For the uninitiated, data science can sound like black magic that happens deep in the bowels of a massive server room inside some digital compound … maybe in the metaverse. Before we can dive into data science as the important tool it is for business decision making, we have to demystify it.

With the ever-increasing amount of data available to all of us, data science is the collection of methods used to extract meaningful insights from that data. Data scientists sometimes use complex algorithms like neural networks or random forests, but sometimes it’s as simple as running a simple linear regression.

This case study on a fictional pizza chain aims to show how even small businesses can use data science and analytics to drive business decisions.

Billy Ray’s Soulfood Pizza Emporium

[excerpted from Deep Finance: Corporate Finance in the Information Age]

Billy Ray’s Pizza is a regional pizza chain in the Southeastern United States. They have 33 locations in 10 states, and are looking to expand into Nashville, TN. Billy Ray is not only an incredible pizza chef, he’s also something of a data scientist. As such, he relies on a combination of instinct and complex data models to determine where to place his restaurants. When he goes into a new city, he looks at the competitive landscape along with customer demographic data to determine the best place to put a new restaurant. He then compares potential new sites to his existing restaurants to determine how many square feet the new site should be.

Gathering Data

Billy Ray first goes to Google Maps, and uses an API to download existing restaurants in a target city.

He is interested in pizza restaurants, of course, but he also wants to understand the dining landscape of the city, so he exports a list of all the restaurants.

Once he understands the competitive landscape, Billy Ray wants to understand the population in the areas in which he is considering placing a restaurant. He uses a combination of census data and other business sources to put together demographics by zip code.

Tools like Google Maps let him visualize zip codes in his target cities.

He uses online tools like https://www.incomebyzipcode.com/tennessee to find median household income by zip code, and then looks at other data like population density, foot traffic, car count on roads in front of potential locations, and other demographic data such as average age, race, gender, and education level.

Pizza Restaurants by neighborhood.

Once he has collected all of the relevant data, he is ready to let the numbers work their magic … almost. First, he has to consolidate all the data from disparate sources into a single table to make it easier to build and run his forecasting models.

Through years of experience, he has narrowed down his site selection criteria to a few key metrics. But this took several attempts, and the more restaurants he built, the more data he had to help him make effective decisions.

(For the sake of this first model, we have excluded traffic and pedestrian counts as streets have not yet been selected. This first pass model will select the neighborhood. A subsequent model will be used to pick the best street location.)

Statistical software like R and the more robust Python language let him use computing power to build and evaluate models. While it is outside of the scope of this exercise, the complexities of good data models are crucial to any model’s success. Learn more about model evaluation at: https://www.datavedas.com/model-evaluation-in-r/

Once he established his model with the most predictive features, Billy Ray has used this model in each city where he opens a new restaurant.

Data Sources:

Neighborhood (adapted from Zip Code map), latitude/longitude (from Google Maps), Household Income (), traffic counts (from Tennessee Department of Transportation), foot traffic (hired local counting service).

EDA: Exploratory Data Analysis

When the data is all collected, it is time to validate it and see if there are some key metrics that jump out before building a model.

In the above chart, we see the number of pizza restaurants in each neighborhood, and the number of residents per pizza restaurant in each. This gives us an idea of the potential customer base and the number of competitors in each market.

We can also see which neighborhoods have the highest accessible customer base:

In larger models, EDA is a significant part of understanding your data. Different features can be checked for correlations and other revealing information. We’ll leave this study here in the interest of brevity, but a big part of the fun of being a data scientist is exploring the data to see what it tells you before you even begin building out a model.

Modeling

The first model we’re going to build is one that will automatically cluster neighborhoods to help us choose the optimal restaurant location. We’re going to use a tool called K-Means Clustering to complete this task.

Cluster Identification

Billy Ray went with 15 clusters for his first pass on the model. He thinks, however, that he might be able to position a restaurant between 2, 3 or 4 neighborhoods to draw customers from each neighborhood, so he will use a process called K-Means clustering to determine how many clusters to use, and then to cluster the areas based on geographic location (lat/long) and the other variables he has collected.

To determine optimal number of clusters, Billy Ray takes his original 15 areas:

Billy Ray’s first clusters showed him that Antioch and South Nashville were the best locations for his new pizza restaurant. Even with 21 existing pizza restaurants in the area, the proportion of people-to-pizza-joint is still higher than any other neighborhood.

His next step is to determine the optimal number of clusters to use for his model.

In the interest of brevity, we won’t delve into the details of calculating the optimal number of clusters. We do this by calculating the error for each number of clusters.

The algorithm suggested that six is the optimal number of clusters to use based on the data we have.

Optimized map with six clusters:

When we ran the model for six clusters, Cluster 4 showed the optimal location for a restaurant. This location pulls in customers from three of the original clusters, has the highest population concentration with the largest group in our target income demographic, and most potential customers per pizza restaurant.

Next Steps:

With his general location in place, Billy needs to start looking at specific sites. For these sites, he will get more granular in his data and add some new fields, including traffic counts on nearby roads, a number of residences in a 5-mile radius, more detailed demographic data, crime rates, average home prices, and a count of nearby businesses.

His next model will be different from the first model, which used clustering to find the ideal location for his restaurant. In this next model, he will use multivariate regression to compare specific locations to his existing stores. This will help him determine the viability of a certain site and even make preliminary estimates on revenue for that location.

But we’ll save that model for another day.

Author

Glenn Hopper

Glenn Hopper is a director at Eventus Advisory Group and the author of Deep Finance: Corporate Finance in the Information Age. He has spent the past two decades helping startups transition to going concerns, operate at scale, and prepare for funding and/or acquisition. He is passionate about transforming the role of chief financial officer from historical reporter to forward-looking strategist.
View all posts

Glenn Hopper February 24, 2022

5 minutes read

Billy Ray’s Soulfood Pizza Emporium

Gathering Data

Modeling

Author

Related Articles

AI & Data Privacy – How companies can ensure ethical data use while leveraging AI for efficiency

Is AI Data Center Growth Sustainable? Not without Sustainability Strategies

Reimagining Data Centers: Giving Back to Communities While Expanding Digital Infrastructure

The Rise of Artificially Inflated Traffic in Business Messaging: Fighting Back Against AI Bot Attacks with Intelligent Datasets