Data Science is the process of applying scientific methods, algorithms, and systems to extract information from data. For the unfamiliar it can seem like inexplicable magic that happens inside a black box. But when we peel back the lid and peer inside, we see that many of the tools used by data scientists and business analysts are pretty straightforward. By breaking down these processes and how they work, we – as business managers – can better understand what data science is and how it can benefit our organizations.
For example, every business in the world from cul-de-sac lemonade stands to Fortune 100 firms must do some type of forecasting as part of corporate planning, staffing, budgeting, and strategy. Forecasts can either be qualitative (market research, panels, historical analogies), quantitative (time series analysis, causal relationships, simulation), or some combination of both.
While qualitative methodologies are based on estimates or opinions and are subjective, the mathematical approach is more sophisticated, defensible, and reproducible. The most common approach to business forecasting is quantitative and uses historical data to attempt to predict future results. This forecasting method is called Time Series Analysis.
Time series forecasting models aim to predict future outcomes based on past data. Because time series data is collected at chronological points in order, it is possible to infer a correlation between observations.
Linear Regression
The most common approach to the application of time series data to predict future results is to model the relationship between a response and one or more explanatory variables. Simple linear regression uses only one explanatory variable such as the relationship between outside temperature and ice cream sales. Multiple linear regression is when many explanatory variables are linked to trying to predict an outcome. An example of multiple linear regression would be to use zip code, year built, and total square feet to predict home prices.
In time series analysis, simple linear regression would use time (day, hour, week month, etc.) as the explanatory variable and a predicted value (e.g. unit sales or market price) based on historical data.
In the chart below, we see a fairly linear upward trend in Sales by Month over time:
We can see in this chart that the trend is not 100% linear. There are fluctuations that could be caused by trend, seasonal, or cyclical influences.
Trend is the overall direction of the outcomes (up or down). Seasonality is the impact of time of year. Note in the chart above, we only have one year of data, so it is very difficult to determine the impact of seasonality. Are sales in Q4 higher because of the overall trend, or is Q1 going to be a slow quarter every year? Cyclical factors are even more difficult to track because the time span of a “cycle” may not be known. Cyclical factors are typically driven by broader factors such as the overall economy, tax policies, and politics. There are also random variations caused by chance events.
When using linear regression for predictive analysis, we assume we’ve properly identified the correct predictor variable(s) to forecast an outcome. In the case of time series analysis, the predictor variable is set automatically: time. To keep things simple in our example, we are also assuming a linear trend. But in the real world, the trend could be asymptotic, exponential, or the data may not fit any standard curve. While we can certainly build forecasts using non-linear data, the required math escalates quickly and is better saved for another post in which we can discuss using polynomial regression.
The mathematics of simple linear regression models, however, is very easy to understand. The formula is one that may take you back to a high school algebra class:
Y = a + bX
Where:
Y: Dependent Variable
a: y intercept
b: slope of the line
X: independent variable
Even simpler still, software tools like Microsoft Excel and Google Sheets can automatically create a regression line when given the underlying data:
Decomposition of a Time Series
When building a forecast, it is imperative that the individual components the influence the response variable (outcome) are identified and isolated. Once the components are identified, we pull them out of the data by creating indexes for seasonality and trend data. Our next step is to create a trend line without the “noise” from seasonal data, we then add the seasonal and trend data back in.
Other Methods of Time Series Analysis
Simple Moving Average
When a forecast does not have seasonal components and the trend is neither growing nor declining rapidly, a simple moving average can be a quick and easy way to apply historical time series data to a future forecast. Moving averages are created by using subsets of historical data, which smooth out forecast data to create a more predictable trend. The goal when using moving averages is to minimize the impacts of random, short-term fluctuations.
The biggest question for analysts when using moving averages is how many data points to use in each subset. Larger subsets result in smoother data, but may not be as accurate.
Weighted Moving Average
Weighted moving averages are typically used when more recent data is determined to be a better predictor of future performance than older data. While the simple moving average weights each component equally, the weighted moving average allows the analyst to place weights on the individual subsets. For example, if we were looking to forecast a stock price we may find it more relevant to weight the prior trading day at 0.6 with the day before weighted at 0.3 and the day before that at 0.1 (noting that the total of the weights has to add up to 1).
Typically, the most recent data points will be weighted higher, but when seasonality is a factor, there could be reason to weight certain months or quarters higher than others. For example, someone predicting snow shovel sales would likely weight winter months higher than summer months when forecasting.
Exponential Smoothing
Exponential smoothing looks to simplify the process of how much to weight historical data. In most applications, the most recent data points are going to be better predictors of future performance than older data. Exponential using is the most used of all the forecasting techniques, and it is widely used for inventory planning.
Similar to weighted moving average, exponential smoothing applies weights to prior periods, but exponential smoothing uses a formula rather than analyst discretion to determine and apply weights. In exponential smoothing, only three pieces of data are required to assemble a forecast: historical data, a base forecast, and a smoothing constant.
Conclusion
As with all things data science, the level of difficulty increases exponentially (a non-linear function) as you delve deeper into it considering topics like noise, stationarity, autocorrelation, and other factors that can influence predictions. These are all tools and levels of understanding that can be layered on over time as you advance your analytical skills. It is important in the interim, however, to not be overwhelmed by the science behind it. As the availability of data continues to increase, there is an expectation that all of us in the business world are able to extrapolate value from the numbers.