Data is everywhere. We’re practically swimming in it. Recent estimates suggest that some 79 zettabytes of data were generated in 2021 alone. Seventy-nine zettabytes … That is 79,000,000,000,000 gigabytes – or 79 trillion in case you lost count of the zeroes.
The amount of data from Google searches, smart watches, emails, texts, slack messages, spreadsheets, TikTok videos, and the myriad other online devices only continues to grow at an exponential rate every year. This is great news for data scientists who strive to use this data to describe the world around us and to predict what will happen next. But even in this seemingly infinite sea of data there are times we do not have enough to make valid statistical observations. To a struggling data scientist, it can seem the equivalent of the curse of the Ancient Mariner.
“Water, water, everywhere,
Nor any drop to drink.”
~ Samuel Taylor Coleridge
But statisticians are used to working with limited data. Statistics, after all, is the science of drawing conclusions about an entire population based on a limited data set. As such, numerous methodologies have been developed over the years to gather statistically valid collections of data from which statisticians can make inferences and predictions. The primary sampling methods are simple random sampling, systematic sampling, stratified sampling, and clustered sampling. Any of these approaches should return a statistically valid sample of the population being studied.
The Central Limit Theorem & The Law of Large Numbers
This is because the Central Limit Theorem – one of the core tenets of statistics – states that given a sufficiently large random sample from a population, the data will be normally distributed. This is significant because it allows us to make key inferences about the entire population based on the sample mean and standard deviation. CLT is frequently used with the law of large numbers, which states that the larger the sample size, the closer the sample mean and standard deviation are to the total population. For the CLT to work, sample sizes of between 30 and 50 observations are usually considered sufficient. Statisticians can run into problems, however, when a sampling size is not sufficiently large to draw conclusions on the entire population.
Bootstrapping
But when the sample size is lower than what is required for CLT to work, data scientists must find ways to represent the entire population. This is a common real-world problem due to lack of data and/or resources to perform repeated experiments.
It turns out, however, that there is a relatively simple and effective solution.
Bootstrapping is a statistical methodology that resamples a single dataset to create multiple simulated samples. It is based on the presumption that the sample dataset is random and is representative of the entire population, which allows us to use each simulated bootstrap sample to calculate a population estimate. Further, the estimates can be combined to create a sampling distribution, which allows us to make inferences about the population on the whole. With bootstrapping we can establish confidence intervals and other population parameters such as the mean, median, and standard deviation.
A typical bootstrapping application will involve resampling the data thousands of times to create simulated datasets. The benefit of resampling the data so many times vs. the 30-50 observations generally required for sample size is that we are able to come up with a better estimate of the sampling distribution.
It may seem odd that one could infer statistics about an entire population just by resampling a small dataset multiple times, but bootstrapping has been used for the last 40 years and over that time, multiple studies have shown that bootstrap sampling distributions are remarkably similar to population distributions.
Practical Applications
Bootstrapping is frequently used in A/B testing where a marketer may not have enough observations of customer behavior to make a determination. Any time the study designer lacks resources to conduct a larger study, bootstrapping can help generate a reliable sample as long as the selected observations are truly random and representative of the overall population.
For example, if a physical education teacher wanted to find the median height of all the fourth graders in her school but didn’t have time to measure all 300 students in the class, she could use bootstrapping to approximate the median height with a much smaller sample. She could measure 10 or 15 kids, then resample the data from those she measured to make an inference on the median height for the whole class.
Bootstrapping in Machine Learning
Bootstrap aggregation (or “bagging”) is used in many ensemble machine learning algorithms such as random forests and gradient boost. Bagging involves creating multiple bootstrap samples, running each through a decision tree, then combining the outcomes to generate a result.
Bagging is a tool that is commonly used to avoid overfitting the model to the training data. By taking multiple samples of the data, data scientists can create greater diversity in the model training. Running the differentiated samples through their own decision trees results in each tree being slightly different, which ultimately results in better predictions with lower correlation in training errors.
This kind of bootstrap aggregation frequently results in something akin to the “wisdom of the crowd,” by creating models built on more diverse data that can outperform models based on single samples of the data.
Conclusion
Bootstrapping is a powerful statistical tool that allows data scientists to make inferences based on limited data. It is frequently used to create confidence intervals and other statistical parameters when the sample size is very small. It can also be used to improve regression models and ensemble machine learning algorithms by preventing overfitting and training the models on more diversified data.