The Dangers of Big Data in Statis­­tics – Bigger is Not Always Better

July 19, 2018 By Pascal Schmidt Machine Learning

­­Big data sets are more and more available in the field of data science and statistics. Over are the days where we were only able to deal with small to medium-sized samples. Over are the days where we lacked the statistical power in our analysis. However, this does not mean that bigger is always better. As with small samples, Big Data also has its challenges and pitfalls. In this blog post I am going to examine some of the dangers of Big Data in statistics and how to overcome them.

What is Big Data?

  • Big data is large in volume. This means that it is collected from a variety of sources and then combined into a large data set.
  • Big data has lots of variety. This means it can be in different formats such as text data, video, or audio.
  • Big data is complex because it comes from different sources. Therefore, it is hard to match and clean.

In a simpler sense, Big Data is too big in size for commonly used software to do computations on it.

dangers of big data in statistics

The Dangers of Big Data in Statis­­tics

Example of the American Election in 1936

In the 1936 elections one of the biggest opinion poll in the history of the United States was conducted. In this poll, the 2.4 million respondents expressed their preference to vote for governor Alfred Landon or Franklin Roosevelt. The poll resulted in strongly favouring Landon over Roosevelt. However, the evaluation of the election concluded 46 states for Roosevelt and only two for Landon.

Where Has the Poll Gone Wrong?

With a sample size this large, this should give confidence for yielding the right result. However, the sample was suffering selection bias. The poll was conducted by a magazine, which surveyed their own readers. Clearly, the readers of the magazine were not representative of the entire population of the United States and therefore, failed predicting the outcome of the elections. The same year, the American Institute of Public opinion conducted the same poll with a sample size of only 2% of the size of the magazine. The predicted outcome of the election was within 1% of the actual result. This example shows clearly the dangers of Big Data in statistics. Often times a small sample is sufficient to make predictions as long as it is representative of the population. It all depends on the quality of the data and not quantity. So, the assumption, that large sample sizes yield more meaningful results than small sample sizes, is not correct.

dangers of big data in st

 

Sampling Bias and Confounders

As already explained in this blog post, a biased and inconsistent estimator is a big problem no matter what sample size. However, it can more easily occur in Big Data problems because researchers have false confidence. Let us examine why. There is a famous example which suggests that ice cream sales cause theft because there is a strong correlation between ice cream sales and theft rates. However, correlation is not causation. Only because ice cream sales and theft have a high correlation, does not mean that ice cream sales cause theft. There is another variable (the confounding variable) which is related to eating ice cream and this variable is weather. When it is hot outside, people eat more ice cream. In addition to that, people stay outside for longer, go to festivals, parks, and there is more social interaction going on in the summer compared to winter. All these activities make it easier for thieves to commit theft. In conclusion, weather is the variable that confounds the relationship between ice-cream sales and theft. So, if our model does not account for the confounder variable then it does not matter how much data we have because our model will always be wrong. It is biased, because it deviates from the truth, and inconsistent because more data does not improve our model. Therefore, Big Data gives us false confidence and makes us believe that all the available data is sufficient for a good model.

dangers of big data in statistics

Noise Accumulation

Other dangers of Big Data in statistics often occur in classification problems. Poor classification is often because of weak features that do not contribute to the classification accuracy. So, in our data set we have accumulated so many features, so that these features make so much noise that we cannot find the signal anymore. Here is an illustration of this problem.

Dangers of Bias in Big data

The top left plot uses the first m = 2 principal components. We can see that the plot in the top left corner has high discriminative power. However, the more components we add (m = 40, m = 200, m = 1,000), the less discriminative power the classification model has. This is due to the noise that is being accumulated by all the features we are adding. This suggests that when we are dealing with Big Data, it is important to reduce the dimensions with some sort of variable selection method (Lasso, PCA, or stepwise selection).

The P-Value Problem in Big Data

As mentioned in our introduction paragraph. Over are the times where we lacked statistical power. However, we stepped from the lack of statistical power to having p-values that are all “zero”. This suggests statistical significant results even though there might not be any.

s^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \bar{x})^2} {n - 1}

For the variance, as n gets larger the variance decreases.

s_{e} = \sqrt \frac {s^2}{n }

When we have a small sample variance and large n, then the standard error becomes very small.

t = \frac{\bar X - \mu} {s_{e}}

The numerator of the t statistic is a measure of distance of the observed values from an expected value and the denominator is a measure of variability in the observed data. We saw that we get a very small standard error when the variance is small and n is large. This results in a very large t-statistics and therefore, in statistically significant results.

hypothesis testing

In conclusion big sample sizes are not always better than small sample sizes and we have to be aware about the dangers of Big Data in Statistics. There are pitfalls with big samples and also with small samples. Therefore, with whichever size of sample you are dealing, make sure you have investigated your data and are aware of potential problems and challenges. The dangers of Big Data in statistics are numerous and I only have provided the most common ones.

If you want to find out more, check out these links:

Big Data and Large Sample Size

Big Data and the DAnger of Being Precisely Inaccurate

Challenges of Big Data Analysis

Post your comment