Statistical Bias Vs. Consistency – Random Error Vs. Systematic Error
July 2, 2018 By Pascal Schmidt Statistics
In this blog post, we will talk about statistical bias vs. consistency and about randomdom error vs. systematic error. After that we will provide examples about unbiased and consistent, biased and consistent, unbiased but inconsistent, and biased but inconsistent estimators. These concepts are often ver confusing at first. Therefore, it definitely takes some time to understand and grasp the different consepts well. This is the reason why I have provided graphs and examples to make the topic more intuitive so let’s get started!
What is Bias?
In statistics, bias is the tendency to over- or underestimate a statistic (e.g. mean) and hence the results drawn from it. When we want to have a look at the Cochrane definition then bias is…
“…a systematic error, or deviation from the truth, in results or inferences”
In order to understand the bias definition better, we have to understand what is meant by systematic error. We also have to understand the difference between statistical bias vs. consistency.
Statistical Bias Vs. Consistency – Random Error Vs. Systematic Error
Random Error
We have two errors in statistics. One is the random error which gives us imprecise results after repeated sampling, even though on average it gives us the right answer. The random error can’t be eliminated because it is random 🙂 and unpredictable.
Let’s suppose we are interested in the height of an average men in the United States. We have collected a total of 10 samples, each including 100 observations. Let us assume we did not have any selection bias or other biases in our study design and acquired 10 iid (independent and identically distributed) samples which are representative of a men’s average height in the US. Then, due to the random error in our samples, we would not get the correct population height of an American men in each of our samples. Sometimes, our sample height would be too tall and sometimes too short but when we add and average our 10 samples, then our result is increasingly concentrated around the true height and our result is “exact”.
In conclusion, what we are essentially doing is decreasing the variability (increasing consistency) by increasing our sample size. This makes our estimate (average height) consistent and produces the correct result on average.
In the graph above you can see an unbiased and consistent estimator. The more n increases, the less variability we have in our distribution and the closer we get to the true value (the true value is 0 in the graph above).
Systematic Error
Now, bias refers to the systematic error, which means that when we take our samples from before and average it, then we would get the wrong answer. We can ask ourselves now how bias can contribute to that.
Well, let us assume we introduced selection bias in our study design. Meaning, we sampled American basketball players’ height and made these samples falsely representative of the average American men’s height. How can we get rid of bias? We either collect a random sample and eliminate our selection bias, or we increase our sample size to decrease our bias to zero. So, if we have a biased estimator and we increase n to infinity, then our estimator becomes unbiased. But how does it work? Imagine we keep sampling and sampling. After a while, we have sampled a quarter of the American men’s population. However, our bias is still high because with all these basketball players in our sample, the average height is still being overestimated! So, we keep going until we have sampled every single man in America. Now our result is unbiased because we found the true population height of the American man by sampling the entire population of the United States.
This leads to an interesting finding. If an estimator is biased and consistent, and n goes towards infinity, or in our case 151.8 million (population of men living in the United States), then our estimator will eventually become unbiased.
In the graph above you can see a biased but consistent estimator. As n increases, our biased estimator becomes unbiased and our variability decreases again (the true value is 0 in the graph above).
Combinations of (UN)biased and (IN)consistent Estimators
- Unbiased and consistent
- Biased and consistent
- Unbiased and not consistent
- Biased and not consistent
In the first paragraph I gave an example about an unbiased but consistent estimator. In the second paragraph, I gave an example about a biased estimator (introduced with selection bias) which is consistent. Now suppose we have an unbiased estimator which is inconsistent. As an example, we randomly sample men from the United States. Hence, our sample is iid and our estimator unbiased. There are many possible ways in how we can determine the average height of the sample we took. In the first two paragraphs we summed up all of our observations and then divided by n. We can also determine the average height by drawing the first number from the sample. This number is unbiased due to the random sampling. However, it is inconsistent because no matter how much we increase n, the variance will not decrease. It stays constant. Hence, an unbiased and inconsistent estimator
Now, let’s explain a biased and inconsistent estimator. This is probably the worst estimator because we would think that when n increases, our results would have more power and our estimator becomes unbiased. Wrong!!! Let’s deviate a little from our previous example and suppose that we are interested in predicting the sales price of a house. Imagine that we forgot to include a square feet regressor variable in our model. Let us assume that the square feet variable is correlated with the regressor variables we included in our model. This leads to estimators which are biased and inconsistent. For example, two houses with different sizes of square feet have the same predicted sales price. However, if we had included square feet, then the house with more square feet should have had a higher sales price. But because we did not, we have omitted an important variable. This is called omitted variable bias and you can read about it in detail here.
Now that you know the ins and outs of statistical bias vs. consistency, you can check out my next blog post, where I am talking about the different types of biases with examples. You might also like the blog post about the dangers of of Big Data which inludes selection bias.
I have found great upplementary material to my blog post on the web if you want to learn more about statistical bias vs. consistency.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications