The Bias vs Variance Trade-off is an essential concept to grasp if you want to learn machine learning. Understanding its relation to overfitting and underfitting is necessary to build an accurate machine learning model. It is also often a topic covered during data science interviews, which is a good reason to go over it. Practice makes perfect!
When talking about machine learning, bias and variance are referring to measures of prediction error produced by your predictive model. By prediction error, I mean the range of error between your model’s predicted values and its actual values. I’m going to assume you already have an understanding of why we split our data to train and test predictive models on separate unseen datasets.
When assessing modeling performance, you want to pay attention to the amount of error produced from your training set, in relation to the amount of error produced by your testing set.
If your model’s training produces a low error, yet testing that model on unseen data produces a high error, then your model is overfit. It has memorized nuances and bits of noise in your training set, becoming too complex. Because it is overfit or overly complex, its predictive ability on your testing set falls short because those nuances just aren’t present in the same way. This is high variance.
There are a lot of definitions of variance out there. I will say variance is the error of the target prediction that comes from sensitivity to small changes in training data.
If your predictive model doesn’t train well enough and also produces high error on the test set, then your model is underfit. It is only able to generalize the data and performs poorly all around because it is too simple to really capture the trend of the data. This model has a high bias.
Bias is the error that is produced from overly simplifying the model.
Here’s a less technical and more real-life example in the course of classic high-school trickery:
There are students in a statistics course who are about to take a test. Wanting to study, of course, the students ask the teacher for a study guide to help them prepare. The teacher, who is actually an undercover senior data scientist and about to unleash the wrath of bias-variance tradeoff, graciously hands over a study guide composed of training questions to the students and says “If you learn these questions, you will pass the test with flying colors. Guaranteed.” The students praise the decent teacher.
They go back home and study every line and character within those questions so well that they have memorized the questions and full answers. Next week, they all show up to the test, ready to take on the world with their 100% study guide training accuracy. As they receive their tests, they lay eyes on the first question, and see that the words are mixed around and the numbers are all different. They jot down what they remember and move on. The rest of the test has them completely stumped. They have never really seen these questions before.
The results? All failing.
What was the students’ mistake here? They knew nothing at first and error was high. They started to learn the questions and error dropped to a low. But they kept learning, and complexly memorized non-important noise, or overfit, the study guide questions so well that they froze up when the questions were all re-worded and produced high amounts of error during testing. Think of the students highly memorizing variations in the data. This is an example of high variance.
Now imagine a different scenario where the teacher refuses to grant the students a study guide at all and loosely says they will do fine. Then gives the same test. The students did receive horrible results again because weren’t able to fully grasp the technical questions without a study guide. Think of these students as being highly biased towards any specific questions because they underfit the course’s lesson. This is an example of high bias.
What the students should have done is studied just enough to know the concepts, yet not memorize the training questions.
Bias and variance are usually inversely related. Too much of either is not good because too much total error is produced either in training or testing. This is why it is necessary to find the optimal balance between the two, the lowest error.
Notice how there is always some amount of error here. This is normally referred to as irreducible error. Unfortunate, yet important to understand, it is not possible to reduce this error, no matter how good your predictive model is. This irreducible error results from noise, like everyday fluctuations in data points.
Think of bias and variance each being a measure of error, contributing to total error. As the complexity of learning goes on, the total error is a product of bias and variance, giving the error that bell curve.
When training your model, you essentially wanted your error, or loss function to be about the same for both the training set and the testing set. If your model is too simple, perhaps not having enough data or having very easy training, this will be underfit and give you all-around high error. On the other hand, if your model fits too well to the training set and generates a high error on the testing set, your model is too complex.
There are different solutions for this, and almost always depend on the data. This is a bit out of the scope of my lesson, which is just meant to explain the bias-variance trade-off.
I hope you were able to at least grasp the concept here. If you have any questions what-so-ever, reach out in the comments!
Thanks for reading,