Deep Tech Point
first stop in your tech adventure
What to do about outliers in linear regression?
April 7, 2022 | Data science

In this article, we are going to investigate what are outliers in linear regression, why and when are they important, and what should we do about them – should we remove them from the data presentation or not.

What are outliers – the unusual Y values – in linear regression?

In regression, the unusual Y values aka outliers are data points or observations that fall far from the rest of the data points. Basically, every point that doesn’t appear to belong with the vast majority of the other points is an outlier.
But why are these points important? Because they can have a strong influence on the least-squares line. Most parametric statistics – and means, correlations, and standard deviations are a part of that – are very sensitive to outliers. And since the assumptions of common statistical procedures – and linear regression is a part of these procedures – are based on parametric statistics, outliers can damage your analysis.

What is the difference between outliers and high leverage observations aka influential points?

So, an outlier is an observation or a data point that does not go along with the rest of the data cloud. And when outliers fall horizontally away from the center of the cloud, we call them leverage points. But when a data point has extreme predictor x values, we call it a high leverage point or the influential point because it influences the slope of the regression line. When you want to determine whether a data point is influential, you should visualize the regression line with and without the leverage data points. If the slope of the regression line changes significantly, that is a strong predictor we are dealing with an influential point.
Basically, every outlier has the potential to be an influential point because it can influence predicted responses, the slope coefficients, and at the end of the day the hypothesis test results. Nevertheless, the rule of thumb is to observe these outliers to decide if they have the potential to be actually influential points.
So, when we are dealing with only one predictor, we can simply look at a scatter plot and distinguish between an outlier and high leverage data point.
Another easy way is to analyze a data set twice — once with and once without the outlier — and observe if there are differences in the results. Is there inflation in the parametric statistics such as means, correlations, and standard deviations? Is there a difference in the predicted responses, estimated slope coefficients, and hypothesis test results? If yes, we are not only dealing with an outlier but also an influential point.

So, what to do about the outliers – should we drop them or not?

Ah, yes, that seems so tempting. Just push these horrible data points that are sticking a finger in the eye away. Remove them. As quickly as possible. Not so fast.
A general rule of thumb is not to remove any data point, just because it is an outlier. Outliers are perfectly legitimate data points and are oftentimes the most interesting observations and for this reason, we should investigate them before we decide whether we should remove them or not.
One example when we should absolutely remove the outlier is when data is obviously unrealistic. For example, an example is when a person’s weight is recorded as physically impossible, for example, 18kg. The weight could be 81kg, maybe even 118kg or 180kg or even 181kg, but we are not sure. In this case, it is obvious we are dealing with an error, and since we cannot correct it, we must remove the outlier.
Another example of when we can remove the outlier is when you notice the data was you’re observing is not a part of the population you’re studying – for example, when you’re observing the bone density of healthy middle-aged men and you notice unusual patterns in one participant and you find out a person has diabetes (which often leads to unhealthy bones).
Another example of when to remove the outlier is when the data measured is not accurately measured or at least not measured as intended. An example could be you’re measuring a reaction time to a specific event, and you notice your participant is randomly hitting the response key. Obviously, the measurement will not provide accurate results and you can remove the outliers.
A general rule of thumb is if the outlier does not influence the results, but it does affect the assumpts of the experiment, you can remove the outlier. Nevertheless, you must bring that up in a footnote of your paper. However, this is not a common situation – usually, the outlier will affect both results and assumptions of the experiment, and in this case, it is not recommended to leave the outlier out of the analysis, but if you decide to do so, you should again note this in a footnote and also describe how the results changed with the outliers and without them.

So, if your reasons are against dropping the outlier – what should you do?

When it turns out you should not leave out your outlier, you are left with two alternatives. However, whatever technique you decide to go along with, before anything you need to know your data well, and most of all, you need to research your area of research.
One alternative is to opt for transformation – go along with a square root and log transformations and they will both pull in high numbers. This way they will have an impact on assumptions and will work better if the outlier is a dependent variable. This will also reduce the impact of a single point if the outlier is an independent variable.
Another alternative is to try a different model, however, this could lead to the fact that a model that is not linear could be more appropriate for your case.