Deep Tech Point
first stop in your tech adventure
The things you should know about simple and multiple linear regression
April 4, 2022 | Data science

Linear regression is also known as ordinary least squares (OLS) and linear least squares, and it opens the doors into the regression world. Linear regression is one of the most widely known modeling techniques and is usually among the first few topics that people master when they learn predictive modeling. We differentiate between a simple and multiple linear regression, and in this article, we’re going to focus on these two.

What are a simple and multiple linear regressions?

So, a simple linear regression is one of the most basic and commonly used regression techniques, but what are some examples from the life of when we can use linear regression? In real life, businesses often use linear regression to evaluate a relationship between advertising and their revenue, or scientists use it to understand the relationship between specific drug dosage and let’s say patients’ blood sugar or when they want to evaluate the effect of fertilizer on crop yields. For sure, you can use simple linear regression when you have one dependent and one independent variable when both variables are continuous and a line that represents the relationship is a straight linear line.

So, when do we use linear regression? We use this modeling technique when we have a relationship between a dependent variable (Y) and one (simple linear regression) or more (multiple linear regression) independent variables (X) using a best fit straight line (also known as a regression line). Despite the term “linear model,” this type of regression can model curvature. So, we use the linear regression when:

Linear regression equation

The equation for linear regression that can be used to predict the value of the dependent variable (Y) based on the given independent variable(X) is:

Y = a + b*X + e

How do we fit the best linear regression line?

The regression line is defined with values of a and b. We calculate the line for the available data with the Least Square Method (this is why the simple regression is also called the Linear Least Squares technique). There are other methods, too, but this method is probably the most common method used for drawing the best-fit regression line. So, we minimize the sum of the squares of the vertical deviations from each data point to the line. All deviations are first squared, so when they are added, they do not cancel each other because of the positive and negative values.

Linear models are one of the most common forms of regression and when we have a continuous dependent variable, linear regression is probably the first type of regression model we should consider. However, despite the term “linear model”, we can use polynomials and add curvature to the linear regression line. This way we include the interaction effects of the variables.

What about outliers in linear regression?

In regression, we call outliers all points that fall far away from the “cloud” of other points. These points can be especially important because they can have a strong influence on the least-squares line. However, not all outliers have the same impact – some are more important and influential than others. For example, it is very important to observe points that are positioned horizontally on the line, but away from the center of the cloud. They tend to have a strong influence on the slope of the least-squares line – they pull harder on the line – therefore we call them points with high leverage. Moreover, when these high leverage points have such an impact on the slope of the linear line that if we had fitted the line without that specific outlier, that specific point would have been incredibly far away from the least-squares lines. In these cases, we call these high leverage points – influential points, because they influence the slope of the least-squares line.

Should you simply just remove the outliers? This seems like a very inviting thing to do, but don’t throw out data for no reason other than it makes the data look bad. A rule of thumb is that those final models that ignore exceptions usually perform badly. Exceptions are usually there for a reason and whatever final model you create, it should be capable of including outliers.

Another thing to be careful about is using a categorical predictor when one of the levels has a small number of points because they can become influential points.

Linear regression and the key takeaway points