Deep Tech Point
first stop in your tech adventure

Regression analysis: key points every data scientist should know

March 31, 2022 | Data science

In this article, we are going to learn about one of the most significant predictive analytics tools for machine learning and big data – regression. We are going to define it, learn why and in which cases we use it. We are also going to take a look at seven types of regression analysis – we are going to learn which variables are correlated with specific regression techniques and we are also going to discuss some of the key factors associated with each technique.

What is regression?

Regression is one of the leading predictive analytics tools for machine learning and big data. Usually, data scientists are familiar with linear and logistic regression because they are commonly used in the real world. But the fact is there are more than a dozen types of regression algorithms – each invented for various types of analysis and each carrying its own importance. In addition to these, you can experiment with formulas and come up with your own, new regression algorithm.
Regression analysis is a form of modeling technique in predictive analysis that researches the correlation between variables. A primary goal is to understand a relationship between two variables – a dependent variable that represents the outcome or the target and an independent variable or predictor that mirrors the action. The second goal of regression analysis, nevertheless an important one, is to understand the strength of the relationship between a dependent and independent variable.

Correlation analysis as a prerequisite to regression

A regression analysis has a cousin – a correlation analysis, which is also supported with a scatter plot and a regression line – and they are often implemented together with a regression analysis. Correlation analysis does not make any assumptions about one variable being independent and the other dependent, as regression analysis does. Correlation analysis is focused only on the strength and direction of the relationship between two or more variables. Regression analysis, on the other hand, hypothesizes one or more variables are independent and they have a causal relationship with dependent variables.

Why do we use regression analysis?

Three major uses for regression analysis are determining the strength of predictors, forecasting an effect, and trend forecasting.
Regression analysis is used for:

We’ve said it before – we use regression analysis to understand the relationship between two or more variables. Maybe an example could illustrate what we’re talking about.
Let’s say you’ve been gaining weight over the last few years, and if you don’t plan to change your eating and exercise habits and continue to put on weight at the same rate, you can predict with a simple linear regression how much weight you’ll gain in the next five years.
Or, let’s say you want to evaluate what will be the growth of your product sales in the existing economy. Based on the data from past years, you can also forecast what will be future sales, if conditions in the economy and your approach to business stay the same.
Let’s say you want to research whether socioeconomic status affects achievement in education, whether IQ influences earnings, or does exercising affects your weight. These are all examples where we can use regression analysis – where we evaluate the impact of the independent variable on a dependent one and where we also try to estimate what is the strength of that variable relationship.

Types of regression techniques

We usually choose a regression problem when the output variable is a real or continuous value, such as sales, salary, or even weight. But is it that simple? Which type of regression should we choose when trying to predict future sales, salary, or even life expectancy with a specific disease?? As we said at the beginning, there is so much more to regression techniques than linear and logistic regression – there are more than a dozen types of regression techniques with each technique having its own regression equation and regression coefficients. In general, every industry has more or less specific regression analysis that is the most common. In medical research, for example, Cox regression is often used because its application is in relation to the survival terms or life expectancy or “time to event” data. In addition to this, simple or multiple linear and logistic regression are definitely the ones that are most commonly used.

But, how do we know when to use a specific type of regression technique to make a prediction? Regression techniques are driven by three metrics:

So, there are various kinds of regression techniques available to make predictions, and these techniques are driven by the three metrics mentioned above. Of course, you can even create new regression models, but let’s put this aside for now and focus on the regression techniques that are most commonly used:

So, the big question – how to choose the right regression model?

Yes, we know, so many options, and imagine if we had listed all regression model that exist. Some people follow the logic – if the outcome is continuous, use linear regression, if the outcome is binary, use logistic regression. But is it really that simple?
The first thing you should take into account when choosing the right type of regression model is to explore your data – identify the relationship and impact of variables, type of independent and dependent variables, dimensionality and other essential characteristics of the data.
Another important approach that can reveal the appropriateness of a regression model is analyzing different metrics. There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model – Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Appropriateness of a regression model can also be measured with R-square, Adjusted r-square, AIC, BIC, error term, Mallow’s Cp criterion…
Another excellent approach to evaluating regression models is Cross-validation. With this approach you divide your data set into two groups – train and validation group. Then you take a simple mean squared difference between the observed and predicted values and they give you a measure for the prediction accuracy.
Another thing to keep in mind is that if your data set has multiple confounding variables, you should not choose an automatic model selection method because you do not want to put these variables in a model at the same time because confounding variables affect other variables in a way that produces spurious or distorted associations between two variables and therefore produce false results.
In addition to all said, the selection of the best-fitted regression model also depends on your objective – obviously a less powerful model is easy to implement compared to a highly statistically significant, but also more complex model.