Deep Tech Point
first stop in your tech adventure

A quick guide to bivariate analysis in descriptive statistics

March 7, 2022 | Data science

“Bi” means two and the term bivariate analysis refers to the understanding of the relationship between two variables. In comparison, univariate analysis is about analyzing one variable, while multivariate analysis refers to understanding relationships between more than two variables.

Bivariate analysis has a lot of use in real life because it can help understand the strength of the relationship between the two variables. But before bivariate analysis can define the strength of relationship, it first must define whether there is any casualty and association between two variables – whether the value of the dependent variable will change if the independent variable is modified.

What are the three common ways to perform the bivariate analysis?

Usually, in numeric variables, we use three common ways to carry out bivariate analysis and often we use them simultaneously:

1. Scatterplots
2. Correlation coefficients
3. Simple linear regression

1. What are scatterplots and how do we interpret them?

A scatterplot, also known as a scatter chart or scatter graph applies dots that represent values for two different numeric variables. The position of each dot on the x and y-axis indicates values for an individual data point. We use scatterplots to observe relationships – either positive, negative, or none – between variables and to observe the strength of these relationships – either weak or strong.
Scatterplots help us visualize the relationship between two variables. This is done by assigning the value of one variable on the x-axis and the other on the y-axis. For example, if we place weight on the x-axis and height of several people on the y-axis, we can quickly notice there is a positive relationship between people’s weight and height: the greater the weight, the taller the height. This indicates an example of a strong, positive relationship: as a person’s height increases, a person’s weight increases as well – if the variable on the x-axis increases, the variable on the y-axis increases as well. In this case, for example, the dots on a scatterplot are packed together tightly and this indicates a strong relationship. On the other hand, if the dots are spread out, but the relationship still exists (the variable on the x-axis increases as the variable on the y-axis increases as well), we are talking about a weak relationship.
As we can observe a positive relationship, we can also witness a negative relationship (for example, as the variable on the x-axis increases, the variable on the y-axis decreases), and we can also observe a strong negative relationship (as the variable on x-axis increases, the variable on y-axis decreases and the dots are packed together tightly) and a weak negative relationship (as the variable on x-axis increases, the variable on y-axis decreases and the dots are spread out).
And finally, a scatterplot can visualize there is no relationship, either positive or negative – the dots on the scatterplot have no meaningful connection.

A stacked column chart is a useful graph to visualize the relationship between two categorical variables. A stacked column chart compares the percentage that each category from one variable contributes to a total across categories of the second variable. With categorical variables, we can also use a combination chart, which uses two or more chart types to emphasize that the chart contains different kinds of information.

2. Correlation coefficients

Bivariate analysis can be also performed by a correlation coefficient, which gives us a good understanding of how two variables are related. In practice, we use correlation coefficients so we can quantify the relationships between two variables.
The most common type of a correlation coefficient is the Pearson coefficient, which measures the linear association between two variables The values in the Pearson coefficient fluctuate between -1 and 1: a value of -1 shows a perfectly negative linear correlation between two variables, a value of 1 shows a perfect positive linear correlation between two variables, and a value close to 0 shows there is no linear correlation between two variables.
In practice, we can look at the scatterplot and we can quickly tell there is a relationship (either positive or negative) between variables, but if we want to quantify precisely the relationship of the two variables, we need to calculate the (Pearson) correlation coefficient. However, in practice, it is not all that simple. Pearson correlation coefficient is useful in proving a linear association between two variables, but we must be careful when interpreting it and we must keep in mind the following three perspectives:

1. Correlation and causation are not the same.

Two variables can be correlated in calculations, but that does not mean these two variables are connected and in a causative relationship – that one variable is influencing the other variable to occur more or less often. One example of correlation, but not causation comes from a study that observed women during menopause and the effect of a hormone replacement therapy (HRT). This study concluded that HRT reduces the risk of coronary heart disease. However, studies that were conducted afterward didn’t show that causation. So, why the correlation between the HRT and lower risk of coronary heart disease? Women who used HRT came from higher socioeconomic class and they received better quality of diet and exercise and that was a hidden explanatory relationship that the primary study failed to see.

2. Outliers can significantly influence a change in the Pearson correlation coefficient

Let’s observe two variables that have a Pearson correlation coefficient of 0.1, and let’s say there is also one outlier in the dataset that shows a much stronger correlation, let’s say 0.9. Consequently, this data outlier can change the final calculation of the Pearson correlation coefficient and how we interpret the data. For this reason, when we use the Pearson correlation coefficient – when we calculate the correlation for two variables – it is always a good idea to visualize the variables using a scatterplot. This way we can see if we are dealing with outliers and whether they could have an impact on the final interpretation.

3. A Pearson correlation coefficient captures only linear relationships between two variables.

Obviously, non-linear correlations also exist. For this reason, it again makes sense to visualize data with a scatterplot when analyzing the relationship between two variables because it could help you detect a nonlinear relationship.
Nevertheless, with practice, we get to know the types of research questions a Pearson correlation can examine. Some of them are examining a relationship between age which is measured in years, and height which is measured in inches, or a relationship between temperature, which is measured in degrees Celsius, and ice cream sales, which could be measured by income.

Which is to say, the Pearson correlation coefficient is not the only one that measures correlation, it is, however, the most popular one. In bivariate analysis, we, for example, also use Kendall rank correlation or Spearman rank correlation. Questions we could examine with Spearman rank correlation are whether is there a statistically significant relationship between the level of education (high school, bachelor’s, or graduate degree) and the starting salary of subjects? Or, is there a statistically significant relationship between competitors finishing position in a race and competitors’ age?

3. What should we know about simple linear regression in bivariate analysis?

Another way to perform a bivariate analysis is by using a method, called simple linear regression. With this method, we are facing with one variable as an explanatory variable (independent variable or a predictor variable, this variable explains the variation in the response variable), and the other variable as a response variable (also known as a dependent variable or an outcome variable, the value of this variable responds to changes in the explanatory variable). Afterward, we need to find the line that best suits the dataset, which can help us understand the exact relationship between the two variables. This line is known as the least-squares regression line and it can be used to help us understand the relationships between explanatory or independent variable and response or dependent variable. In general, we use software like Excel, SPSS, or some graphing calculator to find the equation for this line.
On a scatterplot, we should see our data points are scattered closely around this line because the least-squares regression line is the best fitting line for our data out of all the possible lines we could draw.
With a simple linear regression in the bivariate analysis, we can answer a question like how tall can we expect a person to be if he weighs 80kg? Caution, again. Questions like these can be answered if we use values for the predictor variable that is within the range of the predictor variable in the original dataset we used to generate the least-squares regression line (let’s say in our example we use weight from 70 kg- 130 kg). Imagine what kind of an answer we would get, if we expected simple linear regression to answer a question “How tall can we expect a person to be if he weighs 180 kg?” Not a realistic scenario, that’s for sure.

The types of bivariate analysis according to the type of variable

There are three types of bivariate analysis according to the type of variable:

1. If we examine two numerical variables, we usually use a scatterplot and a (linear) correlation coefficient
2. If we examine two categorical variables, we usually use a stacked column chart, which is a useful graph to visualize the relationship between two categorical variables. A stacked column chart compares the percentage that each category from one variable contributes to a total across categories of the second variable. We can also use a combination chart, which uses two or more chart types to emphasize that the chart contains different kinds of information. When we want to determine the association between two categorical variables, we use the chi-square test.
3. If we examine a numerical and categorical variable, we can use a line chart with error bars or a combination chart to visualize data. When wanting to determine the association between a numerical and categorical variable, we use Z-test and t-test, which are basically the same, and observe whether the averages of the two groups are statistically different from each other. We can also use Analysis of Variance (ANOVA), which assesses whether the averages of more than two groups are statistically different from each other. This analysis is appropriate for comparing the averages of a numerical variable for more than two categories of a categorical variable.

Conclusion

In descriptive statistics, bivariate analysis is one of the most commonly used types of analysis because it helps us visualize and understand the relationship between two variables. Moreover, with numeric variables, we often use all three methods – scatterplots, correlation coefficients, and linear regression together because when analyzed together they help us see the big picture that could be hidden in the data.