Deep Tech Point

# 7 things every data scientist should know about descriptive statistics and univariate, bivariate and multivariate analysis

February 25, 2022 | Data science

This article is going to scratch the surface of descriptive statistics – we are going to define it and see what purpose it serves. With a help of descriptive statistics, we are going to take a look at univariate analysis – an analysis of a single variable – and we will observe the three major characteristics when observing a single variable – the distribution, the central tendency, and the dispersion, and we are going to take a quick peek at bivariate and multivariate analysis, so you have a better understanding of what analyzing one, two or more variables mean.

### What is descriptive statistics?

The basic data features – simple summaries about the sample and the measures – of basically almost every study are presented with descriptive statistics. If we compare descriptive statistics with inferential statistics, descriptive statistics tells the story about what the data shows, while inferential statistics tries to conclude or make judgments of the probability and apply them to the wider population. Descriptive statistics is therefore much more straightforward and only summarizes the sample by simply describing what is going on in the data.

### Univariate analysis as the simplest form of analyzing data

Univariate analysis is the simplest form of analyzing data because it examines only one variable at a time, so we are not dealing with variable relationships as is the case in regression. Being the most recognizable form of analyzing data in descriptive statistics, the sole purpose of the univariate analysis is to take data, summarize it, and find a pattern. Therefore, univariate analysis is about describing data.

Usually, we observe three major characteristics of a single variable:

The most common scenario in every univariate analysis is to observe and describe the following three characteristics of every variable.

• the distribution
• the central tendency
• the dispersion

#### The distribution

Every sample of data will form a distribution. We’re sure you’ve already heard of the most well-known distribution – the Gaussian distribution or also known as the normal distribution. The simplest example of distribution would be to list every value of a variable and the number of persons who had each value, for example when we describe gender distribution in a group of people. Cases like this aren’t problematic because we are not dealing with many values and we can list every value for each variable. However, it gets a bit more complicated when we deal with a huge amount of values, for example when we deal with variables such as income, where almost every person has a different value for his income. In cases like these, we usually group the values into categories according to ranges of values, for example, we group income values into approximately five ranges or intervals of income values, which must be mutually exclusive and exhaustive. This is also called a frequency distribution and is one of the most common ways to represent a single variable, either in a graphical or tabular format, including histograms and stem-and-leaf display. Sometimes the shape of the distribution is described via indices such as skewness and kurtosis.

#### The central tendency

The average? The mean or the average is the most commonly used method of describing central tendency. The central tendency represents a single value that describes the central position within the set of data, and the measures are:

• Mean aka average – to compute the mean all you do is add up all the values and divide by the number of values.
• Median – the score found at the exact middle of the set of values.
• Mode – is the most frequently occurring value in the set of scores.

#### Dispersion

The dispersion goes hand in hand with the central tendency measures – the dispersion refers to the spread of the values around the central tendency. Measures of variability or spread describe the dispersion of data within the set. Two most common measures of dispersion are:

• range
• standard deviation.

The range is a super-simple measurement and also the least accurate – it is the highest value minus the lowest value, and as such can greatly exaggerate the range. The standard deviation is much more accurate and shows the relation that a set of scores has to the mean of the sample. In other words, standard deviation describes the variance, or how dispersed the data observed in that variable is distributed around its mean. In addition to the range and standard deviation, measures of variability or dispersion are also quartiles, absolute deviation, and variance.

### Bivariate and multivariate analysis as a step further in descriptive statistics

If your sample contains more than one variable, we are working with bivariate (two variables) and multivariate (more than two variables) analysis. As we’ve seen in previous paragraphs, univariate analysis is quite a simple descriptive analysis. The bivariate and multivariate analyses get slightly more complicated because they describe the relationship between two or more different variables.

To be more precise, a bivariate analysis looks at two paired data sets, studying whether a relationship exists between them – it is extremely helpful in testing simple hypotheses of association and is for example often reported in quality of life research. For instance, we could store results from the bivariate analysis in a two-column data table and could find out that there is a relationship between the number of traffic accidents and weather conditions, where the weather condition would be the independent variable, and the traffic accident would be the dependent variable. Some of the most common types of bivariate analysis are:

• Scatter plots that layout a visual idea of the pattern that variables follow
• Regression analysis which is actually a catch-all term for a wide variety of tools that are applied to determine how are your data points related – whether they follow an exponential curve or a straight line.
• Correlation coefficient: in addition to regression analysis correlation coefficient calculates the strength of correlation (0 for example means there is no correlation, while positive or negative 1 mean that the variables are perfectly correlated.

Multivariate analysis, on the other hand, uses two or more variables and analyzes which, if any, are correlated with a specific outcome. Multivariate means involving multiple dependent variables result in one outcome. The goal in the multivariate analysis is to determine which variables influence or cause the outcome. A laic view on this also explains that the majority of the problems in the real world are multivariate. For example, not only the weather but also the level of alcohol in blood, experience in driving, numbers of hours slept, and probably a few other independent variables correlate with the number of traffic accidents. The most common multivariate analysis techniques include multiple linear regression, multiple logistic regression, MANOVA, factor analysis, and cluster analysis—to name just a few.