If you want to be good in data science, you cannot underestimate data analytics and statistics. Both hate guessing and they both love facts. If you are knowledgeable in data analytics and statistics, you will be able to think critically and most of all make data-driven decisions. In this article, we are going to review the basic statistical terms every data scientist should be familiar with.
Population and sample
In statistics, a population is the source of data to be collected, while a sample presents only a part of that population. In other words, a population is the pool of individuals, objects, events and even procedures that share a common feature. That feature could be a year of birth, nationality, a place of living, a level of education etc. The population is usually very large and diverse. But, what is then a sample? A statistical sample is when we select relevant representatives from a population to be included in a study. Obviously, it is very rare that a study would cover the entire population – usually, we would only include a sample in a study, but we would be careful enough to include a statistically significant portion of a population in that sample.
Variable is any data item – any kind of attribute or characteristic – that can be measured or counted. A variable can be a place of birth, a level of education, a variable can be a place, a thing, it can be a person or even an idea. A variable should be the core of research and should therefore be clearly identified.
At this point, we should mention the difference between qualitative (aka categorical) and quantitative (aka numerical) variables. Qualitative variables have descriptive values and can only take values in a form of names or labels, such as the color of eyes or hair, a person’s name, or his nationality. A quantitative or numeric variable is something we can count or measure in numbers, such as a person’s height or weight, age, etc. Quantitative or numeric variables can be further classified as discrete or continuous. A variable is continuous when it can take on any value between its minimum value and its maximum value, and when this is not the case, we are dealing with a discrete variable. The number of firefighters in a department is a discrete variable, while their height and weight is a continuous variable – variables take a value between a minimum and maximum weight and height of firefighters.
In addition to the difference between quantitative and qualitative data, we can also mention that statistical data is often classified according to the number of variables that are studied – in this case, we are talking about a difference between univariate and bivariate data. When we are looking at only one variable in a study, we say that we are working with univariate data. On the other hand, when we examine a relationship between two variables, we are working with bivariate data. For example, if we conduct a study that examines a relationship between the height and weight of firefighters, we would be working with bivariate data.
Let’s have a look at another perspective on variables. For example, during an experiment, when we evaluate a hypothesis, we test the effect of the independent variable – the variable that is manipulated – on another, dependent variable, which is being measured. In addition to these two, scientists often include controlled variables, which are kept constant while exploring the effects of the independent variable on a dependent one.
Since we presented variables in a previous paragraph, analysis is another (statistical) term that every data scientist should be well-acquainted with.
The analysis is when we study or examine something in detail, in order to discover more about it, and in statistics, we deal with:
- quantitative analysis aka statistical analysis, and
- qualitative analysis or non-statistical analysis.
Quantitative analysis is about looking at the hard data and the actual numbers – we collect and interpret data with patterns and data visualization. On the other hand, qualitative analysis is less definite because it often concerns things that cannot be expressed as a number, such as subjective characteristics and opinions. Focus group discussions, in-depth interviews, and searching for the prevalent opinion in forums are just some of the most common methods of gathering qualitative data.
Understanding descriptive and inferential statistics
Every data scientist should be able to work with at least the basics of descriptive statistics. But what is that? The shortest possible answer would be – characteristics of a population. With descriptive statistics, we try to describe and understand the characteristics of a specific data set. We do that by giving short summaries about the sample and measures of the data, such as measures of center or central tendency (the mean or the average, median, and mode), which are some of the most famous statistical measures, and measures of variability, which are about standard deviation, variance, minimum and maximum variables, range, kurtosis, and skewness.
Descriptive statistics helps us understand the properties of the data sample. When we know what is the sample mean, and for example what is the distribution of a variable, we know a bit more about the collective data attributes of a sample. But, with all this information, can we make a prediction? No, in order to make predictions we need to employ inferential statistical, which is a separate branch of statistics. If we try to compare descriptive vs. inferential statistics, we can say that with descriptive statistics, there is no uncertainty – the statistics precisely describe the data that you collected. However, with inferential statistics, we take the data from samples and make generalizations about a population. In order to do that, we need to use random and unbiased sampling methods to get a representative sample because, without it, you can’t make valid statistical predictions. Of course, when working with sample predictions, there are some downsides and a sampling error, which is a difference between the actual and measured values of the population, will always occur. For this reason, there is always at least some degree of uncertainty in inferential statistics, but we try to eliminate it to a minimum with probability sampling methods.
There is so much, much more to cover just the basics of statistics, but in this article, we tried to go through a few of the very basic terms and concepts which are the true foundations so you can deepen your knowledge.