Deep Tech Point
first stop in your tech adventure

8 things data scientist should know about variables

February 21, 2022 | Data science

What is a variable and what do we do with a variable? This and many more are just some of the things that we are going to learn in this article. These are the very basics of data science, but they are super important before you take a leap into topics that are much more complex. So, let’s start.

1. Every data scientist should at least approximately know a definition of a variable

Variable is any data item – any kind of attribute or characteristic – that can be measured or counted and can assume different values. A variable can be a place of birth, a level of education, income, a type of housing – a variable can be a place, a thing, it can be a person or even an idea. A variable should be the core of research and should therefore be clearly identified.

2. Variables are classified into two main categories

The two main variable categories are:

3. Categorical or qualitative variables can be either nominal or ordinal

As already explained in a previous paragraph, these types of variables have descriptive values and the characteristics of these variables cannot be quantifiable – they cannot be counted, or measured. In addition to that, there are two subcategories to categorical or qualitative variables:

4. We can “translate” categorical or qualitative variables into numbers

As we said categorical variables are not quantifiable, we can’t measure or count them. However, we can “translate” them into numbers in a way that correlation and similarity are achieved between these numbers and the categories during data coding. To illustrate what we’re talking about, let’s have a look at an example of a new car design. Answers we offered (“I think the design is excellent”, “very good”, “good”, “bad”, “awful”) could be easily translated to numbers, in terms of “rate the design of a car on a scale from one to five, where one stands for completely dissatisfied or the worst and five stands for very satisfied or excellent.” This nomenclature is very self-explanatory with ordinal variables, where the order comes naturally. This presentation of closed-ended questions is also called a Likert scale survey questions. On the other hand, we can even assign numbers to nominal variables which don’t have a natural order, but we have to have access to metadata and we have to define the code set for every categorical variable. At the end of the day, transforming categorical variables into numbers is far from obligatory, and if you decide to do that, we strongly suggest that you give each value a label so that you or anyone else looking at the data understands what each value represents.

In addition to that, let’s clarify that categorical variables can also contain numbers (without transformation as explained above), they do not always contain labels or strings. An example of such a case would be the class of a train in which you are traveling (first class, second class…). In this precise case, we’re talking about an ordinal categorical variable.

Another interesting aspect of categorical variables and date and time values. Numerical variables one would say. No, date is a categorical variable says another. Well, depends on the context.
Let’s take dates for example. Dates themselves are interval, a number between 1 and 31, however, they can be viewed differently. What if you have only a few days? Or if a day of a week is something that matters? Let’s say what’s the best day of a week to sell a car. You would transpose the date into a day of the week (Monday etc) and that would be a nominal variable that would make sense. You could also treat dates as ordinal – an example would be listing car models according to the date they were published. A general case scenario would be to treat dates as a continuous variable (numeric variable), because the starting point is arbitrary and the units are fixed and we are working within an interval and in addition to that there is no true zero. However you might well change dates to let’s say days since a particular event, and in this case, dates would become ratio. This is a common problem (and beauty) when working with statistics – if you apply rules blindly, you will quickly hit a problem, so yes, you have to think about what you are doing.

5. Quantitative or numerical variables can be additionally classified as discrete or continuous

As already explained in a previous paragraph, these types of variables have numeric values, so the characteristics are quantifiable – they can be counted, or measured. In addition to that, quantitative or numerical variables have two subcategories:

6. Other classifications of variables

Almost 100 years ago, Stanley Smith Stevens introduced four scales of measurement: nominal, ordinal, interval, and ratio. These four scales are still widely used today as a way to describe the characteristics of a variable.
Nominal and ordinal variablesbelong to qualitative data, and we’ve already talked about them.
Interval and ratio variables belong to quantitative data:

7. Classification of variables according to the number of variables that are studied

In addition to the difference between quantitative and qualitative data, we can also mention that statistical data is often classified according to the number of variables that are studied. We are talking about the difference between univariate and bivariate or multivariate data. When we are looking at only one variable in a study, we say that we are working with univariate data. But, when we examine a relationship between two or more variables, we are working with bivariate or multivariate data. For example, if we conduct a study that examines a relationship between the speed of a car and the number of car’s gears, we would be working with bivariate data.

8. Why should you care about the type of variable you’re dealing with?

It is important to identify and understand the type of variable in a study because they are the basic units of the information. For this reason, scientists carefully analyze and interpret every variable and its values to make sense of how things relate to each other in a descriptive study or an experiment. Depending on the variable, you must choose the corresponding processing technique and statistical analysis, design your study, select your tests and interpret results. Let’s take a look at the visual presentation of data. If we analyze a single variable (univariate analysis) we can use a bar plot or a histogram, but if we analyze several variables (multivariate analysis), previously mentioned visual presentations are not appropriate. Instead, for multivariate analysis, we use the scatter plot, contour plots, multi-dimensional plots.