Data Science: 6 Common Data Types

Written by Benoit Rolland, PhD | 2020

Due to the success of machine learning, it can often be easy to forget that data science is a much wider field of expertise, which does not limit itself to artificial intelligence. It is in fact a concept that aims to better understand the world around us by combining mathematics, statistics, data analysis, data visualization, the scientific method and computer science. This endeavor has been made possible due to an intangible, but omnipresent resource: data. Before any project, it is crucial to understand the difference between the following data types: numerical, categorical, continuous, discrete, nominal and ordinal.

This knowledge is key to fully grasp the statistical nature of the available data and to properly handle any given features. Despite its simplicity, this step is essential to achieve a robust and meaningful data analysis. In fact, data types usually dictate which imputation strategies, statistical measurements, plot designs and algorithms are the most appropriate to use. Being comfortable with these properties is thus, without a doubt, one of the most valuable tools for a data scientist.

Common Data Types

1. Numerical Data

Numerical, or quantitative, variables describe phenomenons or properties that can be measured or counted such as distance, temperature, price, duration, weight, etc. This data type is always represented by a number. This category is quite broad and encompasses various subgroups, some of which are discussed below.

a) Discrete Data

Discrete features form a subgroup of numerical variables. They arise from characteristics that can be counted, but not measured. By nature, such data only takes integer values and can represent age, number of persons in a room or even the number of tails obtained after 50 coin tosses.

b) Continuous Data

Second subgroup of numerical variables is the mirror image of the discrete data type. Continuous features describe properties that can be measured, but not counted. By definition, they can take any value within a given interval to characterize, for example, height, speed, length, current, etc.

2. Categorical Data

Categorical, or qualitative data, encompasses any variable assessing a given characteristic or quality. It can represent the color of an object, vegetal or animal species, marital status, your favorite sport or music band, etc. They are usually represented by text, but can be encoded with numbers if necessary (0: no, 1: yes). Also, they are normally substituted with numerical alternatives in order to be properly understood by an algorithm or a computer. Note that in this specific context, those numbers refer to categories and do not have any mathematical meaning. Like their numerical counterparts, categorical data can take on subtypes, such as Nominal and Ordinal.

a) Nominal Data

Nominal data form a subgroup of categorical variables. This type of feature can be interpreted as simple labels with no specific order, meaning that rearranging the values must not change its meaning. Civil status, native tongue, and an animal species’ name are great examples of nominal variables.

b) Ordinal Data

Opposite of the nominal variables. This data type describes labels or characteristics for which the order is crucial. Such scenarios include education level, ranking in a competition, the spice sensation of hot peppers or the quality of a purchased product. Ordinal data thus has additional meaning encoded in its hierarchy and is useful in many types of analyses.

Evaluation of a Predictive Water Quality Model

Identifying times when water is unsafe for recreation, for drinking, or for aquatic life is a major challenge. Traditionally, sampling has been the preferred means of determining whether water is safe. Predictive modeling based on artificial intelligence (AI) is an approach that is becoming more and more popular.

A brief introduction to time series analysis

A time series is a set of repeated measurements taken sequentially over time. The main purpose of time series analysis is to predict the future of a certain process based on what has happened in the past.