How to Identify Data Types

Fundamentals Series

Myrnelle Jover
Decision Data

--

Understanding the data that one has at hand is key to any data science endeavour. Every variable in your dataset has a type. Mistyped variables are not uncommon; data types of variables differ between languages and can become quite granular. In this article, we will look at the primary data types in mathematics, R, Python, and SQL.

Data types in mathematics

In mathematics and statistics, we split data according to whether they are quantitative or qualitative. Quantitative data allows inference to be drawn using numerical methods, whereas qualitative data only allows inference through comparison.

Quantitative data

In general, quantitative data refers to numerical information. We can further categorise data according to whether they are continuous or discrete.

Discrete variables present as specific points on a number line in one dimension, whereas continuous variables present as any number within a specified range.

Examples of discrete variables:

  • A person’s age in years
  • The number of cars on a road
  • The difference in scores between teams in a sports match

Examples of continuous variables:

  • A person’s height
  • The length of a car
  • The date and time of an event
Fig 1: Discrete values present as specific points on a number line.
Fig 2: Continuous values can be any number within a specified range.

In two or more dimensions, discrete functions present as disconnected points, whereas graphical representations of continuous functions are connected. Additionally, a function is called smooth if it is a continuous function, and with a continuous derivative; otherwise, it is called non-smooth.

Fig 3: The discrete function consists of disconnected data points.
Fig 4: Continuous functions consist of connected data points. This graph is an example of a non-smooth continuous function.
Fig 5: By definition, all smooth functions are also continuous functions. In this example, the data points are implicit as they can occur anywhere on the function line.

In mathematics, there are standard number sets that can be represented as a hierarchy of quantitative variables:

Discrete number sets

  • ℕ: The set of natural numbers {1, 2, 3, …}
  • ℤ: The set of integers {…, -3, -2, -1, 0, 1, 2, 3, …}

Continuous number sets

  • ℚ: The set of rational (quotient/fraction) numbers
  • ℝ: The set of real numbers
Fig 6: A Venn diagram representing the hierarchy of some standard number sets used in mathematics.

Qualitative data

Qualitative data refers to the type of data that allows for interpretation, such as audio, visual, textual or Likert scale responses, that one might encounter in questionnaires and self-report surveys.

If the data consists of named categories, then it is nominal; if it has an inherent order, it is ordinal, regardless of whether it is textual or numerical.

  • Nominal: From the Latin word nominalis, which means “pertaining to a name”. Includes nouns such as countries, cities, colours, brand names, and so on.
  • Ordinal: From the Latin word ordinalis, meaning “an order or place in a series”. Includes placings such as 1st, 2nd, 3rd, and sets of words which have an inherent hierarchy such as {good, better, best}.
Fig 7: A hierarchy of standard number sets and data types in mathematics.

Ordinal variables are closely related to quantitative variables since numbers have an inherent ordering.

A note on dates

Since dates present in various formats (e.g. day, month, annual year, date, date and time, seconds, seasons, financial quarters), their data type depends on specific applications.

Data types in R

R is a statistical programming language widely used for data analysis and statistical software development. Therefore data types of its variables most closely resemble those used in mathematics and statistics.

Tidyverse is a popular data analysis software library for R and influences some of the data types shown below.

Table 1: An example of common data types in R.

Data types in Python

Python is a general-purpose programming language used for web applications, data analysis, information security, software development and artificial intelligence. It also influences other programming languages used for data science, such as Julia and Ruby.

Pandas is a popular data analysis software library for Python and may influence some of the data types shown here.

Table 2: An example of common data types in Python.

Data types in SQL

SQL is a database programming language used for building and querying relational databases. Many different flavours may influence some of the data types shown here; these include MySQL, T-SQL (used with MS SQL Server), PostgreSQL, Oracle and SQLite.

Table 3: An example of common data types in SQL.

Summary

We derive data types from mathematical categorisations of data, so understanding the main number sets helps to convert data types between the most common data science languages, as shown in the following table.

Table 4: An example of typical data type conversions between mathematics and common data science languages R, Python and SQL.

--

--

Myrnelle Jover
Decision Data

I am a data scientist and former mathematics tutor with a passion for reading, writing and teaching others. I am also a hobbyist poet and dog mum to Jujubee.