Glossary of data terminology

Myrnelle Jover

Published in

Decision Data

8 min readMar 16, 2021

An accurate depiction of your constantly evolving data science vocabulary.

A

Algorithm: Refers to a step-by-step process to solve a particular problem.

Analytics: The process of investigating, cleaning and transforming data to discover valuable information. Also called data analysis.

Arc: An ordered pair of vertices in a directed graph, represented by an arrow. See directed graph, vertex.

Artificial Intelligence (AI): The study of machine intelligence, or the ability of a machine to learn from data.

B

Bivariate: Refers to two random variables, most commonly the interaction between them. Derived from the Latin prefix bi- meaning “two” and word variatus meaning “an element likely to change”.

Boolean data: A quantitative data type that can only take on two values. This is a subtype of discrete data. Examples: TRUE/FALSE, 1/0 and success/failure.

C

Category: A directed graph showing the possible morphisms of objects. See directed graph, morphism.

Category theory: The study of categories in mathematics.

Central tendency: A central (or typical) value in a distribution. Includes mean, median and mode.

Chart: A visual representation of data. Also called a graph.

Cherry-picking: The act of suppressing evidence, or showcasing incomplete evidence, to support a particular position while ignoring a significant portion of related and similar data that contradicts that position.

Classification techniques use labelled data to predict a categorical outcome, such as separating a collection of objects into defined classes. These classes can be binary or multi-class.

Coefficients: Coefficients multiply the predictor variables in a regression model. Each coefficient has a sign (+/-) and magnitude (number) indicating the respective direction and strength of the relationship between a predictor and the response. The coefficient represents the mean change in the response for each unit change in the predictor, holding all other predictor variables constant.

Continuous data: A quantitative data type that can take on any value within a specified range.

Continuous function: A function that consists of a series of connected data points when graphed.

D

Dashboard: An information management tool that shows business metrics using data visualisation.

Data analysis: The process of investigating, cleaning and transforming data to discover valuable information. Also called analytics.

Data art: An art form that incorporates data.

Data mining: The process of mining insights from data.

Data science: An interdisciplinary field that uses scientific methods to extract knowledge and insights from data using algorithms. A field that unifies mathematics, statistics, computer science and domain knowledge.

Data set: A collection of data.

Data-driven decision making (DDDM): The process of making decisions from data rather than intuition, usually at the organisational level.

Database: An organised collection of data.

Data visualisation: A branch of data science concerning the display of data in graphical format.

Density estimation: A machine learning technique concerned with estimating the distribution of new points in a dataset through the distribution of closely related datasets.

Dependent variable: A variable whose value depends on another variable. Also called response variable or target variable.

Dimensionality: The number of variables in a data set. See dimensionality reduction.

Dimensionality reduction: A machine learning technique that simplifies higher-dimensional unlabelled data into lower-dimensional forms while retaining most of the information. See dimensionality.

Directed graph: A graph with a set of objects (called vertices) that are connected together through arrows (called arcs). See graph, vertex, arc.

Discrete data: A quantitative data type that can only inhabit specific values within a specified range.

Discrete function: A function that consists of disconnected data points when graphed.

Distribution: A function that shows the space of possible values for a variable and their frequency.

Dot product: A binary operation that takes two vectors and returns a scalar value. The dot product of two vectors is the product of their magnitudes and the cosine of the angle between them. See magnitude.

E

Edge: An unordered pair of vertices, represented by a line in a graph. See graph.

Error component: The error component of an observed value is the deviation from the value predicted by the model. In other words, the variance that is unexplained by the model. Also called residual.

Exploratory data analysis: Initial analysis performed on a dataset; includes summary statistics and data visualisation methods.

F

Feature engineering: The process of extracting features (variables) from raw data.

Frequency: The number of times an event occurs.

Fit: How well a model generalises on unseen data. See model.

Function: A process that relates a single input to a single output.

G

Graph: (Data visualisation) A visual representation of data. (Graph theory) A structure showing the relationships between vertices through the edges joining them. See vertex, edge.

Group: A set with an operation on two elements allowing the formation of another element.

Group theory: The study of groups in abstract algebra.

H

Hypothesis: A supposition requiring further analysis.

I

Independent variable: A variable the value of which does not change based on other variables. See input variable.

Infographic: A visual narrative containing multiple graphics containing information.

Information Age: The period beginning in the 1970s marking rapid technological changes.

Input data: Data that is fed into a machine learning algorithm. Includes predictor variables. May include response variables in the case of supervised or semi-supervised learning. See algorithm.

Integer data: A quantitative data type that can only take whole number values, which are either positive, negative or zero. This is a subtype of discrete data. See discrete data.

Intelligent agent: An autonomous entity that uses observation and/or sensors to learn from its environment. See reinforcement learning.

L

Labelled data: Datasets containing a response variable. See supervised learning.

Linear regression: A regression technique that predicts (continuous) numerical outcomes on data with a linear relationship between the predictor variables and the response variable.

Logistic regression: A generalised linear model that predicts a continuous and class outcome on data with a linear relationship between the predictor variables and the log-odds of the response variable. Logistic regression is both a regression and classification technique.

M

Machine learning: A branch of artificial intelligence (AI) referring to the collection of algorithms that use data to discover functional relationships between input variables and target variables. See algorithm.

Magnitude: The magnitude of a vector is a scalar quantity that represents the length or size of the vector. The magnitude of a vector is calculated as the square root of the sum of the squares of its components.

Map: (Data visualisation) A visual representation of geographic information. (Mathematics) A function.

Mean: The average value in a set of numerical data. It is calculated by summing the data and dividing it by the number of values.

Median: The middle value in a set of numerical data is found by first arranging the values in ascending order.

Mode: The most popular (frequent) value in a data set. Derived from the French phrase à la mode meaning “fashionable”.

Model: A machine learning algorithm that has been trained on data.

Monotonically increasing: A function that always increases or remains constant as the independent variable increases.

Morphism: A structure-preserving map from one mathematical object to another.

Multiple linear regression: Linear regression with multiple predictor variables. See linear regression, simple linear regression.

Multivariate: Refers to two or more random variables, most commonly the interaction between them. Derived from the Latin prefix multi- meaning “many” and Latin word variatus meaning “an element likely to change”. See univariate and bivariate.

Map visualisation: A type of data visualisation using geographic data.

N

Node: An object in a graph, also called a vertex. See graph.

Nominal data: From the Latin word nominalis, which means “pertaining to a name”. Includes nouns such as countries, cities, colours, brand names, and so on.

Non-smooth function: A non-differentiable function.

Null hypothesis: A default hypothesis that there is no significant association or difference between variables of interest. See hypothesis.

O

Ordinal data: From the Latin word ordinalis, meaning “an order or place in a series”. Include placings such as 1st, 2nd and 3rd, and sets of words that have an inherent hierarchy such as {good, better, best}.

P

Predictor variable: A variable fed into a machine learning algorithm to predict the target variable. Also called independent variable or input variable. See algorithm, response variable.

Probability: The likelihood of an event occurring, expressed as a value between 0 and 1.

Probability distribution: A function that shows the space of possible probabilities and frequencies for the occurrence of an event.

Probability theory: The branch of statistics concerned with probability.

Python: Python is a general-purpose programming language used for web applications, data analysis, information security, software development and artificial intelligence. It also influences other programming languages used for data science, such as Julia and Ruby.

Q

Quantitative data: Type of data whose magnitude can be measured objectively. Also called numerical data.

Qualitative data: Type of data with magnitude measured subjectively. Usually collected from first-hand information, such as surveys, questionnaires and interviews.

R

R: A statistical programming language widely used for data analysis and statistical software development.

Regression: A machine learning technique that uses labelled data to predict a numerical outcome (except for the special case of logistic regression).

Reinforcement learning: A type of machine learning problem where an intelligent agent learns by trial and error using feedback from an interactive environment. The agent learns through rewards and punishments for positive and negative behaviour for its goal, which is to maximise its cumulative reward. See intelligent agent.

Relational database: A database containing tables that are related to one another. See SQL.

Residual: The residual of an observed value is the deviation from the value predicted by the model. In other words, the variance that is unexplained by the model. Also called error component.

Response variable: A variable that needs to be predicted. Also called dependent variable or target variable. See predictor variable.

S

Scientific method: A procedure used for discovering knowledge. The procedure consists of the following steps:

Formulate a question
Speculate, or construct a hypothesis
Make a prediction
Collect data
Test the results

Set: A collection of distinct objects.

Set theory: The study of sets in mathematics.

Simple linear regression: Linear regression with only one predictor variable. See linear regression, multiple linear regression.

Smooth function: A function that is differentiable everywhere.

Space: In mathematics, a set with some added structure.

SQL: SQL stands for Structured Query Language and is a language used for storing and manipulating data in relational databases. See relational database.

Supervised learning: A type of machine learning problem that uses labelled data as input to train algorithms to predict or classify outcomes. See labelled data.

T

Target variable: A variable that needs to be predicted. Also called dependent variable or response variable. See predictor variable.

Temporal: Relating to time.

Test set: A portion of a data set used to test the fit of a machine learning algorithm. See algorithm, machine learning, model, training set, validation set.

Training set: A portion of a data set used to train a machine learning algorithm and fit it to the algorithm’s parameters. See algorithm, machine learning, model, test set, validation set.

Trend: The direction of change in data.

Truncated graph: A graph that has a y-axis that does not start at zero.

U

Univariate: Refers to a single random variable. Derived from the Latin prefix uni- meaning “one” and Latin word variatus meaning “an element likely to change”. See bivariate and multivariate.

Unlabelled data: Datasets that do not contain a response variable. See unsupervised learning.

Unsupervised learning: A type of machine learning problem that uses unlabelled data as input to train algorithms to discover patterns. See unlabelled data.

V

Validation set: A portion of a data set used to tune the hyperparameters of a machine learning algorithm. See algorithm, machine learning, model, training set, test set.

Variable: An expression for a quantity that may change. See univariate, bivariate and multivariate.

Variance: The sum of the squared differences between the observed response values and the mean of the response variable. The variance tells us the spread of the observed response values around their mean.

Vertex: An object in a graph, also called a node. See graph.

Glossary of data terminology

A

B

C

D

E

F

G

H

I

L

M

N

O

P

Q

R

S

T

U

V

Written by Myrnelle Jover