How to Choose the Right Chart: A Practical Example

DataViz Series

Myrnelle Jover
Decision Data

--

Determining where the name of a common chart stems from, I leave as an exercise for the reader.

Systemising data visualisation

Let’s be data visualisation critics for a moment. Can you guess what’s wrong with the following chart? How would you fix it?

Fig 1: Graphic from Content Marketing Institute.

When I see this, I wonder why the designer chose a line chart. Categories aren’t ordinal, so if we shifted the departments around, the chart would look very different. The designer was probably trying to compare travel expenses between departments; in this case, a simple bar chart with departments ordered by dollar amount would be simple and effective. Explicitly stating the values would also help, as it is not entirely clear whether a value of $2,000 means 2 million dollars.

I have seen senior analysts produce charts similar to the one above. Beginners aren’t the only ones who get things a little muddled when it comes to data visualisation — we’re all human, and we all have our off days. That’s why I think it’s essential to have a process to fall back on.

In this article, we will recap the data visualisation process and implement it in three rounds. The first round explores whether a relationship exists between health and income using data from the Gapminder dataset. The second round extends this by questioning whether a country’s population is a factor that affects health and income. The third round inspects whether the continent adds another layer of insight into the relationships between health, income and population.

Contents

Recap of the data visualisation process

We begin by following the data visualisation process defined in the previous article.

Fig 2: The data visualisation process.

Round 1

Is there a relationship between health and income?

Question: Is there a relationship between health and income for all countries in 2007?

Hypothesise: The null hypothesis is that there is no relationship between the two variables, health and income.

Predict: If there is no relationship between health and income, there will be no distinguishable pattern within the chart. If there is a relationship between these variables, it will be a direct one.

Combine: Take a minute to determine the data types of these variables from the Gapminder dataset. I’m going to assume you have an understanding of data types. If not, please refer to this post from the fundamentals series. The answers are below the following image.

Table 1: Example data from the Gapminder dataset.

Answers

  • Nominal categorical variables: Country, continent
  • Discrete-time variable: Year
  • Continuous numerical variables: Life expectancy, GDP per Capita
  • Discrete numerical variable: Population

Possible charts

We are only interested in the health and income variables, Life Expectancy and GDP per Capita. Since these are both continuous numerical variables, we can produce the following charts from the recommendations in the previous post:

  • Scatterplot
  • Correlogram
  • Boxplot
Table 2: Examples of bivariate data visualisations.

Filter: From the table of bivariate visualisation purposes in the previous post, we see that out of the three recommended charts, the following are most appropriate for determining the existence of a relationship:

  • Scatterplot
  • Correlogram
Table 3: Viable visualisation options depend not only on compatible data types but also according to the intended message.

However, as we see below, it is pointless to use a correlogram because it only gives us one aspect: the correlation between life expectancy and GDP per Capita. Thus, we move forward with the scatterplot.

Fig 3: Correlogram example.

Test: Does the following scatterplot persuade us to retain the null hypothesis that there is no relationship between health and income?

Fig 4: Scatterplot example.

Round 2

Is there a relationship between health, income and population?

From the graph above, we are curious about whether a nation’s population affects its health and income.

Question: Is there a relationship between health, income and population for all countries in 2007?

Hypothesis: The null hypothesis is that there is no relationship between the three variables: health, income and population.

Predict: If there is no relationship between these variables, there will be no distinguishable pattern within the chart. If there is a relationship, we predict that the population will decrease as life expectancy and GDP per Capita increase.

Combine: The data types of the variables are:

  • Continuous numerical variables: life expectancy, GDP per Capita
  • Discrete numerical variable: population

We will build on the scatterplot and add plotting elements to the additional variable, population. From the recommendations in the previous post, we can choose to use the following elements:

  • Sequential colour palette
  • Datapoint size
Table 4: Summary of suitable plotting elements for additional dimensions in a 2D chart.

Filter: The following are examples that incorporate the plotting elements above. It isn't easy to ascertain the population using the colour palette, so we will use the datapoint size.

Fig 5: Sequential colour palette example.

Test: Does the following scatterplot persuade us to retain the null hypothesis that there is no relationship between population and health or population and income?

Fig 6: Datapoint size example.

Round 3

Does the continent affect health, income and population?

It would probably be beneficial to use continents to understand whether environmental, social or economic conditions provide context for the other chosen variables.

Question: Does the continent affect health, income and population for all countries in 2007?

Hypothesis: The null hypothesis is that there is no difference between the health, income and population amongst the continents.

Predict: If there is no difference between the subsets, there will be no distinguishable pattern of continents within the chart. If there is a difference between the continents, we predict that the three-world model will explain it.

Combine: The data types of the variables are:

  • Continuous numerical variables: life expectancy, GDP per Capita
  • Discrete numerical variable: population
  • Nominal categorical variable: continent

We will build on the scatterplot and add plotting elements to the continent variable. From the recommendations in the previous post, we can choose to use the following elements:

  • Qualitative colour palette
  • Shape
  • Facets

Filter: The following are examples that incorporate the plotting elements above. It is difficult to read the values of GDP per Capita through facetting. Further, it is difficult to distinguish the continents from one another using shape without differences in colour, so we will use a qualitative colour palette.

Fig 7: Shape example.
Fig 8: Facet example.

Test: Does the following scatterplot persuade us to retain the null hypothesis that there is no difference in population, health, or income between the continents?

Recreation of the 2015 Gapminder Poster for 2007.

--

--

Myrnelle Jover
Decision Data

I am a data scientist and former mathematics tutor with a passion for reading, writing and teaching others. I am also a hobbyist poet and dog mum to Jujubee.