How to Choose the Right Chart: The Decision-Making Process

DataViz Series

Myrnelle Jover
Decision Data

--

The kind of dashboard most of us prefer.

Data types define the way we interact with data

If you have read the welcome post, you will know that I was a mathematics tutor for five years. My students may not have realised it, but their questions often strengthened my understanding of mathematics.

On one such occasion, a student directed my attention to a page of their statistics assignment. The question read, “Stand on a sidewalk for 15 minutes and tally the cars you see in their colour, be it red, white, blue, black or green. Plot your results”. Below this was a neatly-drawn histogram with appropriate plotting elements.

This student couldn’t understand why they received full marks for their data collection but only half of the allocated marks for their plot. I replied, “It seems to me that you used a histogram instead of a bar or column chart— histograms aren’t the right chart for categorical data.” Until that moment, I was unaware of my unconscious absorption of some unspoken but foundational data visualisation rules — that appropriate chart usage depends on the data types of the variables in the diagram.

Why is this? Histograms have no spaces between the columns, suggesting that the data are at least ordinal, if not continuous. In contrast, column charts have gaps between the columns to imply that the data are separated categorically, though numbers may also represent these categories.

Fig 1: The difference between a histogram and a column chart is whether their variables are continuous or discrete.

It was not explicitly taught to me that some charts are appropriate for combinations of specific variable types. The discovery that there exists an underlying set of rules for visualisation choice was mostly self-directed. The conversation with my student taught me to base my data visualisation choices on variable data types. This article will define the processes I have learnt for creating a visualisation, from nominating a set of possible charts to filtering them and adding dimensionality.

Contents of the current article

A practical example follows in the next article.

The data visualisation process

We all have unconscious processes which necessitate external input to effect change. Though my student helped me realise that I had developed innate mathematical reasoning for deciding between statistical charts, it wasn’t until I stepped into the business world and evolved an analytical eye that I honed my decision process.

I have found that the most effective way to create a data visualisation is to incorporate the scientific method.

  1. Question: Before we collect or interrogate data for information, we must always begin by formulating a question. For example, you may be curious about the correlation between certain variables.
  2. Hypothesise: Develop a null hypothesis for your question.
  3. Predict: Predict the results of your question. If you are visually inclined, you might try mapping these in a flow chart.
  4. Collect: Collect data and determine the space of possible charts by assessing the data types of your selected variables.
  5. Filter: Filter these visualisations to ensure the purpose of the charts is congruent with the question that you need to answer.
  6. Test: Test whether any of the chosen visualisations provide a satisfactory response to your hypothesis.

Your unique situation determines whether another variable adds value to the hypothesis or narrows the narrative overmuch. Once a visualisation is chosen, we can add dimensions through plot elements such as colour, size, data point shapes and facetting.

Fig 2: The data visualisation process.

At times, we might need to combine datasets to satisfy our questions; this situation commonly occurs when either formulating a question (Step 1) or incorporating more variables (after Step 6). If this happens, additional cleaning and processing may be required before proceeding to the subsequent steps.

There are also times when we require a collection of visualisations, such as when building a dashboard (as shown in the article banner) 😉. In this case, we might want to formulate a collection of questions.

Choosing a chart by data type

Further categorisation of data types

In the first post of the Fundamentals Series, we categorised numeric data as continuous or discrete. Though it may not be mathematically correct, we can extend this thought process to other types of data:

Textual data

  • Continuous: free-text data
  • Discrete: categorical data

Temporal data

  • Continuous: continuous-time intervals
  • Discrete: sequenced time-series data

Similar to the example of histograms and column charts, categorising all data types in this way allows us to perceive nuanced differences between standard statistical graphs, and therefore understand (from first principles) which type of visualisation is appropriate.

Fig 3: Examples of continuous vs discrete variables.

Univariate data

Univariate visualisations describe the composition of a single random variable. These visualisations typically show distributions and measures of central tendency for quantitative variables, whereas qualitative variables may display category hierarchies and their frequencies. For temporal variables, we assess interval patterns for regularity, seasonality or trends.

Fig 4: Examples of standard univariate data visualisations for numerical, categorical and temporal variables.

In Figure 1, we noted that the separation (or lack thereof) between adjacent columns in histograms, column charts and bar charts depends on the independent variable. Note also the continuity of time implied by the connected points within the line chart in Figure 4, since we can fractionate time into infinitesimally small increments. I have suggested some visualisation options for univariate data types in Figure 5.

Fig 5: Univariate visualisations typically show compositions or trends.

Creating a chart invariably requires a quantitative aspect. These are inherent in numerical variables, though in qualitative variables, we can use (raw) frequencies or derived percentages, probabilities or categorical scoring, as in the example below.

Fig 6: Reading from top-to-bottom, we show the process of a categorical variable (Continent) from raw data to a frequency table, bar chart, waffle chart and pie chart. Note that there were no initial numerical aspects for plotting — we derive the frequencies, proportions and percentages from raw categorical data.

Bivariate data

Bivariate visualisations are most commonly used to show relationships between two variables, comparisons within hierarchies, or trends over time.

If both variables are quantitative, we likely want to determine the existence or pattern of a relationship. If one of the variables is categorical while the other is quantitative, we are probably comparing values within categories. If both variables are categorical, then we would be comparing proportions rather than values. The following table provides options for bivariate visualisation based on this premise. Note that independent and dependent variables differ between coordinate systems:

Cartesian coordinate system

  • Independent variable: x-axis
  • Dependent variable: y-axis

Polar coordinate system

  • Independent variable: the angle, theta (θ)
  • Dependent variable: the radius (r)
Table 1: Examples of bivariate data visualisations.

Note also that when the visualisation utilises the Cartesian coordinate system, we may invert the axes to read category names more clearly. This way, the viewer doesn’t have to turn their head. We exemplify this by observing the difference between a bar chart and a column chart.

Fig 7: The main difference between a column chart and a bar chart is a flip of the coordinate axes. Additionally, we use column charts for temporal data visualisation, with time presented on the x-axis and the measurement unit on the y-axis.

There is at least one exception for axis inversion in the Cartesian coordinate system: continuous-time charts must have the temporal variable displayed on the x-axis. This is for ease of chronological reading and because time varies independently of the quantity being measured.

Filtering visualisations by purpose

Once we have a selection of possible visualisations, we must determine whether they are useful for our purpose. Tables 2 and 3 separate the visualisations, as mentioned earlier, into three broad objectives:

  1. To analyse distribution, composition or change;
  2. To determine the existence of patterns, relationships or trends; or
  3. To compare subsets within the data.
Table 2: Viable visualisation options for univariate data.
Table 3: Viable visualisation options depend not only on compatible data types but also according to the intended message.

Adding dimension through plotting elements

Colour

One of my favourite games is called I Love Hue. The objective of the game is to sort an unordered mosaic of coloured tiles into a harmonious spectrum. It’s visually appealing, and I find it calming.

Video 1: Screen recording of me playing I Love Hue.

Along the perimeter of the mosaic is a change in hue (colour). If you follow an outer tile towards the mosaic centre, you will notice that the path gradually becomes greyer — this indicates a decrease in saturation (colour intensity). Now, notice that some tiles appear to be within the same colour family as their neighbours, and only vary in brightness — this shows a change in lightness. These concepts make up the colour code system that we call HSL.

HSL is the easiest to interpret compared to other colour code systems, such as RGB (red-green-blue) or HEX codes. If you have ever tried your hand at graphic design, photo-editing or created a style sheet, these may sound familiar.

In colour theory, there are three types of colour palettes:

  • Qualitative palettes vary in hue while remaining constant in saturation and lightness. The resulting palette contains distinct colours which we can think of as discrete and unordered; therefore, they are most suitable for nominal categories.
  • Diverging palettes vary in lightness/saturation, though their extremities are usually different colours. Lightness and saturation are both ordered concepts, so these palettes are best used for variables which are ordered but contain positive, negative and neutral values.
  • Sequential palettes are a subset of diverging palettes, except that they are unidirectional. Therefore they are suitable for unidirectional ordered variables.
Fig 8: Examples of colour palette types from ColorBrewer.

Size, shape and facets

Quiz time: Name the data types that are best suited to the following plotting elements. The answers can be found in the table following.

a) Variations in size

Fig 9: Example of data point size variation.

b) Variations in shape

Fig 10: Example of data point shape variation.

c) Facetting

Fig 11: Example of facetting.

By matching variable data types to these plotting elements, we can add as many dimensions as are suitable for a 2D plot. It is important to note that additional variables do not always add value — sometimes, they may be redundant or add unnecessary complexity. At the heart of a chart, one must always choose clarity.

Table 4: Summary of suitable plotting elements for additional dimensions in a 2D chart.

Conclusion

Selecting an appropriate chart depends on the initial question, as well as the number of variables required to answer the question, along with their data types.

Univariate visualisations describe a variable's composition, whereas bivariate and multivariate visualisations show relationships, comparisons or trends. To convert a bivariate visualisation into a multivariate visualisation, we can layer suitable plotting elements such as colour, size, shape and facets according to the data types of the additional variables.

--

--

Myrnelle Jover
Decision Data

I am a data scientist and former mathematics tutor with a passion for reading, writing and teaching others. I am also a hobbyist poet and dog mum to Jujubee.