Understanding Logistic Regression

Machine Learning Series

Myrnelle Jover
Decision Data

--

Food — a difficult decision.

Modelling my life decisions

I’ve been a bit stagnant with my blog posting these last six months. I started an exciting new job blending data science with operations research, began my management journey as a data visualisation team lead for a volunteer organisation, raced the clock to move back home (on the other side of the country) before another COVID-19 lockdown, then spent my time catching up on all the family and social life I’d missed while living away.

Truth be told, I was also in denial about the possibility that lockdowns can still affect those of us who were last year reportedly #blessed with introversion.

Recently, I’ve been doing some side projects to help me with life admin. I’ve collected all kinds of information: shopping habits, food choices, sleeping patterns, stock information, etc.

Last weekend, I was trying to decide what I could be bothered making for meal prep. I use a cool app called Yummly that scrapes the web for recipes and categorises them by cuisines, ingredients, cooking techniques, taste preferences, etc. Unfortunately, the API is no longer working so I manually collected the information into a dataset (there aren’t enough recipes to show you at the moment, so this simulation should do for now).

Option 1: Get a computer to randomly choose a cooking recipe if you’re indecisive.
Option 2: Log your own “CBF factor” and get insight into your cooking choices — what factors are most important for you?

In this article, we are going to be exploring logistic regression — a generalised linear model that is often used for classification problems in machine learning. Similar to the linear regression article, we will be exploring the concepts, assumptions and metrics required to understand, validate and evaluate a logistic regression model.

Contents

Overview of logistic regression

Recall that all problems containing a response variable fall into the category of supervised learning methods. These are further separated into either regression or classification problems depending on whether their respective label predictions are numeric or categorical.

Logistic regression is considered a generalised linear model that predicts a continuous and class outcome on data with a linear relationship between the d predictor variables, X, and the log-odds of the response variable, y. This means that logistic regression is both a regression and classification technique.

Fig 1: Logistic regression appears twice in this diagram because it can be used for regression and classification problems.

Let’s take a look at both linear and logistic regression equations to understand why. First, recall that the general regression problem takes on the following form, and is classified as linear if f(X) is linear in y.

Eqn 1: Regression techniques use predictor variables, X, to predict a numeric response, y.

From this definition of linearity, it is straightforward to see why the linear regression equation is aptly named:

Eqn 2: Linear regression techniques have a linear function with respect to the response variable, y.

To convert a linear regression equation to a logistic regression equation, we take the log-odds of the response variable, y, which is often called a log-link or logit function. As you can see below, the reason that logistic regression is considered to be a generalised linear model is that it is linear with respect to the log-odds of the response variable.

Eqn 3: Logistic regression techniques are linear with respect to the log-odds of y.

Since we are primarily interested in the probability of the response belonging to a class, we need to redefine the equation so that we get a probability output using some simple algebra.

Eqn 4: Extracting the probability, p, from the logit function.

The resulting equation is what we call a sigmoid function, which squashes a linear curve into an S-curve. The effect of this squashing by the log-link function limits the space of possible outcomes between 0 and 1 — thereby providing a probability.

Eqn 5: The sigmoid function gives us the probability of a response variable, y, belonging to a particular class.
Fig 2: Linear regression curves are infinite whereas the predicted values (y) of logistic regression remain between [0, 1].

If we are only interested in a probability output for regression, we can stop here. However, if we are using logistic regression for classification, then we need to provide some threshold, t, to the probability so the model understands how to characterise the class. When the probability is greater than or equal to this threshold, we set the predicted value of the response variable to 1 (True), and 0 (False) otherwise. This last step is what converts logistic regression into a classifier.

Eqn 5: Predicting the response for the i-th example. The threshold can be any reasonable value between (0, 1).

To recap, the conversion from linear regression to logistic regression has the following steps:

  1. Take a linear function
  2. Take the log-odds of the response variable (i.e. apply the logit function)
  3. Re-arrange the equation for the probability, p
  4. Attach a probability threshold

Et voila! Simple.

Fig 3: Visual recap of the difference between linear and logistic regression.

Concepts

There are many shared concepts between linear and logistic regression; to reduce repetition, feel free to read about regression concepts in my linear regression article.

Table 1: An overview of common regression concepts.

Assumptions

  1. Appropriate type: The type of logistic regression must be chosen according to the response variable.
  2. Independence: Similar to linear regression, the residuals must be independent of each other and the observations should not come from repeated measurements of the same entity. If this assumption is invalidated, the residual variance will be biased and may seem higher than it is.
  3. No multicollinearity: The independent variables must have little to no multicollinearity. We use the term collinearity when there is a linear relationship between two predictor variables. Similarly, multicollinearity describes situations where there is a linear relationship between two or more predictor variables.
  4. Linearity: There is a linear relationship between the predictor variables and the log odds.
  5. Large sample size: We have a large sample size.

Assumption #1: Logistic regression type

There are three main types of logistic regression: binomial, ordinal and multinomial.

  • Binomial: The response variable consists of two classes.
  • Ordinal: The response variable has an inherent order.
  • Multinomial: The response variable consists of three or more classes that do not have an inherent order.

Examples of the inputs into these logistic regression types are provided in the table below.

Table 1: We use binomial logistic regression when the predictor consists of two classes, multinomial logistic regression when there are three or more classes, and ordinal logistic regression when the predictor has an inherent ordering.

Assumption #2: Independence

Similar to linear regression, we cannot visually nor numerically check whether the residuals are independent; we must view the data collection method and ensure that it leads to independent residuals, to the best of our knowledge.

Assumption #3: No multicollinearity

There are multiple tests to identify multicollinearity, but the easiest way is to plot a correlation matrix — the only catch is that the predictor variables need to be numeric.

Fig 4: Correlation matrix of the predictor variables used to determine the presence or absence of diabetes. (Data source: diabetes)

Assumption #4: Linearity

The linearity assumption requires a linear relationship between the predictor variables and the log-odds of the response variable. It is a common mistake to assume a linear relationship with the categorical response variable instead of its log-odds — if you have ever made this mistake, you may realise that even if the response is displayed as an integer, you will still need to convert this into a real number to make a meaningful assessment of linearity.

Three common ways to assess the relationship between the predictors and the log-odds are a Box-Tidwell test, a scatterplot, or a comparison with a non-linear spline model.

Assumption #5: Large sample size

Logistic regression requires a lot of data, otherwise, we can find ourselves with an overfitting problem. In my cooking example, I would likely need hundreds of recipes that I had chosen to make (or decided against making) before my model could make any reasonable conclusions as to whether my CBF factor was more highly influenced by cooking time, number of ingredients, or steps in the recipe.

Unfortunately, there is no hard guideline on what “enough data” entails. Some people say that you need a minimum of 10–30 cases in the least frequent outcome in your model, and others may say this is not enough. At a minimum, logistic regression requires a sample size greater than the number of predictors (i.e. n > m). Beyond that, the sample size depends on how accurate the outputs need to be for your particular scenario — in general, the more (high quality) data the better.

Table 2: Summary of logistic regression assumptions and how to check them.

Evaluation metrics

Confusion matrix

The confusion matrix is a summary table that shows the performance of a classification model in predicting each class label in counts.

If we are performing binary classification, these classes will either be one (positive) or zero (negative). The confusion matrix will be a 2x2 summary table where the rows represent predicted classes and the columns represent actual classes.

Note: Sometimes the position of the predicted and actual classes are switched, so it’s a good idea to double-check the arguments and output titles of your confusion matrix function.

From the confusion matrix, we determine the following values:

  • True positive (TP): Correct classification of the positive class
  • False-positive (FP): Incorrect classification of the positive class
  • True negatives (TN): Correct classification of the negative class
  • False-negative (FN): Incorrect classification of the negative class
Fig 5: Confusion matrix for binary classification.

If we are performing a multiclass or ordinal classification, the class labels will simply be the name of the class; if there are n classes, the confusion matrix will be an n×n table. To retrieve the value counts defined in the binary classification example, we focus on one class at a time; we define the current focus class as the positive class, and all others will be negative classes.

To find the true positives, we use the values on the main diagonal (in green) that belong to the current focus class. Similarly, to find the true negatives, we use the values on the main diagonal that do not belong to the current focus class. Everything else is a false positive or false negative, depending on the current focus class.

Fig 6: Confusion matrix for multinomial classification.

The main metrics that come from the confusion matrix are in the table below.

Table 3: Definitions and equations of the main confusion matrix metrics.

Receiver operating characteristic (ROC)

The receiver operating characteristic (ROC) shows the performance of a classification model at all classification thresholds. The ROC plots two parameters: sensitivity and specificity, which are given by the confusion matrix. These parameters are inversely proportional to each other; when one increases, the other decreases.

Fig 7: ROC for binomial classification. A perfect model would have a curve consisting of two lines: a vertical line at x=0 and a horizontal line at y=1; the shape of this curve would look like the right-half of the letter ‘T’. (Data source: diabetes)

Area Under the Curve (AUC)

The area under the curve (AUC) uses the ROC to quantify the performance of a classifier. Perfect classifiers have an AUC of 1, whereas classifiers that perform no better than a coin flip have an AUC of 0.5.

Fig 8: AUC for binomial classification. (Data source: diabetes)

Here is the full code snippet:

Summary of advantages and disadvantages

Advantages

Logistic regression is a probabilistic method that is efficient to train, simple to implement and has easily interpretable outputs. The coefficients of the model give an indicator of a predictor’s relationship with the log-odds of the response variable. In comparison to most other algorithms, it has low complexity and is less prone to overfitting if provided enough data.

Rule of thumb #1: Simpler models are usually less prone to overfitting.

Disadvantages

One of the major assumptions of logistic regression is a linear relationship between the predictor variables and the log-odds of the response variable, which supports the construction of linear boundaries between predicted classes — this means that we cannot use logistic regression to solve non-linear classification problems. In the case of complex non-linear relationships, we would tend towards using more powerful methods, such as neural networks.

Rule of thumb #2: The complexity of a problem should determine the complexity of its solution.

Conclusions

Logistic regression is a supervised learning method whose prediction output is either a probability or class label, depending on whether we attach a probability threshold. Binary logistic regression predicts two classes, multinomial logistic regression predicts multiple classes, and ordinal logistic regression predicts ordered classes.

To implement logistic regression, we must understand concepts common to all regression models, such as coefficients, variance, and residuals.

To fit a logistic model, the data must satisfy the following assumptions: choosing appropriate logistic regression types, independence of residuals, little to no multicollinearity, linearity between the predictors and log-odds of the response, and large sample size. Choosing appropriate logistic regression types is self-evident from the response classes. To test for independence, we must review the data collection method (if possible). To test for multicollinearity, we use a correlation matrix. To test for linearity, we use either a Box-Tidwell test or a scatterplot. Finally, the sample size depends on the situation — however, we require a lot to avoid overfitting.

We evaluate the performance of a logistic regression model against the following metrics: confusion matrix, receiver operating characteristic (ROC), and the area under the curve (AUC) of the ROC plot.

Similar to linear regression, the advantages of logistic regression lie in its simplicity of implementation and interpretation, though its disadvantages include strong assumptions of linearity, meaning that it should only be used for simple classification problems.

--

--

Myrnelle Jover
Decision Data

I am a data scientist and former mathematics tutor with a passion for reading, writing and teaching others. I am also a hobbyist poet and dog mum to Jujubee.