How to Structure Data Science Projects

Fundamentals Series

Myrnelle Jover
Decision Data

--

An accurate portrayal of you working on your next data science project.

The CRISP-DM Methodology

All good data science projects should begin with a structure to ensure that the project is well-organised, efficient, and effective. These structures help to define the roles and responsibilities of team members, as well as a clear plan for achieving the project’s goals.

The most widely-used framework for data science projects is the Cross-Industry Process for Data Mining (CRISP-DM). This methodology has 6 stages: business understanding, data understanding, data preparation, modelling, evaluation, and deployment. Many of these stages may require us to revert to previous stages and alter our approach when either the data, method, or model performance requires it.

Contents

If you are looking to build some structure in your data science projects, I invite you to download a free copy of the format I have developed for myself using this framework, and that I personally use in my day-to-day work.

Fig 1: The CRISP-DM Methodology is a framework to help us organise our data science projects.

Stage 1: Business Understanding

The first stage involves defining the project’s objectives, understanding the context of the project, and identifying the data sources that will be used. This stage is also typically when we’ll have the most access to the stakeholder, so it’s important to be prepared with the questions we’ll need for later stages.

Some example questions we could use in our research are:

  • What is the problem that needs to be solved?
    This question provides context to the general business problem. If we listen well, we may be able to hone some more specific questions from their answer.
  • What are the objectives of the project?
    This question will help us to understand the purpose of the solution. The goal of this question is to produce valuable and actionable insights, so we can ask as many follow-up questions as we need.
  • What would be the use case of this solution?
    This question provides the required format of the solution. Ask whether the stakeholder would prefer to access the results on a dashboard, as an application, in a report, or in another format. We could also ask whether the solution needs to be deployed in the cloud or on-premises.
  • Who will be the users of this solution?
    We must tailor the communication of the data to the needs of the people who will be using it. From a product management perspective, it’s a good idea to develop user personas, which are archetypical profiles of the users who will be using the solution. It may be useful to get the contact details of a variety of users who can be testers as well.
  • What is the size of the data, what is the data source, and how frequently would new data points come in?
    This question will inform our modelling decisions, as some methods have a size limitation or minimum size requirement. The frequency of new data will also help us determine the speed requirement of our chosen ML method.
  • To the best of your knowledge, what is the data collection method like?
    Many machine learning methods require independent residuals to satisfy the assumptions for their use; one of the quickest ways to determine this is to discern it from the data collection method, however, we may also be able to tell from visual inspection once we have access to this data. When the data are inherently dependent, such as in time-series and spatial modelling problems, we can ask questions such as, “In what way are the data dependent, and why? What are the strongest sources of dependence?”
  • What are the definitions of success?
    At what point will we know we have met the stakeholder’s needs? It’s a good idea to determine success and exit criteria here since, with data science projects, it is entirely possible to continue optimising forever. There needs to be a definitive stopping point that tells us when our model is good enough. There are a few ways to do this, but my favourite way is to use satisficing and optimising metrics —the satisficing metric is usually binary, whereas the optimising metric is numeric. An example of this is achieving a runtime of ≤ 45 seconds (satisficing metric) whilst also using the method that is known to historically give the most accuracy for similar problems (optimising metric). In this scenario, so long as we have achieved the satisficing metric, we know that we are achieving our project goals. It’s important to note that there may be multiple objective functions we will need to optimise for, but there is typically only one satisficing metric — depending on the task, it may be unreasonable to achieve more than this.
  • What are the project timelines?
    This question should tell us what resources we would allocate to the project in an ideal scenario and helps us plan the project milestones.

Stage 2: Data understanding

The second stage of the methodology is used to explore the data and understand its quality, structure and characteristics. This is the time to detect any data quality issues and discover any underlying patterns within the data.

Examples of data quality issues that may arise are noise, outliers, missing values, any patterns of missingness, whether data is duplicated, whether there is enough data or too much data, and whether there are too many or too few predictors. We don’t necessarily need to act on any of this information just yet — just note the issues as well as our treatment options.

Then in our exploratory analysis, we would visualise the distributions of the data as well as its noise, produce statistical summaries, determine whether the data have different units, perform correlations, segment the data by classes to determine whether any patterns exist, and determine whether the data needs normalisation or scaling.

Stage 3: Data preparation

The third stage of the methodology is used to clean, transform, and integrate data to prepare it for analysis. This is where we will apply treatment to any data quality issues discovered in Stage 2, such as missing value imputation, outlier detection and treatment, and feature engineering. There are numerous treatment options for each data quality issue, so we will go into these in more depth in another article.

Stage 4: Modelling

The modelling stage is where we develop machine learning models and identify the most promising ones for achieving the project goals. Here, we will need to consider factors such as building a baseline model, determining our train-dev-test split or whether we will use cross-validation, the ML method we will use, and any hyperparameters that could be tuned.

Stage 5: Evaluation

In this stage, the chosen models are evaluated to determine their effectiveness and to identify any areas for improvement. We may need to confer with the stakeholder to give progress updates or request feedback.

Stage 6: Deployment

The final stage of the CRISP-DM methodology involves deploying the chosen models into the business environment — but projects don’t always end after deployment; we may need to monitor the results of the model to ensure they are continuing to meet the project goals. There are several factors that can influence the need to update a model after deployment, such as a change in the data, performance degradation, changes in the business context, or newer approaches that may provide better performance than the existing model.

Download a free cheat sheet of this framework.

Summary

The CRISP-DM methodology is a six-stage framework for organising and executing data science projects. The first stage, business understanding, involves defining the project’s objectives, understanding the context of the project, and identifying the data sources that will be used. The second stage, data understanding, involves exploring the data to understand its characteristics and any potential issues. The third stage, data preparation, involves cleaning and preparing the data for modelling. The fourth stage, modelling, involves selecting and applying appropriate machine learning techniques to the data. The fifth stage, evaluation, involves assessing the quality and effectiveness of the model. The final stage, deployment, involves implementing the model and making it available for use.

Resources for Fundamentals Series:

--

--

Myrnelle Jover
Decision Data

I am a data scientist and former mathematics tutor with a passion for reading, writing and teaching others. I am also a hobbyist poet and dog mum to Jujubee.