Agile Methodology for Data Science Projects

Agile project management methodology is a powerful framework when businesses are faced with uncertainty, rigid time frames and budget constraints. It has boomed in popularity over the last few years, but while the key concepts are easy to grasp, it is much harder to put into practice. Adopting Agile development is not as simple of having a stand-up and using Jira, it requires leadership and commitment. By taking a piece-meal approach to Agile, picking and choosing the easy parts to use in the next project cycle, nothing will change. Agile is a mindset, a culture and is greater than the sum of its parts. When it is adopted in its fullest it can be hugely successful. If you have seen Agile work, you know what I mean.

The key benefits from the Agile methodology are:

  • Continuous integration
  • Faster time to market – deliver results and value quickly
  • On budget
  • Utilising the power of a self-organising team
  • Ability to pivot with pace and limit wasted time and resources

to name a few.

Agile began in the software development space where the product is (more or less) easy to discretise. The chunks of work are prioritised by those delivering most business value. Once the minimum viable product has been released, subsequent iterations build upon the MVP and each showcase to the client and end users is demonstrating a product with more and more features.

The worlds of data science and software development are coming together. I have worked on many data science projects where we aimed to implement Agile methodology for delivery of predictive models and data driven solutions. Some have been successful, but some have failed to adopt Agile successfully and end up with a blown-out waterfall process and a very heavy tail end of a project.

In this post I’ll highlight some of the key thinking when applying Agile to data science projects.

User stories don’t always work

Traditionally user stories are of the form:

As a….

I want….

So, I can….

with the point of view of the end user. This is great to put your feet in the shoes of those that are going to use the final product. It helps to shape how the product is built and ensure the finished product meets the business requirements. Each user story should contain enough detail to estimate the effort it will take the team to complete, but not so much that it details what the solution looks like.

Arguably the biggest stumbling block is how to write user stories for a data science project. Most data science projects are developing models for prediction or classification models to aid or automate decision making. The end user is only really concerned with the output of the model. With that thinking the user stories could be boiled down to one for each model. That doesn’t leave a lot to work with. The user stories need to be discrete chunks of work that can be completed within a sprint. Typically, a data science project will involve

  • Plumbing – Getting the pipes and buckets ready for the data
  • Discover – Discover and source the data
  • Ingest – Ingest the data into the data warehouse
  • Clean, extract and explore – Start to wrangle the data into a usable form
  • Feature engineering – Impute missing values, transform data and extract key pieces of information
  • Model fitting and evaluation – Fit the appropriate model and evaluate it’s accuracy
  • Deployment – Deploy the model into the environment.

These are logical chunks of work for the data science and data engineering teams (each may consist of many stories with more specifics about individual data items). In this way the stories are not for the end-user but for the data science team such as,

              As a Data Scientist

              I want to impute the missing values of variable x

              So, I can use variable x in the model for predicting y

In a sense they’re more like tasks rather than user stories. Trying to force these chunks of work into the user story template starts to get confusing. What it will do is mostly waste time in discussing the semantics of the wording and what the user story should say rather than what the body of work actually is. I’ve been in this situation many times and the end result is always to simply detail the work as a task ready for story point estimation ignoring the user story template.

My suggestion is from the outset is to forget about the template for the data science components and simply detail them as tasks that are estimable. It will save much time and confusion. Detailing the work that needs to be done and how it supports the final product is more important than massaging it into the template.

Having said that, I have heard from others where keeping the user story template has worked but everyone needs to be on the same page. For longer projects this may be OK since the team has the time to work through this. But in my experience forcing the work into this shape takes longer for no real gain.

The user story structure is important for the final output however, as it helps to ensure all the intermediate tasks will eventually support the output which is what the end users are primarily interested in.

Find fastest path to first release

Continuous integration and speed to market are key benefits of an Agile process. A common problem is the models built aren’t released until the vast amount of data has been ingested, explored and fit to the model. This is unlikely to happen late into the project and therefore isn’t delivering value until close to the completion of the project. This is akin to a waterfall approach.

In the software development space, continuous integration results in always having working software where the first release is the MVP with the features that deliver the most business value. In following sprints, the product is iterated on with the next highest priority features.

In the data science space this equates to always have a working model with output that can be productionised. During the planning phase a decision needs to be made on the minimal amount of work required to have a working model and output delivered to the client. The decision will be based on,

  • The type of model to be built
  • The ease in which data items can be sourced and ingested
  • The ease in which data items can be cleaned and transformed fit for modelling and
  • The most important features for the model and the value they provide.

The last point will be mostly a gut feel from subject matter experts and the data science team since it won’t objectively be known until the final model is built.

The model MVP may only have a few predictors, for example if the product is a model to predict churned customers for a telco the initial model may include bill amount, tenure and data usage. An image classification task may include classifying expressions from images of people such as happy and sad from a subset of images rather than classifying the full range of human emotion.

While the first model built won’t be the best or most accurate, as long as it is doing better than the baseline it is delivering value. Particularly in the data science space showcasing a working, early model can greatly help the business to start to test the effectiveness of the model and making gains right away. If improvements or changes need to be made, the team can pivot in the right direction.

Iterate

Another common question is, how to do iterate a model building process? The data science and modelling process is inherently iterative and fits into an agile process easily. Rarely is the first model you build, the last. Once the MVP has been showcased each iteration will likely have more and more features added to the model, model trained on more data, achieving higher accuracy or even a completely different modelling method. For example, the MVP may be a simple logistic regression model where in later iterations as the model becomes more complex a full neural net may be fit to handle the complexity. At each showcase it is important to show these improvements as measures of progress. It shouldn’t be challenging to iterate the model building process, it only requires a different mindset.

What shouldn’t happen, in say a 6-month project the working model is only shown in the last month. This does not allow any time for feedback or improvements. As mentioned above, this is in effect a waterfall project and is likely that in the last month the team will be putting in double time to fix all the things that are wrong. The Agile iterative process avoids the crunch, resulting in a better product and happier people.

Another benefit of embracing the iterative nature is the power of the self-organising team and a chance to upskill junior staff. Once the first model has been deployed there will be tasks to add more features and improve the prediction accuracy. This is a great time for more junior staff to grab those user stories and build the next version of the model after some of the more challenging work has been done e.g. choosing the most appropriate model, treating the missing values appropriately given the business context and understanding why those decisions were made.

Automate testing

Software development relies on test driven development for success. A user story is complete when the new feature is added to the product and is passing all the automated tests. As the product becomes more complex, the integration of new features may break other that previously passed.

The same occurs with building data driven solutions. For example, the inclusion of a new feature in the model may not improve the model fit. It could increase the variance or decrease the accuracy of the test set, or it could improve the overall accuracy but decrease the accuracy of a particular classification. Consider a case where a bank is developing a model to identify fraudulent credit card transactions. This is an example of a highly unbalanced dataset where say 0.1% of the transactions are fraudulent. By labelling every transaction as genuine the model would achieve an accuracy of 99.9%! But to identify the fraudulent records every transaction would need to be checked by a human manually. It is in the banks best interest to instead minimise the number of false negatives i.e. no fraudulent transaction slips through detection. Even if 5% of the transactions are labelled as fraudulent this reduces the number to be manually checked by 95%, a huge saving. If a new feature is added to the model it may improve the overall accuracy but reduce the precision and recall values which is actually a worse model given the business context and how the output will be used.

Another scenario may be where multiple models have been developed. A user story has been completed to improve the imputation of missing values for a particular variable. This improvement has improved one model but negatively affected the others which also use that variable. In an automated test-driven environment this will be picked up before the model is deployed to production. This will either mean the user story has not met the definition of done or it will create multiple other user stories to re-fit the other models. Either way this is protection against adverse effects of the experimental process and makes it clear how much additional work is needed to use this feature in the models.

Final thoughts

There can be a lot of unease about the experimental nature of data science projects. That’s never going to go away but it can be managed by an open and transparent process that Agile supports. It is often said that failure is not missing the deadline, it is not communicating your progress.

Follow me on social media:

Leave a Reply

Your email address will not be published. Required fields are marked *