Better habits for managing complexity in data science codebases

How to become more agile and productive with better coding habits

If you’ve tried your hand at machine learning or data science, you know that code can get messy, quickly.

Actual footage of me writing data science code

Typically, code to train ML models is written in Jupyter notebooks and it’s full of (i) side effects (e.g. print statements, pretty-printed dataframes, data visualisations) and (ii) glue code without any abstraction, modularisation and automated tests.

While this may be fine for notebooks targeted at teaching people about the machine learning process, in real projects it’s a recipe for unmaintainable mess. The lack of good coding habits makes code hard to understand and consequently, modifying code becomes painful and error-prone. This makes it increasingly difficult for data scientists and developers to evolve their ML solutions to adapt to business needs.

Complexity is unavoidable, but it can be compartmentalized. In our homes, when we don’t actively organise and rationalise where, why and how we place things, mess accumulates and what should have been a simple task (e.g. finding a key) becomes unnecessarily time-consuming and frustrating. The same applies to our codebase.

Every time we write code in a way that adds another moving part, we increase complexity and add one more thing to hold in our head. While we cannot — and should not try to — escape from the essential complexity of a problem, we often add unnecessary accidental complexity and unnecessary cognitive load through bad coding practices.

If we can keep complexity under control by applying the principles listed below, our brains are freed up to solve the actual problem we want to solve. With this as our backdrop, we’ll share some techniques for identifying bad habits that add to complexity in code as well as habits that can help us manage complexity.

Five habits for managing complexity

“One of the most important techniques for managing software complexity is to design systems so that developers only need to face a small fraction of the overall complexity at any given time.” (John Ousterhout)

1. Keep code clean

One common bad habit (or “code smell”) is leaving dead code in the code base. Dead code is code which is executed but whose result is never used in any other computation. Dead code is yet another unrelated thing that developers have to hold in our head when coding. For example, compare these two code samples:

# bad example

df = get_data()
# do_other_stuff()
# do_some_more_stuff()
# do_so_much_stuff()
model = train_model(df)

# good example

df    = get_data()
model = train_model(df)
See how much easier it is to read the second code sample?

Clean code practices have been written about extensively in several languages, including Python. We’ve adapted these clean code principles for the machine learning context, and you can find them in this clean-code-ml repo.

2. Use functions to abstract away complexity

Imagine you’re in a restaurant. You’re given a menu. Instead of telling you the name of the dishes, this menu spells out the recipe for each dish. For example, one such dish is:

What dinner menus look like without abstraction

It would have been easier for us if the menu hid all the steps in the recipe (i.e. the implementation details) and instead gave us the name of the dish (i.e. an interface, an abstraction of the dish). (Answer: that was lentil soup).

To illustrate this point, here’s a code sample from a notebook in Kaggle’s Titanic competition before and after refactoring to a function.

# bad example
pd.qcut(df['Fare'], q=4, retbins=True)[1] # returns array([0., 7.8958, 14.4542, 31.275, 512.3292])
df.loc[ df['Fare'] <= 7.90, 'Fare'] = 0
df.loc[(df['Fare'] > 7.90) & (df['Fare'] <= 14.454), 'Fare'] = 1
df.loc[(df['Fare'] > 14.454) & (df['Fare'] <= 31), 'Fare'] = 2
df.loc[ df['Fare'] > 31, 'Fare'] = 3
df['Fare'] = df['Fare'].astype(int)
df['FareBand'] = df['Fare']

# good example (after refactoring into functions)
df['FareBand'] = categorize_column(df['Fare'], num_bins=4)

By abstracting away the complexity into functions, we made our code readable, testable and reusable.

When we refactor to functions, our entire notebook can be simplified and made more elegant:

# bad example (a typical data science notebook)
See notebook:

# good example
df = impute_nans(df, categorical_columns=['Embarked'],
                     continuous_columns =['Fare', 'Age'])
df = add_derived_title(df)
df = encode_title(df)
df = add_is_alone_column(df)
df = add_categorical_columns(df)
X, y = split_features_and_labels(df)

# best example. Notice how this reads like a story
prepare_data = compose(impute_nans,
X, y = prepare_data(df)
Life is happier when you read code that tells you what they do, and not how they do it

Our mental overhead is now drastically reduced. We’re no longer forced to process many many lines of implementation details to understand the entire flow. Instead, the abstractions (i.e. functions) abstract away the complexity and tell us what they do, and save us from having to spend mental effort figuring out how they do it.

3. Smuggle code out of Jupyter notebooks as soon as possible

Sure, Jupyter notebooks are great for quick prototyping. But it’s where we tend to put many things — glue code, print statements, glorified print statements (df.describe() or df.plot()), unused import statements and even stack traces ( 🙈). Despite our best intentions, so long as the notebooks are there, mess tends to accumulate.

Notebooks are useful because they give us fast feedback, and that’s often what we want when we’re given a new dataset and a new problem. However, the longer the notebooks become, the harder it is to get feedback on whether our changes are working.

For instance, when we change a line of code, the only way to ensure that everything still works is to restart and re-run the entire notebook(s). We’re forced to take on the complexity of the whole codebase even though we just want to work on one small part of it.

In contrast, if we had extracted our code into functions and Python modules and if we have unit tests, the test runner will give us feedback on our changes in a matter of seconds, even when there are hundreds of functions.

The more code we have, the harder it is for Jupyter notebooks to give us fast feedback on whether everything is working as expected.

Hence, our goal is to move code out of notebooks into Python modules and packages as early as possible. That way they can rest within the safe confines of unit tests and domain boundaries. This will help to manage complexity by providing a structure for organizing code and tests logically and make it easier for us to evolve our ML solution.

So, how do we move code out of Jupyter notebooks? Assuming you already have your code in a Jupyter notebook, you can follow this process:

The refactoring cycle for Jupyter notebooks

The details of each step in this process can be found in the clean-code-ml repo.

4. Apply test-driven development

There is a myth that we cannot apply test-driven development (TDD) to machine learning projects. To us, this is simply untrue. In any machine learning project, most of the code is concerned with data transformations (e.g. data cleaning, feature engineering) and a small part of the codebase is actual machine learning. Such data transformations can be written as pure functions that return the same output for the same input, and as such, we can apply TDD and reap its benefits. For instance, TDD can help us break down big and complex data transformations into smaller bite-size problems that we can fit in our head, one at a time.

As for testing that the actual machine learning part of the code works as we expect it to, we can write functional tests to assert that the metrics of the model (e.g. accuracy, precision, etc) are above our expected threshold. In other words, these tests assert that the model functions according to our expectations (hence the name, functional test). Here’s an example of such a test:

import unittest 
from sklearn.metrics import precision_score, recall_score

from src.train import prepare_data_and_train_model

class TestModelMetrics(unittest.TestCase):
  def test_model_precision_score_should_be_above_threshold(self):
    model, X_test, Y_test = prepare_data_and_train_model()
    Y_pred = model.predict(X_test)

    precision = precision_score(Y_test, Y_pred)

    self.assertGreaterEqual(precision, 0.7)

  def test_model_recall_score_should_be_above_threshold(self):
    model, X_test, Y_test = prepare_data_and_train_model()
    Y_pred = model.predict(X_test)

    recall = recall_score(Y_test, Y_pred)

    self.assertGreaterEqual(recall, 0.6)
Example of a automated functional test for ML models

When we’ve written these unit tests and functional tests, we can make them run on a continuous integration (CI) pipeline whenever a team member pushes code. This will allow us to catch errors as soon as they are introduced into our codebase, and not a few days or weeks later.

5. Make small and frequent commits

For example, look at the first and second images below. Can you find out which function we’re working on? Which image gave you an easier time?

When we make small and frequent commits, we get the following benefits:

  • Reduced visual distractions and cognitive load.
  • We needn’t worry about accidentally breaking working code if it’s already been committed.
  • In addition to red-green-refactor, we can also red-red-red-revert. If we were to inadvertently break something, we can easily fall back to the latest commit, and try again. This saves us from wasting time undoing problems that we accidentally created when we were trying to solve the essential problem.

So, how small of a commit is small enough? Try to commit when there is a single group of logically related changes and passing tests. One technique is to look out for the word “and” in our commit message, e.g. “Add exploratory data analysis and split sentences into tokens and refactor model training code”. Each of these three changes could be split up into three logical commits. In this situation, you can use git add -p to stage code in smaller batches to be committed.


Source: towardsdatascience