Software development tools for staying organised and keeping quality high
There are many online lists of the software and packages used in Data Science. Pandas, Numpy and Matplotlib are always featured, as are machine learning libraries Scikit-learn and Tensorflow.
However, just as important are some less DS-specific software development tools that should be part of your workflow on every project.
Git
Version control is a necessity on any coding project, with Data Science being no exception. Keeping track of who did what when, and having a comprehensive history of working, tested code is invaluable in projects of any scale, but especially when collaborating with others.
Git keeps track of changes made to your code-base, maintaining an annotated audit trail of code versions. Code can be cloned from a master repository for development, then changes are committed with comments and pushed back after adequate testing. You can switch between versions at any time and create branches to segregate the development of major features before merging them back into the master branch.
This allows multiple people to work simultaneously on different features of the same code-base, without risking overwriting someone else’s work. Depending on the structure of your team and the size of your project, you may want to put some policies in place around testing and code reviews before merging.
GitHub (owned by Microsoft) is probably the most popular Git repository hosting service, but alternatives exist such as GitLab, BitBucket and Launchpad. These each offer a varying array of management tools to help further organise your project, such as roadmaps and issue tracking.
A great tutorial can be found here, but nothing beats practice, so make sure that for your next project you’re taking full advantage of version control and all its features.
Conda
Managing your environment is crucial for making code reproducible. Nobody wants the headache after cloning a git repository of spending hours fixing compatibility errors and installing and re-installing modules. Conda handles this through Virtual Environments, onto which you can install the software you need, before easily switching between them. This allows you to keep your environments minimalist and clean, including only the packages that you need for your current project.
Anaconda Navigator is a portable GUI which makes managing VEs incredibly easy. Simply create a new environment, search for the packages you want to use and hit “Apply” to add them. From there you can access that environment within a terminal, straight Python or a Jupyter notebook.
I often have to work across multiple machines and operating systems and so the ability to quickly and easily install the tools I need is incredibly useful and allows me to avoid moving data around.
If you’re using a terminal, Conda offers more flexibility at the expense of learning a few commands. This includes exporting environments as .yml files, which can be imported elsewhere. Saving these environment files in your Git repository means that its easy for others to replicate your environment before attempting to run your code.
A tutorial is available here, and a good explanation of a workflow using Git and Conda is detailed here by Tim Hopper.
Unitest
Brian Kernighan once wrote:
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
Everyone knows the frustrating feeling of realising half way through a project, that things are much more complicated than you had anticipated. The 80/20 rule often holds, where you spend 20% of the time getting 80% of your work done, then it’s the last 20% that causes all the headaches. Data Science projects can quickly spiral upwards in complexity and keeping your head across all the moving parts can be tricky.
Chopping up your project into smaller, simpler components is therefore essential for collaborating and creating robust code without getting lost in the complexity. When working on a small, self-contained section of code, unit testing gives you the confidence that it is working as expected. This allows you to forget about how it works, but only care about what it does — a concept named “Black Box Abstraction”.
Python comes with the Unittest framework installed. Using this framework you can specify as many tests as you wish using various assert statements to check the output of functions. These can be run as a script, making it easy to test, and continue to test the correctness of your code as you edit.
For example, let’s say you were doing some feature engineering on the “Cabin” field of the famous Titanic dataset. You might want to write a function extracting the deck letter “C” from a cabin value “C38”. After writing the function, you specify a number of test cases and the expected output. If these unit tests all behave as expected you can be confident you’ve written the function correctly.
Imagine further, that later on, you wish to add the ability to extract the cabin number “38” as well. In editing the function, you don’t want to break the original functionality. Having those unit tests handy to ensure nothing has been broken (and using Git to revert your changes if it has) gives you the ability to edit code, without risking damaging things you’ve built previously.
Testing in Data Science comes with its own set of challenges. The underlying data your code relies on may change between tests making it hard to reproduce unexpected behaviour, and the size of the data may make regular testing prohibitively time consuming. A good solution to this is to use a static small subset of the data for testing purposes. This allows you to definitively debug any strange behaviours and run your code in a few seconds. You can also add invented data points to test how your code responds to edge cases.
More details of how to use Unittest can be found here.