Managing Your Reusable Python Code as a Data Scientist

Originally posted on kdnuggets.

Here are a few approaches that I have settled on for managing my own reusable Python code as a data scientist, presented from most to least general code use, and aimed at beginners.

There are lots of different approaches to managing your own code, which will differ depending on your requirements, personality, technical know-how, role, and numerous other factors. While a highly-experienced developer may have an incredibly regimented method of organizing their code across multiple languages, projects, and use cases, a data analyst that rarely writes their own code may be much more ad hoc and lackadaisical out of lack of necessity. There really is no right or wrong, it’s simply a matter of what works — and is appropriate — for you.

To be specific, what I’m referring to by “managing code” is how you organize, store, and recall different pieces of code you, yourself, have written and found useful as long-term additions to your programming toolbox. Programming is all about automating, and so if, as someone who writes code, you find that you are performing similar tasks repetitively, it’s only makes sense that you somehow automated the recalling of the code associated with that task.

This is why you are already using third-party libraries. No need to re-implement a support vector machine code base from scratch every time you want to use it; instead, you make use of a library — perhaps Scikit-learn — and take advantage of the collective work of numerous folks perfecting some code over time.

Extending this idea to the personal programming sphere only makes sense. You may already be doing this (I hope you are), but if not, here are a few approaches that I have settled on for managing my own reusable Python code as a data scientist, presented from most to least general code use.

Full-blown libraries

This is the most general approach there is, and what could be argued is the most “professional”; however, this alone does not make it the right choice all the time.

If you find that you are using the same functionality in numerous use cases, and doing so regularly, this is the way to go. This also makes sense if the functionality you want to reuse is easily parameterizable; that is, the task can be handled over and over again by writing and calling a generalized function with variables you can define each time you call.

For example, I often find that I want to find the nth occurrence of some substring in a string, and there is no Python standard library function for this. Thus, I have a simple piece of code that accepts a string, and substring, and the nth occurrence I am looking for as input, and returns the position in the string which this nth occurrence begins (lifted long ago from here).

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

Since I deal with a lot of text processing, I have collected this with numerous other text processing functions I regularly use and created a library that resides on my computer as any other Python library would, and am able to import this library as any other. The steps for creating the library are somewhat lengthy, though straightforward, and so I will not cover them here, but this article is one of very many that does so well.

So now that I have a textproc library, I can import and use my find_nth function easily, and as often as I like, without having to copy and paste the function into each and every program I write that I use it in.

from textproc import find_nth

segment = line[:find_nth(line, ',', 4)].strip()

Also, if I want to extend the library to add more functions, or change the existing find_nth code, I can do so in one spot and just re-import.

Project-specific shared scripts

Perhaps you don’t need a full-blown library, as the code you want to reuse doesn’t seem to have a use beyond the project you are currently working on, but you do need to reuse it within a specific project. In this case, you can place the functions together in one script, and simply import that script by name. It’s the poor woman’s library, but it is often just what is needed.

In my graduate work I had to write a lot of code related to unsupervised learning, specifically k-means clustering. I wrote what became functions for initializing centroids, computing distances between data points and centroids, recalculating centroids, etc., and doing numerous of these tasks using different algorithms. I soon found that keeping a separate script with copies of some of these algorithm functions was not optimal, and so moved them out into their own scripts to be imported. It worked nearly the same way as a library, but the process was path-specific, and was meant for this project only.

Soon I had scripts for different centroid initialization functions and distance computation functions, and for data-loading and processing functions as well. As this code all became more and more parameterized and generally useful, the code eventually made its way into a legitimate library.

This seems to be how things usually progress, at least in my experience: You write a function in your script that you need to use now, and you use it. The project expands, or you move on to a similar project, and you realize that same function would be handy to have now. So that function gets dropped down to a script of its own, and you import it to use. If this usefulness continues beyond the near term, and you find that function having more general and longer term use, that function now gets added to an existing library, or is the basis for a new one.

However, another specific useful aspect of importing simple scripts is when using Jupyter notebooks. Given the ad hoc, exploratory, and experimental nature of much of what goes on in Jupyter notebooks, I’m not a fan of importing notebooks into other notebooks as modules. If I find that more than one notebook is making regular use of some code excerpt, that code goes gets dropped down into a script stored in the same folder which then gets imported into the notebook(s). This approach makes much more sense to me, and provides more stability by knowing that one notebook another notebook relies on is not being edited in a harmful manner.

Task-specific templates

I find that I often perform some of the same tasks over and over again which do not lend well to being parameterized, or are tasks which could be parameterized but with more effort than it is worth. In such cases, I employ code templating, or boiler-plating. This is much more the copying and pasting of code that I wanted to avoid in all cases at the outset of this article, but sometimes it’s the right choice.

For example, I often need to “listify,” for lack of a better word, the contents of a Pandas DataFrame, and while writing a function that could determine the number of columns, could accept as input the columns to use, etc., often the output also needs to be tweaked, all of which points to writing a function being far too time consuming.

In this case, I just write up a script template that can easily be changed, and keep it handy in a folder of similar templates. Here’s an excerpt of listify_df, which goes from CSV file to Pandas DataFrame, to the desired HTML output.

import pandas as pd

# Read CSV file into dataframe
csv_file = 'data.csv'
df = pd.read_csv(csv_file)

# Iterate over df, creating numbered list entries
i = 1
for index, row in df.iterrows():
	entry = '<b>' + str(i) + \
			'. <a href="' + \
			row['url'] + \
			'">' + \
			row['title'] + \
			'</a> + \
			'\n\n<blockquote>\n' + \
			row['description'] + \
	i += 1

In this case, clear filenames and folder organization are helpful for managing these often useful snippets.

Short one-liners and blocks

Lastly, there are a lot of repetitive snippets you probably type regularly. So why do you do that?

You should be making use of a text expansion tool to insert short “phrases” when needed. I use AutoKey to manage such short phrases, which are associated with trigger keywords and then inserted when those keywords are typed.

For example, do you import a lot of the same libraries for all of your projects of a particular type? I do. For instance, you could set up all of the imports you would need for working on a particular task by typing, say, #nlpimport which, once typed, is recognized as a trigger keyword and is replaced with the following:

import sys, requests

import numpy as np
import pandas as pd

import texthero
import scattertext as st

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

from datasets import load_metric, list_metrics
from transformers import pipeline
from fastapi import FastAPI

It should be noted that some IDEs have these capabilities. I, myself, generally use glorified text editors to code, and so AutoKey is necessary (and incredibly useful) in my case. If you have an IDE which takes care of this, great. The point is, you shouldn’t need to be typing these over and over all the time.

This has been an overview of approaching the management of your reusable Python code as a data scientist. I hope that you have found it useful.

Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master’s degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.


Source: kdnuggets