Originally posted on towardsdatascience.
Finding bottlenecks and optimizing performance using cProfile
If you’ve ever written a line of code (or even tens of thousands of lines), you’ve surely wondered “Why does my code take so long to run?” Answering that question isn’t always simple, but it can be easier if you search for answers the right way.
Perhaps you trust your knowledge of the problem at hand, using your subject matter expertise to check certain snippets first. Maybe you time a few different modules/classes/functions to see where most of the execution time is spent. Better yet, you can profile your code to get more information about the relative time spent in different functions and sub-functions. Whatever your process is, this blog might teach you some ways to get to the answer more quickly.
In this post, I’ll start off by showing you a basic profiling method. I’ll add more and more features and flavor to it, ending with a good, solid profiling decorator. For those in a hurry (or who want to refer back to this material), head over to this GitHub repository where you can find the profile decorator and an example.
Time It!
To profile your code, you’re gonna need to know how to time it. To do so, you can use something as simple as this:
from time import timestart = time()
# your script here
end = time()
print(f'It took {end - start} seconds!')
To easily time several functions/snippets (or if you just prefer to use a cleaner and more pythonic approach), you can convert the above into a timer decorator (discussed with an example here).
Using a timer on any function shows us the run time of that piece in isolation. For this approach to be useful in finding the bottleneck, we need more information. At least one of the following two conditions should be true in order to profile your code effectively:
- We should know the total run time of the program to have a better picture of the relative run time of our desired function/section. For example, if a piece of code takes 5 minutes to execute, is that 10%, 40%, or 90% of the total run time?
- We should know enough about the problem at hand or the run time of other parts of the program to confidently label a given piece of code as the bottleneck. Even if a function takes 10 minutes to run (let’s assume 10 minutes is relatively high), we should only worry about its inefficiency if we’re confident that there’s no other part that takes longer. As Donald Knuth famously said:
Premature optimization is the root of all evil.
A profiler package like cProfile
helps us find the bottlenecks in our code by satisfying both of these conditions.
How to Use cProfile
Basic Usage
The most basic way of profiling with cProfile
is using the run()
function. All you need to do is to pass what you want to profile as a string statement to run()
. This is an example of what the report looks like (truncated for brevity):
>>> import cProfile
>>> import pandas as pd>>> cProfile.run("pd.Series(list('ABCDEFG'))")258 function calls (256 primitive calls) in 0.001 secondsOrdered by: standard namencalls tottime percall cumtime percall filename:lineno(function)
4 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 _dtype.py:319(_name_get)
....
11/9 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {built-in method numpy.array}
1 0.000 0.000 0.000 0.000 {built-in method numpy.empty}
....
The first line indicates that 258 calls were monitored, out of which 256 were primitive (a primitive call is one that was not induced via recursion). The next line, Ordered by: standard name
, indicates that the report is sorted based on the standard name, which is the text in the filename:lineno(function)
column. The line after that is the column headings:
ncalls
: is the number of calls made. When there are two numbers (like 11/9 above), the function recurred. The first value is the total number of calls and the second value is the number of primitive or non-recursive calls.
tottime
: is the total time spent in the given function (excluding time made in calls to sub-functions).
percall
: is the quotient of tottime
divided by ncalls
.
cumtime
: is the cumulative time spent in this and all subfunctions. This figure is accurate even for recursive functions.
percall
: is the quotient of cumtime
divided by primitive calls.
filename:lineno(function)
: provides the respective data of each function.
The run()
function can accept two more arguments: a filename
to write the results to a file instead of the stdout, and a sort
argument that specifies how the output should be sorted. You can check the documentation to learn more about valid sort values. Some of the common ones are 'cumulative'
(for cumulative time), 'time'
(for total time), and 'calls'
(for number of calls).
If you pass a filename and save the results, you may notice that the output is not human-readable. In this case, you need to use the pstats.Stats
class to format the results, which we’ll discuss next.
More Control With Profile
and pstats.Stats
Although using cProfile.run()
can be sufficient in most cases, if you need more control over profiling, you should use the Profile
class of cProfile
. The snippet below is taken from the Profile
class document.
import cProfile, pstats, io
from pstats import SortKey
pr = cProfile.Profile()
pr.enable()
# ... do something ...
pr.disable()
s = io.StringIO()
sortby = SortKey.CUMULATIVE
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print(s.getvalue())
Let’s go through this line by line:
First, we create an instance of the Profile
class (pr
) and collect profiling data by calling enable
. When we want to stop collecting profiling data, we call disable
. What comes after that formats the collected statistics. We can then use the pstats.Stats
class constructor to create an instance of the statistics object (ps
).
The Stats
class can create a statistics object from a profile object (pr
) and print the output to the stream that’s passed to it. The Stats
class also has a sort_stats
method that sorts the results based on the provided criteria. In this case, the criterion is SortKey.CUMULATIVE
which is for the cumulative time spent in a function. As described in sort_stats
documentation, the sorting criteria can be in the form of a SortKey
enum (added in Python 3.7) or a string (i.e. using 'cumulative'
instead of SortKey.CUMULATIVE
is also valid). Finally, the results are created and printed to the standard output.
This is good boilerplate with which to build a decorator (to learn more about decorators, I recommend this great and comprehensive primer):
import cProfile
import io
import pstatsdef profile(func):
def wrapper(*args, **kwargs):
pr = cProfile.Profile()
pr.enable()
retval = func(*args, **kwargs)
pr.disable()
s = io.StringIO()
sortby = SortKey.CUMULATIVE # 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print(s.getvalue())
return retval
return wrapper# Profile foo
@profile
def foo():
print('Profile me!')
Though it’s a good start, there are several pieces in the above decorator that can be made better. In particular, a more comprehensive function could:
- Allow users to specify sorting keys (whether a single key or a tuple of sorting keys)
- Only show the biggest time consumers rather than all the lines
- Clean up the report by removing all the leading path information from the file names, reducing the size of the printout and making the report easier to read
- Save the output to a file rather than printing to the stdout
All of the above conditions can be handled by using the methods in pstats.Stats
. Let’s go over them in order:
- Pass the desired sort keys to the
sort_stats()
method (though we still need to check whether it’s a single value or a tuple) - Pass the max desired number of printed lines to
print_stats()
- Use
strip_dirs()
- By default, the outputs are printed to
sys.stdout
becausepstats.Stats()
uses that as its default stream argument, but to write results to a file, we can pass a file as the stream
Applying the aforementioned changes to the profile
decorator will result in the following function (more explanation of the arguments is provided in the docstring):
import cProfile import pstats from functools import wraps def profile(output_file=None, sort_by='cumulative', lines_to_print=None, strip_dirs=False): """A time profiler decorator. Inspired by and modified the profile decorator of Giampaolo Rodola: http://code.activestate.com/recipes/577817-profile-decorator/ Args: output_file: str or None. Default is None Path of the output file. If only name of the file is given, it's saved in the current directory. If it's None, the name of the decorated function is used. sort_by: str or SortKey enum or tuple/list of str/SortKey enum Sorting criteria for the Stats object. For a list of valid string and SortKey refer to: https://docs.python.org/3/library/profile.html#pstats.Stats.sort_stats lines_to_print: int or None Number of lines to print. Default (None) is for all the lines. This is useful in reducing the size of the printout, especially that sorting by 'cumulative', the time consuming operations are printed toward the top of the file. strip_dirs: bool Whether to remove the leading path info from file names. This is also useful in reducing the size of the printout Returns: Profile of the decorated function """ def inner(func): @wraps(func) def wrapper(*args, **kwargs): _output_file = output_file or func.__name__ + '.prof' pr = cProfile.Profile() pr.enable() retval = func(*args, **kwargs) pr.disable() pr.dump_stats(_output_file) with open(_output_file, 'w') as f: ps = pstats.Stats(pr, stream=f) if strip_dirs: ps.strip_dirs() if isinstance(sort_by, (tuple, list)): ps.sort_stats(*sort_by) else: ps.sort_stats(sort_by) ps.print_stats(lines_to_print) return retval return wrapper return inner
Wrap-up Example
To see how the profile
decorator can help us detect bottlenecks, let’s create an example with multiple functions. Assume you have a list of products sold in the past year, and you want to know each product’s popularity by counting units sold. (What we want can be easily done using collections.Counter
, or if you’re a pandas
user, by pandas.Series.value_counts
, but let’s forget about those for the sake of this exercise.)
import random
random.seed(20)def create_products(num):
"""Create a list of random products with 3-letter alphanumeric name."""
return [''.join(random.choices('ABCDEFG123', k=3)) for _ in range(num)]
# version1
@profile(sort_by='cumulative', lines_to_print=10, strip_dirs=True)
def product_counter_v1(products):
"""Get count of products in descending order."""
counter_dict = create_counter(products)
sorted_p = sort_counter(counter_dict)
return sorted_p
def create_counter(products):
counter_dict = {}
for p in products:
if p not in counter_dict:
counter_dict[p] = 0
counter_dict[p] += 1
return counter_dict
def sort_counter(counter_dict):
return {k: v for k, v in sorted(counter_dict.items(),
key=lambda x: x[1],
reverse=True)}
# ===========
# Analysis starts here
# ===========
num = 1_000_000 # assume we have sold 1,000,000 products
products = create_products(num)
# Let's add profile decorator to product_counter_v1 function
counter_dict = product_counter_v1(products)
The result will be saved in product_counter_v1.prof
in your current directory, and it should look like this:
1007 function calls in 0.228 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.228 0.228 scratch_6.py:69(product_counter_v1)
1 0.215 0.215 0.215 0.215 scratch_6.py:86(create_counter)
1 0.000 0.000 0.013 0.013 scratch_6.py:105(sort_counter)
1 0.013 0.013 0.013 0.013 {built-in method builtins.sorted}
1 0.000 0.000 0.000 0.000 scratch_6.py:106(<dictcomp>)
1000 0.000 0.000 0.000 0.000 scratch_6.py:107(<lambda>)
1 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Examining this file, we can see that the vast majority of the execution time (0.215 of the 0.228 seconds) is spent in the create_counter
function. So, let’s make a new create_counter
function and check its effects.
# version2
@profile(sort_by='cumulative', lines_to_print=10, strip_dirs=True)
def product_counter_v2(products):
"""Get count of products in descending order."""
counter_dict = create_counter_v2(products)
sorted_p = sort_counter(counter_dict)
return sorted_p
def create_counter_v2(products):
counter_dict = {}
for p in products:
try:
counter_dict[p] += 1
except KeyError:
counter_dict[p] = 1
return counter_dict
The result of profiling product_counter_v2
is shown below:
1007 function calls in 0.169 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.169 0.169 scratch_6.py:78(product_counter_v2)
1 0.158 0.158 0.158 0.158 scratch_6.py:95(create_counter_v2)
1 0.000 0.000 0.010 0.010 scratch_6.py:105(sort_counter)
1 0.010 0.010 0.010 0.010 {built-in method builtins.sorted}
1 0.000 0.000 0.000 0.000 scratch_6.py:106(<dictcomp>)
1000 0.000 0.000 0.000 0.000 scratch_6.py:107(<lambda>)
1 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
The run time decreased from 0.228 to 0.169 seconds, a ~26% reduction. If this is still not satisfactory, you can try using collections.defaultdict
to create the counter_dict
. For the last implementation, we’ll use collections.Counter
:
import collections
# version3
@profile(sort_by='cumulative', lines_to_print=10, strip_dirs=True)
def product_counter_v3(products):
"""Get count of products in descending order."""
return collections.Counter(products)
Profiling product_counter_v3
shows that it yields a 62% improvement over product_counter_v1
.
11 function calls in 0.086 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.086 0.086 scratch_6.py:118(product_counter_v3)
1 0.000 0.000 0.086 0.086 __init__.py:517(__init__)
1 0.000 0.000 0.086 0.086 __init__.py:586(update)
1 0.086 0.086 0.086 0.086 {built-in method _collections._count_elements}
1 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
1 0.000 0.000 0.000 0.000 abc.py:180(__instancecheck__)
2 0.000 0.000 0.000 0.000 _weakrefset.py:70(__contains__)
2 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Play around with different arguments of theprofile
decorator and see how the output will change. You can find the complete example in this GitHub repository.
Happy profiling!
Source: towardsdatascience