Boost your efficiency and process Excel-files with Python

Boost your efficiency and process Excel-files with Python

Load, transform, modify and save Excel-files with Python to improve your reporting processes

If you work with data, you will get in touch with excel. Even if you don´t use it by yourself, your clients or colleagues will use it. Excel is great for what it is made: table calculation for smaller data sets. But I always hated the kind of excel sheets with one million rows and hundreds of columns. This workbooks are slow and tend to crash after some calculation. So I started to use python for handling large excel files which offer another big advantage: You create code which is reproducible and provide documentation as well. Let´s jump in!

Reading Excel-files with Python

The file we want to process contains nearly 1 million rows and 16 columns:

Python provides read_excel() to read Excel-files as DataFrame:

import pandas as pd
import numpy as np
df = pd.read_excel(...\Excel-Tutorial.xlsx')

As you can see the data looks clean so far but our column header seems to be wrong. A lot of excel maps contain headlines or other information to guide the reader. We can skip this parts and define a header row:

df = pd.read_excel('…\Excel-Tutorial.xlsx', header=[1]).reset_index()

The argument header=[1] specifies that we want to use the second row in the excel sheet as header. All previous rows are skipped.

Use Pandas to do some calculations

A typical question of the marketing department could be how much sales we had for the different countries in each year:

We finished this calculation in 86 ms. One big advantage of processing Excel-files with Python is that any kind of calculation is much faster done as in Excel itself. The more complex the operations, the greater the speed advantages.

Another requirement could be that the sales department needs the data for each country grouped by years and categories. Since they want to supply the data to the national markets, we have to save the calculations in different worksheets:

Saving the results as Excel

In a next step, we want to save our files as Excel again to supply it to the sales and marketing department. We will create a pd.ExcelWriter object and create the different worksheets:

Easy, isn´t it? Let´s have a look at the new created workbook:

As you can see our DataFrames were saved correctly to the specified worksheets. After we sent our great result to both departments, we receive a mail on the next day: They ask for some formatting and visualization. Since we have to transform this kind of data every month, we decide to perform the tasks in Python as well.

Formatting and visualization

To add formatting and visualization, we have to create a writer object again:

As you can see the first part of the code is the same as in the first example. We create a writer object. xlsxwriter gives us access to Excel-features such as charts and formatting. To gain access to this features, we need to get the workbook object workbook = writer.book and the worksheet object worksheet = writer.sheet['Sales_Sums'] . In this example, we will perform the modifications on our first sheet. We add a chart, specify the range for the data ( =Sales_Sums!$B$2:$B$7' ) and add it to our worksheet in cell A9 .

In the same way we add formatting for our sales data. We add a 3 color scale on the range B2:B7 to visually highlight low or high values. We also adjust the width of the first and second column worksheet.set_column(0,1,30) . We also format the column header for our sales data and rename it to 2019 Sales Data . In a last step, we save out the file:

This result is much better and provides a big advantage compared to Excel. We can reproduce exactly the same file next month with one click.

Conclusion

Python is great for processing Excel-files. You can handle large files much easier, you create reproducible code and you provide a documentation for your colleagues. We also saw the we have easily access to advanced features of Python. You could automate your whole reporting process.

Source: towardsdatascience