Python Profiling with line_profiler

The line_profiler can be used to see the amount of time taken to execute each line in a function of your code. I think this is an important tool that can be used to reduce the runtime of a code. Simple command “pip install line_profiler” shall install the package or “conda install line_profiler” to install into an existing conda environment.

I shall present the usage of this line_profiler tool for a randomly generated data to calculate supply to demand ratio for releases from a reservoir. Demand or target supply is defined for day of the water year. The following code first defines the calculation for day of water year, generates random data for demand and supply, and two functions are defined for different methods of calculation of ratio of supply to demand. Include the line @profile before the function definition line to get the profile of execution of each line in the function.

import pandas as pd
import numpy as np
from line_profiler import profile

#function to caluclate day of water year
def get_dowy(date):
water_year_start = pd.Timestamp(year=date.year, month=10, day=1)
if date < water_year_start:
water_year_start = pd.Timestamp(year=date.year - 1, month=10, day=1)
return (date - water_year_start).days + 1

# Generate random data for demand for each day of water year
np.random.seed(0)
data = {
'Median_Demand': np.random.randint(0, 1000, 367),
}

# Create dataframe
df_demand = pd.DataFrame(data)

## Generate random data for supply from years 2001 to 2010 and also define corresponding day of water year
date_range = pd.date_range(start='2001-10-01', end='2091-09-30', freq='D')
data = {
'dowy': [get_dowy(date) for date in date_range],
'Supply': np.random.uniform(0, 2500, len(date_range))
}
# Create dataframe
df_supply = pd.DataFrame(data, index=date_range)

@profile #define before the function for profiling
def calc_supply_demand_1(df,df_median):
ratio = pd.DataFrame()
medians_dict = df_demand['Median_Demand'].to_dict()
demand = df_supply['dowy'].map(medians_dict)
supply = df_supply['Supply']
ratio = supply.resample('AS-OCT').sum() / demand.resample('AS-OCT').sum()
return ratio

@profile
def calc_supply_demand_2(df,df_median):
ratio = pd.DataFrame()
medians_dict = df_demand['Median_Demand'].to_dict()
demand = pd.Series([df_demand['Median_Demand'][i] for i in df.dowy], index=df.index)
supply = df_supply['Supply']
ratio = supply.resample('AS-OCT').sum() / demand.resample('AS-OCT').sum()
return ratio

ratio1 = calc_supply_demand_1(df_supply, df_demand)
ratio2 = calc_supply_demand_2(df_supply,df_demand)

Running just the code wouldn’t output anything related to line_profiler. To enable profiling, run the script as follows (this sets the environment variable LINE_PROFILE=1)

LINE_PROFILE=1 python Blog_Post.py

The above line generates three output files as profile_output.txt, profile_output_.txt, and profile_output.lprof and stdout is as follows:

Timer unit: 1e-09 s

0.04 seconds - /directory/Blog_Post.py:30 - calc_supply_demand_1
2.43 seconds - /directory/Blog_Post.py:39 - calc_supply_demand_2
Wrote profile results to profile_output.txt
Wrote profile results to profile_output_2024-03-29T192919.txt
Wrote profile results to profile_output.lprof
To view details run:
python -m line_profiler -rtmz profile_output.lprof

On executing the line “python -m line_profiler -rtmz profile_output.lprof”, the following is printed.

Timer unit: 1e-06 s

Total time: 0.0393394 s
File: /directory/Blog_Post.py
Function: calc_supply_demand_1 at line 30

Line # Hits Time Per Hit % Time Line Contents
==============================================================
30 @profile
31 def calc_supply_demand_1(df,df_median):
32 1 2716.4 2716.4 6.9 ratio = pd.DataFrame()
33 1 1365.2 1365.2 3.5 medians_dict = df_demand['Median_Demand'].to_dict()
34 1 3795.6 3795.6 9.6 demand = df_supply['dowy'].map(medians_dict)
35 1 209.7 209.7 0.5 supply = df_supply['Supply']
36 1 31252.0 31252.0 79.4 ratio = supply.resample('AS-OCT').sum() / demand.resample('AS-OCT').sum()
37 1 0.5 0.5 0.0 return ratio

Total time: 2.43446 s
File: /directory/Blog_Post.py
Function: calc_supply_demand_2 at line 39

Line # Hits Time Per Hit % Time Line Contents
==============================================================
39 @profile
40 def calc_supply_demand_2(df,df_median):
41 1 1365.1 1365.1 0.1 ratio = pd.DataFrame()
42 1 697.5 697.5 0.0 medians_dict = df_demand['Median_Demand'].to_dict()
43 1 2411800.5 2e+06 99.1 demand = pd.Series([df_demand['Median_Demand'][i] for i in df.dowy], index=df.index)
44 1 53.9 53.9 0.0 supply = df_supply['Supply']
45 1 20547.0 20547.0 0.8 ratio = supply.resample('AS-OCT').sum() / demand.resample('AS-OCT').sum()
46 1 0.6 0.6 0.0 return ratio

0.04 seconds - /directory/Blog_Post.py:30 - calc_supply_demand_1
2.43 seconds - /directory/Blog_Post.py:39 - calc_supply_demand_2

The result shows line number, number of hits (number of times the line is executed; hits increase when executed in a for loop), total time, time per hit, percentage of time and the line contents. The above result implies that for the first function, 79.4% of time was used to execute ratio, whereas for the second function 99.1% is used in execution of demand. ratio1 and ratio2 are the two exact same outputs where demand is defined in different ways in both the functions. We also see that time taken to execute calc_supply_demand_1 function is 0.04 seconds and calc_supply_demand_2 is 2.43 seconds. Using this I could reduce the runtime by 61 times (2.43/0.04) identifying that demand calculation takes 99.1% of time in calc_supply_demand_2 function using line_profiler. Another method is using cprofile (details are in this blog post). cprofile gives more detailed information.

References:

https://kernprof.readthedocs.io/en/latest

https://researchcomputing.princeton.edu/python-profiling

Importing, Exporting and Organizing Time Series Data in Python – Part 1

Importing, Exporting and Organizing Time Series Data in Python – Part 1

This blog post is Part 1 of a multi-part series of posts (see here for Part 2) intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle Excel spreadsheets); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

In modeling water resources and environmental systems, we frequently must import and export large quantities of data (particularly time series data), both to run models and to process model outputs. I will describe the approaches and tools I have found to be most useful for managing data in these circumstances.

This blog post focuses on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets. Future posts will cover the pandas data management/export component.

Through an example that follows, there are two main lessons I hope to convey in this post:

  1. If you can store data, especially time series data, in text (.txt) or comma separated value (.csv) files, the time required for importing and exporting data will be vastly reduced compared to attempting the same thing in an Excel workbook (.xlsx file). Hence, I suggest that if you must work with an Excel workbook (.xlsx files), try to save each worksheet in the Excel workbook as a separate .csv file and use that instead. A .csv file can still be viewed in the MS Excel interface, but is much faster to import. I’ll show below how much time this can save you.
  2. There are many ways to analyze the data once loaded into Python, but my suggestion is you take advantage of pandas, an open-source data analysis library in Python. This is why my example below makes use of built-in pandas features for importing data. (I will have some future posts with some useful pandas code for doing time series analysis/plotting. There are other previous pandas posts on our blog as well).

It is important at this point to note that Microsoft Excel files (.xlsx or .xls) are NOT an ideal means of storing data. I have had the great misfortune of working with Excel (and Visual Basic for Applications) intensively for many years, and it can be somewhat of a disaster for data management. I am using Excel only because the input data files for the model I work with are stored as .xlsx files, and because this is the data storage interface with which others using my Python-based model are most comfortable.

An Example:

Suppose that you wish to read in and analyze time series of 100 years of simulated daily reservoir releases from 2 reservoirs (“Reservoir_1” and “Reservoir_2”) for 100 separate simulations (or realizations). Let’s assume data for the reservoirs is stored in two separate files, either .csv or .xlsx files (e.g., “Reservoir_1.xlsx” and “Reservoir_2.xlsx”, or “Reservoir_1.csv” and “Reservoir_2.csv”).

The figure below shows a screenshot of an example layout of the file in either case (.csv or .xlsx). Feel free to make your own such file, with column names that correspond to the title of each simulation realization.

Worksheet appearance

There are two import options for these data.

Option 1: Import your data directly into a dictionary of two pandas DataFrame objects, where the keys of the Python dictionary are the two reservoir names. Here is some code that does this. All you need to do is run this script in a directory where you also have your input files saved.


# Imports
import pandas as pd
from datetime import datetime
from datetime import timedelta
import time

start_time = time.time()  # Keep track of import time

start_date = datetime(1910, 1, 1)
simulation_dates = pd.date_range(start_date, start_date + timedelta(365*100 - 1))

# Specify File Type (Choices: (1) 'XLSX', or (2) 'CSV')
File_Type = 'CSV'

location_list = ['Reservoir_1', 'Reservoir_2']  # Names of data files
start = 0  # starting column of input file from which to import data
stop = 99  # ending column of input file from which to import data. Must not exceed number of columns in input file.

# Create a dictionary where keys are items in location_list, and content associated with each key is a pandas dataframe.
if File_Type is 'XLSX':
    extension = '.xlsx'
    Excel_data_dictionary = {name: pd.read_excel(name + extension, sheetname=None, usecols=[i for i in range(start, stop)]) for name in
                             location_list}
    # Reset indices as desired dates
    for keys in Excel_data_dictionary:
        Excel_data_dictionary[keys] = Excel_data_dictionary[keys].set_index(simulation_dates)
    print('XLSX Data Import Complete in %s seconds' % (time.time() - start_time))
elif File_Type is 'CSV':
    extension = '.csv'
    CSV_data_dictionary = {name: pd.read_csv(name + extension, usecols=[i for i in range(start, stop)]) for name in location_list}
    # Reset indices as desired dates
    for keys in CSV_data_dictionary:
        CSV_data_dictionary[keys] = CSV_data_dictionary[keys].set_index(simulation_dates)
    print('CSV Data Import Complete in %s seconds' % (time.time() - start_time))

When I run this code, the .csv data import took about 1.5 seconds on my office desktop computer (not bad), and the .xlsx import took 100 seconds (awful!)! This is why you should avoid storing data in Excel workbooks.

You can visualize the pandas DataFrame object per the image below, with a major axis that corresponds to dates, and a minor axis that corresponds to each simulation run (or realization). Note that I added code that manually sets the major axis index to date values, as my input sheet had no actual dates listed in it.

Pandas DF

Now, if I want to see the data for a particular realization or date, pandas makes this easy. For example, the following would access the DataFrame column corresponding to realization 8 for Reservoir_1:

CSV_data_dictionary['Reservoir_1']['Realization8']

Option 2: If you must work with .xlsx files and need to import specific sheets and cell ranges from those files, I recommend that you use the open-source Python library OpenPyXL. (There are other options, some of which are reviewed here). OpenPyXL is nice because it can both read and write from and to Excel, it can loop through worksheets and cells within those worksheets, and does not require that you have Microsoft Office if you are working in Windows. Indeed it does not require Windows at all (it works well on Linux, though I have not tried it out in OSX). OpenPyXL documentation is available here, and includes numerous examples. This page also provides some nice examples of using OpenPyXL.

Some additional tips regarding OpenPyXL:

  • Make sure you are working with the latest version. If you install openpyxl through an IDE, be sure it’s updated to at least version 2.2.4 or 2.2.5. I’ve had problems with earlier versions.
  • I believe OpenPyXL only works with .xlsx files, not .xls files.

Python Data Analysis Part 2: Pandas / Matplotlib Live Demo

This post is part of our series on Python.  Part 1 came in three parts: a, b, and c.  We also have a post on setting up python.  And on how to write a python job submission script. For all python fun, search “python” in the search box on the right.

Transcript

At yesterday’s group meeting, I promised a transcript of the live demo. I’m pasting it below with light edits for formatting.

  1 import pandas
  2 import matplotlib
  3 metrics = pandas.read_table("metrics.txt")
  4 metrics
  5 metrics[0:40]
  6 metrics["roundeddown"] = 3000 * ( metrics["NFE"] // 3000)
  7 metrics[0:40]
  8 grouped = metrics.groupby(["model", "roundeddown"])
  9 mean = grouped.mean()
 10 mean
 11 mean.ix["response"]
 12 grouped = grouped["SBX+PM"]
 13 mean = grouped.mean()
 14 mean
 15 fig = pyplot.figure(figsize=(10,6))
 16 ax = fig.add_subplot(1,1,1)
 17 ax.plot(mean.ix["response"].index.values,
 18         mean.ix["response"].values, color = 'r')
 19 fig
 20 ax.plot(mean.ix["gasp"].index.values,
 21         mean.ix["gasp"].values, color = 'b')
 22 fig
 23 summary = pandas.DataFrame(
 24             data = {
 25                     "mean":grouped.mean(),
 26                     "tenth":grouped.quantile(0.1),
 27                     "ninetieth":grouped.quantile(0.9)
 28             }
 29           )
 30 summary
 31 fig = pyplot.figure(figsize=(10,12))
 32 ax = fig.add_subplot(2,1,1)
 33 index = summary.ix["gasp"].index.values
 34 mean = summary.ix["gasp"]["mean"].values
 35 tenth = summary.ix["gasp"]["tenth"].values
 36 ninetieth=summary.ix["gasp"]["ninetieth"].values
 37 ax.plot(index, mean,
 38         index, tenth,
 39         index, ninetieth,
 40         color='k')
 41 fig
 42 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.01)
 43 fig
 44 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.1)
 45 fig
 46 ax = fig.add_subplot(2,1,2)
 47 index = summary.ix["response"].index.values
 48 mean = summary.ix["response"]["mean"].values
 49 tenth = summary.ix["response"]["tenth"].values
 50 ninetieth=summary.ix["response"]["ninetieth"].values
 51 ax.plot(index, mean,
 52         index, tenth,
 53         index, ninetieth,
 54         color='k')
 55 fig
 56 ax.fill_between(index,tenth,ninetieth,color='k',alpha=0.1)
 57 fig
 58 for ax in fig.axes:
 59     ax.set_yscale("log")
 60     ax.set_ylim(0.0001, 1.0)
 61     ax.set_xlim(0,103000)
 62 fig

Description

The transcript picks up where your homework left off. We have 50 seeds worth of data for two models:

sbx

And we want to produce a summary plot:

SBX_filled

You should be able to use text-based ipython on the cluster without any additional setup. However, you’ll need to add from matplotlib import pyplot to the top before you start typing in code from the transcript. (The HTML notebook I used for the demo imports pyplot automatically.)

If you want to see your figures, you’ll have to take the somewhat cumbersome steps of saving them (fig.savefig("filename")), copying them to your local machine (scp user1234@hammer.rcc.psu.edu:workingdirectory/filename.png .), and then viewing them.

I used an ipython HTML notebook for the demo. I’ll be making two posts about how to do this: one about how I did it for the demo, and another about installing ipython on your desktop.