Using Borg in Parallel and Serial with a Python Wrapper – Part 2

This blog post is Part 2 of a two-part series that will demonstrate how I have coupled a pure Python simulation model with the Borg multi-objective evolutionary algorithm (MOEA). I recommend reading Part 1 of this series before you read Part 2. In Part 1, I explain how to get Borg and provide sample code showing how you can access Borg’s serial and/or parallelized (master-slave) implementations through a Python wrapper (borg.py). In Part 2, I provide details for more advanced simulation-optimization setups that require you pass additional information from the borg wrapper into the simulation model (the “function evaluation”) other than just decision variable values.

In Part 1, the example simulation model I use (PySedSim) is called through a function handle “Simulation_Caller” in the example_sim_opt.py file. Borg needs only this function handle to properly call the simulation model in each “function evaluation”. Borg’s only interaction with the simulation model is to pass the simulation model’s function handle (e.g., “Simulation_Caller”) the decision variables, and nothing else. In many circumstances, this is all you need.

However, as your simulation-optimization setup becomes more complex, in order for your simulation model (i.e., the function evaluation) to execute properly, you may need to pass additional arguments to the simulation model from Borg other than just the decision variables. For example, in my own use of Borg in a simulation-optimization setting, in order to do optimization I first import a variety of assumptions and preferences to set up a Borg-PySedSim run. Some of those assumptions and preferences are helpful to the simulation model (PySedSim) in determining how to make use of the decision variable values Borg feeds it. So, I would like to pass those relevant assumptions and preferences directly into the Borg wrapper (borg.py), so the wrapper can in turn pass them directly into the simulation model along with the decision variable values.

Before I show how to do this, let me provide a more concrete example of how/why I am doing this in my own research. In my current work, decision variable values represent parameters for a reservoir operating policy that is being optimized. The simulation model needs to know how to take the decision variable values and turn them into a useful operating policy that can be simulated. Some of this information gets imported in order to run Borg, so I might as well pass that information directly into the simulation model while I have it on hand, rather than importing it once again in the simulation model.

To do what I describe above, we just need to modify the two functions in the example_sim_opt.py module so that a new argument “additional_inputs” is passed from borg to the simulation handle.  Using my python code from blog post 1, I provide code below that is modified in the Simulation_Caller() function on lines 5, 21, 22 and 27; and in the Optimization() function on lines 55, 56 and 70. After that code, I then indicate how I modify the borg.py wrapper so it can accept this information.

import numpy as np
import pysedsim # This is your simulation model
import platform  # helps identify directory locations on different types of OS

def Simulation_Caller(vars, additional_inputs):
    '''
    Purpose: Borg calls this function to run the simulation model and return multi-objective performance.

    Note: You could also just put your simulation/function evaluation code here.

    Args:
        vars: A list of decision variable values from Borg
        additional_inputs: A list of python data structures you want to pass from Borg into the simulation model.
    Returns:
        performance: policy's simulated objective values. A list of objective values, one value each of the objectives.
    '''

    borg_vars = vars  # Decision variable values from Borg

    # Unpack lists of additional inputs from Borg (example assumes additional inputs is a python list with two items)
    borg_dict_1 = additional_inputs[0]
    borg_dict_2 = additional_inputs[1]

    # Reformat decision variable values as necessary (.e.g., cast borg output parameters as array for use in simulation)
    op_policy_params = np.asarray(borg_vars)
    # Call/run simulation model with decision vars and additional relevant inputs, return multi-objective performance:
    performance = pysedsim.PySedSim(decision_vars = op_policy_params, sim_addl_input_1 = borg_dict_1, sim_addl_input_2 = borg_dict_2)
    return performance

def Optimization():

    '''

    Purpose: Call this method from command line to initiate simulation-optimization experiment

    Returns:
        --pareto approximate set file (.set) for each random seed
        --Borg runtime file (.runtime) for each random seed

    '''

    import borg as bg  # Import borg wrapper

    parallel = 1  # 1= master-slave (parallel), 0=serial

    # The following are just examples of relevant MOEA specifications. Select your own values.
    nSeeds = 25  # Number of random seeds (Borg MOEA)
    num_dec_vars = 10  # Number of decision variables
    n_objs = 6  # Number of objectives
    n_constrs = 0  # Number of constraints
    num_func_evals = 30000  # Number of total simulations to run per random seed. Each simulation may be a monte carlo.
    runtime_freq = 1000  # Interval at which to print runtime details for each random seed
    decision_var_range = [[0, 1], [4, 6], [-1,4], [1,2], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1]]
    epsilon_list = [50000, 1000, 0.025, 10, 13, 4]  # Borg epsilon values for each objective
    borg_dict_1 = {'simulation_preferences_1': [1,2]}  # reflects data you want Borg to pass to simulation model
    borg_dict_2 = {'simulation_preferences_2': [3,4]}  # reflects data you want Borg to pass to simulation model

    # Where to save seed and runtime files
    main_output_file_dir = 'E:\output_directory'  # Specify location of output files for different seeds
    os_fold = Op_Sys_Folder_Operator()  # Folder operator for operating system
    output_location = main_output_file_dir + os_fold + 'sets'

    # If using master-slave, start MPI. Only do once.
    if parallel == 1:
        bg.Configuration.startMPI()  # start parallelization with MPI

    # Loop through seeds, calling borg.solve (serial) or borg.solveMPI (parallel) each time
    for j in range(nSeeds):
        # Instantiate borg class, then set bounds, epsilon values, and file output locations
        borg = bg.Borg(num_dec_vars, n_objs, n_constrs, Simulation_Caller, add_sim_inputs = [borg_dict_1, borg_dict_2])
        borg.setBounds(*decision_var_range)  # Set decision variable bounds
        borg.setEpsilons(*epsilon_list)  # Set epsilon values
        # Runtime file path for each seed:
        runtime_filename = main_output_file_dir + os_fold + 'runtime_file_seed_' + str(j+1) + '.runtime'
        if parallel == 1:
            # Run parallel Borg
            result = borg.solveMPI(maxEvaluations='num_func_evals', runtime=runtime_filename, frequency=runtime_freq)

        if parallel == 0:
            # Run serial Borg
            result = borg.solve({"maxEvaluations": num_func_evals, "runtimeformat": 'borg', "frequency": runtime_freq,
                                 "runtimefile": runtime_filename})

        if result:
            # This particular seed is now finished being run in parallel. The result will only be returned from
            # one node in case running Master-Slave Borg.
            result.display()

            # Create/write objective values and decision variable values to files in folder "sets", 1 file per seed.
            f = open(output_location + os_fold + 'Borg_DPS_PySedSim' + str(j+1) + '.set', 'w')
            f.write('#Borg Optimization Results\n')
            f.write('#First ' + str(num_dec_vars) + ' are the decision variables, ' + 'last ' + str(n_objs) +
                    ' are the ' + 'objective values\n')
            for solution in result:
                line = ''
                for i in range(len(solution.getVariables())):
                    line = line + (str(solution.getVariables()[i])) + ' '

                for i in range(len(solution.getObjectives())):
                    line = line + (str(solution.getObjectives()[i])) + ' '

                f.write(line[0:-1]+'\n')
            f.write("#")
            f.close()

            # Create/write only objective values to files in folder "sets", 1 file per seed. Purpose is so that
            # the file can be processed in MOEAFramework, where performance metrics may be evaluated across seeds.
            f2 = open(output_location + os_fold + 'Borg_DPS_PySedSim_no_vars' + str(j+1) + '.set', 'w')
            for solution in result:
                line = ''
                for i in range(len(solution.getObjectives())):
                    line = line + (str(solution.getObjectives()[i])) + ' '

                f2.write(line[0:-1]+'\n')
            f2.write("#")
            f2.close()

            print("Seed %s complete") %j

    if parallel == 1:
        bg.Configuration.stopMPI()  # stop parallel function evaluation process

def Op_Sys_Folder_Operator():
    '''
    Function to determine whether operating system is (1) Windows, or (2) Linux

    Returns folder operator for use in specifying directories (file locations) for reading/writing data pre- and
    post-simulation.
    '''

    if platform.system() == 'Windows':
        os_fold_op = '\\'
    elif platform.system() == 'Linux':
        os_fold_op = '/'
    else:
        os_fold_op = '/'  # Assume unix OS if it can't be identified

    return os_fold_op
 

 

Next, you will need to acquire the Borg wrapper using the instructions I specified in my previous blog post. You will need to make only two modifications: (1) modify the Borg class in borg.py so it accepts the inputs you want to pass to the simulation; and (2) some additional internal accounting in borg.py to ensure those inputs are passed to the borg.py methods that deal with your function handle. I will address these two in order.

First, modify the Borg class in borg.py so it now accepts an additional input (I only show some of the borg.py code here, just to indicate where changes are being made):


class Borg:
    def __init__(self, numberOfVariables, numberOfObjectives, numberOfConstraints, function, epsilons = None, bounds = None, directions = None, add_sim_inputs=None):

    # add_sim_inputs is the new input you will pass to borg

 

Then, modify the portion of the borg.py wrapper where self.function is called, so it can accommodate any simulation inputs you have specified.


if add_pysedsim_inputs is None:
    self.function = _functionWrapper(function, numberOfVariables, numberOfObjectives, numberOfConstraints, directions)
else:
    # More simulation inputs are specified and can be passed to the simulation handle
    self.function = _functionWrapper(function, numberOfVariables, numberOfObjectives, numberOfConstraints, directions, addl_inputs=add_sim_inputs)

After the above, the last step is to modify the _functionWrapper method in borg.py:


def _functionWrapper(function, numberOfVariables, numberOfObjectives, numberOfConstraints, directions=None, addl_inputs=None):
    # addl_inputs will be passed into the simulation model
    def innerFunction(v,o,c):
    global terminate
    try:
        if addl_inputs is None:
            result = function(*[v[i] for i in range(numberOfVariables)])
        else:
            result = function([v[i] for i in range(numberOfVariables)], addl_inputs)

Using Borg in Parallel and Serial with a Python Wrapper – Part 1

Simulation and optimization are frequently used to solve complex water resources and environmental systems problems. By itself, a simulation model begs the question “what to simulate?” Similarly, by itself, an optimization model begs the question “is the solution really best?” For this reason, simulation and optimization models are frequently coupled.

This blog post is part 1 of a multi-part series that will demonstrate how I have coupled a pure Python simulation model with the multi-objective evolutionary optimization algorithm Borg. In this post, I will show how you can access Borg’s serial and/or parallelized (master-slave) implementations through a Python wrapper (borg.py).

Please see this previous blog post for some background about Borg, and how to obtain it. My instructions below assume you have access to the Borg files.

In the setup I will describe below, Borg parameterizes and iteratively refines solutions (e.g., reservoir operating policies) to a problem, optimizing them in response to their simulated performance with respect to multiple objectives.

You will need the following Borg files (see link above for how to download these):

  • Serial (i.e., borg.c, libborg.so, etc.) and/or master-slave (i.e., borgms.c, libborgms.so, etc.) implementations of Borg, depending upon your ability to parallelize.
  • Python wrapper for Borg (borg.py), which will allow you to to access Borg easily in Python.

You will need to create the following files yourself (I provide sample code below for these files):

  • example_sim_opt.py—A python module that should contain two main functions:
    1. A simulation caller, which takes decision variables and returns multi-objective performance. This function is called “Simulation_Caller” in the example_sim_opt.py code below.
    2. An optimization function, which calls the Borg MOEA through its python wrapper borg.py. This borg.py wrapper provides extensive docstring documentation regarding required arguments, returned values, etc., so I do suggest reading through the wrapper if you have questions (e.g., about the python data types of arguments and returns).

Note that the file and function names above are just example names. You can name the above files whatever you want. Just be sure to modify the code I provide below to reflect the new names.

A sample of code for example_sim_opt.py is as follows:

import numpy as np
import pysedsim # This is your simulation model
import platform  # helps identify directory locations on different types of OS

def Simulation_Caller(vars):
    '''
    Purpose: Borg calls this function to run the simulation model and return multi-objective performance.

    Note: You could also just put your simulation/function evaluation code here.

    Args:
        vars: A list of decision variable values from Borg

    Returns:
        performance: policy's simulated objective values. A list of objective values, one value each of the objectives.
    '''

    borg_vars = vars  # Decision variable values from Borg

    # Reformat decision variable values as necessary (.e.g., cast borg output parameters as array for use in simulation)
    op_policy_params = np.asarray(borg_vars)
    # Call/run simulation model, return multi-objective performance:
    performance = pysedsim.PySedSim(decision_vars = op_policy_params)
    return performance

def Optimization():

    '''

    Purpose: Call this method from command line to initiate simulation-optimization experiment

    Returns:
        --pareto approximate set file (.set) for each random seed
        --Borg runtime file (.runtime) for each random seed

    '''

    import borg as bg  # Import borg wrapper

    parallel = 1  # 1= master-slave (parallel), 0=serial

    # The following are just examples of relevant MOEA specifications. Select your own values.
    nSeeds = 25  # Number of random seeds (Borg MOEA)
    num_dec_vars = 10  # Number of decision variables
    n_objs = 6  # Number of objectives
    n_constrs = 0  # Number of constraints
    num_func_evals = 30000  # Number of total simulations to run per random seed. Each simulation may be a monte carlo.
    runtime_freq = 1000  # Interval at which to print runtime details for each random seed
    decision_var_range = [[0, 1], [4, 6], [-1,4], [1,2], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1]]
    epsilon_list = [50000, 1000, 0.025, 10, 13, 4]  # Borg epsilon values for each objective

    # Where to save seed and runtime files
    main_output_file_dir = 'E:\output_directory'  # Specify location of output files for different seeds
    os_fold = Op_Sys_Folder_Operator()  # Folder operator for operating system
    output_location = main_output_file_dir + os_fold + 'sets'

    # If using master-slave, start MPI. Only do once.
    if parallel == 1:
        bg.Configuration.startMPI()  # start parallelization with MPI

    # Loop through seeds, calling borg.solve (serial) or borg.solveMPI (parallel) each time
    for j in range(nSeeds):
        # Instantiate borg class, then set bounds, epsilon values, and file output locations
        borg = bg.Borg(num_dec_vars, n_objs, n_constrs, Simulation_Caller)
        borg.setBounds(*decision_var_range)  # Set decision variable bounds
        borg.setEpsilons(*epsilon_list)  # Set epsilon values
        # Runtime file path for each seed:
        runtime_filename = main_output_file_dir + os_fold + 'runtime_file_seed_' + str(j+1) + '.runtime'
        if parallel == 1:
            # Run parallel Borg
            result = borg.solveMPI(maxEvaluations='num_func_evals', runtime=runtime_filename, frequency=runtime_freq)

        if parallel == 0:
            # Run serial Borg
            result = borg.solve({"maxEvaluations": num_func_evals, "runtimeformat": 'borg', "frequency": runtime_freq,
                                 "runtimefile": runtime_filename})

        if result:
            # This particular seed is now finished being run in parallel. The result will only be returned from
            # one node in case running Master-Slave Borg.
            result.display()

            # Create/write objective values and decision variable values to files in folder "sets", 1 file per seed.
            f = open(output_location + os_fold + 'Borg_DPS_PySedSim' + str(j+1) + '.set', 'w')
            f.write('#Borg Optimization Results\n')
            f.write('#First ' + str(num_dec_vars) + ' are the decision variables, ' + 'last ' + str(n_objs) +
                    ' are the ' + 'objective values\n')
            for solution in result:
                line = ''
                for i in range(len(solution.getVariables())):
                    line = line + (str(solution.getVariables()[i])) + ' '

                for i in range(len(solution.getObjectives())):
                    line = line + (str(solution.getObjectives()[i])) + ' '

                f.write(line[0:-1]+'\n')
            f.write("#")
            f.close()

            # Create/write only objective values to files in folder "sets", 1 file per seed. Purpose is so that
            # the file can be processed in MOEAFramework, where performance metrics may be evaluated across seeds.
            f2 = open(output_location + os_fold + 'Borg_DPS_PySedSim_no_vars' + str(j+1) + '.set', 'w')
            for solution in result:
                line = ''
                for i in range(len(solution.getObjectives())):
                    line = line + (str(solution.getObjectives()[i])) + ' '

                f2.write(line[0:-1]+'\n')
            f2.write("#")
            f2.close()

            print("Seed %s complete") %j

    if parallel == 1:
        bg.Configuration.stopMPI()  # stop parallel function evaluation process

def Op_Sys_Folder_Operator():
    '''
    Function to determine whether operating system is (1) Windows, or (2) Linux

    Returns folder operator for use in specifying directories (file locations) for reading/writing data pre- and
    post-simulation.
    '''

    if platform.system() == 'Windows':
        os_fold_op = '\\'
    elif platform.system() == 'Linux':
        os_fold_op = '/'
    else:
        os_fold_op = '/'  # Assume unix OS if it can't be identified

    return os_fold_op
 

The following is an example of how you would submit a batch script on a Linux cluster to run a parallelized simulation-optimization experiment using the example_sim_opt.py and borg.py files. Note that in the parallelized version, simulations (i.e., “function evaluations”) are being run in parallel by separate processors.

You would need the following two files:

  1. example_sim_opt_caller.py. This is a python file that is used to call example_sim_opt.py
  2. example_sim_opt_batch_script.pbs. This is batch script that runs example_sim_opt_caller.py in parallel on a cluster using open MPI.

Example code for example_sim_opt_caller.py:


'''
Purpose: To initiate the optimization process, which will iteratively call the simulation model.
'''

import example_sim_opt  # Import main optimization module that uses borg python wrapper

# Module within example
example_sim_opt.Optimization()

Example code for example_sim_opt_batch_script.pbs:


#PBS -l nodes=8:ppn=16
#PBS -l walltime=24:00:00
#PBS -j oe
#PBS -o pysedsim_output.out

cd $PBS_O_WORKDIR
source /etc/profile.d/modules.sh
module load openmpi-1.6.5-intel-x86_64/gnu
module load python-2.7.5
mpirun python example_sim_opt_caller.py

You could then run the above batch script with a command such as:

qsub example_sim_opt_batch_script.pbs

A Guide to Using Git in PyCharm – Part 2

A Guide to Using Git in PyCharm – Part 2

This post is part 2 of a multi-part series intended to describe how to use PyCharm’s integrated Git (version control) features.

In my previous PyCharm post (Part 1), I described how to get started using Git in PyCharm, including:

  • Creating a local git repository
  • Adding and commiting files to a git repository
  • Making and committing changes to a python file

In this post, I will describe:

  • Creating branches
  • Merging branches

These are extremely valuable features of Git, as they allow you to create a new version of your code to modify and evaluate before permanently integrating the changes into the master version of your code.

The following tutorial will assume you already have PyCharm and Git installed, and have a Github account. It also assumes you have a git repository to start with. Follow my steps in the first blog post if you don’t already have a git repository.

Create a new branch by right clicking on your repository project folder, and in the menu select Git–>Repository–>Branches, as is shown below.

figure-1

Select “New Branches” and then enter the name of the new branch in the box (see below for how I named the example branch):

figure-2

You can now see in the version control window that a new branch called “example_branch_1” has been created as a branch from the master branch (you can access this by clicking “version control” window at the bottom).

figure-3

Importantly, PyCharm has now placed you on the new branch (“example_branch_1”). Any changes you make and commit, you are making to the branch, not to the master (the one you started with).

First, right-click on the python file and select Git –> Add.

Now, make some modifications to the python file on this branch.

In the menu at the top of the screen, select VCS –> Commit changes.

In the menu that appears, provide a message to attach to your commit so you know later what changes you made. You can double click on the file name in the window (e.g., on the “blog_post_file.py” file) to review what changes have actually been made. If you like the changes, select “Commit”.

figure-4

In the menu at the very bottom of the screen to the right, you can now go back and “check out” the master branch. Just click master –> Checkout, as is shown below.

figure-6

What now appears on your screen is the previous version of the “blog_post_file.py” file, before the function was modified, as is shown in the image below.

figure-5

The version control menu at the bottom explains the structure (and dates/times) of the master and branches.

You can now go back to the example branch if you want using the same feature.

This feature allows you to quickly snap to different branches that are being modified, and see how they are different. (This feature is also accessible from the menu at the top of the screen through VCS –>Git –> Branches).

If you want to merge your changes to your local master, do the following. From the master branch, you can now select “example_branch_1” in the same menu shown to the bottom right below. This time select “Merge” rather than “Checkout”. Your changes will now be merged onto the master branch, and the modified function will now appear on both the master and the example branch.

figure-7

I will continue to discuss PyCharm’s Git features in future posts in this series.

A Guide to Using Git in PyCharm – Part 1

A Guide to Using Git in PyCharm – Part 1

This post is part 1 of a multi-part series intended to describe how to use PyCharm’s integrated Git (version control) features. See Part 2 here. PyCharm makes it easy to link a Github account directly to the PyCharm IDE and perform Git-related tasks within the PyCharm IDE rather than performing these tasks in a command prompt.

Today I will describe a few basic features to help you get started, including: creating a repository, adding files and committing changes. Future posts will discuss more detailed features, including branching, pushing/pulling, merging, etc. While PyCharm does have a very detailed website geared toward explaining some of this, this blog post series is intended to help those who just want some basic steps and pictures to get started.

You can find background information on Git on this blog, including an introduction to git, as well as tutorials on local version control and remote repositories. A glossary of Git terminology and commands can be found here.

PyCharm is one of many IDEs one can use to edit Python code. I offer some background here on why I chose PyCharm for my Python programming needs, and how to get PyCharm.

The following tutorial will assume you already have PyCharm and Git installed, and have a Github account.

  1. Within PyCharm, link your Github account.

File –> Settings –> Version Control –> GitHub

Input your GitHub login (username) and password.

  1. Create a new PyCharm project as pictured below, by going to:

File –> New Project.

(Note: A “project” can just be a folder/directory of related files that PyCharm will recognize as a “project”. If you already have a project, you can simply open the existing “project” to follow these steps to create a repository).

create_project

  1. Having established a new project, now create a local git repository for the project as pictured below, by going to:

VCS –> Import Into Version Control –> Create Git Repository.

If completed correctly, you should see a directory named “.git” appear in the project directory. Note that you must have already downloaded git for this to work (test that git.exe works by going File –> Settings –> Version Control –> Git –> Test).

create_git_repo

  1. Having created a git repository, now create python files and add them to the repository.

Right click on your project in the Project menu, and select New — Python File, as pictured below.

New Python File

PyCharm will prompt you to include this file in your repository, as is pictured below. If you select “Yes”, you can now commit and track changes you make to this file.

commit_file_question

  1. Commit the file to repository by right-clicking the python file in the project menu, and selecting Git–>Commit File, as is shown in the two images below:

git_commit

git_commit

You can include a message with your commit when you are prompted to commit, as shown below:

initial_commit

Note that file names in your project menu will appear in green text as long as the file has not been committed yet. The file name will no longer be green once you commit it.

  1. Now that your file has been added, you can make changes to the file and commit them.

As soon as you make edits to this newly committed file, the file name in the menu will change colors (to blue in my version of PyCharm). This signifies that uncommitted changes exist in that file.  You can commit the changes following the same process I described before, or by clicking the green “VCS” up arrow icon on the top menu.

UNCOMMITTED_CODE

You will now be prompted with a Commit Changes window, as appears below.

first_module_added

You can review changes that have been made, in case you have forgotten what has changed since you last committed, by double clicking on the file name (in the figure above, this would be the blue “blog_post_file.py” icon). PyCharm will show you what changes have been made in green. (It will also show deletions and/or rearrangements of code).

changes_shown

7. Having committed the file and a change to the file, you can now go to the “Version Control” menu at the bottom of the PyCharm window to unveil a variety of features, including a “Log”, which stores all of the changes I have made to this local repository. The log for my example case is shown below.

version_control_window

I will continue to discuss more detailed features in future posts in this series.

Importing, Exporting and Organizing Time Series Data in Python – Part 2

Importing, Exporting and Organizing Time Series Data in Python – Part 2

This blog post is Part 2 of a multi-part series of posts intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle .csv, .xls and .xlsx files); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

Part 1 of the series focused on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets.

This blog post presents a flexible python function I developed to import time series outputs from a simulation model into a dictionary of 3D pandas DataFrames (DFs). This function is part of a data importing and exporting module included in the PySedSim simulation model I have been developing in the last year, but you can use this function for time series outputs from any simulation model, provided the data are stored in .csv files and organized in the format shown in Fig. 2.

The value of this function is it creates a single dictionary of 3D pandas DFs storing time series outputs from multiple simulation “scenarios”; for multiple system locations (e.g., Reservoir 1, Reservoir 2…); multiple state variables (reservoir water storage, suspended sediment load, etc.); and multiple realizations, or stochastic simulation ensemble members (e.g., Realization 1, Realization 2, …, Realization 1000). I have found the 3D DFs to have more reliable functionality than 4D DFs, so I have elected not to use a fourth pandas panel dimension in this function.

In a future post, I will then present some additional modules I have developed that use Pandas functionality to evaluate “performance” of different system locations (e.g., reservoirs) with respect to a diverse set of temporal resampling strategies and distribution statistics.

Below is the Python function, which you are welcome to use and/or modify for your own purposes. Please note that I had trouble reproducing Python formatting in the blog post (and some html characters get incorrectly inserted), and apparently I cannot post a link to a .py file, so it will eventually be on Github in late 2016.

 # Import method-specific libraries
from copy import deepcopy
from __future__ import division
import pandas as pd
import os
import platform

def Import_Simulation_Output(Sims_to_Import, Locations_to_Import, var_sub_list, file_dir, proc_num = ''):
    '''
    Purpose: Imports time series outputs from a simulation run into a 3-dimensional pandas dataframe (DF).

    More detail: Module intended to import .csv files that have been produced as the result of a PySedSim simulation.
    Each .csv file should contain time series outputs for a single state variable (e.g., reservoir water storage) and
    system location (e.g., reservoir 1).

    The purpose of using this file may either be to (1) import the output into a dictionary of pandas structures so
    that simulation performance measures can be evaluated and plotted, or (2) to import all the data produced by
    separate processors into a single data structure that can then be exported into a .csv file that contains aggregated
    output for each system location/variable.

    DF details (a 3D DF exists for each system location (e.g., reservoir)):
    Axis 0: State variable (e.g., Water storage, Energy production) for system location
    Axis 1: Time (e.g., dates over simulation horizon)
    Axis 2: Realization Number (e.g., stochastic simulation ensemble members)

    :param Sims_to_Import: List, containing strings of simulation scenario names (these must be directories in the
    specified output file directory that have these names). Example: ["Alternative Scenario 7A"]

    :param Locations_to_Import: Dictionary, keys store strings representing simulation element names (e.g.,
    Reservoir 1). Keys must be in the Sims_to_Import list. Example: {"Alternative Scenario 7A": ["Reservoir 1"]}

    :param var_sub_list: List, containing strings of PySedSim state variable names for which .csv output files exist
    for the scenarios in the Sims_to_Import list. Example: ['water_surface_elevation', 'capacity_active_reservoir']

    :param file_dir: String, directory in which output files to be imported are located.
    Example: r'E:\PySedSim\ModelFiles\Output_Storage'

    :param proc_num: Optional. Integer, number appended to the output .csv file representing the processor that
    produced the file (e.g., the number 3 for the file 'water_surface_elevation_3.csv')

    :return TSID: Dictionary, where keys are scenario names. Key stores sub_dictionary, where sub_dictionary keys are
    system locations storing 3D pandas DF for each system location.
    :return Num_Realizations: Dictionary, where keys are scenario names, storing number of stochastic realiztions for
    scenario.
    :return Num_Years: Dictionary, where keys are scenario names, storing number of years in a simulation realization
    for scenario
    '''

    # Get operator (/ or \) for changing directory based on operating system.
    os_fold_op = Op_Sys_Folder_Operator()

    # This function reads back in previously exported simulation data so performance measure analysis can be conducted.
    if proc_num is not '':
        cluster_loop = '_' + str(proc_num-1) # Subtract 1 as first file ends with "0".
        cluster_sub_folder = 'cluster_output'
    else:
        cluster_loop = ''
        cluster_sub_folder = ''

    # Initialize various data structures
    TSID = {} # Main dictionary to export
    TSID_Temp = {} # Use to temporarily load each processor's output sheet for location/variable, if applicable
    Num_Realizations = {} # For each scenario, stores number of realizations for that scenario
    Num_Years = {} # For each scenario, stores number of years in each realization for that scenario
    counter = {} # Temporary counter

    # Main data import loop. Intent is to import data into Time Series Import Dictionary (TSID)
    for sims in Sims_to_Import:
        counter[sims] = 0
        TSID[sims] = {} # Sub dict for each simulation will store locations.
        TSID_Temp[sims] = {} # Sub dict for element/variable output for a given processor in cluster.
        sim_import_loc = file_dir + os_fold_op + sims # This folder needs to already exist.
        for sys_locs in Locations_to_Import[sims]:
            TSID[sims][sys_locs] = {} # Sub dictionary for each location will store variables.
            TSID_Temp[sims][sys_locs] = {} # Sub dict for location will store a variable for each processor.
            loc_sub_folder = os_fold_op + cluster_sub_folder + os_fold_op + sys_locs
            # Requires that all the locs you are searching have all the variables you list above, which wont be the
            # case always (for junctions vs. reservoirs, for example).
            for vars in var_sub_list:
                file_path = sim_import_loc + loc_sub_folder # File path reflecting new folder
                if os.path.exists(os.path.join(file_path, vars + cluster_loop + '.csv')) == True:
                    # This variable exists as a file name in the specified file path, so import it.
                    if proc_num == '':
                        # User is not importing output files produced on a cluster by various processors. Proceed
                        # linearly (there are not different files from different processors that need to be combined).
                        # Import this dataframe to a csv file.
                        TSID[sims][sys_locs][vars] = pd.read_csv(os.path.join(file_path, vars + cluster_loop + '.csv'),
                        index_col=0)

                        # Force each dataframe to have datetime objects as dates rather than strings.
                        TSID[sims][sys_locs][vars].set_index(pd.to_datetime(TSID[sims][sys_locs][vars].index),
                        inplace=True)
                        # Determine number of realizations (ensembles). Only do this calculation once per simulation
                        # realization (on first pass through loop).
                        if counter[sims] == 0:
                            Num_Realizations[sims] = len(TSID[sims][sys_locs][vars].columns)
                            Num_Years[sims] = TSID[sims][sys_locs][vars].index[-1].year - \
                            TSID[sims][sys_locs][vars].index[0].year + 1
                            counter[sims] += 1
                    else:
                        # User wishes to use this processor to create a dictionary for the particular
                        # location/variable of interest. This processor will therefore read in all output .csv files
                        # produced by other processors.
                        for csv_sheet in range(proc_num):
                            # Import this dataframe to a csv file
                            TSID_Temp[sims][sys_locs][vars] = pd.read_csv(
                            os.path.join(file_path, vars + '_' + str(csv_sheet) + '.csv'), index_col=0)
                            # Make each dataframe have datetime objects as dates rather than strings.
                            TSID_Temp[sims][sys_locs][vars].set_index(
                            pd.to_datetime(TSID_Temp[sims][sys_locs][vars].index), inplace=True)
                            # Loop through locations and variables, store data from this processor in master dictionary.
                            if csv_sheet == 0:
                                TSID = deepcopy(TSID_Temp)
                            else:
                                for locs in TSID[sims]:
                                    for vars in TSID[sims][locs]:
                                        # Add this new set of realizations from this DF into the main DF
                                        TSID[sims][locs][vars] = pd.concat(
                                        [TSID[sims][locs][vars], TSID_Temp[sims][locs][vars]], axis=1)
    print("Data Import is completed.")
    # Return is conditional. Number of realizations/years cannot be provided if the TSID only represents one of many
    # ensemble members of a stochastic simulation:
    if proc_num is not '':
        return TSID
    else:
        return TSID, Num_Realizations, Num_Years

The following is an example of how to make use of this function. Suppose you have 2 directories corresponding two two different simulation runs (or “scenarios”). Let’s call those Scenario 1 and Scenario 2.

Within each of those scenario directories, you have separate directories for each system location. In this example, the locations are “Reservoir 1” and “River Channel 1”, so there are 2 sub-directories.

Within each of those location directories, you have .csv files for the following state variables: water_storage, susp_sediment_inflow, inflow_rate. Each .csv file stores time series output for a particular scenario, location, and state variable across all realizations (ensemble members).

Figure 1 (below) shows an example of how these files might be stored for “River Channel 1”, within a directory titled “Simulation Output”.

Fig 2 - file location

Figure 2 (below) shows an example of how these files might be stored for “Reservoir 1”, within a directory titled “Simulation Output”.

Fig 2A - file location

Figure 2 shows an example of how one .csv file. (The PySedSim model automatically creates the directory structure and file formatting shown here).

Fig 1 - csv file layout

The following is an example function call:


import Import_Simulation_Output

Sims_to_Import = ['Scenario 1', 'Scenario 2']

Locations_to_Import = {'Scenario 1': ['Reservoir 1', 'River Channel 1'], 'Scenario 2': ['Reservoir 1', 'River Channel']}

var_sub_list = ['inflow_rate', 'susp_sediment_inflow', 'water_storage']

file_dir = r'C:\Users\tbw32\Documents\Reed Group\Blog - Water Programming\2016\July 2016\Simulation Output'

Import_Simulation_Output(Sims_to_Import, Locations_to_Import, var_sub_list, file_dir)

In the next post I will demonstrate what to do with the dictionary that has been created.

Debugging in Python (using PyCharm) – Part 3

Debugging in Python (using PyCharm) – Part 3

This post is part 3 of a multi-part series of posts intended to provide discussion of some basic debugging tools that I have found to be helpful in developing a pure Python simulation model using a Python Integrated Development Environment (IDE) called PyCharm.

Before I begin this post, the following are links to previous blog posts I have written on this topic:

In this post I will focus on PyCharm’s “Coverage” features, which are very useful for debugging by allowing you to see what parts of your program (e.g., modules, classes, methods) are/are not being accessed for a given implementation (run) of the model. If instead you are interested in seeing how much time is being spent running particular sections of code, or want to glimpse into the values of variables during execution, see the previous posts I linked above on profiling and breakpoints.

To see what parts of my code are being accessed, I have found it helpful to create and run what are called “unit tests”. You can find more on unit testing here, or just by googling it. (Please note that I am not a computer scientist, so I am not intending to provide a comprehensive summary of all possible approaches you could take to do this. I am just going to describe something that has worked well for me). To summarize, unit testing refers to evaluating sections (units) of source code to determine whether those units are performing as they should. I have been using unit testing to execute a run of my model (called “PySedSim”) to see what sections of my code are and are not being accessed.

I integrated information from the following sources to prepare this post:

Please follow the following steps:

Step 1. Open the script (or class, or method) you want to assess, and click on the function or method you want to assess.

In my case, I am assessing the top-level python file “PySedSim.py”, which is the file in my program that calls all of the classes to run a simulation (e.g., Reservoirs and River Channels). Within this file, I have clicked on the PySedSim function. Note that these files are already part of a PyCharm project I have created, and Python interpreters have already been established. You need to do that first.

Step 2. With your cursor still on the function/method of interest, click “ctrl + shift + T”.

A window should appear as it does below. Click to “Create New Test”.

Figure 1

Step 3. Create a new test. Specify the location of the script you are testing, and keep the suggested test file and class names, or modify them. Then click to add a check mark to the box next to the Test Method, and click “OK”.

Figure 2

Step 4. Modify the new script that has been created. (In my case, this file is called “test_pySedSim.py”, and appears initially as it does below).

Figure 3

I then modified this file so that it reflects testing I want to conduct on the PySedSim method in the PySedSim.py file.

In my case, it appears like this.


from unittest import TestCase
from PySedSim import PySedSim
class TestPySedSim(TestCase):
def test_PySedSim(self):
PySedSim()

Note that there is a ton of functionality that is now possible in this test file. I suggest reviewing this website again carefully for ideas. You can raise errors, and use the self.fail() function, to indicate whether or not your program is producing acceptable results. For example, if the program produces a negative result when it should produce a positive result, you can indicate to PyCharm that this represents a fail, and the test has not been passed. This offers you a lot of flexibility in testing various methods in your program.

In my case, all I am wanting to do is run the model and see which sections were accessed, not to specifically evaluate results it has produced, so in my case PyCharm should execute the model and indicate it has “passed” the unit test (once I create and run the unit test).

Step 5. In the menu at the top of the screen that I show clicked on in the image below, click on “Edit configurations”.

Figure 4

From here, click on the “+” button, and go to Python tests –> Unittests.Figure 5

Step 6. In the “Run/Debug Configurations” window, give your test a name in the “Name” box, and in the “script” box locate the script you created in Step 4, and indicate its path. Specify any method parameters that need to be specified to run the method. I did not specify any environment preferences, as the interpreter was already filled in. Click OK when you are done.

Figure 6

Step 7. Your test should now appear in the same configuration menu you clicked on earlier in Step 5. So, click the button at the top to “Run with Coverage”. (In my case, run water_programming_blog_post with coverage)

Figure 7

Note that it is likely going to take some time for the test to run (more than it would take for a normal execution of your code).

Step 8. Review the results.

A coverage window should appear to the right of your screen, indicating what portions (%) of the various functions and methods contained in this program were actually entered.

Figure 8

To generate a more detailed report, you can click on the button with the green arrow inside the coverage window, which will offer you options for how to generate a report. I selected the option to generate an html report. If you then select the “index” html file that appears in the directory you’re working in, you can click to see the coverage for each method.

For example, here is an image of a particular class (reservoir.py), showing in green those sections of code that were entered, and in red the sections that were not. I used this to discover that particular portions of some methods were not being accessed when they should have been. The script files themselves also now have red and green text that appears next to the code that was not or was entered, respectively. See image above for an example of this.

Figure 9

PyCharm also indicates whether or not the unittest has passed. (Though I did not actually test for specific outputs from the program, I could have done tests on model outputs as I described earlier, and any test failure would be indicated here).

Importing, Exporting and Organizing Time Series Data in Python – Part 1

Importing, Exporting and Organizing Time Series Data in Python – Part 1

This blog post is Part 1 of a multi-part series of posts (see here for Part 2) intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle Excel spreadsheets); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

In modeling water resources and environmental systems, we frequently must import and export large quantities of data (particularly time series data), both to run models and to process model outputs. I will describe the approaches and tools I have found to be most useful for managing data in these circumstances.

This blog post focuses on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets. Future posts will cover the pandas data management/export component.

Through an example that follows, there are two main lessons I hope to convey in this post:

  1. If you can store data, especially time series data, in text (.txt) or comma separated value (.csv) files, the time required for importing and exporting data will be vastly reduced compared to attempting the same thing in an Excel workbook (.xlsx file). Hence, I suggest that if you must work with an Excel workbook (.xlsx files), try to save each worksheet in the Excel workbook as a separate .csv file and use that instead. A .csv file can still be viewed in the MS Excel interface, but is much faster to import. I’ll show below how much time this can save you.
  2. There are many ways to analyze the data once loaded into Python, but my suggestion is you take advantage of pandas, an open-source data analysis library in Python. This is why my example below makes use of built-in pandas features for importing data. (I will have some future posts with some useful pandas code for doing time series analysis/plotting. There are other previous pandas posts on our blog as well).

It is important at this point to note that Microsoft Excel files (.xlsx or .xls) are NOT an ideal means of storing data. I have had the great misfortune of working with Excel (and Visual Basic for Applications) intensively for many years, and it can be somewhat of a disaster for data management. I am using Excel only because the input data files for the model I work with are stored as .xlsx files, and because this is the data storage interface with which others using my Python-based model are most comfortable.

An Example:

Suppose that you wish to read in and analyze time series of 100 years of simulated daily reservoir releases from 2 reservoirs (“Reservoir_1” and “Reservoir_2”) for 100 separate simulations (or realizations). Let’s assume data for the reservoirs is stored in two separate files, either .csv or .xlsx files (e.g., “Reservoir_1.xlsx” and “Reservoir_2.xlsx”, or “Reservoir_1.csv” and “Reservoir_2.csv”).

The figure below shows a screenshot of an example layout of the file in either case (.csv or .xlsx). Feel free to make your own such file, with column names that correspond to the title of each simulation realization.

Worksheet appearance

There are two import options for these data.

Option 1: Import your data directly into a dictionary of two pandas DataFrame objects, where the keys of the Python dictionary are the two reservoir names. Here is some code that does this. All you need to do is run this script in a directory where you also have your input files saved.


# Imports
import pandas as pd
from datetime import datetime
from datetime import timedelta
import time

start_time = time.time()  # Keep track of import time

start_date = datetime(1910, 1, 1)
simulation_dates = pd.date_range(start_date, start_date + timedelta(365*100 - 1))

# Specify File Type (Choices: (1) 'XLSX', or (2) 'CSV')
File_Type = 'CSV'

location_list = ['Reservoir_1', 'Reservoir_2']  # Names of data files
start = 0  # starting column of input file from which to import data
stop = 99  # ending column of input file from which to import data. Must not exceed number of columns in input file.

# Create a dictionary where keys are items in location_list, and content associated with each key is a pandas dataframe.
if File_Type is 'XLSX':
    extension = '.xlsx'
    Excel_data_dictionary = {name: pd.read_excel(name + extension, sheetname=None, usecols=[i for i in range(start, stop)]) for name in
                             location_list}
    # Reset indices as desired dates
    for keys in Excel_data_dictionary:
        Excel_data_dictionary[keys] = Excel_data_dictionary[keys].set_index(simulation_dates)
    print('XLSX Data Import Complete in %s seconds' % (time.time() - start_time))
elif File_Type is 'CSV':
    extension = '.csv'
    CSV_data_dictionary = {name: pd.read_csv(name + extension, usecols=[i for i in range(start, stop)]) for name in location_list}
    # Reset indices as desired dates
    for keys in CSV_data_dictionary:
        CSV_data_dictionary[keys] = CSV_data_dictionary[keys].set_index(simulation_dates)
    print('CSV Data Import Complete in %s seconds' % (time.time() - start_time))

When I run this code, the .csv data import took about 1.5 seconds on my office desktop computer (not bad), and the .xlsx import took 100 seconds (awful!)! This is why you should avoid storing data in Excel workbooks.

You can visualize the pandas DataFrame object per the image below, with a major axis that corresponds to dates, and a minor axis that corresponds to each simulation run (or realization). Note that I added code that manually sets the major axis index to date values, as my input sheet had no actual dates listed in it.

Pandas DF

Now, if I want to see the data for a particular realization or date, pandas makes this easy. For example, the following would access the DataFrame column corresponding to realization 8 for Reservoir_1:

CSV_data_dictionary['Reservoir_1']['Realization8']

Option 2: If you must work with .xlsx files and need to import specific sheets and cell ranges from those files, I recommend that you use the open-source Python library OpenPyXL. (There are other options, some of which are reviewed here). OpenPyXL is nice because it can both read and write from and to Excel, it can loop through worksheets and cells within those worksheets, and does not require that you have Microsoft Office if you are working in Windows. Indeed it does not require Windows at all (it works well on Linux, though I have not tried it out in OSX). OpenPyXL documentation is available here, and includes numerous examples. This page also provides some nice examples of using OpenPyXL.

Some additional tips regarding OpenPyXL:

  • Make sure you are working with the latest version. If you install openpyxl through an IDE, be sure it’s updated to at least version 2.2.4 or 2.2.5. I’ve had problems with earlier versions.
  • I believe OpenPyXL only works with .xlsx files, not .xls files.
Debugging in Python (using PyCharm) – Part 2

Debugging in Python (using PyCharm) – Part 2

This post is part 2 of a multi-part series of posts intended to provide discussion of some basic debugging tools that I have found to be helpful in developing a pure Python simulation model using a Python Integrated Development Environment (IDE) called PyCharm. You can see part 1 of the debugging discussion here, and other discussion of useful PyCharm functionality here.

In this post I will focus on PyCharm’s “Profile” features, which are very useful for debugging by allowing you to see what parts of your program (e.g., modules, classes, methods) are/are not being called, how frequently, and how much time is being spent in them. This has been especially useful to me when my program runs without error messages, but isn’t exactly performing as I expect it to (e.g., the program is taking a very long time to produce a result).

Step 1. Open your python files in PyCharm. In this case, my file is called SedSim_v2.py. I suggest you do this with a file that already executes without errors.

Step 2. In menu at the top of the screen, click Run–>Profile SedSim_v2 (or your file name), as is shown in the image below.

Fig. 1

Step 3. This will open a new window in the “Run” section (PyCharm will not execute the file in a console).    Wait for the profiling run to be completed, then within the Run window click on the upper (“capture snapshot”) icon. This will open a new file that summarizes our profile run with a “Statistics” tab and a “Call Graph” tab.

The statistics tab contains an overwhelming amount of information. A screenshot is shown below for just some of the Statistics tab.

Fig 2

The call graph presents a visual summary of function/method calls. An example screen shot of this is shown below. Note that in this case, the diagram we are seeing has a “hierarchical group” layout. Other groups include circular, organic, orthogonal, and directed orthogonal. Right-clicking in the call graph space will reveal various options, including diagram layout.

FIg 3

In the photo below I have zoomed in on a particular area of the figure. We can learn a lot from the small amount of information presented here, as I will describe below.

Fig 4

From the information presented in this figure, we can gather that SedSim_v2.py is responsible for 100% of execution time (this, of course, makes sense, as this was the top-level file I ran). I am still learning about the Profiling features in PyCharm, so my very loose interpretation of the remaining information is as follows. The “Total” run took 264,381 milliseconds (or about 264 seconds). Only 1.7% (“Own” time) of this run-time was taken up by actions within SedSim_v2.py that did not involve calling the other branched methods/functions listed in this diagram.

One can then follow the other branches in the Call diagram to see how much of the time spent calling methods outside of SedSim_v2 was spent in those methods. For example, the “Master_Method_Caller” method (which is a method of the Reservoir class in my model), shown in the upper left of the figure, is responsible for 21.3% of the run time. This method is called 73,000 times.

Step 4. Make necessary adjustments to your code based on what you have discovered. If, for example, you are experiencing a very long program run time, the approach I outline above may allow you to pinpoint what methods are responsible. This diagram helped me figure out that the Reservoir_Volume_Reduction method (another method of the Reservoir class) is taking up far too much time for what it does. Something clearly needed to be fixed!

Debugging in Python (using PyCharm) – Part 1

Debugging in Python (using PyCharm) – Part 1

This blog post is the first of a multi-part series (here are the second and third debugging blog posts) intended to provide discussion of some basic debugging tools that I have found to be helpful in developing a pure Python simulation model using a Python Integrated Development Environment (IDE) called PyCharm. (Some of this discussion surely applies to other languages and development environments as well. I just want folks to know that I’ve prepared this post with Python/PyCharm in mind).  The post was updated with some additional tips from Joe Kasprzyk, on 10/15/2015.

If you didn’t catch my previous post, I strongly recommend PyCharm as a Python IDE, and have described in that post how to get access to PyCharm. PyCharm can be a bit heavy just for scientific computing, but in developing a large, object-oriented simulation model, I have found PyCharm’s code completion and debugging features to be a major asset for coding quickly and efficiently.

When I am debugging a program, there are often two fundamental issues at hand. The first is to make sure the program runs without any syntax errors (or any other errors generated while interpreting your code). The second is to be confident that your code is doing what you think it is. That is, even if it runs, there may be problems within the implementation of the routines you have coded. Python’s highly intelligent code editor is very helpful for solving the former problem before you even run your program: it will highlight errors on the fly, help you complete code to avoid variable misspellings, etc. However, to handle the latter issue, you may want to glimpse into the code’s functioning (e.g., variable values) as the program executes. I will now describe some of PyCharm’s debugging features that I have found to be very helpful in this regard.

I prepared the instructions and figures that appear below with the Professional version of PyCharm (JetBrains PyCharm 4.5.2), and I’m using it in Windows 7.

I am going to show examples of how to use the Pycharm’s debug features on a very simple Python script, which appears below. However, clearly you can follow these same steps for your own files. This script should work, so it shouldn’t need to be debugged, but I’ll show how I would use debugging tools to check the script anyway.

# My program file
import matplotlib.pyplot as plt
import numpy as np

# Initialize empty arrays
array_len = 10000  # Desired length of arrays
var1 = np.zeros(array_len)
var2 = np.zeros(array_len)
var3 = np.zeros(array_len)
var4 = np.zeros(array_len)

# Define two functions:

def function_1(input):
    # Function 1
    for i in range(array_len):
        var1[i] = 2 + i*input
        var2[i] = 2*var1[i]
    return var1, var2

def function_2(input):
    # Function 2
    for j in range(array_len):
        var3[j] = 3 + j*input
        var4[j] = 2*var3[j]
    return var3, var4

# Add var1 and var3, and plot result:
Output = function_1(1)[0] + function_2(2)[0]
plt.plot(Output)

Step 1. Open your files.

In PyCharm, open a file (or files) that you have coded, or are in the process of coding, and want to debug. For example, a screenshot of my open file appears below.

Fig 1

Step 2. Turn on the debugger.

There are numerous ways to do this. I prefer to open up a Python console to work from, and turn it on from there (as is shown in the screenshot below). To do this, click on the “Python console” button at the bottom of the screen, then click on the green bug icon on the console menu on the left to turn on the debugger. Note that you can actually have open multiple consoles (by clicking the “plus” icon on the console menu), and you can have debugging turned on in whichever consoles you prefer.

If you can’t figure out how to open up a Python console, you can also just click on the green bug icon in the menu at the top of the screen next to the play button, while your script is open. This may actually run the debugger as well, which is a step that comes later on in my post.

Fig 2

Step 3. Create breakpoints.

Breakpoints are spots in your code where you want to temporarily interrupt execution so you can glimpse into the program’s operation (e.g., view values of variables as the program runs). If you left click in the area just to the left of where the code editor stops, a closed red circle will appear, denoting a breakpoint.

In my example, I will set breakpoints on the lines where variables var2 and var4 are defined. (Note that I like to have line numbers turned on, particularly when debugging. You can access this features in PyCharm’s settings menu).

Breakpoints are only used when running the program in debug mode. That is, when you turn debugging off (e.g., by clicking once again on the bug icon in the console menu), the program will execute without stopping on breakpoints. Also, note that breakpoints are useful not only for viewing the program’s variable values and so forth, but also to see if the program is entering particular sections of code. If the program doesn’t stop on a breakpoint you’ve set, then the program is not entering the section of code where that breakpoint is located, which can be very valuable information.

Fig 3

Step 4. Create a run configuration and run the debugger.

There are several ways to do this as well. You can simply click on the green bug icon at the top of the screen again to run the debugger. Or, highlight your code, right click, and select the option to “Debug” your file.

Joe adds: If you want to do a quick debug of an uncomplicated script, the above will work well.  However sometimes you may want to debug a Python file that has command line arguments, such as pareto.py.  In order to create a configuration, right click on the configuration pull down menu in the upper right hand corner of the program, and click Edit Configurations…  Then you can add the command line arguments by typing them into the “Script parameters” box in that dialog window.  Then, you can continue debugging as usual.  Back to your regularly scheduled programming…

I have noticed that my programs execute significantly more slowly in debug mode, so you may experience the same. A debugger pane should appear as below.

Fig 4

Step 5. Set “Watches”.

Note the program execution has been interrupted on the “var1” line. Suppose we want to know the value of this local variable in its current loop. In the debugger pane, go to the “Watches” window, click the “plus” symbol, and enter “var1”. Repeat this process for the rest of the variables.

Fig 5

The debugger still shows no value for var1. This is because when the Pycharm debugger stops on a breakpoint, it interrupts execution before the breakpoint line is run, rather than after.

If you click on the “play” icon to resume execution in the debugger, you will see values appear in your “Watches” pane. (If you hold your cursor over the different icons in the debugging pane, an icon description at the bottom of the window appears, so you can figure out which buttons I’m referring to this way).

PyCharm will also display values of watch variables (as well as the counter i) in the actual code editor as well (note the new text in gray in the figure below). Watches are helpful for more complex programs than this one, where you could have hundreds or thousands of variables, attributes, objects, etc., and want to track the values only of specific ones, so you can check them as you debug.

Fig 6

Joe adds:

Step 6. Add some exception handling to find information about an error.

Sometimes it is unclear why it is that you’re getting a particular error, such as “list index out of range.”  The Python run may give you information about what line number caused the problem, but that may not be enough.  Exception handling provides a way to create a ‘soft landing’ when the program runs into a problem.  Check out the Python documentation of this here.

For example, I had a problem with the function ‘withobjectives’ within pareto.py.  It was telling me list index out of range.  This seemed strange to me, and setting a breakpoint would be a lot of work because I didn’t know when the actual error was happening.  But exception handling saves the day.  Here’s the original offending line:

for oo in oindices:
    objectives.append(float(row[oo]))

I add ‘try’ and ‘except’ statements around the offending line:

for oo in oindices:
    try:
        objectives.append(float(row[oo]))
    except:
        print "You messed up. Here is row:"
        print "%s" % row

Now, there will be a line printed to the console that shows you the variable that caused the problem. In this case, there was an error in the text file, and I can go and use a find command in a text editor to realize that there was an extra column somewhere in the file that should not have been there.

We return to the original post:

I have described just a few basic features. Other neat ones include:

  • If you want to temporarily turn off all breakpoints during the debug run, use “mute breakpoints”.
  • If you want the breakpoints only to turn on when particular conditions are satisfied, click on the “view breakpoints” icon. In the image below I am setting a rule that means the breakpoint will only interrupt execution when the counter (i) is >= 9998.

Fig 7

  • If you’ve been scrolling around in a big file and can’t find where the breakpoint is, click on the “Show Execution Point” icon.
  • You can change your code as you are debugging without any problem. So, if you discover and fix an error, you don’t have to stop the debugger and start over again. (In many cases, though, it obviously might be a good idea to just start again, in which case you can click on the “Stop” and/or “Rerun” icons).
  • In addition to watches, and values displayed by PyCharm in the editor, you can see the values of a variable (or attribute, etc.) by holding your mouse over that variable.
  • If you want to see plots of variables as you are debugging, you can import matplotlib and insert code in your script to generate a plot of those variables. PyCharm will print such figures during the debug. More details on matplotlib integration with the PyCharm debugger can be found here.

In a future post I will demonstrate PyCharm’s Coverage and Profile features, which are very useful for debugging by allowing you to see what parts of your program (e.g., modules) are/are not being accessed, how frequently, and how much time is being spent in them.

PyCharm as a Python IDE for Generating UML Diagrams

PyCharm as a Python IDE for Generating UML Diagrams

This blog post is intended to provide discussion of some interesting features of a Python Integrated Development Environment (IDE) called PyCharm, which I have recently found to be useful in the development of a pure Python object-oriented simulation model. Specifically, I will focus on PyCharm’s ability to generate diagrams that help visualize the design of an object-oriented model. (Also, please see my other post on PyCharm debugging features if you’re interested in that).

Note that there are numerous Python IDEs, all of which have advantages and disadvantages. This post is not intended to be a comparison of Python IDEs, as others have done such comparisons (see this interesting write-up by Dr. Pedro Kroger). To accomplish the tasks I will describe below, I found PyCharm to be the easiest IDE/approach to use.

To give you some background, I am working on building an object oriented river basin simulation model called PySedSim. The model will have many different classes (e.g., river channels and reservoirs), many of which are related to other classes in different ways (e.g., they may inherit behavior, or methods, from other classes). Naturally, I wish to create a diagram all of these classes, methods and relationships before coding them, to structure my code development process. In such a circumstance it can be helpful to create a Unified Modeling Language (UML) diagram to visualize the design of the object-oriented model (e.g., inheritance/class relationships, and methods/attributes present in each class). Other diagram types (e.g., tree diagram or directed graph diagram) can also be helpful, depending on what kind of program you are building.

To generate a UML diagram, I started out using tools like Gliffy, which comes with a 30-day trial, as well as just creating diagrams in MS Word. I grew frustrated with these approaches because after drawing a small UML diagram, I would decide to change the layout of the model (e.g., re-name and re-arrange classes), and then have to draw a completely new diagram! So, to work around this problem, I decided a more efficient approach could be to prepare a skeleton of a Python model, replete with class and method (and even attribute) definitions, and locate a Python IDE that could automatically generate a UML diagram for me using my Python code. This way, when I inevitably decide I don’t like the way the program is designed, I can re-write the skeleton code, and the diagram will automatically adjust itself. This takes full advantage of the fact that Python can be coded and run so quickly that it is basically executable pseudo-code.

As you may have guessed, I discovered that PyCharm is a good tool for doing exactly what I have described above (generate a class diagram from Python code). Note that only the professional version of PyCharm (available for free to students) will do this, as far as I can tell. While I will focus on PyCharm in this post, using PyCharm is not the only way to generate a UML from python code. Some other options are reviewed here. I found PyCharm to be the most efficient means of generating a nicely formatted UML among these options.

Here is some more information (or references to places where you can learn more) about PyCharm:

  • You can learn a bit more about PyCharm here.
  • Some of PyCharm’s key features are described here.
  • As I mentioned, if you are a student (or instructor) with a valid university email address, you can obtain a one-year (renewable) student/teacher license to use the professional version of PyCharm for free. Read more here.
  • A comparison of the free versus professional version is available here.
  • PyCharm has a blog here with useful tips and tricks.
  • Some additional things I have liked about PyCharm so far: really nice code completion and debugging features, support for Git version control, and the ability to rearrange the order of different script files in the user interface, which, for whatever reason, I have been unable to accomplish in Spyder (another Python IDE I have used).

Below I am going to explain and show how I created a class diagram using PyCharm. The code I show is written to be interpreted in Python 2, and I have only tested this process in Windows7.

Step 1. Create some file(s)/scripts that establish different classes, and store them in a folder together. Open PyCharm, and go to File –> New Project, and select the directory where your files are stored. In this example, I am using three script files.

Step 2. Right click on one of your files in the project window. Click on Diagrams –> Show Diagram –> Python class diagram. A new .uml file will be created with a small rectangle in it representing the class you have added.

Figure 1

Step 3. Add other classes to this diagram by hitting the space bar on the screen (or right clicking anywhere on the canvas and clicking “add class to diagram”), and searching/selecting the other class files in the window that appears as below.

Figure 2

After adding the channel and reservoir classes, the diagram appears as below. Note that the Channel class contains 2 subclasses that I also added. You can rearrange them in any way that suits you.

Figure 3

Step 4. Right click on the canvas and select “Show categories”. These offer you opportunities to reveal additional information about each class. There are also buttons that appear in the inner menu on the upper left (the “m”, “i” and “f” buttons), that will let you achieve the same thing.

Figure 4

For example, selecting “Methods” and “fields” will show what methods and/or attributes are present in each class.

Figure 5

If you make some changes to your code (e.g., re-name a method in your reservoir.py file), those changes will be automatically reflected in your class diagram. Note, however, that you can actually directly re-name (called ‘refactoring’) methods and attribute values from within the UML diagram. For example, right-click on a method name in your UML diagram, click “re-factor”, and click “rename”. This will let you quickly change method names in all classes that inherit/customize this method from the current class, and even in classes from which this method is inherited. No changes will be made to identically named methods in completely separate classes. PyCharm will let you preview where the changes will be made before they are made.

Figure 6

Note that to do all of this you didn’t even need to fully develop or even run the model. You will see in my files below that most of my methods (or functions) don’t even have anything defined in them yet. I simply have kept them there as placeholders for when I start to code those methods.

If you have thoughts on other IDEs that you have found to be helpful for the purposes I describe, I would love to hear your comments.

If you want use some existing code to try this, below is the code for the three Class files I used to generate the UML in the images I showed.

From storage_element.py:

# import relevant libraries
import numpy as np
from ExcelDataImport import *

# Description: This file defines storage element class and methods

class Storage_Element:
    def __init__(self, name, T, Input_Data_File):
        self.name = name
        # Initialize as arrays critical reservoir attributes.
        Storage_Element.Array_Initialization(self, name, T)
        Storage_Element.Import_Data(self, name, T, Input_Data_File)
    def __repr__(self):                                        # Added method that prints object information
        return '[Model Element Info.: %s, %s, %s]' % (self.name, self.removed_load, self.BS_W)      # String to print
    def Array_Initialization(self, name, T):
        self.Q_in = np.zeros(T)
        self.Q_out = np.zeros(T)
    def Import_Data(self, name, T, Input_Data_File):
        if 'Incremental Flows' in Input_Data_File.sheetnames:
            self.Incremental_Flow_Junction = Excel_Data_Import(self.name, Input_Data_File, 'Incremental Flows', 0, T, max_distinct_data_types = None, data_name_offset = None)
        if 'Incremental Sediment Loads' in Input_Data_File.sheetnames:
            self.Incremental_Sed_Load_Junction = Excel_Data_Import(self.name, Input_Data_File, 'Incremental Sediment Loads', 0, T, max_distinct_data_types = None, data_name_offset = None)
    def Element_inflows(self):
        return

From channel.py:

import numpy as np
from storage_element import Storage_Element
from ExcelDataImport import *

# Description: This file defines the channel class and methods
class Channel(Storage_Element):
    def __init__(self, name, T, Input_Data_File):
        if hasattr(Storage_Element, '__init__'):
            Storage_Element.__init__(self, name, T, Input_Data_File) # If parent class has constructor method, then call that first.
        # Channel.Import_Data(self, T, Input_Data_File)
    def Import_Data(self, T, Input_Data_File):
        if 'Reach Specifications' in Input_Data_File.sheetnames:
            # Placed here because reservoirs are considered to be reaches in the unregulated simulation.
            [self.Routing_Coefficient, self.Routing_Exponent, self.Pool_Volume, self.Initial_Storage, self.alpha_2_3, self.beta_2_3, self.Initial_Sediment_Mass] = Excel_Data_Import(self.name, Input_Data_File, 'Reach Specifications', 1, 7, max_distinct_data_types = None, data_name_offset = None)
    def mass_balance(self, constant, t=None):
        self.removed_load = np.power(constant,2) + self.new_func(2)
        self.BS_W[t+1] = self.BS_W[t] + 3
        return self.removed_load
    def new_func(self,new_constant):
        trapped_load = 20 + new_constant
        return trapped_load
    def Flow_Routing(self):
        return
    def Sediment_Routing(self):
        return
class Diversion_Channel(Channel):
    # A diversion is a channel or pipe that is a regulated (at inflow) conveyance channel, unregulated at outflow.
    def __init__(self,name,T):
        if hasattr(Channel, '__init__'):
            Channel.__init__(self, name, T) # If parent class has constructor method, then call that first.
    def Element_inflows(self):
        return
    def Flow_Routing(self):
        return
    def Sediment_Routing(self):
        return
class Bypass_Channel(Channel):
    # A bypass is a channel or pipe that is a regulated (at inflow) conveyance channel, unregulated at outflow, but is at the upstream end of a reservoir.
    def __init__(self,name,T):
        if hasattr(Channel, '__init__'):
            Channel.__init__(self, name, T) # If parent class has constructor method, then call that first.
    def Element_inflows(self):
        return
    def Flow_Routing(self):
        return
    def Sediment_Routing(self):
        return

From reservoir.py:

import numpy as np
from storage_element import Storage_Element
from outlet import Outlet
from ExcelDataImport import *

# This file defines the reservoir class and methods
class Reservoir(Storage_Element):
    def __init__(self, name, T, Input_Data_File):
        if hasattr(Storage_Element, '__init__'):
            Storage_Element.__init__(self, name, T, Input_Data_File) #If parent class has constructor method, then call that first.
        Reservoir.Array_Initialization(self,T) # Initialize arrays reservoirs will have.
        Reservoir.Import_Data(self, T, Input_Data_File)
        # Reservoir.Import_Reservoir_Data(self,....)
        # Every reservoir must have outlets (orifices) of some sort
        self.Orifices = Outlet(self.name, T, Input_Data_File)
    def Array_Initialization(self, T):
        # Initialize Arrays
        self.Q_downstream = np.zeros(T)
    def Import_Data(self, T, Input_Data_File):
            # Worksheet_Names_Preferences = {} # Dictionary stores key of worksheet names, and a list of the following:
        if 'Evaporation Data' in Input_Data_File.sheetnames:
            self.Monthly_Evap_Data = Excel_Data_Import(self.name, Input_Data_File, 'Evaporation Data', 1, 12, max_distinct_data_types = None, data_name_offset = None)
    def mass_balance(self, constant, t):
        self.removed_load = np.power(constant, 2) + self.new_func(2)
        self.BS_W[t+1] = self.BS_W[t] + 3 #- self.Orifice[3].Q_overflow[t]
        return self.removed_load
    def new_func(self,new_constant):
        trapped_load = 20 + new_constant
        return trapped_load