Using Borg in Parallel and Serial with a Python Wrapper – Part 2

This blog post is Part 2 of a two-part series that will demonstrate how I have coupled a pure Python simulation model with the Borg multi-objective evolutionary algorithm (MOEA). I recommend reading Part 1 of this series before you read Part 2. In Part 1, I explain how to get Borg and provide sample code showing how you can access Borg’s serial and/or parallelized (master-slave) implementations through a Python wrapper (borg.py). In Part 2, I provide details for more advanced simulation-optimization setups that require you pass additional information from the borg wrapper into the simulation model (the “function evaluation”) other than just decision variable values.

In Part 1, the example simulation model I use (PySedSim) is called through a function handle “Simulation_Caller” in the example_sim_opt.py file. Borg needs only this function handle to properly call the simulation model in each “function evaluation”. Borg’s only interaction with the simulation model is to pass the simulation model’s function handle (e.g., “Simulation_Caller”) the decision variables, and nothing else. In many circumstances, this is all you need.

However, as your simulation-optimization setup becomes more complex, in order for your simulation model (i.e., the function evaluation) to execute properly, you may need to pass additional arguments to the simulation model from Borg other than just the decision variables. For example, in my own use of Borg in a simulation-optimization setting, in order to do optimization I first import a variety of assumptions and preferences to set up a Borg-PySedSim run. Some of those assumptions and preferences are helpful to the simulation model (PySedSim) in determining how to make use of the decision variable values Borg feeds it. So, I would like to pass those relevant assumptions and preferences directly into the Borg wrapper (borg.py), so the wrapper can in turn pass them directly into the simulation model along with the decision variable values.

Before I show how to do this, let me provide a more concrete example of how/why I am doing this in my own research. In my current work, decision variable values represent parameters for a reservoir operating policy that is being optimized. The simulation model needs to know how to take the decision variable values and turn them into a useful operating policy that can be simulated. Some of this information gets imported in order to run Borg, so I might as well pass that information directly into the simulation model while I have it on hand, rather than importing it once again in the simulation model.

To do what I describe above, we just need to modify the two functions in the example_sim_opt.py module so that a new argument “additional_inputs” is passed from borg to the simulation handle.  Using my python code from blog post 1, I provide code below that is modified in the Simulation_Caller() function on lines 5, 21, 22 and 27; and in the Optimization() function on lines 55, 56 and 70. After that code, I then indicate how I modify the borg.py wrapper so it can accept this information.

import numpy as np
import pysedsim # This is your simulation model
import platform  # helps identify directory locations on different types of OS

def Simulation_Caller(vars, additional_inputs):
    '''
    Purpose: Borg calls this function to run the simulation model and return multi-objective performance.

    Note: You could also just put your simulation/function evaluation code here.

    Args:
        vars: A list of decision variable values from Borg
        additional_inputs: A list of python data structures you want to pass from Borg into the simulation model.
    Returns:
        performance: policy's simulated objective values. A list of objective values, one value each of the objectives.
    '''

    borg_vars = vars  # Decision variable values from Borg

    # Unpack lists of additional inputs from Borg (example assumes additional inputs is a python list with two items)
    borg_dict_1 = additional_inputs[0]
    borg_dict_2 = additional_inputs[1]

    # Reformat decision variable values as necessary (.e.g., cast borg output parameters as array for use in simulation)
    op_policy_params = np.asarray(borg_vars)
    # Call/run simulation model with decision vars and additional relevant inputs, return multi-objective performance:
    performance = pysedsim.PySedSim(decision_vars = op_policy_params, sim_addl_input_1 = borg_dict_1, sim_addl_input_2 = borg_dict_2)
    return performance

def Optimization():

    '''

    Purpose: Call this method from command line to initiate simulation-optimization experiment

    Returns:
        --pareto approximate set file (.set) for each random seed
        --Borg runtime file (.runtime) for each random seed

    '''

    import borg as bg  # Import borg wrapper

    parallel = 1  # 1= master-slave (parallel), 0=serial

    # The following are just examples of relevant MOEA specifications. Select your own values.
    nSeeds = 25  # Number of random seeds (Borg MOEA)
    num_dec_vars = 10  # Number of decision variables
    n_objs = 6  # Number of objectives
    n_constrs = 0  # Number of constraints
    num_func_evals = 30000  # Number of total simulations to run per random seed. Each simulation may be a monte carlo.
    runtime_freq = 1000  # Interval at which to print runtime details for each random seed
    decision_var_range = [[0, 1], [4, 6], [-1,4], [1,2], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1]]
    epsilon_list = [50000, 1000, 0.025, 10, 13, 4]  # Borg epsilon values for each objective
    borg_dict_1 = {'simulation_preferences_1': [1,2]}  # reflects data you want Borg to pass to simulation model
    borg_dict_2 = {'simulation_preferences_2': [3,4]}  # reflects data you want Borg to pass to simulation model

    # Where to save seed and runtime files
    main_output_file_dir = 'E:\output_directory'  # Specify location of output files for different seeds
    os_fold = Op_Sys_Folder_Operator()  # Folder operator for operating system
    output_location = main_output_file_dir + os_fold + 'sets'

    # If using master-slave, start MPI. Only do once.
    if parallel == 1:
        bg.Configuration.startMPI()  # start parallelization with MPI

    # Loop through seeds, calling borg.solve (serial) or borg.solveMPI (parallel) each time
    for j in range(nSeeds):
        # Instantiate borg class, then set bounds, epsilon values, and file output locations
        borg = bg.Borg(num_dec_vars, n_objs, n_constrs, Simulation_Caller, add_sim_inputs = [borg_dict_1, borg_dict_2])
        borg.setBounds(*decision_var_range)  # Set decision variable bounds
        borg.setEpsilons(*epsilon_list)  # Set epsilon values
        # Runtime file path for each seed:
        runtime_filename = main_output_file_dir + os_fold + 'runtime_file_seed_' + str(j+1) + '.runtime'
        if parallel == 1:
            # Run parallel Borg
            result = borg.solveMPI(maxEvaluations='num_func_evals', runtime=runtime_filename, frequency=runtime_freq)

        if parallel == 0:
            # Run serial Borg
            result = borg.solve({"maxEvaluations": num_func_evals, "runtimeformat": 'borg', "frequency": runtime_freq,
                                 "runtimefile": runtime_filename})

        if result:
            # This particular seed is now finished being run in parallel. The result will only be returned from
            # one node in case running Master-Slave Borg.
            result.display()

            # Create/write objective values and decision variable values to files in folder "sets", 1 file per seed.
            f = open(output_location + os_fold + 'Borg_DPS_PySedSim' + str(j+1) + '.set', 'w')
            f.write('#Borg Optimization Results\n')
            f.write('#First ' + str(num_dec_vars) + ' are the decision variables, ' + 'last ' + str(n_objs) +
                    ' are the ' + 'objective values\n')
            for solution in result:
                line = ''
                for i in range(len(solution.getVariables())):
                    line = line + (str(solution.getVariables()[i])) + ' '

                for i in range(len(solution.getObjectives())):
                    line = line + (str(solution.getObjectives()[i])) + ' '

                f.write(line[0:-1]+'\n')
            f.write("#")
            f.close()

            # Create/write only objective values to files in folder "sets", 1 file per seed. Purpose is so that
            # the file can be processed in MOEAFramework, where performance metrics may be evaluated across seeds.
            f2 = open(output_location + os_fold + 'Borg_DPS_PySedSim_no_vars' + str(j+1) + '.set', 'w')
            for solution in result:
                line = ''
                for i in range(len(solution.getObjectives())):
                    line = line + (str(solution.getObjectives()[i])) + ' '

                f2.write(line[0:-1]+'\n')
            f2.write("#")
            f2.close()

            print("Seed %s complete") %j

    if parallel == 1:
        bg.Configuration.stopMPI()  # stop parallel function evaluation process

def Op_Sys_Folder_Operator():
    '''
    Function to determine whether operating system is (1) Windows, or (2) Linux

    Returns folder operator for use in specifying directories (file locations) for reading/writing data pre- and
    post-simulation.
    '''

    if platform.system() == 'Windows':
        os_fold_op = '\\'
    elif platform.system() == 'Linux':
        os_fold_op = '/'
    else:
        os_fold_op = '/'  # Assume unix OS if it can't be identified

    return os_fold_op
 

 

Next, you will need to acquire the Borg wrapper using the instructions I specified in my previous blog post. You will need to make only two modifications: (1) modify the Borg class in borg.py so it accepts the inputs you want to pass to the simulation; and (2) some additional internal accounting in borg.py to ensure those inputs are passed to the borg.py methods that deal with your function handle. I will address these two in order.

First, modify the Borg class in borg.py so it now accepts an additional input (I only show some of the borg.py code here, just to indicate where changes are being made):


class Borg:
    def __init__(self, numberOfVariables, numberOfObjectives, numberOfConstraints, function, epsilons = None, bounds = None, directions = None, add_sim_inputs=None):

    # add_sim_inputs is the new input you will pass to borg

 

Then, modify the portion of the borg.py wrapper where self.function is called, so it can accommodate any simulation inputs you have specified.


if add_pysedsim_inputs is None:
    self.function = _functionWrapper(function, numberOfVariables, numberOfObjectives, numberOfConstraints, directions)
else:
    # More simulation inputs are specified and can be passed to the simulation handle
    self.function = _functionWrapper(function, numberOfVariables, numberOfObjectives, numberOfConstraints, directions, addl_inputs=add_sim_inputs)

After the above, the last step is to modify the _functionWrapper method in borg.py:


def _functionWrapper(function, numberOfVariables, numberOfObjectives, numberOfConstraints, directions=None, addl_inputs=None):
    # addl_inputs will be passed into the simulation model
    def innerFunction(v,o,c):
    global terminate
    try:
        if addl_inputs is None:
            result = function(*[v[i] for i in range(numberOfVariables)])
        else:
            result = function([v[i] for i in range(numberOfVariables)], addl_inputs)

Advertisements

Using Borg in Parallel and Serial with a Python Wrapper – Part 1

Simulation and optimization are frequently used to solve complex water resources and environmental systems problems. By itself, a simulation model begs the question “what to simulate?” Similarly, by itself, an optimization model begs the question “is the solution really best?” For this reason, simulation and optimization models are frequently coupled.

This blog post is part 1 of a multi-part series that will demonstrate how I have coupled a pure Python simulation model with the multi-objective evolutionary optimization algorithm Borg. In this post, I will show how you can access Borg’s serial and/or parallelized (master-slave) implementations through a Python wrapper (borg.py).

Please see this previous blog post for some background about Borg, and how to obtain it. My instructions below assume you have access to the Borg files.

In the setup I will describe below, Borg parameterizes and iteratively refines solutions (e.g., reservoir operating policies) to a problem, optimizing them in response to their simulated performance with respect to multiple objectives.

You will need the following Borg files (see link above for how to download these):

  • Serial (i.e., borg.c, libborg.so, etc.) and/or master-slave (i.e., borgms.c, libborgms.so, etc.) implementations of Borg, depending upon your ability to parallelize.
  • Python wrapper for Borg (borg.py), which will allow you to to access Borg easily in Python.

You will need to create the following files yourself (I provide sample code below for these files):

  • example_sim_opt.py—A python module that should contain two main functions:
    1. A simulation caller, which takes decision variables and returns multi-objective performance. This function is called “Simulation_Caller” in the example_sim_opt.py code below.
    2. An optimization function, which calls the Borg MOEA through its python wrapper borg.py. This borg.py wrapper provides extensive docstring documentation regarding required arguments, returned values, etc., so I do suggest reading through the wrapper if you have questions (e.g., about the python data types of arguments and returns).

Note that the file and function names above are just example names. You can name the above files whatever you want. Just be sure to modify the code I provide below to reflect the new names.

A sample of code for example_sim_opt.py is as follows:

import numpy as np
import pysedsim # This is your simulation model
import platform  # helps identify directory locations on different types of OS

def Simulation_Caller(vars):
    '''
    Purpose: Borg calls this function to run the simulation model and return multi-objective performance.

    Note: You could also just put your simulation/function evaluation code here.

    Args:
        vars: A list of decision variable values from Borg

    Returns:
        performance: policy's simulated objective values. A list of objective values, one value each of the objectives.
    '''

    borg_vars = vars  # Decision variable values from Borg

    # Reformat decision variable values as necessary (.e.g., cast borg output parameters as array for use in simulation)
    op_policy_params = np.asarray(borg_vars)
    # Call/run simulation model, return multi-objective performance:
    performance = pysedsim.PySedSim(decision_vars = op_policy_params)
    return performance

def Optimization():

    '''

    Purpose: Call this method from command line to initiate simulation-optimization experiment

    Returns:
        --pareto approximate set file (.set) for each random seed
        --Borg runtime file (.runtime) for each random seed

    '''

    import borg as bg  # Import borg wrapper

    parallel = 1  # 1= master-slave (parallel), 0=serial

    # The following are just examples of relevant MOEA specifications. Select your own values.
    nSeeds = 25  # Number of random seeds (Borg MOEA)
    num_dec_vars = 10  # Number of decision variables
    n_objs = 6  # Number of objectives
    n_constrs = 0  # Number of constraints
    num_func_evals = 30000  # Number of total simulations to run per random seed. Each simulation may be a monte carlo.
    runtime_freq = 1000  # Interval at which to print runtime details for each random seed
    decision_var_range = [[0, 1], [4, 6], [-1,4], [1,2], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1]]
    epsilon_list = [50000, 1000, 0.025, 10, 13, 4]  # Borg epsilon values for each objective

    # Where to save seed and runtime files
    main_output_file_dir = 'E:\output_directory'  # Specify location of output files for different seeds
    os_fold = Op_Sys_Folder_Operator()  # Folder operator for operating system
    output_location = main_output_file_dir + os_fold + 'sets'

    # If using master-slave, start MPI. Only do once.
    if parallel == 1:
        bg.Configuration.startMPI()  # start parallelization with MPI

    # Loop through seeds, calling borg.solve (serial) or borg.solveMPI (parallel) each time
    for j in range(nSeeds):
        # Instantiate borg class, then set bounds, epsilon values, and file output locations
        borg = bg.Borg(num_dec_vars, n_objs, n_constrs, Simulation_Caller)
        borg.setBounds(*decision_var_range)  # Set decision variable bounds
        borg.setEpsilons(*epsilon_list)  # Set epsilon values
        # Runtime file path for each seed:
        runtime_filename = main_output_file_dir + os_fold + 'runtime_file_seed_' + str(j+1) + '.runtime'
        if parallel == 1:
            # Run parallel Borg
            result = borg.solveMPI(maxEvaluations='num_func_evals', runtime=runtime_filename, frequency=runtime_freq)

        if parallel == 0:
            # Run serial Borg
            result = borg.solve({"maxEvaluations": num_func_evals, "runtimeformat": 'borg', "frequency": runtime_freq,
                                 "runtimefile": runtime_filename})

        if result:
            # This particular seed is now finished being run in parallel. The result will only be returned from
            # one node in case running Master-Slave Borg.
            result.display()

            # Create/write objective values and decision variable values to files in folder "sets", 1 file per seed.
            f = open(output_location + os_fold + 'Borg_DPS_PySedSim' + str(j+1) + '.set', 'w')
            f.write('#Borg Optimization Results\n')
            f.write('#First ' + str(num_dec_vars) + ' are the decision variables, ' + 'last ' + str(n_objs) +
                    ' are the ' + 'objective values\n')
            for solution in result:
                line = ''
                for i in range(len(solution.getVariables())):
                    line = line + (str(solution.getVariables()[i])) + ' '

                for i in range(len(solution.getObjectives())):
                    line = line + (str(solution.getObjectives()[i])) + ' '

                f.write(line[0:-1]+'\n')
            f.write("#")
            f.close()

            # Create/write only objective values to files in folder "sets", 1 file per seed. Purpose is so that
            # the file can be processed in MOEAFramework, where performance metrics may be evaluated across seeds.
            f2 = open(output_location + os_fold + 'Borg_DPS_PySedSim_no_vars' + str(j+1) + '.set', 'w')
            for solution in result:
                line = ''
                for i in range(len(solution.getObjectives())):
                    line = line + (str(solution.getObjectives()[i])) + ' '

                f2.write(line[0:-1]+'\n')
            f2.write("#")
            f2.close()

            print("Seed %s complete") %j

    if parallel == 1:
        bg.Configuration.stopMPI()  # stop parallel function evaluation process

def Op_Sys_Folder_Operator():
    '''
    Function to determine whether operating system is (1) Windows, or (2) Linux

    Returns folder operator for use in specifying directories (file locations) for reading/writing data pre- and
    post-simulation.
    '''

    if platform.system() == 'Windows':
        os_fold_op = '\\'
    elif platform.system() == 'Linux':
        os_fold_op = '/'
    else:
        os_fold_op = '/'  # Assume unix OS if it can't be identified

    return os_fold_op
 

The following is an example of how you would submit a batch script on a Linux cluster to run a parallelized simulation-optimization experiment using the example_sim_opt.py and borg.py files. Note that in the parallelized version, simulations (i.e., “function evaluations”) are being run in parallel by separate processors.

You would need the following two files:

  1. example_sim_opt_caller.py. This is a python file that is used to call example_sim_opt.py
  2. example_sim_opt_batch_script.pbs. This is batch script that runs example_sim_opt_caller.py in parallel on a cluster using open MPI.

Example code for example_sim_opt_caller.py:


'''
Purpose: To initiate the optimization process, which will iteratively call the simulation model.
'''

import example_sim_opt  # Import main optimization module that uses borg python wrapper

# Module within example
example_sim_opt.Optimization()

Example code for example_sim_opt_batch_script.pbs:


#PBS -l nodes=8:ppn=16
#PBS -l walltime=24:00:00
#PBS -j oe
#PBS -o pysedsim_output.out

cd $PBS_O_WORKDIR
source /etc/profile.d/modules.sh
module load openmpi-1.6.5-intel-x86_64/gnu
module load python-2.7.5
mpirun python example_sim_opt_caller.py

You could then run the above batch script with a command such as:

qsub example_sim_opt_batch_script.pbs

A Guide to Using Git in PyCharm – Part 2

A Guide to Using Git in PyCharm – Part 2

This post is part 2 of a multi-part series intended to describe how to use PyCharm’s integrated Git (version control) features.

In my previous PyCharm post (Part 1), I described how to get started using Git in PyCharm, including:

  • Creating a local git repository
  • Adding and commiting files to a git repository
  • Making and committing changes to a python file

In this post, I will describe:

  • Creating branches
  • Merging branches

These are extremely valuable features of Git, as they allow you to create a new version of your code to modify and evaluate before permanently integrating the changes into the master version of your code.

The following tutorial will assume you already have PyCharm and Git installed, and have a Github account. It also assumes you have a git repository to start with. Follow my steps in the first blog post if you don’t already have a git repository.

Create a new branch by right clicking on your repository project folder, and in the menu select Git–>Repository–>Branches, as is shown below.

figure-1

Select “New Branches” and then enter the name of the new branch in the box (see below for how I named the example branch):

figure-2

You can now see in the version control window that a new branch called “example_branch_1” has been created as a branch from the master branch (you can access this by clicking “version control” window at the bottom).

figure-3

Importantly, PyCharm has now placed you on the new branch (“example_branch_1”). Any changes you make and commit, you are making to the branch, not to the master (the one you started with).

First, right-click on the python file and select Git –> Add.

Now, make some modifications to the python file on this branch.

In the menu at the top of the screen, select VCS –> Commit changes.

In the menu that appears, provide a message to attach to your commit so you know later what changes you made. You can double click on the file name in the window (e.g., on the “blog_post_file.py” file) to review what changes have actually been made. If you like the changes, select “Commit”.

figure-4

In the menu at the very bottom of the screen to the right, you can now go back and “check out” the master branch. Just click master –> Checkout, as is shown below.

figure-6

What now appears on your screen is the previous version of the “blog_post_file.py” file, before the function was modified, as is shown in the image below.

figure-5

The version control menu at the bottom explains the structure (and dates/times) of the master and branches.

You can now go back to the example branch if you want using the same feature.

This feature allows you to quickly snap to different branches that are being modified, and see how they are different. (This feature is also accessible from the menu at the top of the screen through VCS –>Git –> Branches).

If you want to merge your changes to your local master, do the following. From the master branch, you can now select “example_branch_1” in the same menu shown to the bottom right below. This time select “Merge” rather than “Checkout”. Your changes will now be merged onto the master branch, and the modified function will now appear on both the master and the example branch.

figure-7

I will continue to discuss PyCharm’s Git features in future posts in this series.

A Guide to Using Git in PyCharm – Part 1

A Guide to Using Git in PyCharm – Part 1

This post is part 1 of a multi-part series intended to describe how to use PyCharm’s integrated Git (version control) features. See Part 2 here. PyCharm makes it easy to link a Github account directly to the PyCharm IDE and perform Git-related tasks within the PyCharm IDE rather than performing these tasks in a command prompt.

Today I will describe a few basic features to help you get started, including: creating a repository, adding files and committing changes. Future posts will discuss more detailed features, including branching, pushing/pulling, merging, etc. While PyCharm does have a very detailed website geared toward explaining some of this, this blog post series is intended to help those who just want some basic steps and pictures to get started.

You can find background information on Git on this blog, including an introduction to git, as well as tutorials on local version control and remote repositories. A glossary of Git terminology and commands can be found here.

PyCharm is one of many IDEs one can use to edit Python code. I offer some background here on why I chose PyCharm for my Python programming needs, and how to get PyCharm.

The following tutorial will assume you already have PyCharm and Git installed, and have a Github account.

  1. Within PyCharm, link your Github account.

File –> Settings –> Version Control –> GitHub

Input your GitHub login (username) and password.

  1. Create a new PyCharm project as pictured below, by going to:

File –> New Project.

(Note: A “project” can just be a folder/directory of related files that PyCharm will recognize as a “project”. If you already have a project, you can simply open the existing “project” to follow these steps to create a repository).

create_project

  1. Having established a new project, now create a local git repository for the project as pictured below, by going to:

VCS –> Import Into Version Control –> Create Git Repository.

If completed correctly, you should see a directory named “.git” appear in the project directory. Note that you must have already downloaded git for this to work (test that git.exe works by going File –> Settings –> Version Control –> Git –> Test).

create_git_repo

  1. Having created a git repository, now create python files and add them to the repository.

Right click on your project in the Project menu, and select New — Python File, as pictured below.

New Python File

PyCharm will prompt you to include this file in your repository, as is pictured below. If you select “Yes”, you can now commit and track changes you make to this file.

commit_file_question

  1. Commit the file to repository by right-clicking the python file in the project menu, and selecting Git–>Commit File, as is shown in the two images below:

git_commit

git_commit

You can include a message with your commit when you are prompted to commit, as shown below:

initial_commit

Note that file names in your project menu will appear in green text as long as the file has not been committed yet. The file name will no longer be green once you commit it.

  1. Now that your file has been added, you can make changes to the file and commit them.

As soon as you make edits to this newly committed file, the file name in the menu will change colors (to blue in my version of PyCharm). This signifies that uncommitted changes exist in that file.  You can commit the changes following the same process I described before, or by clicking the green “VCS” up arrow icon on the top menu.

UNCOMMITTED_CODE

You will now be prompted with a Commit Changes window, as appears below.

first_module_added

You can review changes that have been made, in case you have forgotten what has changed since you last committed, by double clicking on the file name (in the figure above, this would be the blue “blog_post_file.py” icon). PyCharm will show you what changes have been made in green. (It will also show deletions and/or rearrangements of code).

changes_shown

7. Having committed the file and a change to the file, you can now go to the “Version Control” menu at the bottom of the PyCharm window to unveil a variety of features, including a “Log”, which stores all of the changes I have made to this local repository. The log for my example case is shown below.

version_control_window

I will continue to discuss more detailed features in future posts in this series.

Importing, Exporting and Organizing Time Series Data in Python – Part 2

Importing, Exporting and Organizing Time Series Data in Python – Part 2

This blog post is Part 2 of a multi-part series of posts intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle .csv, .xls and .xlsx files); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

Part 1 of the series focused on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets.

This blog post presents a flexible python function I developed to import time series outputs from a simulation model into a dictionary of 3D pandas DataFrames (DFs). This function is part of a data importing and exporting module included in the PySedSim simulation model I have been developing in the last year, but you can use this function for time series outputs from any simulation model, provided the data are stored in .csv files and organized in the format shown in Fig. 2.

The value of this function is it creates a single dictionary of 3D pandas DFs storing time series outputs from multiple simulation “scenarios”; for multiple system locations (e.g., Reservoir 1, Reservoir 2…); multiple state variables (reservoir water storage, suspended sediment load, etc.); and multiple realizations, or stochastic simulation ensemble members (e.g., Realization 1, Realization 2, …, Realization 1000). I have found the 3D DFs to have more reliable functionality than 4D DFs, so I have elected not to use a fourth pandas panel dimension in this function.

In a future post, I will then present some additional modules I have developed that use Pandas functionality to evaluate “performance” of different system locations (e.g., reservoirs) with respect to a diverse set of temporal resampling strategies and distribution statistics.

Below is the Python function, which you are welcome to use and/or modify for your own purposes. Please note that I had trouble reproducing Python formatting in the blog post (and some html characters get incorrectly inserted), and apparently I cannot post a link to a .py file, so it will eventually be on Github in late 2016.

 # Import method-specific libraries
from copy import deepcopy
from __future__ import division
import pandas as pd
import os
import platform

def Import_Simulation_Output(Sims_to_Import, Locations_to_Import, var_sub_list, file_dir, proc_num = ''):
    '''
    Purpose: Imports time series outputs from a simulation run into a 3-dimensional pandas dataframe (DF).

    More detail: Module intended to import .csv files that have been produced as the result of a PySedSim simulation.
    Each .csv file should contain time series outputs for a single state variable (e.g., reservoir water storage) and
    system location (e.g., reservoir 1).

    The purpose of using this file may either be to (1) import the output into a dictionary of pandas structures so
    that simulation performance measures can be evaluated and plotted, or (2) to import all the data produced by
    separate processors into a single data structure that can then be exported into a .csv file that contains aggregated
    output for each system location/variable.

    DF details (a 3D DF exists for each system location (e.g., reservoir)):
    Axis 0: State variable (e.g., Water storage, Energy production) for system location
    Axis 1: Time (e.g., dates over simulation horizon)
    Axis 2: Realization Number (e.g., stochastic simulation ensemble members)

    :param Sims_to_Import: List, containing strings of simulation scenario names (these must be directories in the
    specified output file directory that have these names). Example: ["Alternative Scenario 7A"]

    :param Locations_to_Import: Dictionary, keys store strings representing simulation element names (e.g.,
    Reservoir 1). Keys must be in the Sims_to_Import list. Example: {"Alternative Scenario 7A": ["Reservoir 1"]}

    :param var_sub_list: List, containing strings of PySedSim state variable names for which .csv output files exist
    for the scenarios in the Sims_to_Import list. Example: ['water_surface_elevation', 'capacity_active_reservoir']

    :param file_dir: String, directory in which output files to be imported are located.
    Example: r'E:\PySedSim\ModelFiles\Output_Storage'

    :param proc_num: Optional. Integer, number appended to the output .csv file representing the processor that
    produced the file (e.g., the number 3 for the file 'water_surface_elevation_3.csv')

    :return TSID: Dictionary, where keys are scenario names. Key stores sub_dictionary, where sub_dictionary keys are
    system locations storing 3D pandas DF for each system location.
    :return Num_Realizations: Dictionary, where keys are scenario names, storing number of stochastic realiztions for
    scenario.
    :return Num_Years: Dictionary, where keys are scenario names, storing number of years in a simulation realization
    for scenario
    '''

    # Get operator (/ or \) for changing directory based on operating system.
    os_fold_op = Op_Sys_Folder_Operator()

    # This function reads back in previously exported simulation data so performance measure analysis can be conducted.
    if proc_num is not '':
        cluster_loop = '_' + str(proc_num-1) # Subtract 1 as first file ends with "0".
        cluster_sub_folder = 'cluster_output'
    else:
        cluster_loop = ''
        cluster_sub_folder = ''

    # Initialize various data structures
    TSID = {} # Main dictionary to export
    TSID_Temp = {} # Use to temporarily load each processor's output sheet for location/variable, if applicable
    Num_Realizations = {} # For each scenario, stores number of realizations for that scenario
    Num_Years = {} # For each scenario, stores number of years in each realization for that scenario
    counter = {} # Temporary counter

    # Main data import loop. Intent is to import data into Time Series Import Dictionary (TSID)
    for sims in Sims_to_Import:
        counter[sims] = 0
        TSID[sims] = {} # Sub dict for each simulation will store locations.
        TSID_Temp[sims] = {} # Sub dict for element/variable output for a given processor in cluster.
        sim_import_loc = file_dir + os_fold_op + sims # This folder needs to already exist.
        for sys_locs in Locations_to_Import[sims]:
            TSID[sims][sys_locs] = {} # Sub dictionary for each location will store variables.
            TSID_Temp[sims][sys_locs] = {} # Sub dict for location will store a variable for each processor.
            loc_sub_folder = os_fold_op + cluster_sub_folder + os_fold_op + sys_locs
            # Requires that all the locs you are searching have all the variables you list above, which wont be the
            # case always (for junctions vs. reservoirs, for example).
            for vars in var_sub_list:
                file_path = sim_import_loc + loc_sub_folder # File path reflecting new folder
                if os.path.exists(os.path.join(file_path, vars + cluster_loop + '.csv')) == True:
                    # This variable exists as a file name in the specified file path, so import it.
                    if proc_num == '':
                        # User is not importing output files produced on a cluster by various processors. Proceed
                        # linearly (there are not different files from different processors that need to be combined).
                        # Import this dataframe to a csv file.
                        TSID[sims][sys_locs][vars] = pd.read_csv(os.path.join(file_path, vars + cluster_loop + '.csv'),
                        index_col=0)

                        # Force each dataframe to have datetime objects as dates rather than strings.
                        TSID[sims][sys_locs][vars].set_index(pd.to_datetime(TSID[sims][sys_locs][vars].index),
                        inplace=True)
                        # Determine number of realizations (ensembles). Only do this calculation once per simulation
                        # realization (on first pass through loop).
                        if counter[sims] == 0:
                            Num_Realizations[sims] = len(TSID[sims][sys_locs][vars].columns)
                            Num_Years[sims] = TSID[sims][sys_locs][vars].index[-1].year - \
                            TSID[sims][sys_locs][vars].index[0].year + 1
                            counter[sims] += 1
                    else:
                        # User wishes to use this processor to create a dictionary for the particular
                        # location/variable of interest. This processor will therefore read in all output .csv files
                        # produced by other processors.
                        for csv_sheet in range(proc_num):
                            # Import this dataframe to a csv file
                            TSID_Temp[sims][sys_locs][vars] = pd.read_csv(
                            os.path.join(file_path, vars + '_' + str(csv_sheet) + '.csv'), index_col=0)
                            # Make each dataframe have datetime objects as dates rather than strings.
                            TSID_Temp[sims][sys_locs][vars].set_index(
                            pd.to_datetime(TSID_Temp[sims][sys_locs][vars].index), inplace=True)
                            # Loop through locations and variables, store data from this processor in master dictionary.
                            if csv_sheet == 0:
                                TSID = deepcopy(TSID_Temp)
                            else:
                                for locs in TSID[sims]:
                                    for vars in TSID[sims][locs]:
                                        # Add this new set of realizations from this DF into the main DF
                                        TSID[sims][locs][vars] = pd.concat(
                                        [TSID[sims][locs][vars], TSID_Temp[sims][locs][vars]], axis=1)
    print("Data Import is completed.")
    # Return is conditional. Number of realizations/years cannot be provided if the TSID only represents one of many
    # ensemble members of a stochastic simulation:
    if proc_num is not '':
        return TSID
    else:
        return TSID, Num_Realizations, Num_Years

The following is an example of how to make use of this function. Suppose you have 2 directories corresponding two two different simulation runs (or “scenarios”). Let’s call those Scenario 1 and Scenario 2.

Within each of those scenario directories, you have separate directories for each system location. In this example, the locations are “Reservoir 1” and “River Channel 1”, so there are 2 sub-directories.

Within each of those location directories, you have .csv files for the following state variables: water_storage, susp_sediment_inflow, inflow_rate. Each .csv file stores time series output for a particular scenario, location, and state variable across all realizations (ensemble members).

Figure 1 (below) shows an example of how these files might be stored for “River Channel 1”, within a directory titled “Simulation Output”.

Fig 2 - file location

Figure 2 (below) shows an example of how these files might be stored for “Reservoir 1”, within a directory titled “Simulation Output”.

Fig 2A - file location

Figure 2 shows an example of how one .csv file. (The PySedSim model automatically creates the directory structure and file formatting shown here).

Fig 1 - csv file layout

The following is an example function call:


import Import_Simulation_Output

Sims_to_Import = ['Scenario 1', 'Scenario 2']

Locations_to_Import = {'Scenario 1': ['Reservoir 1', 'River Channel 1'], 'Scenario 2': ['Reservoir 1', 'River Channel']}

var_sub_list = ['inflow_rate', 'susp_sediment_inflow', 'water_storage']

file_dir = r'C:\Users\tbw32\Documents\Reed Group\Blog - Water Programming\2016\July 2016\Simulation Output'

Import_Simulation_Output(Sims_to_Import, Locations_to_Import, var_sub_list, file_dir)

In the next post I will demonstrate what to do with the dictionary that has been created.

Debugging in Python (using PyCharm) – Part 3

Debugging in Python (using PyCharm) – Part 3

This post is part 3 of a multi-part series of posts intended to provide discussion of some basic debugging tools that I have found to be helpful in developing a pure Python simulation model using a Python Integrated Development Environment (IDE) called PyCharm.

Before I begin this post, the following are links to previous blog posts I have written on this topic:

In this post I will focus on PyCharm’s “Coverage” features, which are very useful for debugging by allowing you to see what parts of your program (e.g., modules, classes, methods) are/are not being accessed for a given implementation (run) of the model. If instead you are interested in seeing how much time is being spent running particular sections of code, or want to glimpse into the values of variables during execution, see the previous posts I linked above on profiling and breakpoints.

To see what parts of my code are being accessed, I have found it helpful to create and run what are called “unit tests”. You can find more on unit testing here, or just by googling it. (Please note that I am not a computer scientist, so I am not intending to provide a comprehensive summary of all possible approaches you could take to do this. I am just going to describe something that has worked well for me). To summarize, unit testing refers to evaluating sections (units) of source code to determine whether those units are performing as they should. I have been using unit testing to execute a run of my model (called “PySedSim”) to see what sections of my code are and are not being accessed.

I integrated information from the following sources to prepare this post:

Please follow the following steps:

Step 1. Open the script (or class, or method) you want to assess, and click on the function or method you want to assess.

In my case, I am assessing the top-level python file “PySedSim.py”, which is the file in my program that calls all of the classes to run a simulation (e.g., Reservoirs and River Channels). Within this file, I have clicked on the PySedSim function. Note that these files are already part of a PyCharm project I have created, and Python interpreters have already been established. You need to do that first.

Step 2. With your cursor still on the function/method of interest, click “ctrl + shift + T”.

A window should appear as it does below. Click to “Create New Test”.

Figure 1

Step 3. Create a new test. Specify the location of the script you are testing, and keep the suggested test file and class names, or modify them. Then click to add a check mark to the box next to the Test Method, and click “OK”.

Figure 2

Step 4. Modify the new script that has been created. (In my case, this file is called “test_pySedSim.py”, and appears initially as it does below).

Figure 3

I then modified this file so that it reflects testing I want to conduct on the PySedSim method in the PySedSim.py file.

In my case, it appears like this.


from unittest import TestCase
from PySedSim import PySedSim
class TestPySedSim(TestCase):
def test_PySedSim(self):
PySedSim()

Note that there is a ton of functionality that is now possible in this test file. I suggest reviewing this website again carefully for ideas. You can raise errors, and use the self.fail() function, to indicate whether or not your program is producing acceptable results. For example, if the program produces a negative result when it should produce a positive result, you can indicate to PyCharm that this represents a fail, and the test has not been passed. This offers you a lot of flexibility in testing various methods in your program.

In my case, all I am wanting to do is run the model and see which sections were accessed, not to specifically evaluate results it has produced, so in my case PyCharm should execute the model and indicate it has “passed” the unit test (once I create and run the unit test).

Step 5. In the menu at the top of the screen that I show clicked on in the image below, click on “Edit configurations”.

Figure 4

From here, click on the “+” button, and go to Python tests –> Unittests.Figure 5

Step 6. In the “Run/Debug Configurations” window, give your test a name in the “Name” box, and in the “script” box locate the script you created in Step 4, and indicate its path. Specify any method parameters that need to be specified to run the method. I did not specify any environment preferences, as the interpreter was already filled in. Click OK when you are done.

Figure 6

Step 7. Your test should now appear in the same configuration menu you clicked on earlier in Step 5. So, click the button at the top to “Run with Coverage”. (In my case, run water_programming_blog_post with coverage)

Figure 7

Note that it is likely going to take some time for the test to run (more than it would take for a normal execution of your code).

Step 8. Review the results.

A coverage window should appear to the right of your screen, indicating what portions (%) of the various functions and methods contained in this program were actually entered.

Figure 8

To generate a more detailed report, you can click on the button with the green arrow inside the coverage window, which will offer you options for how to generate a report. I selected the option to generate an html report. If you then select the “index” html file that appears in the directory you’re working in, you can click to see the coverage for each method.

For example, here is an image of a particular class (reservoir.py), showing in green those sections of code that were entered, and in red the sections that were not. I used this to discover that particular portions of some methods were not being accessed when they should have been. The script files themselves also now have red and green text that appears next to the code that was not or was entered, respectively. See image above for an example of this.

Figure 9

PyCharm also indicates whether or not the unittest has passed. (Though I did not actually test for specific outputs from the program, I could have done tests on model outputs as I described earlier, and any test failure would be indicated here).

Importing, Exporting and Organizing Time Series Data in Python – Part 1

Importing, Exporting and Organizing Time Series Data in Python – Part 1

This blog post is Part 1 of a multi-part series of posts (see here for Part 2) intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle Excel spreadsheets); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

In modeling water resources and environmental systems, we frequently must import and export large quantities of data (particularly time series data), both to run models and to process model outputs. I will describe the approaches and tools I have found to be most useful for managing data in these circumstances.

This blog post focuses on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets. Future posts will cover the pandas data management/export component.

Through an example that follows, there are two main lessons I hope to convey in this post:

  1. If you can store data, especially time series data, in text (.txt) or comma separated value (.csv) files, the time required for importing and exporting data will be vastly reduced compared to attempting the same thing in an Excel workbook (.xlsx file). Hence, I suggest that if you must work with an Excel workbook (.xlsx files), try to save each worksheet in the Excel workbook as a separate .csv file and use that instead. A .csv file can still be viewed in the MS Excel interface, but is much faster to import. I’ll show below how much time this can save you.
  2. There are many ways to analyze the data once loaded into Python, but my suggestion is you take advantage of pandas, an open-source data analysis library in Python. This is why my example below makes use of built-in pandas features for importing data. (I will have some future posts with some useful pandas code for doing time series analysis/plotting. There are other previous pandas posts on our blog as well).

It is important at this point to note that Microsoft Excel files (.xlsx or .xls) are NOT an ideal means of storing data. I have had the great misfortune of working with Excel (and Visual Basic for Applications) intensively for many years, and it can be somewhat of a disaster for data management. I am using Excel only because the input data files for the model I work with are stored as .xlsx files, and because this is the data storage interface with which others using my Python-based model are most comfortable.

An Example:

Suppose that you wish to read in and analyze time series of 100 years of simulated daily reservoir releases from 2 reservoirs (“Reservoir_1” and “Reservoir_2”) for 100 separate simulations (or realizations). Let’s assume data for the reservoirs is stored in two separate files, either .csv or .xlsx files (e.g., “Reservoir_1.xlsx” and “Reservoir_2.xlsx”, or “Reservoir_1.csv” and “Reservoir_2.csv”).

The figure below shows a screenshot of an example layout of the file in either case (.csv or .xlsx). Feel free to make your own such file, with column names that correspond to the title of each simulation realization.

Worksheet appearance

There are two import options for these data.

Option 1: Import your data directly into a dictionary of two pandas DataFrame objects, where the keys of the Python dictionary are the two reservoir names. Here is some code that does this. All you need to do is run this script in a directory where you also have your input files saved.


# Imports
import pandas as pd
from datetime import datetime
from datetime import timedelta
import time

start_time = time.time()  # Keep track of import time

start_date = datetime(1910, 1, 1)
simulation_dates = pd.date_range(start_date, start_date + timedelta(365*100 - 1))

# Specify File Type (Choices: (1) 'XLSX', or (2) 'CSV')
File_Type = 'CSV'

location_list = ['Reservoir_1', 'Reservoir_2']  # Names of data files
start = 0  # starting column of input file from which to import data
stop = 99  # ending column of input file from which to import data. Must not exceed number of columns in input file.

# Create a dictionary where keys are items in location_list, and content associated with each key is a pandas dataframe.
if File_Type is 'XLSX':
    extension = '.xlsx'
    Excel_data_dictionary = {name: pd.read_excel(name + extension, sheetname=None, usecols=[i for i in range(start, stop)]) for name in
                             location_list}
    # Reset indices as desired dates
    for keys in Excel_data_dictionary:
        Excel_data_dictionary[keys] = Excel_data_dictionary[keys].set_index(simulation_dates)
    print('XLSX Data Import Complete in %s seconds' % (time.time() - start_time))
elif File_Type is 'CSV':
    extension = '.csv'
    CSV_data_dictionary = {name: pd.read_csv(name + extension, usecols=[i for i in range(start, stop)]) for name in location_list}
    # Reset indices as desired dates
    for keys in CSV_data_dictionary:
        CSV_data_dictionary[keys] = CSV_data_dictionary[keys].set_index(simulation_dates)
    print('CSV Data Import Complete in %s seconds' % (time.time() - start_time))

When I run this code, the .csv data import took about 1.5 seconds on my office desktop computer (not bad), and the .xlsx import took 100 seconds (awful!)! This is why you should avoid storing data in Excel workbooks.

You can visualize the pandas DataFrame object per the image below, with a major axis that corresponds to dates, and a minor axis that corresponds to each simulation run (or realization). Note that I added code that manually sets the major axis index to date values, as my input sheet had no actual dates listed in it.

Pandas DF

Now, if I want to see the data for a particular realization or date, pandas makes this easy. For example, the following would access the DataFrame column corresponding to realization 8 for Reservoir_1:

CSV_data_dictionary['Reservoir_1']['Realization8']

Option 2: If you must work with .xlsx files and need to import specific sheets and cell ranges from those files, I recommend that you use the open-source Python library OpenPyXL. (There are other options, some of which are reviewed here). OpenPyXL is nice because it can both read and write from and to Excel, it can loop through worksheets and cells within those worksheets, and does not require that you have Microsoft Office if you are working in Windows. Indeed it does not require Windows at all (it works well on Linux, though I have not tried it out in OSX). OpenPyXL documentation is available here, and includes numerous examples. This page also provides some nice examples of using OpenPyXL.

Some additional tips regarding OpenPyXL:

  • Make sure you are working with the latest version. If you install openpyxl through an IDE, be sure it’s updated to at least version 2.2.4 or 2.2.5. I’ve had problems with earlier versions.
  • I believe OpenPyXL only works with .xlsx files, not .xls files.