Importing, Exporting and Organizing Time Series Data in Python – Part 2

This blog post is Part 2 of a multi-part series of posts intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle .csv, .xls and .xlsx files); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.

Part 1 of the series focused on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets.

This blog post presents a flexible python function I developed to import time series outputs from a simulation model into a dictionary of 3D pandas DataFrames (DFs). This function is part of a data importing and exporting module included in the PySedSim simulation model I have been developing in the last year, but you can use this function for time series outputs from any simulation model, provided the data are stored in .csv files and organized in the format shown in Fig. 2.

The value of this function is it creates a single dictionary of 3D pandas DFs storing time series outputs from multiple simulation “scenarios”; for multiple system locations (e.g., Reservoir 1, Reservoir 2…); multiple state variables (reservoir water storage, suspended sediment load, etc.); and multiple realizations, or stochastic simulation ensemble members (e.g., Realization 1, Realization 2, …, Realization 1000). I have found the 3D DFs to have more reliable functionality than 4D DFs, so I have elected not to use a fourth pandas panel dimension in this function.

In a future post, I will then present some additional modules I have developed that use Pandas functionality to evaluate “performance” of different system locations (e.g., reservoirs) with respect to a diverse set of temporal resampling strategies and distribution statistics.

Below is the Python function, which you are welcome to use and/or modify for your own purposes. Please note that I had trouble reproducing Python formatting in the blog post (and some html characters get incorrectly inserted), and apparently I cannot post a link to a .py file, so it will eventually be on Github in late 2016.

 # Import method-specific libraries
from copy import deepcopy
from __future__ import division
import pandas as pd
import os
import platform

def Import_Simulation_Output(Sims_to_Import, Locations_to_Import, var_sub_list, file_dir, proc_num = ''):
    '''
    Purpose: Imports time series outputs from a simulation run into a 3-dimensional pandas dataframe (DF).

    More detail: Module intended to import .csv files that have been produced as the result of a PySedSim simulation.
    Each .csv file should contain time series outputs for a single state variable (e.g., reservoir water storage) and
    system location (e.g., reservoir 1).

    The purpose of using this file may either be to (1) import the output into a dictionary of pandas structures so
    that simulation performance measures can be evaluated and plotted, or (2) to import all the data produced by
    separate processors into a single data structure that can then be exported into a .csv file that contains aggregated
    output for each system location/variable.

    DF details (a 3D DF exists for each system location (e.g., reservoir)):
    Axis 0: State variable (e.g., Water storage, Energy production) for system location
    Axis 1: Time (e.g., dates over simulation horizon)
    Axis 2: Realization Number (e.g., stochastic simulation ensemble members)

    :param Sims_to_Import: List, containing strings of simulation scenario names (these must be directories in the
    specified output file directory that have these names). Example: ["Alternative Scenario 7A"]

    :param Locations_to_Import: Dictionary, keys store strings representing simulation element names (e.g.,
    Reservoir 1). Keys must be in the Sims_to_Import list. Example: {"Alternative Scenario 7A": ["Reservoir 1"]}

    :param var_sub_list: List, containing strings of PySedSim state variable names for which .csv output files exist
    for the scenarios in the Sims_to_Import list. Example: ['water_surface_elevation', 'capacity_active_reservoir']

    :param file_dir: String, directory in which output files to be imported are located.
    Example: r'E:\PySedSim\ModelFiles\Output_Storage'

    :param proc_num: Optional. Integer, number appended to the output .csv file representing the processor that
    produced the file (e.g., the number 3 for the file 'water_surface_elevation_3.csv')

    :return TSID: Dictionary, where keys are scenario names. Key stores sub_dictionary, where sub_dictionary keys are
    system locations storing 3D pandas DF for each system location.
    :return Num_Realizations: Dictionary, where keys are scenario names, storing number of stochastic realiztions for
    scenario.
    :return Num_Years: Dictionary, where keys are scenario names, storing number of years in a simulation realization
    for scenario
    '''

    # Get operator (/ or \) for changing directory based on operating system.
    os_fold_op = Op_Sys_Folder_Operator()

    # This function reads back in previously exported simulation data so performance measure analysis can be conducted.
    if proc_num is not '':
        cluster_loop = '_' + str(proc_num-1) # Subtract 1 as first file ends with "0".
        cluster_sub_folder = 'cluster_output'
    else:
        cluster_loop = ''
        cluster_sub_folder = ''

    # Initialize various data structures
    TSID = {} # Main dictionary to export
    TSID_Temp = {} # Use to temporarily load each processor's output sheet for location/variable, if applicable
    Num_Realizations = {} # For each scenario, stores number of realizations for that scenario
    Num_Years = {} # For each scenario, stores number of years in each realization for that scenario
    counter = {} # Temporary counter

    # Main data import loop. Intent is to import data into Time Series Import Dictionary (TSID)
    for sims in Sims_to_Import:
        counter[sims] = 0
        TSID[sims] = {} # Sub dict for each simulation will store locations.
        TSID_Temp[sims] = {} # Sub dict for element/variable output for a given processor in cluster.
        sim_import_loc = file_dir + os_fold_op + sims # This folder needs to already exist.
        for sys_locs in Locations_to_Import[sims]:
            TSID[sims][sys_locs] = {} # Sub dictionary for each location will store variables.
            TSID_Temp[sims][sys_locs] = {} # Sub dict for location will store a variable for each processor.
            loc_sub_folder = os_fold_op + cluster_sub_folder + os_fold_op + sys_locs
            # Requires that all the locs you are searching have all the variables you list above, which wont be the
            # case always (for junctions vs. reservoirs, for example).
            for vars in var_sub_list:
                file_path = sim_import_loc + loc_sub_folder # File path reflecting new folder
                if os.path.exists(os.path.join(file_path, vars + cluster_loop + '.csv')) == True:
                    # This variable exists as a file name in the specified file path, so import it.
                    if proc_num == '':
                        # User is not importing output files produced on a cluster by various processors. Proceed
                        # linearly (there are not different files from different processors that need to be combined).
                        # Import this dataframe to a csv file.
                        TSID[sims][sys_locs][vars] = pd.read_csv(os.path.join(file_path, vars + cluster_loop + '.csv'),
                        index_col=0)

                        # Force each dataframe to have datetime objects as dates rather than strings.
                        TSID[sims][sys_locs][vars].set_index(pd.to_datetime(TSID[sims][sys_locs][vars].index),
                        inplace=True)
                        # Determine number of realizations (ensembles). Only do this calculation once per simulation
                        # realization (on first pass through loop).
                        if counter[sims] == 0:
                            Num_Realizations[sims] = len(TSID[sims][sys_locs][vars].columns)
                            Num_Years[sims] = TSID[sims][sys_locs][vars].index[-1].year - \
                            TSID[sims][sys_locs][vars].index[0].year + 1
                            counter[sims] += 1
                    else:
                        # User wishes to use this processor to create a dictionary for the particular
                        # location/variable of interest. This processor will therefore read in all output .csv files
                        # produced by other processors.
                        for csv_sheet in range(proc_num):
                            # Import this dataframe to a csv file
                            TSID_Temp[sims][sys_locs][vars] = pd.read_csv(
                            os.path.join(file_path, vars + '_' + str(csv_sheet) + '.csv'), index_col=0)
                            # Make each dataframe have datetime objects as dates rather than strings.
                            TSID_Temp[sims][sys_locs][vars].set_index(
                            pd.to_datetime(TSID_Temp[sims][sys_locs][vars].index), inplace=True)
                            # Loop through locations and variables, store data from this processor in master dictionary.
                            if csv_sheet == 0:
                                TSID = deepcopy(TSID_Temp)
                            else:
                                for locs in TSID[sims]:
                                    for vars in TSID[sims][locs]:
                                        # Add this new set of realizations from this DF into the main DF
                                        TSID[sims][locs][vars] = pd.concat(
                                        [TSID[sims][locs][vars], TSID_Temp[sims][locs][vars]], axis=1)
    print("Data Import is completed.")
    # Return is conditional. Number of realizations/years cannot be provided if the TSID only represents one of many
    # ensemble members of a stochastic simulation:
    if proc_num is not '':
        return TSID
    else:
        return TSID, Num_Realizations, Num_Years

The following is an example of how to make use of this function. Suppose you have 2 directories corresponding two two different simulation runs (or “scenarios”). Let’s call those Scenario 1 and Scenario 2.

Within each of those scenario directories, you have separate directories for each system location. In this example, the locations are “Reservoir 1” and “River Channel 1”, so there are 2 sub-directories.

Within each of those location directories, you have .csv files for the following state variables: water_storage, susp_sediment_inflow, inflow_rate. Each .csv file stores time series output for a particular scenario, location, and state variable across all realizations (ensemble members).

Figure 1 (below) shows an example of how these files might be stored for “River Channel 1”, within a directory titled “Simulation Output”.

Fig 2 - file location

Figure 2 (below) shows an example of how these files might be stored for “Reservoir 1”, within a directory titled “Simulation Output”.

Fig 2A - file location

Figure 2 shows an example of how one .csv file. (The PySedSim model automatically creates the directory structure and file formatting shown here).

Fig 1 - csv file layout

The following is an example function call:


import Import_Simulation_Output

Sims_to_Import = ['Scenario 1', 'Scenario 2']

Locations_to_Import = {'Scenario 1': ['Reservoir 1', 'River Channel 1'], 'Scenario 2': ['Reservoir 1', 'River Channel']}

var_sub_list = ['inflow_rate', 'susp_sediment_inflow', 'water_storage']

file_dir = r'C:\Users\tbw32\Documents\Reed Group\Blog - Water Programming\2016\July 2016\Simulation Output'

Import_Simulation_Output(Sims_to_Import, Locations_to_Import, var_sub_list, file_dir)

In the next post I will demonstrate what to do with the dictionary that has been created.

Advertisements

2 thoughts on “Importing, Exporting and Organizing Time Series Data in Python – Part 2

  1. Pingback: Importing, Exporting and Organizing Time Series Data in Python – Part 1 – Water Programming: A Collaborative Research Blog

  2. Pingback: Water Programming Blog Guide (Part I) – Water Programming: A Collaborative Research Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s