This blog post is Part 2 of a multi-part series of posts intended to introduce options in Python available for reading (importing) data (with particular emphasis on time series data, and how to handle .csv, .xls and .xlsx files); (2) organizing time series data in Python (with emphasis on using the open-source data analysis library pandas); and (3) exporting/saving data from Python.
Part 1 of the series focused on approaches for reading (importing) time series data, with particular emphasis on how (and how not) to handle data in MS Excel spreadsheets.
This blog post presents a flexible python function I developed to import time series outputs from a simulation model into a dictionary of 3D pandas DataFrames (DFs). This function is part of a data importing and exporting module included in the PySedSim simulation model I have been developing in the last year, but you can use this function for time series outputs from any simulation model, provided the data are stored in .csv files and organized in the format shown in Fig. 2.
The value of this function is it creates a single dictionary of 3D pandas DFs storing time series outputs from multiple simulation “scenarios”; for multiple system locations (e.g., Reservoir 1, Reservoir 2…); multiple state variables (reservoir water storage, suspended sediment load, etc.); and multiple realizations, or stochastic simulation ensemble members (e.g., Realization 1, Realization 2, …, Realization 1000). I have found the 3D DFs to have more reliable functionality than 4D DFs, so I have elected not to use a fourth pandas panel dimension in this function.
In a future post, I will then present some additional modules I have developed that use Pandas functionality to evaluate “performance” of different system locations (e.g., reservoirs) with respect to a diverse set of temporal resampling strategies and distribution statistics.
Below is the Python function, which you are welcome to use and/or modify for your own purposes. Please note that I had trouble reproducing Python formatting in the blog post (and some html characters get incorrectly inserted), and apparently I cannot post a link to a .py file, so it will eventually be on Github in late 2016.
# Import method-specific libraries from copy import deepcopy from __future__ import division import pandas as pd import os import platform def Import_Simulation_Output(Sims_to_Import, Locations_to_Import, var_sub_list, file_dir, proc_num = ''): ''' Purpose: Imports time series outputs from a simulation run into a 3-dimensional pandas dataframe (DF). More detail: Module intended to import .csv files that have been produced as the result of a PySedSim simulation. Each .csv file should contain time series outputs for a single state variable (e.g., reservoir water storage) and system location (e.g., reservoir 1). The purpose of using this file may either be to (1) import the output into a dictionary of pandas structures so that simulation performance measures can be evaluated and plotted, or (2) to import all the data produced by separate processors into a single data structure that can then be exported into a .csv file that contains aggregated output for each system location/variable. DF details (a 3D DF exists for each system location (e.g., reservoir)): Axis 0: State variable (e.g., Water storage, Energy production) for system location Axis 1: Time (e.g., dates over simulation horizon) Axis 2: Realization Number (e.g., stochastic simulation ensemble members) :param Sims_to_Import: List, containing strings of simulation scenario names (these must be directories in the specified output file directory that have these names). Example: ["Alternative Scenario 7A"] :param Locations_to_Import: Dictionary, keys store strings representing simulation element names (e.g., Reservoir 1). Keys must be in the Sims_to_Import list. Example: {"Alternative Scenario 7A": ["Reservoir 1"]} :param var_sub_list: List, containing strings of PySedSim state variable names for which .csv output files exist for the scenarios in the Sims_to_Import list. Example: ['water_surface_elevation', 'capacity_active_reservoir'] :param file_dir: String, directory in which output files to be imported are located. Example: r'E:\PySedSim\ModelFiles\Output_Storage' :param proc_num: Optional. Integer, number appended to the output .csv file representing the processor that produced the file (e.g., the number 3 for the file 'water_surface_elevation_3.csv') :return TSID: Dictionary, where keys are scenario names. Key stores sub_dictionary, where sub_dictionary keys are system locations storing 3D pandas DF for each system location. :return Num_Realizations: Dictionary, where keys are scenario names, storing number of stochastic realiztions for scenario. :return Num_Years: Dictionary, where keys are scenario names, storing number of years in a simulation realization for scenario ''' # Get operator (/ or \) for changing directory based on operating system. os_fold_op = Op_Sys_Folder_Operator() # This function reads back in previously exported simulation data so performance measure analysis can be conducted. if proc_num is not '': cluster_loop = '_' + str(proc_num-1) # Subtract 1 as first file ends with "0". cluster_sub_folder = 'cluster_output' else: cluster_loop = '' cluster_sub_folder = '' # Initialize various data structures TSID = {} # Main dictionary to export TSID_Temp = {} # Use to temporarily load each processor's output sheet for location/variable, if applicable Num_Realizations = {} # For each scenario, stores number of realizations for that scenario Num_Years = {} # For each scenario, stores number of years in each realization for that scenario counter = {} # Temporary counter # Main data import loop. Intent is to import data into Time Series Import Dictionary (TSID) for sims in Sims_to_Import: counter[sims] = 0 TSID[sims] = {} # Sub dict for each simulation will store locations. TSID_Temp[sims] = {} # Sub dict for element/variable output for a given processor in cluster. sim_import_loc = file_dir + os_fold_op + sims # This folder needs to already exist. for sys_locs in Locations_to_Import[sims]: TSID[sims][sys_locs] = {} # Sub dictionary for each location will store variables. TSID_Temp[sims][sys_locs] = {} # Sub dict for location will store a variable for each processor. loc_sub_folder = os_fold_op + cluster_sub_folder + os_fold_op + sys_locs # Requires that all the locs you are searching have all the variables you list above, which wont be the # case always (for junctions vs. reservoirs, for example). for vars in var_sub_list: file_path = sim_import_loc + loc_sub_folder # File path reflecting new folder if os.path.exists(os.path.join(file_path, vars + cluster_loop + '.csv')) == True: # This variable exists as a file name in the specified file path, so import it. if proc_num == '': # User is not importing output files produced on a cluster by various processors. Proceed # linearly (there are not different files from different processors that need to be combined). # Import this dataframe to a csv file. TSID[sims][sys_locs][vars] = pd.read_csv(os.path.join(file_path, vars + cluster_loop + '.csv'), index_col=0) # Force each dataframe to have datetime objects as dates rather than strings. TSID[sims][sys_locs][vars].set_index(pd.to_datetime(TSID[sims][sys_locs][vars].index), inplace=True) # Determine number of realizations (ensembles). Only do this calculation once per simulation # realization (on first pass through loop). if counter[sims] == 0: Num_Realizations[sims] = len(TSID[sims][sys_locs][vars].columns) Num_Years[sims] = TSID[sims][sys_locs][vars].index[-1].year - \ TSID[sims][sys_locs][vars].index[0].year + 1 counter[sims] += 1 else: # User wishes to use this processor to create a dictionary for the particular # location/variable of interest. This processor will therefore read in all output .csv files # produced by other processors. for csv_sheet in range(proc_num): # Import this dataframe to a csv file TSID_Temp[sims][sys_locs][vars] = pd.read_csv( os.path.join(file_path, vars + '_' + str(csv_sheet) + '.csv'), index_col=0) # Make each dataframe have datetime objects as dates rather than strings. TSID_Temp[sims][sys_locs][vars].set_index( pd.to_datetime(TSID_Temp[sims][sys_locs][vars].index), inplace=True) # Loop through locations and variables, store data from this processor in master dictionary. if csv_sheet == 0: TSID = deepcopy(TSID_Temp) else: for locs in TSID[sims]: for vars in TSID[sims][locs]: # Add this new set of realizations from this DF into the main DF TSID[sims][locs][vars] = pd.concat( [TSID[sims][locs][vars], TSID_Temp[sims][locs][vars]], axis=1) print("Data Import is completed.") # Return is conditional. Number of realizations/years cannot be provided if the TSID only represents one of many # ensemble members of a stochastic simulation: if proc_num is not '': return TSID else: return TSID, Num_Realizations, Num_Years
The following is an example of how to make use of this function. Suppose you have 2 directories corresponding two two different simulation runs (or “scenarios”). Let’s call those Scenario 1 and Scenario 2.
Within each of those scenario directories, you have separate directories for each system location. In this example, the locations are “Reservoir 1” and “River Channel 1”, so there are 2 sub-directories.
Within each of those location directories, you have .csv files for the following state variables: water_storage, susp_sediment_inflow, inflow_rate. Each .csv file stores time series output for a particular scenario, location, and state variable across all realizations (ensemble members).
Figure 1 (below) shows an example of how these files might be stored for “River Channel 1”, within a directory titled “Simulation Output”.
Figure 2 (below) shows an example of how these files might be stored for “Reservoir 1”, within a directory titled “Simulation Output”.
Figure 2 shows an example of how one .csv file. (The PySedSim model automatically creates the directory structure and file formatting shown here).
The following is an example function call:
import Import_Simulation_Output Sims_to_Import = ['Scenario 1', 'Scenario 2'] Locations_to_Import = {'Scenario 1': ['Reservoir 1', 'River Channel 1'], 'Scenario 2': ['Reservoir 1', 'River Channel']} var_sub_list = ['inflow_rate', 'susp_sediment_inflow', 'water_storage'] file_dir = r'C:\Users\tbw32\Documents\Reed Group\Blog - Water Programming\2016\July 2016\Simulation Output' Import_Simulation_Output(Sims_to_Import, Locations_to_Import, var_sub_list, file_dir)
In the next post I will demonstrate what to do with the dictionary that has been created.
Pingback: Importing, Exporting and Organizing Time Series Data in Python – Part 1 – Water Programming: A Collaborative Research Blog
Pingback: Water Programming Blog Guide (Part I) – Water Programming: A Collaborative Research Blog