MORDM VIII: Characterizing the effects of deep uncertainty

In the previous post, we defined robustness using the satisficing metric where (1) reliability should be at least 98%, (2) restriction frequency should be not more than 10% and (3) worst-case cost of drought mitigation action should not be more than 10% of annual net volumetric income. To calculate the robustness of these set of actions (portfolios) against the effects of challenging states of the world (SOWs) on the initial set of actions, we once again re-simulated them to discover how they fail.

In this penultimate post, we will perform simple sensitivity analysis across the average performance of all sixty-nine portfolios of actions to understand which uncertainties control the performance of each utility (Raleigh, Durham and Cary) and the regions across all uncertain SOWs.

Calculating average performance across 100 DU SOWs

First, we create a new folder to hold the output of the next few post-processing steps. Navigate to the WaterPaths/ folder and create a folder called post_processing. Now, let’s calculate the average performance of each of the sixty-nine initial portfolios across the 100 DU SOWs that we previously simulated them over. The code for this can be found in the post_processing_code folder under gather_objs.py file and should look like this:

# -*- coding: utf-8 -*-
"""
Created on Mon April 26 2022 11:12

@author: Lillian Bei Jia Lau
Organizes output objectives by mean across RDMs and across solutions

"""
import numpy as np

obj_names = ['REL_C', 'RF_C', 'INF_NPC_C', 'PFC_C', 'WCC_C', \
        'REL_D', 'RF_D', 'INF_NPC_D', 'PFC_D', 'WCC_D', \
        'REL_R', 'RF_R', 'INF_NPC_R', 'PFC_R', 'WCC_R', \
        'REL_reg', 'RF_reg', 'INF_NPC_reg', 'PFC_reg', 'WCC_reg']

'''
Performs regional minimax
'''
def minimax(N_SOLNS, objs):
    for i in range(N_SOLNS):
        for j in range(5):
            if j == 0:
                objs[i,15] = np.min([objs[i,0],objs[i,5], objs[i,10]])
            else:
                objs[i, (j+15)] = np.max([objs[i,j],objs[i,j+5], objs[i,j+10]])
    return objs

'''
Calculates the mean performance acorss all SOWs
'''
def mean_performance_across_rdms(objs_by_rdm_dir, N_RDMS, N_SOLNS):
    objs_matrix = np.zeros((N_SOLNS,20,N_RDMS), dtype='float')
    objs_means = np.zeros((N_SOLNS,20), dtype='float')

    for i in range(N_RDMS):
        filepathname = objs_by_rdm_dir + str(i) + '_sols0_to_' + str(N_SOLNS) + '.csv'
        objs_file = np.loadtxt(filepathname, delimiter=",")
        objs_matrix[:,:15,i] = objs_file

        objs_file_wRegional = minimax(N_SOLNS, objs_matrix[:,:,i])

        objs_matrix[:,:,i] = objs_file_wRegional

        array_has_nan = np.isnan(np.mean(objs_matrix[:,3,i]))
        if(array_has_nan == True):
            print('NaN found at RDM ', str(i))

    for n in range(N_SOLNS):
        for n_objs in range(20):
            objs_means[n,n_objs] = np.mean(objs_matrix[n,n_objs,:])

    return objs_means

'''
Calculates the mean performance acorss all SOWs
'''
def mean_performance_across_solns(objs_by_rdm_dir, N_RDMS, N_SOLNS):
    objs_matrix = np.zeros((N_SOLNS,20,N_RDMS), dtype='float')
    objs_means = np.zeros((N_RDMS,20), dtype='float')

    for i in range(N_RDMS):
        filepathname = objs_by_rdm_dir + str(i) + '_sols0_to_' + str(N_SOLNS) + '.csv'
        objs_file = np.loadtxt(filepathname, delimiter=",")
        objs_matrix[:,:15,i] = objs_file
        objs_file_wRegional = minimax(N_SOLNS, objs_matrix[:,:,i])

        objs_matrix[:,:,i] = objs_file_wRegional

        array_has_nan = np.isnan(np.mean(objs_matrix[:,3,i]))
        if(array_has_nan == True):
            print('NaN found at RDM ', str(i))

    for n in range(N_RDMS):
        for n_objs in range(20):
            objs_means[n,n_objs] = np.mean(objs_matrix[:,n_objs,n])

    return objs_means

# change number of solutions available depending on the number of solutions
# that you identified
N_SOLNS = 69
N_RDMS = 100

# change the filepaths
objs_by_rdm_dir = '/yourFilePath/WaterPaths/output/Objectives_RDM'
objs_og_dir = '/yourFilePath/WaterPaths/'

fileoutpath = '/yourFilePath/WaterPaths/post_processing/'

fileoutpath_byRDMs = fileoutpath + 'meanObjs_acrossRDM.csv'
fileoutpath_bySoln = fileoutpath + 'meanObjs_acrossSoln.csv'

# should have shape (N_SOLNS, 20)
objs_byRDM = mean_performance_across_rdms(objs_by_rdm_dir, N_RDMS, N_SOLNS)
# should have shape (N_RDMS, 20)
objs_bySoln = mean_performance_across_solns(objs_by_rdm_dir, N_RDMS, N_SOLNS)

np.savetxt(fileoutpath_byRDMs, objs_byRDM, delimiter=",")
np.savetxt(fileoutpath_bySoln, objs_bySoln, delimiter=",")

This will output two .csv files: meanObjs_acrossRDM.csv contains the average performance of each of the sixty-nine objectives evaluated over 100 DU SOWs, while meanObjs_acrossSoln.csv contains the average performance of all solutions within one SOW. Take some time to understand this difference, as this will be important when performing sensitivity analysis and scenario discovery.

Calculate the robustness of each portfolio to deep uncertainty

Now, let’s identify how each of these solutions perform under a set of more challenging SOWs. Within post_processing_code/, identify the file called calc_robustness_across_rdms.py. It should look like this:

# -*- coding: utf-8 -*-
"""
Created on Mon April 26 2022 11:12

@author: Lillian Bei Jia Lau

Calculates the fraction of RDMs over which each perturbed version of the solution meets all four satisficing criteria
"""
import numpy as np
import pandas as pd

obj_names = ['REL_C', 'RF_C', 'INF_NPC_C', 'PFC_C', 'WCC_C', \
        'REL_D', 'RF_D', 'INF_NPC_D', 'PFC_D', 'WCC_D', \
        'REL_R', 'RF_R', 'INF_NPC_R', 'PFC_R', 'WCC_R', \
        'REL_reg', 'RF_reg', 'INF_NPC_reg', 'PFC_reg', 'WCC_reg']

utilities = ['Cary', 'Durham', 'Raleigh', 'Regional']

'''
Performs regional minimax
'''
def minimax(N_SOLNS, objs):
    for i in range(N_SOLNS):
        for j in range(5):
            if j == 0:
                objs[i,15] = np.min([objs[i,0],objs[i,5], objs[i,10]])
            else:
                objs[i, (j+15)] = np.max([objs[i,j],objs[i,j+5], objs[i,j+10]])
    return objs

'''
For each rdm, identify if the perturbed solution version x satisfies the satisficing criteria
'''
def satisficing(df_objs):
    for i in range(4):
        df_objs['satisficing_C'] = (df_objs['REL_C'] >= 0.98).astype(int) *\
                                    (df_objs['WCC_C'] <= 0.10).astype(int) *\
                                    (df_objs['RF_C'] <= 0.10).astype(int)

        df_objs['satisficing_D'] = (df_objs['REL_D'] >= 0.98).astype(int) *\
                                    (df_objs['WCC_D'] <= 0.10).astype(int) *\
                                    (df_objs['RF_D'] <= 0.10).astype(int)

        df_objs['satisficing_R'] = (df_objs['REL_R'] >= 0.98).astype(int) *\
                                    (df_objs['WCC_R'] <= 0.10).astype(int) *\
                                    (df_objs['RF_R'] <= 0.10).astype(int)

    df_objs['satisficing_reg'] = np.max(df_objs.iloc[:, 20:23])
    return df_objs

def calc_robustness(objs_by_rdm_dir, N_RDMS, N_SOLNS):

    # matrix structure: (N_SOLNS, N_OBJS, N_RDMS)
    objs_matrix = np.zeros((N_SOLNS,20,N_RDMS), dtype='float')

    satisficing_matrix = np.zeros((N_SOLNS,4,N_RDMS), dtype='float')
    solution_robustness = np.zeros((N_SOLNS,4), dtype='float')

    for i in range(N_RDMS):
        # get one perturbed instance's behavior over all RDMs
        filepathname = objs_by_rdm_dir + str(i) + '_sols0_to_' + str(N_SOLNS) + '.csv'

        objs_file = np.loadtxt(filepathname, delimiter=",")

        objs_matrix[:,:15,i] = objs_file

        objs_file_wRegional = minimax(N_SOLNS, objs_matrix[:,:,i])

        objs_matrix[:,:,i] = objs_file_wRegional

        # NaN check
        array_has_nan = np.isnan(np.mean(objs_matrix[:,3,i]))
        if(array_has_nan == True):
            print('NaN found at RDM ', str(i))

    # for the perturbed instances
    for r in range(N_RDMS):

        df_curr_rdm = pd.DataFrame(objs_matrix[:, :, r], columns = obj_names)

        df_curr_rdm = satisficing(df_curr_rdm)
        satisficing_matrix[:N_SOLNS,:,r] = df_curr_rdm.iloc[:,20:24]

    for n in range(N_SOLNS):
        solution_robustness[n,0] = np.sum(satisficing_matrix[n,0,:])/N_RDMS
        solution_robustness[n,1] = np.sum(satisficing_matrix[n,1,:])/N_RDMS
        solution_robustness[n,2] = np.sum(satisficing_matrix[n,2,:])/N_RDMS

    solution_robustness[:,3] = np.min(solution_robustness[:,:3], axis=1)

    return solution_robustness

'''
Change number of solutions available depending on the number of solutions
that you identified and the number of SOWs that you are evaluating them over.
'''
N_RDMS = 100
N_SOLNS = 69

objs_by_rdm_dir = '/scratch/lbl59/blog/WaterPaths/output/Objectives_RDM'

fileoutpath_robustness = '/scratch/lbl59/blog/WaterPaths/post_processing/' + \
    'robustness_' + str(N_RDMS) + '_SOWs.csv'

robustness = calc_robustness(objs_by_rdm_dir, N_RDMS, N_SOLNS)

np.savetxt(fileoutpath_robustness, robustness, delimiter=",")

When you run this script from the terminal, you should have a .csv file called ‘robustness_100_SOWs.csv‘ appear in your post_processing/ folder. Now, let’s get onto some sensitivity analysis!

Delta moment-independent sensitivity analysis

The Delta moment-independent (DMI) method (Borgonovo, 2007) is sensitivity analysis method that compares the entire probability distribution of both input and output parameters to estimate the sensitivity of the output to a specific input parameter. It is one of many global sensitivity analysis methods, which in itself is one of two main categories of sensitivity analysis that enables the assessment of the degree to which uncertainty in model inputs map to the degree of uncertainty in model output. For purposes of this test case, the DMI is preferable as it does not rely on any one statistical moment (variance, mean and skew) to describe the dependence of model output to its input parameters. It is also time-sensitive, reflecting the current state of knowledge within the system, which philosophically pairs well with our use of the ROF triggers. More information on alternative global sensitivity methods can be found here.

Within the context of our test case, we will be using the DMI method to identify uncertainties in our decision variables that most strongly influence our performance over the 100 DU SOWs. To perform DMI sensitivity analysis, first navigate to the post_processing/ folder. Within the folder, create two sub-folders called delta_output_DV/ and delta_output_DUF/. This is where all your DMI output will be stored. Next, locate the delta_sensitivity.py file within the post_processing_code/ folder. The code should look similar to the script provided below:

import sys
from SALib.analyze import delta
from SALib.util import read_param_file
import numpy as np
import pandas as pd

'''
Finds the upper and lower bounds of input parameters
'''
def find_bounds(input_file):
    bounds = np.zeros((input_file.shape[1],2), dtype=float)
    for i in range(input_file.shape[1]):
        bounds[i,0] = min(input_file[:,i])
        bounds[i,1] = max(input_file[:,i])

    return bounds
'''
Performs delta moment-independent sensitivity analysis
Source: https://github.com/SALib/SALib/tree/main/examples/delta
'''
def delta_sensitivity(dec_vars, measured_outcomes, names, mo_names, bounds, rdm, mode):
    X = dec_vars
    Y = measured_outcomes

    problem = {
        'num_vars': int(dec_vars.shape[1]),
        'names': names,
        'bounds': bounds
    }

    for i in range(measured_outcomes.shape[1]):
        mo_label = mo_names[i]
        if i == 2 and mode == 'objs':
          break
        else:
          filename = '../post_processing/delta_output_' + rdm + '/S1_' + mo_label + '.csv'
          S1 = delta.analyze(problem, X, Y[mo_label].values, num_resamples=10, conf_level=0.95, print_to_console=False)
        numpy_S1 = np.array(S1["S1"])
        fileout = pd.DataFrame([names, numpy_S1], index = None, columns = None)
        fileout.to_csv(filename, sep=",")

'''
0 - Name all file headers and compSol to be analyzed
'''
obj_names = ['REL_C', 'RF_C', 'INF_NPC_C', 'PFC_C', 'WCC_C', \
        'REL_D', 'RF_D', 'INF_NPC_D', 'PFC_D', 'WCC_D', \
        'REL_R', 'RF_R', 'INF_NPC_R', 'PFC_R', 'WCC_R', \
        'REL_reg', 'RF_reg', 'INF_NPC_reg', 'PFC_reg', 'WCC_reg']

dv_names = ['RT_C', 'RT_D', 'RT_R', 'TT_D', 'TT_R', 'LMA_C', 'LMA_D', 'LMA_R',\
            'RC_C', 'RC_D', 'RC_R', 'IT_C', 'IT_D', 'IT_R', 'IP_C', 'IP_D', \
            'IP_R', 'INF_C', 'INF_D', 'INF_R']

rdm_headers_dmp = ['Cary restr. eff', 'Durham restr. eff', 'Raleigh restr. eff']
rdm_headers_utilities = ['Demand growth\nmultiplier', 'Bond term\nmultiplier', \
                        'Bond interest\nrate multiplier', 'Infrastructure interest\nrate multiplier']
rdm_headers_ws = ['Streamflow amp', 'SCR PT', 'SCR CT', 'NRR PT', 'NRR CT', 'CR Low PT', 'CR Low CT',\
                  'CR High PT', 'CR High CT', 'WR1 PT', 'WR1 CT', 'WR2 PT', 'WR2 CT',\
                  'DR PT', 'DR CT', 'FR PT', 'FR CT']

duf_names = ['Cary restr. eff', 'Durham restr. eff', 'Raleigh restr. eff', 'Demand growth\nmultiplier',\
             'Bond term\nmultiplier', 'Bond interest\nrate multiplier', 'Infrastructure interest\nrate multiplier',\
             'Streamflow amp\nmultiplier', 'SCR PT\nmultiplier', 'SCR CT\nmultiplier', 'NRR PT\nmultiplier',\
             'NRR CT\nmultiplier', 'CR Low PT\nmultiplier', 'CR Low CT\nmultiplier', 'CR High PT\nmultiplier',\
             'CR High CT\nmultiplier', 'WR1 PT\nmultiplier', 'WR1 CT\nmultiplier', 'WR2 PT\nmultiplier',\
             'WR2 CT\nmultiplier', 'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier',\
             'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier']

utilities = ['Cary', 'Durham', 'Raleigh', 'Regional']

N_RDMS = 100
N_SOLNS = 69

'''
1 - Load DU factor files and DV files
'''
# change to your own filepath
rdm_factors_directory = '/yourFilePath/WaterPaths/TestFiles/'
rdm_dmp_filename = rdm_factors_directory + 'rdm_dmp_test_problem_reeval.csv'
rdm_utilities_filename = rdm_factors_directory + 'rdm_utilities_test_problem_reeval.csv'
rdm_watersources_filename = rdm_factors_directory + 'rdm_water_sources_test_problem_reeval.csv'

rdm_dmp = pd.read_csv(rdm_dmp_filename, sep=",", names=rdm_headers_dmp)
rdm_utilities = pd.read_csv(rdm_utilities_filename, sep=",", names=rdm_headers_utilities)
rdm_ws_all = np.loadtxt(rdm_watersources_filename, delimiter=",")
rdm_ws = pd.DataFrame(rdm_ws_all[:,:17], columns=rdm_headers_ws)

dufs = pd.concat([rdm_dmp, rdm_utilities, rdm_ws], axis=1, ignore_index=True)
duf_numpy = dufs.to_numpy()

# change to your own filepath
dv_directory = '/yourFilePath/WaterPaths/'
dv_filename = dv_directory + 'NC_dvs_all_noheader.csv'
dvs = np.loadtxt(dv_filename, delimiter=",")

'''
2 - Get bounds for DU factors and DVs
'''
duf_bounds = find_bounds(duf_numpy)
dv_bounds = find_bounds(dvs)

'''
3 - Load robustness file and objectives file
'''
# change to your own filepath
main_dir = '/yourFilePath/WaterPaths/post_processing/'

robustness_filename = main_dir + 'robustness_100_SOWs.csv'
robustness_arr = np.loadtxt(robustness_filename, delimiter=",")
robustness_df = pd.DataFrame(robustness_arr, columns=utilities)

objs_mean_rdm_filename = main_dir + 'meanObjs_acrossRDM.csv'
objs_mean_rdm_arr = np.loadtxt(objs_mean_rdm_filename, delimiter=",")
objs_mean_rdm_df = pd.DataFrame(objs_mean_rdm_arr, columns=obj_names)

objs_mean_soln_filename = main_dir + 'meanObjs_acrossSoln.csv'
objs_mean_soln_arr = np.loadtxt(objs_mean_soln_filename, delimiter=",")
objs_mean_soln_df = pd.DataFrame(objs_mean_soln_arr, columns=obj_names)

# to change  depending on whether DV or DUF is being analyzed
dec_vars = dvs
measured_outcomes = objs_mean_rdm_df
names = dv_names
mo_names = obj_names
bounds = dv_bounds
rdm = 'DV'
mode = 'objs'
###

delta_sensitivity(dec_vars, measured_outcomes, names, mo_names, bounds, rdm, mode)

The code above identifies the sensitivity of the average values of all sixty-nine performance objective sets over all 100 deeply-uncertain SOWs to the decision variables. This is why we use the meanObjs_acrossRDM.csv file – this file contains sixty-nine mean values of the performance objectives, where each set of performance objectives inversely maps to their corresponding portfolio of actions.

To identify the sensitivity of performance objectives to the DU factors, change lines 121 to 127 to the following:

# to change  depending on whether DV or DUF is being analyzed
dec_vars = duf_numpy[:100,:]
measured_outcomes = objs_mean_soln_df
names = duf_names
mo_names = obj_names
bounds = duf_bounds[:100,:]
rdm = 'DUF'
mode = 'objs'
###

The code above identifies the sensitivity of the average values of all twenty performance objectives over each of the sixty-nine different portfolios to the set of deeply uncertain hydroclimatic and demand scenarios. This is why we use the meanObjs_acrossSoln.csv file – this file contains one hundred mean values of the performance objectives, where each set of performance objectives inversely maps to their corresponding DU SOW.

Great job so far! Now let’s visualize the sensitivity of our output to our input parameters using heatmaps. Before being able to visualize each utility’s performance sensitivity, we must first organize the sensitivity indices of the decision variables into a file containing the indices over all objectives for each utility. The gather_delta.py script does this. Simply change the value of mode on line 11 to ‘DUF‘ to gather the indices for the DU factors.

"""
Created on Tue April 26 2022 16:12

@author: Lillian Bei Jia Lau

Gathers the delta sensitivity indices into files per utility
"""
import numpy as np
import pandas as pd

mode = 'DV'
main_dir = '/yourFilePath/WaterPaths/post_processing/delta_output_' + mode + '/'
utilities = ['_C', '_D', '_R', '_reg']
objs = ['REL', 'RF', 'INF_NPC', 'PFC', 'WCC']
utilities_full = ['Cary', 'Durham', 'Raleigh', 'Regional']

dv_names = ['RT_C', 'RT_D', 'RT_R', 'TT_D', 'TT_R', 'LMA_C', 'LMA_D', 'LMA_R',\
            'RC_C', 'RC_D', 'RC_R', 'IT_C', 'IT_D', 'IT_R', 'IP_C', 'IP_D', \
            'IP_R', 'INF_C', 'INF_D', 'INF_R']

duf_names = ['Cary restr. eff', 'Durham restr. eff', 'Raleigh restr. eff', 'Demand growth\nmultiplier',\
             'Bond term\nmultiplier', 'Bond interest\nrate multiplier', 'Infrastructure interest\nrate multiplier',\
             'Streamflow amp\nmultiplier', 'SCR PT\nmultiplier', 'SCR CT\nmultiplier', 'NRR PT\nmultiplier',\
             'NRR CT\nmultiplier', 'CR Low PT\nmultiplier', 'CR Low CT\nmultiplier', 'CR High PT\nmultiplier',\
             'CR High CT\nmultiplier', 'WR1 PT\nmultiplier', 'WR1 CT\nmultiplier', 'WR2 PT\nmultiplier',\
             'WR2 CT\nmultiplier', 'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier',\
             'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier']

s1_dv_cary = np.zeros((len(objs), len(dv_names)), dtype=float)
s1_dv_durham = np.zeros((len(objs), len(dv_names)), dtype=float)
s1_dv_raleigh = np.zeros((len(objs), len(dv_names)), dtype=float)
s1_dv_regional = np.zeros((len(objs), len(dv_names)), dtype=float)

s1_dv_dict = {
    '_C': s1_dv_cary,
    '_D': s1_dv_durham,
    '_R': s1_dv_raleigh,
    '_reg': s1_dv_regional
}

s1_duf_cary = np.zeros((len(objs), len(duf_names)), dtype=float)
s1_duf_durham = np.zeros((len(objs), len(duf_names)), dtype=float)
s1_duf_raleigh = np.zeros((len(objs), len(duf_names)), dtype=float)
s1_duf_regional = np.zeros((len(objs), len(duf_names)), dtype=float)

s1_duf_dict = {
    '_C': s1_duf_cary,
    '_D': s1_duf_durham,
    '_R': s1_duf_raleigh,
    '_reg': s1_duf_regional
}

for i in range(len(utilities)):
    s1_util = []
    hdrs = []
    if mode == 'DV':
        s1_util = s1_dv_dict[utilities[i]]
        hdrs = dv_names
    elif mode == 'DUF':
        s1_util = s1_duf_dict[utilities[i]]
        hdrs = duf_names

    for j in range(len(objs)):
        curr_file = main_dir + 'S1_' + objs[j] + utilities[i] + '.csv'
        s1_util[j, :] = pd.read_csv(curr_file, sep=',', skiprows=2, header=None).iloc[0,1:]

    s1_util_df = pd.DataFrame(s1_util, columns=hdrs)
    out_filepath = main_dir + utilities_full[i] + '.csv'

    s1_util_df.to_csv(out_filepath, sep=',', index=False)

Now, let’s plot our heatmaps! The code to do so can be found in sensitivity_heatmap.py, and should look similar to the code provided below:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import AxesGrid

sns.set_theme()

# change depending on compromise solution and whether its sensitivity to DUF or DVs
mode = 'DUF'
rot = 90    # if DV use 0; if DUF use 45
main_dir = '/YourFilePath/WaterPaths/post_processing/delta_output_' + mode + '/'
c_filepath = main_dir + 'Cary.csv'
d_filepath = main_dir + 'Durham.csv'
r_filepath = main_dir + 'Raleigh.csv'
reg_filepath = main_dir + 'Regional.csv'

cary = pd.read_csv(c_filepath, index_col=False, header=0)
durham = pd.read_csv(d_filepath, index_col=False, header=0)
raleigh = pd.read_csv(r_filepath, index_col=False, header=0)
regional = pd.read_csv(reg_filepath, index_col=False, header=0)

savefig_dir = '/YourFilePath/WaterPaths/post_processing/'
savefig_name = savefig_dir + 'dmi_heatmap_' + mode + '.svg'

grid_kws = {"height_ratios": (0.20, 0.20, 0.20, 0.20, .02), "hspace": 0.5}
f, (ax1, ax2, ax3, ax4, cbar_ax) = plt.subplots(5, figsize=(15, 20), gridspec_kw=grid_kws)
plt.subplots_adjust(top = 0.95, bottom = 0.05,
            hspace = 0, wspace = 0.05)

y_objs=['REL', 'RF', 'INPC', 'PFC', 'WCC']

x_dvs=['$RT_{C}$', '$RT_{D}$', '$RT_{R}$', '$TT_{D}$', '$TT_{R}$', '$LM_{C}$', '$LM_{D}$', '$LM_{R}$',\
                '$RC_{C}$', '$RC_{D}$', '$RC_{R}$', '$IT_{C}$', '$IT_{D}$', '$IT_{R}$', '$IP_{C}$', \
                '$IP_{D}$', '$IP_{R}$','$INF_{C}$', '$INF_{D}$', '$INF_{R}$']
x_dufs = ['Cary\nrestr. eff', 'Durham\nrestr. eff', 'Raleigh\nrestr. eff', 'Dem. growth\nmultiplier',\
             'Bond term\nmultiplier', 'Bond interest\nrate multiplier', 'Infra. interest\nrate multiplier',\
             'Streamflow amp\nmultiplier', 'SCR PT\nmultiplier', 'SCR CT\nmultiplier', 'NRR PT\nmultiplier',\
             'NRR CT\nmultiplier', 'CR Low PT\nmultiplier', 'CR Low CT\nmultiplier', 'CR High PT\nmultiplier',\
             'CR High CT\nmultiplier', 'WR1 PT\nmultiplier', 'WR1 CT\nmultiplier', 'WR2 PT\nmultiplier',\
             'WR2 CT\nmultiplier', 'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier',\
             'DR PT\nmultiplier', 'DR CT\nmultiplier', 'FR PT\nmultiplier', 'FR CT\nmultiplier']

x_labs = []
if mode == 'DV':
    x_labs = x_dvs
elif mode == 'DUF':
    x_labs = x_dufs

plt.rc('xtick', labelsize=1)
plt.rc('ytick', labelsize=3)
plt.rc('axes', labelsize=5)
plt.rc('axes', titlesize=14)

#ax1 = fig.add_subplot(411)
ax1.set_title("Cary")
sns.heatmap(cary, linewidths=.05, cmap="YlOrBr", xticklabels=[],
    yticklabels=y_objs, ax=ax1, cbar=False)
ax1.set_yticklabels(y_objs, rotation=0)

#ax2 = fig.add_subplot(412)
ax2.set_title("Durham")
sns.heatmap(durham, linewidths=.05, cmap="YlOrBr", xticklabels=[],
    yticklabels=y_objs, ax=ax2, cbar=False)
ax2.set_yticklabels(y_objs, rotation=0)

#ax3 = fig.add_subplot(413)
ax3.set_title("Raleigh")
sns.heatmap(raleigh, linewidths=.05, cmap="YlOrBr", xticklabels=[],
    yticklabels=y_objs, ax=ax3, cbar=False)
ax3.set_yticklabels(y_objs, rotation=0)

#ax4 = fig.add_subplot(414, fontsize=10)
ax4.set_title("Regional")
ax4 = sns.heatmap(regional, linewidths=.05, cmap="YlOrBr", xticklabels=x_labs,
    yticklabels=y_objs, ax=ax4, cbar=True, cbar_ax=cbar_ax,
    cbar_kws={'orientation': 'horizontal'})     # change depending on whether analyzing DUF or DV
ax4.set_xticklabels(x_labs, rotation=rot, fontsize=10)
ax4.set_yticklabels(y_objs, rotation=0)

plt.savefig(savefig_name)

Running this for the sensitivity to decision variables and DU factors will generate the following images:

Sensitivity of performance objectives to decision variables.

In the figure above, the color of each box represents the sensitivity of a performance objective (y-axis) to a specific decision variable (x-axis). It is interesting to note that the restriction triggers (RT) of all utilities strongly influence each of their individual and regional reliability and restriction frequency. This indicates the potential for regional conflict, as possible errors is operating one utility’s restriction trigger may adversely affect other utilities’ reliabilities and ability to maintain full control over their own use of water-use restrictions. Furthermore, Raleigh’s performance is sensitive to more decision variables than its remaining two counterparts, with it’s worst-case cost (WCC) being affected most by Cary’s infrastructure investments. This observation highlights the importance of careful cooperation between a region’s member utilities to ensure that all partners play their part in maintaining both their own and their counterparts’ performance.

Sensitivity of performance objectives to DU factors.

In this next figure, we observe that uncertainty in demand growth is the only DU factor that significantly drives changes in individual and regional performance. This finding can thus help utilities to focus on demand management programs, or formulate operations and management policies that enable them to more quickly adapt to changes in consumer and industrial demand growth.

Overall, in this post, we have performed a simple sensitivity analysis to identify uncertainties in the decision variables and DU factors that control regional and individual performance. All the code for processing the output data can be found in this GitHub repository here. In the next post, we will end the MORDM blogpost series by performing scenario discovery to map regions of success and failure as defined by our robustness metrics.

References

Borgonovo, E. (2007). A new uncertainty importance measure. Reliability Engineering &Amp; System Safety, 92(6), 771-784. doi: 10.1016/j.ress.2006.04.015

Herman, J. D., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2015). How should robustness be defined for water systems planning under change? Journal of Water Resources Planning and Management, 141(10), 04015012. doi:10.1061/(asce)wr.1943-5452.0000509

Reed, P.M., Hadjimichael, A., Malek, K., Karimi, T., Vernon, C.R., Srikrishnan, V., Gupta, R.S., Gold, D.F., Lee, B., Keller, K., Thurber, T.B, & Rice, J.S. (2022). Addressing Uncertainty in Multisector Dynamics Research [Book]. Zenodo. https://doi.org/10.5281/zenodo.6110623

Basics of data visualization with ggplot2

Basics of data visualization with ggplot2

In my previous post, I showed how wonderful the ggplot2 library in R is for visualizing complex networks. I realized that while there are several posts on this blog going over the advanced visualization capabilities of the ggplot2 library, there isn’t a primer on structuring code for creating graphs in R…yet. In this post, I will go over the syntax for creating pretty ggplot2 graphs and tweaking various parameters. I am a self-declared Python aficionado, but love using ggplot2 because it is intuitive to use, beginner-friendly, and highly customizable all at the same time.

Dataset and setup

For this tutorial, I will be using one of the built-in datasets in R called mtcars which was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Further documentation on this dataset can be found here. We import the data into our RStudio workspace.

# import the library into our workspace
library(ggplot2)

# import dataset
data(mtcars)
head(mtcars)

The resultant dataset looks something like this.

Basic plot

Now that we have the data, we can get to plotting with ggplot2. We can declaratively create graphics using this library. We just have to provide the data, specify how to map properties to graph aesthetics, and the library takes care of the rest for us! We need to specify three things for each ggplot — 1) the data, 2) the aesthetics, and 3) the geometry.

Let us start by creating a basic scatterplot of the mileage (mpg) of each car as a function of its horsepower (hp). In this case the data is our dataframe mtcars, and the aesthetics x and y will be defined as the names of the columns we wish to plot along each axis — hp and mpg. We can also set the color aesthetic to indicate the number of cylinders (cyl) in each car. One of the reasons ggplot2 is so user-friendly is because each graph property can be tacked on to the same line of code with a + sign. Since we want a scatterplot, the geometry will be defined using geom_point().

# basic scatterplot
g <- ggplot(data = mtcars, aes(x = hp, y = mpg, color=cyl))
g + geom_point()

Excellent! The library automatically assigns the column names as axis labels, and uses the default theme and colors, but all of this can be modified to suit our tastes and to create pretty graphs. It is also important to note that we could have visualized the same data (less helpfully) as a line plot instead of a scatterplot, just by tweaking the geometry function.

# basic line plot
g + geom_line()

Well, this looks unpleasant. But wait, we can do so much more. We can also layer multiple geometries on the same graph to make more interesting plots.

# basic scatter+line plot
g + geom_line() + geom_point()

Additionally, we can tweak the geometry properties in each graph. Here is how we can transform the lines to dotted, and specify line widths and marker shape and size.

# change properties of geometry
g + geom_point(shape = "diamond", size = 3) +
  geom_line(color = "black", linetype = "dotted", size = .3) 

While our graph looks much neater now, using a line plot is actually pretty unhelpful for our dataset since each data point is a separate car. We will stick with a scatterplot for the rest of this tutorial. However, the above sort of graph would work great for time series data or other data that measures change in one variable.

Axis labels

One of the cardinal rules of good data visualization is to add axis labels to your graphs. While R automatically set the axis labels to be column headers, we can override this to make the axis labels more informative with just one extra function.

# change axis titles
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)")

Title

This graph is in serious need of a title to provide a reader some idea of what they’re looking at. There are actually multiple ways to add a graph title here, but I find it easiest to use ggtitle().

# add title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") 

Alright, having a title is helpful, but I don’t love it’s placement on the graph. R automatically left-aligns the title, where it clashes with the y-axis. I would much rather have the title right-aligned, in a bigger font, and bolded. Here is how to do that.

# change position of title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

Theme

There are ways to manually change the background and gridlines of ggplot2 graphs using theme(), but an easy workaround is to use the built-in themes. Which theme you use depends greatly on the graph type and formatting guidelines, but I personally like a white background, faint gridlines, and a bounding box. One thing to note here though is that theme_bw() overrides theme() so the order of these two matters.

# add theme
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also use the theme() function to change the base font size and font family. Shown below is how to increase the base font size to 15 and change the base font family to Courier.

# use theme to change base font family and font size
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)") +
  ggtitle("Mileage vs Horsepower")  +
  theme_bw(base_size = 15, base_family = "Courier") +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

Legend

It has been bothering me for the last seven paragraphs that my legend title still uses the column name. However, this is an easy fix. All I have to do is add a label to the color aesthetic in the labs() function.

# change legend title
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold"))

We can also change the position of the legend. R automatically places legends on the right, and while I like having it to the right in this case, I could also place the legend at the bottom of the graph. This automatically changes the aspect ratio of the graph.

# change legend position
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "bottom")

Margins

The theme() function is of endless use in ggplot2, and can be used to manually adjust the graph margins and add/remove white space padding. The order of arguments in margin() is counterclockwise — top, right, bottom, left (helpfully remembered by the pneumonic TRouBLe).

# add plot margins
g + geom_point(shape = "diamond", size = 3) +
  labs(x = "Horsepower (hp)", y = "Mileage (mpg)", color = "Cylinders") +
  ggtitle("Mileage vs Horsepower") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 1, size = 15, face = "bold")) +
  theme(legend.position = "right") +
  theme(plot.margin = margin(t = 1, r = 1, b = 1, l = 2, unit = "cm"))

Conclusion

I have barely scratched the surface of what can be achieved using ggplot2 in this post. There are hundreds of excellent tutorials online that dive deeper into ggplot2, like this blog post by Cedric Scherer. I have yet to learn so much about this library and data visualization in general, but have hopefully made a solid case for using ggplot2 to create clean and aesthetically-pleasing data visualizations.

Taylor Diagram

The evaluation of model performance is a widely discussed and yet extremely controversial topic in hydrology, atmospheric science, and environmental studies. Generally, it is not super straightforward to quantify the accuracy of tools that simulate complex systems. One of the reasons is that these systems usually have various sub-systems. For example, a complex river system will simulate regulated streamflow, water allocation, dam operations, etc. This can imply that there are multiple goodness-of-fit objectives that cannot be fully optimized simultaneously. In this situation, there will always be tradeoffs in the accuracy of the system representation.

The other reason is that these complex systems are often state-dependent, and their behavior non-linearly changes as the system evolves. Finally, there are variables in the system that have several properties, and it is usually impossible to improve all of their properties at the same time (Gupta et al., 1998). Take streamflow as an example. In calculating model accuracy, we can focus on the daily fluctuation of streamflow or look into the changes weekly, monthly, and annually. We can also concentrate on the seasonality of streamflow, extreme low and high cases, and the persistence of these extreme events. Therefore, our results are not fully defensible if we only focus on one performance metric and make inferences that ignore many other components of the system. In this blog post, I am going to explain a visualization technique that allows us to draw three or more metrics of the system on the same plot. The plot is called a Taylor Diagram. The Taylor Diagram is not really a solution to the problem that I explained, but it does provide a systematic and mathematically sound way of demonstrating a few popular goodness-of-fit measures. The original version of the Taylor Diagram includes three performance metrics: i) standard deviation of simulated vs. observed; ii) correlation coefficient between observed and simulated; and iii) centered root mean square error (CRMSE). However, there are other versions of the Taylor Diagram that include more than these three metrics (e.g., here). The Taylor Diagram was developed by Karl E. Taylor (original paper) and has been frequently used by meteorologists and atmospheric scientists. In this blog post, I focus on streamflow.

Underlying Mathematical Relationships

As mentioned above, the Taylor Diagram draws the following three statistical metrics on a single plot: standard deviation, CRMSE, and correlation. The equation below relates these three:

Equation – 1

In this equation:

This relationship can be easily derived using the definition of CRMSE, which is:

Equation – 2

In this equation:

We can expand this equation to the following:

Equation – 3

The first two components of the equation are standard deviations of observed and simulated time series. To understand the third component of the equation, we need to recall the mathematical definition of correlation.

Equation – 4

If we multiply both sides of the above equation by standard deviation of observed and simulated, we see that the third components of the equations 3 and equation 4 are actually the same. The Taylor Diagram uses polar coordinates to visualize all of these components. Readers can refer to the original paper to find more discussion and the trigonometric proofs behind this form of plot.

Code Example

In this blog post, I am presenting a simple R package that can be used to create Taylor Diagrams. In this example, I use the same streamflow datasets that I explained in the previous blog post. First, you need to download the time series from GitHub and use the following commands to read these two time series into R.

Observed_Arrow<-read.table("~/Taylor Diagram/Arrow_observed.txt", header = T)
Simulated_Arrow<-read.table("~/Taylor Diagram/Arrow_simulated.txt", header = T)

In this post, I use the “openair” R library, which was originally developed for atmospheric chemistry data analysis. In addition to the Taylor Diagram, the package provides many helpful visualization options that can be accessed from here. Note that I was able to find at least one more R package that can be used to create Taylor Diagrams (i.e., plotrix ). There are also Python and MATLAB libraries that can be used to create Taylor Diagrams.

You can use the following to install the “openair” package and activate the library.

install.packages("openair")
library(openair)

The following command can be used to create a Taylor Diagram.

combined_dataset<-cbind(Observed_Arrow[ 18263:length(Observed_Arrow[,1]),], Arrow_sim=Simulated_Arrow[1:10592,4])

TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim")

Interpretation of the Taylor Diagram

The Taylor Diagram indicates the baseline observed point where correlation is 1 and CRMSE is 0 (purple color). If the simulation point is close to the observed point, it means that they are similar in terms of standard deviation, their correlation is high, and their CRMSE is close to zero. There is also a black dashed line that represents the standard deviation of the observed time series. If the red dot is above the line, it means that the simulated data set has a higher variation. The other helpful information that we gain from the Taylor Diagram is the correlation between observed and simulated values. Higher correlation shows a higher level of agreement between observed and simulated data. The correlation goes down as a point moves toward higher sectors in the graph. Centered root mean square error also shows the quality of the simulation process; however, it puts more weight on outliers. The golden contour lines on the polar plot show the value of CRMSE. Basically, on this figure, the red point has a higher standard deviation, the correlation is around 90%, and the CRMSE is close to 20,000.

The package also allows us to look at the performance of the model during different months or years, and we can compare different simulation scenarios with the observed data. Here, I use the function to draw a point for each month. The “group” argument can be used to do that.

TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim", group = "Month")

The function also provides a simple way to break down the data into different plots for other attributes. For example, I created four panels for four different annual periods:

TaylorDiagram(combined_dataset, obs = "ARROW_obs", mod = "Arrow_sim", group = "Month", type = "Year")

How to make horizon plots in Python

Horizon plots were invented about a decade ago to facilitate visual comparison between two time series. They are not intuitive to read right away, but they are great for comparing and presenting many sets of timeseries together. They can take advantage of a minimal design by avoiding titles and ticks on every axis and packing them close together to convey a bigger picture. The example below shows percent changes in the price of various food items in 25 years.

The way they are produced and read is by dividing the values along the y axis in bands based on ranges. The color of each band is given by a divergent color map. By collapsing the bands to the zero axis and layering the higher bands on top, one can create a time-varying heatmap of sorts.

Source: http://idl.cs.washington.edu/papers/horizon

I wasn’t able to find a script that could produce this in Python, besides some code in this github repository, that is about a decade old and cannot really run in Python 3. I cleaned it up and updated the scripts with some additional features. I also added example data comparing USGS streamflow data with model simulation data for the same locations for 38 years. The code can be found here and can be used with any two datasets that one would like to compare with as many points of comparison as needed (I used eight below, but the script can accept larger csv files with more or less comparison points, which will be detected automatically). The script handles the transformation of the data to uniform bands and produces the following figure, with every subplot comparing model output with observations at eight gauges, i.e. model prediction error. When the model is over predicting the area is colored blue, when the area is underpredicting, the area is colored red. Darker shades indicate further divergence from the zero axis. The script automatically uses three bands for both positive or negative divergence, but more can be added, as long as the user defines additional colors to be used.

Using this type of visualization for these data allows for time-varying comparisons of multiple locations in the same basin. The benefit of it is most exploited with many subplots that make up a bigger picture.

Future extensions in this repository will include code to accept more file types than csv, more flexibility in how the data is presented and options to select different colormaps when executing.

Visualizing High Dimensional Data Using the T-SNE Method

In this blog post, I am going to go over a machine learning method that is used to visualize high dimensional datasets. The method is called T-distributed Stochastic Neighbor Embedding, or T-SNE for short. The method was developed by the Google Brain Team (van der Maaten & Hinton, 2008). In this blog post, I describe T-SNE and provide a simple example in R.

Two weeks ago, Dave introduced a few useful dimensionality reduction methods to visualize the Pareto front. Last week, Rohini’s blog post provided some very useful information about self-organizing maps. I guess this blog post will officially make this a series of posts on dimensionality reduction and the visualization of high dimensional datasets. If you haven’t already read these two super informative blog posts, I highly recommend you take a look at them.

T-SNE is an extension of the Local Linear Embedding technique, and it uses an interesting method to maintain the local relationships and similarities that exist in the original high-dimensional datasets when transferring that into a 2-D plane. As van der Maaten and Hinton (2008) argue, other dimensionality reduction techniques (e.g., PCA) usually focus on macro distances in the datasets, which might lead to overlooking short-distance relationships. However, T-SNE is able to capture both local and macro distances in the data structure. Therefore, it can give more reasonable results when you are dealing with non-linear high dimensional datasets.

T-SNE uses the following formula, which basically creates a Gaussian distribution over each data point and computes the density of all other data points in respect to the distribution around that point. The denominator of the equation normalizes the nominator.

In this equation, xi is the reference point, and xj is its neighboring points. Also, Sigma is the bandwidth that returns the same perplexity for each point. Perplexity is a measure of uncertainty that has a direct relationship with entropy. For more information about it, you can read this Wikipedia page. Basically, perplexity is a hyper parameter of T-SNE, and the final outcome might be very sensitive to its value. Also, in this equation, pij is the probability of the similarity of each two points in the high dimensional space. Basically, if two points are close to each other, pij will be high. If they are far from each other, the probability of similarity is low.

To make the low dimensional map consistent with its original high dimensional dataset, T-SNE also calculates a similar point-to-point probability of similarity in the 2D map.

Basically, T-SNE moves the point on the 2D plain to find the locations that minimize the Kullback-Leibler divergence metric between the Pi distributions of the original data and the Qi distributions of the 2D plane.

As you probably noticed, in the low dimensional space, T-SNE uses the Student T-distribution curve (with one degree of freedom). The reason is that Student T (compared to Gaussian) is a fat-tailed distribution. This allows the T-SNE to separate dissimilar points with higher margins, which makes the map easier to interpret.

T-SNE Example in R

Here, I provide a simple R code example that demonstrates how you can use T-SNE to visualize high dimensional datasets. This example only has four dimensions, so it doesn’t really represent a high dimensional problem. As such, keep in mind that T-SNE shows its true capabilities when there are tens or even hundreds of dimensions.

First, you need to install “tsne” library and load it.

install.packages("Rtsne”)
library(Rtsne)

# We also need to load ggplot2 and ggpubr libraries

library(ggplot2)
library(ggpubr)

In this example, I use R’s famous “iris” data set. Here is how you can activate it.

# Load the iris dataset
data(iris)

Now you can use the following scrip to visualize the data using T-SNE.

# Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=30, max_iter=1000, check_duplicates = FALSE) # Run TSNE

# Create a data frame using IRIS values and the T-SNE values
df_to_gg<-as.data.frame(cbind(iris$Species, as.data.frame(y_tsne$Y)))

# Specify column names
names(df_to_gg)<-c("Species", "Y.1", "Y.2")

# Show the objects in the 2D T-SNE representation
ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
  labs(title=paste("T-SNE Visualization- Perplexity Number = 30" )) +
  scale_color_manual(values = c("darkblue", "orange", "pink3"))

Here is the 2-D map that our code generates

Although T-SNE is a powerful method and has become quite popular in recent years, it might have some pitfalls. This blog post discusses some of these issues, including the sensitivity of the final map to perplexity hyperparameters or the fact that it can be tricky to interpret T-SNE maps because they might not represent the actual relationships between different clusters.

To explore the impacts of the perplexity hyperparameter on the final clusters, I use the following code to create T-SNE maps for different perplexity values.

# Here are the perplexity values that I am taking into account in my T-SNE analysis
perplexity_number_values<-c(2,5,10, 15, 30, 40)

# Initializing a plot list object
plot_list = list()

# This loop explores the effect of perplexity values on the T-SNE results
for (i_prp in 1:6){
  
  # I am setting a seed to makes sure that my results for different perplexities are not sensitive to any random factors
  set.seed(10)
  
  # Perplexity number
  perplexity_number<- perplexity_number_values[i_prp]
  
  # Calculating T-SNE's Y values which indicate where each cluster is on the 2-D map
  y_tsne <- Rtsne(iris[,1:4],pca=FALSE,perplexity=perplexity_number, max_iter=1000, check_duplicates = FALSE) # Run TSNE
  
  # Create a data frame using IRIS values and the T-SNE values
  df_to_gg<-as.data.frame(cbind(iris$Species, as.data.frame(y_tsne$Y)))
  
  # Specify Column names
  names(df_to_gg)<-c("Species", "Y.1", "Y.2")
  
  # Show the objects in the 2D T-SNE representation
  plt<-ggplot(df_to_gg, aes(x=Y.1, y=Y.2, color=Species))+geom_point()+theme_minimal() +
    labs(title=paste("T-SNE Visualization- Perplexity Number = ", perplexity_number )) +
    scale_color_manual(values = c("darkblue", "orange", "pink3"))
  
  # Save figure in the plot list object
  plot_list[[i_prp]]=plt
  
  
}
# Combine all the plot objects
plt_combined<-ggarrange(plotlist = plot_list[1:6], ncol = 2, nrow = 3, common.legend = T)

Here is the final result that underlines the high sensitivity of the location of our clusters on 2-D map to the choice of the perplexity values. There are other hyperparameters that can be tried as well, for example, the maximum number of iterations might change the final output in some cases.

There are many other very good tutorials on T-SNE, such as here and here. Also, for example, this tutorial provides a nice and easy-to-follow Python code for T-SNE. Finally, there are some great YouTube videos that clearly explain the logic behind T-SNE and its algorithm (for example, here and here).

Beyond Hypervolume: Dynamic visualization of MOEA runtime

Multi-objective evolutionary algorithms have become an essential tool for decision support in water resources systems. A core challenge we face when applying them to real world problems is that we don’t have analytic solutions to evaluate algorithmic performance, i.e. since we don’t know what solutions are possible before hand, we don’t have a point of reference to assess how well our algorithm is performing. One way we can gain insight into algorithmic performance is by examining runtime dynamics. To aid our understanding of the dynamics of the Borg MOEA, I’ve written a small Python library to read Borg runtime files and build a dynamic dashboard that visualizes algorithmic progress.

The Borg MOEA produces runtime files which track algorithmic parameters and the evolving Pareto approximate set over an optimization run. Often we use these data to calculate performance metrics, which provide information on the relative convergence of an approximation set and the diversity of solutions within it (for background on metrics see Joe’s post here). Commonly, generational distance, epsilon indicator and hypervolume are used to examine quality of the approximation set. An animation of these metrics for the 3 objective DTLZ2 test problem is shown in Figure 1 below.

Figure 1: Runtime metrics for the DTLZ2 test problem. The x-axis is number of function evaluations, the y-axis is the each individual metric

While these metrics provide a helpful picture of general algorithmic performance, they don’t provide insight into how the individual objectives are evolving or Borg’s operator dynamics.

Figure 2 shows a diagnostic dashboard of the same 3 objective DTLZ2 test problem run. I used the Celluloid python package to animate the figures. I like this package because it allows you to fully control each frame of the animation.

Figure 2: DTLZ2 runtime dynamics. The tree objectives are shown in a scatter plot and a parallel axis plot. The third figure plots hypervolume over the optimization run and the bottom figure shows Borg’s operator dynamics. (a higher resolution version of this file can be found here: https://www.dropbox.com/s/0nus5xhwqoqtghk/DTLZ2_runtime.gif?dl=0)

One thing we can learn from this dashboard is that though hypervolume starts to plateau around 3500 NFE, the algorithm is still working to to find solutions that represent an adequately diverse representation of the Pareto front. We can also observe that for this DTLZ2 run, the SPX and SBX operators were dominant. SBX is an operator tailored to problems with independent decision variables, like DTLZ2, so this results make sense.

I’m planning on building off this dashboard to include a broader suite of visualization tools, including pairwise scatter plots and radial plots. The library can be found here: https://github.com/dgoldri25/runtimeDiagnostics

If anyone has suggestions or would like to contribute, I would love to hear from you!

Publication-Quality Illustrations in R

There are many ways to create high-quality figures for scientific publications, such as user-friendly and code-free tools like Adobe Illustrator. There are also some informative blog post in the Water programming blog that have gone over the basics of Illustrator and different tricks to make decent figures in that program (fore example, here and here). One issue about Adobe Illustrator is that it’s not free. However, there are some open-source design alternatives to Illustrator such as Inscape. One broader problem with using these types of programs for figure design, though, is the fact that figures can be challenging to produce and time-consuming to reproduce using these code-free methods.

The ggbur package in R provides a script-based way to create and combine high-quality figures. The pros of this method are that it is free, the figures are easily reproducible, and combining figure would not be very time-consuming if you are comfortable coding in R.

A few drawbacks do exist, however. For one, there are several ways to make figures in R, but, when using ggpubr, all of the figures need to have a ggplot format, which limits the application of other figure formats in R. The other disadvantage is that ggpubr might not be as flexible as Illustrator in customizing the details of figures. In this blogpost, I explain some of the main feature of ggpubr and will show some examples of how figures can be created using this method. Since ggpubr is a part of ggplot, if you are not familiar with ggplot, it might be helpful to take a look at some ggplot basics first (for example, here, here, and here).

In this blog post, I used R’s “airquality” dataset to showcase some of the applications of the ggpubr. You can use the following code to activate the airqulaity dataset.

library(MASS)
data(airquality)

# You can use the head() function to look at the first 6 columns of our dataset
head(airquality)

I used this dataset to create seven different ggplot2 figures and then I used ggpubr to merge them into a combined figure.

library(ggplot2)
# uncomment the next line if you haven't installed ggpubr yet 
# install.packages("ggpubr") 
library(ggpubr) 

p1<-ggplot(airquality, aes(x=as.factor(Month), y=Ozone, fill=as.factor(Month))) + geom_boxplot() + theme_bw() + 
  scale_fill_hue() + theme(legend.title = element_blank()) +
  labs(x="Month", y="Ozone concentration (PPM)", title = "a. Monthly Ozone concentration in New York")

p2<-ggplot(airquality, aes(x=as.factor(Month), y=Solar.R, fill=as.factor(Month))) + geom_boxplot() + theme_bw() + 
  scale_fill_hue() +theme(legend.title = element_blank()) +
  labs(x="Month", y="Solar Radiation (W/m2)", title = "b. Monthly Solar Radiation in New York")

p3<-ggplot(airquality, aes(x=as.factor(Month), y=Temp, fill=as.factor(Month))) + geom_boxplot() + theme_bw() + 
  scale_fill_hue() +theme(legend.title = element_blank()) +
  labs(x="Month", y="Temperature (Fahrenheit)", title = "c. Monthly Tmperature in New York")


p4<-ggplot(airquality, aes(x=Solar.R, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() +
scale_color_hue() +theme(legend.title = element_blank()) +
  labs(x="Solar Radiation (W/m2)", y="Ozone concentration (PPM)", title = "d. Ozone concentration vs. solar radiation")

p5<-ggplot(airquality, aes(x=Wind, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() + 
  scale_fill_hue()  +theme(legend.title = element_blank()) +
  labs(x="Wind Speed (Miles per Hour)", y="Ozone concentration (PPM)", title = "e. Ozone concentration vs. wind speed")

p6<-ggplot(airquality, aes(x=Temp, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() + 
  scale_fill_hue()  +theme(legend.title = element_blank()) +
  labs(x="Temperature (Fahrenheit)", y="Ozone concentration (PPM)", title = "f. Ozone concentration vs. temperature") 



p7<-ggplot(airquality, aes(x=Ozone)) + geom_density(fill="gold") + theme_bw() + theme(legend.title = element_blank()) +
  labs(x="Ozone concentration (PPM)", y="Density", title = "g. PDF of Ozone concentration in New York") +
  annotate("this is an example")

Here is an example of these plots (p1):

The main function of ggpubr (in my opinion, at least) is ggarrange. This function provides a nice and fast way to merge different graphs and tables. Therefore, if you use this function, you don’t need to worry about whether different panels of your plots are properly aligned. If it’s applicable to your plot, you can also include a common legend.

# ggarrange is used to combine p1 to p7
combined_plot<-ggarrange(
ggarrange(p1, p2, p3, p4, p5, p6, ncol=2, nrow = 3, common.legend =T # TRUE: remove individual legends and add a common legend), 
          p7, ncol=1, nrow =2,
          heights = c(2, 0.7))

# ggsave can be used to save ggplot and ggpubr figures.
ggsave("YOUR FAVORITE DIRECTORY/CombinedPlot.png", combined_plot, width = 12, height = 14) 

Keep in mind that, ggpubr provides some extra functions for generating plots that are slightly different from ggplot2 functions. For example, instead of geom_density, you can use ggdensity.

Also, although ggplot2 provides several different plot themes, ggpubr allows users to use its own publication style theme. The following can be used to apply that theme to the graph. Note that ggpubr’s original functions have this theme as their default plot theme.

Sometimes, we need to draw the attention of our readers to a specific part of the graph or add some extra information to the figure. You can use the Annotate function to do that.

p6<-ggplot(airquality, aes(x=Temp, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() + 
  scale_fill_hue()  +theme(legend.title = element_blank()) +
  labs(x="Temperature (Fahrenheit)", y="Ozone concentration (PPM)", title = "f. Ozone concentration vs. temperature") 

p6<-p6+annotate( "rect", xmin = 80, xmax = 95, ymin = 50, ymax = 100, alpha = .3, color="black", fill="grey")+
  annotate("text", x=67, y=80, label="This is how you can add extra \n information to the plot", fontface=2)

R Shiny – Part 1

In this blog post, I explain a helpful R package that you can use to create interactive plots and web applications. It’s called shiny, and it can also be used to create websites with interactive user interfaces. The interactive plots can contain graphs, maps, tables and other details you need on your site. In this post, I cover some basics of R shiny and some guidelines to creating interactive plots with it. In the next two posts, I’ll provide more details on how the package can be used to create simple web apps.

Shiny apps have two main functions: ui.R and sever.R. These functions can be in two separate R files or a single one. The ui.R function is used to create the graphical user interface; I provide more details below. The server.R function is where calculations are done and graphs are generated. In other words, ui.R is equivalent to front-end development and server.R to the back-end part of the code. Before I get into the details, I’ll note that there are several nice instruction manuals out there that you can refer to for more information. Probably the best is Shiny’s original manual, available at here and here.

To showcase some of the things you can create with R shiny, I downloaded the package download logs from November 1, 2019 here. This file has more than 3 million lines, corresponding to all the CRAN packages that were downloaded that day. I post-processed the data so that the final file lists the packages by the number of times they were downloaded that day. For example, this graph shows the first thirty packages. You can download the final post-processed file from here.

User Interface

The user interface part of our script divides the page into several components and defines the characteristics of each. Let’s say we want a side bar that provides options for the user. We can instruct the model on the layout of our web app, the size of each component, the details of any slide bars or dropdown menus we want, and so forth. The following is a user interface script I created, in which I added some comments to explain what each part of the code does.

library(shiny)
library(ggpubr)
library(ggplot2)

setwd("~/blog posts/R shiny/") 

# UI 

ui <- fluidPage(  
  # The fluidPage command translates R shiny codes to HTML format 
  
  
  
  titlePanel("Most Downloaded R Packages"), # Here we can define the main title of our web app
  
  
  fluidRow( 
    
    # The fluidRow command gives us the flexibility to define 
    # our layout column and offset commands are two of the popular arguments
    # that can be used to generate our layout
    # There are also other predefined layouts available in 
    # R-Shiny such as sidebar layout, vertical layout and split layout
    
    
    # Here we width and position of our object in the web app
    column(10, offset = 4,
           
    # Users can have different types of widgets 
    # in R-Shiny, such as dropdown menus and date range inputs. 
    # One of the popular ways to set a number
    # is through a slider widget, the following two lines 
    # define two slider bars to set different values
    # The sliderInputs command allows us to define the range 
    # and default value of our slider widget
           
           sliderInput(inputId = "NumLibs1",
                       label = "1- Number of libraries in PDF plot:", min = 30, max = 1000, value = 30),
           
           
           sliderInput(inputId = "NumLibs2",
                       label = "2- Number of libraries in bar plot:", min = 3, max = 30, value = 10)
    ),
    
    # The mainPanel command defines the size and layout 
    # structure of our main panel where we show our actual plots
    column(12, 
           mainPanel(
             
    # Creates a plot that will be placed in the main panel
             plotOutput(outputId = "plot",width = "150%", height = "450")
             
           )
    )
    
  ) 
)

Server

The server provides information about calculations that need to be done behind the scenes and then plots and generates the graphics, defines the properties of table, and so on. The following is server code I developed to provide information on all the components of my plot.


# Server 

server<-function(input, output){ 
  # Here we define a server function that does 
  # behind the scene calculations and generates our plot
  
  input_data<- read.table("pkg_by_freq.txt", header = T)
  # Reads the input file from our working directory
  
  reorderd_freq<-input_data[order(input_data$pkg_count, decreasing = T),]
  # This command sorts our input data based on number of downloads 
  # in that particular day (11/01/2019)
  
  output$plot <- renderPlot({
    # This part renders our output plot
    
    max_numb<-input$NumLibs1
    num_pop_libs<-input$NumLibs2
    # Here our code receives the number that will 
    # be set by users through the slider widget
    
    p1<-ggplot(reorderd_freq[6:max_numb,], aes(x=pkg_count)) +geom_density(fill="lightblue")+
      labs(title = "1- PDF of Numuer of Downloads of R Packages",  x="Number of Downloads", y="") +
      theme_bw() +theme(axis.text.x  = element_text(size = rel(1.8)) )
    
    reorderd_freq$pkg_name <- reorder(reorderd_freq$pkg_name, reorderd_freq$pkg_count)
    p2<-ggplot(reorderd_freq[1:num_pop_libs,])+ geom_bar(aes(x=pkg_name, y=pkg_count), stat="identity", fill="purple1") +
      labs(title = "2- Most Downloaded R Packages", y="Number of Downloads", x="Package Name") +
      coord_flip() +theme_bw() +theme(axis.text.y  = element_text(size = rel(1.4)) )
    # Now we use ggplot2 package to generate two figures a PDF plot (geom_density) and a bar plot (geom_bar)
    # Note that we use the slider input to change the characteristics of our plot
    
    ggarrange(p1, p2)
    # Finally we combine our two plots using ggarange function from the ggpubr package
  })
  
}

One thing to keep in mind is that if you have two separate files for ui.R and sever.R, you always have to save them in the same folder.

When you’re done with your ui.R and server.R, you can either use your R-Studio run bottom or the runApp() command to combine all the components and create your final output.

# This command connects the UI code with the server code and 
# generates our final output 
shinyApp(ui, server)

And here is the final interactive figure that you will produce:

Also, I created this simple interactive webpage to show the results of my R code example (download the R code from here). The left graph shows the probability density function of the different R libraries downloaded on November 1, 2019. Because more than 15,000 R libraries were downloaded that day, the graph allows you to see the distribution of the libraries with the highest download rates. The slider lets you change the number plots in the graph. The right plot shows the most downloaded libraries, and the slider lets you to include an arbitrary number of the most popular libraries in the figure.

In the next couple of posts, I will give more examples of R shiny’s capabilities and ways to make websites using it.

ggplot (Part 2)

This is the second part of the ggplot introduction. In this blog post, I am going to go over how you can make a decent density plot in ggplot. Density plots are basically smoothed versions of the histogram and show the distribution of your data while also presenting the probability distribution of the data using the kernel density estimation procedure. For example, when we have a regional data set, it is important to look at the distribution of our data across the region instead of just considering the region average. In our example (download the data set from here), we are going to visualize the regional distribution of simulated average winter wheat yield for 30 years from 1981 to 2010. The “ID” column in the data set represents one grid cell in the region, and there are 1,812 total grid cells. For each grid cell, the average historical yield and the standard deviation of yield during 30 years were given. First, we need to load the library; then, in the general code structure of “ggplot ( dataframe , aes ( x , y , fill )),” we need to specify x-axis to “yield.” The y-axis will be calculated and added through “geom_density()”. Then, we can add a color, title, and label and customize the background.

example1<- read.csv("(your directory)/example_1.csv")
library(ggplot2)   
ggplot(example1, aes(x=example1$period_ave_Y))+ 
geom_density(fill="blue")+
 theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

Now, we want to know how the standard deviation of 30 years’ average yield for all the grid cells in the region can be mapped into this density plot.

We can add another column (name it “SD_class”) to the data set and classify the standard deviations. The maximum and minimum standard deviations among all the grid cells are the following.

max(example1$period_sd_Y)
# [1] 3.605131
min(example1$period_sd_Y)
# [1] 0.8645882

For example, I want to see this plot categorized by standard deviations between 0.8 to 1.5, 1.5 to 2.5, and 2.5 to the maximum value. Here, I am writing a simple loop to go over each row and check the standard deviation value for each row (corresponding to each grid cell in a region); I fill the newly added column (“SD_class”) with the correct class that I specify in the “if statement.”

example1$SD_class<- NA
for (i in 1:nrow(example1)){
  if(example1[i,2]>0.8 && example1[i,2]<= 1.5) {example1[i,4]<- c("0.8-1.5")}
  if(example1[i,2]>1.5 && example1[i,2]<= 2.5) {example1[i,4]<- c("1.5-2.5")}
  if(example1[i,2]>2.5) {example1[i,4]<- c("2.5-3.6")}
}

Now, we just need to add “fill” to the aesthetics section of the code, specify the column with the classifications, and add “alpha” to make the color transparent in order to see the shapes of the graphs and whether they have overlaps.

ggplot(example1, aes(x=example1$period_ave_Y,fill =SD_class))+
  geom_density(alpha=0.4)+
  theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
        axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
        legend.text=element_text(size=13),legend.title=element_text(size=14))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

We can also use the “facet_grid()” option, like the plot in Part (1), and specify the column with classification to show each of these classes in a separate panel.

ggplot(example1, aes(x=example1$period_ave_Y,fill =SD_class))+
  geom_density(alpha=0.4)+facet_grid(example1$SD_class ~ .)+
  theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
        axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
        legend.text=element_text(size=13),legend.title=element_text(size=14))+
  labs(title = paste("Density Plot of Regional Average Historical Yield (30 years)"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")

The other interesting variables that we can explore are different percentiles of our data set that correspond to the density plot. For this, we need to obtain the density values (y-axis on the plot) for the percentiles that we are interested in—for example 10%, 25%, 50%, 75%, and 90%. Also we need to find out the actual yield value corresponding to each percentile:

quantiles_yield <- quantile(example1$period_ave_Y, prob=c(0.1, 0.25, 0.5, 0.75, 0.9))
#     10%      25%      50%      75%      90% 
#  4.229513 5.055070 5.582192 5.939071 6.186014

Now, we are going to estimate the density value for each of the yields at the 10th, 25th, 50th, 75th, and 90th percentiles.

df <- approxfun(density(example1$period_ave_Y))

The above function will give us the approximate density value for each point (yield) in which we are interested—in our case, yields for the above percentiles:

df(c(quantiles_yield))
#[1] 0.1176976 0.3267841 0.6129621 0.6615790 0.4345247

Now, we can add several vertical segments to the density plot that show where each percentile is located on this graph. The limits of these segments on the y-axis are based on the density values for each percentile that we got above. Also, note that I used those values to adjust the positions of the labels for the segments.

ggplot()+ 
      geom_density(aes(x=example1$period_ave_Y),fill="blue",alpha=0.4) + 
    geom_segment(aes(x=quantiles_yield, y=0, xend =quantiles_yield,
                     yend= df(c(quantiles_yield))),size=1,colour =c("red","green","blue","purple","orange"),linetype='dashed')+
      theme(panel.background = element_rect(fill = 'white'),axis.line = element_line(size = 0.5, linetype = "solid",colour = "black"),
            axis.text=element_text(size=16),axis.title=element_text(size=16,face="bold"),plot.title = element_text(size = 20, face = "bold"),
            legend.text=element_text(size=13),legend.title=element_text(size=14))+
      labs(title = paste("Density Plot of Regional Average Historical Yield (30 years) and Percentiles"),x = "Winter Wheat Yield (tonnes/ha)", y = "Density", color="black")+
    annotate("text", x=4.229513, y=0.15, label=paste("10%"),size=5)+
    annotate("text", x=5.055070, y=0.36, label=paste("25%"),size=5)+
    annotate("text", x=5.582192, y=0.65, label=paste("50%"),size=5)+
    annotate("text", x=5.939071, y=0.7, label=paste("75%"),size=5)+
    annotate("text", x=6.186014, y=0.47, label=paste("90%"),size=5) 

Hydro Packages in R: HydroGOF

In this blog post, I will go over a very helpful hydrologic package in R that can make your hydro-life much easier. The package is called HydroGOF, and it can be used to make different types of plots, including mean monthly, annual, and seasonal plots for streamflow, rainfall, temperature, and other environmental variables. You can also use HydroGOF to compare your simulated flow to observed flow and calculate various performance metrics such as Nash-Sutcliffe efficiency. Indeed, the GOF part of HydroGOF stands for “goodness of fit.” More information about HydroGOF and its applications for hydrologists can be found here. Also, you can find a more comprehensive list of hydrologic R packages from this water programming blog post.

1- Library and Data Preparation

HydroGOF accepts R data frames and R zoo objects. If you are not familiar with R’s zoo objects, you can find more information at here. In this tutorial, I use HydroGOF’s test case streamflow data, which are in zoo format. Here is how you can install and load zoo and HydroGOF.

install.packages("zoo")
library(zoo)
install.packages("hydroGOF ")
library(hydroGOF)

After you load the package, you need to activate your streamflow data. This is how you do so.

# Activate HydroGOF's streamflow data
data(EgaEnEstellaQts) 

Now, let’s take a look at the first few lines of our streamflow data.

head(EgaEnEstellaQts)

Note that the first column is date and that the second column is streamflow data; the unit is m3/sec. Also, keep in mind that you can use zoo to manipulate the temporal regime of your data. For example, you can convert your daily data to monthly or annual.

2- Streamflow Plots

Now, let’s use HydroGOF to visualize our observed streamflow data. You can use the following commands to generate some nice figures that will help you explore the overall regime of the streamflow data.

obs<-EgaEnEstellaQts

hydroplot(x = obs,var.type = "FLOW", var.unit = "m3/s", ptype = "ts+boxplot", FUN=mean)
# Note that "hydroplot" command is very flexible and there are many
# options that users can add or remove

3- Generate Simulated Dataset

For this tutorial, I have written the following function, which uses observed streamflow to generate a very simple estimation of daily streamflow. Basically, the function takes the observed data and calculates daily average flow for each day of the year. Then, the function repeats the one-year data as many times as you need, which for our case, is ten times to match the observed flow.

simple_predictor<-function(obs){
  # This function generates a very simple prediction of streamflow 
  # based on observed streamflow inputs

  DOY<-data.frame(matrix(ncol =1, nrow = length(EgaEnEstellaQts))) 
  
  for (i_day in 1:length(EgaEnEstellaQts)){
    DOY[i_day,]=as.numeric(strftime(index(EgaEnEstellaQts[i_day]), format = "%j"))
  }
 
# Create a 365 day timeseries of average daily streamflow.  
  m_inflow_obs<-as.numeric(aggregate(obs[,1], by=list(DOY[,1]), mean)) 
  
  simplest_predictor<-data.frame(matrix(ncol=3, nrow =length(obs )))
  names(simplest_predictor)<-c("Date", "Observed", "Predicted")
  simplest_predictor[,1]=index(obs)
  
  simplest_predictor[,2]=coredata(obs)
  
  for (i_d in 1:length(obs)){
   # Iterates average daily flow for entire simulation period 
    simplest_predictor[i_d,3]=m_inflow_obs[DOY[i_d,1]]
  }
  # Convert to zoo format
  dt_z<-read.zoo(simplest_predictor, format="%Y-%m-%d") 
  
  return(dt_z)
}

After loading the function, you can use the following to create your combined observed and simulated data frame.

# Here we just call the function
obs_sim<-simple_predictor(obs)

4- Hydrologic Error Metrics Using HydroGOF

There are twenty error metrics in HydroGOF—for example, mean error (ME), mean absolute error (MAE), root mean square error (RMSE), normalized root mean square error (NRMSE), percent bias (PBIAS), ratio of standard deviations (Rsd), and Nash-Sutcliffe efficiency (NSE). You can find more information about them here. You can use the following commands to calculate specific error metrics.

# Nash-Sutcliffe Efficiency
NSE(sim=obs_sim$Predicted, obs=obs_sim$Observed) 
# Root Mean Squared Error
rmse(sim=obs_sim$Predicted, obs=obs_sim$Observed) 
# Mean Error
me(sim=obs_sim$Predicted, obs=obs_sim$Observed) 

You can also use this cool command to see all of the available error metrics in HydroGOF.

gof(sim=obs_sim$Predicted, obs=obs_sim$Observed)

5- Visualizing Observed and Simulated Streamflow

Here is the most interesting part: you can plot observed and simulated on the same graph and add all error metrics to the plot.

ggof(sim=obs_sim$Predicted, obs=obs_sim$Observed, ftype="dm", gofs = c("NSE", "rNSE", "ME", "MSE",  "d", "RMSE", "PBIAS"), FUN=mean)
# You should explore different options that you can add to this figure. 
# For example you can choose which error metrics you want to display, etc