Our recently published eBook, Addressing Uncertainty in Multisector Dynamics Research, provides several interactive tutorials for hands on training in model diagnostics and uncertainty characterization. The purpose of this post is to expand upon these trainings by providing a tutorial demonstrating gradient boosted trees for scenario discovery. I’ll first provide some brief background on scenario discovery and gradient boosted trees, then demonstrate a Python implementation on a water supply planning problem. All code here is written in Python, but the workflow is model agnostic, and can be paired with simulation models in any language. I’ve included my code within the text below, but all code and data for this post can also be found in this git repository.
Scenario discovery gradient boosted trees
In water resources planning and management, decision makers are often faced with uncertainty about how their system will change in the future. Traditionally, planners have confronted this uncertainty by developing prespecified narrative scenarios, which reduce the multitude of possible future conditions into a small subset of important future states of the world (a prominent example is the ‘scenario matrix framework’ used to evaluate climate change (O’Neill et al., 2014)). While this approach provides intuitive appeal, it may increase system vulnerability if future conditions do not evolve as decision makers expect (for a detailed critique of scenario based planning see Reed et al., 2022). This vulnerability is especially apparent for systems facing deep uncertainty, where decision makers do not know or cannot agree upon the probability density functions of key system inputs (Kwakkel et al., 2016).
Scenario discovery (Groves and Lempert, 2007) is an exploratory modeling centered approach that seeks to discover consequential future scenarios using computational experiments rather than relying on prespecified information. To perform scenario discovery, decision makers first identify a set of relevant uncertainties and their plausible ranges. Next, an ensemble of these uncertainties is developed by sampling across parameter ranges. Candidate policies are then evaluated across this ensemble and machine learning or data mining algorithms are used to examine which combinations of uncertainties generate vulnerability in the system. These combinations can then be used to develop narrative scenarios to inform implementation and monitoring efforts or new policy development.
A core element of the scenario discovery process is the algorithm used to classify future states of the world. Popular algorithms include the PRIM, CART and logistic regression. Recently, gradient boosted trees have been applied as an alternative classificiation algorithm. Gradient boosted trees have advantages over other scenario discovery algorithms because they can easily capture nonlinear and non-differentiable boundaries in the uncertainty space (which are particularly prevalent in water supply planning problems that have discrete capacity expansion options), are highly resistant to overfitting and provide a clear means of ranking the importance of uncertain factors (Trindade et al., 2020). For a comprehensive overview of gradient boosted trees, see Bernardo’s post here.
Test case: the Sedento Valley
To demonstrate gradient boosted trees for scenario discovery we’ll use the Sedento Valley water supply planning test case (Trindade et al., 2020). In the Sedento Valley, three water utilities seek to discover cooperative water supply managment and infrastructure investment portfolios to meet several conflicting objectives in a system facing deep uncertainty. In this post, we’ll investigate how these deep uncertainties (which include demand growth, the efficacy of water use restrictions, financial variables and parameters governing infrastructure permitting and construction time) impact a utility’s ability to maintain three performance criteria: keeping reliability > 98%, restriction frequency < 20% and worst case cost less than 10% of annual revenue. For simplicity, we’ll focus on one regional water utility named Watertown.
Step 1: create a sample of deeply uncertain states of the world
To start the scenario discovery process, we generate an ensemble of deep uncertainties that represent future states of the world (SOWs). Here, we’ll use Latin Hypercube Sampling with an implementation I found in the Surrogate Modeling Toolbox.
import numpy as np from smt.sampling_methods import LHS ''' This script will generate 1000 Latin Hypercube Samples (LHS) of deeply uncertain system parameters for the Sedento Valley ''' # create an array storing the ranges of deeply uncertain parameters DU_factor_limits = np.array([ [0.9, 1.1], # Watertown restriction efficacy [0.9, 1.1], # Dryville restriction efficacy [0.9, 1.1], # Fallsland restriction efficacy [0.5, 2.0], # Demand growth rate multiplier [1.0, 1.2], # Bond term [0.6, 1.0], # Bond interest rate [0.6, 1.4], # Discount rate [0.75, 1.5], # New River Reservoir permitting time [1.0, 1.2], # New River Reservoir construction time [0.75, 1.5], # College Rock Reservoir (low) permitting time [1.0, 1.2], # College Rock Reservoir (low) construction time [0.75, 1.5], # College Rock Reservoir (high) permitting time [1.0, 1.2], # College Rock Reseroir (high) construction time [0.75, 1.5], # Water Reuse permitting time [1.0, 1.2], # Water Reuse construction time [0.8, 1.2], # Inflow amplitude [0.2, 0.5], # Inflow frequency [-1.57, 1.57]]) # Inflow phase # Use the smt package to set up the LHS sampling sampling = LHS(xlimits=DU_factor_limits) # We will create 1000 samples num = 1000 # Create the actual sample x = sampling(num) # save to a csv file np.savetxt('DU_factors.csv', x, delimiter=',')
Step 2: Evaluate performance across SOWs
Next, we’ll evaluate the performance of our policy across the LHS sample of DU factors. For the Sedento Valley test case, we use WaterPaths, an open-source simulation system for integrated water supply portfolio management and infrastructure investment planning (for more see Trindade et al., 2020). This step is not included in the git repository as it requires high-performance computing for this system, but results can be found in the “Model_output.csv” file. For simulation details, see Gold et al., 2022.
Step 3: Convert model output into a boolean array for classification
To perform classification, we need to convert the results of our simulations to a binary array classifying each SOW as a “success” or “failure” based on whether the policy met the performance criteria under the SOW. First, we define a small function to determine if an SOW meets a set of criteria, then we apply this function to our results. We also load the DU factor LHS sample.
# First, define a function to check whether performance criteria are met def check_criteria(objectives, crit_objs, crit_vals): """ Determines if an objective meets a given set of criteria for a set of SOWs inputs: objectives: np array of all objectives across a set of SOWs crit_objs: the column index of the objective in question crit_vals: an array containing [min, max] of the values returns: meets_criteria: an numpy array containing the SOWs that meet both min and max criteria """ # check max and min criteria for each objective meet_low = objectives[:, crit_objs] >= crit_vals meet_high = objectives[:, crit_objs] <= crit_vals # check if max and min criteria are met at the same time meets_criteria = np.hstack((meet_low, meet_high)).all(axis=1) return meets_criteria ##### Load data and pre-process ##### # load objectives and create input array of boolean values for SD input Reeval_objectives = np.loadtxt('Model_output.csv', skiprows=1, delimiter=',') REL = check_criteria(Reeval_objectives, , [.979, 1]) RF = check_criteria(Reeval_objectives, , [0, 0.10]) WCC = check_criteria(Reeval_objectives, , [0, 0.10]) SD_input = np.vstack((REL, RF, WCC)).SD_input(axis=0) # load DU factors DU_factors = np.loadtxt('DU_factors.csv', skiprows=1, delimiter=',') DU_names = ['Watertown Rest. Eff.', 'Dryville Rest. Eff.', 'FSD_inputsland Rest. Eff.', 'Demand Growth Rate', 'Bond Term', 'Bond Interest', 'Discount Rate', 'NRR Perm', 'NRR Const', 'CRR L Perm', 'CRR L Const.', 'CRR H Perm.', 'CRR H Const.', 'WR1 Perm.', 'WR1 Const.', 'Inflows A', 'Inflows m','Inflows p']
Step 4: Fit a boosted trees classifier
After we’ve formatted the data, we’re ready to perform boosted trees classification. There are several packages for boosted trees in Python, here we’ll use the implementation from scikit-learn. We’ll use an ensemble of 200 trees with depth 3 and a learning rate of 0.1. These parameters need to be tuned for the individual problem, I found this nice post that goes into detail on parameter tuning.
##### Boosted Tree Classification ##### from sklearn.ensemble import GradientBoostingClassifier # create a gradient boosted classifier object gbc = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=3) # fit the classifier gbc.fit(DU_factors, SD_input)
Step 5: Examine which DU factors have the most impact on performance criteria
Now we’re ready to examine the results of our classification. First, we’ll examine how important each DU factor is to the classification results generated by boosted trees. To rank the imporance of each DU factor, we examine the percentage decrease in impurity of the ensemble of trees that is associated with each factor. In plain english, this is a measure of how helpful each DU factor is to correctly classifying SOWs. This infromation is generated during the fit of the classifier above and is easily accessible as an attribute of our scikit-learn classifier.
For our example, one deep uncertainty, demand growth rate, clearly stands out as the most influential, as shown in the figure below. A second, the restriction efficacy for Watertown (the utility we’re focusing on), also stands out as a higher level of importance. All other DU factors have little impact on the classification in this case.
##### Factor Ranking ##### # Extract the feature importances feature_importances = deepcopy(gbc.feature_importances_) # rank the feature importances and plot importances_sorted_idx = np.argsort(feature_importances) sorted_names = [DU_names[i] for i in importances_sorted_idx] fig = plt.figure(figsize=(8,8)) ax = fig.gca() ax.barh(np.arange(len(feature_importances)), feature_importances[importances_sorted_idx]) ax.set_yticks(np.arange(len(feature_importances))) ax.set_yticklabels(sorted_names) ax.set_xlim([0,1]) ax.set_xlabel('Feature Importance') plt.tight_layout()
Step 6: Create factor maps
Finally, we visualize the results of our classification through factor mapping. In the plot below, we show the uncertainty space projected onto the two most influential factors, demand growth rate and restriciton efficacy. Each point represents a sampled SOW, red points represent SOWs that resulted in failure, while white points represent SOWs that resulted in success. The color in the background shows the predicted regions of success and failure from the boosted trees classification.
Here we observe that high levels of demand growth are the primary source of vulnerability for the water utility. When restriction efficacy is lower than our estimate (multiplier < 1), the utility faces failures at demand growth levels of about 1.7 times the estimated values. When restriction effectiveness is at or above estimates, the acceptable scaling of demand growth raises to about 1.8.
Taken as a whole, these results provide valueable insights for decision makers. From our original 18 deep uncertainties, we have discovered that two are critical for the success of our water supply management policy. Further, we have defined thresholds within the uncertainty space that define scenarios that lead to failure. We can use this information to inform monitoring efforts for the water supply policy, or to inform a new problem formulation that tailors actions to mitigate these vulnerabilities.
##### Factor Mapping ##### # Select the top two factors discovered above selected_factors = DU_factors[:, [3,0]] # Fit a classifier using only these two factors gbc_2_factors = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=3) gbc_2_factors.fit(selected_factors, SD_input) # plot prediction contours x_data = selected_factors[:,0] y_data = selected_factors[:,1] x_min, x_max = (x_data.min(), x_data.max()) y_min, y_max = (y_data.min(), y_data.max()) # create a grid to makes predictions on xx, yy = np.meshgrid(np.arange(x_min, x_max * 1.001, (x_max - x_min) / 100), np.arange(y_min, y_max * 1.001, (y_max - y_min) / 100)) dummy_points = list(zip(xx.ravel(), yy.ravel())) z = gbc_2_factors.predict_proba(dummy_points)[:, 1] z[z < 0] = 0. z = z.reshape(xx.shape) # plot the factor map fig = plt.figure(figsize=(10,8)) ax = fig.gca() ax.contourf(xx, yy, z, [0, 0.5, 1.], cmap='RdBu', alpha=.6, vmin=0.0, vmax=1) ax.scatter(selected_factors[:,0], selected_factors[:,1],\ c=SD_input, cmap='Reds_r', edgecolor='grey', alpha=.6, s= 100, linewidth=.5) ax.set_xlim([.5, 2]) ax.set_ylim([.9,1.1]) ax.set_xlabel('Demand Growth Multiplier') ax.set_ylabel('Restriction Eff. Multiplier')
Gold, D. F., Reed, P. M., Gorelick, D. E., & Characklis, G. W. (2022). Power and Pathways: Exploring Robustness, Cooperative Stability, and Power Relationships in Regional Infrastructure Investment and Water Supply Management Portfolio Pathways. Earth’s Future, 10(2), e2021EF002472.
Groves, D. G., & Lempert, R. J. (2007). A new analytic method for finding policy-relevant scenarios. Global Environmental Change, 17(1), 73-85.
Kwakkel, J. H., Walker, W. E., & Haasnoot, M. (2016). Coping with the wickedness of public policy problems: approaches for decision making under deep uncertainty. Journal of Water Resources Planning and Management, 142(3), 01816001.
O’Neill, B. C., Kriegler, E., Riahi, K., Ebi, K. L., Hallegatte, S., Carter, T. R., … & van Vuuren, D. P. (2014). A new scenario framework for climate change research: the concept of shared socioeconomic pathways. Climatic change, 122(3), 387-400.
Reed, P.M., Hadjimichael, A., Malek, K., Karimi, T., Vernon, C.R., Srikrishnan, V., Gupta, R.S., Gold, D.F., Lee, B., Keller, K., Thurber, T.B., & Rice, J.S. (2022). Addressing Uncertainty in Multisector Dynamics Research [Book]. Zenodo. https://doi.org/10.5281/zenodo.6110623
Trindade, B. C., Gold, D. F., Reed, P. M., Zeff, H. B., & Characklis, G. W. (2020). Water pathways: An open source stochastic simulation system for integrated water supply portfolio management and infrastructure investment planning. Environmental Modelling & Software, 132, 104772.