Optimizing reservoir operations with Reinforcement Learning, Pt 1: Creating a custom Gymnasium environment

As discussed by Lillian Lau in a recent blog post, Reinforcement Learning (RL) methods like Proximal Policy Optimization (PPO) have become increasingly popular for optimizing adaptive policies to solve a range of challenging real-world Markov Decision Process (MDP) problems. The original OpenAI Gym and its community-maintained successor Gymnasium have aided in the development and application of RL algorithms by providing a large library of standardized “Environments” for training RL agents in the Python language. The environment libraries draw from classic control problems like the Cart-Pole problem as well as video games like Atari Asteroids. This provides a useful benchmarking functionality for training and evaluating different RL algorithms on a range of different problems.

These libraries also provide standard functionality for user-specified custom environments. This facilitates the implementation of new MDP problems while maintaining interoperability with a range of RL algorithms for policy optimization. In this blog post, I will walk through the process of creating a new Gymnasium environment that can be used to simulate a reservoir operating policy in Python. I provide both a continuous state/action space version of the environment as well as a discrete version. These environments are designed to be interoperable with RL libraries such as Stable-Baselines3 for training adaptive policies using algorithms such as PPO, which will be the subject of a follow-up post. This post draws on my previous post on PID controllers and its simple reservoir model that attempts to make releases to match a seasonally-varying storage target under stochastic inflows.

All code related to this post can be found in this GitHub repository.

Building a custom Gymnasium environment for reservoir operations

Gymnasium makes it possible to code nearly any MDP problem in a standardized form that can easily interface with RL libraries such as Stable-Baselines3. To do so, there are four major steps: (1) Define a new environment class that inherits from Gymnasium’s core Env class. (2) Define the observation space and action space for the MDP. (3) Define the reset function that resets the state of the environment at the beginning of each run. (4) Define the step function that implements a prescribed action, updates the state of the system, and calculates a reward for each time step. I will highlight these major steps below – readers interested in a more detailed look are referred to the GitHub repository referenced above.

Defining the environment class

First, we create a Python class to represent our new environment. For the reservoir operations example, I actually created three distinct environment classes: one with continuous observation/action spaces, one with discrete observation/action spaces, and a base class with shared functionality. First, the Reservoir_base class is defined to inherit from the Env class for Gymnasium environments.

import sys
import gymnasium
sys.modules["gym"] = gymnasium
from gymnasium import Env
from gymnasium.spaces import Dict, Discrete, Box
import numpy as np
from random import gauss, uniform
from math import pi, sin
import matplotlib.pyplot as plt

class Reservoir_base(Env):
    '''
    Basic reservoir class with shared functionality across continuous & discrete reservoir classes.
    Note this environment shouldnt be used directly; use Reservoir_continuous or Reservoir_discrete versions,
    which inherit from this base class.
    '''

    def __init__(self):

        ### inherit from generic Env class for environments in Gym
        super(Reservoir_base, self).__init__()

        ...

In addition to inheriting behavior from the Env class, the Reservoir_base class defines many parameters that are shared across the discrete and continuous versions, related to time steps, inflows, min/max storages and releases, etc. It also defines several shared methods that update the internal states within the model: reset_basic(), update_inflow(), update_target(), update_storage_release(), update_reward(), and update_time(). Rnterested readers are referred to my previous post on PID controllers for more details on the state dynamics of this reservoir example, as well as this GitHub repository for implementation details for the present analysis.

Next, we build on the base class to define more specific classes for the continuous and discrete environments. Here is the definition of the continuous version, which inherits from the base reservoir environment class.

class Reservoir_continuous(Reservoir_base):
    '''
    Continuous-state/continuous-action version of reservoir environment.
    Inherits from Reservoir_base class, and adds specific functionality for continuous version.
    '''

    def __init__(self):

        ### inherit from Reservoir_base class for basic shared functionality
        super(Reservoir_continuous, self).__init__()

        ...

The initialization of the Reservoir_discrete class follows straightforwardly from the continuous example.

Defining the observation and action spaces

The next step is to define the observation and action spaces for the environment. The observation space is the set of time-evolving model states that are observed by the RL agent each time step in order to guide its action in the MDP problem. The action space is the set of actions determined by the agent each time step. In our simple reservoir example, the observation space is the 2D space of all possible storages and storage targets; at each time step, the observed state of the system is the pair of numbers (storage, target). The action space is the space of all possible releases.

Each environment needs to have an attribute named observation_space and an attribute named action_space. Gymnasium provides simple classes (Dict, Box, Discrete) to aid the creation of defining continuous and discrete spaces. Here is how spaces are defined in the continuous reservoir environment:

        ### define continuous 2D observation space: storage & target
        self.observation_shape = 2
        self.observation_space = Dict({'storage': Box(low=self.min_storage, high=self.max_storage, shape=(1,)),
                                       'target': Box(low=self.min_storage, high=self.max_storage, shape=(1,))})

        ### define continuous 1D action space: target reservoir release
        self.action_space = Box(low=self.min_release, high=self.max_release, shape=(1,))

And here is the discrete version, with both the observation and action spaces divided into 15 equally-spaced discrete values:

        ### define discrete 2D observation space: storage & target
        self.nstep_storage = 15
        self.grid_storage = np.arange(self.min_storage, self.max_storage + 1e-9,
                                      (self.max_storage - self.min_storage) / (self.nstep_storage - 1))
        self.observation_space = Dict({'storage_discrete': Discrete(self.nstep_storage),
                                       'target_discrete': Discrete(self.nstep_storage)})

        ### define continuous 1D action space: target reservoir release
        self.nstep_release = 15
        self.grid_release = np.arange(self.min_release, self.max_release + 1e-9,
                                      (self.max_release - self.min_release) / (self.nstep_release - 1))
        self.action_space = Discrete(self.nstep_release)

Defining the reset function

Next, we need to define the reset() function that resets the state of the reservoir each time we start a new simulation or training epoch. First consider reset_basic(), defined in the Reservoir_base class, which resets key state variables. Initial storage is set to a randomly sampled value.

def reset_basic(self):
    '''
    function for resetting basic state variables each time the environment is reset
    :return: None
    '''
    self.inflow = 0.
    self.inflow_rand_walk = 0.
    self.storage = uniform(self.min_storage, self.max_storage)
    self.target = self.target_base
    self.reward = 0.
    self.t = 0
    self.terminate = False

The reset() function for the continuous Gymnasium environment builds on reset_basic():

    def reset(self):
        '''
        function for resetting state variables each time the environment is reset
        :return: dictionary with the 2 observed states: storage & target.
        '''

        self.reset_basic()
        return {'storage': np.array([self.storage]), 'target': np.array([self.target])}

Thus, the reset() function simply resets key state variables and then returns the initial observed state for the RL agent. Note that the observed state values should be packaged in Numpy arrays for compatibility with Stable-Baselines3.

For the discrete-state environment, the observed states returned by reset() should be the integer indexes for the nearest values in the grid_storage array, rather than the observed state values themselves:

    def reset(self):
        '''
        function for resetting state variables each time the environment is reset
        :return: dictionary with the 2 observed states: storage & target.
        '''

        self.reset_basic()

        ### get discrete indexes for storage and target to return to agent
        storage_discrete = np.argmin(np.abs(self.grid_storage - self.storage))
        target_discrete = np.argmin(np.abs(self.grid_storage - self.target))

        return {'storage_discrete':np.array([storage_discrete]), 'target_discrete':np.array([target_discrete])}

Defining the step function

Lastly, we need to define the step() function. For any Gymnasium environment, the step() function takes the prescribed action as its argument, implements this action and updates all internal model states, and returns the observed states, the reward, and the termination status at the end of the time step. Here is the step function for the discrete environment:

    def step(self, release_discrete):
        '''
        Method for updating the reservoir each time step based on prescribed release target.
        This involves updating the current inflow and seasonally-varying target,
        potentially modifying the target release to ensure mass balance, updating the storage,
        calculating the reward based on deviation from target,
        updating the time step, and terminating the simulation after a preset number of time steps.
        :release_target: The index of discrete release proposed by RL agent.
        :return: {'storage_discrete':storage_discrete, 'target_discrete':target_discrete}: dict of observed state that is returned to RL agent;
                 reward: decreases based on abs deviation of discretized storage from discretized target;
                 terminate: whether episode ended due to time horizon;
                 {'inflow': self.inflow, 'release': self.release,'storage_cont': self.storage, 'target_cont':self.target}:
                        additional state info not returned to agent
        '''

        ### assert that proposed action is valid
        assert self.action_space.contains(release_discrete), f"Invalid Action: {release_discrete}"

        ### map discrete action index to release_target
        release_target = self.grid_release[release_discrete]

        ### update inflow
        self.update_inflow()

        ### update release target
        self.update_target()

        ### update storage & release to preserve mass balance
        self.update_storage_release(release_target)

        ## get discrete indexes for storage and target & last release
        storage_discrete = np.argmin(np.abs(self.grid_storage - self.storage))
        target_discrete = np.argmin(np.abs(self.grid_storage - self.target))

        ### provide negative reward (ie penalty) based on abs value deviation from target.
        self.update_reward(storage_discrete, target_discrete)

        ### update time and truncate episode when time is over
        self.update_time()

        return {'storage_discrete':np.array([storage_discrete]), 'target_discrete':np.array([target_discrete])}, self.reward, self.terminate, \
                {'inflow': self.inflow, 'release': self.release,'storage_cont': self.storage, 'target_cont':self.target}

The functions update_inflow(), update_target(), update_storage_release(), update_reward(), and update_time() functions are defined in the Reservoir_base class (see repository). Note that the variable release_discrete that is passed as an argument into the step() function is an integer index for the discrete grid_release array, so we need to retrieve the actual release value before using it in the simulated time step. Similarly, after updating the storage and target values within the model step, we need to find the nearest discrete values and return their indexes. Our step() function returns four items: (1) a dictionary with the updated observed states, (2) the reward value calculated for this time step, (3) the termination status of the simulation (False until a specified number of time steps have passed, at which point termination = True), and (4) an additional dictionary of internal model state information we want to track, but that is not provided to the RL agent as observed state.

The continuous step() function is similar and is not shown here for brevity. The interested reader is referred to the GitHub repository for details.

Simulating behavior with a custom Gymnasium environment

Once we have defined the new Gymnasium environments, the next step is to use them for simulation! I will leave policy optimization with RL for a future post. For this post, I will demonstrate two very simple policies: (1) Constant-valued prescribed releases that do not change over time, and (2) Randomly-selected prescribed releases. Here I use the term “prescribed release” to distinguish it from the “actual release”, which may differ from prescribed release when the prescribed release is impossible due to mass balance restraints (i.e., when prescribed release exceeds available water, or when excess releases are required because the reservoir is full).

The script reservoir_simulate.py runs four separate simulations: (1) Continuous states/actions and constant prescribed releases, (2) Continuous states/actions and random prescribed releases, (3) Discrete states/actions and constant prescribed releases, (4) Discrete states/actions and random prescribed releases. For brevity, I only show the first simulation code here, but others are very similar (see repository).

import numpy as np
from random import seed
from reservoir_env import Reservoir_continuous, plot_continuous, Reservoir_discrete, plot_discrete

#############################################################################
### now simulate continuous-version reservoir under random releases
#############################################################################

### set random seed for consistent results
rseed = 5
seed(rseed)

### initialize reservoir & set initial state
reservoir = Reservoir_continuous()
obs = reservoir.reset()

### lists for storing state variables, actions, & rewards
storages, targets, inflows, releases, rewards = [obs['storage'][0]], [obs['target'][0]], [], [], []

### loop through daily simulation until the reservoir environment terminates after set number of time steps
terminated = False
while not terminated:
    ### generate a random release
    release = reservoir.action_space.sample()
    ### update state of reservoir with release & get reward
    obs, reward, terminated, info = reservoir.step(release)
    ### save state variables, actions, & rewards for plotting
    storages.append(obs['storage'][0])
    targets.append(obs['target'][0])
    inflows.append(info['inflow'])
    releases.append(info['release'])
    rewards.append(reward)

### plot reservoir storage, target, inflow, release, & reward time series
plot_continuous(reservoir, storages, targets, inflows, releases, rewards, 'figs/random_continuous.png')

reservoir.close()

As you can see, the Gymnasium environment makes it very simple to simulate the reservoir’s behavior. In brief, we initialize and reset the reservoir object, then iteratively update the system each time step by randomly sampling a prescribed release from the action space and stepping the reservoir object to return its updated states and rewards. We store both observed and unobserved state information each time step for visualization purposes. The simulation terminates after a predefined number of time steps. We then plot our simulated data using the plot_continuous() function, which is defined along with the Gymnasium environment classes in reservoir_env.py.

Figs. 1-4 show the behavior of each of these four combinations of continuous vs. discrete environments and constant vs. random prescribed releases, under a single random inflow scenario. Of course, none of these simulations looks very good at all – the storages do not meaningfully track the storage targets, and the rewards are far below the maximum daily value of 10. This suggests (unsurprisingly) that neither constant releases nor randomized releases are an effective operating policy.

However, with the Gymnasium architecture in place, we are now in a position to develop more optimal operating policies with a variety of RL algorithms using libraries like Stable-Baselines3. Stay tuned for a another post on this soon!

Exploring Internal Variability with Hidden Markov Models in Our Newly Published eBook Interactive Tutorial

We recently published two new Jupyter Notebook tutorials as technical appendices to our eBook on Addressing Uncertainty in MultiSector Dynamics Research. Dave debuted his tutorial on the blog earlier this month here. This post will cover my addition to the eBook, which outlines a hidden Markov modeling approach to creating synthetic streamflow scenarios. Similarly to Dave’s, this post will outline the notebook’s connections to topics in the main text of the eBook rather than detailing the code demonstrated in the notebook.

In this post, I’ll first give a brief overview of how a Hidden-Markov model-based streamflow generator can help facilitate exploratory modeling in the Colorado River Basin. Then I’ll go through examples of how it can be used to explore internal variability that characterizes the region’s hydroclimate.

Exploratory Modeling for Colorado River Basin Planning and Management

The Colorado River Basin (CRB) is experiencing an unprecedented water shortage crisis brought upon by large hydroclimatic changes and over-allocation of the Colorado river. Parts of the basin are experiencing warming that is nearly twice the global average1. Even if above average snowpack can provide additional water, dry soils absorb this flow rather than allowing the runoff to reach and refill downstream reservoirs. The complexity of accommodating the demands of growing cities in the basin combined with a diminishing water supply makes planning and management a growing challenge in the region. Effective planning and management in the CRB must consider the fact that the region is experiencing a shift to a more arid climate. The historic streamflow record is now not a sufficient representation of future hydroclimate. Tools are needed that allow the exploration of the vulnerabilities of the CRB in future conditions that are plausible but haven’t been experienced before.

Herein lies the value of exploratory modeling as discussed in Chapter 2.2 of the eBook. Exploratory modeling can be used to run simulation models under vast ensembles of potential futures and then identify those with consequential effects. Rather than trying to predict future conditions exactly, the focus is shifted to a more holistic approach of discovering what conditions can lead to futures that cause system vulnerability or failure. The simulation model we use is a monthly water allocation model called StateMod which the State of Colorado uses for planning and simulating future management policies. In order to use this model in an exploratory modeling context, we need to find a way of generating plausible future scenarios that can be used to force the model. The future of the CRB will be defined by many uncertainties including demand, population growth, and policy changes pertaining to the Colorado River Compact. In this tutorial, we will focus specifically on the changing hydroclimate in the region and on a tool that allows us to generate streamflow scenarios for basins in the state of Colorado.

An HMM as a tool for exploring internal variability

As discussed in Chapter 3.3.6 of the eBook, there are many different ways to generate synthetic input data. To generate traces of streamflow, one can use physically-informed GCM projections, purely statistical approaches, or a hybrid of the two. In this tutorial, we utilize a Hidden Markov Model (HMM)-based streamflow generator, which is a purely statistical approach, to generate synthetic traces of streamflow. The HMM is (1) conditioned directly on regional streamflow and therefore better able to capture local drought dynamics than downscaled CMIP5/6 projections and (2) more computationally tractable to help facilitate exploratory analyses. The HMM can be utilized to generate stationary representations of streamflow and multipliers or shifts can be applied to the parameters to create non-stationary traces.

Decadal Drought Conditions

If we plot the historical annual streamflow at the outlet gage of the Upper Colorado River Basin (UCRB) in Figure 1, we can quantify decadal drought risk in the observed period. We use a metric proposed in Ault et al. (2014). The authors define a decadal drought as where the 11-year running mean of the available record is more than a half a standard deviation below the overall mean of the record. By this metric, the region has experienced two decadal droughts.

Figure 1: Historical annual streamflow at the outlet of the UCRB. Decadal drought periods are overlaid in tan.

However, this historical record has very few instances of extremes and thus underestimates flood/drought hazards. It is important to remember that the streamflow that we have observed in the region over the last century is only one instance of the hydrology that could occur since the atmosphere is an inherently stochastic system. Thus, we require a tool that will allow us to see multiple plausible realizations of the streamflow record to understand the internal variability that characterizes the historical period. This is where the HMM-based generator comes in. If a system follows a Markov process, it switches between a number of “hidden states” dictated by a transition matrix. The UCRB is best described by a two-state system: wet or dry. Each state has its own probability distribution. We use a Gaussian distribution for each hidden state’s distribution and thus each distribution is defined by a mean and standard deviation. One can draw samples from this distribution to create synthetic flows that fit the properties of the historical distribution. HMMs are an attractive choice for this region because they can simulate persistence (i.e., long duration droughts), which is needed to create future conditions that the region will likely experience. Figure 2 shows the 2-state Gaussian HMM that is fit in the tutorial and the Viterbi sequence (i.e. the classification of years in each state).

Figure 2: a) Example wet and dry state distributions and b) resulting Viterbi sequence

If we sample 105-year traces from the HMM, we are creating alternative hydrologies that the region could have experienced that are different that what was experienced historically. In the historical dataset, there are two decadal drought periods which leads to 20 years classified in decadal drought conditions (1959-1969 and 2000-2010). However, if we sample 1000 times from the model to create 1000 more 105-year traces, we have the capacity to create realizations with many more years classified in decadal drought conditions. Figure 3 shows a histogram of the years classified in decadal drought across the 1000 samples. The historic trace is shown as a red vertical line. Note how the HMM is creating many instances of >20 years of decadal drought conditions.

Figure 3: The number of years in a 105-year trace classified to be in decadal drought conditions across 1000 samples from the stationary generator

Figure 4 shows one of the 1000 traces where 53 years (or half of the whole record) are classified in decadal drought conditions.

Figure 4: Example trace from stationary generator which is characterized by more frequent drought conditions

Multi-Decadal Drought Conditions

The decadal drought metric in Ault et al. (2014) can be extended to a multi-decadal drought context by inspecting where the 35-year running mean dips below the drought threshold. By this metric, the region has experienced no multi-decadal droughts (Figure 5).

Figure 5: Historical annual streamflow at the outlet of the UCRB. The 35-year rolling mean does not cross the drought threshold.

However, the stationary synthetic generator does have the capacity to create multi-decadal drought, such as the trace shown in Figure 6, even though the historical record does not have one.


Figure 6: An example trace from the stationary generator that exhibits a multi-decadal drought

The traces shown in Figure 4 and 6 are characterized by substantially more years of drought and since they were created with a stationary generator, they are examples of what the UCRB could have experienced historically rather than what is shown in Figure 1. These examples demonstrate that natural variability alone can create records that contain more frequent droughts than what has been seen historically (even in the absence of climate changes).

Drought Severity

Is the HMM creating droughts that are more severe than what was experienced historically? Let’s use the decadal drought as an example. We can define severity as the maximum volume (cf) that rolling mean dips below the drought threshold. Figure 7 shows the histogram of these dips and once again, we see that we are creating droughts that dip substantially below the threshold unlike the behavior observed in Figure 1 and represented as the red line.


Figure 7: The severity of the decadal drought periods in each 105-year trace across 1000 samples from the stationary generator

Figure 8 shows an example trace of severe drought conditions where the rolling mean substantially dips below the drought threshold. Thus we see that natural variability alone can create records that contain more severe droughts than what has been seen historically.

Figure 8: An example trace from the stationary generator that exhibits severe drought conditions.

Conclusion

This tutorial was meant to demonstrate the utility of using the HMM-based generator in a stationary context with a focus on its ability to create more frequent and severe decadal and multi-decadal droughts. Sampling from the generator allows the ability to explore the rich internal variability that characterizes the region. We demonstrate that internal variability captured by the generator can produce more frequent and severe droughts than what was experienced historically and that this is in absence of adjustments to the HMM to create climate change proxies. Thus, it is important to acknowledge that future droughts may become even more severe and frequent based on phases of internal variability interacting with climate changes.

Though we won’t go over it in this blog post, the technical appendix also covers adjustments that can be made to the HMM parameterization to create non-stationary traces which you can find here.

References

1 The Nature Conservancy: https://www.nature.org/en-us/about-us/where-we-work/priority-landscapes/colorado-river/colorado-river-in-crisis/

Ault, T. R., Cole, J. E., Overpeck, J. T., Pederson, G. T., & Meko, D. M. (2014). Assessing the risk of persistent drought using climate model simulations and paleoclimate data. Journal of Climate27(20), 7529-7549.

Deep Reinforcement Learning with PPO (the ML kind, not the health insurance kind)

As the name *may* have implied, today’s blog post will be about proximal policy optimization (PPO), which is a deep reinforcement learning (DRL) algorithm introduced by OpenAI in 2017. Before we proceed, though, let’s set a few terms straight:

  • State: An abstraction of the current environment that the agent inhabits. An agent observes the current state of the environment, and makes a decision on the next action to take.
  • Action: The mechanism that the agent uses to transition between one state to another.
  • Agent: The decision-maker and learner that takes actions based on its consequent rewards or penalties.
  • Environment: The dynamic system defined by a set of rules that the agent interacts with. The environment changes based on the decisions made and actions taken by the agent.
  • Reward/Penalty: The feedback received by the agent. Often in reinforcement learning, the agent’s objective is to maximize its reward or minimize its penalty. In this post, we will proceed under the assumption of a reward-maximization objective.
  • Policy: A model that maps the states to the probability distribution of actions. The policy can then be used to tell the agent to select actions that are most likely to result in the lowest penalty/highest reward for a given state.
  • Continuous control: The states, actions, and rewards take on analog continuous values (e.g. move the cart forward by 1.774 inches).
  • Discrete control: The states, actions, and rewards take on binary values (e.g. true/false, 1/0, left/right).

As per the last definition, the PPO is policy-based DRL algorithm that consists of two main steps:

  1. The agent interacts with the environment by taking a limited number of actions, and samples data from the reward it receives.
  2. The agent then makes multiple optimizations (policy updates) for an estimate (or “surrogate” as Schulman et al. 2017 calls it) of the reward-maximizing objective function using stochastic gradient ascent (SGA). This is where the weights of the loss function (the difference between actual and observed reward) are incrementally tuned as more data is obtained to result in the highest possible reward.

Note: If you are wondering what SGA is, look up Stochastic Gradient Descent — it’s the same thing, but reversed.

These steps address a couple of issues that other policy-based methods such as policy gradient optimization (PGO) and trust region policy optimization (TRPO) face. Standard PGO requires that the objective function be updated only once per data sample, which is computationally expensive given the number of updates that are typically required of such problems. PGO is also susceptible to potentially destructive policy updates where one round of optimization could result in the policy’s premature convergence, or failure to converge to the true maximum reward. On the other hand, the TRPO is complicated to implement and requires prior computations to optimize (and re-optimize) a secondary constraint function that defines the policy’s trust region (and hence the name). It is therefore difficult to implement, and lacks explainability.

PPO Plug

Unlike both standard PGO and TRPO, PPO serves to carefully balance the tradeoffs between ease of implementation, stable and reliable reward trajectories, and speed. It is particularly useful for training agents in continuous control environments, and achieves this in one of two ways:

  1. PPO with adaptive penalty: The penalty coefficient used to optimize the function defining the trust region is updated every time the policy changes to better to adapt the penalty coefficient so that we achieve an update that is both significant but does not overshoot from the true maximum reward.
  2. PPO with a clipped surrogate objective: This method is currently the more widely used version of PPO as it has been found to perform better than the former (Schulman et al, 2017; van Heeswijk, 2022). This PPO variant restricts – clips – the possible range within which the policy can be updated by penalizing any update where the ratio of the probability of a new action being taken given a state to that of the prior action exceeds a threshold.

The latest version of PPO, called PPO2 (screams creativity, no?) is GPU- and multiprocessor-enabled. In other words, its a more efficiently-parallelized version of PPO that enables the training over multiple environments at the same time. This is the algorithm we will be demonstrating next.

PPO Demo

As always, we first need to load some libraries. Assuming that you already have the usual suspects (NumPy, MatPlotlib, Pandas, etc), you will require the Gym and Stable-Baselines3 libraries. The former is a collection of environments that reference general RL problems, while the latter contains stable implementations of most RL algorithms written in PyTorch. To install both Gym and Stable-Baselines3 on your local machine, enter the following command into your command line:

pip install gym stable-baselines3

Once that it completed, create a Python file and follow along the code. This was adapted from a similar PPO demo that can be found here.

In the Python file, we first import the necessary libraries to run and record the training process. We will directly import the PPO algorithm from Stable-Baselines3. Note that this version of the PPO algorithm is in fact the more recent PPO2 algorithm.

# import libraries to train the mountain car
import gym
from stable_baselines3 import PPO

# import libraries to record the training
import numpy as np
import imageio

Next, let’s create the gym environment. For the purpose of this post, we will use the Mountain Car environment from the Gym library. The Mountain Car problem describes a deterministic Markov Decision Process (MDP) with a known reward function (and hence the name). In this problem, a car is placed at the bottom of a sinusoidal valley (Figure 1) and can take three discrete deterministic actions – accelerate to the right, accelerate to the left, or don’t accelerate – to gain enough momentum to push the car up to the flag on top of the mountain. In this environment, the goal of the agent is to learn a policy that will consistently accelerate the mountain car to reach the flag at the top of the hill in less than 200 episodes. This is where the agent completes a full training sequence within the pre-allocated number of time steps, and is otherwise known as an epoch in general machine learning.

Figure 1: Episode 1 (untrained) of the Mountain Cart.

Aite, so let’s create this environment!

# make the gym environment
env_mtcar = gym.make('MountainCar-v0')

Next, we setup an actor-critic multi-layer perceptron model and apply the PPO2 algorithm to train the mountain cart. Here, we want to view the information output by the model, we we will set verbose = 1. We will then allow the model learn for over 100,000 timesteps per episode.

# setup the model 
# Implement actor critic, using a multi-layer perceptron (2 layers of 64) in the pre-specified environment
model = PPO("MlpPolicy", env_cartpole, verbose=1)

# return a trained model that is trained over 10,000 timesteps
model.learn(total_timesteps=10_000)

Let’s take a look at the starting point of the environment. In general, it’s good practice to use the .reset() function to return the environment to it’s starting state after every episode. Here, we also initialize an array of images that we will later combine using the imageIO library to form a GIF.

# get the environment
vec_env = model.get_env()

# reset the environment and return an array of observations whenever a new episode is started
obs = vec_env.reset()

# initialize the GIF to record
images = []
img = model.env.render(mode='rgb_array')

For the Mountain Car environment, the obs variable is a 2-element array where the first element describes the position of the car along the x-axis, and the second element describes the velocity of the car. After a reset, the obs variable should print to look something like [[-0.52558255 0. ]] where the velocity is zero (stationary).

Next, we take 1,000 random actions just to see how things look like. Before each action is taken, we take a snapshot of the prior action and append it to the list of images we initialized earlier. Next, we predict what the next action and hidden state (denoted by the “_” at the beginning of the variable name) is given the current state, provided that its actions are deterministic. We then use this new action to take a step to return the resulting observation and reward. The reward variable will return a value of -1 if an additional action was taken which did not result in reaching the flag. The done variable is a boolean that indicates if the car successfully reached the flag. If done=TRUE, reset the environment to its starting state. Otherwise, continue learning from the environment. We also create a new snapshot of this current action and render it as an image to be later added to the image list.

# train the car
for i in range(1000):
    # append a snapshot of the current episode to the array
    images.append(img)
    
    # get the policy action from an observation and the optional hidden state
    # return the deterministic actions 
    action, _states = model.predict(obs, deterministic=True)
    
   # step the environment with the given action
    obs, reward, done, info = vec_env.step(action)
    
   
   if (i%500) == 0:
        print("i=", i)
        print("Observation=", obs)
        print("Reward=", reward)   
        print("Done?", done)
        print("Info=", info)
  
    if done:
       obs = env.reset()
    
    img = model.env.render(mode='rgb_array')
    vec_env.render()

Finally, we convert the list of images to a GIF and close the environment:

imageio.mimsave('mtcart.gif', [np.array(img) for i, img in enumerate(images) if i%2 == 0], fps=29)

You should see the mtcart.gif file saved in the same directory that you have your code file in. This GIF should look similar to Figure 2:

Figure 2: The mountain car environment progress for a 10,000 timestep training environment.

Conclusion

Overall, PPO is relatively simple but powerful reinforcement learning algorithm to implement, with recent applications in video games, autonomous driving, and continuous control problems in general. This post provided you with a brief but thorough overview of the algorithm and a simple example application to the Mountain Cars environment, and I hope that it motivates you to further check it out and explore other environments as well!

References

King, J. (2020) Driving up a mountain, A Random Walk. Available at: https://jfking50.github.io/mountaincar/ (Accessed: April 12, 2023).

Proximal policy optimization (no date) Proximal Policy Optimization. Available at: https://openai.com/research/openai-baselines-ppo (Accessed: April 12, 2023).

Schulman, J. et al. (2017) Proximal policy optimization algorithms, arXiv.org. Available at: https://arxiv.org/abs/1707.06347 (Accessed: April 11, 2023).

Srinivasan, A.V. (2019) Stochastic gradient descent - clearly explained !!, Medium. Towards Data Science. Available at: https://towardsdatascience.com/stochastic-gradient-descent-clearly-explained-53d239905d31 (Accessed: April 11, 2023).

Wouter van Heeswijk, P.D. (2023) Proximal Policy Optimization (PPO) explained, Medium. Towards Data Science. Available at: https://towardsdatascience.com/proximal-policy-optimization-ppo-explained-abed1952457b (Accessed: April 11, 2023).

Creating Interactive Geospatial Maps in Python with Folium

Interactive mapping and data visualization provide data scientists and researchers with a unique opportunity to explore and analyze geospatial data, and to share their work with stakeholders in a more engaging and accessible way.

Personally, I’ve found that constructing an interactive map for my own research, and iteratively updating it as the work progresses, has been great for improving my own understanding of my project. My project map (not the one in this demo) has been a helpful narrative aid when introducing someone to the project.

I’ve had a lot of success using the folium Python package to create interactive maps, and want to share the tool with you all here.

In this post, I provide a demonstration on how to use the folium package to create an HTML based interactive map of a hydrologic watershed.

Folium: a quick introduction

Folium is a Python package that is used to create interactive maps.

It is built on top of the Leaflet JavaScript library and can be used to visualize geospatial data in unique ways. Using folium, users construct maps by adding various features including lines, markers, pop-up text boxes, and different base-maps (aka, tiles), and can export maps as interactive HTML documents.

The full Folium documentation is here, however I think it is best to learn through an example.

Demo: Mapping a hydrologic basin data with Folium

All of the code and data used in this post is located in a GitHub repository: trevorja/Folium_Interactive_Map_Demo.

To interact with the demo map directly, download the file here: basin_map.html

The folium_map_demo.ipynb Jupyter Notebook steps through the process shown below and will re-create the map shown above. You can create a similar plot for a different basin by changing the station_id number inside ‘retrieve_basin_data.ipynb’ to a specific USGS gauge of interest.

Here is a brief video showcasing the interaction with the final map:

Map Data

For this demo, I am plotting various features within a hydrologic basin in northern New Mexico.

All of the data in the map is retrieved using the pynhd package from the HyRiver suite for python. For more information about these packages, see my previous post, Efficient hydroclimatic data accessing with HyRiver for Python.

The script ‘retrieve_basin_data.ipynb’ is set up to retrieve several features from the basin, including:

  • Basin boundary
  • Mainstem river
  • Tributary rivers
  • USGS gauge locations
  • New Mexico Water Data Initiative (NMWDI) Sites
  • HUC12 pour points

The geospatial data (longitude and latitudes) for each of these features are exported to data/basin_data.csv and used later to generate the folium map.

Constructing a folium hydrologic map

Like many other data visualization programs, folium maps are constructed by iteratively adding features or map elements to the main map object. It is easy to add a map feature, however it takes more care to ensure that the features are well-organized and respond to user interaction the way you want them to.

In this demo, I deconstruct the process into five parts:

  1. Initializing the map
  2. Initializing feature groups (layers)
  3. Adding points to feature layers
  4. Adding polygons & lines to feature layers
  5. Adding layers onto the map

Before going any further, we need to import the necessary packages, load our basin data, and convert the geospatial data to geopandas.GeoDataFrame() geometry objects:

# Import packages
import pandas as pd
import folium
from folium.plugins import MarkerCluster
import geopandas as gpd
from shapely.geometry import Point, LineString
from shapely import wkt

# Specify the coordinate reference system (CRS)
crs = 4386

# Load the basin data
basin_data = pd.read_csv('./data/basin_data.csv', sep = ',', index_col=0)

# Convert to a geoDataFrame and geometry objects
basin_data['geometry'] = basin_data['geometry'].apply(wkt.loads)
geodata = gpd.GeoDataFrame(basin_data, crs = crs)

Additionally, I start by specifying a couple design options before getting started (colors, line widths, etc.). I do this at the start so that I can easily test different colors, and I store all of these specifications in a dict. This step is not necessary, but just helps to stay organized. The following code block shows some of these specifications, but some are left out just to make this post shorter:

## Plot options
# Line widths
basin_linewidth = 2
mainstem_linewidth = 3
tributary_linewidth = 1

# ...
# More color and design specs here...
# ...

# Combine options into a dictionary
plot_options = {
    'station':{'color': usgs_color,
               'size': usgs_size},
    'pourpoint': {'color': pourpoint_color,
                  'size': pourpoint_size},
    'nmwdi': {'color': nmwdi_color,
              'size': nmwdi_size}
              }

With that out of the way, we will start mapping.

Initializing the map

We start to construct our map using the folium.Map() function:

# Initialize the map
geomap = folium.Map(location = [36.5, -106.5],
                    zoom_start = 9.2,
                    tiles = 'cartodbpositron',
                    control_scale = True)

The location and zoom_start arguments set the default view; the user will be able to pan and zoom around the map, but this will be the starting location.

The tile argument in the initial folium.Map() calls sets the default base-map style, but different styles can be added using the folium.TileLayer() function. Each TileLayer() call adds a different base-map style that is then available from the drop-down menu in the final figure.

## Add different tiles (base maps)
folium.TileLayer('openstreetmap').add_to(geomap)
folium.TileLayer('stamenwatercolor').add_to(geomap)
folium.TileLayer('stamenterrain').add_to(geomap)

Here is a side-by-side comparison of the four different tiles used in this demo:

Initializing feature groups (layers)

Before we add any lines or makers to the map, it is best to set up different layers using folium.FeatureGroup(). When the map is complete, the different feature groups can be toggled on-and-off using the drop-down menu.

Below, I initialize a folium.FeatureGroup for each of the six feature types in the map.

# Start feature groups for toggle functionality
basin_layer = folium.FeatureGroup(name = 'Basin Boundary', show=True)
usgs_layer = folium.FeatureGroup(name = 'USGS Gauges', show=True)
mainstem_layer = folium.FeatureGroup(name = 'Mainstem River', show=True)
tributary_layer = folium.FeatureGroup(name='Tributary Rivers', show=False)
pourpoint_layer = folium.FeatureGroup(name= 'HUC12 Pour Points', show=False)
nmwdi_layer = folium.FeatureGroup(name='NM Water Data Initiative Gauge', show=True)

The name argument is what will be displayed on the drop-down menu. The show argument indicates whether that layer is visible by default (when the map is first opened), or whether it first needs to be selected using the drop-down menu.

Now that the layers are initialized, we can begin adding the features (polygons, lines, and point markers) to each layer.

Adding points to feature layers

The folium.CircleMarker() is used to add circle points using a specific coordinate location.

The following code shows how I am iteratively adding different points to the different layers of the map based upon the feature type.

For each point, I extract the latitude and longitude (coords = [point.geometry.y, point.geometry.x]) and pass it to the folium.CircleMarker() function with colors and sizes specific to each of the three different point features (USGS gauge stations (station), NMWDI (nmwdi), and HUC12 pourpoints (pourpoint)):

plot_types = ['station', 'nmwdi', 'pourpoint']
plot_layers = [usgs_layer, nmwdi_layer, pourpoint_layer]

# Loop through the different feature point types
for i, feature_type in enumerate(plot_types):

	# Specify the correct feature layer to add the point too
	map_layer = plot_layers[i]

	# Add each point
    for _, point in geodata.loc[geodata['type'] == feature_type].iterrows():    
        coords = [point.geometry.y, point.geometry.x]
        
        # Add the popup box with description
        textbox = folium.Popup(point.description,
                               min_width= popup_width,
                               max_width= popup_width)

		# Add the marker at the coordinates with color-coordination
        folium.CircleMarker(coords,
                            popup= textbox,
                            fill_color = plot_options[feature_type]['color'],
                            fill = True,
                            fill_opacity = fill_opacity,
                            radius= plot_options[feature_type]['size'],
                            color = plot_options[feature_type]['color']).add_to(map_layer)

Adding polygons & lines to feature layers

I’ve found that it can be difficult to add Polygons or Lines to a folium map if the coordinate geometry are not formatted correctly. For me, the best method has been to convert the polygon or line data to a geopandas.GeoSeries and then converting this to JSON format data using the .to_json() method.

Once in JSON format, the data can be added to the map using the folium.GeoJson() method similar to other features. Although, rather than adding it to the map directly, we add it to the specific feature layer so that we can toggle things later.

Below, I show how I add the basin boundary polygon to the map. This needs to be repeated for the mainstem and tributary river lines, and the full code is included in folium_map_demo.ipynb.

## Plot basin border
for i,r in geodata.loc[geodata['type'] == ].iterrows():
	# Convert the Polygon or LineString to geoJSON format
    geo_json = gpd.GeoSeries(r['geometry']).simplify(tolerance = 0.000001).to_json()
    geo_json = folium.GeoJson(data= geo_json,
                              style_function=lambda x: {'fillcolor': 'none',
                                                        'weight': basin_linewidth,
                                                        'color': basin_color,
                                                        'opacity': 1,
                                                        'fill_opacity': 1,
                                                        'fill': False})
	# Add popup with line description
    folium.Popup(r.description,
                 min_width = popup_width,
                 max_width= popup_width).add_to(geo_json)
    
    # Add the feature to the appropriate layer
    geo_json.add_to(basin_layer)

And with that, the hard part is done.

Last step: adding layers onto map

Now, if you try to visualize the map it will be empty! This is because we have not added the feature layers to the map. In this last step, we add each of the six feature layers to the map and also add the folium.LayerControl() which will allow for us to toggle the different layers on-and-off:

# Add all feature layers to the map
basin_layer.add_to(geomap)
usgs_layer.add_to(geomap)
mainstem_layer.add_to(geomap)
tributary_layer.add_to(geomap)
pourpoint_layer.add_to(geomap)
nmwdi_layer.add_to(geomap)

# Add the toggle option for layers
folium.LayerControl().add_to(geomap)

Ready for the grand reveal?

Viewing, saving, and sharing the map

Viewing your map is as easy as calling the map name at any point in the script (i.e., geomap), and folium makes it easy to save the map as an HTML using the map.save() function as shown here:

# Save and view the map
geomap.save("basin_map.html")
geomap

Once you have your HTML saved, and you’ve taken a moment to open it on your computer and made sure that the features are situated nicely, then it comes time to share. Other users can view the maps simply by opening the HTML file on their local machine, or you can add the HTML to a website.

Concluding thoughts

I hope you’ve found some inspiration here, and find a way to incorporate interactive geospatial mapping in your project. I don’t think it can be overstated how much an interactive visual such as a folium map can serve to broaden the access to your dataset or model.

Thanks for reading!