As discussed by Lillian Lau in a recent blog post, Reinforcement Learning (RL) methods like Proximal Policy Optimization (PPO) have become increasingly popular for optimizing adaptive policies to solve a range of challenging real-world Markov Decision Process (MDP) problems. The original OpenAI Gym and its community-maintained successor Gymnasium have aided in the development and application of RL algorithms by providing a large library of standardized “Environments” for training RL agents in the Python language. The environment libraries draw from classic control problems like the Cart-Pole problem as well as video games like Atari Asteroids. This provides a useful benchmarking functionality for training and evaluating different RL algorithms on a range of different problems.
These libraries also provide standard functionality for user-specified custom environments. This facilitates the implementation of new MDP problems while maintaining interoperability with a range of RL algorithms for policy optimization. In this blog post, I will walk through the process of creating a new Gymnasium environment that can be used to simulate a reservoir operating policy in Python. I provide both a continuous state/action space version of the environment as well as a discrete version. These environments are designed to be interoperable with RL libraries such as Stable-Baselines3 for training adaptive policies using algorithms such as PPO, which will be the subject of a follow-up post. This post draws on my previous post on PID controllers and its simple reservoir model that attempts to make releases to match a seasonally-varying storage target under stochastic inflows.
All code related to this post can be found in this GitHub repository.
Building a custom Gymnasium environment for reservoir operations
Gymnasium makes it possible to code nearly any MDP problem in a standardized form that can easily interface with RL libraries such as Stable-Baselines3. To do so, there are four major steps: (1) Define a new environment class that inherits from Gymnasium’s core Env
class. (2) Define the observation space and action space for the MDP. (3) Define the reset
function that resets the state of the environment at the beginning of each run. (4) Define the step
function that implements a prescribed action, updates the state of the system, and calculates a reward for each time step. I will highlight these major steps below – readers interested in a more detailed look are referred to the GitHub repository referenced above.
Defining the environment class
First, we create a Python class to represent our new environment. For the reservoir operations example, I actually created three distinct environment classes: one with continuous observation/action spaces, one with discrete observation/action spaces, and a base class with shared functionality. First, the Reservoir_base
class is defined to inherit from the Env
class for Gymnasium environments.
import sys
import gymnasium
sys.modules["gym"] = gymnasium
from gymnasium import Env
from gymnasium.spaces import Dict, Discrete, Box
import numpy as np
from random import gauss, uniform
from math import pi, sin
import matplotlib.pyplot as plt
class Reservoir_base(Env):
'''
Basic reservoir class with shared functionality across continuous & discrete reservoir classes.
Note this environment shouldnt be used directly; use Reservoir_continuous or Reservoir_discrete versions,
which inherit from this base class.
'''
def __init__(self):
### inherit from generic Env class for environments in Gym
super(Reservoir_base, self).__init__()
...
In addition to inheriting behavior from the Env
class, the Reservoir_base
class defines many parameters that are shared across the discrete and continuous versions, related to time steps, inflows, min/max storages and releases, etc. It also defines several shared methods that update the internal states within the model: reset_basic()
, update_inflow()
, update_target()
, update_storage_release()
, update_reward()
, and update_time()
. Rnterested readers are referred to my previous post on PID controllers for more details on the state dynamics of this reservoir example, as well as this GitHub repository for implementation details for the present analysis.
Next, we build on the base class to define more specific classes for the continuous and discrete environments. Here is the definition of the continuous version, which inherits from the base reservoir environment class.
class Reservoir_continuous(Reservoir_base):
'''
Continuous-state/continuous-action version of reservoir environment.
Inherits from Reservoir_base class, and adds specific functionality for continuous version.
'''
def __init__(self):
### inherit from Reservoir_base class for basic shared functionality
super(Reservoir_continuous, self).__init__()
...
The initialization of the Reservoir_discrete
class follows straightforwardly from the continuous example.
Defining the observation and action spaces
The next step is to define the observation and action spaces for the environment. The observation space is the set of time-evolving model states that are observed by the RL agent each time step in order to guide its action in the MDP problem. The action space is the set of actions determined by the agent each time step. In our simple reservoir example, the observation space is the 2D space of all possible storages and storage targets; at each time step, the observed state of the system is the pair of numbers (storage, target)
. The action space is the space of all possible releases.
Each environment needs to have an attribute named observation_space
and an attribute named action_space
. Gymnasium provides simple classes (Dict
, Box
, Discrete
) to aid the creation of defining continuous and discrete spaces. Here is how spaces are defined in the continuous reservoir environment:
### define continuous 2D observation space: storage & target
self.observation_shape = 2
self.observation_space = Dict({'storage': Box(low=self.min_storage, high=self.max_storage, shape=(1,)),
'target': Box(low=self.min_storage, high=self.max_storage, shape=(1,))})
### define continuous 1D action space: target reservoir release
self.action_space = Box(low=self.min_release, high=self.max_release, shape=(1,))
And here is the discrete version, with both the observation and action spaces divided into 15 equally-spaced discrete values:
### define discrete 2D observation space: storage & target
self.nstep_storage = 15
self.grid_storage = np.arange(self.min_storage, self.max_storage + 1e-9,
(self.max_storage - self.min_storage) / (self.nstep_storage - 1))
self.observation_space = Dict({'storage_discrete': Discrete(self.nstep_storage),
'target_discrete': Discrete(self.nstep_storage)})
### define continuous 1D action space: target reservoir release
self.nstep_release = 15
self.grid_release = np.arange(self.min_release, self.max_release + 1e-9,
(self.max_release - self.min_release) / (self.nstep_release - 1))
self.action_space = Discrete(self.nstep_release)
Defining the reset function
Next, we need to define the reset()
function that resets the state of the reservoir each time we start a new simulation or training epoch. First consider reset_basic()
, defined in the Reservoir_base
class, which resets key state variables. Initial storage is set to a randomly sampled value.
def reset_basic(self):
'''
function for resetting basic state variables each time the environment is reset
:return: None
'''
self.inflow = 0.
self.inflow_rand_walk = 0.
self.storage = uniform(self.min_storage, self.max_storage)
self.target = self.target_base
self.reward = 0.
self.t = 0
self.terminate = False
The reset()
function for the continuous Gymnasium environment builds on reset_basic()
:
def reset(self):
'''
function for resetting state variables each time the environment is reset
:return: dictionary with the 2 observed states: storage & target.
'''
self.reset_basic()
return {'storage': np.array([self.storage]), 'target': np.array([self.target])}
Thus, the reset()
function simply resets key state variables and then returns the initial observed state for the RL agent. Note that the observed state values should be packaged in Numpy arrays for compatibility with Stable-Baselines3.
For the discrete-state environment, the observed states returned by reset()
should be the integer indexes for the nearest values in the grid_storage
array, rather than the observed state values themselves:
def reset(self):
'''
function for resetting state variables each time the environment is reset
:return: dictionary with the 2 observed states: storage & target.
'''
self.reset_basic()
### get discrete indexes for storage and target to return to agent
storage_discrete = np.argmin(np.abs(self.grid_storage - self.storage))
target_discrete = np.argmin(np.abs(self.grid_storage - self.target))
return {'storage_discrete':np.array([storage_discrete]), 'target_discrete':np.array([target_discrete])}
Defining the step function
Lastly, we need to define the step()
function. For any Gymnasium environment, the step()
function takes the prescribed action as its argument, implements this action and updates all internal model states, and returns the observed states, the reward, and the termination status at the end of the time step. Here is the step function for the discrete environment:
def step(self, release_discrete):
'''
Method for updating the reservoir each time step based on prescribed release target.
This involves updating the current inflow and seasonally-varying target,
potentially modifying the target release to ensure mass balance, updating the storage,
calculating the reward based on deviation from target,
updating the time step, and terminating the simulation after a preset number of time steps.
:release_target: The index of discrete release proposed by RL agent.
:return: {'storage_discrete':storage_discrete, 'target_discrete':target_discrete}: dict of observed state that is returned to RL agent;
reward: decreases based on abs deviation of discretized storage from discretized target;
terminate: whether episode ended due to time horizon;
{'inflow': self.inflow, 'release': self.release,'storage_cont': self.storage, 'target_cont':self.target}:
additional state info not returned to agent
'''
### assert that proposed action is valid
assert self.action_space.contains(release_discrete), f"Invalid Action: {release_discrete}"
### map discrete action index to release_target
release_target = self.grid_release[release_discrete]
### update inflow
self.update_inflow()
### update release target
self.update_target()
### update storage & release to preserve mass balance
self.update_storage_release(release_target)
## get discrete indexes for storage and target & last release
storage_discrete = np.argmin(np.abs(self.grid_storage - self.storage))
target_discrete = np.argmin(np.abs(self.grid_storage - self.target))
### provide negative reward (ie penalty) based on abs value deviation from target.
self.update_reward(storage_discrete, target_discrete)
### update time and truncate episode when time is over
self.update_time()
return {'storage_discrete':np.array([storage_discrete]), 'target_discrete':np.array([target_discrete])}, self.reward, self.terminate, \
{'inflow': self.inflow, 'release': self.release,'storage_cont': self.storage, 'target_cont':self.target}
The functions update_inflow()
, update_target()
, update_storage_release()
, update_reward()
, and update_time()
functions are defined in the Reservoir_base
class (see repository). Note that the variable release_discrete
that is passed as an argument into the step()
function is an integer index for the discrete grid_release
array, so we need to retrieve the actual release value before using it in the simulated time step. Similarly, after updating the storage and target values within the model step, we need to find the nearest discrete values and return their indexes. Our step()
function returns four items: (1) a dictionary with the updated observed states, (2) the reward value calculated for this time step, (3) the termination status of the simulation (False
until a specified number of time steps have passed, at which point termination = True
), and (4) an additional dictionary of internal model state information we want to track, but that is not provided to the RL agent as observed state.
The continuous step()
function is similar and is not shown here for brevity. The interested reader is referred to the GitHub repository for details.
Simulating behavior with a custom Gymnasium environment
Once we have defined the new Gymnasium environments, the next step is to use them for simulation! I will leave policy optimization with RL for a future post. For this post, I will demonstrate two very simple policies: (1) Constant-valued prescribed releases that do not change over time, and (2) Randomly-selected prescribed releases. Here I use the term “prescribed release” to distinguish it from the “actual release”, which may differ from prescribed release when the prescribed release is impossible due to mass balance restraints (i.e., when prescribed release exceeds available water, or when excess releases are required because the reservoir is full).
The script reservoir_simulate.py
runs four separate simulations: (1) Continuous states/actions and constant prescribed releases, (2) Continuous states/actions and random prescribed releases, (3) Discrete states/actions and constant prescribed releases, (4) Discrete states/actions and random prescribed releases. For brevity, I only show the first simulation code here, but others are very similar (see repository).
import numpy as np
from random import seed
from reservoir_env import Reservoir_continuous, plot_continuous, Reservoir_discrete, plot_discrete
#############################################################################
### now simulate continuous-version reservoir under random releases
#############################################################################
### set random seed for consistent results
rseed = 5
seed(rseed)
### initialize reservoir & set initial state
reservoir = Reservoir_continuous()
obs = reservoir.reset()
### lists for storing state variables, actions, & rewards
storages, targets, inflows, releases, rewards = [obs['storage'][0]], [obs['target'][0]], [], [], []
### loop through daily simulation until the reservoir environment terminates after set number of time steps
terminated = False
while not terminated:
### generate a random release
release = reservoir.action_space.sample()
### update state of reservoir with release & get reward
obs, reward, terminated, info = reservoir.step(release)
### save state variables, actions, & rewards for plotting
storages.append(obs['storage'][0])
targets.append(obs['target'][0])
inflows.append(info['inflow'])
releases.append(info['release'])
rewards.append(reward)
### plot reservoir storage, target, inflow, release, & reward time series
plot_continuous(reservoir, storages, targets, inflows, releases, rewards, 'figs/random_continuous.png')
reservoir.close()
As you can see, the Gymnasium environment makes it very simple to simulate the reservoir’s behavior. In brief, we initialize and reset the reservoir object, then iteratively update the system each time step by randomly sampling a prescribed release from the action space and stepping the reservoir object to return its updated states and rewards. We store both observed and unobserved state information each time step for visualization purposes. The simulation terminates after a predefined number of time steps. We then plot our simulated data using the plot_continuous()
function, which is defined along with the Gymnasium environment classes in reservoir_env.py
.
Figs. 1-4 show the behavior of each of these four combinations of continuous vs. discrete environments and constant vs. random prescribed releases, under a single random inflow scenario. Of course, none of these simulations looks very good at all – the storages do not meaningfully track the storage targets, and the rewards are far below the maximum daily value of 10. This suggests (unsurprisingly) that neither constant releases nor randomized releases are an effective operating policy.
However, with the Gymnasium architecture in place, we are now in a position to develop more optimal operating policies with a variety of RL algorithms using libraries like Stable-Baselines3. Stay tuned for a another post on this soon!