Communicating model architecture with Python Diagrams

Communicating model architecture with Python Diagrams

Model diagrams are ubiquitous in science and engineering research and practice. A model diagram is any visualization of the various components of a model and their interactions. Depending on what you are trying to convey, the components of such a diagram could be: (1) broad concepts or systems and their modeled interactions (e.g., interconnected food-energy-water systems); (2) individual steps within a multi-part sequential experiment; (3) software modules and their interactions and dependencies (e.g., UML diagrams – see this previous blog post by Tom Wild); or (4) interacting nodes within a network simulation model. Regardless of the type of component, model diagrams are critical for communicating model architecture for presentations, papers, and collaborative work. They can also be very helpful internally for model developers when brainstorming and developing new models and workflows.

The development of a model diagram typically begins with pen and paper or marker and whiteboard, while the polished final version is typically created in PowerPoint or Adobe Illustrator. Illustrator makes it possible to create highly professional diagrams (see this blog post by Jazmin Zatarain Salazar for some tips), but the process is tedious and time consuming. More recently, I’ve begun experimenting with automated model diagramming workflows as an intermediate step. Software tools are available that allow the user to easily create model diagrams using simple scripts. Although the resulting diagram typically doesn’t look as polished as what you could create in Illustrator, these tools can be valuable for “quick and dirty” mockups at the brainstorming stage. They also make it possible to automatically update the model diagram each time the underlying model is updated, which can be helpful throughout the model development process.

In this blog post, I will demonstrate how to use the Diagrams package for Python to create a model diagram. Diagrams is an open-source Python package which is built on top of the Graphviz visualization software. Both tools are primarily used for prototyping software system architecture diagrams.

Example model diagram, from Diagrams package documentation.

However, they are flexible enough to be applied to any type of model diagram. In this post, I will demonstrate the use of the Diagrams package to visualize the nodes and links in a water resources simulation model. Interested readers can find the full source code associated with this blog post in this GitHub repository.

Water resources simulation model context

I am currently in the process of writing a paper that will introduce Pywr-DRB, our new open-source water resources simulation model of the Delaware River Basin. This model includes 17 reservoir nodes plus a similar number of river junction nodes and interbasin transfer nodes. Each of these “primary nodes” is associated with a variety of “secondary nodes” with functions such as applying catchment inflows, water withdrawals/consumptions, and time lags.

Given this size and complexity, it was important to develop a model diagram which could clearly convey the relationships between different nodes. There are several pieces of information I wanted to convey with this diagram: (1) the labeled set of primary nodes and the links between them corresponding to the river network; (2) the “type” of primary node: NYC reservoirs vs non-NYC reservoirs vs interbasin transfers vs river nodes; (3) the types of secondary nodes associated with each primary node (e.g., catchment inflows, reservoir, observation gage); and (4) the time delay between each nodes (i.e., the time in days that it takes for water to flow from upstream node to downstream node).

Here is the model diagram I created with the Diagrams Python package to meet these needs. Although I have since turned to Illustrator to make a more polished final version of this diagram, I found the Diagrams package to be a very useful intermediate step when brainstorming alternative diagram structures and different ways of conveying information. In the following section, I will walk through the code used to make it.

Model diagram for a water resources simulation model, created with Diagrams package.

Creating a model diagram with Python

I would highly suggest that interested readers first check out the Diagrams documentation, which has concise and helpful examples for getting started with the package. What follows is not an exhaustive introduction, but rather a targeted introduction to Diagrams based on the

Here is the code I used for creating my diagram. I have broken it down into Sections which will be explained in sequence in the text following the code block.

##### Section 1
from diagrams import Diagram, Cluster, Edge
from diagrams.custom import Custom


##### Section 2
### filename for output png file
filename='diagrams/Pywr-DRB_model_diagram'

### location of icons, downloaded from https://thenounproject.com/
### Note: file paths need to be relative to filename above (e.g., use '../' to back out of diagrams/ directory)
reservoir_icon = '../icons/reservoir.png'
river_icon = '../icons/river.png'
gage_icon = '../icons/measurement.png'
diversion_icon = '../icons/demand.png'


##### Section 3
### customize graphviz attributes
graph_attr = {
    'fontsize': '40',
    'splines': 'spline',
}

##### Section 4
### create a diagram to depict the node network
with Diagram("", filename=filename, show=False, graph_attr=graph_attr, direction='LR'):

    ##### Section 5
    ### diversion nodes
    graph_attr['bgcolor'] = 'mediumseagreen'

    with Cluster('NYC Diversion', graph_attr=graph_attr):
        NYCDiversion = Custom('', diversion_icon)

    with Cluster('NJ Diversion', graph_attr=graph_attr):
        NJDiversion = Custom('', diversion_icon)


    ##### Section 6
    ### function for creating edge with linestyle based on time delay between nodes (days)
    def create_edge(lag_days):
        penwidth = '4'
        if lag_days == 0:
            return Edge(color='black', style='solid', penwidth=penwidth)
        elif lag_days == 1:
            return Edge(color='black', style='dashed', penwidth=penwidth)
        elif lag_days == 2:
            return Edge(color='black', style='dotted', penwidth=penwidth)


    ##### Section 7
    ### cluster of minor nodes within major node
    def create_node_cluster(label, has_reservoir, has_gage):
        if has_reservoir and label in ['Cannonsville', 'Pepacton', 'Neversink']:
            bgcolor='firebrick'
        elif has_reservoir:
            bgcolor='lightcoral'
        else:
            bgcolor='cornflowerblue'
        graph_attr['bgcolor'] = bgcolor

        with Cluster(label, graph_attr=graph_attr):
            cluster_river = Custom('', river_icon)

            if has_reservoir:
                cluster_reservoir = Custom('', reservoir_icon)
                cluster_river >> create_edge(0) >> cluster_reservoir
                if has_gage:
                    cluster_gage = Custom('', gage_icon)
                    cluster_reservoir >> create_edge(0) >> cluster_gage
                    return {'river': cluster_river, 'reservoir': cluster_reservoir, 'out': cluster_gage}
                else:
                    return {'river': cluster_river, 'reservoir': cluster_reservoir, 'out': cluster_reservoir}
            else:
                if has_gage:
                    cluster_gage = Custom('', gage_icon)
                    cluster_river >> create_edge(0) >> cluster_gage
                    return {'river': cluster_river, 'reservoir': None, 'out': cluster_gage}
                else:
                    return {'river': cluster_river, 'reservoir': None, 'out': cluster_river}


    ##### Section 8
    ### river nodes
    Lordville = create_node_cluster('Lordville', has_reservoir=False, has_gage=True)
    Montague = create_node_cluster('Montague', has_reservoir=False, has_gage=True)
    Trenton1 = create_node_cluster('Trenton 1', has_reservoir=False, has_gage=False)
    Trenton2  = create_node_cluster('Trenton 2', has_reservoir=False, has_gage=True)
    DelawareBay  = create_node_cluster('Delaware Bay', has_reservoir=False, has_gage=True)

    ### reservoir nodes
    Cannonsville = create_node_cluster('Cannonsville', has_reservoir=True, has_gage=True)
    Pepacton = create_node_cluster('Pepacton', has_reservoir=True, has_gage=True)
    Neversink = create_node_cluster('Neversink', has_reservoir=True, has_gage=True)
    Prompton = create_node_cluster('Prompton', has_reservoir=True, has_gage=False)
    Wallenpaupack = create_node_cluster('Wallenpaupack', has_reservoir=True, has_gage=False)
    ShoholaMarsh = create_node_cluster('Shohola Marsh', has_reservoir=True, has_gage=True)
    Mongaup = create_node_cluster('Mongaup', has_reservoir=True, has_gage=True)
    Beltzville = create_node_cluster('Beltzville', has_reservoir=True, has_gage=True)
    FEWalter = create_node_cluster('F.E. Walter', has_reservoir=True, has_gage=True)
    MerrillCreek = create_node_cluster('Merrill Creek', has_reservoir=True, has_gage=False)
    Hopatcong = create_node_cluster('Hopatcong', has_reservoir=True, has_gage=False)
    Nockamixon = create_node_cluster('Nockamixon', has_reservoir=True, has_gage=False)
    Assunpink = create_node_cluster('Assunpink', has_reservoir=True, has_gage=True)
    StillCreek = create_node_cluster('Still Creek', has_reservoir=True, has_gage=False)
    Ontelaunee = create_node_cluster('Ontelaunee', has_reservoir=True, has_gage=False)
    BlueMarsh = create_node_cluster('Blue Marsh', has_reservoir=True, has_gage=True)
    GreenLane = create_node_cluster('Green Lane', has_reservoir=True, has_gage=False)


    ##### Section 9
    ### tie them all together, with edge linestyles designating time delay between nodes.
    Cannonsville['reservoir'] >> create_edge(0) >> NYCDiversion
    Pepacton['reservoir'] >> create_edge(0) >> NYCDiversion
    Neversink['reservoir'] >> create_edge(0) >> NYCDiversion
    Cannonsville['out'] >> create_edge(0) >> Lordville['river']
    Pepacton['out'] >> create_edge(0) >> Lordville['river']
    Lordville['out'] >> create_edge(2) >> Montague['river']
    Neversink['out'] >> create_edge(1) >> Montague['river']
    Prompton['out'] >> create_edge(1) >> Montague['river']
    Wallenpaupack['out'] >> create_edge(1) >> Montague['river']
    ShoholaMarsh['out'] >> create_edge(1) >> Montague['river']
    Mongaup['out'] >> create_edge(0) >> Montague['river']
    Montague['out'] >> create_edge(2) >> Trenton1['river']
    Beltzville['out'] >> create_edge(2) >> Trenton1['river']
    FEWalter['out'] >> create_edge(2) >> Trenton1['river']
    MerrillCreek['out'] >> create_edge(1) >> Trenton1['river']
    Hopatcong['out'] >> create_edge(1) >> Trenton1['river']
    Nockamixon['out'] >> create_edge(0) >> Trenton1['river']
    Trenton1['out'] >> create_edge(0) >> Trenton2['river']
    Trenton1['out'] >> create_edge(0) >> NJDiversion
    Trenton2['out'] >> create_edge(0) >> DelawareBay['river']
    Assunpink['out'] >> create_edge(0) >> DelawareBay['river']
    Ontelaunee['out'] >> create_edge(2) >> DelawareBay['river']
    StillCreek['out'] >> create_edge(2) >> DelawareBay['river']
    BlueMarsh['out'] >> create_edge(2) >> DelawareBay['river']
    GreenLane['out'] >> create_edge(1) >> DelawareBay['river']

In Section 1, I import the Diagram, Cluster, Edge, and Custom classes from the Diagrams package. The Diagrams package is written in an object-oriented way that makes it very easy to create and link together different visual objects such as nodes, edges, and clusters. The meaning of these classes will become clear in the following paragraphs.

In Section 2, I define the filename for the output graphic that will be created, as well as the filenames for input graphics that are used as icons within the diagram to represent inflows, reservoirs, observation gages, and diversions. I downloaded these graphics from The Noun Project, a great resource for finding icons. I have not included the icons in the GitHub repository because I don’t own the rights to distribute them; if you are trying to replicate this code, you will need to download icons of your own, name them consistent with the code above, and place them in the “icons” directory. It is also worth noting that the Diagrams package your output filename directory (in this case “diagrams/”) as the working directory, which means that the file paths for the icons must be relative to this working directory.

In Section 3, I overwrite some default parameters for the diagram: the font size and the arrow drawing methodology. These parameters are written in a dictionary called “graph_attr”. More details on the available parameters can be found in the Diagrams and GraphViz documentation pages.

In Section 4, I create the model diagram itself with a call to the Diagram class. This call is made within a context manager (“with … :”) so that all code that follows will directly impact the final diagram. The “direction” is given as “LR”, meaning that the major flow of components will be from left to right; other options include “TB” for top to bottom, as well as “RL” and “BT”. Trial and error is the best way to find out which direction is most efficient and intuitive for a given model diagram.

In Section 5, I create the first two primary nodes associated with interbasin transfer diversions to NYC and NJ. Each node is created by calling the Custom class. The Custom class in Diagrams is used to represent a node using a custom graphic (in this case, the “diversion” icon). Diagrams also has many built-in node types that use standard icons from major software companies (e.g., AWS) or shapes common in UML-type software architecture diagrams (“programming“). However, for our purposes the Custom class is best as it allows us to provide our own icons that improve the clarity of the diagram.

The other important thing to note from Section 5 is the use of the Cluster class, again within a context manager. This creates the rectangular background around each primary node. Although the diversion nodes here only have a single secondary node inside of them, we will see later that it is possible to group multiple secondary nodes inside of a single primary node “cluster”, which is helpful for organizing the diagram. The primary node is given a label using the first argument of the Cluster class (note that the Custom class also allows labeling, but I decided for clarity to only label at the Cluster level, which is why the first argument of the Custom call is an empty string). Specific attributes of the Cluster such as the color can be specified by supplying a new graph_attr dictionary directly to the Cluster call.

In Section 6, I define a function used to create customized arrows between nodes using the Edge class. The style of the arrow is customized based on the time lag between two nodes (with solid, dashed, or dotted lines representing 0, 1, or 2 days lag, respectively).

In Section 7, I define a function used to automatically create custom Clusters for any primary node type other than a diversion. The color is based on the primary node type: NYC reservoir (firebrick), non-NYC reservoir (lightcoral), or river junction (cornflowerblue). Each cluster is automatically populated with a catchment inflows icon and/or a reservoir icon and/or an observation gage icon based on the secondary nodes included in that primary node, as specified in the function arguments. Each secondary node is created with a call to the Custom class, and the secondary nodes within each primary node are linked with solid arrows (zero-day lag) via a call to the create_edge function. We return a dictionary of secondary nodes so that they can be linked to other nodes across the broader network.

Section 8 is where all of the primary nodes associated with river junctions and reservoirs in the model are actually created via calls to the create_node_cluster function. And finally, Section 9 is where the cross-node links are created via calls to the create_edge function, with arrows formatted according to lag times within the river network.

And that’s it! The diagram can be created by running this script in Python. The Diagrams package will automatically assign the layout of the different components in a way that tries to maximize flow and minimize overlap. It is also possible to choose between a few alternative methods for assigning the layout, or even to dictate the location of some or all components manually if you desire a greater degree of control. The interested reader is referred to the documentation for more details.

An introduction to linear programming for reservoir operations (part 1)

Motivation:

If you were like me, you may have been provided an overview of linear programming (LP) methods but craved a more hands-of demonstration, as opposed to the abstract/generic mathematical formulation that is often presented in lectures. This post is for the hydrology enthusiasts out there who want to improve their linear programming understanding by stepping through a contextual example.

In this post, I provide a very brief intro to linear programming then go through the process of applying a linear programming approach to a simple hydroelectric reservoir problem focused on determining a release rate that will consider both storage targets and energy generation. In a later post, I will come back to show how the formulation generated here can be implemented within simulation code.

Content:

  • An introduction to linear programming
    • Formulating an LP
    • Solving LP problems: Simplex
    • Relevance for water resource managers
  • A toy reservoir demonstration
    • The problem formulation
      • Constraints
      • Objective
  • Conclusions and what’s next

An introduction to linear programming

Linear programming (LP) is a mathematical optimization technique for making decisions under constraints. The aim of an LP problem is to find the best outcome (either maximizing or minimizing) given a set of linear constraints and a linear objective function.

Formulating an LP

The quality of an LP solution is only as good as the formulation used to generate it.

An LP formulation consists of three main components:

1. Decision Variables: These are the variables that you have control over. The goal here is to find the specific decision variable values that produce optimal outcomes. There are two decision variables shown in the figure below, x1 and x2.

2. Constraints: These are the restrictions which limit what decisions are allowed. (For our reservoir problem, we will use constraints to make sure total mass is conserved, storage levels stay within limits, etc.) The constraints functions are required to be linear, and are expressed as inequality relationships.

3. Objective Function: This is the (single) metric used to measure outcome performance. This function is required to be linear and it’s value is denoted Z in the figure below, where it depends on two decision variables (x1, and x2).

In practice, the list of constraints on the system can get very large. Fortunately matrix operations can be used to solve these problems quickly, although this requires that the problem is formatted in “standard form” shown above. The standard form arranges the coefficient values for the constraints within matrices A and b.

Solving LP problems: keep it Simplex

Often the number of potential solutions that satisfy your constraints will be very large (infinitely large for continuous variables), and you will want to find the one solution in this region that maximizes/minimizes your objective of interest (the “optimal solution”).

The set of all potential solutions is referred to as the “feasible space” (see the graphic below), with the constraint functions forming the boundary edges of this region. Note that by changing the constraints, you are inherently changing the feasible space and thus are changing the set of solutions that you are or are not considering when solving the problem.

The fundamental theorem of linear programming states that the optimal solution for an LP problem will exist at one of corners (or a vertex) of the feasible space.

Recognizing this, George Dantzig proposed the Simplex algorithm which strategically moves between vertices on the boundary feasible region, testing for optimality until the best solution is identified.

A detailed review of the Simplex algorithm warrants it’s own blog post, and wont be explained here. Fortunately there are various open LP solvers. For example, one option in Python is scipy.optimize.linprog().

Relevance for water resource managers

Why should you keep read this post?

If you work in water resource systems, you will likely need to make decisions in highly constrained settings. In these cases, LP methods can be helpful. In fact there are many scenarios in water resource management in which LP methods can (and historically have) been useful. Some general application contexts include:

  • Resource allocation problems: LP can be used to allocate water efficiently among different users like agriculture, industry, and municipalities.
  • Operations optimization: LP can guide the operations decisions of water reservoirs, deciding on storage levels, releases, and energy production to maximize objectives like water availability or energy generation from hydropower (the topic of this demonstration below).

Toy Reservoir Demonstration: Problem Formulation

The following demo aims to provide a concrete example of applying linear programming (LP) in the realm of water resource management. We will focus on a very simple reservoir problem, which, despite its simplicity, captures the essence of many real-world challenges.

Imagine you are managing a small hydroelectric reservoir that provides water and energy to a nearby town. Your goal is to operate the reservoir, choosing how much water to release each day such that the (1) the water level stays near a specific target and (2) you maximize energy generation. Additionally, there is a legally mandated minimum flow requirement downstream to sustain local ecosystems

Here, I will translate this problem context into formal LP notation, then in a later post I will provide Python code to implement the LP decision making process within a simulation model.

Problem formulation

We want to translate our problem context into formal mathematical notation that will allow us to use an LP solver. In order to help us get there, I’ve outlines the following table with all the variables being considered here.

Decision variables

In this case the things we are controlling at the reservoir releases (Rt), and the storage level (St) at each timestep.

Constraints:

Constraints are the backbone of any LP problem, defining the feasible space which ultimately dictates the optimal solution. Without constraints, an LP problem would have infinite solutions, rendering it practically meaningless. Constraints are meant to represent any sort of physical limitations, legal requirements, or specific conditions that must be satisfied by your decision. In the context of water resource management, constraints could signify capacity limits of a reservoir, minimum environmental flow requirements, or regulatory water diversion limits.

  • Mass balance requirement:

Naturally you want to make sure that mass is conserved through the reservoir, by balancing all of the inflows, outflows and storage states, for each timestep (t):

Although we need to separate this into two separate inequalities in order to meet the “standard form” for an LP formulation. I am also going to move the decision variables (Rt and St) to the left side of the inequalities.

  • Storage limitations:

The most obvious constraints are that we don’t want the reservoir to overflow or runout of water. We do this by requiring the storage to be between some minimum (Smin) and maximum (Smax) values specified by the operator.

Again we need to separate this into two separate inequalities:

  • Maximum and minimum release limitations:

The last two constraints represent the physical limit of how much water can be discharged out of the reservoir at any given time (Rmax) and the minimum release requirement that is needed to maintain ecosystem quality downstream of the reservoir (Rmin).

Objective:

The objective function in an LP represents your goals. In the case of our toy reservoir, we have two goals:

  1. Maximize water storage
  2. Maximize energy production.

Initially when I was setting this demonstration up, there was no hydroelectric component. However, I chose to add the energy generation because it introduces more complexity in the actions (as we will see). This is because the two objectives conflict with one another. When generating energy, the LP is encouraged to release a lot of water and maximize energy generation, however it must balance this with the goal of raising the storage to the target level. This tension between the two objectives makes for much more interesting decision dynamics.

1. Objective #1: Maximize storage

Since our reservoir managers want to make sure the Town’s water supply is reliable, they want to maximize the storage. This demo scenario assumes that flood is not a concern for this reservoir. We can define objective one simply as:

Again, we can replace the unknown value (St) with the mass-balance equation described above to generate a form of the objective function which includes the decision variable (Rt):

We can also drop the non-decision variables (all except Rt) since these cannot influence our solution:

2. Objective #2: Maximize energy generation:

Here I introduce a second component of the objective function associated with the amount of hydroelectric energy being generated by releases. Whereas the minimize-storage-deviation component of the objective function will encourage minimizing releases, the energy generation component of the objective function will incentivize maximizing releases to generate as much energy each day as possible.

I will define the energy generated during each timestep as:

(1+2). Aggregate minimization objective:
Lastly, LP problems are only able to solve for a single objective function. With that said, we need to aggregate the storage and energy generation objectives described above. In this case I take a simple sum of the two different objective scores. Of course this is not always appropriate, and the weighting should reflect the priorities of the decision makers and stakeholders.

In this case, this requires a priori specification of objective preferences, which will determine the solution to the optimization. In later posts I plan on showing an alternative method which allows for posterior interpretation of objective tradeoffs. But for now, we stay limited with the basic LP with equal objective weighting.

Also, we want to formulate the optimization as a minimization problem. Since we are trying to maximize both of our objectives, we will make them negative in the final objective function.

The final aggregate objective is calculated as the sum of objectives of the N days of operation:

Final formulation:

As we prepare to solve the LP problem, it is best to lay out the complete formulation in a way that will easily allow us to transform the information into the form needed for the solver:

Conclusions and what’s next

If you’ve made it this far, good on you, and I hope you found some benefit to the process of formulating a reservoir LP problem from scratch.

This post was meant to include simulated reservoir dynamics which implement the LP solutions in simulation. However, the post has grown quite long at this point and so I will save the code for another post. In the meantime, you may find it helpful to try to code the reservoir simulator yourself and try to identify some weaknesses with the proposed operational LP formulation.

Thanks for reading, and see you here next time where we will use this formulation to implement release decisions in a reservoir simulation.