PyCharm as a Python IDE for Generating UML Diagrams

PyCharm as a Python IDE for Generating UML Diagrams

This blog post is intended to provide discussion of some interesting features of a Python Integrated Development Environment (IDE) called PyCharm, which I have recently found to be useful in the development of a pure Python object-oriented simulation model. Specifically, I will focus on PyCharm’s ability to generate diagrams that help visualize the design of an object-oriented model. (Also, please see my other post on PyCharm debugging features if you’re interested in that).

Note that there are numerous Python IDEs, all of which have advantages and disadvantages. This post is not intended to be a comparison of Python IDEs, as others have done such comparisons (see this interesting write-up by Dr. Pedro Kroger). To accomplish the tasks I will describe below, I found PyCharm to be the easiest IDE/approach to use.

To give you some background, I am working on building an object oriented river basin simulation model called PySedSim. The model will have many different classes (e.g., river channels and reservoirs), many of which are related to other classes in different ways (e.g., they may inherit behavior, or methods, from other classes). Naturally, I wish to create a diagram all of these classes, methods and relationships before coding them, to structure my code development process. In such a circumstance it can be helpful to create a Unified Modeling Language (UML) diagram to visualize the design of the object-oriented model (e.g., inheritance/class relationships, and methods/attributes present in each class). Other diagram types (e.g., tree diagram or directed graph diagram) can also be helpful, depending on what kind of program you are building.

To generate a UML diagram, I started out using tools like Gliffy, which comes with a 30-day trial, as well as just creating diagrams in MS Word. I grew frustrated with these approaches because after drawing a small UML diagram, I would decide to change the layout of the model (e.g., re-name and re-arrange classes), and then have to draw a completely new diagram! So, to work around this problem, I decided a more efficient approach could be to prepare a skeleton of a Python model, replete with class and method (and even attribute) definitions, and locate a Python IDE that could automatically generate a UML diagram for me using my Python code. This way, when I inevitably decide I don’t like the way the program is designed, I can re-write the skeleton code, and the diagram will automatically adjust itself. This takes full advantage of the fact that Python can be coded and run so quickly that it is basically executable pseudo-code.

As you may have guessed, I discovered that PyCharm is a good tool for doing exactly what I have described above (generate a class diagram from Python code). Note that only the professional version of PyCharm (available for free to students) will do this, as far as I can tell. While I will focus on PyCharm in this post, using PyCharm is not the only way to generate a UML from python code. Some other options are reviewed here. I found PyCharm to be the most efficient means of generating a nicely formatted UML among these options.

Here is some more information (or references to places where you can learn more) about PyCharm:

  • You can learn a bit more about PyCharm here.
  • Some of PyCharm’s key features are described here.
  • As I mentioned, if you are a student (or instructor) with a valid university email address, you can obtain a one-year (renewable) student/teacher license to use the professional version of PyCharm for free. Read more here.
  • A comparison of the free versus professional version is available here.
  • PyCharm has a blog here with useful tips and tricks.
  • Some additional things I have liked about PyCharm so far: really nice code completion and debugging features, support for Git version control, and the ability to rearrange the order of different script files in the user interface, which, for whatever reason, I have been unable to accomplish in Spyder (another Python IDE I have used).

Below I am going to explain and show how I created a class diagram using PyCharm. The code I show is written to be interpreted in Python 2, and I have only tested this process in Windows7.

Step 1. Create some file(s)/scripts that establish different classes, and store them in a folder together. Open PyCharm, and go to File –> New Project, and select the directory where your files are stored. In this example, I am using three script files.

Step 2. Right click on one of your files in the project window. Click on Diagrams –> Show Diagram –> Python class diagram. A new .uml file will be created with a small rectangle in it representing the class you have added.

Figure 1

Step 3. Add other classes to this diagram by hitting the space bar on the screen (or right clicking anywhere on the canvas and clicking “add class to diagram”), and searching/selecting the other class files in the window that appears as below.

Figure 2

After adding the channel and reservoir classes, the diagram appears as below. Note that the Channel class contains 2 subclasses that I also added. You can rearrange them in any way that suits you.

Figure 3

Step 4. Right click on the canvas and select “Show categories”. These offer you opportunities to reveal additional information about each class. There are also buttons that appear in the inner menu on the upper left (the “m”, “i” and “f” buttons), that will let you achieve the same thing.

Figure 4

For example, selecting “Methods” and “fields” will show what methods and/or attributes are present in each class.

Figure 5

If you make some changes to your code (e.g., re-name a method in your reservoir.py file), those changes will be automatically reflected in your class diagram. Note, however, that you can actually directly re-name (called ‘refactoring’) methods and attribute values from within the UML diagram. For example, right-click on a method name in your UML diagram, click “re-factor”, and click “rename”. This will let you quickly change method names in all classes that inherit/customize this method from the current class, and even in classes from which this method is inherited. No changes will be made to identically named methods in completely separate classes. PyCharm will let you preview where the changes will be made before they are made.

Figure 6

Note that to do all of this you didn’t even need to fully develop or even run the model. You will see in my files below that most of my methods (or functions) don’t even have anything defined in them yet. I simply have kept them there as placeholders for when I start to code those methods.

If you have thoughts on other IDEs that you have found to be helpful for the purposes I describe, I would love to hear your comments.

If you want use some existing code to try this, below is the code for the three Class files I used to generate the UML in the images I showed.

From storage_element.py:

# import relevant libraries
import numpy as np
from ExcelDataImport import *

# Description: This file defines storage element class and methods

class Storage_Element:
    def __init__(self, name, T, Input_Data_File):
        self.name = name
        # Initialize as arrays critical reservoir attributes.
        Storage_Element.Array_Initialization(self, name, T)
        Storage_Element.Import_Data(self, name, T, Input_Data_File)
    def __repr__(self):                                        # Added method that prints object information
        return '[Model Element Info.: %s, %s, %s]' % (self.name, self.removed_load, self.BS_W)      # String to print
    def Array_Initialization(self, name, T):
        self.Q_in = np.zeros(T)
        self.Q_out = np.zeros(T)
    def Import_Data(self, name, T, Input_Data_File):
        if 'Incremental Flows' in Input_Data_File.sheetnames:
            self.Incremental_Flow_Junction = Excel_Data_Import(self.name, Input_Data_File, 'Incremental Flows', 0, T, max_distinct_data_types = None, data_name_offset = None)
        if 'Incremental Sediment Loads' in Input_Data_File.sheetnames:
            self.Incremental_Sed_Load_Junction = Excel_Data_Import(self.name, Input_Data_File, 'Incremental Sediment Loads', 0, T, max_distinct_data_types = None, data_name_offset = None)
    def Element_inflows(self):
        return

From channel.py:

import numpy as np
from storage_element import Storage_Element
from ExcelDataImport import *

# Description: This file defines the channel class and methods
class Channel(Storage_Element):
    def __init__(self, name, T, Input_Data_File):
        if hasattr(Storage_Element, '__init__'):
            Storage_Element.__init__(self, name, T, Input_Data_File) # If parent class has constructor method, then call that first.
        # Channel.Import_Data(self, T, Input_Data_File)
    def Import_Data(self, T, Input_Data_File):
        if 'Reach Specifications' in Input_Data_File.sheetnames:
            # Placed here because reservoirs are considered to be reaches in the unregulated simulation.
            [self.Routing_Coefficient, self.Routing_Exponent, self.Pool_Volume, self.Initial_Storage, self.alpha_2_3, self.beta_2_3, self.Initial_Sediment_Mass] = Excel_Data_Import(self.name, Input_Data_File, 'Reach Specifications', 1, 7, max_distinct_data_types = None, data_name_offset = None)
    def mass_balance(self, constant, t=None):
        self.removed_load = np.power(constant,2) + self.new_func(2)
        self.BS_W[t+1] = self.BS_W[t] + 3
        return self.removed_load
    def new_func(self,new_constant):
        trapped_load = 20 + new_constant
        return trapped_load
    def Flow_Routing(self):
        return
    def Sediment_Routing(self):
        return
class Diversion_Channel(Channel):
    # A diversion is a channel or pipe that is a regulated (at inflow) conveyance channel, unregulated at outflow.
    def __init__(self,name,T):
        if hasattr(Channel, '__init__'):
            Channel.__init__(self, name, T) # If parent class has constructor method, then call that first.
    def Element_inflows(self):
        return
    def Flow_Routing(self):
        return
    def Sediment_Routing(self):
        return
class Bypass_Channel(Channel):
    # A bypass is a channel or pipe that is a regulated (at inflow) conveyance channel, unregulated at outflow, but is at the upstream end of a reservoir.
    def __init__(self,name,T):
        if hasattr(Channel, '__init__'):
            Channel.__init__(self, name, T) # If parent class has constructor method, then call that first.
    def Element_inflows(self):
        return
    def Flow_Routing(self):
        return
    def Sediment_Routing(self):
        return

From reservoir.py:

import numpy as np
from storage_element import Storage_Element
from outlet import Outlet
from ExcelDataImport import *

# This file defines the reservoir class and methods
class Reservoir(Storage_Element):
    def __init__(self, name, T, Input_Data_File):
        if hasattr(Storage_Element, '__init__'):
            Storage_Element.__init__(self, name, T, Input_Data_File) #If parent class has constructor method, then call that first.
        Reservoir.Array_Initialization(self,T) # Initialize arrays reservoirs will have.
        Reservoir.Import_Data(self, T, Input_Data_File)
        # Reservoir.Import_Reservoir_Data(self,....)
        # Every reservoir must have outlets (orifices) of some sort
        self.Orifices = Outlet(self.name, T, Input_Data_File)
    def Array_Initialization(self, T):
        # Initialize Arrays
        self.Q_downstream = np.zeros(T)
    def Import_Data(self, T, Input_Data_File):
            # Worksheet_Names_Preferences = {} # Dictionary stores key of worksheet names, and a list of the following:
        if 'Evaporation Data' in Input_Data_File.sheetnames:
            self.Monthly_Evap_Data = Excel_Data_Import(self.name, Input_Data_File, 'Evaporation Data', 1, 12, max_distinct_data_types = None, data_name_offset = None)
    def mass_balance(self, constant, t):
        self.removed_load = np.power(constant, 2) + self.new_func(2)
        self.BS_W[t+1] = self.BS_W[t] + 3 #- self.Orifice[3].Q_overflow[t]
        return self.removed_load
    def new_func(self,new_constant):
        trapped_load = 20 + new_constant
        return trapped_load

Embedding figure text into a Latex document

Often times we have to create plots and schematic drawings for our publications. These figures are then included in the final document either as bitmaps (png, jpeg, bmp) or as vectorized images (ps, eps, pdf). Some inconveniences that arise due to this process and are noticed in the final document are:

  • Loss of image quality due to resizing the figure (bitmaps only)
  • Different font type and size from the rest of the text
  • Limited resizing possibility due to text readability
  • No straight-forward method to add equations to the figure

If the document is being created in LaTeX, it is possible to overcome all these inconveniences by exporting your figure into either svg or postscript formats and converting it into pdf+Latex format with Inkscape. This format allows the LaTeX engine to understand and treat figure text as any other text in the document and the lines and curves as a vectorized image.

EXPORTING FIGURE

The process for creating of a PDF+LaTeX figure is described below:

1 – Create your figure and save it in either svg or postscript format. Inkscape, Matlab, GNUPlot, and Python are examples of software that can export at least one of these formats. If your figure has any equations, remember to type them in LaTeX format in the figure.

2 – Open your figure with Inkscape, edit it as you see necessary (figure may need to be ungrouped), and save it.

3.0 – If you are comfortable with using a terminal and the figure does not need editing, open a terminal pointing to the folder where the figure is and type the following the command (no $). If this works, you can skip steps 3 and 4 and go straight to step 5.

$ inkscape fig1.svg --export-pdf fig1.pdf --export-latex

3 – Click on File -> Save As…, select “Portable Document Format (*.pdf)” as the file format, and click on Save.

ss1

4 – On the Portable Document Format window that will open, check the option “PDF+LaTeX: Omit text in PDF, and create LaTeX file” and click on OK.

ss2

Inkscape will then export two files, both with the same name but one with pdf and the other with pdf_tex extension. The pdf file contains all the drawing, while the pdf_tex contains all the text of the figure and calls the pdf file.

5 – On your latex document, include package graphicx with the command \usepackage{graphicx}.

6 – To include the figure in your document, use \input{your_figure.pdf_tex}. Do not use the conventional \includegraphics command otherwise you will end up with an error or with a figure with no text. If you want to scale the figure, type \def\svgwidth{240bp} (240 is the size of your figure in pixels) in the line before the \input command. Do not use the conventional [scale=0.5] command, which would cause an error. Some instructions are available at the first lines of the pdf_tex file, which can be opened with a regular text editor such as notepad.

Below is a comparison of how the same figure would look like in the document if exported in PDF+LaTeX and png formats. It can be seen that the figure created following the procedure above looks smoother and its text style matches that of the paragraphs above and below, which is more pleasant to the eyes. Also, the text can be marked and searched with any pdf viewer. However, the user should be aware that, since text font size is not affected by the scaling of the figure, some text may end up bigger than boxes that are supposed to contain it, as well as too close or to far from lines and curves. The former can be clearly seen in the figure below. This, however, can be easily fixed with software such as Inkscape and/or with the editing tips described in the following section.

ss3

TIPS FOR TEXT MANIPULATION AFTER FIGURE IS EXPORTED

If you noticed a typo of a poorly positioned text in the figure after the figure has been exported and inserted in your document, there is a easier way of fixing it other than exporting the figure again. If you open the pdf_tex file (your_figure.pdf_tex) with a text editor such as notepad, you can change any text and its position by changing the parameters of the \put commands inside the \begin{picture}\end{picture} LaTeX environment.

For example, it would be better if the value 1 in the y and x axes of the figures above would show as 1.0, so that its precision is consistent with that of the other values. The same applies to 2 vs. 2.0 in the x axis. This can be fixed by opening file fig1.pdf_tex and replacing lines:

\put(0.106,0.76466667){\makebox(0,0)[rb]{\smash{1}}}%
\put(0.53916667,0.0585){\makebox(0,0)[b]{\smash{1}}}%
\put(0.95833333,0.0585){\makebox(0,0)[b]{\smash{2}}}%

by:

\put(0.106,0.76466667){\makebox(0,0)[rb]{\smash{1.0}}}%
\put(0.53916667,0.0585){\makebox(0,0)[b]{\smash{1.0}}}%
\put(0.95833333,0.0585){\makebox(0,0)[b]{\smash{2.0}}}%

Also, one may think that the labels of both axes are too close to the axes. This can be fixed by replacing lines:

\put(0.02933333,0.434){\rotatebox{90}{\makebox(0,0)[b]{\smash{$x\cdot e^{-x+1}$}}}}%
\put(0.539,0.0135){\makebox(0,0)[b]{\smash{x}}}%

by:

\put(0.0,0.434){\rotatebox{90}{\makebox(0,0)[b]{\smash{$x\cdot e^{-x+1}$}}}}%
\put(0.0,0.0135){\makebox(0,0)[b]{\smash{x}}}%

With the modifications described above and resizing the legend box with Inkscape, the figure now would look like this:

ss4

Don’t forget to explore all the editing features of inkscape. If you export a figure form GNUPlot or Matlab and ungroup it with Inkscape into small pieces, Inkscape would give you freedom to rearrange and fine tune your plot.

Parameter Description Files for the MOEAFramework algorithms and for the Borg MOEA

The parameter description files, describe the parameters for each MOEA along with their feasible ranges.  The file first lists the name of the parameter, followed by its lower and upper bounds.  To enable the description files you can copy the following text into a text editor and save it as AMALGAM_Params.txt, for instance, to have the AMALGAM description file.  Each MOEA needs to have its separate description file.

AMALGAM:

populationSize 10 1000
maxEvaluations 1000 10000
sbx.rate 0.0 1.0
sbx.distributionIndex 0.0 500.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 500.0
de.F 0.0 1.0
de.K 0.0 1.0
metro.jumpRate 0.5 5.0
pso.c1 0.0 2.0 
pso.c2 0.0 2.0 
pso.chi 0.0 1.0

Borg MOEA:

initialPopulationSize 10 1000
maxEvaluations 1000 10000
injectionRate 0.1 1.0
sbx.rate 0.0 1.0
sbx.distributionIndex 0.0 500.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 500.0
de.crossoverRate 0.0 1.0
de.stepSize 0.0 1.0
um.rate 0.0 1.0
pcx.parents 2.0 10.0
pcx.offspring 1.0 10.0
pcx.eta 0.0 1.0
pcx.zeta 0.0 1.0
undx.parents 2.0 10.0
undx.offspring 1.0 10.0
undx.eta 0.0 1.0
undx.zeta 0.0 1.0
spx.parents 2.0 10.0
spx.offspring 1.0 10.0
spx.epsilon 0.0 1.0

ε-MOEA:

populationSize 10 1000
maxEvaluations 1000 10000
sbx.rate 0.0 1.0
sbx.distributionIndex 0.0 500.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 500.0

ε-NSGAII:

populationSize 10 1000
maxEvaluations 1000 10000
injectionRate 0.1 1.0
sbx.rate 0.0 1.0
sbx.distributionIndex 0.0 500.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 500.0

GDE3:

populationSize 10 1000
maxEvaluations 1000 10000
de.crossoverRate 0.0 1.0
de.stepSize 0.0 1.0

HypE:

populationSize 10 1000
maxEvaluations 1000 10000
sbx.rate 0.0 1.0
sbx.distributionIndex 0.0 500.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 500.0

IBEA:

populationSize 10 1000
archiveSize 10 1000
maxEvaluations 1000 10000
sbx.rate 0.0 1.0
sbx.distributionIndex 0.0 500.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 500.0

MOEAD:

populationSize 10 1000
maxEvaluations 1000 10000
neighborhoodSize 0.0 0.2
delta 0.0 1.0
eta 0.0 0.02
de.crossoverRate 0.0 1.0
de.stepSize 0.0 1.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 500.0

NSGAII:

populationSize 10 1000
maxEvaluations 1000 10000
sbx.rate 0.0 1.0
sbx.distributionIndex 0.0 500.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 500.0

OMOPSO:

populationSize 10 1000
archiveSize 10 1000
maxEvaluations 1000 10000
perturbationIndex 0.0 1.0

SPEA:

populationSize 10 1000
archiveSize 10 1000
maxEvaluations 1000 10000
sbx.rate 0.0 1.0
sbx.distributionIndex 0.0 500.0
pm.rate 0.0 1.0
pm.distributionIndex 0.0 500.0

The previous parameter description files were retrieved from the diagnostics study in Reed, P.M., Hadka, D., Herman, J., Kasprzyk, J., and Kollat, J., “Evolutionary Multiobjective Optimization in Water Resources: The Past, Present, and Future.” (Editor Invited Submission to 35th Anniversary Special Issue), Advances in Water Resources, v51, 438-456, 2013.  In their study they used 10,000 NFE as the lower bound and the upper bounds varies depending on the problem.  You can also vary the maxEvaluations parameter and adjust it according to the problem’s difficulty.

Useful Linux commands to handle text files and speed up work

Most of us, nice and normal human beings, tend to prefer programs with GUIs over typing commands on a command prompt because the former looks more “real” and is more visual than the latter. However, one thing we don’t know (or, sometimes, don’t want to know) is that learning a few terminal commands can dramatically increase productivity. These commands can save us a lot of time by sparing us from opening and closing programs, navigating through menus and windows, moving the mouse around, as well as moving the hand back and forth from the mouse to the keyboard.

This post will mention and briefly describe some useful “intermediate level” Linux commands (some basic commands are presented in this post by Jon Herman), which can be called from a Linux OS, Cygwin (mostly), or Mac. Among the numerous tedious tasks these commands can greatly simplify is the particularly interesting chore of handling text files, be they scripts or data files. Commands for other tasks are covered as well. Keep in mind that the symbol * is a wild card (character that can mean any string of characters when searching for files), which is really useful when the goal is to repeatedly apply one command to multiple files. For all commands listed here skip the “$” character.

DELIMITER SEPARATED FILES HANDLING

  • Remove columns 30 to 36 (starting from 0) from a comma separated file and export the output to another file.
    $ cut -d',' -f1-30,36 input.file >> output.file

    (More information on the post by Joe Kasprzyk)

  • Print only columns 2 and 4 (starting from 1) of a comma separated file.
    $ awk -F "," '{print $2,$4}' input.file >> output.file
  • Count number of columns in a file separated either by spaces or commas:
    $ head -1 input.file | sed 's/[^, ]//g' | wc -c
    or:
    $ awk -F "[, ]" 'END{print NF}' input.file
  • Print lines of a comma separated file in which the value in the 2nd column is lower than 100 and the value in the 5th column is higher than 0.3:
    $ awk -F "," '$2<100 && $5>0.3' input.file >> output.file
  • Print lines between 10th and 20th lines (not inclusive) of a file:
    $ awk 'NR>10 && NR<20' input.file >> output.file
  • Add a string to the end of multiple files:
    $ echo "your string" | tee -a *.set
  • Add a string to the end of one file:
    $ echo "your string" >> file

FILE SEARCHING

  • Find all text files in a folder that contain a certain string:
    $ grep -rn './folder' -e your_string
  • Find files recursively (* is a wildcard):
    $ find -type f -name name_of*.file

FILES INFO

  • See the contents of a zip/tar file without extracting it. Press q to quit.
    $ less file.tar
  • Count number of lines in a file:
    $ wc -l your.file
  • List all files with a certain extension in a directory:
    $ ls *.ext
  • Print files and folders in tree fashion:
    $ tree
  • Print the size of all subfolders and files in (sub)folders to a certain max depth in the folder hierarchy:
    $ du -h -a --max-depth=2

IMAGE HANDLING

  • Convert svg files to png (you need to have Inkscape installed):
    $ inkscape input.svg -d 300 -e output.png
  • Convert svg files to pdf-latex (you need to have Inkscape installed):
    $ inkscape input.svg --export-pdf output.pdf --export-latex
  • Rotate a picture:
    $ convert Fig6_prim.png -rotate 90 Fig6_prim_rotated.png

MISCELLANEOUS

  • See the history of commands you have typed:
    $ history
  • See a calendar (month and year optional):
    $ cal [month] [year]
  • Combine pdf files into one (you need to have pdftk installed):
    $ pdftk file1.pdf file2.pdf file3.pdf cat output newfile.pdf
    or, to merge all pdf files in a directory:
    $ pdftk *.pdf cat output newfile.pdf

    In order to see how to combine only certain pagers of pdf files, as well as how to splits all pages into separate pdf files, see this page.

  • See the manual of a command:
    $ man command

Another useful idea is that of piping outputs of a command to another command. For example, if you want print the number of files in a directory, you can pipe the output of the ls command (list all files in a directory) to the wc -l command (count the number of lines in a string). For this, use the “|” character:

$ ls | wc -l

However, you may want instead to check the number of lines in all the 20 files in a directory at once, which can also be achieved by combining the ls and wc commands with the command xargs. The command then would look like:

$ ls | xargs wc -l

The command xargs breaks down the output from ls into one string per line and then calls wc -l for each line of the output of ls.

Hope all this saves you some time!

New Federal Flood Frequency Analysis Guidelines

This post is a bit off topic for this blog, but I think it should be interesting to our readers.  The current procedure used by Federal agencies (FEMA, Bureau of Reclamation, USGS, Army Corps, etc.) to assess flood risk at gauged sites is described in Bulletin 17B.  That procedure is used to estimate flood risk for things like FEMA flood maps, to set flood insurance rates, and to design riparian structures like levees.  Given how important the procedure is, many people are surprised to learn that it has not been updated since 1982, despite improvements in statistical techniques and computational resources, and the additional data that is now available.  This post wont speculate as to why change has taken so long, but I will briefly describe the old procedures and the proposed updates.

Bulletin 17B Procedure:

Dr. Veronica Webster has written an excellent brief history on the evolution of  national flood frequency standards in the US dating back to the 1960s.  For this post, we focus on the procedures adopted in 1982.  In short, Bulletin 17B recommends fitting the log-Pearson Type III distribution to the annual peak flow series at a gauged location, using the log-space method of moments.  In other words, one takes the logarithms of the flood series and then uses the mean, variance, and skew coefficient of the transformed series to fit a Pearson Type III distribution.  The Pearson Type III distribution is a shifted Gamma distribution.  When the population skew coefficient is positive the distribution is bounded on the low end, and when the population skew is negative the distribution is bounded on the high end.  When the population skew is zero, the Pearson Type III distribution becomes a normal distribution.  For those wanting more information, Veronica and Dr. Jery Stedinger have written a nice three-part series on the log-Pearson Type III, starting with a paper on the distribution characteristics.

Unfortunately data is not always well behaved and the true distribution of floods is certainly not log-Person Type III, so the Bulletin takes measures to make the procedure more robust.  One such measure is the use of a regional skew coefficient to supplement the sample skew coefficient computed from the transformed flood record.  The idea here is that computing the skew coefficient from short samples is difficult because the skew coefficient can be noisy, so one instead uses a weighted average of the sample skew and a regional skew.  The relative weights are proportional to the mean squared error of each skew, with the more accurate skew given more weight.  The Bulletin recommends obtaining a regional skew from either an average of nearby and hydrologically similar records, a model of skew based on basin characteristics (like drainage area), or the use of the National Skew Map (see below), based on a 1974 study.

National Skew Map provided in Bulletin 17B [IAWCD, 1982, Plate I]

National Skew Map provided in Bulletin 17B [IAWCD, 1982, Plate I]

A second concern is that small observations in a flood record might exert unjustifiable influence in the analysis and distort the estimates of the likelihood of the large floods of interest.  Often such small observations are caused by different hydrologic processes than those that cause the largest floods.  In my own experience, I’ve encountered basins wherein the maximum annual discharge is zero, or nearly zero in some years.  The Bulletin recommends censoring such observations, meaning that one removes them from the computation of the sample moments, then applies a probability adjustment to account for their presence.  The Bulletin provides an objective outlier detection test to identify potential nuisance observations, but ultimately leaves censoring decisions to the subjective judgement of the analyst.  The proposed objective test is the Grubbs-Beck test, which is a 10% significance test of whether the smallest observation in a normal sample is unusually small.

Updates:

The Hydrologic Frequency Analysis Work Group (HFAWG) is charged with updating the Bulletin.  Their recommendations can be found on-line, as well as a testing report which compares the new methods to the old Bulletin.  Bulletin 17C is currently being written.  Like Bulletin 17B, the new recommendations also fit the log-Pearson Type III distribution to the annual maximum series from a gaged site, but a new fitting technique is used: the expected moments algorithm (EMA).  This method is related to the method of moments estimators previously used, but allows for the incorporation of historical/paleo flood information, censored observations, and regional skew information in a unified, systematic methodology.

High water mark of 1530 flood of Tiber at Rome.

High water mark of 1530 flood of the Tiber River at Rome

Historical information might include non-systematic records of large floods: “The town hall has been there since 1760 and has only been flooded once,”  thus providing a threshold flow which has only been crossed once in 256 years.  EMA can include that sort of valuable community experience about flooding in a statistically rigorous way! For the case that no historical information is available, no observations are censored, and no regional skew is used, the EMA moment estimators are exactly the same as the Bulletin 17B method of moments estimators.  See Dr. Tim Cohn’s website for EMA software.

The EMA methodology also has correct quantile confidence intervals (confidence intervals for the 100-year flood, etc.), which are more accurate than the ones used in Bulletin 17B.

Another update to the Bulletin involves the identification of outlier observations, which are now termed Potentially Influential Low Flows (PILFs).  A concern with the old detection test was that it rarely identified multiple outliers, even when several very small observations are present.  In fact, the Grubbs-Beck test, as used in the Bulletin, is only designed to test the smallest observation and not the second, third, or kth smallest observations.  Instead, the Bulletin adopts a generalization of the Grubbs-Beck test designed to test for multiple PILFs (see this proceedings, full paper to come).  The new test identifies PILFs more often and will identify more PILFs than the previous test, but we’ve found that use of a reasonable outlier test actually results in improved quantile efficiency when fitting log-Pearson Type III data with EMA  (again, full paper in review).

The final major revision is the retirement of the skew map (see above), and the adoption of language recommending more modern techniques, like Bayesian Generalized Least Squares (BGLS).  In fact, Dr. Andrea Veilleux, along with her colleagues at the USGS have been busy using iterations of BGLS to generate models of regional skew across the country, including California, Iowa, the Southeast, the Northwest, MissouriVermont, Alaska, Arizona, and California rainfall floods.  My masters work was in this area, and I’d be happy to write further on Bayesian regression techniques for hydrologic data, if folks are interested!

If folks are interested in software that implements the new techniques, the USGS has put out a package with good documentation.

That’s it for now!

Jon

 

Random Seed Analysis for the Borg MOEA using DTLZ2, 3 objective instance

Random Seed Analysis for the Borg MOEA using DTLZ2, 3 objective instance

Why do we want to perform a Random Seed Analysis?

A random seed analysis will provide insight on how sensitive an algorithm is to its initial populations. Keep in mind that MOEAs are stochastic search tools that vary in their initial configuration and their probabilistic search operators and it is likely that two independent algorithmic runs yield different results.  We account  for this variability by running the algorithm for multiple seeds and then averaging them, or generating a reference set across all of them.  At the end of this exercise, you should be able to perform a random seed analysis  for the Borg MOEA using a simple test case.

Pre-requisites:

Access to a Unix or Linux  Terminal

The Borg MOEA. For detailed instructions on how to request and compile the Borg MOEA,  read Basic Borg MOEA use.  I strongly suggest completing the DTLZ2 Exercise before performing this tutorial.  It will provide you with background on this test instance and will help you perform the initial setup.

The latest version of the MOEAFramework .  You can retrieve it from the moeaframework website.  Click on the  Downloads tab,  and double click on the Compiled Binaries, it should automatically download the MOEAFramework-2.5.tar.gz file.  Next navigate your Downloads folder from your terminal and type the following command to extract the files:

tar –xzf  MOEAFramework-2.5.tar.gz

Follow the instructions in the MOEAFramework User Guide  to complete the setup.

Step 1.  Collect runtime dynamics for multiple seeds from the command line.

From your terminal, navigate the dmh309-serial-borg-moea-eb8f1ca9d713 directory (previously used in the DTLZ2 Exercise)  we’ll refer to this as “the borg directory” for later reference.   These are the files that you should see:

borgfolder

It is possible to run several seeds directly from your terminal with the following command:

for i in {1..10}; do ./borg.exe –v 12 –o 3 –s $i –e 0.01,0.01,0.01 –l 0,0,0,0,0,0,0,0,0,0,0,0 –u 1,1,1,1,1,1,1,1,1,1,1,1 –n 1000 –R dtlz2_3_borg_S$i.runtime –F 10 python dtlz2.py; done

This command is very similar to the one we used in the DTLZ2 exercise except here we specify the  seed flag -s, which denotes the  pseudorandom number.  For instance,  everytime we run  –s 1 , we should always get the same results for that seed number.  In this exercise we want to run 10 different random seeds, so we loop ten times, the $i will take the value of 1, 2  until 10.  We also use the runtime flag –R to enable the runtime output, dtlz2_3_borg_$i.runtime is the name of the file in which the runtime output will be printed.  Finally, the  –F flag specifies the output frequency, here we used a frequency of 10.

Once you run the previous command, you should see ten files with the .runtime and .out extension in your borg directory, respectively.  Next, type the following:

mkdir runtime

This command created a directory named runtime.  You can move all your runtime output into this newly created folder with the following command:

mv *.runtime runtime

You should see inside your borg directory, a runtime directory with all your runtime files.  You can also create an output directory and move all your .out files into this folder to keep everything organized.

Step 2.  Select the objectives

Open one of your runtime files, you should see 15 columns. The first 12 column, represent the decision variables and the last three columns are the objectives.  In order to select the last three columns for all the ten runtime files at once.  From your terminal, navigate your runtime directory and run the following command:

for i in {1..10}; do awk ‘Begin {FS=” “}; /^#/  {print $0}; /^[-]?[0-9]/ {printf(“%s %s %s\n”, $13, $14, $15)}’ dtlz2_3_borg_S$i.runtime > dtlz2_3_runtime_S$i.obj; done

In your runtime directory, you should see 10 files with the .obj extension, each of these files should have only three columns. Hold on to these files, we’ll be using them later.

Step 3.  Generate the DTLZ2 3 objective analytical set using the MOEAFramework

The MOEAFramework supports the reference set for any problem with known analytical solution.  For simplicity, place the MOEAFramework-2.0-Executable.jar (or the most recent version) in your borg directory, and execute the following command:

 java -cp MOEAFramework-2.0-Executable.jar org.moeaframework.analysis.sensitivity.SetGenerator -b DTLZ2_3 -n 1000 -o DTLZ2_3.pf -e 0.01,0.01,0.01

This command should generate the Pareto Front for the DTLZ2, 3 objective problem, using epsilon values of 0.01 for each objective.  We will use the DTLZ2.pf  for our metric calculations.

Step 4.  Calculate Metrics

For simplicity, copy the MOEAFramework-2.0-Executable.jar (or more recent version) and the DTLZ2.pf file generated in Step 3 into your runtime folder.  From your terminal access your runtime folder and execute the following command:

for i in {1..10}; do java -cp MOEAFramework-2.0-Executable.jar org.moeaframework.analysis.sensitivity.ResultFileEvaluator --reference DTLZ2_3.pf --input dtlz2_3_runtime_S$i.obj --problem DTLZ2_3 --output dtlz2_3_borg_S$i.metrics; done

The previous command calculates metrics for each of the ten seeds, comparing it to the DTLZ2 analytical set.  Open one of your metrics files,  you can see that the file has 6 columns, one for each metric.  We usually care about hypervolume (first column), generational distance (second column) and epsilon indicator (fifth column).  Read more in MOEA Performance Metrics.

Subsequently, type the following command to average the metrics:

java -cp MOEAFramework-2.0-Executable.jar org.moeaframework.analysis.sensitivity.SimpleStatistics –mode average –output dtlz2_3_borg.average dtlz2_3_borg_*.metrics

Step 5. Plot your results

We can now plot hypervolume, generational distance and epsilon indicator.   A python example is provided for plotting the average metrics.  You can also plot  the metrics for the ten seeds to see the traces.

import numpy as np
import matplotlib.pyplot as plt
metrics= np.loadtxt('dtlz2_3_borg.average', delimiter= ' ')
f, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)
ax1.plot(metrics[:,0]) #Hypervolume plot
ax1.set_ylabel('Hypervolume')
ax1.set_title('Runtime output')
ax2.plot(metrics[:,1]) #Generational distance plot
ax2.set_ylabel('Generational Distance')
ax3.plot(metrics[:,4], color='r') # Epsilon indicator plot
ax3.set_ylabel('Additive $\epsilon$-indicator')
f.subplots_adjust(hspace=0)
plt.setp([a.get_xticklabels() for a in f.axes[:-1]], visible=False)
plt.show()

From the random seed analysis, we expect hypervolume to increase over time denoting a better diversity of tradeoffs. Generational distance and epsilon indicator should decrease, indicating improved convergence and consistency relative to the analytical set.