June 11, 2023 by Lillian Lau

Parameter estimation of Hidden Markov Models using a Markov Chain Monte Carlo method in Julia

I recently completed a course Final Project in which I attempted to implement a Hidden Markov Model (HMM) parameterized using the Markov Chain Monte Carlo method. In this post, I will briefly introduce HMM models as well as the MCMC method. I will then walk you through an exercise implementing a HMM in the Julia programming language and share some preliminary results from my exploration in combining both these methods. All the code found in this exercise can be located in this GitHub repository.

Note: The exercise in this post will require some familiarity with Julia. For a quick introduction to this relatively recent newcomer to the programming languages scene, please refer to Peter Storm’s blog post here.

Hidden Markov Models

A Hidden Markov Model (HMM) is a statistical tool used to model different types of time series data ranging from signal processing, hydrology, financial statistics, speech recognition, and econometrics (Rabiner, 1990; Rydén, 2008), where the underlying process is assumed to be Markovian. Specifically, it has been found to be effectively at modeling systems with long-term persistence, where past events have effects on current and future events at a multi-year scale (Akintug and Rasmussen, 2005; Bracken et al., 2014). Recent studies have also found that it is possible to use HMMs to model the dynamics of streamflow, where the HMM state space represents the different hydrological conditions (e.g. dry vs. wet) and the observation space represents the streamflow data (Hadjimichael et al., 2020; 2020a). These synthetic streamflows are computationally generated with the aid of packages such as hmmlearn in Python and HMMBase in Julia. An example implementation of hmmlearn can be found in Julie Quinn’s blog post here. She also wrote the prequel to the implementation post, where she dives into the background of HMMs and applying Expectation-Maximization (EM) to parameterize the model.

Markov Chain Monte Carlo

The EM approach of parameterizing HMMs reflects a frequentist approach that assumes the existence of one “true” value for each HMM parameter. In this exercise, I will instead use a Bayesian approach via the Markov Chain Monte Carlo method to generate and randomly sample potential parameter (model prior) values from a Markov chain, which are statistical models that approximate Markovian time series. In these time series, the probability of being in any given state s_t at current time t is dependent only on states prior to s_t. Using the MCMC approach, a Markov Chain is constructed for each HMM parameter such that its stationary distribution (distribution of the parameters at equilibrium) is the distribution of its associated model posterior (the likelihood of a model structure after, or a posteriori, obtaining system observed data). The model priors (assumed model structure prior to having any knowledge about the system) are randomly sampled from their respective chains — hence the term “Monte Carlo”. These samples are then used to estimate the statistical properties of the distribution of each parameter.

Overall, MCMC is a powerful method to estimate the statistical properties of high-dimensional problems with many parameters. Nonetheless, it is beholden to its own set of limitations that include high computational expense and the inability to guarantee convergence. Figure 1 below summarize some of the benefits and drawbacks of using MCMC parameterization:

**Figure 1**: The pros and cons of using MCMC parameterization.

Setting up the parameters

In this exercise, I will be using the HMM to approximate the statistical properties of streamflow data found in the annualQ.csv file previously used also by Rohini in her post on creating HMMs to generate synthetic streamflow series and by the MSD UC Ebook HMM training Jupyter Notebook. This streamflow time series originated from a USGS outlet gauge in the Upper Colorado River Basin. It records 105 years’ worth of streamflow from 1909 and 2013 (inclusive). It was previously found that it was best described using two states (wet and dry; Bracken et al, 2014). In this exercise, I also assume that this time series is best approximated by a first-order HMM where the state at current time t (s_t) is only affected by the state at the timestep immediately prior to it (s_t-1).

In this exercise, I will be using MCMC to parameterize the following HMM parameters:

**Table 1**: Summary of all HMM parameters to be adjusted using the MCMC approach.

From Table 1 above:

P(a_i) is the initial state probability for state i
P(A_ij) is the transition probability from state i to state j
μ_i and σ_i are the mean and standard deviation for the vector of streamflow observations x for a given state i

Now that the dataset and model parameters have been introduced, we will proceed with demonstrating the code required to implement the HMM model with MCMC parameterization.

Implementing the streamflow HMM with using MCMC parameterization

Before begin, I recommend downloading and installing Julia 1.8.5. Please find the Windows and Mac 64-bit installer here. Once Julia is set up, we can download and install all required packages. In Julia, this can be done as follows:

# make sure all required packages are installed
import Pkg
Pkg.activate(@__DIR__);
Pkg.instantiate();

Once this is done, we can load all the required packages. In Julia, this is done using the using key word, which can be thought of as the Python import parallel. Note that I placed a semicolon after the using CSV line. This is to prevent the automatic printing of output messages that Julia automatically displays when loading packages.

# load all packages
using Random
using CSVFiles # load CSV data
using DataFrames # data storage and presentation
using Plots # plotting library
using StatsPlots # statistical plotting
using Distributions # statistical distribution interface
using Turing # probabilistic programming and MCMC
using Optim # optimization library
using StatsBase
using LaTeXStrings
using HMMBase
using DelimitedFiles
using CSV;

Now, let’s load the data and identify the number of years and number of states inherent to time streamflow time series.

# load data
# source: https://github.com/IMMM-SFA/msd_uncertainty_ebook/tree/main/notebooks/data
Q = CSV.File("annualQ.csv", header=[:Inflow]) |> DataFrame
Q = Q[2:end,:]
Q[!,:Inflow] = parse.(Float64, Q[!, :Inflow])
Q[!, :Year] = 1909:2013;
Q[!, :logQ] = log.(Q[!, :Inflow]);

# Assume first-order HMM where only the immediate prior step influences the next step
n_states = 2
logQ = Q[!,:logQ];
n_years = size(logQ)[1];

Next, we should set a random seed to ensure that our following results are consistent. This is because the package that we are using in this exercise called “Distributions” performs random sampling. Without setting the seed (shown below), the outcomes of the exercise may be wildly different each time the MCMC parameterization algorithm is run.

# Set the random seed
Random.seed!(1)

Next, we define the HMM model with MCMC sampling. The sampling is performed using methods and syntax built into the Turing.jl package. In the code below, also note lines 32 to 38. Within those lines, the observations’ states are enforced where observations with log-streamflows below the input mean are assigned to a dry state, while the opposite is implemented for log-streamflows above the wet state. This step is important to prevent mode collapse, where the HMM model cannot distinguish between the wet and dry state distributions due to the random sampling of the MCMC parameterization.

@model function HMM_mcmc(logQ, n_states, n_years)
    # Source: https://chat.openai.com/c/d81ef577-9a2f-40b2-be9d-c19bf89f9980
    # Prompt example: Please suggest how you would parameterize Hidden Markov Model with the Markov Chain Monte Carlo method 
    # Uses the Turing.jl package in Julia. The input to this model will be an inflow timeseries with two hidden states.
    mean_logQ = mean(logQ)

    # Define the prior distributions for the parameters
    μ₁ ~ Normal(15.7, 0.25)   # wet state   
    μ₂ ~ Normal(15.2, 0.25)   # dry state
    σ₁ ~ truncated(Normal(0.24, 0.1), 0, Inf)
    σ₂ ~ truncated(Normal(0.24, 0.1), 0, Inf)
    a11 ~ Dirichlet(ones(n_states))
    A ~ filldist(Dirichlet(ones(n_states)), n_states)
    s = tzeros(Int, n_years)   # Define the vector of hidden states

    # Define intial state probability distribution
    a = [a11[1], 1-a11[1]]   # [wet state, dry state]

    # Define the observation distributions 
    B = [Normal(μ₁, σ₁), Normal(μ₂, σ₂)]

    # Sample the initial hidden state and observation variables
    s[1] ~ Categorical(a)

    # Loop over time steps and sample the hidden+observed variables 
    for t in 2:n_years
        s[t] ~ Categorical(A[:,s[t-1]])
    end
    for t in 1:n_years
        logQ[t] ~ B[s[t]]

        # if the logQ is greater than the input mean, force the state to be a wet state
        if logQ[t] > mean_logQ
            s[t] ~ Categorical(A[:,1])
        # if the logQ is smaller than the input mean, force the state to be a dry state
        elseif logQ[t] < mean_logQ
            s[t] ~ Categorical(A[:,2])
        end

    end
end

We then have to run the model for at least 3,000 iterations before the sample values for each parameter converge to stable values. The code to perform this is as follows:

# run for 3000 iterations; usually enough
# run using >3 chains to double check convergence if desired
hmm_model = HMM_mcmc(logQ, n_states, n_years)
g = Gibbs(HMC(0.05, 10, :μ₁, :μ₂, :σ₁, :σ₂, :a11, :A), PG(10, :s))
chains = sample(hmm_model, g, MCMCThreads(), 3000, 1, drop_warmup=true)

To check for convergence, you can attempt to run plot(chains) to check for convergence, and corner(chains) to examine the parameters for dependencies. In this post, we will not discuss MCMC convergence in detail, but some things to look out for would be r_hat values in the output of the code shown above that are close to 1, as well as chains that are similar to each other. Refer to Table 2 and Figure 2 for some quick statistics that can be used as a baseline to determine if your model has converged.

**Table 2:** The mean converged values for some of the HMM parameters.

**Figure 2**: The mean converged values of the transition matrix components.

Now, let’s use the model to generate 1,000 samples of streamflow. First, we define the streamflow prediction function:

function predict_inflow(chain, logQ)
    μ₁ = Array(group(chain, :μ₁))
    μ₂ = Array(group(chain, :μ₂))
    σ₁ = Array(group(chain, :σ₁))
    σ₂ = Array(group(chain, :σ₂))
    a11 = Array(group(chain, "a11[1]"))
    a22 = Array(group(chain, "a11[2]"))
    A11 = Array(group(chain, "A[1,1]"))
    A12 = Array(group(chain, "A[1,2]"))
    A21 = Array(group(chain, "A[2,1]"))
    A22 = Array(group(chain, "A[2,2]"))

    n_samples = 1000
    
    μ1_sample = sample(μ₁, n_samples, replace=true);
    μ2_sample = sample(μ₂, n_samples, replace=true);
    σ1_sample = sample(σ₁, n_samples, replace=true);
    σ2_sample = sample(σ₂, n_samples, replace=true);
    a11_sample = sample(a11, n_samples, replace=true);
    a22_sample = sample(a22, n_samples, replace=true);
    A11_sample = sample(A11, n_samples, replace=true);
    A12_sample = sample(A12, n_samples, replace=true);
    A21_sample = sample(A21, n_samples, replace=true);
    A22_sample = sample(A22, n_samples, replace=true);
    
    Q_predict = zeros(length(logQ), n_samples)    
    #residuals_ar = zeros(length(temp_data), n_samples)
    
    s_matrix = zeros(length(logQ), n_samples)
    logQ_matrix = zeros(length(logQ), n_samples)
    meanQ = mean(logQ)
    
    for n = 1:n_samples 
        s_sample = tzeros(Int, n_years)
        Q_sample = zeros(n_years)

        μ1 = μ1_sample[n]
        μ2 = μ2_sample[n]
        σ1 = σ1_sample[n]
        σ2 = σ2_sample[n]
        a11 = a11_sample[n]
        a22 = a22_sample[n]
        A11 = A11_sample[n]
        A21 = 1 - A11_sample[n]
        A12 = 1 - A22_sample[n]
        A22 = A22_sample[n]
        #println(A)
        A = [A11 A12; A21 A22]
        A = transpose(A)
        #println(A)
        #a =  [1-a22, a22]
        a = [a11, 1-a11]
        print(a)
        B = [Normal(μ1, σ1), Normal(μ2, σ2)]

        # Sample the initial hidden state and observation variables
        s_sample[1] = rand(Categorical(a))

        # Loop over time steps and sample the hidden+observed variables 
        for t in 2:n_years
            s_sample[t] = rand(Categorical(A[s_sample[t-1], :]))
        end
        for t in 1:n_years
            Q_sample[t] = rand(B[s_sample[t]])
            if Q_sample[t] < meanQ
                s_sample[t] = 1
            elseif Q_sample[t] > meanQ
                s_sample[t] = 2
            end
        end

        s_matrix[:, n] = s_sample
        Q_predict[:, n] = Q_sample
        
    end

    return s_matrix, Q_predict

end

From this function, we obtain the state and log-inflow matrix with dimensions 105 × 1000 each.

s_matrix, logQ_predict = predict_inflow(chains, logQ)

We randomly select a timeseries out of the 1,000 timeseries output by the prediction function and plot the wet and dry states identified:

# create the dry and wet masks
q = Q[!,:Inflow]
y = Q[!,:Year]

# select a random sample timeseries to plot
idx = 7
Q_wet_mcmc = q[s_matrix[:,idx] .== 1]
Q_dry_mcmc = q[s_matrix[:,idx] .== 2]

y_wet_mcmc = y[s_matrix[:,idx] .== 1]
y_dry_mcmc = y[s_matrix[:,idx] .== 2];

# plot the figure
plot(y, Q.Inflow, c=:grey, label="Inflow", xlabel="Year", ylabel="Annual inflow (cubic ft/yr)")
scatter!(y_dry_mcmc, Q_dry_mcmc, c=:salmon, label="Dry state (MCMC)", xlabel="Year", ylabel="Annual inflow (cubic ft/yr)")
scatter!(y_wet_mcmc, Q_wet_mcmc, c=:lightblue, label="Wet state (MCMC)", xlabel="Year", ylabel="Annual inflow (cubic ft/yr)")

The output figure should look similar to this:

**Figure 2**: Wet (pink) and dry (blue) states identified within the streamflow timeseries.

From this figure, we can see that the HMM can identify streamflow extreme as wet and dry states (albeit with some inconsistencies) where historically high streamflows are identified as a “wet state” and low streamflows are correspondingly identified as “dry”.

To make sure that the model can distinguish the observation distributions for wet and dry states, we plot their probability distribution functions. Before we plot anything, though, let’s define a function that will help us plot the distributions themselves. For this function, we first need to load the LinearAlgebra package.

# load the LinearAlgebra package 
using LinearAlgebra

# function to plot the distribution of the wet and dry states 
function plot_dist(Q, μ, σ, A)
    evals, evecs = eigen(A).values, eigen(A).vectors
    one_eval = argmin(abs.(evals.-1))
    π = evecs[:, one_eval] / sum(evecs[:, one_eval])

    x_0 = LinRange(μ[1] - 4*σ[1], μ[1] + 4*σ[1], 10000)
    norm0 = Normal(μ[1], σ[1])
    fx_0 = π[1].*pdf.(norm0, x_0)
 
    x_1 = LinRange(μ[2] - 4*σ[2], μ[2] + 4*σ[2], 10000)
    norm1 = Normal(μ[2], σ[2])
    fx_1 = π[2].*pdf.(norm1, x_1)
 
    x = LinRange(μ[1] - 4*σ[1], μ[2] + 4*σ[2], 10000)
    fx = π[1].*pdf.(norm0, x) .+ π[2].*pdf.(norm1, x)
    
    #fig, ax = plt.subplots(1, 1, figsize=(12, 8))

    b = range(μ[1] - 4*σ[1], μ[2] + 4*σ[2], length=100)
    
    histogram(log.(Q), c=:lightgrey, normalize=:pdf, label="Log-Inflow")
    plot!(x_0, fx_0, c=:red, linewidth=3, label="Dry State Distr", xlabel="x", ylabel="P(x)")
    plot!(x_1, fx_1, c=:blue, linewidth=2, label="Wet State Distr")
    plot!(x, fx, c=:black, linewidth=2, label="Combined State Distr")
end

Great! Now let’s use this function to plot the distributions themselves.

# get the model parameter value means
μ₁_mean = mean(chains, :μ₁)
μ₂_mean = mean(chains, :μ₂)
σ₁_mean = mean(chains, :σ₁)
σ₂_mean = mean(chains, :σ₂)
a11_mean = mean(chains, "a11[2]")
A11_mean = mean(chains, "A[1,1]")
A12_mean = mean(chains, "A[1,2]")
A21_mean = mean(chains, "A[2,1]")
A22_mean = mean(chains, "A[2,2]")

# obtained from model above
μ = [μ₂_mean, μ₁_mean]
σ = [σ₂_mean, σ₁_mean]
A_mcmc = [A11_mean 1-A11_mean; 1-A22_mean A22_mean]

# plot the distributions
plot_dist(Q[!,:Inflow], μ, σ, A_mcmc)

This code block will output the following figure:

**Figure 3:** The overall (black), wet (blue) and dry (red) streamflow probability distributions.

From Figure 3, we can observe that the dry state streamflow distribution is different from that of the wet state. The HMM model correctly estimates a lower mean for the dry state, and a higher one for the dry state. Based on visual analysis, the states have similar standard deviations, but this is likely an artifact of the prior parameterization of the initial HMM model that we set up earlier. The step we took earlier to force the wet/dry state distributions given a sampled log-streamflow value also likely prevented mode collapse and enabled the identification of two distinct distributions.

Now, let’s see if our HMM can be used to generate synthetic streamflow. We do this using the following lines of code:

# plot the FDC of the sampled inflow around the historical 
sampled_obs_mcmc = zeros(1000, n_years)
Q_predict_mcmc = transpose(logQ_predict)
for i in 1:1000
    sampled_obs_mcmc[i,:] = exp.(Q_predict_mcmc[i, :])
    sampled_obs_mcmc[i,:] = sort!(sampled_obs_mcmc[i, :], rev=true)
end

hist_obs = copy(Q[!,:Inflow])
hist_obs = sort!(hist_obs, rev=true)
probs = (1:n_years)/n_years

plot(probs, hist_obs, c=:grey, linewidth=2, label="Historical inflow",
    xlabel="Exceedance prob.", ylabel="Annual inflow (cubic ft/yr)")
for i in 1:1000
    if i == 1
        plot!(probs, sampled_obs_mcmc[i,:], c=:lightgreen, linewidth=2,
            label="Sampled inflow (MCMC)", xlabel="Exceedance prob.", ylabel="Annual inflow (cubic ft/yr)")
    end

    plot!(probs, sampled_obs_mcmc[i,:], c=:lightgreen, linewidth=2, legend=false,
        xlabel="Exceedance prob.", ylabel="Annual inflow (cubic ft/yr)")
end

plot!(probs, hist_obs, c=:grey, linewidth=2, legend=false,
    xlabel="Exceedance prob.", ylabel="Annual inflow (cubic ft/yr)")

Once implemented, the following figure should be output:

**Figure 4:** The original historical streamflow timeseries (grey) and the synthetic streamflow generated by the HMM (green).

This demonstrates that the HMM model is capable of generating synthetic streamflow that approximates the statistical properties of the historical timeseries, while expanding upon the range of uncertainty in streamflow scenarios. This has implications for applications such as exploratory modeling of different streamflow scenarios, where such approaches can be used to generate a wetter- or dryer-than-historical streamflows.

Summary

While using MCMC to parameterize a HMM is a useful method to generate synthetic streamflow, one significant drawback is the long runtime and the large number of iterations that the MCMC algorithm takes to converge (Table 3).

**Table 3**: MCMC parameterization and runtime.

This is likely due to the HMM sampling setup. Notice that in lines 27 and 33-27 in the HMM_mcmc function: we are essentially fitting an additional probability distribution at each s_t essentially turning them into tunable parameters themselves. Since there are 105 elements in the streamflow timeseries, this introduces an additional 105 parameters for the MCMC to fit distributions to. This is the likely cause for the lengthy runtime. While we will not explore alternative HMM-MCMC parameterization approaches, more efficient alternatives definitely exist, and you are highly encouraged to check it out!

Overall, we began this post with a quick introduction to HMM and MCMC sampling. We then compared the pros and cons of using the MCMC approach to parameterize a HMM. Once we understood its benefits and drawbacks, we formulated a HMM for a 105-year long timeseries of annual streamflow and examined the performance of the model by analyzing its converged parameter values, observation probability distributions, as well as its ability to detect wet/dry states in the historical time series. We also used the HMM to generate 1,000 synthetic streamflow scenarios. Finally, we diagnosed the cause for the long time to convergence of the MCMC sampling approach.

This brings us to the end of the post! I hope you now have a slightly better understanding of HMMs, how they can be parameterized using the MCMC approach, as well as issues you might run into when you implement the latter.

Happy coding!

References

Akintug, B., & Rasmussen, P. F. (2005). A Markov switching model for annual hydrologic time series. Water Resources Research, 41(9). https://doi.org/10.1029/2004wr003605

Bracken, C., Rajagopalan, B., & Zagona, E. (2014). A hidden markov model combined with climate indices for Multidecadal Streamflow Simulation. Water Resources Research, 50(10), 7836–7846. https://doi.org/10.1002/2014wr015567

Hadjimichael, A., Quinn, J., & Reed, P. (2020). Advancing diagnostic model evaluation to better understand water shortage mechanisms in institutionally complex river basins. Water Resources Research, 56(10), e2020WR028079.

Hadjimichael, A., Quinn, J., Wilson, E., Reed, P., Basdekas, L., Yates, D., & Garrison, M. (2020a). Defining robustness, vulnerabilities, and consequential scenarios for diverse stakeholder interests in institutionally complex river basins. Earth’s Future, 8(7), e2020EF001503.

Reed, P.M., Hadjimichael, A., Malek, K., Karimi, T., Vernon, C.R., Srikrishnan, V., Gupta, R.S., Gold, D.F., Lee, B., Keller, K., Thurber, T.B, & Rice, J.S. (2022). Addressing Uncertainty in Multisector Dynamics Research [Book]. Zenodo. https://doi.org/10.5281/zenodo.6110623

Rabiner, L. R. (1990). A tutorial on Hidden Markov models and selected applications in speech recognition. Readings in Speech Recognition, 267–296. https://doi.org/10.1016/b978-0-08-051584-7.50027-9

Rydén, T. (2008). Em versus markov chain Monte Carlo for estimation of Hidden Markov models: A computational perspective. Bayesian Analysis, 3(4). https://doi.org/10.1214/08-ba326

Turek, D., de Valpine, P. and Paciorek, C.J. (2016) Efficient Markov Chain Monte Carlo Sampling for Hierarchical Hidden Markov Models, arXiv.org. Available at: https://arxiv.org/abs/1601.02698.

March 3, 2020 by David Gold

Make your Git repository interactive with Binder

Have you ever tried to demo a piece of software you wrote only to have the majority of participants get stuck when trying to configure their computational environment? Difficulty replicating computational environments can prevent effective demonstration or distribution of even simple codes. Luckily, new tools are emerging that automate this process for us. This post will focus on Binder, a tool for creating custom computing environments that can be distributed and used by many remote users simultaneously. Binder is language agnostic tool, and can be used to create custom environments for R, Python and Julia. Binder is powered by BinderHub, an open source service in the cloud. At the bottom of this post, I’ll provide an example of an interactive Python Jupyter Notebook that I created using BinderHub.

BinderHub

BinderHub combines two useful libraries: repo2docker and JupyterHub. repo2docker is a tool to build, run and push Docker images from source code repositories. This allows you to create copies of custom environments that users can replicate on any machine. These copies are can be stored and distributed along with the remote repository. JuptyerHub is a scalable system that can be used to spawn multiple Jupyter Notebook servers. JuptyerHub takes the Docker image created by repo2docker and uses it to spawn a Jupyter Notebook server on the cloud. This server can be accessed and run by multiple users at once. By combining repo2docker and JupyterHub, BinderHub allows users to both replicate complex environments and easily distribute code to large numbers of users.

Creating your own BinderHub deployment

Creating your own BinderHub deployment is incredibly easy. To start, you need a remote repository containing two things: (1) a Jupyter notebook with supporting code and (2) configuration files for your environment. Configuration files can either be an environment.yml file (a standard configuration file that can be generated with conda, see example here) or a requirements.txt file (a simple text file that lists dependencies, see example here).

To create an interactive BinderHub deployment:

Push your code to a remote repository (for example Github)
Go to mybinder.org and paste the repository’s URL into the dialoge box (make sure to select the proper hosting service)
Specify the branch if you are not on the Master
Click “Launch”

The website will generate a URL that you can copy and share with users. I’ve created an example for our Rhodium tutorial, which you can find here:

https://mybinder.org/v2/gh/dgoldri25/Rhodium/master?filepath=DMDU_Rhodium_Demo.ipynb

To run the interactive Jupyter Notebook, click on the file titled “Rhodium_Demo.ipynb”. Happy sharing!

March 27, 2019 by Peter Storm

From MATLAB to Julia: Insights from Translating an Opensource Kirsch-Nowak Streamflow Generator to Julia

A quick look into translating code: speed comparisons, practicality, and comments

As I am becoming more and more familiar with Julia—an open-source programming language—I’ve been attracted to translate code to not only run it on an opensource and free language but also to test its performance. Since Julia was made to be an open source language made to handle matrix operations efficiently (when compared to other high-level opensource languages), finding a problem to utilize these performance advantages only makes sense.

As with any new language, understanding how well it performs relative to the other potential tools in your toolbox is vital. As such, I decided to use a problem that is easily scalable and can be directly compare the performances of MATLAB and Julia—the Kirsch-Nowak synthetic stationary streamflow generator.

So, in an effort to sharpen my understanding of the Kirsch-Nowak synthetic stationary streamflow generator created by Matteo Giuliani, Jon Herman and Julianne Quinn, I decided to take on this project of converting from this generator from MATLAB. This specific generator takes in historical streamflow data from multiple sites (while assuming stationarity) and returns a synthetically generated daily timeseries of streamflow. For a great background on synthetic streamflow generation, please refer to this post by Jon Lamontagne.

Model Description

The example is borrowed from Julie’s code utilizes data from the Susquehanna River flows (cfs) at both Marietta (USGS station 01576000) and Muddy Run along with lateral inflows (cfs) between Marietta and Conowingo Damn (1932-2001). Additionally, evaporation rates (in/day) over the Conowingo and Muddy Run Dams (from an OASIS model simulation) utilized. The generator developed by Kirsch et al. (2013) utilizes a Cholesky decomposition to create a monthly synthetic record which preserves the autocorrelation structure of the historical data. The method proposed by Nowak et al. (2010) is then used to disaggregate to daily flows (using a historical month +/- 7 days). A full description of the methods can be found at this link.

Comparing Julia and MATLAB

Comparison of Performance between Julia and MATLAB

To compare the speeds of each language, I adapted the MATLAB code into Julia (shown here) on as nearly of equal basis as possible. I attempted to keep the loops, data structures, and function formulation as similar as possible, even calling similar libraries for any given function. julia_matlab_streamflow

When examining the performance between Julia (solid lines) and MATLAB (dashed lines), there is only one instance where MATLAB(x) outperformed Julia(+)—in the 10-realization, 1000-year simulation shown in the yellow dots in the upper left. Needless to say, Julia easily outperformed MATLAB in all other situations and required only 53% of the time on average (all simulations considered equal). However, Julia was much proportionally faster at lower dimensions of years (17-35% of the time required) than MATLAB. This is likely because I did not handle arrays optimally—the code could likely be sped up even more.

Considerations for Speeding Up Code

Row- Versus Column-Major Array Architecture

It is worth knowing how a specific language processes its arrays/matrices. MATLAB and Julia are both column-major languages, meaning the sequential indexes and memory paths are grouped by descending down row by row through a column then going through the next column. On the other hand, Numpy in Python specifically uses row-major architecture. The Wikipedia article on this is brief but well worthwhile for understanding these quirks.

This is especially notable because ensuring that proper indexing and looping methods are followed can substantially speed up code. In fact, it is likely that the reason Julia slowed down significantly on a 10-realization 1000-year simulation when compared to both its previous performances and MATLAB because of how the arrays were looped through. As a direct example shown below, when exponentiating through a [20000, 20000] array row-by-row took approximately 47.7 seconds while doing the same operation column-by-column only took 12.7 seconds.

5555

Dealing with Arrays

Simply put, arrays and matrices in Julia are a pain compared to MATLAB. As an example of the bad and the ugly, unlike in MATLAB where you can directly declare any size array you wish to work with, you must first create an array and then fill the array with individual array in Julia. This is shown below where an array of arrays is initialized below. However, once an array is established, Julia is extremely fast in loops, so dealing with filling a previously established array makes for a much faster experience.

# initialize output
qq = Array{Array}(undef, num_sites) #(4, 100, 1200)

for i = 1:num_sites
     qq[i] = Array{Float64}(undef, nR, nY * 12)
end

Once the plus side when creating arrays, Julia is extremely powerful in its ability to assign variable types to the components of a given array. This can drastically speed up your code during the day. Shown below, it is easy to the range of declarations and assignments being made to populate the array. There’s an easy example of declaring an array with zeros, and another where we’re populating an array using slices of another. Note the indexing structure for Qd_cg in the second loop–it is not technically a 3-D array but rather a 2-D array nested within a 1-D array–showing the issues mentioned prior.

delta = zeros(n_totals)
for i = 1:n_totals
     for j = 1:n_sites
          delta[i] += (Qtotals[month][j][i] - Z[j]) ^ 2
     end
end

q_ = Array{Float64, 2}(undef, num_realizations[k], 365 * num_years[k])
for i = 1: Nsites
     # put into array of [realizations, 365*num_yrs]
     for j = 1: num_realizations[k]
          q_[j, :] = Qd_cg[j][:, i]'
     end
end

Code Profiling: Order of Experiments

An interesting observation I’ve noticed is that Julia’s first run on a given block of code is substantially slower than every other attempt. Thus, it is likely worthwhile to run a smaller-scale array through to initialize the code if there are plans to move on to substantially more expensive operations (i.e. scaling up).

In the example below, we can see that the second iteration of the same exact code was over 10% faster when calling it a second time. However, when running the code without the function wrapper (in the original timed runs), the code was 10% faster (177 seconds) than the second sequential run shown below. This points to the importance of profiling and experimenting with sections of your code.

4444

Basic profiling tools are directly built into Julia, as shown in the Julia profiling documentation. This can be visualized easily using the ProfileView library. The Juno IDE (standard with Julia Pro) allegedly has a good built-in profile as well. However, it should be expected that most any IDE should do the trick (links to IDEs can be found here).

Syntax and Library Depreciation

While Julia is very similar in its structure and language to MATLAB, much of the similar language has depreciated as Julia has been rapidly upgraded. Notably, Julia released V1.0 in late 2018 and recently released V1.1, moving further away from similarities in function names. Thus, this stands as a lesson for individuals wishing to translate all of their code between these languages. I found a useful website that assists in translating general syntax, but many of the functions have depreciated. However, as someone who didn’t have any experience with MATLAB but was vaguely familiar with Julia, this was a godsend for learning differences in coding styles.

For example, creating an identity matrix in MATLAB utilizes the function eye(size(R)) to create an nxn matrix the size of R. While this was initially the language used in Julia, this specific language was depreciated in V0.7. To get around this, either ‘I’ can be used to create a scalable identity matrix or Matrix{Float64}(I, size(R), size(R)) declare an identity matrix of size(R) by size(R) for a more foolproof and faster operation.

When declaring functions, I have found Julia to be relatively straightforward and Pythonic in its declarations. While I still look to insert colons at the ends of declarations while forgetting to add ‘end’ at the end of functions, loops, and more, the ease of creating, calling, and interacting with functions makes Julia very accessible. Furthermore, its ability to interact with matrices in without special libraries (e.g. Numpy in Python) allows for more efficient coding without having to know specific library notation.

Debugging Drawbacks

One of the most significant drawbacks I run into when using Julia is the lack of clarity in generated error codes for common mistakes, such as adding extra brackets. For example, the following error code is generated in Python when adding an extra parenthesis at the end of an expression. 3333

However, Julia produces the follow error for an identical mistake:

One simple solution to this is to simply upgrade my development environment from Jupyter Notebooks to a general IDE to more easily root out issues by running code line-by-line. However, I see the lack of clarity in showing where specific errors arise a significant drawback to development within Julia. However, as shown in the example below where an array has gone awry, an IDE (such as Atom shown below) can make troubleshooting and debugging a relative breeze.

1111

Furthermore, when editing auxiliary functions in another file or module that was loaded as a library, Julia is not kind enough to simply reload and recompile the module; to get it to properly work in Atom, I had to shut down the Julia kernel then rerun the entirety of the code. Since Julia takes a bit to initially load and compile libraries and code, this slows down the debugging process substantially. There is a specific package (Revise) that exists to take care of this issue, but it is not standard and requires loading this specific library into your code.

GitHub Repositories: Streamflow Generators

PyMFGM: A parallelized Python version of the code, written by Bernardo Trindade

Kirsch-Nowak Stationary Generator in MATLAB: Developed by Matteo Giuliani, Jon Herman and Julianne Quinn

Kirsch-Nowak Stationary Generator in Julia: Please note that the results are not validated. However, you can easily access the Jupyter Notebook version to play around with the code in addition to running the code from your terminal using the main.jl script.

Full Kirsch-Nowak Streamflow Generator: Also developed by Matteo Giuliani, Jon Herman and Julianne Quinn and can handle rescaling flows for changes due to monsoons. I would highly suggest diving into this code alongside the relevant blog posts: Part 1 (explanation), Part 2 (validation).

February 19, 2019 by Peter Storm

Introducing Julia: A Fast and Modern Language

In this blog post, I am introducing Julia, a high-level open-source dynamic programming language. While it has taken root in finance, machine learning, life sciences, energy, optimization, and economic realms, it is certainly not common within the water programming realm. Through this post, I will give an overview of the features, pros, and cons of Julia before comparing its performance to Python. Following this, I give an overview of common Julia development environments as well as linking to additional resources.

Julia: What’s Going On?

Julia is a high-level open-source dynamic programming language built for scientific computing and data processing that is Pythonic in syntax to ensure accessibility while boosting computational efficiency. It is parallelizable on high performance computing resources utilizing CPUs and GPUs, allowing for its use in large-scale experiments.

With Version 1.1.0 recently released, Julia now stands amongst the established and stable programming languages. As such, Julia can handle matrices with ease while performing at speeds nearly comparable to C and Fortran. It is comparable to Cython/Numba and faster than Numpy, Python, Matlab, and R. Note that further benchmarking is required on a project-specific basis due to the speed of individual packages/libraries, but a simple case is shown in the following section.

The biggest case for implementing Julia over C/C++/Fortran is its simplicity in writing and implementation. Its language is similar Python, allowing for not only list comprehension but also ‘sloppy’ dynamic variable type declarations and use. Julia has the ability to move between dynamic and static variable types. Furthermore, Julia allows users to create unique variable types, allowing for maximum flexibility while maintaining efficiency.

Installing and implementing packages from the start is nearly seamless, simply requiring the name of the package and an internet connection. Most packages and libraries are hosted on GitHub and are relatively straightforward to install with just a couple lines of code. On the same hand, distributing and contributing to the opensource community is straightforward. Furthermore, Julia can access Python modules with wrappers and C/Fortran functions without, making it a very versatile language. Likewise, it can easily be called from Python (PyJulia) to speed up otherwise cumbersome operations.

Julia has relatively straightforward profiling modules built in. Beyond being able to utilize Jupyter Notebook commands, Julia has a range of code diagnostic tools, such as a @Time command to inform the user of wall time and memory allocation for a function.

One of the most obvious downsides to Julia when compared to Python is the relatively small community. There is not nearly the base of documentation that I am accustomed to; nearly every issue imaginable in the Python world has been explored on Stack Overflow, Julia is much more limited. However, because much of its functionality is based on MATLAB, similar issues and their corresponding solutions bleed over.

Additional transition issues that may be encountered is that unlike Python, Julia’s indexing begins with 1. Furthermore, Julia uses column-major ordering (like Matlab, Fortran, and R) unlike Python and C (row-major ordering), making iterating column-by-column substantially faster. Thus, special care must be taken when switching between languages in addition to ensuring consistent strategies for speeding up code. Another substantial transition that may be required is that Julia is not directly object-oriented since objects do not directly have embedded methods; instead, a given object must be passed to a function to be modified.

Overall, I see Julia as having the potential to improve computational efficiency for computationally-intensive models without the need to learn a more complex language (e.g. Fortran or C/C++). Between its ease of integration between other common languages/libraries to its ability to properly utilize HPC CPU/GPU resources, Julia may be a useful tool to utilize alongside Python to create more efficient large-scale models.

Julia Allows for Faster Matrix Operations

As a simple test to compare scalability of Julia and Python, I computed the dot product of increasingly large matrices using the base Julia language and Numpy in Python (both running in Jupyter Notebooks on a Windows desktop). You can find the code at the following link on Github.

As you can see in the generated figure, Julia outperforms Python for large matrix applications, indicating that Julia is more efficient and may be worth using for larger and larger operations. If you have any suggestions on improving this code for comparison, please let me know; I am absolutely game for improvement.

Development Environments for Julia

The first and most obvious variables on what development environment is best is you, the user. If you’re wanting all the bells and whistles versus a simple text-editing environment, the range of products to fulfill you desires exists.

Juno—an extension to Atom—is the default IDE that is incorporated with JuliaPro, a pre-packaged version of Julia that includes a range of standard packages—Anaconda is the parallel in the Python world.

Jupyter Notebooks are my go-to development environment for experimenting with code in both Python and Julia, especially for testing small sections of code. Between being able to easily create visuals and examples inline with your code and combining code with Markdown, it allows for clean and interactive approach to sharing code or teaching. Note that installation generally requires IJulia but can allow for easy integration into already existing Jupyter workflows.

JetBrains IDEs (e.g. CLion, PyCharm) support a Julia plugin that supports a wide range of the same features JetBrains is known for. However, it is still in beta and is working to improve its formatter and debugging capabilities.

JuliaBox is an online web interface using Jupyter Notebooks for code development on a remote server, requiring no local installation of Julia. Whether you are in a classroom setting, wanting to write code on your phone, or are just wanting to experiment with parallel computing (or don’t have access to HPC resources), JuliaBox allows for a nearly seamless setup experience. However, note that this development environment limits each session to 90 minutes (up to 8 hours on a paid subscription) and requires a subscription to access any resources beyond 3 CPU cores. Note that you can access GPU instances with a paid subscription.

Julia Studio is a default IDE used by a range of users. It is a no-frills IDE based on Qt Creator, resulting in a clean look.

For anyone looking to use Visual Studio, you can install a VS Code extension.

Vim is not surprisingly available for Julia. And if you’re feeling up the challenge, Emacs also has an interface.

Of course, you can just use a texteditor of your choice (e.g. Sublime Text with an extension) and simply run Julia from the terminal.

Julia Resources

Installing Julia
Julia’s wide range of learning resources
Learn Julia the Hard Way
Julia Style Guide
Trending Julia Packages
Stable (and standard) Julia Packages
Julia Code Profiling Guide
PyPlots in Julia
Calling Python Functions in Julia
Weave, a Report Generator: Made for Julia, Weave incorporates languages including Markdown, Pandoc, and LaTex to generate reports. It integrates with IJulia to work with Jupyter Notebooks
Julia Subreddit Community

Acknowledgements

Thanks to Dave Gold for his assistance in giving guidance for benchmarking and reminding me of the KISS principle.

Parameter	Description
The initial state distribution a for wet (1) and dry (2) states
The transition matrix A from a wet (1) to a dry (2) state
The probability distributions for the vector of observations x for a given state where Wet: 1 and Dry: 2

Water Programming: A Collaborative Research Blog

Tips and tricks on programming, evolutionary algorithms, and doing research

julia