September 26, 2016 by David Gold

A simple command for plotting autocorrelation functions in Matlab

Autocorrelation is a measure of persistence within a data set, which can be defined as the tendency for successive data points to be similar (Wilks, 2011). In atmospheric science temporal autocorrelation can be a helpful tool for model evaluation. Temporal autocorrelation is also a fundamental concept for synthetic weather generation (for more detail see Julie’s fantastic series of blog posts on synthetic weather generation here). Calculating autocorrelation within a sample data set can also be a helpful for assessing the applicability of classical statistical methods requiring independence of data points within a sample. Should a data set prove to be strongly persistent, such methods will likely yield inaccurate results.

Autocorrelation is commonly computed by making a copy of the original data set, shifting the copy k points forward (where k is the lag over which you would like to compute the autocorrelation) and computing the Pearson correlation coefficient between the original data set and the copy.

autocorrelation-equation

Where:

autocorr_variables

The calculation of autocorrelation for a number of different lags at once is known as the autocorrelation function. Plotting the autocorrelation graphically can be a helpful tool for quickly assessing the presence of autocorrelation within a data set.

You can generate such plots in Matlab using the simple command shown below:

autocorr(T,k) 
% T is your data set and k is the number of lags you would like to compute

The command generates a plot of the autocorrelation function. Below are two examples, the first is the autocorrelation function of a set of observed temperature values in Des Moines Iowa, the second is autocorrelation function of the temperature values at the same location as modeled by the MM5I regional climate model:

Figure 1: Temporal autocorrelation function of temperature observations from Des Moines Iowa (temperatures reported at 3 hour intervals)

Figure 2: Temporal autocorrelation function of temperature produced by the MM5I regional climate model for Des Moines Iowa (temperatures reported at 3 hour intervals)

Note the cyclical nature of the autocorrelation functions, this is a reflection of the daily temperature cycle. The autocorrelations function of the maximum or minimum temperatures would show more constant persistence.

Sources:

Wilks, D. S. (2011). Statistical methods in the atmospheric sciences. Burlington, MA: Academic Press.

September 16, 2016 by Julie

Fitting Multivariate Normal Distributions

In water resources modeling, we often want to generate synthetic multivariate data for a simulation model in a way that preserves their marginal and joint behavior. The multivariate normal (MVN) distribution is a common model choice for these simulations because 1) it often arises naturally due to the Central Limit Theorem, 2) it has many useful properties that make data manipulation convenient, and 3) data can often be transformed to MVN even if that is not their underlying distribution.

As an example, let X be a K-dimensional random vector with K×1 mean vector μ and K×K covariance matrix [Σ]. If X is multivariate normal, i.e. X~MVN(μ,[Σ]), its probability density function is the following:

mvnpdf

where det(·) denotes the determinant. The term (x–μ)^T[Σ]^-1(x–μ) is called the squared Mahalanobis distance, which measures how far away an observation x is from its distribution’s mean, scaled by a multi-dimensional measure of spread, [Σ]. It is therefore a measure of how far away from the mean each data vector is in a statistical sense, as opposed to Euclidean distance, which only measures distance in a physical sense. The two measures are related, though. If all of the K dimensions of X are independent, every non-diagonal element of [Σ] will be equal to 0, and the diagonals equal to the variances in each dimension, σ_k² where k ϵ {1, 2, …, K}. In that case, the Mahalanobis distance is equal to the Euclidean distance after scaling each data vector by its standard deviation.

So how do we fit MVN distributions? Well, the MVN distribution has “four handy properties” (Wilks, 2011) that we can test. Here I will discuss two of them and how we can use these properties to test for multivariate normality. See Chapter 11 of Wilks (2011) for additional tests for multivariate normality.

Let X be a set of n joint observations of K variables. Denote each of the n observations x_i = [x_i1, x_i2, …, x_ik] where i ϵ {1, 2, …, n} and each of the K marginals X_k = [x_k1, x_k2, …, x_kn] where k ϵ {1, 2, …, K}. If X~MVN(μ,[Σ]), the following two properties (among others) hold:

All marginal distributions of X are univariate normal, i.e. X_k~N(μ_k, σ_k²)
The squared Mahalanobis distances, D_i² = (x_i–μ)^T[Σ]^-1(x_i–μ), have a χ_k² distribution with k degrees of freedom.

So if we want to fit a MVN distribution to X, each of these will have to be true. Let’s look at an example where X is the standard deviation in daily flows during all historical Septembers at five different sites within the same basin. In this case, K=5 for the 5 sites and n=51, as there are 51 years in the historical record. To fit a MVN distribution to X, we’ll first want to ensure that the marginal distributions of the standard deviations in daily September flows are normal at each of the K sites. Let’s inspect these distributions visually with a histogram, first:

septhist

Clearly these distributions are not normal, as they are positively skewed. But that’s okay, we can transform the data so that they look more normal. The Box-Cox transformation is commonly used to transform non-normal data, X, to normal data, Y (see Chapter 3 of Wilks (2011) for more details):

boxcoxeqn

Using λ=0 (a log-transform), our transformed data look like this:

logsepthist

These look much better! We can perform a formal hypothesis test to confirm that each of these 5 transformed data series are not inconsistent with the normal distribution using a number of tests, such as the Shapiro-Wilk test, the Kolmogorov-Smirnov test, and the Filliben Q-Q correlation test, which I use here (see Chapter 5 of Wilks, 2011 for a description of other tests). The Filliben Q-Q test finds the correlation between the sample data quantiles and the theoretical quantiles of the distribution being fit. I’ve plotted these below; the correlation coefficients at these 5 sites are [0.9922, 0.9951, 0.9822, 0.9909, 0.9945].

logseptqq

Rejection regions for the Filliben Q-Q test for the normal distribution are tabulated for different significance levels and sample sizes based on Monte Carlo results. The relevant section of the table is copied below. For a sample size of n≈50, the site with the lowest correlation (Site 3: 0.9822) fails to reject the null hypothesis that the data are normal at the 10% level, as the rejection region is ρ≤0.981. This means that if the data were normal, there would be a 10% chance that a data series of length n=50 would have a correlation coefficient below 0.981.

criticalvalues

So now we know that none of the marginal distributions at each site is inconsistent with the normal distribution, but that does not guarantee that the joint distribution across sites will be multi-variate normal. There could be multi-variate outliers, or points which are not extreme within any particular site’s distribution, but are extreme within the context of the overall covariance structure. We test this by confirming that the squared Mahalanobis distances are not inconsistent with a χ_k² distribution. Again, this can be done by comparing the sample data quantiles to the theoretical data quantiles (figure below). Here the correlation coefficient is 0.9964.

logseptchisqqq

Because the rejection regions will depend not only on the sample size (n) and significance levels, but also the number of degrees of freedom (k), there are no tabulated critical values for this test (there would need to be a separate table for every possible k). Instead of using a table, one has to perform a Monte Carlo simulation to calculate the critical region for their specific application. In this case, I did that by generating 10,000 random samples of length n=51 from a χ² distribution with k=5 degrees of freedom. Of the generated samples, 97.8% had correlation coefficients less than the observed value of 0.9964 suggesting that this sample is very consistent with a χ₅²distribution.

So now that we know the MVN is a good fit for the log-transformed standard deviations in daily September flows, we can estimate the model parameters. This part is easy! The MLE estimator of the mean vector μ is the sample mean vector x̅ = [x̅₁, x̅₂, … , x̅_k] of the data (in this case, the log-transformed data), while the MLE estimator of the covariance [Σ] is the sample covariance $\left[S\right]= \frac{1}{N-1}\left[X^{'}\right]^{T}\left[X^{'}\right]$ , where $\left[X^{'}\right] = \frac{1}{N}\left[1\right]\left[X\right]$ with [1] being an N×N matrix of 1s and [X] an N×K matrix of the data (log transformed here).

Below is Python code for all of the fitting and plotting done here.


from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
import seaborn.apionly as sns
from scipy import stats

def fitMVN():
    # set plotting style
    sns.set_style("darkgrid")

    # load streamflow data
    Qdaily = np.loadtxt('data/Qdaily.txt')
    Nyears = int(np.shape(Qdaily)[0]/365)
    Nsites = np.shape(Qdaily)[1]
    Months = ['May','June','July','August','September','October','November','December','January','February','March','April']

    # calculate standard deviation in daily flows each month and squared Mahalanobis distances
    StdMonthly = calc_monthly_std(Qdaily, Nyears, Nsites)
    D2 = calcD2(Nyears, Nsites, np.log(StdMonthly))

    # calculate theoretical quantiles for a chi^2 distribution with dof = Nsites, and for the standard normal distribution
    m = np.array(range(1,Nyears+1))
    p = (m-0.5)/Nyears
    chi2 = stats.chi2.ppf(p,Nsites)
    norm = stats.norm.ppf(p,0,1)

    # initialize matrices to store correlation coefficients and significance levels for marginal normal distributions and chi^2 distributions
    normCorr = np.zeros([Nsites,12])
    norm_sigLevel = np.zeros([Nsites,12])
    chi2Corr = np.zeros([12])
    chi2_sigLevel = np.zeros([12])

    for i in range(len(Months)):
        # plot histograms of standard deviation of daily flows each month, and of their logs
        plotHistograms(Nsites, StdMonthly[:,:,i], 'Standard Deviation of Daily ' + Months[i] + ' Flows', Months[i] + 'Hist.png')
        plotHistograms(Nsites, np.log(StdMonthly[:,:,i]), 'log(Standard Deviation of Daily ' + Months[i] + ' Flows)', \
            'Log' + Months[i] + 'Hist.png')

        # plot QQ plots of standard deviation of daily flows each month, and of their logs
        plotNormQQ(Nsites, StdMonthly[:,:,i], norm, 'Standard Deviation of Daily ' + Months[i] + ' Flows', Months[i] + 'QQ.png')
        normCorr[:,i] = plotNormQQ(Nsites, np.log(StdMonthly[:,:,i]), norm, 'log(Standard Deviation of Daily ' + Months[i] + ' Flows)', 'Log' + Months[i] + 'QQ.png')

        # plot QQ plot of Chi Squared distribution of log of standard deviation in daily flows each month
        chi2Corr[i] = plotChi2QQ(Nsites, D2[:,i], chi2, 'D$\mathregular{^2}\!$ of log(Standard Deviation of Daily ' + Months[i] + ' Flows)', \
            'Log' + Months[i] + 'Chi2QQ.png')

        # find significance levels
        chi2_sigLevel[i] = chi2_MC(Nsites,Nyears,chi2,chi2Corr[i])
        norm_sigLevel[:,i] = norm_MC(Nsites,Nyears,norm,normCorr[:,i])

    np.savetxt('Norm_sigLevels.txt',np.transpose(norm_sigLevel))
    np.savetxt('Norm_corr.txt',np.transpose(normCorr))
    np.savetxt('Chi2_sigLevels.txt',chi2_sigLevel)
    np.savetxt('Chi2_corr.txt',chi2Corr)

    return None

def calc_monthly_std(Qdaily, Nyears, Nsites):
    Nmonths = 12
    # first month = May (1st month of water year)
    DaysPerMonth = np.array([31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30])

    Qmonthly = np.zeros([Nsites, Nyears, Nmonths])
    StdMonthly = np.zeros([Nsites, Nyears, Nmonths])
    for year in range(Nyears):
        for month in range(Nmonths):
            start = year*365 + np.sum(DaysPerMonth[0:month])

            for i in range(Nsites):
                # find total flow each month
                Qmonthly[i,year,month] = 86400*np.sum(Qdaily[start:start+DaysPerMonth[month],i])

            # find standard deviation in daily flows each month
            for i in range(Nsites):
                for j in range(DaysPerMonth[month]):
                    StdMonthly[i,year,month] = StdMonthly[i,year,month] + \
                        (86400*Qdaily[start+j,i]-Qmonthly[i,year,month]/DaysPerMonth[month])**2

                StdMonthly[i,year,month] = np.sqrt((1/(DaysPerMonth[month]-1))*StdMonthly[i,year,month])

    return StdMonthly

def plotHistograms(Nsites, data, xlabel, filename):
    fig = plt.figure()
    for i in range(Nsites):
        ax = fig.add_subplot(1,Nsites,i+1)
        ax.hist(data[i,:],bins=10,color='navy',alpha=0.8)
        ax.set_title('Site ' + str(i+1),fontsize=16)

    fig.text(0.1, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=14)
    fig.text(0.5, 0.04, xlabel, ha='center', fontsize=14)
    fig.subplots_adjust(bottom=0.15)
    fig.set_size_inches([22.525,4.825])
    fig.savefig('Hists/' + filename)
    fig.clf()

    return None

def plotNormQQ(Nsites, data, norm, title, filename):
    corr = np.zeros([Nsites])
    fig = plt.figure()
    for i in range(Nsites):
        corr[i] = np.corrcoef(np.sort(data[i,:]),norm)[0,1]
        z = (data[i,:] - np.mean(data[i,:]))/np.std(data[i,:])
        ax = fig.add_subplot(1,Nsites,i+1)
        ax.scatter(norm,np.sort(z))
        ax.plot([-3,3],[-3,3],c='r')
        ax.set_title('Site ' + str(i+1),fontsize=16)
        ax.set_xlim([-3,3])
        ax.set_ylim([-3,3])

    fig.text(0.1, 0.5, 'Sample Quantiles', va='center', rotation='vertical', fontsize=14)
    fig.text(0.5, 0.04, 'Theoretical Quantiles', ha='center', fontsize=14)
    fig.suptitle('Normal Q-Q Plot of ' + title,fontsize=16)
    fig.subplots_adjust(bottom=0.15,top=0.85)
    fig.set_size_inches([22.525,4.825])
    fig.savefig('QQplots/' + filename)
    fig.clf()

    return corr

def calcD2(Nyears, Nsites, data):
    D2 = np.zeros([Nyears,12])
    X = np.zeros([Nyears, Nsites])
    Xprime = np.zeros([Nyears,Nsites])
    S = np.zeros(Nsites)
    for i in range(12):
        # fill data matrix, X, for ith month
        for j in range(Nsites):
            X[:,j] = data[j,:,i]

        # calculate covariance matrix, S, for ith month
        Xprime = X - (1/Nyears)*np.dot(np.ones([Nyears,Nyears]),X)
        S = (1/(Nyears-1))*np.dot(np.transpose(Xprime),Xprime)

        #calculate Mahalanobis distance, D2, for each year's ith month
        for j in range(Nyears):
            D2[j,i] = np.dot(np.dot((X[j,:] - np.mean(X,0)),np.linalg.inv(S)),(np.transpose(X[j,:] - np.mean(X,0))))

    return D2

def plotChi2QQ(Nsites, data, chi2, title, filename):
    corr = np.corrcoef(np.sort(data),chi2)[0,1]
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.scatter(chi2,np.sort(data))
    ax.plot([0,1.1*np.max(chi2)],[0,1.1*np.max(chi2)],c='r')
    ax.set_xlabel('Theoretical Quantiles',fontsize=16)
    ax.set_xlim([0, 1.1*np.max(chi2)])
    ax.set_ylabel('Sample Quantiles',fontsize=16)
    ax.set_ylim([0,1.1*np.max(data)])
    ax.tick_params(axis='both',labelsize=14)
    ax.set_title(r'$\chi^2$' + ' Q-Q Plot of ' + title,fontsize=16)
    fig.savefig('QQplots/' + filename)
    fig.clf()

    return corr

def chi2_MC(Nsites,Nyears,theoretical,dataCorr):
    corr = np.zeros(10000)
    for i in range(10000): # 10,000 MC simulations
        simulated = stats.chi2.rvs(Nsites,size=Nyears)
        corr[i] = np.corrcoef(np.sort(simulated),theoretical)[0,1]

    # find significance levels
    corr = np.sort(corr)
    for i in range(10000):
        if dataCorr > corr[i]:
            sigLevel = (i+0.5)/10000

    return sigLevel

def norm_MC(Nsites,Nyears,theoretical,dataCorr):
    sigLevel = np.zeros(Nsites)
    corr = np.zeros([10000])
    for i in range(10000): # 10,000 MC simulations
        simulated = stats.norm.rvs(0,1,size=Nyears)
        corr[i] = np.corrcoef(np.sort(simulated),theoretical)[0,1]

    # find significance levels
    corr = np.sort(corr)
    for i in range(10000):
        for j in range(Nsites):
            if dataCorr[j] > corr[i]:
                sigLevel[j] = (i+0.5)/10000

    return sigLevel

fitMVN()

September 12, 2016 by Jazmin Zatarain

Easy vectorized parallel plots for multiple data sets

I will share a very quick and straight-forward solution to generate parallel plots in python of multiple groups of data. The idea is transitioning from the parallel axis plot tool to a method that enables the plots to be exported as a vectorized image. You can also take a look at Matt’s python parallel.py code available in github: https://github.com/matthewjwoodruff/parallel.py .

This is the type of figure that you will get:

The previous figure was generated with the following lines of code:

import numpy as np
import pandas as pd
from pandas.tools.plotting import parallel_coordinates
import matplotlib.pyplot as plt
import seaborn

data = pd.read_csv('sample_data.csv')

parallel_coordinates(data,'Name', color= ['#225ea8','#7fcdbb','#1d91c0'], linewidth=5, alpha=.8)
plt.ylabel('Direction of Preference $\\rightarrow$', fontsize=12)

plt.savefig('parallel_plot.svg')

Lines 1-4 are the required libraries. I just threw in the seaborn library to give it the gray background but it is not necessary. In the parallel_coordinates function, you need to specify the data, ‘Name’ and the color of the different groups. You can substitute the color variable for colormap and specify the colormap that you wish to use (e.g. colormap=’YlGnBu’). I also specified an alpha for transparency to see overlapping lines. If you want to learn more, you can take a look at the parallel_coordinates source code. I found this stack overflow link very useful, it shows some examples on editing the source code to enable other capabilities.

Finally, the following snippet shows the format of the input data (the sample_data.csv file that is read in line 7 ) :

Columns A-G the different categories to be plotted are specified (e.g. the objectives of a problem) and in Column H the names of the different data groups are specified. And there you have it, I hope you find this plotting alternative useful.

September 2, 2016 by Tom Wild

A Guide to Using Git in PyCharm – Part 1

This post is part 1 of a multi-part series intended to describe how to use PyCharm’s integrated Git (version control) features. See Part 2 here. PyCharm makes it easy to link a Github account directly to the PyCharm IDE and perform Git-related tasks within the PyCharm IDE rather than performing these tasks in a command prompt.

Today I will describe a few basic features to help you get started, including: creating a repository, adding files and committing changes. Future posts will discuss more detailed features, including branching, pushing/pulling, merging, etc. While PyCharm does have a very detailed website geared toward explaining some of this, this blog post series is intended to help those who just want some basic steps and pictures to get started.

You can find background information on Git on this blog, including an introduction to git, as well as tutorials on local version control and remote repositories. A glossary of Git terminology and commands can be found here.

PyCharm is one of many IDEs one can use to edit Python code. I offer some background here on why I chose PyCharm for my Python programming needs, and how to get PyCharm.

The following tutorial will assume you already have PyCharm and Git installed, and have a Github account.

Within PyCharm, link your Github account.

File –> Settings –> Version Control –> GitHub

Input your GitHub login (username) and password.

Create a new PyCharm project as pictured below, by going to:

File –> New Project.

(Note: A “project” can just be a folder/directory of related files that PyCharm will recognize as a “project”. If you already have a project, you can simply open the existing “project” to follow these steps to create a repository).

Having established a new project, now create a local git repository for the project as pictured below, by going to:

VCS –> Import Into Version Control –> Create Git Repository.

If completed correctly, you should see a directory named “.git” appear in the project directory. Note that you must have already downloaded git for this to work (test that git.exe works by going File –> Settings –> Version Control –> Git –> Test).

Having created a git repository, now create python files and add them to the repository.

Right click on your project in the Project menu, and select New — Python File, as pictured below.

PyCharm will prompt you to include this file in your repository, as is pictured below. If you select “Yes”, you can now commit and track changes you make to this file.

Commit the file to repository by right-clicking the python file in the project menu, and selecting Git–>Commit File, as is shown in the two images below:

You can include a message with your commit when you are prompted to commit, as shown below:

Note that file names in your project menu will appear in green text as long as the file has not been committed yet. The file name will no longer be green once you commit it.

Now that your file has been added, you can make changes to the file and commit them.

As soon as you make edits to this newly committed file, the file name in the menu will change colors (to blue in my version of PyCharm). This signifies that uncommitted changes exist in that file. You can commit the changes following the same process I described before, or by clicking the green “VCS” up arrow icon on the top menu.

You will now be prompted with a Commit Changes window, as appears below.

You can review changes that have been made, in case you have forgotten what has changed since you last committed, by double clicking on the file name (in the figure above, this would be the blue “blog_post_file.py” icon). PyCharm will show you what changes have been made in green. (It will also show deletions and/or rearrangements of code).

7. Having committed the file and a change to the file, you can now go to the “Version Control” menu at the bottom of the PyCharm window to unveil a variety of features, including a “Log”, which stores all of the changes I have made to this local repository. The log for my example case is shown below.

I will continue to discuss more detailed features in future posts in this series.

Water Programming: A Collaborative Research Blog

Tips and tricks on programming, evolutionary algorithms, and doing research

Month: September 2016

A simple command for plotting autocorrelation functions in Matlab

Fitting Multivariate Normal Distributions

Easy vectorized parallel plots for multiple data sets

A Guide to Using Git in PyCharm – Part 1