In water resources modeling, we often want to generate synthetic multivariate data for a simulation model in a way that preserves their marginal and joint behavior. The multivariate normal (MVN) distribution is a common model choice for these simulations because 1) it often arises naturally due to the Central Limit Theorem, 2) it has many useful properties that make data manipulation convenient, and 3) data can often be transformed to MVN even if that is not their underlying distribution.
As an example, let X be a K-dimensional random vector with K×1 mean vector μ and K×K covariance matrix [Σ]. If X is multivariate normal, i.e. X~MVN(μ,[Σ]), its probability density function is the following:
where det(·) denotes the determinant. The term (x–μ)T[Σ]-1(x–μ) is called the squared Mahalanobis distance, which measures how far away an observation x is from its distribution’s mean, scaled by a multi-dimensional measure of spread, [Σ]. It is therefore a measure of how far away from the mean each data vector is in a statistical sense, as opposed to Euclidean distance, which only measures distance in a physical sense. The two measures are related, though. If all of the K dimensions of X are independent, every non-diagonal element of [Σ] will be equal to 0, and the diagonals equal to the variances in each dimension, σk2 where k ϵ {1, 2, …, K}. In that case, the Mahalanobis distance is equal to the Euclidean distance after scaling each data vector by its standard deviation.
So how do we fit MVN distributions? Well, the MVN distribution has “four handy properties” (Wilks, 2011) that we can test. Here I will discuss two of them and how we can use these properties to test for multivariate normality. See Chapter 11 of Wilks (2011) for additional tests for multivariate normality.
Let X be a set of n joint observations of K variables. Denote each of the n observations xi = [xi1, xi2, …, xik] where i ϵ {1, 2, …, n} and each of the K marginals Xk = [xk1, xk2, …, xkn] where k ϵ {1, 2, …, K}. If X~MVN(μ,[Σ]), the following two properties (among others) hold:
- All marginal distributions of X are univariate normal, i.e. Xk~N(μk, σk2)
- The squared Mahalanobis distances, Di2 = (xi–μ)T[Σ]-1(xi–μ), have a χk2 distribution with k degrees of freedom.
So if we want to fit a MVN distribution to X, each of these will have to be true. Let’s look at an example where X is the standard deviation in daily flows during all historical Septembers at five different sites within the same basin. In this case, K=5 for the 5 sites and n=51, as there are 51 years in the historical record. To fit a MVN distribution to X, we’ll first want to ensure that the marginal distributions of the standard deviations in daily September flows are normal at each of the K sites. Let’s inspect these distributions visually with a histogram, first:
Clearly these distributions are not normal, as they are positively skewed. But that’s okay, we can transform the data so that they look more normal. The Box-Cox transformation is commonly used to transform non-normal data, X, to normal data, Y (see Chapter 3 of Wilks (2011) for more details):
Using λ=0 (a log-transform), our transformed data look like this:
These look much better! We can perform a formal hypothesis test to confirm that each of these 5 transformed data series are not inconsistent with the normal distribution using a number of tests, such as the Shapiro-Wilk test, the Kolmogorov-Smirnov test, and the Filliben Q-Q correlation test, which I use here (see Chapter 5 of Wilks, 2011 for a description of other tests). The Filliben Q-Q test finds the correlation between the sample data quantiles and the theoretical quantiles of the distribution being fit. I’ve plotted these below; the correlation coefficients at these 5 sites are [0.9922, 0.9951, 0.9822, 0.9909, 0.9945].
Rejection regions for the Filliben Q-Q test for the normal distribution are tabulated for different significance levels and sample sizes based on Monte Carlo results. The relevant section of the table is copied below. For a sample size of n≈50, the site with the lowest correlation (Site 3: 0.9822) fails to reject the null hypothesis that the data are normal at the 10% level, as the rejection region is ρ≤0.981. This means that if the data were normal, there would be a 10% chance that a data series of length n=50 would have a correlation coefficient below 0.981.
So now we know that none of the marginal distributions at each site is inconsistent with the normal distribution, but that does not guarantee that the joint distribution across sites will be multi-variate normal. There could be multi-variate outliers, or points which are not extreme within any particular site’s distribution, but are extreme within the context of the overall covariance structure. We test this by confirming that the squared Mahalanobis distances are not inconsistent with a χk2 distribution. Again, this can be done by comparing the sample data quantiles to the theoretical data quantiles (figure below). Here the correlation coefficient is 0.9964.
Because the rejection regions will depend not only on the sample size (n) and significance levels, but also the number of degrees of freedom (k), there are no tabulated critical values for this test (there would need to be a separate table for every possible k). Instead of using a table, one has to perform a Monte Carlo simulation to calculate the critical region for their specific application. In this case, I did that by generating 10,000 random samples of length n=51 from a χ2 distribution with k=5 degrees of freedom. Of the generated samples, 97.8% had correlation coefficients less than the observed value of 0.9964 suggesting that this sample is very consistent with a χ52 distribution.
So now that we know the MVN is a good fit for the log-transformed standard deviations in daily September flows, we can estimate the model parameters. This part is easy! The MLE estimator of the mean vector μ is the sample mean vector x̅ = [x̅1, x̅2, … , x̅k] of the data (in this case, the log-transformed data), while the MLE estimator of the covariance [Σ] is the sample covariance , where with [1] being an N×N matrix of 1s and [X] an N×K matrix of the data (log transformed here).
Below is Python code for all of the fitting and plotting done here.
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
import seaborn.apionly as sns
from scipy import stats
def fitMVN():
# set plotting style
sns.set_style("darkgrid")
# load streamflow data
Qdaily = np.loadtxt('data/Qdaily.txt')
Nyears = int(np.shape(Qdaily)[0]/365)
Nsites = np.shape(Qdaily)[1]
Months = ['May','June','July','August','September','October','November','December','January','February','March','April']
# calculate standard deviation in daily flows each month and squared Mahalanobis distances
StdMonthly = calc_monthly_std(Qdaily, Nyears, Nsites)
D2 = calcD2(Nyears, Nsites, np.log(StdMonthly))
# calculate theoretical quantiles for a chi^2 distribution with dof = Nsites, and for the standard normal distribution
m = np.array(range(1,Nyears+1))
p = (m-0.5)/Nyears
chi2 = stats.chi2.ppf(p,Nsites)
norm = stats.norm.ppf(p,0,1)
# initialize matrices to store correlation coefficients and significance levels for marginal normal distributions and chi^2 distributions
normCorr = np.zeros([Nsites,12])
norm_sigLevel = np.zeros([Nsites,12])
chi2Corr = np.zeros([12])
chi2_sigLevel = np.zeros([12])
for i in range(len(Months)):
# plot histograms of standard deviation of daily flows each month, and of their logs
plotHistograms(Nsites, StdMonthly[:,:,i], 'Standard Deviation of Daily ' + Months[i] + ' Flows', Months[i] + 'Hist.png')
plotHistograms(Nsites, np.log(StdMonthly[:,:,i]), 'log(Standard Deviation of Daily ' + Months[i] + ' Flows)', \
'Log' + Months[i] + 'Hist.png')
# plot QQ plots of standard deviation of daily flows each month, and of their logs
plotNormQQ(Nsites, StdMonthly[:,:,i], norm, 'Standard Deviation of Daily ' + Months[i] + ' Flows', Months[i] + 'QQ.png')
normCorr[:,i] = plotNormQQ(Nsites, np.log(StdMonthly[:,:,i]), norm, 'log(Standard Deviation of Daily ' + Months[i] + ' Flows)', 'Log' + Months[i] + 'QQ.png')
# plot QQ plot of Chi Squared distribution of log of standard deviation in daily flows each month
chi2Corr[i] = plotChi2QQ(Nsites, D2[:,i], chi2, 'D$\mathregular{^2}\!$ of log(Standard Deviation of Daily ' + Months[i] + ' Flows)', \
'Log' + Months[i] + 'Chi2QQ.png')
# find significance levels
chi2_sigLevel[i] = chi2_MC(Nsites,Nyears,chi2,chi2Corr[i])
norm_sigLevel[:,i] = norm_MC(Nsites,Nyears,norm,normCorr[:,i])
np.savetxt('Norm_sigLevels.txt',np.transpose(norm_sigLevel))
np.savetxt('Norm_corr.txt',np.transpose(normCorr))
np.savetxt('Chi2_sigLevels.txt',chi2_sigLevel)
np.savetxt('Chi2_corr.txt',chi2Corr)
return None
def calc_monthly_std(Qdaily, Nyears, Nsites):
Nmonths = 12
# first month = May (1st month of water year)
DaysPerMonth = np.array([31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30])
Qmonthly = np.zeros([Nsites, Nyears, Nmonths])
StdMonthly = np.zeros([Nsites, Nyears, Nmonths])
for year in range(Nyears):
for month in range(Nmonths):
start = year*365 + np.sum(DaysPerMonth[0:month])
for i in range(Nsites):
# find total flow each month
Qmonthly[i,year,month] = 86400*np.sum(Qdaily[start:start+DaysPerMonth[month],i])
# find standard deviation in daily flows each month
for i in range(Nsites):
for j in range(DaysPerMonth[month]):
StdMonthly[i,year,month] = StdMonthly[i,year,month] + \
(86400*Qdaily[start+j,i]-Qmonthly[i,year,month]/DaysPerMonth[month])**2
StdMonthly[i,year,month] = np.sqrt((1/(DaysPerMonth[month]-1))*StdMonthly[i,year,month])
return StdMonthly
def plotHistograms(Nsites, data, xlabel, filename):
fig = plt.figure()
for i in range(Nsites):
ax = fig.add_subplot(1,Nsites,i+1)
ax.hist(data[i,:],bins=10,color='navy',alpha=0.8)
ax.set_title('Site ' + str(i+1),fontsize=16)
fig.text(0.1, 0.5, 'Frequency', va='center', rotation='vertical', fontsize=14)
fig.text(0.5, 0.04, xlabel, ha='center', fontsize=14)
fig.subplots_adjust(bottom=0.15)
fig.set_size_inches([22.525,4.825])
fig.savefig('Hists/' + filename)
fig.clf()
return None
def plotNormQQ(Nsites, data, norm, title, filename):
corr = np.zeros([Nsites])
fig = plt.figure()
for i in range(Nsites):
corr[i] = np.corrcoef(np.sort(data[i,:]),norm)[0,1]
z = (data[i,:] - np.mean(data[i,:]))/np.std(data[i,:])
ax = fig.add_subplot(1,Nsites,i+1)
ax.scatter(norm,np.sort(z))
ax.plot([-3,3],[-3,3],c='r')
ax.set_title('Site ' + str(i+1),fontsize=16)
ax.set_xlim([-3,3])
ax.set_ylim([-3,3])
fig.text(0.1, 0.5, 'Sample Quantiles', va='center', rotation='vertical', fontsize=14)
fig.text(0.5, 0.04, 'Theoretical Quantiles', ha='center', fontsize=14)
fig.suptitle('Normal Q-Q Plot of ' + title,fontsize=16)
fig.subplots_adjust(bottom=0.15,top=0.85)
fig.set_size_inches([22.525,4.825])
fig.savefig('QQplots/' + filename)
fig.clf()
return corr
def calcD2(Nyears, Nsites, data):
D2 = np.zeros([Nyears,12])
X = np.zeros([Nyears, Nsites])
Xprime = np.zeros([Nyears,Nsites])
S = np.zeros(Nsites)
for i in range(12):
# fill data matrix, X, for ith month
for j in range(Nsites):
X[:,j] = data[j,:,i]
# calculate covariance matrix, S, for ith month
Xprime = X - (1/Nyears)*np.dot(np.ones([Nyears,Nyears]),X)
S = (1/(Nyears-1))*np.dot(np.transpose(Xprime),Xprime)
#calculate Mahalanobis distance, D2, for each year's ith month
for j in range(Nyears):
D2[j,i] = np.dot(np.dot((X[j,:] - np.mean(X,0)),np.linalg.inv(S)),(np.transpose(X[j,:] - np.mean(X,0))))
return D2
def plotChi2QQ(Nsites, data, chi2, title, filename):
corr = np.corrcoef(np.sort(data),chi2)[0,1]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(chi2,np.sort(data))
ax.plot([0,1.1*np.max(chi2)],[0,1.1*np.max(chi2)],c='r')
ax.set_xlabel('Theoretical Quantiles',fontsize=16)
ax.set_xlim([0, 1.1*np.max(chi2)])
ax.set_ylabel('Sample Quantiles',fontsize=16)
ax.set_ylim([0,1.1*np.max(data)])
ax.tick_params(axis='both',labelsize=14)
ax.set_title(r'$\chi^2$' + ' Q-Q Plot of ' + title,fontsize=16)
fig.savefig('QQplots/' + filename)
fig.clf()
return corr
def chi2_MC(Nsites,Nyears,theoretical,dataCorr):
corr = np.zeros(10000)
for i in range(10000): # 10,000 MC simulations
simulated = stats.chi2.rvs(Nsites,size=Nyears)
corr[i] = np.corrcoef(np.sort(simulated),theoretical)[0,1]
# find significance levels
corr = np.sort(corr)
for i in range(10000):
if dataCorr > corr[i]:
sigLevel = (i+0.5)/10000
return sigLevel
def norm_MC(Nsites,Nyears,theoretical,dataCorr):
sigLevel = np.zeros(Nsites)
corr = np.zeros([10000])
for i in range(10000): # 10,000 MC simulations
simulated = stats.norm.rvs(0,1,size=Nyears)
corr[i] = np.corrcoef(np.sort(simulated),theoretical)[0,1]
# find significance levels
corr = np.sort(corr)
for i in range(10000):
for j in range(Nsites):
if dataCorr[j] > corr[i]:
sigLevel[j] = (i+0.5)/10000
return sigLevel
fitMVN()