NumPy vectorized correlation coefficient

Imagine that you have a matrix X with dimensions (N x k) and a vector y of size (1 x k), and you want to calculate the linear correlation coefficient between y and each row of X to obtain a vector of correlation coefficients, r, of size (N x 1) .

The first way you might consider doing this is in a loop with the np.corrcoef function, which returns the linear correlation coefficient between two vectors:

r = np.zeros((N,1))
for i in xrange(0,N):
    r[k] = np.corrcoef(X[i,:], y)[0,1]

If N is large, or if you need to perform this calculation many times (with an outer loop wrapped around it), this will be very slow. We can time it as follows:

for t in xrange(0,10): # average 10 timings

    start = time.time()
    r = np.zeros((10000,1))
    for k in xrange(0,10000):
        r[k] = np.corrcoef(vic_runoff[k,:], obs_runoff)[0,1]

    end = time.time()
    times[t] = end-start

print np.mean(times)

In this example with N = 10000, the average time is 0.48 seconds per calculation of the r vector. Too slow!

How can we speed this up? Some NumPy functions, such as sum and mean, have an axis argument to perform the operation along a particular dimension, but corrcoef does not (presumably because people often want the correlations between the rows of the matrix X, rather than the correlation of all rows with respect to a single vector). Recall the formula for the sample correlation coefficient between 1D vectors x and y:

linear correlation coefficient (from Wikipedia)

linear correlation coefficient (from Wikipedia)

We can write a function using NumPy’s vectorized arithmetic to compute these values all at once rather than in a loop. For example, np.multiply(X,y) (also given by X*y) performs element-wise multiplication of the vector y over all rows of the matrix X. The function might look something like this:

def vcorrcoef(X,y):
    Xm = np.reshape(np.mean(X,axis=1),(X.shape[0],1))
    ym = np.mean(y)
    r_num = np.sum((X-Xm)*(y-ym),axis=1)
    r_den = np.sqrt(np.sum((X-Xm)**2,axis=1)*np.sum((y-ym)**2))
    r = r_num/r_den
    return r

If we try timing it as before:

for t in xrange(0,10): # average 10 timings

    start = time.time()
    r = vcorrcoef(vic_runoff,obs_runoff)
    end = time.time()
    times[t] = end-start

print np.mean(times)

The average time is only 0.0017 seconds—a reduction of about 99.6%, or a 280x speedup compared to the non-vectorized approach. Not bad! It’s often said that Python is slow, but the matrix operations in NumPy can clearly be used to our advantage.

Advertisements

5 thoughts on “NumPy vectorized correlation coefficient

  1. Pingback: Water Programming Blog Guide (Part I) – Water Programming: A Collaborative Research Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s