Imagine that you have a matrix X with dimensions (N x k) and a vector y of size (1 x k), and you want to calculate the linear correlation coefficient between y and each row of X to obtain a vector of correlation coefficients, r, of size (N x 1) .
The first way you might consider doing this is in a loop with the np.corrcoef function, which returns the linear correlation coefficient between two vectors:
r = np.zeros((N,1))
for i in xrange(0,N):
r[k] = np.corrcoef(X[i,:], y)[0,1]
If N is large, or if you need to perform this calculation many times (with an outer loop wrapped around it), this will be very slow. We can time it as follows:
for t in xrange(0,10): # average 10 timings
start = time.time()
r = np.zeros((10000,1))
for k in xrange(0,10000):
r[k] = np.corrcoef(vic_runoff[k,:], obs_runoff)[0,1]
end = time.time()
times[t] = end-start
print np.mean(times)
In this example with N = 10000, the average time is 0.48 seconds per calculation of the r vector. Too slow!
How can we speed this up? Some NumPy functions, such as sum and mean, have an axis argument to perform the operation along a particular dimension, but corrcoef does not (presumably because people often want the correlations between the rows of the matrix X, rather than the correlation of all rows with respect to a single vector). Recall the formula for the sample correlation coefficient between 1D vectors x and y:
linear correlation coefficient (from Wikipedia)
We can write a function using NumPy’s vectorized arithmetic to compute these values all at once rather than in a loop. For example, np.multiply(X,y) (also given by X*y) performs element-wise multiplication of the vector y over all rows of the matrix X. The function might look something like this:
def vcorrcoef(X,y):
Xm = np.reshape(np.mean(X,axis=1),(X.shape[0],1))
ym = np.mean(y)
r_num = np.sum((X-Xm)*(y-ym),axis=1)
r_den = np.sqrt(np.sum((X-Xm)**2,axis=1)*np.sum((y-ym)**2))
r = r_num/r_den
return r
If we try timing it as before:
for t in xrange(0,10): # average 10 timings
start = time.time()
r = vcorrcoef(vic_runoff,obs_runoff)
end = time.time()
times[t] = end-start
print np.mean(times)
The average time is only 0.0017 seconds—a reduction of about 99.6%, or a 280x speedup compared to the non-vectorized approach. Not bad! It’s often said that Python is slow, but the matrix operations in NumPy can clearly be used to our advantage.