This blog post is adapted from material I learned during the 2021 San Diego Supercomputer Center (SDSC) Summer Institute. This was an introductory boot camp to high-performance computing (HPC), and one of the modules taught the application of Numba for in-line parallelization and speeding up of Python code.
What is Numba?
According to its official web page, Numba is a just-in-time (JIT) compiler that translates subsets of Python and NumPy code into fast machine code, enabling it to run at speeds approaching that of C or Fortran. This is becuase JIT compilation enables specific lines of code to be compiled or activated only when necessary. Numba also makes use of cache memory to generate and store the compiled version of all data types entered to a specific function, which eliminates the need for recompilation every time the same data type is called when a function is run.
This blog post will demonstrate a simple examples of using Numba and its most commonly-used decorator,
@jit, via Jupyter Notebook. The Binder file containing all the executable code can be found here.
Note: The ‘
@‘ flag is used to indicate the use of a decorator
Installing Numba and Setting up the Jupyter Notebook
First, in your command prompt, enter:
pip install numba
Alternatively, you can also use:
conda install numba
Next, import Numba:
import numpy as np import numba from numba import jit from numba import vectorize
Great! Now let’s move onto using the
Using @jit for executing functions on the CPU
@jit decorator works best on numerical functions that use NumPy. It has two modes:
nopython mode and
object mode. Setting
nopython=True tell the compiler to overlook the involvement of the Python interpreter when running the entire decorated function. This setting leads to the best performance. However, in the case when:
nopythonis not set at all
the compiler defaults to
object mode. Then, Numba will manually identify loops that it can compile into functions to be run in machine code, and will run the remaining code in the interpreter.
@jit is demonstrated on a simple matrix multiplication function:
# a function that does multiple matrix multiplication @jit(nopython=True) def matrix_multiplication(A, x): b = np.empty(shape=(x.shape,1), dtype=np.float64) for i in range(x.shape): b[i] = np.dot(A[i,:], x) return b
Remember – the use of
@jit means that this function has not been compiled yet! Compilation only happens when you call the function:
A = np.random.rand(10, 10) x = np.random.rand(10, 1) a_complicated_function(A,x)
But how much faster is Numba really? To find out, some benchmarking is in order. Jupyter Notebook has a handy function called
%timeit that runs simple functions many times in a loop to get their average execution time, that can be used as follows:
%timeit matrix_multiplication(A,x) # 11.4 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba has a special
.py_func attribute that effectively allows the decorated function to run as the original uncompiled Python function. Using this to compare its runtime to that of the decorated version,
%timeit matrix_multiplication.py_func(A,x) # 35.5 µs ± 3.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
From here, you can see that the Numba version runs about 3 times faster than using only NumPy arrays. In addition to this, Numba also supports tuples, integers, floats, and Python lists. All other Python features supported by Numba can be found here.
Besides explicitly declaring @jit at the start of a function, Numba makes it simple to turn a NumPy function into a Numba function by attaching
jit(nopython=True) to the original function. This essentially uses the
@jit decorator as a function. The function to calculate absolute percentage relative error demonstrates how this is done:
# Calculate percentage relative error def numpy_re(x, true): return np.abs(((x - true)/true))*100 numba_re = jit(nopython=True)(numpy_re)
And we can see how the Number version is faster:
%timeit numpy_re(x, 0.66) %timeit numba_re(x, 0.66)
where the NumPy version takes approximately 2.61 microseconds to run, while the Numba version takes 687 nanoseconds.
Inline parallelization with Numba
@jit decorator can also be used to enable inline parallelization by setting its parallelization pass
parallel=True. Parallelization in Numba is done via multi-threading, which essentially creates threads of code that are distributed over all the available CPU cores. An example of this can be seen in the code snippet below, describing a function that calculates the normal distribution of a set of data with a given mean and standard deviation:
SQRT_2PI = np.sqrt(2 * np.pi) @jit(nopython=True, parallel=True) def normals(x, means, sds): n = means.shape result = np.exp(-0.5*((x - means)/sds)**2) return (1 / (sds * np.sqrt(2*np.pi))) * result
As usual, the function must be compiled:
means = np.random.uniform(-1,1, size=10**8) sds = np.random.uniform(0.1, 0.2, size=10**8) normals(0.6, means, sds)
To appreciate the speed-up that Numba’s multi-threading provides, compare the runtime for this with:
- A decorated version of the function with a disabled parallel pass
- The uncompiled, original NumPy function
The first example can be timed by:
normals_deco_nothread = jit(nopython=True)(normals.py_func) %timeit normals_deco_nothread(0.6, means, sds) # 3.24 s ± 757 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The first line of the code snippet first makes an uncompiled copy of the
normals function, and then applies the
@jit decorator to it. This effectively creates a version of
normals that uses
@jit, but is not multi-threaded. This run of the function took approximately 3.3 seconds.
For the second example, simply:
%timeit normals.py_func(0.6, means, sds) # 7.38 s ± 759 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now, compare both these examples to the runtime of the decorated and multi-threaded
%timeit normals(0.6, means, sds) # 933 ms ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The decorated, multi-threaded function is significantly faster (933 ms) than the decorated function without multi-threading (3.24 s), which in turn is faster than the uncompiled original NumPy function (7.38 s). However, the degree of speed-up may vary depending on the number of CPUs that the machine has available.
In general, the improvements achieved by using Numba on top of NumPy functions are marginal for simple, few-loop functions. Nevertheless, Numba is particularly useful for large datasets or high-dimensional arrays that require a large number of loops, and would benefit from the one-and-done compilation that it enables. For more information on using Numba, please refer to its official web page.