This blog post is adapted from material I learned during the 2021 San Diego Supercomputer Center (SDSC) Summer Institute. This was an introductory boot camp to high-performance computing (HPC), and one of the modules taught the application of Numba for in-line parallelization and speeding up of Python code.

**What is Numba?**

According to its official web page, Numba is a just-in-time (JIT) compiler that translates subsets of Python and NumPy code into fast machine code, enabling it to run at speeds approaching that of C or Fortran. This is becuase JIT compilation enables specific lines of code to be compiled or activated only when necessary. Numba also makes use of cache memory to generate and store the compiled version of all data types entered to a specific function, which eliminates the need for recompilation every time the same data type is called when a function is run.

This blog post will demonstrate a simple examples of using Numba and its most commonly-used decorator, `@jit`

, via Jupyter Notebook. **The Binder file containing all the executable code can be found here**.

*Note: The ‘ @‘ flag is used to indicate the use of a decorator*

**Installing Numba and Setting up the Jupyter Notebook**

First, in your command prompt, enter:

`pip install numba`

Alternatively, you can also use:

`conda install numba`

Next, import Numba:

```
import numpy as np
import numba
from numba import jit
from numba import vectorize
```

Great! Now let’s move onto using the `@jit`

decorator.

**Using @jit for executing functions on the CPU**

The `@jit `

decorator works best on numerical functions that use NumPy. It has two modes: `nopython `

mode and `object `

mode. Setting `nopython=True`

tell the compiler to overlook the involvement of the Python interpreter when running the entire decorated function. This setting leads to the best performance. However, in the case when:

`nopython=True`

fails`nopython=False`

, or`nopython`

is not set at all

the compiler defaults to `object`

mode. Then, Numba will manually identify loops that it can compile into functions to be run in machine code, and will run the remaining code in the interpreter.

Here, `@jit`

is demonstrated on a simple matrix multiplication function:

```
# a function that does multiple matrix multiplication
@jit(nopython=True)
def matrix_multiplication(A, x):
b = np.empty(shape=(x.shape[0],1), dtype=np.float64)
for i in range(x.shape[0]):
b[i] = np.dot(A[i,:], x)
return b
```

Remember – the use of `@jit`

means that this function has not been compiled yet! Compilation only happens when you call the function:

```
A = np.random.rand(10, 10)
x = np.random.rand(10, 1)
a_complicated_function(A,x)
```

But how much faster is Numba *really*? To find out, some benchmarking is in order. Jupyter Notebook has a handy function called `%timeit`

that runs simple functions many times in a loop to get their average execution time, that can be used as follows:

```
%timeit matrix_multiplication(A,x)
# 11.4 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Numba has a special` .py_func`

attribute that effectively allows the decorated function to run as the original uncompiled Python function. Using this to compare its runtime to that of the decorated version,

```
%timeit matrix_multiplication.py_func(A,x)
# 35.5 µs ± 3.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

From here, you can see that the Numba version runs about 3 times faster than using only NumPy arrays. In addition to this, Numba also supports tuples, integers, floats, and Python lists. All other Python features supported by Numba can be found **here**.

Besides explicitly declaring @jit at the start of a function, Numba makes it simple to turn a NumPy function into a Numba function by attaching `jit(nopython=True)`

to the original function. This essentially uses the `@jit`

decorator as a function. The function to calculate absolute percentage relative error demonstrates how this is done:

```
# Calculate percentage relative error
def numpy_re(x, true):
return np.abs(((x - true)/true))*100
numba_re = jit(nopython=True)(numpy_re)
```

And we can see how the Number version is faster:

```
%timeit numpy_re(x, 0.66)
%timeit numba_re(x, 0.66)
```

where the NumPy version takes approximately 2.61 microseconds to run, while the Numba version takes 687 nanoseconds.

**Inline parallelization with Numba**

The `@jit`

decorator can also be used to enable inline parallelization by setting its parallelization pass `parallel=True`

. Parallelization in Numba is done via multi-threading, which essentially creates threads of code that are distributed over all the available CPU cores. An example of this can be seen in the code snippet below, describing a function that calculates the normal distribution of a set of data with a given mean and standard deviation:

```
SQRT_2PI = np.sqrt(2 * np.pi)
@jit(nopython=True, parallel=True)
def normals(x, means, sds):
n = means.shape[0]
result = np.exp(-0.5*((x - means)/sds)**2)
return (1 / (sds * np.sqrt(2*np.pi))) * result
```

As usual, the function must be compiled:

```
means = np.random.uniform(-1,1, size=10**8)
sds = np.random.uniform(0.1, 0.2, size=10**8)
normals(0.6, means, sds)
```

To appreciate the speed-up that Numba’s multi-threading provides, compare the runtime for this with:

- A decorated version of the function with a disabled parallel pass
- The uncompiled, original NumPy function

The first example can be timed by:

```
normals_deco_nothread = jit(nopython=True)(normals.py_func)
%timeit normals_deco_nothread(0.6, means, sds)
# 3.24 s ± 757 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

The first line of the code snippet first makes an uncompiled copy of the `normals`

function, and then applies the `@jit `

decorator to it. This effectively creates a version of `normals `

that uses `@jit`

, but is not multi-threaded. This run of the function took approximately 3.3 seconds.

For the second example, simply:

```
%timeit normals.py_func(0.6, means, sds)
# 7.38 s ± 759 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Now, compare both these examples to the runtime of the decorated and multi-threaded `normals `

function:

```
%timeit normals(0.6, means, sds)
# 933 ms ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

The decorated, multi-threaded function is significantly faster (933 ms) than the decorated function without multi-threading (3.24 s), which in turn is faster than the uncompiled original NumPy function (7.38 s). However, the degree of speed-up may vary depending on the number of CPUs that the machine has available.

**Summary**

In general, the improvements achieved by using Numba on top of NumPy functions are marginal for simple, few-loop functions. Nevertheless, Numba is particularly useful for large datasets or high-dimensional arrays that require a large number of loops, and would benefit from the one-and-done compilation that it enables. For more information on using Numba, please refer to its **official web page**.