Us, engineers with no formal computer science training, for a myriad of good reasons tend to favor friendly programming languages such as Python over evil C/C++. However, when our application is performance sensitive and we cannot wait for results sitting in our chairs for as long as a Python code would require us to, we are sometimes forced to us C/C++. Why then not making the most of it when it this is the case?
Here are some simple tips to make your C/C++ code run even faster, and how to get some advice about further performance improvements. The last idea (data locality) is transferable to Python and other languages.
Most important trick
Improve your algorithm. Thinking if there is a simpler way of doing what you coded may reduce your algorithm’s complexity (say, from say n3 to n*log(n)), which would:
- yield great speed-up when you have lots of data or needs to run a computation several times in a row, and
- make you look smarter.
Compiler flags
First and foremost, the those who know what they are doing — compiler developers — do the work for you by calling the compiler with the appropriate flags. There are an incredible amount of flags you can call for that, but I would say that the ones you should have on whenever possible are -O3 and –march=native.
The optimization flags (-O1 to -O3, the latter more aggressive than the former) will perform a series of modification on your code behind the scenes to speed it up, some times by more than an order of magnitude. The issue is that this modifications may eventually make your code behave differently than you expected, so it’s always good to do a few smaller runs with -O0 and -O3 and compare their results before getting into production mode.
The –march=native flag will make the compiler fine tune your code to the processor it is being compiled on (conversely, –march=haswell would fine tune it to haswell architectures, for example). This is great if you are only going to run your code on your own machine or in another machine known to have a similar architecture, but if you try to run the binary on a machine with a different architecture, specially if it is an older one, you may end up with illegal instruction errors.
Restricted pointer array
When declaring a pointer array which you are sure will not be subject to pointer aliasing — namely there will be no other pointer pointing to the same memory address –, you can declare that pointer as a restrict pointer, as below:
- GCC: double* __restrict__ my_pointer_array
- Intel compiler: double* restrict my_pointer_array
This will let the compiler know that it can change order of certain operations involving my_pointer_array to make your code faster without changing some read/write order that may change your results. If you want to use the restricted qualifier with the intel compiler the flag -restrict must be passed as an argument to the compiler.
Aligned pointer array
By aligning an array, you are making sure the data is going to lie in the ideal location in memory for the processor to fetch it and perform the calculations it needs based on that array. To help your compiler optimize your data alignment, you need to (1) align your array when it is declared by a specific number of bytes and (2) tell your the compiler the array is aligned when using it to perform calculations — the compiler has no idea whether or not arrays received as arguments in function are aligned. Below are examples in C, although the same ideas apply to C++ as well.
GCC
#include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 void sum(double *__restrict__ a, double *__restrict__ b, double *__restrict__ c, int n) { a = (double*) __builtin_assume_aligned(a, ALIGN); b = (double*) __builtin_assume_aligned(b, ALIGN); c = (double*) __builtin_assume_aligned(c, ALIGN); for (int i = 0; i < n; ++i) c[i] = a[i] + b[i]; } int main(void) { double a[SIZE_ARRAY] __attribute__((aligned(ALIGN ))); double b[SIZE_ARRAY] __attribute__((aligned(ALIGN ))); double c[SIZE_ARRAY] __attribute__((aligned(ALIGN ))); for (int i = 0; i < SIZE_ARRAY; ++i) { a[i] = 5.; b[i] = 2.; } double start_time = omp_get_wtime(); sum(a, b, c, SIZE_ARRAY); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }
Intel compiler
#include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 void sum(double* restrict a, double* restrict b, double* restrict c, int n) { __assume_aligned(a, ALIGN); __assume_aligned(b, ALIGN); __assume_aligned(c, ALIGN); for (int i = 0; i < n; ++i) c[i] = a[i] + b[i]; } int main(void) { __declspec(align(ALIGN )) float a[SIZE_ARRAY]; __declspec(align(ALIGN )) float b[SIZE_ARRAY]; __declspec(align(ALIGN )) float c[SIZE_ARRAY]; for (int i = 0; i < SIZE_ARRAY; ++i) { a[i] = 5.; b[i] = 2.; } double start_time = omp_get_wtime(); sum(a, b, c, SIZE_ARRAY); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }
Edit: In a comment to this post, Krister Walfridsson not only caught an issue with my GCC code, for which mention I thank him, but he also shows the differences in machine code generated with and without alignment.
Data Locality
Computers are physical things, which means that data is physically stored and needs to be physically moved around in memory and between cache and processor in order to be used in calculations. This means that, if your data is stored all over the place in memory — e.g. in multiple pointer arrays in different parts of memory –, the processor will have to reach out to several parts of memory to fetch all your data before performing any computations. By having the data intelligently laid out in memory you ensure all data required for each computation is stored close to each other and in cache at the same time, which becomes even more important if your code uses too much data to fit in the cache at once.
In order to making your processor’s life easier, it is a good idea to ensure that all data required for a calculation step is close together. For example, if a given computation required three arrays of fixed sizes, it is always a good idea to merge them into one long array, as in the example below for the Intel compiler.
#include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 void sum(double* restrict a, double* restrict b, double* restrict c, int n) { __assume_aligned(a, ALIGN); __assume_aligned(b, ALIGN); __assume_aligned(c, ALIGN); for (int i = 0; i < n; ++i) c[i] = a[i] + b[i]; } int main(void) { __declspec(align(ALIGN )) float abc[3 * SIZE_ARRAY]; for (int i = 0; i < 2 * SIZE_ARRAY; ++i) { a[i] = 5.; b[i] = 2.; } double start_time = omp_get_wtime(); sum(abc, abc + SIZE_ARRAY, abc + 2 * ARRAY, SIZE_ARRAY); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }
or even, since c[i] depends only on b[i] and a[i], we can have the values of a, b and c intercalated to assure that all computations will be performed on data that is right next to each other in memory:
#include <stdio.h> #include <omp.h> #define SIZE_ARRAY 100000 #define ALIGN 64 #define STRIDE 3 void sum(double* restrict abc, int n, int stride) { __assume_aligned(abc, ALIGN); for (int i = 0; i < n; i += stride) abc[i+2] = abc[i] + abc[i+1]; } int main(void) { __declspec(align(ALIGN )) double abc[3 * SIZE_ARRAY]; for (int i = 0; i < 3 * SIZE_ARRAY; i += STRIDE) { abc[i] = 5.; abc[i+1] = 2.; } double start_time = omp_get_wtime(); sum(abc, SIZE_ARRAY, STRIDE ); double time = omp_get_wtime() - start_time; printf("%0.6fs", time); }
Conclusion
According a class project in which we had to write C code to perform matrix multiplication, the improvements suggested should may improve the performance of your code by 5 or 10 times. Also, the idea of data locality can be transferred to other languages, such as Python.