Simple profiling checks for running jobs on clusters

The goal of this short blog post is to share some simple tips on profiling your (to be) submitted jobs on high performance computing resources. Profiling your jobs can give you information about how efficiently you are using your computational resources, i.e., your CPUs and your allocated memory. Typically you would perform these checks on your experiment at a smaller scale, ensuring that everything is working as it should, before expanding to more tasks and CPUs.

Your first check is squeue typically paired with your user ID on a cluster. Here’s an example:

(base) [ah986@login02 project_dir]$ squeue -u ah986
             JOBID PARTITION     NAME      USER  ST       TIME  NODES NODELIST(REASON) 
           5688212    shared <job_name>    ah986  R       0:05      1 exp-4-55 

This tells me that my submitted job is utilizing 1 node in the shared partition of this cluster. If your cluster is using the SLURM scheduler, you can also use sacct which can display accounting data for all jobs you are currently running or have run in the past. There’s many pieces of information available with sacct, that you can specify using the --format flag. Here’s an example for the same job:

(base) [ah986@login02 project_dir]$ sacct --format=JobID,partition,state,time,start,end,elapsed,nnodes,ncpus,nodelist,AllocTRES%32 -j 5688212
       JobID  Partition      State  Timelimit               Start                 End    Elapsed   NNodes      NCPUS        NodeList                        AllocTRES 
------------ ---------- ---------- ---------- ------------------- ------------------- ---------- -------- ---------- --------------- -------------------------------- 
5688212          shared    RUNNING   20:00:00 2021-09-08T10:55:40             Unknown   00:19:47        1        100        exp-4-55 billing=360000,cpu=100,mem=200G+ 
5688212.bat+               RUNNING            2021-09-08T10:55:40             Unknown   00:19:47        1        100        exp-4-55          cpu=100,mem=200G,node=1 
5688212.0                  RUNNING            2021-09-08T10:55:40             Unknown   00:19:47        1        100        exp-4-55          cpu=100,mem=200G,node=1 

In this case I can see the number of nodes (1) and the number of cores (100) utilized by my job as well as the resources allocated to it (100 CPUs and 200G of memory on 1 node). This information is useful in cases where a task launches other tasks and you’d like to diagnose whether the correct number of cores is being used.

Another useful tool is seff, which is actually a wrapper around sacct and summarizes your job’s overall performance. It is a little unreliable while the job is still running, but after the job is finished you can run:

(base) [ah986@login02 project_dir]$ seff 5688212
Job ID: 5688212
Cluster: expanse
User/Group: ah986/pen110
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 100
CPU Utilized: 1-01:59:46
CPU Efficiency: 68.16% of 1-14:08:20 core-walltime
Job Wall-clock time: 00:22:53
Memory Utilized: 38.25 GB
Memory Efficiency: 19.13% of 200.00 GB

The information here is very useful if you want to find out about how efficiently you’re using your resources. For this example I had 100 separate tasks I needed to perform and I requested 100 cores on 1 node and 200 GB of memory. These results tell me that my job completed in 23mins or so, the total time using the CPUs (CPU Utilized) was 01:59:46, and most importantly, the efficiency of my CPU use. CPU Efficiency is calculated “as the ratio of the actual core time from all cores divided by the number of cores requested divided by the run time”, in this case 68.16%. What this means it that I could be utilizing my cores more efficiently by allocating fewer cores to the same number of tasks, especially if scaling up to a larger number of nodes/cores. Additionally, my allocated memory is underutilized and I could be requesting a smaller memory allocation without inhibiting my runs.

Finally, while your job is still running you can log in the node(s) executing the job to look at live data. To do so, you simply ssh to one of the nodes listed under NODELIST (not all clusters allow this). From there, you can run the top command like below (with your own username), which will start the live task manager:

(base) [ah986@r143 ~]$ top -u ah986

top - 15:17:34 up 25 days, 19:55,  1 user,  load average: 0.09, 12.62, 40.64
Tasks: 1727 total,   2 running, 1725 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.1 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 257662.9 total, 249783.4 free,   5561.6 used,   2317.9 buff/cache
MiB Swap: 716287.0 total, 716005.8 free,    281.2 used. 250321.1 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                              
 78985 ah986     20   0  276212   7068   4320 R   0.3   0.0   0:00.62 top                                                                                                                                  
 78229 ah986     20   0  222624   3352   2936 S   0.0   0.0   0:00.00 slurm_script                                                                                                                         
 78467 ah986     20   0  259464   8128   4712 S   0.0   0.0   0:00.00 srun                                                                                                                                 
 78468 ah986     20   0   54520    836      0 S   0.0   0.0   0:00.00 srun                                                                                                                                 
 78481 ah986     20   0  266404  19112   4704 S   0.0   0.0   0:00.24 parallel                                                                                                                             
 78592 ah986     20   0  217052    792    720 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78593 ah986     20   0  217052    732    660 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78594 ah986     20   0  217052    764    692 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78595 ah986     20   0  217052    708    636 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78596 ah986     20   0  217052    708    636 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78597 ah986     20   0  217052    796    728 S   0.0   0.0   0:00.00 sleep                                                                                                                                
 78598 ah986     20   0  217052    732    660 S   0.0   0.0   0:00.00 sleep       

Memory and CPU usage can be tracked from RES and %CPU columns respectively. In this case, for the sake of an example, I just assigned all my cores to sleep a certain number of minutes each (using no CPU or memory). Similar information can also be obtained using the ps command, with memory being tracked under the RSS column.

 (base) [ah986@r143 ~]$ ps -u$USER -o %cpu,rss,args
%CPU   RSS COMMAND
 0.0  3352 /bin/bash /var/spool/slurm/d/job3509431/slurm_script
 0.0  8128 srun --export=all --exclusive -N1 -n1 parallel -j 100 sleep {}m ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
 0.0   836 srun --export=all --exclusive -N1 -n1 parallel -j 100 sleep {}m ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
 0.1 19112 /usr/bin/perl /usr/bin/parallel -j 100 sleep {}m ::: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 0.0   792 sleep 3m
 0.0   732 sleep 4m
 0.0   764 sleep 5m
 0.0   708 sleep 6m
 0.0   708 sleep 7m
 0.0   796 sleep 8m
 0.0   732 sleep 9m
 0.0   712 sleep 10m

Profiling your Python script using cProfile

Profiling refers to performing dynamic analysis on a script to measure its execution time, the execution time of its subcomponents, as well as how many times each subcomponent is being called. This produces data on where the script program is spending the most time, and can help with optimizing your script to minimize its execution time. This blog has had two past posts on profiling, one on C++ using Callgrind and one on Python using PyCharm. PyCharm is a Python IDE that’s very useful but unfortunately not free, so if you’re looking for some freeware profiling functionality in Python, this post is for you.

Python has a module called cProfile. A simple example on timing the multiplication of two matrices with cProfile:

import cProfile
import numpy as np

mat1 = ([1, 6, 3],[3, 6, 3],[2, 8, 3]) 
mat2 = ([2, 7, 6],[5, 4, 7],[6, 1, 9]) 
  
cProfile.run('np.dot(mat1,mat2)')

this should print out something like the following:

         4 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.dot}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

I have recently used cProfile on one of my own scripts which I’ll be using here to demonstrate how it can be used in your own work. I have a function called fish_game, which contained my model and took vars as input. This function also calls function hrvSTR which represented my action policy function (it’s extraneous to this post what these functions do exactly, one represents the system and the other represents a policy that we use to act on the system, you can see the full model here). The fish_game function was called by my optimization algorithm during optimization. Running cProfile on it produces this:

cProfile.run('fish_game(vars)',sort='cumulative')
         282814 function calls in 0.698 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.698    0.698 {built-in method builtins.exec}
        1    0.000    0.000    0.698    0.698 <string>:1(<module>)
        1    0.166    0.166    0.698    0.698 <ipython-input-4-df258d5a749f>:55(fish_game)
    20200    0.385    0.000    0.531    0.000 <ipython-input-4-df258d5a749f>:1(hrvSTR)
    20200    0.016    0.000    0.089    0.000 fromnumeric.py:1821(sum)
    20200    0.021    0.000    0.069    0.000 fromnumeric.py:64(_wrapreduction)
   121208    0.055    0.000    0.055    0.000 {built-in method numpy.core.multiarray.zeros}
    20200    0.046    0.000    0.046    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    20200    0.003    0.000    0.003    0.000 {built-in method builtins.isinstance}
    40400    0.003    0.000    0.003    0.000 {built-in method builtins.len}
    20200    0.002    0.000    0.002    0.000 {method 'items' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'normal' of 'mtrand.RandomState' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

This tells me one run of my function takes 0.698 seconds in total (this might vary slightly every time, depending on your processor usage at the time and other factors), and that most of that time, 0.531 seconds, are consumed by the hrvSTR function. Even though 0.7 seconds might not seem long, in the context of optimization, where this function would need to be evaluated tens of thousands of times, an additional 0.1 seconds might add hours of process time to your workflow. Trying to bring that down is probably a worthwhile investment of time that will result in time savings later on. As a result I figured there could be something I could do to reduce the time hrvSTR took. I particularly intrigued by the fact that some numpy process numpy.core.multiarray.zeros was called 121208 times, an order of magnitude more than every other method in my script, which prompted me to think that I might be unnecessarily repeating a process.

Looking at my code more closely (this is a script I have been using for more than a year now), I realized that I was ordering and normalizing and creating arrays for my action policy every single time it was called, something that was unnecessary, as the same policy was used for all time steps. I could instead perform all those steps once, save the outputs and use them for every time step instead of recalculating every time. I spent some time to adjust my script to do that and running cProfile again, produced this:

 cProfile.run('fish_game(vars)',sort='cumulative')
         79282 function calls in 0.379 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.379    0.379 {built-in method builtins.exec}
        1    0.000    0.000    0.379    0.379 <string>:1(<module>)
        1    0.150    0.150    0.379    0.379 <ipython-input-16-3c528334eb55>:62(fish_game)
    19800    0.193    0.000    0.229    0.000 <ipython-input-16-3c528334eb55>:35(hrvSTR)
    59414    0.036    0.000    0.036    0.000 {built-in method numpy.core.multiarray.zeros}
        4    0.000    0.000    0.000    0.000 fromnumeric.py:2817(mean)
        4    0.000    0.000    0.000    0.000 _methods.py:58(_mean)
        2    0.000    0.000    0.000    0.000 <ipython-input-16-3c528334eb55>:4(generate_policy)
        6    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        2    0.000    0.000    0.000    0.000 {method 'normal' of 'mtrand.RandomState' objects}
        2    0.000    0.000    0.000    0.000 fromnumeric.py:1821(sum)
        2    0.000    0.000    0.000    0.000 fromnumeric.py:64(_wrapreduction)
        4    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        4    0.000    0.000    0.000    0.000 _methods.py:48(_count_reduce_items)
        2    0.000    0.000    0.000    0.000 {method 'clip' of 'numpy.generic' objects}
        4    0.000    0.000    0.000    0.000 numeric.py:504(asanyarray)
       10    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        8    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        4    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        2    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

I basically shaved 0.3 seconds off my function evaluation time by investing some time to look more closely at my script. If optimizing with 30000 function evaluations, this translates to some 2+ hours of processing time that I am saving (I spent far less figuring this out for the first time).

You can also run cProfile directly from the command line like so:

python -m cProfile -s cumtime fish_game.py

Profiling C++ code with Callgrind

Often times, we have to write code to perform tasks whose complexity vary from mundane, such as retrieving and organizing data, to highly complex, such as simulations CFD simulations comprising the spine of a project. In either case, depending on the complexity of the task and amount of data to be processed, it may happen for the newborn code to leave us staring at an underscore marker blinking gracefully for hours on a command prompt during its execution until the results are ready, leading to project schedule delays and shortages of patience. Two standard and preferred approaches to the problem of time intensive codes are to simplify the algorithm and to make the code more efficient. In order to better select the parts of the code to work on, it is often useful to first find the parts of the code in which more time by profiling the code.

In this post, I will show how to use Callgrind, part of Valgrind, and KCachegrind to profile C/C++ codes on Linux — unfortunately, Valgrind is not available for Windows or Mac, although it can be ran on cluster from which results can be downloaded and visualized on Windows with QCachegrind. The first step is to install Valgrind and KCachegrind by typing the following commands in the terminal of a Debian based distribution, such as Ubuntu (equivalent yum commands area available for Red Hat based distributions):

$ sudo apt-get install valgrind
$ sudo apt-get install kcachegrind

Now that the required tools are installed, the next step is to compile your code with GCC/G++ (with a make file, cmake, IDE or by running the compiler directly from the terminal) and then run the following command in a terminal (type ctrl+shift+T to open the terminal):

$ valgrind --tool=callgrind path/to/your/compiled/program program_arguments

Callgrind will then run your program with some instrumentation added to its execution to measure time expenditures and cache use by each function in your code. Because of the instrumentation, Your code will take considerably longer to run under Callgrind than it typically would, so be sure to run a representative task that is as small as possible when profiling your code. During its execution, Callgrind will output a report similar to the one below on terminal itself:

==12345== Callgrind, a call-graph generating cache profiler
==12345== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==12345== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==12345== Command: path/to/your/compiled/program program_arguments
==12345==
==12345== For interactive control, run 'callgrind_control -h'.
IF YOUR CODE OUTPUTS TO THE TERMINAL, THE OUTPUT WILL BE SHOWN HERE.
==12345== 
==12345== Events    : Ir
==12345== Collected : 4171789731
==12345== 
==12345== I   refs:      4,171,789,731

The report above shows that it collected 4 billion events in order to generate the comprehensive report saved in the file callgrind.out.12345 — 12345 is here your process id, shown in the report above. Instead of submerging your soul into a sea of despair by trying to read the output file in a text editor, you should load the file into KCachegrind by typing:

$ kcachegrind calgrind.out.12345

You should now see a screen like the one below:

kcachegrind_initial.png

The screenshot above shows the profiling results for my code. The left panel shows the functions called by my code sorted by total time spent inside each function. Because functions call each other, callgrind shows two cost metrics as proxies for time spent in each function: Incl., showing the total cost of a function, and self, showing the time spent in each function itself discounting the callees. By clicking on “Self” to order to functions by the cost of the function itself, we sort the functions by the costs of their own codes, as shown below:

Untitled_sorted

Callgrind includes functions that are native to C/C++ in its analysis. If one of them appears in the highest positions of the left panel, it may be the case to try to use a different function or data structure that performs a similar task in a more efficient way. Most of the time, however, our functions are the ones in most of the top positions in the list. In the example above, we can see that a possible first step I can take to improve the time performance of my code is to make function “ContinuityModelROF::shiftStorage” more efficient. A few weeks ago, however, the function “ContinuityModel::continuityStep” was ranked first with over 30% of the cost, followed by a C++ map related function. I then replaced a map inside that function by a pointer vector, resulting in the drop of my function’s cost to less than 5% of the total cost of the code.

In case KCachegrind shows that a given function that is called from multiple places in the code is costly, you may want to know which function is the main culprit behind the costly calls. To do this, click on the function of interest (in this case, “_memcpy_sse2_unalight”) in the left panel, and then click on “Callers” in the right upper panel and on “Call Graph” in the lower right panel. This will show in list and graph forms the calls made to the function by other functions, and the asociated percent costs. Unfortunately, I have only the function “ContinuityModelROF::calculateROF” calling “_memcpy_sse2_unalight,” hence the simple graph, but the graph would be more complex if multiple functions made calls to “_memcpy_sse2_unalight.”

I hope this saves you at least the time spend reading this post!