Parallel File Compressing in Linux

If you are dealing with big data and need to move them to different directories or archive them for long-term storage, you have to think about how you can do so efficiently. Without an efficient method, you will probably need to spend days organizing and finishing the work. There are several utilities that are popular for this task. I had more than 3 TB of data that I needed to move to free up some space on the disk, so I thought about a strategy for moving my files. My smallest subdirectory was about 175 GB, and it took about 2 hours to compress with normal tar and gzip. I realized that gzip has options for the level of compression and the speed of compression, which I did not know before. This can be helpful. I did a simple test with a smaller data set (about 1.94 GB) by applying different speed options from 1 to 9; 1 indicates the fastest but compression method, and 9 indicates the slowest but best compression method:

GZIP=-Option tar cvzf OUTPUT_FILE.tar.gz ./Paths_to_Archive/

Here is the result: the default is 6, but you can play with the speed option and, based on your priority, choose the timing and compression ratio. If your file or folder is much bigger than what I used here as an example, different speed options can really save you time. Here is the graph that I created from this experiment:

You can further speed it up using more than one core/processor. There is another available gzip version that compresses files/folders on multiple processors and cores.

tar cf - ./Paths_to_Archive | ./pigz-2.4/pigz -1 -p 20 > OUTPUT_FILE.tar.gz

In this case, I used “1” as the speed option and specified “20” possessors for the task, and the path where I downloaded the pigz. The good news is that you can write a simple bash script to run the same command on other nodes rather than on the login node. Therefore, you can use all the available possessors on a node without making the head node slow.

#!/bin/bash										
#SBATCH -t 24:00:00					
#SBATCH --job-name=test				
#SBATCH --mail-type=end				
#SBATCH -p normal					
#SBATCH --export=ALL				
#SBATCH --nodes=1			
#SBATCH --output="test.txt"				
#SBATCH --cpus-per-task=32

tar cf - ./Paths_to_Archive | ./pigz-2.4/pigz -1 -p 32 > OUTPUT_FILE.tar.gz

Save the lines above in a test.sh, and run it with sbatch test.sh.

I used a larger subset of my data (16.5 GB) to check different speed options on parallel, and here is the result: the slowest option was almost threefold faster (168 vs. 478 seconds) than my previous test on one possessor; however, my previous test folder was much smaller (1.96 vs. 16.5 GB).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s