Intro to HDF5/h5py and comparison to CSV for speed & compression

The Hierarchical Data Format (HDF) is a family of free and open-source file formats designed to facilitate the storage and manipulation of large and heterogeneous datasets. The latest version, HDF5, is commonly used across a range of commercial and non-commercial applications. Like Parquet (see previous blog post by Rohini Gupta) and NetCDF (see this introductory post by Jon Herman as well as several others on more advanced topics), HDF5 offers substantial speed and compression advantages over flat text file formats like CSV and thus is especially useful for scientific modeling efforts that require storage and analysis of large and complex datasets.

In this post, I will first give an introduction to HDF5 file format and how to read/write HDF5 files in Python using the h5py package. I will then perform an experiment writing and reading different dataset sizes in order to quantitatively demonstrate the speed and compression advantages of HDF5/h5py over CSV.

Figure 1: Advantages of HDF5 format. Source: HDF Group.

Background

HDF was originally developed by the US National Center for Supercomputing Applications, and is now maintained by the non-profit HDF Group. HDF5 builds off of the earlier HDF4 as well as the popular NetCDF file format. One of its major advantages is its hierarchical data structure that allows for nested “groups,” which are analogous to “directories” in a computer file system. Each group can contain one or more “datasets,” which are individual arrays of data. HDF5 allows for heterogeneous data types to be combined either within a dataset or across groups. HDF5 also allows for metadata “attributes” to be attached at any level of the hierarchy, from the root group down to individual datasets.

This hierarchical and flexible format has many advantages for data-heavy scientific modeling workflows. For example, in large-scale exploratory modeling workflows that dispatch ensembles of thousands of simulations, HDF5 format can be used to organize both the input data files and the output results files. Hierarchical groups can be used to organize data across nested experimental designs; for example, results can be organized with an outer group for each candidate policy/solution, an inner group for each sampled uncertain state-of-the-world, and one or more inner datasets to store arrays of relevant results for each policy/solution within each state-of-the-world. The relevant policy parameters, sampled states-of-the-world, and other important information can be stored right alongside the data as metadata at the proper level of the hierarchy. This can drastically improve the transparency and portability of results in complex modeling workflows.

Figure 2: Hierarchical structure of an HDF5 file. Source: Neon Science.

Another advantage is that HDF5 automatically applies gzip compression, which can significantly reduce the file size compared to flat text file formats like CSV. HDF5 also allows for “data slicing” and “chunking” to read and write small subsets of the data at a time. This enables users to efficiently interact with large datasets that exceed the available RAM.

Interacting with HDF5 & h5py

Because of its complex hierarchical format, HDF5 is not quite so simple to read and write as CSV files. You cannot open the files directly in a text editor or terminal, for example. However, the HDF Group provides the free HDFView software which has a simple and effective graphical user interface for viewing and editing the contents of HDF5 files.

Figure 3: Viewing an HDF5 dataset with HDFView software. The upper left panel shows the hierarchical structure of this data, with an outer group “soln50”, and inner group “du_1”, and within that many datasets such as “du10”. This dataset shows several attributes such as “dunames”, an array of character strings (bottom left). We can also view the main numeric data array associated with “du10” (bottom right).

The other way to interact with HDF5 files is through an application programming interface (API) provided by your programming language of choice. For example, h5py is a free and open source Python package that provides a fast and user-friendly interface to the C-level HDF5 API. Python users who are already familiar with the basics of dictionaries and Numpy arrays should find the h5py interface to be very intuitive.

Here is an example demonstrating the basic syntax for writing and reading to HDF5 files using the h5py package.

import h5py
import numpy as np

### Create a new HDF5 file with a write connection
with h5py.File('example.hdf5', 'w') as f_write:
    ### create a new 2x2 short int array called mammals, within animals group
    ds1 = f_write.create_dataset('animals/mammals', (2,2), dtype='i8')
    ### set values for mammals array
    ds1[...] = [[1,2],[3,4]]
    ### create/set metadata attribute tied to mammals array
    ds1.attrs['names'] = ['cat', 'dog']
    ### here is a different way to create and set a new dataset directly with data
    f_write['animals/reptiles'] = [[1.2, 2.3], [3.4, 4.5], [5.6, 6.7]]
    ### attribute tied directly to group rather than dataset
    f_write['animals'].attrs['data collection'] = 'made up'
    ### here is a new group and dataset, this time storing character string data.
    f_write['vehicles/cars'] = ['prius', 'focus']

### Reopen the HDF5 file in read mode to verify our data was stored correctly
with h5py.File('example.hdf5', 'r') as f_read:
    ### assign mammals array to variable ds1
    ds1 = f_read['animals/mammals']
    ### read stored array
    print(ds1[...])
    ### read attribute
    print(ds1.attrs['names'])
    ### you can also read array data directly without naming a variable first
    print(f_read['vehicles/cars'][...])

Testing the performance of HDF5/h5py vs CSV

Now that we have covered the basics, I will demonstrate the value of HDF5/h5py over a standard CSV data workflow in terms of writing and reading speed as well as compressed storage size. To do this, I perform a simple experiment to loop over 9 different synthetic datasets, ranging from 200,000 to 60,000,000 total elements (Table 1). These datasets vary both in the size and number of arrays that are stored. For example, there are 3 datasets with 20,000,000 total elements: (1) a single array of 20,000,000 elements; (2) 1,000 different arrays with 20,000 elements each; and (3) 1,000,000 different arrays with 20 elements each. For each of the 9 synthetic datasets, I time both a writing-based and a reading-based task using CSV as well as HDF5. I also record the storage size for the full dataset in each case. Each individual array within a dataset is stored as its own CSV file, while for HDF5 the entire dataset is stored as a single file with individual arrays separated by multiple levels of nested groups.

For brevity’s sake I will only provide the code needed for the HDF5 experiment below. However, all code used in this blog post, including the CSV experiment, can be found at my WPB_HDF5 GitHub repository.

First, here is the function used to create the synthetic data for this experiment. Given the means and standard deviations for two random variables as well as their correlation, this function will return a NumPy array with nsamp randomly sampled values of the two variables.

import numpy as np
import h5py
import time
import os
import shutil
import sys
import subprocess

def generate_mvn(means, stds, corrs, nsamp):
    '''
    function for generating correlated multivariate normal random variables as numpy array
    :param means: 1d array of means for N variables
    :param stds: 1d array of standard deviations for N variables
    :param corrs: NxN correlation matrix
    :param size: Number of samples to generate
    :return: (size x N) array of sampled variables
    '''
    covs = np.matmul(np.matmul(np.diag(stds), corrs),
                    np.diag(stds))
    return np.random.multivariate_normal(means, covs, size=nsamp)

Now to create the multi-level data, I draw nparam alternative parameterizations for the means, the standard deviations, and the correlation. For each of the nparam^3 total parameterizations, I draw nsamp samples of the 2 variables. For example, if nsamp=30 and nparam=100, then I have 1,000,000 total parameterizations, each of which has its own array of size 2×30=60. In the CSV experiment this is stored as 1,000,000 different CSV files. In the HDF5 experiment this is stored as a single HDF5 file with 1,000,000 different arrays. The data are nested with outer groups representing the 10 different sample means. Inside of each are 10 inner groups representing the sample standard deviations. Inside of that are the 10 different arrays for the sampled correlations. Each of these inner arrays has size 2×30.

In the experiment below, I loop over 9 different combinations of nsamp and nparam. For each, I randomly sample the means, standard deviations, and correlations, and then generate the random variable arrays for each combination. Each is stored as an array in the HDF5 file at the appropriated nested group path, along with the corresponding sampled parameter values which are stored as attributes. I print the total time for writing each nsamp/nparam combination for HDF5 as well as CSV (not shown).

### loop over different experiment sizes
for nsamp, nparam in [(30,100), (30000,10), (10,100), (10000,10), (10000000,1), (1000,10), (1000000,1), (100,10), (100000,1)][::-1]:
    ### create new directories to hold data. Remove any old data.
    data_dir = f'nsamp{nsamp}_nparam{nparam}/'
    try:
        shutil.rmtree(data_dir)
    except:
        pass
    os.mkdir(data_dir)
    os.mkdir(data_dir + 'data_hdf5/')
    print(f'nsamp: {nsamp}, nparam: {nparam}\n')
    sys.stdout.flush()

    ### Randomly sample alternative means, stds, & corrs for 2-variable system.
    np.random.seed(101)
    mean_samps = np.random.uniform(-1, 1, size=(nparam,2))
    std_samps = np.random.uniform(0, 1, size=(nparam,2))
    corr_samps = np.random.uniform(0, 1, size=nparam)

    #### start timer
    t0 = time.perf_counter()

    ### Open connection to HDF5 file in write mode via h5py package
    with h5py.File(f'{data_dir}/data_hdf5/data.hdf5', 'w') as f_write:
        ### Loop over sampled means, stds, corrs -> within each, randomly sample 2-variable system nsamp times
        for i in range(nparam):
            means = mean_samps[i,:]
            for j in range(nparam):
                stds = std_samps[j,:]
                for k in range(nparam):
                    corr = corr_samps[k]
                    corrs = np.array([[1, corr], [corr, 1]])
                    ### generate sample for this combination of parameters
                    samps = generate_mvn(means, stds, corrs, nsamp)
                    ### store sample as hdf5 dataset in hierarchical structure -> outer group by means, inner group by stds, then separate dataset for separate corr samples within that
                    f_write[f'means{i}/stds{j}/corr{k}'] = samps
                    ### store correlation, stds, and means as metadata attributes at their appropriate group/dataset level
                    f_write[f'means{i}/stds{j}/corr{k}'].attrs['corr'] = [corr]
                f_write[f'means{i}/stds{j}'].attrs['stds'] = [stds[0], stds[1]]
            f_write[f'means{i}'].attrs['means'] = [means[0], means[1]]

    ### print run time
    dt = time.perf_counter() - t0
    print(f'HDF5 write: {dt:0.4f} seconds')
    sys.stdout.flush()

After writing, I also calculate the time needed to reread in each array, calculate its means, standard deviations, and correlations, and compare these to the original parameter values stored as attributes. Finally, I record the total disk storage size for each dataset before deleting it.

    ### now loop back over and read hdf5 to compare actual to sampled params for each set
    t0 = time.perf_counter()
    params_actual, params_samp = {}, {}

    ### Connect to previously written HDF5 file, in read mode
    with h5py.File(f'{data_dir}/data_hdf5/data.hdf5', 'r') as f_read:
        ### loop over previously sampled means, stds, & corrs
        for i in range(nparam):
            params_actual[f'means{i}'], params_samp[f'means{i}'] = {}, {}
            for j in range(nparam):
                params_actual[f'means{i}'][f'stds{j}'], params_samp[f'means{i}'][f'stds{j}'] = {}, {}
                for k in range(nparam):
                    params_actual[f'means{i}'][f'stds{j}'][f'corr{k}'], params_samp[f'means{i}'][f'stds{j}'][f'corr{k}'] = {}, {}
                    ### retrieve actual means, stds, & corr params that we used to sample -> stored as metadata attributes
                    params_actual[f'means{i}'][f'stds{j}'][f'corr{k}']['means'] = f_read[f'means{i}'].attrs['means']
                    params_actual[f'means{i}'][f'stds{j}'][f'corr{k}']['stds'] = f_read[f'means{i}/stds{j}'].attrs['stds']
                    params_actual[f'means{i}'][f'stds{j}'][f'corr{k}']['corr'] = f_read[f'means{i}/stds{j}/corr{k}'].attrs['corr']
                    ### Calculate means, stds, & corrs directly from sampled data arrays
                    params_samp[f'means{i}'][f'stds{j}'][f'corr{k}']['means'] = f_read[f'means{i}/stds{j}/corr{k}'][...].mean(axis=0)
                    params_samp[f'means{i}'][f'stds{j}'][f'corr{k}']['means'] = f_read[f'means{i}/stds{j}/corr{k}'][...].std(axis=0)
                    params_samp[f'means{i}'][f'stds{j}'][f'corr{k}']['means'] = np.corrcoef(f_read[f'means{i}/stds{j}/corr{k}'][:, 0], f_read[f'means{i}/stds{j}/corr{k}'][:, 1])[0, 1]

    ### print run time
    dt = time.perf_counter() - t0
    print(f'HDF5 read: {dt:0.4f} seconds')
    sys.stdout.flush()

    ### print directory size and remove
    subprocess.run(['du', '-h', f'{data_dir}/data_hdf5/'])
    print('')
    shutil.rmtree(f'{data_dir}/data_hdf5/')

Results

The results of the experiment can be found in Table 1 and Figure 4. In general, the HDF5 method significantly outperforms the CSV method on write speed, read speed, and storage size. The best performance in terms of timing occurs for the largest array size of 20,000,000 elements, where the write and read times for HDF5 are only 2.67% and 3.66% of the CSV times, respectively. Timing benefits are found to decay for array sizes under 10,000. At 200 elements or fewer, HDF5 actually performs similarly to CSV on writing and worse than CSV on reading. However, in general most large-scale scientific computing workflows that would consider using HDF5 will have larger array sizes and thus should benefit significantly from the 20x+ speedup seen for large file sizes.

In terms of compression, the best results are seen for the small array sizes. For example, the HDF5 file for the 20-element array case is only 13.18% of the size of the corresponding CSV collection. However, significant compression benefits are seen throughout the experiment, with approximately 31% relative storage size for all datasets with array sizes between 2,000 and 20,000,000 elements.

Total number samplesNumber samples per datasetNumber datasetsCSV write time (s)HDF5 write time (s)Relative write time (%)CSV read time (s)HDF5 read time (s)Relative read time (%)CSV storage size (MB)HDF5 storage size (MB)Relative storage size (%)
200,000200,00010.240.013.310.180.018.134.91.632.65
200,0002001,0000.480.2961.230.360.57157.527.9225.32
2,000,0002,000,00011.930.062.942.080.115.20491632.65
2,000,0002,0001,0002.070.3416.381.800.6435.48511631.37
20,000,00020,000,000117.150.462.6717.610.653.6648015331.88
20,000,00020,0001,00018.890.723.8116.591.197.1948915431.49
20,000,000201,000,000318.83277.2786.97205.07557.91272.06390051413.18
60,000,00060,0001,00054.031.633.0251.272.605.06150045930.6
60,000,000601,000,000352.47311.9388.50243.45593.66243.85390082521.15
Table 1: Results from testing CSV vs HDF5 on write speed, compression, and read speed
Figure 4: Scaling of HDF5 performance (relative to CSV) as a function of the number of samples per dataset

Next steps

This scaling experiment demonstrates the significant advantages in write speed, read speed, and storage compression that are enabled by HDF5 and h5py. HDF5 also has significant organizational benefits due to the simplicity of storing complex heterogeneous data arrays and associated metadata within a hierarchical format.

In this blog post, I have only covered the basics of HDF5 and h5py. Additional benefits are possible by leveraging HDF5/h5py’s data slicing and chunking capabilities, which make it possible to efficiently access and manipulate small subsets of very large datasets that exceed available RAM. Some users may also be interested in PyTables, a Python package that provides additional flexibility for manipulating and querying complex heterogeneous HDF5 datasets.

Leave a comment