HDF5

openPMD supports writing to and reading from HDF5 .h5 files. For this, the installed copy of openPMD must have been built with support for the HDF5 backend. To build openPMD with support for HDF5, use the CMake option -DopenPMD_USE_HDF5=ON. For further information, check out the installation guide, build dependencies and the build options.

I/O Method

HDF5 internally either writes serially, via POSIX on Unix systems, or parallel to a single logical file via MPI-I/O.

Backend-Specific Controls

The following environment variables control HDF5 I/O behavior at runtime.

Environment variable

Default

Description

OPENPMD_HDF5_INDEPENDENT

ON

Sets the MPI-parallel transfer mode to collective (OFF) or independent (ON).

OPENPMD_HDF5_ALIGNMENT

1

Tuning parameter for parallel I/O, choose an alignment which is a multiple of the disk block size.

OPENPMD_HDF5_THRESHOLD

0

Tuning parameter for parallel I/O, where 0 aligns all requests and other values act as a threshold.

OPENPMD_HDF5_CHUNKS

auto

Defaults for H5Pset_chunk: "auto" (heuristic) or "none" (no chunking).

OPENPMD_HDF5_COLLECTIVE_METADATA

ON

Sets the MPI-parallel transfer mode for metadata operations to collective (ON) or independent (OFF).

OPENPMD_HDF5_PAGED_ALLOCATION

ON

Tuning parameter for parallel I/O in HDF5 to enable paged allocation.

OPENPMD_HDF5_PAGED_ALLOCATION_SIZE

33554432

Size of the page, in bytes, if HDF5 paged allocation optimization is enabled.

OPENPMD_HDF5_DEFER_METADATA

ON

Tuning parameter for parallel I/O in HDF5 to enable deferred HDF5 metadata operations.

OPENPMD_HDF5_DEFER_METADATA_SIZE

ON

Size of the buffer, in bytes, if HDF5 deferred metadata optimization is enabled.

H5_COLL_API_SANITY_CHECK

unset

Debug: Set to 1 to perform an MPI_Barrier inside each meta-data operation.

HDF5_USE_FILE_LOCKING

TRUE

Work-around: Set to FALSE in case you are on an HPC or network file system that hang in open for reads.

OMPI_MCA_io

unset

Work-around: Disable OpenMPI’s I/O implementation for older releases by setting this to ^ompio.

OPENPMD_HDF5_INDEPENDENT: by default, we implement MPI-parallel data storeChunk (write) and loadChunk (read) calls as none-collective MPI operations. Attribute writes are always collective in parallel HDF5. Although we choose the default to be non-collective (independent) for ease of use, be advised that performance penalties may occur, although this depends heavily on the use-case. For independent parallel I/O, potentially prefer using a modern version of the MPICH implementation (especially, use ROMIO instead of OpenMPI’s ompio implementation). Please refer to the HDF5 manual, function H5Pset_dxpl_mpio for more details.

OPENPMD_HDF5_ALIGNMENT: this sets the alignment in Bytes for writes via the H5Pset_alignment function. According to the HDF5 documentation: For MPI IO and other parallel systems, choose an alignment which is a multiple of the disk block size. On Lustre filesystems, according to the NERSC documentation, it is advised to set this to the Lustre stripe size. In addition, ORNL Summit GPFS users are recommended to set the alignment value to 16777216(16MB).

OPENPMD_HDF5_THRESHOLD: this sets the threshold for the alignment of HDF5 operations via the H5Pset_alignment function. Setting it to 0 will force all requests to be aligned. Any file object greater than or equal in size to threshold bytes will be aligned on an address which is a multiple of OPENPMD_HDF5_ALIGNMENT.

OPENPMD_HDF5_CHUNKS: this sets defaults for data chunking via H5Pset_chunk. Chunking generally improves performance and only needs to be disabled in corner-cases, e.g. when heavily relying on independent, parallel I/O that non-collectively declares data records.

OPENPMD_HDF5_COLLECTIVE_METADATA: this is an option to enable collective MPI calls for HDF5 metadata operations via H5Pset_all_coll_metadata_ops and H5Pset_coll_metadata_write. By default, this optimization is enabled as it has proven to provide performance improvements. This option is only available from HDF5 1.10.0 onwards. For previous version it will fallback to independent MPI calls.

OPENPMD_HDF5_PAGED_ALLOCATION: this option enables paged allocation for HDF5 operations via H5Pset_file_space_strategy. The page size can be controlled by the OPENPMD_HDF5_PAGED_ALLOCATION_SIZE option.

OPENPMD_HDF5_PAGED_ALLOCATION_SIZE: this option configures the size of the page if OPENPMD_HDF5_PAGED_ALLOCATION optimization is enabled via H5Pset_file_space_page_size. Values are expressed in bytes. Default is set to 32MB.

OPENPMD_HDF5_DEFER_METADATA: this option enables deffered HDF5 metadata operations. The metadata buffer size can be controlled by the OPENPMD_HDF5_DEFER_METADATA_SIZE option.

OPENPMD_HDF5_DEFER_METADATA_SIZE: this option configures the size of the buffer if OPENPMD_HDF5_DEFER_METADATA optimization is enabled via H5Pset_mdc_config. Values are expressed in bytes. Default is set to 32MB.

H5_COLL_API_SANITY_CHECK: this is a HDF5 control option for debugging parallel I/O logic (API calls). Debugging a parallel program with that option enabled can help to spot bugs such as collective MPI-calls that are not called by all participating MPI ranks. Do not use in production, this will slow parallel I/O operations down.

HDF5_USE_FILE_LOCKING: this is a HDF5 1.10.1+ control option that disables HDF5 internal file locking operations (see HDF5 1.10.1 release notes). This mechanism is mainly used to ensure that a file that is still being written to cannot (yet) be opened by either a reader or another writer. On some HPC and Jupyter systems, parallel/network file systems like GPFS are mounted in a way that interferes with this internal, HDF5 access consistency check. As a result, read-only operations like h5ls some_file.h5 or openPMD Series open can hang indefinitely. If you are sure that the file was written completely and is closed by the writer, e.g., because a simulation finished that created HDF5 outputs, then you can set this environment variable to FALSE to work-around the problem. You should also report this problem to your system support, so they can fix the file system mount options or disable locking by default in the provided HDF5 installation.

OMPI_MCA_io: this is an OpenMPI control variable. OpenMPI implements its own MPI-I/O implementation backend OMPIO, starting with OpenMPI 2.x . This backend is known to cause problems in older releases that might still be in use on some systems. Specifically, we found and reported a silent data corruption issue that was fixed only in OpenMPI versions 3.0.4, 3.1.4, 4.0.1 and newer. There are also problems in OMPIO with writes larger than 2GB, which have only been fixed in OpenMPI version 3.0.5, 3.1.5, 4.0.3 and newer. Using export OMPI_MCA_io=^ompio before mpiexec/mpirun/srun/jsrun will disable OMPIO and instead fall back to the older ROMIO MPI-I/O backend in OpenMPI.

Known Issues

Warning

Jul 23th, 2021 (HDFFV-11260): Collective HDF5 metadata reads (OPENPMD_HDF5_COLLECTIVE_METADATA=ON) broke in 1.10.5, falling back to individual metadata operations. HDF5 releases 1.10.4 and earlier are not affected; versions 1.10.9+, 1.12.2+ and 1.13.1+ fixed the issue.

Selected References

  • GitHub issue #554

  • Axel Huebl, Rene Widera, Felix Schmitt, Alexander Matthes, Norbert Podhorszki, Jong Youl Choi, Scott Klasky, and Michael Bussmann. On the Scalability of Data Reduction Techniques in Current and Upcoming HPC Systems from an Application Perspective, ISC High Performance 2017: High Performance Computing, pp. 15-29, 2017. arXiv:1706.00522, DOI:10.1007/978-3-319-67630-2_2