DASK
The Python bindings of openPMD-api provide direct methods to load data into the parallel, DASK data analysis ecosystem.
How to Install
Among many package managers, PyPI ships the latest packages of DASK:
python3 -m pip install -U dask
python3 -m pip install -U pyarrow
How to Use
The central Python API calls to convert to DASK datatypes are the ParticleSpecies.to_dask
and Record_Component.to_dask_array
methods.
s = io.Series("samples/git-sample/data%T.h5", io.Access.read_only)
electrons = s.iterations[400].particles["electrons"]
# the default schedulers are local/threaded. We can also use local
# "processes" or for multi-node "distributed", among others.
dask.config.set(scheduler='processes')
df = electrons.to_dask()
type(df) # ...
E = s.iterations[400].meshes["E"]
E_x = E["x"]
darr_x = E_x.to_dask_array()
type(darr_x) # ...
# note: no series.flush() needed
The to_dask_array
method will automatically set Dask array chunking based on the available chunks in the read data set.
The default behavior can be overridden by passing an additional keyword argument chunks
, see the dask.array.from_array documentation for more details.
For example, to chunk only along the outermost axis in a 3D dataset using the default Dask array chunk size, call to_dask_array(chunks={0: 'auto', 1: -1, 2: -1})
.
Example
A detailed example script for particle and field analysis is documented under as 11_particle_dataframe.py
in our examples.
See a video of openPMD on DASK in action in pull request #963 (part of openPMD-api v0.14.0 and later).