Efficiently Create HDF5 Image Dataset for Neural Network Training with Memory Limitations

Question

I am having big image datasets to train CNN's on. Since I cannot load all the images into my RAM I plan to dump them into a HDF5 file (with h5py) and then iterate over the set batchwise, as suggested in

Most efficient way to use a large data set for PyTorch?

I tried creating an own dataset for every picture, located in the same group, which is very fast. But I could not figure out to iterate over all datasets in the group, except for accessing the set by its name. As an alternative I tried putting all the images itereatively into one dataset by extending its shape, according to

How to append data to one specific dataset in a hdf5 file with h5py and

incremental writes to hdf5 with h5py

but this is very slow. Is there a faster way to create a HDF5 dataset to iterate over?

You can iterate over all datasets in a group by using group.keys() and checking for instances of h5py.Dataset. See for example: stackoverflow.com/questions/34330283/… — NoDataDumpNoContribution
– NoDataDumpNoContribution, Commented Mar 7, 2019 at 16:41
The problem with this is I would like to access the data batchwise, e.g. 32 images at a time. Creating this batch from single group datasets in every epoch again is very slow... — Camill Trüeb
– Camill Trüeb, Commented Mar 8, 2019 at 9:31
You shouldn't have each image as its own dataset, but rather a large dataset whose first axis represent images. So a stack of 10 256x256 RGB images should be a dataset with shape [10, 256, 256, 3] — Yngve Moe
– Yngve Moe, Commented Mar 9, 2019 at 18:52
Thank you! I realized the dataset creation can be sped up a lot by not compressing the data and not reshaping the dataset every iteration. — Camill Trüeb
– Camill Trüeb, Commented Mar 11, 2019 at 9:28
The most important things are chunk_shape and chunk_cache. The documentation isn't very good on this topics. eg. stackoverflow.com/a/48405220/4045774 Quite common errors are also opening/closing the hdf5-file on every iteration. If you do it the right way, you should easily reach the sequential IO-speed of a HDD or SATA-SSD. But without a code sample it is hard to say why your implementation is so slow. — max9111
– max9111, Commented Jun 14, 2019 at 8:10

Lugh · Accepted Answer · 2020-10-09 05:07:19Z

0

I realize this is an old question, but I found a very helpful resource on this subject that I wanted to share:

https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/ch04.html

Basically, the hdf5 (with chunks enabled) is like a little filesystem. It stores data in chunks scattered throughout memory. So like a filesystem, it benefits from locality. If the chunks are the same shape as the array sections you're trying to access, reading/writing will be fast. If the data you're looking for is scattered across multiple chunks, access will be slow.

So in the case of training a NN on images, you're probably going to have to make the images a standard size. Set chunks=(1,) + image_shape, or even better, chunks=(batch_size,) + image_shape when creating the dataset, and reading/writing will be a lot faster.

answered Oct 9, 2020 at 5:07

Lugh

1001 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Efficiently Create HDF5 Image Dataset for Neural Network Training with Memory Limitations

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related