0

I am having big image datasets to train CNN's on. Since I cannot load all the images into my RAM I plan to dump them into a HDF5 file (with h5py) and then iterate over the set batchwise, as suggested in

Most efficient way to use a large data set for PyTorch?

I tried creating an own dataset for every picture, located in the same group, which is very fast. But I could not figure out to iterate over all datasets in the group, except for accessing the set by its name. As an alternative I tried putting all the images itereatively into one dataset by extending its shape, according to

How to append data to one specific dataset in a hdf5 file with h5py and

incremental writes to hdf5 with h5py

but this is very slow. Is there a faster way to create a HDF5 dataset to iterate over?

5
  • You can iterate over all datasets in a group by using group.keys() and checking for instances of h5py.Dataset. See for example: stackoverflow.com/questions/34330283/… Commented Mar 7, 2019 at 16:41
  • The problem with this is I would like to access the data batchwise, e.g. 32 images at a time. Creating this batch from single group datasets in every epoch again is very slow... Commented Mar 8, 2019 at 9:31
  • 1
    You shouldn't have each image as its own dataset, but rather a large dataset whose first axis represent images. So a stack of 10 256x256 RGB images should be a dataset with shape [10, 256, 256, 3] Commented Mar 9, 2019 at 18:52
  • Thank you! I realized the dataset creation can be sped up a lot by not compressing the data and not reshaping the dataset every iteration. Commented Mar 11, 2019 at 9:28
  • The most important things are chunk_shape and chunk_cache. The documentation isn't very good on this topics. eg. stackoverflow.com/a/48405220/4045774 Quite common errors are also opening/closing the hdf5-file on every iteration. If you do it the right way, you should easily reach the sequential IO-speed of a HDD or SATA-SSD. But without a code sample it is hard to say why your implementation is so slow. Commented Jun 14, 2019 at 8:10

1 Answer 1

0

I realize this is an old question, but I found a very helpful resource on this subject that I wanted to share:

https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/ch04.html

Basically, the hdf5 (with chunks enabled) is like a little filesystem. It stores data in chunks scattered throughout memory. So like a filesystem, it benefits from locality. If the chunks are the same shape as the array sections you're trying to access, reading/writing will be fast. If the data you're looking for is scattered across multiple chunks, access will be slow.

So in the case of training a NN on images, you're probably going to have to make the images a standard size. Set chunks=(1,) + image_shape, or even better, chunks=(batch_size,) + image_shape when creating the dataset, and reading/writing will be a lot faster.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.