I have a Python code whose output is a
sized matrix, whose entries are all of the type float. If I save it with the extension .dat the file size is of the order of 500 MB. I read that using h5py reduces the file size considerably. So, let's say I have the 2D numpy array named A. How do I save it to an h5py file?
Also, how do I read the same file and put it as a numpy array in a different code, as I need to do manipulations with the array?
2 Answers
h5py provides a model of datasets and groups. The former is basically arrays and the latter you can think of as directories. Each is named. You should look at the documentation for the API and examples:
http://docs.h5py.org/en/latest/quick.html
A simple example where you are creating all of the data upfront and just want to save it to an hdf5 file would look something like:
In [1]: import numpy as np
In [2]: import h5py
In [3]: a = np.random.random(size=(100,20))
In [4]: h5f = h5py.File('data.h5', 'w')
In [5]: h5f.create_dataset('dataset_1', data=a)
Out[5]: <HDF5 dataset "dataset_1": shape (100, 20), type "<f8">
In [6]: h5f.close()
You can then load that data back in using: '
In [10]: h5f = h5py.File('data.h5','r')
In [11]: b = h5f['dataset_1'][:]
In [12]: h5f.close()
In [13]: np.allclose(a,b)
Out[13]: True
Definitely check out the docs:
Writing to hdf5 file depends either on h5py or pytables (each has a different python API that sits on top of the hdf5 file specification). You should also take a look at other simple binary formats provided by numpy natively such as np.save, np.savez etc:
3 Comments
data.h5 exists, but I cannot view it with HDFView. I can read the contents with h5py, but not inspect it with HDFView. Any idea why?A cleaner way to handle file open/close and avoid memory leaks:
Prep:
import numpy as np
import h5py
data_to_write = np.random.random(size=(100,20)) # or some such
Write:
with h5py.File('name-of-file.h5', 'w') as hf:
hf.create_dataset("name-of-dataset", data=data_to_write)
Read:
with h5py.File('name-of-file.h5', 'r') as hf:
data = hf['name-of-dataset'][:]
5 Comments
with feature of Python is known as the context manager. It will make sure the file is closed after it has been used. More information is available in the official documentation: docs.python.org/3/library/contextlib.htmlhf['name-of-scalar'][()] , or you will get a ValueError: Illegal slicing argument for scalar dataspace.
.datextension?np.savetxt("output.dat",A,'%10.8e')np.save('output.dat', A)which will save it in a binary format (much faster, much less space used).A = np.loadtxt('output.dat',unpack=True)h5pydoesn't create files smaller than thosenp.savewould? ish5pyfaster thannp.savefor arrays of the size given in the question?