9

I'm running into trouble reading a hdf5 matlab 7.3 file with Python. I'm using h5py 2.0.1.

I can read all the matrices that are stored in the file, but I can not read a list of strings. h5py shows the strings as a dataset of shape (1, 894) with type |04. This data set contains object references, which I tried to dereference using the h5file[obj_ref] syntax.

This yields something like dataset "FFb": shape (4, 1) type "<u2". I interpreted that as a array of chars of length four. Which seems to be the ASCII representation of the string.

Is there an easy way to get the strings out?

Is there any package providing matlab to python hdf5 support?

3 Answers 3

13

I assume you mean it is a cell array of strings in MATLAB? This output looks normal: the dataset is an array of objects (|O4 is the NumPy object datatype). Each object is an array of 2-byte integers (<u2 is the NumPy little-endian unsigned 2-byte integer datatype). h5py has no way of knowing that the dataset is a cell array of strings; it could just as well be a cell array of arbitrary 16-bit integers.

The easiest way to get the strings out would be to use an iterator using unichr to convert the characters, like this:

strlist = [u''.join(unichr(c) for c in h5file[obj_ref]) for obj_ref in dataset])

What this does is iterate over the dataset (for obj_ref in dataset) to create a new list. For each object reference, it dereferences the object (h5file[obj_ref]) to get an array of integers. It converts each integer into a character (unichr(c)) and joins those characters all together into a Unicode string (u''.join()).

Note that this produces a list of unicode strings. If you are absolutely sure that every string contains only ASCII characters, you can replace u'' by '' and unichr by chr.

Caveat: I don't have h5py; this post is based on my experiences with MATLAB and NumPy. You may need to adjust the syntax or iteration order to suite your dataset.

Sign up to request clarification or add additional context in comments.

3 Comments

You are right, I forgot there are no lists. It has to be a cell array. is there no way to say that a particular dataset is of type string? Or does Matlab just not do that? But matlab knows that these are strings, so somehow it must be stored in the hdf5. Your line seems good, I just hoped there was another way.
This is way old - but did you ever figure out a good solution to this problem @AndreasMueller? (other than just writing your own function that implements the above code)
I don't think I found one, at least I can not remember ;)
4

You can get the original Matlab class name of Group and Dataset objects by

dataset.attrs['MATLAB_class']

if dataset contains a string, it will return b'char'.

1 Comment

After check the type, How do you access the string?
0

nneonneo's answer is broadly correct, but requires some changes for modern Python. Say you have a .mat file called my_matfile.mat, containing a cell array of strings my_cell_array. The following should extract the strings into a list:

import h5py

path = "my_matfile.mat"

with h5py.File(path, "r") as h5:
    my_string_list = []
    references = h5["my_cell_array"][0]
    for r in references:
        my_string_list.append("".join(chr(c.item()) for c in h5[r][:]))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.