28

I have some binary data and I was wondering how I can load that into pandas.

Can I somehow load it specifying the format it is in, and what the individual columns are called?

Edit:
Format is

int, int, int, float, int, int[256]

each comma separation represents a column in the data, i.e. the last 256 integers is one column.

6
  • you need to put it into a numpy array (or python dict/list). is it a custom format? or something like stata? Commented May 15, 2013 at 19:26
  • It's a custom format. Some integers, some floats. Commented May 15, 2013 at 20:30
  • your best bet is prob just read with python and create a numpy array; if speed is a problem, then u can read with cython, or if u already have a reader in c then u can wrap in cython Commented May 15, 2013 at 22:20
  • Can provide the format of you binary file? Commented May 16, 2013 at 6:04
  • Sure. Added the format to the original post. Commented May 16, 2013 at 8:13

4 Answers 4

49

Even though this is an old question, I was wondering the same thing and I didn't see a solution I liked.

When reading binary data with Python I have found numpy.fromfile or numpy.fromstring to be much faster than using the Python struct module. Binary data with mixed types can be efficiently read into a numpy array, using the methods above, as long as the data format is constant and can be described with a numpy data type object (numpy.dtype).

import numpy as np
import pandas as pd

# Create a dtype with the binary data format and the desired column names
dt = np.dtype([('a', 'i4'), ('b', 'i4'), ('c', 'i4'), ('d', 'f4'), ('e', 'i4'),
               ('f', 'i4', (256,))])
data = np.fromfile(file, dtype=dt)
df = pd.DataFrame(data)

# Or if you want to explicitly set the column names
df = pd.DataFrame(data, columns=data.dtype.names)

Edits:

  • Removed unnecessary conversion of data.to_list(). Thanks fxx
  • Added example of leaving off the columns argument
Sign up to request clarification or add additional context in comments.

4 Comments

A great improvement. Thanks for posting this solution.
The list conversion is unnecessary, using data directly as driver for the Pandas dataframe speeds things up: df = pd.DataFrame(data, columns=data.dtype.names)
Can something be done without providing format? i.e. if I have more than a thousand columns, it would take a while and unnecessary effort.
I cannot directly convert the numpy array to Dataframe, it raises: ValueError: Data must be 1-dimensional, got ndarray of shape (6059, 1) instead, even the numpy array is 1D (shape=(6059,)). I ended up satisfied with numpy array itself, as I can retrieve data with syntax like data[:]['f'].
16

Recently I was confronted to a similar problem, with a much bigger structure though. I think I found an improvement of mowen's answer using utility method DataFrame.from_records. In the example above, this would give:

import numpy as np
import pandas as pd

# Create a dtype with the binary data format and the desired column names
dt = np.dtype([('a', 'i4'), ('b', 'i4'), ('c', 'i4'), ('d', 'f4'), ('e', 'i4'), ('f', 'i4', (256,))])
data = np.fromfile(file, dtype=dt)
df = pd.DataFrame.from_records(data)

In my case, it significantly sped up the process. I assume the improvement comes from not having to create an intermediate Python list, but rather directly create the DataFrame from the Numpy structured array.

Comments

1

Here's something to get you started.

from struct import unpack, calcsize
from pandas import DataFrame

entry_format = 'iiifi256i' #int, int, int, float, int, int[256]
field_names = ['a', 'b', 'c', 'd', 'e', 'f', ]
entry_size = calcsize(entry_format)

with open(input_filename, mode='rb') as f:
    entry_count = os.fstat(f.fileno()).st_size / entry_size
    for i in range(entry_count):
        record = f.read(entry_size)
        entry = unpack(entry_format, record)
        entry_frame = dict( (n[0], n[1]) for n in zip(field_names, entry) )
        DataFrame(entry_frame)

4 Comments

With minor modifications to your snippet (like open(.., mode='rb') and os.fstat(input_filename)) I get the following error, DataFrame constructor not properly called!
Don't really need to get the count here.... for record in iter(lambda: f.read(entry_size), ''): # ... will do it
This gives an error: ValueError: If use all scalar values, must pass index, and it seems like 'f' is 0 and not an array.
Cool, the struct module looks very useful. I would just append the entry_frame dictionaries to a list and then create a DataFrame from a list of dicts after the whole file is read.
1

The following uses a compiled struct, which is a lot faster than a normal struct. An alternative is to use np.fromstring or np.fromfile, as mentioned above.

import struct, ctypes, os
import numpy as np, pandas as pd 

mystruct = struct.Struct('iiifi256i')
buff = ctypes.create_string_buffer(mystruct.size)
with open(input_filename, mode='rb') as f:
    nrows = os.fstat(f.fileno()).st_size / entry_size
    dtype = 'i,i,i,d,i,i8'
    array = np.empty((nrows,), dtype=dtype)
    for row in xrange(row):
        buff.raw = f.read(s.size)
        record = mystruct.unpack_from(buff, 0)
        #record = np.fromstring(buff, dtype=dtype)
        array[row] = record
 df = pd.DataFrame(array)

see also http://pymotw.com/2/struct/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.