Reading binary data into pandas

Question

I have some binary data and I was wondering how I can load that into pandas.

Can I somehow load it specifying the format it is in, and what the individual columns are called?

Edit:
Format is

int, int, int, float, int, int[256]

each comma separation represents a column in the data, i.e. the last 256 integers is one column.

you need to put it into a numpy array (or python dict/list). is it a custom format? or something like stata? — Jeff
– Jeff, Commented May 15, 2013 at 19:26
your best bet is prob just read with python and create a numpy array; if speed is a problem, then u can read with cython, or if u already have a reader in c then u can wrap in cython — Jeff
– Jeff, Commented May 15, 2013 at 22:20

mowen · Accepted Answer · 2018-06-12 22:54:15Z

49

Even though this is an old question, I was wondering the same thing and I didn't see a solution I liked.

When reading binary data with Python I have found numpy.fromfile or numpy.fromstring to be much faster than using the Python struct module. Binary data with mixed types can be efficiently read into a numpy array, using the methods above, as long as the data format is constant and can be described with a numpy data type object (numpy.dtype).

import numpy as np
import pandas as pd

# Create a dtype with the binary data format and the desired column names
dt = np.dtype([('a', 'i4'), ('b', 'i4'), ('c', 'i4'), ('d', 'f4'), ('e', 'i4'),
               ('f', 'i4', (256,))])
data = np.fromfile(file, dtype=dt)
df = pd.DataFrame(data)

# Or if you want to explicitly set the column names
df = pd.DataFrame(data, columns=data.dtype.names)

Edits:

Removed unnecessary conversion of data.to_list(). Thanks fxx
Added example of leaving off the columns argument

edited Jun 12, 2018 at 22:54

answered Dec 3, 2014 at 7:13

mowen

6917 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

kasperhj Over a year ago

A great improvement. Thanks for posting this solution.

f0xdx Over a year ago

The list conversion is unnecessary, using data directly as driver for the Pandas dataframe speeds things up: df = pd.DataFrame(data, columns=data.dtype.names)

murphy1310 Over a year ago

Can something be done without providing format? i.e. if I have more than a thousand columns, it would take a while and unnecessary effort.

RibomBalt Over a year ago

I cannot directly convert the numpy array to Dataframe, it raises: ValueError: Data must be 1-dimensional, got ndarray of shape (6059, 1) instead, even the numpy array is 1D (shape=(6059,)). I ended up satisfied with numpy array itself, as I can retrieve data with syntax like data[:]['f'].

NicoBernard · Accepted Answer · 2016-01-11 20:47:32Z

Recently I was confronted to a similar problem, with a much bigger structure though. I think I found an improvement of mowen's answer using utility method DataFrame.from_records. In the example above, this would give:

import numpy as np
import pandas as pd

# Create a dtype with the binary data format and the desired column names
dt = np.dtype([('a', 'i4'), ('b', 'i4'), ('c', 'i4'), ('d', 'f4'), ('e', 'i4'), ('f', 'i4', (256,))])
data = np.fromfile(file, dtype=dt)
df = pd.DataFrame.from_records(data)

In my case, it significantly sped up the process. I assume the improvement comes from not having to create an intermediate Python list, but rather directly create the DataFrame from the Numpy structured array.

Brian Cain · Accepted Answer · 2013-05-21 14:18:20Z

1

Here's something to get you started.

from struct import unpack, calcsize
from pandas import DataFrame

entry_format = 'iiifi256i' #int, int, int, float, int, int[256]
field_names = ['a', 'b', 'c', 'd', 'e', 'f', ]
entry_size = calcsize(entry_format)

with open(input_filename, mode='rb') as f:
    entry_count = os.fstat(f.fileno()).st_size / entry_size
    for i in range(entry_count):
        record = f.read(entry_size)
        entry = unpack(entry_format, record)
        entry_frame = dict( (n[0], n[1]) for n in zip(field_names, entry) )
        DataFrame(entry_frame)

edited May 21, 2013 at 14:18

answered May 16, 2013 at 21:59

Brian Cain

14.6k3 gold badges53 silver badges89 bronze badges

4 Comments

kasperhj Over a year ago

With minor modifications to your snippet (like open(.., mode='rb') and os.fstat(input_filename)) I get the following error, DataFrame constructor not properly called!

Jon Clements Over a year ago

Don't really need to get the count here.... for record in iter(lambda: f.read(entry_size), ''): # ... will do it

kasperhj Over a year ago

This gives an error: ValueError: If use all scalar values, must pass index, and it seems like 'f' is 0 and not an array.

metakermit Over a year ago

Cool, the struct module looks very useful. I would just append the entry_frame dictionaries to a list and then create a DataFrame from a list of dicts after the whole file is read.

Albert-Jan · Accepted Answer · 2014-12-16 12:24:37Z

The following uses a compiled struct, which is a lot faster than a normal struct. An alternative is to use np.fromstring or np.fromfile, as mentioned above.

import struct, ctypes, os
import numpy as np, pandas as pd 

mystruct = struct.Struct('iiifi256i')
buff = ctypes.create_string_buffer(mystruct.size)
with open(input_filename, mode='rb') as f:
    nrows = os.fstat(f.fileno()).st_size / entry_size
    dtype = 'i,i,i,d,i,i8'
    array = np.empty((nrows,), dtype=dtype)
    for row in xrange(row):
        buff.raw = f.read(s.size)
        record = mystruct.unpack_from(buff, 0)
        #record = np.fromstring(buff, dtype=dtype)
        array[row] = record
 df = pd.DataFrame(array)

Collectives™ on Stack Overflow

Reading binary data into pandas

4 Answers 4

4 Comments

Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related