Cython fast conversion of binary string to int array

Question

I have a large binary data file which I want to load into a C array for fast access. The data file just contains a sequence of 4 byte ints.

I get the data via the pkgutil.get_data function, which returns a binary string. the following code works:

import pkgutil
import struct

cdef int data[32487834]

def load_data():
    global data
    py_data = pkgutil.get_data('my_module', 'my_data')
    for i in range(32487834):
        data[i] = <int>struct.unpack('i', py_data[4*i:4*(i+1)])[0]
    return 0

load_data()

The problem is that this code is quite slow. Reading the whole data file can take 7 or 8 seconds. Reading the file directly into an array in C only takes 1-2 seconds, but I want to use pkgutil.get_data so that my module can reliably find the data whereever it gets installed.

So, my question is: what's the best way to do this? Is there a way to directly cast the data as an array of ints without all the calls to struct.unpack? And, as a secondary question, is there a way to simply get a pointer to the data to avoid copying 120MB of data unnecessarily?

Alternatively, is there a way to make pkgutil return the file path to the data instead of the data itself (in which case I can use C file IO to read the file quite quickly.

EDIT:

Just for the record, here's the final code used (based on Veedrac's answer):

import pkgutil

from cpython cimport array
import array

cdef int[:] data

cdef void load_data():
    global data
    py_data = pkgutil.get_data('my_module', 'my_data')
    data = array.array('i', py_data)

load_data()

Everything is quite fast.

Why not just memory-map it?

Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams

2014-11-06 16:53:42 +00:00
Commented Nov 6, 2014 at 16:53 — Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams, Commented Nov 6, 2014 at 16:53

Veedrac · Accepted Answer · 2014-11-06 17:17:20Z

5

Chances are you should really just use Numpy:

import numpy
import random
import struct

data = struct.pack('i'*100, *[random.randint(0, 1000000) for _ in range(100)])

numpy.fromstring(data, dtype="int32")
#>>> array([642029, 967046, 599565, ...etc], dtype=int32)

Then just use any of the standard methods to get a pointer from that.

If you want to avoid Numpy, a faster but less platform-agnostic method would be to go via a char pointer:

cdef int *data_view = <int *><char *>data

This has lots of "undefined"-ness to it, so be careful. Also be careful not to modify the data!

A good compromize between the two would be to use cpython.array:

from cpython cimport array
import array

def main(data):
    cdef array.array[int] data_arr = array.array('i', data)
    cdef int *data_ptr = data_arr.data.as_ints

which gives you well defined semantics and is fast with built-in libraries.

edited Nov 6, 2014 at 17:17

answered Nov 6, 2014 at 17:05

Veedrac

60.7k15 gold badges120 silver badges177 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Julian Over a year ago

Thanks! The second method worked beutifully. I knew there had to be a way to do this efficiently. Am I right in assuming I have to keep a copy of the bytestring in the global scope so it doesn't get garbage collected? As far as platform issues go, is the problem just that some platforms may not use 32 bit ints? Is that actually likely to be a real problem? Or just a theoretical concern?

Veedrac Over a year ago

For the second case, yes. The others copy (but do so very fast). You can use numpy.frombuffer to get a non-copying variant of the first. // There are a lot of potential problems with the second: it's actually not defined at all what the behaviour is. Assuming the compiler and platform make it do something sensible, you still need to worry about endianness and the size of int. Luckily it will work most of the time regardless, it's just not required to. There are platforms that do use different endiannesses or different sizes of int so it just depends whether you care about them!

Julian Over a year ago

I'm definitely trying to avoid numpy if I can. For access is the cpython array just as fast as a straight c-array? I don't really care if the loading time is slightly longer as long as that's <1s and access is fast.

Veedrac Over a year ago

Yes, as you end up going through a plain pointer. Access through the array.array[int] is slower but still quite fast. Access through a memoryview int[:] will be as fast as the pointer version (but also safer). I suggest passing the pointer to C routines and using the memoryview when in Cython.

Collectives™ on Stack Overflow

Cython fast conversion of binary string to int array

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related