I have a large binary data file which I want to load into a C array for fast access. The data file just contains a sequence of 4 byte ints.
I get the data via the pkgutil.get_data function, which returns a binary string. the following code works:
import pkgutil
import struct
cdef int data[32487834]
def load_data():
global data
py_data = pkgutil.get_data('my_module', 'my_data')
for i in range(32487834):
data[i] = <int>struct.unpack('i', py_data[4*i:4*(i+1)])[0]
return 0
load_data()
The problem is that this code is quite slow. Reading the whole data file can take 7 or 8 seconds. Reading the file directly into an array in C only takes 1-2 seconds, but I want to use pkgutil.get_data so that my module can reliably find the data whereever it gets installed.
So, my question is: what's the best way to do this? Is there a way to directly cast the data as an array of ints without all the calls to struct.unpack? And, as a secondary question, is there a way to simply get a pointer to the data to avoid copying 120MB of data unnecessarily?
Alternatively, is there a way to make pkgutil return the file path to the data instead of the data itself (in which case I can use C file IO to read the file quite quickly.
EDIT:
Just for the record, here's the final code used (based on Veedrac's answer):
import pkgutil
from cpython cimport array
import array
cdef int[:] data
cdef void load_data():
global data
py_data = pkgutil.get_data('my_module', 'my_data')
data = array.array('i', py_data)
load_data()
Everything is quite fast.