I currently have ~1000 files containing bytes. Each file contains a few thousand messages, each message has identical data types.
I've tried several ways of reading this into a numpy array but I'm curious how fast I can get it as currently it's pretty slow with all attempts.
In terms of reading it into python from the files, I've found it's much faster creating a bytearray up front with the correct size of bytes required and using file.readinto, rather than file.read().
So I'm left with the problem of getting the bytes into a NumPy array. Below was my first iteration and 87.5% of the time is spent in the if-else block appending numpy arrays.
count = 0
numpy_types = np.dtype([('col1','i8'),('col2','i8'),('col3','i8'),('col4','f4')])
for file in files:
byte_array = bytearray(file.filelength)
with open(file, 'rb') as f:
f.readinto(byte_array)
array = np.frombuffer(byte_array, numpy_types)
if count == 0:
numpy_array = array
else:
numpy_array = np.append(numpy_array, array)
In case anyone wants to try this at home, I'll repeat the above example and another attempt with a version you can copy and paste in.
1st attempt
Read each file into an individual numpy array and append them together
import numpy as np
import time
start = time.time()
byte_array = b''
bytes1 = b'\x00\xe8n\x14Z\x1d\xd8\x08\xff\xff\xff\xff\xff\xff\xff\xff\x00\xdd\x90\xa7\x16/\xd8\x08ff\xe0A'
# Create the byte array identical to what would be read in from each file
for i in range(1000):
byte_array += bytes1
numpy_dtypes = np.dtype([('col1','i8'), ('col2', 'i8'), ('col3', 'i8'), ('col4', 'f4')])
total_time = 0
# Imitate loop of reading in multiple files
for i in range(1000):
array = np.frombuffer(byte_array, numpy_dtypes)
start2 = time.time()
if i == 0:
numpy_array = array
else:
numpy_array = np.append(numpy_array, array)
total_time += (time.time() - start2)
print(f'took {total_time} to append numpy arrays together')
print(f'took {time.time()-start:.2f} seconds in total')
- took 12.19652795791626 to append numpy arrays together
- took 12.21 seconds in total
2nd attempt
I tried appending all the bytes to a single bytearray before reading into a numpy array at once
import numpy as np
import time
start = time.time()
byte_array = b''
bytes1 = b'\x00\xe8n\x14Z\x1d\xd8\x08\xff\xff\xff\xff\xff\xff\xff\xff\x00\xdd\x90\xa7\x16/\xd8\x08ff\xe0A'
# Create the byte array identical to what would be read in from each file
for i in range(1000):
byte_array += bytes1
numpy_dtypes = np.dtype([('col1','i8'), ('col2', 'i8'), ('col3', 'i8'), ('col4', 'f4')])
# Imitate loop of reading in multiple files
total_bytes = b''
start2 = time.time()
for i in range(1000):
total_bytes += byte_array
print(f'took {time.time()-start2:.2f} seconds to append bytes together')
numpy_array = np.frombuffer(total_bytes, numpy_dtypes)
print(f'took {time.time()-start:.2f} seconds')
- took 12.67 seconds to append bytes together
- took 12.67 seconds.
Why is it that the majority of the processing time comes from appending the data together? Is there a better way to approach this as it seems this is the bottleneck. Either from the appending all the data together or from the initial way everything is being read in. I have also tried struct.unpack however this is still quite slow and from what I am aware, NumPy is quicker at reading bytes.