1

I'm trying to read in a PDF source file, and append each individual byte to an array of 8-bit integers. This is the slowest function in my program, and I was wondering if there was a more efficient method to do so, or if my process is completely wrong and I should go about it another way. Thanks in advance!

pagesize = 4096
arr = []
doc = ""
with open(filename, 'rb') as f:
    while doc != b'':
        doc = f.read(pagesize)
        for b in doc:
            arr.append(b)
3
  • On its face, this looks like a perfect use case for memory-mapped I/O (meaning you may not even need to read the file at all until something in your code actually wants to index into it). Are you using the resulting list in a manner incompatible with that? Commented Jun 15, 2018 at 17:53
  • Why read a pdf byte wise? Commented Jun 15, 2018 at 18:02
  • @PatrickArtner for this particular project, I need to convert every single byte into an 8-bit int (the data, the weird <A>88 values, etc.) Commented Jun 15, 2018 at 18:09

1 Answer 1

7

A bytes object is already a sequence of 8-bit integers:

>>> b = b'abc'
>>> for byte in b: print(byte)
97
98
99

If you want to convert it to a different kind of sequence, like a list, just call the constructor:

>>> lst = list(b)
>>> lst
[97, 98, 99]
>>> arr = array.array('b', a)
>>> arr
array('b', [97, 98, 99])

Or, if you need to do it a chunk at a time for some reason, just pass the whole chunk to extend:

>>> arr = list(b'abc')
>>> arr.extend(b'def')
>>> arr
[97, 98, 99, 100, 101, 102]

However, the most efficient thing to do is just leave it in a bytes:

with open(filename, 'rb') as f:
    arr = f.read()

… or, if you need it to be mutable, use bytearray:1

pagesize=4096
arr = bytearray()
with open(filename, 'rb') as f:
    arr.extend(f.read(4096))

… or, if there's any chance you could benefit from speeding up elementwise operations over the whole array, use NumPy:

with open(filename, 'rb') as f:
    arr = np.fromfile(f, dtype=np.uint8)

Or, don't even read the file in the first place and instead mmap it, then use the mmap as your sequence of integers.2

with open(filename, 'rb') as f:
    arr = mmap.mmap(f, 0)

For comparison, under the covers (at least in CPython):

  • A bytes (or bytearray, or array.array('b'), or np.array(dtype=np.int8), etc.) is stored as an array of 8-bit integers. So, 1M bytes takes 1MB.
    • A bytearray will have a bit of extra slack at the end, increasing the size by about 6%. So, 1M bytes takes 1.06MB.
  • A general-purpose sequence like a tuple or list is stored as an array of pointers to objects wrapping the 8-bit integers. The objects don't matter (there's only going to be one copy for each of the 256 values, no matter how many references there are to each), but the pointers are 8 bytes (4 bytes in 32-bit builds). So, 1M bytes takes 8MB.
    • A list has the same extra slack as bytearray, so it's 8.48MB.
  • A mmap is like a bytes or array.array('b') as far as virtual memory goes, but any pages that you haven't read or written may not be mapped into physical memory at all. So, 1M bytes takes at most 1MB, but could take as little as 8KB.

1. You can speed this up. If you pre-extend the bytearray 4K at a time—or, even better, pre-allocate the whole thing, if you know the length of the file—you can readinto a memoryview over a slice of the bytearray. But this is more complicated, and probably not worth it—if you need this, you should probably have been using either numpy or an mmap.

2. This does mean that you have to move all your arr-using code inside the with, or otherwise keep the file open as long as you need the data. Because the file itself is the storage for your "array"; you haven't copied the bytes into different storage in memory.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.