11

I want to upload a numpy array to S3 using the boto3 package which expects a bytes object. I want to convert this numpy array to bytes but without any copying due to memory constraints. Things I tried that do not work because they create copies:

It seems numpy used to provide numpy.ndarray.getbuffer but it was deprecated in later releases.

Is there a way to create a bytes view without copying?

4
  • I think stackoverflow.com/questions/25837641/… will help Commented Jan 18, 2021 at 12:57
  • Unfortunately, tostring is an alias to tobytes and also creates a copy of the original buffer. Commented Jan 18, 2021 at 17:19
  • I wonder how you check whether it gets copied? I also meant the other option where you save the array first and then read it back in bytes. Or the pickle answer Commented Jan 18, 2021 at 20:27
  • You can check by setting arr[0]=something else then check if the buffer changes too. I don't want to involve disk I/O either :) Commented Jan 19, 2021 at 4:13

1 Answer 1

3

You can take advantage of the ctypes module to create a pointer to the data array, cast into byte form.

import ctypes

import numpy as np

# generate the test array 
size = 0x10
dtype = np.short
bsize = 2 # size of a single np.short in bytes, set for the data type you want to upload
arr = np.arange(size, dtype=dtype)

# create a pointer to the block of memory that the array lives in, cast to char type. Note that (size*bsize) _must_ be in parenthesis for the code to run correctly.
memory_block = (ctypes.c_char*(size*bsize)).from_address(arr.ctypes.data)
print(memory_block.raw)
# b'\x00\x00\x01\x00\x02\x00\x03\x00\x04\x00\x05\x00\x06\x00\x07\x00\x08\x00\t\x00\n\x00\x0b\x00\x0c\x00\r\x00\x0e\x00\x0f\x00'

# mutate the array and check the contents at the pointer
arr[0] = 255.
print(memory_block.raw)
# b'\xff\x00\x01\x00\x02\x00\x03\x00\x04\x00\x05\x00\x06\x00\x07\x00\x08\x00\t\x00\n\x00\x0b\x00\x0c\x00\r\x00\x0e\x00\x0f\x00'

This, at a minimum, seems to fulfill the test you put forward in the comments to the question. (i.e. if I mutate the array does my view on it change?).

There are a couple of things to notes here though. One, python bytes objects are immutable, meaning that if one is assigned to a variable, a copy is made.

y = memory_block.raw
print(y[:2])
# b'\xff\x00'
arr[0] = 127
print(y[:2])
# b'\xff\x00'

Two, boto3 seems to want a File-like object, at least according to the source code for version 1.28.1. Calling bio = BytesIO(memory_block.raw) incurs a copy, which means that we're back to square one for uploading.

An uploader class

The ArrayUploader class below implements a few basic IO methods (read, seek, tell). When read is called, the data may still be copied from the underlying memory blob, meaning that head room will still be limiting factor. However, if the size of the read is set, then only that much data is copied from the memory blob at a time. How boto3 handles the size of its reads from the IO object I couldn't tell you.

import ctypes
import re
from io import IOBase

import numpy as np

class ArrayUploader(IOBase):
    # set this up as a child of IOBase because boto3 wants an object
    # with a read method. 
    def __init__(self, array):
        # get the number of bytes from the name of the data type
        # this is a kludge; make sure it works for your case
        dbits = re.search('\d+', str(np.dtype(array.dtype))).group(0)
        dbytes = int(dbits) // 8
        self.nbytes = array.size * dbytes
        self.bufferview = (ctypes.c_char*(self.nbytes)).from_address(array.ctypes.data)
        self._pos = 0

    def tell(self):
        return self._pos

    def seek(self, pos):
        self._pos = pos

    def read(self, size=-1):
        if size == -1:
            return self.bufferview.raw[self._pos:]
        old = self._pos
        self._pos += size
        return self.bufferview.raw[old:self._pos]
    

# generate the test array 
size = 0x10
dtype = np.short
arr = np.arange(size, dtype=dtype)

# initialize our uploader object
arrayuploader = ArrayUploader(arr)

# read some data out
print(x:=arrayuploader.read(8))
# b'\x00\x00\x01\x00\x02\x00\x03\x00'

# mutate the array, reread the same data
arr[0] = 127
arrayuploader.seek(0)
print(y:=arrayuploader.read(8))
# b'\x7f\x00\x01\x00\x02\x00\x03\x00'

# has x changed with the original array?
print(x == y)
# False
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.