1

Are numpy structured arrays an alternative to Python dict?

I would like to save memory and I cannot affort much of a performance decline.

In my case, the keys are str and the values are int.

Can you give a quick conversion line in case they actually are an alternative?

I also don't mind if you can suggest a different alternative.

I need to save memory, because some dictionaries get larger than 50Gb in memory and I need to open multiple at a time with 'only' 192 GB RAM available.

5
  • 2
    What is your use case? How big is your dict? How much memory do you need to save, and why? What performance goals are you trying to hit? Do you need key-based lookup? This question is too vague to answer as is. Commented Sep 24, 2019 at 9:55
  • If you are using your dictionary like an array, then maybe, but if you are relying on fast look-ups by key, then probably no. Commented Sep 24, 2019 at 10:42
  • 1
    The structured array that you propose is (in effect) just two arrays of the same size, one with string elements, and other with numeric. Commented Sep 24, 2019 at 16:22
  • Because I need O(n) time to search in the string array and get the index to address the integer array? Commented Sep 24, 2019 at 16:31
  • Because here (stackoverflow.com/a/35485100/1367655) I asked what is the time complexity and someone said it's O(1)... Commented Sep 24, 2019 at 16:32

3 Answers 3

4

Maybe a bit late, but in case others have the same question, I did a simple benchmarking:

In [1]: import random

In [2]: import string

In [3]: import pandas as pd

In [4]: import sys

In [5]: size = 10**6

In [6]: d = {''.join(random.choices(string.ascii_letters + string.digits, k=32)): random.randrange(size) for _ in range(size)}

In [7]: s = pd.Series(d)

In [8]: a = s.values.view(list(zip(s.index, ['i8'] * size)))

In [9]: key = s.index[random.randrange(size)]

In [10]: %timeit d[key]
61.5 ns ± 1.46 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [11]: %timeit s[key]
3.54 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [12]: %timeit a[key]
154 ns ± 1.63 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [13]: sys.getsizeof(d)
Out[13]: 41943136

In [14]: sys.getsizeof(s)
Out[14]: 130816632

It seems that np.ndarray is about 2-3 times slower than dict (which is acceptable), while the performance of pd.Series is much worse than the other two. As for space efficiency, dict also outperforms pd.Series. I didn't find a way to get the memory usage of a numpy structured array (tried with sys.getsizeof and ndarray.nbytes, but it seems both are missing the size of fields).

Sign up to request clarification or add additional context in comments.

1 Comment

You could use a very large mapping and inspect in htop or the like to see how much memory is allocated at initialization...
1

numpy array use contiguous block of memory and can store only one type of object like int, float, string or other object. Where each item are allocated fixed bytes in memory.

Numpy also provide set of functions for operation like traversing array, arithmetic operation, some string operation on those stored items which are implemented using c. As these operation doesn't have overhead of python they are normally more efficient in terms of both memory and processing power

As you need key value pair you can also store that in numpy array similar to c-struct but it won't have features like dict like looking of item, checking if key existing filtering etc. you have do do those your self using array functionality

better option for you may be pandas series, which also use numpy array to store its data for provides you lots of functionality on top of it

2 Comments

Yeah, a pandas.DataFrame is probably the child of dict and ndarray that the OP is looking for.
A clarification on numpy arrays. Numpy has (at least) 3 types of arrays: ndarrays, record arrays, and structured arrays. It's true, ndarrays can only store one object type. However, both record arrays and structured arrays can store objects of different types (as defined with the dtype). You can index objects using the field names, similar to a dictionary. They are worth investigating if you have to handle a large number of data types.
0

You can use object's in-memory size using sys

import sys
import numpy as np

x = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
    dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])

print(sys.getsizeof(x))

y = {
    1 : {
        'name': 'Rex',
        'age' : 9,
        'weight' : 81.0
    },
    2 : {
        'name': 'Fido',
        'age' : 3,
        'weight' : 27.0
    }
}

print(sys.getsizeof(y))
  1. 192
  2. 240

These are the results. I believe numpy structured array are relatively memory efficient.

1 Comment

this is extremely underestimating the total size of that dict

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.