Use numpy structured array instead of dict to save space and keep speed

Question

Are numpy structured arrays an alternative to Python dict?

I would like to save memory and I cannot affort much of a performance decline.

In my case, the keys are str and the values are int.

Can you give a quick conversion line in case they actually are an alternative?

I also don't mind if you can suggest a different alternative.

I need to save memory, because some dictionaries get larger than 50Gb in memory and I need to open multiple at a time with 'only' 192 GB RAM available.

What is your use case? How big is your dict? How much memory do you need to save, and why? What performance goals are you trying to hit? Do you need key-based lookup? This question is too vague to answer as is. — tzaman
– tzaman, Commented Sep 24, 2019 at 9:55
If you are using your dictionary like an array, then maybe, but if you are relying on fast look-ups by key, then probably no. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Sep 24, 2019 at 10:42
The structured array that you propose is (in effect) just two arrays of the same size, one with string elements, and other with numeric. — hpaulj
– hpaulj, Commented Sep 24, 2019 at 16:22
Because I need O(n) time to search in the string array and get the index to address the integer array? — Radio Controlled
– Radio Controlled, Commented Sep 24, 2019 at 16:31
Because here (stackoverflow.com/a/35485100/1367655) I asked what is the time complexity and someone said it's O(1)... — Radio Controlled
– Radio Controlled, Commented Sep 24, 2019 at 16:32

H3NT41 · Accepted Answer · 2021-05-10 08:54:39Z

4

Maybe a bit late, but in case others have the same question, I did a simple benchmarking:

In [1]: import random

In [2]: import string

In [3]: import pandas as pd

In [4]: import sys

In [5]: size = 10**6

In [6]: d = {''.join(random.choices(string.ascii_letters + string.digits, k=32)): random.randrange(size) for _ in range(size)}

In [7]: s = pd.Series(d)

In [8]: a = s.values.view(list(zip(s.index, ['i8'] * size)))

In [9]: key = s.index[random.randrange(size)]

In [10]: %timeit d[key]
61.5 ns ± 1.46 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [11]: %timeit s[key]
3.54 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [12]: %timeit a[key]
154 ns ± 1.63 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [13]: sys.getsizeof(d)
Out[13]: 41943136

In [14]: sys.getsizeof(s)
Out[14]: 130816632

It seems that np.ndarray is about 2-3 times slower than dict (which is acceptable), while the performance of pd.Series is much worse than the other two. As for space efficiency, dict also outperforms pd.Series. I didn't find a way to get the memory usage of a numpy structured array (tried with sys.getsizeof and ndarray.nbytes, but it seems both are missing the size of fields).

answered May 10, 2021 at 8:54

H3NT41

1212 silver badges4 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Radio Controlled Over a year ago

You could use a very large mapping and inspect in htop or the like to see how much memory is allocated at initialization...

Dev Khadka · Accepted Answer · 2019-09-24 10:14:55Z

1

numpy array use contiguous block of memory and can store only one type of object like int, float, string or other object. Where each item are allocated fixed bytes in memory.

Numpy also provide set of functions for operation like traversing array, arithmetic operation, some string operation on those stored items which are implemented using c. As these operation doesn't have overhead of python they are normally more efficient in terms of both memory and processing power

As you need key value pair you can also store that in numpy array similar to c-struct but it won't have features like dict like looking of item, checking if key existing filtering etc. you have do do those your self using array functionality

better option for you may be pandas series, which also use numpy array to store its data for provides you lots of functionality on top of it

answered Sep 24, 2019 at 10:14

Dev Khadka

5,5415 gold badges23 silver badges36 bronze badges

2 Comments

Daniel F Over a year ago

Yeah, a pandas.DataFrame is probably the child of dict and ndarray that the OP is looking for.

kcw78 Over a year ago

A clarification on numpy arrays. Numpy has (at least) 3 types of arrays: ndarrays, record arrays, and structured arrays. It's true, ndarrays can only store one object type. However, both record arrays and structured arrays can store objects of different types (as defined with the dtype). You can index objects using the field names, similar to a dictionary. They are worth investigating if you have to handle a large number of data types.

Prateek Jha · Accepted Answer · 2019-09-24 10:01:46Z

0

You can use object's in-memory size using sys

import sys
import numpy as np

x = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
    dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])

print(sys.getsizeof(x))

y = {
    1 : {
        'name': 'Rex',
        'age' : 9,
        'weight' : 81.0
    },
    2 : {
        'name': 'Fido',
        'age' : 3,
        'weight' : 27.0
    }
}

print(sys.getsizeof(y))

192
240

These are the results. I believe numpy structured array are relatively memory efficient.

answered Sep 24, 2019 at 10:01

Prateek Jha

1352 silver badges5 bronze badges

1 Comment

juanpa.arrivillaga Over a year ago

this is extremely underestimating the total size of that dict

Collectives™ on Stack Overflow

Use numpy structured array instead of dict to save space and keep speed

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related