23

Consider the following code:

arr = []
for (str, id, flag) in some_data:
    arr.append((str, id, flag))

Imagine the input strings being 2 chars long in average and 5 chars max and some_data having 1 million elements. What will the memory requirement of such a structure be?

May it be that a lot of memory is wasted for the strings? If so, how can I avoid that?

3 Answers 3

35

In this case, because the strings are quite short, and there are so many of them, you stand to save a fair bit of memory by using intern on the strings. Assuming there are only lowercase letters in the strings, that's 26 * 26 = 676 possible strings, so there must be a lot of repetitions in this list; intern will ensure that those repetitions don't result in unique objects, but all refer to the same base object.

It's possible that Python already interns short strings; but looking at a number of different sources, it seems this is highly implementation-dependent. So calling intern in this case is probably the way to go; YMMV.

As an elaboration on why this is very likely to save memory, consider the following:

>>> sys.getsizeof('')
40
>>> sys.getsizeof('a')
41
>>> sys.getsizeof('ab')
42
>>> sys.getsizeof('abc')
43

Adding single characters to a string adds only a byte to the size of the string itself, but every string takes up 40 bytes on its own.

Sign up to request clarification or add additional context in comments.

2 Comments

Now I learnt that python in general is quite memory consuming. As you correctly point out, the length of the strings isn't the problem here, but the minimal size of objects. I was a bit shocked to also discover that the size of a simple int is 24 byte (on 64 bit system). Good to know...
My investigations of python 3.12 show an empty string takes 41 bytes for the object and 1 byte for each additional character using sys.getsizeof(). However, pympler.asizeof() shows the empty string takes 48 bytes and stays at this size until the string reaches 8 characters long then it jumps up to 56 bytes. The memory requirement jumps by 8 bytes every 8 characters.
8

In recent Python 3 (64-bit) versions, string instances take up 49+ bytes. But also keep in mind that if you use non-ASCII characters, the memory usage jumps up even more:

>>> sys.getsizeof('t')
50
>>> sys.getsizeof('я')
76

Notice how even if one character in a string is non-ASCII, all other characters will take up more space (2 or 4 bytes each):

>>> sys.getsizeof('t12345')
55  # +5 bytes, compared to 't'
>>> sys.getsizeof('я12345')
86  # +10 bytes, compared to 'я'

This has to do with the internal representation of strings since Python 3.3. See PEP 393 -- Flexible String Representation for more details.

Python, in general, is not very memory efficient, when it comes to having lots of small objects, not just for strings. See these examples:

>>> sys.getsizeof(1)
28
>>> sys.getsizeof(True)
28
>>> sys.getsizeof([])
56
>>> sys.getsizeof(dict())
232
>>> sys.getsizeof((1,1))
56
>>> sys.getsizeof([1,1])
72

Internalizing strings could help, but make sure you don't have too many unique values, as that could do more harm than good.

It's hard to tell how to optimize your specific case, as there is no single universal solution. You could save up a lot of memory if you somehow serialize data from multiple items into a single byte buffer, for example, but then that could complicate your code or affect performance too much. In many cases it won't be worth it, but if I were in a situation where I really needed to optimize memory usage, I would also consider writing that part in a language like Rust (it's not too hard to create a native Python module via PyO3 for example).

Comments

1

If your strings are so short, it is likely there will be a significant number of duplicates. Python interning will optimise it so that these strings are stored only once and the reference used multiple tiems, rather than storing the string multiple times...

These strings should be automatically interned as there are.

1 Comment

String literals are interned, but strings created from other sources are not necessarily interened. You wouldn't want an intern call every time you read something from a file...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.