Python: size of strings in memory

Question

Consider the following code:

arr = []
for (str, id, flag) in some_data:
    arr.append((str, id, flag))

Imagine the input strings being 2 chars long in average and 5 chars max and some_data having 1 million elements. What will the memory requirement of such a structure be?

May it be that a lot of memory is wasted for the strings? If so, how can I avoid that?

Jo So · Accepted Answer · 2016-09-12 21:25:03Z

35

In this case, because the strings are quite short, and there are so many of them, you stand to save a fair bit of memory by using intern on the strings. Assuming there are only lowercase letters in the strings, that's 26 * 26 = 676 possible strings, so there must be a lot of repetitions in this list; intern will ensure that those repetitions don't result in unique objects, but all refer to the same base object.

It's possible that Python already interns short strings; but looking at a number of different sources, it seems this is highly implementation-dependent. So calling intern in this case is probably the way to go; YMMV.

As an elaboration on why this is very likely to save memory, consider the following:

>>> sys.getsizeof('')
40
>>> sys.getsizeof('a')
41
>>> sys.getsizeof('ab')
42
>>> sys.getsizeof('abc')
43

Adding single characters to a string adds only a byte to the size of the string itself, but every string takes up 40 bytes on its own.

edited Sep 12, 2016 at 21:25

Jo So

26.9k6 gold badges45 silver badges60 bronze badges

answered Feb 25, 2012 at 15:18

senderle

152k36 gold badges218 silver badges244 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

didi_X8 Over a year ago

Now I learnt that python in general is quite memory consuming. As you correctly point out, the length of the strings isn't the problem here, but the minimal size of objects. I was a bit shocked to also discover that the size of a simple int is 24 byte (on 64 bit system). Good to know...

Andrew H Over a year ago

My investigations of python 3.12 show an empty string takes 41 bytes for the object and 1 byte for each additional character using sys.getsizeof(). However, pympler.asizeof() shows the empty string takes 48 bytes and stays at this size until the string reaches 8 characters long then it jumps up to 56 bytes. The memory requirement jumps by 8 bytes every 8 characters.

at54321 · Accepted Answer · 2022-01-04 08:09:45Z

In recent Python 3 (64-bit) versions, string instances take up 49+ bytes. But also keep in mind that if you use non-ASCII characters, the memory usage jumps up even more:

>>> sys.getsizeof('t')
50
>>> sys.getsizeof('я')
76

Notice how even if one character in a string is non-ASCII, all other characters will take up more space (2 or 4 bytes each):

>>> sys.getsizeof('t12345')
55  # +5 bytes, compared to 't'
>>> sys.getsizeof('я12345')
86  # +10 bytes, compared to 'я'

This has to do with the internal representation of strings since Python 3.3. See PEP 393 -- Flexible String Representation for more details.

Python, in general, is not very memory efficient, when it comes to having lots of small objects, not just for strings. See these examples:

>>> sys.getsizeof(1)
28
>>> sys.getsizeof(True)
28
>>> sys.getsizeof([])
56
>>> sys.getsizeof(dict())
232
>>> sys.getsizeof((1,1))
56
>>> sys.getsizeof([1,1])
72

Internalizing strings could help, but make sure you don't have too many unique values, as that could do more harm than good.

It's hard to tell how to optimize your specific case, as there is no single universal solution. You could save up a lot of memory if you somehow serialize data from multiple items into a single byte buffer, for example, but then that could complicate your code or affect performance too much. In many cases it won't be worth it, but if I were in a situation where I really needed to optimize memory usage, I would also consider writing that part in a language like Rust (it's not too hard to create a native Python module via PyO3 for example).

Karl Barker · Accepted Answer · 2012-02-25 15:21:33Z

1

If your strings are so short, it is likely there will be a significant number of duplicates. Python interning will optimise it so that these strings are stored only once and the reference used multiple tiems, rather than storing the string multiple times...

These strings should be automatically interned as there are.

answered Feb 25, 2012 at 15:21

Karl Barker

11.5k3 gold badges23 silver badges26 bronze badges

1 Comment

user395760 Over a year ago

String literals are interned, but strings created from other sources are not necessarily interened. You wouldn't want an intern call every time you read something from a file...

Collectives™ on Stack Overflow

Python: size of strings in memory

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related