0

I have a large binary file that I would like to read in and unpack using struct.unpack() The file consists of a number of lines each 2957 bytes long. I read in the file using the following code:

with open("bin_file", "rb") as f:
    line = f.read(2957)

My question is why, is the size returned by:

import sys
sys.getsizeof(line)

not equal to 2957 (in my case it is 2978)?

3
  • What are you using sys.getsizeof for? Commented Oct 31, 2014 at 17:52
  • @hobbs I am not using it for anything in particular, I just noticed the discrepancy and was wondering why that is the case Commented Oct 31, 2014 at 17:58
  • All this has nothing to do with file I/O; you'd get the same result with line = ' ' * 2957. Commented Oct 31, 2014 at 18:13

2 Answers 2

7

You misunderstand what sys.getsizeof() does. It returns the amount of memory Python uses for a string object, not length of the line.

Python string objects track reference counts, the object type and other metadata together with the actual characters, so 2978 bytes is not the same thing as the string length.

See the stringobject.h definition of the type:

typedef struct {
    PyObject_VAR_HEAD
    long ob_shash;
    int ob_sstate;
    char ob_sval[1];

    /* Invariants:
     *     ob_sval contains space for 'ob_size+1' elements.
     *     ob_sval[ob_size] == 0.
     *     ob_shash is the hash of the string or -1 if not computed yet.
     *     ob_sstate != 0 iff the string object is in stringobject.c's
     *       'interned' dictionary; in this case the two references
     *       from 'interned' to this object are *not counted* in ob_refcnt.
     */
} PyStringObject;

where PyObject_VAR_HEAD is defined in object.h, where the standard ob_refcnt, ob_type and ob_size fields are all defined.

So a string of length 2957 takes 2958 bytes (string length + null) and the remaining 20 bytes you see are to hold the reference count, the type pointer, the object 'size' (string length here), the cached string hash and the interned state flag.

Other object types will have different memory footprints, and the exact sizes of the C types used differ from platform to platform as well.

Sign up to request clarification or add additional context in comments.

Comments

3

A string object representing 2957 bytes of data takes more than 2957 bytes of memory to represent, due to overhead such as the type pointer and the reference count. sys.getsizeof includes this additional overhead.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.