How efficient is Python substring extraction?

Question

I've got the entire contents of a text file (at least a few KB) in string myStr.

Will the following code create a copy of the string (less the first character) in memory?

myStr = myStr[1:]

I'm hoping it just refers to a different location in the same internal buffer. If not, is there a more efficient way to do this?

Thanks!

Note: I'm using Python 2.5.

@Glenn: Thanks for the edit. I always forget to proof-read the title! — Cameron
– Cameron, Commented Mar 16, 2010 at 19:41
@Mike: Hah, I guess I'm over-optimizing. The files loaded could potentially be large (in theory) -- but currently the largest is 8KB :-) — Cameron
– Cameron, Commented Mar 16, 2010 at 22:44
A few KB is tiny, but if you're doing an algorithm like [s[0:n] for n in range(0, len(s))], you'll end up with O(n^2), where in-place slicing would give you O(n). You can always code around it, obviously; it's just extra work. — Glenn Maynard
– Glenn Maynard, Commented Mar 19, 2010 at 1:13

Glenn Maynard · Accepted Answer · 2010-03-16 19:30:03Z

4

At least in 2.6, slices of strings are always new allocations; string_slice() calls PyString_FromStringAndSize(). It doesn't reuse memory--which is a little odd, since with invariant strings, it should be a relatively easy thing to do.

Short of the buffer API (which you probably don't want), there isn't a more efficient way to do this operation.

answered Mar 16, 2010 at 19:30

Glenn Maynard

57.9k11 gold badges123 silver badges133 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Cameron Over a year ago

Thanks for the info. I'm actually using Python 2.5 (I've updated my question) but I doubt it's done differently. I'll just have to live with the duplication, I guess (I really need to remove that one character).

jcdyer Over a year ago

Can't you just read the first character out of the file, and not assign it to the string to begin with? See my answer, coming momentarily. edit: see benson's answer instead.

SingleNegationElimination · Accepted Answer · 2010-03-16 19:35:31Z

3

As with most garbage collected languages, strings are created as often as needed, which is very often. The reason for this is because tracking substrings as described would make garbage collection more difficult.

What is the actual algorithm you are trying to implement. It might be possible to give you advice for ways to get better results if we knew a bit more about it.

As for an alternative, what is it you really need to do? Could you use a different way of looking at the issue, such as just keeping an integer index into the string? Could you use a array.array('u')?

answered Mar 16, 2010 at 19:35

SingleNegationElimination

157k35 gold badges269 silver badges306 bronze badges

1 Comment

Cameron Over a year ago

I'm removing the BOM from a UTF-8 decoded file in memory, then sending the contents of this file into a templating engine (Jinja2), then writing the result to an HTML response. I just figured out a way that I'll only have to do this once per template file, though, so it's not really an issue anymore :-)

Benson · Accepted Answer · 2010-03-16 19:50:39Z

1

One (albeit slightly hacky) solution would be something like this:

f = open("test.c")
f.read(1)
myStr = f.read()
print myStr

It will skip the first character, and then read the data into your string variable.

answered Mar 16, 2010 at 19:50

Benson

22.9k2 gold badges44 silver badges49 bronze badges

7 Comments

tgray Over a year ago

Actually, that will read the first byte, not necessarily the first character. In a utf-8 encoded file only 128 US-ASCII characters are encoded in one byte.

jcdyer Over a year ago

So read the first line, convert to unicode, and then strip the first character. Proceed more or less as above, converting to unicode as you go along. If you don't convert, then you're dealing with bytes.

Cameron Over a year ago

I would use this technique, but at the time I'm reading it from file I don't know whether the BOM should be kept or not. When I later retrieve the contents (from a DB), I get the entire file back at once. A version of your technique has actually already been presented to me in the answer to another (related) question I asked earlier: stackoverflow.com/questions/2456380/…

Mike Graham Over a year ago

Always use a context manager when dealing with files, i.e. with open("test.c") as f:

Benson Over a year ago

@Mike: I would have, but he said he was using 2.5, and I didn't want to muck about with the from future import with_statement junk.

|

Mike Graham · Accepted Answer · 2010-03-16 23:21:58Z

1

Depending on what you are doing, itertools.islice may be a suitable memory-efficient solution (should one become necessary).

answered Mar 16, 2010 at 23:21

Mike Graham

77.2k16 gold badges105 silver badges131 bronze badges

2 Comments

Cameron Over a year ago

Cool, I didn't know that module even existed!

Mike Graham Over a year ago

Good find, then!—itertools is constantly useful.

Collectives™ on Stack Overflow

How efficient is Python substring extraction?

4 Answers 4

2 Comments

1 Comment

7 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related