String manipulation in Python

Question

I am converting some code from another language to python. That code reads a rather large file into a string and then manipulates it by array indexing like:

str[i] = 'e'

This does not work directly in python due to the strings being immutable. What is the preferred way of doing this in python ?

I have seen the string.replace() function, but it returns a copy of the string which does not sound very optimal as the string in this case is an entire file.

do you always replace same column, or you do search&replace? — vartec
– vartec, Commented Apr 7, 2009 at 12:25

Nicholas Riley · Accepted Answer · 2009-04-07 13:43:52Z

12

Assuming you're not using a variable-length text encoding such as UTF-8, you can use array.array:

>>> import array
>>> a = array.array('c', 'foo')
>>> a[1] = 'e'
>>> a
array('c', 'feo')
>>> a.tostring()
'feo'

But since you're dealing with the contents of a file, mmap should be more efficient:

>>> f = open('foo', 'r+')
>>> import mmap
>>> m = mmap.mmap(f.fileno(), 0)
>>> m[:]
'foo\n'
>>> m[1] = 'e'
>>> m[:]
'feo\n'
>>> exit()
% cat foo
feo

Here's a quick benchmarking script (you'll need to replace dd with something else for non-Unix OSes):

import os, time, array, mmap

def modify(s):
    for i in xrange(len(s)):
        s[i] = 'q'

def measure(func):
    start = time.time()
    func(open('foo', 'r+'))
    print func.func_name, time.time() - start

def do_split(f):
    l = list(f.read())
    modify(l)
    return ''.join(l)

def do_array(f):
    a = array.array('c', f.read())
    modify(a)
    return a.tostring()

def do_mmap(f):
    m = mmap.mmap(f.fileno(), 0)
    modify(m)

os.system('dd if=/dev/random of=foo bs=1m count=5')

measure(do_mmap)
measure(do_array)
measure(do_split)

Output I got on my several-year-old laptop matches my intuition:

5+0 records in
5+0 records out
5242880 bytes transferred in 0.710966 secs (7374304 bytes/sec)
do_mmap 1.00865888596
do_array 1.09792494774
do_split 1.20163106918

So mmap is slightly faster but none of the suggested solutions is particularly different. If you're seeing a huge difference, try using cProfile to see what's taking the time.

edited Apr 7, 2009 at 13:43

answered Apr 7, 2009 at 12:15

Nicholas Riley

44.5k6 gold badges107 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Stefano Borini Over a year ago

I seems to recall that mmap is linux-only, so you could face portability problems.

Nicholas Riley Over a year ago

Nope, it works on Unix and Windows (docs.python.org/library/mmap.html). There are some minor API differences but nothing that affects this use case. Actually a bigger difference on Windows: do_mmap 0.65700006485; do_array 1.0150001049; do_split 0.827999830246.

Zitrax Over a year ago

Thanks for the tip about cProfile, it pointed me to the problem. The for loops used range() which caused a lot of overhead. I switched to while loops and now the performance is good.

Nicholas Riley Over a year ago

Cool! Glad you figured it out.

Can Berk Güder · Accepted Answer · 2009-04-07 12:14:38Z

9

l = list(str)
l[i] = 'e'
str = ''.join(l)

answered Apr 7, 2009 at 12:14

Can Berk Güder

114k26 gold badges135 silver badges137 bronze badges

13 Comments

Can Berk Güder Over a year ago

@theycallmemorty: it consumes twice the memory as C, but other than that, I can't see any reason why it shouldn't work.

user25148 Over a year ago

In fact, if there's a lot of such manipulation being done, it's probably best to keep the strings as lists of characters.

Zitrax Over a year ago

this works and seem to be slightly faster than the array approach from another answer. However both methods are a lot slower than my previous code; currently ~7seconds vs 0.4seconds

Can Berk Güder Over a year ago

@liw.fi: correct. the ''.join(l) line should be user after all character-based modifications are done.

Can Berk Güder Over a year ago

@Zitrax: what's your previous code? Python or the original language (C?). also, see my reply to liw.fi's comment.

|

Chris Upchurch · Accepted Answer · 2009-04-07 14:57:50Z

1

Others have answered the string manipulation part of your question, but I think you ought to think about whether it would be better to parse the file and modify the data structure the text represents rather than manipulating the text directly.

answered Apr 7, 2009 at 14:57

Chris Upchurch

15.5k6 gold badges53 silver badges66 bronze badges

Comments

vartec · Accepted Answer · 2009-04-07 12:16:05Z

0

Try:

sl = list(s)
sl[i] = 'e'
s = ''.join(sl)

answered Apr 7, 2009 at 12:16

vartec

135k38 gold badges227 silver badges248 bronze badges

Collectives™ on Stack Overflow

String manipulation in Python

4 Answers 4

4 Comments

13 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

13 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related