1

I am converting some code from another language to python. That code reads a rather large file into a string and then manipulates it by array indexing like:

str[i] = 'e'

This does not work directly in python due to the strings being immutable. What is the preferred way of doing this in python ?

I have seen the string.replace() function, but it returns a copy of the string which does not sound very optimal as the string in this case is an entire file.

2
  • do you always replace same column, or you do search&replace? Commented Apr 7, 2009 at 12:25
  • what is replaced depends on the content of the file Commented Apr 7, 2009 at 12:33

4 Answers 4

12

Assuming you're not using a variable-length text encoding such as UTF-8, you can use array.array:

>>> import array
>>> a = array.array('c', 'foo')
>>> a[1] = 'e'
>>> a
array('c', 'feo')
>>> a.tostring()
'feo'

But since you're dealing with the contents of a file, mmap should be more efficient:

>>> f = open('foo', 'r+')
>>> import mmap
>>> m = mmap.mmap(f.fileno(), 0)
>>> m[:]
'foo\n'
>>> m[1] = 'e'
>>> m[:]
'feo\n'
>>> exit()
% cat foo
feo

Here's a quick benchmarking script (you'll need to replace dd with something else for non-Unix OSes):

import os, time, array, mmap

def modify(s):
    for i in xrange(len(s)):
        s[i] = 'q'

def measure(func):
    start = time.time()
    func(open('foo', 'r+'))
    print func.func_name, time.time() - start

def do_split(f):
    l = list(f.read())
    modify(l)
    return ''.join(l)

def do_array(f):
    a = array.array('c', f.read())
    modify(a)
    return a.tostring()

def do_mmap(f):
    m = mmap.mmap(f.fileno(), 0)
    modify(m)

os.system('dd if=/dev/random of=foo bs=1m count=5')

measure(do_mmap)
measure(do_array)
measure(do_split)

Output I got on my several-year-old laptop matches my intuition:

5+0 records in
5+0 records out
5242880 bytes transferred in 0.710966 secs (7374304 bytes/sec)
do_mmap 1.00865888596
do_array 1.09792494774
do_split 1.20163106918

So mmap is slightly faster but none of the suggested solutions is particularly different. If you're seeing a huge difference, try using cProfile to see what's taking the time.

Sign up to request clarification or add additional context in comments.

4 Comments

I seems to recall that mmap is linux-only, so you could face portability problems.
Nope, it works on Unix and Windows (docs.python.org/library/mmap.html). There are some minor API differences but nothing that affects this use case. Actually a bigger difference on Windows: do_mmap 0.65700006485; do_array 1.0150001049; do_split 0.827999830246.
Thanks for the tip about cProfile, it pointed me to the problem. The for loops used range() which caused a lot of overhead. I switched to while loops and now the performance is good.
Cool! Glad you figured it out.
9
l = list(str)
l[i] = 'e'
str = ''.join(l)

13 Comments

@theycallmemorty: it consumes twice the memory as C, but other than that, I can't see any reason why it shouldn't work.
In fact, if there's a lot of such manipulation being done, it's probably best to keep the strings as lists of characters.
this works and seem to be slightly faster than the array approach from another answer. However both methods are a lot slower than my previous code; currently ~7seconds vs 0.4seconds
@liw.fi: correct. the ''.join(l) line should be user after all character-based modifications are done.
@Zitrax: what's your previous code? Python or the original language (C?). also, see my reply to liw.fi's comment.
|
1

Others have answered the string manipulation part of your question, but I think you ought to think about whether it would be better to parse the file and modify the data structure the text represents rather than manipulating the text directly.

Comments

0

Try:

sl = list(s)
sl[i] = 'e'
s = ''.join(sl)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.