Python fastest way to remove multiple spaces in a string

Question

This question has been asked before, but the fast answers that I have seen also remove the trailing spaces, which I don't want.

"   a     bc    "

should become

" a bc "

I have

text = re.sub(' +', " ", text)

but am hoping for something faster. The suggestion that I have seen (and which won't work) is

' '.join(text.split())

Note that I will be doing this to lots of smaller texts so just checking for a trailing space won't be so great.

If you want to really optimize stuff like this, use C, not python. Try cython, that is pretty much Python syntax but fast as C. — Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse, Commented Jun 13, 2013 at 15:13
You could try ''.join((text[0],' '.join(text[1:-1].split()),text[-1])) but that is probably not faster than the regex (you'd need to timeit), and it's definitely not easier to read. — mgilson
– mgilson, Commented Jun 13, 2013 at 15:14
Have you checked that this is really the thing slowing down your program? My (very uninformed) guess is that it is not. First profile, and then if performance really is an issue, then optimise (and the easiest way to do that might be to rewrite the critical bits in C). — Adrian Ratnapala
– Adrian Ratnapala, Commented Jun 13, 2013 at 15:16
Why do you want something faster? I doubt it's really affecting your program. — Lanaru
– Lanaru, Commented Jun 13, 2013 at 15:18
See stackoverflow.com/questions/1546226/…. The winner seems to be while ' ' in s: s=s.replace(' ', ' ') — Fredrik Pihl
– Fredrik Pihl, Commented Jun 13, 2013 at 15:19

Fredrik Pihl · Accepted Answer · 2013-06-14 09:22:28Z

3

FWIW, some timings

$  python -m timeit -s 's="   a     bc    "' 't=s[:]' "while '  ' in t: t=t.replace('  ', ' ')"
1000000 loops, best of 3: 1.05 usec per loop

$ python -m timeit -s 'import re;s="   a     bc    "'  "re.sub(' +', ' ', s)"
100000 loops, best of 3: 2.27 usec per loop

$ python -m timeit -s 's=" a bc "' "''.join((s[0],' '.join(s[1:-1].split()),s[-1]))"
1000000 loops, best of 3: 0.592 usec per loop

$ python -m timeit -s 'import re;s="   a     bc    "'  "re.sub(' {2,}', ' ', s)"
100000 loops, best of 3: 2.34 usec per loop

$ python -m timeit -s 's="   a     bc    "' '" "+" ".join(s.split())+" "'
1000000 loops, best of 3: 0.387 usec per loop

edited Jun 14, 2013 at 9:22

answered Jun 13, 2013 at 15:24

Fredrik Pihl

45.9k7 gold badges89 silver badges133 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Aya Over a year ago

re.sub(' {2,}', ... would be a fairer test. There's no point in matching a single space.

mgilson Over a year ago

@Aya -- Good suggestion, for me, that does about 30% better for this simple test.

mgilson Over a year ago

I also timed my suggestion ... It comes in between the other two on my desktop: python -m timeit -s 's=" a bc "' "s = ''.join((s[0],' '.join(s[1:-1].split()),s[-1]))"

Aya Over a year ago

@lcfseth It would depend on the length of the string, and the number of multi-space instances. For longer strings with many multi-space instances, the regex would out-perform the str.replace approach.

Fredrik Pihl Over a year ago

With this trivial string the while-approach beats the re even with s = "..."*10000

|

Has QUIT--Anony-Mousse · Accepted Answer · 2013-06-14 09:34:29Z

If you want to really optimize stuff like this, use C, not python.

Try cython, that is pretty much Python syntax but fast as C.

Here is some stuff you can time:

import array
buf=array.array('c')
input="   a     bc    "
space=False
for c in input:
  if not space or not c == ' ': buf.append(c)
  space = (c == ' ')
buf.tostring()

Also try using cStringIO:

import cStringIO
buf=cStringIO.StringIO()
input="   a     bc    "
space=False
for c in input:
  if not space or not c == ' ': buf.write(c)
  space = (c == ' ')
buf.getvalue()

But again, if you want to make such things really fast, don't do it in python. Use cython. The two approaches I gave here will likely be slower, just because they put much more work on the python interpreter. If you want these things to be fast, do as little as possible in python. The for c in input loop likely already kills all theoretical performance of above approaches.

Slater Victoroff · Accepted Answer · 2013-06-13 15:22:11Z

0

Just a small rewrite of the suggestion up there, but just because something has a small fault doesn't mean you should assume it won't work.

You could easily do something like:

front_space = lambda x:x[0]==" "
trailing_space = lambda x:x[-1]==" "
" "*front_space(text)+' '.join(text.split())+" "*trailing_space(text)

answered Jun 13, 2013 at 15:22

Slater Victoroff

22k23 gold badges92 silver badges149 bronze badges

Collectives™ on Stack Overflow

Python fastest way to remove multiple spaces in a string

3 Answers 3

12 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

12 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related