42

I have a program which needs to turn many large one-dimensional numpy arrays of floats into delimited strings. I am finding this operation quite slow relative to the mathematical operations in my program and am wondering if there is a way to speed it up. For example, consider the following loop, which takes 100,000 random numbers in a numpy array and joins each array into a comma-delimited string.

import numpy as np
x = np.random.randn(100000)
for i in range(100):
    ",".join(map(str, x))

This loop takes about 20 seconds to complete (total, not each cycle). In contrast, consider that 100 cycles of something like elementwise multiplication (x*x) would take than one 1/10 of a second to complete. Clearly the string join operation creates a large performance bottleneck; in my actual application it will dominate total runtime. This makes me wonder, is there a faster way than ",".join(map(str, x))? Since map() is where almost all the processing time occurs, this comes down to the question of whether there a faster to way convert a very large number of numbers to strings.

4
  • 1
    Is it the conversion of number to string that takes the time? Commented Apr 27, 2010 at 13:25
  • Multiplying two integers and converting/concatenating 100,000 numbers are quite different things - how can you expect both operations to even be in the same ballpark performance-wise? Commented Apr 27, 2010 at 13:28
  • Mark - Yes. Tim - True. My point is simply that the string operation creates a real bottleneck, and it would be nice if there was a way to speed things up. Commented Apr 27, 2010 at 13:43
  • float.hex is 25% faster than str. It can be read back in other languages using "%a" format. Commented Apr 28, 2010 at 14:08

7 Answers 7

38

A little late, but this is faster for me:

#generate an array with strings
x_arrstr = np.char.mod('%f', x)
#combine to a string
x_str = ",".join(x_arrstr)

Speed up is on my machine about 1.5x

Sign up to request clarification or add additional context in comments.

Comments

14

Very good writeup on the performance of various string concatenation techniques in Python: http://www.skymind.com/~ocrow/python_string/

I'm a little surprised that some of the latter approaches perform as well as they do, but looks like you can certainly find something there that will work better for you than what you're doing there.

Fastest method mentioned on the site

Method 6: List comprehensions

def method6():
  return ''.join([`num` for num in xrange(loop_count)])

This method is the shortest. I'll spoil the surprise and tell you it's also the fastest. It's extremely compact, and also pretty understandable. Create a list of numbers using a list comprehension and then join them all together. Couldn't be simpler than that. This is really just an abbreviated version of Method 4, and it consumes pretty much the same amount of memory. It's faster though because we don't have to call the list.append() function each time round the loop.

3 Comments

Thanks sblom. Unfortunately my code is already essentially the same as the fastest solution mentioned. Perhaps there is just no way to get it to go faster.
@Abiel If you really want it faster then you should look into using Cython.
i think generally people like it when an answer explains what to do, rather than just link to a page that may or may not get removed in the future.
3

Convert the numpy array into a list first. The map operation seems to run faster on a list than on a numpy array.

e.g.

import numpy as np
x = np.random.randn(100000).tolist()
for i in range(100):
    ",".join(map(str, x))

In timing tests I found a consistent 15% speedup for this example

I'll leave others to explain why this might be faster as I have no idea!

Comments

2

I think you could experiment with numpy.savetxt passing a cStringIO.StringIO object as a fake file...

Or maybe using str(x) and doing a replacement of the whitespaces by commas (edit: this won't work quite well because the str does an ellipsis of large arrays :-s).

As the purpose of this was to send the array over the network, maybe there are better alternatives (more efficient both in cpu and bandwidth). The one I pointed out in a comment to other answer as to encode the binary representation of the array as a Base64 text block. The main inconvenient for this to be optimal is that the client reading the chunk of data should be able to do nasty things like reinterpret a byte array as a float array, and that's not usually allowed in type safe languages; but it could be done quickly with a C library call (and most languages provide means to do this).

In case you cannot mess with bits, there's always the possibility of processing the numbers one by one to convert the decoded bytes to floats.

Oh, and watch out for the endiannes of the machines when sending data through the network: convert to network order -> base64encode -> send | receive -> base64decode -> convert to host order

1 Comment

Thanks fortran. Unfortunately I'm still not able to get a speed improvement with either savetxt or with str(x). str(x) at first appears to be much faster, but this disappears once np.set_printoptions(threshold=100000) (see my comment on unutbu's answer).
1

numpy.savetxt is even slower than string.join. ndarray.tofile() doesn't seem to work with StringIO.

But I do find a faster method (at least applying to OP's example on python2.5 with lower version of numpy):

import numpy as np
x = np.random.randn(100000)
for i in range(100):
    (",%f"*100000)[1:] % tuple(x)

It looks like string format is faster than string join if you have a well defined format such as in this particular case. But I wonder why OP needs such a long string of floating numbers in memory.

Newer versions of numpy shows no speed improvement.

11 Comments

Dingle - For whatever reason I am not finding this to be faster than my original example of join and str. As to why I need these long strings, I have a server application that operates on numpy arrays and then distributes the results in plain-text strings so that a variety of clients (including non-Python clients) can consume the data (this includes sending data over HTTP to remote clients). If there is a better way to distribute the data I would be happy to use it, but remember that clients using any programming language and running on any operating system would need to be able to consume it.
For that use, compressed binary data is better than plain text! :-) my HTTP knowledge is a little bit rusty now, but you can at least encode the raw floats in Base64 to get better bit-density than in decimal. Make sure that the marshalling scheme is the same in all platforms (check network and host byte order and IEEE 754 compatible representations). If there's no numpy method to do that, you could write your own routine in C and call it with ctypes.
Thanks fortran, this looks like it may be the answer. Certainly doing x.tostring() in numpy is very fast. I'm not very familiar with reading and writing binary data across different environments, but I will dig into this.
@Abiel, timeit shows 20~30% faster. Not sure if fortran's suggestion will improve the speed if the data size is not an issue here. What about JSON or XML? I thought binary data over network is not safe to unpack.
fortran - After looking at your suggestion a bit more, I'm confused about how in practice you would decode the data at the client side, given that the client will not necessarily be written in Python. For example, the client might be written in Visual Basic and be designed to drop numerical arrays into a spreadsheet. In this case I would need to know how to take a binary representation of a numpy array and translate it into something like a VB Variant. Thoughts?
|
0

Using imap from itertools instead of map in the OP's code is giving me about a 2-3% improvement which isn't much, but something that might combine with other ideas to give more improvement.

Personally, I think that if you want much better than this that you will have to use something like Cython.

Comments

-1
','.join(x.astype(str))

is about 10% slower than as

x_arrstr = np.char.mod('%f', x)
x_str = ",".join(x_arrstr)

but is more readable.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.