iterating over a single list in parallel in python

Question

The objective is to do calculations on a single iter in parallel using builtin sum & map functions concurrently. Maybe using (something like) itertools instead of classic for loops to analyze (LARGE) data that arrives via an iterator...

In one simple example case I want to calculate ilen, sum_x & sum_x_sq:

ilen,sum_x,sum_x_sq=iterlen(iter),sum(iter),sum(map(lambda x:x*x, iter))

But without converting the (large) iter to a list (as with iter=list(iter))

n.b. Do this using sum & map and without for loops, maybe using the itertools and/or threading modules?

def example_large_data(n=100000000, mean=0, std_dev=1):
  for i in range(n): yield random.gauss(mean,std_dev)

-- edit --

Being VERY specific: I was taking a good look at itertools hoping that there was a dual function like map that could do it. For example: len_x,sum_x,sum_x_sq=itertools.iterfork(iter_x,iterlen,sum,sum_sq)

If I was to be very very specific: I am looking for just one answer, python source code for the "iterfork" procedure.

What's the importance of using the built-in sum and map? Any solution is going to involve either enough runtime overhead that using those builtins has little performance impact, or enough C extension code that rewriting sum and map would be trivial in comparison. — user2357112
– user2357112, Commented Apr 8, 2015 at 7:59
Agreed: Maybe it can only be done in threads, but I'm hoping not and there is some pythonic feature I have overlooked. I was taking a good look at itertools hoping that there was a dual function like map that could do it. For example: len_x,sum_x,sum_x_sq=itertools.iterfork(iter_x,count,sum,sum_sq) ... this issue has come up often enough that I feel like I need a pattern. — NevilleDNZ
– NevilleDNZ, Commented Apr 8, 2015 at 8:19
Last I checked, threading is the only option that doesn't force you to rewrite the routines that consume the iterator. I tried to come up with a way to do it with coroutines, but Python's coroutines just aren't powerful enough. On the bright side, if you control the source code for the functions that need to consume the iterator, the amount of rewriting is quite minor, and you can keep the old API. — user2357112
– user2357112, Commented Apr 8, 2015 at 8:27
Being VERY specific, I am looking for just one answer, python code for an "iterfork" procedure. — NevilleDNZ
– NevilleDNZ, Commented Apr 8, 2015 at 8:29

Blckknght · Accepted Answer · 2015-04-08 08:30:53Z

You can use itertools.tee to turn your single iterator into three iterators which you can pass to your three functions.

iter0, iter1, iter2 = itertools.tee(input_iter, 3)
ilen, sum_x, sum_x_sq = count(iter0),sum(iter1),sum(map(lambda x:x*x, iter2))

That will work, but the builtin function sum (and map in Python 2) is not implemented in a way that supports parallel iteration. The first function you call will consume its iterator completely, then the second one will consume the second iterator, then the third function will consume the third iterator. Since tee has to store the values seen by one of its output iterators but not all of the others, this is essentially the same as creating a list from the iterator and passing it to each function.

Now, if you use generator functions that consume only a single value from their input for each value they output, you might be able to make parallel iteration work using zip. In Python 3, map and zip are both generators. The question is how to make sum into a generator.

I think you can get pretty much what you want by using itertools.accumulate (which was added in Python 3.2). It is a generator that yields a running sum of its input. Here's how you could make it work for your problem (I'm assuming your count function was supposed to be an iterator-friendly version of len):

iter0, iter1, iter2 = itertools.tee(input_iter, 3)

len_gen = itertools.accumulate(map(lambda x: 1, iter0))
sum_gen = itertools.accumulate(iter1)
sum_sq_gen = itertools.accumulate(map(lambda x: x*x, iter2))

parallel_gen = zip(len_gen, sum_gen, sum_sq_gen)  # zip is a generator in Python 3

for ilen, sum_x, sum_x_sq in parallel_gen:
    pass    # the generators do all the work, so there's nothing for us to do here

# ilen_x, sum_x, sum_x_sq have the right values here!

If you're using Python 2, rather than 3, you'll have to write your own accumulate generator function (there's a pure Python implementation in the docs I linked above), and use itertools.imap and itertools.izip rather than the builtin map and zip functions.

Collectives™ on Stack Overflow

iterating over a single list in parallel in python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related