Nested loop traversing arrays

Question

There are 2 very big series of elements, the second 100 times bigger than the first. For each element of the first series, there are 0 or more elements on the second series. This can be traversed and processed with 2 nested loops. But the unpredictability of the amount of matching elements for each member of the first array makes things very, very slow.

The actual processing of the 2nd series of elements involves logical and (&) and a population count.

I couldn't find good optimizations using C but I am considering doing inline asm, doing rep* mov* or similar for each element of the first series and then doing the batch processing of the matching bytes of the second series, perhaps in buffers of 1MB or something. But the code would be get quite messy.

Does anybody know of a better way? C preferred but x86 ASM OK too. Many thanks!

Sample/demo code with simplified problem, first series are "people" and second series are "events", for clarity's sake. (the original problem is actually 100m and 10,000m entries!)

#include <stdio.h>
#include <stdint.h>

#define PEOPLE 1000000    //   1m
struct Person {
    uint8_t age;   // Filtering condition
    uint8_t cnt;   // Number of events for this person in E
} P[PEOPLE]; // Each has 0 or more bytes with bit flags

#define EVENTS 100000000  // 100m
uint8_t P1[EVENTS]; // Property 1 flags
uint8_t P2[EVENTS]; // Property 2 flags

void init_arrays() {
    for (int i = 0; i < PEOPLE; i++) { // just some stuff
        P[i].age = i & 0x07;
        P[i].cnt = i % 220; // assert( sum < EVENTS );
    }
    for (int i = 0; i < EVENTS; i++) {
        P1[i]    = i % 7;  // just some stuff
        P2[i]    = i % 9;  // just some other stuff
    }
}

int main(int argc, char *argv[])
{
    uint64_t   sum = 0, fcur = 0;

    int age_filter = 7; // just some

    init_arrays();      // Init P, P1, P2

    for (int64_t p = 0; p < PEOPLE ; p++)
        if (P[p].age < age_filter)
            for (int64_t e = 0; e < P[p].cnt ; e++, fcur++)
                sum += __builtin_popcount( P1[fcur] & P2[fcur] );
        else
            fcur += P[p].cnt; // skip this person's events

    printf("(dummy %ld %ld)\n", sum, fcur );

    return 0;
}

gcc -O5 -march=native -std=c99 test.c -o test

You probably gonna be bound by memory more than anything else... — Mysticial
– Mysticial, Commented Nov 10, 2012 at 23:32
Yup. With fixed sizes, yes, and that'd be my goal. But with variable size matches it is much slower due to the inner loop (according to valgrind). Also there're branch mispredictions on the condition, too, but that would be easier to get rid of, I think. — alecco
– alecco, Commented Nov 10, 2012 at 23:35
Have a look at Loop Tiling. That's probably gonna give you the most improvement. — Mysticial
– Mysticial, Commented Nov 10, 2012 at 23:41
just a guess but you could try and do your popcount before your loop, as it looks like every element pair P1[i] and P2[i] are done. This would allow all popcounts to be done at once which might give you some time gain — Origin
– Origin, Commented Nov 11, 2012 at 13:16
@alecco Are you sure that the sample code does not need an else branch for the case when the person does not pass the age filter? Shouldn't there be a fcur += P[p].cnt? Otherwise, the inner loop would be consuming someone else's events... — Sergey Kalinichenko
– Sergey Kalinichenko, Commented Nov 13, 2012 at 17:13

Sergey Kalinichenko · Accepted Answer · 2012-11-14 11:37:09Z

4

+50

Since on average you get 100 items per person, you can speed things up by processing multiple bytes at a time. I re-arranged the code slightly in order to use pointers instead of indexes, and replaced one loop by two loops:

uint8_t *p1 = P1, *p2 = P2;
for (int64_t p = 0; p < PEOPLE ; p++) {
    if (P[p].age < age_filter) {
        int64_t e = P[p].cnt;
        for ( ; e >= 8 ; e -= 8) {
            sum += __builtin_popcountll( *((long long*)p1) & *((long long*)p2) );
            p1 += 8;
            p2 += 8;
        }
        for ( ; e ; e--) {
            sum += __builtin_popcount( *p1++ & *p2++ );
        }
    } else {
        p1 += P[p].cnt;
        p2 += P[p].cnt;
    }
}

In my testing this speeds up your code from 1.515s to 0.855s.

edited Nov 14, 2012 at 11:37

answered Nov 14, 2012 at 4:56

Sergey Kalinichenko

729k85 gold badges1.2k silver badges1.6k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user405725 Over a year ago

+1. Wider data type is better. I've quickly ran two version and it gives 2x improvement right away. Though isn't it supposed to be __builtin_popcountll? I'd also consider using 128-bit registers.

Sergey Kalinichenko Over a year ago

@VladLazarenko Ah, you are absolutely right about the __builtin_popcountll! Thank you very much, this is now fixed.

alecco Over a year ago

Good one. The distribution isn't normal, regretfully and the median will likely be well below 64 bits. A similar optimization is to "round up" to 16, 32 or 64 bit chunks per person, but it is a huge memory tradeoff since there are so many persons with only a very few events (e.g. 0 or 1) upvote +bounty

alecco Over a year ago

This is the closest to an answer. Thanks.

Aki Suihkonen · Accepted Answer · 2012-11-13 20:37:02Z

2

The answer by Neil doesn't require sorting by age, which btw could be a good idea --

If the second loop has holes (please correct original source code to support that idea), a common solution is to do cumsum[n+1]=cumsum[n]+__popcount(P[n]&P2[n]);
Then for each people sum+=cumsum[fcur + P[p].cnt] - cumsum[fcur];

Anyway it seems that the computational burden is merely of order EVENTS, not EVENTS*PEOPLE. Some optimization can anyway take place by calling the inner loop for all the consecutive people meeting the condition.

If there are really max 8 predicates, it could makes sense to precalculate all the
sums (_popcounts(predicate[0..255])) for each people into separate arrays C[256][PEOPLE]. That just about doubles the memory requirements (on disk?), but localizes the search from 10GB+10GB+...+10GB (8 predicates) to one stream of 200MB (assuming 16 bit entries).

Depending on the probability of p(P[i].age < condition && P[i].height < cond2), it may not anymore make sense to calculate cumulative sums. Maybe, maybe not. More likely just some SSE parallelism 8 or 16 people at a time will do.

edited Nov 13, 2012 at 20:37

answered Nov 13, 2012 at 19:15

Aki Suihkonen

20.5k1 gold badge43 silver badges68 bronze badges

2 Comments

alecco Over a year ago

Actually performance is the major driver for this and it has to fit in RAM. And the "&" was a simplification. It could be !(a) & (b) or other combinations. I have that solved with macros and a table of function pointers, I rather have 100k of functions than 1000GB of storage :) But thanks for your thoughtful answer, have an upvote!

Aki Suihkonen Over a year ago

Your system has at least 80GB of ram?

Aki Suihkonen · Accepted Answer · 2012-11-14 06:44:03Z

2

A completely new approach could be to use ROBDDs to encode the truth tables of each person / each event. First, if the event tables are not very random or if they do not consists of pathological functions, such as truth tables of bignum multiplication, then first one may achieve compression of the functions and secondly arithmetic operations for truth tables can be calculated in compressed form. Each subtree can be shared between users and each arithmetic operation for two identical subtrees has to be calculated only once.

answered Nov 14, 2012 at 6:44

Aki Suihkonen

20.5k1 gold badge43 silver badges68 bronze badges

Comments

Neil · Accepted Answer · 2012-11-11 00:08:49Z

1

I don't know if your sample code accurately reflects your problem but it can be rewritten like this:

for (int64_t p = 0; p < PEOPLE ; p++)
    if (P[p].age < age_filter)
        fcur += P[p].cnt;

for (int64_t e = 0; e < fcur ; e++)
    sum += __builtin_popcount( P1[e] & P2[e] );

answered Nov 11, 2012 at 0:08

Neil

55.5k8 gold badges65 silver badges74 bronze badges

1 Comment

alecco Over a year ago

Perhaps I wasn't clear. The first series ("people") is not ordered by "age", in fact there are are than one property (e.g. "height"). So when running the code, there will be gaps on the second series, belonging to the elements of the first series that do not match the search criteria. But thanks for the answer, have an upvote anyway.

granquet · Accepted Answer · 2012-11-13 16:59:42Z

0

I don't know about gcc -O5 (it seems not documented here) and seems to produce the exact same code as gcc -O3 here with my gcc 4.5.4 (though, only tested on a relatively small code sample)

depending on what you want to achieve, -O3 can be slower than -O2

as with your problem, I'd suggest thinking more about your data structure than the actual algorithm. you should not focus on solving the problem with an adequate algorithm/code optimisation as long as your data aren't repsented in a convenient manner.

if you want to quickly cut a large set of your data based on a single criteria (here, age in your example) I'd recommand using a variant of a sorted tree.

answered Nov 13, 2012 at 16:59

granquet

2862 silver badges6 bronze badges

2 Comments

Michał Kosmulski Over a year ago

All optimization options above -O3 are equivalent to -O3 in GCC (in other compilers they might not be).

alecco Over a year ago

The O5 was a slip, yes, apologies. Yet it was in the name of giving everything up front to help.

SomeWittyUsername · Accepted Answer · 2012-11-14 05:15:56Z

0

If your actual data(age,count etc.) is indeed 8-bit there is probably a lot of redundancy in calculations. In this case you can replace the processing by lookup tables - for each 8-bit value you'll have 256 possible outputs and instead of computation it might be possible to read the computed data from the table.

answered Nov 14, 2012 at 5:15

SomeWittyUsername

18.4k3 gold badges46 silver badges87 bronze badges

1 Comment

alecco Over a year ago

Queries can be multi-column, for example older than 21 and taller than 6ft.

alecco · Accepted Answer · 2012-11-14 14:05:23Z

0

To tackle the branch mispredictions (missing in other answers) the code could do something like:

#ifdef MISPREDICTIONS
if (cond)
    sum += value
#else
mask = - (cond == 0);  // cond: 0 then -0, binary 00..; cond: 1 then -1, binary 11..
sum += (value & mask); // if mask is 0 sum value, else sums 0
#endif

It's not completely free since there are data dependencies (think superscalar cpu). But it usually gets a 10x boost for mostly unpredictable conditions.

answered Nov 14, 2012 at 14:05

community wiki

alecco

Collectives™ on Stack Overflow

Nested loop traversing arrays

7 Answers 7

4 Comments

2 Comments

Comments

1 Comment

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

4 Comments

2 Comments

Comments

1 Comment

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related