5

There are 2 very big series of elements, the second 100 times bigger than the first. For each element of the first series, there are 0 or more elements on the second series. This can be traversed and processed with 2 nested loops. But the unpredictability of the amount of matching elements for each member of the first array makes things very, very slow.

The actual processing of the 2nd series of elements involves logical and (&) and a population count.

I couldn't find good optimizations using C but I am considering doing inline asm, doing rep* mov* or similar for each element of the first series and then doing the batch processing of the matching bytes of the second series, perhaps in buffers of 1MB or something. But the code would be get quite messy.

Does anybody know of a better way? C preferred but x86 ASM OK too. Many thanks!

Sample/demo code with simplified problem, first series are "people" and second series are "events", for clarity's sake. (the original problem is actually 100m and 10,000m entries!)

#include <stdio.h>
#include <stdint.h>

#define PEOPLE 1000000    //   1m
struct Person {
    uint8_t age;   // Filtering condition
    uint8_t cnt;   // Number of events for this person in E
} P[PEOPLE]; // Each has 0 or more bytes with bit flags

#define EVENTS 100000000  // 100m
uint8_t P1[EVENTS]; // Property 1 flags
uint8_t P2[EVENTS]; // Property 2 flags

void init_arrays() {
    for (int i = 0; i < PEOPLE; i++) { // just some stuff
        P[i].age = i & 0x07;
        P[i].cnt = i % 220; // assert( sum < EVENTS );
    }
    for (int i = 0; i < EVENTS; i++) {
        P1[i]    = i % 7;  // just some stuff
        P2[i]    = i % 9;  // just some other stuff
    }
}

int main(int argc, char *argv[])
{
    uint64_t   sum = 0, fcur = 0;

    int age_filter = 7; // just some

    init_arrays();      // Init P, P1, P2

    for (int64_t p = 0; p < PEOPLE ; p++)
        if (P[p].age < age_filter)
            for (int64_t e = 0; e < P[p].cnt ; e++, fcur++)
                sum += __builtin_popcount( P1[fcur] & P2[fcur] );
        else
            fcur += P[p].cnt; // skip this person's events

    printf("(dummy %ld %ld)\n", sum, fcur );

    return 0;
}

gcc -O5 -march=native -std=c99 test.c -o test
8
  • 2
    You probably gonna be bound by memory more than anything else... Commented Nov 10, 2012 at 23:32
  • Yup. With fixed sizes, yes, and that'd be my goal. But with variable size matches it is much slower due to the inner loop (according to valgrind). Also there're branch mispredictions on the condition, too, but that would be easier to get rid of, I think. Commented Nov 10, 2012 at 23:35
  • 2
    Have a look at Loop Tiling. That's probably gonna give you the most improvement. Commented Nov 10, 2012 at 23:41
  • just a guess but you could try and do your popcount before your loop, as it looks like every element pair P1[i] and P2[i] are done. This would allow all popcounts to be done at once which might give you some time gain Commented Nov 11, 2012 at 13:16
  • 1
    @alecco Are you sure that the sample code does not need an else branch for the case when the person does not pass the age filter? Shouldn't there be a fcur += P[p].cnt? Otherwise, the inner loop would be consuming someone else's events... Commented Nov 13, 2012 at 17:13

7 Answers 7

4
+50

Since on average you get 100 items per person, you can speed things up by processing multiple bytes at a time. I re-arranged the code slightly in order to use pointers instead of indexes, and replaced one loop by two loops:

uint8_t *p1 = P1, *p2 = P2;
for (int64_t p = 0; p < PEOPLE ; p++) {
    if (P[p].age < age_filter) {
        int64_t e = P[p].cnt;
        for ( ; e >= 8 ; e -= 8) {
            sum += __builtin_popcountll( *((long long*)p1) & *((long long*)p2) );
            p1 += 8;
            p2 += 8;
        }
        for ( ; e ; e--) {
            sum += __builtin_popcount( *p1++ & *p2++ );
        }
    } else {
        p1 += P[p].cnt;
        p2 += P[p].cnt;
    }
}

In my testing this speeds up your code from 1.515s to 0.855s.

Sign up to request clarification or add additional context in comments.

4 Comments

+1. Wider data type is better. I've quickly ran two version and it gives 2x improvement right away. Though isn't it supposed to be __builtin_popcountll? I'd also consider using 128-bit registers.
@VladLazarenko Ah, you are absolutely right about the __builtin_popcountll! Thank you very much, this is now fixed.
Good one. The distribution isn't normal, regretfully and the median will likely be well below 64 bits. A similar optimization is to "round up" to 16, 32 or 64 bit chunks per person, but it is a huge memory tradeoff since there are so many persons with only a very few events (e.g. 0 or 1) upvote +bounty
This is the closest to an answer. Thanks.
2

The answer by Neil doesn't require sorting by age, which btw could be a good idea --

If the second loop has holes (please correct original source code to support that idea), a common solution is to do cumsum[n+1]=cumsum[n]+__popcount(P[n]&P2[n]);
Then for each people sum+=cumsum[fcur + P[p].cnt] - cumsum[fcur];

Anyway it seems that the computational burden is merely of order EVENTS, not EVENTS*PEOPLE. Some optimization can anyway take place by calling the inner loop for all the consecutive people meeting the condition.

If there are really max 8 predicates, it could makes sense to precalculate all the
sums (_popcounts(predicate[0..255])) for each people into separate arrays C[256][PEOPLE]. That just about doubles the memory requirements (on disk?), but localizes the search from 10GB+10GB+...+10GB (8 predicates) to one stream of 200MB (assuming 16 bit entries).

Depending on the probability of p(P[i].age < condition && P[i].height < cond2), it may not anymore make sense to calculate cumulative sums. Maybe, maybe not. More likely just some SSE parallelism 8 or 16 people at a time will do.

2 Comments

Actually performance is the major driver for this and it has to fit in RAM. And the "&" was a simplification. It could be !(a) & (b) or other combinations. I have that solved with macros and a table of function pointers, I rather have 100k of functions than 1000GB of storage :) But thanks for your thoughtful answer, have an upvote!
Your system has at least 80GB of ram?
2

A completely new approach could be to use ROBDDs to encode the truth tables of each person / each event. First, if the event tables are not very random or if they do not consists of pathological functions, such as truth tables of bignum multiplication, then first one may achieve compression of the functions and secondly arithmetic operations for truth tables can be calculated in compressed form. Each subtree can be shared between users and each arithmetic operation for two identical subtrees has to be calculated only once.

Comments

1

I don't know if your sample code accurately reflects your problem but it can be rewritten like this:

for (int64_t p = 0; p < PEOPLE ; p++)
    if (P[p].age < age_filter)
        fcur += P[p].cnt;

for (int64_t e = 0; e < fcur ; e++)
    sum += __builtin_popcount( P1[e] & P2[e] );

1 Comment

Perhaps I wasn't clear. The first series ("people") is not ordered by "age", in fact there are are than one property (e.g. "height"). So when running the code, there will be gaps on the second series, belonging to the elements of the first series that do not match the search criteria. But thanks for the answer, have an upvote anyway.
0

I don't know about gcc -O5 (it seems not documented here) and seems to produce the exact same code as gcc -O3 here with my gcc 4.5.4 (though, only tested on a relatively small code sample)

depending on what you want to achieve, -O3 can be slower than -O2

as with your problem, I'd suggest thinking more about your data structure than the actual algorithm. you should not focus on solving the problem with an adequate algorithm/code optimisation as long as your data aren't repsented in a convenient manner.

if you want to quickly cut a large set of your data based on a single criteria (here, age in your example) I'd recommand using a variant of a sorted tree.

2 Comments

All optimization options above -O3 are equivalent to -O3 in GCC (in other compilers they might not be).
The O5 was a slip, yes, apologies. Yet it was in the name of giving everything up front to help.
0

If your actual data(age,count etc.) is indeed 8-bit there is probably a lot of redundancy in calculations. In this case you can replace the processing by lookup tables - for each 8-bit value you'll have 256 possible outputs and instead of computation it might be possible to read the computed data from the table.

1 Comment

Queries can be multi-column, for example older than 21 and taller than 6ft.
0

To tackle the branch mispredictions (missing in other answers) the code could do something like:

#ifdef MISPREDICTIONS
if (cond)
    sum += value
#else
mask = - (cond == 0);  // cond: 0 then -0, binary 00..; cond: 1 then -1, binary 11..
sum += (value & mask); // if mask is 0 sum value, else sums 0
#endif

It's not completely free since there are data dependencies (think superscalar cpu). But it usually gets a 10x boost for mostly unpredictable conditions.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.