C programming - Speed up array operation

Question

Is it possible to speed up an array operation in C on a computer with a multi-core AMD CPU? I list the code below.

UDP packets arrive every 512ns, and the computation in the array below (where the data is accumulated in else{} loop) takes more than 512ns. This loop keep up when the packets arrive at one microsecond interval. My question: can one multi-thread the array accumulation and hence speed up the computation? Currently the program 'top' shows that the code uses 100% CPU when packets arrive at 512ns interval. Thanks in advance for any inputs/suggestions.

#define NC 64
#define NP 4

void* get_next_buf(mystr_t* str, uint64_t* size)
{
 char buf0[8500];
 long long pktnum, *ptr;
 int i,j,J, offset ;
 ssize_t recsize;
 socklen_t fromlen;
 int pktdiff;

 recsize = recvfrom(str->sock, (void *)buf0, 8224, 0, (struct sockaddr *)&str->sa, &fromlen);

 if (recsize < 0) {
  fprintf(stderr,"error reading udp packet\n");
  return 0;
 }
/* clear the array for accumulate*/
memset(str->data, 0, 2*NCHAN*NPOL*sizeof(short));
/* swap bytes to extract packet counter*/
ptr = (long long *)buf0;
pktnum=BSWAP_64( *ptr ) & 0x00ffffffffffffff;
// got one packet of data. If a pakcet was missed, return array with zeroes
pktdiff = pktnum - str->prev_pkt_cnt;
if ( pktdiff != 1){
   str->bad_pkt_cnt++;
   fprintf (stderr,"%d+",pktdiff);
   str->prev_pkt_cnt = pktnum;
   *size = 2*sizeof(short)*NC*NP;
   return (void*) str->data;
}
//packet arrived in correct order, accumulate and return the array
else {
   J = 8192/(NC*NP);
   for (i=0;i<J;i++){
     for (j=0;j<NC;j=j++){
       offset = i*NC*NP;
       ((short *)str->data)[j]      += (short)(buf0[8+j+offset]);
       ((short *)str->data)[j+64]   += (short)(buf0[8+64+j+offset]);
       ((short *)str->data)[j+128]  += (short)(buf0[8+128+j+offset]);
       ((short *)str->data)[j+192]  += (short)(buf0[8+192+j+offset]);
    }
   }

  *size = sizeof(short)*NC*NP;
  str->prev_pkt_cnt = pktnum;
  /*return the acquired data buffer */
  return (void*) str->data;
 }

}

Did you compile with -O3 ? Have you profiled the code to identify the hot spots ? — Paul R
– Paul R, Commented Feb 24, 2015 at 8:04
offsetcalculation can go on the outer level out of the nested loops (but this should have been noticed by the compiler already if you used suitable optimisation settings). The same with NC * NP (can be precalculated) — mfro
– mfro, Commented Feb 24, 2015 at 8:17
Never write j=j++, it should just be ++j. Probably won't help, but won't hurt either. Just do it. — unwind
– unwind, Commented Feb 24, 2015 at 8:18

DarenW · Accepted Answer · 2015-02-24 08:37:53Z

3

That's fairly tight code, but if I were desperate to make it run a little faster, and not finding someone more expert than I on such things nearby, I'd try:

Instead of (short *)str->data repeated four times in the j loop, establish a pointer of type short* set to str->data. Save a deref. Any halfway decent compiler should be optimizing like this anyway, but sometimes compilers aren't perfect.
Also in the j loop, the (short) conversion may be taking up one opcode - signed or zero extending a byte value into 16 bits. If buf0 where declared as type short instead of char, you save a clock cycle. The type of buf0[] doesn't seem to matter elsewhere, being cast to other types anyway. The indexing would have to be adjusted. Be careful that doesn't involve any extra operations, or you get no improvement.
Dealing with adding short ints in parallel - maybe good ol' MMX, from the 1990s, could help. Or better yet, more recent SIMD opcodes. Those like to work on adjacent pieces of data, whereas your unrolled loop doesn't work on adjacent pieces in any one iteration. Perhaps it could be unrolled some other way. Again, good compilers ought to notice situations where SIMD ops could be used, but that may take some special command line flag, and may be not happening because of the way the loop is. Rearrange the loops to work on [j], [j+1], [j+2], [j+3] at the innermost level, and maybe you get lucky.

Two or more threads working on one huge array is often a good way to speed things up, but huge has to be way bigger than typical memory cache pages or virtual memory pages, so threads don't need write access to the same chunk of stuff. I'm not up on the latest architecture at that level, but I'm pretty sure an array of only 8000 or so bytes isn't "huge" in the needed sense.

If threads could help, the way is to have two of them alternating, taking turns as packets arrive. One finishes up while the other starts. Fine, if the one starting a new packet doesn't need the results of the other unfinished one. But then, I'm no guru with such things.

answered Feb 24, 2015 at 8:37

DarenW

17k8 gold badges71 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Persixty Over a year ago

I like the idea of handling alternate packets between two threads. The code in front of use doesn't have an order dependency. Though some synchronization would be required to rule out 'overtaking' if that matters to the consuming application

RK1974 Over a year ago

@DarenW: the data in the packets are all 8-bit wide and this needs to be promoted to short before accumulating. I can't see how your suggestion 2 can work .... or am I misunderstanding something?

DarenW Over a year ago

More likely I misunderstand something, since I'm not familiar with what you're doing, looked at the code only a few minutes, and thought up some ideas without trying them out.

Adam · Accepted Answer · 2015-02-24 08:48:55Z

You have many calculations which you are doing on each iteration and could be done outside, or defined as constant NC*NP or even offset = i*NC*NP; which shouldn't be in the for j loop as it does not depend on j. Also 8+j+offset is done multiple times, when you could use a variable increment.

Your else should look like something like that:

 // the following are defined on top of your program
 // #define NCNP NC*NP
 // #define J 8192/(NCNP);
 // #define SIZE sizeof(short)*NCNP

 offset=8;
 for (i=0;i<J;i++){ 
    offset += NCNP;
    jo=offset;
    for (j=0;j<NC;j++){
      ((short *)str->data)[j]      += (short)(buf0[jo]);
      ((short *)str->data)[j+64]   += (short)(buf0[64+jo]);
      ((short *)str->data)[j+128]  += (short)(buf0[128+jo]);
      ((short *)str->data)[j+192]  += (short)(buf0[192+jo]);
      jo++;
     }
   }

 *size = SIZE;
 str->prev_pkt_cnt = pktnum;
 /*return the acquired data buffer */
 return (void*) str->data;

RK1974 · Accepted Answer · 2015-02-24 11:21:57Z

Some tips above, plus a change in the code helps the code execute without any packet losses. The last one from @Adam had some effect - Thanks Adam!

The second change I put in were, to receive two packets with every call to the get_next_buf() function as follows:

recsize = recvfrom(str->sock, (void *)buf0, 8224, 0, (struct sockaddr *)&str->sa, &fromlen);
recsize = recvfrom(str->sock, (void *)buf1, 8224, 0, (struct sockaddr *)&str->sa, &fromlen);

Then, data from both buf0 and buf1 were accumulated in str->data[] as follows:

J = 8192/(NCNP);                                                                                                
offset = 8;
for (i=0;i<J;++i){
  offset += NCNP;
  jo = offset;
  for (j=0;j<NC;++j){
    if (i==0 && j==0) memset(udp2db->data,0,NCNP*sizeof(short));
    ((short *)udp2db->data)[j]      += (short)(buf0[jo]) + (short)(buf1[jo]);
    ((short *)udp2db->data)[j+64]   += (short)(buf0[64+jo]) + (short)(buf1[64+jo]) ;
    ((short *)udp2db->data)[j+128]  += (short)(buf0[128+jo]) + (short)(buf1[128+jo]);
    ((short *)udp2db->data)[j+192]  += (short)(buf0[192+jo]) + (short)(buf1[192+jo]);
    jo++;
  }
}

Since the above seems to never go beyond 83% CPU load, this should be fine. Also note that I can add data from two successive packets as they are delivered faster. The main reason I want a software accumulate is to avoid overflow in the hardware, where 8-bit accumulators are used. Thanks again to all for your suggestions!

I'd like to add that, I'll soon try the multi-threaded version, as I do have plans to decrease the dump time to 256ns.

Collectives™ on Stack Overflow

C programming - Speed up array operation

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related