2

Is it possible to speed up an array operation in C on a computer with a multi-core AMD CPU? I list the code below.

UDP packets arrive every 512ns, and the computation in the array below (where the data is accumulated in else{} loop) takes more than 512ns. This loop keep up when the packets arrive at one microsecond interval. My question: can one multi-thread the array accumulation and hence speed up the computation? Currently the program 'top' shows that the code uses 100% CPU when packets arrive at 512ns interval. Thanks in advance for any inputs/suggestions.

#define NC 64
#define NP 4

void* get_next_buf(mystr_t* str, uint64_t* size)
{
 char buf0[8500];
 long long pktnum, *ptr;
 int i,j,J, offset ;
 ssize_t recsize;
 socklen_t fromlen;
 int pktdiff;

 recsize = recvfrom(str->sock, (void *)buf0, 8224, 0, (struct sockaddr *)&str->sa, &fromlen);

 if (recsize < 0) {
  fprintf(stderr,"error reading udp packet\n");
  return 0;
 }
/* clear the array for accumulate*/
memset(str->data, 0, 2*NCHAN*NPOL*sizeof(short));
/* swap bytes to extract packet counter*/
ptr = (long long *)buf0;
pktnum=BSWAP_64( *ptr ) & 0x00ffffffffffffff;
// got one packet of data. If a pakcet was missed, return array with zeroes
pktdiff = pktnum - str->prev_pkt_cnt;
if ( pktdiff != 1){
   str->bad_pkt_cnt++;
   fprintf (stderr,"%d+",pktdiff);
   str->prev_pkt_cnt = pktnum;
   *size = 2*sizeof(short)*NC*NP;
   return (void*) str->data;
}
//packet arrived in correct order, accumulate and return the array
else {
   J = 8192/(NC*NP);
   for (i=0;i<J;i++){
     for (j=0;j<NC;j=j++){
       offset = i*NC*NP;
       ((short *)str->data)[j]      += (short)(buf0[8+j+offset]);
       ((short *)str->data)[j+64]   += (short)(buf0[8+64+j+offset]);
       ((short *)str->data)[j+128]  += (short)(buf0[8+128+j+offset]);
       ((short *)str->data)[j+192]  += (short)(buf0[8+192+j+offset]);
    }
   }

  *size = sizeof(short)*NC*NP;
  str->prev_pkt_cnt = pktnum;
  /*return the acquired data buffer */
  return (void*) str->data;
 }

}
12
  • 1
    Probably not easily possible! Commented Feb 24, 2015 at 7:56
  • 3
    Did you compile with -O3 ? Have you profiled the code to identify the hot spots ? Commented Feb 24, 2015 at 8:04
  • 1
    offsetcalculation can go on the outer level out of the nested loops (but this should have been noticed by the compiler already if you used suitable optimisation settings). The same with NC * NP (can be precalculated) Commented Feb 24, 2015 at 8:17
  • 3
    Never write j=j++, it should just be ++j. Probably won't help, but won't hurt either. Just do it. Commented Feb 24, 2015 at 8:18
  • 1
    What is the relation between NCHAN, NPOL and NC, NP ? Commented Feb 24, 2015 at 8:37

3 Answers 3

3

That's fairly tight code, but if I were desperate to make it run a little faster, and not finding someone more expert than I on such things nearby, I'd try:

  1. Instead of (short *)str->data repeated four times in the j loop, establish a pointer of type short* set to str->data. Save a deref. Any halfway decent compiler should be optimizing like this anyway, but sometimes compilers aren't perfect.

  2. Also in the j loop, the (short) conversion may be taking up one opcode - signed or zero extending a byte value into 16 bits. If buf0 where declared as type short instead of char, you save a clock cycle. The type of buf0[] doesn't seem to matter elsewhere, being cast to other types anyway. The indexing would have to be adjusted. Be careful that doesn't involve any extra operations, or you get no improvement.

  3. Dealing with adding short ints in parallel - maybe good ol' MMX, from the 1990s, could help. Or better yet, more recent SIMD opcodes. Those like to work on adjacent pieces of data, whereas your unrolled loop doesn't work on adjacent pieces in any one iteration. Perhaps it could be unrolled some other way. Again, good compilers ought to notice situations where SIMD ops could be used, but that may take some special command line flag, and may be not happening because of the way the loop is. Rearrange the loops to work on [j], [j+1], [j+2], [j+3] at the innermost level, and maybe you get lucky.

Two or more threads working on one huge array is often a good way to speed things up, but huge has to be way bigger than typical memory cache pages or virtual memory pages, so threads don't need write access to the same chunk of stuff. I'm not up on the latest architecture at that level, but I'm pretty sure an array of only 8000 or so bytes isn't "huge" in the needed sense.

If threads could help, the way is to have two of them alternating, taking turns as packets arrive. One finishes up while the other starts. Fine, if the one starting a new packet doesn't need the results of the other unfinished one. But then, I'm no guru with such things.

Sign up to request clarification or add additional context in comments.

3 Comments

I like the idea of handling alternate packets between two threads. The code in front of use doesn't have an order dependency. Though some synchronization would be required to rule out 'overtaking' if that matters to the consuming application
@DarenW: the data in the packets are all 8-bit wide and this needs to be promoted to short before accumulating. I can't see how your suggestion 2 can work .... or am I misunderstanding something?
More likely I misunderstand something, since I'm not familiar with what you're doing, looked at the code only a few minutes, and thought up some ideas without trying them out.
1

You have many calculations which you are doing on each iteration and could be done outside, or defined as constant NC*NP or even offset = i*NC*NP; which shouldn't be in the for j loop as it does not depend on j. Also 8+j+offset is done multiple times, when you could use a variable increment.

Your else should look like something like that:

 // the following are defined on top of your program
 // #define NCNP NC*NP
 // #define J 8192/(NCNP);
 // #define SIZE sizeof(short)*NCNP

 offset=8;
 for (i=0;i<J;i++){ 
    offset += NCNP;
    jo=offset;
    for (j=0;j<NC;j++){
      ((short *)str->data)[j]      += (short)(buf0[jo]);
      ((short *)str->data)[j+64]   += (short)(buf0[64+jo]);
      ((short *)str->data)[j+128]  += (short)(buf0[128+jo]);
      ((short *)str->data)[j+192]  += (short)(buf0[192+jo]);
      jo++;
     }
   }

 *size = SIZE;
 str->prev_pkt_cnt = pktnum;
 /*return the acquired data buffer */
 return (void*) str->data;

Comments

0

Some tips above, plus a change in the code helps the code execute without any packet losses. The last one from @Adam had some effect - Thanks Adam!

The second change I put in were, to receive two packets with every call to the get_next_buf() function as follows:

recsize = recvfrom(str->sock, (void *)buf0, 8224, 0, (struct sockaddr *)&str->sa, &fromlen);
recsize = recvfrom(str->sock, (void *)buf1, 8224, 0, (struct sockaddr *)&str->sa, &fromlen);

Then, data from both buf0 and buf1 were accumulated in str->data[] as follows:

J = 8192/(NCNP);                                                                                                
offset = 8;
for (i=0;i<J;++i){
  offset += NCNP;
  jo = offset;
  for (j=0;j<NC;++j){
    if (i==0 && j==0) memset(udp2db->data,0,NCNP*sizeof(short));
    ((short *)udp2db->data)[j]      += (short)(buf0[jo]) + (short)(buf1[jo]);
    ((short *)udp2db->data)[j+64]   += (short)(buf0[64+jo]) + (short)(buf1[64+jo]) ;
    ((short *)udp2db->data)[j+128]  += (short)(buf0[128+jo]) + (short)(buf1[128+jo]);
    ((short *)udp2db->data)[j+192]  += (short)(buf0[192+jo]) + (short)(buf1[192+jo]);
    jo++;
  }
}

Since the above seems to never go beyond 83% CPU load, this should be fine. Also note that I can add data from two successive packets as they are delivered faster. The main reason I want a software accumulate is to avoid overflow in the hardware, where 8-bit accumulators are used. Thanks again to all for your suggestions!

1 Comment

I'd like to add that, I'll soon try the multi-threaded version, as I do have plans to decrease the dump time to 256ns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.