I'm working on a homework assignment, and I've been stuck for hours on my solution. The problem we've been given is to optimize the following code, so that it runs faster, regardless of how messy it becomes. We're supposed to use stuff like exploiting cache blocks and loop unrolling.
Problem:
//transpose a dim x dim matrix into dist by swapping all i,j with j,i
void transpose(int *dst, int *src, int dim) {
int i, j;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
}
}
What I have so far:
//attempt 1
void transpose(int *dst, int *src, int dim) {
int i, j, id, jd;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
jd = 0;
for(j = 0; j < dim; j++, jd+=dim) {
dst[jd + i] = src[id + j];
}
}
}
//attempt 2
void transpose(int *dst, int *src, int dim) {
int i, j, id;
int *pd, *ps;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
pd = dst + i;
ps = src + id;
for(j = 0; j < dim; j++) {
*pd = *ps++;
pd += dim;
}
}
}
Some ideas, please correct me if I'm wrong:
I have thought about loop unrolling but I dont think that would help, because we don't know if the NxN matrix has prime dimensions or not. If I checked for that, it would include excess calculations which would just slow down the function.
Cache blocks wouldn't be very useful, because no matter what, we will be accessing one array linearly (1,2,3,4) while the other we will be accessing in jumps of N. While we can get the function to abuse the cache and access the src block faster, it will still take a long time to place those into the dst matrix.
I have also tried using pointers instead of array accessors, but I don't think that actually speeds up the program in any way.
Any help would be greatly appreciated.
Thanks