0

I'm trying to optimize 2d matrix addition in C using SIMD instructions (_mm256_add_pd, store, load, etc.). However, I'm not seeing a large speedup at all. Using some timing code, I'm seeing speedup in the range of .8x-1.5x the naive solution). I was wondering if this is typical at all? I was thinking it could potentially be a memory bottleneck, as the computation seems to be very little in this case. I believe this should give me around a 4x boost in speed, since I'm doing 4 additions at once, so I'm not totally sure what the bottleneck is.

I made some code to demonstrate what I'm doing (testing parallel + SIMD vs just SIMD):

#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>
#include <omp.h>
#include <string.h>

#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
#include <immintrin.h>
#include <x86intrin.h>
#endif

void add_matrix_naive (double **result, double **mat1, double **mat2, int rows, int cols) {
    int simdCols = cols / 4 * 4;
        if(simdCols > 0){
            for(unsigned int i = 0; i < rows; i++){
                for(unsigned int j = 0; j < simdCols; j += 4){
                    _mm256_storeu_pd(result[i] + j, _mm256_add_pd(
                        _mm256_loadu_pd(mat1[i] + j)
                        , _mm256_loadu_pd(mat2[i] + j)));
                }
            }
        }

        //Handle extra columns
        if(simdCols < cols){
            for(unsigned int i = 0; i < rows; i++){ 
                for(unsigned int j = simdCols; j < cols; j++){
                    result[i][j] = mat1[i][j] + mat2[i][j];
                }
            }
        }
}

void add_matrix(double **result, double **mat1, double **mat2, int rows, int cols) {
    int simdCols = cols / 4 * 4;
    #pragma omp parallel if (rows*cols >= 2000)
    {
        if(simdCols > 0){
            #pragma omp for collapse(2)
            for(unsigned int i = 0; i < rows; i++){
                for(unsigned int j = 0; j < simdCols; j += 4){
                    _mm256_storeu_pd(result[i] + j, _mm256_add_pd(
                        _mm256_loadu_pd(mat1[i] + j)
                        , _mm256_loadu_pd(mat2[i] + j)));
                }
            }
        }

        //Handle extra columns
        if(simdCols < cols){
            #pragma omp for collapse(2)
            for(unsigned int i = 0; i < rows; i++){ 
                for(unsigned int j = simdCols; j < cols; j++){
                    result[i][j] = mat1[i][j] + mat2[i][j];
                }
            }
        }

    }
}

int main() 
{ 
    omp_set_num_threads(8);
    //Allocate Matrices
    int rows = 200;
    int cols = 200;

    double **matrix_a = malloc(rows * sizeof(double *) + rows*cols*sizeof(double));

    double * dataStart = (double *) matrix_a + rows; //Offset row pointers
    for(unsigned int i = 0; i < rows; i++){
        matrix_a[i] = dataStart + i * cols;
        memset(matrix_a[i], 0, sizeof(double) * cols);
    }

    double **matrix_b = malloc(rows * sizeof(double *) + rows*cols*sizeof(double));

    dataStart = (double *) matrix_b + rows; //Offset row pointers
    for(unsigned int i = 0; i < rows; i++){
        matrix_b[i] = dataStart + i * cols;
        memset(matrix_b[i], 0, sizeof(double) * cols);
    }

    double **result = malloc(rows * sizeof(double *) + rows*cols*sizeof(double));

    dataStart = (double *) result + rows; //Offset row pointers
    for(unsigned int i = 0; i < rows; i++){
        result[i] = dataStart + i * cols;
        memset(result[i], 0, sizeof(double) * cols);
    }

    //Assign random values to matrices.
    for(int i = 0; i < rows; i++){
        for(int j = 0; j < cols; j++){
            matrix_a[i][j] = rand();
            matrix_b[i][j] = rand();
        }
    }

    
    int LOOP_COUNT = 4;

    double prevTime = omp_get_wtime();
    for(int i = 0; i < LOOP_COUNT; i++){
        add_matrix(result, matrix_a, matrix_b, rows, cols);
        
    }
    double endTime = omp_get_wtime();
    double firstTime = (endTime - prevTime)/LOOP_COUNT;
    printf("Took %f Seconds\n", firstTime);

    //Assign random values to matrices.
    for(int i = 0; i < rows; i++){
        for(int j = 0; j < cols; j++){
            matrix_a[i][j] = rand();
            matrix_b[i][j] = rand();
        }
    }

    prevTime = omp_get_wtime();
    for(int i = 0; i < LOOP_COUNT; i++){
        add_matrix_naive(result, matrix_a, matrix_b, rows, cols);
    }
    endTime = omp_get_wtime();
    double secondTime = (endTime - prevTime)/LOOP_COUNT;
    printf("Took %f Seconds\n", secondTime);
    printf("Naive Time: %f Faster\n", firstTime/secondTime);
}

Something I've noticed is that the result seems quite depending on the LOOP_COUNT. With a high loop count the parallel/SIMD version does quite well, but with lower loop counts the naive solution tends to do better.

16
  • 2
    If you compile your naïve C code with -O3 on gcc/clang they will likely be able to vectorize that as well (take a look at the generated assembly code). Commented Nov 21, 2020 at 23:49
  • 2
    "I'm not allowed to post my code online" translates to "I have this problem with this thing" which means we probably can't help. We need more specifics. We need code that we can use to reproduce the problem. Commented Nov 22, 2020 at 1:44
  • 1
    But without code or any description of details to talk about, this isn't a useful question to answer for the benefit of future readers. Commented Nov 22, 2020 at 1:46
  • 1
    @tadman That makes sense, I added code to the post. Commented Nov 22, 2020 at 3:00
  • 1
    Ugh, why are you using an array of pointers to arrays, instead of an single efficient 2D array? A different way to malloc a 2D array?. That's going to make it harder for compilers to prove or check that there's no aliasing (i.e. that no output rows point to the same storage as some input rows). Commented Nov 22, 2020 at 3:02

1 Answer 1

1

CPUs can compute things way faster than they can access memory.

Adding doubles is very fast. Your core is very likely bottlenecked on memory.

There's more.

omp_set_num_threads

Rarely a good idea. The default is usually good.

double **result, double **mat1, double **mat2

Double pointers means double RAM latency to access them. Use a single pointer. If you want to use slices of the matrix, just pass another parameter with offset between rows of the matrix.

But the most important thing there, your benchmarking is completely wrong.

  1. To compare different implementations of an algorithm, you must compile 2 different programs. When you run both implementations in a single program, many interesting things happen: cache hierarchy kicks in, CPUs have another dedicated cache for decoded instructions, modern CPUs are extremely fast at thermal throttling, and more.

  2. Compilers are not stupid. When they realize you compute something without using the result, they can drop the code. When they realize you computing something multiple times without changing input data, they can eliminate the redundant code and only compute it once.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.