Newest 'avx512' Questions

10 votes

1 answer

425 views

AVX-512 MD5 implementation: unexplained performance regression on Zen 4

I have written an implementation of the MD5 hash function using AVX-512. While it uses SIMD instructions, it is fundamentally a scalar algorithm. The point of using SIMD instructions is to access ...

fuz

94.7k

asked Oct 8 at 16:55

4 votes

3 answers

199 views

AVX2 repack an array of structs of 5 ints to structs of 7 ints, with the extra elements from other arrays? Shuffle/combine for 8 YMM registers?

After some processing I need to write my data and I wanted to optimize it with AVX2. (An AVX-512 version is an optional extra; working fast with just AVX2 is the main goal.) I have this destination ...

Pavel P

17.3k

asked May 21 at 1:32

0 votes

0 answers

102 views

My ANN program is bottlenecked on a Euclidean distance calculation with 128D arrays even with AVX512. Can this be sped up more?

I'm calculating the distance between a query point that has been pre-loaded and a number of other nearby points, via Euclidean distance, while searching for approximate nearest neighbors (ANN search). ...

s. heller

27

asked Mar 28 at 21:34

0 votes

0 answers

87 views

How to merge two YMM registers into single ZMM but interleave?

I have two YMM registers which have values v20={ b128, a128 } and v31={ d128, c128 } I need to write those registers into memory but in following sequence: a128, c128, b128, d128 I wrote code which ...

nckm

123

asked Jan 11 at 19:24

2 votes

1 answer

397 views

Efficient way for using int8 AVX512-VNNI instruction, especially about loading the data to zmm register

I want to optimize my matrix multiplication operation using AVX512-VNNI instruction in int8 data. I understand how this vpdpbusd works, but I don't know how to use this efficiently. In detail, I ...

kdh

194

asked Dec 9, 2024 at 15:37

1 vote

1 answer

673 views

I need more performance for int8 vector multiplication (Intel AVX-512)

I implemented 8-bit integer multiplication for int8 matrix multiplication. (uint8_t or int8_t are the same since it's not widening.) This is my code, but I think it's really slow. inline __m512i ...

kdh

194

asked Dec 6, 2024 at 15:27

2 votes

0 answers

428 views

Why do GCC, ICX and Clang not auto-vectorize using AVX-512 based instructions on Intel processors but do the same on AMD?

My code is extremely simple void x(float* array, float const LOW_THRESHOLD, float const HIGH_THRESHOLD) noexcept { for ( int index = 0; index < 16; ++index ) { array[ index ] = ...

pratikpc

742

asked Nov 10, 2024 at 11:45

1 vote

0 answers

167 views

What is considered as "2 FMAs"?

The Intel® Xeon® Silver 4216 processor installed in the node supports the AVX-512 instruction set. When using AVX-512, how many FP32 operations can one core execute per clock cycle? Hint: Consider the ...

user24200147

11

asked Oct 11, 2024 at 16:06

2 votes

0 answers

175 views

enable avx512 zmm registers in gdb

I'm running mingw64 gdb on windows and I'm trying to debug some c and fortran mixed program that has inline asm. The problem I'm m having is that zmm registers are not available to view. I read thru ...

VarianceOfOne

21

asked Sep 29, 2024 at 13:46

3 votes

1 answer

236 views

Setting AVX512 vector to zero/non-zero sometimes causes signal SIGILL on Godbolt

On Godbolt, this executes fine: volatile __m512i v = _mm512_set_epi64(1, 0, 0, 0, 0, 0, 0, 0); but all zeros does not: volatile __m512i v = _mm512_set_epi64(0, 0, 0, 0, 0, 0, 0, 0); It ...

user997112

31.1k

asked Sep 16, 2024 at 22:17

2 votes

1 answer

509 views

Small performance gain using AVX512 over SSE in batch quaternion-vector multiplication

I've implemented a quaternion-vector multiplication function using SIMD instructions, with conditional compilation for AVX512, AVX2, and SSE. While I expected to see significant performance ...

HiroIshida

1,603

asked Sep 4, 2024 at 13:50

1 vote

2 answers

674 views

How to perform parallel addition using AVX with carry (overflow) fed back into the same element (PE checksum)?

I want to perform eight parallel adds of 16bit values using AVX SIMD. Addition with overflow is required, i.e. 'add with carry' like it is performed with the old "adc" x86 mnemonic. I ...

Devvy

55

asked Aug 19, 2024 at 22:38

1 vote

0 answers

106 views

AVX512 duplicate low 256 bits into high 256 bits inside a zmm register

Is there a faster way to duplicate (copy) the low 256 bits of an AVX-512 register into the higher 256 bits than using the _mm512_insertf64x4 instruction? My current solution is: __m512d zmm1 = ...

Tomas

71

asked Jul 24, 2024 at 14:00

1 vote

0 answers

134 views

What is the most efficient AVX2/512 code sequence to merge two registers with sorted values?

I have had students write code that efficiently sorts 8 and 16 32-bit numbers at a time in parallel using avx2 and avx512. The easy way is to load 8 or 16 registers, and implement an optimal sorting ...

Dov

8,644

asked Jun 19, 2024 at 12:30

6 votes

1 answer

1k views

AVX-512 BF16: load bf16 values directly instead of converting from fp32

On CPU's with AVX-512 and BF16 support, you can use the 512 bit vector registers to store 32 16 bit floats. I have found intrinsics to convert FP32 values to BF16 values (for example: ...

Thijs Steel

1,272

asked May 2, 2024 at 13:42

1 vote

0 answers

50 views

How to identify the proportion of frequency reduction of a process caused by AVX instructions?

Different types of AVX instructions can cause a decrease in CPU frequency[1]. The proportion of this decrease can be evaluated through the PMU events called `CORE_POWER.LVL0/1/2_TURBO_LICENS. However, ...

Frontier_Setter

809

asked Apr 24, 2024 at 8:42

3 votes

1 answer

301 views

Optimal instruction sequence for AVX512 gather of 4D vectors

Using AVX512 instructions, I can use an index vector to gather 16 single precision values from an array. However, such gather operations are not that efficient and issue at a rate of only 2 scalar ...

Wenzel Jakob

705

asked Apr 23, 2024 at 8:50

2 votes

0 answers

101 views

How do XCR0 and XSTATE work for AVX10.2/256?

In order to use AVX-512, the processor must support AVX-512 and certain bits of the register XCR0 must be set by the OS kernel. For AVX-512, these XCR0 bits are: 1: indicates saving support for XMM0-...

Myria

3,907

asked Apr 16, 2024 at 23:13

2 votes

1 answer

358 views

AVX512 perform AND of 512bits of 8-bit chars

I'd like to AND two vectors of 512 bits containing 8 bit elements. Looking at the Intel Intrinsics Guide I can see some 512-bit AND operations: __m512i _mm512_and_epi32 (__m512i a, __m512i b) __m512i ...

user997112

31.1k

asked Mar 22, 2024 at 0:21

4 votes

2 answers

518 views

How to call _mm256_mul_ph from rust?

_mm256_mul_ps is the Intel intrinsic for "Multiply packed single-precision (32-bit) floating-point elements". _mm256_mul_ph is the intrinsic for "Multiply packed half-precision (16-bit) ...

dmeister

35.9k

asked Feb 22, 2024 at 1:14

0 votes

0 answers

271 views

AVX 512 matrix multiplication with column-wise traversal on B

I wrote a matrix multiplication over floating point values with AVX512 intrinsics - for (int i=200; i<400; i++) { for (int k=1200; k<1400; k++) { tmp=val[440000+ (k-1200)*200 ...

Pratyush Das

534

asked Feb 14, 2024 at 18:40

5 votes

1 answer

330 views

AVX512 auto-vectorized C++ matrix-vector functions are much slower when source = destination, in-place

I've tried to write a few functions to carry out matrix-vector multiplication using a single matrix together with an array of source vectors. I've once written those functions in C++ and once in x86 ...

Loran

55

asked Jan 21, 2024 at 0:13

0 votes

0 answers

72 views

dst[i] eqaul src[i] multiply by dst[i-1] in avx or sse

I have a array with 32 bit float, like this: _m512 float_array = _mm512_setr_ps(a, b, c, d,.....); how can i get: _m512 float_array_mul = [a*b, a*b*c, a*b*c*d, ....]; in other words, Operation like ...

lee web

1

asked Dec 19, 2023 at 13:14

0 votes

0 answers

203 views

Extract 8 bit integer from __m512i data type (AVX-512)

Could not find equivalent of int _mm_extract_epi8 (__m128i a, const int imm8) int _mm256_extract_epi8 (__m256i a, const int index) in the AVX-512 instruction set. What is the best way to extract an 8 ...

KaraUL

11

asked Dec 15, 2023 at 14:20

2 votes

1 answer

1k views

Performance Difference Between _mm512_load_si512 and _mm512_stream_load_si512

I'm currently working on a project that involves AVX512 instructions and I have a question regarding the performance differences between _mm512_load_si512, _mm512_loadu_si512, and ...

MHErshadi

73

asked Dec 6, 2023 at 12:18

Collectives™ on Stack Overflow

AVX-512 MD5 implementation: unexplained performance regression on Zen 4

AVX2 repack an array of structs of 5 ints to structs of 7 ints, with the extra elements from other arrays? Shuffle/combine for 8 YMM registers?

My ANN program is bottlenecked on a Euclidean distance calculation with 128D arrays even with AVX512. Can this be sped up more?

How to merge two YMM registers into single ZMM but interleave?

Efficient way for using int8 AVX512-VNNI instruction, especially about loading the data to zmm register

I need more performance for int8 vector multiplication (Intel AVX-512)

Why do GCC, ICX and Clang not auto-vectorize using AVX-512 based instructions on Intel processors but do the same on AMD?

What is considered as "2 FMAs"?

enable avx512 zmm registers in gdb

Setting AVX512 vector to zero/non-zero sometimes causes signal SIGILL on Godbolt

Small performance gain using AVX512 over SSE in batch quaternion-vector multiplication

How to perform parallel addition using AVX with carry (overflow) fed back into the same element (PE checksum)?

AVX512 duplicate low 256 bits into high 256 bits inside a zmm register

What is the most efficient AVX2/512 code sequence to merge two registers with sorted values?

AVX-512 BF16: load bf16 values directly instead of converting from fp32

How to identify the proportion of frequency reduction of a process caused by AVX instructions?

Optimal instruction sequence for AVX512 gather of 4D vectors

How do XCR0 and XSTATE work for AVX10.2/256?

AVX512 perform AND of 512bits of 8-bit chars

How to call _mm256_mul_ph from rust?

AVX 512 matrix multiplication with column-wise traversal on B

AVX512 auto-vectorized C++ matrix-vector functions are much slower when source = destination, in-place

dst[i] eqaul src[i] multiply by dst[i-1] in avx or sse

Extract 8 bit integer from __m512i data type (AVX-512)

Performance Difference Between _mm512_load_si512 and _mm512_stream_load_si512

Hot Network Questions