405 questions
10
votes
1
answer
425
views
AVX-512 MD5 implementation: unexplained performance regression on Zen 4
I have written an implementation of the MD5 hash function using AVX-512. While it uses SIMD instructions, it is fundamentally a scalar algorithm. The point of using SIMD instructions is to access ...
4
votes
3
answers
199
views
AVX2 repack an array of structs of 5 ints to structs of 7 ints, with the extra elements from other arrays? Shuffle/combine for 8 YMM registers?
After some processing I need to write my data and I wanted to optimize it with AVX2.
(An AVX-512 version is an optional extra; working fast with just AVX2 is the main goal.)
I have this destination ...
0
votes
0
answers
102
views
My ANN program is bottlenecked on a Euclidean distance calculation with 128D arrays even with AVX512. Can this be sped up more?
I'm calculating the distance between a query point that has been pre-loaded and a number of other nearby points, via Euclidean distance, while searching for approximate nearest neighbors (ANN search). ...
0
votes
0
answers
87
views
How to merge two YMM registers into single ZMM but interleave?
I have two YMM registers which have values v20={ b128, a128 } and v31={ d128, c128 }
I need to write those registers into memory but in following sequence:
a128, c128, b128, d128
I wrote code which ...
2
votes
1
answer
397
views
Efficient way for using int8 AVX512-VNNI instruction, especially about loading the data to zmm register
I want to optimize my matrix multiplication operation using AVX512-VNNI instruction in int8 data.
I understand how this vpdpbusd works, but I don't know how to use this efficiently.
In detail, I ...
1
vote
1
answer
673
views
I need more performance for int8 vector multiplication (Intel AVX-512)
I implemented 8-bit integer multiplication for int8 matrix multiplication.
(uint8_t or int8_t are the same since it's not widening.)
This is my code, but I think it's really slow.
inline __m512i ...
2
votes
0
answers
428
views
Why do GCC, ICX and Clang not auto-vectorize using AVX-512 based instructions on Intel processors but do the same on AMD?
My code is extremely simple
void x(float* array, float const LOW_THRESHOLD, float const HIGH_THRESHOLD) noexcept
{
for ( int index = 0; index < 16; ++index )
{
array[ index ] = ...
1
vote
0
answers
167
views
What is considered as "2 FMAs"?
The Intel® Xeon® Silver 4216 processor installed in the node supports the AVX-512 instruction set. When using AVX-512, how many FP32 operations can one core execute per clock cycle?
Hint: Consider the ...
2
votes
0
answers
175
views
enable avx512 zmm registers in gdb
I'm running mingw64 gdb on windows and I'm trying to debug some c and fortran mixed program that has inline asm. The problem I'm m having is that zmm registers are not available to view. I read thru ...
3
votes
1
answer
236
views
Setting AVX512 vector to zero/non-zero sometimes causes signal SIGILL on Godbolt
On Godbolt, this executes fine:
volatile __m512i v = _mm512_set_epi64(1, 0, 0, 0, 0, 0, 0, 0);
but all zeros does not:
volatile __m512i v = _mm512_set_epi64(0, 0, 0, 0, 0, 0, 0, 0);
It ...
2
votes
1
answer
509
views
Small performance gain using AVX512 over SSE in batch quaternion-vector multiplication
I've implemented a quaternion-vector multiplication function using SIMD instructions, with conditional compilation for AVX512, AVX2, and SSE. While I expected to see significant performance ...
1
vote
2
answers
674
views
How to perform parallel addition using AVX with carry (overflow) fed back into the same element (PE checksum)?
I want to perform eight parallel adds of 16bit values using AVX SIMD. Addition with overflow is required, i.e. 'add with carry' like it is performed with the old "adc" x86 mnemonic.
I ...
1
vote
0
answers
106
views
AVX512 duplicate low 256 bits into high 256 bits inside a zmm register
Is there a faster way to duplicate (copy) the low 256 bits of an AVX-512 register into the higher 256 bits than using the _mm512_insertf64x4 instruction?
My current solution is:
__m512d zmm1 = ...
1
vote
0
answers
134
views
What is the most efficient AVX2/512 code sequence to merge two registers with sorted values?
I have had students write code that efficiently sorts 8 and 16 32-bit numbers at a time in parallel using avx2 and avx512. The easy way is to load 8 or 16 registers, and implement an optimal sorting ...
6
votes
1
answer
1k
views
AVX-512 BF16: load bf16 values directly instead of converting from fp32
On CPU's with AVX-512 and BF16 support, you can use the 512 bit vector registers to store 32 16 bit floats.
I have found intrinsics to convert FP32 values to BF16 values (for example: ...
1
vote
0
answers
50
views
How to identify the proportion of frequency reduction of a process caused by AVX instructions?
Different types of AVX instructions can cause a decrease in CPU frequency[1]. The proportion of this decrease can be evaluated through the PMU events called `CORE_POWER.LVL0/1/2_TURBO_LICENS.
However, ...
3
votes
1
answer
301
views
Optimal instruction sequence for AVX512 gather of 4D vectors
Using AVX512 instructions, I can use an index vector to gather 16 single precision values from an array. However, such gather operations are not that efficient and issue at a rate of only 2 scalar ...
2
votes
0
answers
101
views
How do XCR0 and XSTATE work for AVX10.2/256?
In order to use AVX-512, the processor must support AVX-512 and certain bits of the register XCR0 must be set by the OS kernel. For AVX-512, these XCR0 bits are:
1: indicates saving support for XMM0-...
2
votes
1
answer
358
views
AVX512 perform AND of 512bits of 8-bit chars
I'd like to AND two vectors of 512 bits containing 8 bit elements.
Looking at the Intel Intrinsics Guide I can see some 512-bit AND operations:
__m512i _mm512_and_epi32 (__m512i a, __m512i b)
__m512i ...
4
votes
2
answers
518
views
How to call _mm256_mul_ph from rust?
_mm256_mul_ps is the Intel intrinsic for "Multiply packed single-precision (32-bit) floating-point elements". _mm256_mul_ph is the intrinsic for "Multiply packed half-precision (16-bit) ...
0
votes
0
answers
271
views
AVX 512 matrix multiplication with column-wise traversal on B
I wrote a matrix multiplication over floating point values with AVX512 intrinsics -
for (int i=200; i<400; i++) {
for (int k=1200; k<1400; k++) {
tmp=val[440000+ (k-1200)*200 ...
5
votes
1
answer
330
views
AVX512 auto-vectorized C++ matrix-vector functions are much slower when source = destination, in-place
I've tried to write a few functions to carry out matrix-vector multiplication using a single matrix together with an array of source vectors. I've once written those functions in C++ and once in x86 ...
0
votes
0
answers
72
views
dst[i] eqaul src[i] multiply by dst[i-1] in avx or sse
I have a array with 32 bit float, like this:
_m512 float_array = _mm512_setr_ps(a, b, c, d,.....);
how can i get:
_m512 float_array_mul = [a*b, a*b*c, a*b*c*d, ....];
in other words, Operation like ...
0
votes
0
answers
203
views
Extract 8 bit integer from __m512i data type (AVX-512)
Could not find equivalent of
int _mm_extract_epi8 (__m128i a, const int imm8)
int _mm256_extract_epi8 (__m256i a, const int index)
in the AVX-512 instruction set.
What is the best way to extract an 8 ...
2
votes
1
answer
1k
views
Performance Difference Between _mm512_load_si512 and _mm512_stream_load_si512
I'm currently working on a project that involves AVX512 instructions and I have a question regarding the performance differences between _mm512_load_si512, _mm512_loadu_si512, and ...