Skip to main content
Filter by
Sorted by
Tagged with
10 votes
1 answer
425 views

I have written an implementation of the MD5 hash function using AVX-512. While it uses SIMD instructions, it is fundamentally a scalar algorithm. The point of using SIMD instructions is to access ...
fuz's user avatar
  • 94.7k
4 votes
3 answers
199 views

After some processing I need to write my data and I wanted to optimize it with AVX2. (An AVX-512 version is an optional extra; working fast with just AVX2 is the main goal.) I have this destination ...
Pavel P's user avatar
  • 17.3k
0 votes
0 answers
102 views

I'm calculating the distance between a query point that has been pre-loaded and a number of other nearby points, via Euclidean distance, while searching for approximate nearest neighbors (ANN search). ...
s. heller's user avatar
0 votes
0 answers
87 views

I have two YMM registers which have values v20={ b128, a128 } and v31={ d128, c128 } I need to write those registers into memory but in following sequence: a128, c128, b128, d128 I wrote code which ...
nckm's user avatar
  • 123
2 votes
1 answer
397 views

I want to optimize my matrix multiplication operation using AVX512-VNNI instruction in int8 data. I understand how this vpdpbusd works, but I don't know how to use this efficiently. In detail, I ...
kdh's user avatar
  • 194
1 vote
1 answer
673 views

I implemented 8-bit integer multiplication for int8 matrix multiplication. (uint8_t or int8_t are the same since it's not widening.) This is my code, but I think it's really slow. inline __m512i ...
kdh's user avatar
  • 194
2 votes
0 answers
428 views

My code is extremely simple void x(float* array, float const LOW_THRESHOLD, float const HIGH_THRESHOLD) noexcept { for ( int index = 0; index < 16; ++index ) { array[ index ] = ...
pratikpc's user avatar
  • 742
1 vote
0 answers
167 views

The Intel® Xeon® Silver 4216 processor installed in the node supports the AVX-512 instruction set. When using AVX-512, how many FP32 operations can one core execute per clock cycle? Hint: Consider the ...
user24200147's user avatar
2 votes
0 answers
175 views

I'm running mingw64 gdb on windows and I'm trying to debug some c and fortran mixed program that has inline asm. The problem I'm m having is that zmm registers are not available to view. I read thru ...
VarianceOfOne's user avatar
3 votes
1 answer
236 views

On Godbolt, this executes fine: volatile __m512i v = _mm512_set_epi64(1, 0, 0, 0, 0, 0, 0, 0); but all zeros does not: volatile __m512i v = _mm512_set_epi64(0, 0, 0, 0, 0, 0, 0, 0); It ...
user997112's user avatar
  • 31.1k
2 votes
1 answer
509 views

I've implemented a quaternion-vector multiplication function using SIMD instructions, with conditional compilation for AVX512, AVX2, and SSE. While I expected to see significant performance ...
HiroIshida's user avatar
  • 1,603
1 vote
2 answers
674 views

I want to perform eight parallel adds of 16bit values using AVX SIMD. Addition with overflow is required, i.e. 'add with carry' like it is performed with the old "adc" x86 mnemonic. I ...
Devvy's user avatar
  • 55
1 vote
0 answers
106 views

Is there a faster way to duplicate (copy) the low 256 bits of an AVX-512 register into the higher 256 bits than using the _mm512_insertf64x4 instruction? My current solution is: __m512d zmm1 = ...
Tomas's user avatar
  • 71
1 vote
0 answers
134 views

I have had students write code that efficiently sorts 8 and 16 32-bit numbers at a time in parallel using avx2 and avx512. The easy way is to load 8 or 16 registers, and implement an optimal sorting ...
Dov's user avatar
  • 8,644
6 votes
1 answer
1k views

On CPU's with AVX-512 and BF16 support, you can use the 512 bit vector registers to store 32 16 bit floats. I have found intrinsics to convert FP32 values to BF16 values (for example: ...
Thijs Steel's user avatar
  • 1,272
1 vote
0 answers
50 views

Different types of AVX instructions can cause a decrease in CPU frequency[1]. The proportion of this decrease can be evaluated through the PMU events called `CORE_POWER.LVL0/1/2_TURBO_LICENS. However, ...
Frontier_Setter's user avatar
3 votes
1 answer
301 views

Using AVX512 instructions, I can use an index vector to gather 16 single precision values from an array. However, such gather operations are not that efficient and issue at a rate of only 2 scalar ...
Wenzel Jakob's user avatar
2 votes
0 answers
101 views

In order to use AVX-512, the processor must support AVX-512 and certain bits of the register XCR0 must be set by the OS kernel. For AVX-512, these XCR0 bits are: 1: indicates saving support for XMM0-...
Myria's user avatar
  • 3,907
2 votes
1 answer
358 views

I'd like to AND two vectors of 512 bits containing 8 bit elements. Looking at the Intel Intrinsics Guide I can see some 512-bit AND operations: __m512i _mm512_and_epi32 (__m512i a, __m512i b) __m512i ...
user997112's user avatar
  • 31.1k
4 votes
2 answers
518 views

_mm256_mul_ps is the Intel intrinsic for "Multiply packed single-precision (32-bit) floating-point elements". _mm256_mul_ph is the intrinsic for "Multiply packed half-precision (16-bit) ...
dmeister's user avatar
  • 35.9k
0 votes
0 answers
271 views

I wrote a matrix multiplication over floating point values with AVX512 intrinsics - for (int i=200; i<400; i++) { for (int k=1200; k<1400; k++) { tmp=val[440000+ (k-1200)*200 ...
Pratyush Das's user avatar
5 votes
1 answer
330 views

I've tried to write a few functions to carry out matrix-vector multiplication using a single matrix together with an array of source vectors. I've once written those functions in C++ and once in x86 ...
Loran's user avatar
  • 55
0 votes
0 answers
72 views

I have a array with 32 bit float, like this: _m512 float_array = _mm512_setr_ps(a, b, c, d,.....); how can i get: _m512 float_array_mul = [a*b, a*b*c, a*b*c*d, ....]; in other words, Operation like ...
lee web's user avatar
0 votes
0 answers
203 views

Could not find equivalent of int _mm_extract_epi8 (__m128i a, const int imm8) int _mm256_extract_epi8 (__m256i a, const int index) in the AVX-512 instruction set. What is the best way to extract an 8 ...
KaraUL's user avatar
  • 11
2 votes
1 answer
1k views

I'm currently working on a project that involves AVX512 instructions and I have a question regarding the performance differences between _mm512_load_si512, _mm512_loadu_si512, and ...
MHErshadi's user avatar

1
2 3 4 5
9