AVX2 什么是基于面具的最有效的打包方式?

AVX2 what is the most efficient way to pack left based on a mask?(AVX2 什么是基于面具的最有效的打包方式?)

本文介绍了AVX2 什么是基于面具的最有效的打包方式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果您有一个输入数组和一个输出数组,但您只想编写那些满足特定条件的元素,那么在 AVX2 中最有效的方法是什么?

我在 SSE 看到它是这样完成的:(来自:

谢谢

解决方案

AVX2 + BMI2.请参阅我对 AVX512 的其他答案.(更新:在 64 位版本中保存了 pdep.)

我们可以使用 AVX2 vpermps (_mm256_permutevar8x32_ps)(或等效的整数,vpermd)进行车道交叉变量洗牌.

我们可以即时生成掩码,因为 BMI2 pext(并行位提取) 为我们提供了我们需要的操作的按位版本.

请注意 pdep/pext 在 Zen 3 之前的 AMD CPU 上非常很慢,例如 6 uops/18 周期延迟和Ryzen Zen 1 和 Zen 2 上的吞吐量.这个实现将在那些 AMD CPU 上表现得非常糟糕.对于 AMD,您可能最适合使用 pshufbvpermilps LUT 的 128 位向量,或评论中讨论的一些 AVX2 可变移位建议.特别是如果您的掩码输入是向量掩码(不是内存中已经打包的位掩码).

Zen2 之前的 AMD 反正只有 128 位向量执行单元,而且 256 位跨车道 shuffle 很慢.所以 128 位向量在 Zen 1 上非常有吸引力.但 Zen 2 有 256 位加载/存储和执行单元.(而且仍然很慢的微编码 pext/pdep.)


对于具有 32 位或更宽元素的整数向量:1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
或者 2) 使用 _mm256_movemask_epi8 然后将第一个 PDEP 常量从 0x0101010101010101 更改为 0x0F0F0F0F0F0F0F0F 以分散 4 个连续位的块.将乘以 0xFFU 更改为 expanded_mask |= extended_mask<<4;expanded_mask *= 0x11;(未测试).无论哪种方式,请使用带有 VPERMD 而不是 VPERMPS 的 shuffle 掩码.

对于 64 位整数或 double 元素,一切仍然正常;比较掩码碰巧总是有相同的 32 位元素对,因此结果 shuffle 将每个 64 位元素的两半放在正确的位置.(所以你仍然使用 VPERMPS 或 VPERMD,因为 VPERMPD 和 VPERMQ 仅适用于立即控制操作数.)

对于 16 位元素,您可以使用 128 位向量进行调整.

对于 8 位元素,请参见 高效 sse为左打包字节元素生成随机掩码,用于不同的技巧,将结果存储在多个可能重叠的块中.


算法:

从压缩的 3 位索引常量开始,每个位置都有自己的索引.即 [ 7 6 5 4 3 2 1 0 ] 其中每个元素是 3 位宽.0b111'110'101'...'010'001'000.

使用 pext 将我们想要的索引提取到整数寄存器底部的连续序列中.例如如果我们想要索引 0 和 2,我们 pext 的控制掩码应该是 0b000'...'111'000'111.pext 将获取与选择器中的 1 位对齐的 010000 索引组.选定的组被打包到输出的低位中,因此输出将是 0b000'...'010'000.(即 [ ... 2 0 ])

有关如何从输入向量掩码为 pext 生成 0b111000111 输入的信息,请参阅注释代码.

现在我们与压缩 LUT 处于同一条船上:解压多达 8 个压缩索引.

当你把所有的部分放在一起时,总共有三个 pext/pdep.我从我想要的东西倒退,所以在那个方向上理解它可能也是最容易的.(即从 shuffle 线开始,然后从那里向后工作.)

如果我们使用每字节一个索引而不是打包的 3 位组,我们可以简化解包.由于我们有 8 个索引,因此这仅适用于 64 位代码.

参见上Godbolt编译器Explorer,这和仅仅32位版本.我使用了 #ifdefs,因此它可以使用 -m64-m32 进行最佳编译.gcc 浪费了一些指令,但 clang 编写了非常好的代码.

#include #include //使用 64 位 pdep/pext 保存解包步骤.__m256 compress256(__m256 src, unsigned int mask/* from movmskps */){uint64_t 扩展掩码 = _pdep_u64(掩码,0x0101010101010101);//将每一位解包为一个字节扩展掩码 *= 0xFF;//掩码 |= 掩码<<1 |掩码<<2 |... |掩码<<7;//ABC... ->AAAAAAAABBBBBBBBCCCCCCCCC...:复制每一位以填充其字节const uint64_t identity_indices = 0x0706050403020100;//vpermps的identity shuffle,打包成每字节一个索引uint64_t want_indices = _pext_u64(identity_indices, expand_mask);__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);__m256i shufmask = _mm256_cvtepu8_epi32(bytevec);返回_mm256_permutevar8x32_ps(src, shufmask);}

这将编译为没有从内存加载的代码,只有即时常量.(有关此和 32 位版本的信息,请参阅 Godbolt 链接).

 # clang 3.7.1 -std=gnu++14 -O3 -march=haswellmov eax, edi # 到零扩展:内联时消失movabs rcx, 72340172838076673 # 常量内联成循环后被提升pdep rax, rax, rcx # ABC ->0000000A0000000B....imul rax, rax, 255 # 0000000A0000000B.. ->AAAAAAAABBBBBBBB..movabs rcx, 506097522914230528pext rax, rcx, raxvmovq xmm1, raxvpmovzxbd ymm1, xmm1 # 3c 延迟,因为这是车道交叉vpermps ymm0, ymm1, ymm0退

(后来的 clang 像 GCC 一样编译,用 mov/shl/sub 代替 imul,见下文.)

因此,根据 Agner Fog 的数字 和 https://uops.info/,这是 6 个 uops(不计算常量,或内联时消失的零扩展 mov).在 Intel Haswell 上,延迟为 16c(vmovq 为 1,每个 pdep/imul/pext/vpmovzx/vpermps 为 3).没有指令级并行性.但是,在这不是循环携带依赖项的一部分的循环中(就像我在 Godbolt 链接中包含的那个),瓶颈希望只是吞吐量,同时保持多次迭代.

这可能可以管理每 4 个周期一个的吞吐量,在端口 1 上为 pdep/pext/imul 加上循环中的 popcnt 成为瓶颈.当然,由于加载/存储和其他循环开销(包括比较和 movmsk),总 uop 吞吐量也很容易成为问题.

例如我的 Godbolt 链接中的过滤器循环是 14 uop,带有 clang,带有 -fno-unroll-loops 以使其更易于阅读.如果幸运的话,它可能每 4c 维持一次迭代,跟上前端.

clang 6 及更早版本使用 创建了循环携带依赖项popcnt 对其输出的错误依赖,因此它会在 compress256 函数延迟的 3/5 处出现瓶颈.clang 7.0 及更高版本使用异或归零来打破错误依赖(而不是仅仅使用 popcnt edx,edx 或类似 GCC 的东西:/).

gcc(以及后来的 clang)使用多条指令乘以 0xFF,使用左移 8 和一个 sub,而不是 imul 乘以 255.这需要前端总共 3 个 uops,而前端为 1 个,但延迟仅为 2 个周期,低于 3 个.(Haswell 在寄存器重命名阶段以零延迟处理 mov.)最重要的是,imul 只能在端口 1 上运行,与 pdep/pext/popcnt 竞争,因此最好避免该瓶颈.


由于所有支持 AVX2 的硬件也支持 BMI2,因此为没有 BMI2 的 AVX2 提供一个版本可能没有意义.

如果您需要在很长的循环中执行此操作,如果初始缓存未命中在足够多的迭代中分摊,并且只需解包 LUT 条目的开销较低,那么 LUT 可能是值得的.您仍然需要movmskps,因此您可以弹出掩码并将其用作LUT索引,但您保存了pdep/imul/pexp.

您可以使用我使用的相同整数序列解压 LUT 条目,但 @Froglegs 的 set1()/vpsrlvd/vpand 可能更好当 LUT 条目在内存中启动并且首先不需要进入整数寄存器时.(32 位广播负载在 Intel CPU 上不需要 ALU uop).但是,Haswell 上的可变偏移是 3 uop(但 Skylake 上只有 1 个).

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?

I've seen in SSE where it was done like this: (From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)

__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
 // Move 4 sign bits of mask to 4-bit integer value.
 int mask = _mm_movemask_ps(mask);
 // Select shuffle control data
 __m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
 // Permute to move valid values to front of SIMD register
 __m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
 return packed;
}

This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).

I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.

I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..

Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?

Here is an illustration of the Left Packing Problem from the above document:

Thanks

解决方案

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)

We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.

We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.

Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).

AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)


For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.

For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)

For 16-bit elements, you might be able to adapt this with 128-bit vectors.

For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.


The algorithm:

Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.

Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])

See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.

Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.

By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)

We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.

See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.

#include <stdint.h>
#include <immintrin.h>

// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
  uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101);  // unpack each bit to a byte
  expanded_mask *= 0xFF;    // mask |= mask<<1 | mask<<2 | ... | mask<<7;
  // ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte

  const uint64_t identity_indices = 0x0706050403020100;    // the identity shuffle for vpermps, packed to one index per byte
  uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);

  __m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
  __m256i shufmask = _mm256_cvtepu8_epi32(bytevec);

  return _mm256_permutevar8x32_ps(src, shufmask);
}

This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).

    # clang 3.7.1 -std=gnu++14 -O3 -march=haswell
    mov     eax, edi                   # just to zero extend: goes away when inlining
    movabs  rcx, 72340172838076673     # The constants are hoisted after inlining into a loop
    pdep    rax, rax, rcx              # ABC       -> 0000000A0000000B....
    imul    rax, rax, 255              # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
    movabs  rcx, 506097522914230528
    pext    rax, rcx, rax
    vmovq   xmm1, rax
    vpmovzxbd       ymm1, xmm1         # 3c latency since this is lane-crossing
    vpermps ymm0, ymm1, ymm0
    ret

(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)

So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.

This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.

e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.

clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).

gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.


Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.

If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pexp.

You can unpack LUT entries with the same integer sequence I used, but @Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

这篇关于AVX2 什么是基于面具的最有效的打包方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本文标题为:AVX2 什么是基于面具的最有效的打包方式?

基础教程推荐