Why does division by 3 require a rightshift (and other oddities) on x86?(为什么在 x86 上除以 3 需要右移(和其他奇怪的东西)?)
问题描述
我有以下 C/C++ 函数:
unsigned div3(unsigned x) {返回 x/3;}
在 -O3
使用 clang 10 编译时,结果为:
div3(无符号整数):mov ecx, edi # tmp = xmov eax, 2863311531 # 结果 = 3^-1imul rax, rcx # 结果 *= tmpshr rax, 33 # 结果 >>= 33退
我所理解的是:除以 3 相当于乘以乘法逆 3-1 mod 232,即 2863311531.
有些东西我不明白:
- 为什么我们需要使用
ecx
/rcx
?不能直接将rax
与edi
相乘吗? - 为什么我们要在 64 位模式下进行乘法运算?
eax
和ecx
相乘不是更快吗? - 为什么我们使用
imul
而不是mul
?我认为模算术都是无符号的. - 最后的 33 位右移是怎么回事?我以为我们可以放弃最高的 32 位.
编辑 1
对于那些不明白我所说的 3-1 mod 232 是什么意思的人,我在这里谈论的是乘法逆.例如:
//乘以 3 的倒数:15 * 2863311531 = 4294967296542949672965 模 2^32 = 5//使用定点乘法15 * 2863311531 = 4294967296542949672965 >>33 = 5//简单地除以 315/3 = 5
所以乘以 42949672965 实际上相当于除以 3.我认为 clang 的优化是基于模运算的,而实际上它是基于定点运算的.
编辑 2
我现在意识到乘法逆只能用于没有余数的除法.例如,1 乘以 3-1 等于 3-1,而不是零.只有定点算法的四舍五入才是正确的.
不幸的是,clang 没有使用任何模算术,在这种情况下它只是一个 imul
指令,即使它可以.以下函数的编译输出与上面相同.
unsigned div3(unsigned x) {__builtin_assume(x % 3 == 0);返回 x/3;}
(关于适用于每个可能输入的精确除法的定点乘法逆的规范问答:为什么 GCC 在实现整数除法时使用乘以奇怪的数? - 不是很重复,因为它只涵盖数学,而不是一些实现细节,如寄存器宽度和 imul 与 mul.)
- 我们不能直接将 rax 与 edi 相乘吗?
我们不能
imul rax, rdi
,因为调用约定允许调用者在RDI的高位留下垃圾;只有 EDI 部分包含该值.内联时这不是问题;写入 32 位寄存器确实隐式地零扩展到完整的 64 位寄存器,因此编译器通常不需要额外的指令来零扩展 32 位值.(由于移动消除的限制,如果无法避免的话,零扩展到不同的寄存器会更好).
从字面上看你的问题,不,x86 没有任何乘法指令可以零扩展其输入之一,让你乘以 32 位和 64 位寄存器.两个输入的宽度必须相同.
<块引用>
- 为什么我们要在 64 位模式下乘法?
(术语:所有这些代码都在 64 位模式下运行.你在问为什么 64 位 operand-size.)
您可以
mul edi
将 EAX 与 EDI 相乘以获得跨 EDX:EAX 拆分的 64 位结果,但是mul edi
在 Intel CPU 上是 3 uops,而大多数现代 x86-64 CPU 具有快速的 64 位imul
.(尽管imul r64, r64
在 AMD Bulldozer 系列和一些低功耗 CPU 上速度较慢.)https://uops.info/ 和 https://agner.org/optimize/ (指令表和 microarch PDF)(有趣的事实:mul rdi
在 Intel CPU 上实际上 更便宜,只有 2 uops.也许与不必对整数乘法单元的输出进行额外拆分有关,就像mul edi
必须将 64 位低半乘法器输出分成 EDX 和 EAX 两半,但对于 64x64 => 128 位 mul,这自然发生.)此外,您想要的部分在 EDX 中,因此您需要另一个
mov eax, edx
来处理它.(同样,因为我们正在查看函数的独立定义的代码,而不是在内联到调用者之后.)GCC 8.3 及更早版本确实使用 32 位
mul
而不是 64 位imul
(https://godbolt.org/z/5qj7d5).当推土机系列和旧的 Silvermont CPU 更相关时,对于-mtune=generic
来说这并不疯狂,但是对于最近的 GCC,这些 CPU 在过去更远,其通用调整选择反映了这一点.不幸的是,GCC 还浪费了一条将 EDI 复制到 EAX 的mov
指令,使这种方式看起来更糟:/# gcc8.3 -O3(默认 -mtune=generic)div3(无符号整数):mov eax, edi # 1 uop, 愚蠢的浪费指令mov edx, -1431655765 # 1 uop(相同的 32 位常量,只是打印方式不同)mul edx # 3 uops on Sandybridge-familymov eax, edx # 1 uopshr eax # 1 uop退# SnB 系列总共 7 个 uops
使用
mov eax, 0xAAAAAAAB
/mul edi
只会是 6 uop,但仍然比:# gcc9.3 -O3(默认 -mtune=generic)div3(无符号整数):mov eax, edi # 1 uopmov edi, 2863311531 # 1 uopimul rax, rdi # 1 uopshr rax, 33 # 1 uop退# 总共 4 个 uops,不包括 ret
不幸的是,64 位
<块引用>0x00000000AAAAAAAB
不能表示为 32 位符号扩展立即数,因此imul rax, rcx, 0xAAAAAAAB
不可编码.这意味着0xFFFFFFFFAAAAAAAB
.
- 为什么我们使用 imul 而不是 mul?我认为模算术都是无符号的.
未签名.输入的有符号性只影响结果的高半部分,但
imul reg, reg
不会产生高半部分.只有mul
和imul
的单操作数形式是满足 NxN =>2N,所以只有他们需要单独的签名和未签名版本.只有
imul
具有更快、更灵活的低半只形式.关于imul reg, reg
的唯一签名是它根据下半部分的有符号溢出设置 OF.仅仅为了拥有一个mul r,r
与imul r,r
的唯一区别是 FLAGS 输出是不值得花费更多的操作码和更多的晶体管的.英特尔的手册(https://www.felixcloutier.com/x86/imul)甚至指出它可以用于未签名的事实.
<块引用>
- 最后的 33 位右移是怎么回事?我以为我们可以放弃最高的 32 位.
不,没有乘数常数可以为每个可能的输入 x
提供准确正确的答案,如果您以这种方式实现它.优化规则不允许近似,只允许为程序使用的每个输入产生完全相同的可观察行为的实现.如果不知道 x
的值范围而不是 unsigned
的完整范围,编译器就没有这个选项.(-ffast-math
仅适用于浮点数;如果您想要更快的整数数学近似值,请手动编写如下代码):
参见 为什么GCC 在实现整数除法时使用一个奇怪的数乘法? 了解更多关于定点乘法逆方法编译器使用编译时间常数进行精确除法.
有关此不在一般情况下工作的示例,请参阅我对 使用位移位除以 10? 哪个提议
//警告:大输入不精确//这种快速近似可以只使用高半部分,//所以在 32 位机器上它避免了一个移位指令与精确除法int32_t div10(int32_t 股息){int64_t invDivisor = 0x1999999A;返回(int32_t)((invDivisor *股息)>> 32);}
它的第一个错误答案(如果从 0 向上循环)是 div10(1073741829) = 107374183
当 1073741829/10
实际上是 107374182.(它向上舍入而不是应该像 C 整数除法那样朝 0 方向移动.)
从您的编辑中,我看到您实际上是在谈论使用乘法结果的低一半,这显然适用于一直到 UINT_MAX 的精确倍数.
正如你所说,当除法有余数时,它完全失败,例如16 * 0xaaaaaaab
= 0xaaaaaab0
当截断为 32 位,而不是 5
.
unsigned div3_exact_only(unsigned x) {__builtin_assume(x % 3 == 0);//或等效的 if() __builtin_unreachable()返回 x/3;}
是的,如果这个数学公式成立,编译器用 32 位 imul 实现它是合法和最佳的.他们不寻找这种优化,因为它很少是一个已知的事实.IDK是否值得添加编译器代码甚至寻找优化,就编译时间而言,更不用说开发人员时间的编译器维护成本.这在运行时成本上没有巨大差异,而且几乎不可能.不过还是不错的.
div3_exact_only:imul eax, edi, 0xAAAAAAAB # 1 uop, 3c 延迟退
但是,您可以在源代码中自己做一些事情,至少对于像 uint32_t
这样的已知类型宽度:
uint32_t div3_exact_only(uint32_t x) {返回 x * 0xaaaaaaabU;}
I have the following C/C++ function:
unsigned div3(unsigned x) {
return x / 3;
}
When compiled using clang 10 at -O3
, this results in:
div3(unsigned int):
mov ecx, edi # tmp = x
mov eax, 2863311531 # result = 3^-1
imul rax, rcx # result *= tmp
shr rax, 33 # result >>= 33
ret
What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3-1 mod 232 which is 2863311531.
There are some things that I don't understand though:
- Why do we need to use
ecx
/rcx
at all? Can't we multiplyrax
withedi
directly? - Why do we multiply in 64-bit mode? Wouldn't it be faster to multiply
eax
andecx
? - Why are we using
imul
instead ofmul
? I thought modular arithmetic would be all unsigned. - What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
Edit 1
For those who don't understand what I mean by 3-1 mod 232, I am talking about the multiplicative inverse here. For example:
// multiplying with inverse of 3:
15 * 2863311531 = 42949672965
42949672965 mod 2^32 = 5
// using fixed-point multiplication
15 * 2863311531 = 42949672965
42949672965 >> 33 = 5
// simply dividing by 3
15 / 3 = 5
So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.
Edit 2
I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3-1 is equal to 3-1, not zero. Only fixed point arithmetic has correct rounding.
Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul
instruction in this case, even when it could. The following function has the same compile output as above.
unsigned div3(unsigned x) {
__builtin_assume(x % 3 == 0);
return x / 3;
}
(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)
- Can't we multiply rax with edi directly?
We can't imul rax, rdi
because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.
(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).
Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.
- Why do we multiply in 64-bit mode?
(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)
You could mul edi
to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi
is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul
. (Although imul r64, r64
is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF)
(Fun fact: mul rdi
is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi
would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)
Also the part you want is in EDX so you'd need another mov eax, edx
to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)
GCC 8.3 and earlier did use 32-bit mul
instead of 64-bit imul
(https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic
when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov
instruction copying EDI to EAX, making this way look even worse :/
# gcc8.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop, stupid wasted instruction
mov edx, -1431655765 # 1 uop (same 32-bit constant, just printed differently)
mul edx # 3 uops on Sandybridge-family
mov eax, edx # 1 uop
shr eax # 1 uop
ret
# total of 7 uops on SnB-family
Would only be 6 uops with mov eax, 0xAAAAAAAB
/ mul edi
, but still worse than:
# gcc9.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop
mov edi, 2863311531 # 1 uop
imul rax, rdi # 1 uop
shr rax, 33 # 1 uop
ret
# total 4 uops, not counting ret
Unfortunately, 64-bit 0x00000000AAAAAAAB
can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB
isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB
.
- Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg
doesn't produce the high half. Only the one-operand forms of mul
and imul
are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.
Only imul
has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg
is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r
whose only difference from imul r,r
is the FLAGS output.
Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.
- What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
No, there's no multiplier constant that would give the exact right answer for every possible input x
if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x
other than full range of unsigned
, compilers don't have that option. (-ffast-math
only applies to floating point; if you want faster approximations for integer math, code them manually like below):
See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.
For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed
// Warning: INEXACT FOR LARGE INPUTS
// this fast approximation can just use the high half,
// so on 32-bit machines it avoids one shift instruction vs. exact division
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183
when 1073741829/10
is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)
From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.
As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab
= 0xaaaaaab0
when truncated to 32-bit, not 5
.
unsigned div3_exact_only(unsigned x) {
__builtin_assume(x % 3 == 0); // or an equivalent with if() __builtin_unreachable()
return x / 3;
}
Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.
div3_exact_only:
imul eax, edi, 0xAAAAAAAB # 1 uop, 3c latency
ret
However, it is something you can do yourself in source code, at least for known type widths like uint32_t
:
uint32_t div3_exact_only(uint32_t x) {
return x * 0xaaaaaaabU;
}
这篇关于为什么在 x86 上除以 3 需要右移(和其他奇怪的东西)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:为什么在 x86 上除以 3 需要右移(和其他奇怪的东西)?
基础教程推荐
- 为什么语句不能出现在命名空间范围内? 2021-01-01
- 如何使图像调整大小以在 Qt 中缩放? 2021-01-01
- 使用从字符串中提取的参数调用函数 2022-01-01
- 如何“在 Finder 中显示"或“在资源管理器中显 2021-01-01
- Windows Media Foundation 录制音频 2021-01-01
- 在 C++ 中循环遍历所有 Lua 全局变量 2021-01-01
- 为 C/C++ 中的项目的 makefile 生成依赖项 2022-01-01
- 管理共享内存应该分配多少内存?(助推) 2022-12-07
- 从 std::cin 读取密码 2021-01-01
- 如何在不破坏 vtbl 的情况下做相当于 memset(this, ...) 的操作? 2022-01-01