CUDA: In warp reduction and volatile keyword(CUDA:在扭曲减少和 volatile 关键字中)
问题描述
阅读以下问题及其答案后
链接
After reading the question and its answer from the following
LINK
我的脑海里还有一个问题.来自我的 C/C++ 背景;我知道使用 volatile
有它的缺点.并且在答案中还指出,在 CUDA 的情况下,如果不使用 volatile
关键字,优化可以用寄存器替换共享数组以保存数据.
I still have a question remaining in my mind. From my background in C/C++; I understand that using volatile
has it's demerits. And also it is pointed in the answers that in case of CUDA, the optimizations can replace shared array with registers to keep data if volatile
keyword is not used.
我想知道在计算(总和)减少时会遇到哪些性能问题.例如
I want to know what would be the performance issues that can be encountered when calculating (sum) reduction. e.g.
__device__ void sum(volatile int *s_data, int tid)
{
if (tid < 16)
{
s_data[tid] += s_data[tid + 16];
s_data[tid] += s_data[tid + 8];
s_data[tid] += s_data[tid + 4];
s_data[tid] += s_data[tid + 2];
s_data[tid] += s_data[tid + 1];
}
}
我正在使用减少翘曲.由于所有带有 in warp 的线程都是同步的,因此我相信没有必要使用 syncthreads()
构造.
I am using in warp reduction. Since all the threads with in warp are in sync, therefore I believe there is no need to use syncthreads()
construct.
我想知道删除关键字 volatile
是否会弄乱我的总和(由于 cuda 优化)?我可以在没有 volatile
关键字的情况下使用这样的缩减吗?
I want to know will removing the keyword volatile
mess up my sum (due to cuda optimizations)? Can I use reduction such as this without volatile
keyword.
由于我多次使用这个reduction函数,volatile
关键字会导致性能下降吗?
Since I use this reduction function multiple time, will volatile
keyword cause any performance degradation?
推荐答案
从该代码中删除 volatile 关键字可能在 Fermi 和 Kepler GPUS 上破坏该代码.这些 GPU 缺乏直接在共享内存上操作的指令.相反,编译器必须向寄存器发出加载/存储对.
Removing the volatile keyword from that code could break that code on Fermi and Kepler GPUS. Those GPUs lack instructions to directly operate on shared memory. Instead, the compiler must emit a load/store pair to and from register.
在这种情况下,volatile 关键字的作用是让编译器尊重加载-操作-存储循环,而不是执行将 s_data[tid]
的值保留在寄存器中的优化.保持在寄存器中累积的总和会破坏使扭曲级别共享内存总和正常工作所需的隐式内存同步.
What the volatile keyword does in this context is make the compiler honour that load-operate-store cycle and not perform an optimisation that would keep the value of s_data[tid]
in register. To keep the sum accumulating in register would break the implicit memory syncronisation required to make that warp level shared memory summation work correctly.
这篇关于CUDA:在扭曲减少和 volatile 关键字中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:CUDA:在扭曲减少和 volatile 关键字中
基础教程推荐
- 管理共享内存应该分配多少内存?(助推) 2022-12-07
- 为 C/C++ 中的项目的 makefile 生成依赖项 2022-01-01
- 如何“在 Finder 中显示"或“在资源管理器中显 2021-01-01
- 如何在不破坏 vtbl 的情况下做相当于 memset(this, ...) 的操作? 2022-01-01
- 使用从字符串中提取的参数调用函数 2022-01-01
- Windows Media Foundation 录制音频 2021-01-01
- 为什么语句不能出现在命名空间范围内? 2021-01-01
- 在 C++ 中循环遍历所有 Lua 全局变量 2021-01-01
- 从 std::cin 读取密码 2021-01-01
- 如何使图像调整大小以在 Qt 中缩放? 2021-01-01