__builtin_prefetch, How much does it read?(__builtin_prefetch,读取多少?)
问题描述
我正在尝试通过以下方式优化一些 C++(RK4)使用
I'm trying to optimize some C++ (RK4) by using
__builtin_prefetch
我不知道如何预取整个结构.
I can't figure out how to prefetch a whole structure.
我不明白读取了多少 const void *addr
.我想加载 from
和 to
的下一个值.
I don't understand how much of the const void *addr
is read. I want to have the next values of from
and to
loaded.
for (int i = from; i < to; i++)
{
double kv = myLinks[i].kv;
particle* from = con[i].Pfrom;
particle* to = con[i].Pto;
//Prefetch values at con[i++].Pfrom & con[i].Pto;
double pos = to->px- from->px;
double delta = from->r + to->r - pos;
double k1 = axcel(kv, delta, from->mass) * dt; //axcel is an inlined function
double k2 = axcel(kv, delta + 0.5 * k1, from->mass) * dt;
double k3 = axcel(kv, delta + 0.5 * k2, from->mass) * dt;
double k4 = axcel(kv, delta + k3, from->mass) * dt;
#define likely(x) __builtin_expect((x),1)
if (likely(!from->bc))
{
from->x += (( k1 + 2 * k2 + 2 * k3 + k4) / 6);
}
}
链接:http://www.ibm.com/developerworks/linux/library/l-gcc-hacks/
推荐答案
我认为它只是发出一个 FETCH
机器指令,它基本上获取一个行缓存,其大小取决于处理器.
I think it just emit one FETCH
machine instruction, which basically fetches a line cache, whose size is processor specific.
例如,您可以使用 __builtin_prefetch (con[i+3].Pfrom)
.根据我的(小)经验,在这样的循环中,最好提前预取几个元素.
And you could use __builtin_prefetch (con[i+3].Pfrom)
for instance. By my (small) experience, in such a loop, it is better to prefetch several elements in advance.
不要太频繁地使用__builtin_prefetch
(即不要将它们中的很多放在一个循环中).如果需要,测量性能增益,并使用 GCC 优化(至少 -O2
).如果你很幸运,手动 __builtin_prefetch
可以将循环的性能提高 10% 或 20%(但它也可能会伤害它).
Don't use __builtin_prefetch
too often (i.e. don't put a lot of them inside a loop). Measure the performance gain if you need them, and use GCC optimization (at least -O2
). If you are very lucky, manual __builtin_prefetch
could increase the performance of your loop by 10 or 20% (but it could also hurt it).
如果这样的循环对您很重要,您可以考虑在具有 OpenCL 或 CUDA 的 GPU 上运行它(但这需要使用 OpenCL 或 CUDA 语言重新编码一些例程,并针对您的特定硬件调整它们).
If such a loop is crucial to you, you might consider running it on GPUs with OpenCL or CUDA (but that requires recoding some routines in OpenCL or CUDA language, and tuning them to your particular hardware).
还使用最新的 GCC 编译器(最新版本是 4.6.2),因为它在这些方面取得了很大进展.
Use also a recent GCC compiler (the latest release is 4.6.2) because it is making a lot of progress on these areas.
(于 2018 年 1 月添加:)
硬件(处理器)和编译器都在缓存方面取得了很大进展,因此现在(2018 年)使用 __builtin_prefetch
似乎不太有用.一定要进行基准测试.
Both hardware (processors) and compilers have made a lot of progress regarding caches, so it seems that using __builtin_prefetch
is less useful today (in 2018). Be sure to benchmarck.
这篇关于__builtin_prefetch,读取多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:__builtin_prefetch,读取多少?


基础教程推荐
- 我有静态或动态 boost 库吗? 2021-01-01
- 如何检查GTK+3.0中的小部件类型? 2022-11-30
- 如何通过C程序打开命令提示符Cmd 2022-12-09
- 常量变量在标题中不起作用 2021-01-01
- 在 C++ 中计算滚动/移动平均值 2021-01-01
- 如何在 C++ 中初始化静态常量成员? 2022-01-01
- 这个宏可以转换成函数吗? 2022-01-01
- C++结构和函数声明。为什么它不能编译? 2022-11-07
- 如何将 std::pair 的排序 std::list 转换为 std::map 2022-01-01
- 静态库、静态链接动态库和动态链接动态库的 .lib 文件里面是什么? 2021-01-01