How to convert two _pd into one _ps?(如何将两个_pd 转换为一个_ps?)
问题描述
我正在循环一些数据,计算一些 double 和每 2 个 __m128d 操作,我想将数据存储在
__m128
浮点数上.
I'm looping some data, calculating some double and every 2 __m128d
operations, I want to store the data on a __m128
float.
所以 64+64 + 64+64 (2 __m128d
) 存入 1 32+32+32+32 __m128
.
So 64+64 + 64+64 (2 __m128d
) stored into 1 32+32+32+32 __m128
.
我做这样的事情:
__m128d v_result;
__m128 v_result_float;
...
// some operations on v_result
// store the first two "slot" on float
v_result_float = _mm_cvtpd_ps(v_result);
// some operations on v_result
// I need to store the last two "slot" on float
v_result_float = _mm_cvtpd_ps(v_result); ?!?
但它每次都会覆盖(显然)前 2 个浮动插槽".
But it overwrite (obviously) the first 2 float "slots" everytime.
我怎样才能空格" _mm_cvtpd_ps
以开始第二次将值插入 3° 和 4°槽"?
How can I "space" the _mm_cvtpd_ps
to start insert values to the 3° and 4° "slot", the second time?
这是完整的代码:
__m128d v_pA;
__m128d v_pB;
__m128d v_result;
__m128 v_result_float;
float *pCEnd = pTest + roundintup8(blockSize);
for (; pTest < pCEnd; pA += 8, pB += 8, pTest += 8) {
v_pA = _mm_load_pd(pA);
v_pB = _mm_load_pd(pB);
v_result = _mm_add_pd(v_pA, v_pB);
v_result = _mm_max_pd(v_boundLower, v_result);
v_result = _mm_min_pd(v_boundUpper, v_result);
v_result = _mm_mul_pd(v_rangeLn2per12, v_result);
v_result = _mm_add_pd(v_minLn2per12, v_result);
// two double processed: store in 1° and 2° float slot
v_result_float = _mm_cvtpd_ps(v_result);
v_pA = _mm_load_pd(pA + 2);
v_pB = _mm_load_pd(pB + 2);
v_result = _mm_add_pd(v_pA, v_pB);
v_result = _mm_max_pd(v_boundLower, v_result);
v_result = _mm_min_pd(v_boundUpper, v_result);
v_result = _mm_mul_pd(v_rangeLn2per12, v_result);
v_result = _mm_add_pd(v_minLn2per12, v_result);
// another two double processed: store in 3° and 4° float slot
v_result_float = _mm_cvtpd_ps(v_result); // fail
v_result_float = someFunction(v_result_float);
_mm_store_ps(pTest, v_result_float);
v_pA = _mm_load_pd(pA + 4);
v_pB = _mm_load_pd(pB + 4);
v_result = _mm_add_pd(v_pA, v_pB);
v_result = _mm_max_pd(v_boundLower, v_result);
v_result = _mm_min_pd(v_boundUpper, v_result);
v_result = _mm_mul_pd(v_rangeLn2per12, v_result);
v_result = _mm_add_pd(v_minLn2per12, v_result);
// two double processed: store in 1° and 2° float slot
v_result_float = _mm_cvtpd_ps(v_result);
v_pA = _mm_load_pd(pA + 6);
v_pB = _mm_load_pd(pB + 6);
v_result = _mm_add_pd(v_pA, v_pB);
v_result = _mm_max_pd(v_boundLower, v_result);
v_result = _mm_min_pd(v_boundUpper, v_result);
v_result = _mm_mul_pd(v_rangeLn2per12, v_result);
v_result = _mm_add_pd(v_minLn2per12, v_result);
// another two double processed: store in 3° and 4° float slot
v_result_float = _mm_cvtpd_ps(v_result); // fail
v_result_float = someFunction(v_result_float);
_mm_store_ps(pTest + 4, v_result_float);
}
推荐答案
需要使用movlhps
(<代码>_mm_movelh_ps).简化示例:
You need to move the low words of the second conversion to the high words of the result of the first conversion using movlhps
(_mm_movelh_ps
). Simplified example:
#include <immintrin.h>
__m128d some_double_operation(__m128d);
__m128 some_float_operation(__m128);
void foo(double const* input, float* output, int size)
{
// assuming everything is already nicely aligned ...
for(int i=0; i<size; i+=4, input+=4, output+=4)
{
__m128d res_lo = some_double_operation(_mm_load_pd(input));
__m128d res_hi = some_double_operation(_mm_load_pd(input+2));
__m128 res_float = _mm_movelh_ps(_mm_cvtpd_ps(res_lo), _mm_cvtpd_ps(res_hi));
__m128 res_final = some_float_operation(res_float);
_mm_store_ps(output, res_final);
}
}
Godbolt 演示:https://godbolt.org/z/wgKjxN.
Godbolt-Demo: https://godbolt.org/z/wgKjxN.
如果 some_double_operation
是内联的,编译器可能会将第一次双精度操作的结果保存在第二次调用函数未使用的寄存器中,因此不需要将任何内容存储到内存中.
If some_double_operation
is inlined, the compiler will likely keep the result of the first double operation in a register not used by the second call to the function, thus not require to store anything to memory.
这篇关于如何将两个_pd 转换为一个_ps?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何将两个_pd 转换为一个_ps?
基础教程推荐
- 为什么语句不能出现在命名空间范围内? 2021-01-01
- 在 C++ 中循环遍历所有 Lua 全局变量 2021-01-01
- 如何在不破坏 vtbl 的情况下做相当于 memset(this, ...) 的操作? 2022-01-01
- Windows Media Foundation 录制音频 2021-01-01
- 如何“在 Finder 中显示"或“在资源管理器中显 2021-01-01
- 管理共享内存应该分配多少内存?(助推) 2022-12-07
- 如何使图像调整大小以在 Qt 中缩放? 2021-01-01
- 从 std::cin 读取密码 2021-01-01
- 使用从字符串中提取的参数调用函数 2022-01-01
- 为 C/C++ 中的项目的 makefile 生成依赖项 2022-01-01