使用 CUDA Thrust 确定每个矩阵列中的最小元素及其位置-C/C++问题

Determining the least element and its position in each matrix column with CUDA Thrust(使用 CUDA Thrust 确定每个矩阵列中的最小元素及其位置)

本文介绍了使用 CUDA Thrust 确定每个矩阵列中的最小元素及其位置的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个相当简单的问题，但我想不出一个优雅的解决方案.

I have a fairly simple problem but I cannot figure out an elegant solution to it.

我有一个 Thrust 代码，它生成包含值的相同大小的 c 向量.假设这些 c 向量中的每一个都有一个索引.我想为每个向量位置获取值最低的 c 向量的索引:

I have a Thrust code which produces c vectors of same size containing values. Let say each of these c vectors have an index. I would like for each vector position to get the index of the c vector for which the value is the lowest:

例子:

C0 =     (0,10,20,3,40)
C1 =     (1,2 ,3 ,5,10)

我会得到一个包含 C 向量索引的向量，该向量具有最低值:

I would get as result a vector containing the index of the C vector which has the lowest value:

result = (0,1 ,1 ,0,1)

我曾考虑过使用推力 zip 迭代器，但遇到了一些问题:我可以压缩所有 c 向量并实现任意转换，该转换采用一个元组并返回其最低值的索引值，但是:

I have thought about doing it using thrust zip iterators, but have come accross issues: I could zip all the c vectors and implement an arbitrary transformation which takes a tuple and returns the index of its lowest value, but:

如何遍历元组的内容?
据我了解，元组最多只能存储 10 个元素，并且可以有比 10 c 个向量更多的元素.

How to iterate over the contents of a tuple?
As I understand tuples can only store up to 10 elements and there can be much more than 10 c vectors.

然后我考虑过这样做:不是让 c 单独的向量，而是将它们全部附加到单个向量 C 中，然后生成引用位置的键和按键执行稳定排序，这会将来自同一位置的向量条目重新组合在一起.在示例中:

I have then thought about doing it this way: Instead of having c separate vectors, append them all in a single vector C, then generate keys referencing the positions and perform a stable sort by key which will regroup the vector entries from a same position together. In the example that would give:

C =      (0,10,20,3,40,1,2,3,5,10)
keys =   (0,1 ,2 ,3,4 ,0,1,2,3,4 )
after stable sort by key:
output = (0,1,10,2,20,3,3,5,40,10)
keys =   (0,0,1 ,1,2 ,2,3,3,4 ,4 )

然后使用向量中的位置生成键，使用 c 向量的索引压缩输出，然后使用自定义函子按键执行归约，对于每个归约输出索引最低值.在示例中:

Then generate keys with the positions in the vector, zip the output with the index of the c vectors and then perform a reduce by key with a custom functor which for each reduction outputs the index with the lowest value. In the example:

input =  (0,1,10,2,20,3,3,5,40,10)
indexes= (0,1,0 ,1,0 ,1,0,1,0 ,1)
keys =   (0,0,1 ,1,2 ,2,3,3,4 ,4)
after reduce by keys on zipped input and indexes:
output = (0,1,1,0,1)

但是，如何为reduce by key操作编写这样的函子?

However, how to write such functor for the reduce by key operation?

推荐答案

由于向量的长度必须相同.最好将它们连接在一起，并将它们视为矩阵C.

Since the length of your vectors has to be the same. It's better to concatenate them together and treat them as a matrix C.

那么您的问题就变成了在行主矩阵中查找每列的最小元素的索引.可以如下解决.

Then your problem becomes finding the indices of the min element of each column in a row-major matrix. It can be solved as follows.

将 row-major 更改为 col-major；
查找每列的索引.

在第 1 步中，您提出使用 stable_sort_by_key 重新排列元素顺序，这不是一个有效的方法.由于给定矩阵的#row和#col，可以直接计算重排.总之，它可以使用置换迭代器来完成:

In step 1, you proposed to use stable_sort_by_key to rearrange the element order, which is not a effective method. Since the rearrangement can be directly calculated given the #row and #col of the matrix. In thrust, it can be done with permutation iterators as:

thrust::make_permutation_iterator(
    c.begin(),
    thrust::make_transform_iterator(
        thrust::make_counting_iterator((int) 0),
        (_1 % row) * col + _1 / row)
)

在第 2 步中，reduce_by_key 可以完全满足您的需求.在您的情况下，归约二元运算函子很容易，因为已经定义了对元组(压缩向量的元素)的比较以比较元组的第一个元素，并且推力支持它

In step 2, reduce_by_key can do exactly what you want. In your case the reduction binary-op functor is easy, since comparison on tuple (element of your zipped vector) has already been defined to compare the 1st element of the tuple, and it's supported by thrust as

thrust::minimum< thrust::tuple<float, int> >()

整个程序如下所示.Thrust 1.6.0+ 是必需的，因为我在花哨的迭代器中使用占位符.

The whole program is shown as follows. Thrust 1.6.0+ is required since I use placeholders in fancy iterators.

#include <iterator>
#include <algorithm>

#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>

using namespace thrust::placeholders;

int main()
{

    const int row = 2;
    const int col = 5;
    float initc[] =
            { 0, 10, 20, 3, 40, 1, 2, 3, 5, 10 };
    thrust::device_vector<float> c(initc, initc + row * col);

    thrust::device_vector<float> minval(col);
    thrust::device_vector<int> minidx(col);

    thrust::reduce_by_key(
            thrust::make_transform_iterator(
                    thrust::make_counting_iterator((int) 0),
                    _1 / row),
            thrust::make_transform_iterator(
                    thrust::make_counting_iterator((int) 0),
                    _1 / row) + row * col,
            thrust::make_zip_iterator(
                    thrust::make_tuple(
                            thrust::make_permutation_iterator(
                                    c.begin(),
                                    thrust::make_transform_iterator(
                                            thrust::make_counting_iterator((int) 0), (_1 % row) * col + _1 / row)),
                            thrust::make_transform_iterator(
                                    thrust::make_counting_iterator((int) 0), _1 % row))),
            thrust::make_discard_iterator(),
            thrust::make_zip_iterator(
                    thrust::make_tuple(
                            minval.begin(),
                            minidx.begin())),
            thrust::equal_to<int>(),
            thrust::minimum<thrust::tuple<float, int> >()
    );

    std::copy(minidx.begin(), minidx.end(), std::ostream_iterator<int>(std::cout, " "));
    std::cout << std::endl;
    return 0;
}

剩下的两个问题可能会影响性能.

Two remaining issues may affect the performance.

必须输出最小值，这不是必需的；
reduce_by_key 是为不同长度的段设计的，它可能不是对相同长度段进行缩减的最快算法.

min values have to be outputted, which is not required;
reduce_by_key is designed for segments with variant lengths, it may not be the fastest algorithm for reduction on segments with same length.

编写自己的内核可能是获得最高性能的最佳解决方案.

Writing your own kernel could be the best solution for highest performance.

这篇关于使用 CUDA Thrust 确定每个矩阵列中的最小元素及其位置的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持编程学习网！