代码之家 › 专栏 › 技术社区 › Serge Rogatch

前8.x等效于CUDA中的__reduce_max_sync()

compute-capability gpu-warp cuda parallel-processing c++

Serge Rogatch · 技术社区 · 3 年前

cuda-memcheck 已在代码中检测到执行以下操作的竞争状况:

condition = /*different in each thread*/;
shared int owner[nWarps];
/* ... owner[i] is initialized to blockDim.x+1 */
if(condition) {
    owner[threadIdx.x/32] = threadIdx.x;
}

因此,基本上,这段代码根据某些条件计算每个扭曲的所有者线程。对于某些扭曲,可能没有所有者,但对于某些扭曲来说,所有者的数量可能超过1,然后发生竞争条件,因为多个线程将一个值分配给同一共享内存区域。

在尝试了这些文档后,我认为我需要做的是:

const uint32_t mask = __ballot_sync(0xffffffff, condition);
if(mask != 0) {
    const unsigned max_owner = __reduce_max_sync(mask, threadIdx.x);
    if(threadIdx.x == max_owner) {
        // at most 1 thread assigns here per warp
        owner[threadIdx.x/32] = max_owner;
    }
}

然而,我的尝试有两个问题:

我真的不需要找到最大线程-如果有一个线程 condition==true
它需要CUDA计算能力8.x,而我需要支持5.2计算能力的设备

你能帮我解决以上问题吗?

0 回复 | 直到 3 年前

Serge Rogatch 3 年前

以下功能似乎可以解决问题:

void SetOwnerThread(int* dest, const bool condition) {
  const uint32_t mask = __ballot_sync(0xffffffff, condition);
  if(!mask) {
    return;
  }
  const uint32_t lowest_bit = mask & -mask;
  const uint32_t my_bit = (1 << (threadIdx.x & 31));
  if(lowest_bit == my_bit) {
    dest = threadIdx.x;
  }
}

推荐文章

drainzerrr · Go锁定结构的一部分

6 年前

Minions · 如何在GridSearchCV中找到最佳进程数(…,n\u作业=…)?

6 年前

Azim · 使用java 8并行处理图像

6 年前

Andrei Suvorkov · 不使用size()方法的LinkedList拆分器

6 年前

Terra Omega · Pthreads:我的并行代码在一定数量后不会将线程传递到函数中

6 年前

user8005765 · Karatsuba-多项式与CUDA相乘

6 年前

Adi · 并行读取大型XSLT字符串

6 年前

Eduard Rostomyan · 为什么我的程序在1个线程上运行得比在8个线程上运行得快。C类++

6 年前

A.J · 同时运行两个python文件

6 年前

Kristofer · 当索引设置为私有时,如何确保访问缓冲区是私有的

6 年前