代码之家 › 专栏 › 技术社区 › Amos

为什么openmp不基于手动NUMA绑定放置线程?

numa hpc openmp c++

0

Amos · 技术社区 · 6 年前

我正在构建一个numa感知处理器,它绑定到给定的套接字并接受lambda。以下是我所做的:

#include <numa.h>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <thread>
#include <vector>

using namespace std;

unsigned nodes = numa_num_configured_nodes();
unsigned cores = numa_num_configured_cpus();
unsigned cores_per_node = cores / nodes;

int main(int argc, char* argv[]) {
    putenv("OMP_PLACES=sockets(1)");
    cout << numa_available() << endl;  // returns 0
    numa_set_interleave_mask(numa_all_nodes_ptr);
    int size = 200000000;
    for (auto i = 0; i < nodes; ++i) {
        auto t = thread([&]() {
            // binding to given socket
            numa_bind(numa_parse_nodestring(to_string(i).c_str()));
            vector<int> v(size, 0);
            cout << "node #" << i << ": on CPU " << sched_getcpu() << endl;
#pragma omp parallel for num_threads(cores_per_node) proc_bind(master)
            for (auto i = 0; i < 200000000; ++i) {
                for (auto j = 0; j < 10; ++j) {
                    v[i]++;
                    v[i] *= v[i];
                    v[i] *= v[i];
                }
            }
        });
        t.join();
    }
}

但是,所有线程都在套接字0上运行。看来 numa_bind 不将当前线程绑定到给定的套接字。第二个numa处理器-- Numac 1 输出 node #1: on CPU 0 ,应位于CPU 1上。那么,出了什么问题?

1 回复 | 直到 6 年前

1

0

Daniel Langr 6 年前

这完全符合我的预期:

#include <cassert>
#include <iostream>
#include <numa.h>
#include <omp.h>
#include <sched.h>

int main() {
   assert (numa_available() != -1);

   auto nodes = numa_num_configured_nodes();
   auto cores = numa_num_configured_cpus();
   auto cores_per_node = cores / nodes;

   omp_set_nested(1);

   #pragma omp parallel num_threads(nodes)
   {
      auto outer_thread_id = omp_get_thread_num();
      numa_run_on_node(outer_thread_id);

      #pragma omp parallel num_threads(cores_per_node)
      {
         auto inner_thread_id = omp_get_thread_num();

         #pragma omp critical
         std::cout
            << "Thread " << outer_thread_id << ":" << inner_thread_id
            << " core: " << sched_getcpu() << std::endl;

         assert(outer_thread_id == numa_node_of_cpu(sched_getcpu()));
      }
   }
}

程序首先在我的双套接字服务器上创建2个(外部)线程。然后,它将它们绑定到不同的套接字(NUMA节点)。最后,它将每个线程拆分为20个(内部)线程,因为每个CPU有10个物理内核并启用了超线程。

所有内部线程都与其父线程在同一套接字上运行。即外螺纹0的芯线0-9和20-29,以及外螺纹1的芯线10-19和30-39。( sched_getcpu() 在我的情况下,返回了范围为0-39的虚拟核心数。)

请注意,没有C++11线程,只有纯OpenMP。