代码之家 › 专栏 › 技术社区 › Mark Ingram

指针赋值或增量导致的性能增量(严格别名?)

cpu-cache intel visual-c++ assembly performance

-1

Mark Ingram · 技术社区 · 6 年前

更新: https://wandbox.org/permlink/G5NFe8ooSKg29ZuS
https://godbolt.org/z/PEWiRk

我正在体验一种基于256次迭代的方法的函数性能从0到500-900的变化(Visual Studio 2017):

void* SomeMethod()
{
    void *result = _ptr; // _ptr is from malloc

    // Increment original pointer
    _ptr = static_cast<uint8_t*>(_ptr) + 32776;    // (1)

    // Set the back pointer
    *static_cast<ThisClass**>(result) = this;      // (2)

    return result;
}

我不相信自己违反了严格的别名规则,也不相信自己在观察via时违反了这些规则 CompilerExplorer ,我可以看到设置返回指针(第(2)行)只会生成一条指令:

mov QWORD PTR [rax], rcx

作为参考,增加原始指针(第(1)行)将生成两条指令:

lea     rdx, QWORD PTR [rax+32776]
mov     QWORD PTR [rcx], rdx

为完整起见,以下是完整的组件输出:

mov     rax, QWORD PTR [rcx]
lea     rdx, QWORD PTR [rax+32776]
mov     QWORD PTR [rcx], rdx
mov     QWORD PTR [rax], rcx
ret     0

1 回复 | 直到 6 年前

1

2

Peter Cordes 6 年前

在您链接的测试代码中,每个存储区的容量为32kiB, 关于新分配的内存 没有热身。您可能在每次迭代中都会遇到一个软页面错误和一个副本。(修订) malloc ed内存可能都被延迟映射到同一物理零页。)

256次迭代也完全不足以将CPU提升到正常/涡轮时钟速度,超出空闲速度。

balance_performance (不是默认值 balance_power ,因此它的速度更快):

我使用gcc8.2.1进行编译,并使用 while ./a.out ;do :;done 1.125us

使用Linux perf stat

peter@volta:/tmp$ perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,dtlb_store_misses.miss_causes_a_walk,tlb_flush.dtlb_thread,dtlb_load_misses.miss_causes_a_walk -r100 ./a.out


 Performance counter stats for './a.out' (100 runs):

              1.15 msec task-clock                #    0.889 CPUs utilized            ( +-  0.33% )
                 0      context-switches          #   40.000 M/sec                    ( +- 49.24% )
                 0      cpu-migrations            #    0.000 K/sec                  
               191      page-faults               # 191250.000 M/sec                  ( +-  0.09% )
         4,343,915      cycles                    # 4343915.040 GHz                   ( +-  0.33% )  (82.06%)
           819,685      branches                  # 819685480.000 M/sec               ( +-  0.05% )
         4,581,597      instructions              #    1.05  insn per cycle           ( +-  0.05% )
         6,366,610      uops_issued.any           # 6366610010.000 M/sec              ( +-  0.05% )
         6,287,015      uops_executed.thread      # 6287015440.000 M/sec              ( +-  0.05% )
             1,271      dtlb_store_misses.miss_causes_a_walk # 1270910.000 M/sec                 ( +-  0.21% )
     <not counted>      tlb_flush.dtlb_thread                                         (0.00%)
     <not counted>      dtlb_load_misses.miss_causes_a_walk                                     (0.00%)

        0.00129289 +- 0.00000489 seconds time elapsed  ( +-  0.38% )

256次循环迭代中有191次页面错误,因此 巨大的 这个程序花费的大部分时间都在内核中。

一旦我们回到用户空间,超过1000个存储区会导致dTLB未命中,而第二级TLB也未命中,需要进行页面漫游。但是没有任何负载。

我们可以通过分配更多的内存来获得更干净的数据,这样我们就可以增加 Count 没有断层。仿形 perf record main