代码之家 › 专栏 › 技术社区 › SoLaR

创建模板函数的每个实例时,模板函数typedef说明符是否正确内联?

inline math templates function c++

SoLaR · 技术社区 · 6 年前

制作了同时对多个数据流进行操作的函数,创建输出结果并将其放入目标流。已经花费了大量时间来优化此函数的性能(openmp、Intrinsic等)。而且它的表现很漂亮。这里涉及很多数学,不用说很长的函数。

现在,我想在同一个函数中为每个实例实现数学替换代码,而无需编写该函数的每个版本。其中,我想仅使用#defines或内联函数来区分此函数的不同实例(代码必须在每个版本中内联)。

选择了模板,但模板只允许类型说明符,并意识到这里不能使用#defines。剩下的解决方案是内联数学函数,所以简化的想法是创建如下标题:

'阿尔姆·卡西莫多。h’:

#pragma once

typedef struct ALM_DATA
{
  int l, t, r, b;
  int scan;
  BYTE* data;  
} ALM_DATA;

typedef BYTE (*MATH_FX)(BYTE&, BYTE&);
// etc

inline BYTE math_a1(BYTE& A, BYTE& B){ return ((BYTE)((B > A) ? B:A)); }
inline BYTE math_a2(BYTE& A, BYTE& B){ return ((BYTE)(255 - ((long)((long)(255 - A) * (255 - B)) >> 8))); }
inline BYTE math_a3(BYTE& A, BYTE& B){ return ((BYTE)((B < 128)?(2*(((long)A>>1)+64))*((float)B/255):(255-(2*(255-(((long)A>>1)+64))*(float)(255-B)/255)))); }
// etc

template <typename MATH>
inline int const template_math_av (MATH math, ALM_DATA& a, ALM_DATA& b) 
{ 
  // ultra simplified version of very complex code
  for (int y = a.t; y <= a.b; y++)
  {
    int yoffset = y * a.scan;
    for (int x = a.l; x <= a.r; x++)
    {
      int xoffset = yoffset + x;
      a.data[xoffset] = math(a.data[xoffset], b.data[xoffset]);
    }
  }
  return 0;
}

ALM_API int math_caller(int condition, ALM_DATA& a, ALM_DATA& b);

math\u调用者在“alm\u卡西莫多”中定义。cpp’如下:

#include "stdafx.h"
#include "alm_quazimodo.h"

ALM_API int math_caller(int condition, ALM_DATA& a, ALM_DATA& b)
{
  switch(condition)
  {
    case 1: return template_math_av<MATH_FX>(math_a1, a, b);
      break;
    case 2: return template_math_av<MATH_FX>(math_a2, a, b);
      break;
    case 3: return template_math_av<MATH_FX>(math_a3, a, b);
      break;
    // etc
  }
  return -1;
}

这里主要关注的是优化,主要是数学函数代码的衬里,而不是破坏原有代码的现有优化。当然,不必为特定的数学运算编写每个函数实例;)

那么这个模板是否正确地内联了所有数学函数? 以及如何优化此函数模板的建议?

如果没有,谢谢你阅读这个冗长的问题。

1 回复 | 直到 6 年前

gflegar 6 年前

这一切都取决于您的编译器、优化级别以及它们的方式和位置 math_a1 到 math_a3 定义的函数。通常,如果所讨论的函数是与其余代码位于同一编译单元中的内联函数,编译器可以对此进行优化。如果您没有遇到这种情况,您可能需要考虑函子而不是函数。

Here 下面是一些我尝试过的简单例子。您可以对函数执行相同的操作,并检查不同编译器的行为。

以我的例子来说,GCC 7.3和clang 6.0在优化out函数调用方面非常好(当然前提是他们看到了函数的定义)。然而,有些令人惊讶的是,ICC 18.0.0只能优化出函子和闭包。即使是内联函数也会给它带来一些麻烦。

这里有一些代码,以防将来链接停止工作。对于以下代码:

template <typename T, int size, typename Closure>
T accumulate(T (&array)[size], T init, Closure closure) {
    for (int i = 0; i < size; ++i) {
        init = closure(init, array[i]);
    }
    return init;
}

int sum(int x, int y) { return x + y; }
inline int sub_inline(int x, int y) { return x - y; }
struct mul_functor {
    int operator ()(int x, int y) const  { return x * y; }
};
extern int extern_operation(int x, int y);

int accumulate_function(int (&array)[5]) {
    return accumulate(array, 0, sum);
}
int accumulate_inline(int (&array)[5]) {
    return accumulate(array, 0, sub_inline);
}
int accumulate_functor(int (&array)[5]) {
    return accumulate(array, 1, mul_functor());
}
int accumulate_closure(int (&array)[5]) {
    return accumulate(array, 0, [](int x, int y) { return x | y; });
}
int accumulate_exetern(int (&array)[5]) {
    return accumulate(array, 0, extern_operation);
}

GCC 7.3(x86)生成以下程序集:

sum(int, int):
        lea     eax, [rdi+rsi]
        ret
accumulate_function(int (&) [5]):
        mov     eax, DWORD PTR [rdi+4]
        add     eax, DWORD PTR [rdi]
        add     eax, DWORD PTR [rdi+8]
        add     eax, DWORD PTR [rdi+12]
        add     eax, DWORD PTR [rdi+16]
        ret
accumulate_inline(int (&) [5]):
        mov     eax, DWORD PTR [rdi]
        neg     eax
        sub     eax, DWORD PTR [rdi+4]
        sub     eax, DWORD PTR [rdi+8]
        sub     eax, DWORD PTR [rdi+12]
        sub     eax, DWORD PTR [rdi+16]
        ret
accumulate_functor(int (&) [5]):
        mov     eax, DWORD PTR [rdi]
        imul    eax, DWORD PTR [rdi+4]
        imul    eax, DWORD PTR [rdi+8]
        imul    eax, DWORD PTR [rdi+12]
        imul    eax, DWORD PTR [rdi+16]
        ret
accumulate_closure(int (&) [5]):
        mov     eax, DWORD PTR [rdi+4]
        or      eax, DWORD PTR [rdi+8]
        or      eax, DWORD PTR [rdi+12]
        or      eax, DWORD PTR [rdi]
        or      eax, DWORD PTR [rdi+16]
        ret
accumulate_exetern(int (&) [5]):
        push    rbp
        push    rbx
        lea     rbp, [rdi+20]
        mov     rbx, rdi
        xor     eax, eax
        sub     rsp, 8
.L8:
        mov     esi, DWORD PTR [rbx]
        mov     edi, eax
        add     rbx, 4
        call    extern_operation(int, int)
        cmp     rbx, rbp
        jne     .L8
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret