代码之家 › 专栏 › 技术社区 › Serge Rogatch

在AVX2中高效实现log2(_uuM256d)

avx2 logarithm floating-point algorithm c++

Serge Rogatch · 技术社区 · 7 年前

__m256d _mm256_log2_pd (__m256d a) 在英特尔以外的其他编译器上是不可用的,他们说它在AMD处理器上的性能受到了限制。中提到了一些在互联网上的实现 AVX log intrinsics (_mm256_log_ps) missing in g++-4.8? 和 SIMD math libraries for SSE and AVX Agner Fog's vector library

那么有人能解释一下如何实施吗 log2() double 数字是否有效?一、 e.比如什么 __m256d \u mm256\u log2\u pd(\u m256d a)

编辑:在我当前的具体案例中,数字是介于0和1之间的概率,对数用于熵计算:总和的求反 i P[i]*log(P[i]) P[i] 很大,因此数字可以接近0。我不确定准确性,所以会考虑任何以30位尾数开始的解决方案,尤其是首选可调整的解决方案。

https://en.wikipedia.org/wiki/Logarithm#Power_series . 如何改进?(需要提高性能和准确性)

namespace {
  const __m256i gDoubleExpMask = _mm256_set1_epi64x(0x7ffULL << 52);
  const __m256i gDoubleExp0 = _mm256_set1_epi64x(1023ULL << 52);
  const __m256i gTo32bitExp = _mm256_set_epi32(0, 0, 0, 0, 6, 4, 2, 0);
  const __m128i gExpNormalizer = _mm_set1_epi32(1023);
  //TODO: some 128-bit variable or two 64-bit variables here?
  const __m256d gCommMul = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
  const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
  const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
  const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
  const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
  const __m256d gVect1 = _mm256_set1_pd(1.0);
}

__m256d __vectorcall Log2(__m256d x) {
  const __m256i exps64 = _mm256_srli_epi64(_mm256_and_si256(gDoubleExpMask, _mm256_castpd_si256(x)), 52);
  const __m256i exps32_avx = _mm256_permutevar8x32_epi32(exps64, gTo32bitExp);
  const __m128i exps32_sse = _mm256_castsi256_si128(exps32_avx);
  const __m128i normExps = _mm_sub_epi32(exps32_sse, gExpNormalizer);
  const __m256d expsPD = _mm256_cvtepi32_pd(normExps);
  const __m256d y = _mm256_or_pd(_mm256_castsi256_pd(gDoubleExp0),
    _mm256_andnot_pd(_mm256_castsi256_pd(gDoubleExpMask), x));

  // Calculate t=(y-1)/(y+1) and t**2
  const __m256d tNum = _mm256_sub_pd(y, gVect1);
  const __m256d tDen = _mm256_add_pd(y, gVect1);
  const __m256d t = _mm256_div_pd(tNum, tDen);
  const __m256d t2 = _mm256_mul_pd(t, t); // t**2

  const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
  const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
  const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
  const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
  const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
  const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
  const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
  const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);

  const __m256d log2_y = _mm256_mul_pd(terms01234, gCommMul);
  const __m256d log2_x = _mm256_add_pd(log2_y, expsPD);

  return log2_x;
}

到目前为止,我的实现每秒提供405 268 490个操作,直到第8位似乎都很精确。通过以下功能测量性能:

#include <chrono>
#include <cmath>
#include <cstdio>
#include <immintrin.h>

// ... Log2() implementation here

const int64_t cnLogs = 100 * 1000 * 1000;

void BenchmarkLog2Vect() {
  __m256d sums = _mm256_setzero_pd();
  auto start = std::chrono::high_resolution_clock::now();
  for (int64_t i = 1; i <= cnLogs; i += 4) {
    const __m256d x = _mm256_set_pd(double(i+3), double(i+2), double(i+1), double(i));
    const __m256d logs = Log2(x);
    sums = _mm256_add_pd(sums, logs);
  }
  auto elapsed = std::chrono::high_resolution_clock::now() - start;
  double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
  double sum = sums.m256d_f64[0] + sums.m256d_f64[1] + sums.m256d_f64[2] + sums.m256d_f64[3];
  printf("Vect Log2: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}

Logarithm in C++ and assembly ,当前矢量实现比 std::log2() 速度比 std::log()

具体而言,使用以下近似公式:

2 回复 | 直到 7 年前

Peter Cordes 4 年前

通常的策略是基于身份的 log(a*b) = log(a) + log(b) ,或者在这种情况下 log2( 2^exponent * mantissa) ) = log2( 2^exponent ) + log2(mantissa) . 或者简化, exponent + log2(mantissa) log2(mantissa) 只需要适应非常有限的范围。(或等效地,尾数=0.5到1.0,并将指数偏差校正常数更改1)。

如果函数评估很重要 log2(1.0) 0.0 ,您可以通过实际使用 mantissa-1.0 作为多项式,没有常数系数。 0.0 ^ n = 0.0

Agner Fog's VCL implementation of log_d() 旨在获得非常高的精度,使用技巧避免舍入误差,尽可能避免可能导致添加小数字和大数字的情况。这在一定程度上掩盖了基本设计。

float log() ,请参阅上的多项式实现 http://jrfonseca.blogspot.ca/2008/09/fast-sse2-pow-tables-or-polynomials.html . 它省去了VCL使用的许多额外的精度获取技巧,所以更容易理解。它对1.0到2.0范围内的尾数使用多项式近似。

(这才是真正的诀窍实现:您只需要在小范围内工作的多项式。)

它已经做到了 log2 而不是 log 与VCL不同的是,在VCL中,log-base-e被烘焙到常数中,并使用它们。阅读它可能是理解的一个很好的起点 exponent + polynomial(mantissa) 日志() .

即使是最高精度的版本也不完整精度更不用说了 double ,但可以拟合具有更多项的多项式。或者很明显,两个多项式的比值效果很好;这就是VCL的用途 双重的 .

将JRF的SSE2函数移植到AVX2+FMA(尤其是带有 _mm512_getexp_ps 和 _mm512_getmant_ps 浮动 这正是我想要的。

jrf_fastlog() 是独立的,所以OOO执行很好地隐藏了FMA延迟,甚至不值得使用更高的ILP更短延迟多项式评估方法 VCL's polynomial_5() function "Estrin's scheme" ,它在FMA之前进行一些非FMA乘法,从而产生更多的指令)。

Agner Fog的VCL现在已获得Apache许可,因此任何项目都可以直接包含它。如果你想要高精度,你应该直接使用VCL。它只是头文件,只是内联函数,所以不会使二进制文件膨胀。

VCL的 日志 浮点和双功能在 vectormath_exp.h

提取指数位并将该整数转换回浮点(在调整IEEE FP使用的偏差后)。
双重的 中的值 [0.5, 1.0) 范围(或 (0.5, 1.0]

进一步调整 if(mantissa <= SQRT2*0.5) { mantissa += mantissa; exponent++;} mantissa -= 1.0 .

使用多项式近似来 log(x) 双重的 ,VCL 使用两个五阶多项式的比率。 @harold says this is often good for precision . 一个分区与许多FMA混合通常不会影响吞吐量,但它确实比FMA具有更高的延迟。使用 vrcpps vdivps 在现代硬件上。使用比率还可以通过并行评估两个低阶多项式而不是一个高阶多项式来创建更多的ILP,并且与高阶多项式的一个长dep链相比,可能会降低整体延迟(这也会沿该长链累积显著的舍入误差)。

然后添加 exponent + polynomial_approx_log(mantissa) ln2_lo + ln2_hi = ln(2) . 它被分成一个小常数和一个大常数,以减少舍入误差。

// res is the polynomial(adjusted_mantissa) result
// fe is the float exponent
// x is the adjusted_mantissa.  x2 = x*x;
res  = mul_add(fe, ln2_lo, res);             // res += fe * ln2_lo;
res += nmul_add(x2, 0.5, x);                 // res += x  - 0.5 * x2;
res  = mul_add(fe, ln2_hi, res);             // res += fe * ln2_hi;

ln2 材料和使用 VM_LN2 如果您的目标不是0.5或1 ulp精度(或此函数实际提供的任何功能;IDK)

x - 0.5*x2

然后,如果向量中的任何元素需要特殊处理以生成适当的NaN或-Inf,而不是我们从多项式+指数中得到的垃圾,它会检查下溢、溢出或非规范化,并进行分支。 如果已知您的值是有限的和正的,您可以注释掉这一部分并获得显著的加速 (甚至在分支之前的检查也需要几条指令)。

进一步阅读:

http://gallium.inria.fr/blog/fast-vectorizable-math-approx/
http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html 一种有趣的方法:it类型双关语a 浮动 uint32_t 和 浮动 主要表示指数的值,按比例缩放 1 << 23 ,但也包含尾数信息。

然后它使用一个带有一对系数的表达式来解决问题,并得到一个近似值。它包括除以 (constant + mantissa) 将浮点位模式转换为时纠正尾数污染 浮动 . 我发现在HSW和SKL上使用AVX2的矢量化版本比使用四阶多项式的JRF fastlog更慢、更不准确。(尤其是在将其用作fast的一部分时) arcsinh 它还使用除法单元 vsqrtps

Serge Rogatch 7 年前

扰流板 :最后看看如何将性能提高到每秒8.7亿个对数。

特殊情况 : 负数,负无穷和 NaN 楠 带正号位的s得到1024左右的对数。如果您不喜欢特殊情况的处理方式,一个选择是添加代码来检查它们,并做更适合您的事情。这将使计算速度减慢。

namespace {
  // The limit is 19 because we process only high 32 bits of doubles, and out of
  //   20 bits of mantissa there, 1 bit is used for rounding.
  constexpr uint8_t cnLog2TblBits = 10; // 1024 numbers times 8 bytes = 8KB.
  constexpr uint16_t cZeroExp = 1023;
  const __m256i gDoubleNotExp = _mm256_set1_epi64x(~(0x7ffULL << 52));
  const __m256d gDoubleExp0 = _mm256_castsi256_pd(_mm256_set1_epi64x(1023ULL << 52));
  const __m256i cAvxExp2YMask = _mm256_set1_epi64x(
    ~((1ULL << (52-cnLog2TblBits)) - 1) );
  const __m256d cPlusBit = _mm256_castsi256_pd(_mm256_set1_epi64x(
    1ULL << (52 - cnLog2TblBits - 1)));
  const __m256d gCommMul1 = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
  const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);
  const __m128i cSseMantTblMask = _mm_set1_epi32((1 << cnLog2TblBits) - 1);
  const __m128i gExpNorm0 = _mm_set1_epi32(1023);
  // plus |cnLog2TblBits|th highest mantissa bit
  double gPlusLog2Table[1 << cnLog2TblBits];
} // anonymous namespace

void InitLog2Table() {
  for(uint32_t i=0; i<(1<<cnLog2TblBits); i++) {
    const uint64_t iZp = (uint64_t(cZeroExp) << 52)
      | (uint64_t(i) << (52 - cnLog2TblBits)) | (1ULL << (52 - cnLog2TblBits - 1));
    const double zp = *reinterpret_cast<const double*>(&iZp);
    const double l2zp = std::log2(zp);
    gPlusLog2Table[i] = l2zp;
  }
}

__m256d __vectorcall Log2TblPlus(__m256d x) {
  const __m256d zClearExp = _mm256_and_pd(_mm256_castsi256_pd(gDoubleNotExp), x);
  const __m256d z = _mm256_or_pd(zClearExp, gDoubleExp0);

  const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
    _mm256_castpd_si256(x), gHigh32Permute));
  // This requires that x is non-negative, because the sign bit is not cleared before
  //   computing the exponent.
  const __m128i exps32 = _mm_srai_epi32(high32, 20);
  const __m128i normExps = _mm_sub_epi32(exps32, gExpNorm0);

  // Compute y as approximately equal to log2(z)
  const __m128i indexes = _mm_and_si128(cSseMantTblMask,
    _mm_srai_epi32(high32, 20 - cnLog2TblBits));
  const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
    /*number of bytes per item*/ 8);
  // Compute A as z/exp2(y)
  const __m256d exp2_Y = _mm256_or_pd(
    cPlusBit, _mm256_and_pd(z, _mm256_castsi256_pd(cAvxExp2YMask)));

  // Calculate t=(A-1)/(A+1). Both numerator and denominator would be divided by exp2_Y
  const __m256d tNum = _mm256_sub_pd(z, exp2_Y);
  const __m256d tDen = _mm256_add_pd(z, exp2_Y);

  // Compute the first polynomial term from "More efficient series" of https://en.wikipedia.org/wiki/Logarithm#Power_series
  const __m256d t = _mm256_div_pd(tNum, tDen);

  const __m256d log2_z = _mm256_fmadd_pd(t, gCommMul1, y);

  // Leading integer part for the logarithm
  const __m256d leading = _mm256_cvtepi32_pd(normExps);

  const __m256d log2_x = _mm256_add_pd(log2_z, leading);
  return log2_x;
}

但是,如果您需要更多的一级缓存来满足其他需求,则可以通过减少 cnLog2TblBits 以降低对数计算精度为代价。

或者,为了保持高精度,可以通过添加以下项来增加多项式项的数量:

namespace {
  // ...
  const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
  const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
  const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
  const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
  const __m256d gCoeff5 = _mm256_set1_pd(1.0 / 11);
}

然后改变 Log2TblPlus() 在生产线之后 const __m256d t = _mm256_div_pd(tNum, tDen);

  const __m256d t2 = _mm256_mul_pd(t, t); // t**2

  const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
  const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
  const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
  const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
  const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
  const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
  const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
  const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);
  const __m256d t11 = _mm256_mul_pd(t9, t2); // t**11
  const __m256d terms012345 = _mm256_fmadd_pd(gCoeff5, t11, terms01234);

  const __m256d log2_z = _mm256_fmadd_pd(terms012345, gCommMul1, y);

然后评论 // Leading integer part for the logarithm 其余不变。

cnLog2TblBits==5 terms012 . 但我还没有做过这样的测量,你需要实验什么适合你的需要。

显然,计算的多项式项越少,计算速度就越快。

In what situation would the AVX2 gather instructions be faster than individually loading the data? 建议您在以下情况下可以获得性能提升:

const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
  /*number of bytes per item*/ 8);

const __m256d y = _mm256_set_pd(gPlusLog2Table[indexes.m128i_u32[3]],
  gPlusLog2Table[indexes.m128i_u32[2]],
  gPlusLog2Table[indexes.m128i_u32[1]],
  gPlusLog2Table[indexes.m128i_u32[0]]);

对于我的实现,它节省了大约1.5个周期,将计算4个对数的总周期计数从18减少到16.5,因此性能提高到每秒8.7亿个对数。我保留了当前的实现,因为它更惯用,一旦CPU开始运行,应该会更快 gather 操作正确(与GPU一样合并)。

编辑2 on Ryzen CPU (but not on Intel) 您可以通过替换

const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
  _mm256_castpd_si256(x), gHigh32Permute));

  const __m128 hiLane = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
  const __m128 loLane = _mm_castpd_ps(_mm256_castpd256_pd128(x));
  const __m128i high32 = _mm_castps_si128(_mm_shuffle_ps(loLane, hiLane,
    _MM_SHUFFLE(3, 1, 3, 1)));