代码之家 › 专栏 › 技术社区 › Margaret Bloom

如何在实践中创建一个幽灵小工具?

spectre x86 assembly caching

4

Margaret Bloom · 技术社区 · 6 年前

我如何制作一个可靠的幽灵小工具? 。

!/垃圾桶/垃圾桶 Sudo WRMSR-A 0X1a0 0X400850089 Sudo WRMSR-A 0X1A4 0XF sudo cpu电源频率设置-g性能 #最大频率缓冲区:resb 256*(1+间隙)*64 我使用此函数来刷新256行。 lea rdi,[缓冲区];开始指针防止上一个CLFLUSH在加载后重新排序防止在加载前重新排序当前clflush 添加RDI,(1+间隙)*64;移到下一行 jnz.flush_loop;重复 ;……而lfence则根据所有指令(本地)订购。 profile: lea r8,[计时数据];指向计时结果的指针 .配置文件: lfence;按顺序读取TSC(忽略存储全局可见性) ;执行加载 RDTSCP mov-dword[r8],eax;保存添加r8、4;移动结果指针 JNZ.简介 mcve在附录中给出,arepository is available to clone。输出在不同的运行中是稳定的。如果将存储执行到前64行中的一行,则输出更改为我怀疑中的数学有问题,但我需要另一双眼睛找出哪里。编辑我不知道130个周期的数字来自哪里(内存太低,缓存太高?).。如果在分析之前执行到任何第一行的存储,则不会在输出中反映任何更改。附录-mcve 默认rel 外部打印 %定义间隙0 第节数据 strHalfline数据库“”,0 .重要性数据库“0”, 加强数据库`\n\n`,0 ;'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.' 粢\ (二) ;从缓存中刷新缓冲区的所有行 (二) lea rdi,[缓冲区];开始指针 .冲洗回路: MOV EAX,[RDI];触摸页面添加RDI,(1+间隙)*64;移到下一行就围栏而言,应订购CLFLUSH。房地产税粢\ (二) ;分析对缓冲区每一行的访问 (二) mov esi,256;要测试多少行我很肯定这是没用的,但我把它包括在内是为了排除…… .配置文件: lfence;按顺序读取TSC(忽略存储全局可见性) ;执行加载 RDTSCP mov-dword[r8],eax;保存添加r8、4;移动结果指针 JNZ.简介 ;“'''''''' (二) (二) lea rbx,[计时数据];指向计时的指针 .打印行: MOV ESI,R12d 试验R12d,0fh lea rdi,[strhalfline];设置调用printf JZ.打印 mov esi,dword[rbx];定时值 MOV R10D,60;用于计算颜色 XOR EDX、EDX ;更新颜色移动EDX,'5' CMova EAX、EDX公司异或EAX,EAX 调用printf wrt..plt;打印3位数字 inc r12d;递增计数器 CMP R12d,256型异或EAX,EAX 调用printf wrt..plt;打印新行 .打印: jmp.打印时间粢\ (二) (二) ;E N T R Y P O I N T (二) (二) ;'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.'.' ;“'''''''' 粢\ ;。/\/\/\/\/\/\/\/\/\/\/\ 主要: ;刷新缓冲区的所有行呼叫Flush_All ;测试访问时间呼叫配置文件 ;显示结果呼叫显示结果 ;退出 xor-edi、edi 呼叫退出WRT..PLT FLUSH+RELOAD)。我怎么能做一个可靠的幽灵小玩意? 我相信我理解flush+reload技术背后的理论,但是在实践中,尽管有一些噪音,我不能产生一个有效的poc。由于我使用的是时间戳计数器,并且负载非常正常,所以我使用这个脚本来禁用预取器、涡轮增压以及修复/稳定CPU频率: #!/bin/bash sudo modprobe msr #Disable turbo sudo wrmsr -a 0x1a0 0x4000850089 #Disable prefetchers sudo wrmsr -a 0x1a4 0xf #Set performance governor sudo cpupower frequency-set -g performance #Minimum freq sudo cpupower frequency-set -d 2.2GHz #Maximum freq sudo cpupower frequency-set -u 2.2GHz 我有一个连续的缓冲区,在4kib上对齐,大到可以跨越256条缓存线,用整数分隔间隙第行。 SECTION .bss ALIGN=4096 buffer: resb 256 * (1 + GAP) * 64 我使用这个函数来刷新256行。 flush_all: lea rdi, [buffer] ;Start pointer mov esi, 256 ;How many lines to flush .flush_loop: lfence ;Prevent the previous clflush to be reordered after the load mov eax, [rdi] ;Touch the page lfence ;Prevent the current clflush to be reordered before the load clflush [rdi] ;Flush a line add rdi, (1 + GAP)*64 ;Move to the next line dec esi jnz .flush_loop ;Repeat lfence ;clflush are ordered with respect of fences .. ;.. and lfence is ordered (locally) with respect of all instructions ret 该函数循环遍历所有行,触碰到中间的每一页(每一页不止一次),并刷新每一行。然后我使用这个函数来分析访问。 profile: lea rdi, [buffer] ;Pointer to the buffer mov esi, 256 ;How many lines to test lea r8, [timings_data] ;Pointer to timings results mfence ;I'm pretty sure this is useless, but I included it to rule out .. ;.. silly, hard to debug, scenarios .profile: mfence rdtscp lfence ;Read the TSC in-order (ignoring stores global visibility) mov ebp, eax ;Read the low DWORD only (this is a short delay) ;PERFORM THE LOADING mov eax, DWORD [rdi] rdtscp lfence ;Again, read the TSC in-order sub eax, ebp ;Compute the delta mov DWORD [r8], eax ;Save it ;Advance the loop add r8, 4 ;Move the results pointer add rdi, (1 + GAP)*64 ;Move to the next line dec esi ;Advance the loop jnz .profile ret 附录和A中给出了MCVE。repository is available to clone. 装配时GAP设置为0,链接并用执行taskset -c 0提取每行所需的循环如下所示。只有64行从内存中加载。输出在不同的运行中是稳定的。如果我设置间隙对于1,只从内存中提取32行,当然64*(1+0)*64=32*(1+1)*64=4096,所以这可能与分页有关? 如果执行存储之前分析(但在刷新之后)到前64行中的一行,输出更改为其他行中的任何存储都提供第一种类型的输出。我怀疑里面的数学坏了,但我需要另外两只眼睛看看哪里。编辑 Hadi Brais pointed out在修正输出不一致之后,对易失性寄存器的一种误用。我看到在计时较低的地方(约50个周期)普遍运行,有时在计时较高的地方运行(约130个周期)。我不知道130个周期的数字来自哪里(内存太低,缓存太高?). 代码在MCVE(和存储库)中是固定的。如果在分析之前执行到任何第一行的存储,则不会在输出中反映任何更改。附录-MCVE BITS 64 DEFAULT REL GLOBAL main EXTERN printf EXTERN exit ;Space between lines in the buffer %define GAP 0 SECTION .bss ALIGN=4096 buffer: resb 256 * (1 + GAP) * 64 SECTION .data timings_data: TIMES 256 dd 0 strNewLine db `\n0x%02x: `, 0 strHalfLine db " ", 0 strTiming db `\e[48;5;16`, .importance db "0", db `m\e[38;5;15m%03u\e[0m `, 0 strEnd db `\n\n`, 0 SECTION .text ;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' ; ' ' ' ' ' ' ' ' ' ' ' ; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ ;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \ ; ; ;FLUSH ALL THE LINES OF A BUFFER FROM THE CACHES ; ; flush_all: lea rdi, [buffer] ;Start pointer mov esi, 256 ;How many lines to flush .flush_loop: lfence ;Prevent the previous clflush to be reordered after the load mov eax, [rdi] ;Touch the page lfence ;Prevent the current clflush to be reordered before the load clflush [rdi] ;Flush a line add rdi, (1 + GAP)*64 ;Move to the next line dec esi jnz .flush_loop ;Repeat lfence ;clflush are ordered with respect of fences .. ;.. and lfence is ordered (locally) with respect of all instructions ret ;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' ; ' ' ' ' ' ' ' ' ' ' ' ; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ ;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \ ; ; ;PROFILE THE ACCESS TO EVERY LINE OF THE BUFFER ; ; profile: lea rdi, [buffer] ;Pointer to the buffer mov esi, 256 ;How many lines to test lea r8, [timings_data] ;Pointer to timings results mfence ;I'm pretty sure this is useless, but I included it to rule out .. ;.. silly, hard to debug, scenarios .profile: mfence rdtscp lfence ;Read the TSC in-order (ignoring stores global visibility) mov ebp, eax ;Read the low DWORD only (this is a short delay) ;PERFORM THE LOADING mov eax, DWORD [rdi] rdtscp lfence ;Again, read the TSC in-order sub eax, ebp ;Compute the delta mov DWORD [r8], eax ;Save it ;Advance the loop add r8, 4 ;Move the results pointer add rdi, (1 + GAP)*64 ;Move to the next line dec esi ;Advance the loop jnz .profile ret ;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' ; ' ' ' ' ' ' ' ' ' ' ' ; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ ;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \ ; ; ;SHOW THE RESULTS ; ; show_results: lea rbx, [timings_data] ;Pointer to the timings xor r12, r12 ;Counter (up to 256) .print_line: ;Format the output xor eax, eax mov esi, r12d lea rdi, [strNewLine] ;Setup for a call to printf test r12d, 0fh jz .print ;Test if counter is a multiple of 16 lea rdi, [strHalfLine] ;Setup for a call to printf test r12d, 07h ;Test if counter is a multiple of 8 jz .print .print_timing: ;Print mov esi, DWORD [rbx] ;Timing value ;Compute the color mov r10d, 60 ;Used to compute the color mov eax, esi xor edx, edx div r10d ;eax = Timing value / 78 ;Update the color add al, '0' mov edx, '5' cmp eax, edx cmova eax, edx mov BYTE [strTiming.importance], al xor eax, eax lea rdi, [strTiming] call printf WRT ..plt ;Print a 3-digits number ;Advance the loop inc r12d ;Increment the counter add rbx, 4 ;Move to the next timing cmp r12d, 256 jb .print_line ;Advance the loop xor eax, eax lea rdi, [strEnd] call printf WRT ..plt ;Print a new line ret .print: call printf WRT ..plt ;Print a string jmp .print_timing ;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' ; ' ' ' ' ' ' ' ' ' ' ' ; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ ;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \ ; ; ;E N T R Y P O I N T ; ; ;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' ; ' ' ' ' ' ' ' ' ' ' ' ; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ ;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \ main: ;Flush all the lines of the buffer call flush_all ;Test the access times call profile ;Show the results call show_results ;Exit xor edi, edi call exit WRT ..plt

1 回复 | 直到 5 年前

1

2

Hadi Brais 5 年前

缓冲区是从 bss buffer ^1个稍后访问同一4K页面。这就是前64个访问的延迟在主内存延迟范围内的原因,并且所有后续访问的延迟都等于l1命中延迟。什么时候 GAP

间隙 是63,所有入口都在同一条线上。因此,只有第一次访问会丢失所有缓存。

mov eax, [rdi] flush_all 到 mov dword [rdi], 0 lfence 全部冲洗 clflush 无法使用写入重新排序 Does clflush also remove TLB entries? )。

Why are the user-mode L1 store miss events only counted when there is a store initialization loop?

我在以前版本的此答案中建议删除对并使用值63。通过这些更改,所有的访问延迟看起来都非常高,我错误地得出结论,所有的访问都缺少所有的缓存级别。就像我上面说的,用值为63时,所有访问都变为同一缓存线,该缓存线实际上驻留在一级缓存中。但是,所有延迟都很高的原因是因为每个访问都指向不同的虚拟页,并且TLB没有这些虚拟页(到同一物理页)的任何映射,因为通过删除对 ,以前没有触摸过任何虚拟页。因此,测量的延迟表示TLB未命中延迟,即使被访问的行在一级缓存中。

(1)如果您不禁用DCU IP预取器,它实际上会在刷新后将所有的线路预取回L1,因此所有的访问仍然会命中L1。

(3)记住您需要减去 rdtscp Memory latency measurement with time stamp counter .

冲洗