Linux 應用程式除錯技術/面向效能和效能衡量

gprof & -pg

使用 gprof 分析應用程式

使用 -pg 編譯程式碼
使用 -pg 連結
執行應用程式。這將在應用程式的當前資料夾中建立一個 gmon.out 檔案。
在提示符下，在 gmon.out 所在的資料夾中：gprof path-to-application

PAPI

效能應用程式程式設計介面（PAPI）為程式設計師提供了對大多數主要微處理器中效能計數器硬體的訪問許可權。使用一個相當不錯的 C++ 包裝器，測量分支預測錯誤和快取未命中（以及更多）只需一行程式碼即可完成。

預設情況下，這些是 papi::counters<Print_policy> 正在查詢的事件

static  const events_type events = {  
      {PAPI_TOT_INS, "Total instructions"} 
    , {PAPI_TOT_CYC, "Total cpu cycles"}
    , {PAPI_L1_DCM,  "L1 load  misses"}
//  , {PAPI_L1_STM,  "L1 store  missess"}
    , {PAPI_L2_DCM,  "L2 load  misses"}
//  , {PAPI_L2_STM,  "L2 store  missess"}
    , {PAPI_BR_MSP,  "Branch mispredictions"}
};

計數器類使用 Print_policy 引數化，指示計數器超出範圍後要執行的操作。

例如，讓我們看一下這些程式碼行

   const int nlines = 196608;
   const int ncols  = 64;
   char ctrash[nlines][ncols];
   {
       int x;
       papi::counters<papi::stdout_print> pc("by column"); //<== the famous one-line
       for (int c = 0; c < ncols; ++c) {
           for (int l = 0; l < nlines; ++l) {
               x = ctrash[l][c];
           }
       }
   }

該程式碼只是迴圈遍歷一個數組，但順序錯誤：最內層迴圈迭代外層索引。雖然我們先迴圈第一個索引還是最後一個索引的結果相同，但從理論上講，為了保持快取區域性性，最內層迴圈應該迭代最內層索引。這對於迭代陣列所需的時間應該有很大影響

   {
       int x;
       papi::counters<papi::stdout_print> pc("By line");
       for (int l = 0; l < nlines; ++l) {
           for (int c = 0; c < ncols; ++c) {
               x = ctrash[l][c];
           }
       }
   }

papi::counters 是一個圍繞 PAPI 功能的包裝類。它將在建立計數器物件時獲取一些效能計數器的快照（在本例中，我們對快取未命中和分支預測錯誤感興趣），並在物件銷燬時獲取另一個快照。然後它將打印出差異。

第一次測量（使用未最佳化程式碼 (-O0)）顯示以下結果

Delta by column:
  PAPI_TOT_INS (Total instructions): 188744788 (380506167-191761379)
  PAPI_TOT_CYC (Total cpu cycles): 92390347 (187804288-95413941)
  PAPI_L1_DCM (L1 load  misses): 28427 (30620-2193)                 <==
  PAPI_L2_DCM (L2 load  misses): 102 (1269-1167)
  PAPI_BR_MSP (Branch mispredictions): 176 (207651-207475)          <==

Delta By line:
  PAPI_TOT_INS (Total instructions): 190909841 (191734047-824206)
  PAPI_TOT_CYC (Total cpu cycles): 94460862 (95387664-926802)
  PAPI_L1_DCM (L1 load  misses): 403 (2046-1643)                    <==
  PAPI_L2_DCM (L2 load  misses): 21 (1081-1060)
  PAPI_BR_MSP (Branch mispredictions): 205934 (207350-1416)         <==

雖然快取未命中確實有所改善，但分支預測錯誤卻激增。這不是一個很好的權衡。在處理器的流水線中，比較操作會轉換為分支操作。編譯器生成的未最佳化程式碼有點奇怪。

通常，分支機器程式碼由 if/else 和三元運算子直接生成；以及由虛呼叫和透過指標呼叫間接生成

也許最佳化程式碼 (-O2) 表現更好？或者也許不是

Delta by column:
  PAPI_TOT_INS (Total instructions): 329 (229368-229039)
  PAPI_TOT_CYC (Total cpu cycles): 513 (186217-185704)
  PAPI_L1_DCM (L1 load  misses): 2 (1523-1521)
  PAPI_L2_DCM (L2 load  misses): 0 (993-993)
  PAPI_BR_MSP (Branch mispredictions): 7 (1287-1280)

Delta By line:
  PAPI_TOT_INS (Total instructions): 330 (209614-209284)
  PAPI_TOT_CYC (Total cpu cycles): 499 (173487-172988)
  PAPI_L1_DCM (L1 load  misses): 2 (1498-1496)
  PAPI_L2_DCM (L2 load  misses): 0 (992-992)
  PAPI_BR_MSP (Branch mispredictions): 7 (1225-1218)

這次編譯器優化了迴圈！它發現我們並沒有真正使用陣列中的資料，所以它將其刪除了。完全刪除了！

讓我們看看這段程式碼的執行情況

   {
       int x;
       papi::counters<papi::stdout_print> pc("by column");
       for (int c = 0; c < ncols; ++c) {
           for (int l = 0; l < nlines; ++l) {
               x = ctrash[l][c];
               ctrash[l][c] = x + 1;
           }
       }
   }

Delta by column:
  PAPI_TOT_INS (Total instructions): 62918492 (63167552-249060)
  PAPI_TOT_CYC (Total cpu cycles): 224705473 (224904307-198834)
  PAPI_L1_DCM (L1 load  misses): 12415661 (12417203-1542)
  PAPI_L2_DCM (L2 load  misses): 9654638 (9655632-994)
  PAPI_BR_MSP (Branch mispredictions): 14217 (15558-1341)

Delta By line:
  PAPI_TOT_INS (Total instructions): 51904854 (115092642-63187788)
  PAPI_TOT_CYC (Total cpu cycles): 25914254 (250864272-224950018)
  PAPI_L1_DCM (L1 load  misses): 197104 (12614449-12417345)
  PAPI_L2_DCM (L2 load  misses): 6330 (9662090-9655760)
  PAPI_BR_MSP (Branch mispredictions): 296 (16066-15770)

快取未命中和分支預測錯誤都至少提高了一個數量級。使用未最佳化程式碼執行將顯示相同數量級的改進。

參考文獻

區域性性原理

OProfile

OProfile 提供對與 PAPI 相同的硬體計數器的訪問許可權，但無需對程式碼進行檢測

它比 PAPI 更粗粒度 - 在函式級別。
一些開箱即用的核心（RedHat）不友好 OProfile。
您需要 root 許可權。

#!/bin/bash

#
# A script to OProfile a program. 
# Must be run as root. 
#


if [ $# -ne 1 ]
then
  echo "Usage: `basename $0` <for-binary-image>"
  exit -1
else
  binimg=$1
fi

opcontrol --stop
opcontrol --shutdown

# Out of the box RedHat kernels are OProfile repellent.
opcontrol --no-vmlinux
opcontrol --reset

# List of events for platform to be found in /usr/share/oprofile/<>/events
opcontrol --event=L2_CACHE_MISSES:1000


opcontrol --start

$binimg

opcontrol --stop
opcontrol --dump


rm $binimg.opreport.log
opreport > $binimg.opreport.log

rm $binimg.opreport.sym
opreport -l > $binimg.opreport.sym


opcontrol --shutdown
opcontrol --deinit
echo "Done"

參考文獻

perf

perf 是一個基於核心的子系統，它為執行的程式對核心的影響提供了一個性能分析框架。它涵蓋：硬體（CPU/PMU，效能監視單元）功能；以及軟體功能（軟體計數器、跟蹤點）。

perf 教程

perf list

列出特定機器上可用的事件。這些事件根據系統的效能監視硬體和軟體配置而有所不同。

$ perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]

  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-prefetches                                    [Hardware cache event]
  dTLB-prefetch-misses                               [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  node-loads                                         [Hardware cache event]
  node-load-misses                                   [Hardware cache event]
  node-stores                                        [Hardware cache event]
  node-store-misses                                  [Hardware cache event]
  node-prefetches                                    [Hardware cache event]
  node-prefetch-misses                               [Hardware cache event]

  rNNN (see 'perf list --help' on how to encode it)  [Raw hardware event descriptor]

  mem:<addr>[:access]                                [Hardware breakpoint]

注意：以 root 身份執行將輸出擴充套件的事件列表；一些事件（跟蹤點？）需要 root 許可權。

perf stat <options> <cmd>

收集常見效能事件的總體統計資訊，包括執行的指令和消耗的時鐘週期。有選項標誌可以收集除預設測量事件以外的事件的統計資訊。

$ g++ -std=c++11 -ggdb -fno-omit-frame-pointer perftest.cpp -o perftest

$ perf stat ./perftest
 Performance counter stats for './perftest':

        379.991103 task-clock                #    0.996 CPUs utilized          
                62 context-switches          #    0.000 M/sec                  
                 0 CPU-migrations            #    0.000 M/sec                  
             6,436 page-faults               #    0.017 M/sec                  
       984,969,006 cycles                    #    2.592 GHz                     [83.27%]
       663,592,329 stalled-cycles-frontend   #   67.37% frontend cycles idle    [83.17%]
       473,904,165 stalled-cycles-backend    #   48.11% backend  cycles idle    [66.42%]
     1,010,613,552 instructions              #    1.03  insns per cycle        
                                             #    0.66  stalled cycles per insn [83.23%]
        53,831,403 branches                  #  141.665 M/sec                   [84.14%]
           401,518 branch-misses             #    0.75% of all branches         [83.48%]

       0.381602838 seconds time elapsed

$ perf stat --event=L1-dcache-load-misses ./perftest
 Performance counter stats for './perftest':

        12,942,048 L1-dcache-load-misses                                       

       0.373719009 seconds time elapsed

perf record

將效能資料記錄到檔案中，該檔案以後可以使用 perf report 進行分析。

$ perf record --event=L1-dcache-load-misses ./perftest

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.025 MB perf.data (~1078 samples) ]

$ ls -al
...
-rw-------  1 amelinte amelinte  27764 Feb 17 17:23 perf.data

perf report

從檔案中讀取效能資料並分析記錄的資料。

$ perf report --stdio
# ========
# captured on: Sun Feb 17 17:23:34 2013
# hostname : bear
# os release : 3.2.0-4-amd64
# perf version : 3.2.17
# arch : x86_64
# nrcpus online : 4
# nrcpus avail : 4
# cpudesc : Intel(R) Core(TM) i3 CPU M 390 @ 2.67GHz
# cpuid : GenuineIntel,6,37,5
# total memory : 3857640 kB
# cmdline : /usr/bin/perf_3.2 record --event=L1-dcache-load-misses ./perftest 
# event : name = L1-dcache-load-misses, type = 3, config = 0x10000, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, id = { 
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# ========
#
# Events: 274  L1-dcache-load-misses
#
# Overhead         Command   Shared Object      Symbol
# ........         ........  .................  ..................
#
#   95.93 percent  perftest  perftest           [.] 0xd35           
#    1.06 percent  perftest  [kernel.kallsyms]  [k] clear_page_c
#    0.82 percent  perftest  [kernel.kallsyms]  [k] run_timer_softirq
#    0.42 percent  perftest  [kernel.kallsyms]  [k] trylock_page
#    0.41 percent  perftest  [kernel.kallsyms]  [k] __rcu_pending
#    0.41 percent  perftest  [kernel.kallsyms]  [k] update_curr
#    0.33 percent  perftest  [kernel.kallsyms]  [k] do_raw_spin_lock
#    0.26 percent  perftest  [kernel.kallsyms]  [k] __flush_tlb_one
#    0.18 percent  perftest  [kernel.kallsyms]  [k] flush_old_exec
#    0.06 percent  perftest  [kernel.kallsyms]  [k] __free_one_page
#    0.05 percent  perftest  [kernel.kallsyms]  [k] free_swap_cache
#    0.05 percent  perftest  [kernel.kallsyms]  [k] zone_statistics
#    0.04 percent  perftest  [kernel.kallsyms]  [k] alloc_pages_vma
#    0.01 percent  perftest  [kernel.kallsyms]  [k] mm_init
#    0.01 percent  perftest  [kernel.kallsyms]  [k] vfs_read
#    0.00 percent  perftest  [kernel.kallsyms]  [k] __cond_resched
#    0.00 percent  perftest  [kernel.kallsyms]  [k] finish_task_switch
#
# (For a higher level overview, try: perf report --sort comm,dso)
#

$ perf record -g ./perftest
$ perf report -g --stdio
...
# Overhead   Command      Shared Object                  Symbol
# ........  ........  .................  ......................
#
    97.23%  perftest  perftest           [.] 0xc75           
            |
            --- 0x400d2c
                0x400dfb
                __libc_start_main

perf annotate

讀取輸入檔案並顯示程式碼的註釋版本。如果目標檔案包含除錯符號，則原始碼將與彙編程式碼一起顯示。如果沒有除錯資訊，則顯示註釋的彙編程式碼。壞了！？

$ perf annotate -i ./perf.data -d ./perftest --stdio -f
Warning:
The ./perf.data file has no samples!

perf top

類似於效能頂部工具。它即時生成和顯示效能計數器概要檔案。

perf archive

perf buildid-cache

perf buildid-list

perf diff

perf inject

perf kmem

perf kvm

perf probe

perf sched

perf script

perf timechart

Valgrind: cachegrind

Cachegrind 模擬了一臺具有兩級（[I1 & D1] 和 L2）快取和分支（錯誤）預測的機器。它在程式碼註釋方面很有用，因為它可以註釋到行級別。它與機器的實際 CPU 可能有很大差異。在 AMD64 CPU 上不會走得太遠（vex 反彙編程式問題）。速度極慢，通常會使應用程式速度降低 12-15 倍。

DIY: libhitcount

libmemleak 可以輕鬆地修改以跟蹤對程式碼中特定點的呼叫。只需在該位置插入一個 mtrace() 呼叫。

getloadavg(3)

事物如何擴充套件

L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns             
Compress 1K bytes with Zippy ............. 3,000 ns  =   3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns  =  20 µs
SSD random read ........................ 150,000 ns  = 150 µs
Read 1 MB sequentially from memory ..... 250,000 ns  = 250 µs
Round trip within same datacenter ...... 500,000 ns  = 0.5 ms
Read 1 MB sequentially from SSD* ..... 1,000,000 ns  =   1 ms
Disk seek ........................... 10,000,000 ns  =  10 ms
Read 1 MB sequentially from disk .... 20,000,000 ns  =  20 ms
Send packet CA->Netherlands->CA .... 150,000,000 ns  = 150 ms

來源：https://gist.github.com/2843375

Operation           Cost (ns)     Ratio
Clock period        0.6           1.0
Best-case CAS       37.9          63.2
Best-case lock      65.6          109.3
Single cache miss   139.5         232.5
CAS cache miss      306.0         510.0
Comms Fabric        3,000         5,000
Global Comms        130,000,000   216,000,000
Table 2.1: Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System

來源：Paul E. McKenney

關於超執行緒的說明

假設密集浮點計算的效能損失（每個核心只有一個 FPU 和一個 ALU（兩個流水線））。
http://ayumilove.net/hyperthreading-faq/
http://en.wikipedia.org/wiki/Hyper-threading

其他工具

變體