ROSE 編譯器框架/算術強度測量工具
一個用於幫助測量迴圈算術強度(FLOPS/記憶體)的工具。它透過以下方式實現:
- 靜態估計使用者指定迴圈的每次迭代中的浮點運算和載入/儲存位元組數
- 用語句修改迴圈,以捕獲迴圈迭代次數並計算 FLOPS 和記憶體佔用(載入/儲存位元組)
- 使用者隨後執行修改後的程式碼以生成最終報告。
快速資訊
- 工具位置:https://github.com/rose-compiler/rose-develop/tree/master/projects/ArithmeticMeasureTool
- 測試:在相應的構建樹中鍵入 "make check"
建議從 rose-develop 倉庫獲取工具以獲得最新更新。
第一步是像往常一樣下載並安裝 rose
然後
- cd rose-build-tree/projects/ArithmeticMeasureTool
- make && make install
一個名為 measureTool 的可執行檔案將被安裝在 ROSE_INSTALLATION_PATH/bin 目錄中
現在準備您的環境以便可以呼叫該工具
# set.rose file, source it to set up environment variables ROSE_INS=/home/liao6/workspace/masterDevClean/install export ROSE_INS PATH=$ROSE_INS/bin:$PATH export PATH LD_LIBRARY_PATH=$ROSE_INS/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH
列表
- -help:列印幫助資訊
- -debug:啟用除錯模式,生成顯示進度和內部結果的螢幕輸出
- -annot your_annotation_file:接受使用者指定的函式副作用標註,補充編譯器分析
- -static-counting-only:一種特殊的執行模式,其中工具掃描所有迴圈體並將計數結果寫入報告檔案
- -report-file your_report_file.txt:指定您自己的報告檔名,否則將使用預設檔案 ai_tool_report.txt。
- -use-algorithm-v2:在靜態計數模式中使用第二個版本的演算法,自下而上合成遍歷以計算 FLOPS,仍在開發中
編譯器分析無法確定所有函式的副作用。這可能是由於無法訪問庫原始碼或原始碼中指標使用的複雜性。為了解決這個問題,該工具透過 --annot 選項接受函式副作用標註檔案
標註檔案格式
operator abs(int val)
{
modify none; read{val}; alias none;
}
operator max(double val1, double val2)
{
modify none; read{val1, val2}; alias none;
}
示例命令列
- measureTool -c -annot /path/to/functionSideEffect.annot your_input.c
這是一種特殊的模式,該工具僅查詢所有迴圈並計算迴圈體的 FLOPS。報告的數字僅針對單次迭代。
載入/儲存位元組以兩種方式表示
- 表示式格式:例如 3*sizeof(float) + 5*sizeof(double)
- 最終求值的整數值:52
結果寫入文字報告檔案。
./measureTool -c -static-counting-only -annot ../../../sourcetree/projects/ArithmeticMeasureTool/src/functionSideEffect.annot -I. ../../../sourcetree/projects/ArithmeticMeasureTool/test/jacobi.c
生成的報告摘錄。注意,第 129 行的迴圈有兩個加法 FP 操作和兩個乘法操作。它載入 0 位元組並存儲一個雙精度元素(通常為 8 位元組)。因此,最終算術強度 (AI) 為 4/8 = 0.5 ops/byte
生成的報告檔案內容:ai_tool_report.txt
----------Floating Point Operation Counts---------------------
SgForStatement@
/home/liao6/workspace/ExReDi/ai_tool/sourcetree/projects/ArithmeticMeasureTool/test/jacobi.c:129:10
fp_plus:2
fp_minus:0
fp_multiply:2
fp_divide:0
fp_total:4
----------Memory Operation Counts---------------------
Loads: NULL
Loads int: 0
Stores:1 * sizeof(double )
Store int: 8
----------Arithmetic Intensity---------------------
AI=0.5
現在
- 如果 AI 未初始化,則將其設定為 -1.0
- 如果除以零位元組,則 AI 將設定為 9999.9
在此模式下,翻譯器可以透過將結果與輸入程式碼中的 pragma 指示的結果進行比較來驗證工具生成的結果。
使用者提供的 pragma 採用以下形式
#pragma aitool fp_plus(10) fp_minus(10) fp_multiply(10) fp_divide (10) fp_total(40)
for () ...
void error_check ( )
{
int i,j;
double xx,yy,temp,error;
dx = 2.0 / (n-1);
dy = 2.0 / (m-1);
error = 0.0 ;
#pragma aitool fp_plus(3) fp_minus(3) fp_multiply(6)
for (i=0;i<n;i++)
for (j=0;j<m;j++)
{
xx = -1.0 + dx * (i-1);
yy = -1.0 + dy * (j-1);
temp = u[i][j] - (1.0-xx*xx)*(1.0-yy*yy);
error = error + temp*temp;
}
error = sqrt(error)/(n*m);
printf("Solution Error :%E \n",error);
}
fp_total 是必需的,而其他型別的 FP 操作的子句是可選的。
這是預設模式。
該工具目前與使用者新增的程式碼修改協同工作,使用以下步驟
- 使用特定變數名宣告四個全域性計數器,這些計數器稍後會被工具識別
- 在您想要計算 FPs 和載入/儲存位元組的迴圈之前新增 chiterations = ..
- 列印結果:printf ("chflops =%lu chloads =%lu chstores=%lu\n", chflops, chloads, chstores);
1 #include <stdio.h>
2 #define SIZE 10
3
4 // Instrumentation 1: add a few global variables
5 unsigned long int chiterations = 0;
6 unsigned long int chloads = 0;
7 unsigned long int chstores = 0;
8 unsigned long int chflops = 0;
9
10 double ref[2] = {9.2, 5.4};
11 double coarse[SIZE][SIZE][SIZE];
12 int main()
13 {
14 double refScale = 1.0 / (ref[0] * ref[1]);
15 int iboxlo1 = 0, iboxlo0 = 0, iboxhi1 = SIZE-1, iboxhi0 = SIZE-1;
16 int var;
17 int ic1=0, ic0=0;
18 int ip0 = ic0 * ref[0];
19 int ip1 = ic1 * ref[1];
20 double coarseSum = 0.0;
21 int ii1, ii0;
22
23 for (var =0; var < SIZE ; var++)
24 {
25 //Instrumentation 2: pass in loop iteration for the loop to be counted
26 chiterations = (1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0);
27 for (ic1 = iboxlo1; ic1< iboxhi1 +1; ic1++)
28 for (ic0 = iboxlo0; ic0< iboxhi0 +1; ic0++)
29 {
30 int ibreflo1 = 0, ibreflo0 = 0, ibrefhi1 = SIZE-1, ibrefhi0 = SIZE-1;
31 //Instrumentation 3: pass in loop iteration for the loop to be counted
32 chiterations = (1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0);
33 for (ii1 = ibreflo1; ii1< ibrefhi1 +1; ii1++)
34 for (ii0 = ibreflo0; ii0< ibrefhi0 +1; ii0++)
35 {
36 coarseSum = coarseSum + coarse[ii1][ii0][ii1] +(ip0 + ii0) + (ip1 + ii1) + var;
37 }
38 coarse[ic0][ic1][var] = coarseSum * refScale;
39 }
40 }
41 //Instrumentation 4: print out results
42 printf ("chflops =%lu chloads =%lu chstores=%lu\n", chflops, chloads, chstores);
43 return 0;
44 }
./measureTool -c -annot ../../../sourcetree/projects/ArithmeticMeasureTool/src/functionSideEffect.annot nestedloops.c
該工具將
- 計算指定迴圈的 FLOPS 和載入儲存位元組
- 新增計數器累加語句,為不同的迴圈使用不同的計數器
1 #include <stdio.h>
2 #define SIZE 10
3 // Instrumentation 1: add a few global variables
4 unsigned long chiterations = 0;
5 unsigned long chloads = 0;
6 unsigned long chstores = 0;
7 unsigned long chflops = 0;
8 double ref[2] = {(9.2), (5.4)};
9 double coarse[10][10][10];
10
11 int main()
12 {
13 double refScale = 1.0 / (ref[0] * ref[1]);
14 int iboxlo1 = 0;
15 int iboxlo0 = 0;
16 int iboxhi1 = 10 - 1;
17 int iboxhi0 = 10 - 1;
18 int var;
19 int ic1 = 0;
20 int ic0 = 0;
21 int ip0 = (ic0 * ref[0]);
22 int ip1 = (ic1 * ref[1]);
23 double coarseSum = 0.0;
24 int ii1;
25 int ii0;
26 unsigned long chiterations_1;
27 unsigned long chiterations_2;
28 for (var = 0; var < 10; var++) {
29 //Instrumentation 2: pass in loop iteration for the loop to be counted
30 chiterations_2 = (1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0);
31 for (ic1 = iboxlo1; ic1 < iboxhi1 + 1; ic1++) {
32 for (ic0 = iboxlo0; ic0 < iboxhi0 + 1; ic0++) {
33 int ibreflo1 = 0;
34 int ibreflo0 = 0;
35 int ibrefhi1 = 10 - 1;
36 int ibrefhi0 = 10 - 1;
37 //Instrumentation 3: pass in loop iteration for the loop to be counted
38 chiterations_1 = (1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0);
39 for (ii1 = ibreflo1; ii1 < ibrefhi1 + 1; ii1++) {
40 for (ii0 = ibreflo0; ii0 < ibrefhi0 + 1; ii0++) {
41 coarseSum = coarseSum + coarse[ii1][ii0][ii1] + (ip0 + ii0) + (ip1 + ii1) + var;
42 }
43 }
44 /* aitool generated Loads counting statement ... */
45 chloads = chloads + chiterations_1 * (1 * sizeof(double ));
46 /* aitool generated FLOPS counting statement ... */
47 chflops = chflops + chiterations_1 * 4;
48 coarse[ic0][ic1][var] = coarseSum * refScale;
49 }
50 }
51 /* aitool generated Stores counting statement ... */
52 chstores = chstores + chiterations_2 * (1 * sizeof(double ));
53 /* aitool generated FLOPS counting statement ... */
54 chflops = chflops + chiterations_2 * 1;
55 }
56 //Instrumentation 4: pass in loop iteration for the loop to be counted
57 printf("chflops =%lu chloads =%lu chstores=%lu\n",chflops,chloads,chstores);
58 return 0;
59 }
gcc -O3 rose_nestedloops.c -o nestedloops.out -l
./nestedloops.out
結果看起來像
chflops =401000 chloads =800000 chstores=8000
該工具目前不支援帶有函式呼叫的Fortran迴圈
- ROSE的Fortran過程/例程表示不夠準確(缺少引數型別資訊),無法與為匹配C/C++函式而設計的函式副作用註釋掛鉤。
執行模型變數running_mode
- e_analysis_and_instrument
- e_static_counting
class FPCounters: public AstAttribute {}; 用於儲存分析結果
void CountFPOperations() 來自src/ai_measurement.cpp
Rose_STL_Container<SgNode*> nodeList = NodeQuery::querySubTree(input, V_SgBinaryOp);
for (Rose_STL_Container<SgNode *>::iterator i = nodeList.begin(); i != nodeList.end(); i++)
{
fp_operation_kind_enum op_kind = e_unknown;
// bool isFPType = false;
// check operation type
SgBinaryOp* bop= isSgBinaryOp(*i);
switch (bop->variantT())
{
case V_SgAddOp:
case V_SgPlusAssignOp:
op_kind = e_plus;
break;
case V_SgSubtractOp:
case V_SgMinusAssignOp:
op_kind = e_minus;
break;
case V_SgMultiplyOp:
case V_SgMultAssignOp:
op_kind = e_multiply;
break;
case V_SgDivideOp:
case V_SgDivAssignOp:
op_kind = e_divide;
break;
default:
break;
} //end switch
...
}
主要函式定義在ai_measurement.cpp中
- std::pair <SgExpression*, SgExpression*> CountLoadStoreBytes (SgLocatedNode* input, bool includeScalars /* = true */, bool includeIntType /* = true */)
- SgExpression* calculateBytes (std::set<SgInitializedName*>& name_set, SgStatement* lbody, bool isRead)
返回用於計算值的表示式,而不是實際值,因為sizeof(type)是機器相關的。
配置
- 預設情況下:僅計算陣列引用。標量被忽略。
演算法
- 呼叫副作用分析以查詢讀/寫變數,某些引用可能會同時觸發讀和寫訪問。如果分析成功,則繼續。否則會發出警告。
- 對同一個陣列/標量變數的訪問被歸類為一個讀訪問(或寫訪問):例如array[i][j]、array[i][j+1]、array[i][j-1]等被計為單個訪問
- 根據型別對訪問進行分組:相同型別的訪問->增加相同的計數器以縮短表示式的長度
- 迭代結果以生成類似2*sizeof(float) + 5* sizeof(double)的表示式
- 作為近似值,我們在不考慮函式呼叫的情況下,在此使用簡單的分析。
// Obtain per-iteration load/store bytes calculation expressions
// excluding scalar types to match the manual version
//CountLoadStoreBytes (SgLocatedNode* input, bool includeScalars = true, bool includeIntType = true);
std::pair <SgExpression*, SgExpression*> load_store_count_pair = CountLoadStoreBytes (loop_body, false, true);
// chstores=chstores+chiterations*8
if (load_store_count_pair.second!= NULL)
{
SgExprStatement* store_byte_stmt = buildCounterAccumulationStmt("chstores", new_iter_var_name, load_store_count_pair.second, scope);
insertStatementAfter (loop, store_byte_stmt);
attachComment(store_byte_stmt," aitool generated Stores counting statement ...");
}
// handle loads stmt 2nd so it can be inserted as the first after the loop
// build chloads=chloads+chiterations*2*8
if (load_store_count_pair.first != NULL)
{
SgExprStatement* load_byte_stmt = buildCounterAccumulationStmt("chloads", new_iter_var_name, load_store_count_pair.first, scope);
insertStatementAfter (loop, load_byte_stmt);
attachComment(load_byte_stmt," aitool generated Loads counting statement ...");
}
科學應用通常具有巢狀迴圈。簡單的插樁會造成兩個問題
- 對巢狀迴圈體進行雙重計數
- chiterations= .. 語句用於所有級別的迴圈。內部迴圈的chiterations將覆蓋用於指示外部迴圈的chiterations。
解決方案
- 翻譯器使用自下而上的遍歷順序:首先處理內部迴圈,然後處理外部迴圈。
- 為了避免在巢狀迴圈內對FP操作進行雙重計數:所有已訪問的FP操作表示式都被儲存到一個查詢表中。後面的計數將檢查操作是否已被計入。如果是,則跳過。
- 為了避免在計算外部迴圈體時對巢狀迴圈中使用的變數進行雙重計數:這與FP運算表示式的處理略有不同。在這裡,我們找到在內部迴圈中計數的所有變數,並在對外部迴圈進行計數時將其排除在外。注意:完全排除,而不僅僅是標記對a的引用,並在稍後排除此引用。
- 注意:靜態計數模式不會進行這種排除,因為冗餘執行的假設不再是問題。如果巢狀迴圈,我們仍然會為內部迴圈和外部迴圈計算迴圈體的FLOPS。
- 將chiterations=改寫為chiterations_loopId= .. ,以便每個迴圈都有自己的迭代次數變數。
// global chiterations is changed to two local variables: each for one loop
unsigned long chiterations_1;
unsigned long chiterations_2;
for (var = 0; var < 10; var++) {
//Instrumentation 2: pass in loop iteration for the loop to be counted
chiterations_2 = ((1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0) * 1);
for (ic1 = iboxlo1; ic1 < iboxhi1 + 1; ic1++) {
for (ic0 = iboxlo0; ic0 < iboxhi0 + 1; ic0++) {
int ibreflo1 = 0;
int ibreflo0 = 0;
int ibrefhi1 = 10 - 1;
int ibrefhi0 = 10 - 1;
//Instrumentation 3: pass in loop iteration for the loop to be counted
chiterations_1 = ((1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0) * 1);
for (ii1 = ibreflo1; ii1 < ibrefhi1 + 1; ii1++) {
for (ii0 = ibreflo0; ii0 < ibrefhi0 + 1; ii0++) {
coarseSum = coarseSum + coarse[ii1][ii0][ii1] + (ip0 + ii0) + (ip1 + ii1) + var;
}
}
/* aitool generated Loads counting statement ... */
chloads = chloads + chiterations_1 * (1 * sizeof(double ));
/* aitool generated FLOPS counting statement ... */
chflops = chflops + chiterations_1 * 4;
coarse[ic0][ic1][var] = coarseSum * refScale;
}
}
/* aitool generated Stores counting statement ... */
chstores = chstores + chiterations_2 * (1 * sizeof(double ));
/* aitool generated FLOPS counting statement ... */
chflops = chflops + chiterations_2 * 1;
}
執行所有內建測試
- make check
僅執行靜態分析的測試
- make check-static
手動測試
- [liao6@tux322:~/workspace/ExReDi/ai_tool.git/translator]m && ./measureTool -c -annot ./src/functionSideEffect.annot -I. ./test/jacobi-v3.c