"Your current attempt runs in about O(N^2). So even if you had enough memory, it won't finish in your lifetime." Alexander Yee^[1]

   "premature optimization is the root of all evil" - Donald Knuth in paper "Structured Programming With Go To Statements"

           "good algorithms beat optimized code" Bruce Dawson ( Fractal extreme )

軟體最佳化

程式碼剖析
基準測試 ^[2]

如何詢問最佳化問題 ?

"如果你已經實際剖析了程式碼，有具體的程式碼片段，以便每個人都可以執行相同的程式碼來檢視其效能，並且你在某個地方釋出了這個庫，可以檢視（GitHub、Bitbucket 等），那麼我不一定認為在程式碼審查中進行討論有問題。包括你選擇的分析器中的結果，並識別瓶頸，將有助於保持主題的集中性。
如果你剛開始編寫程式碼，但已經剖析了一小段程式碼，並展示了異常的效能，那麼在 Stack Overflow 上提問是可接受的。包括你選擇的分析器中的結果，並識別瓶頸，將有助於保持主題的集中性。
如果你只是讓別人幫你最佳化程式碼，那就不要在任何地方提問。得到一個程式碼轉儲並被要求找到效能瓶頸，這只是我在專業工作中才會做的事，而不是在我的空閒時間。" (Makoto) ^[3]

另請參閱 : 程式碼審查 ^[4]

內部迴圈

維基百科中的內部迴圈 ^[5]

數值計算的最佳化

數字型別

double 比 float 快
"在我的測試（使用擾動）中，在 Fraktaler-3 中，x87 長雙精度浮點數在 AMD 2700X CPU 上比雙精度浮點數慢大約 4.2 倍。AMD RX 580 GPU 上的 floatexp（float + int32）比 AMD 2700X CPU 上的 x87 長雙精度浮點數快大約 2.3 倍，因此根據硬體的不同，x87 長雙精度浮點數可能沒有必要..."fractalforums.org: from-java-to-cplusplus

距離

距離近似
科比 1/4 定理 ^[6]

分解

四叉樹分解

"I did a web-based quad-tree Mandelbrot once.^[7]  I labelled the quadrants (eg a,b,c,d) the tile files were in directories (something like a.png b.png a/a.png a/b.png a/c/a/b.png), along with an sqlite database that said which tiles  were boring (all same colour), along with their period (0 for 100% exterior, 1 for 100% inside the main cardioid, 2 for 100% inside the main circle, etc).  The client was really simple, just a web page with tile paths; the server  would do a redirect to the relevant boring tile 0.png 1.png 2.png etc if any parent of the tile was boring, otherwise it'd try to serve the tile.png.  Meanwhile there was another process that just generated the tiles and populated the database.  I can't remember if I made 404 tile not found errors bump the priority of the tile for the renderer, but maybe.

Eventually i abandoned the project because recalculating from scratch is typically faster and more flexible for shallow zooms, and for deep zooms the storage is prohibitive." Claude Heiland-Allen ^[8]

對稱性

映象對稱

引數平面和曼德布洛特集合在平面中關於 x 軸對稱 :^[9]

colour(x,y) = colour(x,-y)

這裡有一個 C 程式碼示例，展示瞭如何將對稱影像（其軸位於其高度的中間）分成 3 部分

對稱軸
2 個對稱部分 : 軸上和軸下

// fill array using mirror symmetry of image 
// uses global var :  ...
int FillArray(unsigned char data[] )
{
   
 unsigned char Color; // gray from 0 to 255

// draw axis of symmetry
iy = iyAxisOfSymmetry; 
for(ix=ixMin;ix<=ixMax;++ix) PlotPoint(ix, iy, GiveColor(ix, iy));

// draw symmetric parts : above and below axis 
for(iyAbove = iyAboveMin; iyAbove<=iyAboveMax; ++iyAbove) 
  for(ix=ixMin; ix<=ixMax; ++ix)

  { // above axis compute color and save it to the array
    iy = iyAxisOfSymmetry + iyAbove;
    Color = GiveColor(ix, iy);
    PlotPoint(ix, iy, Color ); 
    // below the axis only copy Color the same as above without computing it 
    PlotPoint(ix, iyAxisOfSymmetry - iyAbove , Color ); 
   }   
 return 0;
}

另請參閱用於一般情況的 Pascal 程式碼 (Delphi) ^[10]

旋轉對稱

動力學平面和朱利亞集合具有旋轉對稱性。它可以用來加速計算

// fill array using symmetry of image 
// uses global var :  ...
int FillArraySymmetric(unsigned char data[] )
{
   
 unsigned char Color; // gray from 0 to 255

printf("axis of symmetry \n"); 
iy = iyAxisOfSymmetry;

for(ix=ixMin;ix<=ixMax;++ix) PlotPoint(ix, iy, GiveColor(ix, iy));

// above and below axis 
for(iyAbove = iyAboveMin; iyAbove<=iyAboveMax; ++iyAbove) 
  {printf(" %d from %d\n", iyAbove, iyAboveMax); //info 
  for(ix=ixMin; ix<=ixMax; ++ix)

  { // above axis compute color and save it to the array
    iy = iyAxisOfSymmetry + iyAbove;
    Color = GiveColor(ix, iy);
    PlotPoint(ix, iy, Color ); 
    // below the axis only copy Color the same as above without computing it 
    PlotPoint(ixMax-ix, iyAxisOfSymmetry - iyAbove , Color ); 
   } 
}  
 return 0;
}

複數

不顯式定義複數或不使用複數型別的計算通常更快。

平行計算

在曼德布洛特和朱利亞集合中，每個點/畫素都是獨立計算的，因此可以輕鬆地將其分解成更小的任務並同時執行。 ^[11]

平行計算型別 : 維基百科描述 ^[12]

多核（計算）
- Parallella 和 OpenCl ^[13]
- 英特爾 MIC ^[14]
- MPI ^[15]
- OpenMP ^[16] ^[17] ^[18]
通用圖形處理器計算 (GPGPU)
- OpenCl ^[19]
- CUDA ^[20]
- WebGl ^[21] ^[22]

向量化

C

gcc 向量擴充套件 ^[23]^[24]^[25]^[26]
英特爾® 高階向量擴充套件 ^[27]^[28]

c++

Eigen 中的 Mandelbrot - 顯式向量化

R

向量化 ^[29]^[30]

以下是上面圖片的 R 程式碼，包含註釋：^[31]

## based on the R code by Jarek Tuszynski
## http://commons.wikimedia.org/wiki/File:Mandelbrot_Creation_Animation_%28800x600%29.gif

iMax=20; # number of frames and iterations

## world coordinate ( float )
cxmin = -2.2;
cxmax = 1.0;
cymax = 1.2;
cymin = -1.2;

## screen coordinate ( integer)

dx=800; dy=600 ;           # define grid size

## sequences of  values = rows and columns
cxseq = seq(cxmin, cxmax, length.out=dx);
cyseq = seq(cymin, cymax, length.out=dy);
## sequences of  values = rows * columns
cxrseq = rep(cxseq, each = dy);
cyrseq = rep(cyseq,  dx);
C = complex( real=cxrseq, imag = cyrseq); # complex vector
C = matrix(C,dy,dx)       # convert from vector to matrix
# Now C is a matrix of c points coordinates (cx,cy)

# allocate memory for all the frames
F = array(0, dim = c(dy,dx,iMax)) # dy*dx*iMax array filled with zeros

Z = 0                     # initialize Z to zero
for (i in 1:iMax)         # perform iMax iterations
{
  Z = Z^2+C               # iteration; uses n point to compute n+1 point
  
  F[,,i]  = exp(-abs(Z))              # an omitted index is used to represent an entire column or row
     # store magnitude of the complex number, scale it using an exponential function to emphasize fine detail, 
}

# color and save F array to the file 
library(caTools)          # load library with write.gif function
# mapped to color palette jetColors . Dark red is a very low number, dark blue is a very high number
jetColors = colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))
write.gif(F, "Mandelbrot.gif", col=jetColors, delay=100, transparent=0)

多核

"我總是建立與我擁有的核心數量相同的執行緒。超過這個數量，你的系統會花費太多時間進行任務切換" - Duncan C^[32]^[33]

使用 OpenMP

// fill array using symmetry of image 
// uses global var :  ...
int FillArraySymmetric(unsigned char data[] )
{
   
 unsigned char Color; // gray from 0 to 255

printf("axis of symmetry \n"); 
iy = iyAxisOfSymmetry; 
#pragma omp parallel for schedule(dynamic) private(ix,Color) shared(ixMin,ixMax, iyAxisOfSymmetry)
for(ix=ixMin;ix<=ixMax;++ix) PlotPoint(ix, iy, GiveColor(ix, iy));

/*
The use of ‘shared(variable, variable2) specifies that these variables should be shared among all the threads.
The use of ‘private(variable, variable2)’ specifies that these variables should have a separate instance in each thread.
*/

#pragma omp parallel for schedule(dynamic) private(iyAbove,ix,iy,Color) shared(iyAboveMin, iyAboveMax,ixMin,ixMax, iyAxisOfSymmetry)

// above and below axis 
for(iyAbove = iyAboveMin; iyAbove<=iyAboveMax; ++iyAbove) 
  {printf(" %d from %d\n", iyAbove, iyAboveMax); //info 
  for(ix=ixMin; ix<=ixMax; ++ix)

  { // above axis compute color and save it to the array
    iy = iyAxisOfSymmetry + iyAbove;
    Color = GiveColor(ix, iy);
    PlotPoint(ix, iy, Color ); 
    // below the axis only copy Color the same as above without computing it 
    PlotPoint(ixMax-ix, iyAxisOfSymmetry - iyAbove , Color ); 
   } 
}  
 return 0;
}

單元

"索尼 PlayStation 3 是消費市場上最便宜的平行計算機之一。" ^[34]^[35]

GPGPU

CUDA
- 適用於 CUDA 的 Yellow Dog Linux^[36]
- par4all ^[37]
OpenCl
GLSL - GLSL 編輯工具
- Fragmentarium : 用於使用基於 GPU（GLSL）的畫素圖形構建的 C++、OpenGL/GLSL 和 Qt 4 的遊樂場。
- 著色器編輯器使用 kickjs
- GLSL 沙盒由 Ricardo Cabello Miguel 使用 three.js
- jsfiddle 和示例
- WebGl : 透過 html 使用 GLSL。它是一個基於 OpenGL ES 2.0 API 的 3D 圖形 API，開發人員可以使用它來建立完整的 3D 網路應用程式。
  - Inigo Quilez、Pol Jeremias 的 shadertoy : 一個基於瀏覽器的片段著色器編輯器，用於建立 2D 效果。
    - shadertoy 如何運作？
  - Chrome 實驗
  - WebGL 遊樂場 - 新的不工作！！
  - 示例

Canvas : 支援瀏覽器的列表

效果

在 2 核 CPU 上的效果（每個核心 1 個執行緒）

方法	時間 [分鐘]	相對速度
CPU & 無最佳化	80	1
CPU & 對稱性	39	2
CPU & 對稱性 & OpenMP	24	3
GPU	? (待辦)

GPU 應該是

比 CPU 快 62 倍 ^[38]
快 760 倍 ^[39]

擾動演算法與級數逼近（在 e100 附近的縮放級別快 100 倍！）^[40]

使用 WinXP Intel 2.9 GHz CPU（使用 1 個 CPU）和 GTX 480 GPU，我使用 1000x1000 圖和 1000 次最大迭代得到以下結果：^[41]

型別	時間	描述
gpu	0.07 秒	gpu 是 GPU 上的純 CUDA 解決方案
gpuarray	3.4 秒	gpuarray 在 GPU 上使用 Python 中類似 numpy 的 CUDA 包裝器
numpy	43.4 秒	numpy 是 CPU 上的純 Numpy（基於 C）解決方案
python（序列）	1605.6 秒	python 是 CPU 上的純 Python 解決方案，使用 numpy 陣列

參考文獻

[1] stackoverflow 問題：如何以最快的方式計算 e 到 2 萬億位

[2] 計算機語言基準測試遊戲

[3] Stack Overflow 元資料：可以問關於程式碼最佳化的幫助嗎？

[4] stack exchange：程式碼審查

[5] 維基百科中的內迴圈

[6] 使用距離估計進行自適應過取樣，作者：Claude Heiland-Allen

[7] 距離估計，作者：Claude Heiland-Allen

[8] ractalforums.org：使用四叉樹快速建立超解析度分形影像

[9] 維基百科：反射對稱

[10] raktal.republika.pl 上關於 x 軸的映象對稱

[11] 令人尷尬的並行問題

[12] 維基百科：平行計算機 - 平行計算機的類別

[13] 使用 OpenCl 和 Parallella 的 Mandelbrot

[14] 維基百科：英特爾 MIC

[15] 並行 Mandelbrot 集（使用訊息傳遞介面 (MPI) 庫的 C 程式碼），作者：Omar U. Florez

[16] Joel Yliluoma 編寫的《OpenMP 指南：C++ 的簡單多執行緒程式設計》

[17] rogle 編寫的《使用 OpenMP 的並行 Mandelbrot》

[18] udiusmaximus : 指數對映和 OpenMP

[19] ythreads 編寫的《第 1 部分：OpenCL™ – 可移植並行性》

[20] Loren Shure 編寫的《Matlab 中的 GPU 上的 Mandelbrot 集》

[21] Felix Woitzel 編寫的《使用 webgl 的漸進 Julia 分形》

[22] roku.com 上的 glsl 沙箱

[23] : 向量擴充套件

[24] rt hubert 編寫的《使用 gcc 向量支援的第一個程式碼示例》

[25] EJRH Edmund Horner 編寫的《使用 SIMD 的 Mandelbrot 計算》

[26] 計算機語言基準測試遊戲

[27] 英特爾高階向量擴充套件

[28] Tim Horton 編寫的《ICC 和 Mandelbrot》

[29] 向量化（平行計算）

[30] Matlab - 程式碼向量化指南

[31] J.D. Long 編寫的《R 中的 Mandelbrot 集》

[32] /programming/what-is-the-general-approach-to-threading-for-2d-plotting/ Fractal 論壇討論 : 2D 繪圖的執行緒通用方法是什麼

[33] ractalforums 討論 : 以合理的價格獲得最強大的計算機？PHPSESSID=7db41da33f499eb25ef85f46866438f2

[34] 針對特定硬體的演算法 -德國薩爾大學數學影像分析組

[35] Jenks 並行程式設計部落格

[36] 在 Cell BE 處理器上程式設計高效能應用程式，第 1 部分：PLAYSTATION 3 上的 Linux 簡介

[37] r4all

[38] David Bucciarelli 編寫的《MandelCPU 與 MandelGPU》

[39] thworks : 說明三種 GPU 計算 Mandelbrot 集的方法

[40] Bernard Geiger 編寫的《Kalles Fraktaler 2》

[41] Ian Ozsvald ianozsvald.com 2010 年 7 月編寫的《使用 GPU、序列 numpy 和更快的 numpy 計算的 Mandelbrot，用於展示 CPU 和 GPU 計算之間的速度差異》

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]