R語言資料探勘演算法/分類/AdaBoost

Boosting是分類方法學中最重要的發展之一。Boosting透過順序地將分類演算法應用於訓練資料的重新加權版本，然後對由此產生的分類器序列進行加權多數投票來工作。對於許多分類演算法，這種簡單的策略導致效能的顯著提高。這種看似神秘的現象可以用眾所周知的統計原理來理解，即加性建模和最大似然。對於兩類問題，Boosting可以被視為使用最大伯努利似然作為標準在邏輯尺度上對加性建模的近似。

技術/演算法

演算法

雖然Boosting多年來有所發展，但我們描述了最常用的AdaBoost過程版本（Freund和Schapire - 1996），我們稱之為離散AdaBoost。這與Freund和Schapire中用於二元資料的AdaBoost.M1基本相同。以下是AdaBoost在兩類分類設定中的簡要描述。我們有訓練資料 $(x_{1},y_{1}),...,(x_{n},y_{n})$ ，其中 $x_{i}$ 是向量值特徵，而 $y_{i}=-1$ 或1。我們定義 $F(x)=\sum _{1}^{M}c_{m}f_{m}$ ，其中每個 $f_{m}(x)$ 是一個產生加或減1值的分類器，而 $c_{m}$ 是常數；相應的預測是符號 $(F(x))$ 。AdaBoost在訓練樣本的加權版本上訓練分類器 $f_{m}(x)$ ，對當前錯誤分類的案例賦予更高的權重。這對於一系列加權樣本都是如此，然後最終分類器被定義為每個階段的分類器的線性組合。

實現

AdaBoost是ada包的一部分。在本節中，您可以找到有關在R環境中安裝和使用它的更多資訊。

在R控制檯中輸入以下命令以安裝和載入ada包

install.packages("ada")
library("rpart")
library("ada")

用於執行AdaBoost演算法的函式是

ada(x, y,test.x,test.y=NULL, loss=c("exponential","logistic"), type=c("discrete", "real", "gentle"), iter=50, nu=0.1, bag.frac=model.coef=TRUE, bag.shift=FALSE, max.iter=20, delta=10^(-10), verbose=...,na.action=na.rpart)

引數為

x: matrix of descriptors.

Y: vector of responses. ‘y’ may have only two unique values.

test.x: testing matrix of descriptors (optional)

test.y: vector of testing responses (optional)

loss: loss="exponential", "ada","e" or any variation corresponds to the default boosting
under exponential loss. loss="logistic","l2","l" provides boosting under logistic
loss.

type: type of boosting algorithm to perform. “discrete” performs discrete Boosting
(default). “real” performs Real Boost. “gentle” performs Gentle Boost.

Iter: number of boosting iterations to perform. Default = 50.

Nu: shrinkage parameter for boosting, default taken as 1.

bag.frac: sampling fraction for samples taken out-of-bag. This allows one to use random
permutation which improves performance.

model.coef: flag to use stageweights in boosting. If FALSE then the procedure corresponds
to epsilon-boosting.

bag.shift: flag to determine whether the stageweights should go to one as nu goes to zero.
This only makes sense if bag.frac is small. The rationale behind this parameter is discussed in (Culp et al., 2006).

max.iter: number of iterations to perform in the newton step to determine the coefficient.

delta: tolerance for convergence of the newton step to determine the coefficient.

Verbose: print the number of iterations necessary for convergence of a coefficient.

Formula: a symbolic description of the model to be fit.

data: an optional data frame containing the variables in the model.

Subset: an optional vector specifying a subset of observations to be used in the fitting
process.

na.action: a function that indicates how to process ‘NA’ values. Default=na.rpart.

...: arguments passed to rpart.control. For stumps, use rpart.control(maxdepth=1,cp=-
1,minsplit=0,xval=0). maxdepth controls the depth of trees, and cp
controls the complexity of trees. The priors should also be fixed through the
parms argument as discussed in the second reference.

輸入以下命令以顯示此演算法的結果

summary(AdaObject)
varplot(VariableImportanceObject)

當使用“ada(x,y)”用法時：x資料可以採用data.frame或as.matrix的形式。y資料可以採用data.frame、as.factor、as.matrix、as.array或as.table的形式。在執行之前必須從資料中刪除缺失值。

當使用“ada(y~.)”用法時：資料必須在資料框中。響應可以具有因子或數值。只要na.action設定為除na.pass以外的任何選項，描述符資料中都可能存在缺失值。

擬合模型後，“ada”將列印函式呼叫的摘要、用於Boosting的方法、迭代次數、最終混淆矩陣（觀察到的分類與預測的分類；類的標籤與響應中的相同）、訓練集的誤差以及測試、訓練和Kappa估計的適當迭代次數。

還可以使用命令“print(x)”獲得此資訊的摘要。相應的函式（使用幫助檢視summary.ada、predict.ada、...varplot以獲取有關這些命令的其他資訊）：summary ：用於列印原始函式呼叫、用於Boosting的方法、迭代次數、最終混淆矩陣、準確率和Kappa統計量（觀察到的分類與預測的分類之間的一致性度量）的摘要。‘summary’可用於訓練、測試或驗證資料。

predict ：用於預測任何資料集（訓練、測試或驗證）的響應的函式

plot ：用於繪製Boosting迭代中演算法效能的函式。預設圖是迭代次數（x軸）與用於構建模型的資料集的預測誤差（y軸）。該函式還可以同時生成外部測試集的誤差圖以及訓練集和測試集的Kappa圖。

pairs ：用於生成描述符的成對圖的函式。描述符按Boosting選擇的頻率遞減排列（左上=最常選擇）。圖中標記的顏色表示類別成員關係；標記的大小表示預測的類別機率。標記越大，分類機率越高。

varplot ：根據變數重要性度量（基於改進）排序的變數圖。

addtest ：將測試資料集新增到ada物件中，因此測試誤差只需計算一次。

update ：向ada物件新增更多樹。

案例研究

場景

資料集包含藥物發現中使用的化合物的相關資訊。具體來說，該資料集包含5631種化合物，這些化合物進行了內部溶解度篩選（化合物在水/溶劑混合物中溶解的能力）。根據該篩選，化合物被歸類為不溶（n=3493）或可溶（n=2138）。然後，針對每種化合物計算了72個連續的、有噪聲的結構描述符。在這些描述符中，大約14%（n=787）的觀測值缺少一個描述符的值。分析的目的是對結構描述符和溶解度類別之間的關係進行建模。該資料集將被稱為soldat。

資料

輸入格式

x1 a numeric vector
x2 a numeric vector
x3 a numeric vector
x4 a numeric vector
x5 a numeric vector
x6 a numeric vector
x7 a numeric vector
x8 a numeric vector
x9 a numeric vector
x10 a numeric vector
x11 a numeric vector
x12 a numeric vector
x13 a numeric vector
x14 a numeric vector
x15 a numeric vector
x16 a numeric vector
x17 a numeric vector
x18 a numeric vector
x19 a numeric vector
x20 a numeric vector
.
.
.
x72 a numeric vector with missing data
y a numeric vector

執行

data("soldat")
n <- nrow(soldat)
set.seed(100)
ind <- sample(1:n)
trainval <- ceiling(n * .5)
testval <- ceiling(n * .3)
train <- soldat[ind[1:trainval],]
test <- soldat[ind[(trainval + 1):(trainval + testval)],]
valid <- soldat[ind[(trainval + testval + 1):n],]

control <- rpart.control(cp = -1, maxdepth = 14,maxcompete = 1,xval = 0)
gen1 <- ada(y~., data = train, test.x = test[,-73], test.y = test[,73], type = "gentle", control = control, iter = 70)
gen1 <- addtest(gen1, valid[,-73], valid[,73])
summary(gen1)
varplot(gen1)

輸出

Loss: exponential Method: gentle Iteration: 70
Training Results
Accuracy: 0.987 Kappa: 0.972
Testing Results
Accuracy: 0.765 Kappa: 0.487

分析

測試準確率按輸入順序列印，因此測試集上的準確率為0.765，驗證集上的準確率為0.781。對於這種型別的早期藥物發現數據，Gentle AdaBoost演算法表現良好，測試集準確率為76.5%（kappa約為0.5）。為了增強我們對描述符與響應之間關係的理解，使用了varplot函式。

參考文獻

Meira Jr., W.; Zaki, M. 資料探勘演算法基礎。 [1]
CBA R包。 [2]
加性邏輯迴歸：提升方法的統計視角，作者：Jerome Friedman、Trevor Hastie 和 Robert Tibshirani