Caret Package - Tuning(Grid Search, Random Search) in R

Machine Learning

Caret Package - Tuning(Grid Search, Random Search) in R

Killer T Cell 2020. 8. 29. 21:50

R에는 참 위대한 package들이 많습니다. dplyr이 대표적이죠. 그러나 그 중 압권은 단연컨대 caret이라 생각합니다. caret은 "short for Classification And REgression Training"의 약자로 분류와 회귀를 매우 간편하게 만들어주는 package로, R의 머신러닝 구현을 python보다 극도로 쉽게 만들었습니다. caret이 편리한 이유들은 다음과 같습니다.

1. train/test 효율적 분획: createDataPartition()

2. 간편한 전처리: preProcess()

3. 손쉬운 모델 훈련 컨트롤: trainControl()

4. 튜닝 기본 제공 + 추가적 튜닝의 편의성: tuneGrid, tunelength 등

5. 대부분의 모델 지원

이번 글에서는 4번을 중점적으로 다룹니다. 우선 기본적으로, iris data를 예측하는 모델을 만들었습니다. 가중치를 같게 부여하기 위해 표준화하였고, 5번 5-fold cross validation을 수행했습니다.

library(caret)

irisdata <- iris

pp <- c("center", "scale")
ctl <- trainControl(method="repeatedcv", number=5, repeats=5, allowParallel=T, savePredictions="final")

Ac.ranger <- vector(length=5

for(i in 1:5){
  print(i)
  idx = createDataPartition(irisdata$Species, p=0.7, list=F)
  traindata <- iris[idx, ]
  testdata <- iris[-idx, ]
  fit.ranger <- train(Species~., data=traindata, method="ranger", trControl=ctl, preProcess=pp)
  Ac.ranger[i] <- postResample(pred=predict(fit.ranger, testdata), obs=testdata$Species)
}

fit.ranger
mean(Ac.ranger)

성능도 좋은 편이네요,

> mean(Ac.ranger)
[1] 0.9511111

이때 fit.ranger를 출력한 결과가 중요합니다. 제 경우에는 이랬습니다.

> fit.ranger
Random Forest 

105 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

Pre-processing: centered (4), scaled (4) 
Resampling: Cross-Validated (5 fold, repeated 5 times) 
Summary of sample sizes: 84, 84, 84, 84, 84, 84, ... 
Resampling results across tuning parameters:

  mtry  splitrule   Accuracy   Kappa    
  2     gini        0.9561905  0.9342857
  2     extratrees  0.9580952  0.9371429
  3     gini        0.9600000  0.9400000
  3     extratrees  0.9580952  0.9371429
  4     gini        0.9600000  0.9400000
  4     extratrees  0.9600000  0.9400000

Tuning parameter 'min.node.size' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 3, splitrule = gini and min.node.size = 1.

ranger 모델의 hyperparameter를 확인하기 위해 다음 사이트를 이용할 수 있습니다.

https://topepo.github.io/caret/available-models.html

6 Available Models | The caret Package

Documentation for the caret package.

topepo.github.io

이렇게 mtry, splitrule, min.node.size라고 친절히 명시해줍니다.

ranger

caret package는 기본적으로 튜닝을 제공합니다. 예를 들어 위 코드에서는 fit.ranger의 출력 결과와 같이 mtry는 2, 3, 4로 조정해봤으며 splitrule은 gini, extratrees 구분, min.node.size는 1로 고정한 결과 가장 높은 Accuracy(regression의 경우 RMSE)를 보인 hyperparameter의 조합을 택했습니다. 만약 다른 평가 지표를 희망할 경우 train()의 parameter 중 metric을 활용해보세요. 여하튼 여기서 추가적인 튜닝을 수행하겠습니다. 이때 방법이 1가지가 아닙니다.

1. Grid Search

완전탐색으로, 직접 hyperparameter의 범위를 지정해주는 방법입니다. ranger_tune_grid를 선언해줬고, 이를 train parameter tuneGrid에 삽입하면 됩니다. 이론적으로 가능한 대부분의 경우를 전부 탐색하므로 가장 완벽한 방법입니다. 그렇기 때문에 시간을 상당히 많이 잡아먹으므로 적절한 범위를 잘 잡아주어야 합니다. 그러기 위해선, 머신러닝 모델을 비교적 잘 이해하고 있어야 합니다.

ranger_tune_grid <- expand.grid(mtry=1:4 splitrule=c("extratrees", "gini"), min.node.size=1:5)

fit.ranger <- train(Species~., data=traindata, method="ranger", trControl=ctl, preProcess=pp, tuneGrid=ranger_tune_grid)

이렇게 평균 정확도가 약간 증가하였네요. ~~통계적으로 유의한 차이가 발생했는지 알기 위해 분산분석을 해야하지만 그냥 넘어갑시다~~

  mtry  splitrule   min.node.size  Accuracy   Kappa    
  1     extratrees  1              0.9447619  0.9171429
  1     extratrees  2              0.9447619  0.9171429
  1     extratrees  3              0.9466667  0.9200000
  1     extratrees  4              0.9466667  0.9200000
  1     extratrees  5              0.9466667  0.9200000
  1     gini        1              0.9371429  0.9057143
  1     gini        2              0.9352381  0.9028571
  1     gini        3              0.9390476  0.9085714
  1     gini        4              0.9409524  0.9114286
  1     gini        5              0.9409524  0.9114286
  2     extratrees  1              0.9466667  0.9200000
  2     extratrees  2              0.9466667  0.9200000
  2     extratrees  3              0.9466667  0.9200000
  2     extratrees  4              0.9466667  0.9200000
  2     extratrees  5              0.9466667  0.9200000
  2     gini        1              0.9409524  0.9114286
  2     gini        2              0.9371429  0.9057143
  2     gini        3              0.9409524  0.9114286
  2     gini        4              0.9447619  0.9171429
  2     gini        5              0.9466667  0.9200000
  3     extratrees  1              0.9466667  0.9200000
  3     extratrees  2              0.9466667  0.9200000
  3     extratrees  3              0.9466667  0.9200000
  3     extratrees  4              0.9466667  0.9200000
  3     extratrees  5              0.9466667  0.9200000
  3     gini        1              0.9409524  0.9114286
  3     gini        2              0.9371429  0.9057143
  3     gini        3              0.9428571  0.9142857
  3     gini        4              0.9466667  0.9200000
  3     gini        5              0.9466667  0.9200000

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 1, splitrule = extratrees and min.node.size = 3.

> mean(Ac.ranger)
[1] 0.96

2. Random Search

오랜 시간을 투자해야 하는 1번의 Grid Search를 보완할 수 있는 방법입니다. 말 그대로 random하게 parameter들을 조정합니다. Grid Search에서, 가능한 경우의 수가 지나치게 많은 경우 특히 효과적으로 사용됩니다. 그러나 오히려 이런 적은 탐색이 알고리즘의 최적화를 방해하기도 합니다. 따라서 아래 열거된 알고리즘에 Random Search의 사용은 지양해야 합니다.

ada, AdaBag, AdaBoost.M1, bagEarth, blackboost, blasso, BstLm, bstSm, bstTree, C5.0, C5.0Cost, cubist, earth, enet, foba, gamboost, gbm, glmboost, glmnet, kernelpls, lars, lars2, lasso, lda2, leapBackward, leapForward, leapSeq, LogitBoost, pam, partDSA, pcr, PenalizedLDA, pls, relaxo, rfRules, rotationForest, rotationForestCp, rpart, rpart2, rpartCost, simpls, spikeslab, superpc, widekernelpls, xgbDART, xgbTree.

방법은 간단합니다. trainControl()의 search parameter를 "random"으로 설정한 뒤 train()에 tuneLength parameter를 지정하면 됩니다.

ctl <- trainControl(method="repeatedcv", number=5, repeats=5, allowParallel=T, savePredictions="final", search="random")

fit.ranger <- train(Species~., data=traindata, method="ranger", trControl=ctl, preProcess=pp, tuneLength=10)

최종 결정된 parameter가 극단적이라 조금 이상해보이지만... 아무튼 성능이 향상되었습니다.

  min.node.size  mtry  splitrule   Accuracy   Kappa    
   3             4     extratrees  0.9314286  0.8971429
   4             2     gini        0.9333333  0.9000000
   6             4     extratrees  0.9314286  0.8971429
   6             4     gini        0.9390476  0.9085714
  12             2     extratrees  0.9371429  0.9057143
  13             2     extratrees  0.9333333  0.9000000
  15             1     extratrees  0.9276190  0.8914286
  16             3     gini        0.9371429  0.9057143
  16             4     gini        0.9409524  0.9114286
  17             2     gini        0.9371429  0.9057143

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 4, splitrule = gini and min.node.size = 16.

> mean(Ac.ranger)
[1] 0.9644444

GridSearch 결과를 plot해봤습니다.

plot(fit.ranger)

코드 작성에 참고가 되길 바랍니다.

Written By Killer T Cell