基于R的Caret包的特征选择

链接：https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/

基于重要性的特征排序 Rank Features By Importance

特征的重要性可以通过构建模型来评估。比如决策树(Decision tree)就有内部机制来评估特征的重要性. 其它方法也可以根据ROC曲线分析来评估特征的重要性。

下面给出了基于Pima Indians Diabetes数据库和Learning Vector Quantization(LVQ)模型. 这样就能够根据重要性来对特征进行排序。代码如下:

# ensure results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

更进一步的，Recursive Feature Elimination (RFE)方法能够实现特征的选择。这个结果最后需要借助统计学和实际需要来判定特征选择结果的合理性分析。

# ensure the results are repeatable
set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the data
data(PimaIndiansDiabetes)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control)
# summarize the results
print(results)
# list the chosen features
predictors(results)
# plot the results
plot(results, type=c("g", "o"))

需要说明的是，通过相关性分析(Remove Redundant Features)部分，我不确定分析之后该如何处理这些特征呢？既没有对特征排序，又没有给出一个特征子集，所以怎么使用这个信息还不明确。

基于R的Caret包的特征选择

评论 0

近期热门动态

下一篇