Feature Selection
Why
1.curse of dimensionality
Method
1. Filter: Scoring each feature, filter out good features. (Information Content)
1.1 Chi-Square 卡方检验
1.2 Correlation 相关检验 1.3 Information Gain 信息获取Comments:decent complexity, features not as good as wrapper.
2. Wrapper: Try each possibility, evaluate result of each subset. (Accuracy)
1.1 Recursive feature elimination with cross validatonComments: features are better, but computition cost is higher, could overfit.
3. Embedded: Measure the contribution of each feature when creating model.
1.1 Lasso
In Practice
1. XgBoosting
print(xgb.importance(matrix()))
2.Decision Tree
Information Gain
3. Random Forest
library(party)cf1 <- cforest(ozone_reading ~ . , data= inputData, control=cforest_unbiased(mtry=2,ntree=50)) # fit the random forest varimp(cf1)
4. case 1, 用 caret
4.1 cor(dataset[])
算出每个变量的相关系数, 取系数 >= 0.75的系数
4.2 varimp
计算每个变量的重要程度,可以用不同model来衡量, 常用的有random forest
4.3 Recursive Feature Elimination