title: "多重共线性" categories: Statistics updated: comments: true
考虑线性回归
$$ y = X\beta + \varepsilon, $$
其中 $X$ 为 $n\times p$ 矩阵, 可以理解为 $n$ 个样本, $p$ 个特征 (因变量). 当 $X$ 的列向量线性相关时, $X'X$ 不存在逆, 参数估计会有问题. 我们把 $X$ 的列向量线性相关或者近似线性相关的情形称为存在多重共线性. 因为普通线性回归参数估计要用到 $X'X$ 的逆, 多重共线性会导致参数估计非常不稳定, 比如会出现特别大的估计值.
<!-- more -->一个诊断方法是观察 $X$ 的奇异值. 记其 SVD 为 $X = U\Sigma V'$, 详见 用 SVD 进行图像压缩, 主成分 $Z = XV$ 的列互相正交, 且 $Z'Z = \Sigma^2$.
由于
$$ Zi = \sum{k=1}^p v_{ik}X_k. $$
若 $\sigma_i$ 小, 意味着 $Z_i$ 接近零, 那么 $X_k$ 近似线性相关. (事实上, 这导致 $X'X$ 的逆 ill-conditioned.)
不过, 更常见的方法应该是利用 VIF (variance inflation factor) 等, 但我不是很关心.
Finally, consider the actual impact of multicollinearity. It doesn't change the predictive power of the model (at least, on the training data) but it does screw with our coefficient estimates. In most ML applications, we don't care about coefficients themselves, just the loss of our model predictions, so in that sense, checking VIF doesn't actually answer a consequential question. (But if a slight change in the data causes a huge fluctuation in coefficients [a classic symptom of multicollinearity], it may also change predictions, in which case we do care -- but all of this [we hope!] is characterized when we perform cross-validation, which is a part of the modeling process anyway.) A regression is more easily interpreted, but interpretation might not be the most important goal for some tasks.
See Why is multicollinearity not checked in modern statistics/machine learning - Cross Validated
Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators. The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled 1. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. This technique benefits from being model agnostic and can be calculated many times with different permutations of the feature.
Tree-based models provide an alternative measure of feature importances based on the mean decrease in impurity (MDI). Impurity is quantified by the splitting criterion of the decision trees (Gini, Entropy or Mean Squared Error). However, this method can give high importance to features that may not be predictive on unseen data when the model is overfitting. Permutation-based feature importance, on the other hand, avoids this issue, since it can be computed on unseen data.
Furthermore, impurity-based feature importance for trees are strongly biased and favor high cardinality features (typically numerical features) over low cardinality features such as binary features or categorical variables with a small number of possible categories.
Permutation-based feature importances do not exhibit such a bias. Additionally, the permutation feature importance may be computed performance metric on the model predictions predictions and can be used to analyze any model class (not just tree-based models).
When two features are correlated and one of the features is permuted, the model will still have access to the feature through its correlated feature. This will result in a lower importance value for both features, where they might actually be important.
See 4.2. Permutation feature importance — scikit-learn 0.23.2 documentation
参考