Smaller prior variance, bigger influence on posterior.
Ridge (Prior distribution = Normal)
- A better compromise between bias and variance
- All predictors are kept in the model
Lasso (Prior distribution = Laplace)
- Sparsity achieved
- Not necessarily yield good results in presence of high collinearity
- Not uniquely determined when the number of variables is greater than the number of subjects
- Tends to select only one variable among a group of predictors with high pairwise correlations
More complex model able to better learn from the training sample (lower bias) have higher variance, which leads to overfitting and high error for predicting the future values of the output variable.
if the model depends on the certain parameters (like α, λ for Lasso/Ridge) which are not supposed to be fit during the training phase, a separate validation sample could be used for the selection of model parameters (we pick up those which optimize model performance over the validation set).
Often the available dataset is small enough, so splitting into even smaller traning, validation and test sets could have negative impact on the model training leading to noisy and unreliable models. In such cases cross-validation is often applied, performing not one but several random splits of the sample with further averaging of the model performance scores.
Choice of λ or α in Model validation
Small values of λ or high values of α lead to the result close to OLS (identical to it is λ=0 or α=+∞)
- Construct variance Matrix
- Calculate Eigenvalue
- Sort eigenvalue and find eigenvector
- Covariance is invariant under translations
Rotations won’t change the eigenvalues of the data covariance matrix (although the eigenvectors will be rotated).
- Arbitrary scaling of the data will scale eigenvalues of the data covariance matrix by arbitrary amounts (and thus change e.g. the leading principal component).
Use idea of:
- Maximum variance formulation
- Minimum error formulation
Deal with non-linearity
Fisher’s linear discriminant
Also takes the within-class variance into account
To learn a representation (encoding) for a set of data
for the purpose of dimensionality reduction
[Blog for Notes in Chinese] (http://blog.sina.com.cn/s/blog_6a1b8c6b0101h9ho.html#91)
- cross-fold validation
The regularization coefficient λ , with λ>0 , controls how much the neural network is regularized. If λ is very large, the parameters of the neural network will be encouraged to be as small as possible.