MLC Midterm Reiew 1


Smaller prior variance, bigger influence on posterior.

Ridge (Prior distribution = Normal)


  • A better compromise between bias and variance


  • All predictors are kept in the model

Lasso (Prior distribution = Laplace)


  • Sparsity achieved


  • Not necessarily yield good results in presence of high collinearity
  • Not uniquely determined when the number of variables is greater than the number of subjects
  • Tends to select only one variable among a group of predictors with high pairwise correlations

Bias-Variance Trade Off:

More complex model able to better learn from the training sample (lower bias) have higher variance, which leads to overfitting and high error for predicting the future values of the output variable.

Solution: Validation
if the model depends on the certain parameters (like α, λ for Lasso/Ridge) which are not supposed to be fit during the training phase, a separate validation sample could be used for the selection of model parameters (we pick up those which optimize model performance over the validation set).

Solution: Cross-validation
Often the available dataset is small enough, so splitting into even smaller traning, validation and test sets could have negative impact on the model training leading to noisy and unreliable models. In such cases cross-validation is often applied, performing not one but several random splits of the sample with further averaging of the model performance scores.

Choice of λ or α in Model validation
Small values of λ or high values of α lead to the result close to OLS (identical to it is λ=0 or α=+∞)


  1. Normalize
  2. Construct variance Matrix
  3. Calculate Eigenvalue
  4. Sort eigenvalue and find eigenvector

Covariance Matrix

  • Covariance is invariant under translations
    Rotations won’t change the eigenvalues of the data covariance matrix (although the eigenvectors will be rotated).
  • Arbitrary scaling of the data will scale eigenvalues of the data covariance matrix by arbitrary amounts (and thus change e.g. the leading principal component).

Use idea of:

  1. Maximum variance formulation
  2. Minimum error formulation

Kernel PCA
Deal with non-linearity

Fisher’s linear discriminant
Also takes the within-class variance into account

Neural Network

To learn a representation (encoding) for a set of data
for the purpose of dimensionality reduction
[Blog for Notes in Chinese] (

NN Regularzation

  1. cross-fold validation
  2. L2

The regularization coefficient λ , with λ>0 , controls how much the neural network is regularized. If λ is very large, the parameters of the neural network will be encouraged to be as small as possible.