### Bayes

Smaller prior variance, bigger influence on posterior.

**Ridge** (Prior distribution = Normal)

pros:

- A better compromise between bias and variance

cons:

- All predictors are kept in the model

**Lasso** (Prior distribution = Laplace)

pros:

- Sparsity achieved

cons:

- Not necessarily yield good results in presence of high collinearity
- Not uniquely determined when the number of variables is greater than the number of subjects
- Tends to select only one variable among a group of predictors with high pairwise correlations

### Bias-Variance Trade Off:

More complex model able to better learn from the training sample (lower bias) have higher variance, which leads to overfitting and high error for predicting the future values of the output variable.

**Solution: Validation**

if the model depends on the certain parameters (like α, λ for Lasso/Ridge) which are not supposed to be fit during the training phase, a separate validation sample could be used for the selection of model parameters (we pick up those which optimize model performance over the validation set).

**Solution: Cross-validation**

Often the available dataset is small enough, so splitting into even smaller traning, validation and test sets could have negative impact on the model training leading to noisy and unreliable models. In such cases cross-validation is often applied, performing not one but several random splits of the sample with further averaging of the model performance scores.

**Choice of λ or α in Model validation**

Small values of λ or high values of α lead to the result close to OLS (identical to it is λ=0 or α=+∞)

### PCA:

- Normalize
- Construct variance Matrix
- Calculate Eigenvalue
- Sort eigenvalue and find eigenvector

**Covariance Matrix**

- Covariance is invariant under translations

Rotations won’t change the eigenvalues of the data covariance matrix (although the eigenvectors will be rotated). - Arbitrary scaling of the data will scale eigenvalues of the data covariance matrix by arbitrary amounts (and thus change e.g. the leading principal component).

**Use idea of:**

- Maximum variance formulation
- Minimum error formulation

**Kernel PCA**

Deal with non-linearity

**Fisher’s linear discriminant**

Also takes the within-class variance into account

### Neural Network

**Autoencoder**

假设其输出与输入是相同的，然后训练调整其参数，得到每一层中的权重。自然地，我们就得到了输入I的几种不同表示（每一层代表一种表示），这些表示就是特征。

To learn a representation (encoding) for a set of data

for the purpose of dimensionality reduction

[Blog for Notes in Chinese] (http://blog.sina.com.cn/s/blog_6a1b8c6b0101h9ho.html#91)

**NN Regularzation**

- cross-fold validation
- L2

The regularization coefficient λ , with λ>0 , controls how much the neural network is regularized. If λ is very large, the parameters of the neural network will be encouraged to be as small as possible.