Você está na página 1de 34

Machine Learning Algorithms Summary + R Code

Supervised Learning Algorithms


1 Supervised Learning by Empirical Risk Minimization (EMR)
1 1 Empirical Risk Minimization and Inductive Bias
1 2 Ordinary Least Squares (OLS)
1 3 Ridge Regression
1 4 LASSO
1 5 Logistic Regression
1 6 Regression Classifier
1 7 Linear Support Vector Machines (SVM)
1 8 Generalized Additive Models (GAMs)
1 9 Projection Pursuit Regression (PPR)
1 10 Neural Networks (NNETs)
1 11 Classification and Regression Trees (CARTs)
1 12 Random Forests
1 13 Rotation Forest
1 14 Smoothing Splines
2 Non ERM Supervised Learning
2 1 k-Nearest Neighbour (KNN)
2 2 Kernel Regression
2 3 Local Likelihood and Local ERM
2 4 Boosting
2 5 Learning Vector Quantizations (LVQ)
3 Dimensionality Reduction In Supervised Learning
3 1 Variable Selection
3 2 LASSO
3 3 Principal Component Regression (PCAR)
3 4 Partial Least Squares (PLS)
3 5 Canonical Correlation Analysis (CCA)
3 6 Reduced Rank Regression (RRR)
4 Generative Models In Supervised Learning
4 1 Fisher's Linear Discriminant Analysis (LDA)
4 2 Fisher's Quadratic Discriminant Analysis (QDA)
4 3 Naive Bayes
5 Ensembles
5 1 Committee Methods
5 2 Bayesian Model Averaging
5 3 Stacking
5 4 Bootstrap Averaging (Bagging)
5 5 Boosting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Classification:
Kernel Density Classification.
Naive Bayes Classifier - has the form of a generalized additive model. The models are
fit in quite different ways though.
Mixture Models for Density Estimation and Classification - can be viewed as a kind
of kernel method.

1 9 Projection Pursuit Regression (PPR)


Another way to generalize the hypothesis class F, which generalizes the GAM model, is to
allow f to be some simple function of a linear combination of the predictors, of the form
M

f ( x ) = g m ( wm x ) ,

(1.9)

m =1

where both gm and wm are learned from the data. The regularization is now performed by
choosing M and the class of {gm }m=1 .
M

Note: PPR is not a pure ERM. Just like the GAM problem, in the PPR problem {gm }m=1 are
learned by Kernel Regression. Solving the PPR problem is thus a hybrid of ERM and Kernel
Regression algorithms.
M

Note: If M is taken arbitrarily large, for appropriate choice of gm the PPR model can
approximate any continuous function in Rp arbitrarily well. Such a class of models is called a
universal approximator. However this generality comes at a price. Interpretation of the fitted
model is usually difficult, because each input enters into the model in a complex and
multifaceted way. As a result, the PPR model is most useful for prediction, and not very
useful for producing an understandable model for the data.
Notice also- that the neural network model with one hidden layer has exactly the same form
as the projection pursuit model described above. The difference is that the PPR model uses
nonparametric functions gm(v),while the neural network uses a far simpler function based on
sigmoid(v).

1.10 Neural Networks (NNETs) - Single Hidden Layer


We introduce the NNET model via the PPR model, and not through its historically original
construction. In the language of Eq.(1.9), a single-layer{feed-forward neural network, is a
model where {gm }m=1 are not learned from the data, but rather assumed a-priori.
M

g m ( x ) := m ( x m )

where { m , m }m=1 only are learned from the data. A typical activation function is the standard
M

1
.
1 + et
As can be seen, the NNET is merely a non-linear regression model. The parameters of which
are often called weights.

logistic CDF: ( t ) =

Loss Functions: Like any other ERM problem, we are free to choose the appropriate loss
function.
Universal Approximator: Like the PPR, even when {gm }m=1 are fixed beforehand, the class is
still a universal approximator.
M

Regularization: regularization of the model is done via the selection of the , the number of
nodes/variables in the network and the number of layers.

1.11 Classification and Regression Trees (CARTs)


CARTs are a type of ERM where f(x) include very non smooth functions that can be
interpreted as "if-then" rules, also known as decision trees.
The hypothesis class of CARTs includes functions of the form
M

f ( x ) = cm I{xRm }
m =1

The parameters of the model are the different conditions {Rm }m =1 and the function's value at
M

each condition {cm }m=1 .


Regularization: is done by the choice of M which is called the tree depth.
M

Loss Functions: As usual, a squared loss can be used for continuous outcomes y. For
categorical outcomes, the loss function is called the impurity measure. Impurity Measure One
can use either a misclassification error, the multinomial likelihood (knows as the deviance, or
cross-entropy), or a first order approximation of the latter known as the Gini Index.
Universal Approximator: CART is a universal approximator.

Random Forests
Trees are very flexible hypothesis classes. They thus have small bias but large variance.
Bagging trees will reduce this variance by averaging trees from different bootstrap samples.
Alas, the variance (thus the MSE) of bagged trees is lower bounded by the fact the trees use
the same variables, and are thus correlated. To remedy this, [Breiman, 2001] proposed to fit
trees to bootstrapped samples, using only a random subset of variables. This decorrelates
between the trees, this allowing a reduction in the variances of the trees (thus their MSE).

1.14 Smoothing Splines

Unsupervised Learning
1 Introduction to Unsupervised Learning
2 Density Estimation
2 1 Parametric Density Estimation
2 2 Kernel Density Estimation
2 3 Graphical Models
3 High Density Regions
3 1 Association Rules
4 Linear-Space Embeddings
4 1 Principal Components Analysis (PCA)
4 2 Random Projections
4 3 Sparse Principal Component Analysis (sPCA)
4 4 Multidimensional Scaling (MDS)
4 5 Local MDS
4 6 Isometric Feature Mapping (Isomap)
5 Non-Linear-Space Embeddings
5 1 Kernel Principal Component Analysis (kPCA)
5 2 Self Organizing Maps (SOM)
5 3 Principal Curves and Surfaces
5 4 Local Linear Embedding (LLE)
5 5 Auto Encoders
5 6 Matrix Factorization
5 7 Information Bottleneck
6 Latent Space Generative Models
6 1 Factor Analysis (FA)
6 2 Independent Component Analysis (ICA)
6 3 Exploratory Projection Pursuit
6 4 Compressed Sensing
6 5 Generative Topographic Map (GTM)
6 6 Finite Mixtures
6 7 Hidden Markov Models (HMM)
6 8 Latent Space Graphical Models
6 9 Latent Dirichlet Allocation (LDA)
6 10 Probabilistic Latent Semantic Indexing (PLSI)
6 11 Prediction by Partial Matching (PPM)
6 12 Dynamic Markov Compression (DMC)
7 Random Graph Models
7 1 Erdos Renyi
7 2 Exchangeable Graph Model
7 3 p1 Graph Model
7 4 p2 Graph Model
7 5 Stochastic Block Graph Model
7 6 Latent Space Graph Model
7 7 Exponential Random Graphs (ERGMs)
8 Cluster Analysis
8 1 K-Means Clustering
8 2 K-Medoids Clustering (PAM)
8 3 Quality Threshold Clustering (QT)
8 4 Hierarchical Clustering
5

8 5 Fuzzy Clustering
8 6 Self Organizing Maps (SOM)
8 7 Spectral Clustering
8 8 Bi Clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3.1 Association Rules (Market Basket Analysis; Aprioiri algorithm)


Association rules, or market basket analysis, or affinity analysis, can be seen as
approximating the joint distribution with a region-wise constant function.
Apriori Algorithm
Terminology

The algorithm: (use dummy variables for 0/1 response = "in basket"/"Not in basket").
The first pass over the data computes the support (relative frequency) of all single-item sets.
Those whose support is less than the threshold are discarded. The second
pass computes the support of all item sets of size two that can be formed
from pairs of the single items surviving the first pass. In other words, to
generate all frequent itemsets with |K| = m, we need to consider only
candidates such that all of their m ancestral item sets of size m 1 are
frequent. Those size-two item sets with support less than the threshold are
discarded. Each successive pass over the data considers only those item
sets that can be formed by combining those that survived the previous
pass with those retained from the first pass. Passes over the data continue
until all candidate rules from the previous pass have support less than the
specified threshold.
> Example: suppose the item set K = {peanut butter, jelly, bread} and
consider the rule {peanut butter, jelly} => {bread}. A support value
of 0.03 for this rule means that peanut butter, jelly, and bread appeared
together in 3% of the market baskets. A confidence of 0.82 for this rule implies
that when peanut butter and jelly were purchased, 82% of the time
bread was also purchased. If bread appeared in 43% of all market baskets
then the rule {peanut butter, jelly} => {bread} would have a lift of 1.95.
The goal of this analysis is to produce association rules (A => B) with both
high values of support and confidence(A => B).

Examples of Association Rules:

4 Linear Space Embedding Methods


Linear space embedding are a class of dimensionality reduction techniques
that map the data X into a lower dimensional linear space M. The mapping itself,
f : X M can be linear or nonlinear. We denote the low dimensional representation
of the data by X := f ( X ) M .
The idea of ERM and Inductive Bias also applies to unsupervised learning.
We seek some f that does not incur too much loss, on average. I.e., we seek to minimize R(f).

Remark: Two interpretations of "linear" can be found in the literature. It may refer to the
nature of the low dimensional space approximating the data, or to the nature of the
embedding operation.

4.1 PCA
Maximizing under a constraint, using Lagrange-Multipliers:
Where Cov [ vX ] = , Cov [ vX ] = vv .

PCA is such a basic technique it has been rediscovered and renamed independently in many
fields. It can be found under the names of discrete Karhunen-Loeve Transform; Hotteling
Transform; Proper Orthogonal Decomposition (POD); Eckart-Young Theorem;
Schmidt-Mirsky Theorem; Empirical Orthogonal Functions; Empirical Eigenfunction
Decomposition; Empirical Component Analysis; Quasi-Harmonic Modes; Spectral
Decomposition; Empirical Modal Analysis; and possibly more.
Example:
Consider human height and weight data. While clearly two dimensional data, you don't really
need both to understand how "big" are the people in the data. This is because; height and
weight vary mostly along a single dimension, which can be interpreted as the "bigness" of an
individual. This is why, physicians use the Body Mass Index (BMI) as an indicator of size,
instead of a two-dimensional measurement.
Assume now that you wish to give each individual a size score that is a linear combination of
height and weight, PCA does just that. It returns the linear combination that has the most
variability, i.e., the combination which best distinguishes between individuals.
Notice we have currently offered two motivations for PCA: (i) Find linear combinations that
best distinguish between observations, i.e., maximize variance.
(ii) Find the linear subspace the bets approximates the data. The reason these two problems
are equivalent, is due to the use of the squares error. Informally speaking, the data has some
total variance. This variance can be decomposed into the part captured in M, and the part not
captured.
Note: Usually for simplicity of exposition, we will assume that the data X has been mean
centered.
Terminology:
Principal Components: The linear combinations of the features, which best separate
between observations. In our example - the "bigness" index of each individual.
The first component captures the most variance, the second components, the second
most variance, etc. In terms of M, the principal components are an orthogonal basis
for M.
Scores: Synonymous to Principal Components.
Loadings: The weights of each data point in each principal component.
In our example, the importance of the height and weight in constructing the "bigness"
score.
PCA as a Graph Method
Starting from the maximal variance motivation, it is perhaps not surprising
that PCA depends only on the similarities between features, as measured by
their empirical covariance. The linearity of the target manifold was there by
assumption.
The building blocks of all these graph-based dimensionality reduction
methods are:
1. Compute some similarity graph G (or dissimilarity graph D) from the
raw features.
8

2. Call upon graph embedding theory to map the data points into the
target manifold M.

To summarize:
Task = dim reduce
Type = optimization
Input = Graph (G)
Output = embedding function
Sparse Principal Component Analysis (sPCA)
When analyzing the PCA results, we often wish to understand which features contribute to
which component. This is much easier when the loadings (A) are sparse, i.e., include many
zeroes. sPCA performs this in LASSO style, by means of l1 regularization.
4.4 Multidimensional Scaling (MDS)
MDS - Both self-organizing maps and principal curves and surfaces map data points
in Rp to a lower dimensional manifold. Multidimensional scaling (MDS) has a similar
goal, but approaches the problem in a somewhat different way.
MDS represents high-dimensional data in a low-dimensional coordinate system.
MDS requires only the dissimilarities dij , in contrast to the SOM and principal curves
and surfaces which need the data points xi.
MDS aims at representing a network (= a weighted graph) of distances (or
similarities) between observations, by embedding the observations in a q dimensional
linear subspace, while preserving the original distances.

5 Non-Linear Space Embedding Methods


The fact that the linear-space embedding of the data depends only some similarity
graph has laid a bridge between feature embedding, such as PCA, and graph embedding
methods such as MDS. Moreover, it has opened the door for replacing the covariance
similarity, with many other similarity measures.
Classic MDS is simply PCA when starting from G, thus viewed as a graph embedding
problem. kPCA plugs kernel similarities instead of covariance similarities. LocalMDS and
LLE follow a similar motivation using local measures of similarity.
PCA solution can be cast in terms of the covariance between individuals (G = X'X) or the
Euclidean distances (D).
In particular, we show that all the information on the location (mean) of X, needed for the
PCA reconstruction, is actually encoded in G (or D).
Kernel Principal Component Analysis (kPCA)
The optimization problem is:
arg max Cov g ( X ) ,
where g(X) = best separating score (function).
g

We thus have two matters to attend:


(i) We need to constrain g(x) so that it does not overfit.
(ii) We need the problem to be computable. This is precisely the goal of kPCA.
We have already encountered a similar problem with Smoothing Splines. It is thus not
surprising that the solution has the same form. Namely, if we choose the right g's, the solution

of the optimization problem takes a very simple form. The classes of such g's are known as
Reproducing Kernel Hilbert Spaces (RKHS).

Nonlinear Dimension Reduction and Local Multidimensional Scaling - These methods can be
thought of as flattening the manifold, and hence reducing the data to a set of low-dimensional
coordinates that represent their relative positions in the manifold. They are useful for problems where
signal-to-noise ratio is very high (e.g., physical systems), and are probably not as useful for
observational data with lower signal-to-noise ratios.
Three Methods of Nonlinear MDS:
ISOMAP = Isometric feature mapping (Tenenbaum et al., 2000) - constructs a graph to
approximate the geodesic distance between points along the manifold. Specifically, for each data
point we find its neighbors-points within some small Euclidean distance of that point. We construct a
graph with an edge between any two neighboring points. The geodesic distance between any two
points is then approximated by the shortest path between points on the graph. Finally, classical
scaling is applied to the graph distances, to produce a low-dimensional mapping.

LLE = Local linear embedding (Roweis and Saul, 2000) - takes a very different approach, trying
to preserve the local affine structure of the high-dimensional data. Each data point is approximated by
a linear combination of neighboring points. Then a lower dimensional representation is constructed
that best preserves these local approximations.

LLE aims at finding linear subspaces that are good approximations of small neighborhoods of
the whole data X. It is similar in spirit to Isomap and LocalMDS (x5.4.5). It differs, however,
in the way similarities are computed, and in the way embedding are performed. In particular,
as the name may suggest, LLE performs local embedding to linear subspaces.
To summarize:
Task = dim. reduction
Type = algorithm
Input = graph (G)
Output = data embedding
Concept = local distance
Local MDS (Chen and Buja, 2008) - takes the simplest and arguably the most direct approach.
We define N to be the symmetric set of nearby pairs of points; specifically a pair (i, i') is in N if point i
is among the K-nearest neighbors of i', or vice-versa.
Self Organizing Maps (SOM)
SOMs, are a non-linear-subspace dimensionality reduction method, aimed at
good clustering. It is non-linear because the algorithm (which cannot be cast
as an ERM problem, i.e., optimization problem) returns an embedding into a non-linear
manifold.
To summarize:
Task = dim. reduction
Type = algorithm
Input = X (data)
Output = parametric curve or surface
Concept = self consistency => I.e., a curve with a path that is the average of all it's closest
data points. Self Consistency Roughly speaking, one can think of this curve as a
parameterized function, connecting all the k-means cluster centers in the smoothest
way possible.

10

8 Cluster Analysis

Gaussian Mixtures as Soft K-means Clustering.

K-means Clustering - the algorithm is appropriate when the dissimilarity measure is


taken to be squared Euclidean distance. This requires all of the variables to be of the
quantitative type. In addition, using squared Euclidean distance places the highest
influence on the largest distances. This causes the procedure to lack robustness
against outliers that produce very large distance.

K-medoids Clustering - For a given cluster assignment (C) find the observation in the
cluster minimizing total distance to other points in that cluster. This algorithm
assumes attribute data, but the approach can also be applied to data described only by
proximity matrices. There is no need to explicitly compute cluster centers.

11

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Recommender Systems Algorithms
1. Content Filtering
2. Collaborative Filtering
3. Hybrid Filtering
4. Recommender Systems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The two main approaches to recommender systems include content filtering and
collaborative filtering.
1. Content Filtering
In content filtering, the system is assumed to have some background information on the user
(say, because he logged in), and uses this information to give him recommendations. The
recommendation in this case, is approached as a supervised learning problem: the system
learns to predict a product's rating based on the user's features.
2. Collaborative Filtering
Unlike content filtering, in collaborative filtering, there is no external information on the user
or the products, besides the ratings of other users.
Collaborative filtering can be approached as a supervised learning problem, or as an
unsupervised learning problem. This is because it is neither. It is essentially a missing data
problem.
The two main approaches to collaborative filtering include neighborhood methods, and latent
factor models.
a. The neighborhood methods to collaborative filtering rest on the assumption that
similar individuals have similar tastes. If someone similar to individual i has seen
movie j, then i should have a similar opinion.
b. The latent factor models approach to collaborative filtering rests on the assumption
that the rankings are a function of some latent user attributes and latent movie
attributes. This idea is not a new one, as we have seen it in the context of
unsupervised learning in factor analysis (FA) and independent component analysis
(ICA). This is why this approach is more commonly known as the
Matrix Factorization approach collaborative filtering.
We can present several matrix factorization problems in the ERM framework.
Hybrid Filtering
After introducing the ideas of content filtering and collaborative filtering, why not marry the
two? Hybrid filtering is the idea of imputing the missing data, thus making recommendations,
using both a viewer's attributes, and other viewers' preferences.
It can be presented as an ERM problem.
Recommender Systems Terminology
Content Based Filtering: A supervised learning approach to recommendations.
Collaborative Filtering: A missing data imputation approach to recommendations.
Memory Based Filtering: A non-parametric (neighborhood) approach to collaborative
filtering.

12

Model Based Filtering: A latent space generative model approach to collaborative


filtering.

Misc notes:
========
The Relation Between Supervised and Unsupervised Learning
It may be surprising that collaborative filtering can be seen as both an unsupervised and a
supervised learning problem. But these are not mutually exclusive problems.
In unsupervised learning we try to learn the joint distribution of x, i.e., try to learn the
relationship between any variable in x to the rest, we may see it as several supervised
learning problems. In each, a different variable in x plays the role of y.

The Kernel Trick


Applies to: SVM, PCA, canonical correlation analysis, ridge regression, spectral clustering,
Gaussian processes, and more (k-nearest neighbor (kNN) is also a kernel method).
Think of smoothing splines, it was quite magical that without constraining the hypothesis
class F, the ERM problem has a finite dimensional closed form solution. The property of an
infinite dimensional problem having a solution in a finite dimensional space is known as the
kernel property
The problem is then- what type of penalties J(f) will return simple solutions to:
(1)

The answer is: functions that belong to (RKHS) Reproducing Kernel Hilbert Space
function space.
The Bayesian View of RKHS
Just as the ridge regression has a Bayesian interpretation, so does the kernel trick. Informally,
the functions solving Eq.(1) can be seen as the posterior mode if our prior beliefs postulate
that the function we are trying to recover is a Gaussian zero-mean process with covariance
given by K.

Generative Models
By generative model we mean that we specify the whole data distribution.
This is particularly relevant to supervised learning where many methods only assume the
distribution of P(y|x) without stating the distribution of P(x).
LDA, QDA, and Naive Bayes, follow this exact same rational.
Dimensionality Reduction
- It is thus intimately related to lossy compression in information theory.
- Dimensionality reduction is often performed before supervised learning to keep
computational complexity low.

13

R code
Supervised Learning Code
library(magrittr) # for piping
library(dplyr) # for handeling data frames
# Some utility functions:
l2 <- function(x) x^2 %>% sum %>% sqrt
l1 <- function(x) abs(x) %>% sum
MSE <- function(x) x^2 %>% mean
missclassification <- function(tab) sum(tab[c(2,3)])/sum(tab)
```
We also initialize the random number generator so that we all get the same results (at least
upon a first run)
```{r set seed}
set.seed(2015)
```
# OLS
## OLS Regression
Starting with OLS regression, and a split train-test data set:
```{r OLS Regression}
View(prostate)
# now verify that your data looks as you would expect....
ols.1 <- lm(lcavol~. ,data = prostate.train)
# Train error:
MSE( predict(ols.1)- prostate.train$lcavol)
# Test error:
MSE( predict(ols.1, newdata = prostate.test)- prostate.test$lcavol)
```
Now using cross validation to estimate the prediction error:
```{r Cross Validation}
folds <- 10
fold.assignment <- sample(1:5, nrow(prostate), replace = TRUE)
errors <- NULL
for (k in 1:folds){
prostate.cross.train <- prostate[fold.assignment!=k,]
prostate.cross.test <- prostate[fold.assignment==k,]
.ols <- lm(lcavol~. ,data = prostate.cross.train)
.predictions <- predict(.ols, newdata=prostate.cross.test)
.errors <- .predictions - prostate.cross.test$lcavol
errors <- c(errors, .errors)
}
14

# Cross validated prediction error:


MSE(errors)
```
Also trying a bootstrap prediction error:
```{r Bootstrap}
B <- 20
n <- nrow(prostate)
errors <- NULL
prostate.boot.test <- prostate
for (b in 1:B){
prostate.boot.train <- prostate[sample(1:n, replace = TRUE),]
.ols <- lm(lcavol~. ,data = prostate.boot.train)
.predictions <- predict(.ols, newdata=prostate.boot.test)
.errors <- .predictions - prostate.boot.test$lcavol
errors <- c(errors, .errors)
}
# Bootstrapped prediction error:
MSE(errors)
```

### OLS Regression Model Selection

Best subset selection: find the best model of each size:


```{r best subset}
# install.packages('leaps')
library(leaps)
regfit.full <- prostate.train %>%
regsubsets(lcavol~.,data = ., method = 'exhaustive')
summary(regfit.full)
plot(regfit.full, scale = "Cp")
```
Train-Validate-Test Model Selection.
Example taken from
[here](https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/ch6.html)
```{r OLS TVT model selection}
model.n <- regfit.full %>% summary %>% length
X.train.named <- prostate.train %>% model.matrix(lcavol ~ ., data = .)
X.test.named <- prostate.test %>% model.matrix(lcavol ~ ., data = .)
View(X.test.named)
val.errors <- rep(NA, model.n)
train.errors <- rep(NA, model.n)
for (i in 1:model.n) {
15

coefi <- coef(regfit.full, id = i)


pred <- X.train.named[, names(coefi)] %*% coefi
train.errors[i] <- MSE(y.train - pred)
pred <- X.test.named[, names(coefi)] %*% coefi
val.errors[i] <- MSE(y.test - pred)
}
plot(train.errors, ylab = "MSE", pch = 19, type = "black")
points(val.errors, pch = 19, type = "b", col="blue")
legend("topright",
legend = c("Training", "Validation"),
col = c("black", "blue"),
pch = 19)
```

AIC model selection:


```{r OLS AIC}
# Forward search:
ols.0 <- lm(lcavol~1 ,data = prostate.train)
model.scope <- list(upper=ols.1, lower=ols.0)
step(ols.0, scope=model.scope, direction='forward', trace = TRUE)
# Backward search:
step(ols.1, scope=model.scope, direction='backward', trace = TRUE)
```

Cross Validated Model Selection.


```{r OLS CV}
[TODO]
```

Bootstrap model selection:


```{r OLS bootstrap}
[TODO]
```

Partial least squares and principal components:


```{r PLS}
pls::plsr()
pls::pcr()
```
Canonical correlation analyis:
```{r CCA}
16

cancor()
# Kernel based robust version
kernlab::kcca()
```

## OLS Classification
```{r OLS Classification}
# Making train and test sets:
ols.2 <- lm(spam~., data = spam.train.dummy)
# Train confusion matrix:
.predictions.train <- predict(ols.2) > 0.5
(confusion.train <- table(prediction=.predictions.train, truth=spam.train.dummy$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(ols.2, newdata = spam.test.dummy) > 0.5
(confusion.test <- table(prediction=.predictions.test, truth=spam.test.dummy$spam))
missclassification(confusion.test)
```

# Ridge Regression
```{r Ridge I}
# install.packages('ridge')
library(ridge)
ridge.1 <- linearRidge(lcavol~. ,data = prostate.train)
# Note that if not specified, lambda is chosen automatically by linearRidge.
# Train error:
MSE( predict(ridge.1)- prostate.train$lcavol)
# Test error:
MSE( predict(ridge.1, newdata = prostate.test)- prostate.test$lcavol)
```

Another implementation, which also automatically chooses the tuning parameter $\lambda$:
```{r Ridge II}
# install.packages('glmnet')
library(glmnet)
ridge.2 <- glmnet(x=X.train, y=y.train, alpha = 0)
# Train error:
MSE( predict(ridge.2, newx =X.train)- y.train)

17

# Test error:
MSE( predict(ridge.2, newx = X.test)- y.test)
```
__Note__: `glmnet` is slightly picky.
I could not have created `y.train` using `select()` because I need a vector and not a
`data.frame`. Also, `as.matrix` is there as `glmnet` expects a `matrix` class `x` argument.
Thse objects are created in the make_samples.R script, which we sourced in the beggining.

# LASSO Regression
```{r LASSO}
# install.packages('glmnet')
library(glmnet)
lasso.1 <- glmnet(x=X.train, y=y.train, alpha = 1)
# Train error:
MSE( predict(lasso.1, newx =X.train)- y.train)
# Test error:
MSE( predict(lasso.1, newx = X.test)- y.test)
```

# Logistic Regression For Classification


```{r Logistic Regression}
logistic.1 <- glm(spam~., data = spam.train, family = binomial)
# numerical error. Probably due to too many predictors.
# Maybe regularizing the logistic regressio with Ridge or LASSO will make things better?
```
In the next chunk, we do $l_2$ and $l_1$ regularized logistic regression.
Some technical remarks are in order:
- `glmnet` is picky with its inputs. This has already been discussed in the context of the
LASSO regression above.
- The `predict` function for `glmnet` objects returns a prediction (see below) for many
candidate regularization levels $\lambda$. We thus we `cv.glmnet` which does an automatic
cross validated selection of the best regularization level.
```{r Regularized Logistic Regression}
library(glmnet)
# Ridge Regularization with CV selection of regularization:
logistic.2 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 0)
# LASSO Regularization with CV selection of regularization:
logistic.3 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 1)

# Train confusion matrix:


18

.predictions.train <- predict(logistic.2, newx = X.train.spam, type = 'class')


(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)
.predictions.train <- predict(logistic.3, newx = X.train.spam, type = 'class')
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(logistic.2, newx = X.test.spam, type='class')
(confusion.test <- table(prediction=.predictions.test, truth=y.test.spam))
missclassification(confusion.test)
.predictions.test <- predict(logistic.3, newx = X.test, type='class')
(confusion.test <- table(prediction=.predictions.test, truth=y.test))
missclassification(confusion.test)
```

# SVM
## Classification
```{r SVM classification}
library(e1071)
svm.1 <- svm(spam~., data = spam.train)
# Train confusion matrix:
.predictions.train <- predict(svm.1)
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(svm.1, newdata = spam.test)
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```

## Regression
```{r SVM regression}
svm.2 <- svm(lcavol~., data = prostate.train)
# Train error:
MSE( predict(svm.2)- prostate.train$lcavol)
# Test error:
MSE( predict(svm.2, newdata = prostate.test)- prostate.test$lcavol)
```

19

# GAM Regression
```{r GAM}
# install.packages('mgcv')
library(mgcv)
form.1 <- lcavol~ s(lweight)+ s(age)+s(lbph)+s(svi)+s(lcp)+s(gleason)+s(pgg45)+s(lpsa)
gam.1 <- gam(form.1, data = prostate.train) # the model is too rich. let's select a variable
subset
ridge.1 %>% coef %>% abs %>% sort(decreasing = TRUE) # select the most promising
coefficients (a very arbitrary practice)
form.2 <- lcavol~ s(lweight)+ s(age)+s(lbph)+s(lcp)+s(pgg45)+s(lpsa) # keep only
promising coefficients in model
gam.2 <- gam(form.2, data = prostate.train)
# Train error:
MSE( predict(gam.2)- prostate.train$lcavol)
# Test error:
MSE( predict(gam.2, newdata = prostate.test)- prostate.test$lcavol)
```

# Neural Net
## Regression
```{r NNET regression}
library(nnet)
nnet.1 <- nnet(lcavol~., size=20, data=prostate.train, rang = 0.1, decay = 5e-4, maxit = 1000)
# Train error:
MSE( predict(nnet.1)- prostate.train$lcavol)
# Test error:
MSE( predict(nnet.1, newdata = prostate.test)- prostate.test$lcavol)
```

Let's automate the network size selection:


```{r NNET validate}
validate.nnet <- function(size){
.nnet <- nnet(lcavol~., size=size, data=prostate.train, rang = 0.1, decay = 5e-4, maxit = 200)
.train <- MSE( predict(.nnet)- prostate.train$lcavol)
.test <- MSE( predict(.nnet, newdata = prostate.test)- prostate.test$lcavol)
return(list(train=.train, test=.test))
}

20

validate.nnet(3)
validate.nnet(4)
validate.nnet(20)
validate.nnet(50)
sizes <- seq(2, 30)
validate.sizes <- rep(NA, length(sizes))
for (i in seq_along(sizes)){
validate.sizes[i] <- validate.nnet(sizes[i])$test
}
plot(validate.sizes~sizes, type='l')
```
What can I say... This plot is not what I would expect. Could be due to the random nature of
the fitting algorithm.

## Classification
```{r NNET Classification}
nnet.2 <- nnet(spam~., size=5, data=spam.train, rang = 0.1, decay = 5e-4, maxit = 1000)
# Train confusion matrix:
.predictions.train <- predict(nnet.2, type='class')
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(nnet.2, newdata = spam.test, type='class')
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```

# CART

## Regression
```{r Tree regression}
library(rpart)
tree.1 <- rpart(lcavol~., data=prostate.train)
# Train error:
MSE( predict(tree.1)- prostate.train$lcavol)
# Test error:
MSE( predict(tree.1, newdata = prostate.test)- prostate.test$lcavol)
```
At this stage we should prune the tree using `prune()`...
## Classification
21

```{r Tree classification}


tree.2 <- rpart(spam~., data=spam.train)
# Train confusion matrix:
.predictions.train <- predict(tree.2, type='class')
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(tree.2, newdata = spam.test, type='class')
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```

# Random Forest
TODO
# Rotation Forest
TODO

# Smoothing Splines
I will demonstrate the method with a single predictor, so that we can visualize the smoothing
that has been performed:
```{r Smoothing Splines}
spline.1 <- smooth.spline(x=X.train, y=y.train)
# Visualize the non linear hypothesis we have learned:
plot(y.train~X.train, col='red', type='h')
points(spline.1, type='l')
```
I am not extracting train and test errors as the output of `smooth.spline` will require some
tweaking for that.

# KNN
## Classification
```{r knn classification}
library(class)
knn.1 <- knn(train = X.train.spam, test = X.test.spam, cl =y.train.spam, k = 1)

22

# Test confusion matrix:


.predictions.test <- knn.1
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```
And now we would try to optimize `k` by trying different values.

# Kernel Regression
Kernel regression includes many particular algorithms.
```{r kernel}
# install.packages('np')
library(np)
ksmooth.1 <- npreg(txdat =X.train, tydat = y.train)
# Train error:
MSE( predict(ksmooth.1)- prostate.train$lcavol)
```
There is currently no method to make prediction on test data with this function.

# Stacking
As seen in the class notes, there are many ensemble methods.
Stacking, in my view, is by far the most useful and coolest. It is thus the only one I present
here.
The following example is adapted from [James E.
Yonamine](http://jayyonamine.com/?p=456).
```{r Stacking}
#####step 1: train models ####
#logits
logistic.2 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 0)
logistic.3 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 1)

# Learning Vector Quantization (LVQ)


my.codebook<-lvqinit(x=X.train.spam, cl=y.train.spam, size=10, prior=c(0.5,0.5),k = 2)
my.codebook<-lvq1(x=X.train.spam, cl=y.train.spam, codebk=my.codebook, niter = 100 *
nrow(my.codebook$x), alpha = 0.03)
# SVM
library('e1071')
svm.fit <- svm(y=y.train.spam, x=X.train.spam, probability=TRUE)

23

#####step 2a: build predictions for data.train####


train.predict<- cbind(
predict(logistic.2, newx=X.train.spam, type="response"),
predict(logistic.3, newx=X.train.spam, type="response"),
knn1(train=my.codebook$x, test=X.train.spam, cl=my.codebook$cl),
predict(svm.fit, X.train.spam, probability=TRUE)
)
####step 2b: build predictions for data.test####
test.predict <- cbind(
predict(logistic.2, newx=X.test.spam, type="response"),
predict(logistic.3, newx=X.test.spam, type="response"),
predict(svm.fit, newdata = X.test.spam, probability = TRUE),
knn1(train=my.codebook$x, test=X.test.spam, cl=my.codebook$cl)
)

####step 3: train SVM on train.predict####


final <- svm(y=y.train.spam, x=train.predict, probability=TRUE)
####step 4: use trained SVM to make predictions with test.predict####
final.predict <- predict(final, test.predict, probability=TRUE)
results<-as.matrix(final.predict)
table(results, y.test.spam)
```

# Fisher's LDA
```{r LDA}
library(MASS)
lda.1 <- lda(spam~., spam.train)
# Train confusion matrix:
.predictions.train <- predict(lda.1)$class
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(lda.1, newdata = spam.test)$class
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```
__Caution__:
Both `MASS` have a function called `select`. I will thus try avoid the two packages being
loaded at once, or call the functionby its full name: `MASS::select` or `dplyr::select'.

24

# Naive Bayes
```{r Naive Bayes}
library(e1071)
nb.1 <- naiveBayes(spam~., data = spam.train)
# Train confusion matrix:
.predictions.train <- predict(nb.1, newdata = spam.train)
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(nb.1, newdata = spam.test)
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```

25

Unsupervised Learning R code


Some utility functions:
```{r utility}
l2 <- function(x) x^2 %>% sum %>% sqrt
l1 <- function(x) abs(x) %>% sum
MSE <- function(x) x^2 %>% mean
# Matrix norms:
frobenius <- function(A) norm(A, type="F")
spectral <- function(A) norm(A, type="2")
```

__Note__: `foo::bar` means that function `foo` is part of the `bar` package.
With this syntax, there is no need to load (`library`) the package.
If a line does not run, you may need to install the package: `install.packages('bar')`.
Sadly, RStudio currently does not autocomplete function arguments when using the `::`
syntax.

# Learning Distributions
## Gaussian Density Estimation
```{r}
# Sample from a multivariate Gaussian:
## Generate a covariance matrix
p <- 10
Sigma <- bayesm::rwishart(nu = 100, V = diag(p))$W
lattice::levelplot(Sigma)
# Sample from a multivariate Gaussian:
n <- 1e3
means <- 1:p
X1 <- mvtnorm::rmvnorm(n = n, sigma = Sigma, mean = means)
dim(X1)
# Estiamte parameters and compare to truth:
estim.means <- colMeans(X1) # recall truth is (10,...,10)
plot(estim.means~means); abline(0,1, lty=2)
estim.cov <- cov(X1)
estim.cov.errors <- Sigma - estim.cov
lattice::levelplot(estim.cov.errors)
plot(estim.cov~Sigma); abline(0,1, lty=2)
frobenius(estim.cov.errors)

26

# Now try the same while playing with n and p.


```

Other covariance estimators (robust, fast,...)


```{r covariances}
# Robust covariance
estim.cov.1 <- MASS::cov.rob(X1)$cov
estim.cov.errors.1 <- Sigma - estim.cov.1
lattice::levelplot(estim.cov.errors.1)
frobenius(estim.cov.errors.1)
# Nearest neighbour cleaning of outliers
estim.cov.2 <- covRobust::cov.nnve(X1)$cov
estim.cov.errors.2 <- Sigma - estim.cov.2
lattice::levelplot(estim.cov.errors.2)
frobenius(estim.cov.errors.2)

# Regularized covariance estimation


estim.cov.3 <- robustbase::covMcd(X1)$cov
estim.cov.errors.3 <- Sigma - estim.cov.3
lattice::levelplot(estim.cov.errors.3)
frobenius(estim.cov.errors.3)

# Another robust covariance estimator


estim.cov.4 <- robustbase::covComed(X1)$cov
estim.cov.errors.4 <- Sigma - estim.cov.4
lattice::levelplot(estim.cov.errors.4)
frobenius(estim.cov.errors.4)
```
## Non parametric density estimation
There is nothing that will even try dimensions higher than 6.
See [here](http://vita.had.co.nz/papers/density-estimation.pdf) for a review.

## Association rules
Note: Visualization examples are taken from the arulesViz [vignette](http://cran.rproject.org/web/packages/arulesViz/vignettes/arulesViz.pdf)
```{r association rules}
library(arules)
data("Groceries")
inspect(Groceries[1:2])
summary(Groceries)
rules <- apriori(Groceries, parameter = list(support=0.001, confidence=0.5))
27

summary(rules)
rules %>% sort(by='lift') %>% head %>% inspect

# Select a subset of rules


rule.subset <- subset(rules, subset = rhs %pin% "yogurt")
inspect(rule.subset)
# Visualize rules:
library(arulesViz)
plot(rules)
subrules <- rules[quality(rules)$confidence > 0.8]
plot(subrules, method="matrix", measure="lift", control=list(reorder=TRUE))
plot(subrules, method="matrix", measure=c("lift", "confidence"),
control=list(reorder=TRUE))
plot(subrules, method="grouped")
plot(rules, method="grouped", control=list(k=50))
subrules2 <- head(sort(rules, by="lift"), 10)
plot(subrules2, method="graph", control=list(type="items"))
plot(subrules2, method="graph")
# Export rules graph to use with other software:
# saveAsGraph(head(sort(rules, by="lift"),1000), file="rules.graphml")
rule.1 <- rules[1]
inspect(rule.1)
plot(rule.1, method="doubledecker", data = Groceries)
```
See also the `prim.box` function in the `prim` package for more algorithms to learn
association rules

# Dimensionality Reduction
## PCA
Note: example is a blend from [Gaston Sanchez](http://gastonsanchez.com/blog/howto/2012/06/17/PCA-in-R.html) and [Georgia's Geography
dept.](http://geog.uoregon.edu/GeogR/topics/pca.html).

Get some data


```{r PCA data}
?USArrests
plot(USArrests) # basic plot
28

corrplot::corrplot(cor(USArrests), method = "ellipse") # slightly fancier

# As a correaltion graph
cor.1 <- cor(USArrests)
qgraph::qgraph(cor.1)
qgraph::qgraph(cor.1, layout = "spring", posCol = "darkgreen", negCol = "darkmagenta")
```

```{r PCA}
USArrests.1 <- USArrests[,-3] %>% scale
pca1 <- prcomp(USArrests.1, scale. = TRUE)
(pca1$rotation) # loadings
# Now score the states:
pca1$x %>% extract(,1) %>% sort %>% head
```
Interpretation:
- PC1 seems to capture overall crime rate.
- PC2 seems distinguish between sexual and non-sexual crimes

Projecting on first two PCs:


```{r visualizing PCA}
library(ggplot2) # for graphing
pcs <- as.data.frame(pca1$x)
ggplot(data = pcs, aes(x = PC1, y = PC2, label = rownames(pcs))) +
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
geom_text(colour = "tomato", alpha = 0.8, size = 4) +
ggtitle("PCA plot of USA States - Crime Rates")
```

The bi-Plot
```{r biplot}
biplot(pca1) #ugly!

# library(devtools)
# install_github("vqv/ggbiplot")
ggbiplot::ggbiplot(pca1, labels = rownames(USArrests.1)) # better!
```
29

The scree-plot
```{r screeplot}
ggbiplot::ggscreeplot(pca1)
```
So clearly the main differentiation

Visualize the scoring as a projection of the states' attributes onto the factors.
```{r}
# get parameters of component lines (after Everitt & Rabe-Hesketh)
load <- pca1$rotation
slope <- load[2, ]/load[1, ]
mn <- apply(USArrests.1, 2, mean)
intcpt <- mn[2] - (slope * mn[1])
# scatter plot with the two new axes added
dpar(pty = "s") # square plotting frame
USArrests.2 <- USArrests[,1:2] %>% scale
xlim <- range(USArrests.2) # overall min, max
plot(USArrests.2, xlim = xlim, ylim = xlim, pch = 16, col = "purple") # both axes same
length
abline(intcpt[1], slope[1], lwd = 2) # first component solid line
abline(intcpt[2], slope[2], lwd = 2, lty = 2) # second component dashed
legend("right", legend = c("PC 1", "PC 2"), lty = c(1, 2), lwd = 2, cex = 1)
# projections of points onto PCA 1
y1 <- intcpt[1] + slope[1] * USArrests.2[, 1]
x1 <- (USArrests.1[, 2] - intcpt[1])/slope[1]
y2 <- (y1 + USArrests.1[, 2])/2
x2 <- (x1 + USArrests.1[, 1])/2
segments(USArrests.1[, 1], USArrests.1[, 2], x2, y2, lwd = 2, col = "purple")
```

Visualize the loadings:


```{r}
# install.packages('GPArotation')
pca.qgraph <- qgraph::qgraph.pca(USArrests.1, factors = 2, rotation = "varimax")
plot(pca.qgraph)
qgraph::qgraph(pca.qgraph, posCol = "darkgreen", layout = "spring", negCol =
"darkmagenta",
edge.width = 2, arrows = FALSE)
```

30

More implementations of PCA:


```{r}
# FAST solutions:
gmodels::fast.prcomp()
# More detail in output:
FactoMineR::PCA()
# For flexibility in algorithms and visualization:
ade4::dudi.pca()
# Another one...
install.packages('amap')
amap::acp()
```

Principal tensor analysis:


```{r PTA}
PTAk::PTAk()
```

## sPCA
```{r sPCA}
```

## kPCA
```{r kPCA}
kernlab::kpca()
```

## Random Projections
```{r Random Projections}
```

## MDS
```{r MDS}
stats::cmdscale()
MASS::sammon()
MASS::isoMDS()
31

```

## Isomap
```{r Isomap}
```

## LLE
```{r LLE}
```
## LocalMDS
```{r Local MDS}
```
## Principal Curves & Surfaces
```{r Principla curves}
```

# Latent Space Generative Models


## FA
```{r factor analysis}
psych::principal()
```
## ICA
```{r ICA}
fastICA::fastICA() # Also performs projection pursuit
```

## Exploratory Projection Pursuit


```{r exploratory projection pursuit}
install.packages('REPPlab')
library(REPPlab) % will require the `rJava` package
```

32

## Generative Topographic Map


[TODO]
## Finite Mixture
```{r mixtures}
install.packages('mixtools')
library(mixtools)
```
Read [this](http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch20.pdf) for more
information.

## HMM
```{r}
# install.packages('HiddenMarkov')
library(HiddenMarkov)

```

# Clustering:
Generate clusters:
```{r generate clusters}
X <- clusterGeneration::genRandomClust(numClust=2)
clusterGeneration::viewClusters(X, cl=2)
```

## K-means
```{r kmeans}
stats::kmeans()
```
## Kmeans++
```{r kmeansPP}
kmpp <- function(X, k) {
n <- nrow(X)
C <- numeric(k)
C[1] <- sample(1:n, 1)
for (i in 2:k) {
dm <- distmat(X, X[C, ])
pr <- apply(dm, 1, min); pr[C] <- 0
33

C[i] <- sample(1:n, 1, prob = pr)


}
kmeans(X, X[C, ])
}
```

## K-medoids
```{r kmedoids}
cluster::pam()
# Many other similarity measures:
proxy::dist()
```
## Hirarchial
```{r}
hclust()
# install.packages('cluster')
library(cluster)
agnes()
```

## Self Organizing Maps


You may note the similar function names. This is why the `::` syntax is very useful.
```{r SOM}
# install.packages('som')
library(som)
som::som()
kohonen::som()
class::SOM()
```

## Spectral Clustering
```{r}
# install.packages('kernlab')
library(kernlab)
specc()
```

34

Você também pode gostar