Machine Learning 3

More Machine Learning
Perceptron Support Vector Machines and Margins The Kernel Trick K-Nearest Neighbor
Recall: Key Components of Intelligent Agents

Representation Language: Graph, Bayes Nets, Linear functions Inference Mechanism: A*, variable elimination, Gibbs sampling Learning Mechanism: Maximum Likelihood, Laplace Smoothing, gradient descent, many more: perceptron, k-Nearest Neighbor, ------------------------------------Evaluation Metric: Likelihood, quadratic loss (a.k.a. squared error), regularized loss, many more: margins, 0-1 loss, conditional likelihood, precision/recall,
Linear Separability
X2
X1 Data has two features: X1 and X2. Two possible labels: blue and red.
Linear Separator
Linear Classification
Suppose there are N input variables, X1, , XN (all real numbers). A linear classifier is a function that looks like this: = If 0 + 1 1 + + 0, return Class 1 (eg., red); otherwise, return Class 2 (e.g. , blue).
The wi variables are called weights or parameters. Each one is a real number.
The set of all functions that look like this (one function for each choice of weights w0 through wN) is called the Hypothesis Class for linear regression.
Hypotheses
X2
X1
Quiz: Making predictions

X2 A: Which label?
B: Which label?
C: Which label?
X1
Answer: Making predictions

X2 A: Which label?
B: Which label?
C: Which label?
X1
The Perceptron Algorithm

Input: Training data (Xi1, , XiN, Yi), where each Yi is either 0 or 1. 1. Set each wj random initial guess 2. For each training example i:
For each weight wj: wj wj + (Yi f(Xi1, , XiN))
Output: weights wj
Learning Rate
Error
Properties of Perceptron
Convergence: If the data set is linearly separable, then the Perceptron algorithm converges to a linear separator (amazingly enough). (If there is no linear separator, then perceptron will keep moving the line around forever.) Online: Unlike gradient descent, MLE, etc., the Perceptron algorithm can train by looking at one example at a time, rather than processing all of the data in a batch. This is something called an online training algorithm.
c b a X2
Quiz
X1
Which classifier would you prefer?
c b a X2
Answer
Its an opinion question, so any answer is acceptable. But machine learning people prefer b. Intuitively, b has the best chance of classifying a new data point correctly. a and c are overfitting.
X1
c b a X2
Margin
margin Distance between the linear separator and the nearest data point.
X1
Maximum Margin Learning

A very popular approach to combating overfitting is to select hypotheses with large margins. This is called maximum margin learning. Two very popular techniques: Support Vector Machines Boosting These techniques are beyond the scope of this class.
c b a X2
Quiz: Margins
X1
Which classifier has the largest margin?
c b a X2
Answer: Margins
X1
Answer: b is farthest from the data, so it has the largest margin.
Non-linear (or non-linearly-separable) data

X2
No line can separate these two classes.
X1
The Kernel Trick

X2
The Kernel Trick is to add a new input variable that is computed from the existing ones.
Let 3 =
X1
2 2 1 + 2
X3
Now theres a linear separator!

In the original feature space, the linear separator looks like a circle.
The Kernel Trick

SVMs use automatic methods (called kernels) to add new features to a learning problem. We wont go into these in detail. The important lesson: its possible to apply linear classifiers to non-linearly-separable data, by extending the feature space.
Parametric vs. Nonparametric models

Almost all models for machine learning have parameters or weights that need to be learned.
Parametric Models The number of parameters is constant, or independent of the number of training examples. Nonparametric models The number of parameters grows with the number of training examples.
Parametric Model Examples

Linear regression: Each training example has N inputs, X1, , XN. It doesnt matter how many examples are in the training data, the regression model will always have N+1 weights. This number is independent of the number of training examples (M). So linear regression is parametric.
Parametric Model Examples

Nave Bayes (with fixed vocabulary): Each training example has a 1 or 0 for every word in the vocabulary. No matter how many training examples there are, we will only need parameters for the number of words in the vocabulary, which is fixed. So this number is independent of the number of training examples (M). So Nave Bayes (with fixed vocabulary size) is parametric.
Quiz: Nonparametric Model: k-Nearest Neighbor Classifier

Color each blank point with the color of its closest neighbor.
a
Answer: Nonparametric Model: k-Nearest Neighbor Classifier

Color each blank point with the color of its closest neighbor.
a
Quiz: k-Nearest Neighbor, k=3

Color each blank point with the majority color of its three closest neighbors.
a
Quiz: k-Nearest Neighbor, k=3

Color each blank point with the majority color of its three closest neighbors.
a
The k-Nearest Neighbor Classifier

Learning algorithm: memorize the X and Y components of each training example. Inference algorithm: For each new point X, find the k nearest points from the training data, and select the most common Y value from those training data points. Use that Y value as the prediction.
Properties of k-NN
Convergence: as the number of training examples grows, the expected accuracy on test data points approaches 100%. Smoothing: Higher values of k can be used to combat overfitting. Typically, only odd values of k are used, to ensure that there are no ties during prediction. Complexity: Training k-NN is very simple: just memorize each training data point. However, finding the nearest neighbors at test time can be an expensive operation. All sorts of hashing and indexing techniques have been invented to improve the time complexity of inference, but this remains an active area of study.
Quiz: Learning model types

Model Classification or Regression? Generative or Discriminative? Parametric or Nonparametric?
Bayes Net Nave Bayes Linear Regression Linear Classifier K-Nearest Neighbor
Answers: Learning model types

Model Classification or Regression? Generative or Discriminative? Parametric or Nonparametric?
Bayes Net
Classification (from what youve seen, although its possible to do regression as well) Classification Regression Classification Classification (or regression)
Generative
Parametric
Nave Bayes Linear Regression Linear Classifier K-Nearest Neighbor
Generative Discriminative Discriminative Discriminative
Parametric Parametric Parametric Nonparametric
Quiz: Learning algorithm types

Algorithm Supervised or Unsupervised? Online or batch? Closed-form or iterative?
MLE
Laplace Smoothing Minimize Squared Error (for linear regression) Gradient Descent
Perceptron
k-NN training (memorization)
Answers: Learning algorithm types

Algorithm Supervised or Unsupervised? Online or batch? Closed-form or iterative?
MLE
Laplace Smoothing Minimize Squared Error (for linear regression) Gradient Descent
Supervised
Supervised Supervised
Batch
Batch Batch
Closed-form
Closed-form Closed-form
Supervised
Batch
Iterative
Perceptron
k-NN training (memorization)
Supervised
Supervised
Online
Online
Iterative
Closed-form
Quiz: Preventing overfitting

Model Bayes Net/Nave Bayes Linear Regression Method to prevent overfitting
k-NN
Answers: Preventing overfitting

Model Bayes Net/Nave Bayes Linear Regression Name an appropriate method to prevent overfitting Laplace smoothing L1 or L2 regularization + gradient descent
k-NN
Maximum margin learning

Choose higher values of k

Machine Learning 3

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Machine Learning 3

Enviado por

Direitos autorais:

Formatos disponíveis

More Machine Learning

Recall: Key Components of Intelligent Agents

Quiz: Making predictions

Answer: Making predictions

The Perceptron Algorithm

Which classifier would you prefer?

Maximum Margin Learning

Which classifier has the largest margin?

Answer: b is farthest from the data, so it has the largest margin.

Non-linear (or non-linearly-separable) data

No line can separate these two classes.

The Kernel Trick

Now theres a linear separator!

The Kernel Trick

Parametric vs. Nonparametric models

Parametric Model Examples

Parametric Model Examples

Quiz: Nonparametric Model: k-Nearest Neighbor Classifier

Answer: Nonparametric Model: k-Nearest Neighbor Classifier

Quiz: k-Nearest Neighbor, k=3

Quiz: k-Nearest Neighbor, k=3

The k-Nearest Neighbor Classifier

Quiz: Learning model types

Answers: Learning model types

Nave Bayes Linear Regression Linear Classifier K-Nearest Neighbor

Generative Discriminative Discriminative Discriminative

Parametric Parametric Parametric Nonparametric

Quiz: Learning algorithm types

Answers: Learning algorithm types

Quiz: Preventing overfitting

Answers: Preventing overfitting

Maximum margin learning

Você também pode gostar