Escolar Documentos
Profissional Documentos
Cultura Documentos
Perceptron Support Vector Machines and Margins The Kernel Trick K-Nearest Neighbor
Linear Separability
X2
X1 Data has two features: X1 and X2. Two possible labels: blue and red.
Linear Separator
Linear Classification
Suppose there are N input variables, X1, , XN (all real numbers). A linear classifier is a function that looks like this: = If 0 + 1 1 + + 0, return Class 1 (eg., red); otherwise, return Class 2 (e.g. , blue).
The wi variables are called weights or parameters. Each one is a real number.
The set of all functions that look like this (one function for each choice of weights w0 through wN) is called the Hypothesis Class for linear regression.
Hypotheses
X2
X1
B: Which label?
C: Which label?
X1
B: Which label?
C: Which label?
X1
Output: weights wj
Learning Rate
Error
Properties of Perceptron
Convergence: If the data set is linearly separable, then the Perceptron algorithm converges to a linear separator (amazingly enough). (If there is no linear separator, then perceptron will keep moving the line around forever.) Online: Unlike gradient descent, MLE, etc., the Perceptron algorithm can train by looking at one example at a time, rather than processing all of the data in a batch. This is something called an online training algorithm.
c b a X2
Quiz
X1
c b a X2
Answer
Its an opinion question, so any answer is acceptable. But machine learning people prefer b. Intuitively, b has the best chance of classifying a new data point correctly. a and c are overfitting.
X1
c b a X2
Margin
margin Distance between the linear separator and the nearest data point.
X1
c b a X2
Quiz: Margins
X1
c b a X2
Answer: Margins
X1
X1
The Kernel Trick is to add a new input variable that is computed from the existing ones.
Let 3 =
X1
2 2 1 + 2
X3
Properties of k-NN
Convergence: as the number of training examples grows, the expected accuracy on test data points approaches 100%. Smoothing: Higher values of k can be used to combat overfitting. Typically, only odd values of k are used, to ensure that there are no ties during prediction. Complexity: Training k-NN is very simple: just memorize each training data point. However, finding the nearest neighbors at test time can be an expensive operation. All sorts of hashing and indexing techniques have been invented to improve the time complexity of inference, but this remains an active area of study.
Bayes Net Nave Bayes Linear Regression Linear Classifier K-Nearest Neighbor
Bayes Net
Classification (from what youve seen, although its possible to do regression as well) Classification Regression Classification Classification (or regression)
Generative
Parametric
MLE
Laplace Smoothing Minimize Squared Error (for linear regression) Gradient Descent
Perceptron
k-NN training (memorization)
MLE
Laplace Smoothing Minimize Squared Error (for linear regression) Gradient Descent
Supervised
Supervised Supervised
Batch
Batch Batch
Closed-form
Closed-form Closed-form
Supervised
Batch
Iterative
Perceptron
k-NN training (memorization)
Supervised
Supervised
Online
Online
Iterative
Closed-form
Linear Classification
k-NN
Linear Classification
k-NN