Lect6 PDF

Artificial Intelligence:
Machine Learning - 1
Russell & Norvig: Sections 18.1 to 18.4
Motivation
Too many to list here!

Recommender systems
eg. Amazon --> you might like this book
eg. LinkedIn --> people you might know
Pattern Recognition
Learning to recognize postal codes
Handwritten recognition --> who wrote this
cheque?
Why Machine Learning?

To construct programs that automatically

improve with experience.
Learning = crucial characteristics of an
intelligent agent
A learning system
performs tasks and improves its performance

the more tasks it accomplishes
generalizes from given experiences and is able
to make judgments on unseen cases
Machine Learning vs Data Mining

Supervised Machine Learning

Try to predict the outcome of a new data from a training set of

already classified examples
I know what I am looking for, and I am using this training data to

help me generalise
Eg: diagnostic system: which treatments are best for new diseases?
Eg: speech recognition : given the acoustic pattern, which word was
uttered?
Unsupervised Machine Learning/ Data Mining / Knowledge

Discovery

Try to discover valuable knowledge / pattern from large databases

of grocery bills, financial transactions, medical records, etc.
Im not sure what I am looking for, but I am sure theres something

interesting
Eg: Anomaly detection: odd use of credit card?
Eg. Learning association rules: buy strawberries & whipped cream
together?
4
Types of Learning
In Supervised learning
We are given a training set of (X, f(X)) pairs

big nose
big teeth
big eyes
no moustache
f(X) = not person
small nose
small teeth
small eyes
no moustache
f(X) = person
small nose
big teeth
small eyes
moustache
f(X) = ?
moustache
f(X) = ?
In Reinforcement learning
We are not given the (X, f(X)) pairs

small nose

big teeth
small eyes
But somehow we are told whether our learned f(X) is right or wrong
Goal: maximize the nb of right answers
In Unsupervised learning

We are only given the Xs - not the corresponding f(X)

big nose
big teeth
big eyes
no moustache
not given
small nose
small teeth
small eyes
no moustache
not given
small nose
big teeth
small eyes
moustache
f(X) = ?
No teacher involved / Goal: find regularities among the Xs (clustering)

Data mining
5
Remember this slide
Inductive Learning

= learning from examples

Most work in ML
Examples are given (positive and/or negative) to train a
system in a classification (or regression) task
Extrapolate from the training set to make accurate
predictions about future examples
Can be seen as learning a function
Given a new instance X you have never seen
You must find an estimate of the function f(X) where f(X) is
the desired output
Ex: small nose big teeth
small eyes
moustache
f(X) = ?

X
X = features of a face (ex. small nose, big teeth, )
f(X) = function to tell if X represents a human face or not
7
Example

Given 5 pairs (X,f(X)) (the training set)

Find a function that fits the training set well
So that given a new X, you can predict its f(X) value
f(X)
Note: choosing one function over another beyond just

looking at the training set is called inductive bias

ex: prefer "smoother" functions

ex: if an attribute does not seem discriminating, drop it
source: Rusell & Norvig (2003)
Inductive Learning Framework

Input data are represented by a vector of features, X

Each vector X is a list of (attribute, value) pairs.

= [nose:big, teeth:big, eyes:big, moustache:no]
The number of features is fixed (positive, finite)

Each attribute has a fixed, finite number of possible values
Each example can be interpreted as a point in a n-dimensional
feature space
Ex: X
where n is the number of features
if the output is discrete --> classification

if the output is continuous --> regression
9
Example
Real ML applications typically require hundreds or thousands of examples

source: Alison Cawsey: The Essence of AI (1997).
10
Techniques in ML
Probabilistic Methods
Decision Trees
Use only discriminating features as questions in a big if-then-else

tree
Genetic algorithms
ex: Nave Bayes Classifier
Good functions evolve out of a population by combining possible

solutions to produce offspring solutions and killing off the weaker
of those solutions.
Neural networks

Also called parallel distributed processing or connectionist systems

Intelligence arise from having a large number of simple
computational units
11
Today

Last time: Probabilistic Methods

Decision trees
( Evaluation
Unsupervised learning)
Next time: Neural Networks
Next time: Genetic Algorithms
12
Guess Who?
Play online
13
Decision Trees

Simplest, but most successful form of learning algorithm

Very well-know algorithm is ID3 (Quinlan, 1987) and its
successor C4.5
Look for features that are very good indicators of the
result, place these features (as questions) in nodes of
the tree
Split the examples so that those with different values
for the chosen feature are in a different set
Repeat the same process with another feature
14
ID3 / C4.5 Algorithm

Top-down construction of the decision tree

Recursive selection of the best feature to use at the
current node in the tree
Once the feature is selected for the current node,
generate children nodes, one for each possible value of
the selected attribute
Partition the examples using the possible values of this
attribute, and assign these subsets of the examples to
the appropriate child node
Repeat for each child node until all examples
associated with a node are either all positive or all
negative
15
Example
Info on last years students to determine if a student will get an A this year
Features
Student
A last
year?
Black
hair?
Output f(X)
Works
hard?
Drinks?
A this year?
X1: Richard
Yes
Yes
No
Yes
No
X2: Alan
Yes
Yes
Yes
No
Yes
X3: Alison
No
No
Yes
No
No
X4: Jeff
No
Yes
No
Yes
No
X5: Gail
Yes
No
Yes
Yes
Yes
X6: Simon
No
Yes
Yes
Yes
No
16
Example
Features
A last year?
yes
no
Output = No
Works hard?
yes
no
Output = Yes
Output = No
Output
f(X)
Student
A last
year?
Black
hair?
Works
hard?
Drinks
?
Richard
Yes
Yes
No
Yes
No
Alan
Yes
Yes
Yes
No
Yes
Alison
No
No
Yes
No
No
Jeff
No
Yes
No
Yes
No
Gail
Yes
No
Yes
Yes
Yes
Simon
No
Yes
Yes
Yes
No
A this
year?
17
Example 2: The Restaurant

Goal: learn whether one should wait for a table

Attributes

Alternate: another suitable restaurant nearby

Bar: comfortable bar for waiting
Fri/Sat: true on Fridays and Saturdays
Hungry: whether one is hungry
Patrons: how many people are present (none, some, full)
Price: price range ($, $$, $$$)
Raining: raining outside
Reservation: reservation made
Type: kind of restaurant (French, Italian, Thai, Burger)
WaitEstimate: estimated wait by host (0-10 mins, 10-30, 3060, >60)
18
Example 2: The Restaurant

Training data:
source: Norvig (2003)
19
A First Decision Tree
But is it the best decision tree we can build?
20
Ockhams Razor Principle

It is vain to do more than can be done with less Entities
should not be multiplied beyond necessity. [Ockham, 1324]
In other words always favor the simplest answer

that correctly fits the training data
i.e. the smallest tree on average
This type of assumption is called inductive bias
inductive bias = making a choice beyond what the training

instances contain
21
Finding the Best Tree

can be seen as searching the space

of all possible decision trees
Inductive bias: prefer shorter
trees on average
how?
search the space of all decision
trees
empty tree
always pick the next attribute to

split the data based on its
"discriminating power" (information
gain)
in effect, hill-climbing search
where heuristic is information gain
complete tree
source: Tom Mitchell, Machine Learning (1997)

22
Which Tree is Best?

F1?
F2?
F3?
F4?
class
F5?
class
class
F6?
class
class
F7?
class
class
class
F1?
class
F2?
class
F3?
class
F4?
class
F5?
class
F6?
class
F7?
class
class
23
Choosing the Next Attribute

The
key problem is choosing which

feature to split a given set of examples
ID3
uses Maximum Information-Gain:
Choose the attribute that has the largest

information gain
i.e., the attribute that will result in the smallest

expected size of the subtrees rooted at its
children
information theory
24
Intuitively
Patron:

Output f(X)
If value is Some all outputs=Yes

If value is None all outputs=No
If value is Full we need more tests
Type:

If value
If value
If value
If value
is
is
is
is
French we need more tests

Italian we need more tests
Thai we need more tests
Burger we need more tests
So patron is more discriminating
25
Next Feature
For only data where patron = Full
hungry

If value is Yes we need more tests

If value is No all output= No
type:

If value
If value
If value
If value
is French all output= No

is Italian all output= No
is Thai we need more tests
is Burger we need more tests
So hungry is more discriminating (only 1 new branch)

26
A Better Decision Tree

4 tests instead of 9
11 branches instead of 21
27
Choosing the Next Attribute

The key problem is choosing which feature to
split a given set of examples
Most used strategy: information theory
H(X) = p(x)log2p(x)
Entropy (or information content)
xX
1 1
H(fair coin toss) = p(xi )log2p(xi ) = H ,
2 2
xi X
1 1
1
1
= log2 + log2 = 1 bit
2 2
2
2
H(a,b) = entropy if
probability of success = a and
probability of failure = b
28
Essential Information Theory

Developed by Shannon in the 40s

Notion of entropy (information content):
How informative is a piece of information?

If you already have a good idea about the answer

low entropy
then a "hint" about the right answer is not very
informative
If you have no idea about the answer (ex. 50-50 split)
high entropy
then a "hint" about the right answer is very
informative
29
Entropy

Let X be a discrete random variable (RV)

Entropy (or information content)
n
H(X) = p(xi )log2p(xi )

i=1

measures the amount of information in a RV

average uncertainty of a RV
the average length of the message needed to transmit an
outcome xi of that variable
measured in bits
for only 2 outcomes x1 and x2, then 1 H(X) 0
30
Example: The Coin Flip

n
Fair coin: H(X) = p(xi )log2p(xi ) = - 1 log2 1 + 1 log2 1 = 1 bit

2 2
2
2
i=1
n
Rigged coin: H(X) = p(xi )log2p(xi ) = - 99 log2 99 + 1 log2 1 = 0.08 bits

i=1
100
100 100
100
Entropy
fair coin -> high entropy
rigged coin -> low entropy

P(head)
31
Choosing the Best Feature (con't)

The "discriminating power" of an attribute A given a data set S

Let Values(A) = the set of values that attribute A can take
Let Sv = the set of examples in the data set which have value v for
attribute A (for each value v from Values(A) )
information gain (or
entropy reduction)
gain(S, A) = H(S) H(S | A)

= H(S)
v values(A)
Sv
S
x H(Sv )
32
Some Intuition
Size
Color
Shape
Output
Big
Red
Circle
Small
Red
Circle
Small
Red
Square
Big
Blue
Circle
Size is the least discriminating attribute (i.e.

smallest information gain)
Shape and color are the most discriminating
attributes (i.e. highest information gain)
33
A Small Example (1)

Size
Color
Shape
Big
Red
Circle
Small
Red
Circle
Small
Red
Square
Big
Blue
Circle
1 1
1
1
H(S) = log2 + log2 = 1
2 2
2
2
for each v of Values(Color)
Output
Values(Color) = {red,blue}
Color
red: 2+ 1-
blue: 0+ 1-
gain(S, Color) = H(S)
v values(Color)
Sv
S
x H(Sv )
2 1
1
2 1
2
H(S | Color = red) = H , = log2 + log2 = 0.918
3 3
3
3 3
3
1
1
H(S | Color = blue) = H(1,0 ) = log2 = 0
1
1
3
1
H(S | Color) = (0.918) + (0) = 0.6885
4
4
gain(Color) = H(S) - H(S | Color) = 1 - 0.6885 = 0.3115
34
A Small Example (2)

Size
Color
Shape
Output
Big
Red
Circle
Small
Red
Circle
Small
Red
Square
Big
Blue
Circle
Shape
circle: 2+ 1-
square: 0+ 1-
Note: by definition,

Log 0 = -

0log0 is 0
1 1
1
1
H(S) = log2 + log2 = 1
2 2
2
2
3
1
H(S | Shape) = (0.918) + (0) = 0.918
4
4
gain(Shape) = H(S) - H(S | Shape) = 1 - 0.918 = 0.3115
35
A Small Example (3)

Size
Color
Shape
Output
Big
Red
Circle
Small
Red
Circle
Small
Red
Square
Big
Blue
Circle
Size
big: 1+ 1-
small: 1+ 1-
1 1
1
1
H(S) = log2 + log2 = 1
2 2
2
2
1
1
H(S | Size) = (1) + (1) = 1
2
2
gain(Size) = H(S) - H(S | Size) = 1 - 1 = 0
36
A Small Example (4)

Size
Color
Shape
Output
Big
Red
Circle
Small
Red
Circle
Small
Red
Square
Big
Blue
Circle
gain(Shape) = 0.3115
gain(Color) = 0.3115
gain(Size) = 0
So first separate according to either color or shape

(root of the tree)
37
Back to the Restaurant

Training data:
38
The Restaurant Example

gain(alt) = ... gain(bar) = ...
gain(fri) = ...
gain(hun) = ...
2
0 2 4
0 4 6
2 4
gain(pat) = 1 x H , +
x H , +
x H ,
2 2 12
4 4 12
6 6
12
2
0 2
2 4
0
4
4
0
0
= 1 x - log2 + log2 + x log2
+ log2 + ... 0.541bits
2 2
2 12
4
4
4
2
4
12
gain(price) = ... gain(rain) = ...
gain(res) = ...
2
1 1 2
1 1 4
2 2 4
2 2
gain(type) = 1 x H , +
x H , +
x H , + x H , = 0 bits
2 2 12
2 2 12
4 4 12
4 4
12
gain(est) = ...
Attribute pat (Patron) has the highest gain, so root of the

tree should be attribute Patrons
do recursively for subtrees
39
Decision Boundaries of
Decision Trees
Feature 1
Feature 2
40
Decision Trees
Feature 1
Feature 2 > t1
??
t1
Feature 2
41
Decision Trees
Feature 1
Feature 1 > t1
t2
Feature 2 > t2
t1
Feature 2
??
42
Decision Trees
Feature 2 > t1
Feature 1
Feature 2 > t2
t2
t3
t1
Feature 2
Feature 2 > t3
43
Applications of Decision Trees

One of the most widely used learning methods in

practice
Fast, simple, and traceable
Can out-perform human experts in many problems

A study for diagnosing breast cancer had humans

correctly classifying the examples 65% of the time;
the decision tree classified 72% correctly
Cessna designed an airplane flight controller using
90,000 examples and 20 attributes per example
44
Today


Decision trees
( Evaluation
45
Evaluation of Learning Approach

how do you know if what you learned is correct?

You run your classifier on a data set of unseen
examples (that you did not use for training) for which
you know the correct classification
metric:
accuracy = % of instances of the test set (the
unseen examples) the algorithm correctly
classifies.
Eg:
46
Evaluation Methodology
Standard methodology:
1. Collect a large set of examples (all with correct classifications)
2. Divide collection into two disjoint sets: training set and test set
3. Apply learning algorithm to training set
DO NOT LOOK AT THE TEST SET !
4. Measure performance with the test set
how well does the function you learn in 3. correctly classify the
examples in the test set? (ie. Compute accuracy)
Important: keep the training and test sets disjoint!

To study the robustness of the algorithm, repeat steps
2-4 for different training sets and sizes of training sets
47
Error analysis

Where did the learner go wrong ?

Use a confusion matrix / contingency table
correct class
(that should have
been assigned)
classes assigned by the learner
C1
C2
C3
C4
C5
C6
Total
C1
99.4
.3
.3
100
C2
90.2
3.3
4.1
100
C3
.1
93.9
1.8
.1
1.9
100
C4
.5
2.2 95.5
.2
100
C5
.3
1.4 96.0
2.5
100
C6
1.9
3.4 93.3
100
48
A Learning Curve
Size of training set

the more, the better

but after a while, not much improvement
source: Mitchell (1997)
49
Some Words on Training

In all types of learning watch out for:

Noisy input
Overfitting/underfitting the training data
50
Noisy Input

Noisy Input:
Two examples have the same feature-value pairs, but different

outputs
Size
Color
Shape
Output
Big
Red
Circle
Big
Red
Circle
Some values of features are incorrect or missing (ex. errors in

the data acquisition)
Some relevant attributes are not taken into account in the data
set
51
Overfitting / Underfitting

Overfitting the training data:

If large number of irrelevant features are there, we may find

meaningless regularities in the data that are particular to the
training data but irrelevant to the problem.
i.e. You find a model that fits the training set very well using a
complicated model with many features while the real solution is
much simpler
Underfitting the training data:

The training set is not representative enough of the real problem

i.e. You find a simple explanation with few features, while the
real solution is more complex
52
Cross-validation

another method to reduce overfitting

K-fold cross-validation

run k experiments, each time you test on 1/k of the data, and train on the rest
than you average the results
ex: 10-fold cross validation

1. Collect a large set of examples (all with correct classifications)
2. Divide collection into two disjoint sets: training (90%) and test (10% = 1/k)
3. Apply learning algorithm to training set
4. Measure performance with the test set
5. Repeat steps 2-4, with the 10 different portions
6. Average the results of the 10 experiments
exp1:
train
exp2:
train
exp3:
test
train
test
test
train
train
53
Today


Decision trees
( Evaluation
54
Unsupervised Learning
Learn without labeled examples

i.e. X is given, but not f(X)

small nose
big teeth
small eyes
moustache
f(X) = ?
Without a f(X), you can't really identify/label a

test instance
But you can:
Cluster/group the features of the test data into a
number of groups
Discriminate between these groups without
actually labeling them
55
Clustering

Represent each instance as a vector <a1, a2, a3,, an>

Each vector can be visually represented in a n dimensional
space
X5
X2
X1
a1
a2
a3
Output
X1
X2
X3
X4
X5
X4
X3
56
Clustering
Clustering algorithm

Represent test instances on a n dimensional space

Partition them into regions of high density
How? many algorithms (ex. k-means)
Compute the centrod of each region as the

average of data points in the cluster
57
k-means Clustering
User selects how many cluster they want (the value of k)
1. Place k points into the space (ex. at random).

These points represent initial group centrods.
2. Assign each data point xnto the nearest centrod.
3. When all data points have been assigned,
recalculate the positions of the K centrods as the
average of the cluster
4. Repeat Steps 2 and 3 until none of the data
instances change group.
58
Euclidean Distance
To find the nearest

centrod
a possible metric is the
Euclidean distance
distance between 2 pts
p = (p1, p2, ....,pn)
q = (q1, q2, ....,qn)
n
d=
(p q )
i
i=1
10
9
8
7
6
5
4
where to assign a data point 3

x?
2
For all k clusters, chose the 1
one where x has the smallest
distance
9 10
59
Example (in 2-D i.e. 2 features)

initial 3 centrods (ex. at random)
5
c1
3
c2
c3
0
0
5
60
Example
partition data points to closest centrod
5
c1
3
c2
c3
0
0
5
61
Example
re-compute new centrods

5
c1
c3
c2
0
0
5
62
Example
re-assign data points to new closest centrods
5
c1
c3
c2
0
0
5
63
Example
5
c1
c2
c3
0
0
5
64
Notes on k-means

converges very fast!

BUT:
very sensitive to initial choice of centroids

user must set initial k

many find useless clusters

not easy to do
many other clustering algorithms
65
Today


Decision trees
( Evaluation
66

Lect6 PDF

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Lect6 PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Artificial Intelligence:

Russell & Norvig: Sections 18.1 to 18.4

Too many to list here!

Why Machine Learning?

To construct programs that automatically

performs tasks and improves its performance

Machine Learning vs Data Mining

Supervised Machine Learning

Try to predict the outcome of a new data from a training set of

I know what I am looking for, and I am using this training data to

Unsupervised Machine Learning/ Data Mining / Knowledge

Try to discover valuable knowledge / pattern from large databases

Im not sure what I am looking for, but I am sure theres something

We are given a training set of (X, f(X)) pairs

f(X) = not person

We are not given the (X, f(X)) pairs

We are only given the Xs - not the corresponding f(X)

No teacher involved / Goal: find regularities among the Xs (clustering)

Remember this slide

= learning from examples

Given 5 pairs (X,f(X)) (the training set)

Note: choosing one function over another beyond just

ex: prefer "smoother" functions

source: Rusell & Norvig (2003)

Inductive Learning Framework

Input data are represented by a vector of features, X

= [nose:big, teeth:big, eyes:big, moustache:no]

The number of features is fixed (positive, finite)

where n is the number of features

if the output is discrete --> classification

Real ML applications typically require hundreds or thousands of examples

Use only discriminating features as questions in a big if-then-else

ex: Nave Bayes Classifier

Good functions evolve out of a population by combining possible

Also called parallel distributed processing or connectionist systems

Last time: Probabilistic Methods

Simplest, but most successful form of learning algorithm

ID3 / C4.5 Algorithm

Top-down construction of the decision tree

Example 2: The Restaurant

Goal: learn whether one should wait for a table

Alternate: another suitable restaurant nearby

Example 2: The Restaurant

source: Norvig (2003)

A First Decision Tree

But is it the best decision tree we can build?

source: Norvig (2003)

Ockhams Razor Principle

In other words always favor the simplest answer

This type of assumption is called inductive bias

inductive bias = making a choice beyond what the training

Finding the Best Tree

can be seen as searching the space

always pick the next attribute to

source: Tom Mitchell, Machine Learning (1997)

Which Tree is Best?

Choosing the Next Attribute

key problem is choosing which

uses Maximum Information-Gain:

Choose the attribute that has the largest

i.e., the attribute that will result in the smallest

If value is Some all outputs=Yes