Você está na página 1de 66

Artificial Intelligence:

Machine Learning - 1

 Russell & Norvig: Sections 18.1 to 18.4

Motivation


Too many to list here!




Recommender systems
 eg. Amazon --> you might like this book
 eg. LinkedIn --> people you might know
Pattern Recognition
 Learning to recognize postal codes
 Handwritten recognition --> who wrote this

cheque?

Why Machine Learning?




To construct programs that automatically


improve with experience.
Learning = crucial characteristics of an
intelligent agent
A learning system


performs tasks and improves its performance


the more tasks it accomplishes
generalizes from given experiences and is able
to make judgments on unseen cases

Machine Learning vs Data Mining




Supervised Machine Learning







Try to predict the outcome of a new data from a training set of


already classified examples

I know what I am looking for, and I am using this training data to


help me generalise
Eg: diagnostic system: which treatments are best for new diseases?
Eg: speech recognition : given the acoustic pattern, which word was
uttered?

Unsupervised Machine Learning/ Data Mining / Knowledge


Discovery





Try to discover valuable knowledge / pattern from large databases


of grocery bills, financial transactions, medical records, etc.

Im not sure what I am looking for, but I am sure theres something


interesting
Eg: Anomaly detection: odd use of credit card?
Eg. Learning association rules: buy strawberries & whipped cream
together?
4

Types of Learning


In Supervised learning


We are given a training set of (X, f(X)) pairs


big nose

big teeth

big eyes

no moustache

f(X) = not person

small nose

small teeth

small eyes

no moustache

f(X) = person

small nose

big teeth

small eyes

moustache

f(X) = ?

moustache

f(X) = ?

In Reinforcement learning


We are not given the (X, f(X)) pairs


small nose




big teeth

small eyes

But somehow we are told whether our learned f(X) is right or wrong
Goal: maximize the nb of right answers

In Unsupervised learning





We are only given the Xs - not the corresponding f(X)


big nose

big teeth

big eyes

no moustache

not given

small nose

small teeth

small eyes

no moustache

not given

small nose

big teeth

small eyes

moustache

f(X) = ?

No teacher involved / Goal: find regularities among the Xs (clustering)


Data mining
5

Remember this slide

Inductive Learning








= learning from examples


Most work in ML
Examples are given (positive and/or negative) to train a
system in a classification (or regression) task
Extrapolate from the training set to make accurate
predictions about future examples
Can be seen as learning a function
Given a new instance X you have never seen
You must find an estimate of the function f(X) where f(X) is
the desired output
Ex: small nose big teeth
small eyes
moustache
f(X) = ?



X
X = features of a face (ex. small nose, big teeth, )
f(X) = function to tell if X represents a human face or not
7

Example




Given 5 pairs (X,f(X)) (the training set)


Find a function that fits the training set well
So that given a new X, you can predict its f(X) value
f(X)

Note: choosing one function over another beyond just


looking at the training set is called inductive bias



ex: prefer "smoother" functions


ex: if an attribute does not seem discriminating, drop it

source: Rusell & Norvig (2003)

Inductive Learning Framework





Input data are represented by a vector of features, X


Each vector X is a list of (attribute, value) pairs.






= [nose:big, teeth:big, eyes:big, moustache:no]

The number of features is fixed (positive, finite)


Each attribute has a fixed, finite number of possible values
Each example can be interpreted as a point in a n-dimensional
feature space


Ex: X

where n is the number of features

if the output is discrete --> classification


if the output is continuous --> regression
9

Example

Real ML applications typically require hundreds or thousands of examples


source: Alison Cawsey: The Essence of AI (1997).
10

Techniques in ML


Probabilistic Methods


Decision Trees


Use only discriminating features as questions in a big if-then-else


tree

Genetic algorithms


ex: Nave Bayes Classifier

Good functions evolve out of a population by combining possible


solutions to produce offspring solutions and killing off the weaker
of those solutions.

Neural networks



Also called parallel distributed processing or connectionist systems


Intelligence arise from having a large number of simple
computational units

11

Today







Last time: Probabilistic Methods


Decision trees
( Evaluation
Unsupervised learning)
Next time: Neural Networks
Next time: Genetic Algorithms

12

Guess Who?

Play online
13

Decision Trees



Simplest, but most successful form of learning algorithm


Very well-know algorithm is ID3 (Quinlan, 1987) and its
successor C4.5
Look for features that are very good indicators of the
result, place these features (as questions) in nodes of
the tree
Split the examples so that those with different values
for the chosen feature are in a different set
Repeat the same process with another feature

14

ID3 / C4.5 Algorithm





Top-down construction of the decision tree


Recursive selection of the best feature to use at the
current node in the tree
 Once the feature is selected for the current node,
generate children nodes, one for each possible value of
the selected attribute
 Partition the examples using the possible values of this
attribute, and assign these subsets of the examples to
the appropriate child node
 Repeat for each child node until all examples
associated with a node are either all positive or all
negative
15

Example
Info on last years students to determine if a student will get an A this year

Features
Student

A last
year?

Black
hair?

Output f(X)

Works
hard?

Drinks?

A this year?

X1: Richard

Yes

Yes

No

Yes

No

X2: Alan

Yes

Yes

Yes

No

Yes

X3: Alison

No

No

Yes

No

No

X4: Jeff

No

Yes

No

Yes

No

X5: Gail

Yes

No

Yes

Yes

Yes

X6: Simon

No

Yes

Yes

Yes

No

16

Example
Features

A last year?

yes

no
Output = No

Works hard?

yes

no

Output = Yes

Output = No

Output
f(X)

Student

A last
year?

Black
hair?

Works
hard?

Drinks
?

Richard

Yes

Yes

No

Yes

No

Alan

Yes

Yes

Yes

No

Yes

Alison

No

No

Yes

No

No

Jeff

No

Yes

No

Yes

No

Gail

Yes

No

Yes

Yes

Yes

Simon

No

Yes

Yes

Yes

No

A this
year?

17

Example 2: The Restaurant





Goal: learn whether one should wait for a table


Attributes











Alternate: another suitable restaurant nearby


Bar: comfortable bar for waiting
Fri/Sat: true on Fridays and Saturdays
Hungry: whether one is hungry
Patrons: how many people are present (none, some, full)
Price: price range ($, $$, $$$)
Raining: raining outside
Reservation: reservation made
Type: kind of restaurant (French, Italian, Thai, Burger)
WaitEstimate: estimated wait by host (0-10 mins, 10-30, 3060, >60)
18

Example 2: The Restaurant




Training data:

source: Norvig (2003)

19

A First Decision Tree

But is it the best decision tree we can build?

source: Norvig (2003)

20

Ockhams Razor Principle


It is vain to do more than can be done with less Entities
should not be multiplied beyond necessity. [Ockham, 1324]

In other words always favor the simplest answer


that correctly fits the training data
i.e. the smallest tree on average

This type of assumption is called inductive bias

inductive bias = making a choice beyond what the training


instances contain

21

Finding the Best Tree







can be seen as searching the space


of all possible decision trees
Inductive bias: prefer shorter
trees on average
how?
search the space of all decision
trees


empty tree

always pick the next attribute to


split the data based on its
"discriminating power" (information
gain)
in effect, hill-climbing search
where heuristic is information gain
complete tree

source: Tom Mitchell, Machine Learning (1997)


22

Which Tree is Best?


F1?
F2?

F3?

F4?
class

F5?
class

class

F6?
class

class

F7?
class

class

class

F1?
class

F2?
class

F3?
class

F4?
class

F5?
class

F6?
class

F7?
class

class

23

Choosing the Next Attribute


 The

key problem is choosing which


feature to split a given set of examples

 ID3


uses Maximum Information-Gain:

Choose the attribute that has the largest


information gain


i.e., the attribute that will result in the smallest


expected size of the subtrees rooted at its
children
information theory

24

Intuitively


Patron:




Output f(X)

If value is Some all outputs=Yes


If value is None all outputs=No
If value is Full we need more tests

Type:





If value
If value
If value
If value

is
is
is
is

French we need more tests


Italian we need more tests
Thai we need more tests
Burger we need more tests

So patron is more discriminating

source: Norvig (2003)

25

Next Feature


For only data where patron = Full

hungry



If value is Yes we need more tests


If value is No all output= No

type:





If value
If value
If value
If value

is French all output= No


is Italian all output= No
is Thai we need more tests
is Burger we need more tests

So hungry is more discriminating (only 1 new branch)


26

A Better Decision Tree





4 tests instead of 9
11 branches instead of 21

source: Norvig (2003)

27

Choosing the Next Attribute


The key problem is choosing which feature to
split a given set of examples
 Most used strategy: information theory


H(X) = p(x)log2p(x)

Entropy (or information content)

xX

1 1
H(fair coin toss) = p(xi )log2p(xi ) = H ,
2 2
xi X
1 1
1
1
= log2 + log2 = 1 bit
2 2
2
2

H(a,b) = entropy if
probability of success = a and
probability of failure = b
28

Essential Information Theory





Developed by Shannon in the 40s


Notion of entropy (information content):
 How informative is a piece of information?








If you already have a good idea about the answer


low entropy
then a "hint" about the right answer is not very
informative
If you have no idea about the answer (ex. 50-50 split)
high entropy
then a "hint" about the right answer is very
informative
29

Entropy



Let X be a discrete random variable (RV)


Entropy (or information content)
n

H(X) = p(xi )log2p(xi )


i=1





measures the amount of information in a RV


 average uncertainty of a RV
 the average length of the message needed to transmit an
outcome xi of that variable
measured in bits
for only 2 outcomes x1 and x2, then 1 H(X) 0
30

Example: The Coin Flip


n

Fair coin: H(X) = p(xi )log2p(xi ) = - 1 log2 1 + 1 log2 1 = 1 bit


2 2
2
2
i=1
n

Rigged coin: H(X) = p(xi )log2p(xi ) = - 99 log2 99 + 1 log2 1 = 0.08 bits


i=1

100

100 100

100

Entropy

fair coin -> high entropy

rigged coin -> low entropy


P(head)
31

Choosing the Best Feature (con't)






The "discriminating power" of an attribute A given a data set S


Let Values(A) = the set of values that attribute A can take
Let Sv = the set of examples in the data set which have value v for
attribute A (for each value v from Values(A) )
information gain (or
entropy reduction)

gain(S, A) = H(S) H(S | A)


= H(S)

v values(A)

Sv
S

x H(Sv )

32

Some Intuition

Size

Color

Shape

Output

Big

Red

Circle

Small

Red

Circle

Small

Red

Square

Big

Blue

Circle

Size is the least discriminating attribute (i.e.


smallest information gain)
Shape and color are the most discriminating
attributes (i.e. highest information gain)

33

A Small Example (1)


Size

Color

Shape

Big

Red

Circle

Small

Red

Circle

Small

Red

Square

Big

Blue

Circle

1 1
1
1
H(S) = log2 + log2 = 1
2 2
2
2
for each v of Values(Color)

Output

Values(Color) = {red,blue}
Color
red: 2+ 1-

blue: 0+ 1-

gain(S, Color) = H(S)

v values(Color)

Sv
S

x H(Sv )

2 1
1
2 1
2
H(S | Color = red) = H , = log2 + log2 = 0.918
3 3
3
3 3
3
1
1
H(S | Color = blue) = H(1,0 ) = log2 = 0
1
1
3
1
H(S | Color) = (0.918) + (0) = 0.6885
4
4
gain(Color) = H(S) - H(S | Color) = 1 - 0.6885 = 0.3115

34

A Small Example (2)


Size

Color

Shape

Output

Big

Red

Circle

Small

Red

Circle

Small

Red

Square

Big

Blue

Circle

Shape
circle: 2+ 1-

square: 0+ 1-

Note: by definition,

Log 0 = -

0log0 is 0

1 1
1
1
H(S) = log2 + log2 = 1
2 2
2
2
3
1
H(S | Shape) = (0.918) + (0) = 0.918
4
4
gain(Shape) = H(S) - H(S | Shape) = 1 - 0.918 = 0.3115

35

A Small Example (3)


Size

Color

Shape

Output

Big

Red

Circle

Small

Red

Circle

Small

Red

Square

Big

Blue

Circle

Size

big: 1+ 1-

small: 1+ 1-

1 1
1
1
H(S) = log2 + log2 = 1
2 2
2
2
1
1
H(S | Size) = (1) + (1) = 1
2
2
gain(Size) = H(S) - H(S | Size) = 1 - 1 = 0

36

A Small Example (4)


Size

Color

Shape

Output

Big

Red

Circle

Small

Red

Circle

Small

Red

Square

Big

Blue

Circle

gain(Shape) = 0.3115
gain(Color) = 0.3115
gain(Size) = 0


So first separate according to either color or shape


(root of the tree)
37

Back to the Restaurant




Training data:

source: Norvig (2003)

38

The Restaurant Example


gain(alt) = ... gain(bar) = ...

gain(fri) = ...

gain(hun) = ...

2
0 2 4
0 4 6
2 4
gain(pat) = 1 x H , +
x H , +
x H ,
2 2 12
4 4 12
6 6
12
2

0 2
2 4
0
4
4
0
0
= 1 x - log2 + log2 + x log2
+ log2 + ... 0.541bits
2 2
2 12
4
4
4
2
4
12

gain(price) = ... gain(rain) = ...

gain(res) = ...

2
1 1 2
1 1 4
2 2 4
2 2
gain(type) = 1 x H , +
x H , +
x H , + x H , = 0 bits
2 2 12
2 2 12
4 4 12
4 4
12

gain(est) = ...


Attribute pat (Patron) has the highest gain, so root of the


tree should be attribute Patrons
do recursively for subtrees

39

Decision Boundaries of
Decision Trees
Feature 1

Feature 2

40

Decision Boundaries of
Decision Trees
Feature 1

Feature 2 > t1

??

t1

Feature 2

41

Decision Boundaries of
Decision Trees
Feature 1

Feature 1 > t1

t2

Feature 2 > t2

t1

Feature 2
??

42

Decision Boundaries of
Decision Trees

Feature 2 > t1

Feature 1

Feature 2 > t2

t2
t3

t1

Feature 2

Feature 2 > t3

43

Applications of Decision Trees




One of the most widely used learning methods in


practice


Fast, simple, and traceable

Can out-perform human experts in many problems




A study for diagnosing breast cancer had humans


correctly classifying the examples 65% of the time;
the decision tree classified 72% correctly
Cessna designed an airplane flight controller using
90,000 examples and 20 attributes per example

44

Today







Last time: Probabilistic Methods


Decision trees
( Evaluation
Unsupervised learning)
Next time: Neural Networks
Next time: Genetic Algorithms

45

Evaluation of Learning Approach





how do you know if what you learned is correct?


You run your classifier on a data set of unseen
examples (that you did not use for training) for which
you know the correct classification
metric:
 accuracy = % of instances of the test set (the
unseen examples) the algorithm correctly
classifies.

Eg:

46

Evaluation Methodology


Standard methodology:
1. Collect a large set of examples (all with correct classifications)
2. Divide collection into two disjoint sets: training set and test set
3. Apply learning algorithm to training set
DO NOT LOOK AT THE TEST SET !
4. Measure performance with the test set
how well does the function you learn in 3. correctly classify the
examples in the test set? (ie. Compute accuracy)

Important: keep the training and test sets disjoint!


 To study the robustness of the algorithm, repeat steps
2-4 for different training sets and sizes of training sets


47

Error analysis



Where did the learner go wrong ?


Use a confusion matrix / contingency table

correct class
(that should have
been assigned)

classes assigned by the learner

C1

C2

C3

C4

C5

C6

Total

C1

99.4

.3

.3

100

C2

90.2

3.3

4.1

100

C3

.1

93.9

1.8

.1

1.9

100

C4

.5

2.2 95.5

.2

100

C5

.3

1.4 96.0

2.5

100

C6

1.9

3.4 93.3

100

48

A Learning Curve

Size of training set





the more, the better


but after a while, not much improvement

source: Mitchell (1997)

49

Some Words on Training




In all types of learning watch out for:





Noisy input
Overfitting/underfitting the training data

50

Noisy Input


In all types of learning watch out for:


 Noisy Input:


Two examples have the same feature-value pairs, but different


outputs
Size

Color

Shape

Output

Big

Red

Circle

Big

Red

Circle

Some values of features are incorrect or missing (ex. errors in


the data acquisition)
Some relevant attributes are not taken into account in the data
set

51

Overfitting / Underfitting


In all types of learning watch out for:




Overfitting the training data:




If large number of irrelevant features are there, we may find


meaningless regularities in the data that are particular to the
training data but irrelevant to the problem.
i.e. You find a model that fits the training set very well using a
complicated model with many features while the real solution is
much simpler

Underfitting the training data:





The training set is not representative enough of the real problem


i.e. You find a simple explanation with few features, while the
real solution is more complex

52

Cross-validation



another method to reduce overfitting


K-fold cross-validation



run k experiments, each time you test on 1/k of the data, and train on the rest
than you average the results

ex: 10-fold cross validation


1. Collect a large set of examples (all with correct classifications)
2. Divide collection into two disjoint sets: training (90%) and test (10% = 1/k)
3. Apply learning algorithm to training set
4. Measure performance with the test set
5. Repeat steps 2-4, with the 10 different portions
6. Average the results of the 10 experiments
exp1:

train

exp2:

train

exp3:

test

train

test
test

train
train

53

Today







Last time: Probabilistic Methods


Decision trees
( Evaluation
Unsupervised learning)
Next time: Neural Networks
Next time: Genetic Algorithms

54

Unsupervised Learning


Learn without labeled examples




i.e. X is given, but not f(X)


small nose

big teeth

small eyes

moustache

f(X) = ?

Without a f(X), you can't really identify/label a


test instance
But you can:
 Cluster/group the features of the test data into a
number of groups
 Discriminate between these groups without
actually labeling them
55

Clustering



Represent each instance as a vector <a1, a2, a3,, an>


Each vector can be visually represented in a n dimensional
space

X5
X2

X1

a1

a2

a3

Output

X1

X2

X3

X4

X5

X4
X3

56

Clustering


Clustering algorithm




Represent test instances on a n dimensional space


Partition them into regions of high density


How? many algorithms (ex. k-means)

Compute the centrod of each region as the


average of data points in the cluster

57

k-means Clustering
 User selects how many cluster they want (the value of k)

1. Place k points into the space (ex. at random).


These points represent initial group centrods.
2. Assign each data point xnto the nearest centrod.
3. When all data points have been assigned,
recalculate the positions of the K centrods as the
average of the cluster
4. Repeat Steps 2 and 3 until none of the data
instances change group.
58

Euclidean Distance


To find the nearest


centrod
a possible metric is the
Euclidean distance
distance between 2 pts
p = (p1, p2, ....,pn)
q = (q1, q2, ....,qn)
n

d=

(p q )
i

i=1

10
9
8
7
6
5
4

where to assign a data point 3


x?
2
For all k clusters, chose the 1
one where x has the smallest
distance

9 10
59

Example (in 2-D i.e. 2 features)


initial 3 centrods (ex. at random)
5

c1
3

c2

c3

0
0

5
60

Example
partition data points to closest centrod
5

c1
3

c2

c3

0
0

5
61

Example

re-compute new centrods


5

c1

c3

c2

0
0

5
62

Example
re-assign data points to new closest centrods
5

c1

c3

c2

0
0

5
63

Example
5

c1

c2

c3

0
0

5
64

Notes on k-means



converges very fast!


BUT:


very sensitive to initial choice of centroids




user must set initial k




many find useless clusters


not easy to do

many other clustering algorithms

65

Today







Last time: Probabilistic Methods


Decision trees
( Evaluation
Unsupervised learning)
Next time: Neural Networks
Next time: Genetic Algorithms

66

Você também pode gostar