Você está na página 1de 9

Concept Learning CS 464: Introduction to Machine Learning

Concept Learning
Slides adapted from Chapter 2, Machine Learning by Tom M. Mitchell http://www-2.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html
1 2

Acquiring the definition of a general category from given sample positive and negative training examples of the category. An example: learning bird concept from the given examples of birds (positive examples) and non-birds (negative examples). Inferring a boolean-valued function from training examples of its input and output.

Classification (Categorization)
Given
A fixed set of categories: C={c1, c2,ck} A description of an instance x = (x1, x2,xn) with n features, xX, where X is the instance space.

Learning for Categorization


A training example is an instance xX, paired with its correct category c(x): <x, c(x)> for an unknown categorization function, c. Given a set of training examples, D. Find a hypothesized categorization function, h(x), such that:

Determine:
The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C. If c(x) is a binary function C={1,0} ({true,false}, {positive, negative}) then it is called a concept.
3

< x, c( x ) > D : h( x) = c ( x)
Consistency
4

A Concept Learning Task Enjoy Sport Training Examples


Example 1 2 3 4 Sky Sunny Sunny Rainy Sunny AirTemp Humidity Wind Warm Warm Cold Warm Normal High High High Strong Strong Strong Strong Water Warm Warm Warm Warm Forecast EnjoySport Same Same Change Change YES YES O YES

Hypothesis Space
Restrict learned functions to a given hypothesis space, H, of functions h(x) that can be considered as definitions of c(x). For learning concepts on instances described by n discretevalued features, consider the space of conjunctive hypotheses represented by a vector of n constraints <c1, c2, cn> where each ci is either:
?, a wild card indicating no constraint on that feature A specific value from the domain of that feature indicating no value is acceptable

ATTRIBUTES

CO CEPT

Sample conjunctive hypotheses are


<Sunny, ?, ?, Strong, ?, ?> <?, ?, ?, ?, ?, ?> (most general hypothesis) < , , , , , > (most specific hypothesis)
6

A set of example days, and each is described by six attributes. The task is to learn to predict the value of EnjoySport for arbitrary day, based on the values of its attribute values.
5

EnjoySport Concept Learning Task


Given Instances X : set of all possible days, each described by the attributes Sky (values: Sunny, Cloudy, Rainy) AirTemp (values: Warm, Cold) Humidity (values: Normal, High) Wind (values: Strong, Weak) Water (values: Warm, Cold) Forecast (values: Same, Change) Target Concept (Function) c : EnjoySport : X {0,1} Hypotheses H : Each hypothesis is described by a conjunction of constraints on the attributes. Example hypothesis: <Sunny, ?, ?, ?, ?, ?> Training Examples D : positive and negative examples of the target function Determine A hypothesis h in H such that h(x) = c(x) for all x in D.
7

Inductive Learning Hypothesis


Although the learning task is to determine a hypothesis h identical to the target concept c over the entire set of instances X, the only information available about c is its value over the training examples. The Inductive Learning Hypothesis: Any function that is found to approximate the target concept well on a sufficiently large set of training examples will also approximate the target function well on unobserved examples. Assumes that the training and test examples are drawn independently from the same underlying distribution.
8

Concept Learning as Search


Concept learning can be viewed as the task of searching through a large space of hypotheses implicitly defined by the hypothesis representation. The goal of this search is to find the hypothesis that best fits the training examples. By selecting a hypothesis representation, the designer of the learning algorithm implicitly defines the space of all hypotheses that the program can ever represent and therefore can ever learn.

Concept Learning as Search (Cont.)


Consider an instance space consisting of n binary features which therefore has 2n instances. For conjunctive hypotheses, there are 4 choices for each feature: , T, F, ?, so there are 4n syntactically distinct hypotheses. However, all hypotheses with one or more s are equivalent, so there are 3n+1 semantically distinct hypotheses. The target concept learning function in principle could be any of the possible 22^n functions on n input bits. Therefore, conjunctive hypotheses are a small subset of the space of possible functions, but both are intractably large. All reasonable hypothesis spaces are intractably large or could be even infinite.

10

Enjoy Sport - Hypothesis Space


Sky has 3 possible values, and other 5 attributes have 2 possible values. There are 322222 = 96 distinct instances in X. There are 544444 = 5120 syntactically distinct hypotheses in H. (Two more values for attribute values: ? and ) Every hypothesis containing one or more symbols represents the empty set of instance (it classifies every instance as negative). There are 1+ 433333 = 973 semantically distinct hypotheses in H. Only one more value for attributes: ?, and one hypothesis representing empty set of instances. Although EnjoySport has small, finite hypothesis space, most learning tasks have much larger (even infinite) hypothesis spaces.
11

Learning by Enumeration
For any finite or countably infinite hypothesis space, one can simply enumerate and test hypotheses one at a time until a consistent one is found. For each h in H do: If h is consistent with the training data D, then terminate and return h. This algorithm is guaranteed to terminate with a consistent hypothesis if one exists; however, it is obviously computationally intractable for almost any practical problem.
12

Using the Generality Structure


By exploiting the structure imposed by the generality of hypotheses, an hypothesis space can be searched for consistent hypotheses without enumerating or explicitly exploring all hypotheses. An instance, xX, is said to satisfy an hypothesis, h, iff h(x)=1 (positive) Given two hypotheses h1 and h2, h1 is more general than or equal to h2 (h1h2) iff every instance that satisfies h2 also satisfies h1. Given two hypotheses h1 and h2, h1 is (strictly) more general than h2 (h1>h2) iff h1h2 and it is not the case that h2 h1. Generality defines a partial order on hypotheses.
13

Examples of Generality
Conjunctive feature vectors <Sunny, ?, ?, ?, ?, ?> is more general than <Sunny, ?, ?, Strong, ?, ?> Neither of <Sunny, ?, ?, ?, ?, ?> and <?, ?, ?, Strong, ?, ?> is more general than the other.
A B C

Axis-parallel rectangles in 2-d space


A is more general than B Neither of A and C are more general than the other.
14

More General Than Relation

FI D-S Algorithm
Find the most-specific hypothesis (leastgeneral generalization) that is consistent with the training data. Starting with the most specific hypothesis, incrementally update hypothesis after every positive example, generalizing it just enough to satisfy the new example. Time complexity: O(|D| n) if n is the number of features

h2 > h1 and h2 > h3 But there is no more general than relation between h1 and h3
15 16

FI D-S Algorithm
Initialize h = <, , > (the most specific hypothesis) For each positive training instance x in D For each feature fi If the constraint on fi in h is not satisfied by x If fi in h is then set fi in h to the value of fi in x else set fi in h to ? If h is consistent with the negative training instances in D then return h else no consistent hypothesis exists
17

FI D-S Algorithm - Example

18

Issues with FI D-S


Given sufficient training examples, does FIND-S converge to a correct definition of the target concept (assuming it is in the hypothesis space)? How do we know when the hypothesis has converged to a correct definition? Why prefer the most-specific hypothesis? Are more general hypotheses consistent? What about the mostgeneral hypothesis? What about the simplest hypothesis? What if there are several maximally specific consistent hypotheses? Which one should be chosen? What if there is noise in the training data and some training examples are incorrectly labeled?
19

Candidate-Elimination Algorithm
FIND-S outputs a hypothesis from H, that is consistent with the training examples, which is just one of many hypotheses from H that might fit the training data equally well. The key idea in the Candidate-Elimination algorithm is to output a description of the set of all hypotheses consistent with the training examples without explicitly enumerating all of its members. This is accomplished by using the more-general-than partial ordering and maintaining a compact representation of the set of consistent hypotheses.
20

Consistent Hypothesis

Version Space
Given an hypothesis space, H, and training data, D, the version space is the complete subset of H that is consistent with D. The version space can be naively generated for any finite H by enumerating all hypotheses and eliminating the inconsistent ones.

The key difference between this definition of consistent and satisfies: An example x is said to satisfy hypothesis h when h(x) = 1, regardless of whether x is a positive or negative example of the target concept. An example is consistent with hypothesis h when h(x) = c(x) (depending on the target concept).
21

22

List-Then-Eliminate Algorithm

List-Then-Eliminate Algorithm
The version space of candidate hypotheses shrinks as more examples are observed, until ideally just one hypothesis remains that is consistent with all the observed examples. What if there is insufficient training data? Will output the entire set of hypotheses consistent with the observed data. List-Then-Eliminate algorithm can be applied whenever the hypothesis space H is finite. It has many advantages, including the fact that it is guaranteed to output all hypotheses consistent with the training data. Unfortunately, it requires exhaustively enumerating all hypotheses in H - an unrealistic requirement for all but the most trivial hypothesis spaces. Can one compute the version space more efficiently than using enumeration?
23 24

Compact Representation of Version Spaces with S and G


The version space can be represented more compactly by maintaining two boundary sets of hypotheses, S, the set of most specific consistent hypotheses, and G, the set of most general consistent hypotheses:
S = {s H | Consistent ( s, D) s H [ s > s Consistent ( s, D)]} G = {g H | Consistent ( g , D) g H [ g > g Consistent ( s, D)]}

Compact Representation of Version Spaces


The Candidate-Elimination algorithm represents the version space by storing only its most general members G and its most specific members S. Given only these two sets S and G, it is possible to enumerate all members of a version space by generating hypotheses that lie between these two sets in general-to-specific partial ordering over hypotheses. Every member of the version space lies between these boundaries.
26

S and G represent the entire version space via its boundaries in the generalization lattice:
G version space S
25

Version Space Lattice


Size: {sm, big} Color: {red, blue}
<?, ?, ?>

Candidate-Elimination Algorithm
Color Code: G S other VS The Candidate-Elimination algorithm computes the version space containing all hypotheses from H that are consistent with an observed sequence of training examples. It begins by initializing the version space to the set of all hypotheses in H; that is, by initializing the G boundary set to contain the most general hypothesis in H G0 { <?, ?, ?, ?, ?, ?> } and initializing the S boundary set to contain the most specific hypothesis S0 { <, , , , , > } These two boundary sets delimit the entire hypothesis space, because every other hypothesis in H is both more general than S0 and more specific than G0. As each training example is considered, the S and G boundary sets are generalized and specialized, respectively, to eliminate from the version space any hypotheses found inconsistent with the new training example. After all examples have been processed, the computed version space contains all the hypotheses consistent with these examples and only these hypotheses.28

Shape: {circ, squr}

<?,?,circ> <big,?,?> <?,red,?> <?,blue,?> <sm,?,?> <?,?,squr>

< ?,red,circ><big,?,circ><big,red,?><big,blue,?><sm,?,circ><?,blue,circ> <?,red,squr><sm.?,sqr><sm,red,?><sm,blue,?><big,?,squr><?,blue,squr>

< big,red,circ><sm,red,circ><big,blue,circ><sm,blue,circ>< big,red,squr><sm,red,squr><big,blue,squr><sm,blue,squr>

< , , >

<<big, red, squr> positive> <<sm, blue, circ> negative>


27

Candidate-Elimination Algorithm
Initialize G to the set of most-general hypotheses in H Initialize S to the set of most-specific hypotheses in H For each training example, d, do: If d is a positive example then: Remove from G any hypotheses that is inconsistent with d For each hypothesis s in S that is inconsistent with d Remove s from S Add to S all minimal generalizations, h, of s such that: 1) h is consistent with d 2) some member of G is more general than h Remove from S any h that is more general than another hypothesis in S If d is a negative example then: Remove from S any hypotheses that is inconsistent with d For each hypothesis g in G that is consistent with d Remove g from G Add to G all minimal specializations, h, of g such that: 1) h is inconsistent with d 2) some member of S is more specific than h Remove from G any h that is more specific than another hypothesis in G

Required Methods
To instantiate the algorithm for a specific hypothesis language requires the following procedures:
equal-hypotheses(h1, h2) more-general(h1, h2)
29

consistent(h, i) initialize-g() initialize-s() generalize-to(h, i) specialize-against(h, i)


30

Minimal Specialization and Generalization


Procedures generalize-to and specialize-against are specific to a hypothesis language and can be complex. For conjunctive feature vectors:
generalize-to: unique, see FIND-S specialize-against: not unique, can convert each ? to an alernative non-matching value for this feature.
Inputs:
h = <?, red, ?> i = <small, red, triangle>

Example Version Space

Outputs:
<big, red, ?> <medium, red, ?> <?, red, square> <?, red, circle>
31

A version space with its general and specific boundary sets. The version space includes all six hypotheses shown here, but can be represented more simply by S and G.
32

Candidate-Elimination Algo. - Example


S0 and G0 are the initial boundary sets corresponding to the most specific and most general hypotheses. Training examples 1 and 2 force the S boundary to become more general. They have no effect on the G boundary.

Candidate-Elimination Algo. - Example

33

34

Candidate-Elimination Algo. - Example


Given that there are six attributes that could be specified to specialize G2, why are there only three new hypotheses in G3? For example, the hypothesis h = <?, ?, ormal, ?, ?, ?> is a minimal specialization of G2 that correctly labels the new example as a negative example, but it is not included in G3.
The reason this hypothesis is excluded is that it is inconsistent with S2. The algorithm determines this simply by noting that h is not more general than the current specific boundary, S2.
In fact, the S boundary of the version space forms a summary of the previously encountered positive examples that can be used to determine whether any given hypothesis is consistent with these examples. The G boundary summarizes the information from previously encountered negative examples. Any hypothesis more specific than G is assured to be consistent with past negative examples 35

Candidate-Elimination Algo. - Example

36

Candidate-Elimination Algo. - Example


The fourth training example further generalizes the S boundary. It also results in removing one member of the G boundary, because this member fails to cover the new positive example.
To understand the rationale for this step, it is useful to consider why the offending hypothesis must be removed from G. Notice it cannot be specialized, because specializing it would not make it cover the new example. It also cannot be generalized, because by the definition of G, any more general hypothesis will cover at least one negative training example. Therefore, the hypothesis must be dropped from the G boundary, thereby removing an entire branch of the partial ordering from the version space of hypotheses remaining under consideration
37

Candidate-Elimination Algorithm Example Final Version Space

38

Candidate-Elimination Algorithm Example Final Version Space After processing these four examples, the boundary sets S4 and G4 delimit the version space of all hypotheses consistent with the set of incrementally observed training examples. This learned version space is independent of the sequence in which the training examples are presented (because in the end it contains all hypotheses consistent with the set of examples). As further training data is encountered, the S and G boundaries will move monotonically closer to each other, delimiting a smaller and smaller version space of candidate hypotheses. 39

Properties of VS Algorithm
S summarizes the relevant information in the positive examples (relative to H) so that positive examples do not need to be retained. G summarizes the relevant information in the negative examples, so that negative examples do not need to be retained. Result is not affected by the order in which examples are processed but computational efficiency may. Positive examples move the S boundary up; Negative examples move the G boundary down. If S and G converge to the same hypothesis, then it is the only one in H that is consistent with the data. If S and G become empty (if one does the other must also) then there is no hypothesis in H consistent with the data.
40

Correctness of Learning
Since the entire version space is maintained, given a continuous stream of noise-free training examples, the VS algorithm will eventually converge to the correct target concept if it is in the hypothesis space, H, or eventually correctly determine that it is not in H. Convergence is correctly indicated when S=G.

Computational Complexity of VS
Computing the S set for conjunctive feature vectors is linear in the number of features and the number of training examples. Computing the G set for conjunctive feature vectors is exponential in the number of training examples in the worst case. In more expressive languages, both S and G can grow exponentially. The order in which examples are processed can significantly affect computational complexity.
41 42

Active Learning
In active learning, the system is responsible for selecting good training examples and asking a teacher (oracle) to provide a class label. In sample selection, the system picks good examples to query by picking them from a provided pool of unlabeled examples. In query generation, the system must generate the description of an example for which to request a label. Goal is to minimize the number of queries required to learn an accurate concept description.
43

Active Learning with VS


An ideal training example would eliminate half of the hypotheses in the current version space regardless of its label. If a training example matches half of the hypotheses in the version space, then the matching half is eliminated if the example is negative, and the other (non-matching) half is eliminated if the example is positive. Example:
Assume training set
Positive: <big, red, circle> Negative: <small, red, triangle>

Current version space:


{<big, red, circle>, <big, red, ?>, <big, ?, circle>, <?, red, circle> <?, ?, circle> <big, ?, ?>}

An optimal query: <big, blue, circle>

Given a ceiling of log2|VS| such examples will result in convergence. This is the best possible guarantee in general.
44

Using an Unconverged VS
If the VS has not converged, how does it classify a novel test instance? If all elements of S match an instance, then the entire version space must (since it is more general) and it can be confidently classified as positive (assuming target concept is in H). If no element of G matches an instance, then the entire version space must not (since it is more specific) and it can be confidently classified as negative (assuming target concept is in H). Otherwise, one could vote all of the hypotheses in the VS (or just the G and S sets to avoid enumerating the VS) to give a classification with an associated confidence value. Voting the entire VS is probabilistically optimal assuming the target concept is in H and all hypotheses in H are equally likely a priori.
45

Inductive Bias
A hypothesis space that does not include all possible classification functions on the instance space incorporates a bias in the type of classifiers it can learn. Any means that a learning system uses to choose between two functions that are both consistent with the training data is called inductive bias. Inductive bias can take two forms:
Language bias: The language for representing concepts defines a hypothesis space that does not include all possible functions (e.g. conjunctive descriptions). Search bias: The language is expressive enough to represent all possible functions (e.g. disjunctive normal form) but the search algorithm embodies a preference for certain consistent functions over others (e.g. syntactic simplicity).
46

Unbiased Learning
For instances described by n features each with m values, there are mn instances. If these are to be classified into c categories, then there are cm^n possible classification functions.
For n=10, m=c=2, there are approx. 3.4x1038 possible functions, of which only 59,049 can be represented as conjunctions (an incredibly small percentage!)

Futility of Bias-Free Learning


A learner that makes no a priori assumptions about the target concept has no rational basis for classifying any unseen instances. Inductive bias can also be defined as the assumptions that, when combined with the observed training data, logically entail the subsequent classification of unseen instances.
Training-data + inductive-bias | novel-classifications

However, unbiased learning is futile since if we consider all possible functions then simply memorizing the data without any real generalization is as good an option as any. Without bias, the version-space is always trivial. The unique mostspecific hypothesis is the disjunction of the positive instances and the unique most general hypothesis is the negation of the disjunction of the negative instances:

S = {( p1 p2 ... pk )} G = {( n1 n2 ... n j )}
47

The bias of the VS algorithm (assuming it refuses to classify an instance unless it is classified the same by all members of the VS), is simply that H contains the target concept. The rote learner, which refuses to classify any instance unless it has seen it during training, is the least biased. Learners can be partially ordered by their amount of bias
Rote-learner < VS Algorithm < Find-S
48

Inductive Bias Three Learning Algorithms


ROTE-LEAR ER: Learning corresponds simply to storing each observed training example in memory. Subsequent instances are classified by looking them up in memory. If the instance is found in memory, the stored classification is returned. Otherwise, the system refuses to classify the new instance. Inductive Bias: No inductive bias CA DIDATE-ELIMI ATIO : New instances are classified only in the case where all members of the current version space agree on the classification. Otherwise, the system refuses to classify the new instance. Inductive Bias: the target concept can be represented in its hypothesis space. FI D-S: This algorithm finds the most specific hypothesis consistent with the training examples. It then uses this hypothesis to classify all subsequent instances. Inductive Bias: the target concept can be represented in its hypothesis space, and all instances are negative instances unless the opposite is entailed by its 49 other know1edge.

Concept Learning - Summary


Concept learning can be seen as a problem of searching through a large predefined space of potential hypotheses. The general-to-specific partial ordering of hypotheses provides a useful structure for organizing the search through the hypothesis space. The FIND-S algorithm utilizes this general-to-specific ordering, performing a specific-to-general search through the hypothesis space along one branch of the partial ordering, to find the most specific hypothesis consistent with the training examples. The CANDIDATE-ELIMINATION algorithm utilizes this generalto-specific ordering to compute the version space (the set of all hypotheses consistent with the training data) by incrementally computing the sets of maximally specific (S) and maximally general (G) hypotheses. 50

Concept Learning - Summary


Because the S and G sets delimit the entire set of hypotheses consistent with the data, they provide the learner with a description of its uncertainty regarding the exact identity of the target concept. This version space of alternative hypotheses can be examined to determine whether the learner has converged to the target concept, to determine when the training data are inconsistent, to generate informative queries to further refine the VS, and to determine which unseen instances can be unambiguously classified based on the partially learned concept. The CANDIDATE-ELIMINATION algorithm is not robust to noisy data or to situations in which the unknown target concept is 51 not expressible in the provided hypothesis space.

Concept Learning - Summary


Inductive learning algorithms are able to classify unseen examples only because of their implicit inductive bias for selecting one consistent hypothesis over another. If the hypothesis space is enriched to the point where there is a hypothesis corresponding to every possible subset of instances (the power set of the instances), this will remove any inductive bias from the CANDIDATEELIMINATION algorithm . Unfortunately, this also removes the ability to classify any instance beyond the observed training examples. An unbiased learner cannot make inductive leaps to classify unseen examples. 52

Você também pode gostar