Você está na página 1de 8

Total Score 102 out of 110

Karen Cegelski
Homework Week 1
CSIS 5420
June 3, 2005
Score 10 out of 10

1. (Question #2, page 30) For each of the following problem scenarios, decide if
a solution would best be addressed with supervised learning, unsupervised
clustering, or database query. As appropriate, state any initial hypothesis you
would like to test. If you decide that supervised learning or unsupervised
clustering is the best answer, list several input attributes you believe to be
relevant for solving the problem.

a. What characteristics differentiate people who have had back


surgery and have returned to work from those who have had back
surgery and have not returned to their jobs?

I would choose supervised learning to address this problem as the


model could be built using data instances of known origin. One
hypothesis that could be used would be the physician specialty –
Did patients who were operated on by a neurosurgeon return to
work faster than those who had the surgery performed by an
orthopedic surgeon? Attributes that could be relevant to the solution
would be physician, physician specialty, type of employment, age,
overall health. Very good

b. A major automotive manufacturer recently initiated a tire recall for


one of their top-selling vehicles. The automotive company blames
the tires for the unusually high accident rate seen with their top-
seller. The company producing the tires claims the high accident
rate only occurs when their tires are on the vehicle in question.
Who is to blame?

To solve this problem, I would do a database query. In querying the


database, I would want to know the make/model of the vehicle,
model year, type of tires, and filter for vehicles that have been
involved in accidents. If there is any correlation between the types
of tires and the number of accidents this would determine who is
responsible. Good

c. When customers visit my web site, what products are they most
likely to buy together?
Unsupervised clustering would be used to decide this scenario. If
this web site were a clothing site, it would be determined that if a
woman wanted a blue dress, accessories such as shoes, jewelry
could possibly be bought at the same time. Some attributes would
be: Customer ID, Type of clothing; accessories – shoes, necklace,
earrings; hose. Good

d. What percent of my employees miss one or more days of work


per month?

A database query would be used to answer this question. I would


ask the database the employee number, employee name, length of
employment, and filter for employees who have missed greater
than 1 day. Good

e. What relationships can I find between an individual's height,


weight, age, and favorite spectator sport?

I would use unsupervised clustering to find the answer to this question.


There are no predefined classes instead data instances can be grouped
together based on a similarity scheme defined by the clustering model.
The hypothesis would be if a relationship could be developed between
individual demographics and favorite spectator sport. It is very possible
that no relationship would be able to be determined. Attributes could be:
Person name or id number, height, weight, age, favorite sport . Good

Score 10 out of 10

2. (Question #3, page 30) Medical doctors are experts at disease diagnosis and
surgery. Explain how medical doctors use induction to help develop their skills.

Induction or inductive reasoning is the process of reasoning in which the


conclusion of an argument is very likely to be true based on given
symptoms. It is to base diagnosis on observations of particular patterns
and to formulate treatment based on these observations of recurring
patterns. In determining whether a patient who is having chest pains is
having a heart attack or just acid reflux, the physician will order certain
tests that will eliminate or determine what the symptoms are telling him.
Good

Score 10 out of 10

3. (Question #6, page 31) What happens when you try to build a decision tree for
the data in Table 1.1 without employing the attributes Swollen Glands and Fever?

Table 1.1 Hypothetical Training Data for Disease Diagnosis


Patient Sore Swollen
Fever Congestion Headache Diagnosis
ID Throat Glands
Strep
1 Yes Yes Yes Yes Yes
throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
Strep
4 Yes No Yes No No
throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
Strep
7 No No Yes No No
throat
8 Yes No No Yes Yes Allergy
9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold

Without using the symptoms fever or swollen glands the diagnosis could
be misleading. The diagnosis could be allergy or a cold without factoring in
these other 2 attributes. The attributes sore throat, congestion, and
headache are not important in determining the diagnosis. Good

Let's pick sore throat as the top-level node. The only possibilities are yes and no.
Instances one, three four, eight, and ten follow the yes path. The no path shows instances
2,5,6,7 & 9. The path for sore throat = yes has representatives from all three classes as
does sore throat = no.

Next we follow the sore throat = yes path and choose headache. We need only concern
ourselves with instances 1,3,4, 8 & 10. For headache = yes we have instances 1 (strep
throat) ,8 (allergy ), & 10 (cold). For headache = no we have instances 3 (cold) and 4 (strep
throat).

Next follow headache = yes and choose congestion the only remaining attribute. All
three instances show congestion = yes, therefore the tree is unable to further differentiate
the three instances. A similar problem is seen by following headache = no. Therefore, the
path following sore throat = yes is unable to differentiate any of the five instances. The
problem repeats itself for the path sore throat = no. In general, any top-level node choice
of sore throat, congestion, or headache gives a similar result.

Score 10 out of 10

4. (Question #6, page 63) Supposed you have used data mining to develop two
alternative models designed to accept or reject home mortgage applications.
Both models show an 85% test set classification correctness. The majority of
errors made by model A are false accepts whereas the majority of errors made
by model B are false rejects. Which model should you choose? Justify your
answer.

Model B should be chosen because this matrix tells us that this model is
less likely to erroneously offer a home loan to an individual who may be
likely to default. The test set error rate is a useful measure for model
evaluation, but other factors such as costs incurred for false inclusion as
well as losses resulting from false omission must be considered. Good,
but consider this perspective, since a mortgage is secured credit, is
there much risk in false accepts?

Score 10 out of 10

5. (Question #7, page 63) Supposed you have used data mining to develop two
alternative models designed to decide whether or not to drill for oil. Both models
show an 85% test set classification correctness. The majority of errors made by
model A are false accepts whereas the majority of errors made by model B are
false rejects. Which model should you choose? Justify your answer.

Model A should be chosen because this matrix tells us that this model is
more likely to provide oil in the site that we have chosen than Model A.
The test set error rate is a useful measure for model evaluation, but other
factors such as costs incurred for false inclusion as well as losses
resulting from false omission must be considered. OK, but consider if the
cost of drilling for oil is very high, Model B is the best choice.

Score 10 out of 10

6. (Question #8, page 63) Explain how unsupervised clustering can be used to
evaluate the likely success of a supervised learner model.

Unsupervised clustering can be used to evaluate the likely success of a


supervised learner model using:

 A confusion matrix to compute model accuracy by adding the values found


on the main diagonal and divide this number by the total number of test
set instances.
 Two-class error analysis to denote false accepts and false rejects
 Evaluate supervised models having numeric output mean absolute error
and mean square error can be utilized.

OK, but let me suggest a simpler answer.

In a supervised learner model, we pre-determine which attributes will be


used to classify our data and what specific clusters we will accept. In other
words, we assume that a chosen set of attributes will classify our data
under a chosen output attribute.

If our unsupervised learner determines that the same input attributes will
form clusters that differentiate the values of the output attribute, then the
complementary results verify the supervised learner assumptions.

Score 10 out of 10

7. (Question #97, page 63) Explain how supervised learning can be used to help
evaluate the results of an unsupervised clustering model.

Supervised learning can be used to help evaluate the results of an unsupervised


clustering model using following technique:

 Perform an unsupervised clustering. Designate each cluster as a class


and assign each an arbitrary name such as C1, C2, and C3.
 Do a random sample of instances from each of the classes as a result of
the instance clustering. Each class should be represented in the random
sampling in the same ratio as it is represented in the dataset.
 Construct a supervised learner model with the class name as the output
attribute using the randomly sampled instances as training data. Use the
remaining instances to test the supervised model for classification
correctness. Very good

Score 7 out of 10

8. (Computational Question #1, page 63) Consider the following three-class


confusion matrix. The matrix shows the classification results of a supervised
model that uses previous voting records to determine the political party affiliation
(Republican, Democrat, or Independent) of members of the United States
Senate.

Computed Decision

Rep Dem Ind


Rep 42 2 1
Dem 5 40 3
Ind 0 3 4

a. What percent of the instances were correctly


classified?

86% Good
b. According to the confusion matrix, how many
Democrats are in the Senate? How many
Republicans? How many Independents?

Democrats – 40  should be 48 (add across the


row)

Republicans – 42 should be 45 (add across the


row)

Independents - 4 should be 7 (add across the


row)

There are 100 senators total.

c. How many Republicans were classified as


belonging to the Democratic Party?

2 Republicans were classified as belonging to the


Democratic Party Good

d. How many Independents were classified as


Republicans?

0 Independents were classified as Republicans Good

Score 7 out of 10

9. (Computational Question #2, page 64) Suppose we have two


classes each with 100 instances. The instances in one class
contain information about individuals who currently have credit card
insurance. The instances in the second class include information
about individuals who have at least one credit card but are without
credit card insurance. Use the following to answer the questions
below:

IF Life Insurance = Yes & Income > $50K

THEN Credit Card Insurance = Yes

Rule Accuracy = 80%

Rule Coverage = 40%


a. How many individuals represented by the instances
in the class of credit card insurance holders have life
insurance and make more than $50,000 per year?

80 individuals 40 instances

b. How many instances representing individuals who


do not have credit card insurance have life insurance
and make more than $50,000 per year?

80 instances 10 instances

Score 10 out of 10

10. (Computational Question #3, page 64) Consider the confusion


matrices shown below.

a. Compute the lift for Model X.

Lift = 2.00785 Very good

b. Compute the lift for Model Y.

Lift = 2.25 Very good

Model Computed Computed


X Accept Reject
Accept 46 54
Reject 2,245 7,655

Computed Computed
Model Y
Accept Reject
Accept 45 55
Reject 1,955 7,945

Score 8 out of 10

11. (Computational Question #4, page 65) A certain mailing list


consists of P names. Suppose a model has been built to determine
a select group of individuals from the list who will receive a special
flyer. As a second option, the flyer can be sent to all individuals on
the list. Use the notation given in the confusion matrix below to
show that the lift for choosing the model over sending out the flyer
to the entire population can be computed with the equation:

Send Computed Computed


Flyer? Send Don't Send
Send C11 C12
Don't
C21 C22
Send

Lift = P(C11 | Sample)


P(C11 | Population)

Send Flyer? Computed Send Computed Don't Send


Send c11 c12 Sum(Send)
Don't Send c21 c21 Sum(Don't Send)
Sum(Computed Send) Sum (Computed Don't Send) Sum(Total)

Lift = c11/Sum(ComputedSend)
Sum(Send)/Sum(Total)

So Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / (C11+C12+C21 +C22) )

and we know that (C11+C12+C21 +C22) = the total number of names P.


Therefore, using substitution …

Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / P )

Lift = ( C11 / (C11 + C21) ) * (P / ( (C11+C12) )

Lift = ( C11 * P ) / ((C11 + C12) * (C11+ C21) )

Você também pode gostar