Você está na página 1de 34

Tutorial knn

This is a guest post by Dr. Scott Turner, who won the Machine March Madness prediction contest
last year, and who was the co-winner of the Sweet 16 contest from two years ago. If you like this
post, check out his great blog all about algorithmic prediction of NCAA
basketball:http://netprophetblog.blogspot.com/.

Dr. Turner has a Ph.D. in Artificial Intelligence from UCLA. His dissertation subject was a program
called MINSTREL that told stories about King Arthur and his knights, as a way to explore issues in
creativity and storytelling. Since obtaining his Ph.D. in 1993, Dr. Turner has worked for the
Aerospace Corporation, where he advises the nation's space programs on software and systems
engineering issues.

Danny & Lee asked me to contribute a guest post as part of the Machine Madness contest. I started
writing a posting about using RapidMiner as part of a prediction workflow, but unfortunately I
became overwhelmed with other tasks and wasn't able to finish it. I had given up on finishing it
when I realized that anyone entering the Machine Madness contest at this late date might well
appreciate a tool that could make creating the routine parts of building a predictive model very
fast. So I quickly finished it up and hope it will prove helpful to someone. Readers who are expert
data miners won't find much here, but I hope that it might be useful to the interested amateur who
knows more about basketball (football, baseball, etc.) than about statistics and data mining and
wants to put in a quick entry.

I will assume that you have some program or method for generating the statistics or ratings you
want to use to predict games and that you've saved those results as an Excel file. (These might just
be season averages of the statistics Danny & Lee are providing.) As a tool RapidMiner is not well-
suited for this part of the problem; it's strengths are in pulling the predictive value out of those
statistics rather than generating them. (Or perhaps I should say that it's not well-suited as I
understand it. I wouldn't be surprised to learn that it has useful features in this area that I don't
know about.) The Excel file should have one line for each game, with columns for the team names,
statistics, ratings, and scores.

The next step is to download and install RapidMiner. You can do that here. The "community
edition" of RapidMiner is completely free. (I like free.) There's a user forum here where questions
usually get a fairly quick response.

Once you've installed, start up RapidMiner. You'll see this:



RapidMiner has three default perspectives: Design, Results, and Welcome. It starts up in
Welcome. Switch to Design by clicking on the icon that looks like a pencil writing in a notebook,
from the View menu, or by hitting F8. The Design view looks like this:




The blank central area is the canvas where you'll graphically build your RapidMiner process. The
left-side has a menu of Operators as well as Repositories (where processes are stored). The right-
side has details about the current operator (Just a blank "Process" in this case because we haven't
added anything yet.)

To start, let's read in our Excel file of game data. In the list of Operators on the left-side of the
RapidMiner window, you'll see a folder labeled "Import". Clicking on that reveals sub-folders
labeled "Data," "Models", and so on. Click on the Data folder and you'll see a list of
operators. "Read Excel" should be near the top. Click and drag that operator onto the blank area
in the middle of the screen and release. You'll see this:


There are a couple of things to note. First, RapidMiner has automatically drawn a connection from
the output of this process (the little semi-circle node on the right of the box) to the right edge of
the workspace. Anything going out to that edge will show up in the Results view when the process
is executed. Second, the message window at the bottom of the workspace shows an error. It is
complaining "The mandatory parameter "excel file" is undefined."

To fix this, look to the right-side. You'll see that is now showing the details for the highlighted
"Read Excel" operator. Just below there you'll see a button for an "Import Configuration Wizard"
and then some input boxes for the various parameters for this operator, including the "excel file"
parameter being complained about. There's also a description/help box for the operator below the
parameters section.

Use the "Import Configuration Wizard" to find your Excel file and prepare it to be read in. The
wizard does some basic data checking, so you may discover a problem in your file at this
point. Here's what the final step of the wizard looks like for my sample data:




There are 8 columns to my data: name, score, TrueSkill mean, and home winning
percentage. (The TrueSkill mean is a rating system. You can read more about it here.) These will
be the inputs to my prediction model.

To run a process in RapidMiner, you click the right-facing blue triangle button near the top of the
window. Right now our process isn't very interesting -- it just reads in the Excel file and sends it to
the Results -- but let's run it and see what happens. You may be asked to save your model and
whether you want to switch to the Results view. For both questions you can save a default answer,
which is handy. When you switch to the Results view you'll see something like this:




The data you read in creates an "Example Set" and this window is showing you the Meta Data View
for the data set. In my case, the data set has 3699 examples (games), and for each attribute in the
examples, the window shows the Role, Name, Type, Statistics, Range and Missings. There's some
interesting stuff here -- for example, home teams scored between 28 and 124 points in this
season. A home team scored only 28 points?! That's pretty intriguing.

Let's follow up. Click on the "Data View" checkbutton and then on the Hscore column to look at the
actual data sorted by home team's score:




Apparently that 28 point performance was put in by SMU against UAB. That had to be fun to watch!
You can do some interesting data analysis with the Plot View and Advanced Chart options here, but
let's continue on with building a process.

Switch back to the Design view and let's work on conditioning the data. In many cases, there are
problems in the input data -- such as missing values -- that will corrupt your prediction
models. RapidMiner provides a number of operators for fixing these sorts of problems. Let's work
on fixing missing values. In the Design View on the Operators tab on the right part of the screen
you'll see a search box. This is handy for finding operators by name. Type "missing" into the Search
box and you should see this:




Click on the "Missing Values"operator, drag it onto the canvas in the middle of the screen and drop
it. You'll now have this:




You'll see that RapidMiner is complaining of an error in our process: we don't have an input to the
Replace Missing Values operator. We want to connect the output of our Excel file to the input of
this operator. To do this, we left click on the output of the Read Excel operator, and drag the
resulting orange line to the input of the Replace Missing Values operator and release. This causes a
pop-up box asking if we really want to disconnect the current output connection or not. Allow
RapidMiner to disconnect the port and you should have this:




And that's all you need do: Add operators and hook them together into a process. By default, the
Replace Missing Values operators replaces all missing values with the average value for that
attribute. That's fine for now, so we'll leave it as is.

One very important step we need to take is to create a "label". The label is the attribute that we're
trying to predict. In our case, we'll be trying to predict the winner of the game: "Home" or
"Away". We don't actually have that in our input data, so we'll need to create a new attribute and
set it to be our label.

To do this, find the "Generate Attributes" operator and the "Set Role" operator and modify your
process to look like this:


Now click on the "Generate Attributes" operator. On the right you'll see a button labeled "function
descriptions" and "Edit List(0)". Click on this to bring up a view that will let us define a new
attribute in our data set.


This is fairly simple to use. We type in a name for our new attribute in the left-hand column and
then an expression for calculating it in the right hand column. We can use any existing attribute in
our expression, and if you click on the calculator icon, it will bring up a tool to help create
expressions. In our case, we want to create a new attribute called "winner" that has the value
"Home" if the home team scored more than the Away team, and "Away" otherwise. The expression
to do this is 'if(Hscore>Ascore,"Home","Away")':




And that's it for creating the new attribute. Now we need to set the Role of this attribute to "label"
so that our models will know what we're trying to predict. To do this, click on the Set Role
operator and in the right-side pane, select our new attribute from the drop-down box next to
Name, and "label" from the drop-down box next to "target role":

We're almost ready to start modeling, but let's check to make sure we've added the "winner"
attribute correctly. Hit the run button to run the process and let's look at the output in the Results
view:

At the top of the results (colored light yellow because of its role as "label") we see the new
attribute "winner". In this data set, the Home team won almost twice as often as the Away
team. If you click on the Data View button, you can check a few games to make sure the
calculation is correct:

Looks good, so let's go back to Design View and train a model. Switch back to the Design View and
find the k-NN model, drag it into the process and connect it up to look like this:

Along the right-side you can see the parameters for the k-NN operator. Change "k" to 3. We're
almost ready to create a model, but we need to add one last step. Right now the input data to our
model includes the scores of both teams. It isn't very hard to predict who will win the game if we
know who scored the most points :-) so we'll need to remove that information from our
examples. To do this, we need an operator called "Select Attributes". Drop this into our process
between "Set Role" and "k-NN".

Highlight the new operator, and on the right-side, set the "attribute filter type" to subset and then
click on "Select Attributes". That will bring up this dialog:

Now we simply select attributes we want to include from the left side and use the green arrow to
move them to the right side. We want to leave out the Hscore, Ascore and Date attributes.

Save this and we're now ready to run the process to create a model. Hit the Run button and you
should see results that look like this:

Great, we created a model! But how good is it? We don't have any idea. To figure that out, we
need to apply the model and then measure its performance. Let's do that.

Switch back to the Design View, and find the "Apply Model" and the "Performance (Classification)"
and add them to your process after the k-NN operator like so:

Note that the model output of the K-NN operator goes into the model input for the Apply Model
operator, and the example set output goes into the unlabeled input. The labeled output of Apply
Model goes into the labeled input of the Performance operator, and the performance output of that
operator goes out the right-hand side of our process.

Run this, and you should get a Results View that looks something like this:

Wow, 83% accuracy predicting the winner of the game -- pretty good! Good enough to win the
Machine Madness contest? Who can say? :-)

This illustrates the basics of using RapidMiner for prediction. RapidMiner has a wealth of features
and options, and there are many improvements you can make to the simple process flow I've
illustrated above. But hopefully this has given you enough guidance to get started, and good luck!

Você também pode gostar