Você está na página 1de 210

Machine Learning

with Python

The Basics

By David V.
Copyright2017 by David V.
All Rights Reserved
Copyright 2017 by David V.

All rights reserved. No part of this

publication may be reproduced,

distributed, or transmitted in any form or

by any means, including photocopying,

recording, or other electronic or

mechanical methods, without the prior

written permission of the author, except


in the case of brief quotations embodied
in critical reviews and certain other
noncommercial uses permitted by
copyright law.
Table of Contents
Introduction
Chapter 1- Getting Started
Chapter 2- Python and
matplotlib for Data
Exploration
Chapter 3- Logistic
Regression
Conclusion
Disclaimer

While all attempts have been made to


verify the information provided in this

book, the author does assume any

responsibility for errors, omissions, or

contrary interpretations of the subject

matter contained within. The information


provided in this book is for educational
and entertainment purposes only. The
reader is responsible for his or her own
actions and the author does not accept

any responsibilities for any liabilities or


damages, real or perceived, resulting

from the use of this information.

The trademarks that are used are

without any consent, and the


publication of the trademark is
without permission or backing by the
trademark owner. All trademarks and

brands within this book are for


clarifying purposes only and are the

owned by the owners themselves, not

affiliated with this document. **


Introduction

Machine learning is a very common field


of study in the world today. With Python,

it is easy for you to implement the

concept of machine learning. This is

because this programming language has

numerous libraries which can help you


create production systems employing the
concept of machine learning. This book
helps you learn this. Enjoy reading!
Chapter 1- Getting

Started

The only way for you to understand

machine learning is by doing and then

completing projects, beginning with the

small ones. Python is an interpreted


programming language, and very
powerful. This language can be used for
the purposes of research, as well as the
creation of production systems as it is a

complete programming language.

Python has numerous libraries and

modules from which you can choose,

meaning that there are different ways that

you can accomplish a particular task.


The following are the best steps to
follow when doing a machine learning
project in Python:

1. Define Problem.

2. Prepare Data.

3. Evaluate Algorithms.

4. Improve Results.

5. Present Results.

The above steps will also help you learn


how to use some new tools or a
platform.

First Project

Our first project should be the one for

classifying iris flowers. This project is

good for the following reasons:

1. It has numeric attributes, making it


possible for us to figure out on how
the loading and handling of data can
be done.

2. The problem falls under the category

of classification, meaning that it is

possible for us to implement it by use

of a supervised learning algorithm.

3. It is also a multi-class classification


problem, and it needs a specialized

form of handling.

4. The project is small, and it will fit


into the memory very well.

5. Our numeric units will be in the same

scale and same units. This means that

we will not have to do any special

transformations or scaling so as to
get started.

The following is what we will cover in

this small project:

1. Installing Python and the SciPy

platform.

2. Loading the dataset.

3. Summarizing the dataset.

4. Visualizing the dataset.


5. Evaluating some algorithms.

6. Making predictions.
Installing Python and the
SciPy Platform

You should install the Python SciPy


platform in your system if you have not

installed it. The following are the SciPy

libraries which you should install on

your system:

scipy
numpy
matplotlib
pandas
sklearn

These libraries can be installed in a

number of ways. The best way for you to

do the installation is by choosing a

single way and then following it


throughout the installation process.
Installing via pip
In the case of Mac and Linux users, it is
possible for you to install the SciPy

libraries via pip. The installation of

these libraries will be done in wheel

package format. For you to do the

installation, you must ensure that you

have installed both pip and Python into


your system. However, the installation of

these libraries with pip in Windows


does not work properly.

First, begin by upgrading pip to the latest

version. This can be done by use of the

following command:

python -m pip install --upgrade pip

You can then use pip to install the SciPy

packages. The following command


demonstrates how to do this:

pip install --user numpy scipy


matplotlib ipython jupyter pandas
sympy nose

The packages will then be installed to

your local user, and no permissions will

be needed for you to write to the

directories.
In the case of user installs, please ensure

that the user install executable directory


is on the PATH. In Linux, the PATH can

be set as follows:

# This should be added at the end of

~/.bashrc file

export

PATH="$PATH:/home/username/.local/
In OSX, the PATH can be set as follows:

# This should be added at the end of

~/.bash_profile file
export
PATH="$PATH:/Users/username/Libra

Note that you must use your correct

username.
Installation via Linux
Package Manager

The installation can also be done much


quicker from the repositories of the

various Linux distributions. Note that

such installations will be system-wide,

and they will be somehow outdated

compared to the ones which are installed


via pip.
For users of Ubunu and Debian, the
installation of the libraries can be done

by executing the following command on

the terminal:

sudo apt-get install python-numpy


python-scipy python-matplotlib
ipython ipython-notebook python-
pandas python-sympy python-nose
For users of Fedora 22 and the later

versions, use the following command:


sudo dnf install numpy scipy python-
matplotlib ipython python-pandas
sympy python-nose atlas-devel
Installation via Mac
Package Manager

Unlike Linux, the Mac does not come


with a package manager, but there exists

several package managers which you

can install.

Macports
The installation of the SciPy libraries by

use of this package manager can be done


by executing the following command:

sudo port install py35-numpy py35-


scipy py35-matplotlib py35-ipython
+notebook py35-pandas py35-sympy
py35-nose

Now that you have the SciPy libraries

necessary for this project, you can move


to the next step.
Start Python, Check for
Versions

You should check so as to be sure that


the Python was installed properly and it

is running as expected.

Just open the terminal and then type the


following:
Python

Below is a script which can be used for

testing the environment. It works by

importing each of the libraries which are

required for our project, and it will then

return the version for each library. This

is the script:

# Check versions of the libraries


# Python version

import sys

print('Python: {}'.format(sys.version))

# scipy

import scipy

print('scipy:
{}'.format(scipy.__version__))

# numpy

import numpy
print('numpy:

{}'.format(numpy.__version__))

# matplotlib

import matplotlib

print('matplotlib:

{}'.format(matplotlib.__version__))

# pandas

import pandas

print('pandas:
{}'.format(pandas.__version__))
# scikit-learn

import sklearn

print('sklearn:

{}'.format(sklearn.__version__))

You will then get the versions for each

of the libraries if the installation was

done successfully.

Load Data
We will be using the data set for the iris
flower. It is a very famous dataset,

widely used in machine learning by

almost everyone.

It has 150 observations of iris flowers. It

has four columns for flower

measurements in centimeters. The fifth

will have the species to which the


flower belongs. There are three species,
and each flower must belong to one of
these.

Importing the Libraries

There are objects, functions, and

libraries which should be used in this

project. We can import them into the


project as follows:
# Load libraries

import pandas

from pandas.tools.plotting import


scatter_matrix

import matplotlib.pyplot as plt

from sklearn.metrics import


classification_report
from sklearn.metrics import

accuracy_score

from sklearn import model_selection

from sklearn.tree import

DecisionTreeClassifier

from sklearn.metrics import

confusion_matrix

from sklearn.neighbors import


KNeighborsClassifier

from sklearn.linear_model import


LogisticRegression
from sklearn.discriminant_analysis
import LinearDiscriminantAnalysis

from sklearn.svm import SVC

from sklearn.naive_bayes import

GaussianNB

You should have everything load without

getting an error. In case you get an error,


you have to stop and then begin to work
on your environment.

Loading the Dataset

It is possible for us to direct load data

from the UCI Machine Learning

repository. We will use pandas so as to

load the data. The pandas will also be

used for exploring the data with both


data visualization and descriptive
statistics.

The names for each column will have to

be specified during the process of

loading the data. This will then help us

later when we need to explore the data.

This can bedone as shown below:

# Load dataset
url =
"https://archive.ics.uci.edu/ml/machine-
learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width',
'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url,

names=names)

Our expectation is that the data should

be loaded without any incident. For

those with network problem, feel free to

download the iris.data file into the


working directory and then use the same
mechanism so as to load it, but the
URLhas to be changed so as to reflect

the one leading to the local file.

It is possible for us to learn the number

of attributes and the rows (instances)

which we have by use of the shape

property. This can be done as shown


below:
# shape

print(dataset.shape)

There should be 150 instances, and the

attributes should be 50 as shown below:

(150, 50)
We can then eyeball the data as follows:

# head

print(dataset.head(20))

This will give the first 20 rows of data


which are contained in the file. This is

shown below:
sepal-length sepal-width petal-

length petal-width class

0 5.1 3.5 1.4

0.2 Iris-setosa

1 4.9 3.0 1.4

0.2 Iris-setosa

2 4.7 3.2 1.3

0.2 Iris-setosa

3 4.6 3.1 1.5

0.2 Iris-setosa
4 5.0 3.6 1.4

0.2 Iris-setosa

5 5.4 3.9 1.7

0.4 Iris-setosa

6 4.6 3.4 1.4

0.3 Iris-setosa

7 5.0 3.4 1.5

0.2 Iris-setosa

8 4.4 2.9 1.4

0.2 Iris-setosa
9 4.9 3.1 1.5

0.1 Iris-setosa

10 5.4 3.7 1.5

0.2 Iris-setosa

11 4.8 3.4 1.6

0.2 Iris-setosa

12 4.8 3.0 1.4

0.1 Iris-setosa

13 4.3 3.0 1.1

0.1 Iris-setosa
14 5.8 4.0 1.2

0.2 Iris-setosa

15 5.7 4.4 1.5

0.4 Iris-setosa

16 5.4 3.9 1.3

0.4 Iris-setosa

17 5.1 3.5 1.4

0.3 Iris-setosa

18 5.7 3.8 1.7

0.3 Iris-setosa
19 5.1 3.8 1.5

0.3 Iris-setosa

We can then go ahead and look for the

statistics of each attribute. This should

include the mean, the min, max, the

count, and some other percentiles. This

is shown below:

# descriptions
print(dataset.describe())

You will observe that there will be a

similar scale for all the numeric values,

and the same ranges.

Class Distribution

We now need to know the number of


instances for the classes. This can be
viewed as an absolute count. This is
demonstrated below:

# class distribution

print(dataset.groupby('class').size())

You will then find that all the classes

have the same number of the instances.

Data Visualizations
Now that we have some basic idea
regarding the data, it is good for us to

extend it through visualizations. Let us

have a look at the plots.

Univariate Plots

These will help us to understand each

attribute. They are the plots for the


individual variables. Suppose we have
numeric input variables, we can go
ahead to create some whisker and box

plots for these. This is shown below:

# whisker and box plots


dataset.plot(kind='box',
subplots=True, layout=(2,2),
sharex=False, sharey=False)

plt.show()
This will help us have a good picture
regarding how the input variables are

distributed. We can use each of the input

variables so as to create a histogram,

and this will help learn more about the

distribution:

# histograms

dataset.hist()
plt.show()

Two of the input variables will have a

Gaussian distribution. There are

algorithms which can be used for

exploiting this assumption.


Multivariate Slots

We should explore how the variables


interact with each other. Scatterplots for

the pair of the attributes will help us

identify any structured relationships

between the input variables.

# scatter plot matrix


scatter_matrix(dataset)

plt.show()

You will realize that there is a diagonal


grouping for some of the attribute pairs.

This is an indication of a high

correlation and some predictable

relationships.
Evaluation of Algorithms

We should make some data models, and


then estimate their accuracy based on

unseen data.

Creating the Validation


Dataset
We want to know whether we created a

good model or not. Statistical methods


will then be used for determining the

accuracy of the models which were


created on unseen data. We need some

estimate of the best model on the unseen

data by simply evaluating it on the actual

unseen data. This is shown below:

# Split-out validation dataset


array = dataset.values

X = array[:,0:4]

Y = array[:,4]

validation_size = 0.20

seed = 7
X_train, X_validation, Y_train,
Y_validation =
model_selection.train_test_split(X, Y,
test_size=validation_size,
random_state=seed)

The loaded dataset has been split into


two, whereby 80% of it will be used for
training the models, and 20% of this will
be held as the validation dataset. The

training data is now contained in the


X_train and Y_train for purposes of

training the model while X_validation

and Y_validation are sets which will be

used later.

Test Harness
The 10-fold cross validation will be
used for the purpose of estimating the

accuracy. With this, the dataset will be

split into 10 parts, whereby training will

be done on 9 datasets, while the other

dataset will be used for testing. All the

combinations for the train-test splits will


then be repeated. This is shown below:
# Test options and the evaluation

metric

seed = 7

scoring = 'accuracy'

The models are to be evaluated by use of

the metric of accuracy. This refers to

the ratio of the number of predicted


instances which are correct divided by

the total number of instances in your


dataset, and then multiplied by 100 so as
to get a percentage. We will use a
scoring variable when running build

and we will then evaluate each model.

Build Models

We are not sure of the best algorithms or

the best configurations for use in solving


this kind of problem. There are 6
possible algorithms which we can use.
These include the following:

1. Logistic Regression (LR)


2. Gaussian Naive Bayes (NB).
3. Linear Discriminant Analysis (LDA)
4. K-Nearest Neighbors (KNN).
5. Classification and Regression Trees
(CART).
6. Support Vector Machines (SVM).
We will be mixing the simple linear and

non-linear algorithms. The random


number seed will be reset before each

run, and this will help us ensure that the


execution of each algorithm is done by

use of similar data splits. This serves to

help us be sure that the results can be

directly compared. Let us begin by

evaluating our five models:


# Spot Check Algorithms

models = []

models.append(('LR',

LogisticRegression()))

models.append(('LDA',

LinearDiscriminantAnalysis()))

models.append(('KNN',

KNeighborsClassifier()))

models.append(('CART',
DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))

models.append(('SVM', SVC()))

# evaluate each model in turn

results = []

names = []

for name, model in models:


kfold =
model_selection.KFold(n_splits=10,
random_state=seed)

cv_results =
model_selection.cross_val_score(model,
X_train, Y_train, cv=kfold,
scoring=scoring)

results.append(cv_results)

names.append(name)
msg = "%s: %f (%f)" % (name,
cv_results.mean(), cv_results.std())

print(msg)

Selecting the best Model


Currently, we have 6 models as well as

the accuracy estimation for each of these


models. Our aim is to do a comparison

between the models, and then choose the


most accurate one. The program should

give the following result once executed:

LR: 0.966667 (0.040825)

LDA: 0.975000 (0.038188)

KNN: 0.983333 (0.033333)


CART: 0.975000 (0.038188)

NB: 0.975000 (0.053359)

SVM: 0.981667 (0.025000)

As shown in the above output, the KNN

seems to be the one with the highest

estimated accuracy score. We can go

ahead and create a plot of model


evaluation and then compare the mean

accuracy and spread of each model.


Note that the evaluation of each
algorithm was done 10 times.

# Compare Algorithms

fig = plt.figure()

fig.suptitle('Algorithm Comparison')

ax = fig.add_subplot(111)

plt.boxplot(results)

ax.set_xticklabels(names)
plt.show()

Making Predictions

Note that we found the KNN model to be

the most accurate one from the tested

ones. It is now time for us to get the

accuracy of this model on the validation

set.
This will provide us with a final

independent check on the accuracy of the


best model. It is also good for us to

maintain a validation set so that in case a


mistake occurs during training, such as

overfitting to a training set or some data

leak. These will lead to an overly

optimistic result.

The KNN model can be executed


directly on the validation set, and then
we summarize the result in the form of
one final score, a classification result,

and a confusion matrix. This is shown


below:

# Make predictions on the validation

dataset

knn = KNeighborsClassifier()

knn.fit(X_train, Y_train)
predictions =

knn.predict(X_validation)

print(accuracy_score(Y_validation,

predictions))

print(confusion_matrix(Y_validation,

predictions))

print(classification_report(Y_validation

predictions))

You will then get the accuracy. The


confusion matrix will give the three
errors which are made. The
classification report will then give the

breakdown for each class, based on


precision. Here is the result:

0.9

[[ 7 0 0]

[ 0 11 1]
[ 0 2 9]]

precision recall f1-score


support

Iris-setosa 1.00 1.00

1.00 7

Iris-versicolor 0.85 0.92

0.88 12

Iris-virginica 0.90 0.82


0.86 11
avg / total 0.90 0.90 0.90

30
Chapter 2- Python and
matplotlib for Data
Exploration
The Python libraries can be used

together with matplotlib for the purpose

of exploring data. Let us discuss how

this can be done.

Load the Data set


This should be the first step in machine
learning. This should be observed data,

and you can choose to collect it, or you

may choose to browse for the data from

various data sources so as to get the data

sets. In this case, you may choose to load

the digits data set which comes with


scikit-learn, which is a Python library.
For the data to be loaded, we have to

import the datasets module from the


sklearn. You can then make use of the

load_digits() method from datasets so


as to load in the data. This is shown

below:

# Import `datasets` from the `sklearn`

from sklearn import ________


# Load in `digits` data

digits = datasets.load_digits()

# Print `digits` data

print(______)

It is good for you to be aware that the

datasets module also has other


methods for loading and fetching the
popular reference datasets, and one can
count on the module if they need
artificial data generators. If we needed
to pull the data from the latter, then we

could have done it as follows:

# Import `pandas` library as `pd`

import ______ as __

# Load in data with the `read_csv()`


digits =
pd.read_csv("http://archive.ics.uci.edu/
learning-
databases/optdigits/optdigits.tra",
header=None)

# Print out the `digits`

print(______)

It is good for you to be aware that if data


is split in this manner, the data will be

split into a test and training set, and


these are indicated by .tes and .tra
extensions. Both files should be loaded
so that the project can be elaborated.

The above command will only help us to


load the training set.
Exploring the Data

Before you can begin to use a particular


data set, it is good for you to read its

description and try to learn something.

For the case of scikit-learn, this

information is not readily available, but

when you import the data from a


particular data source, it will have a
description and this will be enough
information for you to gain enough
insights into the data. However, it is
good for you to have asufficient

knowledge about how the data net


works.

The process of performing exploratory

data analysis (EDA) for a particular data

set has been found to be very difficult.


Gather Information on the
Data

Suppose you have not checked any


folder with the data description. It is

time for you to begin gathering the

information.

After printing out the digits data once


you loaded it through the scikit-learn
datasets module, you will notice that
there is too much information which is
available. You should now be aware of

some things such as the target as well as


the description of the data. The digits

data can be accessed through the data

attributes. The target attribute can also

be used for accessing the target values

or attributes and the description via the

DESCR attribute.
If you need to know the keys which are
available, you just have to execute

digits.keys(). You can try the one

shown below:

# Get keys of `digits` data

print(digits.______)

# Print out data


print(digits.____)

# Print out target values

print(digits.______)

# Print out description of `digits` data

print(digits.DESCR)

You can then go ahead and check for the


type of your data. If you want the
read_csv() to do the importation of the
data, you will have a data frame with all

the data. There will be no description


component, but it will be possible for

you to resort to head() or tail() for the

purpose of inspecting the data. It is

always good for you to read the folder

for data description.


The data attribute should be used for

isolating the numpy array from the


digits data and the shape attribute

should be used for finding more. The


same can also be done for target and

DESCR. An attribute known as

images also exists, and this is used for

describing the data in the images.

The shape attribute of an array can be


used as shown below:

# Isolate `digits` data

digits_data = digits.data

# Inspect the shape

print(digits_data.shape)

# Isolate target values with the


`target`

digits_target = digits.______

# Inspect the shape

print(digits_target._____)

# Print number of the unique labels

number_digits =
len(np.unique(digits.target))
# Isolate the `images`

digits_images = digits.images

# Inspect the shape

print(digits_images.shape)

Visualizing the Data Images


using matplotlib
It is also possible for you to visualize

the images which you are using. There


are multiple libraries in Python which

can be used for this purpose, but here,


we will be using matplotlib. This can

be used as shown below:

# Import matplotlib library

import matplotlib.pyplot as plt


# Figure out the size (width, height) in

inches
fig = plt.figure(figsize=(6, 6))

# Change the subplots


fig.subplots_adjust(left=0, right=1,
bottom=0, top=1, hspace=0.05,
wspace=0.05)

# For each of 64 images


for i in range(64):
# Initializing the subplots, add
subplot in grid of 8 by 8, at i+1-th
position

ax = fig.add_subplot(8, 8, i + 1,
xticks=[], yticks=[])
# Display image at i-th position

ax.imshow(digits.images[i],
cmap=plt.cm.binary,
interpolation='nearest')

# label image with a target value

ax.text(0, 7, str(digits.target[i]))

# Show the plot


plt.show()

The above code might seem to be

lengthy and even overwhelming. Note


that we began by importing the library,

which is matplotlib.pyplo. We have

then setup a figure, with dimensions of 6

inches long and 6 inches wide. This will

create a canvas, and all the subplots

having images will be displayed on it.


We have also set the alignment of this on

the left, right bottom, and top. We have


then created a loop which is to help us

fill the figure which we have created.

The subplots have been initialized one

by one, and each has been added into its

own position on a grid which measures

8 by 8. Note that each image has been


displayed on the grid ay a particular
time. We have also used binary colors
which will in turn give us white, black,
and gray values. We have used nearest

as the interpolation method, which


translates to the fact that the data is not

smooth.

The cherry on pie adds text to the

subplots. The target labels will be


printed at the coordinates (0,7) for each
subplot, meaning that these will be
visible on the bottom-left corner of the
subplot. The line plt.show() has been

used for displaying the plot so that it can


be visible. To make it simple, it is

possible for you to visualize the target

labels as shown below:

# Import matplotlib
import matplotlib.pyplot as plt
# Join images and the target labels
into a list

images_and_labels =
list(zip(digits.images, digits.target))

# for each element contained in the

list

for index, (image, label) in

enumerate(images_and_labels[:8]):
# initializing a subplot of the 2X4 at

i+1-th position
plt.subplot(2, 4, index + 1)

# Do not plot any axes


plt.axis('off')

# Display images in the all subplots


plt.imshow(image,
cmap=plt.cm.gray_r,interpolation='near

# Add some title to every subplot


plt.title('Training: ' + str(label))
# Show the plot
plt.show()

Note that once we had imported the

matplotlib.pyplot, we went ahead to zip

our two numpy arrays together, and then

saved it in a variable named

images_and_labels. You will also

learn that each will have aninstance of


digits.images and a value of

digits.target.

Principal Component
Analysis (PCA)

Since the digits data set will have 64

features, it becomes a challenge. It

becomes hard for us to understand the

structure and maintain an overview of


digits data. You will then be working
on a high-dimensional data set.

This results when one tries to describe

objects via the collection of features.

The problem with high dimensionality of

data is that the algorithms might be

expected to take in too many features.

Having many dimensions may be an

indication that the data points are


located far from each other point, and
the distance between data points can be
uninformative.

The Principle Component Analysis

(PCA) will help us solve this problem.

It works by finding a linear combination

of two variables which contains much of

the information. This principle

component or the new variable can be


used for the replacement of your two
original variables. You can see it as a
linear transformation method for
yielding directions for maximizing

variance of data.

The scikit-learn can help you find it

easy to apply PCA to your data. This is

shown below:

# Creating some Randomized PCA


model which takes in two components
randomized_pca =

RandomizedPCA(n_components=2)

# Fit and transform data to model


reduced_data_rpca =
randomized_pca.fit_transform(digits.da

# Creating some regular PCA model

pca = PCA(n_components=2)
# Fit and transform data to model

reduced_data_pca =

pca.fit_transform(digits.data)

# Inspect the shape

reduced_data_pca.shape

# Print out data


print(reduced_data_rpca)

print(reduced_data_pca)

Note that in the above example, we have


used the RandomizedPCA() method.

This is because this performs better in

such circumstances than in times when

there are a few number of dimensions.

You may choose to replace the estimator

object or the randomized PCA model


with a regular PCA model and observe
the difference you get.

It is good for you to keep in mind how

the model is told to keep two

components only. This is a way of

ensuring that you will only have two-

dimensional data for plotting. Also, you

should note that the target class is not


passed with labels to PCA
transformation, since one needs to
investigate whether the PCA reveals the
distribution of different labels and

whether it is possible for the instances to


be separated from each other clearly. A

scatterplot can now be built for the

purpose of visualizing the data:

colors = ['black', 'purple', 'blue',


'yellow', 'white', 'lime', 'cyan',
'orange', 'red', 'gray']
for i in range(len(colors)):

x = reduced_data_rpca[:, 0]
[digits.target == i]

y = reduced_data_rpca[:, 1]
[digits.target == i]

plt.scatter(x, y, c=colors[i])
plt.legend(digits.target_names,
bbox_to_anchor=(1.05, 1), loc=2,
borderaxespad=0.)

plt.xlabel('First Principal Component')

plt.ylabel('Second Principal
Component')

plt.title("PCA Scatter Plot")


plt.show()

The matplotlib should then be used for

visualizing the data.

Preprocessing the Data

Data should be prepared well before it


can be modeled. The preparation step is
commonly known as preprocessing.

Data Normalization

We will begin by preprocessing the data.

The digits data can, for example, be

standardized by use of the scale ()

method. The following example

demonstrates this:
# Import

from sklearn.preprocessing import


scale

# Apply the`scale()` method to `digits`

data

data = _____(digits.data)

Once the data has been scaled, the


distribution of each attribute will be
shifted so that its mean can be 0 and the
standard deviation can be 1.

Split the Data into Test and


Training Sets

For you to assess the performance of

your model later, the data set should be

divided into two parts: a test set and a


training set. The first one will be used
for evaluating the system which has been
trained, while the second one will be
used for training the system.

The best way for one to approach this is

by taking 2/3 of the data set and uses it

for the training set, and the 1/3 of the

data set as a test set. Consider the

example given below:


# Import the `train_test_split`

from sklearn.cross_validation import


________________

# Split `digits` data into the training

and the test sets

X_train, X_test, y_train, y_test,


images_train, images_test =
train_test_split(data, digits.target,
digits.images, test_size=0.25,
random_state=42)
Note that in the above code, the

traditional way of splitting has been

respected. In the arguments for the

train_test_split() method, it is clearly

shown that the test_size has been set to

0.25. You also see the parameter

random_state has been set to a value

of 42. This argument will ensure that the


split is done so as to be the same.
Now that the data set has been split into
train and test sets, the numbers can be

inspected before the data can be

modeled. This is shown below:

# Number of the training features

n_samples, n_features = X_train.shape

# Print out the `n_samples`


print(_________)

# Print out the `n_features`

print(__________)

# Number of the Training labels

n_digits = len(np.unique(y_train))

# Inspect the `y_train`


print(len(_______))

The training set X_train should now

have 1347 samples, and this represents

only 2/3 of what was contained in the

original data set. This is an indication

that the test used X_train and y_train to

be the size of 450 samples.

Clustering digits Data


At this point, you must be aware that all
the known data has been stored. We have

not performed any actual learning or

model until now.

Thw time has come for us to find the

clusters of the training set. The module

can be setup by use of KMeans() from

the cluster. You will observe that only


three arguments are passed to the
module, and these include init,
n_clusters, and random_state.

You must remember that we had the last

argument given above before we could

split our data into the training and test

sets. The argument was responsible for

ensuring that we are getting some


reproducible results. Consider the code
given below:

# Import `cluster` module

from sklearn import ________

# Create KMeans model


clf = cluster.KMeans(init='k-
means++', n_clusters=10,
random_state=42)
# Fit training data `X_train`to model

clf.fit(________)

The init represents the initialization


method and even after defaulting to k-

means++, it will come back to the code.

It is also clear that the argument

n_clusters has been set to 10. This

number is responsible for specifying the

number of groups or clusters which will


be formed by the data, as well as the
number of centroids which will be
generated. Note that a cluster centroid

represents the middle of the cluster.

Note that once we add the n-init

parameter to the KMeans() function,

you will be in a position to determine

the number of different configurations


which the algorithm will try. The images
making up cluster centers can be
visualized as shown below:

# Import matplotlib

import matplotlib.pyplot as plt

# Figure the size in inches

fig = plt.figure(figsize=(8, 3))

# Add the title


fig.suptitle('Cluster Center Images',
fontsize=14, fontweight='bold')

# For all the labels (0-9)

for i in range(10):
# Initialize the subplots in some grid
measuring 2X5, at the i+1th position

ax = fig.add_subplot(2, 5, 1 + i)

# Display the images

ax.imshow(clf.cluster_centers_[i].reshap
8)), cmap=plt.cm.binary)
# Don't show axes

plt.axis('off')

# Show plot
plt.show()

The next step should be prediction of

labels of the test set. This can be done as

shown below:
# Predict labels for the `X_test`

y_pred=clf.predict(X_test)

# Print out first 100 instances of the


`y_pred`

print(y_pred[:100])

# Print out first 100 instances of the


`y_test`
print(y_test[:100])

# Study shape of cluster centers

clf.cluster_centers_._____

In the above example, we are predicting


the values of the test set, and this has

450 samples. The result of this has been

stored in the y_pred. We have then


gone ahead so as to print out the first
100 instances of the y_pred and the
y_test, and some results should be
observed immediately.

We can now visualize the labels which

have been predicted. This can be done

as shown below:

# Import `Isomap()`

from sklearn.manifold import Isomap


# Create an isomap and fit the `digits`
data to it

X_iso =
Isomap(n_neighbors=10).fit_transform(

# Compute the cluster centers and


then predict the cluster index for
every

# sample

clusters = clf.fit_predict(X_train)
# Create plot with the subplots in grid
measuring 1X2

fig, ax = plt.subplots(1, 2, figsize=(8,


4))

# Adjust the layout


fig.suptitle('Predicted Versus the
Training Labels', fontsize=14,
fontweight='bold')

fig.subplots_adjust(top=0.85)
# Add the scatterplots to subplots
ax[0].scatter(X_iso[:, 0], X_iso[:, 1],

c=clusters)
ax[0].set_title('Predicted Training

Labels')

ax[1].scatter(X_iso[:, 0], X_iso[:, 1],

c=y_train)

ax[1].set_title('Actual Training

Labels')
# Show the plots
plt.show()

The Isomap() should be used as a way

of reducing the high-dimensional data set

digits. The difference with the PCA

method is the ISOMAP, a non-linear

reduction method. You can try to run the

above code using PCA rather than


Isomap and see the effect. The solution

can be found here:

# Import `PCA()`

from sklearn.decomposition import

PCA

# Model and then fit `digits` data to

PCA model
X_pca =
PCA(n_components=2).fit_transform(X

# Compute the cluster centers and


then predict the cluster index for
every

# sample

clusters = clf.fit_predict(X_train)

# Create some plot with the subplots


in some grid of 1X2

fig, ax = plt.subplots(1, 2, figsize=(8,


4))

# Adjust the layout


fig.suptitle('Predicted Versus the
Training Labels', fontsize=14,
fontweight='bold')

fig.subplots_adjust(top=0.85)

# Add the scatterplots to subplots


ax[0].scatter(X_pca[:, 0], X_pca[:, 1],

c=clusters)
ax[0].set_title('Predicted the Training

Labels')
ax[1].scatter(X_pca[:, 0], X_pca[:, 1],

c=y_train)
ax[1].set_title('The Actual Training

Labels')

# Show plots

plt.show()
Evaluating the Clustering
Model

We should now evaluate the


performance of our model. In other

words, we need to know how accurate

our models predictions are.

Let us begin by printing out a confusion


matrix:

# Import `metrics` from `sklearn`

from sklearn import _______

# Print out confusion matrix with the


`confusion_matrix()`

print(metrics.confusion_matrix(y_test,
y_pred))
You may also need to learn more
regarding the results rather than by use

of the confusion matrix alone. We should

apply some different cluster metrics so

as to know the quality of our clusters.

This way, you will be able to learn the

goodness of fit for the cluster labels to


correct labels. Consider the following

code:
from sklearn.metrics import
homogeneity_score,
completeness_score,
v_measure_score,
adjusted_rand_score,
adjusted_mutual_info_score,
silhouette_score

print('% 9s' % 'inertia homo compl


v-meas ARI AMI silhouette')

print('%i %.3f %.3f %.3f %.3f

%.3f %.3f'
%(clf.inertia_,

homogeneity_score(y_test,
y_pred),

completeness_score(y_test,

y_pred),

v_measure_score(y_test, y_pred),

adjusted_rand_score(y_test,

y_pred),

adjusted_mutual_info_score(y_test,
y_pred),

silhouette_score(X_test, y_pred,
metric='euclidean')))

The fact is that some metrics exist which

one has to consider. The homogeneity

score is responsible for telling us the

extent to which the clusters have data

points which belong to a single class.

The completeness score is responsible


for measuring the extent to which the
data points are members of a given
class, and are also elements of a similar
cluster. V-measure score refers to the

harmonic between the homogeneity and


the completeness.

The property adjusted Rand score is

used for measuring the similarity

between any two clusterings, and all


samples and counting pairs are
considered which have been assigned in
different or same clusters in the true or
predicted clusterings.

The Adjusted Mutual Info (AMI) score

helps in the comparison of clusters. It is

for measuring the similarity between the

data points which are in clusterings,

providing for chance groupings and if


the clusterings are equivalent, this will
take a maximum value of 1.

The silhouette score is used for

measuring the similarity of an object to

its own clusters when compared to the

other clusters. The value for this ranges

between -1 and 1, and if you get a higher

value, it is an indication that the object

is closely matched to the cluster it


belongs to and worse matched to the
neighboring clusters. If there are many
points with a higher value, the cluster
configuration will be good.

From the above explanation, it is very

clear that our values are not good. In our

case, the value of the silhouette score is

0, and this means that the sample is too

close to the decision boundary between


the two neighboring clusters. This is an
indication that the samples might have
been assigned to the wrong clusters.

The ARI measure shows that not all the

data point clusters are the same, while

the completeness score shows that there

are some data points which were not

assigned to the correct cluster. You

should consider another estimator so that


you can predict the labels for the
digits data.
Support Vector Machines

Consider the code given below:

# Import the `train_test_split`

from sklearn.cross_validation import

train_test_split

# Split data into training and the test


sets
X_train, X_test, y_train, y_test,
images_train, images_test =
train_test_split(digits.data,
digits.target, digits.images,
test_size=0.25, random_state=42)

# Import `svm` model

from sklearn import svm

# Create SVC model


svc_model = svm.SVC(gamma=0.001,

C=100., kernel='linear')

# Fit data to SVC model

svc_model.fit(X_train, y_train)

Once you follow the algorithm map, the

first model which we get is a linear

SVC. This has then been applied to the


digits data. Also, note that we have
used the X_train and y_train so as to fit
the data into our SVC model. This is
very different from clustering. Also, the

value for gamma has been set


manually. You can automatically obtain

good values for the parameters by the

use of tools such as cross validation and

grid search.

You will note that the use of grid search


is the best way for you to adjust the
parameters. This can be done as shown
below:

# Split `digits` data to two equal sets


X_train, X_test, y_train, y_test =
train_test_split(digits.data,
digits.target, test_size=0.5,
random_state=0)

# Import GridSearchCV
from sklearn.grid_search import

GridSearchCV

# Set parameter candidates

parameter_candidates = [

{'C': [1, 10, 100, 1000], 'kernel':

['linear']},
{'C': [1, 10, 100, 1000], 'gamma':
[0.001, 0.0001], 'kernel': ['rbf']},
]
# Create some classifier with
parameter candidates
clf =
GridSearchCV(estimator=svm.SVC(),
param_grid=parameter_candidates,
n_jobs=-1)

# Train classifier on the training data

clf.fit(X_train, y_train)
# Print out results

print('Best score for training data:',


clf.best_score_)

print('Best

`C`:',clf.best_estimator_.C)

print('Best

kernel:',clf.best_estimator_.kernel)

print('Best
`gamma`:',clf.best_estimator_.gamma)
You should then use a classifier together

with the classifier and the parameter


candidates which have just been created

so that it can be applied to the data sets


second part. Next, a new classifier

should be trained by the use of the best

parameters which are used which are

found via grid search. You will score the

result so as to see whether the best

parameters which are found in grid


search are working.

# Apply classifier to test data, and


then view the accuracy of score

clf.score(X_test, y_test)

# Train and then score some new


classifier with grid search parameters

svm.SVC(C=10, kernel='rbf',
gamma=0.001).fit(X_train,
y_train).score(X_test, y_test)
You will see that the parameters are

working well. You must have observed

that in SVM classifier, penalty

parameter, which is c, for error term

is usually specified at 100. Note that the

kernel has been explicitly specified as

linear. The argument kernel is used for


specifying the kernel which is to be used

in the algorithm, and this defaults to


rbf. There exist other types of kernels
which you can specify. Examples
include poly, linear, and others.

The kernel can be seen to be the same as

a function, and it is used for computing

the similarity between training data sets.

Once the kernel has been provided to an

algorithm, as well as the labels and the

training data, one gets the classifier, just


as we have in this case. In this case, we
have trained a model which is able to
categorize unseen objects to their
specific category. When using SVM, one

has to do a linear division on the data


points.

The results obtained from the grid search

are an indication that an rbf kernel

could have worked better. Both the


gamma and the penalty parameter were
well specified.

We can now proceed with our test

model, and then predict values for test

set. This is shown below:

# Predict label of the `X_test`

print(svc_model.predict(______))
# Print the `y_test` to check for

results

print(______)

The images can also be visualized

together with their predicted labels. This

is shown below:

# Import matplotlib
import matplotlib.pyplot as plt
# Assign predicted values to the
`predicted`

predicted = svc_model.predict(X_test)

# Zip together `images_test` and the


`predicted` values in the
#`images_and_predictions`

images_and_predictions =
list(zip(images_test, predicted))
# For first 4 elements in the

`images_and_predictions`
for index, (image, prediction) in
enumerate(images_and_predictions[:4])

# Initializing subplots in some grid


measuring 1 by 4 at the position i+1

plt.subplot(1, 4, index + 1)

# Do not show the axes

plt.axis('off')
# Displaying the images in all the

subplots in a grid
plt.imshow(image,
cmap=plt.cm.gray_r,
interpolation='nearest')

# Add some title to plot

plt.title('Predicted: ' +

str(prediction))

# Show a plot

plt.show()

Note that we have zipped the images


together and predicted values, and only

the first four elements of the


images_and_predictions have been

taken. We should now determine the


performance of the model. This is shown

below:

# Import `metrics`

from sklearn import metrics


# Print classification report of the

`y_test` and the `predicted`

print(metrics.classification_report(____

_________))

# Print confusion matrix of the

`y_test` and the `predicted`

print(metrics.confusion_matrix(______,
_________))
You will notice that the performance of

this model is often compared to the


clustering model which we used later. It

is also possible for you to visualize the


predicted and actual labels by use of

Isomap(). This is shown below:

# Import `Isomap()`

from sklearn.manifold import Isomap


# Create isomap and then fit `digits`

data to this
X_iso =

Isomap(n_neighbors=10).fit_transform(

# Calculate the cluster centers and


then predict the cluster index for
every

# sample
predicted =

svc_model.predict(X_train)
# Create some plot with subplots on a
grid measuring 1X2

fig, ax = plt.subplots(1, 2, figsize=(8,


4))

# Adjusting the layout

fig.subplots_adjust(top=0.85)

# Add the scatterplots to subplots


ax[0].scatter(X_iso[:, 0], X_iso[:, 1],

c=predicted)
ax[0].set_title('Predicted labels')

ax[1].scatter(X_iso[:, 0], X_iso[:, 1],


c=y_train)

ax[1].set_title('Actual Labels')

# Add title
fig.suptitle('Predicted versus actual
labels', fontsize=14,
fontweight='bold')

# Show the plot

plt.show()

The visualization will have confirmed

the classification report.


Chapter 3- Logistic

Regression

Logic regression is just a classification

algorithm. Its learning process is closely

related to that of linear regression, with

the difference being that the formulation


of the cost and gradient functions is done

differently. The logistic regression


makes use of a logit or sigmoid
activation function rather than the
continuous output in a linear regression.

To demonstrate how this works, let us

begin by importing the data which is to

be used in the exercise:

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

import os
path = os.getcwd() + '\data\data1.txt'
data = pd.read_csv(path,
header=None, names=['Exam 1',
'Exam 2', 'Admitted'])

data.head()

In this data, we have two continuous


independent variables, and these are the

Exam 1 and Exam 2. The


Admitted label represents our

prediction target, and it is binary-valued.


A value of 0 is an indication that the

student has not been admitted, while a

value of 1 is an indication that the

student has been admitted. This can be

visualized by the use of colors on a

graph by use of the following code:


positive =
data[data['Admitted'].isin([1])]

negative =

data[data['Admitted'].isin([0])]

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['Exam 1'],
positive['Exam 2'], s=50, c='b',
marker='o', label='Admitted')

ax.scatter(negative['Exam 1'],
negative['Exam 2'], s=50, c='r',
marker='x', label='Not Admitted')

ax.legend()

ax.set_xlabel('Exam 1 Score')

ax.set_ylabel('Exam 2 Score')

You will get a linear decision boundary.

Since it is curving, it is impossible for

us to classify the examples correctly by


use of a straight line, but it is possible

for us to get closer to the result.


It is now time for us to implement the
logistic regression so as to train the

model to be able to find the optimal

decision boundary, and then make some

class predictions. Our first step should

be the implementation of a sigmoid

function. This is shown below:

def sigmoid(z):
return 1 / (1 + np.exp(-z))

Above is the activation function for

output of the logistic regression. It works


by converting in a continuous format to a

value between 0 and 1. The value can

then be seen as the probability of the

class, or likelihood that an input

example will be classified positively.

With the use of this probability value


together with a threshold value, we will

be in a position to get some discrete


label prediction. This is good for

helping us visualize the output of the


function so as to be able to see what is

happening. This is shown below:

nums = np.arange(-10, 10, step=1)

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(nums, sigmoid(nums), 'r')

We should then go ahead and write a

cost function. This is good for evaluating


your models performance on some

training data which is given a set of the

model pictures. The cost function for the

logistic regression should be as follows:

def cost(theta, X, y):


theta = np.matrix(theta)

X = np.matrix(X)
y = np.matrix(y)

first = np.multiply(-y,
np.log(sigmoid(X * theta.T)))

second = np.multiply((1 - y),

np.log(1 - sigmoid(X * theta.T)))

return np.sum(first - second) /

(len(X))
The output has to be reduced

downwards to a scalar value which


should be the sum of the error quantified

as the function of the difference between


the class probability which was

assigned by model and the true label.

Note that this is implemented in a

vectorized manner because the

computation of the predictions of the

model for the whole data set is done into


one statement, which is sigmoid(X *

theta.T).

It is now a good time for us to do a test

on the cost function so as to be sure that

it is working well, but it will be good

for us to first do a setup. This can be

done as shown below:

# add ones column - this will make the


matrix multiplication exercise easier
data.insert(0, 'Ones', 1)

# set X (training data) and y (target

variable)

cols = data.shape[1]

X = data.iloc[:,0:cols-1]

y = data.iloc[:,cols-1:cols]

# converting to numpy arrays and then


initializing parameter array theta
X = np.array(X.values)

y = np.array(y.values)
theta = np.zeros(3)

It is good for you to check for the shape

of the data structures which you are

using, and you will be able to learn

whether or not their values are sensible.

This is a very good technique for the

implementation of matrix multiplication.


X.shape, theta.shape, y.shape

If zeros are given for the model

parameters, we can calculate the cost of

the initial solution, and the zeros are

represented as theta:

cost(theta, X, y)
Our cost function is working, and our

next step should be to write a function


for computing the gradient of module

parameters to learn how we can change


the parameters for improving the model

outcome on data training. The function

can be written as shown below:

def gradient(theta, X, y):


theta = np.matrix(theta)
X = np.matrix(X)

y = np.matrix(y)

parameters =
int(theta.ravel().shape[1])

grad = np.zeros(parameters)

error = sigmoid(X * theta.T) - y

for i in range(parameters):
term = np.multiply(error, X[:,i])

grad[i] = np.sum(term) / len(X)

return grad

Note that we have only computed a

single gradient step. Lets use a function

named "fminunc" which will help in the

optimization of a parameter given

functions for the computation of cost and


gradients. This is shown below:

import scipy.optimize as opt


result = opt.fmin_tnc(func=cost,
x0=theta, fprime=gradient, args=(X,
y))

cost(result[0], X, y)

Our next step should be writing a

function for giving us predictions for the


dataset X by use of the learned
parameters theta. The function can then
be used for scoring the training accuracy
of the classifier.

def predict(theta, X):

probability = sigmoid(X * theta.T)

return [1 if x >= 0.5 else 0 for x in

probability]

theta_min = np.matrix(result[0])
predictions = predict(theta_min, X)
correct = [1 if ((a == 1 and b == 1) or
(a == 0 and b == 0)) else 0 for (a, b) in
zip(predictions, y)]

accuracy = (sum(map(int, correct)) %

len(correct))

print 'accuracy =

{0}%'.format(accuracy)

This will then give you an accuracy of

89%, which is a great percentage.


Conclusion

We have come to the end of this book.


That is how machine learning is done in

Python. Always begin by choosing the

training data set, and then continue with

the rest of the steps!

Você também pode gostar