Machine Learning With Python - The Basics

Machine Learning
with Python
The Basics
By David V.
Copyright2017 by David V.
All Rights Reserved
Copyright 2017 by David V.
All rights reserved. No part of this
publication may be reproduced,
distributed, or transmitted in any form or
by any means, including photocopying,
recording, or other electronic or
mechanical methods, without the prior
written permission of the author, except

in the case of brief quotations embodied
in critical reviews and certain other
noncommercial uses permitted by
copyright law.
Table of Contents
Introduction
Chapter 1- Getting Started
Chapter 2- Python and
matplotlib for Data
Exploration
Chapter 3- Logistic
Regression
Conclusion
Disclaimer
While all attempts have been made to

verify the information provided in this
book, the author does assume any
responsibility for errors, omissions, or
contrary interpretations of the subject
matter contained within. The information

provided in this book is for educational
and entertainment purposes only. The
reader is responsible for his or her own
actions and the author does not accept
any responsibilities for any liabilities or

damages, real or perceived, resulting
from the use of this information.
The trademarks that are used are
without any consent, and the

publication of the trademark is
without permission or backing by the
trademark owner. All trademarks and
brands within this book are for

clarifying purposes only and are the
owned by the owners themselves, not
affiliated with this document. **

Introduction
Machine learning is a very common field

of study in the world today. With Python,
it is easy for you to implement the
concept of machine learning. This is
because this programming language has
numerous libraries which can help you

create production systems employing the
concept of machine learning. This book
helps you learn this. Enjoy reading!
Chapter 1- Getting
Started
The only way for you to understand
machine learning is by doing and then
completing projects, beginning with the
small ones. Python is an interpreted

programming language, and very
powerful. This language can be used for
the purposes of research, as well as the
creation of production systems as it is a
complete programming language.
Python has numerous libraries and
modules from which you can choose,
meaning that there are different ways that
you can accomplish a particular task.

The following are the best steps to
follow when doing a machine learning
project in Python:
1. Define Problem.
2. Prepare Data.
3. Evaluate Algorithms.
4. Improve Results.
5. Present Results.
The above steps will also help you learn

how to use some new tools or a
platform.
First Project
Our first project should be the one for
classifying iris flowers. This project is
good for the following reasons:
1. It has numeric attributes, making it

possible for us to figure out on how
the loading and handling of data can
be done.
2. The problem falls under the category
of classification, meaning that it is
possible for us to implement it by use
of a supervised learning algorithm.
3. It is also a multi-class classification

problem, and it needs a specialized
form of handling.
4. The project is small, and it will fit

into the memory very well.
5. Our numeric units will be in the same
scale and same units. This means that
we will not have to do any special
transformations or scaling so as to
get started.
The following is what we will cover in
this small project:
1. Installing Python and the SciPy
platform.
2. Loading the dataset.
3. Summarizing the dataset.
4. Visualizing the dataset.

5. Evaluating some algorithms.
6. Making predictions.
Installing Python and the
SciPy Platform
You should install the Python SciPy

platform in your system if you have not
installed it. The following are the SciPy
libraries which you should install on
your system:
scipy
numpy
matplotlib
pandas
sklearn
These libraries can be installed in a
number of ways. The best way for you to
do the installation is by choosing a
single way and then following it

throughout the installation process.
Installing via pip
In the case of Mac and Linux users, it is
possible for you to install the SciPy
libraries via pip. The installation of
these libraries will be done in wheel
package format. For you to do the
installation, you must ensure that you
have installed both pip and Python into

your system. However, the installation of
these libraries with pip in Windows

does not work properly.
First, begin by upgrading pip to the latest
version. This can be done by use of the
following command:
python -m pip install --upgrade pip
You can then use pip to install the SciPy
packages. The following command

demonstrates how to do this:
pip install --user numpy scipy

matplotlib ipython jupyter pandas
sympy nose
The packages will then be installed to
your local user, and no permissions will
be needed for you to write to the
directories.
In the case of user installs, please ensure
that the user install executable directory

is on the PATH. In Linux, the PATH can
be set as follows:
# This should be added at the end of
~/.bashrc file
export
PATH="$PATH:/home/username/.local/
In OSX, the PATH can be set as follows:
# This should be added at the end of
~/.bash_profile file
export
PATH="$PATH:/Users/username/Libra
Note that you must use your correct
username.
Installation via Linux
Package Manager
The installation can also be done much

quicker from the repositories of the
various Linux distributions. Note that
such installations will be system-wide,
and they will be somehow outdated
compared to the ones which are installed

via pip.
For users of Ubunu and Debian, the
installation of the libraries can be done
by executing the following command on
the terminal:
sudo apt-get install python-numpy

python-scipy python-matplotlib
ipython ipython-notebook python-
pandas python-sympy python-nose
For users of Fedora 22 and the later
versions, use the following command:

sudo dnf install numpy scipy python-
matplotlib ipython python-pandas
sympy python-nose atlas-devel
Installation via Mac
Package Manager
Unlike Linux, the Mac does not come

with a package manager, but there exists
several package managers which you
can install.
Macports
The installation of the SciPy libraries by
use of this package manager can be done

by executing the following command:
sudo port install py35-numpy py35-

scipy py35-matplotlib py35-ipython
+notebook py35-pandas py35-sympy
py35-nose
Now that you have the SciPy libraries
necessary for this project, you can move

to the next step.
Start Python, Check for
Versions
You should check so as to be sure that

the Python was installed properly and it
is running as expected.
Just open the terminal and then type the

following:
Python
Below is a script which can be used for
testing the environment. It works by
importing each of the libraries which are
required for our project, and it will then
return the version for each library. This
is the script:
# Check versions of the libraries

# Python version
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy:
{}'.format(scipy.__version__))
# numpy
import numpy
print('numpy:
{}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib:
{}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas:
{}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn:
{}'.format(sklearn.__version__))
You will then get the versions for each
of the libraries if the installation was
done successfully.
Load Data
We will be using the data set for the iris
flower. It is a very famous dataset,
widely used in machine learning by
almost everyone.
It has 150 observations of iris flowers. It
has four columns for flower
measurements in centimeters. The fifth
will have the species to which the

flower belongs. There are three species,
and each flower must belong to one of
these.
Importing the Libraries
There are objects, functions, and
libraries which should be used in this
project. We can import them into the

project as follows:
# Load libraries
import pandas
from pandas.tools.plotting import

scatter_matrix
import matplotlib.pyplot as plt
from sklearn.metrics import

classification_report
accuracy_score
from sklearn import model_selection
from sklearn.tree import
DecisionTreeClassifier
confusion_matrix
from sklearn.neighbors import

KNeighborsClassifier
from sklearn.linear_model import

LogisticRegression
from sklearn.discriminant_analysis
import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.naive_bayes import
GaussianNB
You should have everything load without
getting an error. In case you get an error,

you have to stop and then begin to work
on your environment.
Loading the Dataset
It is possible for us to direct load data
from the UCI Machine Learning
repository. We will use pandas so as to
load the data. The pandas will also be
used for exploring the data with both

data visualization and descriptive
statistics.
The names for each column will have to
be specified during the process of
loading the data. This will then help us
later when we need to explore the data.
This can bedone as shown below:
# Load dataset
url =
"https://archive.ics.uci.edu/ml/machine-
learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width',
'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url,
names=names)
Our expectation is that the data should
be loaded without any incident. For
those with network problem, feel free to
download the iris.data file into the

working directory and then use the same
mechanism so as to load it, but the
URLhas to be changed so as to reflect
the one leading to the local file.
It is possible for us to learn the number
of attributes and the rows (instances)
which we have by use of the shape
property. This can be done as shown

below:
# shape
print(dataset.shape)
There should be 150 instances, and the
attributes should be 50 as shown below:
(150, 50)
We can then eyeball the data as follows:
# head
print(dataset.head(20))
This will give the first 20 rows of data

which are contained in the file. This is
shown below:
sepal-length sepal-width petal-
length petal-width class
0 5.1 3.5 1.4
0.2 Iris-setosa
1 4.9 3.0 1.4
0.2 Iris-setosa
2 4.7 3.2 1.3
0.2 Iris-setosa
3 4.6 3.1 1.5
0.2 Iris-setosa
4 5.0 3.6 1.4
0.2 Iris-setosa
5 5.4 3.9 1.7
0.4 Iris-setosa
6 4.6 3.4 1.4
0.3 Iris-setosa
7 5.0 3.4 1.5
0.2 Iris-setosa
8 4.4 2.9 1.4
0.2 Iris-setosa
9 4.9 3.1 1.5
0.1 Iris-setosa
10 5.4 3.7 1.5
0.2 Iris-setosa
11 4.8 3.4 1.6
0.2 Iris-setosa
12 4.8 3.0 1.4
0.1 Iris-setosa
13 4.3 3.0 1.1
0.1 Iris-setosa
14 5.8 4.0 1.2
0.2 Iris-setosa
15 5.7 4.4 1.5
0.4 Iris-setosa
16 5.4 3.9 1.3
0.4 Iris-setosa
17 5.1 3.5 1.4
0.3 Iris-setosa
18 5.7 3.8 1.7
0.3 Iris-setosa
19 5.1 3.8 1.5
0.3 Iris-setosa
We can then go ahead and look for the
statistics of each attribute. This should
include the mean, the min, max, the
count, and some other percentiles. This
is shown below:
# descriptions
print(dataset.describe())
You will observe that there will be a
similar scale for all the numeric values,
and the same ranges.
Class Distribution
We now need to know the number of

instances for the classes. This can be
viewed as an absolute count. This is
demonstrated below:
# class distribution
print(dataset.groupby('class').size())
You will then find that all the classes
have the same number of the instances.
Data Visualizations
Now that we have some basic idea
regarding the data, it is good for us to
extend it through visualizations. Let us
have a look at the plots.
Univariate Plots
These will help us to understand each
attribute. They are the plots for the

individual variables. Suppose we have
numeric input variables, we can go
ahead to create some whisker and box
plots for these. This is shown below:
# whisker and box plots

dataset.plot(kind='box',
subplots=True, layout=(2,2),
sharex=False, sharey=False)
plt.show()
This will help us have a good picture
regarding how the input variables are
distributed. We can use each of the input
variables so as to create a histogram,
and this will help learn more about the
distribution:
# histograms
dataset.hist()
plt.show()
Two of the input variables will have a
Gaussian distribution. There are
algorithms which can be used for
exploiting this assumption.

Multivariate Slots
We should explore how the variables

interact with each other. Scatterplots for
the pair of the attributes will help us
identify any structured relationships
between the input variables.
# scatter plot matrix

scatter_matrix(dataset)
plt.show()
You will realize that there is a diagonal

grouping for some of the attribute pairs.
This is an indication of a high
correlation and some predictable
relationships.
Evaluation of Algorithms
We should make some data models, and

then estimate their accuracy based on
unseen data.
Creating the Validation

Dataset
We want to know whether we created a
good model or not. Statistical methods

will then be used for determining the
accuracy of the models which were

created on unseen data. We need some
estimate of the best model on the unseen
data by simply evaluating it on the actual
unseen data. This is shown below:
# Split-out validation dataset

array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train,
Y_validation =
model_selection.train_test_split(X, Y,
test_size=validation_size,
random_state=seed)
The loaded dataset has been split into

two, whereby 80% of it will be used for
training the models, and 20% of this will
be held as the validation dataset. The
training data is now contained in the

X_train and Y_train for purposes of
training the model while X_validation
and Y_validation are sets which will be
used later.
Test Harness
The 10-fold cross validation will be
used for the purpose of estimating the
accuracy. With this, the dataset will be
split into 10 parts, whereby training will
be done on 9 datasets, while the other
dataset will be used for testing. All the
combinations for the train-test splits will

then be repeated. This is shown below:
# Test options and the evaluation
metric
seed = 7
scoring = 'accuracy'
The models are to be evaluated by use of
the metric of accuracy. This refers to
the ratio of the number of predicted

instances which are correct divided by
the total number of instances in your

dataset, and then multiplied by 100 so as
to get a percentage. We will use a
scoring variable when running build
and we will then evaluate each model.
Build Models
We are not sure of the best algorithms or
the best configurations for use in solving

this kind of problem. There are 6
possible algorithms which we can use.
These include the following:
1. Logistic Regression (LR)

2. Gaussian Naive Bayes (NB).
3. Linear Discriminant Analysis (LDA)
4. K-Nearest Neighbors (KNN).
5. Classification and Regression Trees
(CART).
6. Support Vector Machines (SVM).
We will be mixing the simple linear and
non-linear algorithms. The random

number seed will be reset before each
run, and this will help us ensure that the

execution of each algorithm is done by
use of similar data splits. This serves to
help us be sure that the results can be
directly compared. Let us begin by
evaluating our five models:

# Spot Check Algorithms
models = []
models.append(('LR',
LogisticRegression()))
models.append(('LDA',
LinearDiscriminantAnalysis()))
models.append(('KNN',
KNeighborsClassifier()))
models.append(('CART',
DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:

kfold =
model_selection.KFold(n_splits=10,
random_state=seed)
cv_results =
model_selection.cross_val_score(model,
X_train, Y_train, cv=kfold,
scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name,
cv_results.mean(), cv_results.std())
print(msg)
Selecting the best Model

Currently, we have 6 models as well as
the accuracy estimation for each of these

models. Our aim is to do a comparison
between the models, and then choose the

most accurate one. The program should
give the following result once executed:
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)

CART: 0.975000 (0.038188)
NB: 0.975000 (0.053359)
SVM: 0.981667 (0.025000)
As shown in the above output, the KNN
seems to be the one with the highest
estimated accuracy score. We can go
ahead and create a plot of model

evaluation and then compare the mean
accuracy and spread of each model.

Note that the evaluation of each
algorithm was done 10 times.
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Making Predictions
Note that we found the KNN model to be
the most accurate one from the tested
ones. It is now time for us to get the
accuracy of this model on the validation
set.
This will provide us with a final
independent check on the accuracy of the

best model. It is also good for us to
maintain a validation set so that in case a

mistake occurs during training, such as
overfitting to a training set or some data
leak. These will lead to an overly
optimistic result.
The KNN model can be executed

directly on the validation set, and then
we summarize the result in the form of
one final score, a classification result,
and a confusion matrix. This is shown

below:
# Make predictions on the validation
dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions =
knn.predict(X_validation)
print(accuracy_score(Y_validation,
predictions))
print(confusion_matrix(Y_validation,
predictions))
print(classification_report(Y_validation
predictions))
You will then get the accuracy. The

confusion matrix will give the three
errors which are made. The
classification report will then give the
breakdown for each class, based on

precision. Here is the result:
0.9
[[ 7 0 0]
[ 0 11 1]
[ 0 2 9]]
precision recall f1-score

support
Iris-setosa 1.00 1.00
1.00 7
Iris-versicolor 0.85 0.92
0.88 12
Iris-virginica 0.90 0.82

0.86 11
avg / total 0.90 0.90 0.90
30
Chapter 2- Python and
matplotlib for Data
Exploration
The Python libraries can be used
together with matplotlib for the purpose
of exploring data. Let us discuss how
this can be done.
Load the Data set

This should be the first step in machine
learning. This should be observed data,
and you can choose to collect it, or you
may choose to browse for the data from
various data sources so as to get the data
sets. In this case, you may choose to load
the digits data set which comes with

scikit-learn, which is a Python library.
For the data to be loaded, we have to
import the datasets module from the

sklearn. You can then make use of the
load_digits() method from datasets so

as to load in the data. This is shown
below:
# Import `datasets` from the `sklearn`
from sklearn import ________

# Load in `digits` data
digits = datasets.load_digits()
# Print `digits` data
print(______)
It is good for you to be aware that the
datasets module also has other

methods for loading and fetching the
popular reference datasets, and one can
count on the module if they need
artificial data generators. If we needed
to pull the data from the latter, then we
could have done it as follows:
# Import `pandas` library as `pd`
import ______ as __
# Load in data with the `read_csv()`

digits =
pd.read_csv("http://archive.ics.uci.edu/
learning-
databases/optdigits/optdigits.tra",
header=None)
# Print out the `digits`
print(______)
It is good for you to be aware that if data

is split in this manner, the data will be
split into a test and training set, and

these are indicated by .tes and .tra
extensions. Both files should be loaded
so that the project can be elaborated.
The above command will only help us to

load the training set.
Exploring the Data
Before you can begin to use a particular

data set, it is good for you to read its
description and try to learn something.
For the case of scikit-learn, this
information is not readily available, but
when you import the data from a

particular data source, it will have a
description and this will be enough
information for you to gain enough
insights into the data. However, it is
good for you to have asufficient
knowledge about how the data net

works.
The process of performing exploratory
data analysis (EDA) for a particular data
set has been found to be very difficult.

Gather Information on the
Data
Suppose you have not checked any

folder with the data description. It is
time for you to begin gathering the
information.
After printing out the digits data once

you loaded it through the scikit-learn
datasets module, you will notice that
there is too much information which is
available. You should now be aware of
some things such as the target as well as

the description of the data. The digits
data can be accessed through the data
attributes. The target attribute can also
be used for accessing the target values
or attributes and the description via the
DESCR attribute.
If you need to know the keys which are
available, you just have to execute
digits.keys(). You can try the one
shown below:
# Get keys of `digits` data
print(digits.______)
# Print out data

print(digits.____)
# Print out target values
print(digits.______)
# Print out description of `digits` data
print(digits.DESCR)
You can then go ahead and check for the

type of your data. If you want the
read_csv() to do the importation of the
data, you will have a data frame with all
the data. There will be no description

component, but it will be possible for
you to resort to head() or tail() for the
purpose of inspecting the data. It is
always good for you to read the folder
for data description.

The data attribute should be used for
isolating the numpy array from the

digits data and the shape attribute
should be used for finding more. The

same can also be done for target and
DESCR. An attribute known as
images also exists, and this is used for
describing the data in the images.
The shape attribute of an array can be

used as shown below:
# Isolate `digits` data
digits_data = digits.data
# Inspect the shape
print(digits_data.shape)
# Isolate target values with the

`target`
digits_target = digits.______
# Inspect the shape
print(digits_target._____)
# Print number of the unique labels
number_digits =
len(np.unique(digits.target))
# Isolate the ìmages`
digits_images = digits.images
# Inspect the shape
print(digits_images.shape)
Visualizing the Data Images

using matplotlib
It is also possible for you to visualize
the images which you are using. There

are multiple libraries in Python which
can be used for this purpose, but here,

we will be using matplotlib. This can
be used as shown below:
# Import matplotlib library

# Figure out the size (width, height) in
inches
fig = plt.figure(figsize=(6, 6))
# Change the subplots

fig.subplots_adjust(left=0, right=1,
bottom=0, top=1, hspace=0.05,
wspace=0.05)
# For each of 64 images

for i in range(64):
# Initializing the subplots, add
subplot in grid of 8 by 8, at i+1-th
position
ax = fig.add_subplot(8, 8, i + 1,
xticks=[], yticks=[])
# Display image at i-th position
ax.imshow(digits.images[i],
cmap=plt.cm.binary,
interpolation='nearest')
# label image with a target value
ax.text(0, 7, str(digits.target[i]))
# Show the plot

plt.show()
The above code might seem to be
lengthy and even overwhelming. Note

that we began by importing the library,
which is matplotlib.pyplo. We have
then setup a figure, with dimensions of 6
inches long and 6 inches wide. This will
create a canvas, and all the subplots
having images will be displayed on it.

We have also set the alignment of this on
the left, right bottom, and top. We have

then created a loop which is to help us
fill the figure which we have created.
The subplots have been initialized one
by one, and each has been added into its
own position on a grid which measures
8 by 8. Note that each image has been

displayed on the grid ay a particular
time. We have also used binary colors
which will in turn give us white, black,
and gray values. We have used nearest
as the interpolation method, which

translates to the fact that the data is not
smooth.
The cherry on pie adds text to the
subplots. The target labels will be

printed at the coordinates (0,7) for each
subplot, meaning that these will be
visible on the bottom-left corner of the
subplot. The line plt.show() has been
used for displaying the plot so that it can

be visible. To make it simple, it is
possible for you to visualize the target
labels as shown below:
# Import matplotlib
# Join images and the target labels
into a list
images_and_labels =
list(zip(digits.images, digits.target))
# for each element contained in the
list
for index, (image, label) in
enumerate(images_and_labels[:8]):
# initializing a subplot of the 2X4 at
i+1-th position
plt.subplot(2, 4, index + 1)
# Do not plot any axes

plt.axis('off')
# Display images in the all subplots

plt.imshow(image,
cmap=plt.cm.gray_r,interpolation='near
# Add some title to every subplot

plt.title('Training: ' + str(label))
# Show the plot
plt.show()
Note that once we had imported the
matplotlib.pyplot, we went ahead to zip
our two numpy arrays together, and then
saved it in a variable named
images_and_labels. You will also
learn that each will have aninstance of

digits.images and a value of
digits.target.
Principal Component
Analysis (PCA)
Since the digits data set will have 64
features, it becomes a challenge. It
becomes hard for us to understand the
structure and maintain an overview of

digits data. You will then be working
on a high-dimensional data set.
This results when one tries to describe
objects via the collection of features.
The problem with high dimensionality of
data is that the algorithms might be
expected to take in too many features.
Having many dimensions may be an
indication that the data points are

located far from each other point, and
the distance between data points can be
uninformative.
The Principle Component Analysis
(PCA) will help us solve this problem.
It works by finding a linear combination
of two variables which contains much of
the information. This principle
component or the new variable can be

used for the replacement of your two
original variables. You can see it as a
linear transformation method for
yielding directions for maximizing
variance of data.
The scikit-learn can help you find it
easy to apply PCA to your data. This is
shown below:
# Creating some Randomized PCA

model which takes in two components
randomized_pca =
RandomizedPCA(n_components=2)
# Fit and transform data to model

reduced_data_rpca =
randomized_pca.fit_transform(digits.da
# Creating some regular PCA model
pca = PCA(n_components=2)
# Fit and transform data to model
reduced_data_pca =
pca.fit_transform(digits.data)
# Inspect the shape
reduced_data_pca.shape
# Print out data

print(reduced_data_rpca)
print(reduced_data_pca)
Note that in the above example, we have

used the RandomizedPCA() method.
This is because this performs better in
such circumstances than in times when
there are a few number of dimensions.
You may choose to replace the estimator
object or the randomized PCA model

with a regular PCA model and observe
the difference you get.
It is good for you to keep in mind how
the model is told to keep two
components only. This is a way of
ensuring that you will only have two-
dimensional data for plotting. Also, you
should note that the target class is not

passed with labels to PCA
transformation, since one needs to
investigate whether the PCA reveals the
distribution of different labels and
whether it is possible for the instances to

be separated from each other clearly. A
scatterplot can now be built for the
purpose of visualizing the data:
colors = ['black', 'purple', 'blue',

'yellow', 'white', 'lime', 'cyan',
'orange', 'red', 'gray']
for i in range(len(colors)):
x = reduced_data_rpca[:, 0]
[digits.target == i]
y = reduced_data_rpca[:, 1]
[digits.target == i]
plt.scatter(x, y, c=colors[i])
plt.legend(digits.target_names,
bbox_to_anchor=(1.05, 1), loc=2,
borderaxespad=0.)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal
Component')
plt.title("PCA Scatter Plot")

plt.show()
The matplotlib should then be used for
visualizing the data.
Preprocessing the Data
Data should be prepared well before it

can be modeled. The preparation step is
commonly known as preprocessing.
Data Normalization
We will begin by preprocessing the data.
The digits data can, for example, be
standardized by use of the scale ()
method. The following example
demonstrates this:
# Import
from sklearn.preprocessing import

scale
# Apply the`scale()` method to `digits`
data
data = _____(digits.data)
Once the data has been scaled, the

distribution of each attribute will be
shifted so that its mean can be 0 and the
standard deviation can be 1.
Split the Data into Test and

Training Sets
For you to assess the performance of
your model later, the data set should be
divided into two parts: a test set and a

training set. The first one will be used
for evaluating the system which has been
trained, while the second one will be
used for training the system.
The best way for one to approach this is
by taking 2/3 of the data set and uses it
for the training set, and the 1/3 of the
data set as a test set. Consider the
example given below:

# Import the `train_test_split`
from sklearn.cross_validation import

________________
# Split `digits` data into the training
and the test sets
X_train, X_test, y_train, y_test,

images_train, images_test =
train_test_split(data, digits.target,
digits.images, test_size=0.25,
random_state=42)
Note that in the above code, the
traditional way of splitting has been
respected. In the arguments for the
train_test_split() method, it is clearly
shown that the test_size has been set to
0.25. You also see the parameter
random_state has been set to a value
of 42. This argument will ensure that the

split is done so as to be the same.
Now that the data set has been split into
train and test sets, the numbers can be
inspected before the data can be
modeled. This is shown below:
# Number of the training features
n_samples, n_features = X_train.shape
# Print out the `n_samples`

print(_________)
# Print out the `n_features`
print(__________)
# Number of the Training labels
n_digits = len(np.unique(y_train))
# Inspect the `y_train`

print(len(_______))
The training set X_train should now
have 1347 samples, and this represents
only 2/3 of what was contained in the
original data set. This is an indication
that the test used X_train and y_train to
be the size of 450 samples.
Clustering digits Data

At this point, you must be aware that all
the known data has been stored. We have
not performed any actual learning or
model until now.
Thw time has come for us to find the
clusters of the training set. The module
can be setup by use of KMeans() from
the cluster. You will observe that only

three arguments are passed to the
module, and these include init,
n_clusters, and random_state.
You must remember that we had the last
argument given above before we could
split our data into the training and test
sets. The argument was responsible for
ensuring that we are getting some

reproducible results. Consider the code
given below:
# Import `cluster` module
from sklearn import ________
# Create KMeans model

clf = cluster.KMeans(init='k-
means++', n_clusters=10,
random_state=42)
# Fit training data `X_train`to model
clf.fit(________)
The init represents the initialization

method and even after defaulting to k-
means++, it will come back to the code.
It is also clear that the argument
n_clusters has been set to 10. This
number is responsible for specifying the
number of groups or clusters which will

be formed by the data, as well as the
number of centroids which will be
generated. Note that a cluster centroid
represents the middle of the cluster.
Note that once we add the n-init
parameter to the KMeans() function,
you will be in a position to determine
the number of different configurations

which the algorithm will try. The images
making up cluster centers can be
visualized as shown below:
# Import matplotlib
# Figure the size in inches
fig = plt.figure(figsize=(8, 3))
# Add the title

fig.suptitle('Cluster Center Images',
fontsize=14, fontweight='bold')
# For all the labels (0-9)
for i in range(10):
# Initialize the subplots in some grid
measuring 2X5, at the i+1th position
ax = fig.add_subplot(2, 5, 1 + i)
# Display the images
ax.imshow(clf.cluster_centers_[i].reshap
8)), cmap=plt.cm.binary)
# Don't show axes
plt.axis('off')
# Show plot
plt.show()
The next step should be prediction of
labels of the test set. This can be done as
shown below:
# Predict labels for the `X_test`
y_pred=clf.predict(X_test)
# Print out first 100 instances of the

`y_pred`
print(y_pred[:100])
# Print out first 100 instances of the

`y_test`
print(y_test[:100])
# Study shape of cluster centers
clf.cluster_centers_._____
In the above example, we are predicting

the values of the test set, and this has
450 samples. The result of this has been
stored in the y_pred. We have then

gone ahead so as to print out the first
100 instances of the y_pred and the
y_test, and some results should be
observed immediately.
We can now visualize the labels which
have been predicted. This can be done
as shown below:
# Import Ìsomap()`
from sklearn.manifold import Isomap

# Create an isomap and fit the `digits`
data to it
X_iso =
Isomap(n_neighbors=10).fit_transform(
# Compute the cluster centers and

then predict the cluster index for
every
# sample
clusters = clf.fit_predict(X_train)
# Create plot with the subplots in grid
measuring 1X2
fig, ax = plt.subplots(1, 2, figsize=(8,

4))
# Adjust the layout

fig.suptitle('Predicted Versus the
Training Labels', fontsize=14,
fontweight='bold')
fig.subplots_adjust(top=0.85)
# Add the scatterplots to subplots
ax[0].scatter(X_iso[:, 0], X_iso[:, 1],
c=clusters)
ax[0].set_title('Predicted Training
Labels')
c=y_train)
ax[1].set_title('Actual Training
Labels')
# Show the plots
plt.show()
The Isomap() should be used as a way
of reducing the high-dimensional data set
digits. The difference with the PCA
method is the ISOMAP, a non-linear
reduction method. You can try to run the
above code using PCA rather than

Isomap and see the effect. The solution
can be found here:
# Import `PCA()`
from sklearn.decomposition import
PCA
# Model and then fit `digits` data to
PCA model
X_pca =
PCA(n_components=2).fit_transform(X
# Compute the cluster centers and

every
# sample
clusters = clf.fit_predict(X_train)
# Create some plot with the subplots

in some grid of 1X2

4))
# Adjust the layout

fig.suptitle('Predicted Versus the
Training Labels', fontsize=14,
fontweight='bold')

ax[0].scatter(X_pca[:, 0], X_pca[:, 1],
c=clusters)
ax[0].set_title('Predicted the Training
Labels')
ax[1].scatter(X_pca[:, 0], X_pca[:, 1],
c=y_train)
ax[1].set_title('The Actual Training
Labels')
# Show plots
plt.show()
Evaluating the Clustering
Model
We should now evaluate the

performance of our model. In other
words, we need to know how accurate
our models predictions are.
Let us begin by printing out a confusion

matrix:
# Import `metrics` from `sklearn`
from sklearn import _______
# Print out confusion matrix with the

`confusion_matrix()`
print(metrics.confusion_matrix(y_test,
y_pred))
You may also need to learn more
regarding the results rather than by use
of the confusion matrix alone. We should
apply some different cluster metrics so
as to know the quality of our clusters.
This way, you will be able to learn the
goodness of fit for the cluster labels to

correct labels. Consider the following
code:
homogeneity_score,
completeness_score,
v_measure_score,
adjusted_rand_score,
adjusted_mutual_info_score,
silhouette_score
print('% 9s' % 'inertia homo compl

v-meas ARI AMI silhouette')
print('%i %.3f %.3f %.3f %.3f
%.3f %.3f'
%(clf.inertia_,
homogeneity_score(y_test,
y_pred),
completeness_score(y_test,
y_pred),
v_measure_score(y_test, y_pred),
adjusted_rand_score(y_test,
y_pred),
adjusted_mutual_info_score(y_test,
y_pred),
silhouette_score(X_test, y_pred,
metric='euclidean')))
The fact is that some metrics exist which
one has to consider. The homogeneity
score is responsible for telling us the
extent to which the clusters have data
points which belong to a single class.
The completeness score is responsible

for measuring the extent to which the
data points are members of a given
class, and are also elements of a similar
cluster. V-measure score refers to the
harmonic between the homogeneity and

the completeness.
The property adjusted Rand score is
used for measuring the similarity
between any two clusterings, and all

samples and counting pairs are
considered which have been assigned in
different or same clusters in the true or
predicted clusterings.
The Adjusted Mutual Info (AMI) score
helps in the comparison of clusters. It is
for measuring the similarity between the
data points which are in clusterings,
providing for chance groupings and if

the clusterings are equivalent, this will
take a maximum value of 1.
The silhouette score is used for
measuring the similarity of an object to
its own clusters when compared to the
other clusters. The value for this ranges
between -1 and 1, and if you get a higher
value, it is an indication that the object
is closely matched to the cluster it

belongs to and worse matched to the
neighboring clusters. If there are many
points with a higher value, the cluster
configuration will be good.
From the above explanation, it is very
clear that our values are not good. In our
case, the value of the silhouette score is
0, and this means that the sample is too
close to the decision boundary between

the two neighboring clusters. This is an
indication that the samples might have
been assigned to the wrong clusters.
The ARI measure shows that not all the
data point clusters are the same, while
the completeness score shows that there
are some data points which were not
assigned to the correct cluster. You
should consider another estimator so that

you can predict the labels for the
digits data.
Support Vector Machines
Consider the code given below:
# Import the `train_test_split`
from sklearn.cross_validation import
train_test_split
# Split data into training and the test

sets
X_train, X_test, y_train, y_test,
images_train, images_test =
train_test_split(digits.data,
digits.target, digits.images,
test_size=0.25, random_state=42)
# Import `svm` model
from sklearn import svm
# Create SVC model

svc_model = svm.SVC(gamma=0.001,
C=100., kernel='linear')
# Fit data to SVC model
svc_model.fit(X_train, y_train)
Once you follow the algorithm map, the
first model which we get is a linear
SVC. This has then been applied to the

digits data. Also, note that we have
used the X_train and y_train so as to fit
the data into our SVC model. This is
very different from clustering. Also, the
value for gamma has been set

manually. You can automatically obtain
good values for the parameters by the
use of tools such as cross validation and
grid search.
You will note that the use of grid search

is the best way for you to adjust the
parameters. This can be done as shown
below:
# Split `digits` data to two equal sets

X_train, X_test, y_train, y_test =
train_test_split(digits.data,
digits.target, test_size=0.5,
random_state=0)
# Import GridSearchCV
from sklearn.grid_search import
GridSearchCV
# Set parameter candidates
parameter_candidates = [
{'C': [1, 10, 100, 1000], 'kernel':
['linear']},
{'C': [1, 10, 100, 1000], 'gamma':
[0.001, 0.0001], 'kernel': ['rbf']},
]
# Create some classifier with
parameter candidates
clf =
GridSearchCV(estimator=svm.SVC(),
param_grid=parameter_candidates,
n_jobs=-1)
# Train classifier on the training data
clf.fit(X_train, y_train)
# Print out results
print('Best score for training data:',

clf.best_score_)
print('Best
`C`:',clf.best_estimator_.C)
print('Best
kernel:',clf.best_estimator_.kernel)
print('Best
`gamma`:',clf.best_estimator_.gamma)
You should then use a classifier together
with the classifier and the parameter

candidates which have just been created
so that it can be applied to the data sets

second part. Next, a new classifier
should be trained by the use of the best
parameters which are used which are
found via grid search. You will score the
result so as to see whether the best
parameters which are found in grid

search are working.
# Apply classifier to test data, and

then view the accuracy of score
clf.score(X_test, y_test)
# Train and then score some new

classifier with grid search parameters
svm.SVC(C=10, kernel='rbf',
gamma=0.001).fit(X_train,
y_train).score(X_test, y_test)
You will see that the parameters are
working well. You must have observed
that in SVM classifier, penalty
parameter, which is c, for error term
is usually specified at 100. Note that the
kernel has been explicitly specified as
linear. The argument kernel is used for

specifying the kernel which is to be used
in the algorithm, and this defaults to

rbf. There exist other types of kernels
which you can specify. Examples
include poly, linear, and others.
The kernel can be seen to be the same as
a function, and it is used for computing
the similarity between training data sets.
Once the kernel has been provided to an
algorithm, as well as the labels and the
training data, one gets the classifier, just

as we have in this case. In this case, we
have trained a model which is able to
categorize unseen objects to their
specific category. When using SVM, one
has to do a linear division on the data

points.
The results obtained from the grid search
are an indication that an rbf kernel
could have worked better. Both the

gamma and the penalty parameter were
well specified.
We can now proceed with our test
model, and then predict values for test
set. This is shown below:
# Predict label of the `X_test`
print(svc_model.predict(______))
# Print the `y_test` to check for
results
print(______)
The images can also be visualized
together with their predicted labels. This
is shown below:
# Import matplotlib
# Assign predicted values to the
`predicted`
predicted = svc_model.predict(X_test)
# Zip together ìmages_test` and the

`predicted` values in the
#ìmages_and_predictions`
images_and_predictions =
list(zip(images_test, predicted))
# For first 4 elements in the
ìmages_and_predictions`
for index, (image, prediction) in
enumerate(images_and_predictions[:4])
# Initializing subplots in some grid

measuring 1 by 4 at the position i+1
plt.subplot(1, 4, index + 1)
# Do not show the axes
plt.axis('off')
# Displaying the images in all the
subplots in a grid
plt.imshow(image,
cmap=plt.cm.gray_r,
interpolation='nearest')
# Add some title to plot
plt.title('Predicted: ' +
str(prediction))
# Show a plot
plt.show()
Note that we have zipped the images

together and predicted values, and only
the first four elements of the

images_and_predictions have been
taken. We should now determine the

performance of the model. This is shown
below:
# Import `metrics`
from sklearn import metrics

# Print classification report of the
`y_test` and the `predicted`
print(metrics.classification_report(____
_________))
# Print confusion matrix of the
`y_test` and the `predicted`
print(metrics.confusion_matrix(______,
_________))
You will notice that the performance of
this model is often compared to the

clustering model which we used later. It
is also possible for you to visualize the

predicted and actual labels by use of
Isomap(). This is shown below:
# Import Ìsomap()`
from sklearn.manifold import Isomap

# Create isomap and then fit `digits`
data to this
X_iso =
Isomap(n_neighbors=10).fit_transform(
# Calculate the cluster centers and

every
# sample
predicted =
svc_model.predict(X_train)
# Create some plot with subplots on a
grid measuring 1X2

4))
# Adjusting the layout

c=predicted)
ax[0].set_title('Predicted labels')

c=y_train)
ax[1].set_title('Actual Labels')
# Add title
fig.suptitle('Predicted versus actual
labels', fontsize=14,
fontweight='bold')
# Show the plot
plt.show()
The visualization will have confirmed
the classification report.

Chapter 3- Logistic
Regression
Logic regression is just a classification
algorithm. Its learning process is closely
related to that of linear regression, with
the difference being that the formulation

of the cost and gradient functions is done
differently. The logistic regression

makes use of a logit or sigmoid
activation function rather than the
continuous output in a linear regression.
To demonstrate how this works, let us
begin by importing the data which is to
be used in the exercise:
import numpy as np
import pandas as pd
%matplotlib inline
import os
path = os.getcwd() + '\data\data1.txt'
data = pd.read_csv(path,
header=None, names=['Exam 1',
'Exam 2', 'Admitted'])
data.head()
In this data, we have two continuous

independent variables, and these are the
Exam 1 and Exam 2. The

Admitted label represents our
prediction target, and it is binary-valued.

A value of 0 is an indication that the
student has not been admitted, while a
value of 1 is an indication that the
student has been admitted. This can be
visualized by the use of colors on a
graph by use of the following code:

positive =
data[data['Admitted'].isin([1])]
negative =
data[data['Admitted'].isin([0])]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['Exam 1'],
positive['Exam 2'], s=50, c='b',
marker='o', label='Admitted')
ax.scatter(negative['Exam 1'],
negative['Exam 2'], s=50, c='r',
marker='x', label='Not Admitted')
ax.legend()
ax.set_xlabel('Exam 1 Score')
ax.set_ylabel('Exam 2 Score')
You will get a linear decision boundary.
Since it is curving, it is impossible for
us to classify the examples correctly by

use of a straight line, but it is possible
for us to get closer to the result.

It is now time for us to implement the
logistic regression so as to train the
model to be able to find the optimal
decision boundary, and then make some
class predictions. Our first step should
be the implementation of a sigmoid
function. This is shown below:
def sigmoid(z):
return 1 / (1 + np.exp(-z))
Above is the activation function for
output of the logistic regression. It works

by converting in a continuous format to a
value between 0 and 1. The value can
then be seen as the probability of the
class, or likelihood that an input
example will be classified positively.
With the use of this probability value

together with a threshold value, we will
be in a position to get some discrete

label prediction. This is good for
helping us visualize the output of the

function so as to be able to see what is
happening. This is shown below:
nums = np.arange(-10, 10, step=1)
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(nums, sigmoid(nums), 'r')
We should then go ahead and write a
cost function. This is good for evaluating

your models performance on some
training data which is given a set of the
model pictures. The cost function for the
logistic regression should be as follows:
def cost(theta, X, y):

theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
first = np.multiply(-y,
np.log(sigmoid(X * theta.T)))
second = np.multiply((1 - y),
np.log(1 - sigmoid(X * theta.T)))
return np.sum(first - second) /
(len(X))
The output has to be reduced
downwards to a scalar value which

should be the sum of the error quantified
as the function of the difference between

the class probability which was
assigned by model and the true label.
Note that this is implemented in a
vectorized manner because the
computation of the predictions of the
model for the whole data set is done into

one statement, which is sigmoid(X *
theta.T).
It is now a good time for us to do a test
on the cost function so as to be sure that
it is working well, but it will be good
for us to first do a setup. This can be
done as shown below:
# add ones column - this will make the

matrix multiplication exercise easier
data.insert(0, 'Ones', 1)
# set X (training data) and y (target
variable)
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
# converting to numpy arrays and then

initializing parameter array theta
X = np.array(X.values)
y = np.array(y.values)
theta = np.zeros(3)
It is good for you to check for the shape
of the data structures which you are
using, and you will be able to learn
whether or not their values are sensible.
This is a very good technique for the
implementation of matrix multiplication.

X.shape, theta.shape, y.shape
If zeros are given for the model
parameters, we can calculate the cost of
the initial solution, and the zeros are
represented as theta:
cost(theta, X, y)
Our cost function is working, and our
next step should be to write a function

for computing the gradient of module
parameters to learn how we can change

the parameters for improving the model
outcome on data training. The function
can be written as shown below:
def gradient(theta, X, y):

theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
parameters =
int(theta.ravel().shape[1])
grad = np.zeros(parameters)
error = sigmoid(X * theta.T) - y
for i in range(parameters):
term = np.multiply(error, X[:,i])
grad[i] = np.sum(term) / len(X)
return grad
Note that we have only computed a
single gradient step. Lets use a function
named "fminunc" which will help in the
optimization of a parameter given
functions for the computation of cost and

gradients. This is shown below:
import scipy.optimize as opt

result = opt.fmin_tnc(func=cost,
x0=theta, fprime=gradient, args=(X,
y))
cost(result[0], X, y)
Our next step should be writing a
function for giving us predictions for the

dataset X by use of the learned
parameters theta. The function can then
be used for scoring the training accuracy
of the classifier.
def predict(theta, X):
probability = sigmoid(X * theta.T)
return [1 if x >= 0.5 else 0 for x in
probability]
theta_min = np.matrix(result[0])
predictions = predict(theta_min, X)
correct = [1 if ((a == 1 and b == 1) or
(a == 0 and b == 0)) else 0 for (a, b) in
zip(predictions, y)]
accuracy = (sum(map(int, correct)) %
len(correct))
print 'accuracy =
{0}%'.format(accuracy)
This will then give you an accuracy of
89%, which is a great percentage.

Conclusion
We have come to the end of this book.

That is how machine learning is done in
Python. Always begin by choosing the
training data set, and then continue with
the rest of the steps!

Machine Learning With Python - The Basics

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Machine Learning With Python - The Basics

Enviado por

Direitos autorais:

Formatos disponíveis

Machine Learning

All rights reserved. No part of this

publication may be reproduced,

distributed, or transmitted in any form or

by any means, including photocopying,

recording, or other electronic or

mechanical methods, without the prior

written permission of the author, except

While all attempts have been made to

book, the author does assume any

responsibility for errors, omissions, or

contrary interpretations of the subject

matter contained within. The information

any responsibilities for any liabilities or

from the use of this information.

The trademarks that are used are

without any consent, and the

brands within this book are for

owned by the owners themselves, not

affiliated with this document. **

Machine learning is a very common field

it is easy for you to implement the

concept of machine learning. This is

because this programming language has

numerous libraries which can help you

The only way for you to understand

machine learning is by doing and then

completing projects, beginning with the

small ones. Python is an interpreted

complete programming language.

Python has numerous libraries and

modules from which you can choose,

meaning that there are different ways that

you can accomplish a particular task.

The above steps will also help you learn

Our first project should be the one for

classifying iris flowers. This project is

good for the following reasons:

1. It has numeric attributes, making it

2. The problem falls under the category

of classification, meaning that it is

possible for us to implement it by use

of a supervised learning algorithm.

3. It is also a multi-class classification

4. The project is small, and it will fit

5. Our numeric units will be in the same

scale and same units. This means that

we will not have to do any special

The following is what we will cover in

this small project:

1. Installing Python and the SciPy

2. Loading the dataset.

3. Summarizing the dataset.

4. Visualizing the dataset.

You should install the Python SciPy

installed it. The following are the SciPy

libraries which you should install on

These libraries can be installed in a

number of ways. The best way for you to

do the installation is by choosing a

single way and then following it

libraries via pip. The installation of

these libraries will be done in wheel

package format. For you to do the