Escolar Documentos
Profissional Documentos
Cultura Documentos
with Python
The Basics
By David V.
Copyright2017 by David V.
All Rights Reserved
Copyright 2017 by David V.
Started
1. Define Problem.
2. Prepare Data.
3. Evaluate Algorithms.
4. Improve Results.
5. Present Results.
First Project
form of handling.
transformations or scaling so as to
get started.
platform.
6. Making predictions.
Installing Python and the
SciPy Platform
your system:
scipy
numpy
matplotlib
pandas
sklearn
following command:
directories.
In the case of user installs, please ensure
be set as follows:
~/.bashrc file
export
PATH="$PATH:/home/username/.local/
In OSX, the PATH can be set as follows:
~/.bash_profile file
export
PATH="$PATH:/Users/username/Libra
username.
Installation via Linux
Package Manager
the terminal:
can install.
Macports
The installation of the SciPy libraries by
is running as expected.
is the script:
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy:
{}'.format(scipy.__version__))
# numpy
import numpy
print('numpy:
{}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib:
{}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas:
{}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn:
{}'.format(sklearn.__version__))
done successfully.
Load Data
We will be using the data set for the iris
flower. It is a very famous dataset,
almost everyone.
import pandas
accuracy_score
DecisionTreeClassifier
confusion_matrix
GaussianNB
# Load dataset
url =
"https://archive.ics.uci.edu/ml/machine-
learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width',
'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url,
names=names)
print(dataset.shape)
(150, 50)
We can then eyeball the data as follows:
# head
print(dataset.head(20))
shown below:
sepal-length sepal-width petal-
0.2 Iris-setosa
0.2 Iris-setosa
0.2 Iris-setosa
0.2 Iris-setosa
4 5.0 3.6 1.4
0.2 Iris-setosa
0.4 Iris-setosa
0.3 Iris-setosa
0.2 Iris-setosa
0.2 Iris-setosa
9 4.9 3.1 1.5
0.1 Iris-setosa
0.2 Iris-setosa
0.2 Iris-setosa
0.1 Iris-setosa
0.1 Iris-setosa
14 5.8 4.0 1.2
0.2 Iris-setosa
0.4 Iris-setosa
0.4 Iris-setosa
0.3 Iris-setosa
0.3 Iris-setosa
19 5.1 3.8 1.5
0.3 Iris-setosa
is shown below:
# descriptions
print(dataset.describe())
Class Distribution
# class distribution
print(dataset.groupby('class').size())
Data Visualizations
Now that we have some basic idea
regarding the data, it is good for us to
Univariate Plots
plt.show()
This will help us have a good picture
regarding how the input variables are
distribution:
# histograms
dataset.hist()
plt.show()
plt.show()
relationships.
Evaluation of Algorithms
unseen data.
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train,
Y_validation =
model_selection.train_test_split(X, Y,
test_size=validation_size,
random_state=seed)
used later.
Test Harness
The 10-fold cross validation will be
used for the purpose of estimating the
metric
seed = 7
scoring = 'accuracy'
Build Models
models = []
models.append(('LR',
LogisticRegression()))
models.append(('LDA',
LinearDiscriminantAnalysis()))
models.append(('KNN',
KNeighborsClassifier()))
models.append(('CART',
DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
results = []
names = []
cv_results =
model_selection.cross_val_score(model,
X_train, Y_train, cv=kfold,
scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name,
cv_results.mean(), cv_results.std())
print(msg)
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Making Predictions
set.
This will provide us with a final
optimistic result.
dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions =
knn.predict(X_validation)
print(accuracy_score(Y_validation,
predictions))
print(confusion_matrix(Y_validation,
predictions))
print(classification_report(Y_validation
predictions))
0.9
[[ 7 0 0]
[ 0 11 1]
[ 0 2 9]]
1.00 7
0.88 12
30
Chapter 2- Python and
matplotlib for Data
Exploration
The Python libraries can be used
below:
digits = datasets.load_digits()
print(______)
import ______ as __
print(______)
information.
DESCR attribute.
If you need to know the keys which are
available, you just have to execute
shown below:
print(digits.______)
print(digits.______)
print(digits.DESCR)
digits_data = digits.data
print(digits_data.shape)
digits_target = digits.______
print(digits_target._____)
number_digits =
len(np.unique(digits.target))
# Isolate the `images`
digits_images = digits.images
print(digits_images.shape)
inches
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(8, 8, i + 1,
xticks=[], yticks=[])
# Display image at i-th position
ax.imshow(digits.images[i],
cmap=plt.cm.binary,
interpolation='nearest')
ax.text(0, 7, str(digits.target[i]))
smooth.
# Import matplotlib
import matplotlib.pyplot as plt
# Join images and the target labels
into a list
images_and_labels =
list(zip(digits.images, digits.target))
list
enumerate(images_and_labels[:8]):
# initializing a subplot of the 2X4 at
i+1-th position
plt.subplot(2, 4, index + 1)
digits.target.
Principal Component
Analysis (PCA)
variance of data.
shown below:
RandomizedPCA(n_components=2)
pca = PCA(n_components=2)
# Fit and transform data to model
reduced_data_pca =
pca.fit_transform(digits.data)
reduced_data_pca.shape
print(reduced_data_pca)
x = reduced_data_rpca[:, 0]
[digits.target == i]
y = reduced_data_rpca[:, 1]
[digits.target == i]
plt.scatter(x, y, c=colors[i])
plt.legend(digits.target_names,
bbox_to_anchor=(1.05, 1), loc=2,
borderaxespad=0.)
plt.ylabel('Second Principal
Component')
Data Normalization
demonstrates this:
# Import
data
data = _____(digits.data)
print(__________)
n_digits = len(np.unique(y_train))
clf.fit(________)
# Import matplotlib
for i in range(10):
# Initialize the subplots in some grid
measuring 2X5, at the i+1th position
ax = fig.add_subplot(2, 5, 1 + i)
ax.imshow(clf.cluster_centers_[i].reshap
8)), cmap=plt.cm.binary)
# Don't show axes
plt.axis('off')
# Show plot
plt.show()
shown below:
# Predict labels for the `X_test`
y_pred=clf.predict(X_test)
print(y_pred[:100])
clf.cluster_centers_._____
as shown below:
# Import `Isomap()`
X_iso =
Isomap(n_neighbors=10).fit_transform(
# sample
clusters = clf.fit_predict(X_train)
# Create plot with the subplots in grid
measuring 1X2
fig.subplots_adjust(top=0.85)
# Add the scatterplots to subplots
ax[0].scatter(X_iso[:, 0], X_iso[:, 1],
c=clusters)
ax[0].set_title('Predicted Training
Labels')
c=y_train)
ax[1].set_title('Actual Training
Labels')
# Show the plots
plt.show()
# Import `PCA()`
PCA
PCA model
X_pca =
PCA(n_components=2).fit_transform(X
# sample
clusters = clf.fit_predict(X_train)
fig.subplots_adjust(top=0.85)
c=clusters)
ax[0].set_title('Predicted the Training
Labels')
ax[1].scatter(X_pca[:, 0], X_pca[:, 1],
c=y_train)
ax[1].set_title('The Actual Training
Labels')
# Show plots
plt.show()
Evaluating the Clustering
Model
print(metrics.confusion_matrix(y_test,
y_pred))
You may also need to learn more
regarding the results rather than by use
code:
from sklearn.metrics import
homogeneity_score,
completeness_score,
v_measure_score,
adjusted_rand_score,
adjusted_mutual_info_score,
silhouette_score
%.3f %.3f'
%(clf.inertia_,
homogeneity_score(y_test,
y_pred),
completeness_score(y_test,
y_pred),
v_measure_score(y_test, y_pred),
adjusted_rand_score(y_test,
y_pred),
adjusted_mutual_info_score(y_test,
y_pred),
silhouette_score(X_test, y_pred,
metric='euclidean')))
train_test_split
C=100., kernel='linear')
svc_model.fit(X_train, y_train)
grid search.
# Import GridSearchCV
from sklearn.grid_search import
GridSearchCV
parameter_candidates = [
['linear']},
{'C': [1, 10, 100, 1000], 'gamma':
[0.001, 0.0001], 'kernel': ['rbf']},
]
# Create some classifier with
parameter candidates
clf =
GridSearchCV(estimator=svm.SVC(),
param_grid=parameter_candidates,
n_jobs=-1)
clf.fit(X_train, y_train)
# Print out results
print('Best
`C`:',clf.best_estimator_.C)
print('Best
kernel:',clf.best_estimator_.kernel)
print('Best
`gamma`:',clf.best_estimator_.gamma)
You should then use a classifier together
clf.score(X_test, y_test)
svm.SVC(C=10, kernel='rbf',
gamma=0.001).fit(X_train,
y_train).score(X_test, y_test)
You will see that the parameters are
print(svc_model.predict(______))
# Print the `y_test` to check for
results
print(______)
is shown below:
# Import matplotlib
import matplotlib.pyplot as plt
# Assign predicted values to the
`predicted`
predicted = svc_model.predict(X_test)
images_and_predictions =
list(zip(images_test, predicted))
# For first 4 elements in the
`images_and_predictions`
for index, (image, prediction) in
enumerate(images_and_predictions[:4])
plt.subplot(1, 4, index + 1)
plt.axis('off')
# Displaying the images in all the
subplots in a grid
plt.imshow(image,
cmap=plt.cm.gray_r,
interpolation='nearest')
plt.title('Predicted: ' +
str(prediction))
# Show a plot
plt.show()
below:
# Import `metrics`
print(metrics.classification_report(____
_________))
print(metrics.confusion_matrix(______,
_________))
You will notice that the performance of
# Import `Isomap()`
data to this
X_iso =
Isomap(n_neighbors=10).fit_transform(
# sample
predicted =
svc_model.predict(X_train)
# Create some plot with subplots on a
grid measuring 1X2
fig.subplots_adjust(top=0.85)
c=predicted)
ax[0].set_title('Predicted labels')
ax[1].set_title('Actual Labels')
# Add title
fig.suptitle('Predicted versus actual
labels', fontsize=14,
fontweight='bold')
plt.show()
Regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os
path = os.getcwd() + '\data\data1.txt'
data = pd.read_csv(path,
header=None, names=['Exam 1',
'Exam 2', 'Admitted'])
data.head()
negative =
data[data['Admitted'].isin([0])]
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['Exam 1'],
positive['Exam 2'], s=50, c='b',
marker='o', label='Admitted')
ax.scatter(negative['Exam 1'],
negative['Exam 2'], s=50, c='r',
marker='x', label='Not Admitted')
ax.legend()
ax.set_xlabel('Exam 1 Score')
ax.set_ylabel('Exam 2 Score')
def sigmoid(z):
return 1 / (1 + np.exp(-z))
fig, ax = plt.subplots(figsize=(12,8))
ax.plot(nums, sigmoid(nums), 'r')
X = np.matrix(X)
y = np.matrix(y)
first = np.multiply(-y,
np.log(sigmoid(X * theta.T)))
(len(X))
The output has to be reduced
theta.T).
variable)
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
y = np.array(y.values)
theta = np.zeros(3)
represented as theta:
cost(theta, X, y)
Our cost function is working, and our
y = np.matrix(y)
parameters =
int(theta.ravel().shape[1])
grad = np.zeros(parameters)
for i in range(parameters):
term = np.multiply(error, X[:,i])
return grad
cost(result[0], X, y)
probability]
theta_min = np.matrix(result[0])
predictions = predict(theta_min, X)
correct = [1 if ((a == 1 and b == 1) or
(a == 0 and b == 0)) else 0 for (a, b) in
zip(predictions, y)]
len(correct))
print 'accuracy =
{0}%'.format(accuracy)