Você está na página 1de 13

5/22/2019 Copy of haberman datasets analysis.

ipynb - Colaboratory

1 import pandas as pd
2 import seaborn as sns
3 import matplotlib.pyplot as plt
4 import numpy as np
5 from google.colab import files

1 uploaded = files.upload()

Choose Files haberman.csv


haberman.csv(n/a) - 3103 bytes, last modified: 5/16/2019 - 100% done
Saving haberman.csv to haberman (1).csv

1 haberman = pd.read_csv("haberman.csv")

1 print(haberman.shape)

(305, 4)

1. (rows, columns) shows (data-points, features)

1 print(haberman.columns)
2 haberman.columns = ["age", "operation_year", "axil_nodes", "survival status"]
3 haberman.head()

Index(['30', '64', '1', '1.1'], dtype='object')


age operation_year axil_nodes survival status

0 30 62 3 1

1 30 65 0 1

2 31 59 2 1

3 31 65 4 1

4 33 58 10 1

1. This is said to be imbalanced datasets


2. ["age", 'operation year', "axil_nodes", "survival_status"]

# This is formatted as code

"]

1 haberman.info()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 1/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 305 entries, 0 to 304
Data columns (total 4 columns):
age 305 non-null int64
operation_year 305 non-null int64
axil_nodes 305 non-null int64
survival status 305 non-null int64
dtypes: int64(4)
1 haberman["survival
memory usage: 9.6status"].value_counts()
KB

1 224
2 81
Name: survival status, dtype: int64

Double-click (or enter) to edit

observation:- out of 305 observation , we found 224 people lived more than 5 years,and 81 people died
wthin 5 years.

1 haberman.describe()

age operation_year axil_nodes survival status

count 305.000000 305.000000 305.000000 305.000000

mean 52.531148 62.849180 4.036066 1.265574

std 10.744024 3.254078 7.199370 0.442364

min 30.000000 58.000000 0.000000 1.000000

25% 44.000000 60.000000 0.000000 1.000000

50% 52.000000 63.000000 1.000000 1.000000

75% 61.000000 66.000000 4.000000 2.000000

max 83.000000 69.000000 52.000000 2.000000

observation;-age(min, max)=(30, 83),median is 52 and number of positive axil_nodes is 52.and 75%


people has positive axil_nodes and 25% people has no positive axil_nodes

2 - d scatter plot

1 haberman.plot(kind='scatter', x = 'age', y = 'axil_nodes');


2 plt.show()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 2/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

observation ; most of the people have less than 1 positive axil_nodes

1 sns.set_style("whitegrid");
2 sns.FacetGrid(haberman, hue="survival status", size = 8) \
3 .map(plt.scatter, 'age', 'axil_nodes') \
4 .add_legend()
5 plt.show()

/usr/local/lib/python3.6/dist-packages/seaborn/axisgrid.py:230: UserWarning: The `siz


warnings.warn(msg, UserWarning)

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 3/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

Double-click (or enter) to edit

observation:-here we cannot distinguished between orange and blue dots, and here most patient has 0
axil_nodes

1 plt.close();
2 sns.set_style("whitegrid")
3 sns.pairplot(haberman, hue = 'survival status', vars =("age", "operation_year", "axil_no
4 plt.show()

/usr/local/lib/python3.6/dist-packages/seaborn/axisgrid.py:2065: UserWarning: The `si


warnings.warn(msg, UserWarning)

observation:-by observing these pair-plot, wecan't distingush cause most of the point are overlaping

UNIVARAITE ANALYSIS HISTOGRAM, PDF,CDF

1 sns.FacetGrid(haberman, hue= "survival status", size= 5) \


2 .map(sns.distplot,'axil_nodes') \
3 .add_legend();
4 plt.show();

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 4/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

/usr/local/lib/python3.6/dist-packages/seaborn/axisgrid.py:230: UserWarning: The `siz


warnings.warn(msg, UserWarning)

1 sns.FacetGrid(haberman, hue="survival status", size=5) \


2 .map(sns.distplot, "age") \
3 .add_legend();
4 plt.show();

/usr/local/lib/python3.6/dist-packages/seaborn/axisgrid.py:230: UserWarning: The `siz


warnings.warn(msg, UserWarning)

1 sns.FacetGrid(haberman, hue = 'survival status', size = 7) \


2 .map(sns.distplot,"operation_year") \
3 .add_legend();
4 plt.show()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 5/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

/usr/local/lib/python3.6/dist-packages/seaborn/axisgrid.py:230: UserWarning: The `siz


warnings.warn(msg, UserWarning)

observation :- 1. only axil_nodes is usefull to read the graph 2. ages and operation are not usefull as they
are overlap, 3. In 1965 more number of people are not survive.

1 alive = haberman.loc[haberman['survival status']==1]


2 dead = haberman.loc[haberman["survival status"]==2]

1 counts, bin_edges = np.histogram(alive['axil_nodes'],bins = 15,density= True)


2 pdf = counts/(sum(counts))
3 print(pdf)
4 print(bin_edges)
5 cdf = np.cumsum(pdf)
6 plt.plot(bin_edges[1:],pdf)
7 plt.plot(bin_edges[1:], cdf)
8 plt.legend(["pdf for people who survive more than 5 year",
9 "cdf for the people who survive more than 5 years"])
10 plt.show

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 6/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

[0.79017857 0.07142857 0.05357143 0.01785714 0.02232143 0.00892857


0.00892857 0.00892857 0.00446429 0.00892857 0. 0.
0. 0. 0.00446429]
[ 0. 3.06666667 6.13333333 9.2 12.26666667 15.33333333
18.4 21.46666667 24.53333333 27.6 30.66666667 33.73333333
36.8 39.86666667 42.93333333 46. ]
<function matplotlib.pyplot.show>

1 counts, bin_edges = np.histogram(dead['axil_nodes'],bins = 15,density= True)


2 pdf = counts/(sum(counts))
3 print(pdf)
4 print(bin_edges)
5 cdf = np.cumsum(pdf)
6 plt.plot(bin_edges[1:],pdf)
7 plt.plot(bin_edges[1:], cdf)
8 plt.legend(["pdf for people who dead within 5 year",
9 "cdf for the people who dead within 5 years"])
10 plt.show

[0.48148148 0.12345679 0.11111111 0.09876543 0.04938272 0.03703704


0.07407407 0. 0. 0. 0.01234568 0.
0. 0. 0.01234568]
[ 0. 3.46666667 6.93333333 10.4 13.86666667 17.33333333
20.8 24.26666667 27.73333333 31.2 34.66666667 38.13333333
41.6 45.06666667 48.53333333 52. ]
<function matplotlib.pyplot.show>

MEAN , MEDIAN,PERCENTILE

1 print("mean:")
2 print(np.mean(alive["age"]))
3 print(np.mean(dead["age"]))

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 7/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

mean:
52.11607142857143
53.67901234567901

1 print(np.mean(alive["operation_year"]))
2 print(np.mean(dead["operation_year"]))

62.857142857142854
62.82716049382716

1 print(np.mean(alive["axil_nodes"]))
2 print(np.mean(dead["axil_nodes"]))

2.799107142857143
7.45679012345679

1 print('std')
2 print(np.std(alive["age"]))
3 print(np.std(dead["age"]))

std
10.913004640364269
10.10418219303131

1 print(np.std(alive['operation_year']))
2 print(np.std(dead["operation_year"]))

3.2220145175061514
3.3214236255207883

1 print(np.std(alive["axil_nodes"]))
2 print(np.std(dead["axil_nodes"]))

5.869092706952767
9.128776076761632

1 print("median")
2 print(np.median(alive['age']))
3 print(np.median(dead["age"]))

median
52.0
53.0

1 print(np.median(alive['operation_year']))
2 print(np.median(dead["operation_year"]))

63.0
63.0

1 print(np.median(alive['axil_nodes']))
https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 8/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory
2 print(np.median(dead['axil_nodes']))

0.0
4.0

1 print('quantiles')

quantiles

1 print(np.percentile(alive["age"],np.arange(0,100,25)))
2 print(np.percentile(dead["age"],np.arange(0,100,25)))
3 print(np.percentile(alive["operation_year"],np.arange(0,100,25)))
4 print(np.percentile(dead["operation_year"],np.arange(0,100,25)))
5 print(np.percentile(alive["axil_nodes"],np.arange(0,100,25)))
6 print(np.percentile(dead["axil_nodes"],np.arange(0,100,25)))

[30. 43. 52. 60.]


[34. 46. 53. 61.]
[58. 60. 63. 66.]
[58. 59. 63. 65.]
[0. 0. 0. 3.]
[ 0. 1. 4. 11.]

1 print("90th percentile")

90th percentile

1 print(np.percentile(alive["age"], 90))
2 print(np.percentile(alive["operation_year"], 90))
3 print(np.percentile(alive["axil_nodes"], 90))
4
5 print(np.percentile(dead["age"], 90))
6 print(np.percentile(dead["operation_year"], 90))
7 print(np.percentile(dead["axil_nodes"], 90))

67.0
67.0
8.0
67.0
67.0
20.0

1 from statsmodels import robust


2 print("Median absolute deviation ")
3 print(robust.mad(alive["age"]))
4 print(robust.mad(alive["operation_year"]))
5 print(robust.mad(alive["axil_nodes"]))
6 print(robust.mad(dead["age"]))
7 print(robust.mad(dead["operation_year"]))
8 print(robust.mad(dead["axil_nodes"]))
9

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 9/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

Median absolute deviation


13.343419966550417
4.447806655516806
0.0
11.860817748044816
4.447806655516806
BOX PLOT AND WHISKERS
5.930408874022408

1 sns.boxplot(x='survival status',y='age', data=haberman)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb8a5ed8b38>

1 print("\n.......year......")
2 sns.boxplot(x='survival status',y='operation_year',data=haberman)

.......year......
<matplotlib.axes._subplots.AxesSubplot at 0x7fb8a5eb8e80>

1 print("\n......axil_nodes.....")
2 sns.boxplot(x="survival status",y="axil_nodes",data=haberman)

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 10/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

......axil_nodes.....
<matplotlib.axes._subplots.AxesSubplot at 0x7fb8a5e1d358>

OBSERVATION:- 1. 75 PERCENTILE DEAD PATIENT FROM CANCER IS BETWEEN THHE YEAR 64 TO 66


AND 25 PERCENTILE DEAD PEOPLE FROM THE YEAR 58 TO 63 2 75% People in the year 1965 survived

VOLIN PLOT

1 sns.violinplot(x='survival status',y='age',data=haberman,size= 8)
2
3 plt.show()

1 sns.violinplot(x="survival status", y='operation_year',data=haberman, size = 8)


2 plt.show()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 11/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

1 sns.violinplot(x='survival status',y='axil_nodes',data= haberman,size= 8)


2 plt.show()

1 sns.jointplot(x='age',y='operation_year',data=dead,kind='kde')
2 plt.show()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 12/13
5/22/2019 Copy of haberman datasets analysis.ipynb - Colaboratory

1 sns.jointplot(x='axil_nodes',y='operation_year',data=alive,kind='kde')
2 plt.show()

https://colab.research.google.com/drive/1o1_WfES3ATAM2cXHemuFB6JVMSf76syP#scrollTo=nCsQxUiSTW7_&printMode=true 13/13

Você também pode gostar