Você está na página 1de 11

Python For Data Science Cheat Sheet Lists Also see NumPy Arrays Libraries

>>> a = 'is' Import libraries


Python Basics >>> b = 'nice' >>> import numpy Data analysis Machine learning
Learn More Python for Data Science Interactively at www.datacamp.com >>> my_list = ['my', 'list', a, b] >>> import numpy as np
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] Selective import
>>> from math import pi Scientific computing 2D plotting
Variables and Data Types Selecting List Elements Index starts at 0
Subset Install Python
Variable Assignment
>>> my_list[1] Select item at index 1
>>> x=5
>>> my_list[-3] Select 3rd last item
>>> x
Slice
5 >>> my_list[1:3] Select items at index 1 and 2
Calculations With Variables >>> my_list[1:] Select items after index 0
>>> my_list[:3] Select items before index 3 Leading open data science platform Free IDE that is included Create and share
>>> x+2 Sum of two variables
>>> my_list[:] Copy my_list powered by Python with Anaconda documents with live code,
7 visualizations, text, ...
>>> x-2 Subtraction of two variables
Subset Lists of Lists
>>> my_list2[1][0] my_list[list][itemOfList]
3
>>> my_list2[1][:2] Numpy Arrays Also see Lists
>>> x*2 Multiplication of two variables
>>> my_list = [1, 2, 3, 4]
10 List Operations >>> my_array = np.array(my_list)
>>> x**2 Exponentiation of a variable
25 >>> my_list + my_list >>> my_2darray = np.array([[1,2,3],[4,5,6]])
>>> x%2 Remainder of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']
Selecting Numpy Array Elements Index starts at 0
1 >>> my_list * 2
>>> x/float(2) Division of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] Subset
2.5 >>> my_list2 > 4 >>> my_array[1] Select item at index 1
True 2
Types and Type Conversion Slice
List Methods >>> my_array[0:2] Select items at index 0 and 1
str() '5', '3.45', 'True' Variables to strings
my_list.index(a) Get the index of an item array([1, 2])
>>>
int() 5, 3, 1 Variables to integers >>> my_list.count(a) Count an item Subset 2D Numpy arrays
>>> my_list.append('!') Append an item at a time >>> my_2darray[:,0] my_2darray[rows, columns]
my_list.remove('!') Remove an item array([1, 4])
float() 5.0, 1.0 Variables to floats >>>
>>> del(my_list[0:1]) Remove an item Numpy Array Operations
bool() True, True, True >>> my_list.reverse() Reverse the list
Variables to booleans >>> my_array > 3
>>> my_list.extend('!') Append an item array([False, False, False, True], dtype=bool)
>>> my_list.pop(-1) Remove an item >>> my_array * 2
Asking For Help >>> my_list.insert(0,'!') Insert an item array([2, 4, 6, 8])
>>> help(str) >>> my_list.sort() Sort the list >>> my_array + np.array([5, 6, 7, 8])
array([6, 8, 10, 12])
Strings
>>> my_string = 'thisStringIsAwesome' Numpy Array Functions
String Operations Index starts at 0
>>> my_string >>> my_array.shape Get the dimensions of the array
'thisStringIsAwesome' >>> my_string[3] >>> np.append(other_array) Append items to an array
>>> my_string[4:9] >>> np.insert(my_array, 1, 5) Insert items in an array
String Operations >>> np.delete(my_array,[1]) Delete items in an array
String Methods >>> np.mean(my_array) Mean of the array
>>> my_string * 2
'thisStringIsAwesomethisStringIsAwesome' >>> my_string.upper() String to uppercase >>> np.median(my_array) Median of the array
>>> my_string + 'Innit' >>> my_string.lower() String to lowercase >>> my_array.corrcoef() Correlation coefficient
'thisStringIsAwesomeInnit' >>> my_string.count('w') Count String elements >>> np.std(my_array) Standard deviation
>>> 'm' in my_string >>> my_string.replace('e', 'i') Replace String elements
True >>> my_string.strip() Strip whitespaces DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Advanced Indexing Also see NumPy Arrays Combining Data
Selecting data1 data2
Pandas >>> df3.loc[:,(df3>1).any()] Select cols with any vals >1 X1 X2 X1 X3
Learn Python for Data Science Interactively at www.DataCamp.com >>> df3.loc[:,(df3>1).all()] Select cols with vals > 1
>>> df3.loc[:,df3.isnull().any()] Select cols with NaN a 11.432 a 20.784
>>> df3.loc[:,df3.notnull().all()] Select cols without NaN b 1.303 b NaN
Indexing With isin c 99.906 d 20.784
>>> df[(df.Country.isin(df2.Type))] Find same elements
Reshaping Data >>> df3.filter(items=”a”,”b”]) Filter on values
Merge
>>> df.select(lambda x: not x%5) Select specific elements
Pivot Where X1 X2 X3
>>> pd.merge(data1,
>>> df3= df2.pivot(index='Date', Spread rows into columns >>> s.where(s > 0) Subset the data data2, a 11.432 20.784
columns='Type', Query how='left',
values='Value') b 1.303 NaN
>>> df6.query('second > first') Query DataFrame on='X1')
c 99.906 NaN
Date Type Value

0 2016-03-01 a 11.432 Type a b c Setting/Resetting Index >>> pd.merge(data1, X1 X2 X3


1 2016-03-02 b 13.031 Date data2, a 11.432 20.784
>>> df.set_index('Country') Set the index
how='right',
2 2016-03-01 c 20.784 2016-03-01 11.432 NaN 20.784 >>> df4 = df.reset_index() Reset the index b 1.303 NaN
on='X1')
3 2016-03-03 a 99.906 >>> df = df.rename(index=str, Rename DataFrame d NaN 20.784
2016-03-02 1.303 13.031 NaN columns={"Country":"cntry",
4 2016-03-02 a 1.303 "Capital":"cptl", >>> pd.merge(data1,
2016-03-03 99.906 NaN 20.784 "Population":"ppltn"}) X1 X2 X3
5 2016-03-03 c 20.784 data2,
how='inner', a 11.432 20.784
Pivot Table Reindexing on='X1') b 1.303 NaN
>>> s2 = s.reindex(['a','c','d','e','b'])
>>> df4 = pd.pivot_table(df2, Spread rows into columns X1 X2 X3
values='Value', Forward Filling Backward Filling >>> pd.merge(data1,
index='Date', data2, a 11.432 20.784
columns='Type']) >>> df.reindex(range(4), >>> s3 = s.reindex(range(5), how='outer', b 1.303 NaN
method='ffill') method='bfill') on='X1') c 99.906 NaN
Stack / Unstack Country Capital Population 0 3
0 Belgium Brussels 11190846 1 3 d NaN 20.784
>>> stacked = df5.stack() Pivot a level of column labels 1 India New Delhi 1303171035 2 3
>>> stacked.unstack() Pivot a level of index labels 2 Brazil Brasília 207847528 3 3 Join
3 Brazil Brasília 207847528 4 3
0 1 1 5 0 0.233482 >>> data1.join(data2, how='right')
1 5 0.233482 0.390959 1 0.390959 MultiIndexing Concatenate
2 4 0.184713 0.237102 2 4 0 0.184713
>>> arrays = [np.array([1,2,3]),
3 3 0.433522 0.429401 1 0.237102 np.array([5,4,3])] Vertical
>>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays) >>> s.append(s2)
Unstacked 3 3 0 0.433522
>>> tuples = list(zip(*arrays)) Horizontal/Vertical
1 0.429401 >>> index = pd.MultiIndex.from_tuples(tuples, >>> pd.concat([s,s2],axis=1, keys=['One','Two'])
Stacked names=['first', 'second']) >>> pd.concat([data1, data2], axis=1, join='inner')
>>> df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
Melt >>> df2.set_index(["Date", "Type"])
>>> pd.melt(df2, Gather columns into rows
Dates
id_vars=["Date"],
value_vars=["Type", "Value"],
Duplicate Data >>> df2['Date']= pd.to_datetime(df2['Date'])
>>> df2['Date']= pd.date_range('2000-1-1',
value_name="Observations") >>> s3.unique() Return unique values periods=6,
>>> df2.duplicated('Type') Check duplicates freq='M')
Date Type Value
Date Variable Observations >>> dates = [datetime(2012,5,1), datetime(2012,5,2)]
0 2016-03-01 Type a >>> df2.drop_duplicates('Type', keep='last') Drop duplicates >>> index = pd.DatetimeIndex(dates)
0 2016-03-01 a 11.432 1 2016-03-02 Type b
>>> df.index.duplicated() Check index duplicates >>> index = pd.date_range(datetime(2012,2,1), end, freq='BM')
1 2016-03-02 b 13.031 2 2016-03-01 Type c
2 2016-03-01 c 20.784 3 2016-03-03 Type a Grouping Data Visualization Also see Matplotlib
4 2016-03-02 Type a
3 2016-03-03 a 99.906
5 2016-03-03 Type c Aggregation >>> import matplotlib.pyplot as plt
4 2016-03-02 a 1.303 >>> df2.groupby(by=['Date','Type']).mean()
6 2016-03-01 Value 11.432 >>> s.plot() >>> df2.plot()
>>> df4.groupby(level=0).sum()
5 2016-03-03 c 20.784 7 2016-03-02 Value 13.031 >>> df4.groupby(level=0).agg({'a':lambda x:sum(x)/len(x), >>> plt.show() >>> plt.show()
8 2016-03-01 Value 20.784 'b': np.sum})
9 2016-03-03 Value 99.906 Transformation
>>> customSum = lambda x: (x+x%2)
10 2016-03-02 Value 1.303
>>> df4.groupby(level=0).transform(customSum)
11 2016-03-03 Value 20.784

Iteration Missing Data


>>> df.dropna() Drop NaN values
>>> df.iteritems() (Column-index, Series) pairs >>> df3.fillna(df3.mean()) Fill NaN values with a predetermined value
>>> df.iterrows() (Row-index, Series) pairs >>> df2.replace("a", "f") Replace values with others
DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Inspecting Your Array Subsetting, Slicing, Indexing Also see Lists
>>> a.shape Array dimensions Subsetting
NumPy Basics >>>
>>>
len(a)
b.ndim
Length of array
Number of array dimensions
>>> a[2]
3
1 2 3 Select the element at the 2nd index
Learn Python for Data Science Interactively at www.DataCamp.com >>> e.size Number of array elements >>> b[1,2] 1.5 2 3 Select the element at row 0 column 2
>>> b.dtype Data type of array elements 6.0 4 5 6 (equivalent to b[1][2])
>>> b.dtype.name Name of data type
>>> b.astype(int) Convert an array to a different type Slicing
NumPy >>> a[0:2]
array([1, 2])
1 2 3 Select items at index 0 and 1
2
The NumPy library is the core library for scientific computing in Asking For Help >>> b[0:2,1] 1.5 2 3 Select items at rows 0 and 1 in column 1
>>> np.info(np.ndarray.dtype) array([ 2., 5.]) 4 5 6
Python. It provides a high-performance multidimensional array
Array Mathematics
1.5 2 3
>>> b[:1] Select all items at row 0
object, and tools for working with these arrays. array([[1.5, 2., 3.]]) 4 5 6 (equivalent to b[0:1, :])
Arithmetic Operations >>> c[1,...] Same as [1,:,:]
Use the following import convention: array([[[ 3., 2., 1.],
>>> import numpy as np [ 4., 5., 6.]]])
>>> g = a - b Subtraction
array([[-0.5, 0. , 0. ], >>> a[ : :-1] Reversed array a
NumPy Arrays [-3. , -3. , -3. ]])
array([3, 2, 1])

>>> np.subtract(a,b) Boolean Indexing


1D array 2D array 3D array Subtraction
>>> a[a<2] Select elements from a less than 2
>>> b + a Addition 1 2 3
array([[ 2.5, 4. , 6. ], array([1])
axis 1 axis 2
1 2 3 axis 1 [ 5. , 7. , 9. ]]) Fancy Indexing
1.5 2 3 >>> np.add(b,a) Addition >>> b[[1, 0, 1, 0],[0, 1, 2, 0]] Select elements (1,0),(0,1),(1,2) and (0,0)
axis 0 axis 0 array([ 4. , 2. , 6. , 1.5])
4 5 6 >>> a / b Division
array([[ 0.66666667, 1. , 1. ], >>> b[[1, 0, 1, 0]][:,[0,1,2,0]] Select a subset of the matrix’s rows
[ 0.25 , 0.4 , 0.5 ]]) array([[ 4. ,5. , 6. , 4. ], and columns
>>> np.divide(a,b) Division [ 1.5, 2. , 3. , 1.5],
Creating Arrays >>> a * b
array([[ 1.5, 4. , 9. ],
Multiplication
[ 4. , 5.
[ 1.5, 2.
,
,
6.
3.
,
,
4. ],
1.5]])

>>> a = np.array([1,2,3]) [ 4. , 10. , 18. ]])


>>> b = np.array([(1.5,2,3), (4,5,6)], dtype = float) >>> np.multiply(a,b) Multiplication Array Manipulation
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]], >>> np.exp(b) Exponentiation
dtype = float) >>> np.sqrt(b) Square root Transposing Array
>>> np.sin(a) Print sines of an array >>> i = np.transpose(b) Permute array dimensions
Initial Placeholders >>> np.cos(b) Element-wise cosine >>> i.T Permute array dimensions
>>> np.log(a) Element-wise natural logarithm
>>> np.zeros((3,4)) Create an array of zeros >>> e.dot(f) Dot product
Changing Array Shape
>>> np.ones((2,3,4),dtype=np.int16) Create an array of ones array([[ 7., 7.], >>> b.ravel() Flatten the array
>>> d = np.arange(10,25,5) Create an array of evenly [ 7., 7.]]) >>> g.reshape(3,-2) Reshape, but don’t change data
spaced values (step value)
>>> np.linspace(0,2,9) Create an array of evenly Comparison Adding/Removing Elements
spaced values (number of samples) >>> h.resize((2,6)) Return a new array with shape (2,6)
>>> e = np.full((2,2),7) Create a constant array >>> a == b Element-wise comparison >>> np.append(h,g) Append items to an array
>>> f = np.eye(2) Create a 2X2 identity matrix array([[False, True, True], >>> np.insert(a, 1, 5) Insert items in an array
>>> np.random.random((2,2)) Create an array with random values [False, False, False]], dtype=bool) >>> np.delete(a,[1]) Delete items from an array
>>> np.empty((3,2)) Create an empty array >>> a < 2 Element-wise comparison
array([True, False, False], dtype=bool) Combining Arrays
>>> np.array_equal(a, b) Array-wise comparison >>> np.concatenate((a,d),axis=0) Concatenate arrays
I/O array([ 1, 2,
>>> np.vstack((a,b))
3, 10, 15, 20])
Stack arrays vertically (row-wise)
Aggregate Functions array([[ 1. , 2. , 3. ],
Saving & Loading On Disk [ 1.5, 2. , 3. ],
>>> a.sum() Array-wise sum [ 4. , 5. , 6. ]])
>>> np.save('my_array', a) >>> a.min() Array-wise minimum value >>> np.r_[e,f] Stack arrays vertically (row-wise)
>>> np.savez('array.npz', a, b) >>> b.max(axis=0) Maximum value of an array row >>> np.hstack((e,f)) Stack arrays horizontally (column-wise)
>>> np.load('my_array.npy') >>> b.cumsum(axis=1) Cumulative sum of the elements array([[ 7., 7., 1., 0.],
>>> a.mean() Mean [ 7., 7., 0., 1.]])
Saving & Loading Text Files >>> b.median() Median >>> np.column_stack((a,d)) Create stacked column-wise arrays
>>> np.loadtxt("myfile.txt") >>> a.corrcoef() Correlation coefficient array([[ 1, 10],
>>> np.std(b) Standard deviation [ 2, 15],
>>> np.genfromtxt("my_file.csv", delimiter=',') [ 3, 20]])
>>> np.savetxt("myarray.txt", a, delimiter=" ") >>> np.c_[a,d] Create stacked column-wise arrays
Copying Arrays Splitting Arrays
Data Types >>> h = a.view() Create a view of the array with the same data >>> np.hsplit(a,3) Split the array horizontally at the 3rd
>>> np.copy(a) Create a copy of the array [array([1]),array([2]),array([3])] index
>>> np.int64 Signed 64-bit integer types >>> np.vsplit(c,2) Split the array vertically at the 2nd index
>>> np.float32 Standard double-precision floating point >>> h = a.copy() Create a deep copy of the array [array([[[ 1.5, 2. , 1. ],
>>> np.complex Complex numbers represented by 128 floats [ 4. , 5. , 6. ]]]),
array([[[ 3., 2., 3.],
>>>
>>>
np.bool
np.object
Boolean type storing TRUE and FALSE values
Python object type Sorting Arrays [ 4., 5., 6.]]])]

>>> np.string_ Fixed-length string type >>> a.sort() Sort an array


>>> np.unicode_ Fixed-length unicode type >>> c.sort(axis=0) Sort the elements of an array's axis DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Plot Anatomy & Workflow
Plot Anatomy Workflow
Matplotlib Axes/Subplot The basic steps to creating plots with matplotlib are:
Learn Python Interactively at www.DataCamp.com 1 Prepare data 2 Create plot 3 Plot 4 Customize plot 5 Save plot 6 Show plot
>>> import matplotlib.pyplot as plt
>>> x = [1,2,3,4] Step 1
>>> y = [10,20,25,30]
>>> fig = plt.figure() Step 2
Matplotlib Y-axis Figure >>> ax = fig.add_subplot(111) Step 3
>>> ax.plot(x, y, color='lightblue', linewidth=3) Step 3, 4
Matplotlib is a Python 2D plotting library which produces >>> ax.scatter([2,4,6],
publication-quality figures in a variety of hardcopy formats [5,15,25],
color='darkgreen',
and interactive environments across marker='^')
platforms. >>> ax.set_xlim(1, 6.5)
X-axis
>>> plt.savefig('foo.png')

1 Prepare The Data Also see Lists & NumPy


>>> plt.show() Step 6

1D Data 4 Customize Plot


>>> import numpy as np Colors, Color Bars & Color Maps Mathtext
>>> x = np.linspace(0, 10, 100)
>>> y = np.cos(x) >>> plt.plot(x, x, x, x**2, x, x**3) >>> plt.title(r'$sigma_i=15$', fontsize=20)
>>> z = np.sin(x) >>> ax.plot(x, y, alpha = 0.4)
>>> ax.plot(x, y, c='k') Limits, Legends & Layouts
2D Data or Images >>> fig.colorbar(im, orientation='horizontal')
>>> im = ax.imshow(img, Limits & Autoscaling
>>> data = 2 * np.random.random((10, 10)) cmap='seismic')
>>> data2 = 3 * np.random.random((10, 10)) >>> ax.margins(x=0.0,y=0.1) Add padding to a plot
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j] >>> ax.axis('equal') Set the aspect ratio of the plot to 1
Markers >>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) Set limits for x-and y-axis
>>> U = -1 - X**2 + Y
>>> V = 1 + X - Y**2 >>> fig, ax = plt.subplots() >>> ax.set_xlim(0,10.5) Set limits for x-axis
>>> from matplotlib.cbook import get_sample_data >>> ax.scatter(x,y,marker=".") Legends
>>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy')) >>> ax.plot(x,y,marker="o") >>> ax.set(title='An Example Axes', Set a title and x-and y-axis labels
ylabel='Y-Axis',
Linestyles xlabel='X-Axis')
2 Create Plot >>>
>>>
plt.plot(x,y,linewidth=4.0)
plt.plot(x,y,ls='solid')
>>> ax.legend(loc='best')
Ticks
No overlapping plot elements

>>> import matplotlib.pyplot as plt >>> ax.xaxis.set(ticks=range(1,5), Manually set x-ticks


>>> plt.plot(x,y,ls='--') ticklabels=[3,100,-12,"foo"])
Figure >>> plt.plot(x,y,'--',x**2,y**2,'-.') >>> ax.tick_params(axis='y', Make y-ticks longer and go in and out
>>> plt.setp(lines,color='r',linewidth=4.0) direction='inout',
>>> fig = plt.figure() length=10)
>>> fig2 = plt.figure(figsize=plt.figaspect(2.0)) Text & Annotations
Subplot Spacing
Axes >>> ax.text(1, >>> fig3.subplots_adjust(wspace=0.5, Adjust the spacing between subplots
-2.1, hspace=0.3,
All plotting is done with respect to an Axes. In most cases, a 'Example Graph', left=0.125,
style='italic') right=0.9,
subplot will fit your needs. A subplot is an axes on a grid system. >>> ax.annotate("Sine", top=0.9,
>>> fig.add_axes() xy=(8, 0), bottom=0.1)
>>> ax1 = fig.add_subplot(221) # row-col-num xycoords='data', >>> fig.tight_layout() Fit subplot(s) in to the figure area
xytext=(10.5, 0),
>>> ax3 = fig.add_subplot(212) textcoords='data', Axis Spines
>>> fig3, axes = plt.subplots(nrows=2,ncols=2) arrowprops=dict(arrowstyle="->", >>> ax1.spines['top'].set_visible(False) Make the top axis line for a plot invisible
>>> fig4, axes2 = plt.subplots(ncols=3) connectionstyle="arc3"),) >>> ax1.spines['bottom'].set_position(('outward',10)) Move the bottom axis line outward

3 Plotting Routines 5 Save Plot


Save figures
1D Data Vector Fields >>> plt.savefig('foo.png')
>>> lines = ax.plot(x,y) Draw points with lines or markers connecting them >>> axes[0,1].arrow(0,0,0.5,0.5) Add an arrow to the axes Save transparent figures
>>> ax.scatter(x,y) Draw unconnected points, scaled or colored >>> axes[1,1].quiver(y,z) Plot a 2D field of arrows >>> plt.savefig('foo.png', transparent=True)
>>> axes[0,0].bar([1,2,3],[3,4,5]) Plot vertical rectangles (constant width) >>> axes[0,1].streamplot(X,Y,U,V) Plot 2D vector fields
>>> axes[1,0].barh([0.5,1,2.5],[0,1,2])
6
Plot horiontal rectangles (constant height)
>>> axes[1,1].axhline(0.45) Draw a horizontal line across axes Data Distributions Show Plot
>>> axes[0,1].axvline(0.65) Draw a vertical line across axes >>> ax1.hist(y) Plot a histogram
>>> ax.fill(x,y,color='blue') Draw filled polygons >>> ax3.boxplot(y) Make a box and whisker plot >>> plt.show()
>>> ax.fill_between(x,y,color='yellow') Fill between y-values and 0 >>> ax3.violinplot(z) Make a violin plot
2D Data or Images Close & Clear
>>> fig, ax = plt.subplots() >>> plt.cla() Clear an axis
>>> axes2[0].pcolor(data2) Pseudocolor plot of 2D array >>> plt.clf() Clear the entire figure
>>> im = ax.imshow(img, Colormapped or RGB arrays >>> axes2[0].pcolormesh(data) Pseudocolor plot of 2D array
cmap='gist_earth', >>> plt.close() Close a window
interpolation='nearest', >>> CS = plt.contour(Y,X,U) Plot contours
vmin=-2, >>> axes2[2].contourf(data1) Plot filled contours
vmax=2) >>> axes2[2]= ax.clabel(CS) Label a contour plot DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet 3 Plotting With Seaborn
Seaborn Axis Grids
Learn Data Science Interactively at www.DataCamp.com >>> g = sns.FacetGrid(titanic, Subplot grid for plotting conditional >>> h = sns.PairGrid(iris) Subplot grid for plotting pairwise
col="survived", relationships >>> h = h.map(plt.scatter) relationships
row="sex") >>> sns.pairplot(iris) Plot pairwise bivariate distributions
>>> g = g.map(plt.hist,"age") >>> i = sns.JointGrid(x="x", Grid for bivariate plot with marginal
>>> sns.factorplot(x="pclass", Draw a categorical plot onto a y="y", univariate plots
y="survived", Facetgrid data=data)
Statistical Data Visualization With Seaborn hue="sex",
data=titanic)
>>> i = i.plot(sns.regplot,
sns.distplot)
The Python visualization library Seaborn is based on >>> sns.lmplot(x="sepal_width", Plot data and regression model fits >>> sns.jointplot("sepal_length", Plot bivariate distribution
y="sepal_length", across a FacetGrid "sepal_width",
matplotlib and provides a high-level interface for drawing hue="species", data=iris,
attractive statistical graphics. data=iris) kind='kde')

Categorical Plots Regression Plots


Make use of the following aliases to import the libraries: >>> sns.regplot(x="sepal_width", Plot data and a linear regression
Scatterplot
>>> import matplotlib.pyplot as plt y="sepal_length", model fit
>>> sns.stripplot(x="species", Scatterplot with one
>>> import seaborn as sns data=iris,
y="petal_length", categorical variable
data=iris) ax=ax)
The basic steps to creating plots with Seaborn are: >>> sns.swarmplot(x="species", Categorical scatterplot with Distribution Plots
y="petal_length", non-overlapping points
1. Prepare some data data=iris) >>> plot = sns.distplot(data.y, Plot univariate distribution
2. Control figure aesthetics Bar Chart kde=False,
color="b")
3. Plot with Seaborn >>> sns.barplot(x="sex", Show point estimates and
y="survived", confidence intervals with Matrix Plots
4. Further customize your plot hue="class", scatterplot glyphs
>>> sns.heatmap(uniform_data,vmin=0,vmax=1) Heatmap
data=titanic)
>>> import matplotlib.pyplot as plt Count Plot
>>>
>>>
>>>
import seaborn as sns
tips = sns.load_dataset("tips")
sns.set_style("whitegrid") Step 2
Step 1
>>> sns.countplot(x="deck",
data=titanic,
Show count of observations
4 Further Customizations Also see Matplotlib
palette="Greens_d")
>>> g = sns.lmplot(x="tip", Step 3
Point Plot Axisgrid Objects
y="total_bill",
data=tips, >>> sns.pointplot(x="class", Show point estimates and >>> g.despine(left=True) Remove left spine
aspect=2) y="survived", confidence intervals as >>> g.set_ylabels("Survived") Set the labels of the y-axis
>>> g = (g.set_axis_labels("Tip","Total bill(USD)"). hue="sex", rectangular bars >>> g.set_xticklabels(rotation=45) Set the tick labels for x
set(xlim=(0,10),ylim=(0,100))) data=titanic, >>> g.set_axis_labels("Survived", Set the axis labels
Step 4 palette={"male":"g", "Sex")
>>> plt.title("title")
>>> plt.show(g) Step 5 "female":"m"}, >>> h.set(xlim=(0,5), Set the limit and ticks of the
markers=["^","o"], ylim=(0,5), x-and y-axis
linestyles=["-","--"]) xticks=[0,2.5,5],

1
Boxplot yticks=[0,2.5,5])
Data Also see Lists, NumPy & Pandas >>> sns.boxplot(x="alive", Boxplot
Plot
y="age",
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> plt.title("A Title") Add plot title
data=titanic)
>>> uniform_data = np.random.rand(10, 12) >>> plt.ylabel("Survived") Adjust the label of the y-axis
>>> sns.boxplot(data=iris,orient="h") Boxplot with wide-form data
>>> data = pd.DataFrame({'x':np.arange(1,101), >>> plt.xlabel("Sex") Adjust the label of the x-axis
'y':np.random.normal(0,4,100)}) Violinplot >>> plt.ylim(0,100) Adjust the limits of the y-axis
>>> sns.violinplot(x="age", Violin plot >>> plt.xlim(0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> plt.setp(ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)

2 Figure Aesthetics Also see Matplotlib


5 Show or Save Plot Also see Matplotlib
>>> plt.show() Show the plot
Context Functions >>> plt.savefig("foo.png") Save the plot as a figure
>>> f, ax = plt.subplots(figsize=(5,6)) Create a figure and one subplot >>> plt.savefig("foo.png", Save transparent figure
>>> sns.set_context("talk") Set context to "talk" transparent=True)
>>> sns.set_context("notebook", Set context to "notebook",
Seaborn styles font_scale=1.5, Scale font elements and
>>> sns.set() (Re)set the seaborn default
rc={"lines.linewidth":2.5}) override param mapping Close & Clear Also see Matplotlib
>>> sns.set_style("whitegrid") Set the matplotlib parameters Color Palette >>> plt.cla() Clear an axis
>>> sns.set_style("ticks", Set the matplotlib parameters >>> plt.clf() Clear an entire figure
{"xtick.major.size":8, >>> sns.set_palette("husl",3) Define the color palette >>> plt.close() Close a window
"ytick.major.size":8}) >>> sns.color_palette("husl") Use with with to temporarily set palette
>>> sns.axes_style("whitegrid") Return a dict of params or use with >>> flatui = ["#9b59b6","#3498db","#95a5a6","#e74c3c","#34495e","#2ecc71"]
with to temporarily set the style >>> sns.set_palette(flatui) Set your own color palette DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data
Basic Information Reducing
PySpark - RDD Basics >>> rdd.getNumPartitions() List the number of partitions
>>> rdd.reduceByKey(lambda x,y : x+y)
.collect()
Merge the rdd values for
each key
Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count() Count RDD instances [('a',9),('b',2)]
3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values
>>> rdd.countByKey() Count RDD instances by key ('a',7,'a',2,'b',2)
defaultdict(<type 'int'>,{'a':2,'b':1}) Grouping by
>>> rdd.countByValue() Count RDD instances by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values
Spark defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() Return (key,value) pairs as a
.mapValues(list)
.collect()
PySpark is the Spark Python API that exposes {'a': 2,'b': 2} dictionary >>> rdd.groupByKey() Group rdd by key
>>> rdd3.sum() Sum of RDD elements .mapValues(list)
the Spark programming model to Python. 4950 .collect()
>>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])]
True
Initializing Spark Summary
Aggregating
>>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
SparkContext >>> rdd3.max() Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each
99 (4950,100) partition and then the results
>>> from pyspark import SparkContext >>> rdd3.min() Minimum value of RDD elements
>>> sc = SparkContext(master = 'local[2]') >>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key
0
>>> rdd3.mean() Mean value of RDD elements .collect()
Inspect SparkContext 49.5 [('a',(9,2)), ('b',(2,1))]
>>> rdd3.stdev() Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each
>>> sc.version Retrieve SparkContext version 28.866070047722118 4950 partition, and then the results
>>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key
>>> sc.master Master URL to connect to 833.25 .collect()
>>> str(sc.sparkHome) Path where Spark is installed on worker nodes >>> rdd3.histogram(3) Compute histogram by bins [('a',9),('b',2)]
>>> str(sc.sparkUser()) Retrieve name of the Spark User running ([0,33,66,99],[33,33,34])
>>> rdd3.stats() Summary statistics (count, mean, stdev, max & >>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by
SparkContext
>>> sc.appName Return application name min) .collect() applying a function
>>> sc.applicationId Retrieve application ID
>>>
>>>
sc.defaultParallelism Return default level of parallelism
sc.defaultMinPartitions Default minimum number of partitions for Applying Functions Mathematical Operations
RDDs >>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd.subtract(rdd2) Return each rdd value not contained
.collect() .collect() in rdd2
Configuration [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] [('b',2),('a',7)]
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2
>>> from pyspark import SparkConf, SparkContext and flatten the result .collect() with no matching key in rdd
>>> conf = (SparkConf() >>> rdd5.collect() [('d', 1)]
.setMaster("local") ['a',7,7,'a','a',2,2,'a','b',2,2,'b'] >>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd
.setAppName("My app") >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each (key,value) and rdd2
.set("spark.executor.memory", "1g")) .collect() pair of rdd4 without changing the keys
>>> sc = SparkContext(conf = conf) [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]
Sort
Using The Shell Selecting Data >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function
.collect()
In the PySpark shell, a special interpreter-aware SparkContext is already Getting [('d',1),('b',1),('a',2)]
created in the variable called sc. >>> rdd.collect() Return a list with all RDD elements >>> rdd2.sortByKey() Sort (key, value) RDD by key
[('a', 7), ('a', 2), ('b', 2)] .collect()
$ ./bin/spark-shell --master local[2] >>> rdd.take(2) Take first 2 RDD elements [('a',2),('b',1),('d',1)]
$ ./bin/pyspark --master local[4] --py-files code.py [('a', 7), ('a', 2)]
>>> rdd.first() Take first RDD element
Set which master the context connects to with the --master argument, and
('a', 7) Repartitioning
add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd.top(2) Take top 2 RDD elements
[('b', 2), ('a', 7)] >>> rdd.repartition(4) New RDD with 4 partitions
comma-separated list to --py-files. >>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1
Sampling
Loading Data >>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97] Saving
Parallelized Collections Filtering >>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)]) .collect() 'org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) [('a',7),('a',2)]
>>> rdd3 = sc.parallelize(range(100)) >>> rdd5.distinct().collect() Return distinct RDD values
>>> rdd4 = sc.parallelize([("a",["x","y","z"]),
("b",["p", "r"])])
['a',2,'b',7]
>>> rdd.keys().collect() Return (key,value) RDD's keys
Stopping SparkContext
['a', 'a', 'b'] >>> sc.stop()
External Data
Read either one text file from HDFS, a local file system or or any Iterating Execution
Hadoop-supported file system URI with textFile(), or read in a directory >>> def g(x): print(x)
>>> rdd.foreach(g) Apply a function to all RDD elements $ ./bin/spark-submit examples/src/main/python/pi.py
of text files with wholeTextFiles().
('a', 7)
>>> textFile = sc.textFile("/my/directory/*.txt") ('b', 2) DataCamp
>>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Duplicate Values GroupBy
>>> df = df.dropDuplicates() >>> df.groupBy("age")\ Group by age, count the members
PySpark - SQL Basics .count() \
.show()
in the groups
Learn Python for data science Interactively at www.DataCamp.com Queries
>>> from pyspark.sql import functions as F
Select Filter
>>> df.select("firstName").show() Show all entries in firstName column >>> df.filter(df["age"]>24).show() Filter entries of age, only keep those
>>> df.select("firstName","lastName") \ records of which the values are >24
PySpark & Spark SQL .show()
>>> df.select("firstName", Show all entries in firstName, age

Spark SQL is Apache Spark's module for "age", and type


Sort
explode("phoneNumber") \
working with structured data. .alias("contactInfo")) \
.select("contactInfo.type", >>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
Initializing SparkSession "firstName",
"age") \ >>> df.orderBy(["age","city"],ascending=[0,1])\
A SparkSession can be used create DataFrame, register DataFrame as tables, .show() .collect()
execute SQL over tables, cache tables, and read parquet files. >>> df.select(df["firstName"],df["age"]+ 1) Show all entries in firstName and age,
.show() add 1 to the entries of age
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession \
>>> df.select(df['age'] > 24).show()
When
Show all entries where age >24 Missing & Replacing Values
.builder \ >>> df.select("firstName", Show firstName and 0 or 1 depending
.appName("Python Spark SQL basic example") \ >>> df.na.fill(50).show() Replace null values
F.when(df.age > 30, 1) \ on age >30 >>> df.na.drop().show() Return new df omitting rows with null values
.config("spark.some.config.option", "some-value") \ .otherwise(0)) \
.getOrCreate() >>> df.na \ Return new df replacing one value with
.show() .replace(10, 20) \ another
>>> df[df.firstName.isin("Jane","Boris")] Show firstName if in the given options .show()
Creating DataFrames Like
.collect()

From RDDs
>>> df.select("firstName", Show firstName, and lastName is
df.lastName.like("Smith")) \ TRUE if lastName is like Smith
Repartitioning
.show()
>>> from pyspark.sql.types import * Startswith - Endswith >>> df.repartition(10)\ df with 10 partitions
>>> df.select("firstName", Show firstName, and TRUE if .rdd \
Infer Schema .getNumPartitions()
>>> sc = spark.sparkContext df.lastName \ lastName starts with Sm
.startswith("Sm")) \ >>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition
>>> lines = sc.textFile("people.txt")
.show()
>>> parts = lines.map(lambda l: l.split(",")) >>> df.select(df.lastName.endswith("th")) \ Show last names ending in th
>>>
>>>
people = parts.map(lambda p: Row(name=p[0],age=int(p[1])))
peopledf = spark.createDataFrame(people)
.show() Running SQL Queries Programmatically
Substring
Specify Schema >>> df.select(df.firstName.substr(1, 3) \ Return substrings of firstName Registering DataFrames as Views
>>> people = parts.map(lambda p: Row(name=p[0], .alias("name")) \
age=int(p[1].strip()))) .collect() >>> peopledf.createGlobalTempView("people")
>>> schemaString = "name age" Between >>> df.createTempView("customer")
>>> fields = [StructField(field_name, StringType(), True) for >>> df.select(df.age.between(22, 24)) \ Show age: values are TRUE if between >>> df.createOrReplaceTempView("customer")
field_name in schemaString.split()] .show() 22 and 24
>>> schema = StructType(fields) Query Views
>>> spark.createDataFrame(people, schema).show()
+--------+---+
| name|age|
Add, Update & Remove Columns >>> df5 = spark.sql("SELECT * FROM customer").show()
+--------+---+ >>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
|
|
Mine| 28|
Filip| 29|
Adding Columns .show()
|Jonathan| 30|
+--------+---+ >>> df = df.withColumn('city',df.address.city) \
.withColumn('postalCode',df.address.postalCode) \
From Spark Data Sources .withColumn('state',df.address.state) \
.withColumn('streetAddress',df.address.streetAddress) \
Output
.withColumn('telePhoneNumber', Data Structures
JSON explode(df.phoneNumber.number)) \
>>> df = spark.read.json("customer.json") .withColumn('telePhoneType',
>>> df.show() >>> rdd1 = df.rdd Convert df into an RDD
+--------------------+---+---------+--------+--------------------+ explode(df.phoneNumber.type)) >>> df.toJSON().first() Convert df into a RDD of string
| address|age|firstName |lastName| phoneNumber|
+--------------------+---+---------+--------+--------------------+ >>> df.toPandas() Return the contents of df as Pandas
|[New York,10021,N...| 25|
|[New York,10021,N...| 21|
John|
Jane|
Smith|[[212 555-1234,ho...|
Doe|[[322 888-1234,ho...|
Updating Columns DataFrame
+--------------------+---+---------+--------+--------------------+
>>> df2 = spark.read.load("people.json", format="json")
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') Write & Save to Files
Parquet files Removing Columns >>> df.select("firstName", "city")\
>>> df3 = spark.read.load("users.parquet") .write \
TXT files >>> df = df.drop("address", "phoneNumber") .save("nameAndCity.parquet")
>>> df4 = spark.read.text("people.txt") >>> df = df.drop(df.address).drop(df.phoneNumber) >>> df.select("firstName", "age") \
.write \
.save("namesAndAges.json",format="json")
Inspect Data
>>> df.dtypes Return df column names and data types >>> df.describe().show() Compute summary statistics Stopping SparkSession
>>> df.show() Display the content of df >>> df.columns Return the columns of df
>>> df.count() >>> spark.stop()
>>> df.head() Return first n rows Count the number of rows in df
>>> df.first() Return first row >>> df.distinct().count() Count the number of distinct rows in df
>>> df.take(2) Return the first n rows >>> df.printSchema() Print the schema of df DataCamp
>>> df.schema Return the schema of df >>> df.explain() Print the (logical and physical) plans
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Model Architecture Inspect Model
>>> model.output_shape
Sequential Model Model output shape
Keras >>> from keras.models import Sequential
>>>
>>>
model.summary()
model.get_config()
Model summary representation
Model configuration
Learn Python for data science Interactively at www.DataCamp.com >>> model = Sequential() >>> model.get_weights() List all weight tensors in the model
>>> model2 = Sequential()
>>> model3 = Sequential() Compile Model
Multilayer Perceptron (MLP) MLP: Binary Classification
Keras Binary Classification >>> model.compile(optimizer='adam',
loss='binary_crossentropy',
Keras is a powerful and easy-to-use deep learning library for >>> from keras.layers import Dense metrics=['accuracy'])
Theano and TensorFlow that provides a high-level neural >>> model.add(Dense(12, MLP: Multi-Class Classification
input_dim=8, >>> model.compile(optimizer='rmsprop',
networks API to develop and evaluate deep learning models. kernel_initializer='uniform', loss='categorical_crossentropy',
activation='relu')) metrics=['accuracy'])
A Basic Example >>> model.add(Dense(8,kernel_initializer='uniform',activation='relu'))
MLP: Regression
>>> model.add(Dense(1,kernel_initializer='uniform',activation='sigmoid')) >>> model.compile(optimizer='rmsprop',
>>> import numpy as np loss='mse',
>>> from keras.models import Sequential Multi-Class Classification metrics=['mae'])
>>> from keras.layers import Dense >>> from keras.layers import Dropout
>>> data = np.random.random((1000,100)) >>> model.add(Dense(512,activation='relu',input_shape=(784,))) Recurrent Neural Network
>>> labels = np.random.randint(2,size=(1000,1)) >>> model.add(Dropout(0.2)) >>> model3.compile(loss='binary_crossentropy',
>>> model = Sequential() optimizer='adam',
>>> model.add(Dense(512,activation='relu')) metrics=['accuracy'])
>>> model.add(Dense(32, >>> model.add(Dropout(0.2))
activation='relu', >>> model.add(Dense(10,activation='softmax'))

>>>
input_dim=100))
model.add(Dense(1, activation='sigmoid'))
Regression Model Training
>>> model.compile(optimizer='rmsprop', >>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1])) >>> model3.fit(x_train4,
loss='binary_crossentropy', >>> model.add(Dense(1)) y_train4,
metrics=['accuracy']) batch_size=32,
>>> model.fit(data,labels,epochs=10,batch_size=32) Convolutional Neural Network (CNN) epochs=15,
verbose=1,
>>> predictions = model.predict(data) >>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten validation_data=(x_test4,y_test4))
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
Data Also see NumPy, Pandas & Scikit-Learn >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(32,(3,3))) Evaluate Your Model's Performance
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ide- >>> model2.add(Activation('relu')) >>> score = model3.evaluate(x_test,
>>> model2.add(MaxPooling2D(pool_size=(2,2))) y_test,
ally, you split the data in training and test sets, for which you can also resort batch_size=32)
>>> model2.add(Dropout(0.25))
to the train_test_split module of sklearn.cross_validation.
>>> model2.add(Conv2D(64,(3,3), padding='same'))
Keras Data Sets >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(64,(3, 3)))
Prediction
>>> from keras.datasets import boston_housing, >>> model2.add(Activation('relu')) >>> model3.predict(x_test4, batch_size=32)
mnist, >>> model2.add(MaxPooling2D(pool_size=(2,2))) >>> model3.predict_classes(x_test4,batch_size=32)
cifar10, >>> model2.add(Dropout(0.25))
imdb
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data()
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()
>>>
>>>
model2.add(Flatten())
model2.add(Dense(512))
Save/ Reload Models
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> model2.add(Activation('relu')) >>> from keras.models import load_model
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> model2.add(Dropout(0.5)) >>> model3.save('model_file.h5')
>>> num_classes = 10 >>> my_model = load_model('my_model.h5')
>>> model2.add(Dense(num_classes))
>>> model2.add(Activation('softmax'))
Other
Recurrent Neural Network (RNN) Model Fine-tuning
>>> from urllib.request import urlopen
>>> data = np.loadtxt(urlopen("http://archive.ics.uci.edu/
ml/machine-learning-databases/pima-indians-diabetes/
>>> from keras.klayers import Embedding,LSTM Optimization Parameters
pima-indians-diabetes.data"),delimiter=",") >>> model3.add(Embedding(20000,128)) >>> from keras.optimizers import RMSprop
>>> X = data[:,0:8] >>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> opt = RMSprop(lr=0.0001, decay=1e-6)
>>> y = data [:,8] >>> model3.add(Dense(1,activation='sigmoid')) >>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Preprocessing Also see NumPy & Scikit-Learn
Early Stopping
Sequence Padding Train and Test Sets >>> from keras.callbacks import EarlyStopping
>>> from keras.preprocessing import sequence >>> from sklearn.model_selection import train_test_split >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, >>> model3.fit(x_train4,
>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) y,
test_size=0.33, y_train4,
random_state=42) batch_size=32,
One-Hot Encoding epochs=15,
>>> from keras.utils import to_categorical Standardization/Normalization validation_data=(x_test4,y_test4),
>>> Y_train = to_categorical(y_train, num_classes) >>> from sklearn.preprocessing import StandardScaler callbacks=[early_stopping_monitor])
>>> Y_test = to_categorical(y_test, num_classes) >>> scaler = StandardScaler().fit(x_train2)
>>> Y_train3 = to_categorical(y_train3, num_classes) >>> standardized_X = scaler.transform(x_train2) DataCamp
>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = scaler.transform(x_test2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet 3 Renderers & Visual Customizations
Glyphs Customized Glyphs Also see Data
Bokeh Scatter Markers Selection and Non-Selection Glyphs
Learn Bokeh Interactively at www.DataCamp.com, >>> p1.circle(np.array([1,2,3]), np.array([3,2,1]), >>> p = figure(tools='box_select')
taught by Bryan Van de Ven, core contributor fill_color='white') >>> p.circle('mpg', 'cyl', source=cds_df,
>>> p2.square(np.array([1.5,3.5,5.5]), [1,4,3], selection_color='red',
color='blue', size=1) nonselection_alpha=0.1)
Line Glyphs
Plotting With Bokeh >>> p1.line([1,2,3,4], [3,4,5,6], line_width=2) Hover Glyphs
>>> p2.multi_line(pd.DataFrame([[1,2,3],[5,6,7]]), >>> hover = HoverTool(tooltips=None, mode='vline')
The Python interactive visualization library Bokeh pd.DataFrame([[3,4,5],[3,2,1]]), >>> p3.add_tools(hover)
enables high-performance visual presentation of color="blue")
Colormapping
large datasets in modern web browsers. Rows & Columns Layout US
Asia >>> color_mapper = CategoricalColorMapper(
factors=['US', 'Asia', 'Europe'],
Europe

Rows Columns palette=['blue', 'red', 'green'])


Bokeh’s mid-level general purpose bokeh.plotting >>> from bokeh.layouts import row >>> from bokeh.layouts import columns >>> p3.circle('mpg', 'cyl', source=cds_df,
>>> layout = row(p1,p2,p3) >>> layout = column(p1,p2,p3)
interface is centered around two main components: data color=dict(field='origin',
Nesting Rows & Columns transform=color_mapper),
and glyphs. >>>layout = row(column(p1,p2), p3) legend='Origin'))

Grid Layout Linked Plots Also see Data


+ = >>> from bokeh.layouts import gridplot
>>> row1 = [p1,p2]
Linked Axes
data glyphs plot >>> p2.x_range = p1.x_range
>>> row2 = [p3]
>>> p2.y_range = p1.y_range
>>> layout = gridplot([[p1,p2],[p3]])
The basic steps to creating plots with the bokeh.plotting Linked Brushing
interface are: Tabbed Layout >>> p4 = figure(plot_width = 100, tools='box_select,lasso_select')
>>> from bokeh.models.widgets import Panel, Tabs >>> p4.circle('mpg', 'cyl', source=cds_df)
1. Prepare some data: >>> tab1 = Panel(child=p1, title="tab1") >>> p5 = figure(plot_width = 200, tools='box_select,lasso_select')
Python lists, NumPy arrays, Pandas DataFrames and other sequences of values >>> p5.circle('mpg', 'hp', source=cds_df)
>>> tab2 = Panel(child=p2, title="tab2")
2. Create a new plot >>> layout = Tabs(tabs=[tab1, tab2]) >>> layout = row(p4,p5)
3. Add renderers for your data, with visual customizations Legends
4. Specify where to generate the output Legend Location Legend Orientation
5. Show or save the results Inside Plot Area >>> p.legend.orientation = "horizontal"
>>> from bokeh.plotting import figure >>> p.legend.location = 'bottom_left' >>> p.legend.orientation = "vertical"
>>> from bokeh.io import output_file, show Outside Plot Area
>>> x = [1, 2, 3, 4, 5] >>> r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1]) Legend Background & Border
Step 1 >>> r2 = p2.line([1,2,3,4], [3,4,5,6])
>>> y = [6, 7, 2, 4, 5] >>> p.legend.border_line_color = "navy"
>>> legend = Legend(items=[("One" , [p1, r1]),("Two" , [r2])], location=(0, -30))
>>> p = figure(title="simple line example", Step 2 >>> p.add_layout(legend, 'right') >>> p.legend.background_fill_color = "white"
x_axis_label='x',
y_axis_label='y')
>>> p.line(x, y, legend="Temp.", line_width=2) Step 3 4 Output Statistical Charts With Bokeh
Bokeh’s high-level bokeh.charts interface is ideal for quickly
Also see Data
>>> output_file("lines.html") Step 4 Output to HTML File
>>> show(p) Step 5 creating statistical charts
>>> from bokeh.io import output_file, show

1 Data Also see Lists, NumPy & Pandas


>>> output_file('my_bar_chart.html', mode='cdn')
Notebook Output
Bar Chart
>>> from bokeh.charts import Bar
Under the hood, your data is converted to Column Data >>> p = Bar(df, stacked=True, palette=['red','blue'])
>>> from bokeh.io import output_notebook, show
Sources. You can also do this manually: Box Plot
>>> import numpy as np >>> output_notebook()
>>> import pandas as pd >>> from bokeh.charts import BoxPlot
>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'], Embedding >>> p = BoxPlot(df, values='vals', label='cyl',
[32.4,4,66, 'Asia'], legend='bottom_right')
Standalone HTML
Label 1

[21.4,4,109, 'Europe']]),
Label 2
Label 3

columns=['mpg','cyl', 'hp', 'origin'],


index=['Toyota', 'Fiat', 'Volvo'])
>>> from bokeh.embed import file_html Histogram
>>> html = file_html(p, CDN, "my_plot") Histogram

>>> from bokeh.charts import Histogram


>>> from bokeh.models import ColumnDataSource Components >>> p = Histogram(df, title='Histogram')
>>> cds_df = ColumnDataSource(df)
>>> from bokeh.embed import components Scatter Plot
2 Plotting
>>> script, div = components(p) >>> from bokeh.charts import Scatter

5
>>> p = Scatter(df, x='mpg', y ='hp', marker='square',
Show or Save Your Plots
y-axis
>>> from bokeh.plotting import figure xlabel='Miles Per Gallon',
>>> p1 = figure(plot_width=300, tools='pan,box_zoom') ylabel='Horsepower')
x-axis

>>> p2 = figure(plot_width=300, plot_height=300, >>> show(p1) >>> save(p1)


x_range=(0, 8), y_range=(0, 8)) >>> show(layout) >>> save(layout) DataCamp
>>> p3 = figure() Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Linear Algebra Also see NumPy
You’ll use the linalg and sparse modules. Note that scipy.linalg contains and expands on numpy.linalg.
SciPy - Linear Algebra >>> from scipy import linalg, sparse Matrix Functions
Learn More Python for Data Science Interactively at www.datacamp.com
Creating Matrices Addition
>>> np.add(A,D) Addition
>>> A = np.matrix(np.random.random((2,2)))
>>> B = np.asmatrix(b) Subtraction
SciPy >>> C = np.mat(np.random.random((10,5))) >>> np.subtract(A,D) Subtraction
The SciPy library is one of the core packages for >>> D = np.mat([[3,4], [5,6]]) Division
>>> np.divide(A,D) Division
scientific computing that provides mathematical Basic Matrix Routines Multiplication
algorithms and convenience functions built on the >>> A @ D Multiplication operator
Inverse
NumPy extension of Python. >>> A.I Inverse (Python 3)
>>> np.multiply(D,A) Multiplication
>>> linalg.inv(A) Inverse
>>> np.dot(A,D) Dot product
Interacting With NumPy Also see NumPy Transposition >>> np.vdot(A,D) Vector dot product
>>> import numpy as np >>> A.T Tranpose matrix >>> np.inner(A,D) Inner product
>>> A.H Conjugate transposition >>> np.outer(A,D) Outer product
>>> a = np.array([1,2,3])
>>> b = np.array([(1+5j,2j,3j), (4j,5j,6j)]) Trace >>> np.tensordot(A,D) Tensor dot product
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]]) >>> np.trace(A) Trace >>> np.kron(A,D) Kronecker product
Norm Exponential Functions
Index Tricks >>> linalg.expm(A) Matrix exponential
>>> linalg.norm(A) Frobenius norm
>>> np.mgrid[0:5,0:5] Create a dense meshgrid >>> linalg.expm2(A) Matrix exponential (Taylor Series)
>>> linalg.norm(A,1) L1 norm (max column sum)
>>> np.ogrid[0:2,0:2] Create an open meshgrid >>> linalg.expm3(D) Matrix exponential (eigenvalue
>>> linalg.norm(A,np.inf) L inf norm (max row sum) decomposition)
>>> np.r_[3,[0]*5,-1:1:10j] Stack arrays vertically (row-wise)
>>> np.c_[b,c] Create stacked column-wise arrays Rank Logarithm Function
>>> np.linalg.matrix_rank(C) Matrix rank >>> linalg.logm(A) Matrix logarithm
Shape Manipulation Determinant Trigonometric Functions
>>> linalg.det(A) Determinant >>> linalg.sinm(D) Matrix sine
>>> np.transpose(b) Permute array dimensions
Solving linear problems >>> linalg.cosm(D) Matrix cosine
>>> b.flatten() Flatten the array >>> linalg.tanm(A) Matrix tangent
>>> np.hstack((b,c)) Stack arrays horizontally (column-wise) >>> linalg.solve(A,b) Solver for dense matrices
>>> np.vstack((a,b)) Stack arrays vertically (row-wise) >>> E = np.mat(a).T Solver for dense matrices Hyperbolic Trigonometric Functions
>>> np.hsplit(c,2) Split the array horizontally at the 2nd index >>> linalg.lstsq(F,E) Least-squares solution to linear matrix >>> linalg.sinhm(D) Hypberbolic matrix sine
>>> np.vpslit(d,2) Split the array vertically at the 2nd index equation >>> linalg.coshm(D) Hyperbolic matrix cosine
Generalized inverse >>> linalg.tanhm(A) Hyperbolic matrix tangent
Polynomials >>> linalg.pinv(C) Compute the pseudo-inverse of a matrix Matrix Sign Function
(least-squares solver) >>> np.signm(A) Matrix sign function
>>> from numpy import poly1d
>>> p = poly1d([3,4,5]) Create a polynomial object >>> linalg.pinv2(C) Compute the pseudo-inverse of a matrix Matrix Square Root
(SVD) >>> linalg.sqrtm(A) Matrix square root
Vectorizing Functions Creating Sparse Matrices Arbitrary Functions
>>> def myfunc(a): >>> linalg.funm(A, lambda x: x*x) Evaluate matrix function
if a < 0: >>> F = np.eye(3, k=1) Create a 2X2 identity matrix
return a*2
else:
>>> G = np.mat(np.identity(2)) Create a 2x2 identity matrix Decompositions
return a/2 >>> C[C > 0.5] = 0
>>> np.vectorize(myfunc) Vectorize functions
>>> H = sparse.csr_matrix(C) Compressed Sparse Row matrix Eigenvalues and Eigenvectors
>>> I = sparse.csc_matrix(D) Compressed Sparse Column matrix >>> la, v = linalg.eig(A) Solve ordinary or generalized
>>> J = sparse.dok_matrix(A) Dictionary Of Keys matrix eigenvalue problem for square matrix
Type Handling >>> E.todense() Sparse matrix to full matrix >>> l1, l2 = la Unpack eigenvalues
>>> sparse.isspmatrix_csc(A) Identify sparse matrix >>> v[:,0] First eigenvector
>>> np.real(b) Return the real part of the array elements >>> v[:,1] Second eigenvector
>>> np.imag(b) Return the imaginary part of the array elements
>>> np.real_if_close(c,tol=1000) Return a real array if complex parts close to 0 Sparse Matrix Routines >>> linalg.eigvals(A) Unpack eigenvalues
>>> np.cast['f'](np.pi) Cast object to a data type Singular Value Decomposition
Inverse >>> U,s,Vh = linalg.svd(B) Singular Value Decomposition (SVD)
>>> sparse.linalg.inv(I) Inverse >>> M,N = B.shape
Other Useful Functions
Norm >>> Sig = linalg.diagsvd(s,M,N) Construct sigma matrix in SVD
>>> np.angle(b,deg=True) Return the angle of the complex argument >>> sparse.linalg.norm(I) Norm LU Decomposition
>>> g = np.linspace(0,np.pi,num=5) Create an array of evenly spaced values Solving linear problems >>> P,L,U = linalg.lu(C) LU Decomposition
(number of samples)
>>> g [3:] += np.pi >>> sparse.linalg.spsolve(H,I) Solver for sparse matrices
>>> np.unwrap(g) Unwrap Sparse Matrix Decompositions
>>>
>>>
np.logspace(0,10,3) Create an array of evenly spaced values (log scale)
np.select([c<4],[c*2]) Return values from a list of arrays depending on
Sparse Matrix Functions >>> la, v = sparse.linalg.eigs(F,1) Eigenvalues and eigenvectors
conditions >>> sparse.linalg.expm(I) Sparse matrix exponential >>> sparse.linalg.svds(H, 2) SVD
>>> misc.factorial(a) Factorial
>>> Combine N things taken at k time
>>>
misc.comb(10,3,exact=True)
misc.central_diff_weights(3) Weights for Np-point central derivative Asking For Help DataCamp
>>> misc.derivative(myfunc,1.0) Find the n-th derivative of a function at a point >>> help(scipy.linalg.diagsvd)
>>> np.info(np.matrix) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Create Your Model Evaluate Your Model’s Performance
Supervised Learning Estimators Classification Metrics
Scikit-Learn
Learn Python for data science Interactively at www.DataCamp.com Linear Regression Accuracy Score
>>> from sklearn.linear_model import LinearRegression >>> knn.score(X_test, y_test) Estimator score method
>>> lr = LinearRegression(normalize=True) >>> from sklearn.metrics import accuracy_score Metric scoring functions
>>> accuracy_score(y_test, y_pred)
Support Vector Machines (SVM)
Scikit-learn >>> from sklearn.svm import SVC Classification Report
>>> svc = SVC(kernel='linear') >>> from sklearn.metrics import classification_report Precision, recall, f1-score
Scikit-learn is an open source Python library that Naive Bayes >>> print(classification_report(y_test, y_pred)) and support
implements a range of machine learning, >>> from sklearn.naive_bayes import GaussianNB Confusion Matrix
>>> gnb = GaussianNB() >>> from sklearn.metrics import confusion_matrix
preprocessing, cross-validation and visualization >>> print(confusion_matrix(y_test, y_pred))
algorithms using a unified interface. KNN
>>> from sklearn import neighbors Regression Metrics
A Basic Example >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> from sklearn import neighbors, datasets, preprocessing
Mean Absolute Error
>>> from sklearn.model_selection import train_test_split Unsupervised Learning Estimators >>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.metrics import accuracy_score >>> y_true = [3, -0.5, 2]
>>> iris = datasets.load_iris() Principal Component Analysis (PCA) >>> mean_absolute_error(y_true, y_pred)
>>> X, y = iris.data[:, :2], iris.target >>> from sklearn.decomposition import PCA Mean Squared Error
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) >>> pca = PCA(n_components=0.95) >>> from sklearn.metrics import mean_squared_error
>>> scaler = preprocessing.StandardScaler().fit(X_train) >>> mean_squared_error(y_test, y_pred)
>>> X_train = scaler.transform(X_train)
K Means
>>> X_test = scaler.transform(X_test) >>> from sklearn.cluster import KMeans R² Score
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> k_means = KMeans(n_clusters=3, random_state=0) >>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred) Model Fitting Clustering Metrics
Adjusted Rand Index
Supervised learning >>> from sklearn.metrics import adjusted_rand_score
Loading The Data Also see NumPy & Pandas >>> lr.fit(X, y) Fit the model to the data
>>> adjusted_rand_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse >>> svc.fit(X_train, y_train) Homogeneity
>>> from sklearn.metrics import homogeneity_score
matrices. Other types that are convertible to numeric arrays, such as Pandas Unsupervised Learning >>> homogeneity_score(y_true, y_pred)
DataFrame, are also acceptable. >>> k_means.fit(X_train) Fit the model to the data
>>> pca_model = pca.fit_transform(X_train) Fit to data, then transform it V-measure
>>> import numpy as np >>> from sklearn.metrics import v_measure_score
>>> X = np.random.random((10,5)) >>> metrics.v_measure_score(y_true, y_pred)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0 Prediction Cross-Validation
>>> from sklearn.cross_validation import cross_val_score
Supervised Estimators >>> print(cross_val_score(knn, X_train, y_train, cv=4))
Training And Test Data >>> y_pred = svc.predict(np.random.random((2,5))) Predict labels
>>> y_pred = lr.predict(X_test)
>>> print(cross_val_score(lr, X, y, cv=2))
Predict labels
>>> from sklearn.model_selection import train_test_split >>> y_pred = knn.predict_proba(X_test) Estimate probability of a label
>>> X_train, X_test, y_train, y_test = train_test_split(X,
y, Unsupervised Estimators Tune Your Model
random_state=0) >>> y_pred = k_means.predict(X_test) Predict labels in clustering algos Grid Search
>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3),
Preprocessing The Data "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,
Standardization Encoding Categorical Features param_grid=params)
>>> grid.fit(X_train, y_train)
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import LabelEncoder >>> print(grid.best_score_)
>>> scaler = StandardScaler().fit(X_train) >>> print(grid.best_estimator_.n_neighbors)
>>> enc = LabelEncoder()
>>> standardized_X = scaler.transform(X_train) >>> y = enc.fit_transform(y)
>>> standardized_X_test = scaler.transform(X_test) Randomized Parameter Optimization
Normalization Imputing Missing Values >>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5),
>>> from sklearn.preprocessing import Normalizer "weights": ["uniform", "distance"]}
>>> from sklearn.preprocessing import Imputer >>> rsearch = RandomizedSearchCV(estimator=knn,
>>> scaler = Normalizer().fit(X_train) >>> imp = Imputer(missing_values=0, strategy='mean', axis=0) param_distributions=params,
>>> normalized_X = scaler.transform(X_train) >>> imp.fit_transform(X_train) cv=4,
>>> normalized_X_test = scaler.transform(X_test) n_iter=8,
random_state=5)
Binarization Generating Polynomial Features >>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
>>> from sklearn.preprocessing import Binarizer >>> from sklearn.preprocessing import PolynomialFeatures
>>> binarizer = Binarizer(threshold=0.0).fit(X) >>> poly = PolynomialFeatures(5)
>>> binary_X = binarizer.transform(X) >>> poly.fit_transform(X) DataCamp
Learn Python for Data Science Interactively

Você também pode gostar