Rapids Cheatsheet

Cheat Sheet
www.RAPIDS.ai
TIDY DATA A foundation for wrangling in pandas INGESTING AND RESHAPING DATA Change the layout of a data set
In a tidy data set: gdf.sort_values(‘mpg’)
Order rows by values of a column (low
F M A F M A to high).
CSV
&
gdf.sort_values(‘mpg’,ascending=False)
Planned for Future Release Order rows by values of a column (high
to low).
df.rename(columns = {‘y’:’year’})
Planned for Future Release
Each variable is saved Each observation is gdf = cuDF.read_csv(filename, delimiter=”,”, df.pivot(columns=’var’, values=’val’) Rename the columns of a DataFrame
in its own column. saved in its own row. names=col_names, dtype =col_types) Spread rows into columns. gdf.sort_index()
Sort the index of a DataFrame.
Tidy data complements pandas’ vectorized operations. pandas will gdf.set_index()
automatically preserve observations as you manipulate variables. No other Return a new DataFrame with a new index.
} }
format works as intuitively. gdf.drop_column(‘Length’)
Drop column from DataFrame.
M A F
cudf.concat([gdf1,gdf2]) gdf.add_column(‘name’, gdf1[‘name’])

M A Append rows of DataFrames. Append columns of DataFrames.
SYNTAX Creating DataFrames SUBSET OBSERVATIONS SUBSET VARIABLES (COLUMNS)

a b c
1 4 7 10
2 5 8 11
3 6 9 12
gdf.query(‘Length > 7’] df.sample(frac=0.5) gdf[[‘width’,’length’,’species’]]
gdf = cudf.DataFrame([
Extract rows that meet logical criteria. Randomly select fraction of rows. Select multiple columns with specific names.
(“a”, [4 ,5, 6]),
(“b”, [7, 8, 9]), df.drop_duplicates() df.sample(n=10) gdf[‘width’] or gdf.width
Remove duplicate rows (only considers Planned
Randomly for Future
select n rows.Release Select single column with specific name.
(“c”, [10, 11, 12])
]) columns). df.iloc[10:20] df.filter(regex=’regex’)
Specify values for each column. Planned for Future Release
df.head(n) Select rows by position. Select columns whose name matches regular expression regex.
Select first n rows. gdf.nlargest(n, ‘value’) REGEX (REGULAR EXPRESSIONS) EXAMPLES
gdf = cudf.DataFrame.from_records( df.tail(n) Select and order top n entries.
[[4, 7, 10], Select last n rows. gdf.nsmallest(n, ‘value’) ‘\.’ Matches strings containing a period ‘.’
[5, 8, 11], Select and order bottom n entries.
[6, 9, 12]], ‘Length$’ Planned
Matches forending
strings Futurewith
Release
word ‘Length’
index=[1, 2, 3],
columns=[‘a’, ‘b’, ‘c’])
LOGIC IN PYTHON (AND PANDAS) ‘^Sepal’ Matches strings beginning with the word ‘Sepal’
Specify values for each row. < Less than != Not equal to ‘^x[1-5]$’ Matches strings beginning with ‘x’ and ending with 1,2,3,4,5
> Greater than df.column.isin(values) Group
membership ‘’^(?!Species$).*’ Matches strings except the string ‘Species’
METHOD CHAINING == Equals pd.isnull(obj) Is NaN gdf.loc[2:5,[‘x2’,’x4’]]
Get rows from index 2 to index 5 from ‘a’ and ‘b’ columns.
Most pandas methods return a DataFrame so another pandas method can be applied <= Less than or pd.notnull(obj) Is not NaN
to the result. This improves readability of code. df.iloc[:,[1,2,5]]
equals
Select columns in positions 1, 2 and 5 (first column is 0).
gdf = cudf.from_pandas(df) >= Greater than or &,|,~,^,df.any(),df.all() Logical and, or, not, Planned for Future Release
.query(‘val >= 200’) df.loc[df[‘a’] > 10, [‘a’,’c’]]
equals xor, any, all Select rows meeting logical condition, and only the specific columns.
.nlargest(‘val’,3)
SUMMARIZE DATA HANDLING MISSING DATA COMBINE DATA SETS
gdf[‘w’].value_counts() df.dropna()
Planned for Future Release gdf1 gdf2
+ =
Count number of rows with each unique value of variable. Drop rows with any column having NA/null data. x1 x3
x1 x2
len(gdf) gdf[‘length’].fillna(value) A T
A 1
# of rows in DataFrame. Replace all NA/null data with value.
B 2 B F
gdf[‘w’].unique_count()
# of distinct values in a column. C 3 D T
df.describe()
MAKE NEW COLUMNS STANDARD JOINS
Basic descriptive statistics for each column (or GroupBy)
x1 x2 x3
A 1 T gdf.merge(gdf2,
how=’left’, on=’x1’)
df.assign(Area=lambda df: df.Length*df.Height)
Planned for Future Release B 2 F Join matching rows from bdf to adf.
Compute and append one or more new columns. C 3 NaN
Pygdf provides a set of summary functions that operate on different kinds of pandas gdf[‘Volume’] = gdf.Length*gdf.Height*gdf.Depth
objects (DataFrame columns, Series, GroupBy) and produce single values for each of the
Add single column. x1 x2 x3
pd.qcut(df.col, n, labels=False) A 1.0 T gdf.merge(gdf1, gdf2,
groups. When applied to a DataFrame, the result is returned as a pandas Series for each Planned for Future Release how=’right’, on=’x1’)
Bin column into n buckets. B 2.0 F
column. Examples: Join matching rows from gdf1 to gdf2.
Apply row Apply row D NaN T
sum() min() functions functions
Sum values of each object. Minimum value in each object. x1 x2 x3 gdf.merge(gdf1, gdf2,
count() max() pandas provides a large set of vector functions that operate on all columns of a A 1 T how=‘inner’, on=’x1’)
Count non-NA/null values of each Maximum value in each object. DataFrame or a single selected column (cuDF Series). These functions produce vectors Join data. Retain only rows in both sets.
of values for each of the columns, or a single Series for the individual Series. Examples:
B 2 F
object. mean()
median() Mean value of each object. max(axis=1) min(axis=1) x1 x2 x3
Median value of each object. var() Element-wise max. Element-wise min.
Planned for Future Release A 1 T gdf.merge(gdf1, gdf2,
quantile([0.25,0.75]) Variance of each object. clip(lower=-10,upper=10) abs() B 2 F how=‘outer’, on=’x1’)
Quantiles of each object. std() Trim values at input thresholds Absolute value.
C 3 NaN Join data. Retain all values, all rows.
applymap(function) Standard deviation of each object.
Apply function to each object. Define a kernal function: D NaN T
>>> def kernel(in1, in2, in3, out1, out2, extra1, extra2):
for i, (x, y, z) in enumerate(zip(in1, in2, in3)): FILTERING JOINS
out1[i] = extra2 * x - extra1 * y
GROUP DATA out2[i] = y - extra1 * z x1 x2 x
adf[adf.x1.isin(bdf.x1)]
A 1 All rows for
in adf that have a match in bdf.
gdf.groupby(“col”) Call the kernel with apply_rows: Planned Future Release
Return a GroupBy object, grouped
B 2
>>> outdf = gdf.apply_rows(kernel,
by values in column named “col”. incols=[‘in1’, ‘in2’, ‘in3’], x1 x2 adf[~adf.x1.isin(bdf.x1)]
df.groupby(level=”ind”) outcols=dict(out1=np.float64,
out2=np.float64), C 3 All rows in adf that do not have a match in bdf.
Return a GroupBy
Planned object,Release
for Future grouped
by values in index level named “ind”. kwargs=dict(extra1=2.3, extra2=3.4))
gdf1 gdf2
+ =
x1 x2 x1 x2
WINDOWS A 1 B 2
B 2 C 3
df.expanding()
All of the summary functions listed above can be applied to a group. Additional Return an Expanding object allowing summary functions to be applied C 3 D 4
GroupBy functions: cumulatively.
Planned for Future Release SET-LIKE OPERATIONS
size()
agg(function) df.rolling(n)
Size of each group. Aggregate group using function. Return a Rolling object allowing summary functions to be applied to windows
of length n.
x1 x2
gdf.merge(gdf1, gdf2, how=‘inner’)
The examples below can also be applied to groups. In this case, the function is B 2 Rows that appear in both ydf and zdf (Intersection).
applied on a per-group basis, and the returned vectors are of the length of the C 3
original DataFrame.
shift(1) shift(-1)
ONE-HOT ENCODING x1 x2
Copy with values shifted by 1. Copy with values lagged by 1. CuDF can convert pandas category data types into one-hot encoded or A 1 gdf.merge(gdf1, gdf2, how=’outer’)
rank(method=’dense’) cumsum() dummy variables easily. B 2 Rows that appear in either or both ydf and zdf
Ranks with no gaps. Cumulative sum. pet_owner = [1, 2, 3, 4, 5] (Union).
Planned for Future Release pet_type = [‘fish’, ‘dog’, ‘fish’, ‘bird’, ‘fish’]
C 3
rank(method=’min’) cummax()
df = pd.DataFrame({‘pet_owner’: pet_owner, ‘pet_type’: pet_type}) D 4
Ranks. Ties get min rank. Cumulative max.
df.pet_type = df.pet_type.astype(‘category’)
rank(pct=True) cummin() pd.merge(ydf, zdf, how=’outer’,
Ranks rescaled to interval [0, 1]. Cumulative min. my_gdf = cuDF.DataFrame.from_pandas(df) indicator=True)
x1 x2 Planned for Future Release
rank(method=’first’) cumprod() my_gdf[‘pet_codes’] = my_gdf.pet_type.cat.codes .query(‘_merge == “left_only”’)
Ranks. Ties go to first value. Cumulative product.
A 1 .drop(columns=[‘_merge’])
codes = my_gdf.pet_codes.unique() Rows that appear in ydf but not zdf (Setdiff).
enc_gdf = my_gdf.one_hot_encoding(‘pet_codes’, ‘pet_dummy’, codes)
This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants

Rapids Cheatsheet

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Rapids Cheatsheet

Enviado por

Direitos autorais:

Formatos disponíveis

Cheat Sheet

cudf.concat([gdf1,gdf2]) gdf.add_column(‘name’, gdf1[‘name’])

SYNTAX Creating DataFrames SUBSET OBSERVATIONS SUBSET VARIABLES (COLUMNS)

Você também pode gostar