Escolar Documentos
Profissional Documentos
Cultura Documentos
www.RAPIDS.ai
TIDY DATA A foundation for wrangling in pandas INGESTING AND RESHAPING DATA Change the layout of a data set
In a tidy data set: gdf.sort_values(‘mpg’)
Order rows by values of a column (low
F M A F M A to high).
CSV
&
gdf.sort_values(‘mpg’,ascending=False)
Planned for Future Release Order rows by values of a column (high
to low).
df.rename(columns = {‘y’:’year’})
Planned for Future Release
Each variable is saved Each observation is gdf = cuDF.read_csv(filename, delimiter=”,”, df.pivot(columns=’var’, values=’val’) Rename the columns of a DataFrame
in its own column. saved in its own row. names=col_names, dtype =col_types) Spread rows into columns. gdf.sort_index()
Sort the index of a DataFrame.
Tidy data complements pandas’ vectorized operations. pandas will gdf.set_index()
automatically preserve observations as you manipulate variables. No other Return a new DataFrame with a new index.
} }
format works as intuitively. gdf.drop_column(‘Length’)
Drop column from DataFrame.
M A F
+ =
Count number of rows with each unique value of variable. Drop rows with any column having NA/null data. x1 x3
x1 x2
len(gdf) gdf[‘length’].fillna(value) A T
A 1
# of rows in DataFrame. Replace all NA/null data with value.
B 2 B F
gdf[‘w’].unique_count()
# of distinct values in a column. C 3 D T
df.describe()
Planned for Future Release
MAKE NEW COLUMNS STANDARD JOINS
Basic descriptive statistics for each column (or GroupBy)
x1 x2 x3
A 1 T gdf.merge(gdf2,
how=’left’, on=’x1’)
df.assign(Area=lambda df: df.Length*df.Height)
Planned for Future Release B 2 F Join matching rows from bdf to adf.
Compute and append one or more new columns. C 3 NaN
Pygdf provides a set of summary functions that operate on different kinds of pandas gdf[‘Volume’] = gdf.Length*gdf.Height*gdf.Depth
objects (DataFrame columns, Series, GroupBy) and produce single values for each of the
Add single column. x1 x2 x3
pd.qcut(df.col, n, labels=False) A 1.0 T gdf.merge(gdf1, gdf2,
groups. When applied to a DataFrame, the result is returned as a pandas Series for each Planned for Future Release how=’right’, on=’x1’)
Bin column into n buckets. B 2.0 F
column. Examples: Join matching rows from gdf1 to gdf2.
Apply row Apply row D NaN T
sum() min() functions functions
Sum values of each object. Minimum value in each object. x1 x2 x3 gdf.merge(gdf1, gdf2,
count() max() pandas provides a large set of vector functions that operate on all columns of a A 1 T how=‘inner’, on=’x1’)
Count non-NA/null values of each Maximum value in each object. DataFrame or a single selected column (cuDF Series). These functions produce vectors Join data. Retain only rows in both sets.
of values for each of the columns, or a single Series for the individual Series. Examples:
B 2 F
object. mean()
median() Mean value of each object. max(axis=1) min(axis=1) x1 x2 x3
Median value of each object. var() Element-wise max. Element-wise min.
Planned for Future Release A 1 T gdf.merge(gdf1, gdf2,
quantile([0.25,0.75]) Variance of each object. clip(lower=-10,upper=10) abs() B 2 F how=‘outer’, on=’x1’)
Quantiles of each object. std() Trim values at input thresholds Absolute value.
C 3 NaN Join data. Retain all values, all rows.
applymap(function) Standard deviation of each object.
Apply function to each object. Define a kernal function: D NaN T
>>> def kernel(in1, in2, in3, out1, out2, extra1, extra2):
for i, (x, y, z) in enumerate(zip(in1, in2, in3)): FILTERING JOINS
out1[i] = extra2 * x - extra1 * y
GROUP DATA out2[i] = y - extra1 * z x1 x2 x
adf[adf.x1.isin(bdf.x1)]
A 1 All rows for
in adf that have a match in bdf.
gdf.groupby(“col”) Call the kernel with apply_rows: Planned Future Release
Return a GroupBy object, grouped
B 2
>>> outdf = gdf.apply_rows(kernel,
by values in column named “col”. incols=[‘in1’, ‘in2’, ‘in3’], x1 x2 adf[~adf.x1.isin(bdf.x1)]
df.groupby(level=”ind”) outcols=dict(out1=np.float64,
out2=np.float64), C 3 All rows in adf that do not have a match in bdf.
Return a GroupBy
Planned object,Release
for Future grouped
by values in index level named “ind”. kwargs=dict(extra1=2.3, extra2=3.4))
gdf1 gdf2
+ =
x1 x2 x1 x2
WINDOWS A 1 B 2
B 2 C 3
df.expanding()
All of the summary functions listed above can be applied to a group. Additional Return an Expanding object allowing summary functions to be applied C 3 D 4
GroupBy functions: cumulatively.
Planned for Future Release SET-LIKE OPERATIONS
size()
Planned for Future Release
agg(function) df.rolling(n)
Size of each group. Aggregate group using function. Return a Rolling object allowing summary functions to be applied to windows
of length n.
x1 x2
gdf.merge(gdf1, gdf2, how=‘inner’)
The examples below can also be applied to groups. In this case, the function is B 2 Rows that appear in both ydf and zdf (Intersection).
applied on a per-group basis, and the returned vectors are of the length of the C 3
original DataFrame.
shift(1) shift(-1)
ONE-HOT ENCODING x1 x2
Copy with values shifted by 1. Copy with values lagged by 1. CuDF can convert pandas category data types into one-hot encoded or A 1 gdf.merge(gdf1, gdf2, how=’outer’)
rank(method=’dense’) cumsum() dummy variables easily. B 2 Rows that appear in either or both ydf and zdf
Ranks with no gaps. Cumulative sum. pet_owner = [1, 2, 3, 4, 5] (Union).
Planned for Future Release pet_type = [‘fish’, ‘dog’, ‘fish’, ‘bird’, ‘fish’]
C 3
rank(method=’min’) cummax()
df = pd.DataFrame({‘pet_owner’: pet_owner, ‘pet_type’: pet_type}) D 4
Ranks. Ties get min rank. Cumulative max.
df.pet_type = df.pet_type.astype(‘category’)
rank(pct=True) cummin() pd.merge(ydf, zdf, how=’outer’,
Ranks rescaled to interval [0, 1]. Cumulative min. my_gdf = cuDF.DataFrame.from_pandas(df) indicator=True)
x1 x2 Planned for Future Release
rank(method=’first’) cumprod() my_gdf[‘pet_codes’] = my_gdf.pet_type.cat.codes .query(‘_merge == “left_only”’)
Ranks. Ties go to first value. Cumulative product.
A 1 .drop(columns=[‘_merge’])
codes = my_gdf.pet_codes.unique() Rows that appear in ydf but not zdf (Setdiff).
enc_gdf = my_gdf.one_hot_encoding(‘pet_codes’, ‘pet_dummy’, codes)
This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants