a abordagem tidyverse
Atividade
1. Coletar dados
19% 21% 2. Limpar e organizar dados
50 60% 0/100 50 57% 0/100 3. Minerar de dados
5% 5%
4% 4% 4. Construir dados de treino
3% 10%
9% 3% 5. Refinar algorítmos
6. Outros
25 25
library(tidyverse) 1
ls("package:tidyverse") 2
tidyverse_packages() 1
tibble
readr
dplyr
forcats
stringr
purrr
I Aperfeiçoamento do data.frame.
I A classe tibble.
I Formas ágeis de criar tibbles.
I Formas ágeis de modificar objetos das classes.
I Método print mais enxuto e informativo.
Both drop_na(data, ...) fill(data, ..., .direction = c("down", "up")) replace_na(data, C 2000 213K
Construct by columns. make this replace = list(), ...)
C 2000 1T
Drop rows containing Fill in NA’s in … columns with most
tibble(x = 1:3, y = c("a", "b", "c")) tibble NA’s in … columns. recent non-NA values. Replace NA’s by column. separate_rows(table3, rate)
x x x
tribble(…)
unite(data, col, ..., sep = "_", remove = TRUE)
A tibble: 3 × 2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2
Construct by rows. x y A 1 A 1 A 1 A 1 A 1 A 1
tribble( ~x, ~y, <int> <chr> B NA D 3 B NA B 1 B NA B 2
Collapse cells across several columns to
1 1 a C NA C NA C 1 C NA C 2
1, "a", 2 2 b D 3 D 3 D 3 D 3 D 3 make a single column.
2, "b", 3 3 c E NA E NA E 3 E NA E 2
table5
3, "c")
drop_na(x, x2) fill(x, x2) replace_na(x, list(x2 = 2)) country century year country year
as_tibble(x, …) Convert data frame to tibble. Afghan 19 99 Afghan 1999
enframe(x, name = "name", value = "value") Expand Tables - quickly create tables with combinations of values Afghan
Brazil
20
19
0
99
Afghan
Brazil
2000
1999
Convert named vector to a tibble Brazil 20 0 Brazil 2000
complete(data, ..., fill = list()) expand(data, ...) China 19 99 China 1999
is_tibble(x) Test whether x is a tibble. China 20 0 China 2000
Adds to the data missing combinations of the Create new tibble with all possible combinations
values of the variables listed in … of the values of the variables listed in … unite(table5, century, year,
complete(mtcars, cyl, gear, carb) expand(mtcars, cyl, gear, carb) col = "year", sep = "")
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at tidyverse.org • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01
Summarise Cases
w
www
ww criteria. filter(iris, Sepal.Length > 7)
w
www pull(iris, Sepal.Length)
weight = NULL, .env = parent.frame()) Randomly Use these helpers with select (),
summary function select fraction of rows.
e.g. select(iris, starts_with("Sepal"))
summarise(.data, …)
Compute table of summaries.
w
www
ww sample_frac(iris, 0.5, replace = TRUE)
sample_n(tbl, size, replace = FALSE, weight =
contains(match) num_range(prefix, range) :, e.g. mpg:cyl
ends_with(match) one_of(…) -, e.g, -Species
w
ww summarise(mtcars, avg = mean(mpg))
count(x, ..., wt = NULL, sort = FALSE)
NULL, .env = parent.frame()) Randomly select
size rows. sample_n(iris, 10, replace = TRUE)
matches(match) starts_with(match)
Count number of rows in each group defined slice(.data, …) Select rows by position. MAKE NEW VARIABLES
slice(iris, 10:15)
by the variables in … Also tally().
w
ww count(iris, Species)
w
www
ww top_n(x, n, wt) Select and order top n entries (by
group if grouped data). top_n(iris, 5, Sepal.Width)
These apply vectorized functions to columns. Vectorized funs take
vectors as input and return vectors of the same length as output
(see back).
VARIATIONS vectorized function
summarise_all() - Apply funs to every column.
summarise_at() - Apply funs to specific columns. mutate(.data, …)
summarise_if() - Apply funs to all cols of one type. Logical and boolean operators to use with filter() Compute new column(s).
<
>
<=
>=
is.na()
!is.na()
%in%
!
|
&
xor() w
wwww
w mutate(mtcars, gpm = 1/mpg)
transmute(.data, …)
Group Cases See ?base::logic and ?Comparison for help. Compute new column(s), drop others.
Use group_by() to create a "grouped" copy of a table.
dplyr functions will manipulate each "group" separately and
w
ww transmute(mtcars, gpm = 1/mpg)
mtcars %>%
arrange(.data, …) Order rows by values of a
column or columns (low to high), use with
w
www mutate_all(faithful, funs(log(.), log2(.)))
mutate_if(iris, is.numeric, funs(log(.)))
w
www
ww group_by(cyl) %>% w
www
ww desc() to order from high to low.
arrange(mtcars, mpg)
mutate_at(.tbl, .cols, .funs, …) Apply funs to
Figura 13. Cartão de referência de operações em dados com tabulares com dplyr.
Walmes Zeviani · UFPR Manipulação e Visualização de Dados 37
Vector Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARISE () COMBINE VARIABLES COMBINE CASES dplyr
mutate() and transmute() apply vectorized summarise() applies summary functions to x y
functions to columns to create new columns. columns to create a new table. Summary A B C A B D A B C A B D A B C
A B C
vectorized function summary function Use bind_cols() to paste tables beside each
other as they are. + y
C v 3
d w 4
+ =
A B.x C B.y D
a t 1 a t 3
MISC
a t 1 t 3
specify one or more common b u 2 b u 2
Row Names
b u 2 u 2
c v 3 NA NA columns to match on. c v 3 d w 1
dplyr::case_when() - multi-case if_else() left_join(x, y, by = "A")
dplyr::coalesce() - first non-NA values by
element across a set of vectors Tidy data does not use rownames, which store a
variable outside of the columns. To work with the A.x B.x C A.y B.y Use a named vector, by = c("col1" = Use a "Filtering Join" to filter one table against
dplyr::if_else() - element-wise if() + else() rownames, first move them into a column. a t 1 d w
"col2"), to match on columns that the rows of another.
dplyr::na_if() - replace specific values with NA b u 2 b u
have different names in each table.
C A B c v 3 a t
pmax() - element-wise max() rownames_to_column() left_join(x, y, by = c("C" = "D")) semi_join(x, y, by = NULL, …)
A B A B C
pmin() - element-wise min() 1 a t 1 a t Move row names into col. a t 1 Return rows of x that have a match in y.
dplyr::recode() - Vectorized switch() 2 b u 2 b u a <- rownames_to_column(iris, var A1 B1 C A2 B2 Use suffix to specify the suffix to b u 2 USEFUL TO SEE WHAT WILL BE JOINED.
dplyr::recode_factor() - Vectorized switch()
3 c v 3 c v
= "C") a t 1 d w give to unmatched columns that
for factors b u 2 b u
have the same name in both tables. A B C anti_join(x, y, by = NULL, …)
c v 3 a t
A B C A B column_to_rownames() left_join(x, y, by = c("C" = "D"), suffix = c v 3 Return rows of x that do not have a
1 a t 1 a t
Move col in row names. c("1", "2")) match in y. USEFUL TO SEE WHAT WILL
2 b u 2 b u
3 c v 3 c v column_to_rownames(a, var = "C") NOT BE JOINED.
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2017-03
Figura 14. Cartão de referência de operações em dados com tabulares com dplyr.
Walmes Zeviani · UFPR Manipulação e Visualização de Dados 38
Programação funcional com purrr
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at purrr.tidyverse.org • purrr 0.2.3 • Updated: 2017-09
1 2 3
Sepal.L Sepal.W Petal.L Petal.W
individual tables within the 5.1 3.5 1.4 0.2 Make a list Work with Simplify
cells of a larger, organizing 4.9 3.0 1.4 0.2
column list columns the list
4.7 3.2 1.3 0.2 S.L S.W P.L P.W
table. 4.6 3.1 1.5 0.2 Species S.L S.W P.L P.W 5.1 3.5 1.4 0.2
Call: column
lm(S.L ~ ., df)
setosa 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2
5.0 3.6 1.4 0.2 Coefs:
setosa 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2
(Int) S.W P.L P.W
n_iris$data[[1]] setosa 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 2.3 0.6 0.2 0.2
setosa 4.6 3.1 1.5 0.2
Species data S.L S.W P.L P.W
nested data frame Sepal.L Sepal.W Petal.L Petal.W versi 7.0 3.2 4.7 1.4
setos <tibble [50x4]> 7.0 3.2 4.7 1.4
Species data model Call: Species beta
versi 6.4 3.2 4.5 1.5 setosa <tibble [50x4]> <S3: lm> lm(S.L ~ ., df) setos 2.35
Species data 7.0 3.2 4.7 1.4 versi <tibble [50x4]> 6.4 3.2 4.5 1.5
versi 6.9 3.1 4.9 1.5 versi <tibble [50x4]> <S3: lm> Coefs: versi 1.89
virgini <tibble [50x4]> 6.9 3.1 4.9 1.5
setosa <tibble [50 x 4]> 6.4 3.2 4.5 1.5 versi 5.5 2.3 4.0 1.3 virgini <tibble [50x4]> <S3: lm> (Int) S.W P.L P.W virgini 0.69
5.5 2.3 4.0 1.3 1.8 0.3 0.9 -0.6
versicolor <tibble [50 x 4]> 6.9 3.1 4.9 1.5 virgini 6.3 3.3 6.0 2.5
virginica <tibble [50 x 4]> 5.5 2.3 4.0 1.3 virgini 5.8 2.7 5.1 1.9 S.L S.W P.L P.W
virgini 7.1 3.0 5.9 2.1 Call:
n_iris 6.5 2.8 4.6 1.5 6.3 3.3 6.0 2.5
lm(S.L ~ ., df)
virgini 6.3 2.9 5.6 1.8 5.8 2.7 5.1 1.9
n_iris$data[[2]] 7.1 3.0 5.9 2.1 Coefs:
(Int) S.W P.L P.W
6.3 2.9 5.6 1.8 0.6 0.3 0.9 -0.1
Sepal.L Sepal.W Petal.L Petal.W n_iris <- iris %>% mod_fun <- function(df) b_fun <- function(mod)
Use a nested data frame to: 6.3 3.3 6.0 2.5 group_by(Species) %>% lm(Sepal.Length ~ ., data = df) coefficients(mod)[[1]]
5.8 2.7 5.1 1.9 nest()
• preserve relationships 7.1 3.0 5.9 2.1 m_iris <- n_iris %>% m_iris %>% transmute(Species,
between observations and 6.3 2.9 5.6 1.8 mutate(model = map(data, mod_fun)) beta = map_dbl(model, b_fun))
subsets of data 6.5 3.0 5.8 2.2
n_iris$data[[3]]
• manipulate many sub-tables 1. MAKE A LIST COLUMN - You can create list columns with functions in the tibble and dplyr packages, as well as tidyr’s nest()
at once with the purrr functions map(), map2(), or pmap().
tibble::tribble(…) tibble::tibble(…) dplyr::mutate(.data, …) Also transmute()
Makes list column when needed Saves list input as list columns Returns list col when result returns list.
Use a two step process to create a nested data frame: tribble( ~max, ~seq, max seq tibble(max = c(3, 4, 5), seq = list(1:3, 1:4, 1:5)) mtcars %>% mutate(seq = map(cyl, seq))
1. Group the data frame into groups with dplyr::group_by() 3, 1:3, 3 <int [3]>
with one row per group S.L S.W P.L P.W 5, 1:5)
5 <int [5]>
tibble::enframe(x, name="name", value="value") dplyr::summarise(.data, …)
5.1 3.5 1.4 0.2 Converts multi-level list to tibble with list cols Returns list col when result is wrapped with list()
Species S.L S.W P.L P.W
setosa 5.1 3.5 1.4 0.2
Species
setosa
S.L S.W P.L P.W
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
enframe(list('3'=1:3, '4'=1:4, '5'=1:5), 'max', 'seq') mtcars %>% group_by(cyl) %>%
setosa 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 4.6 3.1 1.5 0.2 summarise(q = list(quantile(mpg)))
setosa 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 5.0 3.6 1.4 0.2
setosa 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2
setosa 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 S.L S.W P.L P.W 2. WORK WITH LIST COLUMNS - Use the purrr functions map(), map2(), and pmap() to apply a function that returns a result element-wise
versi
versi
7.0
6.4
3.2 4.7
3.2 4.5
1.4
1.5
versi
versi
7.0
6.4
3.2
3.2
4.7
4.5
1.4
1.5
Species data
setos <tibble [50x4]>
7.0
6.4
3.2 4.7
3.2 4.5
1.4
1.5 to the cells of a list column. walk(), walk2(), and pwalk() work the same way, but return a side effect.
versi versi 6.9 3.1 4.9 1.5 6.9 3.1 4.9 1.5
purrr::map(.x, .f, ...)
6.9 3.1 4.9 1.5 versi <tibble [50x4]>
versi versi 5.5 2.3 4.0 1.3 5.5 2.3 4.0 1.3 data
fun( , …)
5.5 2.3 4.0 1.3 virgini <tibble [50x4]> data result
versi
virgini
6.5 2.8 4.6 1.5 versi
virgini
6.5
6.3
2.8
3.3
4.6
6.0
1.5
2.5
6.5 2.8 4.6 1.5
Apply .f element-wise to .x as .f(.x) map( <tibble [50x4]>
, fun, …) fun(
<tibble [50x4]>
, …)
result 1
6.3 3.3 6.0 2.5 <tibble [50x4]> <tibble [50x4]> result 2
virgini 5.8 2.7 5.1 1.9 virgini 5.8 2.7 5.1 1.9 S.L S.W P.L P.W n_iris %>% mutate(n = map(data, dim)) <tibble [50x4]> fun( <tibble [50x4]> , …) result 3
virgini 7.1 3.0 5.9 2.1 virgini 7.1 3.0 5.9 2.1 6.3 3.3 6.0 2.5
virgini 6.3 2.9 5.6 1.8 virgini 6.3 2.9 5.6 1.8 5.8 2.7 5.1 1.9 purrr::map2(.x, .y, .f, ...) data model
virgini 6.5 3.0 5.8 2.2 virgini 6.5 3.0 5.8 2.2 7.1 3.0 5.9 2.1
Apply .f element-wise to .x and .y as .f(.x, .y)
data model
fun( <tibble [50x4]> , <S3: lm> ,…) result
map2( , , fun, …)
<tibble [50x4]> <S3: lm> result 1
6.3 2.9 5.6 1.8 fun( <tibble [50x4]> , <S3: lm> ,…)
n_iris <- iris %>% group_by(Species) %>% nest() 6.5 3.0 5.8 2.2 m_iris %>% mutate(n = map2(data, model, list)) <tibble [50x4]>
<tibble [50x4]>
<S3: lm>
<S3: lm> fun( <tibble [50x4]> , <S3: lm> ,…)
result 2
result 3
Unnest a nested data frame Species data Species S.L S.W P.L P.W
setos <tibble [50x4]> setosa 5.1 3.5 1.4 0.2
with unnest(): versi <tibble [50x4]> setosa 4.9 3.0 1.4 0.2 3. SIMPLIFY THE LIST COLUMN (into a regular column)
virgini <tibble [50x4]> setosa 4.7 3.2 1.3 0.2
n_iris %>% unnest() setosa 4.6 3.1 1.5 0.2
Use the purrr functions map_lgl(), purrr::map_lgl(.x, .f, ...) purrr::map_dbl(.x, .f, ...)
versi 7.0 3.2 4.7 1.4
tidyr::unnest(data, ..., .drop = NA, .id=NULL, .sep=NULL) versi 6.4 3.2 4.5 1.5 map_int(), map_dbl(), map_chr(), Apply .f element-wise to .x, return a logical vector Apply .f element-wise to .x, return a double vector
versi 6.9 3.1 4.9 1.5
as well as tidyr’s unnest() to reduce n_iris %>% transmute(n = map_lgl(data, is.matrix)) n_iris %>% transmute(n = map_dbl(data, nrow))
Unnests a nested data frame. versi 5.5 2.3 4.0 1.3
a list column into a regular column. purrr::map_chr(.x, .f, ...)
virgini
virgini
6.3
5.8
3.3
2.7
6.0
5.1
2.5
1.9 purrr::map_int(.x, .f, ...)
virgini 7.1 3.0 5.9 2.1 Apply .f element-wise to .x, return an integer vector Apply .f element-wise to .x, return a character vector
virgini 6.3 2.9 5.6 1.8
n_iris %>% transmute(n = map_int(data, nrow)) n_iris %>% transmute(n = map_chr(data, nrow))
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at purrr.tidyverse.org • purrr 0.2.3 • Updated: 2017-09
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at http://ggplot2.tidyverse.org • ggplot2 3.1.0 • Updated: 2018-12
+ =
x ..count.. (n <- d + geom_bar(aes(fill = fl))) The default cartesian coordinate system discrete variables.
aesthetic prepackaged scale-specific r + coord_fixed(ratio = 1/2)
scale_ to adjust scale to use arguments ratio, xlim, ylim
t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
data stat geom coordinate plot Cartesian coordinates with fixed aspect ratio
x = x ·
system n + scale_fill_manual( between x and y units
y = ..count.. values = c("skyblue", "royalblue", "blue", “navy"), r + coord_flip()
t + facet_grid(cols = vars(fl))
Visualize a stat by changing the default stat of a geom limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", “r"), xlim, ylim
facet into columns based on fl
name = "fuel", labels = c("D", "E", "P", "R")) Flipped Cartesian coordinates
function, geom_bar(stat="count") or by using a stat t + facet_grid(rows = vars(year))
function, stat_count(geom="bar"), which calls a default range of labels to use breaks to use in
r + coord_polar(theta = "x", direction=1 )
facet into rows based on year
title to use in theta, start, direction
values to include legend/axis in legend/axis legend/axis
geom to make a layer (equivalent to a geom function). in mapping Polar coordinates t + facet_grid(rows = vars(year), cols = vars(fl))
Use ..name.. syntax to map stat variables to aesthetics. r + coord_trans(ytrans = “sqrt")
facet into both rows and columns
xtrans, ytrans, limx, limy
GENERAL PURPOSE SCALES Transformed cartesian coordinates. Set xtrans and t + facet_wrap(vars(fl))
wrap facets into a rectangular layout
geom to use stat function geommappings ytrans to the name of a window function.
Use with most aesthetics
i + stat_density2d(aes(fill = ..level..), Set scales to let axis limits vary across facets
geom = "polygon") scale_*_continuous() - map cont’ values to visual ones 60
π + coord_quickmap()
variable created by stat scale_*_discrete() - map discrete values to visual ones π + coord_map(projection = "ortho", t + facet_grid(rows = vars(drv), cols = vars(fl),
lat
scale_*_identity() - use data values as visual ones orientation=c(41, -74, 0))projection, orienztation, scales = "free")
xlim, ylim x and y axis limits adjust to individual facets
c + stat_bin(binwidth = 1, origin = 10)
scale_*_manual(values = c()) - map discrete values to long
Themes
Set legend type for each aesthetic: colorbar, legend, or
ggplot() + stat_function(aes(x = -3:3), n = 99, fun = o + scale_fill_gradient2(low="red", high=“blue", none (no legend)
dnorm, args = list(sd=0.5)) x | ..x.., ..y.. mid = "white", midpoint = 25) n + scale_fill_discrete(name = "Title",
labels = c("A", "B", "C", "D", "E"))
e + stat_identity(na.rm = TRUE) o + scale_fill_gradientn(colours=topo.colors(6)) r + theme_bw()
r + theme_classic() Set legend title and labels with a scale function.
ggplot() + stat_qq(aes(sample=1:100), dist = qt, White background
Also: rainbow(), heat.colors(), terrain.colors(), with grid lines r + theme_light()
dparam=list(df=5)) sample, x, y | ..sample.., ..theoretical..
Zooming
cm.colors(), RColorBrewer::brewer.pal() r + theme_gray()
r + theme_linedraw()
e + stat_sum() x, y, size | ..n.., ..prop.. Grey background
r + theme_minimal()
e + stat_summary(fun.data = "mean_cl_boot") SHAPE AND SIZE SCALES (default theme) Minimal themes
r + theme_dark()
r + theme_void()
Without clipping (preferred)
h + stat_summary_bin(fun.y = "mean", geom = "bar") p <- e + geom_point(aes(shape = fl, size = cyl)) dark for contrast
p + scale_shape() + scale_size() Empty theme t + coord_cartesian(
e + stat_unique() xlim = c(0, 100), ylim = c(10, 20))
p + scale_shape_manual(values = c(3:7))
With clipping (removes unseen data points)
t + xlim(0, 100) + ylim(10, 20)
p + scale_radius(range = c(1,6))
p + scale_size_area(max_size = 6) t + scale_x_continuous(limits = c(0, 100)) +
scale_y_continuous(limits = c(0, 100))
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at http://ggplot2.tidyverse.org • ggplot2 3.1.0 • Updated: 2018-12
u %>% str_subset("^theme_") 1
u %>% str_subset("^(scale|coord)_") 1
0 str_count(string, pattern) Count the number str_extract(string, pattern) Return the first str_trunc(string, width, side = c("right", "left",
3 of matches in a string. NA pattern match found in each string, as a vector. "center"), ellipsis = "...") Truncate the width of
1
2 str_count(fruit, "a") Also str_extract_all to return every pattern strings, replacing content with ellipsis.
match. str_extract(fruit, "[aeiou]") str_trunc(fruit, 3)
str_locate(string, pattern) Locate the
start end
2 4
4 7 positions of pattern matches in a string. Also str_match(string, pattern) Return the first str_trim(string, side = c("both", "left", "right"))
NA NA str_locate_all. str_locate(fruit, "a") NA NA
pattern match found in each string, as a Trim whitespace from the start and/or end of a
3 4
matrix with a column for each ( ) group in string. str_trim(fruit)
pattern. Also str_match_all.
str_match(sentences, "(a|the) ([^ ]+)")
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor ! • stringr 1.2.0 • Updated: 2017-10
"
Pattern arguments in stringr are interpreted as MATCH CHARACTERS see <- function(rx) str_view_all("abc ABC 123\t.!?\\(){}\n", rx)
regular expressions after any special characters [:blank:] .
have been parsed. string (type regexp matches example
this) (to mean this) (which matches this) space
In R, you write regular expressions as strings, a (etc.) a (etc.) see("a") abc ABC 123 .!?\(){} tab
sequences of characters surrounded by quotes \\. \. . see("\\.") abc ABC 123 .!?\(){}
("") or single quotes('').
\\! \! ! see("\\!") abc ABC 123 .!?\(){} [:graph:]
Some characters cannot be represented directly \\? \? ? see("\\?") abc ABC 123 .!?\(){}
in an R string . These must be represented as \\\\ \\ \ see("\\\\") abc ABC 123 .!?\(){} [:punct:]
special characters, sequences of characters that \\( \( ( see("\\(") abc ABC 123 .!?\(){}
have a specific meaning., e.g. . , : ; ? ! \ | / ` = * + - ^
\\) \) ) see("\\)") abc ABC 123 .!?\(){}
Special Character Represents \\{ \{ { see("\\{") abc ABC 123 .!?\(){} _ ~ " ' [ ] { } ( ) < > @# $
\\ \ \\} \} } see( "\\}") abc ABC 123 .!?\(){}
\" " \\n \n new line (return) see("\\n") abc ABC 123 .!?\(){} [:alnum:]
\n new line \\t \t tab see("\\t") abc ABC 123 .!?\(){}
Run ?"'" to see a complete list \\s \s any whitespace (\S for non-whitespaces) see("\\s") abc ABC 123 .!?\(){} [:digit:]
\\d \d any digit (\D for non-digits) see("\\d") abc ABC 123 .!?\(){}
0 1 2 3 4 5 6 7 8 9
Because of this, whenever a \ appears in a regular \\w \w any word character (\W for non-word chars) see("\\w") abc ABC 123 .!?\(){}
expression, you must write it as \\ in the string \\b \b word boundaries see("\\b") abc ABC 123 .!?\(){}
that represents the regular expression. [:digit:]
1
digits see("[:digit:]") abc ABC 123 .!?\(){} [:alpha:]
1
Use writeLines() to see how R views your string [:alpha:] letters see("[:alpha:]") abc ABC 123 .!?\(){} [:lower:] [:upper:]
1
after all special characters have been parsed. [:lower:] lowercase letters see("[:lower:]") abc ABC 123 .!?\(){}
[:upper:]
1
uppercase letters see("[:upper:]") abc ABC 123 .!?\(){} a b c d e f A B CD E F
writeLines("\\.") [:alnum:]
1
letters and numbers see("[:alnum:]") abc ABC 123 .!?\(){}
# \. g h i j k l GH I J K L
[:punct:] 1 punctuation see("[:punct:]") abc ABC 123 .!?\(){}
writeLines("\\ is a backslash") [:graph:] 1 letters, numbers, and punctuation see("[:graph:]") abc ABC 123 .!?\(){}
mn o p q r MNO PQ R
# \ is a backslash [:space:] 1
space characters (i.e. \s) see("[:space:]") abc ABC 123 .!?\(){} s t u vw x S TU V WX
[:blank:] 1 space and tab (but not new line) see("[:blank:]") abc ABC 123 .!?\(){} z Z
. every character except a new line see(".") abc ABC 123 .!?\(){}
INTERPRETATION 1 Many base R functions require classes to be wrapped in a second set of [ ], e.g. [[:digit:]]
Patterns in stringr are interpreted as regexs To
change this default, wrap the pattern in one of:
ALTERNATES alt <- function(rx) str_view_all("abcde", rx) QUANTIFIERS quant <- function(rx) str_view_all(".a.aa.aaa", rx)
regex(pattern, ignore_case = FALSE, multiline = example example
FALSE, comments = FALSE, dotall = FALSE, ...) regexp matches regexp matches
Modifies a regex to ignore cases, match end of ab|d or alt("ab|d") abcde a? zero or one quant("a?") .a.aa.aaa
lines as well of end of strings, allow R comments [abe] one of alt("[abe]") abcde a* zero or more quant("a*") .a.aa.aaa
within regex's , and/or to have . match everything a+ one or more quant("a+") .a.aa.aaa
including \n. [^abe] anything but alt("[^abe]") abcde
str_detect("I", regex("i", TRUE)) [a-c] range alt("[a-c]") abcde 1 2 ... n a{n} exactly n quant("a{2}") .a.aa.aaa
1 2 ... n a{n, } n or more quant("a{2,}") .a.aa.aaa
fixed() Matches raw bytes but will miss some n ... m a{n, m} between n and m quant("a{2,4}") .a.aa.aaa
characters that can be represented in multiple ANCHORS anchor <- function(rx) str_view_all("aaa", rx)
ways (fast). str_detect("\u0130", fixed("i")) regexp matches example
^a start of string anchor("^a") aaa GROUPS ref <- function(rx) str_view_all("abbaab", rx)
coll() Matches raw bytes and will use locale
specific collation rules to recognize characters a$ end of string anchor("a$") aaa Use parentheses to set precedent (order of evaluation) and create groups
that can be represented in multiple ways (slow).
regexp matches example
str_detect("\u0130", coll("i", TRUE, locale = "tr"))
(ab|d)e sets precedence alt("(ab|d)e") abcde
LOOK AROUNDS look <- function(rx) str_view_all("bacad", rx)
boundary() Matches boundaries between example
characters, line_breaks, sentences, or words. regexp matches Use an escaped number to refer to and duplicate parentheses groups that occur
str_split(sentences, boundary("word")) a(?=c) followed by look("a(?=c)") bacad earlier in a pattern. Refer to each group by its order of appearance
a(?!c) not followed by look("a(?!c)") bacad string regexp matches example
(?<=b)a preceded by look("(?<=b)a") bacad (type this) (to mean this) (which matches this) (the result is the same as ref("abba"))
(?<!b)a not preceded by look("(?<!b)a") bacad \\1 \1 (etc.) first () group, etc. ref("(a)(b)\\2\\1") abbaab
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor ! • stringr 1.2.0 • Updated: 2017-10
ls("package:forcats") 1
Factors stored displayed Change the order of levels Change the value of levels
R represents categorical integer 1 1= a a 1= a
data with factors. A factor vector 3 23 == bc c 23 == bc a 1= a a 1= b fct_relevel(.f, ..., after = 0L) a 1= a v 1= v fct_recode(.f, ...) Manually change
is an integer vector with a c 2= b c 2= c Manually reorder factor levels. c 2= b z 2= x levels. Also fct_relabel which obeys
2 b 3= c 3= a fct_relevel(f, c("b", "c", "a")) 3= c 3= z purrr::map syntax to apply a function
levels attribute that stores levels 1 a b b b x
a set of mappings between or expression to each level.
integers and categorical values. When you view a factor, R
a a a v fct_recode(f, v = "a", x = "b", z = "c")
displays not the integers, but the values associated with them. fct_infreq(f, ordered = NA) fct_relabel(f, ~ paste0("x", .x))
a 1= a a 1= a fct_unique(f) Return the a 1= a a 1= a fct_shuffle(f, n = 1L) Randomly a 1= a a 1= a fct_other(f, keep, drop, other_level =
2= b 2= b unique values, removing b 2= b
b 2= c permute order of factor levels. 2= b 2= b "Other") Replace levels with "other."
c c 3= c 3= b fct_shuffle(f4) c Other
b
3= c
b
3= c duplicates. fct_unique(f) c c b
3= c
b
3 = Other fct_other(f, keep = c("a", "b"))
a a a
fct_reorder(.f, .x, .fun=median, ...,
.desc = FALSE) Reorder levels by
Combine Factors Add or drop levels
1= a 1= b
a 2= b
ca 2= c their relationship with another
bc 3= c b 3= a
variable.
a 1= a + a fct_c(…) Combine factors boxplot(data = iris, Sepal.Width ~ a a
b 1= a = 1= a
fct_reorder(Species, Sepal.Width))
1= a 1= a fct_drop(f, only) Drop unused levels.
c 2= c a 2= b c 2= c with different levels. b 2= b 2= b f5 <- factor(c("a","b"),c("a","b","x"))
3= b f1 <- factor(c("a", "c")) 3= x b
b f6 <- fct_drop(f5)
f2 <- factor(c("b", "a"))
a fct_c(f1, f2) fct_reorder2(.f, .x, .y, .fun =
1= a 1= b last2, ..., .desc = TRUE) Reorder a 1= a a 1= a fct_expand(f, …) Add levels to
2= b 2= c 2= b 2= b a factor. fct_expand(f6, "x")
3= c 3= a
levels by their final values when b b 3= x
a 1= a a 1= a plotted with two other variables.
b
2= b
b
2= b
3= c
fct_unify(fs, levels = ggplot(data = iris,
lvls_union(fs)) Standardize aes(Sepal.Width, Sepal.Length,
a 1= a
2= c
a 1= a
levels across a list of factors. color = fct_reorder2(Species,
a 1= a a 1= a fct_explicit_na(f, na_level="(Missing)")
c c2 2= b
b 2= b
b 2= b Assigns a level to NAs to ensure they
3= c
fct_unify(list(f2, f1)) Sepal.Width, Sepal.Length))) + 3= x appear in plots, etc.
geom_smooth() NA x fct_explicit_na(factor(c("a", "b", NA)))
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at forcats.tidyverse.org • Diagrams inspired by @LVaudor ! • forcats 0.3.0 • Updated: 2019-02
2017-11-28 12:00:00 dt <- as_datetime(1511870400) d <- as_date(17498) t <- hms::as.hms(85) round_date(x, unit = "second")
## "2017-11-28 12:00:00 UTC" ## "2017-11-28" ## 00:01:25 Round to nearest unit.
round_date(dt, unit = "month")
Jan Feb Mar Apr
Stamp Date-times
2018-01-31 11:59:59 year(x) Year. year(dt)
2017-22-12 10:00:00 ydm_hms(), ydm_hm(), ydm_h(). isoyear(x) The ISO 8601 year.
ydm_hms("2017-22-12 10:00:00") epiyear(x) Epidemiological year.
stamp() Derive a template from an example string and return a new
11/28/2017 1:02:03 mdy_hms(), mdy_hm(), mdy_h(). month(x, label, abbr) Month. function that will apply the template to date-times. Also
mdy_hms("11/28/2017 1:02:03") 2018-01-31 11:59:59 stamp_date() and stamp_time().
month(dt)
dmy_hms(), dmy_hm(), dmy_h(). 2018-01-31 11:59:59 1. Derive a template, create a function Tip: use a
1 Jan 2017 23:59:59 day(x) Day of month. day(dt) sf <- stamp("Created Sunday, Jan 17, 1999 3:34")
dmy_hms("1 Jan 2017 23:59:59") date with
wday(x,label,abbr) Day of week.
2. Apply the template to dates day > 12
ymd(), ydm(). ymd(20170131) qday(x) Day of quarter.
20170131 sf(ymd("2010-04-05"))
2018-01-31 11:59:59 ## [1] "Created Monday, Apr 05, 2010 00:00"
mdy(), myd(). mdy("July 4th, 2000") hour(x) Hour. hour(dt)
July 4th, 2000
2018-01-31 11:59:59 minute(x) Minutes. minute(dt)
4th of July '99 dmy(), dym(). dmy("4th of July '99")
2001: Q3 yq() Q for quarter. yq("2001: Q3") 2018-01-31 11:59:59 second(x) Seconds. second(dt) Time Zones
2:01 hms::hms() Also lubridate::hms(), x
J F M A M J week(x) Week of the year. week(dt) R recognizes ~600 time zones. Each encodes the time zone, Daylight
hm() and ms(), which return J A S O N D isoweek() ISO 8601 week. Savings Time, and historical calendar variations for an area. R assigns
periods.* hms::hms(sec = 0, min= 1, epiweek() Epidemiological week. one time zone per vector.
hours = 2)
x
J F M A M J
quarter(x, with_year = FALSE)
Use the UTC time zone to avoid Daylight Savings.
J A S O N D Quarter. quarter(dt) OlsonNames() Returns a list of valid time zone names. OlsonNames()
2017.5 date_decimal(decimal, tz = "UTC")
date_decimal(2017.5)
x
J F M A M J semester(x, with_year = FALSE) 5:00 6:00
J A S O N D Semester. semester(dt)
now(tzone = "") Current time in tz 4:00 Mountain Central 7:00
with_tz(time, tzone = "") Get
(defaults to system tz). now() am(x) Is it in the am? am(dt) Pacific Eastern the same date-time in a new
January today(tzone = "") Current date in a pm(x) Is it in the pm? pm(dt) time zone (a new clock time).
xxxxx with_tz(dt, "US/Pacific")
xxx tz (defaults to system tz). today()
dst(x) Is it daylight savings? dst(d) PT
MT
fast_strptime() Faster strptime. CT ET
fast_strptime('9/1/01', '%y/%m/%d') leap_year(x) Is it a leap year? force_tz(time, tzone = "") Get
leap_year(d) 7:00 7:00 the same clock time in a new
parse_date_time() Easier strptime. Pacific Eastern time zone (a new date-time).
parse_date_time("9/1/01", "ymd") update(object, ..., simple = FALSE) force_tz(dt, "US/Pacific")
update(dt, mday = 2, hour = 1) 7:00 7:00
Mountain Central
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at lubridate.tidyverse.org • lubridate 1.6.0 • Updated: 2017-12
# Outras funções/objetos. 1
ls("package:magrittr") %>% 2
str_subset("^[^%]*$") 3
x <- precip 1
mean(sqrt(x - min(x))) 2
## [1] 5.020078
x <- x - min(x) 1
x <- sqrt(x) 2
mean(x) 3
## [1] 5.020078
precip %>% 1
`-`(min(.)) %>% # o mesmo que subtract(min(.)) 2
sqrt() %>% 3
mean() 4
## [1] 5.020078
x <- precip 1
x <- sqrt(x) 2
x <- x[x > 5] 3
x <- mean(x) 4
x 5
## [1] 6.364094
precip %>% 1
sqrt() %>% 2
.[is_greater_than(., 5)] %>% # o mesmo que .[`>`(., 5)] 3
mean() 4
## [1] 6.364094
# Do CRAN. 1
install.packages("tidyverse") 2
3
# Do GitHub. 4
# install.packages("devtools") 5
devtools::install_github("hadley/tidyverse") 6
7
# Atualizar caso já tenha instalado. 8
tidyverse_update() 9