Você está na página 1de 71

Manipulação e Visualização de Dados

a abordagem tidyverse

Prof. Walmes Zeviani


walmes@ufpr.br

Laboratório de Estatística e Geoinformação


Departamento de Estatística
Universidade Federal do Paraná

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 1


Motivação

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 2


Manipulação e visualização de dados

I Manipular e visualizar dados são tarefas obrigatórias em


Data Science (DS).
I O conhecimento sobre dados determina o sucesso das etapas
seguintes.
I Fazer isso de forma eficiente requer:
I Conhecer o processo e suas etapas.
I Dominar a tecnologia para isso.
I Existem inúmeros softwares voltados para isso.
I R se destaca em DS por ser free & open source, ter muitos
recursos e uma ampla comunidade.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 3


O tempo gasto em DS

O que os cientistas de dados mais gastam tempo


fazendo e como gostam disso?
Tempo gasto Menos divertido
75 75

Atividade
1. Coletar dados
19% 21% 2. Limpar e organizar dados
50 60% 0/100 50 57% 0/100 3. Minerar de dados
5% 5%
4% 4% 4. Construir dados de treino
3% 10%
9% 3% 5. Refinar algorítmos
6. Outros

25 25

Figura 1. Tempo gasto e diversão em atividades. Fonte: Gil Press, 2016.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 4


O ambiente R para manipulação de dados

I O R é a lingua franca da Estatística.


I Desde o princípio oferece recursos para manipulação de
dados.
I O data.frame é a estrutura base para dados tabulares.
I base, utils, stats, reshape, etc com recursos para importar,
transformar, modificar, filtrar, agregar, data.frames.
I Porém, existem “algumas imperfeições” ou espaço para
melhorias:
I Coerções indesejadas de data.frame/matriz para vetor.
I Ordem/nome irregular/inconsistente dos argumentos nas
funções.
I Dependência de pacotes apenas em cascata.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 5


A abordagem tidyverse

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 6


O tidyverse

I Oferece uma reimplementação e extensão das


funcionalidades para manipulação e visualização.
I É uma coleção 8 de pacotes R que operam em harmonia.
I Eles foram planejados e construídos para trabalhar em
conjunto.
I Possuem gramática, organização, filosofia e estruturas de
dados mais clara.
I Maior facilidade de desenvolvimento de código e
portabilidade.
I Outros pacotes acoplam muito bem com o tidyverse.
I Pacotes: https://www.tidyverse.org/packages/.
I R4DS: https://r4ds.had.co.nz/.
I Cookbook: https://rstudio-education.github.io/
tidyverse-cookbook/program.html.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 7


O que o tidyverse contém

library(tidyverse) 1
ls("package:tidyverse") 2

## [1] "tidyverse_conflicts" "tidyverse_deps" "tidyverse_logo"


## [4] "tidyverse_packages" "tidyverse_update"

tidyverse_packages() 1

## [1] "broom" "cli" "crayon" "dplyr"


## [5] "dbplyr" "forcats" "ggplot2" "haven"
## [9] "hms" "httr" "jsonlite" "lubridate"
## [13] "magrittr" "modelr" "purrr" "readr"
## [17] "readxl\n(>=" "reprex" "rlang" "rstudioapi"
## [21] "rvest" "stringr" "tibble" "tidyr"
## [25] "xml2" "tidyverse"

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 8


Os pacotes do tidyverse

Figura 2. Pacotes que fazer parte do tidyverse.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 9


Mas na realidade

Figura 3. Em um universo pararelo.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 10


A anatomia do tidyverse

tibble

I Uma reimplementação do data.frame com muitas melhorias.


I Método print() enxuto.
I Documentação: https://tibble.tidyverse.org/.

readr

I Leitura de dados tabulares: csv, tsv, fwf.


I Recursos “inteligentes” que determinam tipo de variável.
I Ex: importar campos de datas como datas!
I Documentação: https://readr.tidyverse.org/.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 11


A anatomia do tidyverse
tidyr

I Suporte para criação de dados no formato tidy (tabular).


I Cada variável está em uma coluna.
I Cada observação (unidade amostral) é uma linha.
I Cada valor é uma cédula.
I Documentação: https://tidyr.tidyverse.org/.

dplyr

I Oferece uma gramática extensa pra manipulação de dados.


I Operações de split-apply-combine.
I Na maior parte da manipulação é usado o dplyr.
I Documentação: https://dplyr.tidyverse.org/.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 12


A anatomia do tidyverse
ggplot2

I Criação de gráficos baseado no The Grammar of Graphics


(WILKINSON et al., 2013).
I Claro mapeamento das variáveis do BD em variáveis visuais e
construção baseada em camadas.
I Documentação: https://ggplot2.tidyverse.org/.
I WICKHAM (2016): ggplot2 - Elegant Graphics for Data
Analysis.
I TEUTONICO (2015): ggplot2 Essentials.

forcats

I Para manipulação de variáveis categóricas/fatores.


I Renomenar, reordenar, transformar, aglutinar.
I Documentação: https://forcats.tidyverse.org/.
Walmes Zeviani · UFPR Manipulação e Visualização de Dados 13
A anatomia do tidyverse

stringr

I Recursos coesos construídos para manipulação de strings.


I Feito sobre o stringi.
I Documentação: https://stringr.tidyverse.org/.

purrr

I Recursos para programação funcional.


I Funções que aplicam funções em lote varrendo objetos:
vetores, listas, etc.
I Documentação: https://purrr.tidyverse.org/.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 14


Harmonizam bem com o tidyverse

I magrittr: operadores pipe → %>%.


I rvest: web scraping.
I httr: requisições HTTP e afins.
I xml2: manipulação de XML.
I lubridate e hms: manipulação de dados cronológicos.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 15


Estruturas de dados do tibble

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 16


Anatomia do tibble

I Aperfeiçoamento do data.frame.
I A classe tibble.
I Formas ágeis de criar tibbles.
I Formas ágeis de modificar objetos das classes.
I Método print mais enxuto e informativo.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 17


# packageVersion("tibble") 1
ls("package:tibble") 2

## [1] "add_case" "add_column"


## [3] "add_row" "as_data_frame"
## [5] "as_tibble" "as.tibble"
## [7] "column_to_rownames" "data_frame"
## [9] "data_frame_" "deframe"
## [11] "enframe" "frame_data"
## [13] "frame_matrix" "glimpse"
## [15] "has_name" "has_rownames"
## [17] "is_tibble" "is.tibble"
## [19] "is_vector_s3" "knit_print.trunc_mat"
## [21] "lst" "lst_"
## [23] "new_tibble" "obj_sum"
## [25] "remove_rownames" "repair_names"
## [27] "rowid_to_column" "rownames_to_column"
## [29] "set_tidy_names" "tbl_sum"
## [31] "tibble" "tibble_"
## [33] "tidy_names" "tribble"
## [35] "trunc_mat" "type_sum"

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 18


Figura 4. Uso do tibble.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 19


Leitura de dados com readr

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 20


Anatomia do readr

I Importação de dados no formato texto.


I Funções de importação: read_*().
I Funções de escrita: write_*().
I Funções de parsing: parse_*.
I Conseguem identificar campos de data.
I Muitas opções de controle de importação:
I Encoding.
I Separador de campo e decimal.
I Aspas, comentários, etc.
I Cartão de leitura com o readr e arrumação com o tidyr: https:
//rawgit.com/rstudio/cheatsheets/master/data-import.pdf.
I Exemplos do curso de leitura de dados com o readr:
http://leg.ufpr.br/~walmes/cursoR/data-vis/99-datasets.html.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 21


# packageVersion("readr") 1
ls("package:readr") %>% 2
str_subset("(read|parse|write)_") %>% 3
sort() 4

## [1] "parse_character" "parse_date" "parse_datetime"


## [4] "parse_double" "parse_factor" "parse_guess"
## [7] "parse_integer" "parse_logical" "parse_number"
## [10] "parse_time" "parse_vector" "read_csv"
## [13] "read_csv2" "read_csv2_chunked" "read_csv_chunked"
## [16] "read_delim" "read_delim_chunked" "read_file"
## [19] "read_file_raw" "read_fwf" "read_lines"
## [22] "read_lines_chunked" "read_lines_raw" "read_log"
## [25] "read_rds" "read_table" "read_table2"
## [28] "read_tsv" "read_tsv_chunked" "write_csv"
## [31] "write_delim" "write_excel_csv" "write_file"
## [34] "write_lines" "write_rds" "write_tsv"

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 22


Data Import : : CHEAT SHEET
R’s tidyverse is built around tidy data stored
in tibbles, which are enhanced data frames.
Read Tabular Data - These functions share the common arguments: Data types
read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"), readr functions guess
The front side of this sheet shows the types of each column and
how to read text files into R with quoted_na = TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000,
n_max), progress = interactive()) convert types when appropriate (but will NOT
readr. convert strings to factors automatically).
The reverse side shows how to A B C Comma Delimited Files
a,b,c read_csv("file.csv") A message shows the type of each column in the
create tibbles with tibble and to 1 2 3
result.
1,2,3 To make file.csv run:
layout tidy data with tidyr. 4 5 NA
4,5,NA write_file(x = "a,b,c\n1,2,3\n4,5,NA", path = "file.csv")
## Parsed with column specification:
## cols(
OTHER TYPES OF DATA A B C Semi-colon Delimited Files ## age = col_integer(), age is an
a;b;c read_csv2("file2.csv")
Try one of the following packages to import 1 2 3 ## sex = col_character(), integer
other types of files 1;2;3 4 5 NA write_file(x = "a;b;c\n1;2;3\n4;5;NA", path = "file2.csv") ## earn = col_double()
4;5;NA ## )
• haven - SPSS, Stata, and SAS files
Files with Any Delimiter sex is a
• readxl - excel files (.xls and .xlsx) character
A B C read_delim("file.txt", delim = "|") earn is a double (numeric)
• DBI - databases a|b|c 1 2 3 write_file(x = "a|b|c\n1|2|3\n4|5|NA", path = "file.txt")
• jsonlite - json 1|2|3 4 5 NA 1. Use problems() to diagnose problems.
• xml2 - XML 4|5|NA Fixed Width Files x <- read_csv("file.csv"); problems(x)
• httr - Web APIs read_fwf("file.fwf", col_positions = c(1, 3, 5))
• rvest - HTML (Web Scraping) abc
A B C
write_file(x = "a b c\n1 2 3\n4 5 NA", path = "file.fwf")
1 2 3 2. Use a col_ function to guide parsing.
123 4 5 NA • col_guess() - the default
Save Data 4 5 NA
Tab Delimited Files
read_tsv("file.tsv") Also read_table(). • col_character()
write_file(x = "a\tb\tc\n1\t2\t3\n4\t5\tNA", path = "file.tsv") • col_double(), col_euro_double()
Save x, an R object, to path, a file path, as: • col_datetime(format = "") Also
USEFUL ARGUMENTS col_date(format = ""), col_time(format = "")
Comma delimited file
write_csv(x, path, na = "NA", append = FALSE, Example file Skip lines • col_factor(levels, ordered = FALSE)
a,b,c 1 2 3
col_names = !append) write_file("a,b,c\n1,2,3\n4,5,NA","file.csv") read_csv(f, skip = 1) • col_integer()
1,2,3 4 5 NA
File with arbitrary delimiter f <- "file.csv" • col_logical()
4,5,NA
write_delim(x, path, delim = " ", na = "NA", • col_number(), col_numeric()
append = FALSE, col_names = !append) A B C No header A B C Read in a subset • col_skip()
CSV for excel
1 2 3
read_csv(f, col_names = FALSE) 1 2 3 read_csv(f, n_max = 1) x <- read_csv("file.csv", col_types = cols(
write_excel_csv(x, path, na = "NA", append =
4 5 NA A = col_double(),
Provide header B = col_logical(),
FALSE, col_names = !append) x y z
Missing Values C = col_factor()))
String to file
A B C read_csv(f, col_names = c("x", "y", "z")) A B C

write_file(x, path, append = FALSE)


1 2 3 NA 2 3 read_csv(f, na = c("1", "."))
4 5 NA 4 5 NA 3. Else, read in as character vectors then parse
String vector to file, one element per line with a parse_ function.
write_lines(x,path, na = "NA", append = FALSE) • parse_guess()
Object to RDS file Read Non-Tabular Data • parse_character()
• parse_datetime() Also parse_date() and
write_rds(x, path, compress = c("none", "gz", Read a file into a raw vector
"bz2", "xz"), ...) Read a file into a single string parse_time()
read_file(file, locale = default_locale()) read_file_raw(file) • parse_double()
Tab delimited files
Read each line into its own string Read each line into a raw vector • parse_factor()
write_tsv(x, path, na = "NA", append = FALSE,
col_names = !append) read_lines(file, skip = 0, n_max = -1L, na = character(), read_lines_raw(file, skip = 0, n_max = -1L, • parse_integer()
locale = default_locale(), progress = interactive()) progress = interactive()) • parse_logical()
Read Apache style log files • parse_number()
read_log(file, col_names = FALSE, col_types = NULL, skip = 0, n_max = -1, progress = interactive()) x$A <- parse_number(x$A)
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at tidyverse.org • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01

Figura 5. Cartão de referência importação de dados com o readr.


Walmes Zeviani · UFPR Manipulação e Visualização de Dados 23
Figura 6. Leitura com o readr.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 24


Figura 7. Parsing de valores com readr.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 25


Dados no formato tidy com tidyr

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 26


Anatomia do tidyr

I Para fazer arrumação dos dados.


I Mudar a disposição dos dados: long
wide.
I Partir uma variável em vários campos.
I Concatenar vários campos para criar uma variável.
I Remover ou imputar os valores ausentes: NA.
I Aninhar listas em tabelas: tribble.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 27


# packageVersion("tidyr") 1
ls("package:tidyr") 2

## [1] "%>%" "complete" "complete_"


## [4] "crossing" "crossing_" "drop_na"
## [7] "drop_na_" "expand" "expand_"
## [10] "extract" "extract_" "extract_numeric"
## [13] "fill" "fill_" "full_seq"
## [16] "gather" "gather_" "nest"
## [19] "nest_" "nesting" "nesting_"
## [22] "population" "replace_na" "separate"
## [25] "separate_" "separate_rows" "separate_rows_"
## [28] "smiths" "spread" "spread_"
## [31] "table1" "table2" "table3"
## [34] "table4a" "table4b" "table5"
## [37] "uncount" "unite" "unite_"
## [40] "unnest" "unnest_" "who"

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 28


Tibbles - an enhanced data frame Tidy Data with tidyr Split Cells
Tidy data is a way to organize tabular data. It provides a consistent data structure across packages.
The tibble package provides a new Use these functions to
A table is tidy if: Tidy data:
S3 class for storing tabular data, the A * B -> C split or combine cells
tibble. Tibbles inherit the data frame A B C A B C A B C A * B C into individual, isolated
class, but improve three behaviors: values.
• Subsetting - [ always returns a new tibble,
[[ and $ always return a vector.
& separate(data, col, into, sep = "[^[:alnum:]]
Each variable is in Each observation, or Makes variables easy Preserves cases during +", remove = TRUE, convert = FALSE,
• No partial matching - You must use full its own column case, is in its own row to access as vectors vectorized operations extra = "warn", fill = "warn", ...)
column names when subsetting
Separate each cell in a column to make
• Display - When you print a tibble, R provides a
concise view of the
Reshape Data - change the layout of values in a table several columns.
table3
data that fits on Use gather() and spread() to reorganize the values of a table into a new layout.
# A tibble: 234 × 6 country year rate country year cases pop
manufacturer model displ
one screen 1
<chr>
audi
<chr> <dbl>
a4 1.8
gather(data, key, value, ..., na.rm = FALSE, spread(data, key, value, fill = NA, convert = FALSE, A 1999 0.7K/19M A 1999 0.7K 19M
2 audi a4 1.8
3 audi a4 2.0 A 2000 2K/20M A 2000 2K 20M
4
5
6
audi
audi
audi
a4
a4
a4
2.0
2.8
2.8
convert = FALSE, factor_key = FALSE) drop = TRUE, sep = NULL) B 1999 37K/172M B 1999 37K 172
7 audi a4 3.1
B 2000 80K/174M B 2000 80K 174
8 audi a4 quattro 1.8
gather() moves column names into a key spread() moves the unique values of a key
w
w
9 audi a4 quattro 1.8
10 audi a4 quattro 2.0 C 1999 212K/1T C 1999 212K 1T
# ... with 224 more rows, and 3
#
#
more variables: year <int>,
cyl <int>, trans <chr>
column, gathering the column values into a column into the column names, spreading the C 2000 213K/1T C 2000 213K 1T
single value column. values of a value column across the new columns.
separate(table3, rate,
tibble display table4a table2
country 1999 2000 country year cases country year type count country year cases pop
into = c("cases", "pop"))
156 1999 6 auto(l4)
A 0.7K 2K A 1999 0.7K A 1999 cases 0.7K A 1999 0.7K 19M
separate_rows(data, ..., sep = "[^[:alnum:].]
157 1999 6 auto(l4)
158 2008 6 auto(l4)
159 2008 8 auto(s4) B 37K 80K B 1999 37K A 1999 pop 19M A 2000 2K 20M
160 1999 4 manual(m5)
C 212K 213K C 1999 212K B 1999 37K 172M
+", convert = FALSE)
161 1999 4 auto(l4) A 2000 cases 2K
162 2008 4 manual(m5)
163 2008 4 manual(m5) A 2000 2K B 2000 80K 174M
164 2008 4 auto(l4) A 2000 pop 20M
165 2008
166 1999
[ reached
4
4
auto(l4)
auto(l4)
getOption("max.print")
B 2000 80K B 1999 cases 37K C 1999 212K 1T Separate each cell in a column to make
A large table
-- omitted 68 rows ] C 2000 213K B 1999 pop 172M C 2000 213K 1T
several rows. Also separate_rows_().
key value B 2000 cases 80K
to display data frame display B 2000 pop 174M table3
• Control the default appearance with options: C 1999 cases 212K country year rate country year rate
C 1999 pop 1T A 1999 0.7K/19M A 1999 0.7K
options(tibble.print_max = n, C 2000 cases 213K A 2000 2K/20M A 1999 19M
tibble.print_min = m, tibble.width = Inf) C 2000 pop 1T B 1999 37K/172M A 2000 2K
key value B 2000 80K/174M A 2000 20M
• View full data set with View() or glimpse() gather(table4a, `1999`, `2000`, C 1999 212K/1T B 1999 37K
key = "year", value = "cases") spread(table2, type, count) C 2000 213K/1T B 1999 172M
• Revert to data frame with as.data.frame() B 2000 80K
B 2000 174M
CONSTRUCT A TIBBLE IN TWO WAYS
tibble(…)
Handle Missing Values C
C
1999
1999
212K
1T

Both drop_na(data, ...) fill(data, ..., .direction = c("down", "up")) replace_na(data, C 2000 213K
Construct by columns. make this replace = list(), ...)
C 2000 1T
Drop rows containing Fill in NA’s in … columns with most
tibble(x = 1:3, y = c("a", "b", "c")) tibble NA’s in … columns. recent non-NA values. Replace NA’s by column. separate_rows(table3, rate)
x x x
tribble(…)
unite(data, col, ..., sep = "_", remove = TRUE)
A tibble: 3 × 2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2
Construct by rows. x y A 1 A 1 A 1 A 1 A 1 A 1
tribble( ~x, ~y, <int> <chr> B NA D 3 B NA B 1 B NA B 2
Collapse cells across several columns to
1 1 a C NA C NA C 1 C NA C 2
1, "a", 2 2 b D 3 D 3 D 3 D 3 D 3 make a single column.
2, "b", 3 3 c E NA E NA E 3 E NA E 2
table5
3, "c")
drop_na(x, x2) fill(x, x2) replace_na(x, list(x2 = 2)) country century year country year
as_tibble(x, …) Convert data frame to tibble. Afghan 19 99 Afghan 1999

enframe(x, name = "name", value = "value") Expand Tables - quickly create tables with combinations of values Afghan
Brazil
20
19
0
99
Afghan
Brazil
2000
1999
Convert named vector to a tibble Brazil 20 0 Brazil 2000
complete(data, ..., fill = list()) expand(data, ...) China 19 99 China 1999
is_tibble(x) Test whether x is a tibble. China 20 0 China 2000
Adds to the data missing combinations of the Create new tibble with all possible combinations
values of the variables listed in … of the values of the variables listed in … unite(table5, century, year,
complete(mtcars, cyl, gear, carb) expand(mtcars, cyl, gear, carb) col = "year", sep = "")
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at tidyverse.org • readr 1.1.0 • tibble 1.2.12 • tidyr 0.6.0 • Updated: 2017-01

Figura 8. Cartão de referência arrumação de dados com tidyr.


Walmes Zeviani · UFPR Manipulação e Visualização de Dados 29
Figura 9. A definição de tidy data ou formato tabular.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 30


Figura 10. Modificação da disposição dos dados com o tidyr.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 31


Figura 11. Recursos para lidar com dados ausentes do tidyr.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 32


Figura 12. Partir e concatenar valores com tidyr.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 33


Agregação com dplyr

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 34


Anatomia do dplyr

I O dplyr é a gramática para manipulação de dados.


I Tem um conjunto consistente de verbos para atuar sobre
tabelas.
I Verbos: mutate(), select(), filter(), arrange(),
summarise(), slice(), rename(), etc.
I Sufixos: _at(), _if(), _all(), etc.
I Agrupamento: group_by() e ungroup().
I Junções: inner_join(), full_join(), left_join() e
right_join().
I Funções resumo: n(), n_distinct(), first(), last(), nth(),
etc.
I E muito mais no cartão de referência.
I Cartão de referência: https://github.com/rstudio/cheatsheets/
raw/master/data-transformation.pdf.
I É sem dúvida o pacote mais importante do tidyverse.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 35


# library(dplyr) 1
ls("package:dplyr") %>% str_c(collapse = ", ") %>% strwrap() 2

## [1] "%>%, add_count, add_count_, add_row, add_rownames, add_tally,"


## [2] "add_tally_, all_equal, all_vars, anti_join, any_vars, arrange,"
## [3] "arrange_, arrange_all, arrange_at, arrange_if, as_data_frame,"
## [4] "as.tbl, as.tbl_cube, as_tibble, auto_copy, band_instruments,"
## [5] "band_instruments2, band_members, bench_tbls, between,"
## [6] "bind_cols, bind_rows, case_when, changes, check_dbplyr,"
## [7] "coalesce, collapse, collect, combine, common_by, compare_tbls,"
## [8] "compare_tbls2, compute, contains, copy_to, count, count_,"
## [9] "cumall, cumany, cume_dist, cummean, current_vars, data_frame,"
## [10] "data_frame_, db_analyze, db_begin, db_commit, db_create_index,"
## [11] "db_create_indexes, db_create_table, db_data_type, db_desc,"
## [12] "db_drop_table, db_explain, db_has_table, db_insert_into,"
## [13] "db_list_tables, db_query_fields, db_query_rows, db_rollback,"
## [14] "db_save_query, db_write_table, dense_rank, desc, dim_desc,"
## [15] "distinct, distinct_, do, do_, dr_dplyr, ends_with, enexpr,"
## [16] "enexprs, enquo, enquos, ensym, ensyms, eval_tbls, eval_tbls2,"
## [17] "everything, explain, expr, failwith, filter, filter_,"
## [18] "filter_all, filter_at, filter_if, first, frame_data, full_join,"
## [19] "funs, funs_, glimpse, group_by, group_by_, group_by_all,"
## [20] "group_by_at, group_by_if, group_by_prepare, grouped_df,"
## [21] "group_indices, group_indices_, groups, group_size, group_vars,"
## [22] "id, ident, if_else, inner_join, intersect, is_grouped_df,"
## [23] "is.grouped_df, is.src, is.tbl, lag, last, lead, left_join,"
## [24] "location, lst, lst_, make_tbl, matches, min_rank, mutate,"
## [25] "mutate_, mutate_all, mutate_at, mutate_each, mutate_each_,"
## [26] "mutate_if, n, na_if, nasa, n_distinct, near, n_groups, nth,"
## [27] "ntile, num_range, one_of, order_by, percent_rank,"
## [28] "progress_estimated, pull, quo, quo_name, quos, rbind_all,"
## [29] Zeviani · _UFPR
Walmes"rbind list,Manipulação recode_factor,
recode,e Visualização de Dados rename, rename_, rename_all," 36
Data Transformation with dplyr : : CHEAT SHEET
dplyr
dplyr functions work with pipes and expect tidy data. In tidy data:
Manipulate Cases Manipulate Variables
A B C A B C
& EXTRACT CASES EXTRACT VARIABLES
pipes
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x %>% f(y)
its own column case, is in its own row becomes f(x, y) pull(.data, var = -1) Extract column values as
filter(.data, …) Extract rows that meet logical
a vector. Choose by name or index.

Summarise Cases
w
www
ww criteria. filter(iris, Sepal.Length > 7)
w
www pull(iris, Sepal.Length)

distinct(.data, ..., .keep_all = FALSE) Remove select(.data, …)


rows with duplicate values. 
 Extract columns as a table. Also select_if().
These apply summary functions to columns to create a new
table of summary statistics. Summary functions take vectors as
input and return one value (see back).
w
www
ww distinct(iris, Species)
sample_frac(tbl, size = 1, replace = FALSE,
w
www select(iris, Sepal.Length, Species)

weight = NULL, .env = parent.frame()) Randomly Use these helpers with select (),
summary function select fraction of rows. 
 e.g. select(iris, starts_with("Sepal"))
summarise(.data, …)

Compute table of summaries. 

w
www
ww sample_frac(iris, 0.5, replace = TRUE)
sample_n(tbl, size, replace = FALSE, weight =
contains(match) num_range(prefix, range) :, e.g. mpg:cyl
ends_with(match) one_of(…) -, e.g, -Species
w
ww summarise(mtcars, avg = mean(mpg))
count(x, ..., wt = NULL, sort = FALSE)

NULL, .env = parent.frame()) Randomly select
size rows. sample_n(iris, 10, replace = TRUE)
matches(match) starts_with(match)

Count number of rows in each group defined slice(.data, …) Select rows by position. MAKE NEW VARIABLES
slice(iris, 10:15)
by the variables in … Also tally().

w
ww count(iris, Species)
w
www
ww top_n(x, n, wt) Select and order top n entries (by
group if grouped data). top_n(iris, 5, Sepal.Width)
These apply vectorized functions to columns. Vectorized funs take
vectors as input and return vectors of the same length as output
(see back).
VARIATIONS vectorized function
summarise_all() - Apply funs to every column.
summarise_at() - Apply funs to specific columns. mutate(.data, …) 

summarise_if() - Apply funs to all cols of one type. Logical and boolean operators to use with filter() Compute new column(s).
<
>
<=
>=
is.na()
!is.na()
%in%
!
|
&
xor() w
wwww
w mutate(mtcars, gpm = 1/mpg)

transmute(.data, …)

Group Cases See ?base::logic and ?Comparison for help. Compute new column(s), drop others.
Use group_by() to create a "grouped" copy of a table. 

dplyr functions will manipulate each "group" separately and
w
ww transmute(mtcars, gpm = 1/mpg)

mutate_all(.tbl, .funs, …) Apply funs to every


then combine the results. ARRANGE CASES column. Use with funs(). Also mutate_if().


mtcars %>%
arrange(.data, …) Order rows by values of a
column or columns (low to high), use with
w
www mutate_all(faithful, funs(log(.), log2(.)))
mutate_if(iris, is.numeric, funs(log(.)))

w
www
ww group_by(cyl) %>% w
www
ww desc() to order from high to low.
arrange(mtcars, mpg)
mutate_at(.tbl, .cols, .funs, …) Apply funs to

w summarise(avg = mean(mpg)) arrange(mtcars, desc(mpg))


ww
specific columns. Use with funs(), vars() and
the helper functions for select().

mutate_at(iris, vars( -Species), funs(log(.)))
group_by(.data, ..., add = ungroup(x, …) ADD CASES add_column(.data, ..., .before = NULL, .after =
FALSE) Returns ungrouped copy 
 NULL) Add new column(s). Also add_count(),
add_row(.data, ..., .before = NULL, .after = NULL)
Returns copy of table 

grouped by …
g_iris <- group_by(iris, Species)
of table.
ungroup(g_iris)
w
www
ww
Add one or more rows to a table.
add_row(faithful, eruptions = 1, waiting = 1)
w
www
ww add_tally(). add_column(mtcars, new = 1:32)
rename(.data, …) Rename columns.

rename(iris, Length = Sepal.Length)
w
wwww
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2017-03

Figura 13. Cartão de referência de operações em dados com tabulares com dplyr.
Walmes Zeviani · UFPR Manipulação e Visualização de Dados 37
Vector Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARISE () COMBINE VARIABLES COMBINE CASES dplyr
mutate() and transmute() apply vectorized summarise() applies summary functions to x y
functions to columns to create new columns. columns to create a new table. Summary A B C A B D A B C A B D A B C

Vectorized functions take vectors as input and


return vectors of the same length as output.
functions take vectors as input and return single
values as output.
a t 1
b u 2
c v 3
+ a t 3
b u 2
d w 1
= a t 1 a t 3
b u 2 b u 2
c v 3 d w 1 x
a t 1
b u 2
c v 3

A B C

vectorized function summary function Use bind_cols() to paste tables beside each
other as they are. + y
C v 3
d w 4

COUNTS bind_cols(…) Returns tables placed side by


OFFSETS
dplyr::n() - number of values/rows side as a single table.  Use bind_rows() to paste tables below each
dplyr::lag() - Offset elements by 1 dplyr::n_distinct() - # of uniques BE SURE THAT ROWS ALIGN.
dplyr::lead() - Offset elements by -1 other as they are.
sum(!is.na()) - # of non-NA’s
CUMULATIVE AGGREGATES Use a "Mutating Join" to join one table to bind_rows(…, .id = NULL)
LOCATION DF
x
A
a
B
t
C
1
dplyr::cumall() - Cumulative all() columns from another, matching values with Returns tables one on top of the other
mean() - mean, also mean(!is.na()) the rows that they correspond to. Each join
x b u 2
as a single table. Set .id to a column
dplyr::cumany() - Cumulative any() x c v 3
median() - median retains a different combination of values from name to add a column of the original
cummax() - Cumulative max() z c v 3
z d w 4
dplyr::cummean() - Cumulative mean() the tables. table names (as pictured)
LOGICALS
cummin() - Cumulative min()
cumprod() - Cumulative prod() mean() - Proportion of TRUE’s A B C D left_join(x, y, by = NULL, A B C intersect(x, y, …)
cumsum() - Cumulative sum() sum() - # of TRUE’s a t 1 3 copy=FALSE, suffix=c(“.x”,“.y”),…) c v 3
Rows that appear in both x and y.
b u 2 2
c v 3 NA Join matching values from y to x.
RANKINGS POSITION/ORDER A B C setdiff(x, y, …)
dplyr::cume_dist() - Proportion of all values <= dplyr::first() - first value A B C D right_join(x, y, by = NULL, copy = a t 1 Rows that appear in x but not y.
b u 2
dplyr::last() - last value
a t 1 3 FALSE, suffix=c(“.x”,“.y”),…)
dplyr::dense_rank() - rank with ties = min, no
dplyr::nth() - value in nth location of vector
b u 2 2
Join matching values from x to y. A B C union(x, y, …)
gaps d w NA 1
a t 1 Rows that appear in x or y. 

dplyr::min_rank() - rank with ties = min inner_join(x, y, by = NULL, copy = b u 2
(Duplicates removed). union_all()
dplyr::ntile() - bins into n bins RANK A B C D c v 3
a t 1 3 FALSE, suffix=c(“.x”,“.y”),…) d w 4 retains duplicates.
dplyr::percent_rank() - min_rank scaled to [0,1] quantile() - nth quantile  b u 2 2
Join data. Retain only rows with
dplyr::row_number() - rank with ties = "first" min() - minimum value matches.
max() - maximum value
MATH Use setequal() to test whether two data sets
A B C D full_join(x, y, by = NULL,
+, - , *, /, ^, %/%, %% - arithmetic ops SPREAD a t 1 3
copy=FALSE, suffix=c(“.x”,“.y”),…) contain the exact same rows (in any order).
b u 2 2
log(), log2(), log10() - logs IQR() - Inter-Quartile Range c v 3 NA Join data. Retain all values, all rows.
<, <=, >, >=, !=, == - logical comparisons mad() - median absolute deviation d w NA 1
dplyr::between() - x >= left & x <= right sd() - standard deviation EXTRACT ROWS
dplyr::near() - safe == for floating point var() - variance x y
numbers
Use by = c("col1", "col2", …) to A B C A B D

+ =
A B.x C B.y D
a t 1 a t 3
MISC
a t 1 t 3
specify one or more common b u 2 b u 2

Row Names
b u 2 u 2
c v 3 NA NA columns to match on. c v 3 d w 1
dplyr::case_when() - multi-case if_else() left_join(x, y, by = "A")
dplyr::coalesce() - first non-NA values by
element  across a set of vectors Tidy data does not use rownames, which store a
variable outside of the columns. To work with the A.x B.x C A.y B.y Use a named vector, by = c("col1" = Use a "Filtering Join" to filter one table against
dplyr::if_else() - element-wise if() + else() rownames, first move them into a column. a t 1 d w
"col2"), to match on columns that the rows of another.
dplyr::na_if() - replace specific values with NA b u 2 b u
have different names in each table.
C A B c v 3 a t
pmax() - element-wise max() rownames_to_column() left_join(x, y, by = c("C" = "D")) semi_join(x, y, by = NULL, …)
A B A B C
pmin() - element-wise min() 1 a t 1 a t Move row names into col. a t 1 Return rows of x that have a match in y.
dplyr::recode() - Vectorized switch() 2 b u 2 b u a <- rownames_to_column(iris, var A1 B1 C A2 B2 Use suffix to specify the suffix to b u 2 USEFUL TO SEE WHAT WILL BE JOINED.
dplyr::recode_factor() - Vectorized switch()
 3 c v 3 c v
= "C") a t 1 d w give to unmatched columns that
for factors b u 2 b u
have the same name in both tables. A B C anti_join(x, y, by = NULL, …)

c v 3 a t
A B C A B column_to_rownames() left_join(x, y, by = c("C" = "D"), suffix = c v 3 Return rows of x that do not have a
1 a t 1 a t
Move col in row names.  c("1", "2")) match in y. USEFUL TO SEE WHAT WILL
2 b u 2 b u
3 c v 3 c v column_to_rownames(a, var = "C") NOT BE JOINED.

Also has_rownames(), remove_rownames()

RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2017-03

Figura 14. Cartão de referência de operações em dados com tabulares com dplyr.
Walmes Zeviani · UFPR Manipulação e Visualização de Dados 38
Programação funcional com purrr

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 39


Anatomia do purrr

I O purrr fornece um conjunto completo e consistente para


programação funcional.
I São uma sofisticação da família apply.
I Várias função do tipo map para cada tipo de input/output.
I Percorrem vetores, listas, colunas, linhas, etc.
I Permitem filtar, concatenar, parear listas, etc.
I Tem funções para tratamento de exceções: falhas/erros, avisos.
I Cartão de referência:
https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 40


# library(purrr) 1
ls("package:purrr") %>% str_c(collapse = ", ") %>% strwrap() 2

## [1] "%>%, %||%, %@%, accumulate, accumulate_right, array_branch,"


## [2] "array_tree, as_function, as_mapper, as_vector, at_depth,"
## [3] "attr_getter, auto_browse, compact, compose, cross, cross2,"
## [4] "cross3, cross_d, cross_df, cross_n, detect, detect_index,"
## [5] "discard, every, flatten, flatten_chr, flatten_dbl, flatten_df,"
## [6] "flatten_dfc, flatten_dfr, flatten_int, flatten_lgl,"
## [7] "has_element, head_while, imap, imap_chr, imap_dbl, imap_dfc,"
## [8] "imap_dfr, imap_int, imap_lgl, invoke, invoke_map,"
## [9] "invoke_map_chr, invoke_map_dbl, invoke_map_df, invoke_map_dfc,"
## [10] "invoke_map_dfr, invoke_map_int, invoke_map_lgl, is_atomic,"
## [11] "is_bare_atomic, is_bare_character, is_bare_double,"
## [12] "is_bare_integer, is_bare_list, is_bare_logical,"
## [13] "is_bare_numeric, is_bare_vector, is_character, is_double,"
## [14] "is_empty, is_formula, is_function, is_integer, is_list,"
## [15] "is_logical, is_null, is_numeric, is_scalar_atomic,"
## [16] "is_scalar_character, is_scalar_double, is_scalar_integer,"
## [17] "is_scalar_list, is_scalar_logical, is_scalar_numeric,"
## [18] "is_scalar_vector, is_vector, iwalk, keep, lift, lift_dl,"
## [19] "lift_dv, lift_ld, lift_lv, lift_vd, lift_vl, list_along,"
## [20] "list_merge, list_modify, lmap, lmap_at, lmap_if, map, map2,"
## [21] "map2_chr, map2_dbl, map2_df, map2_dfc, map2_dfr, map2_int,"
## [22] "map2_lgl, map_at, map_call, map_chr, map_dbl, map_df, map_dfc,"
## [23] "map_dfr, map_if, map_int, map_lgl, modify, modify_at,"
## [24] "modify_depth, modify_if, negate, partial, pluck, pmap,"
## [25] "pmap_chr, pmap_dbl, pmap_df, pmap_dfc, pmap_dfr, pmap_int,"
## [26] "pmap_lgl, possibly, prepend, pwalk, quietly, rbernoulli,"
## [27] "rdunif, reduce, reduce2, reduce2_right, reduce_right,"
## [28] "rep_along, rerun, safely, set_names, simplify, simplify_all,"
## [29]
Walmes"some, splice,
Zeviani · UFPR tail_ewhile,
Manipulação transpose,
Visualização de Dados update_list, vec_depth," 41
Apply functions with purrr : : CHEAT SHEET
Apply Functions Work with Lists
Map functions apply a function iteratively to each element of a list FILTER LISTS SUMMARISE LISTS TRANSFORM LISTS
or vector.
map(.x, .f, …) Apply a a b pluck(.x, ..., .default=NULL) a FALSE every(.x, .p, …) Do all a a modify(.x, .f, ...) Apply
fun( ,…) b Select an element by name elements pass a test? function to each element. Also
map( , fun, …) fun( ,…) function to each b b b
element of a list or c or index, pluck(x,"b") ,or its c every(x, is.character) c c map, map_chr, map_dbl,
fun( ,…) d attribute with attr_getter. d d map_dfc, map_dfr, map_int,
vector. map(x, is.logical)
pluck(x,"b",attr_getter("n")) a TRUE some(.x, .p, …) Do some map_lgl. modify(x, ~.+ 2)
b elements pass a test? 

map2(.x, ,y, .f, …) Apply a a keep(.x, .p, …) Select c some(x, is.character) a a modify_at(.x, .at, .f, ...) Apply
fun( , ,…) elements that pass a function to elements by name
map2( , ,fun,…) fun( , ,…) a function to pairs of b c b b
fun( , ,…) elements from two lists, c logical test. keep(x, is.na) a TRUE has_element(.x, .y) Does a c c or index. Also map_at.
vectors. map2(x, y, sum) b list contain an element? d d modify_at(x, "b", ~.+ 2)
a b discard(.x, .p, …) Select c has_element(x, "foo")
b elements that do not pass a a a modify_if(.x, .p, .f, ...) Apply
pmap(.l, .f, …) Apply a c logical test. discard(x, is.na) detect(.x, .f, ..., .right=FALSE, b b function to elements that
fun( , , ,…) function to groups of a c
pmap( ,fun,…) fun( , , ,…) b .p) Find first element to pass. c c pass a test. Also map_if.
fun( , , ,…) elements from list of lists, a NULL b compact(.x, .p = identity)
 c detect(x, is.character) d d modify_if(x, is.numeric,~.+2)
vectors. pmap(list(x, y, z), b Drop empty elements.
sum, na.rm = TRUE) c NULL compact(x) detect_index(.x, .f, ..., .right modify_depth(.x,.depth,.f,...)
a 3
b = FALSE, .p) Find index of Apply function to each
a a head_while(.x, .p, …) first element to pass. element at a given level of a
fun invoke_map(.f, .x = c
detect_index(x, is.character) list. modify_depth(x, 1, ~.+ 2)
fun ( ,…)
list(NULL), …, .env=NULL) b b Return head elements
invoke_map( fun , ,…) fun ( ,…) c until one does not pass. xy z
fun fun ( ,…) Run each function in a list. d Also tail_while.
2
vec_depth(x) Return depth
a
Also invoke. l <- list(var, head_while(x, is.character) b (number of levels of WORK WITH LISTS
sd); invoke_map(l, x = 1:9) c indexes). vec_depth(x) array_tree(array, margin =
lmap(.x, .f, ...) Apply function to each list-element of a list or vector. NULL) Turn array into list.
RESHAPE LISTS JOIN (TO) LISTS Also array_branch.
imap(.x, .f, ...) Apply .f to each element of a list or vector and its index.
array_tree(x, margin = 3)
a flatten(.x) Remove a level + append(x, values, after =
OUTPUT of indexes from a list. Also length(x)) Add to end of list. cross2(.x, .y, .filter = NULL)
b +
map(), map2(), pmap(), function returns c flatten_chr, flatten_dbl, append(x, list(d = 1)) All combinations of .x
imap and invoke_map flatten_dfc, flatten_dfr, and .y. Also cross, cross3,
map list flatten_int, flatten_lgl. prepend(x, values, before = cross_df. cross2(1:3, 4:6)
each return a list. Use a +
suffixed version to map_chr character vector flatten(x) 1) Add to start of list.
return the results as a map_dbl double (numeric) vector prepend(x, list(d = 1)) a p set_names(x, nm = x) Set
specific type of flat xy x y transpose(.l, .names = b q the names of a vector/list
vector, e.g. map2_chr, map_dfc data frame (column bind) a a NULL) Transposes the index + splice(…) Combine objects c r directly or with a function.
pmap_lgl, etc. map_dfr data frame (row bind) b b order in a multi-level list. into a list, storing S3 objects set_names(x, c("p", "q", "r"))
c c transpose(x) + as sub-lists. splice(x, y, "foo") set_names(x, tolower)
Use walk, walk2, and map_int integer vector
pwalk to trigger side map_lgl logical vector
effects. Each return its
input invisibly.
walk triggers side effects, returns
the input invisibly
Reduce Lists Modify function behavior
a b
func + a b c d func( , ) reduce(.x, .f, ..., .init) compose() Compose negate() Negate a quietly() Modify
SHORTCUTS - within a purrr function: Apply function recursively multiple functions. predicate function (a function to return
c
"name" becomes ~ .x .y becomes func( , ) to each element of a list or pipe friendly !) list of results,
function(x) x[["name"]], function(.x, .y) .x .y, e.g.
d vector. Also reduce_right, lift() Change the type output, messages,
func( , ) reduce2, reduce2_right. of input a function partial() Create a warnings.
e.g. map(l, "a") extracts a map2(l, p, ~ .x +.y ) becomes
from each element of l map2(l, p, function(l, p) l + p ) reduce(x, sum) takes. Also lift_dl, version of a function
lift_dv, lift_ld, lift_lv, that has some args possibly() Modify
~ .x becomes function(x) x, ~ ..1 ..2 etc becomes lift_vd, lift_vl. preset to values. function to return
func + a b c d func( , ) accumulate(.x, .f, ..., .init)
e.g. map(l, ~ 2 +.x) becomes function(..1, ..2, etc) ..1 ..2 etc, c Reduce, but also return default value
func( , ) rerun() Rerun safely() Modify func whenever an error
map(l, function(x) 2 + x ) e.g. pmap(list(a, b, c), ~ ..3 + ..1 - ..2) d intermediate results. Also
becomes pmap(list(a, b, c), func( , ) expression n times. to return list of occurs (instead of
accumulate_right. results and errors. error).
function(a, b, c) c + a - b) accumulate(x, sum)

RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at purrr.tidyverse.org • purrr 0.2.3 • Updated: 2017-09

Figura 15. Cartão de referência de programação funcional com purrr.


Walmes Zeviani · UFPR Manipulação e Visualização de Dados 42
Nested Data "cell" contents
List Column Workflow Nested data frames use a list column, a list that is stored as a
column vector of a data frame. A typical workflow for list columns:
A nested data frame stores

1 2 3
Sepal.L Sepal.W Petal.L Petal.W
individual tables within the 5.1 3.5 1.4 0.2 Make a list Work with Simplify
cells of a larger, organizing 4.9 3.0 1.4 0.2
column list columns the list
4.7 3.2 1.3 0.2 S.L S.W P.L P.W
table. 4.6 3.1 1.5 0.2 Species S.L S.W P.L P.W 5.1 3.5 1.4 0.2
Call: column
lm(S.L ~ ., df)
setosa 5.1 3.5 1.4 0.2 4.9 3.0 1.4 0.2
5.0 3.6 1.4 0.2 Coefs:
setosa 4.9 3.0 1.4 0.2 4.7 3.2 1.3 0.2
(Int) S.W P.L P.W
n_iris$data[[1]] setosa 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 2.3 0.6 0.2 0.2
setosa 4.6 3.1 1.5 0.2
Species data S.L S.W P.L P.W
nested data frame Sepal.L Sepal.W Petal.L Petal.W versi 7.0 3.2 4.7 1.4
setos <tibble [50x4]> 7.0 3.2 4.7 1.4
Species data model Call: Species beta
versi 6.4 3.2 4.5 1.5 setosa <tibble [50x4]> <S3: lm> lm(S.L ~ ., df) setos 2.35
Species data 7.0 3.2 4.7 1.4 versi <tibble [50x4]> 6.4 3.2 4.5 1.5
versi 6.9 3.1 4.9 1.5 versi <tibble [50x4]> <S3: lm> Coefs: versi 1.89
virgini <tibble [50x4]> 6.9 3.1 4.9 1.5
setosa <tibble [50 x 4]> 6.4 3.2 4.5 1.5 versi 5.5 2.3 4.0 1.3 virgini <tibble [50x4]> <S3: lm> (Int) S.W P.L P.W virgini 0.69
5.5 2.3 4.0 1.3 1.8 0.3 0.9 -0.6
versicolor <tibble [50 x 4]> 6.9 3.1 4.9 1.5 virgini 6.3 3.3 6.0 2.5
virginica <tibble [50 x 4]> 5.5 2.3 4.0 1.3 virgini 5.8 2.7 5.1 1.9 S.L S.W P.L P.W
virgini 7.1 3.0 5.9 2.1 Call:
n_iris 6.5 2.8 4.6 1.5 6.3 3.3 6.0 2.5
lm(S.L ~ ., df)
virgini 6.3 2.9 5.6 1.8 5.8 2.7 5.1 1.9
n_iris$data[[2]] 7.1 3.0 5.9 2.1 Coefs:
(Int) S.W P.L P.W
6.3 2.9 5.6 1.8 0.6 0.3 0.9 -0.1
Sepal.L Sepal.W Petal.L Petal.W n_iris <- iris %>% mod_fun <- function(df) b_fun <- function(mod)
Use a nested data frame to: 6.3 3.3 6.0 2.5 group_by(Species) %>% lm(Sepal.Length ~ ., data = df) coefficients(mod)[[1]]
5.8 2.7 5.1 1.9 nest()
• preserve relationships 7.1 3.0 5.9 2.1 m_iris <- n_iris %>% m_iris %>% transmute(Species,
between observations and 6.3 2.9 5.6 1.8 mutate(model = map(data, mod_fun)) beta = map_dbl(model, b_fun))
subsets of data 6.5 3.0 5.8 2.2
n_iris$data[[3]]
• manipulate many sub-tables 1. MAKE A LIST COLUMN - You can create list columns with functions in the tibble and dplyr packages, as well as tidyr’s nest()
at once with the purrr functions map(), map2(), or pmap().
tibble::tribble(…) tibble::tibble(…) dplyr::mutate(.data, …) Also transmute()
Makes list column when needed Saves list input as list columns Returns list col when result returns list.
Use a two step process to create a nested data frame: tribble( ~max, ~seq, max seq tibble(max = c(3, 4, 5), seq = list(1:3, 1:4, 1:5)) mtcars %>% mutate(seq = map(cyl, seq))
1. Group the data frame into groups with dplyr::group_by() 3, 1:3, 3 <int [3]>

2. Use nest() to create a nested data frame 4, 1:4,


4 <int [4]>

with one row per group S.L S.W P.L P.W 5, 1:5)
5 <int [5]>
tibble::enframe(x, name="name", value="value") dplyr::summarise(.data, …)
5.1 3.5 1.4 0.2 Converts multi-level list to tibble with list cols Returns list col when result is wrapped with list()
Species S.L S.W P.L P.W
setosa 5.1 3.5 1.4 0.2
Species
setosa
S.L S.W P.L P.W
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
enframe(list('3'=1:3, '4'=1:4, '5'=1:5), 'max', 'seq') mtcars %>% group_by(cyl) %>%
setosa 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 4.6 3.1 1.5 0.2 summarise(q = list(quantile(mpg)))
setosa 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 5.0 3.6 1.4 0.2
setosa 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2
setosa 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 S.L S.W P.L P.W 2. WORK WITH LIST COLUMNS - Use the purrr functions map(), map2(), and pmap() to apply a function that returns a result element-wise
versi
versi
7.0
6.4
3.2 4.7
3.2 4.5
1.4
1.5
versi
versi
7.0
6.4
3.2
3.2
4.7
4.5
1.4
1.5
Species data
setos <tibble [50x4]>
7.0
6.4
3.2 4.7
3.2 4.5
1.4
1.5 to the cells of a list column. walk(), walk2(), and pwalk() work the same way, but return a side effect.
versi versi 6.9 3.1 4.9 1.5 6.9 3.1 4.9 1.5
purrr::map(.x, .f, ...)
6.9 3.1 4.9 1.5 versi <tibble [50x4]>
versi versi 5.5 2.3 4.0 1.3 5.5 2.3 4.0 1.3 data
fun( , …)
5.5 2.3 4.0 1.3 virgini <tibble [50x4]> data result
versi
virgini
6.5 2.8 4.6 1.5 versi
virgini
6.5
6.3
2.8
3.3
4.6
6.0
1.5
2.5
6.5 2.8 4.6 1.5
Apply .f element-wise to .x as .f(.x) map( <tibble [50x4]>
, fun, …) fun(
<tibble [50x4]>
, …)
result 1
6.3 3.3 6.0 2.5 <tibble [50x4]> <tibble [50x4]> result 2
virgini 5.8 2.7 5.1 1.9 virgini 5.8 2.7 5.1 1.9 S.L S.W P.L P.W n_iris %>% mutate(n = map(data, dim)) <tibble [50x4]> fun( <tibble [50x4]> , …) result 3
virgini 7.1 3.0 5.9 2.1 virgini 7.1 3.0 5.9 2.1 6.3 3.3 6.0 2.5
virgini 6.3 2.9 5.6 1.8 virgini 6.3 2.9 5.6 1.8 5.8 2.7 5.1 1.9 purrr::map2(.x, .y, .f, ...) data model
virgini 6.5 3.0 5.8 2.2 virgini 6.5 3.0 5.8 2.2 7.1 3.0 5.9 2.1
Apply .f element-wise to .x and .y as .f(.x, .y)
data model
fun( <tibble [50x4]> , <S3: lm> ,…) result

map2( , , fun, …)
<tibble [50x4]> <S3: lm> result 1
6.3 2.9 5.6 1.8 fun( <tibble [50x4]> , <S3: lm> ,…)
n_iris <- iris %>% group_by(Species) %>% nest() 6.5 3.0 5.8 2.2 m_iris %>% mutate(n = map2(data, model, list)) <tibble [50x4]>
<tibble [50x4]>
<S3: lm>
<S3: lm> fun( <tibble [50x4]> , <S3: lm> ,…)
result 2
result 3

tidyr::nest(data, ..., .key = data)


purrr::pmap(.l, .f, ...)
data model data model funs result
For grouped data, moves groups into cells as data frames. Apply .f element-wise to vectors saved in .l funs
fun( <tibble [50x4]> , <S3: lm> , ,…)
pmap(list( <tibble
<tibble [50x4]> , , ), fun, …)
[50x4]> <S3: lm> coef result 1
fun( , , ,…)
coef
m_iris %>% <S3: lm> AIC <tibble [50x4]> <S3: lm> AIC result 2
mutate(n = pmap(list(data, model, data), list)) <tibble [50x4]> <S3: lm> BIC fun( <tibble [50x4]> , <S3: lm> , BIC ,…) result 3

Unnest a nested data frame Species data Species S.L S.W P.L P.W
setos <tibble [50x4]> setosa 5.1 3.5 1.4 0.2
with unnest(): versi <tibble [50x4]> setosa 4.9 3.0 1.4 0.2 3. SIMPLIFY THE LIST COLUMN (into a regular column)
virgini <tibble [50x4]> setosa 4.7 3.2 1.3 0.2
n_iris %>% unnest() setosa 4.6 3.1 1.5 0.2
Use the purrr functions map_lgl(), purrr::map_lgl(.x, .f, ...) purrr::map_dbl(.x, .f, ...)
versi 7.0 3.2 4.7 1.4
tidyr::unnest(data, ..., .drop = NA, .id=NULL, .sep=NULL) versi 6.4 3.2 4.5 1.5 map_int(), map_dbl(), map_chr(), Apply .f element-wise to .x, return a logical vector Apply .f element-wise to .x, return a double vector
versi 6.9 3.1 4.9 1.5
as well as tidyr’s unnest() to reduce n_iris %>% transmute(n = map_lgl(data, is.matrix)) n_iris %>% transmute(n = map_dbl(data, nrow))
Unnests a nested data frame. versi 5.5 2.3 4.0 1.3
a list column into a regular column. purrr::map_chr(.x, .f, ...)
virgini
virgini
6.3
5.8
3.3
2.7
6.0
5.1
2.5
1.9 purrr::map_int(.x, .f, ...)
virgini 7.1 3.0 5.9 2.1 Apply .f element-wise to .x, return an integer vector Apply .f element-wise to .x, return a character vector
virgini 6.3 2.9 5.6 1.8
n_iris %>% transmute(n = map_int(data, nrow)) n_iris %>% transmute(n = map_chr(data, nrow))
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at purrr.tidyverse.org • purrr 0.2.3 • Updated: 2017-09

Figura 16. Cartão de referência de programação funcional com purrr.


Walmes Zeviani · UFPR Manipulação e Visualização de Dados 43
Gráficos com ggplot2

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 44


Anatomia do ggplot2

I O ggplot2 é o pacote gráfico mais adotado em ciência de


dados.
I Sua implementação é baseada no The Gammar of Graphics
(WILKINSON et al., 2013).
I A gramática faz com que a construção dos gráficos seja por
camadas.
I Cartão de referência: https://github.com/rstudio/cheatsheets/
raw/master/data-visualization-2.1.pdf.
I Um tutorial de ggplot2 apresentado no R Day:
http://rday.leg.ufpr.br/materiais/intro_ggplo2_tomas.pdf.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 45


Data Visualization with ggplot2 : : CHEAT SHEET
Basics Geoms Use a geom function to represent data points, use the geom’s aesthetic properties to represent variables. 

Each function returns a layer.
GRAPHICAL PRIMITIVES TWO VARIABLES 

ggplot2 is based on the grammar of graphics, the idea
that you can build every graph from the same a <- ggplot(economics, aes(date, unemploy)) continuous x , continuous y continuous bivariate distribution
components: a data set, a coordinate system, b <- ggplot(seals, aes(x = long, y = lat)) h <- ggplot(diamonds, aes(carat, price))
e <- ggplot(mpg, aes(cty, hwy))
and geoms—visual marks that represent data points. a + geom_blank()
 e + geom_label(aes(label = cty), nudge_x = 1, h + geom_bin2d(binwidth = c(0.25, 500))

(Useful for expanding limits) nudge_y = 1, check_overlap = TRUE) x, y, label, x, y, alpha, color, fill, linetype, size, weight
F M A alpha, angle, color, family, fontface, hjust,
b + geom_curve(aes(yend = lat + 1,
 lineheight, size, vjust
+ = xend=long+1,curvature=z)) - x, xend, y, yend,
alpha, angle, color, curvature, linetype, size e + geom_jitter(height = 2, width = 2) 

h + geom_density2d()

x, y, alpha, colour, group, linetype, size
x, y, alpha, color, fill, shape, size
data geom coordinate plot a + geom_path(lineend="butt", linejoin="round", h + geom_hex()

x=F·y=A system linemitre=1)
 x, y, alpha, colour, fill, size
e + geom_point(), x, y, alpha, color, fill, shape,
x, y, alpha, color, group, linetype, size size, stroke

To display values, map variables in the data to visual a + geom_polygon(aes(group = group))
 e + geom_quantile(), x, y, alpha, color, group,
properties of the geom (aesthetics) like size, color, and x x, y, alpha, color, fill, group, linetype, size linetype, size, weight
 continuous function
and y locations. i <- ggplot(economics, aes(date, unemploy))
b + geom_rect(aes(xmin = long, ymin=lat, xmax=
F M A long + 1, ymax = lat + 1)) - xmax, xmin, ymax, e + geom_rug(sides = "bl"), x, y, alpha, color, i + geom_area()

ymin, alpha, color, fill, linetype, size x, y, alpha, color, fill, linetype, size
+ =
linetype, size
a + geom_ribbon(aes(ymin=unemploy - 900, e + geom_smooth(method = lm), x, y, alpha, i + geom_line()

ymax=unemploy + 900)) - x, ymax, ymin, color, fill, group, linetype, size, weight x, y, alpha, color, group, linetype, size
data geom coordinate plot alpha, color, fill, group, linetype, size
x=F·y=A system
color = F e + geom_text(aes(label = cty), nudge_x = 1, i + geom_step(direction = "hv")

size = A nudge_y = 1, check_overlap = TRUE), x, y, label, x, y, alpha, color, group, linetype, size

alpha, angle, color, family, fontface, hjust, 

LINE SEGMENTS lineheight, size, vjust 

common aesthetics: x, y, alpha, color, linetype, size 

b + geom_abline(aes(intercept=0, slope=1)) visualizing error
Complete the template below to build a graph. b + geom_hline(aes(yintercept = lat)) df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)
required b + geom_vline(aes(xintercept = long)) discrete x , continuous y j <- ggplot(df, aes(grp, fit, ymin = fit-se, ymax = fit+se))
ggplot (data = <DATA> ) + f <- ggplot(mpg, aes(class, hwy))
b + geom_segment(aes(yend=lat+1, xend=long+1)) j + geom_crossbar(fatten = 2)

<GEOM_FUNCTION> (mapping = aes( <MAPPINGS> ), x, y, ymax, ymin, alpha, color, fill, group, linetype,
b + geom_spoke(aes(angle = 1:1155, radius = 1)) f + geom_col(), x, y, alpha, color, fill, group,
stat = <STAT> , position = <POSITION> ) + Not 
 linetype, size size
<COORDINATE_FUNCTION> + required,
sensible j + geom_errorbar(), x, ymax, ymin, alpha, color,
f + geom_boxplot(), x, y, lower, middle, upper, group, linetype, size, width (also
<FACET_FUNCTION> + defaults
supplied ONE VARIABLE continuous ymax, ymin, alpha, color, fill, group, linetype, geom_errorbarh())
<SCALE_FUNCTION> + shape, size, weight
c <- ggplot(mpg, aes(hwy)); c2 <- ggplot(mpg)
j + geom_linerange()

<THEME_FUNCTION> f + geom_dotplot(binaxis = "y", stackdir = x, ymin, ymax, alpha, color, group, linetype, size
c + geom_area(stat = "bin")
 "center"), x, y, alpha, color, fill, group
x, y, alpha, color, fill, linetype, size j + geom_pointrange()

ggplot(data = mpg, aes(x = cty, y = hwy)) Begins a plot f + geom_violin(scale = "area"), x, y, alpha, color, x, y, ymin, ymax, alpha, color, fill, group, linetype,
that you finish by adding layers to. Add one geom c + geom_density(kernel = "gaussian")
 fill, group, linetype, size, weight shape, size
function per layer. 
 x, y, alpha, color, fill, group, linetype, size, weight
aesthetic mappings data geom
c + geom_dotplot() 
 maps
qplot(x = cty, y = hwy, data = mpg, geom = “point") x, y, alpha, color, fill data <- data.frame(murder = USArrests$Murder,

Creates a complete plot with given data, geom, and discrete x , discrete y state = tolower(rownames(USArrests)))

mappings. Supplies many useful defaults. c + geom_freqpoly() x, y, alpha, color, group, g <- ggplot(diamonds, aes(cut, color)) map <- map_data("state")

linetype, size k <- ggplot(data, aes(fill = murder))
last_plot() Returns the last plot g + geom_count(), x, y, alpha, color, fill, shape,
c + geom_histogram(binwidth = 5) x, y, alpha, k + geom_map(aes(map_id = state), map = map)
ggsave("plot.png", width = 5, height = 5) Saves last plot color, fill, linetype, size, weight size, stroke + expand_limits(x = map$long, y = map$lat),
as 5’ x 5’ file named "plot.png" in working directory. map_id, alpha, color, fill, linetype, size
Matches file type to file extension. c2 + geom_qq(aes(sample = hwy)) x, y, alpha,
color, fill, linetype, size, weight
THREE VARIABLES
seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2))l <- ggplot(seals, aes(long, lat))
discrete l + geom_contour(aes(z = z))
 l + geom_raster(aes(fill = z), hjust=0.5, vjust=0.5,
d <- ggplot(mpg, aes(fl)) x, y, z, alpha, colour, group, linetype, 
 interpolate=FALSE)

size, weight x, y, alpha, fill
d + geom_bar() 

x, alpha, color, fill, linetype, size, weight l + geom_tile(aes(fill = z)), x, y, alpha, color, fill,
linetype, size, width

RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at http://ggplot2.tidyverse.org • ggplot2 3.1.0 • Updated: 2018-12

Figura 17. Cartão de referência de gráficos com ggplot2.


Walmes Zeviani · UFPR Manipulação e Visualização de Dados 46
Stats An alternative way to build a layer Scales Coordinate Systems Faceting
A stat builds new variables to plot (e.g., count, prop). Scales map data values to the visual values of an r <- d + geom_bar() Facets divide a plot into 

fl cty cyl aesthetic. To change a mapping, add a new scale. r + coord_cartesian(xlim = c(0, 5)) 
 subplots based on the 

xlim, ylim
 values of one or more 


+ =
x ..count.. (n <- d + geom_bar(aes(fill = fl))) The default cartesian coordinate system discrete variables.
aesthetic prepackaged scale-specific r + coord_fixed(ratio = 1/2) 

scale_ to adjust scale to use arguments ratio, xlim, ylim
 t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
data stat geom coordinate plot Cartesian coordinates with fixed aspect ratio
x = x ·
 system n + scale_fill_manual( between x and y units
y = ..count.. values = c("skyblue", "royalblue", "blue", “navy"), r + coord_flip() 
 t + facet_grid(cols = vars(fl))

Visualize a stat by changing the default stat of a geom limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", “r"), xlim, ylim
 facet into columns based on fl
name = "fuel", labels = c("D", "E", "P", "R")) Flipped Cartesian coordinates
function, geom_bar(stat="count") or by using a stat t + facet_grid(rows = vars(year))

function, stat_count(geom="bar"), which calls a default range of labels to use breaks to use in
r + coord_polar(theta = "x", direction=1 ) 
 facet into rows based on year
title to use in theta, start, direction

values to include legend/axis in legend/axis legend/axis
geom to make a layer (equivalent to a geom function). in mapping Polar coordinates t + facet_grid(rows = vars(year), cols = vars(fl))

Use ..name.. syntax to map stat variables to aesthetics. r + coord_trans(ytrans = “sqrt") 
 facet into both rows and columns
xtrans, ytrans, limx, limy

GENERAL PURPOSE SCALES Transformed cartesian coordinates. Set xtrans and t + facet_wrap(vars(fl))

wrap facets into a rectangular layout
geom to use stat function geommappings ytrans to the name of a window function.
Use with most aesthetics
i + stat_density2d(aes(fill = ..level..), Set scales to let axis limits vary across facets
geom = "polygon") scale_*_continuous() - map cont’ values to visual ones 60
π + coord_quickmap()
variable created by stat scale_*_discrete() - map discrete values to visual ones π + coord_map(projection = "ortho", t + facet_grid(rows = vars(drv), cols = vars(fl),

lat
scale_*_identity() - use data values as visual ones orientation=c(41, -74, 0))projection, orienztation, scales = "free")

xlim, ylim x and y axis limits adjust to individual facets

c + stat_bin(binwidth = 1, origin = 10)
 scale_*_manual(values = c()) - map discrete values to long

Map projections from the mapproj package


x, y | ..count.., ..ncount.., ..density.., ..ndensity.. manually chosen visual ones (mercator (default), azequalarea, lagrange, etc.) "free_x" - x axis limits adjust

scale_*_date(date_labels = "%m/%d"), date_breaks = "2 "free_y" - y axis limits adjust
c + stat_count(width = 1) x, y, | ..count.., ..prop.. weeks") - treat data values as dates.
c + stat_density(adjust = 1, kernel = “gaussian") 
 Set labeller to adjust facet labels
scale_*_datetime() - treat data x values as date times.
x, y, | ..count.., ..density.., ..scaled..
e + stat_bin_2d(bins = 30, drop = T)

Use same arguments as scale_x_date(). See ?strptime for
label formats. Position Adjustments t + facet_grid(cols = vars(fl), labeller = label_both)
fl: c fl: d fl: e fl: p fl: r
x, y, fill | ..count.., ..density.. Position adjustments determine how to arrange geoms t + facet_grid(rows = vars(fl),
X & Y LOCATION SCALES that would otherwise occupy the same space. labeller = label_bquote(alpha ^ .(fl)))
e + stat_bin_hex(bins=30) x, y, fill | ..count.., ..density..
Use with x or y aesthetics (x shown here) ↵c ↵d ↵e ↵p ↵r
e + stat_density_2d(contour = TRUE, n = 100)
 s <- ggplot(mpg, aes(fl, fill = drv))
x, y, color, size | ..level.. scale_x_log10() - Plot x on log10 scale
s + geom_bar(position = "dodge")

Labels
e + stat_ellipse(level = 0.95, segments = 51, type = "t") scale_x_reverse() - Reverse direction of x axis
scale_x_sqrt() - Plot x on square root scale Arrange elements side by side
l + stat_contour(aes(z = z)) x, y, z, order | ..level.. s + geom_bar(position = "fill")

Stack elements on top of one another, 
 t + labs( x = "New x axis label", y = "New y axis label",

l + stat_summary_hex(aes(z = z), bins = 30, fun = max)
 COLOR AND FILL SCALES (DISCRETE) normalize height
x, y, z, fill | ..value.. title ="Add a title above the plot", 
 Use scale functions
n <- d + geom_bar(aes(fill = fl)) e + geom_point(position = "jitter")
 subtitle = "Add a subtitle below title",
 to update legend
l + stat_summary_2d(aes(z = z), bins = 30, fun = mean)
 Add random noise to X and Y position of each
n + scale_fill_brewer(palette = "Blues") 
 element to avoid overplotting caption = "Add a caption below plot", labels
x, y, z, fill | ..value.. For palette choices: <aes> = "New <aes>
<AES> <AES> legend title")
A
RColorBrewer::display.brewer.all() e + geom_label(position = "nudge")

f + stat_boxplot(coef = 1.5) x, y | ..lower.., 
 B Nudge labels away from points
 t + annotate(geom = "text", x = 8, y = 9, label = "A")
..middle.., ..upper.., ..width.. , ..ymin.., ..ymax.. n + scale_fill_grey(start = 0.2, end = 0.8, 

na.value = "red") geom to place manual values for geom’s aesthetics
f + stat_ydensity(kernel = "gaussian", scale = “area") x, y | s + geom_bar(position = "stack")

..density.., ..scaled.., ..count.., ..n.., ..violinwidth.., ..width.. Stack elements on top of one another
COLOR AND FILL SCALES (CONTINUOUS)
e + stat_ecdf(n = 40) x, y | ..x.., ..y..
e + stat_quantile(quantiles = c(0.1, 0.9), formula = y ~
o <- c + geom_dotplot(aes(fill = ..x..)) Each position adjustment can be recast as a function with
manual width and height arguments
Legends
log(x), method = "rq") x, y | ..quantile.. o + scale_fill_distiller(palette = "Blues") s + geom_bar(position = position_dodge(width = 1)) n + theme(legend.position = "bottom")


 Place legend at "bottom", "top", "left", or "right"
e + stat_smooth(method = "lm", formula = y ~ x, se=T,
level=0.95) x, y | ..se.., ..x.., ..y.., ..ymin.., ..ymax.. o + scale_fill_gradient(low="red", high="yellow") n + guides(fill = "none")


Themes

 Set legend type for each aesthetic: colorbar, legend, or
ggplot() + stat_function(aes(x = -3:3), n = 99, fun = o + scale_fill_gradient2(low="red", high=“blue", none (no legend)
dnorm, args = list(sd=0.5)) x | ..x.., ..y.. mid = "white", midpoint = 25) n + scale_fill_discrete(name = "Title", 


 labels = c("A", "B", "C", "D", "E"))

e + stat_identity(na.rm = TRUE) o + scale_fill_gradientn(colours=topo.colors(6)) r + theme_bw()
 r + theme_classic() Set legend title and labels with a scale function.
ggplot() + stat_qq(aes(sample=1:100), dist = qt, White background

Also: rainbow(), heat.colors(), terrain.colors(), with grid lines r + theme_light()
dparam=list(df=5)) sample, x, y | ..sample.., ..theoretical..
Zooming
cm.colors(), RColorBrewer::brewer.pal() r + theme_gray()
 r + theme_linedraw()
e + stat_sum() x, y, size | ..n.., ..prop.. Grey background 
 r + theme_minimal()

e + stat_summary(fun.data = "mean_cl_boot") SHAPE AND SIZE SCALES (default theme) Minimal themes
r + theme_dark()
 r + theme_void()
 Without clipping (preferred)
h + stat_summary_bin(fun.y = "mean", geom = "bar") p <- e + geom_point(aes(shape = fl, size = cyl)) dark for contrast
p + scale_shape() + scale_size() Empty theme t + coord_cartesian(

e + stat_unique() xlim = c(0, 100), ylim = c(10, 20))
p + scale_shape_manual(values = c(3:7))
With clipping (removes unseen data points)
t + xlim(0, 100) + ylim(10, 20)
p + scale_radius(range = c(1,6))
p + scale_size_area(max_size = 6) t + scale_x_continuous(limits = c(0, 100)) +
scale_y_continuous(limits = c(0, 100))

RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at http://ggplot2.tidyverse.org • ggplot2 3.1.0 • Updated: 2018-12

Figura 18. Cartão de referência de gráficos com ggplot2.


Walmes Zeviani · UFPR Manipulação e Visualização de Dados 47
u <- ls("package:ggplot2") 1
u %>% str_subset("^geom_") 2

## [1] "geom_abline" "geom_area" "geom_bar"


## [4] "geom_bin2d" "geom_blank" "geom_boxplot"
## [7] "geom_col" "geom_contour" "geom_count"
## [10] "geom_crossbar" "geom_curve" "geom_density"
## [13] "geom_density2d" "geom_density_2d" "geom_dotplot"
## [16] "geom_errorbar" "geom_errorbarh" "geom_freqpoly"
## [19] "geom_hex" "geom_histogram" "geom_hline"
## [22] "geom_jitter" "geom_label" "geom_line"
## [25] "geom_linerange" "geom_map" "geom_path"
## [28] "geom_point" "geom_pointrange" "geom_polygon"
## [31] "geom_qq" "geom_qq_line" "geom_quantile"
## [34] "geom_raster" "geom_rect" "geom_ribbon"
## [37] "geom_rug" "geom_segment" "geom_sf"
## [40] "geom_sf_label" "geom_sf_text" "geom_smooth"
## [43] "geom_spoke" "geom_step" "geom_text"
## [46] "geom_tile" "geom_violin" "geom_vline"

u %>% str_subset("^theme_") 1

## [1] "theme_bw" "theme_classic" "theme_dark" "theme_get"


## [5] "theme_gray" "theme_grey" "theme_light" "theme_linedraw"
## [9] "theme_minimal" "theme_replace" "theme_set" "theme_test"
## [13] "theme_update" "theme_void"

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 48


u %>% str_subset("^stat_") 1

## [1] "stat_bin" "stat_bin2d" "stat_bin_2d"


## [4] "stat_binhex" "stat_bin_hex" "stat_boxplot"
## [7] "stat_contour" "stat_count" "stat_density"
## [10] "stat_density2d" "stat_density_2d" "stat_ecdf"
## [13] "stat_ellipse" "stat_function" "stat_identity"
## [16] "stat_qq" "stat_qq_line" "stat_quantile"
## [19] "stat_sf" "stat_sf_coordinates" "stat_smooth"
## [22] "stat_spoke" "stat_sum" "stat_summary"
## [25] "stat_summary2d" "stat_summary_2d" "stat_summary_bin"
## [28] "stat_summary_hex" "stat_unique" "stat_ydensity"

u %>% str_subset("^(scale|coord)_") 1

## [1] "coord_cartesian" "coord_equal"


## [3] "coord_fixed" "coord_flip"
## [5] "coord_map" "coord_munch"
## [7] "coord_polar" "coord_quickmap"
## [9] "coord_sf" "coord_trans"
## [11] "scale_alpha" "scale_alpha_continuous"
## [13] "scale_alpha_date" "scale_alpha_datetime"
## [15] "scale_alpha_discrete" "scale_alpha_identity"
## [17] "scale_alpha_manual" "scale_alpha_ordinal"
## [19] "scale_color_brewer" "scale_color_continuous"
## [21] "scale_color_discrete" "scale_color_distiller"
## [23] "scale_color_gradient" "scale_color_gradient2"
## [25] "scale_color_gradientn" "scale_color_grey"
## [27] "scale_color_hue" "scale_color_identity"
## [29] "scale_color_manual" "scale_color_viridis_c"
## [31] Zeviani
Walmes "scale _color
· UFPR _viridis
Manipulação _d"
e Visualização Dados _colour_brewer"
"scale
de 49
Manipulação de strings com stringr

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 50


Anatomia do stringr

I O stringr é uma coleção de funções para operações com


strings.
I Ele foi construído sobre o stringi.
I Cartão de referência:
https://github.com/rstudio/cheatsheets/raw/master/strings.pdf.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 51


ls("package:stringr") 1

## [1] "%>%" "boundary" "coll"


## [4] "fixed" "fruit" "invert_match"
## [7] "regex" "sentences" "str_c"
## [10] "str_conv" "str_count" "str_detect"
## [13] "str_dup" "str_extract" "str_extract_all"
## [16] "str_flatten" "str_glue" "str_glue_data"
## [19] "str_interp" "str_length" "str_locate"
## [22] "str_locate_all" "str_match" "str_match_all"
## [25] "str_order" "str_pad" "str_remove"
## [28] "str_remove_all" "str_replace" "str_replace_all"
## [31] "str_replace_na" "str_sort" "str_split"
## [34] "str_split_fixed" "str_squish" "str_sub"
## [37] "str_sub<-" "str_subset" "str_to_lower"
## [40] "str_to_title" "str_to_upper" "str_trim"
## [43] "str_trunc" "str_view" "str_view_all"
## [46] "str_which" "str_wrap" "word"
## [49] "words"

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 52


String manipulation with stringr : : CHEAT SHEET
The stringr package provides a set of internally consistent tools for working with character strings, i.e. sequences of characters surrounded by quotation marks.

Detect Matches Subset Strings Manage Lengths


TRUE str_detect(string, pattern) Detect the str_sub(string, start = 1L, end = -1L) Extract 4 str_length(string) The width of strings (i.e.
TRUE
FALSE
presence of a pattern match in a string. substrings from a character vector. 6 number of code points, which generally equals
2
TRUE str_detect(fruit, "a") str_sub(fruit, 1, 3); str_sub(fruit, -2) 3
the number of characters). str_length(fruit)
1 str_which(string, pattern) Find the indexes of str_subset(string, pattern) Return only the str_pad(string, width, side = c("left", "right",
2 strings that contain a pattern match. strings that contain a pattern match. "both"), pad = " ") Pad strings to constant
4
str_which(fruit, "a") str_subset(fruit, "b") width. str_pad(fruit, 17)

0 str_count(string, pattern) Count the number str_extract(string, pattern) Return the first str_trunc(string, width, side = c("right", "left",
3 of matches in a string. NA pattern match found in each string, as a vector. "center"), ellipsis = "...") Truncate the width of
1
2 str_count(fruit, "a") Also str_extract_all to return every pattern strings, replacing content with ellipsis.
match. str_extract(fruit, "[aeiou]") str_trunc(fruit, 3)
str_locate(string, pattern) Locate the
start end

2 4
4 7 positions of pattern matches in a string. Also str_match(string, pattern) Return the first str_trim(string, side = c("both", "left", "right"))
NA NA str_locate_all. str_locate(fruit, "a") NA NA
pattern match found in each string, as a Trim whitespace from the start and/or end of a
3 4
matrix with a column for each ( ) group in string. str_trim(fruit)
pattern. Also str_match_all.
str_match(sentences, "(a|the) ([^ ]+)")

Mutate Strings Join and Split Order Strings


str_sub() <- value. Replace substrings by str_c(..., sep = "", collapse = NULL) Join 4 str_order(x, decreasing = FALSE, na_last =
identifying the substrings with str_sub() and multiple strings into a single string. 1 TRUE, locale = "en", numeric = FALSE, ...)1 Return
assigning into the results. str_c(letters, LETTERS)
3
2
the vector of indexes that sorts a character
str_sub(fruit, 1, 3) <- "str" vector. x[str_order(x)]
str_c(..., sep = "", collapse = NULL) Collapse
str_replace(string, pattern, replacement) a vector of strings into a single string. str_sort(x, decreasing = FALSE, na_last = TRUE,
Replace the first matched pattern in each str_c(letters, collapse = "") locale = "en", numeric = FALSE, ...)1 Sort a
character vector.
string. str_replace(fruit, "a", "-")
str_sort(x)
str_dup(string, times) Repeat strings times
str_replace_all(string, pattern, times. str_dup(fruit, times = 2)
replacement) Replace all matched patterns
in each string. str_replace_all(fruit, "a", "-") str_split_fixed(string, pattern, n) Split a
vector of strings into a matrix of substrings
Helpers
str_conv(string, encoding) Override the
A STRING str_to_lower(string, locale = "en")1 Convert (splitting at occurrences of a pattern match).
encoding of a string. str_conv(fruit,"ISO-8859-1")
a string strings to lower case. Also str_split to return a list of substrings.
str_to_lower(sentences) str_split_fixed(fruit, " ", n=2)
str_view(string, pattern, match = NA) View
HTML rendering of first regex match in each
a string str_to_upper(string, locale = "en")1 Convert {xx} {yy} str_glue(…, .sep = "", .envir = parent.frame())
string. str_view(fruit, "[aeiou]")
A STRING strings to upper case. Create a string from strings and {expressions}
str_to_upper(sentences) to evaluate. str_glue("Pi is {pi}")
str_view_all(string, pattern, match = NA) View
a string
str_to_title(string, locale = "en")1 Convert str_glue_data(.x, ..., .sep = "", .envir = HTML rendering of all regex matches.
str_view_all(fruit, "[aeiou]")
A String strings to title case. str_to_title(sentences) parent.frame(), .na = "NA") Use a data frame,
list, or environment to create a string from
str_wrap(string, width = 80, indent = 0, exdent
strings and {expressions} to evaluate.
= 0) Wrap strings into nicely formatted
str_glue_data(mtcars, "{rownames(mtcars)}
paragraphs. str_wrap(sentences, 20)
has {hp} hp")

1 See bit.ly/ISO639-1 for a complete list of locales.

RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor ! • stringr 1.2.0 • Updated: 2017-10

Figura 19. Cartão de referência para manipulação de strings com stringr.


Walmes Zeviani · UFPR Manipulação e Visualização de Dados 53
Need to Know Regular Expressions - Regular expressions, or regexps, are a concise language for
describing patterns in strings.
[:space:]
new line

"
Pattern arguments in stringr are interpreted as MATCH CHARACTERS see <- function(rx) str_view_all("abc ABC 123\t.!?\\(){}\n", rx)
regular expressions after any special characters [:blank:] .
have been parsed. string (type regexp matches example
this) (to mean this) (which matches this) space
In R, you write regular expressions as strings, a (etc.) a (etc.) see("a") abc ABC 123 .!?\(){} tab
sequences of characters surrounded by quotes \\. \. . see("\\.") abc ABC 123 .!?\(){}
("") or single quotes('').
\\! \! ! see("\\!") abc ABC 123 .!?\(){} [:graph:]
Some characters cannot be represented directly \\? \? ? see("\\?") abc ABC 123 .!?\(){}
in an R string . These must be represented as \\\\ \\ \ see("\\\\") abc ABC 123 .!?\(){} [:punct:]
special characters, sequences of characters that \\( \( ( see("\\(") abc ABC 123 .!?\(){}
have a specific meaning., e.g. . , : ; ? ! \ | / ` = * + - ^
\\) \) ) see("\\)") abc ABC 123 .!?\(){}
Special Character Represents \\{ \{ { see("\\{") abc ABC 123 .!?\(){} _ ~ " ' [ ] { } ( ) < > @# $
\\ \ \\} \} } see( "\\}") abc ABC 123 .!?\(){}
\" " \\n \n new line (return) see("\\n") abc ABC 123 .!?\(){} [:alnum:]
\n new line \\t \t tab see("\\t") abc ABC 123 .!?\(){}
Run ?"'" to see a complete list \\s \s any whitespace (\S for non-whitespaces) see("\\s") abc ABC 123 .!?\(){} [:digit:]
\\d \d any digit (\D for non-digits) see("\\d") abc ABC 123 .!?\(){}
0 1 2 3 4 5 6 7 8 9
Because of this, whenever a \ appears in a regular \\w \w any word character (\W for non-word chars) see("\\w") abc ABC 123 .!?\(){}
expression, you must write it as \\ in the string \\b \b word boundaries see("\\b") abc ABC 123 .!?\(){}
that represents the regular expression. [:digit:]
1
digits see("[:digit:]") abc ABC 123 .!?\(){} [:alpha:]
1
Use writeLines() to see how R views your string [:alpha:] letters see("[:alpha:]") abc ABC 123 .!?\(){} [:lower:] [:upper:]
1
after all special characters have been parsed. [:lower:] lowercase letters see("[:lower:]") abc ABC 123 .!?\(){}
[:upper:]
1
uppercase letters see("[:upper:]") abc ABC 123 .!?\(){} a b c d e f A B CD E F
writeLines("\\.") [:alnum:]
1
letters and numbers see("[:alnum:]") abc ABC 123 .!?\(){}
# \. g h i j k l GH I J K L
[:punct:] 1 punctuation see("[:punct:]") abc ABC 123 .!?\(){}
writeLines("\\ is a backslash") [:graph:] 1 letters, numbers, and punctuation see("[:graph:]") abc ABC 123 .!?\(){}
mn o p q r MNO PQ R
# \ is a backslash [:space:] 1
space characters (i.e. \s) see("[:space:]") abc ABC 123 .!?\(){} s t u vw x S TU V WX
[:blank:] 1 space and tab (but not new line) see("[:blank:]") abc ABC 123 .!?\(){} z Z
. every character except a new line see(".") abc ABC 123 .!?\(){}
INTERPRETATION 1 Many base R functions require classes to be wrapped in a second set of [ ], e.g. [[:digit:]]
Patterns in stringr are interpreted as regexs To
change this default, wrap the pattern in one of:
ALTERNATES alt <- function(rx) str_view_all("abcde", rx) QUANTIFIERS quant <- function(rx) str_view_all(".a.aa.aaa", rx)
regex(pattern, ignore_case = FALSE, multiline = example example
FALSE, comments = FALSE, dotall = FALSE, ...) regexp matches regexp matches
Modifies a regex to ignore cases, match end of ab|d or alt("ab|d") abcde a? zero or one quant("a?") .a.aa.aaa
lines as well of end of strings, allow R comments [abe] one of alt("[abe]") abcde a* zero or more quant("a*") .a.aa.aaa
within regex's , and/or to have . match everything a+ one or more quant("a+") .a.aa.aaa
including \n. [^abe] anything but alt("[^abe]") abcde
str_detect("I", regex("i", TRUE)) [a-c] range alt("[a-c]") abcde 1 2 ... n a{n} exactly n quant("a{2}") .a.aa.aaa
1 2 ... n a{n, } n or more quant("a{2,}") .a.aa.aaa
fixed() Matches raw bytes but will miss some n ... m a{n, m} between n and m quant("a{2,4}") .a.aa.aaa
characters that can be represented in multiple ANCHORS anchor <- function(rx) str_view_all("aaa", rx)
ways (fast). str_detect("\u0130", fixed("i")) regexp matches example
^a start of string anchor("^a") aaa GROUPS ref <- function(rx) str_view_all("abbaab", rx)
coll() Matches raw bytes and will use locale
specific collation rules to recognize characters a$ end of string anchor("a$") aaa Use parentheses to set precedent (order of evaluation) and create groups
that can be represented in multiple ways (slow).
regexp matches example
str_detect("\u0130", coll("i", TRUE, locale = "tr"))
(ab|d)e sets precedence alt("(ab|d)e") abcde
LOOK AROUNDS look <- function(rx) str_view_all("bacad", rx)
boundary() Matches boundaries between example
characters, line_breaks, sentences, or words. regexp matches Use an escaped number to refer to and duplicate parentheses groups that occur
str_split(sentences, boundary("word")) a(?=c) followed by look("a(?=c)") bacad earlier in a pattern. Refer to each group by its order of appearance
a(?!c) not followed by look("a(?!c)") bacad string regexp matches example
(?<=b)a preceded by look("(?<=b)a") bacad (type this) (to mean this) (which matches this) (the result is the same as ref("abba"))

(?<!b)a not preceded by look("(?<!b)a") bacad \\1 \1 (etc.) first () group, etc. ref("(a)(b)\\2\\1") abbaab

RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor ! • stringr 1.2.0 • Updated: 2017-10

Figura 20. Cartão de referência para manipulação de strings com stringr.


Walmes Zeviani · UFPR Manipulação e Visualização de Dados 54
Manipulação de fatores com forcats

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 55


Anatomia do forcats

I O forcats é uma coleção de funções para operações com


fatores.
I Permite renomear, reordenar, aglutinar níveis, etc.
I Cartão de referência:
https://github.com/rstudio/cheatsheets/raw/master/factors.pdf.

ls("package:forcats") 1

## [1] "%>%" "as_factor" "fct_anon"


## [4] "fct_c" "fct_collapse" "fct_count"
## [7] "fct_drop" "fct_expand" "fct_explicit_na"
## [10] "fct_infreq" "fct_inorder" "fct_lump"
## [13] "fct_other" "fct_recode" "fct_relabel"
## [16] "fct_relevel" "fct_reorder" "fct_reorder2"
## [19] "fct_rev" "fct_shift" "fct_shuffle"
## [22] "fct_unify" "fct_unique" "gss_cat"
## [25] "last2" "lvls_expand" "lvls_reorder"
## [28] "lvls_revalue" "lvls_union"

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 56


Factors with forcats : : CHEAT SHEET
The forcats package provides tools for working with factors, which are R's data structure for categorical data.

Factors stored displayed Change the order of levels Change the value of levels
R represents categorical integer 1 1= a a 1= a
data with factors. A factor vector 3 23 == bc c 23 == bc a 1= a a 1= b fct_relevel(.f, ..., after = 0L) a 1= a v 1= v fct_recode(.f, ...) Manually change
is an integer vector with a c 2= b c 2= c Manually reorder factor levels. c 2= b z 2= x levels. Also fct_relabel which obeys
2 b 3= c 3= a fct_relevel(f, c("b", "c", "a")) 3= c 3= z purrr::map syntax to apply a function
levels attribute that stores levels 1 a b b b x
a set of mappings between or expression to each level.
integers and categorical values. When you view a factor, R
a a a v fct_recode(f, v = "a", x = "b", z = "c")
displays not the integers, but the values associated with them. fct_infreq(f, ordered = NA) fct_relabel(f, ~ paste0("x", .x))

Create a factor with factor() c 1= a c 1= c Reorder levels by the frequency


c 2= c
c 2= a in which they appear in the
a a 1= a
a 1= a 1=2
c c 2= b factor(x = character(), levels, a a data (highest frequency first). 2 2=1 fct_anon(f, prefix = ""))
labels = levels, exclude = NA, ordered f3 <- factor(c("c", "c", "a")) c 2= b
b b
3= c
3= c 1 3=3 Anonymize levels with random
= is.ordered(x), nmax = NA) Convert fct_infreq(f3)
a a a vector to a factor. Also as_factor.
b 3 integers. fct_anon(f)
f <- factor(c("a", "c", "b", "a"),
a 2
levels = c("a", "b", "c")) 1= a fct_inorder(f, ordered = NA)
b b 1= b
a 2= b a 2= a Reorder levels by order in
a 1= a a Return its levels with levels()
which they appear in the data.
a 1= a x 1= x fct_collapse(.f, ...) Collapse levels
2= b b c 2= b c 2= c into manually defined groups.
c 3= c c
levels(x) Return/set the levels of a fct_inorder(f2) 3= c fct_collapse(f, x = c("a", "b"))
b factor. levels(f); levels(f) <- c("x","y","z") b x
a a x
Use unclass() to see its structure
a 1= a a 1= c fct_rev(f) Reverse level order.
2= b 2= b f4 <- factor(c("a","b","c")) fct_lump(f, n, prop, w = NULL,
b b
Inspect Factors c
3= c
c
3= a fct_rev(f4) a
c
1= a
2= b
a 1= a
2 = Other
other_level = "Other", ties.method =
c("min", "average", "first", "last",
3= c
Other
"random", "max")) Lump together
a 1= a f n fct_count(f, sort = FALSE) a 1= a a 1= c fct_shift(f) Shift levels to left b Other
least/most common levels into a
c 2= b
a 2 Count the number of values b 2= b
b 2= a or right, wrapping around end. a a single level. Also fct_lump_min.
3= c with each level. fct_count(f) 3= c 3= b fct_shift(f4)
b b1 c c fct_lump(f, n = 1)
a c 1

a 1= a a 1= a fct_unique(f) Return the a 1= a a 1= a fct_shuffle(f, n = 1L) Randomly a 1= a a 1= a fct_other(f, keep, drop, other_level =
2= b 2= b unique values, removing b 2= b
b 2= c permute order of factor levels. 2= b 2= b "Other") Replace levels with "other."
c c 3= c 3= b fct_shuffle(f4) c Other

b
3= c
b
3= c duplicates. fct_unique(f) c c b
3= c
b
3 = Other fct_other(f, keep = c("a", "b"))
a a a
fct_reorder(.f, .x, .fun=median, ...,
.desc = FALSE) Reorder levels by
Combine Factors Add or drop levels
1= a 1= b
a 2= b
ca 2= c their relationship with another
bc 3= c b 3= a
variable.
a 1= a + a fct_c(…) Combine factors boxplot(data = iris, Sepal.Width ~ a a
b 1= a = 1= a
fct_reorder(Species, Sepal.Width))
1= a 1= a fct_drop(f, only) Drop unused levels.
c 2= c a 2= b c 2= c with different levels. b 2= b 2= b f5 <- factor(c("a","b"),c("a","b","x"))
3= b f1 <- factor(c("a", "c")) 3= x b
b f6 <- fct_drop(f5)
f2 <- factor(c("b", "a"))
a fct_c(f1, f2) fct_reorder2(.f, .x, .y, .fun =
1= a 1= b last2, ..., .desc = TRUE) Reorder a 1= a a 1= a fct_expand(f, …) Add levels to
2= b 2= c 2= b 2= b a factor. fct_expand(f6, "x")
3= c 3= a
levels by their final values when b b 3= x
a 1= a a 1= a plotted with two other variables.
b
2= b
b
2= b
3= c
fct_unify(fs, levels = ggplot(data = iris,
lvls_union(fs)) Standardize aes(Sepal.Width, Sepal.Length,
a 1= a
2= c
a 1= a
levels across a list of factors. color = fct_reorder2(Species,
a 1= a a 1= a fct_explicit_na(f, na_level="(Missing)")
c c2 2= b
b 2= b
b 2= b Assigns a level to NAs to ensure they
3= c
fct_unify(list(f2, f1)) Sepal.Width, Sepal.Length))) + 3= x appear in plots, etc.
geom_smooth() NA x fct_explicit_na(factor(c("a", "b", NA)))
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at forcats.tidyverse.org • Diagrams inspired by @LVaudor ! • forcats 0.3.0 • Updated: 2019-02

Figura 21. Cartão de referência para manipulação de factores com forcats.


Walmes Zeviani · UFPR Manipulação e Visualização de Dados 57
Dados cronológicos com lubridate e hms

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 58


Anatomia dos pacotes

I Recursos para manipulação de dados date-time.


I Fácil decomposição de datas: dia, mês, semana, dia da
semana, etc.
I Lida com fusos horários, horários de verão, etc.
I Extende para outras classes de dados baseados em date-time:
duração, período, intervalos.
I Cartão de referência:
https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf.
I Não é carregado junto com o tidyverse.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 59


library(lubridate) 1
ls("package:lubridate") %>% str_c(collapse = ", ") %>% strwrap() 2

## [1] "%--%, add_with_rollback, am, Arith, as_date, as_datetime,"


## [2] "as.difftime, as.duration, as.interval, as.period, ceiling_date,"
## [3] "Compare, date, date<-, date_decimal, day, day<-, days,"
## [4] "days_in_month, ddays, decimal_date, dhours, dmicroseconds,"
## [5] "dmilliseconds, dminutes, dmy, dmy_h, dmy_hm, dmy_hms,"
## [6] "dnanoseconds, dpicoseconds, dseconds, dst, duration, dweeks,"
## [7] "dyears, dym, edays, ehours, emicroseconds, emilliseconds,"
## [8] "eminutes, enanoseconds, epicoseconds, epiweek, epiyear,"
## [9] "eseconds, eweeks, eyears, fast_strptime, fit_to_timeline,"
## [10] "floor_date, force_tz, force_tzs, guess_formats, here, hm, hms,"
## [11] "hour, hour<-, hours, int_aligns, int_diff, int_end, int_end<-,"
## [12] "intersect, interval, int_flip, int_length, int_overlaps,"
## [13] "int_shift, int_standardize, int_start, int_start<-, is.Date,"
## [14] "is.difftime, is.duration, is.instant, is.interval, isoweek,"
## [15] "isoyear, is.period, is.POSIXct, is.POSIXlt, is.POSIXt,"
## [16] "is.timepoint, is.timespan, lakers, leap_year, local_time, %m-%,"
## [17] "%m+%, make_date, make_datetime, make_difftime, mday, mday<-,"
## [18] "mdy, mdy_h, mdy_hm, mdy_hms, microseconds, milliseconds,"
## [19] "minute, minute<-, minutes, month, month<-, ms, myd,"
## [20] "nanoseconds, new_difftime, new_duration, new_interval,"
## [21] "new_period, now, olson_time_zones, origin, parse_date_time,"
## [22] "parse_date_time2, period, period_to_seconds, picoseconds, pm,"
## [23] "pretty_dates, qday, qday<-, quarter, reclass_date,"
## [24] "reclass_timespan, rollback, round_date, second, second<-,"
## [25] "seconds, seconds_to_period, semester, setdiff, show, stamp,"
## [26] "stamp_date, stamp_time, time_length, today, tz, tz<-, union,"
## [27] "wday, wday<-, week, week<-, weeks, %within%, with_tz, yday,"
## [28] "yday<-, ydm, ydm_h, ydm_hm, ydm_hms, year, year<-, years, ymd,"
## [29]
Walmes"ymd _h,
Zeviani ymd_Manipulação
· UFPR hm, ymd_ehms, yq" de Dados
Visualização 60
Dates and times with lubridate : : CHEAT SHEET
Date-times 2017-11-28 12:00:00 2017-11-28 12:00:00 Round Date-times
A date-time is a point on the timeline, A date is a day stored as An hms is a time stored as floor_date(x, unit = "second")
stored as the number of seconds since the number of days since the number of seconds since
1970-01-01 00:00:00 UTC 1970-01-01 00:00:00 Round down to nearest unit.
2016 2017 2018 2019 2020 floor_date(dt, unit = "month")
Jan Feb Mar Apr

2017-11-28 12:00:00 dt <- as_datetime(1511870400) d <- as_date(17498) t <- hms::as.hms(85) round_date(x, unit = "second")
## "2017-11-28 12:00:00 UTC" ## "2017-11-28" ## 00:01:25 Round to nearest unit.
round_date(dt, unit = "month")
Jan Feb Mar Apr

ceiling_date(x, unit = "second",


PARSE DATE-TIMES (Convert strings or numbers to date-times) GET AND SET COMPONENTS change_on_boundary = NULL)
d ## "2017-11-28"
1. Identify the order of the year (y), month (m), day (d), hour (h), Use an accessor function to get a component. day(d) ## 28 Round up to nearest unit.
Jan Feb Mar Apr
minute (m) and second (s) elements in your data. ceiling_date(dt, unit = "month")
Assign into an accessor function to change a day(d) <- 1
2. Use the function below whose name replicates the order. Each component in place. d ## "2017-11-01" rollback(dates, roll_to_first =
accepts a wide variety of input formats. FALSE, preserve_hms = TRUE)
Roll back to last day of previous
month. rollback(dt)
2017-11-28T14:02:00 ymd_hms(), ymd_hm(), ymd_h(). 2018-01-31 11:59:59 date(x) Date component. date(dt)
ymd_hms("2017-11-28T14:02:00")

Stamp Date-times
2018-01-31 11:59:59 year(x) Year. year(dt)
2017-22-12 10:00:00 ydm_hms(), ydm_hm(), ydm_h(). isoyear(x) The ISO 8601 year.
ydm_hms("2017-22-12 10:00:00") epiyear(x) Epidemiological year.
stamp() Derive a template from an example string and return a new
11/28/2017 1:02:03 mdy_hms(), mdy_hm(), mdy_h(). month(x, label, abbr) Month. function that will apply the template to date-times. Also
mdy_hms("11/28/2017 1:02:03") 2018-01-31 11:59:59 stamp_date() and stamp_time().
month(dt)
dmy_hms(), dmy_hm(), dmy_h(). 2018-01-31 11:59:59 1. Derive a template, create a function Tip: use a
1 Jan 2017 23:59:59 day(x) Day of month. day(dt) sf <- stamp("Created Sunday, Jan 17, 1999 3:34")
dmy_hms("1 Jan 2017 23:59:59") date with
wday(x,label,abbr) Day of week.
2. Apply the template to dates day > 12
ymd(), ydm(). ymd(20170131) qday(x) Day of quarter.
20170131 sf(ymd("2010-04-05"))
2018-01-31 11:59:59 ## [1] "Created Monday, Apr 05, 2010 00:00"
mdy(), myd(). mdy("July 4th, 2000") hour(x) Hour. hour(dt)
July 4th, 2000
2018-01-31 11:59:59 minute(x) Minutes. minute(dt)
4th of July '99 dmy(), dym(). dmy("4th of July '99")

2001: Q3 yq() Q for quarter. yq("2001: Q3") 2018-01-31 11:59:59 second(x) Seconds. second(dt) Time Zones
2:01 hms::hms() Also lubridate::hms(), x
J F M A M J week(x) Week of the year. week(dt) R recognizes ~600 time zones. Each encodes the time zone, Daylight
hm() and ms(), which return J A S O N D isoweek() ISO 8601 week. Savings Time, and historical calendar variations for an area. R assigns
periods.* hms::hms(sec = 0, min= 1, epiweek() Epidemiological week. one time zone per vector.
hours = 2)
x
J F M A M J
quarter(x, with_year = FALSE)
Use the UTC time zone to avoid Daylight Savings.
J A S O N D Quarter. quarter(dt) OlsonNames() Returns a list of valid time zone names. OlsonNames()
2017.5 date_decimal(decimal, tz = "UTC")
date_decimal(2017.5)
x
J F M A M J semester(x, with_year = FALSE) 5:00 6:00
J A S O N D Semester. semester(dt)
now(tzone = "") Current time in tz 4:00 Mountain Central 7:00
with_tz(time, tzone = "") Get
(defaults to system tz). now() am(x) Is it in the am? am(dt) Pacific Eastern the same date-time in a new
January today(tzone = "") Current date in a pm(x) Is it in the pm? pm(dt) time zone (a new clock time).
xxxxx with_tz(dt, "US/Pacific")
xxx tz (defaults to system tz). today()
dst(x) Is it daylight savings? dst(d) PT
MT
fast_strptime() Faster strptime. CT ET
fast_strptime('9/1/01', '%y/%m/%d') leap_year(x) Is it a leap year? force_tz(time, tzone = "") Get
leap_year(d) 7:00 7:00 the same clock time in a new
parse_date_time() Easier strptime. Pacific Eastern time zone (a new date-time).
parse_date_time("9/1/01", "ymd") update(object, ..., simple = FALSE) force_tz(dt, "US/Pacific")
update(dt, mday = 2, hour = 1) 7:00 7:00
Mountain Central
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at lubridate.tidyverse.org • lubridate 1.6.0 • Updated: 2017-12

Figura 22. Cartão de referência para manipulação de *date-time* com lubridate


e hmsWalmes
. Zeviani · UFPR Manipulação e Visualização de Dados 61
Math with Date-times — Lubridate provides three classes of timespans to facilitate math with dates and date-times
Math with date-times relies on the timeline, Periods track changes in clock times, Durations track the passage of Intervals represent specific intervals Not all years
which behaves inconsistently. Consider how which ignore time line irregularities. physical time, which deviates from of the timeline, bounded by start and are 365 days
the timeline behaves during: clock time when irregularities occur. end date-times. due to leap days.
A normal day nor + minutes(90) nor + dminutes(90) interval(nor, nor + minutes(90)) Not all minutes
nor <- ymd_hms("2018-01-01 01:30:00",tz="US/Eastern") are 60 seconds due to
leap seconds.
1:00 2:00 3:00 4:00 1:00 2:00 3:00 4:00
It is possible to create an imaginary date
1:00 2:00 3:00 4:00 1:00 2:00 3:00 4:00
by adding months, e.g. February 31st
The start of daylight savings (spring forward) gap + minutes(90) gap + dminutes(90) interval(gap, gap + minutes(90)) jan31 <- ymd(20180131)
gap <- ymd_hms("2018-03-11 01:30:00",tz="US/Eastern") jan31 + months(1)
## NA
1:00 2:00 3:00 4:00 1:00 2:00 3:00 4:00 1:00 2:00 3:00 4:00 1:00 2:00 3:00 4:00
%m+% and %m-% will roll imaginary
dates to the last day of the previous
The end of daylight savings (fall back) month.
lap + minutes(90) lap + dminutes(90) interval(lap, lap + minutes(90))
lap <- ymd_hms("2018-11-04 00:30:00",tz="US/Eastern") jan31 %m+% months(1)
## "2018-02-28"
add_with_rollback(e1, e2, roll_to_first =
12:00 1:00 2:00 3:00 12:00 1:00 2:00 3:00 12:00 1:00 2:00 3:00 12:00 1:00 2:00 3:00 TRUE) will roll imaginary dates to the
leap + years(1) leap + dyears(1) interval(leap, leap + years(1)) first day of the new month.
Leap years and leap seconds
leap <- ymd("2019-03-01") add_with_rollback(jan31, months(1),
roll_to_first = TRUE)
## "2018-03-01"
2019 2020 2021 2019 2020 2021 2019 2020 2021 2019 2020 2021

PERIODS DURATIONS INTERVALS


Add or subtract periods to model events that happen at specific clock Add or subtract durations to model physical processes, like battery life. Divide an interval by a duration to determine its physical length, divide
times, like the NYSE opening bell. Durations are stored as seconds, the only time unit with a consistent length. an interval by a period to determine its implied length in clock time.
Difftimes are a class of durations found in base R.
Start End
Make a period with the name of a time unit pluralized, e.g. Make a duration with the name of a period prefixed with a d, e.g. Make an interval with interval() or %--%, e.g. Date Date
p <- months(3) + days(12) years(x = 1) x years. dd <- ddays(14) dyears(x = 1) 31536000x seconds. i <- interval(ymd("2017-01-01"), d) ## 2017-01-01 UTC--2017-11-28 UTC
p months(x) x months. dd dweeks(x = 1) 604800x seconds. j <- d %--% ymd("2017-12-31") ## 2017-11-28 UTC--2017-12-31 UTC
"3m 12d 0H 0M 0S" weeks(x = 1) x weeks. "1209600s (~2 weeks)" ddays(x = 1) 86400x seconds. a %within% b Does interval or date-time a fall
days(x = 1) x days. Exact Equivalent dhours(x = 1) 3600x seconds. within interval b? now() %within% i
Number Number
of months of days etc. hours(x = 1) x hours. length in in common dminutes(x = 1) 60x seconds.
minutes(x = 1) x minutes. seconds units dseconds(x = 1) x seconds. int_start(int) Access/set the start date-time of
seconds(x = 1) x seconds. dmilliseconds(x = 1) x x 10-3 seconds. an interval. Also int_end(). int_start(i) <- now();
milliseconds(x = 1) x milliseconds. int_start(i)
dmicroseconds(x = 1) x x 10-6 seconds.
microseconds(x = 1) x microseconds dnanoseconds(x = 1) x x 10-9 seconds. int_aligns(int1, int2) Do two intervals share a
nanoseconds(x = 1) x nanoseconds. dpicoseconds(x = 1) x x 10-12 seconds. boundary? Also int_overlaps(). int_aligns(i, j)
picoseconds(x = 1) x picoseconds.
duration(num = NULL, units = "second", ...) int_diff(times) Make the intervals that occur
period(num = NULL, units = "second", ...) An automation friendly duration between the date-times in a vector.
An automation friendly period constructor. constructor. duration(5, unit = "years") v <-c(dt, dt + 100, dt + 1000); int_diff(v)
period(5, unit = "years")
as.duration(x, …) Coerce a timespan to a int_flip(int) Reverse the direction of an
as.period(x, unit) Coerce a timespan to a duration. Also is.duration(), is.difftime(). interval. Also int_standardize(). int_flip(i)
period, optionally in the specified units. as.duration(i)
Also is.period(). as.period(i) l int_length(int) Length in seconds. int_length(i)
make_difftime(x) Make difftime with the
period_to_seconds(x) Convert a period to specified number of units.
the "standard" number of seconds implied int_shift(int, by) Shifts an interval up or down
make_difftime(99999) the timeline by a timespan. int_shift(i, days(-1))
by the period. Also seconds_to_period().
period_to_seconds(p)
as.interval(x, start, …) Coerce a timespans to
an interval with the start date-time. Also
is.interval(). as.interval(days(1), start = now())
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at lubridate.tidyverse.org • lubridate 1.6.0 • Updated: 2017-12

Figura 23. Cartão de referência para manipulação de *date-time* com lubridate


e hmsWalmes
. Zeviani · UFPR Manipulação e Visualização de Dados 62
Encadeando com operadores do magrittr

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 63


Anatomia

I O operador permite expressar de forma mais direta as


operações.
I É uma ideia inspirada no Shell.
I A lógica é bem simples:
I x %>% f é o mesmo que f(x).
I x %>% f(y) é o mesmo que f(x, y).
I x %>% f %>% g %>% h é o mesmo que h(g(f(x))).

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 64


Anatomia do magrittr
library(magrittr) 1
2
# Operadores "pipe". 3
ls("package:magrittr") %>% 4
str_subset("%") 5

## [1] "%<>%" "%>%" "%$%" "%T>%"

# Outras funções/objetos. 1
ls("package:magrittr") %>% 2
str_subset("^[^%]*$") 3

## [1] "add" "and"


## [3] "debug_fseq" "debug_pipe"
## [5] "divide_by" "divide_by_int"
## [7] "equals" "extract"
## [9] "extract2" "freduce"
## [11] "functions" "inset"
## [13] "inset2" "is_greater_than"
## _
[15] "is in" "is_less_than"
## [17] "is_weakly_greater_than" "is_weakly_less_than"
## [19] "mod" "multiply_by"
## [21] "multiply_by_matrix" "n'est pas"
## [23] "not" "or"
## [25] "raise_to_power" "set_colnames"
## [27] "set_names" "set_rownames"
## [29] "subtract" "undebug_fseq"
## [31] _
Walmes"use
Zevianiseries"
· UFPR Manipulação e Visualização de Dados 65
Exemplos do uso do pipe (1)

x <- precip 1
mean(sqrt(x - min(x))) 2

## [1] 5.020078

x <- x - min(x) 1
x <- sqrt(x) 2
mean(x) 3

## [1] 5.020078

precip %>% 1
`-`(min(.)) %>% # o mesmo que subtract(min(.)) 2
sqrt() %>% 3
mean() 4

## [1] 5.020078

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 66


Exemplos de uso do pipe (2)

x <- precip 1
x <- sqrt(x) 2
x <- x[x > 5] 3
x <- mean(x) 4
x 5

## [1] 6.364094

precip %>% 1
sqrt() %>% 2
.[is_greater_than(., 5)] %>% # o mesmo que .[`>`(., 5)] 3
mean() 4

## [1] 6.364094

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 67


Mãos à obra!

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 68


Instalar o tidyverse

# Do CRAN. 1
install.packages("tidyverse") 2
3
# Do GitHub. 4
# install.packages("devtools") 5
devtools::install_github("hadley/tidyverse") 6
7
# Atualizar caso já tenha instalado. 8
tidyverse_update() 9

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 69


O que vem agora?

I Uma visão aprofundada de cada pacote do tidyverse.


I Exemplos didáticos seguidos de desafios práticos.
I Happy coding.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 70


Referências

TEUTONICO, D. Ggplot2 essentials. Packt Publishing, 2015.


WICKHAM, H. Ggplot2: Elegant graphics for data analysis.
Springer International Publishing, 2016.
WILKINSON, L.; WILLS, D.; ROPE, D.; NORTON, A.; DUBBS, R.
The grammar of graphics. Springer New York, 2013.

Walmes Zeviani · UFPR Manipulação e Visualização de Dados 71

Você também pode gostar