Escolar Documentos
Profissional Documentos
Cultura Documentos
Start RStudio and open a new R script: - On Windows click the start button and search for rstudio. On Mac
RStudio will be in your applications folder. - In Rstudio go to File -> New File -> R Script
What is R?
R is a programming language designed for statistical computing . Notable characteristics include:
Whatever you’re trying to do, you’re probably not the first to try doing it R. Chances are good that someone has
already written a package for that.
The window in the upper-left is your R script. This is where you will write instructions for R to carry out.
The window in the lower-left is the R console. This is where results will be displayed.
Exercise 0
The purpose of this exercise is to give you an opportunity to explore the interface provided by RStudio (or
whichever GUI you’ve decided to use). You may not know how to do these things; that’s fine! This is an
opportunity to figure it out.
Also keep in mind that we are living in a golden age of tab completion. If you don’t know the name of an R
function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are
looking for should appear in a pop up!
3. R includes extensive documentation, including a file named “An introduction to R”. Try to find this help file.
Exercise 0 solution
## 1. 2 plus 2
2 + 2
## [1] 4
## [1] 4
## [1] 3.162278
## or
10^(1/2)
## [1] 3.162278
R basics
Function calls
The general form for calling R functions is
Assignment
Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf
Values can be assigned names and used in subsequent operations
The <- operator (less than followed by a dash) is used to save values
The name on the left gets the value on the right.
## [1] 3.162278
help(help)
The help function can be used to look up the documentation for a function, or to look up the documentation to a
package. We can learn how to use the stats package by reading its documentation like this:
help(package = "stats")
Data sets
The examples in this workshop use the baby names data provided by the governments of the United States and
the United Kingdom. A cleaned and merged version of these data is available at
Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf
http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv .
You can install a package in R using the install.packages() function. Once a package is installed you may
use the library function to attach it so that it can be used.
## install.packages("readr")
library(readr)
Note You may be confused by the existence of similar functions, e.g., read.csv and read.delim . These are
legacy functions that tend to be slower and less robust than the readr functions. One way to tell them apart is
that the faster more robust versions use underscores in their names (e.g., read_csv ) while the older functions
Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf
us dots (e.g., read.csv ). My advice is to use the more robust newer versions, i.e., the ones with underscores.
Exercise 2
The purpose of this exercise is to practice reading data into R.
1. Open the help page for the read_csv function. How can you limit the number of rows to be read in?
3. Once you have successfully read in the first 10 rows, read the whole file, assigning the result to the name
baby.names .
Exercise 2 solution
## read ?read_csv
read_csv("http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv",
n_max = 10)
Data Manipulation
data.frame objects
Usually data read into R will be stored as a data.frame
A data.frame has two dimensions corresponding the number of rows and the number of columns (in that order)
## install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
We can get more information about R objects using the glimpse function.
glimpse(baby.names) # structure
## Observations: 197,106
## Variables: 4
## $ Year <int> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 19...
## $ Sex <chr> "Female", "Female", "Female", "Female", "Female", "Femal...
## $ Name <chr> "aaliyah", "aarin", "aaron", "abagail", "abbey", "abbi",...
## $ Count <int> 802, 5, 26, 87, 510, 5, 311, 235, 17, 1402, 8, 5, 5, 6, ...
## # A tibble: 19 x 4
## Year Sex Name Count
## <int> <chr> <chr> <int>
## 1 1996 Female jill 306
## 2 1997 Female jill 254
## 3 1998 Female jill 206
## 4 1999 Female jill 169
## 5 2000 Female jill 168
## 6 2001 Female jill 130
## 7 2002 Female jill 85
## 8 2003 Female jill 83
## 9 2004 Female jill 53
## 10 2005 Female jill 61
## 11 2006 Female jill 108
## 12 2007 Female jill 42
## 13 2008 Female jill 25
## 14 2009 Female jill 34
## 15 2010 Female jill 31
## 16 2011 Female jill 5
## 17 2013 Female jill 13
## 18 2014 Female jill 18
## 19 2015 Female jill 7
## # A tibble: 2 x 4
## Year Sex Name Count
## <int> <chr> <chr> <int>
## 1 1996 Female jill 306
## 2 1996 M jack 4240
In the previous example we used == to filter rows. Other relational and logical operators are listed below.
== equal to
!= not equal to
%in% contained in
2. Restrict the previous extraction to include only years between 2000 and 2004.
Exercise 2 solution
## 1. Extract data for the name "ashley".
filter(baby.names, Name == "ashley")
## 2. Restrict the previous extraction to include only years between 2000 and 2004.
filter(baby.names,
Year >= 2000 &
Year <= 2004 &
Name == "ashley")
## # A tibble: 10 x 4
## Year Sex Name Count
## <int> <chr> <chr> <int>
## 1 2000 Female ashley 17997
## 2 2000 M ashley 44
## 3 2001 Female ashley 16524
## 4 2001 M ashley 33
## 5 2002 Female ashley 15339
## 6 2002 M ashley 18
## 7 2003 Female ashley 14512
## 8 2003 M ashley 31
## 9 2004 Female ashley 14370
## 10 2004 M ashley 57
## # A tibble: 197,106 x 5
## Year Sex Name Count Thousands
## <int> <chr> <chr> <int> <dbl>
## 1 1996 Female aaliyah 802 0.802
## 2 1996 Female aarin 5 0.005
## 3 1996 Female aaron 26 0.026
## 4 1996 Female abagail 87 0.087
## 5 1996 Female abbey 510 0.51
## 6 1996 Female abbi 5 0.005
## 7 1996 Female abbie 311 0.311
## 8 1996 Female abbigail 235 0.235
## 9 1996 Female abbigale 17 0.017
## 10 1996 Female abby 1402 1.40
## # ... with 197,096 more rows
head(baby.names)
tail(baby.names)
## # A tibble: 6 x 6
## Year Sex Name Count Thousands Decade
## <int> <chr> <chr> <int> <dbl> <chr>
## 1 2015 Male william 19 0.019 2010's
## 2 2015 Male wyatt 29 0.029 2010's
## 3 2015 Male xavier 11 0.011 2010's
## 4 2015 Male zander 6 0.006 2010's
## 5 2015 Male zane 6 0.006 2010's
## 6 2015 Male zayden 6 0.006 2010's
1. Ifyou look at unique(baby.names$Sex) you’ll notice that some records indicate Male with "M" , while
other records use "Male" . Correct this by replacing "M" with "Male" .
2. Create a column named “Popular” containing a 1 in rows where Count is greater than 30000 and a 0
otherwise
3. Filter the baby names data to display only the popular names.
Exercise 3 solution
Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf
# 1
baby.names <- mutate(baby.names,
Sex = case_when(Sex == "M" ~ "Male",
TRUE ~ Sex))
# 2
baby.names <- mutate(baby.names,
Popular = case_when(Count < 30000 ~ FALSE,
Count >= 30000 ~ TRUE))
# 3
filter(baby.names, Popular)
## # A tibble: 17 x 7
## Year Sex Name Count Thousands Decade Popular
## <int> <chr> <chr> <int> <dbl> <chr> <lgl>
## 1 1996 Male christopher 30870 30.9 1990's TRUE
## 2 1996 Male jacob 31864 31.9 1990's TRUE
## 3 1996 Male matthew 32031 32.0 1990's TRUE
## 4 1996 Male michael 38322 38.3 1990's TRUE
## 5 1997 Male jacob 34090 34.1 1990's TRUE
## 6 1997 Male matthew 31480 31.5 1990's TRUE
## 7 1997 Male michael 37505 37.5 1990's TRUE
## 8 1998 Male jacob 35958 36.0 1990's TRUE
## 9 1998 Male matthew 31091 31.1 1990's TRUE
## 10 1998 Male michael 36569 36.6 1990's TRUE
## 11 1999 Male jacob 35306 35.3 1990's TRUE
## 12 1999 Male matthew 30388 30.4 1990's TRUE
## 13 1999 Male michael 33854 33.9 1990's TRUE
## 14 2000 Male jacob 34418 34.4 2000's TRUE
## 15 2000 Male michael 31992 32.0 2000's TRUE
## 16 2001 Male jacob 32487 32.5 2000's TRUE
## 17 2002 Male jacob 30509 30.5 2000's TRUE
## # A tibble: 6 x 8
## # Groups: Year, Sex [1]
## Year Sex Name Count Thousands Decade Popular max_count
## <int> <chr> <chr> <int> <dbl> <chr> <lgl> <dbl>
## 1 1996 Female aaliyah 802 0.802 1990's FALSE 25150
## 2 1996 Female aarin 5 0.005 1990's FALSE 25150
## 3 1996 Female aaron 26 0.026 1990's FALSE 25150
## 4 1996 Female abagail 87 0.087 1990's FALSE 25150
## 5 1996 Female abbey 510 0.51 1990's FALSE 25150
## 6 1996 Female abbi 5 0.005 1990's FALSE 25150
tail(baby.names)
## # A tibble: 6 x 8
## # Groups: Year, Sex [1]
## Year Sex Name Count Thousands Decade Popular max_count
## <int> <chr> <chr> <int> <dbl> <chr> <lgl> <dbl>
## 1 2015 Male william 19 0.019 2010's FALSE 19485
## 2 2015 Male wyatt 29 0.029 2010's FALSE 19485
## 3 2015 Male xavier 11 0.011 2010's FALSE 19485
## 4 2015 Male zander 6 0.006 2010's FALSE 19485
## 5 2015 Male zane 6 0.006 2010's FALSE 19485
## 6 2015 Male zayden 6 0.006 2010's FALSE 19485
filter(baby.names,
Count == max_count)
Note that the data remains grouped until you change the groups by running group_by again or remove
grouping information with ungroup .
Grouping can be useful when modifying a data.frame with mutate or extracting subsets with filter , but it
really shines when combined with summarize . For example, suppose that we want the most popular girl and
boy names for each decade. In this case we need to summarize the by summing the Count column for each
Sex X Decade group.
In the previous example we used sum and max , two examples of basic statistics functions in R. Other basic
statistics functions include: - mean - median - sd - var - min - quantile - length
Exporting Data
Now that we have made some changes to our data set, we might want to save those changes to a file.
## character(0)
Exercise 4
1. Calculate the total number of children born.
2. Filter the data to extract data from 2004 and calculate the total number of children born in that year.
3. Calculate the number of boys and girls born each year. Assign the result to the name births.by.year .
Exercise 4 solution
## 1. Calculate the total number of children born.
baby.names <- ungroup(baby.names)
summarize(baby.names, Total = sum(Count))
## # A tibble: 1 x 1
## Total
## <int>
## 1 64519299
## # A tibble: 1 x 1
## Total
## <int>
## 1 3294392
## 3. Calculate the number of boys and girls born each year. Assign the result
## to the name `births.by.year`.
births.by.year <- summarize(group_by(baby.names, Year, Sex),
Count = sum(Count))
births.by.year
## # A tibble: 40 x 3
## # Groups: Year [?]
## Year Sex Count
## <int> <chr> <int>
## 1 1996 Female 1497121
## 2 1996 Male 1728057
## 3 1997 Female 1478802
## 4 1997 Male 1714552
## 5 1998 Female 1496488
## 6 1998 Male 1733194
## 7 1999 Female 1497247
## 8 1999 Male 1736147
## 9 2000 Female 1527247
## 10 2000 Male 1769871
## # ... with 30 more rows
Basic graphs
R has decent plotting tools built-in – see e.g., help(plot) . However, To make things easier on ourselves we
will use a contributed package called ggplot2 instead.
First, we’ll plot the number of boys and girls born each year.
Next, we’ll filter out the most popular girls names and plot their popularity over time.
Exercise 5
1. Add a new coloumn to the baby.names data equal to the proportion of boys and girls born each year with
2. Filter the baby.names data, retaining only the most popular girl and boy names for each year.
3. Plot proportion over time to see changes in the proportion of parents choosing the most popular name of
the year.
Exercise 5 solution
## 1. Add a new coloumn to the baby.names data equal to the proportion
## of boys and girls born each year with each name. That is, calculate
## Proportion = Count/sum(Count) for each Year X Sex group.
## 2. Filter the baby.names data, retaining only the most popular girl
## and boy names for each year.
Additional resources
IQSS workshops: http://projects.iq.harvard.edu/rtc/filter_by/workshops