Https Tutorials Iq Harvard Edu R Rintro Rintro HTML

Welcome
Introduction to R workshop notes

Graphical User Interfaces
(GUIs) Welcome
R basics
Example project: baby names!

Materials and setup
NOTE: skip this section if you are not running R locally (e.g., if you are running R in your browser using a
Getting data into R
remote Jupyter server)
Data Manipulation
You should have R installed –if not:
Basic graphs
Download and install R from http://cran.r-project.org
Wrap-up Download and install RStudio from https://www.rstudio.com/products/rstudio/download/#download
Notes and examples for this workshop are available at
Start RStudio and open a new R script: - On Windows click the start button and search for rstudio. On Mac
RStudio will be in your applications folder. - In Rstudio go to File -> New File -> R Script
What is R?
R is a programming language designed for statistical computing . Notable characteristics include:
Vast capabilities, wide range of statistical and graphical techniques

Very popular in academia, growing popularity in business: http://r4stats.com/articles/popularity/
Written primarily by statisticians
FREE (no monetary cost and open source)
Excellent community support: mailing list, blogs, tutorials
Easy to extend by writing new functions
Whatever you’re trying to do, you’re probably not the first to try doing it R. Chances are good that someone has
already written a package for that.
Graphical User Interfaces (GUIs)

There are many different ways you can interact with R. See the Data Science Tools workshop notes for details.
Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf

For this workshop I encourage you to use RStudio; it is a good R-specific IDE that mostly just works.
Launch RStudio (skip if not using Rstudio)

Note: skip this section if you are not using Rstudio (e.g., if you are running these examples in a Jupyter
notebook).
Start the RStudio program

In RStudio, go to File => New File =&gt R Script
The window in the upper-left is your R script. This is where you will write instructions for R to carry out.
The window in the lower-left is the R console. This is where results will be displayed.
Exercise 0
The purpose of this exercise is to give you an opportunity to explore the interface provided by RStudio (or
whichever GUI you’ve decided to use). You may not know how to do these things; that’s fine! This is an
opportunity to figure it out.
Also keep in mind that we are living in a golden age of tab completion. If you don’t know the name of an R
function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are
looking for should appear in a pop up!
1. Try to get R to add 2 plus 2.
2. Try to calculate the square root of 10.
3. R includes extensive documentation, including a file named “An introduction to R”. Try to find this help file.
Exercise 0 solution
## 1. 2 plus 2
2 + 2
## [1] 4

## or
sum(2, 2)
## [1] 4
## 2. square root of 10:

sqrt(10)
## [1] 3.162278
## or
10^(1/2)
## [1] 3.162278
## 3. Find "An Introduction to R".
## Go to the main help page by running 'help.start() or using the GUI

## menu, find and click on the link to "An Introduction to R".
R basics
Function calls
The general form for calling R functions is
## FunctionName(arg.1 = value.1, arg.2 = value.2, ..., arg.n - value.n)
Arguments can be matched by name; unnamed arguments will be matched by position.
Assignment
Values can be assigned names and used in subsequent operations
The <- operator (less than followed by a dash) is used to save values
The name on the left gets the value on the right.
sqrt(10) ## calculate square root of 10; result is not stored anywhere
## [1] 3.162278
x <- sqrt(10) # assign result to a variable named x
Asking R for help

You can ask R for help using the help function, or the ? shortcut.
help(help)
The help function can be used to look up the documentation for a function, or to look up the documentation to a
package. We can learn how to use the stats package by reading its documentation like this:
help(package = "stats")
Example project: baby names!

General goals
I would like to know what the most popular baby names are. In the course of answering this question we will
learn to call R functions, install and load packages, assign values to names, read and write data, and more.
Data sets
The examples in this workshop use the baby names data provided by the governments of the United States and
the United Kingdom. A cleaned and merged version of these data is available at
http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv .
Getting data into R

R has data reading functionality built-in – see e.g., help(read.table) . However, faster and more robust tools
are available, and so to make things easier on ourselves we will use a contributed package called readr
instead. This requires that we learn a little bit about packages in R.
Installing and using R packages

A large number of contributed packages are available. If you are looking for a package for a specific task,
https://cran.r-project.org/web/views/ and https://r-pkg.org are good places to start.
You can install a package in R using the install.packages() function. Once a package is installed you may
use the library function to attach it so that it can be used.
## install.packages("readr")
library(readr)
Readers for common file types

In order to read data from a file, you have to know what kind of file it is. The table below lists functions that can
import data from common plain-text formats.
Data Type Function
comma separated read_csv()
tab separated read_delim()
other delimited formats read_table()
fixed width read_fwf()
Note You may be confused by the existence of similar functions, e.g., read.csv and read.delim . These are
legacy functions that tend to be slower and less robust than the readr functions. One way to tell them apart is
that the faster more robust versions use underscores in their names (e.g., read_csv ) while the older functions
us dots (e.g., read.csv ). My advice is to use the more robust newer versions, i.e., the ones with underscores.
Exercise 2
The purpose of this exercise is to practice reading data into R.
1. Open the help page for the read_csv function. How can you limit the number of rows to be read in?
2. Read just the first 10 rows of

“ "http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv ”.
3. Once you have successfully read in the first 10 rows, read the whole file, assigning the result to the name
baby.names .
Exercise 2 solution
## read ?read_csv
## limit rows with nrows argument
read_csv("http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv",
n_max = 10)
## Parsed with column specification:

## cols(
## Year = col_integer(),
## Sex = col_character(),
## Name = col_character(),
## Count = col_integer()
## )
## read all the data

baby.names <- read_csv("http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv")

## Parsed with column specification:
## cols(
## Year = col_integer(),
## Sex = col_character(),
## Name = col_character(),
## Count = col_integer()
## )
Data Manipulation
data.frame objects
Usually data read into R will be stored as a data.frame
A data.frame is a list of vectors of equal length

Each vector in the list forms a column
Each column can be a differnt type of vector
Typically columns are variables and the rows are observations
A data.frame has two dimensions corresponding the number of rows and the number of columns (in that order)
Tools for manipulating data.frame objects

R has decent data manipulation tools built-in – see e.g., help(Extract) . However, these tools are powerful
and complex and often overwhelm beginners. To make things easier on ourselves we will use a contributed
package called dplyr instead.
## install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':

##
## intersect, setdiff, setequal, union
Checking imported data

It is always a good idea to examine the imported data set–usually we want the results to be a data.frame
class(baby.names) # check to see that it os a data.frame
## [1] "tbl_df" "tbl" "data.frame"
We can get more information about R objects using the glimpse function.
glimpse(baby.names) # structure
## Observations: 197,106
## Variables: 4
## $ Year <int> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 19...
## $ Sex <chr> "Female", "Female", "Female", "Female", "Female", "Femal...
## $ Name <chr> "aaliyah", "aarin", "aaron", "abagail", "abbey", "abbi",...
## $ Count <int> 802, 5, 26, 87, 510, 5, 311, 235, 17, 1402, 8, 5, 5, 6, ...
View(baby.names) # visually inspect
Filter data.frame rows

You can extract subsets of data.frames using filter to select rows meeting some condition.

## rows where Name == "jill"
filter(baby.names, Name == "jill")
## # A tibble: 19 x 4
## Year Sex Name Count
## <int> <chr> <chr> <int>
## 1 1996 Female jill 306
## 11 2006 Female jill 108
## rows where Year is 1996 and Name is either "jack" or "jill"

filter(baby.names, Year == 1996 & Name %in% c("jack", "jill"))
## # A tibble: 2 x 4
## 2 1996 M jack 4240
In the previous example we used == to filter rows. Other relational and logical operators are listed below.

Operator Meaning
== equal to
!= not equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
%in% contained in
These operators may be combined with & (and) or | (or).
Exercise 2: Data Extraction

Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby.names . The file
is located at “http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv”
1. Extract data for the name “ashley”.
2. Restrict the previous extraction to include only years between 2000 and 2004.
Exercise 2 solution
## 1. Extract data for the name "ashley".
filter(baby.names, Name == "ashley")

## # A tibble: 39 x 4
## 1 1996 Female ashley 23676
## 2 1996 M ashley 64
## 4 1997 M ashley 63
## 6 1998 M ashley 27
## 8 1999 M ashley 28
## 10 2000 M ashley 44
## # ... with 29 more rows
## 2. Restrict the previous extraction to include only years between 2000 and 2004.
filter(baby.names,
Year >= 2000 &
Year <= 2004 &
Name == "ashley")
## # A tibble: 10 x 4
## 2 2000 M ashley 44
## 4 2001 M ashley 33
## 6 2002 M ashley 18
## 8 2003 M ashley 31
## 10 2004 M ashley 57
Adding, removing, and modifying data.frame

columns
You can modify data.frames using mutate function. It works like this:
baby.names <- mutate(baby.names, Thousands = Count/1000)

baby.names
## # A tibble: 197,106 x 5
## Year Sex Name Count Thousands
## <int> <chr> <chr> <int> <dbl>
## 1 1996 Female aaliyah 802 0.802
## 2 1996 Female aarin 5 0.005
## 3 1996 Female aaron 26 0.026
## 4 1996 Female abagail 87 0.087
## 5 1996 Female abbey 510 0.51
## 6 1996 Female abbi 5 0.005
## 7 1996 Female abbie 311 0.311
## 8 1996 Female abbigail 235 0.235
## 9 1996 Female abbigale 17 0.017
## 10 1996 Female abby 1402 1.40
## # ... with 197,096 more rows
Often one needs to replace values conditionally, as in the following example:
baby.names <- mutate(

baby.names,
Decade = case_when(Year < 2000 ~ "1990's",
Year >= 2000 & Year < 2010 ~ "2000's",
Year < 2020 ~ "2010's",
TRUE ~ as.character(Year)))
head(baby.names)

## Year Sex Name Count Thousands Decade
## <int> <chr> <chr> <int> <dbl> <chr>
## 1 1996 Female aaliyah 802 0.802 1990's
## 2 1996 Female aarin 5 0.005 1990's
## 3 1996 Female aaron 26 0.026 1990's
## 4 1996 Female abagail 87 0.087 1990's
## 5 1996 Female abbey 510 0.51 1990's
## 6 1996 Female abbi 5 0.005 1990's
tail(baby.names)
## Year Sex Name Count Thousands Decade
## <int> <chr> <chr> <int> <dbl> <chr>
## 1 2015 Male william 19 0.019 2010's
## 2 2015 Male wyatt 29 0.029 2010's
## 3 2015 Male xavier 11 0.011 2010's
## 4 2015 Male zander 6 0.006 2010's
## 5 2015 Male zane 6 0.006 2010's
## 6 2015 Male zayden 6 0.006 2010's
Exercise 3: Data manipulation

Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby.names . The file
is located at “http://tutorials.iq.harvard.edu/R/Rintro/dataSets/babyNames.csv”
1. Ifyou look at unique(baby.names$Sex) you’ll notice that some records indicate Male with "M" , while
other records use "Male" . Correct this by replacing "M" with "Male" .
2. Create a column named “Popular” containing a 1 in rows where Count is greater than 30000 and a 0
otherwise
3. Filter the baby names data to display only the popular names.
Exercise 3 solution
# 1
baby.names <- mutate(baby.names,
Sex = case_when(Sex == "M" ~ "Male",
TRUE ~ Sex))
# 2
baby.names <- mutate(baby.names,
Popular = case_when(Count < 30000 ~ FALSE,
Count >= 30000 ~ TRUE))
# 3
filter(baby.names, Popular)
## # A tibble: 17 x 7
## Year Sex Name Count Thousands Decade Popular
## <int> <chr> <chr> <int> <dbl> <chr> <lgl>
## 1 1996 Male christopher 30870 30.9 1990's TRUE
## 2 1996 Male jacob 31864 31.9 1990's TRUE
## 3 1996 Male matthew 32031 32.0 1990's TRUE
## 4 1996 Male michael 38322 38.3 1990's TRUE
## 5 1997 Male jacob 34090 34.1 1990's TRUE
## 8 1998 Male jacob 35958 36.0 1990's TRUE
## 11 1999 Male jacob 35306 35.3 1990's TRUE
## 14 2000 Male jacob 34418 34.4 2000's TRUE
## 16 2001 Male jacob 32487 32.5 2000's TRUE
## 17 2002 Male jacob 30509 30.5 2000's TRUE
Grouping and Aggregation

So far we’ve seen than “Jacob”, “Matthew”, and “Michael” tend to be popular names. That isn’t very satisfying,
because it leaves us wanting to know which girls names are popular, and perhaps how popularity has changed
over time. To answer these questions we need to operate on groups within the data rather than on the whole data
structure at once. The dplyr package makes this easy to do using the group_by function.
baby.names <- mutate(group_by(baby.names, Year, Sex),

max_count = max(Count))
head(baby.names)
## # Groups: Year, Sex [1]
## Year Sex Name Count Thousands Decade Popular max_count
## <int> <chr> <chr> <int> <dbl> <chr> <lgl> <dbl>
## 1 1996 Female aaliyah 802 0.802 1990's FALSE 25150
## 2 1996 Female aarin 5 0.005 1990's FALSE 25150
## 3 1996 Female aaron 26 0.026 1990's FALSE 25150
## 4 1996 Female abagail 87 0.087 1990's FALSE 25150
## 5 1996 Female abbey 510 0.51 1990's FALSE 25150
## 6 1996 Female abbi 5 0.005 1990's FALSE 25150
tail(baby.names)
## 1 2015 Male william 19 0.019 2010's FALSE 19485
## 2 2015 Male wyatt 29 0.029 2010's FALSE 19485
## 3 2015 Male xavier 11 0.011 2010's FALSE 19485
## 4 2015 Male zander 6 0.006 2010's FALSE 19485
## 5 2015 Male zane 6 0.006 2010's FALSE 19485
## 6 2015 Male zayden 6 0.006 2010's FALSE 19485
filter(baby.names,
Count == max_count)

## # A tibble: 40 x 8
## 1 1996 Female emily 25150 25.2 1990's FALSE 25150
## 2 1996 Male michael 38322 38.3 1990's TRUE 38322
## 8 1999 Male jacob 35306 35.3 1990's TRUE 35306
## 10 2000 Male jacob 34418 34.4 2000's TRUE 34418
Note that the data remains grouped until you change the groups by running group_by again or remove
grouping information with ungroup .
Grouping can be useful when modifying a data.frame with mutate or extracting subsets with filter , but it
really shines when combined with summarize . For example, suppose that we want the most popular girl and
boy names for each decade. In this case we need to summarize the by summing the Count column for each
Sex X Decade group.
bn.by.decade <- summarize(group_by(baby.names, Decade, Sex, Name),

Count = sum(Count),
Thousands = sum(Thousands))
filter(group_by(bn.by.decade, Decade, Sex),

Count == max(Count))

## # Groups: Decade, Sex [6]
## Decade Sex Name Count Thousands
## <chr> <chr> <chr> <int> <dbl>
## 1 1990's Female emily 103597 104.
## 2 1990's Male michael 146430 146.
## 3 2000's Female emily 223612 224.
## 4 2000's Male jacob 273690 274.
## 5 2010's Female sophia 121787 122.
## 6 2010's Male jacob 112227 112.
In the previous example we used sum and max , two examples of basic statistics functions in R. Other basic
statistics functions include: - mean - median - sd - var - min - quantile - length
Exporting Data
Now that we have made some changes to our data set, we might want to save those changes to a file.
# write data to a .csv file

write_csv(baby.names, "babyNames.csv")
# write data to an R file

write_rds(baby.names, "babyNames.rds")
Saving and loading R workspaces

In addition to importing individual datasets, R can save and load entire workspaces
ls() # list objects in our workspace
## [1] "baby.names" "bn.by.decade" "fname" "oname"

## [5] "x"

save.image(file="myWorkspace.RData") # save workspace
rm(list=ls()) # remove all objects from our workspace
ls() # list stored objects to make sure they are deleted
## character(0)
## Load the "myWorkspace.RData" file and check that it is restored

load("myWorkspace.RData") # load myWorkspace.RData
ls() # list objects
## [1] "baby.names" "bn.by.decade" "fname" "oname"

## [5] "x"
Exercise 4
1. Calculate the total number of children born.
2. Filter the data to extract data from 2004 and calculate the total number of children born in that year.
3. Calculate the number of boys and girls born each year. Assign the result to the name births.by.year .
Exercise 4 solution
## 1. Calculate the total number of children born.
baby.names <- ungroup(baby.names)
summarize(baby.names, Total = sum(Count))
## Total
## <int>
## 1 64519299

## 2. Filter the data to extract data from 2004 and calculate the total
## number of children born in that year.
summarize(filter(baby.names, Year == 2004), Total = sum(Count))
## Total
## <int>
## 1 3294392
## 3. Calculate the number of boys and girls born each year. Assign the result
## to the name `births.by.year`.
births.by.year <- summarize(group_by(baby.names, Year, Sex),
Count = sum(Count))
births.by.year
## # A tibble: 40 x 3
## # Groups: Year [?]
## Year Sex Count
## <int> <chr> <int>
## 1 1996 Female 1497121
## 2 1996 Male 1728057
## 3 1997 Female 1478802
## 4 1997 Male 1714552
## 5 1998 Female 1496488
## 6 1998 Male 1733194
## 7 1999 Female 1497247
## 8 1999 Male 1736147
## 9 2000 Female 1527247
## 10 2000 Male 1769871
Basic graphs
R has decent plotting tools built-in – see e.g., help(plot) . However, To make things easier on ourselves we
will use a contributed package called ggplot2 instead.

## install.packages("ggplot2")
library(ggplot2)
First, we’ll plot the number of boys and girls born each year.
qplot(Year, Count, color = Sex,

geom = "line",
data = births.by.year)
Next, we’ll filter out the most popular girls names and plot their popularity over time.

popular.girls <- filter(group_by(baby.names, Year, Sex),
Sex == "Female" & Count == max(Count))
qplot(Year, Count, color = Name,

geom = "line",
data = filter(baby.names,
Sex == "Female" & Name %in% popular.girls$Name))
Exercise 5
1. Add a new coloumn to the baby.names data equal to the proportion of boys and girls born each year with

each name. That is, calculate Proportion = Count/sum(Count) for each Year X Sex group.
2. Filter the baby.names data, retaining only the most popular girl and boy names for each year.
3. Plot proportion over time to see changes in the proportion of parents choosing the most popular name of
the year.
Exercise 5 solution
## 1. Add a new coloumn to the baby.names data equal to the proportion
## of boys and girls born each year with each name. That is, calculate
## Proportion = Count/sum(Count) for each Year X Sex group.
baby.names <- mutate(group_by(baby.names, Year, Sex),

Proportion = Count / sum(Count))
## 2. Filter the baby.names data, retaining only the most popular girl
## and boy names for each year.
bn.top.prop <- filter(group_by(baby.names, Year, Sex),

Proportion == max(Proportion))
## 3. Plot proportion over time to see changes in the proportion of

## parents choosing the most popular name of the year.
qplot(Year, Proportion, color = Sex,
geom = "line",
data = bn.top.prop)

Wrap-up
Help us make this workshop better!
Please take a moment to fill out a very short feedback form. These workshops exist for you – tell us what you
need! http://tinyurl.com/R-intro-feedback
Additional resources
IQSS workshops: http://projects.iq.harvard.edu/rtc/filter_by/workshops

IQSS statistical consulting: http://dss.iq.harvard.edu
Software (all free!):
R and R package download: http://cran.r-project.org
Rstudio download: http://rstudio.org
ESS (emacs R package): http://ess.r-project.org/
Online tutorials
http://www.codeschool.com/courses/try-r
http://www.datacamp.org
http://swirlstats.com/
http://r4ds.had.co.nz/
Getting help:
Documentation and tutorials: http://cran.r-project.org/other-docs.html
Recommended R packages by topic: http://cran.r-project.org/web/views/
Mailing list: https://stat.ethz.ch/mailman/listinfo/r-help
StackOverflow: http://stackoverflow.com/questions/tagged/r
Coming from… Stata : http://www.princeton.edu/~otorres/RStata.pdf SAS/SPSS :
http://www.et.bs.ehu.es/~etptupaf/pub/R/RforSAS&SPSSusers.pdf matlab :
http://www.math.umaine.edu/~hiebeler/comp/matlabR.pdf Python :
http://mathesaurus.sourceforge.net/matlab-python-xref.pdf

Https Tutorials Iq Harvard Edu R Rintro Rintro HTML

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Https Tutorials Iq Harvard Edu R Rintro Rintro HTML

Enviado por

Direitos autorais:

Formatos disponíveis

Welcome

Introduction to R workshop notes

Example project: baby names!

Notes and examples for this workshop are available at

Vast capabilities, wide range of statistical and graphical techniques

Graphical User Interfaces (GUIs)

Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf

Launch RStudio (skip if not using Rstudio)

Start the RStudio program

1. Try to get R to add 2 plus 2.

2. Try to calculate the square root of 10.

Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf

## 2. square root of 10:

## 3. Find "An Introduction to R".

## Go to the main help page by running 'help.start() or using the GUI

## FunctionName(arg.1 = value.1, arg.2 = value.2, ..., arg.n - value.n)

Arguments can be matched by name; unnamed arguments will be matched by position.

sqrt(10) ## calculate square root of 10; result is not stored anywhere

x <- sqrt(10) # assign result to a variable named x

Asking R for help

Example project: baby names!

Getting data into R

Installing and using R packages

Readers for common file types

Data Type Function

comma separated read_csv()

tab separated read_delim()

other delimited formats read_table()

fixed width read_fwf()

2. Read just the first 10 rows of

## limit rows with nrows argument

## Parsed with column specification:

## read all the data

Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf

A data.frame is a list of vectors of equal length

Tools for manipulating data.frame objects

Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf

## The following objects are masked from 'package:base':

Checking imported data

class(baby.names) # check to see that it os a data.frame

## [1] "tbl_df" "tbl" "data.frame"

View(baby.names) # visually inspect

Filter data.frame rows

Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf

## rows where Year is 1996 and Name is either "jack" or "jill"

Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf

> greater than

>= greater than or equal to

< less than

<= less than or equal to

These operators may be combined with & (and) or | (or).

Exercise 2: Data Extraction

1. Extract data for the name “ashley”.

Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf

Adding, removing, and modifying data.frame

baby.names <- mutate(baby.names, Thousands = Count/1000)

Often one needs to replace values conditionally, as in the following example:

baby.names <- mutate(

Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf

Exercise 3: Data manipulation

Grouping and Aggregation

baby.names <- mutate(group_by(baby.names, Year, Sex),

Convertido de web en PDF a https://www.htmlapdf.com con el api html a pdf

bn.by.decade <- summarize(group_by(baby.names, Decade, Sex, Name),