Preparing Students For Big Data With R and Rstudio

Preparing Students for Big Data with R and
RStudio
R Pruim
JSM 2014
Peering Around the Bend
Is There a Light at the End of the Tunnel?
Or are we adrift at sea?
Some Important Questions . . .
. . . that I’m (mostly) not going to answer
1. What is big data?

• Size? Structure? Hygiene? Workflow?

2. What are the key (big) data skills?
• Programming? Databases? Concepts?

3. Who needs (big) data skills?
• Science Majors? Stat Majors? CS majors? Special subsets?

4. When/how will these students get these skills?
• Special courses/programs? Thread through all courses?

5. Who will take the lead on Big Data Education
• Statisticians? Computer Scientists? Natural Scientists?

5. Who will take the lead on Big Data Education
• Statisticians? Computer Scientists? Natural Scientists?
6. Must we walk before we run?
• Can/should we get to the good/big stuff right away?
Is Big Data Primarily an HR Problem?
Or is it an IT problem?
Questions to answer
1. Can you use R in a first course?
2. Should you use R in a first course?

Questions to answer
• Yes. It’s not too hard for you or your students.

Questions to answer
• But you should use R wisely.
Questions to answer
• But you should use R wisely.
• R is a good (but not the only) choice.
• There are advantages to starting early.
• You need to match technology to your goals and your students.
So you want to use R early. . .
Good news: Any one who can learn to use a TI calculator can
learn to use R
• but neither one can be used effortlessly on day one
More good news: Using R actually makes some things easier
• creating notes (reproducible workflow)

• answering email
• student projects (reproducible workflow)
More good news: Using R actually makes some things easier
• creating notes (reproducible workflow)

• answering email
• student projects (reproducible workflow)
• can reveal thinking problems
Keys to how I use R
• Keep things simple
If R is the hardest thing in your course, then your R is too

hard and your questions are too easy.
Keys to how I use R

• Think of R as a programming language
focus on declarative rather than programming use of R

Keys to how I use R

• Think of R as a programming language
focus on declarative rather than programming use of R
• Less Volume, More Creativity

Less Volume, More Creativity
"A lot of times you end up putting in a lot more
volume, because you are teaching fundamentals
and you are teaching concepts that you need to
put in, but you may not necessarily use because
they are building blocks for other concepts and
variations that will come off of that ... In the
offseason you have a chance to take a step back
and tailor it more specifically towards your team
and towards your players."
Mike McCarthy, Head Coach, Green Bay

Packers
Less Volume, More Creativity
One key to successfully introducing R is finding a set of commands
that is
• small: fewer is better

• coherent: commands should be as similar as possible
• powerful: can do what needs doing
It is not enough to use R, it must be used elegantly.

Two examples of this principle:
• the formula-interface (using the mosaic package)

• the Hadley-verse (dplyr, tidyr, ggvis, etc.)
Minimal R
cran.r-project.org/web/packages/mosaic Enough R for Intro Stats mosaic-web.org
Help Numerical Summaries Randomization/Simulation Data

These functions have a formula interface
apropos() rflip() # mosaic
to match plotting. read.file() # mosaic
? do() # mosaic
?? sample() # mosaic augmented nrow(); ncol(); dim()
favstats() # mosaic summary()
example() resample() # with replacement
tally() # mosaic str()
shuffle() # mosaic
mean() # mosaic augmented rbinom() names()
median() # mosaic augmented head(); tail()
Basic Calculations sd() # mosaic augmented
rnorm() # etc, if needed
with()
Basic calculation works like a calculator. var() # mosaic augmented factor()
diffmean() # mosaic
Distributions
quantile() # mosaic augmented ntiles() # mosaic
# basic ops: + - * / ^ ( )
prop() # mosaic pbinom(); pnorm(); cut()
log(); exp(); sqrt()
perc() # mosaic xpnorm() # mosaic augmented c()
log10(); abs(); choose()
rank() pchisq(); pt() cbind(); rbind()
IQR() # mosaic augmented qbinom(); qnorm(); colnames()
min(); max() # mosaic augmented qchisq(); qt() rownames()
Formula Interface relevel()
plotDist() # mosaic
The following syntax (often with some reorder()
parts omitted) is used for graphical sum-
maries, numerical summaries, and infer- Graphics (mostly lattice) Inference rep()
ence procedures. seq()
sort()
goal(y ~ x | z, data=..., bwplot() t.test() # mosaic augmented rank()
groups=...) xyplot() binom.test() # mosaic augmented
histogram() # mosaic augmented prop.test() # mosaic augmented
For plots: xchisq.test() # mosaic augmented
densityplot()
• y: is y-axis variable freqpolygon() # mosaic fisher.test()
qqmath() pval() # mosaic Data Transformation
• x: is x-axis variable model <- lm() # linear models
makeFun() # mosaic
• z: conditioning variable plotFun() # mosaic summary(model)
coef(model) select() # dplyr
(separate panels) mutate() # dplyr
ladd() # mosaic confint(model) # mosaic augmented
• groups: conditioning variable dotPlot() # mosaic anova(model) filter() # dplyr
(overlaid graphs) bargraph() # mosaic makeFun(model) # mosaic arrange() # dplyr
xqqmath() # mosaic resid(model); fitted(model) summarise() # dplyr
For other things: mplot(model) # mosaic group_by() # dplyr
‘y ~ x | z’ can usually be read ‘y is mplot(data=HELPrct, 'scatter') left_join() # dplyr
modeled by (or depends on) x differently mplot(data=HELPrct, 'boxplot') inner_join() # dplyr
mplot(TukeyHSD(model)) merge()
for each z’. mplot(data=HELPrct, 'histogram')
model <- glm() # logistic reg.
See the sampler for examples.
Minimal R
cran.r-project.org/web/packages/mosaic R Sampler for Intro Stats mosaic-web.org
rflip(6) tally(~sex + substance, data = HELPrct) pval(binom.test(~sex, data = HELPrct))
substance p.value
Flipping 6 coins [ Prob(Heads) = 0.5 ] ... sex alcohol cocaine heroin 1.932e-30
female 36 41 30
T H T H H T male 141 111 94 confint(t.test(~age, data = HELPrct))
Number of Heads: 3 [Proportion Heads: 0.5] mean(age ~ sex, data = HELPrct) mean of x lower upper level
35.65 34.94 36.37 0.95
do(2) * rflip(6) female male
36.25 35.47 model <- lm(weight ~ height + gender,
n heads tails prop data=Heightweight)
1 6 3 3 0.5 diffmean(age ~ sex, data = HELPrct)
wt <- makeFun(model)
2 6 3 3 0.5 wt( height=72, gender="male")
diffmean
coins <- do(1000) * rflip(6) -0.7841
1
tally(~heads, data = coins) 179.1
favstats(age ~ sex, data = HELPrct)
xyplot(weight ~ height, groups=gender,
0 1 2 3 4 5 6 .group min Q1 median Q3 max mean
data=Heightweight)
17 98 243 346 194 91 11 1 female 21 31 35 40.5 58 36.25
plotFun(wt(h,gender="male") ~ h,
2 male 19 30 35 40.0 60 35.47
add=TRUE, col="skyblue")
tally(~heads, data = coins, format = "perc") sd n missing
plotFun(wt(h,gender="female") ~ h,
1 7.585 107 0
add=TRUE, col="navy")
2 7.750 346 0
0 1 2 3 4 5 6 densityplot(~age | sex, groups = substance,
1.7 9.8 24.3 34.6 19.4 9.1 1.1 data = HELPrct, auto.key = TRUE)
tally(~(heads >= 5 | heads <= 1), data = coins)
alcohol 200
weight
cocaine
heroin 150
10 20 30 40 50 60 70
TRUE FALSE 100
217 783 female male

Density
0.06 55 60 65 70 75
0.04
0.02
0.00 height
histogram(~heads, data = coins, width = 1, 10 20 30 40 50 60 70
groups = (heads >= 5 | heads <= 1)) age plotDist("chisq", df = 4)
bwplot(age ~ substance | sex, data = HELPrct)
female male
Density
0.3 60 0.15
0.2 50 0.10
age
0.1 40 0.05
0.0 30 0.00
20
0 2 4 6
alcoholcocaine heroin alcoholcocaine heroin 0 5 10 15 20
heads
The Most Important Template
Other versions:
# simpler version
goal( ~ x, data = mydata )
# fancier version
goal( y ~ x | z , data = mydata )
# unified version
goal( formula , data = mydata )
2 Questions
What do you want R to do? (goal)
What must R know to do that?

2 Questions
What do you want R to do? (goal)
• This determines the function to use
What must R know to do that?
• This determines the inputs to the function

• Must identify the variables and data frame
Example: How do we tell R to make this plot?
10000
births
9000
8000
7000
Jan Apr Jul Oct Jan
date
10000
births
9000
8000
7000
Jan Apr Jul Oct Jan
date
What is the Goal?
What does R need to know?
•
•
10000
births
9000
8000
7000
Jan Apr Jul Oct Jan
date
What is the Goal?
• a scatter plot
• which variable goes where

• which data set
10000
births
9000
8000
7000
Jan Apr Jul Oct Jan
date
What is the Goal?
• a scatter plot (xyplot())
• which variable goes where (births ~ date)

• which data set (data=Birthdays78)
xyplot( births ~ date, data=Birthdays78)
10000
births
9000
8000
7000
Jan Apr Jul Oct Jan
date
Your turn: How do you make this plot?
60
50
age
40
30
20
alcohol cocaine heroin
substance
Two Questions?
Your turn: How do you make this plot?
bwplot( age ~ substance, data=HELPrct)
60
50
age
40
30
20
Your turn: How about this one?
heroin
cocaine
alcohol
20 30 40 50 60
age
Your turn: How about this one?
bwplot( substance ~ age, data=HELPrct )
heroin
cocaine
alcohol
20 30 40 50 60
age
Graphical Summaries: One Variable
histogram( ~ age, data=HELPrct)
0.06
0.05
0.04
Density
0.03
0.02
0.01
0.00
20 30 40 50 60
age
Note: When there is one variable it is on the right side of the

formula.
Graphical Summaries Using the Template
One Variable
histogram( ~age, data=HELPrct )

densityplot( ~age, data=HELPrct )
bwplot( ~age, data=HELPrct )
qqmath( ~age, data=HELPrct )
freqpolygon( ~age, data=HELPrct )
bargraph( ~sex, data=HELPrct )
Two Variables
xyplot( i1 ~ age, data=HELPrct )

bwplot( age ~ substance, data=HELPrct )
bwplot( substance ~ age, data=HELPrct )
• i1 average number of drinks (standard units) consumed per day
Expanding the template: groups and panels
• Add groups = group to overlay.
• Use y ~ x | z to create multipanel plots.
densityplot( ~ age | sex, data=HELPrct,

groups=substance,
auto.key=TRUE)
alcohol
cocaine
heroin
10 20 30 40 50 60 70
female male
0.06
Density
0.04
0.02
0.00
10 20 30 40 50 60 70
age
Bells & Whistles
Lots available
• titles
• axis labels
• colors
• sizes
• transparency
• etc, etc.
My approach:
• Let the students ask or

• Let the data analysis drive
Bells and Whistles
Sun Tues Thurs Sat
Mon Wed Fri
10000
births
9000
8000
7000
Jan Apr Jul Oct Jan
date
Numerical Summaries: One Variable
Big idea: Replace plot name with summary name
• Nothing else changes
histogram( ~ age, data=HELPrct )

mean( ~ age, data=HELPrct )
[1] 35.65
0.06
0.05
0.04
Density
0.03
0.02
0.01
0.00
20 30 40 50 60
age
Other Summaries
The mosaic package includes formula aware versions of mean(),
sd(), var(), min(), max(), sum(), IQR(), . . .
Also provides favstats() to compute our favorites.
favstats( ~ age, data=HELPrct )
min Q1 median Q3 max mean sd n missing

19 30 35 40 60 35.65 7.71 453 0
Numerical Summaries: Two Variables
Three ways to think about this. All do the same thing.
sd( age ~ substance, data=HELPrct )

sd( ~ age | substance, data=HELPrct )
sd( ~ age, groups=substance, data=HELPrct )

7.652 6.693 7.986
Numerical Summaries: Tables
tally( ~ sex, data=HELPrct, margins=TRUE )
female male Total

107 346 453
tally( sex ~ substance, data=HELPrct, format="count" )
substance
sex alcohol cocaine heroin
female 36 41 30
male 141 111 94
One Template to Rule a Lot
• single and multiple variable graphical summaries
• single and multiple variabble numerical summaries
• linear models
mean( age ~ sex, data=HELPrct )

bwplot( age ~ sex, data=HELPrct )
lm( age ~ sex, data=HELPrct )
female male
36.25 35.47
(Intercept) sexmale
36.2523 -0.7841
Modeling
Modeling is really the starting point for the mosaic design.
• linear models (lm() and glm()) defined the template

• lattice graphics use the template (so we chose lattice)
• we added functionality so numerical summaries can be done
with the same template
• additional things added to make modeling easier for beginners.
Modeling and visualization are two key ingredients in dealing with

Big Data
• both involve data reduction

Randomization – You can do() it
A general approach to randomization
1. Do it for your data

2. Do it for “random” data
3. Do it lots of times for “random” data
• definition of “random” is important, but can often be handled

by shuffle() or resample()
Randomization – You can do() it
do(1) * lm(width ~ length, data=KidsFeet)
Intercept length sigma r.squared

1 2.862 0.2479 0.3963 0.411
do(3) * lm( width ~ shuffle(length), data=KidsFeet)
Intercept length sigma r.squared

1 10.762 -0.071595 0.5075 0.0342683
2 5.362 0.146840 0.4778 0.1441500
3 9.203 -0.008535 0.5163 0.0004871
Advantages to Starting R early
• gives students a taste for computation (without programming)
• focus on asking what computation can do, not how to do it
• provides a base to build on for programming later
• better workflow (reproducibility)
• the sky is the limit
Thanks
All of this is a work in progress that would not be as far as it is and
will not get as far as it can go without the help of others.
• Co-conspirators
Danny Kaplan Nick Horton

Macalester C Amherst C
• Project MOSAIC workshop participants

• The Computation and Visualization Consortium
• My science colleagues at Calvin
• The team at RStudio
A note about my slides
The new support for documentation creation in RStudio is great.
• These slides are PDF, but I created them in RMarkdown (+ a

little bit of LaTeX fiddling)
• A single RMarkdown file can generate PDF, HTML, or Word
• no need to know HTML, LateX or Word
• but if you do, you can take advantage

Preparing Students For Big Data With R and Rstudio

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Preparing Students For Big Data With R and Rstudio

Enviado por

Direitos autorais:

Formatos disponíveis

Preparing Students for Big Data with R and

1. What is big data?

1. What is big data?

1. What is big data?

1. What is big data?

1. What is big data?

1. What is big data?

2. Should you use R in a first course?

2. Should you use R in a first course?

• creating notes (reproducible workflow)

• creating notes (reproducible workflow)

• Keep things simple

If R is the hardest thing in your course, then your R is too

• Keep things simple

If R is the hardest thing in your course, then your R is too

• Think of R as a programming language

focus on declarative rather than programming use of R

• Keep things simple

If R is the hardest thing in your course, then your R is too

• Think of R as a programming language

focus on declarative rather than programming use of R

• Less Volume, More Creativity

Mike McCarthy, Head Coach, Green Bay

• small: fewer is better

It is not enough to use R, it must be used elegantly.

• the formula-interface (using the mosaic package)

Help Numerical Summaries Randomization/Simulation Data

rflip(6) tally(~sex + substance, data = HELPrct) pval(binom.test(~sex, data = HELPrct))

217 783 female male

What must R know to do that?

• This determines the function to use

What must R know to do that?

• This determines the inputs to the function

What is the Goal?

What does R need to know?

What is the Goal?

What does R need to know?

• which variable goes where

What is the Goal?

• a scatter plot (xyplot())

What does R need to know?

• which variable goes where (births ~ date)

Note: When there is one variable it is on the right side of the

histogram( ~age, data=HELPrct )

xyplot( i1 ~ age, data=HELPrct )

densityplot( ~ age | sex, data=HELPrct,

• Let the students ask or

• Nothing else changes

histogram( ~ age, data=HELPrct )

favstats( ~ age, data=HELPrct )

min Q1 median Q3 max mean sd n missing

sd( age ~ substance, data=HELPrct )

alcohol cocaine heroin

female male Total

tally( sex ~ substance, data=HELPrct, format="count" )

mean( age ~ sex, data=HELPrct )

• linear models (lm() and glm()) defined the template

Modeling and visualization are two key ingredients in dealing with

• both involve data reduction

1. Do it for your data

• definition of “random” is important, but can often be handled

Intercept length sigma r.squared