Você está na página 1de 58

Preparing Students for Big Data with R and

RStudio

R Pruim

JSM 2014
Peering Around the Bend
Is There a Light at the End of the Tunnel?
Or are we adrift at sea?
Some Important Questions . . .
Some Important Questions . . .
. . . that I’m (mostly) not going to answer
Some Important Questions . . .
. . . that I’m (mostly) not going to answer

1. What is big data?


• Size? Structure? Hygiene? Workflow?
Some Important Questions . . .
. . . that I’m (mostly) not going to answer

1. What is big data?


• Size? Structure? Hygiene? Workflow?
2. What are the key (big) data skills?
• Programming? Databases? Concepts?
Some Important Questions . . .
. . . that I’m (mostly) not going to answer

1. What is big data?


• Size? Structure? Hygiene? Workflow?
2. What are the key (big) data skills?
• Programming? Databases? Concepts?
3. Who needs (big) data skills?
• Science Majors? Stat Majors? CS majors? Special subsets?
Some Important Questions . . .
. . . that I’m (mostly) not going to answer

1. What is big data?


• Size? Structure? Hygiene? Workflow?
2. What are the key (big) data skills?
• Programming? Databases? Concepts?
3. Who needs (big) data skills?
• Science Majors? Stat Majors? CS majors? Special subsets?
4. When/how will these students get these skills?
• Special courses/programs? Thread through all courses?
Some Important Questions . . .
. . . that I’m (mostly) not going to answer

1. What is big data?


• Size? Structure? Hygiene? Workflow?
2. What are the key (big) data skills?
• Programming? Databases? Concepts?
3. Who needs (big) data skills?
• Science Majors? Stat Majors? CS majors? Special subsets?
4. When/how will these students get these skills?
• Special courses/programs? Thread through all courses?
5. Who will take the lead on Big Data Education
• Statisticians? Computer Scientists? Natural Scientists?
Some Important Questions . . .
. . . that I’m (mostly) not going to answer

1. What is big data?


• Size? Structure? Hygiene? Workflow?
2. What are the key (big) data skills?
• Programming? Databases? Concepts?
3. Who needs (big) data skills?
• Science Majors? Stat Majors? CS majors? Special subsets?
4. When/how will these students get these skills?
• Special courses/programs? Thread through all courses?
5. Who will take the lead on Big Data Education
• Statisticians? Computer Scientists? Natural Scientists?
6. Must we walk before we run?
• Can/should we get to the good/big stuff right away?
Is Big Data Primarily an HR Problem?
Or is it an IT problem?
Questions to answer
1. Can you use R in a first course?

2. Should you use R in a first course?


Questions to answer
1. Can you use R in a first course?
• Yes. It’s not too hard for you or your students.

2. Should you use R in a first course?


Questions to answer
1. Can you use R in a first course?
• Yes. It’s not too hard for you or your students.
• But you should use R wisely.
2. Should you use R in a first course?
Questions to answer
1. Can you use R in a first course?
• Yes. It’s not too hard for you or your students.
• But you should use R wisely.
2. Should you use R in a first course?
• R is a good (but not the only) choice.
• There are advantages to starting early.
• You need to match technology to your goals and your students.
So you want to use R early. . .
Good news: Any one who can learn to use a TI calculator can
learn to use R
• but neither one can be used effortlessly on day one
So you want to use R early. . .
More good news: Using R actually makes some things easier

• creating notes (reproducible workflow)


• answering email
• student projects (reproducible workflow)
So you want to use R early. . .
More good news: Using R actually makes some things easier

• creating notes (reproducible workflow)


• answering email
• student projects (reproducible workflow)
• can reveal thinking problems
So you want to use R early. . .
Keys to how I use R

• Keep things simple

If R is the hardest thing in your course, then your R is too


hard and your questions are too easy.
So you want to use R early. . .
Keys to how I use R

• Keep things simple

If R is the hardest thing in your course, then your R is too


hard and your questions are too easy.

• Think of R as a programming language

focus on declarative rather than programming use of R


So you want to use R early. . .
Keys to how I use R

• Keep things simple

If R is the hardest thing in your course, then your R is too


hard and your questions are too easy.

• Think of R as a programming language

focus on declarative rather than programming use of R

• Less Volume, More Creativity


Less Volume, More Creativity
"A lot of times you end up putting in a lot more
volume, because you are teaching fundamentals
and you are teaching concepts that you need to
put in, but you may not necessarily use because
they are building blocks for other concepts and
variations that will come off of that ... In the
offseason you have a chance to take a step back
and tailor it more specifically towards your team
and towards your players."

Mike McCarthy, Head Coach, Green Bay


Packers
Less Volume, More Creativity
One key to successfully introducing R is finding a set of commands
that is

• small: fewer is better


• coherent: commands should be as similar as possible
• powerful: can do what needs doing

It is not enough to use R, it must be used elegantly.


Two examples of this principle:

• the formula-interface (using the mosaic package)


• the Hadley-verse (dplyr, tidyr, ggvis, etc.)
Minimal R
cran.r-project.org/web/packages/mosaic Enough R for Intro Stats mosaic-web.org

Help Numerical Summaries Randomization/Simulation Data


These functions have a formula interface
apropos() rflip() # mosaic
to match plotting. read.file() # mosaic
? do() # mosaic
?? sample() # mosaic augmented nrow(); ncol(); dim()
favstats() # mosaic summary()
example() resample() # with replacement
tally() # mosaic str()
shuffle() # mosaic
mean() # mosaic augmented rbinom() names()
median() # mosaic augmented head(); tail()
Basic Calculations sd() # mosaic augmented
rnorm() # etc, if needed
with()
Basic calculation works like a calculator. var() # mosaic augmented factor()
diffmean() # mosaic
Distributions
quantile() # mosaic augmented ntiles() # mosaic
# basic ops: + - * / ^ ( )
prop() # mosaic pbinom(); pnorm(); cut()
log(); exp(); sqrt()
perc() # mosaic xpnorm() # mosaic augmented c()
log10(); abs(); choose()
rank() pchisq(); pt() cbind(); rbind()
IQR() # mosaic augmented qbinom(); qnorm(); colnames()
min(); max() # mosaic augmented qchisq(); qt() rownames()
Formula Interface relevel()
plotDist() # mosaic
The following syntax (often with some reorder()
parts omitted) is used for graphical sum-
maries, numerical summaries, and infer- Graphics (mostly lattice) Inference rep()
ence procedures. seq()
sort()
goal(y ~ x | z, data=..., bwplot() t.test() # mosaic augmented rank()
groups=...) xyplot() binom.test() # mosaic augmented
histogram() # mosaic augmented prop.test() # mosaic augmented
For plots: xchisq.test() # mosaic augmented
densityplot()
• y: is y-axis variable freqpolygon() # mosaic fisher.test()
qqmath() pval() # mosaic Data Transformation
• x: is x-axis variable model <- lm() # linear models
makeFun() # mosaic
• z: conditioning variable plotFun() # mosaic summary(model)
coef(model) select() # dplyr
(separate panels) mutate() # dplyr
ladd() # mosaic confint(model) # mosaic augmented
• groups: conditioning variable dotPlot() # mosaic anova(model) filter() # dplyr
(overlaid graphs) bargraph() # mosaic makeFun(model) # mosaic arrange() # dplyr
xqqmath() # mosaic resid(model); fitted(model) summarise() # dplyr
For other things: mplot(model) # mosaic group_by() # dplyr
‘y ~ x | z’ can usually be read ‘y is mplot(data=HELPrct, 'scatter') left_join() # dplyr
modeled by (or depends on) x differently mplot(data=HELPrct, 'boxplot') inner_join() # dplyr
mplot(TukeyHSD(model)) merge()
for each z’. mplot(data=HELPrct, 'histogram')
model <- glm() # logistic reg.
See the sampler for examples.
Minimal R
cran.r-project.org/web/packages/mosaic R Sampler for Intro Stats mosaic-web.org

rflip(6) tally(~sex + substance, data = HELPrct) pval(binom.test(~sex, data = HELPrct))

substance p.value
Flipping 6 coins [ Prob(Heads) = 0.5 ] ... sex alcohol cocaine heroin 1.932e-30
female 36 41 30
T H T H H T male 141 111 94 confint(t.test(~age, data = HELPrct))

Number of Heads: 3 [Proportion Heads: 0.5] mean(age ~ sex, data = HELPrct) mean of x lower upper level
35.65 34.94 36.37 0.95
do(2) * rflip(6) female male
36.25 35.47 model <- lm(weight ~ height + gender,
n heads tails prop data=Heightweight)
1 6 3 3 0.5 diffmean(age ~ sex, data = HELPrct)
wt <- makeFun(model)
2 6 3 3 0.5 wt( height=72, gender="male")
diffmean
coins <- do(1000) * rflip(6) -0.7841
1
tally(~heads, data = coins) 179.1
favstats(age ~ sex, data = HELPrct)
xyplot(weight ~ height, groups=gender,
0 1 2 3 4 5 6 .group min Q1 median Q3 max mean
data=Heightweight)
17 98 243 346 194 91 11 1 female 21 31 35 40.5 58 36.25
plotFun(wt(h,gender="male") ~ h,
2 male 19 30 35 40.0 60 35.47
add=TRUE, col="skyblue")
tally(~heads, data = coins, format = "perc") sd n missing
plotFun(wt(h,gender="female") ~ h,
1 7.585 107 0
add=TRUE, col="navy")
2 7.750 346 0
0 1 2 3 4 5 6 densityplot(~age | sex, groups = substance,
1.7 9.8 24.3 34.6 19.4 9.1 1.1 data = HELPrct, auto.key = TRUE)
tally(~(heads >= 5 | heads <= 1), data = coins)
alcohol 200

weight
cocaine
heroin 150
10 20 30 40 50 60 70
TRUE FALSE 100

217 783 female male


Density

0.06 55 60 65 70 75
0.04
0.02
0.00 height
histogram(~heads, data = coins, width = 1, 10 20 30 40 50 60 70
groups = (heads >= 5 | heads <= 1)) age plotDist("chisq", df = 4)
bwplot(age ~ substance | sex, data = HELPrct)

female male
Density

0.3 60 0.15
0.2 50 0.10
age

0.1 40 0.05
0.0 30 0.00
20
0 2 4 6
alcoholcocaine heroin alcoholcocaine heroin 0 5 10 15 20
heads
The Most Important Template
The Most Important Template
The Most Important Template

Other versions:

# simpler version
goal( ~ x, data = mydata )
# fancier version
goal( y ~ x | z , data = mydata )
# unified version
goal( formula , data = mydata )
2 Questions
What do you want R to do? (goal)

What must R know to do that?


2 Questions
What do you want R to do? (goal)

• This determines the function to use

What must R know to do that?

• This determines the inputs to the function


• Must identify the variables and data frame
Example: How do we tell R to make this plot?

10000
births

9000

8000

7000
Jan Apr Jul Oct Jan

date
Example: How do we tell R to make this plot?

10000
births

9000

8000

7000
Jan Apr Jul Oct Jan

date

What is the Goal?

What does R need to know?



Example: How do we tell R to make this plot?

10000
births

9000

8000

7000
Jan Apr Jul Oct Jan

date

What is the Goal?

• a scatter plot

What does R need to know?

• which variable goes where


• which data set
Example: How do we tell R to make this plot?

10000
births

9000

8000

7000
Jan Apr Jul Oct Jan

date

What is the Goal?

• a scatter plot (xyplot())

What does R need to know?

• which variable goes where (births ~ date)


• which data set (data=Birthdays78)
Example: How do we tell R to make this plot?
xyplot( births ~ date, data=Birthdays78)

10000
births

9000

8000

7000
Jan Apr Jul Oct Jan

date
Your turn: How do you make this plot?
60

50
age

40

30

20
alcohol cocaine heroin

substance

Two Questions?
Your turn: How do you make this plot?
bwplot( age ~ substance, data=HELPrct)

60

50
age

40

30

20
alcohol cocaine heroin
Your turn: How about this one?
heroin

cocaine

alcohol

20 30 40 50 60

age
Your turn: How about this one?
bwplot( substance ~ age, data=HELPrct )

heroin

cocaine

alcohol

20 30 40 50 60

age
Graphical Summaries: One Variable
histogram( ~ age, data=HELPrct)

0.06
0.05
0.04
Density

0.03
0.02
0.01
0.00

20 30 40 50 60

age

Note: When there is one variable it is on the right side of the


formula.
Graphical Summaries Using the Template
One Variable

histogram( ~age, data=HELPrct )


densityplot( ~age, data=HELPrct )
bwplot( ~age, data=HELPrct )
qqmath( ~age, data=HELPrct )
freqpolygon( ~age, data=HELPrct )
bargraph( ~sex, data=HELPrct )

Two Variables

xyplot( i1 ~ age, data=HELPrct )


bwplot( age ~ substance, data=HELPrct )
bwplot( substance ~ age, data=HELPrct )
• i1 average number of drinks (standard units) consumed per day
Expanding the template: groups and panels
• Add groups = group to overlay.
• Use y ~ x | z to create multipanel plots.

densityplot( ~ age | sex, data=HELPrct,


groups=substance,
auto.key=TRUE)

alcohol
cocaine
heroin
10 20 30 40 50 60 70

female male

0.06
Density

0.04
0.02
0.00
10 20 30 40 50 60 70

age
Bells & Whistles
Lots available

• titles
• axis labels
• colors
• sizes
• transparency
• etc, etc.

My approach:

• Let the students ask or


• Let the data analysis drive
Bells and Whistles
Sun Tues Thurs Sat
Mon Wed Fri

10000
births

9000

8000

7000
Jan Apr Jul Oct Jan

date
Numerical Summaries: One Variable
Big idea: Replace plot name with summary name

• Nothing else changes

histogram( ~ age, data=HELPrct )


mean( ~ age, data=HELPrct )

[1] 35.65

0.06
0.05
0.04
Density

0.03
0.02
0.01
0.00

20 30 40 50 60

age
Other Summaries
The mosaic package includes formula aware versions of mean(),
sd(), var(), min(), max(), sum(), IQR(), . . .
Also provides favstats() to compute our favorites.

favstats( ~ age, data=HELPrct )

min Q1 median Q3 max mean sd n missing


19 30 35 40 60 35.65 7.71 453 0
Numerical Summaries: Two Variables
Three ways to think about this. All do the same thing.

sd( age ~ substance, data=HELPrct )


sd( ~ age | substance, data=HELPrct )
sd( ~ age, groups=substance, data=HELPrct )

alcohol cocaine heroin


7.652 6.693 7.986
Numerical Summaries: Tables
tally( ~ sex, data=HELPrct, margins=TRUE )

female male Total


107 346 453

tally( sex ~ substance, data=HELPrct, format="count" )

substance
sex alcohol cocaine heroin
female 36 41 30
male 141 111 94
One Template to Rule a Lot
• single and multiple variable graphical summaries
• single and multiple variabble numerical summaries
• linear models

mean( age ~ sex, data=HELPrct )


bwplot( age ~ sex, data=HELPrct )
lm( age ~ sex, data=HELPrct )

female male
36.25 35.47

(Intercept) sexmale
36.2523 -0.7841
Modeling
Modeling is really the starting point for the mosaic design.

• linear models (lm() and glm()) defined the template


• lattice graphics use the template (so we chose lattice)
• we added functionality so numerical summaries can be done
with the same template
• additional things added to make modeling easier for beginners.

Modeling and visualization are two key ingredients in dealing with


Big Data

• both involve data reduction


Randomization – You can do() it
A general approach to randomization

1. Do it for your data


2. Do it for “random” data
3. Do it lots of times for “random” data

• definition of “random” is important, but can often be handled


by shuffle() or resample()
Randomization – You can do() it
do(1) * lm(width ~ length, data=KidsFeet)

Intercept length sigma r.squared


1 2.862 0.2479 0.3963 0.411

do(3) * lm( width ~ shuffle(length), data=KidsFeet)

Intercept length sigma r.squared


1 10.762 -0.071595 0.5075 0.0342683
2 5.362 0.146840 0.4778 0.1441500
3 9.203 -0.008535 0.5163 0.0004871
Advantages to Starting R early
• gives students a taste for computation (without programming)
• focus on asking what computation can do, not how to do it
• provides a base to build on for programming later
• better workflow (reproducibility)
• the sky is the limit
Thanks
All of this is a work in progress that would not be as far as it is and
will not get as far as it can go without the help of others.

• Co-conspirators

Danny Kaplan Nick Horton


Macalester C Amherst C

• Project MOSAIC workshop participants


• The Computation and Visualization Consortium
• My science colleagues at Calvin
• The team at RStudio
A note about my slides
The new support for documentation creation in RStudio is great.

• These slides are PDF, but I created them in RMarkdown (+ a


little bit of LaTeX fiddling)
• A single RMarkdown file can generate PDF, HTML, or Word
• no need to know HTML, LateX or Word
• but if you do, you can take advantage

Você também pode gostar