Escolar Documentos
Profissional Documentos
Cultura Documentos
Introduction to Statistics
• Like dreams, statistics are a form
of wish fulfillment – Jean Baudrillard
Syllabus Overview
x On CANVAS under web page and syllabus. Updated More charitable view :
periodically. “As you can see, we now have
conclusive proof that smoking
About The Sections . . . • Statistics is the art of stating in is one of the leading causes of
precise terms that which one does statistics” – Fletcher Knebel
Videos
not know. – William Kruskal
To Flip or Not to Flip . . .
Why I’m here . . . In other words :
Your Questions . . .
x Statistics is about quantifying VARIATION
(De Veaux et all)
http://www.sciencedirect.com/science/article/pii/S0006320716303044
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 1 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 3
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 2 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 4
Statistics are also important in II Collecting (and producing) data : Sampling,
separating fact from fiction (despite bias, design of experiments, randomization
what you might think . . .)
Probability (need for Interpretation)
x Statistics turned medicine into a field that (by Random variables , rules of probability, conditional
1910) had a better than even chance of helping probability, Bayes’ rule, binomial & Normal
you rather than hurting you (no more leeches . distributions, Central Limit Theorem
. .)
Example : which animals kill the most humans? III Interpreting data -- Statistical inference
x Confidence intervals
x Hypothesis testing
x Advanced Techniques (mostly in sections)
o Regression
o ANOVA
o Multiple Regression
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 5 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 7
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 6 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 8
SO : If we design our statistics wisely . . . . . Quantitative variable (continuous) – a numeric valued variable,
where math on the variable makes sense. Has associated
x We INFER that what we observe about our sample is true of units (don’t forget about these!!)
the population
Categorical Variables (factor, discrete variable, dichotomous
x That is, we infer that our sample statistic approximates the variable (two levels))– places an individual into one of several
population parameter. groups or categories
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 9 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 11
Data – the Statistician’s Raw Material (Cartoon Sometimes a variable can be made categorical or quantitative
Guide)
Example : Person’s age in years, or a person’s age category (0-
Data consists of values of some variables measured on 20 years, 21-40, 41-60, 60+)
individuals from a population.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 10 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 12
Example : Crimean War. March 1854, Russia defeats a
Who cares?!? Turkish fleet in the Black Sea, threatening British and French
shipping who join the war against Russia. They fight on the
Different statistical tools have been Crimean Peninsula through 1856, ultimately defeating Russia
developed for different kinds of data. at a cost on all sides of some 300,000 lives.
What tools you use depends on the kind
of data you collect and what you want to Florence Nightingale was one of the nurses with the British
know army during the conflict. She began to keep meticulous
records on her patients. Her records probably looked something like this :
Soldier Regiment Date of Death Cause of Death
Frank Butler 13th Light Cavalry 10/15/1855 Cholera
George Parson 7th Infantry 10/15/1855 Cholera
Michael Summers 13th Light Cavalry 10/15/1855 Horse Kick
Let’s suppose you have some data – now what do you do?!?!? Matthew Green 14th Infantry 10/16/1855 Gunshot
……
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 13 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 15
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 14 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 16
Making Statistical Graphs Step 2 : Turn piles A Distribution gives
into pictures x A list of all possible values of a variable (i.e. list of all
possible causes of death)
Bar Chart : AREA = RELATIVE FREQUENCY
x The frequency with which each value of the variable
occurs (i.e. how many deaths from disease, how many from
Chart of Cause of Death
16000 wounds, how many from other causes).
14000
12000
10000
Sampling Distribution : distribution of a SAMPLE (known and
Count
8000
measured)
6000
4000
2000
0
Population/True Distribution : distribution of the entire
Disease Other
Cause of Death
Wounds
population (unknown and estimated with sampling
distribution!!)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 17 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 19
x Bar heights gives the count of observations in each death Height : Range of possible values? M F
category. This is called the distribution of the variable
cause of death
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 18 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 20
Pie Chart – advantage is that it clearly shows relative Died of Died of Died of Other Total % Disease
Date Wounds Disease Causes Deaths Deaths
proportion in each category.
Apr_1854 0 1 5 6 0.17
AREA = RELATIVE FREQUENCY May 1854 0 12 9 21 0.57
Jun 1854 0 11 6 17 0.65
Jul 1854 0 359 23 382 0.94
Pie Chart of Count of Deaths vs Cause of Death
Aug 1854 1 828 30 859 0.96
Sep 1854 81 788 70 939 0.84
Category
Oct 1854 132 503 128 763 0.66
Wounds Nov 1854 287 844 106 1237 0.68
Other
Disease Dec 1854 114 1725 131 1970 0.88
Wounds other
Jan 1855 83 2761 324 3168 0.87
Feb 1855 42 2120 361 2523 0.84
Mar 1855 32 1205 172 1409 0.86
Apr 1855 48 477 57 582 0.82
May 1855 49 508 37 594 0.86
Jun 1855 209 802 31 1042 0.77
Jul 1855 134 382 33 549 0.7
Aug 1855 164 483 25 672 0.72
Disease Sep 1855 276 189 20 485 0.39
Oct 1855 53 128 18 199 0.64
Nov 1855 33 178 32 243 0.73
Dec 1855 18 91 28 137 0.66
Jan 1856 2 42 48 92 0.46
Feb 1856 0 24 19 43 0.56
Mar 1856 0 15 35 50 0.3
Total 1758 14476 1748 17982
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 21 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 23
MINITAB : use Graph Æ Pie Chart. For this example, choose Bar Chart – can include frequencies for two categorical
Chart Values from a Table. Enter variables in the variables.
appropriate boxes.
Here is bar chart for first 9 months of Crimean War hospital data :
SPSS: Graphs Æ Legacy Dialogs Æ Pie. Choose
Simple. For the Nightingale data, choose Summaries for
Chart of Date, Cause of Death
groups of cases. Define Slices by Death_Cause,
1800
Slices Represent Number_Dead.
1600 Cause
of
1400 Death
Disease
1200
Other
Sometimes, want to consider two categorical variables at the 1000
Wounds
Count
same time. Use a two-way or contingency table to display this 800
information : each cell contains the count of individuals who had 600
a combination of categorical characteristics. 400
200
Example : Crimean War. Nightingale recorded deaths in each
0
Cause of Death e r s e r s e r s e r s e r s e r s e r s e r s e r s
category for each month over a two year period : asthend asthend asthend asthend asthend asthend asthend asthe nd asthe nd
s e O ou is e O ou ise O ou ise O ou ise O ou ise O ou ise O ou ise O ou ise O ou
Di W D W D W D W D W D W D W D W D W
Date 4 4 4 4 4 4 4 4 4
85 85 85 85 85 85 85 85 85
_1 _1 _1 _1 _1 _1 _1 _1 _1
04 05 06 07 08 09 10 11 12
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 22 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 24
MINITAB : use Graph Æ Bar Chart and choose Cluster. Graphical methods for displaying quantitative
You can enter raw data or summarized data (for summarized data, click on sampling distributions
DATA OPTIONS, choose FREQUENCY, and enter the variable with
the summarized counts). Example : 2013 World Poverty Data (data
SPSS: Use Graphs Æ Legacy Dialogs Æ Bar. Choose from World Bank). For a sample of 232
Clustered. Then proceed as before, but Define Clusters by countries, examine the average number of
children a woman will have over her lifetime.
Death_Cause, Catogory Axis as Month_Year.
Here is how F. Nightingale displayed frequencies for Data :
1.7, 4.9, 5.9, 1.8, 3.2, 1.8, 2.2, 1.7, 2.1, 1.9, 1.4, 2.0, 6.0, 1.8, 4.8, 5.6, 2.2, 1.5, 2.1, 1.9, 1.3, 1.6, 2.7, 1.6, 3.2,
two categorical variables (time and death category) : 1.8, 1.8, 2.0, 2.2, 2.6, 4.4, 1.6, 1.4, 1.5, 1.5, 1.8, 1.7, 4.9, 4.8, 5.0, 2.3, 4.7, 2.3, 1.8, 2.2, 1.4, 1.5, 1.5, 1.4, 3.4,
this is a Polar Graph (first one!!) 1.7, 2.5, 2.8, 1.9, 1.8, 2.0, 1.7, 2.6, 2.8, 1.5, 4.7, 1.3, 1.6, 4.5, 1.6, 4.3, 1.8, 2.6, 2.0, 3.3, 4.1, 1.9, 1.8, 3.9, 4.9,
5.8, 4.9, 4.8, 1.3, 2.2, 2.1, 3.8, 2.4, 2.5, 1.7, 1.1, 3.0, 5.0, 1.5, 3.1, 1.3, 2.3, 2.5, 2.0, 1.9, 4.0, 2.0, 3.0, 1.4, 2.3,
3.2, 1.4, 2.6, 4.4, 3.2, 2.9, 3.0, 1.2, 2.2, 2.6, 2.2, 3.0, 1.5, 4.8, 2.4, 1.9, 2.2, 4.2, 4.8, 1.5, 2.3, 2.8, 2.6, 3.0, 1.6,
Features of Graph : 1.6, 1.4, 1.1, 1.8, 2.7, 1.5, 4.5, 2.3, 2.7, 2.2, 2.4, 1.4, 6.8, 1.4, 1.9, 2.8, 1.7, 2.4, 5.2, 4.7, 1.4, 5.4, 2.0, 1.8, 3.1,
x Clearly shows frequencies in three categories of death over time. 2.3, 7.6, 6.0, 2.5, 1.7, 1.9, 1.9, 2.3, 2.0, 1.7, 1.7, 2.9, 3.5, 3.2, 2.5, 2.4, 3.0, 3.8, 1.3, 1.6, 2.0, 1.3, 2.9, 3.3, 2.1,
2.0, 1.5, 1.7, 4.5, 2.6, 2.6, 4.4, 4.9, 1.2, 4.0, 4.7, 2.2, 6.6, 1.5, 5.0, 4.9, 5.0, 3.2, 4.1, 2.3, 1.3, 1.6, 1.9, 3.3, 2.4,
x Allows comparisons of years – note change between winter of 1854 and 3.0, 6.3, 4.6, 1.4, 3.8, 2.3, 5.2, 3.8, 1.8, 2.3, 2.0, 5.2, 5.9, 1.5, 1.9, 2.0, 1.9, 2.2, 2.0, 2.4, 1.8, 1.7, 3.4, 4.0, 2.5,
winter 1855 (after Nightingale presented year one figures to army brass 4.1, 4.1, 2.4, 5.9, 5.7, 3.5
and got major improvements in sanitation
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 25 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 27
1.1, 1.1, 1.2, 1.2, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.5, 1.5, 1.5,
1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1.7, 1.7, 1.7, 1.7, 1.7, 1.7,
1.7, 1.7, 1.7, 1.7, 1.7, 1.7, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.8, 1.9, 1.9, etc.
All Wedges start at center Range 0.75 to 1.25 1.25 to 1.75 1.75 to 2.25 Etc.
Wedge area = Death Frequency Count 4 52 54
Histogram of Fertility Rate 2013 Symmetric : one half is a mirror image of the other half
60 Asymmetric : not symmetric (AAAHH!)
50
Unimodal : one mode
Bi-modal : two modes
40
(you get the idea . . . .)
Frequency
30
20
10
0
1 2 3 4 5 6 7 Symmetric, Not symmetric, bimodal
Fertility Rate 2013 unimodal
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 29 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 31
MINITAB : use Graph Æ Histogram. Enter the variable with the A Mode of a distribution is
raw data : you can let MINITAB choose the bins. x A local maximum (math definition)
x A peak
SPSS notes : use Graphs Æ Legacy Dialogs Æ x The most common value (and the second most common
Histogram or Graphs Æ Chart Builder. value, etc)
Note : the shape of a histogram is highly dependent on the number
of bins – in general, let the computer choose! Skewness :
Fertility Rate (Avg Children per Woman) Fertility Rate (Avg Children per Woman)
160 16
140 14
120 12
100 10
Frequency
Frequency
80 8
60 6
40 4
20 2
0
0.0 2.5 5.0 7.5
0
0 1 2 3 4 5
Fertility Rate Children per wom
6 7
Skewed to the right Skewed to the left
Fertility Rate Children per wom
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 30 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 32
Numerical Descriptions Of “Physical” interpretation of Mean :
– the CENTER
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 33 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 35
MEAN MEDIAN
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 34 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 36
For EVEN number of observations, sample median is average of
middle observations Note : some programs/books will include the
median when calculating quartiles :
Example: Height of six students in class in cm
150, 163, 167, (167.5) 168, 170, 178
Data : 178, 163, 168, 167, 170, 150 M 167.5
Sorted : 150, 163, 167, 168, 170, 178
Median : 167.5 Q1 165 Q3 169
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 37 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 39
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 41 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 43
Mean = 1.46 (impossible), Median = 0.99 In a Left Skewed distribution, MEAN < MEDIAN
Assuming that 7.0 was supposed to be 0.7, Example : Poverty Data. Life Expectancy for 198
Countries. Mean=69, Median=72
Mean = 0.83, Median = 0.97
Dotplot of Life Expectancy at Birth
MEAN MEDIAN
Median changes very little, Mean changes quite a bit!
48 54 60 66 72 78 84
Life Expectancy at Birth
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 42 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 44
In a Right Skewed distribution, MEAN > MEDIAN
Example : Poverty Data. GNI per capita for 198 NOW : Consider the two plots below (both roughly symmetric
Countries around the world. Mean=10637, Median=3715 so mean=median). The centers in both plots are about zero.
So what's different between the plots ?
Dotplot of GNI per capity Atlas method
70 80
70
60
60
50
50
Frequency
40
40
30
30
20
20
10 10
0 0
-8.25 -5.50 -2.75 0.00 2.75 5.50 8.25 -9 -6 -3 0 3 6 9
Numerical descriptions of
sample distributions
In a Symmetric distribution,
– the SPREAD or VARIABILITY
MEAN = MEDIAN
MINITAB Dotplot : Use Graph Æ Dotplot. Choose Simple. x Range = maximum – minimum
Enter variable with data. x Interquartile Range (IQR) : Q3 - Q1. This gives the width of
the middle 50% of the data
SPSS: use Graphs Æ Legacy Dialogs Æ
x Five Number Summary :
Scatter/Dot and then choose Simple Dot. Enter variable
with data. {min, Q1, median, Q3, maximum}
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 46 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 48
Boxplot – a graph based on the Boxplot of Fertility Rates Example : Probability of taking my class – boxplot of all
8
five number summary that helps responses, then with 7.0 changed to 0.7
7
detect outliers
6
Fertiliy Rate
5
Example : Fertility Rates in 232
4
Countries
3
Making a Boxplot 1
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 49 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 51
x ‘Whiskers’ extend to largest and smallest observations Boxplots are very useful for comparing the distribution of
that are not suspected outliers two or more groups
x Doesn’t matter if the graph is vertical or horizontal – it’s a one
dimensional graph! (and width of the box means nothing . . .) Example : Height of students in a class by gender
7
Height by Gender
Fertiliy Rate
1
200
Fertiliy Rate
5
Boxplot of Fertility Rates
4
190
3
Height(cm)
2 180
170
160
x Why 1.5? Because John Tukey who invented the boxplot in
1977 choose this as a reasonable cutoff for outliers, and it's 150
F M
still used today. More on this later . . . Gender
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 50 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 52
Example : Roundup¥. Two FES students examined the VARIANCE and STANDARD DEVIATION (SD)
claim that Roundup degrades into harmless substances
after a specified period. They examined 5 levels of x Most common and useful measure of SPREAD of a
Roundup – full strength, down to plain water. The dry distribution. Measures spread around the MEAN
weight of 3 different radish plants was measured at each
Roundup level.
x Relationship: Standard Deviation Variance
Boxplot of Results : where are the whiskers???
x Notation :
The Sample Variance s2 is
Radish Dry Weight vs Concentration
Sample Variance = s2,
2.0
Standard Deviation = s
a Statistic.
This is a number we can calculate
1.5
Dry Weight
0.5
o How far away are the observations, on average, from
the mean?
0.0
0.00 0.10 0.25
Round-Up Concentration
0.50 1.00
o This calculation involves the DEVIATIONS
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 53 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 55
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 54 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 56
x Formula for Sample Variance : in words, the average of the Why squared Deviations?
squared deviations
• Sum of deviations is just 0. Squaring the deviations converts
the negative deviations to positive numbers
• Could use average absolute value of the deviations – called
‘mean absolute deviation’
• Summing squares is a natural operation – (think Pythagoras)
s 2
¦ i
n 1 i 1
( x x ) 2 spread in a normal distribution (Fisher, 1920) AND they are a
more accurate measure of dispersion. See nice discussion here:
http://www.separatinghyperplanes.com/2014/04/why-do-statisticians-use-standard.html and
also here : http://www.leeds.ac.uk/educol/documents/00003759.htm
Example : In class . . .
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 57 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 59
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 58 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 60
More on VARIANCE and STANDARD DEVIATION (SD) Rules for LINEAR TRANSFORMATIONS of
Means and Variances
x SD of 3, 3, 3, 3, 3, 3 is zero (no variation)
(this looks boring, but will be useful later – hint hint!!!)
x Robustness (recall that a statistic that is not Example : Height conversion. Suppose I measured
sensitive to outliers is robust) student height in cm and then calculated a mean and
standard deviation. I actually want the mean and
IQR is robust; SD is not standard deviation of height in inches while wearing shoes with ½ inch
soles. Could convert each data point and recalculate mean/std dev.
Example : probabilities of taking my class : OR . . . use brain!
A conversion formula from centimeters (x) to inches (y) :
1.0, 0.9, 0.99, 1.0, 0.3, 0.95, 1.0, 0.5, 7.0, 1.0
SD =2.0 , IQR=0.2 100
1
yi xi 0.5
Inches
2.54
Change 7.0 to 0.7 SD =0.25, IQR=0.35
This is called a linear 0
IQR changes little, SD changes quite a bit 0 100 200
transformation. Centimeters
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 61 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 63
MINITAB Summary Stats : To get all of the summary statistics discussed Suppose we have some data with sample mean x and sample
so far (mean, median, IQR, SD), use Stat Æ Basic Statistics variance s 2
a , add
Æ Display Descriptive Statistics. x. Make a linear transformation (multiply by
b) to create yi axi b.
SPSS: use Analyze Æ Descriptive Statistics Æ
Descriptives
Rules for Linear Transformations :
x y ax b
Results for 6 heights :
Variable N Mean SE Mean StDev Minimum Q1 Median Q3 (mean of y is a times mean of x times plus b)
Height 6 166.00 3.79 9.27 150.00 159.75 167.50 172.00
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 62 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 64
Example : Mean height in inches with shoes : Sample data we collect is observable (it’s sensible)
1 x We calculate various properties of our
y 166 0.5 65.9 inches sample data (Statistics) Sample Æ Statistic
2.54
x Sample Mean
Variance and SD of height in inches with shoes :
x Sample Variance
Population Æ Parameter
1 2 x Histogram – the observed frequency of data in various ranges
s y2 ( ) * 86 13.3 (inches 2)
2.54 x However, our sample data is just an observable reflection of the
1 true, unknown, ideal Population (intelligible).
sy * 9.3 13.3 3.65 inches x The population has various unknown, true, fixed properties
2.54 (Parameters):
x True Mean height of people
x True Variance in the height of people
x True Histogram – true relative frequency of observations over
various ranges (i.e. 22% of men are between 5’6” and 6’0” ).
This ‘True Histogram’ is called a Probability Density Curve or
Probability Density Function.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 65 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 67
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 66 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 68
Density Curve is
x Idealized, smoothed
histogram. Limit of large
What we see (mortals) Invisible Truth (gods) population (sample size
o f (infinity))
Sample Population
x A positive valued curve
Statistic Parameter
Sample mean x True Mean P x A mathematical curve
with area EXACTLY=1
Sample Standard Deviation s True Standard Deviation V underneath it
. . . . and . . . .
Sample Histogram True Density Curve
(Aside : for those who admitted knowing calculus, this means that the
integral of the function over the entire real line is equal to one!)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 69 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 71
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 70 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 72
Histogram of DO (mg/L) -S
Exponential Density with mean 5 (used in failure times – like Histogram of 12
Frequency
0.2 6
0.1
Y
0
4 5 6 7 8 9 10
DO (mg/L) -S
0.0
What is the shape of the underlying density? Maybe . . .
0 10 20 30 40 50
X
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 73 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 75
V =1
Y = Density
We look at 66 observations of surface of waters in Bridgeport harbor in summer standard deviation = 0.2
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 74 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 76
General Normal Densities Example : Dissolved Oxygen
Underlying Shape is Always the Same! DO content in Bridgeport harbor has an
approximately normal distribution with mean 6.6
Remember This! mg/l and standard deviation 1.4 mg/l. (i.e., data
is N(6.6,1.4) ).
V larger
Let’s pretend that the TRUE VALUES – the parameters (think
gods) are P =6.6, V =1.4
V smaller
What percentile is a DO level of 8?
x That is, what percent of the probability density function is
Note that the distance V away from the
mean P indicates the inflection point of the below 8?
curve – the place where it goes from x That is, is we took SAMPLE data, about what percent of the
concave down to concave up. data should have a value of 8 or less?
“68, 95, 99.7 rule” 8.0 is 1 standard deviation above the mean. Want the shaded
area.
• 68% of the population is
within 1 SD of the mean
Think about the same area in a standard normal distribution
(i.e. between P V and
P V
)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 78 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 80
Remember : all normal distributions are Example : (DO levels). What
equivalent – the shape stays the same, only the percentage of DO levels observed in
units change! Bridgeport harbor will have a value of
5.0 mg/l or less?
Not so simple now (5.0 is not a ‘nice’
Use Picture : number of standard deviations away
68% from the mean – i.e. not an easy use
16% + 68% = 84% of the ’68, 95, 99.7’ rule)
Answer : 8.0 is the 84th
percentile. 1) Use MINITAB / SPSS - same procedure as above, just
16% 16% change 8.0 to 5.0
That is, about 84% of DO levels
observed in Bridgeport Harbor x P( X <= x)
will have a value of 8.0 or less 5.0000 0.1265 Å about 13%
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 81 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 83
MINITAB Normal Probabilities : Use Calc Æ Probability 2) Use Normal Score (also called the z-score, or the
Distributions Æ Normal. Fill in 6.6 for mean and 1.4 for SD. Standardized value). How many standard deviations below
Do a cumulative probability for an input constant of 8.0 6.6 mg/l is 5.0?
5.0 6.6
SPSS: use Transform Æ Compute Variable. In Target 1.14 .
Variable, enter some variable name (like normquant) and then in 1.4
Numeric Expression enter CDF.NORMAL(8,6.6,1.4)
In General : If x has a normal distribution with mean P and
standard deviation V , then the normal score is
xP
Cumulative Distribution Function z
Normal with mean = 6.60 and standard deviation = 1.40 V
x P( X <= x)
8.0000 0.8413 Furthermore, z ~ N (0,1) (standard normal distribution)
Look up the z-score in a Table of Standard Normal Probabilities
(Appendix of any stat book)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 82 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 84
So, why are we doing this!?!
In a few weeks, we’ll find that in most situations, our search for
the true mean depends on understanding normal distrituions
and z-scores.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 85 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 87
Aside: Remember the 1.5*IQR rule for boxplots? For a Assessing Normality - Normal Quantile Plots
Standard Normal Distribution:
(Normal Probability Plots)
The idea –
x Hard to look at a histogram or a dotplot and guess if it has
That is, we're calling anything 2.7 standard deviations away the right shape
from the mean an outlier! This is equivalent to keeping the x Instead, make a plot where we judge how well points fall
middle 99.3% of the distribution – sort of like the three along a straight line
standard deviation rule!
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 86 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 88
Histogram of DO (mg/L) -S How to construct quantile plots
12
Example : DO in Bridgeport 10
Frequency
6
Is this normal? 4
be if they came from a (standard) Normal distribution.” (since
every normal distribution has the same shape, it doesn’t matter which one we
2
use)
0
4 5 6 7 8 9 10
DO (mg/L) -S
0.1
2 4 6 8 10 12
DO (mg/L) -S
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 89 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 91
Example : Iron concentration levels from 39 Limestone Example : Damage from hurricanes from 1991-2004
watersheds. (billions of $2004) ordered from smallest to largest:
{3, 4, 4, 5, 6, 7, 9, 14, 15, 44}
x Divide a Standard normal distribution into 11 equal
Histogram of Iron Concentration in Limestone Watersheds Probability Plot of Iron
Normal - 95% CI
areas (i.e. one more than the number of data points).
20
70
Percent
60
10 50
40
30
20
Areas between dotted lines are all the same!
5 10 0.4
5
1
0 -10 -5 0 5 10 15 20 25 0.3
0 5 10 15 20
Iron
Iron Concentration
0.2
0.1
0.0
-3 -2 -1 0 1 2 3
Values
{-1.34, -0.91, -0.6, -0.35, -0.11, 0.11, 0.35, 0.6, 0.91, 1.34}
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 90 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 92
x Plot the values in the standard normal distribution vs. Example : Data generated from a N(5,10) distribution with
{-1.34, -0.91, -0.6, -0.35, -0.11, 0.11, 0.35, 0.6, 0.91, 1.34} the data different sample sizes
1.5
1.0
0.5
Values
0.0
-0.5
-1.0
-1.5
0 10 20 30 40 50
Data
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 93 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 95
Normal Quantile Plots by Hand (to get a rough idea) 15 Example : World Poverty 2013. Does data on
Frequency
10
percent population in urban areas have an
In class 5
truncated distribution)
0 5 10 15 20
Iron Concentration
Mean 57.07
Q-Q Plots. Enter variable(s). Make sure the the Test Distribution
StDev 23.98
StDev 23.98
16 N 209 99 N 209
AD 1.334
is Normal!
14 95 P-Value <0.005
90
12
80
Frequency
70
Percent
10 60
6
50
40
30
20
10
2
5
0 0.1
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 94 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 96
Scatterplot Notation :
Data Relationships
x The Horizontal axis is ALWAYS
Up Next: describing relationship between two called the X axis.
Y
quantitative variables : x The Vertical axis is ALWAYS
called the Y axis.
• Scatterplots
• Association and Correlation X
• Regression
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 97 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 99
50
10.5 72 16.4 40
10.7 81 18.8 30
10.8 83 19.7
20
10
10 15 20
Diameter
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 98 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 100
Example : World Poverty Data : Factors associated with Weak positive correlation Strong positive correlation
Fertility Rate (2008 data) (near zero) (near one)
Negative Association
ion r = 0.06 r = 0.99
Positive Association
3
2
7 7
2
6 1
6
1
5
5 0
0
Fertility
4
Fertility
4
3
-1 -1
3
2 -2
-2
2 1 -2 -1 0 1 2 3
-2 -1 0 1 2 3
1 0
0 10 20 30 40
Female Illiteracy
50 60 70 80 40 50 60
Life Expectancy
70 80
Moderate negative correlation Strong negative correlation
r = - 0.52 r = - 0.96
7
3
3
6
2
2
Fertility
4 1
1
-2
-1
-2
0 10000 20000 30000 40000 50000 60000 70000 80000 90000 -3
GNI per capita -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 101 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 103
• -1 indicates all points are on a line with negative slope Original and z-scores for
the cherry tree data :
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 102 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 104
x Second, multiply and average :
1 n xi x 0 xi x ! 0
n 1 ¦
r z xi z yi
i 1
yi y ! 0 yi y ! 0
Algebra shows this is the same thing :
y
xi x 0 xi x ! 0
¦ ( xi x )( yi y )
r yi y 0 yi y 0
¦ ( xi x ) 2 ¦ ( y i y ) 2
Algebra also shows this is a dimensionless number between –1 x
and +1 (try this if you like!)
§ xi x ·§ yi y · 1 n § xi x ·§¨ yi y ·¸
¨¨ ¸¸¨¨ ¸
n 1 ¦
r ¨¨ ¸
© x ¹© y ¹
s s ¸ i 1© s x ¸¹¨© s y ¸¹
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 105 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 107
§ xi x ·§ yi y ·
What is the value of ¨¨ ¸¸¨¨ ¸?
¸
© sx ¹© s y ¹
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 106 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 108
Correlation is often confused with However – here is a Scatterplot of Three Strikes vs Rehabilitation
Three Strikes
Correlation is NOT
DO NOT CONFUSE 0.0
appropriate!
CORRELATION WITH -0.5
ASSOCIATION -1.0
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 109 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 111
Example : Crime. A survey asked 550 Example : Relationship between Body Weight and
people about ways of reducing crime. Brain Weight (extracted from "Sleep in Mammals: Ecological and
Respondents could answer ‘Agree’, ‘No Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976)). This data
opinion’, or ‘Disagree’, coded as 1, 0, -1. records the average body and brain weight of 62
species of mammals. Correlation is r = 0.93 – seems
The questions were great! However, here is the scatterplot :
x Violent criminals who commit three crimes should always Scatterplot of Brain Weight (g) vs. Body Weight (kg)
receive a mandatory life sentence without parole. 6000
2000
Human
1000
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 110 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 112
In this case, the elephants are Large Outliers – they are driving the Transformations in MINITAB : Transformations in SPSS :
large correlation. Remove elephants – correlation is only 0.651 Calc Æ Calculator. Transform Æ Compute
(and then human looks like an outlier . . .) Make a new variable name (e.g Variable. Make a new
‘Log Brain’) and write the formula for the variable name (e.g ‘LogBrain’) and write the
It seems we should be able to do something about this . . . new variable in the Expression box. formula for the new variable in the
Numeric Expression box.
Transformations
x Sometimes, it is necessary to transform one or
more variables before calculating correlations.
Example : Relationship between Body Weight and Brain Example : Diameter and 70
Weight. The elephant is an outlier in both body weight and Volume of 31 Black Cherry 60
brain weight. Experience (mine) suggests trying to take the Trees in Alleghany Forest.
Volume
50
natural logs of both variables. Look at Scatterplot – seems 40
variables : 6 10 15 20
Diameter
logbrain
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 114 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 116
Use brain – pretend a tree is a box. Let d be the length Regression
of a side. Volume of Tree is V d 3. So diameter is Aside – at the moment, we’re doing regression ‘lite’.
3
proportional to the cube-root of volume ( d V ). The idea is to pique your interest. In a few weeks,
we’ll do ‘serious’ regression.
That is, there is a linear relationship between the
diameter and the cube-root of the volume. The simplest case :
x Take two quantitative / continuous variables.
SO : Make a cube-root x You think one variable can be used to predict the
transformation of volume, 4
values of the other variable.
-2
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 118 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 120
Some Regression History The equation for all the best guesses :
yˆ 69 x 68 ·
r §¨
Example : Sir Francis Galton : cousin of Darwin,
invented eugenics, the weather map, correlation, the idea of ¸
surveys about human communities, AND methods for
classifying fingerprints!
3 © 3 ¹
Galton collected data which Scatterplot of Son Ht vs Father Ht Standardized y value Standardized x value
measured the heights of fathers 80
and sons. How do fathers’ 78 ( ŷ , read ‘y hat’, symbolizes our guess of y for a given x)
76
heights predict sons’ heights? 74
72
Specifically, if a father is 72” tall,
ŷ y §x x·
Son Ht
70
60 62 64 66 68 70 72 74 76 78 80 Rearranging, we get
Father Ht
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 121 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 123
Sons 69” 3” 72
yˆ b0 b1 x
Son Ht
70
68
Correlation r = 0.5 66
sy
Since 72” is 4/3 SD above the
64
62 with slope b1 r and intercept b0 y b1 x
father’s mean, natural guess for 60 sx
son is 4/3 above the sons mean,
60 62 64 66 68 70 72 74 76 78 80 Let’s Review : Least Squares Linear Regression
i.e. 73”. This seems a bit high in Father Ht
the picture. x There frequently exists a linear relationship between two variables.
x If this linear relationship exists, use the value of one variable to predict
‘Best’ guess depends on correlation! the value of another
‘Guess that son will be, x Use regression to find the ‘best’ line that describes this relationship
not 4/3 SD’s above mean, but
correlation * 4/3 = 2/3 SD’s above mean, that is 71”.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 122 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 124
What’s ‘best’? Regression in MINITAB – use Stat Æ Regression Æ
Fitted Line Plot. Later, we’ll use Stat Æ
For each data point we observe, look at the vertical difference Regression Æ Regression. This gives lots of information
between the observed Y value ( y i ) we haven’t discussed yet.
and the corresponding point on the ŷi •
regression line ( ŷ i - the fitted value). • SPSS: for now, use Analyze Æ Regression Æ Curve
This difference is called the residual • • • Estimation. Enter variables and make sure that under models
e
or error: ei yi yˆ i •• i you’ve checked Linear. Later, you’ll use Analyze Æ
So what’s ‘Best’?
• yi Regression Æ Linear.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 125 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 127
Bad Fit Better Fit MINITAB produces a plot that is a scatterplot of the data, the ‘best’ fitted line, and
the estimated regression equation (slope and intercept) calculated by minimizing
the sum of the squared residuals.
• •
• •
•
• • • • • Fitted Line Plot
logbrain = 2.135 + 0.7517 logbody
• • •• 10
• • 8
S
R-Sq
0.694295
92.1%
R-Sq(adj) 91.9%
logbrain
4
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 126 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 128
Relationship of Regression to Correlation : Correlation measures the improvement of a sloped line to a flat-
n n
line. In other words (take home message – forget formulas)
¦ yi yˆ i ¦ yi y
2 2
Notice that <
i 1 i 1
i 1 i 1 i 1
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 129 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 131
NOW : plugging in the definition of correlation, more algebra shows Example : Consider 4 points
6
n (x,y) : (0,0) (1,2) (2,4) (3,6).
¦ yi yˆ i
2 5
y = 3. Correlation is 1, i.e. 4
1 r2
i 1 Deviations
2
n r 1 3
Mean of y's
¦ yi y
Y
2
2
i 1 1
Now : 0
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 130 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 132
Example : Consider 4 points 6
A point can be influential, an outlier, or BOTH!
(x,y) : (0,0) (0,4) (1,2) (1,6) 5
Regression Line
4
(same y’s!!!!). y = 3.
3
Deviation
Mean of Y's
Y
2
Correlation is 0.45, i.e. r 2 Example : The point (10, 10) is . . .? (what happens without this
0.2 Residual
1 point?)
Y = 4.88E-02 + 0.975610X
0 R-Sq = 95.2 %
Now : 0.0 0.5 1.0 10
x Variance(Y) = 6.6 X Y
x Variance (residuals) = 5.3 1 0
0 1 5
6.6 5.3
Y
-1 0
SO : Amount of variation of y’s explained by
regression is
.2 0 -1
6.6 10 10 0
0 5 10
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 133 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 135
Regression – when things go bad Example : Poverty Data (year 2000 data). Try
predicting population growth rate (percent growth)
(Not bad really, but things that can be problematic . . . ) with per capita calories per day. Two ‘unusual
points’. These are (outliers, influential?). Notice
that the regression line does not move much when they are
x Least-squares regression is not robust (resistant)
to ‘unusual’ points. removed.
Kuwait 4
Syria 3
Population Growth
Population Growth
3
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 134 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 136
Evaluating Regression models : To test these assumptions, we look at residual plots
Residual Plots
x To see if residuals have a normal distribution, make a Normal
Here is our simple linear regression model : this describes the Quantile (or Probability) plot of the residuals.
predicted value of y, denoted by y 'hat'.
x To see if there are discernible patterns in the residuals, make
yˆ b0 b1 x ŷ
a plot of the fitted values ( : on the horizontal axis) vs. the
residuals (vertical axis). There should be no discernable
The more complete model is that y is a linear function of x, with
patterns in this plot!
added errors (denoted by H i )
Essentially, this plot removes the linear trend from your data and
y b0 b1 x H i displays any trends that remain!
For reasons we'll discuss later, we assume the errors come from a
normal distribution with mean zero and some standard deviation
sigma – in 'stat' notation :
H i ~ N (0,V )
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 137 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 139
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 138 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 140
Example : Brain/Body weight relationship in mammals. Example – Poverty Data. Try to use GNI per capita to
We used regression to predict log(brain wt) based on predict fertility rates.
log(body st).
Here is the fitted line plot : NOT linear.
Here are plots of residuals : Here are residual plots :
Residuals are somewhat Fitted Line Plot
Fertility = 3.415 - 0.000042 GNI
A normal quantile plot of the A plot of residuals vs. fitted curved, but plot of fitted 8
residuals is approximately linear – values shows no discernible vs. residuals shows a
7
this is good! pattern, and there are no definite pattern! S 1.32742
6 R-Sq 21.1%
obvious outliers. This is good! R-Sq(adj) 20.6%
5
Fertility
4
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 141 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 143
99
99.9 3
2 95
99 90
2
80
90 1 70
Residual
Percent
Residual
60
Percent
50
1
40
50 0 30
20 0
10
5
10 -1
-1
1
1 -2
0.1
0.1 -2 -5 -4 -3 -2 -1 0 1 2 3 4 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
-2 -1 0 1 2 0 5 10 Residual Fitted Value
Residual
0 -2
-1.6 -0.8 0.0 0.8 1.6 1 5 10 15 20 25 30 35 40 45 50 55 60
Residual Observation Order
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 142 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 144
Shows improvement – notice that Fitted Line Plot
Fertility = 8.755 - 0.7063 logGNI
Lurking Variables in Regression (things hidden . . )
relationship is more linear, R2 7 S
R-Sq
0.999532
55.3%
6 R-Sq(adj) 55.0%
increases (i.e. more of overall 5
variation is explained by the A variable that has an important effect but was overlooked.
Fertility
4
0
5 6 7 8 9 10 11 12
another, lurking variable.
logGNI
60
50
40 0
30
20
10
-1 Example : There is strong association between GNP/capita and
5
1
-2 Fertility Rates. This does not mean that getting paid less CAUSES
0.1
-3
1 2 3 4 5
women to have more babies!
-3 -2 -1 0 1 2 3 4
Fitted Value
Residual
Lurking variables can actually cause a reversal in the apparent
magnitude of an effect
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 145 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 147
50
50 deg 80 deg
Temperature Group
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 146 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 148
However : look at plot Scatterplot of Counts vs Age Here is a plot of some experimental data. Now – think about what
of counts by 100
Group
happens as temperature increases or decreases – is the linear
temperature group with 50 deg
80 deg
estimate still valid?
90
age of subject
included (age is the 80
Counts
covariate) Fitted Line Plot
Ln(cm/sec) = 21.16 - 5610 1/Temp
70
3.0
50 deg actually has
S 0.101393
lower counts 80 deg. 60
R-Sq 97.5%
20 Count difference! R-Sq(adj) 97.0%
BUT Age has a strong 50
2.5
impact on Response 20 30 40 50 60 70
Ln(cm/sec)
Age
Time, and the average 2.0
age is quite different in each treatment group ( x1. 32 ,
x2. 57 ), so age effect becomes confounded with 1.5
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 149 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 151
x Make sure the things you measure will answer the question
§ 1 · you're asking.
ln( Speed ) b0 b1 ¨ ¸
© Temp ¹ x Have an analysis plan BEFORE you begin!
That is, the natural log of ant speed is linearly related to the inverse of
the Temperature.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 150 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 152
x Design Carefully : Consider Local Control
x Feasibility of obtaining measurements
x Cost and time associated with obtaining measurements x Methods used to control experimental /
x The number of measurements needed to obtain ‘good’ observational error, increase accuracy of
results (sample size calculation if possible) observations, and allow for inference regarding
treatment factors in an experiment.
x Possible sources of variability, bias, error
x Run a pilot study : Mini- study to test your Examples of Local Control :
procedures and identify problems
- How long will data collection take? x Have a placebo group to use as an experimental baseline
- Will your measurement procedure work? x For oral surveys, ask questions the same way every time
- Will people answer your survey? x Make sure your equipment is properly calibrated
- Will people lie? (thermometers, scales, etc.)
- Will your plants die? x Make ground plots the same size and soil composition
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 153 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 155
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 154 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 156
Example : Scurvy The consequence was that the most sudden and visible good effects were perceived from
the use of the oranges and lemons; one of those who had taken them being at the end of
x Disease caused by vitamin C deficiency six days fit four duty. The other was the best recovered of any in his condition, and being
x Causes the bodies connective tissues to now deemed pretty well was appointed nurse to the rest of the sick.’
degenerate (teeth fall out, wounds open).
x Killed an estimated 2 million sailors This study exhibits
between 1500 and 1800 – number one
cause of death among sailors in this period x Local Control (common diet, similar patients, control
group)
One of the first experiments (every) was conducted by Dr. James x Randomization (some problem here . . .)
Lind who in 1753 published ‘A Treatise of the Scurvy”.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 157 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 159
On the 20th May, 1747, I took twelve patients in the scurvy on board the Salisbury at sea.
Randomization
Their cases were as similar as I could have them. . . . They lay together in one place . .
Randomization is the process by which
and had one diet in common to all. Two of these were ordered each a quart of cyder a x Experimental units are assigned to
treatment groups
day. Two others took elixir vitriol three times a day. Two others took two spoonfuls of x Observational units are selected
vinegar three times a day upon an empty stomach. Two of the worst patients, with the
Why Randomize?
tendons in the ham rigid (a symptom none the rest had) were put under a course of sea
x Randomization prevents BIAS
water. Two others had each two oranges and one lemon given them every day. The two We say a study is biased if it favors certain outcomes
remaining patients took the bigness of a nutmeg three times a day of an electuary
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 158 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 160
Examples :
Scatterplot of Draft Number vs Day of Year
x You want to compare treatments for dealing with woolly 400
Draft Number
200
x You need to buy a new toaster. You go to epinion.com and
see what other consumers liked. 100
Worst x Machines (drums of paper, numbers in a hat, lotto balls) Turned out that the correlation of Day of Year and Draft Number is -0.22, a
x Computers – most common way to achieve randomization. significantly negative correlation. Process was NOT random (challenged in
court, but judge ruled that was ‘random enough’.)
Best x Tables of random digits http://en.wikipedia.org/wiki/Draft_lottery_(1969)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 161 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 163
Machines – often cool, but not always random! x Most common way to achieve randomization.
x Really pseudo-randomization (uses an algorithm, so it is
necessary a predictable process)
Example : Randomization Failure in the 1969
Draft. During the Vietnam War, men over 18 year old were x Adequate for most randomization needs (this is what
drafted in groups of identical birthdays. The Selective Service needed to assign pharmaceutical companies use, for example)
a random draft order to days of the year. To do this, 366 capsules were put into
a shoebox (??) and then dumped into a glass jar. Each capsule contained a There are many ways to do this – the method depends on the
piece of paper containing one birthday of the year. The head of the Selective situation.
Service pulled out capsules one at a time, assigning each successive capsule
the next draft number (i.e. September 14 got drafted first, April 24 was drafted
second, etc). Here is a plot of draft order vs. day of year : Does it seem Example (in MINITAB) : Suppose you want to :
random? x Give a survey to 100 people.
x There are 3 versions of the survey (A, B, and C)
x 40 people should receive survey A, and 30 people each
should receive surveys B and C.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 162 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 164
Solution : in MINITAB Assign survey type A first : give to persons 19, 22, 39, 50, 34, 5,
75, 62, 87, 13, 96, 29, 90, 71, (ignore 96 – repeat), 98, 64, etc.,
1. Use Calc Æ Make Patterned Data Æ Simple Set until 40 people are assigned. Then continue to assign people to
of Numbers. Choose numbers from 1 to 100 in steps of 1, store in survey types B and C.
a new column.
2. Use Calc Æ Random Data Æ Choose from Columns. 19223 95034 05756 28713 96299 07196 98642
Sample from the column above WITHOUT replacement. Save in a new
column. This gives a random ordering of the original column.
What if we only want to assign 30 people, ten to each survey
3. Assign the first 40 people in the new column to receive survey A, the next 30 type?
receive survey B, the final 30 receive survey C.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 165 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 167
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 166 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 168
Observational Studies :
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 169 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 171
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 173 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 175
You may want to insure that certain sub-populations are Example : In a class with 60% women, if you
represented in your sample. In this case, use a stratify by gender and then sample 100
individuals, you need to sample 60 women and
40 men in order to make accurate inferences
about the entire class!
Stratified Sample : First stratify the population into known
homogenous sub-groups and then sample within subgroups.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 177 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 179
Another problem that often arises in sampling is that Stratified Sample vs. Clustered Sample
individuals are too far apart, or that there is no complete list Example . . ..
of individuals. In this case, use
Cluster Sampling :
Sometimes, the total population size is unknown, but can be
x Subdivide population into clusters
x Sample from within each cluster OR sample all observed at regular intervals. In this case, use a
individuals inside chosen clusters.
Systematic Sample : take
every ith unit that comes
along (every 10th person to
leave a grocery store, every
100th item in a production
line, every 10th day in a
year).
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 178 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 180
Finally, you can mix-and-match sampling techniques to achieve PITFALLS IN SAMPLING
a
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 181 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 183
Survey Design – lots to say here, you can take another entire
class on this subject . . .
http://lap.umd.edu/survey_design/index.html
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 185 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 187
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 189 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 191
Local Control for Experiments The experimental counterpart of the simple random sample is
the
x Describes methods used to control experimental Completely Randomized Design : Experimental units are
error, increase accuracy of observations, and allow randomly assigned to treatment groups. Every experimental unit
for inference regarding treatment factors. has the same chance of being assigned to a particular treatment
x Includes things like group.
x Placebo groups – are affects due to thought of being Balanced Design – the most
treated common type of completely
x Measurement accuracy – scale calibration, consistency of randomized design – equal number
research assistants, etc. of experimental units in each
treatment group. When in doubt,
x Making treatment groups as similar as possible to control
use a balanced design!
for confounding factors
x May require additional design to reduce between subject
variability (matching or blocking)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 190 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 192
Example : Deer Contraceptives. One of the most common methods for achieving local control
As an alternative to hunting, deer (i.e. increasing ability to observe differences between
contraceptives are sometimes used to
control deer populations. A study treatment groups) is to use a
examined the effect of Norgestomet at
0mg (placebo), 14mg, 21mg, 28mg, or
42mg. Twenty does were randomly Block Design :
assigned to each treatment group for a x Intention of blocking is to divide experimental units
total of 100 does. For each doe, it was according to factors that are thought to have an
recorded whether or not she had a fawn during the next mating season. effect on the response variable.
This is a balanced completely randomized design with
x Experimental units known to be similar are divided into blocks.
x Five treatment groups, Each block receives ALL treatment groups.
x 20 experimental units per treatment group where a doe was an x Analogous to a stratified sample (a block is like a strata)
experimental unit
x 100 observations total
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 193 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 195
Round-Up Concentration
Species 0 0.1 0.25 0.5 1.0
Rye Grass 3 pots 3 pots 3 pots 3 pots 3 pots
Radish 3 pots 3 pots 3 pots 3 pots 3 pots
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 194 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 196
Crossover Design
Often, variability within subject/unit is greater
0.5
shows strong
Example : (Cartoon Guide) : 35
evidence of a -1.0
Compare ethanol vs. regular gas
difference.
in 10 cabs to test for differences 30 -1.5
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 197 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 199
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 198 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 200
Other Experimental Designs Remember :
x Factorial Design – two or more treatment factors (we'll Parameter – a fixed number that
discuss this when we do Two-Way ANOVA) describes a population (i.e., P =
true population mean height). We
x Analysis of Covariance – take out effect of a continuous don’t know this number (Gods only)
factor (we'll cover later in the semester)
Statistic – a number that describes a sample (i.e., x = sample
x Latin Squares, Unbalanced Designs, Random Effects mean height). We know this number, but the number can
Models (take experimental design class), . . . . . . (and usually does) change from sample to sample. Use
the statistic to estimate the unknown parameter!
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 201 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 203
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 202 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 204
Example : Let p be the true proportion of the In terms of probability, this is the sampling distribution for p̂ for
population that believes in global warming (the samples of size two :
PARAMETER).
Suppose a TOTAL of FOUR people live in the U.S. 4/6
(this is the population). I.e., just this once, we know
the entire population. Here are their opinions (known
to Gods, who are letting us know just Individual Attitude
this once . . . )
1/6 1/6
1 Believe
x We want to estimate p using a
2 Believe 0 0.5 1
STATISTIC p̂ , the sample
3 Don’t Believe
proportion that believes in global t ib
The sampling distribution gives
warming. 4 Don’t Believe
x All possible values of p̂
In this population, p =0.5 (the parameter, i.e. the true proportion x The proportion of times p̂ takes on each of these values, or
of the population that believes in global warming). the probability that p̂ takes each of these values.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 205 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 207
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 206 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 208
If the mean of the sampling distribution of a statistic equals Sampling Distribution of p̂ for n =1 :
the true value of a parameter, the statistic is said to be an
UNBIASED ESTIMATOR of the parameter 1/2 1/2
0 0.5 1
Now : what is the variability of the sampling distribution
of p̂ ? p̂ is still unbiased: Mean of sampling distribution = 0.5
= true proportion that believe in global warming, p
We’ll see formulas for calculating this later. Suffice it to say that
the standard deviation of p̂ is about 0.32. Trust me.
Variability of the sampling distribution of p̂ is clearly 0.5 in this
case.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 209 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 211
NOW : Suppose our budget is so small we can only afford a sample SO : Mean of sampling distribution of p̂ = .5 for samples of
of size n =1. How does the sampling distribution of p̂ change? size 1 or 2.
POPULATION POSSIBLE SAMPLES
However :
Individual
1
Attitude
Believe
Individuals Attitude p̂ Standard Dev. of p̂ with samples of size 2 | 0.3
in SRS
2
3
Believe
Don’t Believe
1 Believe 1 Standard Dev. of p̂ with samples of size 1 | 0.5
2 Believe 1
4 Don’t Believe 3 Don’t Believe 0
4 Don’t Believe 0
Taking samples of size 2 gives us
estimates that are less variable!
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 210 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 212
BIAS and VARIABILITY Random Binomial Data in MINITAB : use Calc Æ Random
Bias of an estimator Data Æ Binomial (more on the binomial distribution later). The
number of trials is 10, the probability of success is 0.8, we generate 1000
= (mean of sampling distrib.) - (true value of parameter) rows of data.
Variability of an estimator
= (Standard Deviation of sampling distrib.)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 213 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 215
To see this, we simulate (make up) data. . . . . Note : this is just an EXERCISE – you would never
Example : Suppose the true proportion of people who actually take many samples of size 10, only ONE
believe in global warming is 0.80, or 80%. sample of size 10. We are doing this to see what
values (and with what frequency) our sample
proportion p̂ might take! Another way to think
x We now assume that the population is large, and
we just happen to know the true proportion of believers (i.e. p about it : about how far can p̂ be from the true value
= 0.80) 0.80 just by chance?
x If we take a sample of size n =10, what sort of values for p̂ Histogram of n=10
350
(the sample proportion) might we see? How often will be see 300
these particular values? This is the sampling distribution. Here is the estimated sampling 250
x To estimate the sampling distribution, let’s simulate taking many distribution of p̂ for samples of
Frequency
200
samples of size n =10 (how about 1000 such samples), and size 150
n = 10 : 50
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
n=10
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 214 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 216
Probability Models – We’re modeling some random
Now, suppose we took samples Now, suppose we took samples phenomenon. A probability model consists of
of size n =100. How does the of size n =1000. How does the
x A List of possible outcomes
estimated sampling distribution estimated sampling distribution
of p̂ change? of p̂ change? x A probability for each outcome
80 Discrete
Frequency
60
Frequency
60
40
Toss a coin: S = {H,T}.
40
20
20 Watch a tree for a year and see if it dies :
0
0.0 0.1 0.2 0.3 0.4 0.5
n=100
0.6 0.7 0.8 0.9 1.0
0
0.0 0.1 0.2 0.3 0.4 0.5
n=1000
0.6 0.7 0.8 0.9 1.0 S = {Dead, Alive}.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 217 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 219
Example : Let the event A = (Gas mileage between 30 and 40 If A ={H}, then P(A)=0.5
mpg)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 221 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 223
Probability measure : a function (satisfying certain Example : Roll Two Dice S={see picture}
conditions) that assigns a probability (a number between 0 and
1) to each event.
Three approaches :
1) Classical
2) Relative Frequency If A ={Sum of Dice at least 10}, then
3) Personal/Subjective Probability (Baysian) # outcomes in A 6
P( A) 0.167
# outcomes in S 36
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 222 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 224
2) Relative Frequency ( Long Run Frequency ): When an Now a bit of gambling history . . . . .
experiment can be repeated, the probability of an event is
A rich Frenchman, Antoine Gombaud, known as the
the proportion of times the event occurs in the long run
‘Chevalier de mere’ liked to gamble. However, he
was confused by certain experiences at the gambling tables. He
Example : What’s the probability a radish seed will grow in posed the following question to his mathematician
soil treated with RoundUp? Plant many, many seeds, count friend, Blaise Pascal in 1654 :
the number of times the seed germinates.
Which is more likely :
3) Personal/Subjective Probability (Baysian) Most
events in life aren’t repeatable. We assign 1) At least one six in four rolls of a single die
probabilities all the time :
2) At least one double-six in 24 rolls of a pair of dice
x What’s the probability I’ll take statistics?
x What’s the probability this guy will ask me out on a date? Pascal with his friend Pierre de Fermat soon worked
out the Chevalier’s problem, and in the process
x What’s the probability a huge body of fresh water will
developed the algebraic basis of probability theory.
halt the gulf stream and lead to an ice age within a
century? (some people think high . .)
Here is the theory they worked out . . .
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 225 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 227
Think in terms of betting . . . . Venn Diagrams : Represent Sample Spaces and Events
with pictures!
This may seem arbitrary, but it’s actually a good description of
how we assign probabilities. Baysians assign a probability based
on the information at hand, and update their probability as x Think about S as a car
more data becomes available. windshield S
x A is an area in the A
Regardless of how you choose to assign probabilities, the windshield.
following two rules apply : x It’s about to start raining.
x Let A be the event that
2) P( S ) 1
(Something must happen)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 226 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 228
Probability measure for this picture : Probability = Area
area of A Complement of A
P( A) (raindrop falls
area of S
in ‘not A’)
For convenience assume (area of windshield S ) = 1.
So P(A) = area of A.
Remember : 0 ≤ P(A) ≤ 1 and P(S) = 1
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 229 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 231
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 230 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 232
Given that the raindrop fell in
x Disjoint Events : If A and B are disjoint, then
A, we restrict our attention to
P(A or B) = P(A) +
the set A. The drop is equally
P(B) A B likely to fall anywhere within A.
Complement rule
Given A, the event B also
P(A)= 1 - P(Ac) occurs when the drop falls in
A the darker region, i.e., the event
Ac ( A and B ).
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 233 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 235
Conditional Probability
This gives the formal
definition for Conditional P( A and B)
Notation : P(B | A) P( B | A)
Probability : P( A)
Read as “ Probability of B given A “
Or, equivalently
Meaning : Given that A has already happened, what is the
probability of B? P( A and B) P( B | A) P( A)
By eyeball : P(B | A) = 0.5
B
A
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 234 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 236
Independence Example : Toss two coins. Given that the first coin
is heads, what is the probability that the second coin
Two events A and B are independent if is also heads?
being told that A occurred has no effect on
Coins are independent so
the probability that B also occurrs.
P(Heads ( Toss 2) | Heads (Toss 1)) = P(Heads (Toss 2))
Formal Definition :
A and B are independent if That is, these events are independent.
P( B | A) P( B)
Equivalently,
P( A and B)
P( B)
P( A)
Equivalently,
P( A and B) P( A) P( B)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 237 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 239
Which is more likely : Similar reasoning (use complement rule), gives that
1) At least one six in four rolls of a single die P(at least one double six) 1 - P( no double sixes)
24
1 §¨ ·¸
2) At least one double-six in 24 rolls of a pair of dice 35
0.491
1) Let the events A, B, C, D be the events of getting a six on © 36 ¹
roll 1, 2, 3, or 4, respectively.
We want
P( A or B or C or D ) A B
SO : more likely to get one six in four
P( A) P( B ) P(C ) P( D )
C D throws of the dice!
(probability of overlaps)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 241 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 243
1 §¨ ·¸
5 Because dice rolls are
0.518 independent – the value of
© 6¹ one die throw does not Example : Who will you vote for? The Real Clear
change the likelihood of Politics Average on 8/30/16 for likely voters had the
outcomes on subsequent following distribution for a four way race:
th
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 242 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 244
x Clinton 42% Question : what is the probability that three randomly
x Trump 38% chosen people all plan to vote for Trump?
x Johnson 8%
x Stein 3% Help : Probabilities multiply ONLY if events are
independent. Think carefully about if this is
Suggestion : Check to see if probabilities satisfy requirements : true!
Let A be the event that a person has a particular opinion about
‘who is to blame’. In this case, it is reasonable to assume people are independent
(we picked them at random!) so
1) For any event A, 0 d P( A) d 1 (all fine here) P( A and B) P( A) P( B)
2) P( S ) 1 (umm . . .)
P(“T”and “T " and “T”) = P(“T”)3 =.383 = .055
Probabilities of events so far only add to 91% - we need another
category (none/other : 9%)
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 245 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 247
Question : what is the probability that a person picked Question : what is the probability that if we choose
at random would vote for Johnson or Stein? four people at random, at least one will vote for
Clinton?
Help : ‘or’ means P(A or B). Use addition rule
and then think about whether events are disjoint.
Help : When you see ‘AT LEAST’ you
P(J or S) = P(J) + P(S)–P(J and S) should think ‘Complement Rule’ :
P(A)= 1 - P(Ac) A Ac
Last probability is zero since can’t vote for two
people– i.e., these events are disjoint :
A B P(At least one Clinton)
= 1 – P(none of four choose Clinton)
P(J and S)= 0. So :
= 1 – P(one does not choose Clinton)4
P(J or S) = P(J) + P(S) = .08 + .03 = .11 = 1 – (.58)4
= 0.89
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 246 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 248
Counting and Probability You can get these values from Pascal’s Triangle = Each entry is
the sum of the two numbers just above it
Example : In 5 card draw, a straight is a sequence
of cards in numerical sequence without regard to
suit. An ace may be the low card or the high card in a straight.
Show how to compute the probability of getting a straight.
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 250 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 252
2. Probability (with a bit of counting) – Bayes Theorem
Let’s calculate Example : Uganda, after an extensive safe-sex campaign, has
1) The probability of getting dealt a straight in order reduced the rate of HIV infection from almost 15% (early 1990’s) of
2) The number of orders for five cards adults to 7.1% (2015, estimated 10% in cities). (click here to learn
more about AIDS in Uganda :
The first card can be any card up to 10 in value : http://www.avert.org/aidsuganda.htm)
Pr(first card < 10) = 40/52 An oral HIV test is given at random to an adult in the capital Kampala. If a given
person has a positive test result, what is the conditional probability that the
Next cards must be one of the four cards that are one higher in person indeed has the virus?
value than the previous card :
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 253 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 255
Information :
40 4 4 4 4 • 10% of the population has the virus
P( Straightinorder) • Test has a false positive rate of 1%
52 51 50 49 48 • Test has a false negative rate of .03%
NOW : There are 120=5*4*3*2*1 ways to get dealt five particular
cards. SO : A = {HIV virus in blood} Æ Ac = {HIV NOT in blood}
c
B = {Blood test positive} Æ B = {Blood test negative}
40 4 4 4 4
P( Straight) 120 * * * * * .0039 We Want P(A|B) = P (Have HIV | Positive Test)
52 51 50 49 48
P(A) = 0.1 so P(Ac)=0.9
P(B|Ac) = 0.01 so P(Bc|Ac) = 0.99
P(Bc|A) = 0.0003 so P(B|A)= 0.9997
Want P(Have HIV | Positive Test) =
P( A and B )
P( A | B )
P( B )
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 254 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 256
Draw a Probability Tree! .9997 B
B|A SO :
Get numerator : .10 That is : a persons risk of having HIV after a positive
A .0003 Bc|A tests increases from about 10% to 91% (still a 9%
Conditional Probability chance of not having HIV).
(the other direction)
.90 B|A
.01 c
P( A and B ) P( B | A) * P( A)
Ac
.99 Bc|Ac
P( A and B )
0.9997 * 0.1
P( B | A) * P( A)
0.099 P( A | B ) * P( B )
Note that we multiply probabilities along the tree
STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 257 STAT 10x Intro Stats : Fall 2016 – J Reuning-Scherer 259