Escolar Documentos
Profissional Documentos
Cultura Documentos
Chapter 2
Presenting data in an informative way Numerical summaries (mean, median, standard deviation, ...) Graphical displays (histogram, boxplot, ...) Association between variables (scatterplot, correlation, ...)
Reminders Make sure you have access to Minitab and the appropriate calculator Homework 1 is due Wednesday Jan 22, in class Problem sets: P1 and P2
Introduction
1 / 33
Example: Molding
Injection molding is a manufacturing process used to form many types of parts. Injected material is usually hot and shrinks as it cools, so the parts cavity must be oversized compared to desired nal dimensions. Shrinkage is inuenced by many factors, including
Injection velocity (ft/sec) Mold temperature (deg C) Material viscosity
Introduction
2 / 33
Example: Molding
A comparative experiment was performed where mold temperature was held at a xed temperature and results observed when using two levels of injection velocity (low and high). Shrinkage of cooled parts is measured (m) for 10 specimens at each velocity. Results: 71.68 71.55 Low velocity 71.62 71.58 71.48 71.42 71.84 71.58 High velocity 71.68 71.74 71.48 71.71 71.56 71.70
Introduction
72.07 71.92
71.62 71.52
STAT 312 (Lecture 2)
71.55 71.50
January 15, 2014 3 / 33
G
71.5
G G G
G
71.7 71.8 Shrinkage
G
71.9 72.0
71.6
GG
71.55
G
71.60 Shrinkage 71.65
GG
71.70
G
71.75
Introduction
4 / 33
xi
i =1
Median m: 50% of xi s are lower and 50% are higher than m The mean is sensistive to outliers, but the median is not:
Mean = 71.7 , Median = 71.6
G GGG G GG GG G
71
72
73 Shrinkage
74
75
71
72
73 Shrinkage
74
75
Introduction
5 / 33
1 n1
(xi x )2
i =1
i =1
1 xi2 n
= 1 n1
xi
i =1
xi2 nx 2
i =1
Sample standard deviation: s = s2 Range: max{xi } min{xi } Percentiles (quantiles): E.g., 90th percentile is a number which is higher than 90% of the data but lower than 10%
STAT 312 (Lecture 2) Introduction January 15, 2014 6 / 33
71
72
73 Shrinkage
74
75
Sd = 1 , Range = 3.48
G GGG G GG GG G
71
72
73 Shrinkage
74
75
Introduction
7 / 33
statistic Sample mean Sample sd. Range minimum 25% quantile Median 75% quantile maximum
Low vel. 71.666 0.216 0.670 71.400 71.517 71.600 71.860 72.070
High vel. 71.606 0.0961 0.260 71.480 71.515 71.590 71.703 71.740
In Minitab: Stat Basic Statistics Display Descriptive Statistics ... Select variables Click Statistics... to select the summaries you want
Introduction
8 / 33
What do we need to know about how this comparative study was conducted in order to draw conclusions? What tentative conclusions can we draw about the importance of injection velocity from this data?
Introduction
9 / 33
Introduction
Descriptive summaries
Graphs Histograms, dot plots, box plots, stem/leaf plots Numbers Mean, median, range, variance, standard deviation, quantiles Put your SOCS on! Shape symmetric, skewed, unimodal, bimodal Outliers observations clearly different Center representative value (mean, median) Spread variation around center (sd., range, quartiles)
Introduction
12 / 33
Stem-and-leaf diagram
Stem-and-leaf diagram
Divide each number into stem and leaf (usually the last digit) List stem values in a vertical column Record all the leafs beside the corresponding stem Make a stem-and-leaf diagram of the low velocity data Shows the actual values Looks like a side-ways histogram Not practical for large data sets A bit old fashioned
Introduction
13 / 33
Stem-and-leaf diagram
Larger dataset
Various measurements (lenght, weight, gender etc) for various species of sh (cod, haddock, common dab, etc.)
STAT 312 (Lecture 2) Introduction January 15, 2014 14 / 33
Stem-and-leaf diagram
Stem-and-leaf diagram
Common Dab
3 10 31 64 92 129 162 (28) 173 124 81 44 22 6 2 2 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 111 2233333 444444444455555555555 666666677777777777777777777777777 8888888888889999999999999999 0000000000000000000011111111111111111 222222222222223333333333333333333 4444444444444555555555555555 6666666666666666666666666667777777777777777777777 8888888888888888888888888889999999999999999 0000000000000000001111111111111111111 2222222223333333333333 4444444445555555 6677 11
Introduction
16 / 33
Histograms
Histograms
Histogram of the lengths of Common Dab in Icelandic waters
Frequency
0 10
10
20
30
40
50
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
Length (cm)
Shows the number (or proportion) of data that fall within each bin
STAT 312 (Lecture 2) Introduction January 15, 2014 17 / 33
Histograms
Shape of a distribution
From a histogram
Symmetric: The right and left sides of the histogram are approximately mirror images of each other. Skewed to the right: The right side of the histogram extends much farther than the left side. Skewed to the left: The left side of the histogram extends much farther than the right side. Bimodal: Two distinct peaks.
Symmetric
Symmetric
Skewed Right
Skewed Right
Skewed Left
Skewed Left
Bimodal
Bimodal
Introduction
18 / 33
Histograms
Histograms
Bin size matters!
Frequency
10
20
30
40
50
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
Length (cm)
Introduction
19 / 33
Histograms
Histograms
Bin size matters!
Frequency
0 10 20 40
17
24
31
38
45
Length (cm)
Introduction
19 / 33
Histograms
Histograms
Bin size matters!
Frequency
0 10
10
20
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
Length (cm)
Introduction
19 / 33
Boxplots
Introduction
20 / 33
Boxplots
Boxplot
Boxplot of the lengths of Common Dab in Icelandic waters
40
Length (cm)
STAT 312 (Lecture 2)
10
15
20
25
30
35
Introduction
21 / 33
Boxplots
Side-by-side Boxplots
Boxplots of the lengths of three species in Icelandic waters
Length (cm)
30
40
G G G G G G G
10
20
Common Dab
Rough Dab
Herring
Inter quartile range (IQR): Q3 - Q1 Wiskers extend no further than 1.5 IQR from the quartiles in either direction. Points outside that are marked speccally and might be considered as outliers (or an indicator of a skewed distiribution).
STAT 312 (Lecture 2) Introduction January 15, 2014 22 / 33
Whe data have more than one variable we often want to examine the relationship between them Descriptive measures of linear association
Sample correlation, scatterplots
FDS acronym:
Form: Does the relationship appear to be linear? Curved? Parabolic?, etc Direction: As one variable increases, does the other increase? Decrease? Strength: Is the relationship strong (very tight relationship with very little noise)? Or, very weak, with lots of noisy variation around the relationship?
Introduction
23 / 33
Scatterplot
Scatterplots
100
Midterm II
80
60
G G G G G G G G G G G G G G G GG G
GG G G
G G G G
40
G GG
40
Midterm I
60
80
100
Scores on two Midterms in a math class. The overall pattern of the relationship between X and Y is linear, i.e. the points on the scatterplot all fall around a straight line. Note that as the values on the X-axis (midterm I) increase, the values on the Y-axis (midterm II) increase also. This is called a positive association.
STAT 312 (Lecture 2) Introduction January 15, 2014 24 / 33
Scatterplot
Scatterplots
mpg and weight of cars
Milespergallon (mpg)
G G G GG G G G G G GG G G G G GG G G G G G G G GG G G G G G GG G G G G G G GG G G GG G G G G G G G G G G GG G G G G G GG G G G G G GG G G G G G G G G G GG G G G G G G G G G GG G G G GG G G G G GG GG G G G G G GG G G GGGG G G G G G G G G G G GG G G GG G G GG G GG G G GG G G G G GG G G G G GG G GG G G G G GG GG G G G G G GG G G G GG G G GG G G G G G G G G G G G G G G G G GG G G G G G GG G G G GG G G GG G G G G G G GG G G GG G G G G G G G G GGG G G G G G GG G G G GG G G G G G GG G G G G G G GG G G G G G GG G G G G G G G G G G G G G G G G G G G GG G G G GG G G GG G G G G G GG G G GG G G G G GG G GG G G G G G GG GG G G G G G GG GG G G G G G G G GG G G G G GG GG G G G G G
40
G
30 20
G G G G GG G GG G G G G G
10 1500
2000
2500
3000
Weight
3500
4000
4500
5000
FDS: There is a moderately strong negative association between weight and mpg. The relationship is slightly curved, the mpg seems to level off around 10 mpg for heavy cars.
STAT 312 (Lecture 2) Introduction January 15, 2014 25 / 33
Scatterplot
Scatterplots
Scatterplot of weight versus length of Common Dab
G
800
600
G G
weight (g)
G G G G G G G G G G G G G G G G G G G G G G G G G G
400
G G G G G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G G G
G G G G G G
G G G G
G G G G G
G G G G G G
G G G G G
G G G G G G G G G G
G G G G G G G G G G G G
G G G G G G G G G G
200
G G G G G G G
G G G G G G G G G G
G G G G G G G G G G G
0 10
15
20
25
30
35
40
January 15, 2014 26 / 33
length (cm)
STAT 312 (Lecture 2) Introduction
Correlation coefcient
(xi x ) ,
Sxx =
i =1
(yi y )2
Sxy =
i =1
(xi x )(yi y )
Introduction
27 / 33
Correlation coefcient
Alternative formulas
Sxx =
i =1 n
(xi x ) =
i =1 n
xi2
1 n 1 n
xi
i =1 n 2
=
i =1 n
xi2 nx 2
Syy =
i =1 n
(yi y )2 =
i =1 n
yi2 1 xi yi n
yi
i =1 n
=
i =1 n
yi2 ny 2
n
Sxy =
i =1
(xi x )(yi y ) =
i =1
xi
i =1 i =1
yi
=
i =1
xi yi nx y
Introduction
28 / 33
Correlation coefcient
Correlation coefcient
G G G G G G
G G G GG G G G G G GG G G GG G G G G G G G G G G G G G G G G GGG G G GG G G G G G G G G G G G G GG G G G GG G G G G G G G G G G G G G G G G G
G G G G G G GG
G G G G G GG G G G G G GG G G G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G GG G G G G G G G G G G G G G G G
G G
G GGG G G G G G G G
G G G G G GG G G G G G
G G G G G
G G
3 2 1
X
3
G
X
G G G G G G G
G G G G G G G GG G G G GG G G GG G G G G GG G G G G G G G G G G G G G G G G GG G G GG G G G GG G G G GG G G G G GG G G G G G G G G G G G G GG G G G GG G G G G G
G G GG G GG G G G G G G G G G G G GG G G G GG G G G G G G GGG G GG G G GG G G G G G G G G G G G G G G GG GG G GG G GG G G G G G G G G G G GG G G G G G G G G G G GG G
GG
G G G G GG G G
G G
G G G G
G G
G G G
G G
GG G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G GG G G G GG
G G G G GG G G G G
G G
3 2 1
X
STAT 312 (Lecture 2)
X
Introduction
X
January 15, 2014 30 / 33
Correlation coefcient
The scatterplot above shows vocabulary growth for children at different ages. There is a strong association here but r is close to 0. Be Careful: Weak correlation (r close to 0) only implies no LINEAR relationship; there may be another type of relationship between the two variables (eg. a curved relationship).
STAT 312 (Lecture 2) Introduction January 15, 2014 31 / 33
Correlation coefcient
Correlation = causation
Number of churches and bars in 100 different cities in the US The correlation between these two variables is quite high (0.95) Q: Does this mean that going to church causes you to go to bars? A: Of course not! There is a lurking variable in this example: the size of the population , i.e. cities with larger population tend to have both more churches and more bars. Strong correlation does not prove causation
STAT 312 (Lecture 2) Introduction January 15, 2014 32 / 33
Scatterplot matrix
Introduction
33 / 33