Você está na página 1de 34

Introduction

Chapter 2
Presenting data in an informative way Numerical summaries (mean, median, standard deviation, ...) Graphical displays (histogram, boxplot, ...) Association between variables (scatterplot, correlation, ...)

Reminders Make sure you have access to Minitab and the appropriate calculator Homework 1 is due Wednesday Jan 22, in class Problem sets: P1 and P2

STAT 312 (Lecture 2)

Introduction

January 15, 2014

1 / 33

Learning from Data

Example: Molding
Injection molding is a manufacturing process used to form many types of parts. Injected material is usually hot and shrinks as it cools, so the parts cavity must be oversized compared to desired nal dimensions. Shrinkage is inuenced by many factors, including
Injection velocity (ft/sec) Mold temperature (deg C) Material viscosity

STAT 312 (Lecture 2)

Introduction

January 15, 2014

2 / 33

Learning from Data

Example: Molding
A comparative experiment was performed where mold temperature was held at a xed temperature and results observed when using two levels of injection velocity (low and high). Shrinkage of cooled parts is measured (m) for 10 specimens at each velocity. Results: 71.68 71.55 Low velocity 71.62 71.58 71.48 71.42 71.84 71.58 High velocity 71.68 71.74 71.48 71.71 71.56 71.70
Introduction

72.07 71.92

71.62 71.52
STAT 312 (Lecture 2)

71.55 71.50
January 15, 2014 3 / 33

Learning from Data

Example: Molding comparisons


Dotplots

Dotplot of Shrinkage for Low Velocity


G
71.4

G
71.5

G G G

G
71.7 71.8 Shrinkage

G
71.9 72.0

71.6

Dotplot of Shrinkage for High Velocity


G G
71.50

GG
71.55

G
71.60 Shrinkage 71.65

GG
71.70

G
71.75

STAT 312 (Lecture 2)

Introduction

January 15, 2014

4 / 33

Useful numerical summaries

Numerical summaries for one variable


Data: x1 , x2 , . . . , xn Measures of Center Sample Mean x= 1 n
n

xi
i =1

Median m: 50% of xi s are lower and 50% are higher than m The mean is sensistive to outliers, but the median is not:
Mean = 71.7 , Median = 71.6
G GGG G GG GG G

71

72

73 Shrinkage

74

75

Mean = 72 , Median = 71.6


G GGG G GG GG G

71

72

73 Shrinkage

74

75

STAT 312 (Lecture 2)

Introduction

January 15, 2014

5 / 33

Useful numerical summaries

Numerical summaries for one variable


Data: x1 , x2 , . . . , xn Measures of Spread Sample variance s2 = Alternative formula:
1 s2 = n1
n

1 n1

(xi x )2
i =1

i =1

1 xi2 n

= 1 n1

xi
i =1

xi2 nx 2
i =1

Sample standard deviation: s = s2 Range: max{xi } min{xi } Percentiles (quantiles): E.g., 90th percentile is a number which is higher than 90% of the data but lower than 10%
STAT 312 (Lecture 2) Introduction January 15, 2014 6 / 33

Useful numerical summaries

Sd and range are sensitive outliers

Sd = 0.2 , Range = 0.65


G GGG G GG GG G

71

72

73 Shrinkage

74

75

Sd = 1 , Range = 3.48
G GGG G GG GG G

71

72

73 Shrinkage

74

75

STAT 312 (Lecture 2)

Introduction

January 15, 2014

7 / 33

Useful numerical summaries

Example: Molding comparisons


Numerical summaries

statistic Sample mean Sample sd. Range minimum 25% quantile Median 75% quantile maximum

Low vel. 71.666 0.216 0.670 71.400 71.517 71.600 71.860 72.070

High vel. 71.606 0.0961 0.260 71.480 71.515 71.590 71.703 71.740

In Minitab: Stat Basic Statistics Display Descriptive Statistics ... Select variables Click Statistics... to select the summaries you want

STAT 312 (Lecture 2)

Introduction

January 15, 2014

8 / 33

Useful numerical summaries

Example: Molding continued

What do we need to know about how this comparative study was conducted in order to draw conclusions? What tentative conclusions can we draw about the importance of injection velocity from this data?

STAT 312 (Lecture 2)

Introduction

January 15, 2014

9 / 33

Useful numerical summaries

Example: Molding continued


The engineer in charge of the study was somewhat surprised at the results from the previous experiment. Upon checking his logs, he noted that the temperature used was somewhat lower than that ordinarily used for the specic material being studied. A second series of parts is fabricated, this time at a higher temp. Results for higher temperature: 76.20 75.94 93.25 92.98 Low velocity 76.09 75.98 76.15 76.12 76.18 76.25 High velocity 93.19 92.87 93.29 93.47 93.75 93.89 76.17 75.82 93.37 91.62 Dotplots Minitab: Graph Dotplot... Select Multiple Ys Select variables
January 15, 2014 10 / 33

STAT 312 (Lecture 2)

Introduction

Useful numerical summaries

Example: Molding comparative dotplots

Now what conclusions can we draw?


STAT 312 (Lecture 2) Introduction January 15, 2014 11 / 33

Useful numerical summaries

Descriptive summaries

Graphs Histograms, dot plots, box plots, stem/leaf plots Numbers Mean, median, range, variance, standard deviation, quantiles Put your SOCS on! Shape symmetric, skewed, unimodal, bimodal Outliers observations clearly different Center representative value (mean, median) Spread variation around center (sd., range, quartiles)

STAT 312 (Lecture 2)

Introduction

January 15, 2014

12 / 33

Graphical displays of one variable

Stem-and-leaf diagram

Stem-and-leaf diagram

Divide each number into stem and leaf (usually the last digit) List stem values in a vertical column Record all the leafs beside the corresponding stem Make a stem-and-leaf diagram of the low velocity data Shows the actual values Looks like a side-ways histogram Not practical for large data sets A bit old fashioned

STAT 312 (Lecture 2)

Introduction

January 15, 2014

13 / 33

Graphical displays of one variable

Stem-and-leaf diagram

Larger dataset

Various measurements (lenght, weight, gender etc) for various species of sh (cod, haddock, common dab, etc.)
STAT 312 (Lecture 2) Introduction January 15, 2014 14 / 33

Graphical displays of one variable

Stem-and-leaf diagram

Stem-and-leaf diagram

Common Dab
3 10 31 64 92 129 162 (28) 173 124 81 44 22 6 2 2 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 111 2233333 444444444455555555555 666666677777777777777777777777777 8888888888889999999999999999 0000000000000000000011111111111111111 222222222222223333333333333333333 4444444444444555555555555555 6666666666666666666666666667777777777777777777777 8888888888888888888888888889999999999999999 0000000000000000001111111111111111111 2222222223333333333333 4444444445555555 6677 11

STAT 312 (Lecture 2)

Introduction

January 15, 2014

16 / 33

Graphical displays of one variable

Histograms

Histograms
Histogram of the lengths of Common Dab in Icelandic waters

Frequency

0 10

10

20

30

40

50

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

42

Length (cm)

Shows the number (or proportion) of data that fall within each bin
STAT 312 (Lecture 2) Introduction January 15, 2014 17 / 33

Graphical displays of one variable

Histograms

Shape of a distribution
From a histogram

Symmetric: The right and left sides of the histogram are approximately mirror images of each other. Skewed to the right: The right side of the histogram extends much farther than the left side. Skewed to the left: The left side of the histogram extends much farther than the right side. Bimodal: Two distinct peaks.

Symmetric

Symmetric

Skewed Right

Skewed Right

Skewed Left

Skewed Left

Bimodal

Bimodal

STAT 312 (Lecture 2)

Introduction

January 15, 2014

18 / 33

Graphical displays of one variable

Histograms

Histograms
Bin size matters!

Frequency

10

20

30

40

50

Histogram of the lengths of Common Dab in Icelandic waters

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

42

Length (cm)

STAT 312 (Lecture 2)

Introduction

January 15, 2014

19 / 33

Graphical displays of one variable

Histograms

Histograms
Bin size matters!

Histogram of the lengths of Common Dab in Icelandic waters

Frequency
0 10 20 40

17

24

31

38

45

Length (cm)

STAT 312 (Lecture 2)

Introduction

January 15, 2014

19 / 33

Graphical displays of one variable

Histograms

Histograms
Bin size matters!

Histogram of the lengths of Common Dab in Icelandic waters

Frequency

0 10

10

20

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

42

44

Length (cm)

STAT 312 (Lecture 2)

Introduction

January 15, 2014

19 / 33

Graphical displays of one variable

Boxplots

The Five-Number Summary


The Five-Number Summary is a way to describe a distribution in ve numbers: Minimum: The smallest value Q1 (rst quartile): 1/4 of the data are less than this value Median: Half of the data are less than this value Q3 (third quartile): 3/4 of the data are less than this value Maximum: The largest value These ve numbers frame the four quarters of the data. Boxplot: The Five-Number Summary on a graph (essentially)

STAT 312 (Lecture 2)

Introduction

January 15, 2014

20 / 33

Graphical displays of one variable

Boxplots

Boxplot
Boxplot of the lengths of Common Dab in Icelandic waters
40

Length (cm)
STAT 312 (Lecture 2)

10

15

20

25

30

35

Introduction

January 15, 2014

21 / 33

Graphical displays of one variable

Boxplots

Side-by-side Boxplots
Boxplots of the lengths of three species in Icelandic waters

Length (cm)

30

40

G G G G G G G

10

20

Common Dab

Rough Dab

Herring

Inter quartile range (IQR): Q3 - Q1 Wiskers extend no further than 1.5 IQR from the quartiles in either direction. Points outside that are marked speccally and might be considered as outliers (or an indicator of a skewed distiribution).
STAT 312 (Lecture 2) Introduction January 15, 2014 22 / 33

Graphical displays of many variables

Whe data have more than one variable we often want to examine the relationship between them Descriptive measures of linear association
Sample correlation, scatterplots

FDS acronym:
Form: Does the relationship appear to be linear? Curved? Parabolic?, etc Direction: As one variable increases, does the other increase? Decrease? Strength: Is the relationship strong (very tight relationship with very little noise)? Or, very weak, with lots of noisy variation around the relationship?

STAT 312 (Lecture 2)

Introduction

January 15, 2014

23 / 33

Graphical displays of many variables

Scatterplot

Scatterplots
100

Scores on two Midterms


G G G G G G G G GG G G G G G G G GG G G G G G G GG GG G G G G G G G G G G G G GG GG G G G G G G G G G G G G G GG G GG G G

Midterm II

80

60

G G G G G G G G G G G G G G G GG G

GG G G

G G G G

40

G GG

40

Midterm I

60

80

100

Scores on two Midterms in a math class. The overall pattern of the relationship between X and Y is linear, i.e. the points on the scatterplot all fall around a straight line. Note that as the values on the X-axis (midterm I) increase, the values on the Y-axis (midterm II) increase also. This is called a positive association.
STAT 312 (Lecture 2) Introduction January 15, 2014 24 / 33

Graphical displays of many variables

Scatterplot

Scatterplots
mpg and weight of cars
Milespergallon (mpg)
G G G GG G G G G G GG G G G G GG G G G G G G G GG G G G G G GG G G G G G G GG G G GG G G G G G G G G G G GG G G G G G GG G G G G G GG G G G G G G G G G GG G G G G G G G G G GG G G G GG G G G G GG GG G G G G G GG G G GGGG G G G G G G G G G G GG G G GG G G GG G GG G G GG G G G G GG G G G G GG G GG G G G G GG GG G G G G G GG G G G GG G G GG G G G G G G G G G G G G G G G G GG G G G G G GG G G G GG G G GG G G G G G G GG G G GG G G G G G G G G GGG G G G G G GG G G G GG G G G G G GG G G G G G G GG G G G G G GG G G G G G G G G G G G G G G G G G G G GG G G G GG G G GG G G G G G GG G G GG G G G G GG G GG G G G G G GG GG G G G G G GG GG G G G G G G G GG G G G G GG GG G G G G G

40
G

30 20

G G G G GG G GG G G G G G

10 1500

2000

2500

3000

Weight

3500

4000

4500

5000

FDS: There is a moderately strong negative association between weight and mpg. The relationship is slightly curved, the mpg seems to level off around 10 mpg for heavy cars.
STAT 312 (Lecture 2) Introduction January 15, 2014 25 / 33

Graphical displays of many variables

Scatterplot

Scatterplots
Scatterplot of weight versus length of Common Dab
G

800

600

G G

weight (g)

G G G G G G G G G G G G G G G G G G G G G G G G G G

400

G G G G G G G G G G G

G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G G G

G G G G G G

G G G G

G G G G G

G G G G G G

G G G G G

G G G G G G G G G G

G G G G G G G G G G G G

G G G G G G G G G G

200

G G G G G G G

G G G G G G G G G G

G G G G G G G G G G G

0 10

15

20

25

30

35

40
January 15, 2014 26 / 33

length (cm)
STAT 312 (Lecture 2) Introduction

Graphical displays of many variables

Correlation coefcient

Measure of linear association


The sample correlation coefcient is dened as r= where Sxx =
i =1 n

Sxy Sxx Syy


n

(xi x ) ,

Sxx =
i =1

(yi y )2

Sxy =
i =1

(xi x )(yi y )

STAT 312 (Lecture 2)

Introduction

January 15, 2014

27 / 33

Graphical displays of many variables

Correlation coefcient

Alternative formulas

Sxx =
i =1 n

(xi x ) =
i =1 n

xi2

1 n 1 n

xi
i =1 n 2

=
i =1 n

xi2 nx 2

Syy =
i =1 n

(yi y )2 =
i =1 n

yi2 1 xi yi n

yi
i =1 n

=
i =1 n

yi2 ny 2
n

Sxy =
i =1

(xi x )(yi y ) =
i =1

xi
i =1 i =1

yi

=
i =1

xi yi nx y

STAT 312 (Lecture 2)

Introduction

January 15, 2014

28 / 33

Graphical displays of many variables

Correlation coefcient

The Correlation coefcient


Correlation Coefcient (r ): A numerical measure of linear association between two quantitative variables. 1 r 1 Measures the strength and direction of the linear relationship between the variables
If r is positive we have a positive linear association (when X gets bigger, Y gets bigger too) If r is negative then we have a negative linear association (when X gets bigger, Y gets smaller) When r is close to -1 or +1 the points on the scatterplot cluster tightly around a straight line, i.e. we have a strong linear association When r is closer to 0 (pos. or neg.) the points are more widely scattered around a straight line and we have a weak linear association.
STAT 312 (Lecture 2) Introduction January 15, 2014 29 / 33

Graphical displays of many variables

Correlation coefcient

The Correlation coefcient


Match the following correlations to the corresponding scatterplot: r = 0.9, r = 0.8, r = 0, r = 0.5, r = 0.9, r = 0.99
G G G G G G G G G G G GGG GG G GG G G GG G GG G G GG G G G GG G G G GG G G GG G G G G GGG G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G GG G G GG G GG GG G G GG G G

G G G G G G

G G G GG G G G G G GG G G GG G G G G G G G G G G G G G G G G GGG G G GG G G G G G G G G G G G G GG G G G GG G G G G G G G G G G G G G G G G G

G G G G G G GG

G G G G G GG G G G G G GG G G G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G GG G G G G G G G G G G G G G G G

G G

G GGG G G G G G G G

G G G G G GG G G G G G

G G G G G

G G

3 2 1

X
3
G

X
G G G G G G G

G G G G G G G GG G G G GG G G GG G G G G GG G G G G G G G G G G G G G G G G GG G G GG G G G GG G G G GG G G G G GG G G G G G G G G G G G G GG G G G GG G G G G G

G G GG G GG G G G G G G G G G G G GG G G G GG G G G G G G GGG G GG G G GG G G G G G G G G G G G G G G GG GG G GG G GG G G G G G G G G G G GG G G G G G G G G G G GG G

GG

G G G G GG G G

G G

G G G G

G G

G G G

G G

GG G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G GG G G G G GG G G G GG

G G G G GG G G G G

G G

3 2 1

X
STAT 312 (Lecture 2)

X
Introduction

X
January 15, 2014 30 / 33

Graphical displays of many variables

Correlation coefcient

Correlation coefcient only describes a linear relationship

The scatterplot above shows vocabulary growth for children at different ages. There is a strong association here but r is close to 0. Be Careful: Weak correlation (r close to 0) only implies no LINEAR relationship; there may be another type of relationship between the two variables (eg. a curved relationship).
STAT 312 (Lecture 2) Introduction January 15, 2014 31 / 33

Graphical displays of many variables

Correlation coefcient

Correlation = causation

Number of churches and bars in 100 different cities in the US The correlation between these two variables is quite high (0.95) Q: Does this mean that going to church causes you to go to bars? A: Of course not! There is a lurking variable in this example: the size of the population , i.e. cities with larger population tend to have both more churches and more bars. Strong correlation does not prove causation
STAT 312 (Lecture 2) Introduction January 15, 2014 32 / 33

Graphical displays of many variables

Scatterplot matrix

Scatterplot matrix for for than two variables

STAT 312 (Lecture 2)

Introduction

January 15, 2014

33 / 33

Você também pode gostar