Escolar Documentos
Profissional Documentos
Cultura Documentos
June 7, 2011
Example
Population, Unit, Sample, Size
Sample
A parameter is a numerical value that would be calculated using all of the values of the units in the population. A statistic is a numerical value that is calculated using all of the values of the units in a sample.
Example
Parameter or Statistic?
According to the Campus Housing Fact Sheet at a Big-Ten University, 60% of the students living in campus housing are in-state residents. In a sample of 200 students living in campus housing, 56.5% were found to be in-state residents. Circle your answer.
1
In this particular situation, the value of 60% is a (parameter, statistic). In this particular situation, the value of 56.5% is a (parameter, statistic).
Denitions
A unit is the item or object we observe. When the object is a person, we refer to the unit as a subject. An observation is the information or characteristic recorded for each unit. A characteristic that can vary from unit to unit is called a variable. A collection of observations on one or more variables is called a data set.
Denitions
More About Variables
Qualitative variables are those which classify the units into categories. The categories may or may not have a natural ordering to them. Qualitative variables are also called categorical variables. Quantitative variables have numerical values that are measurements (length, weight, and so on) or counts (of how many). Arithmetic operations on such numerical values do have meaning.
A discrete variable can only take on a nite (or countable) number of values. A continuous variable can take on any value in an interval (or collection of intervals).
Example
Unit, Observation, Variables
Composer Ludwig Van Beethoven Nikolai Karlovich Medtner Jacques Oenbach Identify the following: Unit Obervation
Siblings 6 5 6
Example
What Type of Random Variable is Weight?
Packages are brought to a mailing center and weighed. Their results are recorded. Is weight discrete or continuous? Packages are brought to a mailing center and weighed. Their weights are recorded to the nearest pound. Is weight discrete or continuous? Packages under 5 pounds are classied as light, those weighing between 5 and 20 pounds are classied as medium and those over 20 pounds are classied as heavy. We record the variable weight, which takes on the values light, medium, or heavy. Now the variable weight is qualitative. Random variables are determined by their context in experiments, not by general categories. It is important to ask many questions about the data and how they were obtained
Measures of Center
What single number would best represent the most typical age for the 20 subjects? Subject Gender Age Subject Gender Age 1 M 45 11 M 41 2 M 41 12 F 44 3 F 51 13 F 47 4 F 46 14 F 49 5 F 47 15 M 45 6 F 42 16 F 42 7 M 43 17 M 41 8 F 50 18 F 40 9 M 39 19 M 45 10 M 32 20 M 37
Measures of center are numerical values that tend to report (in some sense) the middle of the data. We shall focus on two such measures: the mean and the median.
Measures of Center
Mean
The mean of a set of n observations is the sum of the observations divided by the number of observations, n. If the observations are a sample of a larger group, then we denote the mean by x (pronounced x -bar). If the observations are the entire group, i.e. the entire population, then we denote the mean by the Greek letter . Math Trip: If x1 , x2 , . . . , xn denote the observations, then the mean is calculated by (x1 + x2 + + xn ) . n Note the parentheses in the numerator...if you forget these in your calculator, things will go horribly wrong!
Example
Measures of Center: Mean
Consider the following data. Subject Gender Age Subject Gender Age 1 M 45 11 M 41 2 M 41 12 F 44 3 F 51 13 F 47 4 F 46 14 F 49 5 F 47 15 M 45 6 F 42 16 F 42 7 M 43 17 M 41 8 F 50 18 F 40 9 M 39 19 M 45 10 M 32 20 M 37
Find the mean of the ages of the male subjects. x = (45 + 41 + 43 + 39 + 32 + 41 + 45 + 41 + 45 + 37) 409 = = 40.9 10 10
Example
Measures of Center: Mean
Suppose that the number of children in a simple random sample of 10 households is as follows: 2, 3, 0, 2, 1, 0, 3, 0, 1, 4.
1 2 3
Calculate the sample mean number of children per household. Interpret your answer. Suppose that the observation for the last household in the above list was incorrectly recorded as 40 instead of 4.What would happen to the mean?
Solution
Measures of Center: Mean
Calculate the sample mean number of children per household. 16 = 10 = 1.6. x = (2+3+0+2+1+0+3+0+1+4) 10 Interpret your answer. Note that 1.6 is not rounded up to say 2. We are reporting a value that we would expect on average, over many samples of 10 households. Suppose that the observation for the last household in the above list was incorrectly recorded as 40 instead of 4.What would happen to the mean? 52 x = (2+3+0+2+1+0+3+0+1+40) = 10 = 5.2. 10
Thus we say the mean is sensitive to extreme observations. Most graphical displays would detect this...always graph your data!
Lets Do It!
Measures of Center: Mean
Suppose a sample of size n = 10 observations is observed. Can x be larger than the maximum value or less than the minimum value? If yes, give an example. Can x be the minimum value? If yes, give an example. Can x be the maximum value? If yes, give an example. Can x be exactly the midpoint between the minimum and maximum value (when the minimum does not equal the maximum)? If yes, give an example. Can x be exactly the second smallest value (out of the 10, not all equal observations, when they are ordered from smallest to largest)? If yes, give an example. Can x be not equal to any value in the sample? If yes, give an example.
Lets Do It!
A Mean Is Not Always Representative
Kims biology test scores are 7, 98, 25, 19, and 26. Calculate Kims mean test score. Explain why the mean does not do a very good job at summarizing Kims test scores.
Lets Do It!
Combinint Means
We have seven students. The mean score for three of these students is 54 and the mean score for the four other students is 76. What is the mean score for all seven students?
The Mean
As an Equilibrium Point
The mean = the point of equilibrium, the point where the distribution would balance.
1 2 3 1 2 3 4 5 6 7
Mean = 2
Mean = 3
If the distribution is symmetric, as in the rst picture at the left, the mean would be exactly at the center of the distribution. As the largest observation is moved further to the right, making this observation somewhat extreme, the mean shifts towards the extreme observation. If a distribution appears to be skewed, we may wish also to report a more resistant measure of center.
Frequency Tables
Sometimes data is grouped into classes. This is called a frequency table. The data represent the number of miles run during one week for a sample of 20 runners. 7, 13, 15, 18, 18, 20, 22, 22, 24, 24, 25, 26, 27, 28, 29, 33, 34, 35, 37, 40. This can be grouped into the following frequency table (based upon given classes). Class 5.5 - 10.5 10.5 - 15.5 15.5 - 20.5 20.5 - 25.5 25.5 - 30.5 30.5 - 35.5 35.5 - 40.5 Frequency 1 2 3 5 4 3 2
f xm n
490 20
= 24.5 miles
Lets Do It!
The Mean of Grouped Data/Frequency Tables
Eighty randomly selected light bulbs were tested to determine their lifetime in hours. The frequency table of the results is shown in table. Find the average lifetime of a light bulb. Class 53-63 64-74 75-85 86-96 97-107 108-118 Frequency, f 6 12 25 18 14 5 n= Midpoint, xm f xm
f xm =
f xm n
Lets Do It!
The Mean of Grouped Data/Frequency Tables
The cost per load (in cents) of 35 laundry detergents tested by consumer organization is given below. Class 13-19 20-26 27-33 34-40 41-47 48-54 55-61 62-68 Frequency, f 2 7 12 5 6 1 0 n= The mean is given by x =
f xm n
Midpoint, xm
f xm
f xm = =
Measures of Center
Median
A measure of center that is more resistant to extreme values is the median. The median, M , of a set of n observations, ordered from smallest to largest, is a value such that half of the observations are less than or equal to that value and half the observations are greater than or equal to that value. If the number of observations is odd, the median is the middle observation. If the number of observations is even, the median is any number between the two middle observations, including either of the two middle observations. To be consistent, we will dene the median as the mean or average of the two middle observations.
Example
Median
Find the median, M , of the ages of the following 8 subjects.. 30 37 39 40 M So, M = 41 + 42 = 41.5. 2 41 42 43 44
Lets Do It!
Median
The number of children in a household of 10 households is shown below. Number of Children Median, M = What happens to the median if the fth observation in the rst list was incorrectly recorded as 40 instead of 4? What happens to the median if the third observation in the rst list was incorrectly recorded as -20 instead of 0? The median is resistant-that is, it does not change, or changes very little, in response to extreme observations. 2 3 0 1 4 0 3 0 1 2
Measures of Center
Mode
The mode of a set of observations is the most frequently occurring value; it is the value having the highest frequency among the observations. The mode of the values: {0, 0, 0, 0, 1, 1, 2, 2, 3, 4} is 0. For {0, 0, 0, 1, 1, 2, 2, 2, 3, 4} two modes, 0 and 2 (bimodal) The mode for {0, 1, 2, 4, 5, 8} is none! The mode is not often used as a measure of center for quantitative data. The mode can be computed for qualitative (non-numeric) data.
Measures of Center
Dierent Measures Can Give Dierent Impressions
Consider the annual incomes of ve families in a neighborhood: $12, 000, $12, 000 $30, 000 $90, 000 $100, 000
1 2 3 4
Calculate the average income. Calculate the median income. Calculate the modal income. If you were trying to promote that this is an auent neighborhood, which measure might you prefer to present? If you were trying to argue against a tax increase, which measure might you prefer to present? If you want to represent these values with the income that is in the middle, which measure might you prefer to present?
Measures of Center
Shapes of Distributions
Symmetric Distribution
Bimodal Distribution
Mean=Median=Mode
Mode
Mean=Median
Mode
Left Skewed
Right Skewed
Mean
Median
Mode
Mode
Median
Mean
Homework
List 1 35 40 45 50 55 60 65 70 75 80 85 List 2 35 40 45 50 55 60 65 70 75 80 85
Range is just the dierence between the largest value and the smallest value. Consider the following data sets. List 1: 55, 56, 57, 58, 59, 60, 60, 60, 61, 62, 63, 64, 65 List 2: 35, 40, 45, 50, 55, 60, 60, 60, 65, 70, 75, 80, 85 Range of List 1: 65 55 = 10. Range of List 2: 85 35 = 50. Clearly, List 2 is spread out more than List 1.
List 1 35 40 45 50 55 60 65 70 75 80 85
List 2 35 40 45 50 55 60 65 70 75 80 85
Both lists have ranges of 50. Obviously, List 1 has more data concentrted in the middle. List 2 has more data concentrated on the ends.
The three values that divide the data into four parts are called the quartiles, represented by Q1 , Q2 = M = Median, and Q3 . Finding the quartiles: Find the median of all of the observations. First Quartile = Q1 = median of observations that fall below the median. Third Quartile = Q3 = median of observations that fall above the median.
Some things to remember: When the number of observations is odd, the middle observation is the median. This observation is not included in either of the two halves when computing Q1 and Q3 . Although dierent books, calculators, and computers may use slightly dierent ways to compute the quartiles, they are all based on the same idea. In a left-skewed distribution, the rst quartile will be farther from the median than the third quartile. If the distribution is symmetric, the quartiles should be the same distance from the median. In a right-skewed distribution, the third quartile will be farther from the median than the rst quartile.
Example
Quartiles
Find the quartiles of the ages of the following 8 subjects.. 30 37 Q1 37 + 39 = 38 2 41 + 42 M = Q2 = = 41.5 2 42 + 43 Q3 = = 42.5 2 Q1 = 39 40 M 41 42 Q3 43 44
The interquartile range measures the spread of the middle 50% of the data and is dened to be IQR = Q3 Q1 . Find the interquartile range of the ages of the following 8 subjects.. 30 37 Q1 39 40 M 41 42 Q3 43 44
The p th -percentile is the value such that p% of the observations fall at or below that value and (100 - p)% of the observations fall at or above that value. The rst quartile Q1 is the 25th -percentile since 25% of the data fall below and 75% of the data fall above. The second quartile Q2 = M (the median) is the 50th -percentile since 50% of the data fall below and 50% of the data fall above. The third quartile Q3 is the 75th -percentile since 75% of the data fall below and 25% of the data fall above.
One well used measure of variation is the ve number summary dened to be the Minimum, Q1 , Median, Q3 , and Maximum of the data set. Find the ve number summary of the ages of the following 8 subjects.. 30 37 Q1 Solution: Min = 30, Q1 = 38, M = 40.5, Q3 = 42.5, Max = 44. 39 40 M 41 42 Q3 43 44
A boxplot is a graphical representation of the ve number summary of a data set. List the data values in order from smallest to largest. Find the ve number summary: Minimum, Q1 , Median, Q3 , and Maximum. Q1 and Q3 determine the ends of the box, and a line is drawn inside the box to mark the value of the Median. Draw lines (called whiskers) from the midpoints of the ends of the box out to the Minimum and Maximum.
Min Q1 M Q3
Max
Example
Boxplots
M in = 30
M ax = 44 IQR Age
30
35
40
45
Find the ve number summary. Draw the box part of the boxplot using Q1 , M , and Q3 . Find the Interquartile Range, IQR = Q3 Q1 . Compute the quantity STEP = 1.5 IQR . Find the location of the inner fences.
Lower Inner Fence = Q1 STEP Upper Inner Fence = Q3 + STEP
Draw whiskers from the midpoints of the ends of the box to the smallest and largest values within the inner fences. These whiskers end with small vertical lines. All of the observations that fall outside the inner fences are potential outliers and are plotted with solid dots.
Example
Modied Boxplots
Construct a modied boxplot for the ages of the following 8 subjects 30, 37, 39, 40, 41, 42, 43, 44. Recall: Min = 30, Q1 = 38, M = 40.5, Q3 = 42.5, Max = 44. Note: IQR = Q3 Q1 = 42.5 38 = 4.5 and STEP = 1.5 IQR = 1.5(4.5) = 6.75. Lower Fence: Q1 STEP = 38 6.75 = 31.25 Upper Fence: Q3 + STEP = 42.5 + 6.75 = 49.25
5 5 42. 40. = Q3 M ax = 44
M in = 30
Lower Fence
Q1
38 M
Upper Fence
IQR 30 35 40 45 Age 50
Example
Side-by-Side Boxplots
Side-by-side boxplots are helpful for comparing two or more distributions with respect to the ve-number summary.
Although the median of the rst process is closer to the target value of 20.000 cm, the second process produces a less variable distribution.
Lets Do It!
Modied Boxplots
Variable = age for 23 children randomly assigned to one of two treatment groups. Amoxicillin 8 9 9 10 10 11 11 12 14 14 17 9 10 10 11 12 13 14 Cefadroxil 7 8 9 9 (a) Give the ve-number summary for each of the two treatment groups. Comment on your results. (b) Make side-by-side Boxplots for the antibiotic study data in part (a). (c) Using our rule of thumb, are there any outliers for the Amoxicillin group? If so, modify your Boxplot above. (d) Using our rule of thumb, are there any outliers for the Cefadroxil group? If so, modify your Boxplot above.
16
Lets Do It!
Modied Boxplots
For each of the following modied boxplots, report the corresponding ve-number summary and list the values for all outliers (if any). (a)
0 10 20 30 40 50 60 70 80 90 100
Min (b)
, Q1
,M
, Q3
, Max
, Outliers
0 10 20 30 40 50 60 70 80 90 100
Min (c)
, Q1
,M
, Q3
, Max
, Outliers
0 10 20 30 40 50 60 70 80 90 100
Min
, Q1
,M
, Q3
, Max
, Outliers
Standard Deviation
The Idea
Standard deviation is a measure of the spread of the observations from the mean. Think of the standard deviation as roughly an average (or standard) distance of the observations from their mean. If all of the observations are the same, then the standard deviation will be 0 (i.e. no spread). Otherwise the standard deviation is positive and the more spread out the observations are about their mean, the larger the value of the standard deviation.
Standard Deviation
The Idea
Suppose you make three observations: 0, 5, 7. Then, the sample 0+5+7 mean is x = = 4. 3
Deviation = 1 Deviation = -4 Deviation = 3 x =4 0 5 7
Problem: The average of the deviations is zero! 4 + 1 + 3 0 = = 0. 3 3 (Thats boring!) It turns out that the average of the deviations from the mean will always be zero...so we need a little trick.
Standard Deviation
The Idea
Suppose you make three observations: 0, 5, 7. Then, the sample mean is x = 4. Solution: Use the squared deviations from the mean. Deviations from the Mean Squared Deviations -4 16 1 1 3 3
The average, which is called the sample variance is 16 + 1 + 9 26 = = 13. 31 2 The sample standard deviation is 13 3.60555.
Standard Deviation
The Idea
Notes When calculating sample variance in the previous example, 26 16 + 1 + 9 = = 13, we subtract one in the 31 2 denominator...this is because we estimated the mean and hence have used up some information...if you want more information, then take advanced statistics courses. When calculating (sample or population) standard deviation, we square all of the numbers and then add them...so the variance is measured in squared units...so we take a square root to preserve return to the original units. Just as the mean is not a resistant measure of center, since the standard deviation used the mean in its denition, it is not a resistant measure of spread. It is heavily inuenced by extreme values.
Standard Deviation
The Math
Suppose is the population mean. The population variance is n (xi )2 2 denoted by = i =1 . n The population standard deviaiton is = 2 . Note that when dealing with population variance or standard deviation, we do not divide by n 1 since we have not estimated the mean...the population mean can be calculated exactly.
Standard Deviation
The Math - Shortcut Formulas for Sample Variance or Sample Standard Deviation
Some shortcut formulas are presented for calculating the sample variance and sample standard deviation. Let x1 , x2 , . . . , xn denote a sample of n observations. Then,
2 n 2 i =1 xi
2 n i =1 xi ) /n
n1 s2 =
n 2 i =1 xi
2 n i =1 xi ) /n
n1
Standard Deviation
The Math - Shortcut Formulas for Population Variance or Population Standard Deviation
Some shortcut formulas are presented for calculating the population variance and population standard deviation. Let x1 , x2 , . . . , xn denote all n observations in a population. Then, Variance: =
2 n 2 i =1 xi
( n
2 n i =1 xi ) /n
Standard Deviation: =
2 =
n 2 i =1 xi
( n
2 n i =1 xi ) /n
Example
Standard Deviation
In a recent study of the eect of a certain diet on weight reduction, 11 subjects were put on the diet for two weeks and their weight loss/gain in lbs was measured (positive values indicate weight loss). 1, 1, 2, 2, 3, 2, 1, 1, 3, 2.5, 23. What is the standard deviation of the weight loss?
11
= 569.25
Example
Standard Deviation (Continued)
xi = 4.5
i =1
and
i =1
xi2 = 569.25
= 38.516 11 1 The sample standard deviation is s = s 2 = 38.516 6.20613. So, our answer is s = 6.2 lbs. s = n1 =
2 n i =1 xi ) /n
Lets Do It!
Standard Deviation
The following are the ages of a sample of 20 patients seen in the emergency room of a hospital on a Friday night. 35 37 32 53 21 45 43 23 39 64 60 10 36 34 12 22 54 36 45 55
2 = 13, 310. We found n = 20, f xm = 490 f xm The formula for sample variance of grouped data is
s =
2 ( f xm
f xm )2 /n =
n1
The formula for sample standard deviation of grouped data is s = s 2 = 68.68 8.28734. So, our nal answer is s = 8.3
Homework
HW page 37: 2, 3, 9, 13