Stat 509 Notes

STAT 509
STATISTICS FOR ENGINEERS

Spring, 2014
Lecture Notes
Joshua M. Tebbs
Department of Statistics
University of South Carolina
TABLE OF CONTENTS STAT/MATH 511, J. TEBBS
Contents
3 Modeling Random Behavior 1
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.1 Sample spaces and events . . . . . . . . . . . . . . . . . . . . . . 4
3.2.2 Unions and intersections . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.3 Axioms and rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.4 Conditional probability and independence . . . . . . . . . . . . . 10
3.2.5 Additional probability rules . . . . . . . . . . . . . . . . . . . . . 12
3.3 Random variables and distributions . . . . . . . . . . . . . . . . . . . . . 15
3.4 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Negative binomial distribution . . . . . . . . . . . . . . . . . . . . 30
3.4.4 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . . 32
3.4.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.1 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.2 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.3 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Reliability and lifetime distributions . . . . . . . . . . . . . . . . . . . . . 53
3.6.1 Weibull distribution . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.2 Reliability functions . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 Statistical Inference 63
4.1 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Parameters and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
i
4.3 Point estimators and sampling distributions . . . . . . . . . . . . . . . . 67
4.4 Sampling distributions involving Y . . . . . . . . . . . . . . . . . . . . . 69
4.4.1 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.2 t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.3 Normal quantile-quantile (qq) plots . . . . . . . . . . . . . . . . . 77
4.5 Condence intervals for a population mean . . . . . . . . . . . . . . . . 78
4.5.1 Known population variance
2
. . . . . . . . . . . . . . . . . . . . 78
4.5.2 Unknown population variance
2
. . . . . . . . . . . . . . . . . . 85
4.5.3 Sample size determination . . . . . . . . . . . . . . . . . . . . . . 88
4.6 Condence interval for a population proportion p . . . . . . . . . . . . . 90
4.7 Condence interval for a population variance
2
. . . . . . . . . . . . . . 95
4.8 Condence intervals for the dierence of two population means
1

2
:
Independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.8.1 Equal variance case:
2
1
=
2
2
. . . . . . . . . . . . . . . . . . . . . 101
4.8.2 Unequal variance case:
2
1
=
2
2
. . . . . . . . . . . . . . . . . . . 105
4.9 Condence interval for the dierence of two population proportions p
1
p
2
:
Independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.10 Condence interval for the ratio of two population variances
2
2
/
2
1
: Inde-
pendent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.11 Condence intervals for the dierence of two population means
1

2
:
Dependent samples (Matched-pairs) . . . . . . . . . . . . . . . . . . . . . 116
4.12 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.12.1 Overall F test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.12.2 Follow up analysis: Tukey pairwise condence intervals . . . . . . 130
6 Linear regression 133
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2.1 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . 136
ii
6.2.2 Model assumptions and properties of least squares estimators . . 139
6.2.3 Estimating the error variance . . . . . . . . . . . . . . . . . . . . 140
6.2.4 Inference for
0
and
1
. . . . . . . . . . . . . . . . . . . . . . . . 142
6.2.5 Condence and prediction intervals for a given x = x
0
. . . . . . . 145
6.3 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.3.2 Matrix representation . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.3.3 Estimating the error variance . . . . . . . . . . . . . . . . . . . . 154
6.3.4 The hat matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3.5 Analysis of variance for linear regression . . . . . . . . . . . . . . 156
6.3.6 Inference for individual regression parameters . . . . . . . . . . . 161
0
. . . . . . . 164
6.4 Model diagnostics (residual analysis) . . . . . . . . . . . . . . . . . . . . 165
7 Factorial Experiments 174
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.2 Example: A 2
2
experiment with replication . . . . . . . . . . . . . . . . . 176
7.3 Example: A 2
4
experiment without replication . . . . . . . . . . . . . . . 186
iii
CHAPTER 3 STAT 509, J. TEBBS
3 Modeling Random Behavior
Complementary reading: Chapter 3 (VK); Sections 3.1-3.5 and 3.9.
3.1 Introduction
TERMINOLOGY: Statistics is the development and application of theory and meth-
ods to the collection, design, analysis, and interpretation of observed information from
planned and unplanned studies.
Statisticians get to play in everyone elses back yard. (John Tukey, Princeton)
Here are some examples where statistics could be used:
1. In a reliability (time to event) study, an engineer is interested in quantifying the
time until failure for a jet engine fan blade.
2. In an agricultural study in Iowa, researchers want to know which of four fertilizers
(which vary in their nitrogen contents) produces the highest corn yield.
3. In a clinical trial, physicians want to determine which of two drugs is more eective
for treating HIV in the early stages of the disease.
4. In a public health study, epidemiologists want to know whether smoking is linked
to a particular demographic class in high school students.
5. A food scientist is interested in determining how dierent feeding schedules (for
pigs) could aect the spread of salmonella during the slaughtering process.
6. A pharmacist posits that administering caeine to premature babies in the ICU at
Richland Hospital will reduce the incidence of necrotizing enterocolitis.
7. A research dietician wants to determine if academic achievement is related to body
mass index (BMI) among African American students in the fourth grade.
PAGE 1
8. An economist as part of President Obamas re-election campaign is trying to fore-
cast the monthly unemployment and under-employment rates for 2012.
REMARK: Statisticians use their skills in mathematics and computing to formulate
statistical models and analyze data for a specic problem at hand. These models are
then used to estimate important quantities of interest (to the researcher), to test the
validity of important conjectures, and to predict future behavior. Being able to identify
and model sources of variability is an important part of this process.
TERMINOLOGY: A deterministic model is one that makes no attempt to explain
variability. For example, in chemistry, the ideal gas law states that
PV = nRT,
where P = pressure of a gas, V = volume, n = the amount of substance of gas (number
of moles), R = Boltzmanns constant, and T = temperature. In circuit analysis, Ohms
law states that
V = IR,
where V = voltage, I = current, and R = resistance.
In both of these models, the relationship among the variables is completely deter-
mined without any ambiguity.
In real life, this is rarely true for the obvious reason: there is natural variation that
arises in the measurement process.
For example, a common electrical engineering experiment involves setting up a
simple circuit with a known resistance R. For a given current I, dierent students
will then calculate the voltage V .
With a sample of n = 20 students, conducting the experiment in succession,
we might very well get 20 dierent measured voltages!
A deterministic model is too simplistic for real life; it does not acknowledge
the inherent variability that arises in the measurement process.
PAGE 2
A probabilistic (or stochastic or statistical) model might look like
V = IR + ,
where is a random term that accounts for measurement error.
PREDICTION: Statistical models can also be used to predict future outcomes. For
example, suppose that I am trying to predict
Y = MATH 141 nal course percentage
for incoming freshmen enrolled in MATH 141. For each freshmen student, I will record
the following variables:
x
1
= SAT MATH score
x
2
= high school GPA.
A deterministic model would be
Y = f(x
1
, x
2
),
for some function f : R
2
[0, 100]. This model suggests that for a student with values x
1
and x
2
, we could compute Y exactly if the function f was known. A statistical model
for Y might look like something like this:
Y =
0
+
1
x
1
+
2
x
2
+ ,
where is a random term that accounts for not only measurement error (e.g., incorrect
student information, grading errors, etc.) but also
(a) all of the other variables not accounted for (e.g., major, leisure habits, natural
ability, etc.) and
(b) the error induced by assuming a linear relationship between Y and {x
1
, x
2
} when,
in fact, it may not be.
In this example, with certain (probabilistic) assumptions on and a mathematically
sensible way to estimate the unknown
0
,
1
, and
2
(i.e., coecients of the linear
function), we can produce point predictions of Y on a student-by-student basis; we can
also characterize numerical uncertainty with our predictions.
PAGE 3
3.2 Probability
3.2.1 Sample spaces and events
TERMINOLOGY: Probability is a measure of ones belief in the occurrence of a future
event. Here are some events to which we may wish to assign a probability:
tomorrows temperature exceeding 80 degrees
manufacturing a defective part
concluding one fertilizer is superior to another when it isnt
the NASDAQ losing 5 percent of its value.
you being diagnosed with prostate/cervical cancer in the next 20 years.
TERMINOLOGY: Many real life phenomena can be envisioned as a random exper-
iment. The set of all possible outcomes for a given random experiment is called the
sample space, denoted by S.
The number of outcomes in S is denoted by n
S
.
Example 3.1. In each of the following random experiments, we write out a correspond-
ing sample space.
(a) The Michigan state lottery calls for a three-digit integer to be selected:
S = {000, 001, 002, ..., 998, 999}.
The size of the set of all possible outcomes is n
S
= 1000.
(b) A USC undergraduate student is tested for chlamydia (0 = negative, 1 = positive):
S = {0, 1}.
S
= 2.
PAGE 4
(c) Four equally qualied applicants (a, b, c, d) are competing for two positions. If the
positions are identical (so that selection order does not matter), then
S = {ab, ac, ad, bc, bd, cd}.
S
= 6. If the positions are dierent (e.g.,
project leader, assistant project leader, etc.), then
S = {ab, ba, ac, ca, ad, da, bc, cb, bd, db, cd, dc}.
In this case, the size of the set of all possible outcomes is n
S
= 12.
TERMINOLOGY: Suppose that S is a sample space for a random experiment. We say
that A is an event in S if the outcome satises
A S.
GOAL: We would like to develop a mathematical framework so that we can assign prob-
ability to an event A. This will quantify how likely the event is. The probability that
the event A occurs is denoted by P(A).
INTUITIVE: Suppose that a sample space S contains n
S
< outcomes, each of which
is equally likely. If the event A contains n
A
outcomes, then
P(A) =
n
A
n
S
.
This is called an equiprobability model. Its main requirement is that all outcomes in
S are equally likely.
Important: If the outcomes in S are not equally likely, then this result is not
applicable.
Example 3.2. In the random experiments from Example 3.1, we use the previous result
to assign probabilities to events (if applicable).
(a) The Michigan state lottery calls for a three-digit integer to be selected:
S = {000, 001, 002, ..., 998, 999}.
PAGE 5
S
= 1000. Let the event
A = {000, 005, 010, 015, ..., 990, 995}
= {winning number is a multiple of 5}.
There are n
A
= 200 outcomes in A. It is reasonable to assume that each outcome in S
is equally likely. Therefore,
P(A) =
200
1000
= 0.20.
(b) A USC undergraduate student is tested for chlamydia (0 = negative, 1 = positive):
S = {0, 1}.
S
= 2. However, is it reasonable to assume
that each outcome in S (0 = negative, 1 = positive) is equally likely? The prevalence of
chlamydia among college age students is much less than 50 percent (in SC, this prevalence
is probably somewhere between 5-12 percent). Therefore, it would be illogical to assign
probabilities using an equiprobability model.
(c) Four equally qualied applicants (a, b, c, d) are competing for two positions. If the
positions are identical (so that selection order does not matter), then
S = {ab, ac, ad, bc, bd, cd}.
S
= 6. If A is the event that applicant d
is selected for one of the two positions, then
A = {ad, bd, cd}
= {applicant d is chosen}.
There are n
A
= 3 outcomes in A. If each of the 4 applicants has the same chance of
being selected (an assumption), then each of the n
S
= 6 outcomes in S is equally likely.
Therefore,
P(A) =
3
6
= 0.50.
PAGE 6
INTERPRETATION: In general, what does P(A) really measure? There are two main
interpretations:
P(A) measures the likelihood that A will occur on any given experiment.
If the experiment is performed many times, then P(A) can be interpreted as the
percentage of times that A will occur over the long run. This is called the relative
frequency interpretation.
If we are using the former interpretation, then it is common to use a decimal represen-
tation; e.g.,
P(A) = 0.50.
If we are using the latter, it is commonly accepted to say something like the event A will
occur 50 percent of the time. This gives the impression that A will occur, on average,
1 out of every 2 times the experiment is performed. This does not mean that the event
will occur exactly 1 out of every 2 times the experiment is performed.
3.2.2 Unions and intersections
TERMINOLOGY: The null event, denoted by , is an event that contains no outcomes.
The null event has probability P() = 0.
TERMINOLOGY: The union of two events A and B is the set of all outcomes in
either set or both. We denote the union of two events A and B by
A B = { : A or B}.
TERMINOLOGY: The intersection of two events A and B is the set of all outcomes
in both sets. We denote the intersection of two events A and B by
A B = { : A and B}.
TERMINOLOGY: If the events A and B contain no common outcomes, we say the
events are mutually exclusive. In this case,
P(A B) = P() = 0.
PAGE 7
Example 3.3. A primary computer system is backed up by two secondary systems.
They operate independently of each other; i.e., the failure of one has no eect on any
of the others. We are interested in the readiness of these three systems, which we can
envision as an experiment. The sample space for this experiment can be described as
S = {yyy, yny, yyn, ynn, nyy, nny, nyn, nnn},
where y means the system is ready and n means the system is not ready. Dene
A = {primary system is operable} = {yyy, yny, yyn, ynn}
B = {rst backup system is operable} = {yyy, yyn, nyy, nyn}.
The union of A and B is
A B = {yyy, yny, yyn, ynn, nyy, nyn}.
In words, A B occurs if the primary system or the rst backup system is operable (or
both). The intersection of A and B is
A B = {yyy, yyn}
and occurs if the primary system and the rst backup system are both operable.
Important: Can we compute P(A), P(B), P(A B), and P(A B) in this example?
For example, if we wrote
P(A B) =
n
AB
n
S
=
6
8
= 0.75,
we would be making the assumption that each outcome in S is equally likely! This would
only be true if each system (primary and both backups) functions with probability equal
to 1/2. This is extremely unlikely! Therefore, we can not compute probabilities in this
example without additional information about the system-specic failure rates.
Example 3.4. Hemophilia is a sex-linked hereditary blood defect of males characterized
by delayed clotting of the blood which makes it dicult to control bleeding. When a
woman is a carrier of classical hemophilia, there is a 50 percent chance that a male child
PAGE 8
will inherit this disease. If a carrier gives birth to two males (not twins), what is the
probability that either will have the disease? both will have the disease?
Solution. We can envision the process of having two male children as an experiment
with sample space
S = {++, +, +, },
where + means the male ospring has the disease; means the male does not have
the disease. To compute the probabilities requested in this problem, we will
assume that each outcome in S is equally likely. Dene the events:
A = {rst child has disease} = {++, +}
B = {second child has disease} = {++, +}.
The union and intersection of A and B are, respectively,
A B = {either child has disease} = {++, +, +}
A B = {both children have disease} = {++}.
The probability that either male child will have the disease is
P(A B) =
n
AB
n
S
=
3
4
= 0.75.
The probability that both male children will have the disease is
P(A B) =
n
AB
n
S
=
1
4
= 0.25.
3.2.3 Axioms and rules
KOLMOLGOROV AXIOMS: For any sample space S, a probability P must satisfy
(1) 0 P(A) 1, for any event A
(2) P(S) = 1
(3) If A
1
, A
2
, ..., A
n
are pairwise mutually exclusive events, then
P
_
n
i=1
A
i
_
=
n
i=1
P(A
i
).
PAGE 9
The term pairwise mutually exclusive means that A
i
A
j
= , for all i = j.
The event
n
i=1
A
i
= A
1
A
2
A
n
.
This event, in words, is read at least one A
i
occurs.
3.2.4 Conditional probability and independence
IDEA: In some situations, we may be fortunate enough to have prior knowledge about
the likelihood of other events related to the event of interest. We can then incorporate
this information into a probability calculation.
TERMINOLOGY: Let A and B be events in a sample space S with P(B) > 0. The
conditional probability of A, given that B has occurred, is
P(A|B) =
P(A B)
P(B)
.
Similarly,
P(B|A) =
P(A B)
P(A)
.
Example 3.5. In a company, 36 percent of the employees have a degree from a SEC
university, 22 percent of the employees that have a degree from the SEC are also engineers,
and 30 percent of the employees are engineers. An employee is selected at random.
(a) Compute the probability that the employee is an engineer and is from the SEC.
(b) Compute the conditional probability that the employee is from the SEC, given that
s/he is an engineer.
Solution: Dene the events
A = {employee is an engineer}
B = {employee is from the SEC}.
PAGE 10
From the information in the problem, we are given: P(A) = 0.30, P(B) = 0.36, and
P(A|B) = 0.22. In part (a), we want P(A B). Note that
0.22 = P(A|B) =
P(A B)
P(B)
=
P(A B)
0.36
.
Therefore,
P(A B) = 0.22(0.36) = 0.0792.
In part (b), we want P(B|A). From the denition of conditional probability:
P(B|A) =
P(A B)
P(A)
=
0.0792
0.30
= 0.264.
IMPORTANT: Note that, in this example, the conditional probability P(B|A) and the
unconditional probability P(B) are not equal.
In other words, knowledge that A has occurred has changed the likelihood that
B occurs.
In other situations, it might be that the occurrence (or non-occurrence) of a com-
panion event has no eect on the probability of the event of interest. This leads us
to the denition of independence.
TERMINOLOGY: When the occurrence or non-occurrence of B has no eect on whether
or not A occurs, and vice-versa, we say that the events A and B are independent.
Mathematically, we dene A and B to be independent if and only if
P(A B) = P(A)P(B).
Note that if A and B are independent,
P(A|B) =
P(A B)
P(B)
=
P(A)P(B)
P(B)
= P(A)
and
P(B|A) =
P(B A)
P(A)
=
P(B)P(A)
P(A)
= P(B).
Note: These results only apply if A and B are independent. In other words, if A and B
are not independent, then these rules do not apply.
PAGE 11
Example 3.6. In an engineering system, two components are placed in a series; that is,
the system is functional as long as both components are. Each component is functional
with probability 0.95. Dene the events
A
1
= {component 1 is functional}
A
2
= {component 2 is functional}
so that P(A
1
) = 0.95 and P(A
2
) = 0.95. The probability that the system is functional
is given by P(A
1
A
2
).
If the components operate independently, then A
1
and A
2
are independent events
so that
P(A
1
A
2
) = P(A
1
)P(A
2
) = 0.95(0.95) = 0.9025.
If the components do not operate independently; e.g., failure of one component
wears on the other, then we can not compute P(A
1
A
2
) without additional
knowledge.
EXTENSION: The notion of independence extends to any nite collection of events
A
1
, A
2
, ..., A
n
. Mutual independence means that the probability of the intersection of
any sub-collection of A
1
, A
2
, ..., A
n
equals the product of the probabilities of the events
in the sub-collection. For example, if A
1
, A
2
, A
3
, and A
4
are mutually independent, then
P(A
1
A
2
) = P(A
1
)P(A
2
)
P(A
1
A
2
A
3
) = P(A
1
)P(A
2
)P(A
3
)
P(A
1
A
2
A
3
A
4
) = P(A
1
)P(A
2
)P(A
3
)P(A
4
).
3.2.5 Additional probability rules
TERMINOLOGY: Suppose that S is a sample space and that A is an event. The
complement of A, denoted by A, is the set of all outcomes in S not in A. That is,
A = { S : / A}.
PAGE 12
1. Complement rule: Suppose that A is an event.
P(A) = 1 P(A).
2. Additive law: Suppose that A and B are two events.
P(A B) = P(A) + P(B) P(A B).
3. Multiplicative law: Suppose that A and B are two events.
P(A B) = P(B|A)P(A)
= P(A|B)P(B).
4. Law of Total Probability (LOTP): Suppose that A and B are two events.
P(A) = P(A|B)P(B) + P(A|B)P(B).
5. Bayes Rule: Suppose that A and B are two events.
P(B|A) =
P(A|B)P(B)
P(A)
=
P(A|B)P(B)
P(A|B)P(B) + P(A|B)P(B)
.
Example 3.7. The probability that train 1 is on time is 0.95. The probability that train
2 is on time is 0.93. The probability that both are on time is 0.90. Dene the events
A
1
= {train 1 is on time}
A
2
= {train 2 is on time}.
We are given that P(A
1
) = 0.95, P(A
2
) = 0.93, and P(A
1
A
2
) = 0.90.
(a) What is the probability that train 1 is not on time?
P(A
1
) = 1 P(A
1
)
= 1 0.95 = 0.05.
(b) What is the probability that at least one train is on time?
P(A
1
A
2
) = P(A
1
) + P(A
2
) P(A
1
A
2
)
= 0.95 + 0.93 0.90 = 0.98.
PAGE 13
(c) What is the probability that train 1 is on time given that train 2 is on time?
P(A
1
|A
2
) =
P(A
1
A
2
)
P(A
2
)
=
0.90
0.93
0.968.
(d) What is the probability that train 2 is on time given that train 1 is not on time?
P(A
2
|A
1
) =
P(A
1
A
2
)
P(A
1
)
=
P(A
2
) P(A
1
A
2
)
1 P(A
1
)
=
0.93 0.90
1 0.95
= 0.60.
(e) Are A
1
and A
2
independent events?
Answer: They are not independent because
P(A
1
A
2
) = P(A
1
)P(A
2
).
Equivalently, note that P(A
1
|A
2
) = P(A
1
). In other words, knowledge that A
2
has
occurred changes the likelihood that A
1
occurs.
Example 3.8. An insurance company classies people as accident-prone and non-
accident-prone. For a xed year, the probability that an accident-prone person has an
accident is 0.4, and the probability that a non-accident-prone person has an accident is
0.2. The population is estimated to be 30 percent accident-prone. Dene the events
A = {policy holder has an accident}
B = {policy holder is accident-prone}.
We are given that P(B) = 0.3, P(A|B) = 0.4, and P(A|B) = 0.2.
(a) What is the probability that a new policy-holder will have an accident?
Solution: By the Law of Total Probability,
P(A) = P(A|B)P(B) + P(A|B)P(B)
= 0.4(0.3) + 0.2(0.7) = 0.26.
PAGE 14
(b) Suppose that the policy-holder does have an accident. What is the probability that
s/he was accident-prone?
Solution: We want P(B|A). By Bayes Rule,
P(B|A) =
P(A|B)P(B)
P(A|B)P(B) + P(A|B)P(B)
=
0.4(0.3)
0.4(0.3) + 0.2(0.7)
0.46.
3.3 Random variables and distributions
TERMINOLOGY: A random variable Y is a variable whose value is determined by
chance. The distribution of a random variable consists of two parts:
1. an elicitation of the set of all possible values of Y (called the support)
2. a function that describes how to assign probabilities to events involving Y .
NOTATION: By convention, we denote random variables by upper case letters towards
the end of the alphabet; e.g., W, X, Y , Z, etc. A possible value of Y (i.e., a value in the
support) is denoted generically by the lower case version y. In words, the symbol
P(Y = y)
is read, the probability that the random variable Y equals the value y. The symbol
F
Y
(y) P(Y y)
is read, the probability that the random variable Y is less than or equal to the value
y. This probability is called the cumulative distribution function of Y and will be
discussed later.
TERMINOLOGY: If a random variable Y can assume only a nite (or countable) number
of values, we call Y a discrete random variable. If it makes more sense to envision Y as
assuming values in an interval of numbers, we call Y a continuous random variable.
PAGE 15
Example 3.9. Classify the following random variables as discrete or continuous and
specify the support of each random variable.
V = number of unbroken eggs in a randomly selected carton (dozen)
W = pH of an aqueous solution
X = length of time between accidents at a factory
Y = whether or not you pass this class
Z = number of aircraft arriving tomorrow at CAE.
The random variable V is discrete. It can assume values in
{v : v = 0, 1, 2, ..., 12}.
The random variable W is continuous. It most certainly assumes values in
{w : < w < }.
Of course, with most solutions, it is more likely that W is not negative (although
this is possible) and not larger than, say, 15 (a very reasonable upper bound).
However, the choice of {w : < w < } is not mathematically incongruous
with these practical constraints.
The random variable X is continuous. It can assume values in
{x : x > 0}.
The key feature here is that a time cannot be negative. In theory, it is possible
that X can be very large.
The random variable Y is discrete. It can assume values in
{y : y = 0, 1},
where I have arbitrarily labeled 1 for passing and 0 for failing. Random vari-
ables that can assume exactly 2 values (e.g., 0, 1) are called binary.
PAGE 16
The random variable Z is discrete. It can assume values in
{z : z = 0, 1, 2, ..., }.
I have allowed for the possibility of a very large number of aircraft arriving.
3.4 Discrete random variables
TERMINOLOGY: Suppose that Y is a discrete random variable. The function
p
Y
(y) = P(Y = y)
is called the probability mass function (pmf) for Y . The pmf p
Y
(y) is a function
that assigns probabilities to each possible value of Y .
PROPERTIES: A pmf p
Y
(y) for a discrete random variable Y satises the following:
1. 0 < p
Y
(y) < 1, for all possible values of y
2. The sum of the probabilities, taken over all possible values of Y , must equal 1; i.e.,
all y
p
Y
(y) = 1.
Example 3.10. A mail-order computer business has six telephone lines. Let Y denote
the number of lines in use at a specic time. Suppose that the probability mass function
(pmf) of Y is given by
y 0 1 2 3 4 5 6
p
Y
(y) 0.10 0.15 0.20 0.25 0.20 0.06 0.04
Figure 3.1 (left) displays p
Y
(y), the probability mass function (pmf) of Y .
The height of the bar above y is equal to p
Y
(y) = P(Y = y).
If y is not equal to 0, 1, 2, 3, 4, 5, 6, then p
Y
(y) = 0.
PAGE 17
0 1 2 3 4 5 6
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
y
p
(
y
)
0 2 4 6
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.1: PMF (left) and CDF (right) of Y in Example 3.10.
Figure 3.1 (right) displays the cumulative distribution function (cdf) of Y .
F
Y
(y) = P(Y y).
F
Y
(y) is a nondecreasing function.
0 F
Y
(y) 1; this makes sense since F
Y
(y) = P(Y y) is a probability!
The cdf F
Y
(y) in this example (Y is discrete) takes a step at each possible
value of Y and stays constant otherwise.
The height of the step at a particular y is equal to p
Y
(y) = P(Y = y).
Here is the table I gave you above, but now I have added the values of the cumulative
distribution function:
y 0 1 2 3 4 5 6
p
Y
(y) 0.10 0.15 0.20 0.25 0.20 0.06 0.04
F
Y
(y) 0.10 0.25 0.45 0.70 0.90 0.96 1.00
PAGE 18
(a) What is the probability that exactly two lines are in use?
p
Y
(2) = P(Y = 2) = 0.20.
(b) What is the probability that at most two lines are in use?
P(Y 2) = P(Y = 0) + P(Y = 1) + P(Y = 2)
= p
Y
(0) + p
Y
(1) + p
Y
(2)
= 0.10 + 0.15 + 0.20 = 0.45.
Note: This is also equal to F
Y
(2) = 0.45.
(c) What is the probability that at least ve lines are in use?
P(Y 5) = P(Y = 5) + P(Y = 6)
= p
Y
(5) + p
Y
(6) = 0.06 + 0.04 = 0.10.
It is also important to note that in part (c), we could have computed
P(Y 5) = 1 P(Y 4)
= 1 F
Y
(4) = 1 0.90 = 0.10.
TERMINOLOGY: Let Y be a discrete random variable with pmf p
Y
(y). The expected
value of Y is given by
= E(Y ) =
all y
yp
Y
(y).
The expected value for a discrete random variable Y is simply a weighted average of the
possible values of Y . Each value y is weighted by its probability p
Y
(y). In statistical
applications, = E(Y ) is commonly called the population mean.
Example 3.11. In Example 3.10, we examined the distribution of Y , the number of
lines in use at a specied time. The probability mass function (pmf) of Y is given by
y 0 1 2 3 4 5 6
p
Y
(y) 0.10 0.15 0.20 0.25 0.20 0.06 0.04
PAGE 19
The expected value of Y is
= E(Y ) =
all y
yp
Y
(y)
= 0(0.10) + 1(0.15) + 2(0.20) + 3(0.25) + 4(0.20) + 5(0.06) + 6(0.04)
= 2.64.
Interpretation: On average, we would expect 2.64 calls at the specied time.
Interpretation: Over the long run, if we observed many values of Y at this specied
time (and this pmf was applicable each time), then the average of these Y observations
would be close to 2.64.
Interpretation: Place an x at = 2.64 in Figure 3.1 (left). This represents the
balance point of the probability mass function.
FUNCTIONS: Let Y be a discrete random variable with pmf p
Y
(y). Suppose that g is
a real-valued function. Then, g(Y ) is a random variable and
E[g(Y )] =
all y
g(y)p
Y
(y).
PROPERTIES OF EXPECTATIONS: Let Y be a discrete random variable with pmf
p
Y
(y). Suppose that g, g
1
, g
2
, ..., g
k
are real-valued functions, and let c be any real con-
stant. Expectations satisfy the following (linearity) properties:
(a) E(c) = c
(b) E[cg(Y )] = cE[g(Y )]
(c) E[
k
j=1
g
j
(Y )] =
k
j=1
E[g
j
(Y )].
Note: These rules are also applicable if Y is continuous (coming up).
Example 3.12. In a one-hour period, the number of gallons of a certain toxic chemical
that is produced at a local plant, say Y , has the following pmf:
y 0 1 2 3
p
Y
(y) 0.2 0.3 0.3 0.2
PAGE 20
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
y
p
(
y
)
1 0 1 2 3 4
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.2: PMF (left) and CDF (right) of Y in Example 3.12.
(a) Compute the expected number of gallons produced during a one-hour period.
Solution: The expected value of Y is
= E(Y ) =
all y
yp
Y
(y) = 0(0.2) + 1(0.3) + 2(0.3) + 3(0.2) = 1.5.
We would expect 1.5 gallons of the toxic chemical to be produced per hour (on average).
(b) The cost (in hundreds of dollars) to produce Y gallons is given by the cost function
g(Y ) = 3 + 12Y + 2Y
2
. What is the expected cost in a one-hour period?
Solution: We want to compute E[g(Y )]. We rst compute E(Y
2
):
E(Y
2
) =
all y
y
2
p
Y
(y) = 0
2
(0.2) + 1
2
(0.3) + 2
2
(0.3) + 3
2
(0.2) = 3.3.
Therefore,
E[g(Y )] = E(3 + 12Y + 2Y
2
) = 3 + 12E(Y ) + 2E(Y
2
)
= 3 + 12(1.5) + 2(3.3) = 27.6.
The expected hourly cost is $2, 760.00.
PAGE 21
TERMINOLOGY: Let Y be a discrete random variable with pmf p
Y
(y) and expected
value E(Y ) = . The population variance of Y is given by
2
var(Y ) E[(Y )
2
]
=
all y
(y )
2
p
Y
(y).
The population standard deviation of Y is the positive square root of the variance:
=
2
=
var(Y ).
FACTS: The population variance
2
satises the following:
(a)
2
0.
2
= 0 if and only if the random variable Y has a degenerate distribu-
tion; i.e., all the probability mass is located at one support point.
(b) The larger (smaller)
2
is, the more (less) spread in the possible values of Y about
the population mean = E(Y ).
(c)
2
is measured in (units)
2
and is measured in the original units.
COMPUTING FORMULA: Let Y be a random variable with population mean E(Y ) = .
An alternative computing formula for the population variance is
var(Y ) = E[(Y )
2
]
= E(Y
2
) [E(Y )]
2
.
This formula is easy to remember and makes subsequent calculations easier.
Example 3.13. In Example 3.12, we examined the pmf for the number of gallons of a
certain toxic chemical that is produced at a local plant, denoted by Y . The pmf of Y is
y 0 1 2 3
p
Y
(y) 0.2 0.3 0.3 0.2
In Example 3.13, we computed
E(Y ) = 1.5
E(Y
2
) = 3.3.
PAGE 22
The population variance of Y is
2
= var(Y ) = E(Y
2
) [E(Y )]
2
= 3.3 (1.5)
2
= 1.05.
The population standard deviation of Y is
=
2
=
1.05 1.025.
3.4.1 Binomial distribution
BERNOULLI TRIALS: Many experiments can be envisioned as consisting of a sequence
of trials, where
1. each trial results in a success or a failure,
2. the trials are independent, and
3. the probability of success, denoted by p, 0 < p < 1, is the same on every trial.
Example 3.14. We give examples of situations that can be thought of as observing
Bernoulli trials.
When circuit boards used in the manufacture of Blue Ray players are tested, the
long-run percentage of defective boards is 5 percent.
circuit board = trial
defective board is observed = success
p = P(success) = P(defective board) = 0.05.
Ninety-eight percent of all air trac radar signals are correctly interpreted the rst
time they are transmitted.
radar signal = trial
signal is correctly interpreted = success
p = P(success) = P(correct interpretation) = 0.98.
PAGE 23
Albino rats used to study the hormonal regulation of a metabolic pathway are
injected with a drug that inhibits body synthesis of protein. The probability that
a rat will die from the drug before the study is complete is 0.20.
rat = trial
dies before study is over = success
p = P(success) = P(dies early) = 0.20.
TERMINOLOGY: Suppose that n Bernoulli trials are performed. Dene
Y = the number of successes (out of n trials performed).
We say that Y has a binomial distribution with number of trials n and success prob-
ability p. Shorthand notation is Y b(n, p).
PMF: If Y b(n, p), then the probability mass function of Y is given by
p
Y
(y) =
_
_
_
n
y
_
p
y
(1 p)
ny
, y = 0, 1, 2, ..., n
0, otherwise.
MEAN/VARIANCE: If Y b(n, p), then
E(Y ) = np
var(Y ) = np(1 p).
Example 3.15. In an agricultural study, it is determined that 40 percent of all plots
respond to a certain treatment. Four plots are observed. In this situation, we interpret
plot of land = trial
plot responds to treatment = success
p = P(success) = P(responds to treatment) = 0.4.
PAGE 24
0 1 2 3 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
y
p
(
y
)
1 0 1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.3: PMF (left) and CDF (right) of Y b(n = 4, p = 0.4) in Example 3.15.
If the Bernoulli trial assumptions hold (independent plots, same response probability for
each plot), then
Y = the number of plots which respond b(n = 4, p = 0.4).
(a) What is the probability that exactly two plots respond?
P(Y = 2) = p
Y
(2) =
_
4
2
_
(0.4)
2
(1 0.4)
42
= 6(0.4)
2
(0.6)
2
= 0.3456.
(b) What is the probability that at least one plot responds?
P(Y 1) = 1 P(Y = 0) = 1
_
4
0
_
(0.4)
0
(1 0.4)
40
= 1 1(1)(0.6)
4
= 0.8704.
(c) What are E(Y ) and var(Y )?
E(Y ) = np = 4(0.4) = 1.6
var(Y ) = np(1 p) = 4(0.4)(0.6) = 0.96.
PAGE 25
Example 3.16. An electronics manufacturer claims that 10 percent of its power supply
units need servicing during the warranty period. To investigate this claim, technicians
at a testing laboratory purchase 30 units and subject each one to an accelerated testing
protocol to simulate use during the warranty period. In this situation, we interpret
power supply unit = trial
supply unit needs servicing during warranty period = success
p = P(success) = P(supply unit needs servicing) = 0.1.
If the Bernoulli trial assumptions hold (independent units, same probability of needing
service for each unit), then
Y = the number of units requiring service during warranty period
b(n = 30, p = 0.1).
Note: Instead of computing probabilities by hand, we will use R.
BINOMIAL R CODE: Suppose that Y b(n, p).
p
Y
(y) = P(Y = y) F
Y
(y) = P(Y y)
dbinom(y,n,p) pbinom(y,n,p)
(a) What is the probability that exactly ve of the 30 power supply units require
servicing during the warranty period?
p
Y
(5) = P(Y = 5) =
_
30
5
_
(0.1)
5
(1 0.1)
305
dbinom(5,30,0.1) = 0.1023048.
(b) What is the probability that at most ve of the 30 power supply units require
servicing during the warranty period?
F
Y
(5) = P(Y 5) =
5
y=0
_
30
y
_
(0.1)
y
(1 0.1)
30y
pbinom(5,30,0.1) = 0.9268099.
PAGE 26
0 5 10 15 20 25 30
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
y
p
(
y
)
0 5 10 15 20 25 30
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.4: PMF (left) and CDF (right) of Y b(n = 30, p = 0.1) in Example 3.16.
(c) What is the probability at least ve of the 30 power supply units require service?
P(Y 5) = 1 P(Y 4) = 1
4
y=0
_
30
y
_
(0.1)
y
(1 0.1)
30y
1-pbinom(4,30,0.1) = 0.1754949.
(d) What is P(2 Y 8)?
P(2 Y 8) =
8
y=2
_
30
y
_
(0.1)
y
(1 0.1)
30y
.
One way to get this in R is to use the command:
> sum(dbinom(2:8,30,0.1))
[1] 0.8142852
The dbinom(2:8,30,0.1) command creates a vector containing p
Y
(2), p
Y
(3), ..., p
Y
(8),
and the sum command adds them. Another way to calculate this probability in R is
> pbinom(8,30,0.1)-pbinom(1,30,0.1)
[1] 0.8142852
PAGE 27
3.4.2 Geometric distribution
NOTE: The geometric distribution also arises in experiments involving Bernoulli trials:
1. Each trial results in a success or a failure.
2. The trials are independent.
3. The probability of success, denoted by p, 0 < p < 1, is the same on every trial.
TERMINOLOGY: Suppose that Bernoulli trials are continually observed. Dene
Y = the number of trials to observe the rst success.
We say that Y has a geometric distribution with success probability p. Shorthand
notation is Y geom(p).
PMF: If Y geom(p), then the probability mass function of Y is given by
p
Y
(y) =
_
_
_
(1 p)
y1
p, y = 1, 2, 3, ...
0, otherwise.
MEAN/VARIANCE: If Y geom(p), then
E(Y ) =
1
p
var(Y ) =
1 p
p
2
.
Example 3.17. Biology students are checking the eye color of fruit ies. For each y,
the probability of observing white eyes is p = 0.25. In this situation, we interpret
fruit y = trial
y has white eyes = success
p = P(success) = P(white eyes) = 0.25.
PAGE 28
5 10 15 20 25
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
y
p
(
y
)
0 5 10 15 20 25
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.5: PMF (left) and CDF (right) of Y geom(p = 0.25) in Example 3.17.
If the Bernoulli trial assumptions hold (independent ies, same probability of white eyes
for each y), then
Y = the number of ies needed to nd the rst white-eyed
geom(p = 0.25).
(a) What is the probability the rst white-eyed y is observed on the fth y checked?
p
Y
(5) = P(Y = 5) = (1 0.25)
51
(0.25)
= (0.75)
4
(0.25) 0.079.
(b) What is the probability the rst white-eyed y is observed before the fourth y is
examined? Note: For this to occur, we must observe the rst white-eyed y (success)
on either the rst, second, or third y.
F
Y
(3) = P(Y 3) = P(Y = 1) + P(Y = 2) + P(Y = 3)
= (1 0.25)
11
(0.25) + (1 0.25)
21
(0.25) + (1 0.25)
31
(0.25)
= 0.25 + 0.1875 + 0.140625 0.578.
PAGE 29
GEOMETRIC R CODE: Suppose that Y geom(p).
p
Y
(y) = P(Y = y) F
Y
(y) = P(Y y)
dgeom(y-1,p) pgeom(y-1,p)
> dgeom(5-1,0.25) ## Part (a)
[1] 0.07910156
> pgeom(3-1,0.25) ## Part (b)
[1] 0.578125
3.4.3 Negative binomial distribution
NOTE: The negative binomial distribution also arises in experiments involving Bernoulli
trials:
1. Each trial results in a success or a failure.
2. The trials are independent.
3. The probability of success, denoted by p, 0 < p < 1, is the same on every trial.
TERMINOLOGY: Suppose that Bernoulli trials are continually observed. Dene
Y = the number of trials to observe the rth success.
We say that Y has a negative binomial distribution with waiting parameter r and
success probability p. Shorthand notation is Y nib(r, p).
REMARK: Note that the negative binomial distribution is a mere generalization of the
geometric. If r = 1, then the nib(r, p) distribution reduces to the geom(p).
PMF: If Y nib(r, p), then the probability mass function of Y is given by
p
Y
(y) =
_
_
_
y 1
r 1
_
p
r
(1 p)
yr
, y = r, r + 1, r + 2, ...
0, otherwise.
PAGE 30
MEAN/VARIANCE: If Y nib(r, p), then
E(Y ) =
r
p
var(Y ) =
r(1 p)
p
2
.
Example 3.18. At an automotive paint plant, 15 percent of all batches sent to the lab
for chemical analysis do not conform to specications. In this situation, we interpret
batch = trial
batch does not conform = success
p = P(success) = P(not conforming) = 0.15.
If the Bernoulli trial assumptions hold (independent batches, same probability of non-
conforming for each batch), then
Y = the number of batches needed to nd the third nonconforming
nib(r = 3, p = 0.15).
(a) What is the probability the third nonconforming batch is observed on the tenth batch
sent to the lab?
p
Y
(10) = P(Y = 10) =
_
10 1
3 1
_
(0.15)
3
(1 0.15)
103
=
_
9
2
_
(0.15)
3
(0.85)
7
0.039.
(b) What is the probability that no more than two nonconforming batches will be
observed among the rst 30 batches sent to the lab? Note: This means the third
nonconforming batch must be observed on the 31st batch observed, the 32nd, the 33rd,
etc.
P(Y 31) = 1 P(Y 30)
= 1
30
y=3
_
y 1
3 1
_
(0.15)
3
(0.85)
y3
0.151.
PAGE 31
10 20 30 40 50 60 70
0
.
0
0
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
0
.
0
5
y
p
(
y
)
0 20 40 60
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.6: PMF (left) and CDF (right) of Y nib(r = 3, p = 0.15) in Example 3.18.
NEGATIVE BINOMIAL R CODE: Suppose that Y nib(r, p).
p
Y
(y) = P(Y = y) F
Y
(y) = P(Y y)
dnbinom(y-r,r,p) pnbinom(y-r,r,p)
> dnbinom(10-3,3,0.15) ## Part (a)
[1] 0.03895012
> 1-pnbinom(30-3,3,0.15) ## Part (b)
[1] 0.1514006
3.4.4 Hypergeometric distribution
SETTING: Consider a population of N objects and suppose that each object belongs to
one of two dichotomous classes: Class 1 and Class 2. For example, the objects (classes)
might be people (infected/not), parts (conforming/not), plots of land (respond to treat-
PAGE 32
ment/not), etc. In the population of interest, we have
N = total number of objects
r = number of objects in Class 1
N r = number of objects in Class 2.
Envision taking a sample n objects from the population (objects are selected at random
and without replacement). Dene
Y = the number of objects in Class 1 (out of the n selected).
We say that Y has a hypergeometric distribution and write Y hyper(N, n, r).
PMF: If Y hyper(N, n, r), then the probability mass function of Y is given by
p
Y
(y) =
_
_
_
r
y
__
Nr
ny
_
_
N
n
_ , y r and n y N r
0, otherwise.
MEAN/VARIANCE: If Y hyper(N, n, r), then
E(Y ) = n
_
r
N
_
var(Y ) = n
_
r
N
_
_
N r
N
__
N n
N 1
_
.
Example 3.19. A supplier ships parts to a company in lots of 100 parts. The company
has an acceptance sampling plan which adopts the following acceptance rule:
....sample 5 parts at random and without replacement. If there are no de-
fectives in the sample, accept the entire lot; otherwise, reject the entire lot.
In this example, the population size is N = 100. The sample size is n = 5. Dene the
random variable
Y = the number of defectives in the sample
hyper(N = 100, n = 5, r).
PAGE 33
0 1 2 3 4 5
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
y
p
(
y
)
1 0 1 2 3 4 5 6
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.7: PMF (left) and CDF (right) of Y hyper(N = 100, n = 5, r = 10) in
Example 3.19.
(a) If r = 10, what is the probability that the lot will be accepted? Note: The lot will
be accepted only if Y = 0.
p
Y
(0) = P(Y = 0) =
_
10
0
__
90
5
_
_
100
5
_
=
1(43949268)
75287520
0.584.
(b) If r = 10, what is the probability that at least 3 of the 5 parts sampled are defective?
P(Y 3) = 1 P(Y 2)
= 1
_
_
10
0
__
90
5
_
_
100
5
_ +
_
10
1
__
90
4
_
_
100
5
_ +
_
10
2
__
90
3
_
_
100
5
_
_
1 (0.584 + 0.339 + 0.070) = 0.007.
HYPERGEOMETRIC R CODE: Suppose that Y hyper(N, n, r).
p
Y
(y) = P(Y = y) F
Y
(y) = P(Y y)
dhyper(y,r,N-r,n) phyper(y,r,N-r,n)
PAGE 34
> dhyper(0,10,100-10,5) ## Part (a)
[1] 0.5837524
> 1-phyper(2,10,100-10,5) ## Part (b)
[1] 0.006637913
3.4.5 Poisson distribution
NOTE: The Poisson distribution is commonly used to model counts, such as
1. the number of customers entering a post oce in a given hour
2. the number of -particles discharged from a radioactive substance in one second
3. the number of machine breakdowns per month
4. the number of insurance claims received per day
5. the number of defects on a piece of raw material.
TERMINOLOGY: In general, we dene
Y = the number of occurrences over in a unit interval of time (or space).
A Poisson distribution for Y emerges if these occurrences obey the following rules:
(i) the number of occurrences in non-overlapping intervals (of time or space) are inde-
pendent random variables.
(ii) The probability of an occurrence in a suciently short interval is proportional to
the length of the interval.
(iii) The probability of 2 or more occurrences in a suciently short interval is zero.
We say that Y has a Poisson distribution and write Y Poisson(). A process that
produces occurrences according to these rules is called a Poisson process.
PAGE 35
0 2 4 6 8 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
y
p
(
y
)
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.8: PMF (left) and CDF (right) of Y Poisson( = 2.5) in Example 3.20.
PMF: If Y Poisson(), then the probability mass function of Y is given by
p
Y
(y) =
_
y
e
y!
, y = 0, 1, 2, ...
0, otherwise.
MEAN/VARIANCE: If Y Poisson(), then
E(Y ) =
var(Y ) = .
Example 3.20. Let Y denote the number of times per month that a detectable amount
of radioactive gas is recorded at a nuclear power plant. Suppose that Y follows a Poisson
distribution with mean = 2.5 times per month.
(a) What is the probability that there are exactly three times a detectable amount of
gas is recorded in a given month?
P(Y = 3) = p
Y
(3) =
(2.5)
3
e
2.5
3!
=
15.625e
2.5
6
0.214.
PAGE 36
(b) What is the probability that there are no more than four times a detectable amount
of gas is recorded in a given month?
P(Y 4) =
4
y=0
(2.5)
y
e
2.5
y!
=
(2.5)
0
e
2.5
0!
+
(2.5)
1
e
2.5
1!
+
(2.5)
2
e
2.5
2!
+
(2.5)
3
e
2.5
3!
+
(2.5)
4
e
2.5
4!
0.891.
POISSON R CODE: Suppose that Y Poisson().
p
Y
(y) = P(Y = y) F
Y
(y) = P(Y y)
dpois(y,) ppois(y,)
> dpois(3,2.5) ## Part (a)
[1] 0.213763
> ppois(4,2.5) ## Part (b)
[1] 0.891178
3.5 Continuous random variables
RECALL: A random variable Y is called continuous if it can assume any value in an
interval of real numbers.
Contrast this with a discrete random variable whose values can be counted.
For example, if Y = time (measured in seconds), then the set of all possible values
of Y is
{y : y > 0}.
If Y = temperature (measured in deg C), the set of all possible values of Y (ignoring
absolute zero and physical upper bounds) might be described as
{y : < y < }.
Neither of these sets of values can be counted.
PAGE 37
IMPORTANT: Assigning probabilities to events involving continuous random variables
is dierent than in discrete models. We do not assign positive probability to specic
values (e.g., Y = 3, etc.) like we did with discrete random variables. Instead, we assign
positive probability to events which are intervals (e.g., 2 < Y < 4, etc.).
TERMINOLOGY: Every continuous random variable we will discuss in this course has a
probability density function (pdf ), denoted by f
Y
(y). This function has the following
characteristics:
1. f
Y
(y) 0, that is, f
Y
(y) is nonnegative.
2. The area under any pdf is equal to 1, that is,
f
Y
(y)dy = 1.
3. If y
0
is a specic value of interest, then the cumulative distribution function
(cdf) of Y is given by
F
Y
(y
0
) = P(Y y
0
) =
y
0
f
Y
(y)dy.
4. If y
1
and y
2
are specic values of interest (y
1
< y
2
), then
P(y
1
Y y
2
) =
y
2
y
1
f
Y
(y)dy
= F
Y
(y
2
) F
Y
(y
1
).
5. If y
0
is a specic value, then P(Y = y
0
) = 0. In other words, in continuous
probability models, specic points are assigned zero probability (see #4 above and
this will make perfect mathematical sense). An immediate consequence of this is
that if Y is continuous,
P(y
1
Y y
2
) = P(y
1
Y < y
2
) = P(y
1
< Y y
2
) = P(y
1
< Y < y
2
)
and each is equal to
y
2
y
1
f
Y
(y)dy.
This is not true if Y has a discrete distribution because positive probability is
assigned to specic values of Y .
PAGE 38
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
y
f
(
y
)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.9: PDF (left) and CDF (right) of Y in Example 3.21.
IMPORTANT: Evaluating a pdf at a specic value y
0
, that is, computing f
Y
(y
0
), does
not give you a probability! This simply gives you the height of the pdf f
Y
(y) at y = y
0
.
Example 3.21. Suppose that Y has the pdf
f
Y
(y) =
_
_
_
3y
2
, 0 < y < 1
0, otherwise.
Find the cumulative distribution function (cdf) of Y .
Solution. For 0 < y < 1,
F
Y
(y) =
f
Y
(t)dt =
y
0
3t
2
dt
= t
3
y
0
= y
3
.
Therefore, the cdf of Y is
F
Y
(y) =
_
_
0, y < 0
y
3
, 0 y < 1
1, y 1.
PAGE 39
(a) Calculate P(Y < 0.3).
Method 1: PDF:
P(Y < 0.3) =
0.3
0
3y
2
dy = y
3
0.3
0
= (0.3)
3
0
3
= 0.027.
Method 2: CDF:
P(Y < 0.3) = F
Y
(0.3) = (0.3)
3
= 0.027.
(b) Calculate P(Y > 0.8).
Method 1: PDF:
P(Y > 0.8) =
1
0.8
3y
2
dy = y
3
1
0.8
= 1
3
(0.8)
3
= 0.488.
Method 2: CDF:
P(Y > 0.8) = 1 P(Y 0.8) = 1 F
Y
(0.8) = 1 (0.8)
3
= 0.488.
(c) Calculate P(0.3 < Y < 0.8).
Method 1: PDF:
P(0.3 < Y < 0.8) =
0.8
0.3
3y
2
dy = y
3
0.8
0.3
= (0.8)
3
(0.3)
3
= 0.485.
Method 2: CDF:
P(0.3 < Y < 0.8) = F
Y
(0.8) F
Y
(0.3) = (0.8)
3
(0.3)
3
= 0.485.
TERMINOLOGY: Let Y be a continuous random variable with pdf f
Y
(y). The ex-
pected value (or mean) of Y is given by
= E(Y ) =
yf
Y
(y)dy.
NOTE: The limits of the integral in this denition, while technically correct, will always
be the lower and upper limits corresponding to the nonzero part of the pdf.
PAGE 40
FUNCTIONS: Let Y be a continuous random variable with pdf f
Y
(y). Suppose that g
is a real-valued function. Then, g(Y ) is a random variable and
E[g(Y )] =
g(y)f
Y
(y)dy.
TERMINOLOGY: Let Y be a continuous random variable with pdf f
Y
(y) and expected
value E(Y ) = . The population variance of Y is given by
2
var(Y ) E[(Y )
2
] =
(y )
2
f
Y
(y)dy.
The computing formula is still
var(Y ) = E(Y
2
) [E(Y )]
2
.
The population standard deviation of Y is the positive square root of the variance:
=
2
=
var(Y ).
Example 3.22. Find the mean and variance of Y in Example 3.21.
Solution. The mean of Y is
= E(Y ) =
1
0
yf
Y
(y)dy
=
1
0
y 3y
2
dy =
1
0
3y
3
dy =
3y
4
4
1
0
=
3
4
.
To nd var(Y ), we will use the computing formula var(Y ) = E(Y
2
) [E(Y )]
2
. We
already have E(Y ) = 3/4.
E(Y
2
) =
1
0
y
2
f
Y
(y)dy
=
1
0
y
2
3y
2
dy =
1
0
3y
4
dy =
3y
5
5
1
0
=
3
5
.
Therefore,
2
= var(Y ) = E(Y
2
) [E(Y )]
2
=
3
5

_
3
4
_
2
= 0.0375.
The population standard deviation is
=
0.0375 0.194.
PAGE 41
QUANTILES: Suppose that Y is a continuous random variable with cdf F
Y
(y) and let
0 < p < 1. The pth quantile of the distribution of Y , denoted by
p
, solves
F
Y
(
p
) = P(Y
p
) =
f
Y
(y)dy = p.
The median of the distribution of Y is the p = 0.5 quantile. That is, the median
0.5
solves
F
Y
(
0.5
) = P(Y
0.5
) =

0.5
f
Y
(y)dy = 0.5.
NOTE: Another name for the pth quantile is the 100pth percentile.
REMARK: When Y is discrete, there are some potential problems with the denition
that
p
solves F
Y
(
p
) = P(Y
p
) = p. The reason is that there may be many values of
p
that satisfy this equation. By convention, in discrete distributions, the pth quantile
p
is taken to be the smallest value satisfying F
Y
(
p
) = P(Y
p
) p.
3.5.1 Exponential distribution
TERMINOLOGY: A random variable Y is said to have an exponential distribution
with parameter > 0 if its pdf is given by
f
Y
(y) =
_
_
_
e
y
, y > 0
0, otherwise.
Shorthand notation is Y exponential(). Important: The exponential distribution is
used to model the distribution of positive quantities (e.g., lifetimes, etc.).
MEAN/VARIANCE: If Y exponential(), then
E(Y ) =
1
var(Y ) =
1
2
.
CDF: Suppose that Y exponential(). Then, the cdf of Y exists in closed form and
is given by
F
Y
(y) =
_
_
_
0, y 0
1 e
y
, y > 0.
PAGE 42
0 5 10 15 20
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
f
(
y
)
0 5 10 15 20
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
0 5 10 15 20
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
lambda = 1
lambda = 1/2
lambda = 1/5
Figure 3.10: Exponential pdfs with dierent values of .
Example 3.23. Extensive experience with fans of a certain type used in diesel engines
has suggested that the exponential distribution provides a good model for time until
failure (i.e., lifetime). Suppose that the lifetime of a fan, denoted by Y (measured in
10000s of hours), follows an exponential distribution with = 0.4.
(a) What is the probability that a fan lasts longer than 30,000 hours?
Method 1: PDF:
P(Y > 3) =

3
0.4e
0.4y
dy = 0.4
_
1
0.4
e
0.4y
3
_
= e
0.4y
3
= e
1.2
0.301.
PAGE 43
0 2 4 6 8 10 12
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
y
f
(
y
)
0 2 4 6 8 10 12
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.11: PDF (left) and CDF (right) of Y exponential( = 0.4) in Example 3.23.
Method 2: CDF:
P(Y > 3) = 1 P(Y 3) = 1 F
Y
(3)
= 1 [1 e
0.4(3)
]
= e
1.2
0.301.
(b) What is the probability that a fan will last between 20,000 and 50,000 hours?
Method 1: PDF:
P(2 < Y < 5) =
5
2
0.4e
0.4y
dy = 0.4
_
_
1
0.4
e
0.4y
5
2
_
_
= e
0.4y
5
2
= [e
0.4(5)
e
0.4(2)
]
= e
0.8
e
2
0.314.
PAGE 44
Method 2: CDF:
P(2 < Y < 5) = F
Y
(5) F
Y
(2)
= [1 e
0.4(5)
] [1 e
0.4(2)
]
= e
0.8
e
2
0.314.
MEMORYLESS PROPERTY: Suppose that Y exponential(), and let r and s be
positive constants. Then
P(Y > r + s|Y > r) = P(Y > s).
If Y measures time (e.g., time to failure, etc.), then the memoryless property says that
the distribution of additional lifetime (s time units beyond time r) is the same as the
original distribution of the lifetime. In other words, the fact that Y has made it to
time r has been forgotten. For example, in Example 3.23,
P(Y > 5|Y > 2) = P(Y > 3) 0.301.
POISSON RELATIONSHIP: Suppose that we are observing occurrences over time
according to a Poisson distribution with rate . Dene the random variable
W = the time until the rst occurrence.
Then, W exponential(). It is also true that the time between any two occur-
rences in a Poisson process follows this same exponential distribution (these are called
interarrival times).
Example 3.24. Suppose that customers arrive at a check-out according to a Poisson
process with mean = 12 per hour. What is the probability that we will have to wait
longer than 10 minutes to see the rst customer? Note: 10 minutes is 1/6th of an hour.
Solution. The time until the rst arrival, say W, follows an exponential distribution
with = 12, so the cdf of W, for w > 0, is
F
W
(w) = 1 e
12w
.
PAGE 45
The desired probability is
P(W > 1/6) = 1 P(W 1/6) = 1 F
W
(1/6)
= 1 [1 e
12(1/6)
]
= e
2
0.135.
EXPONENTIAL R CODE: Suppose that Y exponential().
F
Y
(y) = P(Y y)
p
pexp(y,) qexp(p,)
> 1-pexp(1/6,12) ## Example 3.24
[1] 0.1353353
> qexp(0.9,12) ## 0.9 quantile
[1] 0.1918821
NOTE: The command qexp(0.9,12) gives the 0.90 quantile (90th percentile) of the
exponential( = 12) distribution. In Example 3.24, this means that 90 percent of the
waiting times will be less than approximately 0.192 hours (only 10 percent will exceed).
3.5.2 Gamma distribution
TERMINOLOGY: The gamma function is a real function of t, dened by
(t) =

0
y
t1
e
y
dy,
for all t > 0. The gamma function satises the recursive relationship
() = ( 1)( 1),
for > 1. Therefore, if is an integer, then
() = ( 1)!.
PAGE 46
0 5 10 15 20 25
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
y
f
(
y
)
0 5 10 15 20 25
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0 5 10 15 20 25
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
alpha = 1.5, lambda = 1/2
alpha = 2, lambda = 1/3
alpha = 2.5, lambda = 1/5
Figure 3.12: Gamma pdfs with dierent values of and .
TERMINOLOGY: A random variable Y is said to have a gamma distribution with
parameters > 0 and > 0 if its pdf is given by
f
Y
(y) =
_
()
y
1
e
y
, y > 0
0, otherwise.
Shorthand notation is Y gamma(, ).
By changing the values of and , the gamma pdf can assume many shapes. This
makes the gamma distribution popular for modeling positive random variables (it
is more exible than the exponential).
Note that when = 1, the gamma pdf reduces to the exponential() pdf.
PAGE 47
MEAN/VARIANCE: If Y gamma(, ), then
E(Y ) =

var(Y ) =

2
.
CDF: The cdf of a gamma random variable does not exist in closed form. Therefore,
probabilities involving gamma random variables and gamma quantiles must be computed
numerically (e.g., using R, etc.).
GAMMA R CODE: Suppose that Y gamma(, ).
F
Y
(y) = P(Y y)
p
pgamma(y,,) qgamma(p,,)
Example 3.25. When a certain transistor is subjected to an accelerated life test, the
lifetime Y (in weeks) is well modeled by a gamma distribution with = 4 and = 1/6.
(a) Find the probability that a transistor will last at least 50 weeks.
P(Y 50) = 1 P(Y < 50) = 1 F
Y
(50)
= 1-pgamma(50,4,1/6)
= 0.03377340.
(b) Find the probability that a transistor will last between 12 and 24 weeks.
P(12 Y 24) = F
Y
(24) F
Y
(12)
= pgamma(24,4,1/6)-pgamma(12,4,1/6)
= 0.4236533.
(c) Twenty percent of the transistor lifetimes will be below which time? Note: I am
asking for the 0.20 quantile (20th percentile) of the lifetime distribution.
> qgamma(0.2,4,1/6)
[1] 13.78072
PAGE 48
0 10 20 30 40 50 60 70
0
.
0
0
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
y
f
(
y
)
0 10 20 30 40 50 60 70
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.13: PDF (left) and CDF (right) of Y gamma( = 4, = 1/6) in Example
3.25.
3.5.3 Normal distribution
TERMINOLOGY: A random variable Y is said to have a normal distribution if its
pdf is given by
f
Y
(y) =
_
_
1
2
e
1
2
_
y
_
2
, < y <
0, otherwise.
Shorthand notation is Y N(,
2
). Another name for the normal distribution is the
Guassian distribution.
MEAN/VARIANCE: If Y N(,
2
), then
E(Y ) =
var(Y ) =
2
.
REMARK: The normal distribution serves as a very good model for a wide range of mea-
surements; e.g., reaction times, ll amounts, part dimensions, weights/heights, measures
of intelligence/test scores, economic indicators, etc.
PAGE 49
10 5 0 5 10
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
y
f
(
y
)
10 5 0 5 10
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
10 5 0 5 10
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
mu = 0, sigma = 1
mu = 2, sigma = 2
mu = 1, sigma = 3
Figure 3.14: Normal pdfs with dierent values of and
2
.
CDF: The cdf of a normal random variable does not exist in closed form. Probabilities
involving normal random variables and normal quantiles can be computed numerically
(e.g., using R, etc.).
NORMAL R CODE: Suppose that Y N(,
2
).
F
Y
(y) = P(Y y)
p
pnorm(y,,) qnorm(p,,)
Example 3.26. The time it takes for a driver to react to the brake lights on a decelerating
vehicle is critical in helping to avoid rear-end collisions. A recently published study
suggests that this time during in-trac driving, denoted by Y (measured in seconds),
follows a normal distribution with mean = 1.5 and variance
2
= 0.16.
PAGE 50
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
f
(
y
)
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.15: PDF (left) and CDF (right) of Y N( = 1.5,
2
= 0.16) in Example 3.26.
(a) What is the probability that reaction time is less than 1 second?
P(Y < 1) = F
Y
(1)
= pnorm(1,1.5,sqrt(0.16))
= 0.1056498.
(b) What is the probability that reaction time is between 1.1 and 2.5 seconds?
P(1.1 Y 2.5) = F
Y
(2.5) F
Y
(1.1)
= pnorm(2.5,1.5,sqrt(0.16))-pnorm(1.1,1.5,sqrt(0.16))
= 0.835135.
(c) Five percent of all reaction times will exceed which time? Note: I am asking for
the 0.95 quantile (95th percentile) of the reaction time distribution.
> qnorm(0.95,1.5,sqrt(0.16))
[1] 2.157941
PAGE 51
EMPIRICAL RULE: For any N(,
2
) distribution,
about 68% of the distribution is between and +
about 95% of the distribution is between 2 and + 2
about 99.7% of the distribution is between 3 and + 3.
This is also called the 68-95-99.7% rule. This rule allows for us to make statements
like this (referring to Example 3.26, where = 1.5 and = 0.4):
About 68 percent of all reaction times will be between 1.1 and 1.9 seconds.
About 95 percent of all reaction times will be between 0.7 and 2.3 seconds.
About 99.7 percent of all reaction times will be between 0.3 and 2.7 seconds.
TERMINOLOGY: A random variable Z is said to have a standard normal distribu-
tion if its pdf is given by
f
Z
(z) =
_
_
1
2
e
z
2
/2
, < z <
0, otherwise.
Shorthand notation is Z N(0, 1). A standard normal distribution is a special normal
distribution, that is, a normal distribution with mean = 0 and variance
2
= 1. The
variable Z is called a standard normal random variable.
RESULT: If Y N(,
2
), then
Z =
Y
N(0, 1).
The result says that Z follows a standard normal distribution; i.e., Z N(0, 1). In this
context, Z is called the standardized value of Y .
Important: Therefore, any normal random variable Y N(,
2
) can be converted
to a standard normal random variable Z by applying this transformation.
PAGE 52
IMPLICATION: Any probability calculation involving a normal random variable Y
N(,
2
) can be transformed into a calculation involving Z N(0, 1). More speci-
cally, if Y N(,
2
), then
P(y
1
< Y < y
2
) = P
_
y
1
< Z <
y
2
_
= F
Z
_
y
2
_
F
Z
_
y
1
_
.
Q: Why is this important?
Because it is common for textbooks (like yours) to table the cumulative distri-
bution function F
Z
(z) for various values of z. See Table 1 (pp 593-594, VK).
Therefore, the preceding standardization result makes it possible to nd proba-
bilities involving normal random variables by hand without using calculus or
software.
Because R will compute normal probabilities directly, I view hand calculation to
be unnecessary and outdated.
See the examples in the text (Section 3.5) for illustrations of this approach.
3.6 Reliability and lifetime distributions
TERMINOLOGY: Reliability analysis deals with failure time (i.e., lifetime, time-to-
event) data. For example,
T = time from start of product service until failure
T = time of sale of a product until a warranty claim
T = number of hours in use/cycles until failure.
We call T a lifetime random variable if it measures the time to an event; e.g.,
failure, death, eradication of some infection/condition, etc. Engineers are often involved
with reliability studies in practice, because reliability is related to product quality.
PAGE 53
NOTE: There are many well known lifetime distributions, including
exponential
Weibull
lognormal
Others: gamma, inverse Gaussian, Gompertz-Makeham, Birnbaum-Sanders, ex-
treme value, log-logistic, etc.
The normal (Gaussian) distribution is rarely used to model lifetime variables.
3.6.1 Weibull distribution
TERMINOLOGY: A random variable T is said to have a Weibull distribution with
parameters > 0 and > 0 if its pdf is given by
f
T
(t) =
_
_
t
_
1
e
(t/)
, t > 0
0, otherwise.
Shorthand notation is T Weibull(, ).
We call
= shape parameter
= scale parameter.
By changing the values of and , the Weibull pdf can assume many shapes. The
Weibull distribution is very popular among engineers in reliability applications.
Note that when = 1, the Weibull pdf reduces to the exponential( = 1/) pdf.
MEAN/VARIANCE: If T Weibull(, ), then
E(T) =
_
1 +
1
_
var(T) =
2
_
_
1 +
2
_
1 +
1
__
2
_
.
PAGE 54
0 5 10 15 20 25
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
t
f
(
t
)
0 5 10 15 20 25
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0 5 10 15 20 25
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
beta = 2, eta = 5
beta = 2, eta = 10
beta = 3, eta = 10
Figure 3.16: Weibull pdfs with dierent values of and .
CDF: Suppose that T Weibull(, ). Then, the cdf of T exists in closed form and is
given by
F
T
(t) =
_
_
_
0, t 0
1 e
(t/)
, t > 0.
Example 3.27. Suppose that the lifetime of a rechargeable battery, denoted by T
(measured in hours), follows a Weibull distribution with parameters = 2 and = 10.
(a) What is the mean time to failure?
E(T) = 10
_
3
2
_
8.862 hours.
(b) What is the probability that a battery is still functional at time t = 20?
P(T 20) = 1 P(T < 20) = 1 F
T
(20)
= 1 [1 e
(20/10)
2
] 0.018.
PAGE 55
0 5 10 15 20 25
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
0
.
1
0
y
f
(
y
)
0 5 10 15 20 25
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
y
F
(
y
)
Figure 3.17: PDF (left) and CDF (right) of T Weibull( = 2, = 10) in Example 3.27.
(c) What is the probability that a battery is still functional at time t = 20 given that
the battery is functional at time t = 10?
P(T 20|T 10) =
P(T 20 and T 10)
P(T 10)
=
P(T 20)
P(T 10)
=
1 F
T
(20)
1 F
T
(10)
=
e
(20/10)
2
e
(10/10)
2
0.050.
(d) What is the 99th percentile of this lifetime distribution? We set
F
T
(
0.99
) = 1 e
(
0.99
/10)
2
set
= 0.99.
Solving for
0.99
gives
0.99
21.460 hours. Only one percent of the battery lifetimes
will exceed this value.
WEIBULL R CODE: Suppose that T Weibull(, ).
F
T
(t) = P(T t)
p
pweibull(t,,) qweibull(p,,)
PAGE 56
> 10*gamma(3/2) ## Part (a)
[1] 8.86227
> 1-pweibull(20,2,10) ## Part (b)
[1] 0.01831564
> (1-pweibull(20,2,10))/(1-pweibull(10,2,10)) ## Part (c)
[1] 0.04978707
> qweibull(0.99,2,10) ## Part (d)
[1] 21.45966
3.6.2 Reliability functions
DESCRIPTION: We now describe some dierent, but equivalent, ways of dening the
distribution of a (continuous) lifetime random variable T.
The cumulative distribution function (cdf)
F
T
(t) = P(T t).
This can be interpreted as the proportion of units that have failed by time t.
The survivor function
S
T
(t) = P(T > t) = 1 F
T
(t).
This can be interpreted as the proportion of units that have not failed by time t;
e.g., the unit is still functioning, a warranty claim has not been made, etc.
The probability density function (pdf )
f
T
(t) =
d
dt
F
T
(t) =
d
dt
S
T
(t).
Also, recall that
F
T
(t) =
t
0
f
T
(u)du
and
S
T
(t) =

t
f
T
(u)du.
PAGE 57
TERMINOLOGY: The hazard function is dened as
h
T
(t) = lim
0
P(t T < t + |T t)
.
The hazard function is not a probability; rather, it is a probability rate. Therefore, it
is possible that a hazard function may exceed one.
REMARK: The hazard function (or hazard rate) is a very important characteristic of
a lifetime distribution. It indicates the way the risk of failure varies with time.
Distributions with increasing hazard functions are seen in units for whom some kind of
aging or wear out takes place. Certain types of units (e.g., electronic devices, etc.)
may display a decreasing hazard function, at least in the early stages of their lifetimes.
NOTE: It is insightful to note that
h
T
(t) = lim
0
P(t T < t + |T t)
= lim
0
P(t T < t + )
P(T t)
=
1
P(T t)
lim
0
F
T
(t + ) F
T
(t)
=
f
T
(t)
S
T
(t)
.
We can therefore describe the distribution of the continuous lifetime random variable T
by using either f
T
(t), F
T
(t), S
T
(t), or h
T
(t).
Example 3.28. In this example, we nd the hazard function for T Weibull(, ).
Recall that the pdf of T is
f
T
(t) =
_
_
t
_
1
e
(t/)
, t > 0
0, otherwise.
The cdf of T is
F
T
(t) =
_
_
_
0, t 0
1 e
(t/)
, t > 0.
The survivor function of T is
S
T
(t) = 1 F
T
(t) =
_
_
_
1, t 0
e
(t/)
, t > 0.
PAGE 58
0 1 2 3 4
0
1
0
2
0
3
0
4
0
t
h
(
t
)
0 1 2 3 4
0
.
0
1
.
0
2
.
0
3
.
0
t
h
(
t
)
0 1 2 3 4
0
.
6
0
.
8
1
.
0
1
.
2
1
.
4
t
h
(
t
)
0 1 2 3 4
0
1
2
3
4
5
t
h
(
t
)
Figure 3.18: Weibull hazard functions with = 1. Upper left: = 3. Upper right:
= 1.5. Lower left: = 1. Lower right: = 0.5.
Therefore, the hazard function, for t > 0, is
h
T
(t) =
f
T
(t)
S
T
(t)
=
_
t
_
1
e
(t/)
e
(t/)
_
t
_
1
.
Plots of Weibull hazard functions are given in Figure 3.18. It is easy to show
h
T
(t) is increasing if > 1 (wear out; population of units get weaker with aging)
h
T
(t) is constant if = 1 (constant hazard; exponential distribution)
h
T
(t) is decreasing if < 1 (infant mortality; population of units gets stronger with
aging).
PAGE 59
Example 3.29. We consider the data in Example 3.23 of Vining and Kowalski (pp 162).
The data are times, denoted by T (measured in months), to the rst failure for 20 electric
carts used for internal delivery and transportation in a large manufacturing facility.
0.9 1.5 2.3 3.2 3.9 5.0 6.2 7.5 8.3 10.4
11.1 12.6 15.0 16.3 19.3 22.6 24.8 31.5 38.1 53.0
From these data, maximum likelihood estimates of and are computed to be
1.110
15.271.
Note: Maximum likelihood estimation is a mathematical technique used to nd
values of and that most closely agree with the observed data. R computes these
estimates automatically. In Figure 3.19, we display the (estimated) PDF f
T
(t), CDF
F
T
(t), survivor function S
T
(t), and hazard function h
T
(t) based on these estimates.
REMARK: Note that the estimate

1.110 is larger than 1. This suggests that there
is wear out taking place among the carts; that is, the population of carts gets weaker
as time passes.
(a) Using the estimated Weibull(
1.110, 15.271) distribution as a model for future

cart lifetimes, nd the probability that a cart will still be functioning after t = 36 months.
P(T 36) = 1 P(T < 36) = 1 F
T
(36)
= 1 [1 e
(36/15.271)
1.110
] 0.075.
(b) Use the estimated distribution to nd the 90th percentile of the cart lifetimes.
F
T
(
0.90
) = 1 e
(
0.90
/15.271)
1.110
set
= 0.90.
Solving for
0.90
gives
0.90
32.373 months. Only ten percent of the cart lifetimes will
exceed this value.
PAGE 60
0 20 40 60
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
t
f
(
t
)
0 20 40 60
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
t
F
(
t
)
0 20 40 60
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
t
S
(
t
)
0 20 40 60
0
.
0
0
0
.
0
4
0
.
0
8
t
h
(
t
)
Figure 3.19: Weibull functions with

1.110 and 15.271. Upper left: PDF. Upper
right: CDF. Lower left: Survivor function. Lower right: Hazard function.
> 1-pweibull(36,1.110,15.271) ## Part (a)
[1] 0.07497392
> qweibull(0.90,1.110,15.271) ## Part (b)
[1] 32.37337
TERMINOLOGY: A quantile-quantile plot (qq plot) is a graphical display that can
help assess the appropriateness of a distribution. Here is how the plot is constructed:
On the vertical axis, we plot the observed data, ordered from low to high.
PAGE 61
0 10 20 30 40 50
0
1
0
2
0
3
0
4
0
5
0
Weibull percentiles
O
b
s
e
r
v
e
d

v
a
l
u
e
s
Figure 3.20: Weibull qq plot for the electric cart data in Example 3.29. The observed data
are plotted versus the theoretical quantiles from a Weibull distribution with

1.110
and 15.271.
On the horizontal axis, we plot the (ordered) theoretical quantiles from the distri-
bution (model) assumed for the observed data.
Our intuition should suggest the following:
If the observed data agrees with the distributions theoretical quantiles, then the
qq plot should look like a straight line (the distribution is a good choice).
If the observed data does not agree with the theoretical quantiles, then the qq
plot should have curvature in it (the distribution is not a good choice).
Interpretation: The Weibull qq plot in Figure 3.20 looks like a straight line. This
suggests that a Weibull distribution is a good t for the electric cart lifetime data.
PAGE 62
4 Statistical Inference
Complementary reading: Chapter 4 (VK); Sections 4.1-4.8. Read also Sections 3.6-3.7.
4.1 Populations and samples
OVERVIEW: This chapter is about statistical inference. This deals with making
(probabilistic) statements about a population of individuals based on information that
is contained in a sample taken from the population.
Example 4.1. Suppose that we wish to study the performance of lithium batteries used
in a certain calculator. The purpose of our study is to determine the mean lifetime of
these batteries so that we can place a limited warranty on them in the future. Since this
type of battery has not been used in this calculator before, no one (except the Oracle) can
tell us the distribution of Y , the batterys lifetime. In fact, not only is the distribution
not known, but all parameters which index this distribution arent known either.
TERMINOLOGY: A population refers to the entire group of individuals (e.g., parts,
people, batteries, etc.) about which we would like to make a statement (e.g., proportion
defective, median weight, mean lifetime, etc.).
It is generally accepted that the entire population can not be measured. It is too
large and/or it would be too time consuming to do so.
To draw inferences (make probabilistic statements) about a population, we therefore
observe a sample of individuals from the population.
We will assume that the sample of individuals constitutes a random sample.
Mathematically, this means that all observations are independent and follow the
same probability distribution. Informally, this means that each sample (of the same
size) has the same chance of being selected. Our hope is that a random sample of
individuals is representative of the entire population of individuals.
PAGE 63
NOTATION: We will denote a random sample of observations by
Y
1
, Y
2
, ..., Y
n
.
That is, Y
1
is the value of Y for the rst individual in the sample, Y
2
is the value of
Y for the second individual in the sample, and so on. The sample size tells us how
many individuals are in the sample and is denoted by n. Statisticians refer to the set of
observations Y
1
, Y
2
, ..., Y
n
generically as data. Lower case notation y
1
, y
2
, ..., y
n
is used
when citing numerical values (or when referring to realizations of the upper case versions).
BATTERY DATA: Consider the following random sample of n = 50 battery lifetimes
y
1
, y
2
, ..., y
50
(measured in hours):
4285 2066 2584 1009 318 1429 981 1402 1137 414
564 604 14 4152 737 852 1560 1786 520 396
1278 209 349 478 3032 1461 701 1406 261 83
205 602 3770 726 3894 2662 497 35 2778 1379
3920 1379 99 510 582 308 3367 99 373 454
In Figure 4.1, we display a histogram of the battery lifetime data. We see that the
(empirical) distribution of the battery lifetimes is skewed towards the high side.
Which continuous probability distribution seems to display the same type of pattern
that we see in histogram?
An exponential() model seems reasonable here (based on the histogram shape).
What is ?
In this example, is called a (population) parameter. It describes the theoretical
distribution which is used to model the entire population of battery lifetimes.
In general, (population) parameters which index probability distributions (like the
exponential) are unknown.
All of the probability distributions that we discussed in Chapter 3 are meant to
describe (model) population behavior.
PAGE 64
Lifetime (in hours)
C
o
u
n
t
0 1000 2000 3000 4000
0
5
1
0
1
5
Figure 4.1: Histogram of battery lifetime data (measured in hours).
4.2 Parameters and statistics
TERMINOLOGY: A parameter is a numerical quantity that describes a population.
In general, population parameters are unknown. Some very common examples are:
= population mean
2
= population variance
p = population proportion.
CONNECTION: All of the probability distributions that we talked about in Chapter 3
were indexed by population (model) parameters. For example,
the N(,
2
) distribution is indexed by two parameters, the population mean and
the population variance
2
.
PAGE 65
the Poisson() distribution is indexed by one parameter, the population mean .
the Weibull(, ) distribution is indexed by two parameters, the shape parameter
and the scale parameter .
the b(n, p) distribution is indexed by one parameter, the population proportion of
successes p.
TERMINOLOGY: Suppose that Y
1
, Y
2
, ..., Y
n
is a random sample from a population.
The sample mean is
Y =
1
n
n
i=1
Y
i
.
The sample variance is
S
2
=
1
n 1
n
i=1
(Y
i
Y )
2
.
The sample standard deviation is the positive square root of the sample variance;
i.e.,
S =
S
2
=
_
1
n 1
n
i=1
(Y
i
Y )
2
.
Important: Unlike their population analogues, these quantities can be computed from
a sample of data Y
1
, Y
2
, ..., Y
n
.
TERMINOLOGY: A statistic is a numerical quantity that can be calculated from a
sample of data. Some very common examples are:
Y = sample mean
S
2
= sample variance
p = sample proportion.
For example, with the battery lifetime data (a random sample of n = 50 lifetimes),
y = 1274.14 hours
s
2
= 1505156 (hours)
2
s 1226.85 hours.
PAGE 66
> mean(battery) ## sample mean
[1] 1274.14
> var(battery) ## sample variance
[1] 1505156
> sd(battery) ## sample standard deviation
[1] 1226.848
SUMMARY: The table below succinctly summarizes the salient dierences between a
population and a sample (a parameter and a statistic):
Group of individuals Numerical quantity Status
Population (Not observed) Parameter Unknown
Sample (Observed) Statistic Calculated from sample data
Statistical inference deals with making (probabilistic) statements about a population
of individuals based on information that is contained in a sample taken from the popu-
lation. We do this by
(a) estimating unknown population parameters with sample statistics
(b) quantifying the uncertainty (variability) that arises in the estimation process.
These are both necessary to construct condence intervals and to perform hypothesis
tests, two important exercises discussed in this chapter.
4.3 Point estimators and sampling distributions
NOTATION: To keep our discussion as general as possible (as the material in this subsec-
tion can be applied to many situations), we will let denote a population parameter.
For example, could denote a population mean, a population variance, a population
proportion, a Weibull or gamma model parameter, etc. It could also denote a
parameter in a regression context (Chapter 6-7).
PAGE 67
TERMINOLOGY: A point estimator

is a statistic that is used to estimate a popu-
lation parameter . Common examples of point estimators are:
Y a point estimator for (population mean)
S
2
a point estimator for
2
(population variance)
S a point estimator for (population standard deviation).
CRUCIAL POINT: It is important to note that, in general, an estimator

is a statistic,
so it depends on the sample of data Y
1
, Y
2
, ..., Y
n
.
The data Y
1
, Y
2
, ..., Y
n
come from the sampling process; e.g., dierent random sam-
ples will yield dierent data sets Y
1
, Y
2
, ..., Y
n
.
In this light, because the sample values Y
1
, Y
2
, ..., Y
n
will vary from sample to sam-
ple, the value of

will too! It therefore makes sense to think about all possible
values of

; that is, the distribution of

.
TERMINOLOGY: The distribution of an estimator

(a statistic) is called its sampling
distribution. A sampling distribution describes mathematically how

would vary in
repeated sampling. We will study many sampling distributions in this chapter.
TERMINOLOGY: We say that

is an unbiased estimator of if and only if
E(
) = .
In other words, the mean of the sampling distribution of

is equal to . Note that
unbiasedness is a characteristic describing the center of a sampling distribution. This
deals with accuracy.
RESULT: Mathematics shows that when Y
1
, Y
2
, ..., Y
n
is a random sample,
E(Y ) =
E(S
2
) =
2
.
That is, Y and S
2
are unbiased estimators of their population analogues.
PAGE 68
GOAL: Not only do we desire to use point estimators
which are unbiased, but we would

also like for them to have small variability. In other words, when

misses , we would
like for it to not miss by much. This deals with precision.
MAIN POINT: Accuracy and precision are the two main mathematical characteristics
that arise when evaluating the quality of a point estimator

. We desire point estimators
which are unbiased (perfectly accurate) and have small variance (highly precise).
TERMINOLOGY: The standard error of a point estimator

is equal to
se(
) =
var(
).
In other words, the standard error is equal to the standard deviation of the sampling
distribution of

. An estimators standard error measures the amount of variability in
the point estimator

. Therefore,
smaller se(
)

more precise.
4.4 Sampling distributions involving Y
NOTE: This subsection summarizes Sections 3.6-3.7 (VK).
Result 1: Suppose that Y
1
, Y
2
, ..., Y
n
is a random sample from a N(,
2
) distribution.
The sample mean Y has the following sampling distribution:
Y N
_
,

2
n
_
.
This result reminds us that
E(Y ) = .
That is, the sample mean Y is an unbiased estimator of the population mean .
This result also shows that the standard error of Y (as a point estimator) is
se(Y ) =
var(Y ) =
2
n
=

n
.
PAGE 69
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
1
2
3
4
5
y
f
(
y
)
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
1
2
3
4
5
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
1
2
3
4
5
Population distribution
Sample mean, n=5
Sample mean, n=25
Figure 4.2: Braking time example. Population distribution: Y N( = 1.5,
2
= 0.16).
Also depicted are the sampling distributions of Y when n = 5 and n = 25.
Example 4.2. In Example 3.26 (notes), we examined the distribution of
Y = time (in seconds) to react to brake lights during in-trac driving.
We assumed that
Y N( = 1.5,
2
= 0.16).
We call this the population distribution, because it describes the distribution of values
of Y for all individuals in the population (here, in-trac drivers).
Question. Suppose that we take a random sample of n = 5 drivers with times
Y
1
, Y
2
, ..., Y
5
. What is the distribution of the sample mean Y ?
Solution. If the sample size is n = 5, then with = 1.5 and
2
= 0.16, we have
Y N
_
,

2
n
_
= Y N(1.5, 0.032).
PAGE 70
This distribution describes the values of Y we would expect to see in repeated sampling,
that is, if we repeatedly sampled n = 5 individuals from this population of in-trac
drivers and calculated the sample mean Y each time.
Question. Suppose that we take a random sample of n = 25 drivers with times
Y
1
, Y
2
, ..., Y
25
. What is the distribution of the sample mean Y ?
Solution. If the sample size is n = 25, then with = 1.5 and
2
= 0.16, we have
Y N
_
,

2
n
_
= Y N(1.5, 0.0064).
The sampling distribution of Y when n = 5 and when n = 25 is shown in Figure 4.2.
4.4.1 Central Limit Theorem
1
, Y
2
, ..., Y
n
is a random sample from a population distribution
with mean and variance
2
(not necessarily a normal distribution). When the sample
size n is large, we have
Y AN
_
,

2
n
_
.
The symbol AN is read approximately normal. This result is called the Central Limit
Theorem (CLT).
Result 1 guarantees that when the underlying population distribution is N(,
2
),
the sample mean
Y N
_
,

2
n
_
.
The Central Limit Theorem (Result 2) says that even if the population distribution
is not normal (Guassian), the sampling distribution of the sample mean Y will be
approximately normal (Gaussian) when the sample size is suciently large.
Example 4.3. The time to death for rats injected with a toxic substance, denoted by
Y (measured in days), follows an exponential distribution with = 1/5. That is,
Y exponential( = 1/5).
PAGE 71
0 5 10 15 20
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
Time to death (in days)
P
D
F
0 5 10 15 20
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0 5 10 15 20
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
Population distribution
Sample mean, n=5
Sample mean, n=25
Figure 4.3: Rat death times. Population distribution: Y exponential( = 1/5). Also
depicted are the sampling distributions of Y when n = 5 and n = 25.
This is the population distribution, that is, this distribution describes the time to
death for all rats in the population.
In Figure 4.3, I have shown the exponential(1/5) population distribution (solid
curve). I have also depicted the theoretical sampling distributions of Y when n = 5
and when n = 25.
Main point: Notice how the sampling distribution of Y begins to (albeit distantly)
resemble a normal distribution when n = 5. When n = 25, the sampling distri-
bution of Y looks very much to be normal (Gaussian). This is precisely what is
conferred by the CLT. The larger the sample size n, the better a normal (Gaussian)
distribution approximates the true sampling distribution of Y .
PAGE 72
Example 4.4. When a batch of a certain chemical product is prepared, the amount of
a particular impurity in the batch (measured in grams) is a random variable Y with the
following population parameters:
= 4.0g
2
= (1.5g)
2
.
Suppose that n = 50 batches are prepared (independently). What is the probability that
the sample mean impurity amount Y is greater than 4.2 grams?
Solution. With n = 50, = 4, and
2
= (1.5)
2
, the CLT says that
Y AN
_
,

2
n
_
= Y AN(4, 0.045).
Therefore,
P(Y > 4.2) = 1 P(Y < 4.2)
1-pnorm(4.2,4,sqrt(0.045)) = 0.1728893.
Important: Note that in making this (approximate) probability calculation, we never
made an assumption about the underlying population distribution shape.
4.4.2 t distribution
1
, Y
2
, ..., Y
n
2
) distribution.
Result 1 says the sample mean Y has the following sampling distribution:
Y N
_
,

2
n
_
.
If we standardize Y , we obtain
Z =
Y
/
n
N(0, 1).
Replacing the population standard deviation with the sample standard deviation S,
we get a new sampling distribution:
t =
Y
S/
n
t(n 1),
a t distribution with degrees of freedom = n 1.
PAGE 73
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
y
f
(
y
)
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
N(0,1)
t(2)
t(10)
Figure 4.4: Probability density functions of N(0, 1), t(2), and t(10).
FACTS: The t distribution has the following characteristics:
It is continuous and symmetric about 0 (just like the standard normal distribution).
It is indexed by a value called the degrees of freedom.
In practice, is often an integer (related to the sample size).
As , t() N(0, 1); thus, when becomes larger, the t() and the N(0, 1)
distributions look more alike.
When compared to the standard normal distribution, the t distribution, in general,
is less peaked and has more probability (area) in the tails.
The t pdf formula is complicated and is unnecessary for our purposes. R will
compute t probabilities and quantiles from the t distribution.
PAGE 74
t R CODE: Suppose that T t().
F
T
(t) = P(T t)
p
pt(t,) qt(p,)
Example 4.5. Hollow pipes are to be used in an electrical wiring project. In testing
1-inch pipes, the data below were collected by a design engineer. The data are mea-
surements of Y , the outside diameter of this type of pipe (measured in inches). These
n = 25 pipes were randomly selected and measuredall in the same location.
1.296 1.320 1.311 1.298 1.315
1.305 1.278 1.294 1.311 1.290
1.284 1.287 1.289 1.292 1.301
1.298 1.287 1.302 1.304 1.301
1.313 1.315 1.306 1.289 1.291
From their extensive experience, the manufacturers of this pipe claim that the population
distribution is normal (Gaussian) and that the mean outside diameter is = 1.29 inches.
Under this assumption (which may or may not be true), calculate the value of
t =
y
s/
n
.
Solution. We use R to nd the sample mean y and the sample standard deviation s:
> mean(pipes) ## sample mean
[1] 1.29908
> sd(pipes) ## sample standard deviation
[1] 0.01108272
With n = 25, we have
t =
1.29908 1.29
0.01108272/
25
4.096.
PAGE 75
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
t
f
(
t
)
Figure 4.5: t(24) probability density function. An at t = 4.096 has been added.
Analysis. If the manufacturers claim is true (that is, if = 1.29 inches), then
t =
y
s/
n
comes from a t(24) distribution. The t(24) pdf is displayed above in Figure 4.5.
Key question: Does t = 4.096 seem like a value you would expect to see from this
distribution? If not, what might this suggest? Recall that t was computed under the
assumption that = 1.29 inches (the manufacturers claim).
Question. The value t = 4.096 is what percentile of the t(24) distribution?
> pt(4.096,24)
[1] 0.9997934
Answer: t = 4.096 is approximately the 99.98th percentile of the t(24) distribution.
PAGE 76
4.4.3 Normal quantile-quantile (qq) plots
IMPORTANT: Result 3 says that if Y
1
, Y
2
, ..., Y
n
2
)
distribution, then
t =
Y
S/
n
t(n 1).
An obvious question therefore arises: What if Y
1
, Y
2
, ..., Y
n
are non-normal (i.e., non-
Gaussian)?
Answer: The t distribution result still approximately holds, even if the underlying
population distribution is not perfectly normal. The approximation improves when
the sample size is larger
the population distribution is more symmetric (not highly skewed).
Because the normality assumption (for the population distribution) is not absolutely
critical for this t sampling distribution result to hold, we say that Result 3 is robust to
the normality assumption.
REMARK: Robustness is a nice property; it assures us that the underlying assumption
of normality is not an absolute requirement for the t distribution result to hold. Other
sampling distribution results (coming up) are not always robust to normality departures.
TERMINOLOGY: Just as we used Weibull qq plots to assess the Weibull model assump-
tion in the last chapter, we can use a normal quantile-quantile (qq) plot to assess the
normal distribution assumption. The plot is constructed as follows:
On the vertical axis, we plot the observed data, ordered from low to high.
On the horizontal axis, we plot the (ordered) theoretical quantiles from the distri-
bution (model) assumed for the observed data (here, normal).
ILLUSTRATION: Figure 4.6 shows the normal qq plot for the pipe diameter data in
Example 4.5. The ordered data do not match up perfectly with the normal quantiles,
but the plot doesnt set o any serious alarms (insofar as a departure from normality is
concerned).
PAGE 77
2 1 0 1 2
1
.
2
8
1
.
2
9
1
.
3
0
1
.
3
1
1
.
3
2
Theoretical Quantiles
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
Figure 4.6: Normal qq plot for the pipe diameter data in Example 4.5. The observed
data are plotted versus the theoretical quantiles from a normal distribution. The line
added passes through the rst and third theoretical quartiles.
4.5 Condence intervals for a population mean
4.5.1 Known population variance
2
SETTING: To get things started, we will assume that Y
1
, Y
2
, ..., Y
n
is a random sample
from a N(,
2
) population distribution. We will assume that
the population variance
2
is known (largely unrealistic).
the goal is to estimate the population mean .
PAGE 78
We already know that Y is an unbiased (point) estimator for , that is,
E(Y ) = .
However, reporting Y alone does not acknowledge that there is variability attached to
this estimator. For example, in Example 4.5, with the n = 50 measured pipes, reporting
y 1.299 in
as an estimate of the population mean does not account for the fact that
the 50 pipes measured were drawn randomly from a population of all pipes, and
dierent samples would give dierent sets of pipes (and dierent values of y).
In other words, using a point estimator only ignores important information; namely,
how variable the population of pipes is.
REMEDY: To avoid this problem (i.e., to account for the uncertainty in the sampling
procedure), we therefore pursue the topic of interval estimation (also known as con-
dence intervals). The main dierence between a point estimate and an interval estimate
is that
a point estimate is a one-shot guess at the value of the parameter; this ignores
the variability in the estimate.
an interval estimate (i.e., condence interval) is an interval of values. It is
formed by taking the point estimate and then adjusting it downwards and up-
wards to account for the point estimates variability. The end result is an interval
estimate.
DERIVATION: We start our discussion by revisiting Result 1 in the last subsection.
Recall that if Y
1
, Y
2
, ..., Y
n
2
) distribution, then the
sampling distribution of Y is
Y N
_
,

2
n
_
PAGE 79
3 2 1 0 1 2 3
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
z
f
(
z
)
Figure 4.7: N(0, 1) pdf. The upper 0.025 and lower 0.025 areas have been shaded. The
associated quantiles are z
0.025
1.96 and z
0.025
1.96, respectively.
and therefore
Z =
Y
/
n
N(0, 1).
We now introduce new notation that identies quantiles from this distribution.
NOTATION: Let z
/2
denote the upper /2 quantile from the N(0, 1) distribution.
Because the N(0, 1) distribution is symmetric about z = 0, we know that z
/2
is the
lower /2 quantile. For example, if = 0.05 (see Figure 4.7), we know that
z
0.05/2
= z
0.025
1.96
z
0.05/2
= z
0.025
1.96.
To see where these values come from, you can use Table 1 (VK, pp 593-594) or R:
PAGE 80
> qnorm(0.975,0,1) ## z_{0.025} ## upper 0.025 quantile
[1] 1.959964
> qnorm(0.025,0,1) ## -z_{0.025} ## lower 0.025 quantile
[1] -1.959964
DERIVATION: In general, for any value of , 0 < < 1, we can write
1 = P(z
/2
< Z < z
/2
)
= P
_
z
/2
<
Y
/
n
< z
/2
_
= P
_
z
/2
n
< Y < z
/2
n
_
= P
_
z
/2
n
> Y > z
/2
n
_
= P
_
Y + z
/2
n
> > Y z
/2
n
_
= P
_
Y z
/2
n
< < Y + z
/2
n
_
.
We call
_
Y z
/2
n
, Y + z
/2
n
_
a 100(1) percent condence interval for the population mean . This is sometimes
written (more succinctly) as
Y z
/2
n
.
Note the form of the interval:
point estimate
. .
Y
quantile
. .
z
/2
standard error
. .
/
n
.
Many condence intervals we will study follow this same general form.
Here is how we interpret this interval: We say
We are 100(1 ) percent condent that the population mean is in
this interval.
PAGE 81
Unfortunately, the word condent does not mean probability. The term con-
dence in condence interval means that if we were able to sample from the pop-
ulation over and over again, each time computing a 100(1 ) percent condence
interval for , then 100(1 ) percent of the intervals we would compute would
contain the population mean .
That is, condence refers to long term behavior of many intervals; not prob-
ability for a single interval. Because of this, we call 100(1 ) the condence
level. Typical condence levels are
90 percent ( = 0.10) = z
0.05
1.645
95 percent ( = 0.05) = z
0.025
1.96
99 percent ( = 0.01) = z
0.005
2.33.
The length of the 100(1 ) percent condence interval
Y z
/2
n
is equal to
2z
/2
n
.
Therefore,
the larger the sample size n, the smaller the interval length.
the larger the population variance
2
, the larger the interval length.
the larger the condence level 100(1 ), the larger the interval length.
Clearly, shorter condence intervals are preferred. They are more informative!
Example 4.6. Civil engineers have found that the ability to see and read a sign at night
depends in part on its surround luminance; i.e., the light intensity near the sign. The
data below are n = 30 measurements of the random variable Y , the surround luminance
(in candela per m
2
). The 30 measurements constitute a random sample from all signs in
a large metropolitan area.
PAGE 82
10.9 1.7 9.5 2.9 9.1 3.2 9.1 7.4 13.3 13.1
6.6 13.7 1.5 6.3 7.4 9.9 13.6 17.3 3.6 4.9
13.1 7.8 10.3 10.3 9.6 5.7 2.6 15.1 2.9 16.2
Based on past experience, the engineers assume a normal population distribution (for
the population of all signs) with known population variance
2
= 20.
Question. Find a 90 percent condence interval for , the mean surround luminance.
Solution. We rst use R to calculate the sample mean y:
> mean(intensity) ## sample mean
[1] 8.62
For a 90 percent condence level; i.e., with = 0.10, we use
z
0.10/2
= z
0.05
1.645.
This can be determined from Table 1 (VK, pp 593-594) or from R:
> qnorm(0.95,0,1) ## z_{0.05} ## upper 0.05 quantile
[1] 1.644854
With n = 30 and
2
= 20, a 90 percent condence interval for the mean surround
luminance is
y z
/2
n
= 8.62 1.645
_
20
30
_
= (7.28, 9.96) candela/m
2
.
Interpretation: We are 90 percent condent that the mean surround luminance for
all signs in the population is between 7.28 and 9.96 candela/m
2
.
Further analysis: Recall that the engineers claimed that the population of all light
intensities is described by a normal (Gaussian) distribution. In Figure 4.8, we display a
normal qq plot to check this assumption. The qq plot might cause some concern about
the normality assumption. There is some mild evidence of disagreement with the normal
model (although with a small sample, it is always hard to be sure).
PAGE 83
2 1 0 1 2
5
1
0
1
5
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
Figure 4.8: Normal qq plot for the light intensity data in Example 4.6. The observed
data are plotted versus the theoretical quantiles from a normal distribution. The line
added passes through the rst and third theoretical quartiles.
Is this a serious cause for concern?: Probably not. Recall that even if the population
distribution (here, the distribution of all light intensity measurements in the city) is not
perfectly normal, we still have
Y AN
_
,

2
n
_
,
for n large, by the Central Limit Theorem. Therefore, our 90 percent condence interval
is still approximately valid. A sample of size n = 30 is pretty large. In other words,
at n = 30, the CLT approximation above is usually kicking in rather well unless the
underlying population distribution is grossly skewed (and I mean very grossly). This is
not the case here.
PAGE 84
4.5.2 Unknown population variance
2
SETTING: We will continue to assume that Y
1
, Y
2
, ..., Y
n
is a random sample from a
N(,
2
) population distribution.
Our goal is the same; namely, to write a 100(1 ) percent condence interval for
the population mean .
However, we will no longer make the (rather unrealistic) assumption that popula-
tion variance
2
is known.
RECALL: If you look back in the notes at the known
2
case, you will see that to derive
a 100(1) percent condence interval for , we started with the following distributional
result:
Z =
Y
/
n
N(0, 1).
This led us to the following condence interval formula:
Y z
/2
n
.
The obvious problem is that, because
2
is now unknown, we can not calculate the
interval. Not to worry; we just need a dierent starting point. Recall that
t =
Y
S/
n
t(n 1),
where S is the sample standard deviation (a point estimator for the population standard
deviation). This result is all we need; in fact, it is straightforward to reproduce the
known
2
derivation and tailor it to this (now more realistic) case. A 100(1 )
percent condence interval for is given by
Y t
n1,/2
S
n
.
The symbol t
n1,/2
denotes the upper /2 quantile from a t distribution with = n1
degrees of freedom. This value can be easily obtained from R. For those of you that like
probability tables, VK tables the t distributions in Table 2 (pp 595).
PAGE 85
We see that the interval again has the same form:
point estimate
. .
Y
quantile
. .
t
n1,/2
standard error
. .
S/
n
.
We interpret the interval in the same way.
We are 100(1 ) percent condent that the population mean is in
this interval.
Example 4.7. Acute exposure to cadmium produces respiratory distress and kidney and
liver damage (and possibly death). For this reason, the level of airborne cadmium dust
and cadmium oxide fume in the air, denoted by Y (measured in milligrams of cadmium
per m
3
of air), is closely monitored. A random sample of n = 35 measurements from a
large factory are given below:
0.044 0.030 0.052 0.044 0.046 0.020 0.066
0.052 0.049 0.030 0.040 0.045 0.039 0.039
0.039 0.057 0.050 0.056 0.061 0.042 0.055
0.037 0.062 0.062 0.070 0.061 0.061 0.058
0.053 0.060 0.047 0.051 0.054 0.042 0.051
Based on past experience, engineers assume a normal population distribution (for the
population of all cadmium measurements).
Question. Find a 99 percent condence interval for , the mean level of airborne
cadmium.
Solution. We rst use R to calculate the sample mean y and the sample standard
deviation s:
> mean(cadmium) ## sample mean
[1] 0.04928571
> sd(cadmium) ## sample standard deviation
[1] 0.0110894
PAGE 86
For a 99 percent condence level; i.e., with = 0.01, we use
t
34,0.01/2
= t
34,0.005
2.728.
Note that Table 2 (VK, pp 595) is not helpful here; the = 34 degrees of freedom
quantiles are not listed. From R, we get:
> qt(0.995,34) ## t_{34,0.005} ## upper 0.005 quantile
[1] 2.728394
With n = 35, y 0.049, and s 0.011, a 99 percent condence interval for the population
mean level of airborne cadmium is
y t
n1,/2
s
n
= 0.049 2.728
_
0.011
35
_
= (0.044, 0.054) mg/m
3
.
Interpretation: We are 99 percent condent that the population mean level of airborne
cadmium is between 0.044 and 0.054 mg/m
3
.
NOTE: It is possible to implement the t interval procedure entirely in R:
> # Calculate t interval directly
> t.test(cadmium,conf.level=0.99)$conf.int
[1] 0.04417147 0.05439996
Further analysis: Recall that the engineers claimed that the population of all airborne
cadmium concentrations is described by a normal (Gaussian) distribution. In Figure
4.9, we display the normal qq plot to check this assumption. The qq plot looks pretty
supportive of the normal assumption, although there is mild evidence of a slight departure
in the upper tail.
ROBUSTNESS: The t condence interval is based on the population distribution being
normal (Gaussian). However, this interval is robust to departures from normality; i.e.,
even if the population distribution is non-normal (non-Gaussian), we can still use the t
condence interval and get approximately valid results.
PAGE 87
2 1 0 1 2
0
.
0
2
0
.
0
3
0
.
0
4
0
.
0
5
0
.
0
6
0
.
0
7
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
Figure 4.9: Normal qq plot for the cadmium data in Example 4.7. The observed data
are plotted versus the theoretical quantiles from a normal distribution. The line added
passes through the rst and third theoretical quartiles.
4.5.3 Sample size determination
MOTIVATION: In the planning stages of an experiment or investigation, it is often of
interest to determine how many individuals are needed to write a condence interval
with a given level of precision. For example, we might want to construct a 95 percent
condence interval for a population mean so that the interval length is no more than
5 units (e.g., days, inches, dollars, etc.). Of course, collecting data almost always costs
money! Therefore, one must be cognizant not only of the statistical issues associated
with sample size determination, but also of the practical issues like cost, time spent
in data collection, personnel training, etc.
PAGE 88
SETTING: Suppose that Y
1
, Y
2
, ..., Y
n
2
) population,
where
2
is known. In this (known
2
) situation, recall that a 100(1 ) percent con-
dence interval for is given by
Y z
/2
_

n
_
. .
=B, say
.
The quantity B is called the margin of error.
FORMULA: In the setting described above, it is possible to determine the sample size n
necessary once we specify these three pieces of information:
the value of
2
(or an educated guess at its value; e.g., from past information, etc.)
the condence level, 100(1 )
the margin of error, B.
This is true because
B = z
/2
_

n
_
n =
_
z
/2
B
_
2
.
Example 4.8. In a biomedical experiment, we would like to estimate the population
mean remaining life of healthy rats that are given a certain dose of a toxic substance.
Suppose that we would like to write a 95 percent condence interval for with a margin
of error equal to B = 2 days. From past studies, remaining rat lifetimes have been
approximated by a normal distribution with standard deviation = 8 days. How many
rats should we use for the experiment?
Solution. With z
0.05/2
= z
0.025
1.96, B = 2, and = 8, the desired sample size to
estimate is
n =
_
z
/2
B
_
2
=
_
1.96 8
2
_
2
61.46.
We would sample n = 62 rats to achieve these goals.
> qnorm(0.975,0,1)
[1] 1.959964
PAGE 89
4.6 Condence interval for a population proportion p
SITUATION: We now switch gears and focus on a new parameter: the population
proportion p. This parameter emerges when the characteristic we measure on each
individual is binary (i.e., only 2 outcomes possible). Here are some examples:
p = proportion of defective circuit boards
p = proportion of customers who are satised
p = proportion of payments received on time
p = proportion of HIV positives in SC.
To start our discussion, we need to recall the Bernoulli trial assumptions for each
individual in the sample:
1. each individual results in a success or a failure,
2. the individuals are independent, and
3. the probability of success, denoted by p, 0 < p < 1, is the same for every
individual.
In our examples above,
success circuit board defective
success customer satised
success payment received on time
success HIV positive individual.
RECALL: If the individual success/failure statuses in the sample adhere to the Bernoulli
trial assumptions, then
Y = the number of successes out of n sampled individuals
follows a binomial distribution, that is, Y b(n, p). The statistical problem at hand is
to use the information in Y to estimate p.
PAGE 90
POINT ESTIMATOR: A natural point estimator for p, the population proportion, is
p =
Y
n
,
the sample proportion. This statistic is simply the proportion of successes in the
sample (out of n individuals).
PROPERTIES: Fairly simple arguments can be used to show the following results:
E( p) = p
se( p) =
p(1 p)
n
.
The rst result says that the sample proportion p is an unbiased estimator of the
population proportion p. The second (standard error) result quanties the precision of p
as an estimator of p.
SAMPLING DISTRIBUTION: Knowing the sampling distribution of p is critical if we
are going to formalize statistical inference procedures for p. In this situation, we appeal
to an approximate result (conferred by the CLT) which says that
p AN
_
p,
p(1 p)
n
_
,
when the sample size n is large.
RESULT: An approximate 100(1 ) percent condence interval for p is given by
p z
/2
p(1 p)
n
.
This interval should be used only when the sample size n is large. A common
rule of thumb (to use this interval formula) is to require
n p 5
n(1 p) 5.
Under these conditions, the CLT should adequately approximate the true sampling
distribution of p, thereby making the condence interval formula above approxi-
mately valid.
PAGE 91
Note again the form of the interval:
point estimate
. .
p
quantile
. .
z
/2
standard error
. .
p(1 p)
n
.
We are 100(1 ) percent condent that the population proportion p
is in this interval.
The value z
/2
is the upper /2 quantile from the N(0, 1) distribution.
Example 4.9. One source of water pollution is gasoline leakage from underground
storage tanks. In Pennsylvania, a random sample of n = 74 gasoline stations is selected
and the tanks are inspected; 10 stations are found to have at least one leaking tank.
Question. Calculate a 95 percent condence interval for p, the population proportion
of gasoline stations with at least one leaking tank.
Solution. In this situation, we interpret
gasoline station = individual trial
at least one leaking tank = success
p = population proportion of stations with at least one leaking tank.
For 95 percent condence, we need z
0.05/2
= z
0.025
1.96. The sample proportion of
stations with at least one leaking tank is
p =
10
74
0.135.
Therefore, an approximate 95 percent condence interval for p is
0.135 1.96
0.135(1 0.135)
74
= (0.057, 0.213).
Interpretation: We are 95 percent condent that the population proportion of stations
in Pennsylvania with at least one leaking tank is between 0.057 and 0.213.
PAGE 92
CLT approximation check: We have
n p = 74
_
10
74
_
= 10
n(1 p) = 74
_
1
10
74
_
= 64.
Both of these are larger than 5 =we can feel comfortable in using this interval formula.
QUESTION: Suppose that we would like to write a 100(1) percent condence interval
for p, a population proportion. We know that
p z
/2
p(1 p)
n
is an approximate 100(1) percent condence interval for p. What sample size n should
we use?
SAMPLE SIZE DETERMINATION: To determine the necessary sample size, we rst
need to specify two pieces of information:
the condence level 100(1 )
the margin of error:
B = z
/2
p(1 p)
n
.
A small problem arises. Note that B depends on p. Unfortunately, p can only be
calculated once we know the sample size n. We overcome this problem by replacing p
with p
0
, an a priori guess at its value. The last expression becomes
B = z
/2
p
0
(1 p
0
)
n
.
Solving this equation for n, we get
n =
_
z
/2
B
_
2
p
0
(1 p
0
).
This is the desired sample size n to nd a 100(1 ) percent condence interval for p
with a prescribed margin of error (roughly) equal to B. I say roughly, because there
may be additional uncertainty arising from our use of p
0
(our best guess).
PAGE 93
CONSERVATIVE APPROACH: If there is no sensible guess for p available, use p
0
= 0.5.
In this situation, the resulting value for n will be as large as possible. Put another way,
using p
0
= 0.5 gives the most conservative solution (i.e., the largest sample size, n).
This is true because
n = n(p
0
) =
_
z
/2
B
_
2
p
0
(1 p
0
),
when viewed as a function of p
0
, is maximized when p
0
= 0.5.
Example 4.10. You have been asked to estimate the proportion of raw material (in a
certain manufacturing process) that is being scrapped; e.g., the material is so defective
that it can not be reworked. If this proportion is larger than 10 percent, this will be
deemed (by management) to be an unacceptable continued operating cost and a sub-
stantial process overhaul will be performed. Past experience suggests that the scrap rate
is about 5 percent, but recent information suggests that this rate may be increasing.
Question. You would like to write a 95 percent condence interval for p, the popula-
tion proportion of raw material that is to be scrapped, with a margin of error equal to
B = 0.02. How many pieces of material should you ask to be sampled?
Solution. For 95 percent condence, we need z
0.05/2
= z
0.025
1.96. In providing an
initial guess, we have options; we could use
p
0
= 0.05 (historical scrap rate)
p
0
= 0.10 (critical mass value)
p
0
= 0.50 (most conservative choice).
For these choices, we have
n =
_
1.96
0.02
_
2
0.05(1 0.05) 457
n =
_
1.96
0.02
_
2
0.10(1 0.10) 865
n =
_
1.96
0.02
_
2
0.50(1 0.50) 2401.
As we can see, the guessed value of p
0
has a substantial impact on the nal sample
size calculation.
PAGE 94
4.7 Condence interval for a population variance
2
MOTIVATION: In many situations, one is concerned not with the mean of an underlying
(continuous) population distribution, but with the variance
2
instead. If
2
is excessively
large, this could point to a potential problem with a manufacturing process, for example,
where there is too much variation in the measurements produced. In a laboratory setting,
chemical engineers might wish to estimate the variance
2
attached to a measurement
system (e.g., scale, caliper, etc.). In eld trials, agronomists are often interested in
comparing the variability levels for dierent cultivars or genetically-altered varieties. In
clinical trials, physicians are often concerned if there are substantial dierences in the
variation levels of patient responses at dierent clinic sites.
NEW RESULT: Suppose that Y
1
, Y
2
, ..., Y
n
2
) distri-
bution. The quantity
Q =
(n 1)S
2
2

2
(n 1),
a
2
distribution with = n 1 degrees of freedom.
FACTS: The
2
distribution has the following characteristics:
It is continuous, skewed to the right, and always positive.
It is indexed by a value called the degrees of freedom. In practice, is often
an integer (related to the sample size).
The
2
pdf formula is unnecessary for our purposes. R will compute
2
probabilities
and quantiles from the
2
distribution.
2
R CODE: Suppose that Q
2
().
F
Q
(q) = P(Q q)
p
pchisq(q,) qchisq(p,)
NOTE: Table 3 (VK, pp 596-597) catalogues some quantiles from some
2
distributions.
PAGE 95
0 10 20 30 40
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
q
f
(
q
)
0 10 20 30 40
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0 10 20 30 40
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
Chisquare; df = 5
Chisquare; df = 10
Chisquare; df = 20
Figure 4.10:
2
probability density functions with dierent degrees of freedom.
GOAL: Suppose that Y
1
, Y
2
, ..., Y
n
2
) distribution. We
would like to write a 100(1 ) percent condence interval for
2
.
NOTATION: Let
2
n1,/2
denote the lower /2 quantile and let
2
n1,1/2
denote the
upper /2 quantile of the
2
(n 1) distribution; i.e.,
2
n1,/2
and
2
n1,1/2
satisfy
P(Q <
2
n1,/2
) = /2
P(Q >
2
n1,1/2
) = /2,
respectively. Note that, unlike the N(0, 1) and t distributions, the
2
distribution is
not symmetric. Therefore, dierent notation is needed to identify the quantiles of
2
distributions (this is nothing to get worried about).
PAGE 96
DERIVATION: Because Q
2
(n 1), we write
1 = P(
2
n1,/2
< Q <
2
n1,1/2
)
= P
_
2
n1,/2
<
(n 1)S
2
2
<
2
n1,1/2
_
= P
_
1
2
n1,/2
>

2
(n 1)S
2
>
1
2
n1,1/2
_
= P
_
(n 1)S
2
2
n1,/2
>
2
>
(n 1)S
2
2
n1,1/2
_
= P
_
(n 1)S
2
2
n1,1/2
<
2
<
(n 1)S
2
2
n1,/2
_
.
This argument shows that
_
(n 1)S
2
2
n1,1/2
,
(n 1)S
2
2
n1,/2
_
is a 100(1 ) percent condence interval for the population variance
2
. We
interpret the interval in the same way.
We are 100(1 ) percent condent that the population variance
2
is in
this interval.
IMPORTANT: A 100(1 ) percent condence interval for the population standard
deviation arises from simply taking the square root of the endpoints of the
2
interval.
That is,
_
(n 1)S
2
2
n1,1/2
,
(n 1)S
2
2
n1,/2
_
is a 100(1 ) percent condence interval for the population standard deviation . In
practice, this interval may be preferred over the
2
interval, because standard deviation
is a measure of variability in terms of the original units (e.g., dollars, inches, days, etc.).
The variance is measured in squared units (e.g., dollars
2
, in
2
, days
2
, etc.).
Example 4.11. Indoor swimming pools are noted for their poor acoustical properties.
Suppose your goal is to design a pool in such a way that
PAGE 97
the population mean time that it takes for a low-frequency sound to die out is
= 1.3 seconds
the population standard deviation for the distribution of die-out times is = 0.6
seconds.
Computer simulations of a preliminary design are conducted to see whether these stan-
dards are being met; here are data from n = 20 independently-run simulations. The data
are obtained on the time (in seconds) it takes for the low-frequency sound to die out.
1.34 2.56 1.28 2.25 1.84 2.35 0.77 1.84 1.80 2.44
0.86 1.29 0.12 1.87 0.71 2.08 0.71 0.30 0.54 1.48
Question. Find a 95 percent condence interval for the population standard deviation of
times . What does this interval suggest about whether the preliminary design conforms
to specications (with respect to variability)?
Solution. For 95 percent condence (i.e., using = 0.05), we need to use
2
n1,/2
=
2
19,0.025
8.907
2
n1,1/2
=
2
19,0.975
32.852.
I got these quantiles from R:
> qchisq(0.025,19) ## chi^2_{19,0.025}
[1] 8.906516
> qchisq(0.975,19) ## chi^2_{19,0.975}
[1] 32.85233
We also need to calculate the sample variance s
2
. From R, we get s
2
0.555.
> var(sounds) ## sample variance
[1] 0.554666
PAGE 98
2 1 0 1 2
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
Figure 4.11: Normal qq plot for the swimming pool sound time data in Example 4.11.
The observed data are plotted versus the theoretical quantiles from a normal distribution.
The line added passes through the rst and third theoretical quartiles.
A 95 percent condence interval for
2
is
_
(n 1)s
2
2
n1,1/2
,
(n 1)s
2
2
n1,/2
_
=
_
19(0.555)
32.852
,
19(0.555)
8.907
_
= (0.321, 1.184).
Interpretation: We are 95 percent condent that the population variance
2
is between
0.321 and 1.184 sec
2
.
A 95 percent condence interval for (which is what we originally wanted) is
(
0.321,
1.184) = (0.567, 1.088).

Interpretation: We are 95 percent condent that the population standard deviation
is between 0.567 and 1.088 sec. From this interval, it appears as though the pool design
PAGE 99
specications are being met (with respect to the population standard deviation level).
Note that = 0.6 is contained in the condence interval for .
Major warning: Unlike the z and t condence intervals for a population mean ,
the
2
interval for
2
(and for ) is not robust to departures from normality. If the
underlying population distribution is non-normal (non-Guassian), then the condence
interval formulas for
2
and are not to be used. Therefore, it is very important to
check the normality assumption with these interval procedures (e.g., use a qq-plot).
Analysis: With only n = 20 measurements, it is somewhat hard to tell, but the qq-plot
in Figure 4.11 looks fairly straight. Small sample sizes make interpreting qq-plots more
dicult (e.g., the analyst may look for patterns that are not really there).
4.8 Condence intervals for the dierence of two population
means
1
2
: Independent samples
REMARK: In practice, it is very common to compare the same characteristic (mean,
proportion, variance) from two dierent distributions. For example, we may wish to
compare
the mean starting salaries of male and female engineers (compare
1
and
2
)
the proportion of scrap produced from two manufacturing processes (compare p
1
and p
2
)
the variance of sound levels from two indoor swimming pool designs (compare
2
1
and
2
2
).
Our previous work is applicable only for a single distribution (i.e., a single mean , a single
proportion p, and a single variance
2
). We therefore need to extend these procedures to
handle two distributions. We start with comparing two means.
PAGE 100
TWO-SAMPLE PROBLEM: Suppose that we have two independent samples:
Sample 1 : Y
11
, Y
12
, ..., Y
1n
1
N(
1
,
2
1
) random sample
Sample 2 : Y
21
, Y
22
, ..., Y
2n
2
N(
2
,
2
2
) random sample.
GOAL: Construct a 100(1) percent condence interval for the dierence of population
means
1
2
.
POINT ESTIMATORS: We dene the statistics
Y
1+
=
1
n
1
n
1
j=1
Y
1j
= sample mean for sample 1
Y
2+
=
1
n
2
n
2
j=1
Y
2j
= sample mean for sample 2
S
2
1
=
1
n
1
1
n
1
j=1
(Y
1j
Y
1+
)
2
= sample variance for sample 1
S
2
2
=
1
n
2
1
n
2
j=1
(Y
2j
Y
2+
)
2
= sample variance for sample 2.
4.8.1 Equal variance case:
2
1
=
2
2
GOAL: We want to write a condence interval for
1

2
, but how this interval is
constructed depends on the values of
2
1
and
2
2
. In particular, we consider two cases:

2
1
=
2
2
; that is, the two population variances are equal

2
1
=
2
2
; that is, the two population variances are not equal.
We rst consider the equal variance case. Addressing this case requires us to start with
the following (sampling) distribution result:
T =
(Y
1+
Y
2+
) (
1
2
)
S
2
p
_
1
n
1
+
1
n
2
_
t(n
1
+ n
2
2),
where
S
2
p
=
(n
1
1)S
2
1
+ (n
2
1)S
2
2
n
1
+ n
2
2
.
PAGE 101
40 50 60 70
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
y
f
(
y
)
40 50 60 70
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
Distribution 1
Distribution 2
Figure 4.12: Two normal distributions with
2
1
=
2
2
.
Some comments are in order:
For this sampling distribution to hold (exactly), we need
the two samples to be independent
the two population distributions to be normal (Gaussian)
the two population distributions to have the same variance; i.e.,
2
1
=
2
2
.
The statistic S
2
p
is called the pooled sample variance estimator of the common
population variance, say,
2
. Algebraically, it is simply a weighted average of the
two sample variances S
2
1
and S
2
2
(where the weights are functions of the sample
sizes n
1
and n
2
).
The sampling distribution T t(n
1
+n
2
2) should suggest to you that condence
PAGE 102
interval quantiles will come from this t distribution; note that this distribution
depends on the sample sizes from both samples.
In particular, because T t(n
1
+ n
2
2), we can nd the value t
n
1
+n
2
2,/2
that
satises
P(t
n
1
+n
2
2,/2
< T < t
n
1
+n
2
2,/2
) = 1 .
Substituting T into the last expression and performing algebraic manipulations, we
obtain
(Y
1+
Y
2+
) t
n
1
+n
2
2,/2
S
2
p
_
1
n
1
+
1
n
2
_
.
This is a 100(1) percent condence interval for the mean dierence
1
2
.
We see that the interval again has the same form:
point estimate
. .
Y
1+
Y
2+
quantile
. .
t
n
1
+n
2
2,/2
standard error
. .
S
2
p
(
1
n
1
+
1
n
2
)
.
We are 100(1) percent condent that the population mean dierence
2
Important: In two-sample situations, it is often of interest to see if the means
1
and
2
are dierent.
If the condence interval for
1
2
includes 0, this does not suggest that the
means
1
and
2
are dierent.
1
2
does not include 0, it does.
Example 4.12. In the vicinity of a nuclear power plant, environmental engineers from
the EPA would like to determine if there is a dierence between the mean weight in
sh (of the same species) from two locations. Independent samples are taken from each
location and the following weights (in ounces) are observed:
PAGE 103
Location 1 Location 2
0
1
0
2
0
3
0
4
0
W
e
i
g
h
t

(
i
n

o
u
n
c
e
s
)
Figure 4.13: Boxplots of sh weights by location in Example 4.12.
Location 1: 21.9 18.5 12.3 16.7 21.0 15.1 18.2 23.0 36.8 26.6
Location 2: 22.0 20.6 15.4 17.9 24.4 15.6 11.4 17.5
Question. Construct a 90 percent condence interval for the mean dierence
1
2
.
Here,
1
(
2
) denotes the population mean weight of all sh at location 1 (2).
Solution. In order to visually assess the equal variance assumption, we use boxplots
to display the data in each sample; see Figure 4.13. The equal variance assumption,
based on the gure, looks reasonable; note that the spread in each distribution looks
roughly the same (save the outlier in Location 1).
NOTE: Instead of resorting to hand calculation (as we have often done in previous
examples), we will use R to calculate the condence interval directly:
PAGE 104
> t.test(loc.1,loc.2,conf.level=0.90,var.equal=TRUE)$conf.int
[1] -1.940438 7.760438
Therefore, a 90 percent condence interval for the mean dierence
1
2
is
(1.940, 7.760) oz.
Interpretation: We are 90 percent condent that the mean dierence
1
2
is between
1.940 and 7.760 oz. Note that this interval includes 0. Therefore, we do not have
sucient evidence that the population mean sh weights
1
and
2
are dierent.
ROBUSTNESS: Some comments are in order about the robustness properties of the
two-sample condence interval
(Y
1+
Y
2+
) t
n
1
+n
2
2,/2
S
2
p
_
1
n
1
+
1
n
2
_
for the mean dierence
1
2
.
We should only use this interval if there is strong evidence that the population
variances
2
1
and
2
2
are equal (or at least close). Otherwise, we should use a
dierent interval (coming up).
Like the one-sample t interval for , the two sample t interval (and the unequal
variance version coming up) is robust to normality departures. This means that we
can feel comfortable with the interval even if the underlying population distributions
are not perfectly normal (Guassian).
4.8.2 Unequal variance case:
2
1
=
2
2
REMARK: When
2
1
=
2
2
, the problem of constructing a 100(1 ) percent condence
interval for
1

2
becomes more dicult theoretically. However, we can still write an
approximate condence interval.
PAGE 105
FORMULA: An approximate 100(1 ) percent condence interval
1
2
is given by
(Y
1+
Y
2+
) t
,/2
S
2
1
n
1
+
S
2
2
n
2
,
where the degrees of freedom is calculated as
=
_
S
2
1
n
1
+
S
2
2
n
2
_
2
_
S
2
1
n
1
_
2
n
1
1
+
_
S
2
2
n
2
_
2
n
2
1
.
This interval is always approximately valid, as long as
the two samples are independent
the two population distributions are approximately normal (Gaussian).
No one in their right mind would calculate this interval by hand (particularly
nasty is the formula for ). R will produce the interval on request.
Example 4.13. You are part of a recycling project that is examining how much paper is
being discarded (not recycled) by employees at two large plants. These data are obtained
on the amount of white paper thrown out per year by employees (data are in hundreds
of pounds). Samples of employees at each plant were randomly selected.
Plant 1: 3.01 2.58 3.04 1.75 2.87 2.57 2.51 2.93 2.85 3.09
1.43 3.36 3.18 2.74 2.25 1.95 3.68 2.29 1.86 2.63
2.83 2.04 2.23 1.92 3.02
Plant 2: 3.79 2.08 3.66 1.53 4.07 4.31 2.62 4.52 3.80 5.30
3.41 0.82 3.03 1.95 6.45 1.86 1.87 3.78 2.74 3.81
Question. Are there dierences in the mean amounts of white paper discarded by
employees at the two plants? Answer this question by nding a 95 percent condence
interval for the mean dierence
1

2
. Here,
1
(
2
) denotes the population mean
amount of white paper discarded per employee at Plant 1 (2).
Solution. In order to visually assess the equal variance assumption, we again use
PAGE 106
Plant 1 Plant 2
0
2
4
6
8
W
e
i
g
h
t

(
i
n

1
0
0
s

l
b
)
Figure 4.14: Boxplots of discarded white paper amounts (in 100s lb) in Example 4.13.
boxplots to display the data in each sample; see Figure 4.14. For these data, the equal
variance assumption would be highly suspect; the spread in the distribution of Plant 2
values is much larger than that of Plant 1. We again use R to calculate the (unequal
variance) condence interval for
1
2
:
> t.test(plant.1,plant.2,conf.level=0.95,var.equal=FALSE)$conf.int
[1] -1.35825799 -0.01294201
Therefore, a 95 percent condence interval for the mean dierence
1
2
is
(1.358, 0.013) lb.
Interpretation: We are 95 percent condent that the mean dierence
1
2
is between
PAGE 107
135.8 and 1.3 lb. This interval does not include 0. Therefore, we have evidence that
the population mean weights (
1
and
2
) at the two plants are dierent.
REMARK: In this subsection, we have presented two condence intervals for
1
2
. One
assumes
2
1
=
2
2
(equal variance assumption) and one that assumes
2
1
=
2
2
(unequal
variance assumption). If you are unsure about which interval to use, go with the
unequal variance interval. The penalty for using it when
2
1
=
2
2
is much smaller
than the penalty for using the equal variance interval when
2
1
=
2
2
.
4.9 Condence interval for the dierence of two population pro-
portions p
1
p
2
NOTE: We also can extend our condence interval procedure for a single population
proportion p to two populations. Dene
p
1
= population proportion of successes in Population 1
p
2
= population proportion of successes in Population 2.
For example, we might want to compare the proportion of
defective circuit boards for two dierent suppliers
satised customers before and after a product design change (e.g., Facebook, etc.)
on-time payments for two classes of customers
HIV positives for individuals in two demographic classes.
POINT ESTIMATORS: We assume that there are two independent random samples of
individuals (one sample from each population to be compared). Dene
Y
1
= number of successes in Sample 1 (out of n
1
individuals) b(n
1
, p
1
)
Y
2
= number of successes in Sample 2 (out of n
2
individuals) b(n
2
, p
2
).
PAGE 108
The point estimators for p
1
and p
2
are the sample proportions, dened by
p
1
=
Y
1
n
1
p
2
=
Y
2
n
2
.
GOAL: We would like to write a 100(1 ) percent condence interval for p
1
p
2
, the
dierence of two population proportions.
IMPORTANT: To accomplish this goal, we need the following distributional result.
When the sample sizes n
1
and n
2
are large,
Z =
( p
1
p
2
) (p
1
p
2
)
p
1
(1p
1
)
n
1
+
p
2
(1p
2
)
n
2
AN(0, 1).
If this sampling distribution holds approximately, then
( p
1
p
2
) z
/2
p
1
(1 p
1
)
n
1
+
p
2
(1 p
2
)
n
2
is an approximate 100(1 ) percent condence interval for p
1
p
2
.
For the Z sampling distribution to hold approximately (and therefore for the in-
terval above to be useful), we need
the two samples to be independent
the sample sizes n
1
and n
2
to be large; common rules of thumb are to require
n
i
p
i
5
n
i
(1 p
i
) 5,
for each sample i = 1, 2. Under these conditions, the CLT should adequately
approximate the true sampling distribution of Z, thereby making the con-
dence interval formula above approximately valid.
Note again the form of the interval:
point estimate
. .
p
1
p
2
quantile
. .
z
/2
standard error
. .
p
1
(1 p
1
)
n
1
+
p
2
(1 p
2
)
n
2
.
PAGE 109
We are 100(1 ) percent condent that the population proportion
dierence p
1
p
2
The value z
/2
is the upper /2 quantile from the N(0, 1) distribution.
Important: In two-sample situations, it is often of interest to see if the proportions
p
1
and p
2
are dierent.
If the condence interval for p
1
p
2
includes 0, this does not suggest that the
proportions p
1
and p
2
are dierent.
If the condence interval for p
1
p
2
Example 4.14. A programmable lighting control system is being designed. The pur-
pose of the system is to reduce electricity consumption costs in buildings. The system
eventually will entail the use of a large number of transceivers (a device comprised of
both a transmitter and a receiver). Two types of transceivers are being considered. In
life testing, 200 transceivers (randomly selected) were tested for each type.
Transceiver 1: 20 failures were observed (out of 200)
Transceiver 2: 14 failures were observed (out of 200).
Question. Dene p
1
(p
2
) to be the population proportion of Transceiver 1 (Transceiver
2) failures. Write a 95 percent condence interval for p
1
p
2
. Is there a signicant
dierence between the failure rates p
1
and p
2
?
Solution. For 95 percent condence, we need z
0.05/2
= z
0.025
1.96. The sample
proportions of defective transceivers are
p
1
=
20
200
= 0.10
p
2
=
14
200
= 0.07.
Therefore, an approximate 95 percent condence interval for p
1
p
2
is
(0.10 0.07) 1.96
0.10(1 0.10)
200
+
0.07(1 0.07)
200
= (0.025, 0.085).
PAGE 110
Interpretation: We are 95 percent condent that the dierence of the population failure
rates for the two transceivers is between 0.025 and 0.085. Because this interval includes
0, we do not have sucient evidence that the two failure rates p
1
and p
2
are dierent.
CLT approximation check: We have
n
1
p
1
= 200
_
20
200
_
= 20 n
2
p
2
= 200
_
14
200
_
= 14
n
1
(1 p
1
) = 200
_
1
20
200
_
= 180 n
2
(1 p
2
) = 200
_
1
14
200
_
= 186.
All of these quantities are larger than 5 =we can feel comfortable in using this interval
formula.
4.10 Condence interval for the ratio of two population vari-
ances
2
2
/
2
1
IMPORTANCE: You will recall that when we wrote a condence interval for
1

2
,
the dierence of the population means (with independent samples), we proposed two
dierent intervals:
one interval that assumed
2
1
=
2
2
one interval that assumed
2
1
=
2
2
.
We now propose a condence interval procedure that can be used to determine which
assumption is more appropriate. This condence interval is used to compare the popu-
lation variances in two independent samples.
TWO-SAMPLE PROBLEM: Suppose that we have two independent samples:
Sample 1 : Y
11
, Y
12
, ..., Y
1n
1
N(
1
,
2
1
) random sample
Sample 2 : Y
21
, Y
22
, ..., Y
2n
2
N(
2
,
2
2
) random sample.
GOAL: Construct a 100(1 ) percent condence interval for the ratio of population
variances
2
2
/
2
1
.
PAGE 111
IMPORTANT: To accomplish this, we need the following sampling distribution result:
Q =
S
2
1
/
2
1
S
2
2
/
2
2
F(n
1
1, n
2
1),
an F distribution with (numerator)
1
= n
1
1 and (denominator)
2
= n
2
1 degrees
of freedom.
FACTS: The F distribution has the following characteristics:
continuous, skewed right, and always positive
indexed by two degree of freedomparameters
1
and
2
; these are usually integers
and are often related to sample sizes
The F pdf formula is complicated and is unnecessary for our purposes. R will
compute F probabilities and quantiles from the F distribution.
F R CODE: Suppose that Q F(
1
,
2
).
F
Q
(q) = P(Q q)
p
pf(q,
1
,
2
) qf(p,
1
,
2
)
NOTATION: Let F
n
1
1,n
2
1,/2
and F
n
1
1,n
2
1,1/2
denote the lower and upper quantiles,
respectively, of the F(n
1
1, n
2
1) distribution; i.e., these values satisfy
P(Q < F
n
1
1,n
2
1,/2
) = /2
P(Q > F
n
1
1,n
2
1,1/2
) = /2,
respectively. Similar to the
2
distribution, the F distribution is not symmetric. There-
fore, dierent notation is needed to identify the quantiles of F distributions.
DERIVATION: Because
Q =
S
2
1
/
2
1
S
2
2
/
2
2
F(n
1
1, n
2
1),
PAGE 112
0 2 4 6 8
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
q
f
(
q
)
0 2 4 6 8
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
0 2 4 6 8
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
F; df = 5,5
F; df = 5,10
F; df = 10,10
Figure 4.15: F probability density functions with dierent degrees of freedom.
we can write
1 = P
_
F
n
1
1,n
2
1,/2
< Q < F
n
1
1,n
2
1,1/2
_
= P
_
F
n
1
1,n
2
1,/2
<
S
2
1
/
2
1
S
2
2
/
2
2
< F
n
1
1,n
2
1,1/2
_
= P
_
S
2
2
S
2
1
F
n
1
1,n
2
1,/2
<

2
2
2
1
<
S
2
2
S
2
1
F
n
1
1,n
2
1,1/2
_
.
This shows that
_
S
2
2
S
2
1
F
n
1
1,n
2
1,/2
,
S
2
2
S
2
1
F
n
1
1,n
2
1,1/2
_
is a 100(1 ) percent condence interval for the ratio of the population variances
2
2
/
2
1
. We interpret the interval in the same way.
We are 100(1) percent condent that the ratio
2
2
/
2
1
PAGE 113
2
2
/
2
1
includes 1, this does not suggest that the vari-
ances
2
1
and
2
2
are dierent.
2
2
/
2
1
Example 4.15. We consider again the recycling project in Example 4.13 that examined
the amount of white paper discarded per employee at two large plants. The data (pre-
sented in Example 4.13) were obtained on the amount of white paper thrown out per
year by employees (data are in hundreds of pounds). Samples of employees at each plant
(n
1
= 25 and n
2
= 20) were randomly selected. The boxplots in Figure 4.14 did suggest
that the population variances may be dierent.
Question. Find a 95 percent condence interval for
2
2
/
2
1
, the ratio of the population
variances. Here,
2
1
(
2
2
) denotes the population variance of the amount of white paper
by employees at Plant 1 (Plant 2).
Solution. We use R to get the sample variances s
2
1
and s
2
2
:
> var(plant.1) ## sample variance (Plant 1)
[1] 0.3071923
> var(plant.2) ## sample variance (Plant 2)
[1] 1.878411
That is, s
2
1
0.307 and s
2
2
1.878. We also use R to get the necessary F quantiles:
> qf(0.025,24,19) ## F_{24,19,0.025} ## lower 0.025 quantile
[1] 0.4264113
> qf(0.975,24,19) ## F_{24,19,0.975} ## upper 0.025 quantile
[1] 2.452321
The lower bound of the a 95 percent condence interval for
2
2
/
2
1
is
s
2
2
s
2
1
F
24,19,0.025
=
1.878
0.307
0.4264113 = 2.607.
The upper bound of the a 95 percent condence interval for
2
2
/
2
1
is
s
2
2
s
2
1
F
24,19,0.975
=
1.878
0.307
2.452321 = 14.995.
PAGE 114
2 1 0 1 2
1
.
5
2
.
0
2
.
5
3
.
0
3
.
5
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
2 1 0 1 2
1
2
3
4
5
6
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
Figure 4.16: Normal quantile-quantile (qq) plots for employee recycle data for two plants
in Example 4.15.
Interpretation: We are 95 percent condent that the ratio of the population variances
2
2
/
2
1
is between 2.607 and 14.995. This interval does not include 1. Therefore, we
have sucient evidence that the population variances (
2
1
and
2
2
) at the two plants are
dierent.
Discussion: This nding supports our use of the unequal-variance interval for
1
2
in Example 4.13. Some statisticians recommend to use this equal/unequal variance test
before deciding which condence interval to use for
1
2
. Some statisticians (including
your authors) do not.
Major warning: Like the
2
interval for single population variance
2
, the two-sample F
interval for the ratio of two variances is not robust to departures from normality. If the
underlying population distributions are non-normal (non-Guassian), then this interval
should not be used.
Discussion: Figure 4.16 (above) displays the normal qq plots for data from the two
plants. I am somewhat worried about the normal assumption for Plant 2.
PAGE 115
4.11 Condence intervals for the dierence of two population
means
1
2
: Dependent samples (Matched-pairs)
Example 4.16. Creatine is an organic acid that helps to supply energy to cells in the
body, primarily muscle. Because of this, it is commonly used by those who are weight
training to gain muscle mass. Does it really work? Suppose that we are designing an
experiment involving USC male undergraduates who exercise/lift weights regularly.
Design 1 (Independent samples): Recruit 30 students who are representative of the
population of USC male undergraduates who exercise/lift weights. For a single weight
training session, we will
assign 15 students to take creatine.
assign 15 students an innocuous substance that looks like creatine (but has no
positive/negative eect on performance).
For each student, we will record
Y = maximum bench press weight (MBPW).
We will then have two samples of data (with n
1
= 15 and n
2
= 15):
Sample 1 (Creatine): Y
11
, Y
12
, ..., Y
1n
1
Sample 2 (Control): Y
21
, Y
22
, ..., Y
2n
2
.
To compare the population means
1
= population mean MBPW for students taking creatine
2
= population mean MBPW for students not taking creatine,
we could construct a two-sample t condence interval for
1
2
using
(Y
1+
Y
2+
) t
n
1
+n
2
2,/2
S
2
p
_
1
n
1
+
1
n
2
_
PAGE 116
or
(Y
1+
Y
2+
) t
,/2
S
2
1
n
1
+
S
2
2
n
2
,
depending on our underlying assumptions about
2
1
and
2
2
.
Design 2 (Matched Pairs): Recruit 15 students who are representative of the popu-
lation of USC male undergraduates who exercise/lift weights.
Each student will be assigned rst to take either creatine or the control substance.
For each student, we will then record his value of Y (MBPW).
After a period of recovery (e.g., 1 week), we will then have each student take the
other treatment (creatine/control) and record his value of Y again (but now on
the other treatment).
In other words, for each individual student, we will measure Y under both condi-
tions.
NOTE: In Design 2, because MBPW measurements are taken on the same student, the
dierence between the measurement (creatine/control) should be less variable than the
dierence between a creatine measurement on one student and a control measurement
on a dierent student.
In other words, the student-to-student variation inherent in the latter dierence
is not present in the dierence between MBPW measurements taken on the same
individual student.
MATCHED PAIRS: In general, by obtaining a pair of measurements on a single individ-
ual (e.g., student, raw material, machine, etc.), where one of measurement corresponds to
Treatment 1 and the other measurement corresponds to Treatment 2, you eliminate
variation among the individuals. This allows you to compare the two experimental con-
ditions (e.g., creatine/control, biodegradability treatments, operators, etc.) under more
homogeneous conditions where only variation within individuals is present (that is, the
variation arising from the dierence in the two experimental conditions).
PAGE 117
Table 4.1: Creatine example. Sources of variation in the two independent sample and
matched pairs designs.
Design Sources of Variation
Two Independent Samples among students, within students
Matched Pairs within students
ADVANTAGE: When you remove extra variability, this enables you to do a better job at
comparing the two experimental conditions (treatments). By better job, I mean, you
can more precisely estimate the dierence between the treatments (excess variability
that naturally arises among individuals is not getting in the way). This gives you a better
chance of identifying a dierence between the treatments if one really exists.
NOTE: In matched pairs experiments, it is important to randomize the order in which
treatments are assigned. This may eliminate common patterns that may be seen when
always following, say, Treatment 1 with Treatment 2. In practice, the experimenter could
ip a fair coin to determine which treatment is applied rst.
IMPLEMENTATION: Data from matched pairs experiments are analyzed by examining
the dierence in responses of the two treatments. Specically, compute
D
j
= Y
1j
Y
2j
,
for each individual j = 1, 2, ..., n. After doing this, we have essentially created a one
sample problem, where our data are:
D
1
, D
2
, ..., D
n
,
the so-called data dierences. The one sample 100(1 ) percent condence interval
D t
n1,/2
S
D
n
,
where D and S
D
are the sample mean and sample standard deviation of the dierences,
respectively, is an interval estimate for
D
= mean dierence between the 2 treatments.
PAGE 118
Table 4.2: Creatine data. Maximum bench press weight (in lbs) for creatine and control
treatments with 15 students. Note: These are not real data.
Student j Creatine MBPW Control MBPW Dierence (D
j
= Y
1j
Y
2j
)
1 230 200 30
2 140 155 15
3 215 205 10
4 190 190 0
5 200 170 30
6 230 225 5
7 220 200 20
8 255 260 5
9 220 240 20
10 200 195 5
11 90 110 20
12 130 105 25
13 255 230 25
14 80 85 5
15 265 255 10
INTERPRETATION: The parameter
D
describes the dierence in means for the two
treatment groups. If there are no dierences between the two treatments,
D
= 0.
D
includes 0, this does not suggest that two treatment
means are dierent.
D
To analyze the creatine data, lets compute a 95 percent condence interval for
D
:
> t.test(diff,conf.level=0.95)$conf.int
[1] -3.227946 15.894612
Interpretation: We are 95 percent condent that the mean dierence MBPW weight
is between 3.2 and 15.9 lb. Because this interval includes 0, this does not suggest
that taking creating leads to a larger mean MBPW.
PAGE 119
4.12 One-way analysis of variance
REVIEW: So far in this chapter, we have discussed condence intervals for a single
population mean and for the dierence of two population means
1
2
. When there
are two means, we have recently seen that the design of the experiment/study completely
determines how the data are to be analyzed.
When the two samples are independent, this is called a (two) independent-
sample design.
When the two samples are obtained on the same individuals (so that the samples
are dependent), this is called a matched pairs design.
Condence interval procedures for
1
2
depend on the design of the study.
TERMINOLOGY: More generally, the purpose of an experiment is to investigate dif-
ferences between or among two or more treatments. In a statistical framework, we do
this by comparing the population means of the responses to each treatment.
In order to detect treatment mean dierences, we must try to control the eects
of error so that any variation we observe can be attributed to the eects of the
treatments rather than to dierences among the individuals.
BLOCKING: Designs involving meaningful grouping of individuals, that is, blocking,
can help reduce the eects of experimental error by identifying systematic components
of variation among individuals.
The matched pairs design for comparing two treatments is an example of such a
design. In this situation, individuals themselves are treated as blocks.
The analysis of data from experiments involving blocking will not be covered in this
course (see, e.g., STAT 506, STAT 525, and STAT 706). We focus herein on a simpler
setting, that is, a one-way classication model. This is an extension of the two
independent-sample design to more than two treatments.
PAGE 120
ONE-WAY CLASSIFICATION: Consider an experiment to compare t 2 treatments
set up as follows:
We obtain a random sample of individuals and randomly assign them to treat-
ments. Samples corresponding to the treatment groups are independent (i.e., the
individuals in each treatment sample are unrelated).
In observational studies (where no treatment is physically applied to individu-
als), individuals are inherently dierent to begin with. We therefore simply take
random samples from each treatment population.
We do not attempt to group individuals according to some other factor (e.g., loca-
tion, gender, weight, variety, etc.). This would be an example of blocking.
MAIN POINT: In a one-way classication design, the only way in which individuals are
classied is by the treatment group assignment. Hence, such an arrangement is called
a one-way classication. When individuals are thought to be basically alike (other
than the possible eect of treatment), experimental error consists only of the variation
among the individuals themselves, that is, there are no other systematic sources of
variation.
Example 4.17. Four types of mortars: (1) ordinary cement mortar (OCM), polymer
impregnated mortar (PIM), resin mortar (RM), and (4) polymer cement mortar (PCM),
were subjected to a compression test to measure strength (MPa). Here are the strength
measurements taken on dierent mortar specimens (36 in all).
OCM: 51.45 42.96 41.11 48.06 38.27 38.88 42.74 49.62
PIM: 64.97 64.21 57.39 52.79 64.87 53.27 51.24 55.87 61.76 67.15
RM: 48.95 62.41 52.11 60.45 58.07 52.16 61.71 61.06 57.63 56.80
PCM: 35.28 38.59 48.64 50.99 51.52 52.85 46.75 48.31
Side by side boxplots of the data are in Figure 4.17.
PAGE 121
OCM PIM RM PCM
0
2
0
4
0
6
0
8
0
1
0
0
S
t
r
e
n
g
t
h

(
M
P
a
)
Figure 4.17: Boxplots of strength measurements (MPa) for four mortar types.
In this example,
Treatment = mortar type (OCM, PIM, RM, and PCM). There are t = 4 treat-
ment groups.
Individuals = mortar specimens
This is an example of an observational study; not an experiment. That is, we do
not physically apply a treatment here; instead, the mortar specimens are inherently
dierent to begin with. We simply take random samples of each mortar type.
QUERY: An initial question that we might have is the following:
Are the treatment (mortar type) population means equal? Or, are the treat-
ment population means dierent?
PAGE 122
This question can be answered by performing a hypothesis test, that is, by testing
H
0
:
1
=
2
=
3
=
4
versus
H
a
: the population means
i
are not all equal.
GOAL: We now develop a statistical procedure that allows us to test this type of hy-
pothesis in a one-way classication model.
4.12.1 Overall F test
NOTATION: Let t denote the number of treatments to be compared. Dene
Y
ij
= response on the jth individual in the ith treatment group
for i = 1, 2, ..., t and j = 1, 2, ..., n
i
.
n
i
is the number of replications for treatment i
When n
1
= n
2
= = n
t
= n, we say the design is balanced; otherwise, the
design is unbalanced.
Let N = n
1
+n
2
+ +n
t
denote the total number of individuals measured. If the
design is balanced, then N = nt.
Dene
Y
i+
=
1
n
i
n
i
j=1
Y
ij
S
2
i
=
1
n
i
1
n
i
j=1
(Y
ij
Y
i+
)
2
Y
++
=
1
N
t
i=1
n
i
j=1
Y
ij
.
The statistics Y
i+
and S
2
i
are simply the sample mean and sample variance, re-
spectively, of the ith sample. The statistic Y
++
is the sample mean of all the data
(across all t treatment groups).
PAGE 123
STATISTICAL HYPOTHESIS: Our goal is to develop a procedure to test
H
0
:
1
=
2
= =
t
versus
H
a
i
are not all equal.
The null hypothesis H
0
says that there is no treatment dierence, that is, all
treatment population means are the same.
The alternative hypothesis H
a
says that a dierence among the t population
means exists somewhere (but does not specify how the means are dierent).
When performing a hypothesis test, the goal is to decide which hypothesis is more
supported by the observed data.
ASSUMPTIONS: We have independent random samples from t 2 normal distribu-
tions, each of which has the same variance (but possibly dierent means):
Sample 1: Y
11
, Y
12
, ..., Y
1n
1
N(
1
,
2
)
Sample 2: Y
21
, Y
22
, ..., Y
2n
2
N(
2
,
2
)
.
.
.
.
.
.
Sample t: Y
t1
, Y
t2
, ..., Y
tn
t
N(
t
,
2
).
The procedure we develop is formulated by deriving two estimators for
2
. These two
estimators are formed by (1) looking at the variance of the observations within samples,
and (2) looking at the variance of the sample means across the t samples.
WITHIN ESTIMATOR: To estimate
2
within samples, we take a weighted average
(weighted by the sample sizes) of the t sample variances; that is, we pool all variance
estimates together to form one estimate. Dene
SS
res
= (n
1
1)S
2
1
+ (n
2
1)S
2
2
+ + (n
t
1)S
2
t
=
t
i=1
n
i
j=1
(Y
ij
Y
i+
)
2
. .
(n
i
1)S
2
i
.
PAGE 124
We call SS
res
the residual sum of squares. Mathematics shows that
E
_
SS
res
2
_
= N t = E(MS
res
) =
2
,
where
MS
res
=
SS
res
N t
.
IMPORTANT: MS
res
is an unbiased estimator of
2
regardless of whether or not H
0
is
true. We call MS
res
the residual mean squares.
ACROSS ESTIMATOR: To derive the across-sample estimator, we assume a com-
mon sample size n
1
= n
2
= = n
t
= n (to simplify notation). Recall that if a sample
arises from a normal population, then the sample mean is also normally distributed, i.e.,
Y
i+
N
_
i
,

2
n
_
.
NOTE: If all the treatment population means are equal, that is,
H
0
:
1
=
2
= =
t
= , say,
is true, then
Y
i+
N
_
,

2
n
_
.
If H
0
is true, then the t sample means Y
1+
, Y
2+
, ..., Y
t+
are a random sample of size t
from a normal distribution with mean and variance
2
/n. The sample variance of this
random sample is given by
1
t 1
t
i=1
(Y
i+
Y
++
)
2
and has expectation
E
_
1
t 1
t
i=1
(Y
i+
Y
++
)
2
_
=

2
n
.
Therefore,
MS
trt
=
1
t 1
t
i=1
n(Y
i+
Y
++
)
2
. .
SS
trt
,
2
; i.e., E(MS
trt
) =
2
, when H
0
is true.
PAGE 125
TERMINOLOGY: We call SS
trt
the treatment sums of squares and MS
trt
the treat-
ment mean squares. MS
trt
is our second point estimator for
2
. Recall that MS
trt
2
only when H
0
:
1
=
2
= =
t
is true (this is
important!). If we have dierent sample sizes, we simply adjust MS
trt
to
MS
trt
=
1
t 1
t
i=1
n
i
(Y
i+
Y
++
)
2
. .
SS
trt
.
This is still an unbiased estimator for
2
when H
0
is true.
MOTIVATION: When H
0
is true (i.e., the treatment means are the same), then
E(MS
trt
) =
2
E(MS
res
) =
2
.
These two facts suggest that when H
0
is true,
F =
MS
trt
MS
res
1.
When H
0
is not true (i.e., the treatment means are dierent), then
E(MS
trt
) >
2
E(MS
res
) =
2
.
These two facts suggest that when H
0
is not true,
F =
MS
trt
MS
res
> 1.
SAMPLING DISTRIBUTION: When H
0
is true, the F statistic
F =
MS
trt
MS
res
F(t 1, N t).
DECISION: We reject H
0
and conclude the treatment population means are dierent
if the F statistic is far out in the right tail of the F(t 1, N t) distribution. Why?
Because a large value of F is not consistent with H
0
being true! Large values of F (far
out in the right tail) are more consistent with H
a
.
PAGE 126
0 5 10 15
0
.
0
0
.
2
0
.
4
0
.
6
F
P
D
F
Figure 4.18: The F(3, 32) probability density function. This is the distribution of F in
Example 4.17 if H
0
is true. An at F = 16.848 has been added.
MORTAR DATA: We now use R to calculate the F statistic for the strength/mortar
type data in Example 4.17.
> anova(lm(strength ~ mortar.type))
Analysis of Variance Table
Df Sum Sq Mean Sq F value Pr(>F)
mortar.type 3 1520.88 506.96 16.848 9.576e-07 ***
Residuals 32 962.86 30.09
CONCLUSION: F = 16.848 is not an observation we would expect from the F(3, 32)
distribution (the distribution of F when H
0
is true); see Figure 4.18. Therefore, we reject
H
0
and conclude the population mean strengths for the four mortar types are dierent.
In other words, the evidence from the data suggests that H
a
is true; not H
0
.
PAGE 127
TERMINOLOGY: As we have just seen (from the recent R analysis), it is common to
display one-way classication results in an ANOVA table. The form of the ANOVA
table for the one-way classication is given below:
Source df SS MS F
Treatments t 1 SS
trt
MS
trt
=
SS
trt
t1
F =
MS
trt
MS
res
Residuals N t SS
res
MS
res
=
SS
res
Nt
Total N 1 SS
total
It is easy to show that
SS
total
= SS
trt
+ SS
res
.
SS
total
measures how observations vary about the overall mean, without regard to
treatments; that is, it measures the total variation in all the data. SS
total
can be
partitioned into two components:
SS
trt
measures how much of the total variation is due to the treatments
SS
res
measures what is left over, which we attribute to inherent variation
among the individuals.
Degrees of freedom (df) add down.
Mean squares (MS) are formed by dividing sums of squares by the corresponding
degrees of freedom.
TERMINOLOGY: The probability value (p-value) for a hypothesis test measures
how much evidence we have against H
0
. It is important to remember the following:
the smaller the p-value = the more evidence against H
0
.
MORTAR DATA: For the strength/mortar type data in Example 4.17 (from the R out-
put), we see that
p-value = 0.0000009576.
PAGE 128
This is obviously quite small which suggests that we have an enormous amount of
evidence against H
0
.
In this example, this p-value is calculated as the area to the right of F = 16.848 on
the F(3, 32) probability density function.
Therefore, this probability is interpreted as follows: If H
0
is true, this is the
probability that we would get a test statistic equal to or larger than F = 16.848.
Since this is extremely unlikely (p-value = 0.0000009576), this strongly suggests
that H
0
is not true.
P-VALUE RULES: Probability values are used in more general hypothesis test settings
in statistics (not just in one-way classication).
Q: How low does a p-value have to get before we reject H
0
?
A: Unfortunately, there is no right answer to this question. What is commonly done
is the following.
First choose a signicance level that is small. This represents the probability
that we will reject a true H
0
, that is,
= P(Reject H
0
|H
0
true).
Common values of chosen beforehand are = 0.10, = 0.05 (the most common),
and = 0.01.
The smaller the is chosen to be, the more evidence one requires to reject H
0
.
This is a true statement because of the following well-known decision rule:
p-value < = reject H
0
.
Therefore, the value of chosen by the experimenter determines how low the p-
value must get before H
0
is ultimately rejected.
For the strength/mortar type data, there is no ambiguity in our decision. For other
situations (e.g., p-value = 0.052), the decision may not be as clear cut.
PAGE 129
4.12.2 Follow up analysis: Tukey pairwise condence intervals
RECALL: In a one-way classication, the overall F test is used to test the hypotheses:
H
0
:
1
=
2
= =
t
versus
H
a
i
are not all equal.
QUESTION: If we do reject H
0
, have we really learned anything that is all that
relevant? All we have learned is that at least one of the population treatment means
is dierent. We have no idea which one(s) or how many. In this light, rejecting H
0
is
largely an uninformative conclusion.
FOLLOW-UP ANALYSES: If H
0
is rejected, that is, we conclude at least one population
treatment mean
i
is dierent, the obvious game becomes determining which one(s) and
how they are dierent. To do this, we will construct Tukey pairwise condence
intervals for all population treatment mean dierences
i
i
, 1 i < i
t. If there
are t treatments, then there are
_
t
2
_
=
t(t 1)
2
pairwise condence intervals to construct. For example, with the strength/mortar type
example, there are t = 4 treatments and 6 pairwise intervals:
2
,
1
3
,
1
4
,
2
3
,
2
4
,
3
4
,
where
1
= population mean strength for mortar type OCM
2
= population mean strength for mortar type PIM
3
= population mean strength for mortar type RM
4
= population mean strength for mortar type PCM.
PROBLEM: If we construct multiple condence intervals (here, 6 of them), and if we
construct each one at the 100(1) percent condence level, then the overall condence
PAGE 130
level for all 6 intervals will be less than 100(1 ) percent. In statistics, this is known
as the multiple comparisons problem.
GOAL: Construct condence intervals for all pairwise intervals
i

i
, 1 i < i
t,
and have our overall condence level still be at 100(1 ) percent.
SOLUTION: Simply increase the condence level associated with each individual inter-
val! Tukeys method is designed to do this. Even better, R automates the construction
of Tukeys intervals. The intervals are of the form:
(Y
i+
Y
i
+
) q
t,Nt,
MS
res
_
1
n
i
+
1
n
i
_
,
where q
t,Nt,
is the Tukey quantile which gives an overall condence level of 100(1)
percent (overall for the set of all possible pairwise intervals).
MORTAR DATA: For the strength/mortar type data in Example 4.17, the R output
below gives all pairwise intervals. Note that the overall condence level is 95 percent.
> TukeyHSD(aov(lm(strength ~ mortar.type)),conf.level=0.95)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = lm(strength ~ mortar.type))
$mortar.type
diff lwr upr p adj
PCM-OCM 2.48000 -4.950955 9.910955 0.8026758
PIM-OCM 15.21575 8.166127 22.265373 0.0000097
RM-OCM 12.99875 5.949127 20.048373 0.0001138
PIM-PCM 12.73575 5.686127 19.785373 0.0001522
RM-PCM 10.51875 3.469127 17.568373 0.0016850
RM-PIM -2.21700 -8.863448 4.429448 0.8029266
ANALYSIS: In the R output, the columns labeled lwr and upr give, respectively, the
lower and upper limits of the pairwise condence intervals.
PAGE 131
We are (at least) 95 percent condent that the dierence between the population
mean strengths for the PCM and OCM mortars is between 4.95 and 9.91 MPa.
Note that this condence interval includes 0, which suggests that these two
population means are not dierent.
An equivalent nding is that the adjusted p-value, given in the p adj col-
umn, is large.
We are (at least) 95 percent condent that the dierence between the population
mean strengths for the PIM and OCM mortars is between 8.17 and 22.27 MPa.
Note that this condence interval does not include 0, which suggests that
these two population means are dierent.
An equivalent nding is that the adjusted p-value, given in the p adj col-
umn, is small.
Interpretations for the remaining 4 condence intervals are formed similarly.
The main point is this:
If a pairwise condence interval (for two population means) includes 0, then
these population means are declared not to be dierent.
If a pairwise interval does not include 0, then the population means are
declared to be dierent.
The conclusions we make for all possible pairwise comparisons are at the
100(1 ) percent condence level.
Therefore, for the strength/mortar type data, the following pairs of population
means are declared to be dierent:
PIM-OCM RM-OCM PIM-PCM RM-PCM.
The following pairs of population means are declared to be not dierent:
PCM-OCM RM-PIM.
PAGE 132
6 Linear regression
Complementary reading: Chapter 6 (VK); Sections 6.1-6.4.
6.1 Introduction
IMPORTANCE: A problem that arises in engineering, economics, medicine, and other
areas is that of investigating the relationship between two (or more) variables. In such
settings, the goal is to model a continuous random variable Y as a function of one or more
independent variables, say, x
1
, x
2
, ..., x
k
. Mathematically, we can express this model as
Y = g(x
1
, x
2
, ..., x
k
) + ,
where g : R
k
R. This is called a regression model.
The presence of the (random) error conveys the fact that the relationship between
the dependent variable Y and the independent variables x
1
, x
2
, ..., x
k
through g is
not deterministic. Instead, the term absorbs all variation in Y that is not
explained by g(x
1
, x
2
, ..., x
k
).
LINEAR MODELS: In this course, we will consider models of the form
Y =
0
+
1
x
1
+
2
x
2
+ +
k
x
k
. .
g(x
1
,x
2
,...,x
k
)
+ ,
that is, g is a linear function of
0
,
1
, ...,
k
. We call this a linear regression model.
The response variable Y is assumed to be random (but we do get to observe its
value).
The regression parameters
0
,
1
, ...,
k
are assumed to be xed and unknown.
The independent variables x
1
, x
2
, ..., x
k
are assumed to be xed (not random).
The error term is assumed to be random (and not observed).
PAGE 133
DESCRIPTION: More precisely, we call a regression model a linear regression model
if the regression parameters enter the g function in a linear fashion. For example, each
of the models is a linear regression model:
Y =
0
+
1
x
. .
g(x)
+
Y =
0
+
1
x +
2
x
2
. .
g(x)
+
Y =
0
+
1
x
1
+
2
x
2
+
3
x
1
x
2
. .
g(x
1
,x
2
)
+.
Main point: The term linear does not refer to the shape of the regression function g.
It refers to how the regression parameters
0
,
1
, ...,
k
enter the g function.
6.2 Simple linear regression
TERMINOLOGY: A simple linear regression model includes only one independent
variable x and is of the form
Y =
0
+
1
x + .
The regression function
g(x) =
0
+
1
x
is a straight line with intercept
0
and slope
1
. If E() = 0, then
E(Y ) = E(
0
+
1
x + )
=
0
+
1
x + E()
=
0
+
1
x.
Therefore, we have these interpretations for the regression parameters
0
and
1
:

0
quanties the mean of Y when x = 0.

1
quanties the change in E(Y ) brought about by a one-unit change in x.
PAGE 134
Example 6.1. As part of a waste removal project, a new compression machine for
processing sewage sludge is being studied. In particular, engineers are interested in the
following variables:
Y = moisture control of compressed pellets (measured as a percent)
x = machine ltration rate (kg-DS/m/hr).
Engineers collect n = 20 observations of (x, Y ); the data are given below.
Obs x Y Obs x Y
1 125.3 77.9 11 159.5 79.9
2 98.2 76.8 12 145.8 79.0
3 201.4 81.5 13 75.1 76.7
4 147.3 79.8 14 151.4 78.2
5 145.9 78.2 15 144.2 79.5
6 124.7 78.3 16 125.0 78.1
7 112.2 77.5 17 198.8 81.5
8 120.2 77.0 18 132.5 77.0
9 161.2 80.1 19 159.6 79.0
10 178.9 80.2 20 110.7 78.6
Table 6.1: Sewage data. Moisture (Y , measured as a percentage) and machine ltration
rate (x, measured in kg-DS/m/hr). There are n = 20 observations.
Figure 6.1 displays the data in a scatterplot. This is the most common graphical display
for bivariate data like those seen above. From the plot, we see that
the variables Y and x are positively related, that is, an increase in x tends to be
associated with an increase in Y .
the variables Y and x are linearly related, although there is a large amount of
variation that is unexplained.
this is an example where a simple linear regression model may be adequate.
PAGE 135
80 100 120 140 160 180 200
7
7
7
8
7
9
8
0
8
1
Filtration rate (kgDS/m/hr)
M
o
i
s
t
u
r
e

(
P
e
r
c
e
n
t
a
g
e
)
Figure 6.1: Scatterplot of pellet moisture Y (measured as a percentage) as a function of
machine ltration rate x (measured in kg-DS/m/hr).
6.2.1 Least squares estimation
TERMINOLOGY: When we say, t a regression model, we mean that we would like
to estimate the regression parameters in the model with the observed data. Suppose that
we collect (x
i
, Y
i
), i = 1, 2, ..., n, and postulate the simple linear regression model
Y
i
=
0
+
1
x
i
+
i
,
for each i = 1, 2, ..., n. Our rst goal is to estimate
0
and
1
. Formal assumptions for
the error terms
i
will be given later.
LEAST SQUARES: A widely-accepted method of estimating the model parameters
0
and
1
is least squares. The method of least squares says to choose the values of
0
PAGE 136
and
1
that minimize
Q(
0
,
1
) =
n
i=1
[Y
i
(
0
+
1
x
i
)]
2
.
Denote the least squares estimators by b
0
and b
1
, respectively, that is, the values of
0
and
1
that minimize Q(
0
,
1
). A two-variable minimization argument can be used to
nd closed-form expressions for b
0
and b
1
. Taking partial derivatives of Q(
0
,
1
), we
obtain
Q(
0
,
1
)
0
= 2
n
i=1
(Y
i
1
x
i
)
set
= 0
Q(
0
,
1
)
1
= 2
n
i=1
(Y
i
1
x
i
)x
i
set
= 0.
Solving for
0
and
1
gives the least squares estimators
b
0
= Y b
1
x
b
1
=
n
i=1
(x
i
x)(Y
i
Y )
n
i=1
(x
i
x)
2
=
SS
xy
SS
xx
.
In real life, it is rarely necessary to calculate b
0
and b
1
by hand, although VK (Example
6.4, pp 379) does give an example of hand calculation. R automates the entire model
tting process and subsequent analysis.
Example 6.1 (continued). We now use R to calculate the equation of the least squares
regression line for the sewage sludge data in Example 6.1. Here is the output:
> fit = lm(moisture~filtration.rate)
> fit
lm(formula = moisture ~ filtration.rate)
Coefficients:
(Intercept) filtration.rate
72.95855 0.04103
From the output, we see the least squares estimates (to 3 dp) for the sewage data are
b
0
= 72.959
b
1
= 0.041.
PAGE 137
80 100 120 140 160 180 200
7
7
7
8
7
9
8
0
8
1
M
o
i
s
t
u
r
e

(
P
e
r
c
e
n
t
a
g
e
)
ltration rate x (measured in kg-DS/m/hr). The least squares line has been added.
Therefore, the equation of the least squares line that relates moisture percentage Y to
the ltration rate x is
Y = 72.959 + 0.041x,
or, in other words,
Moisture = 72.959 + 0.041 Filtration rate.

NOTE: Your authors call the least squares line the prediction equation. This is
because we can predict the value of Y (moisture) for any value of x (ltration rate). For
example, when the ltration rate is x = 150 kg-DS/m/hr, we would predict the moisture
percentage to be
Y (150) = 72.959 + 0.041(150) 79.109.

PAGE 138
6.2.2 Model assumptions and properties of least squares estimators
INTEREST: We wish to investigate the properties of b
0
and b
1
as estimators of the true
regression parameters
0
and
1
in the simple linear regression model
Y
i
=
0
+
1
x
i
+
i
,
for i = 1, 2, ..., n. To do this, we need assumptions on the error terms
i
. Specically, we
will assume throughout that
E(
i
) = 0, for i = 1, 2, ..., n
var(
i
) =
2
, for i = 1, 2, ..., n, that is, the variance is constant
the random variables
i
are independent
i
are normally distributed.
IMPLICATION: Under these assumptions,
Y
i
N(
0
+
1
x
i
,
2
).
Fact 1. The least squares estimators b
0
and b
1
are unbiased estimators of
0
and
1
,
respectively, that is,
E(b
0
) =
0
E(b
1
) =
1
.
Fact 2. The least squares estimators b
0
and b
1
have the following sampling distributions:
b
0
N(
0
, c
00
2
) and b
1
N(
1
, c
11
2
),
where
c
00
=
1
n
+
x
2
SS
xx
and c
11
=
1
SS
xx
.
Knowing these sampling distributions is critical if we want to write condence intervals
and perform hypothesis tests for
0
and
1
.
PAGE 139
6.2.3 Estimating the error variance
GOAL: In the simple linear regression model
Y
i
=
0
+
1
x
i
+
i
,
where
i
N(0,
2
), we now turn our attention to estimating
2
, the error variance.
TERMINOLOGY: In the simple linear regression model, dene the ith tted value by
Y
i
= b
0
+ b
1
x
i
,
where b
0
and b
1
are the least squares estimators. Each observation has its own tted
value. Geometrically, an observations tted value is the (perpendicular) projection of
its Y value, upward or downward, onto the least squares line.
TERMINOLOGY: We dene the ith residual by
e
i
= Y
i

Y
i
.
Each observation has its own residual. Geometrically, an observations residual is the
vertical distance (i.e., length) between its Y value and its tted value.
If an observations Y value is above the least squares regression line, its residual is
positive.
If an observations Y value is below the least squares regression line, its residual is
negative.
INTERESTING: In the simple linear regression model (provided that the model includes
an intercept term
0
), we have the following algebraic result:
n
i=1
e
i
=
n
i=1
(Y
i

Y
i
) = 0,
that is, the sum of the residuals (from a least squares t) is equal to zero.
PAGE 140
Obs x Y

Y = b
0
+ b
1
x e = Y

Y Obs x Y

Y = b
0
+ b
1
x e = Y

Y
1 125.3 77.9 78.100 0.200 11 159.5 79.9 79.503 0.397
2 98.2 76.8 76.988 0.188 12 145.8 79.0 78.941 0.059
3 201.4 81.5 81.223 0.277 13 75.1 76.7 76.040 0.660
4 147.3 79.8 79.003 0.797 14 151.4 78.2 79.171 0.971
5 145.9 78.2 78.945 0.745 15 144.2 79.5 78.876 0.624
6 124.7 78.3 78.075 0.225 16 125.0 78.1 78.088 0.012
7 112.2 77.5 77.563 0.062 17 198.8 81.5 81.116 0.384
8 120.2 77.0 77.891 0.891 18 132.5 77.0 78.396 1.396
9 161.2 80.1 79.573 0.527 19 159.6 79.0 79.508 0.508
10 178.9 80.2 80.299 0.099 20 110.7 78.6 77.501 1.099
Table 6.2: Sewage data. Fitted values and residuals from the least squares t.
SEWAGE DATA: In Table 6.2, I have used R to calculate the tted values and residuals
for each of the n = 20 observations in the sewage sludge data set.
TERMINOLOGY: We dene the residual sum of squares by
SS
res

n
i=1
e
2
i
=
n
i=1
(Y
i

Y
i
)
2
.
Fact 3. In the simple linear regression model,
MS
res
=
SS
res
n 2
2
, that is, E(MS
res
) =
2
. The quantity
=
MS
res
=
SS
res
n 2
estimates and is called the residual standard error.
SEWAGE DATA: For the sewage data in Example 6.1, we use R to calculate MS
res
:
> fitted.values = predict(fit)
> residuals = moisture-fitted.values
> # Calculate MS_res
> sum(residuals^2)/18
[1] 0.4426659
PAGE 141
For the sewage data, an (unbiased) estimate of the error variance
2
is
MS
res
0.443.
The residual standard error is
=
MS
res
=
0.4426659 0.6653.
This estimate can also be seen in the following R output:
> summary(fit)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 72.958547 0.697528 104.596 < 2e-16 ***
filtration.rate 0.041034 0.004837 8.484 1.05e-07 ***
Residual standard error: 0.6653 on 18 degrees of freedom
Multiple R-squared: 0.7999, Adjusted R-squared: 0.7888
F-statistic: 71.97 on 1 and 18 DF, p-value: 1.052e-07
6.2.4 Inference for
0
and
1
INTEREST: In the simple linear regression model
Y
i
=
0
+
1
x
i
+
i
,
the regression parameters
0
and
1
are unknown. It is therefore of interest to construct
condence intervals and perform hypothesis tests for these parameters.
In practice, inference for the slope parameter
1
is of primary interest because of
its connection to the independent variable x in the model.
Inference for
0
is less meaningful, unless one is explicitly interested in the mean
of Y when x = 0. We will not pursue this.
PAGE 142
CONFIDENCE INTERVAL FOR
1
: Under our model assumptions, the following sam-
pling distribution arises:
t =
b
1
MS
res
/SS
xx
t(n 2).
This result can be used to derive a 100(1 ) percent condence interval for
1
,
which is given by
b
1
t
n2,/2
MS
res
/SS
xx
.
The value t
n2,/2
is the upper /2 quantile from the t(n 2) distribution.
Note the form of the interval:
point estimate
. .
b
1
quantile
. .
t
n2,/2
standard error
. .
MS
res
/SS
xx
.
We are 100(1) percent condent that the population regression slope
1
When interpreting the interval, of particular interest to us is the value
1
= 0.
If
1
= 0 is in the condence interval, this suggests that Y and x are not
linearly related.
If
1
= 0 is not in the condence interval, this suggests that Y and x are
linearly related.
HYPOTHESIS TEST FOR
1
: If our interest was to test
H
0
:
1
=
1,0
versus
H
a
:
1
=
1,0
,
where
1,0
is a xed value (often,
1,0
= 0), we would focus our attention on
t =
b
1
1,0
MS
res
/SS
xx
.
PAGE 143
REASONING: If H
0
:
1
=
1,0
is true, then t arises from a t(n2) distribution. We can
therefore judge the amount of evidence against H
0
by comparing t to this distribution.
R automatically calculates t and produces a p-value for the test above. Remember that
small p-values are evidence against H
0
.
Example 6.1 (continued). We now use R to test
H
0
:
1
= 0
versus
H
a
:
1
= 0,
for the sewage sludge data in Example 6.1. Note that
1,0
= 0.
> summary(fit)
Coefficients:
(Intercept) 72.958547 0.697528 104.596 < 2e-16 ***
filtration.rate 0.041034 0.004837 8.484 1.05e-07 ***
ANALYSIS: Figure 6.3 shows the t(18) distribution, that is, the distribution of t when
H
0
:
1
= 0 is true for the sewage sludge example. Clearly, t = 8.484 is not an expected
outcome from this distribution (p-value = 0.000000105)!! In other words, there is strong
evidence that the absorption rate is linearly related to machine ltration rate.
ANALYSIS: A 95 percent condence interval for
1
is calculated as follows:
b
1
t
18,0.025
se(b
1
) = 0.0410 2.1009(0.0048) = (0.0309, 0.0511).
We are 95 percent condent that population regression slope
1
is between 0.0309 and
0.0511. Note that this interval does not include 0.
> qt(0.975,18)
[1] 2.100922
PAGE 144
5 0 5
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
t
f
(
t
)
Figure 6.3: t(18) probability density function. An at t = 8.484 has been added.
0
INTEREST: Consider the simple linear regression model
Y
i
=
0
+
1
x
i
+
i
,
where
i
N(0,
2
). We are often interested in using the tted model to learn about
the response variable Y at a certain setting for the independent variable x = x
0
, say. For
example, in our sewage sludge example, we might be interested in the moisture percentage
Y when the ltration rate is x = 150 kg-DS/m/hr. Two potential goals arise:
We might be interested in estimating the mean response of Y when x = x
0
.
This mean response is denoted by E(Y |x
0
). This value is the mean of the following
PAGE 145
probability distribution:
Y (x
0
) N(
0
+
1
x
0
,
2
).
We might be interested in predicting a new response Y when x = x
0
. This
predicted response is denoted by Y
(x
0
). This value is a new outcome from
Y (x
0
) N(
0
+
1
x
0
,
2
).
In the rst problem, we are interested in estimating the mean of the response variable Y
at a certain value of x. In the second problem, we are interested in predicting the value
of a new random variable Y at a certain value of x. Conceptually, the second problem is
far more dicult than the rst.
GOALS: We would like to create 100(1 ) percent intervals for the mean E(Y |x
0
) and
for the new value Y
(x
0
). The former is called a condence interval (since it is for
a mean response) and the latter is called a prediction interval (since it is for a new
random variable).
POINT ESTIMATOR/PREDICTOR: To construct either interval, we start with the
same quantity:
Y (x
0
) = b
0
+ b
1
x
0
,
where b
0
and b
1
are the least squares estimates from the t of the model.
In the condence interval for E(Y |x
0
), we call

Y (x
0
) a point estimator.
In the prediction interval for Y (x
0
), we call

Y (x
0
) a point predictor.
The primary dierence in the intervals arises in assessing the variability of

Y (x
0
).
CONFIDENCE INTERVAL: A 100(1 ) percent condence interval for the mean
E(Y |x
0
) is given by
Y (x
0
) t
n2,/2
MS
res
_
1
n
+
(x
0
x)
2
SS
xx
_
.
PAGE 146
PREDICTION INTERVAL: A 100(1 ) percent prediction interval for the new
response Y
(x
0
) is given by
Y (x
0
) t
n2,/2
MS
res
_
1 +
1
n
+
(x
0
x)
2
SS
xx
_
.
COMPARISON: The two intervals are identical except for the extra 1 in the standard
error part of the prediction interval. This extra 1 arises from the additional uncer-
tainty associated with predicting a new response from the N(
0
+
1
x
0
,
2
) distribution.
Therefore, a 100(1 ) percent prediction interval for Y
(x
0
) will be wider than the
corresponding 100(1 ) percent condence interval for E(Y |x
0
).
INTERVAL LENGTH: The length of both intervals clearly depends on the value of x
0
.
In fact, the standard error of

Y (x
0
) will be smallest when x
0
= x and will get larger
the farther x
0
is from x in either direction. This implies that the precision with which
we estimate E(Y |x
0
) or predict Y
(x
0
) decreases the farther we get away from x. This
makes intuitive sense, namely, we would expect to have the most condence in our
tted model near the center of the observed data.
TERMINOLOGY: It is sometimes desired to estimate E(Y |x
0
) or predict Y
(x
0
) based
on the t of the model for values of x
0
outside the range of x values used in the ex-
periment/study. This is called extrapolation and can be very dangerous. In order for
our inferences to be valid, we must believe that the straight line relationship holds for x
values outside the range where we have observed data. In some situations, this may be
reasonable. In others, we may have no theoretical basis for making such a claim without
data to support it.
Example 6.1 (continued). In our sewage sludge example, suppose that we are interested
in estimating E(Y |x
0
) and predicting a new Y
(x
0
) when the ltration rate is x
0
= 150
kg-DS/m/hr.
E(Y |x
0
) denotes the mean moisture percentage for compressed pellets when the
machine ltration rate is x
0
= 150 kg-DS/m/hr. In other words, if we were to repeat
PAGE 147
the experiment over and over again, each time using a ltration rate of x
0
= 150
kg-DS/m/hr, then E(Y |x
0
) denotes the mean value of Y (moisture percentage)
that would be observed.
Y
(x
0
) denotes a possible value of Y for a single run of the machine when the
ltration rate is set at x
0
= 150 kg-DS/m/hr.
R automates the calculation of condence and prediction intervals, as seen below.
> predict(fit,data.frame(filtration.rate=150),level=0.95,interval="confidence")
fit lwr upr
79.11361 78.78765 79.43958
> predict(fit,data.frame(filtration.rate=150),level=0.95,interval="prediction")
fit lwr upr
79.11361 77.6783 80.54893
Note that the point estimate (point prediction) is easily calculated:
Y (x
0
= 150) = 72.959 + 0.041(150) 79.11361.
A 95 percent condence interval for E(Y |x
0
= 150) is (78.79, 79.44). When the
ltration rate is x
0
= 150 kg-DS/m/hr, we are 95 percent condent that the mean
moisture percentage is between 78.79 and 79.44 percent.
A 95 percent prediction interval for Y
(x
0
= 150) is (77.68, 80.55). When the
ltration rate is x
0
= 150 kg-DS/m/hr, we are 95 percent condent that the mois-
ture percentage for a single run of the experiment will be between 77.68 and 80.55
percent.
Figure 6.4 shows 95 percent condence bands for E(Y |x
0
) and 95 percent prediction
bands for Y
(x
0
). These are not simultaneous bands (i.e., these are not bands for
the entire population regression function).
PAGE 148
80 100 120 140 160 180 200
7
7
7
8
7
9
8
0
8
1
M
o
i
s
t
u
r
e

(
P
e
r
c
e
n
t
a
g
e
)
80 100 120 140 160 180 200
7
7
7
8
7
9
8
0
8
1
80 100 120 140 160 180 200
7
7
7
8
7
9
8
0
8
1
80 100 120 140 160 180 200
7
7
7
8
7
9
8
0
8
1
80 100 120 140 160 180 200
7
7
7
8
7
9
8
0
8
1
95% confidence
95% prediction
machine ltration rate x (measured in kg-DS/m/hr). The least squares regression line
has been added. Ninety-ve percent condence/prediction bands have been added.
6.3 Multiple linear regression
6.3.1 Introduction
PREVIEW: We have already considered the simple linear regression model
Y
i
=
0
+
1
x
i
+
i
,
for i = 1, 2, ..., n, where
i
N(0,
2
). We now extend this basic model to include
multiple independent variables x
1
, x
2
, ..., x
k
. This is much more realistic because, in
practice, often Y depends on many dierent factors (i.e., not just one). Specically, we
PAGE 149
consider models of the form
Y
i
=
0
+
1
x
i1
+
2
x
i2
+ +
k
x
ik
+
i
,
for i = 1, 2, ..., n. We call this a multiple linear regression model.
There are now p = k + 1 regression parameters
0
,
1
, ...,
k
. These are unknown
and are to be estimated with the observed data.
Schematically, we can envision the observed data as follows:
Individual Y x
1
x
2
x
k
1 Y
1
x
11
x
12
x
1k
2 Y
2
x
21
x
22
x
2k
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n Y
n
x
n1
x
n2
x
nk
Each of the n individuals contributes a response Y and a value of each of the
independent variables x
1
, x
2
, ..., x
k
.
We continue to assume that
i
N(0,
2
).
We also assume that the independent variables x
1
, x
2
, ..., x
k
are xed and measured
without error.
PREVIEW: To t the multiple linear regression model
Y
i
=
0
+
1
x
i1
+
2
x
i2
+ +
k
x
ik
+
i
,
we again use the method of least squares. Simple computing formulae for the least
squares estimators are no longer available (as they were in simple linear regression).
This is hardly a big deal because we will use computing to automate all analyses. For
instructional purposes, it is advantageous to express multiple linear regression models
in terms of matrices and vectors. This streamlines notation and makes the presentation
easier.
PAGE 150
6.3.2 Matrix representation
MATRIX REPRESENTATION: Consider the multiple linear regression model
Y
i
=
0
+
1
x
i1
+
2
x
i2
+ +
k
x
ik
+
i
,
for i = 1, 2, ..., n. Dene
Y =
_
_
_
_
_
_
_
_
Y
1
Y
2
.
.
.
Y
n
_
_
_
_
_
_
_
_
, X =
_
_
_
_
_
_
_
_
1 x
11
x
12
x
1k
1 x
21
x
22
x
2k
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 x
n1
x
n2
x
nk
_
_
_
_
_
_
_
_
, =
_
_
_
_
_
_
_
_
_
_
_
2
.
.
.
k
_
_
_
_
_
_
_
_
_
_
_
, =
_
_
_
_
_
_
_
_
2
.
.
.
n
_
_
_
_
_
_
_
_
.
With these denitions, the model above can be expressed equivalently as
Y = X + .
In this equivalent representation,
Y is an n 1 (random) vector of responses
X is an n p (xed) matrix of independent variable measurements (p = k + 1)
is a p 1 (xed) vector of unknown population regression parameters
is an n 1 (random) vector of unobserved errors.
LEAST SQUARES: The notion of least squares is the same as it was in the simple linear
regression model. To t a multiple linear regression model, we want to nd the values of
0
,
1
, ...,
k
that minimize
Q(
0
,
1
, ...,
k
) =
n
i=1
[Y
i
(
0
+
1
x
i1
+
2
x
i2
+ +
k
x
ik
)]
2
,
or, in matrix notation, the value of that minimizes
Q() = (YX)
(YX).
PAGE 151
Because Q() is a scalar function of the p = k + 1 elements of , it is possible to use
calculus to determine the values of the p elements that minimize it. Formally, we can
take p partial derivatives with respect to each of
0
,
1
, ...,
k
and set these equal to zero.
Using the calculus of matrices, we can write this resulting system of p equations (and p
unknowns) as follows:
X
X = X
Y.
These are called the normal equations. Provided that X
X is full rank, the (unique)

solution is
b = (X
X)
1
X
Y =
_
_
_
_
_
_
_
_
_
_
_
b
0
b
1
b
2
.
.
.
b
k
_
_
_
_
_
_
_
_
_
_
_
.
This is the least squares estimator of . The tted regression model is
Y = Xb,
or, equivalently,
Y
i
= b
0
+ b
1
x
i1
+ b
2
x
i2
+ + b
k
x
ik
,
for i = 1, 2, ..., n.
TECHNICAL NOTE: For the least squares estimator
b = (X
X)
1
X
Y
to be unique, we need X to be of full column rank; i.e., r(X) = p = k + 1. This will
occur when there are no linear dependencies among the columns of X. If r(X) < p, then
X
X does not have a unique inverse. In this case, the normal equations can not be solved
uniquely.
Example 6.2. The taste of matured cheese is related to the concentration of several
chemicals in the nal product. In a study from the LaTrobe Valley of Victoria, Aus-
tralia, samples of cheddar cheese were analyzed for their chemical composition and were
PAGE 152
subjected to taste tests. For each specimen, the taste Y was obtained by combining the
scores from several tasters. Data were collected on the following variables:
Y = taste score (TASTE)
x
1
= concentration of acetic acid (ACETIC)
x
2
= concentration of hydrogen sulde (H2S)
x
3
= concentration of lactic acid (LACTIC).
Variables ACETIC and H2S were both measured on the log scale. The variable LACTIC
has not been transformed. Table 6.3 contains concentrations of the various chemicals in
n = 30 specimens of cheddar cheese and the observed taste score.
Specimen TASTE ACETIC H2S LACTIC Specimen TASTE ACETIC H2S LACTIC
1 12.3 4.543 3.135 0.86 16 40.9 6.365 9.588 1.74
2 20.9 5.159 5.043 1.53 17 15.9 4.787 3.912 1.16
3 39.0 5.366 5.438 1.57 18 6.4 5.412 4.700 1.49
4 47.9 5.759 7.496 1.81 19 18.0 5.247 6.174 1.63
5 5.6 4.663 3.807 0.99 20 38.9 5.438 9.064 1.99
6 25.9 5.697 7.601 1.09 21 14.0 4.564 4.949 1.15
7 37.3 5.892 8.726 1.29 22 15.2 5.298 5.220 1.33
8 21.9 6.078 7.966 1.78 23 32.0 5.455 9.242 1.44
9 18.1 4.898 3.850 1.29 24 56.7 5.855 10.20 2.01
10 21.0 5.242 4.174 1.58 25 16.8 5.366 3.664 1.31
11 34.9 5.740 6.142 1.68 26 11.6 6.043 3.219 1.46
12 57.2 6.446 7.908 1.90 27 26.5 6.458 6.962 1.72
13 0.7 4.477 2.996 1.06 28 0.7 5.328 3.912 1.25
14 25.9 5.236 4.942 1.30 29 13.4 5.802 6.685 1.08
15 54.9 6.151 6.752 1.52 30 5.5 6.176 4.787 1.25
Table 6.3: Cheese data. ACETIC, H2S, and LACTIC are independent variables. The re-
sponse variable is TASTE.
MODEL: Researchers postulate that each of the three chemical composition variables
x
1
, x
2
, and x
3
is important in describing the taste and consider the multiple linear re-
PAGE 153
gression model
Y
i
=
0
+
1
x
i1
+
2
x
i2
+
3
x
i3
+
i
,
for i = 1, 2, ..., 30. We now use R to t this model using the method of least squares:
> fit = lm(taste~acetic+h2s+lactic)
> fit
Coefficients:
(Intercept) acetic h2s lactic
-28.877 0.328 3.912 19.670
This output gives the values of the least squares estimates
b =
_
_
_
_
_
_
_
_
b
0
b
1
b
2
b
3
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
28.877
0.328
3.912
19.670
_
_
_
_
_
_
_
_
.
Therefore, the tted least squares regression model is
Y = 28.877 + 0.328x
1
+ 3.912x
2
+ 19.670x
3
,
or, in other words,
TASTE = 28.877 + 0.328 ACETIC + 3.912 H2S + 19.670 LACTIC.

6.3.3 Estimating the error variance
GOAL: Consider the multiple linear regression model
Y
i
=
0
+
1
x
i1
+
2
x
i2
+ +
k
x
ik
+
i
,
for i = 1, 2, ..., n, where
i
N(0,
2
). We have just seen how to estimate
0
,
1
, ...,
k
via least squares and how to automate this procedure in R. Our next task is to estimate
the error variance
2
.
PAGE 154
TERMINOLOGY: Dene the residual sum of squares by
SS
res
=
n
i=1
(Y
i

Y
i
)
2
=
n
i=1
e
2
i
.
In matrix notation, we can write this as
SS
res
= (Y

Y)
(Y

Y)
= (YXb)
(YXb) = e
e.
The n 1 vector

Y = Xb contains the least squares tted values.
The n 1 vector e = Y

Y contains the least squares residuals.
RESULT: In the multiple linear regression model
Y
i
=
0
+
1
x
i1
+
2
x
i2
+ +
k
x
ik
+
i
,
where
i
N(0,
2
),
MS
res
=
SS
res
n p
2
, that is,
E(MS
res
) =
2
.
The quantity
=
MS
res
=
SS
res
n p
estimates and is called the residual standard error.
CHEESE DATA: For the cheese data in Example 6.2, we use R to calculate MS
res
:
> fitted.values = predict(fit)
> residuals = taste-fitted.values
> # Calculate MS_res
> sum(residuals^2)/26
[1] 102.6299
PAGE 155
6.3.4 The hat matrix
TERMINOLOGY: Consider the linear regression model Y = X + and dene
H = X(X
X)
1
X
.
H is called the hat matrix. Important quantities in linear regression can be written as
functions of the hat matrix.
The vector of tted values can be written as
Y = Xb = X(X
X)
1
X
Y = HY.
The vector of residuals can be written as
e = Y

Y = YHY = (I H)Y,
where I is the n n identity matrix.
Interestingly, it turns out that

Y and e are orthogonal vectors in R
n
.
6.3.5 Analysis of variance for linear regression
IDENTITY: Algebraically, it can be shown that
n
i=1
(Y
i
Y )
2
. .
SS
total
=
n
i=1
(
Y
i
Y )
2
. .
SS
reg
+
n
i=1
(Y
i

Y
i
)
2
. .
SS
res
.
SS
total
is the total sum of squares. SS
total
is the numerator of the sample variance
of Y
1
, Y
2
, ..., Y
n
. It measures the total variation in the response data.
SS
reg
is the regression sum of squares. SS
reg
measures the variation in the
response data explained by the linear regression model.
SS
res
is the residual sum of squares. SS
res
measures the variation in the response
data not explained by the linear regression model.
PAGE 156
Table 6.4: Analysis of variance table for linear regression.
Source df SS MS F
Regression k SS
reg
MS
reg
=
SS
reg
k
F =
MS
reg
MS
res
Residual n p SS
res
MS
res
=
SS
res
np
Total n 1 SS
total
ANOVA TABLE: We can combine all of this information to produce an analysis of
variance (ANOVA) table. Such tables are standard in regression analysis.
The degrees of freedom (df) add down.
SS
total
can be viewed as a statistic that has lost a degree of freedom for
having to estimate the overall mean of Y with the sample mean Y . Recall
that n 1 is our divisor in the sample variance S
2
.
There are k degrees of freedom associated with SS
reg
because there are k
independent variables.
The degrees of freedom for SS
res
can be thought of as the divisor needed to
create an unbiased estimator of
2
. Recall that
MS
res
=
SS
res
n p
=
SS
res
n k 1
2
The sum of squares (SS) also add down. This follows from the algebraic identity
noted earlier.
Mean squares (MS) are the sums of squares divided by their degrees of freedom.
The F statistic is formed by taking the ratio of MS
reg
and MS
res
. More on this in
a moment.
COEFFICIENT OF DETERMINATION: Since
SS
total
= SS
reg
+ SS
res
,
PAGE 157
the proportion of the total variation in the data explained by the linear regression model
is
R
2
=
SS
reg
SS
total
.
This statistic is called the coecient of determination. Clearly,
0 R
2
1.
The larger the R
2
, the better the regression model explains the variability in the data.
IMPORTANT: It is critical to understand what R
2
does and does not measure. Its value
is computed under the assumption that the multiple linear regression model is correct
and assesses how much of the variation in the data may be attributed to that relationship
rather than to inherent variation.
If R
2
is small, it may be that there is a lot of random inherent variation in the data,
so that, although the multiple linear regression model is reasonable, it can explain
only so much of the observed overall variation.
Alternatively, R
2
may be close to 1; e.g., in a simple linear regression model t, but
this may not be the best model. In fact, R
2
could be very high, but ultimately
not relevant because it assumes the simple linear regression model is correct. In
reality, a better model may exist (e.g., a quadratic model, etc.).
F STATISTIC: The F statistic in the ANOVA table is used to test
H
0
:
1
=
2
= =
k
= 0
versus
H
a
: at least one of the
j
is nonzero.
In other words, F tests whether or not at least one of the independent variables
x
1
, x
2
, ..., x
k
is important in describing the response Y . If H
0
is rejected, we do not
know which one or how many of the
j
s are nonzero; only that at least one is.
SAMPLING DISTRIBUTION: When H
0
:
1
=
2
= =
k
= 0 is true,
F =
MS
reg
MS
res
F(k, n p).
PAGE 158
Therefore, we can gauge the evidence against H
0
by comparing F to this distribution.
Values of F far out in the (right) upper tail are evidence against H
0
. R automatically
produces the value of F and produces the corresponding p-value. Recall that small
p-values are evidence against H
0
(the smaller the p-value, the more evidence).
Example 6.2 (continued). For the cheese data in Example 6.2, we t the multiple linear
regression model
Y
i
=
0
+
1
x
i1
+
2
x
i2
+
3
x
i3
+
i
,
for i = 1, 2, ..., 30. The ANOVA table, obtained using SAS, is shown below.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Regression 3 4994.50861 1664.83620 16.22 <.0001
Residual 26 2668.37806 102.62993
Corrected Total 29 7662.88667
The F statistic is used to test
H
0
:
1
=
2
=
3
= 0
versus
H
a
: at least one of the
j
is nonzero.
ANALYSIS: Based on the F statistic (F = 16.22), and the corresponding probability
value (p-value < 0.0001), we have strong evidence to reject H
0
. See also Figure 6.5.
Interpretation: We conclude that at least one of the independent variables (ACETIC,
H2S, LACTIC) is important in describing taste.
NOTE: The coecient of determination is
R
2
=
SS
reg
SS
total
=
4994.51
7662.89
0.652.
Interpretation: About 65.2 percent of the variability in the taste data is explained by
the linear regression model that includes ACETIC, H2S, and LACTIC. The remaining 34.8
percent of the variability in the taste data is explained by other sources.
PAGE 159
0 5 10 15 20
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
F
P
D
F
Figure 6.5: F(3, 26) probability density function. An at F = 16.22 has been added.
IMPORTANT: If we t this model in R, we get the following:
fit = lm(taste~acetic+h2s+lactic)
anova(fit)
Response: taste
acetic 1 2314.14 2314.14 22.5484 6.528e-05 ***
h2s 1 2147.11 2147.11 20.9209 0.0001035 ***
lactic 1 533.26 533.26 5.1959 0.0310870 *
Residuals 26 2668.38 102.63
NOTE: The convention used by R is to split up the regression sum of squares
SS
reg
= 4994.50861
PAGE 160
into sums of squares for each of the three independent variables ACETIC, H2S, and LACTIC,
as they are added sequentially to the model (these are called sequential sums of
squares). The sequential sums of squares for the independent variables add to the
SS
reg
for the model (up to rounding error), that is,
SS
reg
= 4994.51 = 2314.14 + 2147.11 + 533.26
= SS(ACETIC) + SS(H2S) + SS(LACTIC).
In words,
SS(ACETIC) is the sum of squares added when compared to a model that includes
only an intercept term.
SS(H2S) is the sum of squares added when compared to a model that includes an
intercept term and ACETIC.
SS(LACTIC) is the sum of squares added when compared to a model that includes
an intercept term, ACETIC, and H2S.
In other words, we can use the sequential sums of squares to assess the impact of adding
independent variables ACETIC, H2S, and LACTIC to the model in sequence.
6.3.6 Inference for individual regression parameters
IMPORTANCE: Consider our multiple linear regression model
Y
i
=
0
+
1
x
i1
+
2
x
i2
+ +
k
x
ik
+
i
,
for i = 1, 2, ..., n, where
i
N(0,
2
). Condence intervals and hypothesis tests for
j
can help us assess the importance of using the independent variable x
j
in a model with
the other independent variables. That is, inference regarding
j
is always conditional
on the other variables being included in the model.
CONFIDENCE INTERVALS: A 100(1 ) percent condence interval for
j
, for
j = 0, 1, 2, ..., k, is given by
b
j
t
np,/2

2
c
jj
,
PAGE 161
where

2
= MS
res
=
SS
res
n p
is our unbiased estimate of
2
and
c
jj
= (X
X)
1
jj
is the corresponding diagonal element of the (X
X)
1
matrix.
HYPOTHESIS TESTS: Hypothesis tests for
H
0
:
j
= 0
versus
H
a
:
j
= 0,
can be performed by examining the p-value output provided in R.
If H
0
:
j
= 0 is not rejected, then x
j
is not important in describing Y in the
presence of the other independent variables.
If H
0
:
j
= 0 is rejected, this means that x
j
is important in describing Y even
after including the eects of the other independent variables.
> summary(fit)
Call: lm(formula = taste ~ acetic + h2s + lactic)
Coefficients:
(Intercept) -28.877 19.735 -1.463 0.15540
acetic 0.328 4.460 0.074 0.94193
h2s 3.912 1.248 3.133 0.00425 **
lactic 19.670 8.629 2.279 0.03109 *
PAGE 162
OUTPUT: The Estimate output gives the values of the least squares estimates:
b
0
28.877 b
1
0.328 b
2
3.912 b
3
19.670.
Therefore, the tted least squares regression model is
Y = 28.877 + 0.328x
1
+ 3.912x
2
+ 19.670x
3
,
or, in other words,
TASTE = 28.877 + 0.328 ACETIC + 3.912 H2S + 19.670 LACTIC.

The Std.Error output gives
19.735 = se(b
0
) =
c
00

2
=

2
(X
X)
1
00
4.460 = se(b
1
) =
c
11

2
=

2
(X
X)
1
11
1.248 = se(b
2
) =
c
22

2
=

2
(X
X)
1
22
8.629 = se(b
3
) =
c
33

2
=

2
(X
X)
1
33
,
where

2
= MS
res
=
SS
res
30 4
= (10.13)
2
102.63
is the square of the Residual standard error. The t value output gives the t statistics
t = 1.463 =
b
0
0
c
00

2
t = 0.074 =
b
1
0
c
11

2
t = 3.133 =
b
2
0
c
22

2
t = 2.279 =
b
3
0
c
33

2
.
These t statistics can be used to test H
0
:
i
= 0 versus H
0
:
i
= 0, for i = 0, 1, 2, 3.
Two-sided probability values are in Pr(>|t|). At the = 0.05 level,
we do not reject H
0
:
0
= 0 (p-value = 0.155). Interpretation: In the model
which includes all three independent variables, the intercept term
0
is not statis-
tically dierent from zero.
PAGE 163
we do not reject H
0
:
1
= 0 (p-value = 0.942). Interpretation: ACETIC does not
signicantly add to a model that includes H2S and LACTIC.
we reject H
0
:
2
= 0 (p-value = 0.004). Interpretation: H2S does signicantly
add to a model that includes ACETIC and LACTIC.
we reject H
0
:
3
= 0 (p-value = 0.031). Interpretation: LACTIC does signicantly
add to a model that includes ACETIC and H2S.
CONFIDENCE INTERVALS: Ninety-ve percent condence intervals for the regression
parameters
0
,
1
,
2
, and
3
, respectively, are
b
0
t
26,0.025
se(b
0
) = 28.877 2.056(19.735) =(69.45, 11.70)
b
1
t
26,0.025
se(b
1
) = 0.328 2.056(4.460) =(8.84, 9.50)
b
2
t
26,0.025
se(b
2
) = 3.912 2.056(1.248) =(1.35, 6.48)
b
3
t
26,0.025
se(b
3
) = 19.670 2.056(8.629) =(1.93, 37.41).
The conclusions reached from interpreting these intervals are the same as those reached
using the hypothesis test p-values. Note that the
2
and
3
intervals do not include zero.
Those for
0
and
1
do.
0
GOALS: We would like to create 100(1 ) percent intervals for the mean E(Y |x
0
)
and for the new value Y
(x
0
). As in the simple linear regression case, the former is
called a condence interval (since it is for a mean response) and the latter is called a
prediction interval (since it is for a new random variable).
CHEESE DATA: Suppose that we are interested estimating E(Y |x
0
) and predicting a
new Y
(x
0
) when ACETIC = 5.5, H2S = 6.0, and LACTIC = 1.4, so that
x
0
=
_
_
_
_
_
5.5
6.0
1.4
_
_
_
_
_
.
PAGE 164
We use R to compute the following:
> predict(fit,data.frame(acetic=5.5,h2s=6.0,lactic=1.4),level=0.95,interval="confidence")
fit lwr upr
23.93552 20.04506 27.82597
> predict(fit,data.frame(acetic=5.5,h2s=6.0,lactic=1.4),level=0.95,interval="prediction")
fit lwr upr
23.93552 2.751379 45.11966
Note that the point estimate/prediction is
Y (x
0
) = b
0
+ b
1
x
10
+ b
2
x
20
+ b
3
x
30
= 28.877 + 0.328(5.5) + 3.912(6.0) + 19.670(1.4) 23.936.
A 95 percent condence interval for E(Y |x
0
) is (20.05, 27.83). When ACETIC =
5.5, H2S = 6.0, and LACTIC = 1.4, we are 95 percent condent that the mean taste
rating is between 20.05 and 27.83.
A 95 percent prediction interval for Y
(x
0
), when x = x
0
, is (2.75, 45.12). When
ACETIC = 5.5, H2S = 6.0, and LACTIC = 1.4, we are 95 percent condent that the
taste rating for a new cheese specimen will be between 2.75 and 45.12.
6.4 Model diagnostics (residual analysis)
IMPORTANCE: We now discuss certain diagnostic techniques for linear regression. The
term diagnostics refers to the process of checking the model assumptions. This is an
important exercise because if the model assumptions are violated, then our analysis (and
all subsequent interpretations) could be compromised.
MODEL ASSUMPTIONS: We rst recall the model assumptions on the error terms in
the linear regression model
Y
i
=
0
+
1
x
i1
+
2
x
i2
+ +
k
x
ik
+
i
,
for i = 1, 2, ..., n. Specically, we have made the following assumptions:
PAGE 165
E(
i
) = 0, for i = 1, 2, ..., n
var(
i
) =
2
, for i = 1, 2, ..., n, that is, the variance is constant
i
are independent
i
are normally distributed.
RESIDUALS: In checking our model assumptions, we rst have to deal with the obvious
problem; namely, the error terms
i
in the model are never observed. However, from the
t of the model, we can calculate the residuals
e
i
= Y
i

Y
i
,
where the ith tted value
Y
i
= b
0
+ b
1
x
i1
+ b
2
x
i2
+ + b
k
x
ik
.
We can think of the residuals e
1
, e
2
, ..., e
n
as proxies for the error terms
1
,
2
, ...,
n
,
and, therefore, we can use the residuals to check our model assumptions instead.
QQ PLOT FOR NORMALITY: To check the normality assumption (for the errors) in
linear regression, it is common to display the qq-plot of the residuals.
Recall that if the plotted points follow a straight line (approximately), this supports
the normality assumption.
Substantial deviation from linearity is not consistent with the normality assump-
tion.
The plot in Figure 6.6 supports the normality assumption for the errors in the
multiple linear regression model for the cheese data.
RESIDUAL PLOT: By the phrase residual plot, I mean the plot of the residuals (on
the vertical axis) versus the predicted values (on the horizontal axis). This plot is simply
the scatterplot of the residuals and the predicted values.
PAGE 166
2 1 0 1 2
1
0
0
1
0
2
0
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
Figure 6.6: Cheese data. Normal qq-plot of the least squares residuals.
Advanced linear model arguments show that if the model does a good job at de-
scribing the data, then the residuals and tted values are independent.
This means that a plot of the residuals versus the tted values should reveal no
noticeable patterns; that is, the plot should appear to be random in nature (e.g.,
a random scatter of points).
On the other hand, if there are denite (non-random) patterns in the residual plot,
this suggests that the model is inadequate in some way or it could point to a
violation in the model assumptions.
The plot in Figure 6.7 does not suggest any obvious model inadequacies! It looks
completely random in appearance.
PAGE 167
0 10 20 30 40 50
1
0
0
1
0
2
0
Fitted values
R
e
s
i
d
u
a
l
s
Figure 6.7: Cheese data. Residual plot for the multiple linear regression model t. A
horizontal line at zero has been added.
COMMON VIOLATIONS: Although there are many ways to violate the statistical as-
sumptions associated with linear regression, the most common violations are
non-constant variance (heteroscedasticity)
misspecifying the true regression function
correlated observations over time.
Example 6.3. An electric company is interested in modeling peak hour electricity
demand (Y ) as a function of total monthly energy usage (x). This is important for
planning purposes because the generating system must be large enough to meet the
maximum demand imposed by customers. Data for n = 53 residential customers for a
given month are shown in Figure 6.8.
PAGE 168
500 1000 1500 2000 2500 3000 3500
0
5
1
0
1
5
Monthly Usage (kWh)
P
e
a
k

D
e
m
a
n
d

(
k
W
h
)
0 2 4 6 8 10 12
2
0
2
Fitted values
R
e
s
i
d
u
a
l
s
Figure 6.8: Electricity data. Left: Scatterplot of peak demand (Y , measured in kWh)
versus monthly usage (x, measured in kWh) with least squares simple linear regression
line superimposed. Right: Residual plot for the simple linear regression model t.
Problem: There is a clear problem with non-constant variance here. Note how the
residual plot fans out like the bell of a trumpet. This violation may have been missed
by looking at the scatterplot alone, but the residual plot highlights it.
Remedy: A common course of action to handle non-constant variance is to apply a
transformation to the response variable Y . Common transformations are logarithmic
(ln Y ), square-root (
Y ), and inverse (1/Y ).

ELECTRICITY DATA: A square root transformation is commonly applied to address
non-constant variance. Consider the simple linear regression model
W
i
=
0
+
1
x
i
+
i
,
for i = 1, 2, ..., 53, where W
i
=
Y
i
. It is straightforward to t this transformed model
in R as before. We simply regress W on x (instead of regressing Y on x).
> fit.2 = lm(sqrt(peak.demand) ~ monthly.usage)
PAGE 169
500 1000 1500 2000 2500 3000 3500
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
3
.
5
4
.
0
Monthly Usage (kWh)
P
e
a
k

D
e
m
a
n
d

(
k
W
h
)
:

S
q
u
a
r
e

r
o
o
t

s
c
a
l
e
1.0 1.5 2.0 2.5 3.0 3.5 4.0
1
.
0
0
.
5
0
.
0
0
.
5
Fitted values
R
e
s
i
d
u
a
l
s
Figure 6.9: Electricity data. Left: Scatterplot of the square root of peak demand (
Y )
versus monthly usage (x, measured in kWh) with the least squares simple linear regression
line superimposed. Right: Residual plot for the simple linear regression model t with
transformed response.
> fit.2
Coefficients:
(Intercept) monthly.usage
0.580831 0.000953
ANALYSIS: Figure 6.9 above shows the scatterplot (left) and the residual plot (right)
from tting the transformed model. The fanning out shape that we saw previously (in
the untransformed model) is now largely absent. The tted transformed model is
W = 0.580831 + 0.000953x,
or, in other words,
Peak demand = 0.580831 + 0.000953 Monthly usage.

Further analyses can be carried out with the transformed model; e.g., testing whether
peak demand (on the square root scale) is linearly related to monthly usage, etc.
PAGE 170
4 6 8 10
0
.
5
1
.
0
1
.
5
2
.
0
Wind Velocity (mph)
D
C

O
u
t
p
u
t
1.0 1.5 2.0 2.5
0
.
6
0
.
4
0
.
2
0
.
0
0
.
2
Fitted values
R
e
s
i
d
u
a
l
s
Figure 6.10: Windmill data. Left: Scatterplot of DC output Y versus wind velocity (x,
measured in mph) with least squares simple linear regression line superimposed. Right:
Residual plot for the simple linear regression model t.
Example 6.4. A research engineer is investigating the use of a windmill to generate
electricity. He has collected data on the direct current (DC) output Y from his windmill
and the corresponding wind velocity (x, measured in mph). Data for n = 25 observation
pairs are shown in Figure 6.10.
Problem: There is a clear quadratic relationship between DC output and wind velocity,
so a simple linear regression model t (as shown above) is inappropriate. The residual
plot shows a pronounced quadratic pattern; this pattern is not accounted for in tting a
straight line model.
Remedy: Fit a multiple linear regression model with two independent variables: wind
velocity x and its square x
2
, that is, consider the quadratic regression model
Y
i
=
0
+
1
x
i
+
2
x
2
i
+
i
,
for i = 1, 2, ..., 25. It is straightforward to t a quadratic model in R. We simply regress
Y on x and x
2
.
PAGE 171
4 6 8 10
0
.
5
1
.
0
1
.
5
2
.
0
Wind Velocity (mph)
D
C

O
u
t
p
u
t
0.5 1.0 1.5 2.0
0
.
2
0
.
1
0
.
0
0
.
1
0
.
2
Fitted values
R
e
s
i
d
u
a
l
s
Figure 6.11: Windmill data. Scatterplot of DC output Y versus wind velocity (x, mea-
sured in mph) with least squares quadratic regression curve superimposed. Right: Resid-
ual plot for the quadratic regression model t.
> wind.velocity.sq = wind.velocity^2
> fit.2 = lm(DC.output ~ wind.velocity + wind.velocity.sq)
> fit.2
Coefficients:
(Intercept) wind.velocity wind.velocity.sq
-1.15590 0.72294 -0.03812
The tted quadratic regression model is
Y = 1.15590 + 0.72294x 0.03812x

2
or, in other words,
DC output = 1.15590 + 0.72294 Wind.velocity 0.03812 (Wind.velocity)

2
.
Note that the residual plot from the quadratic model t, shown above, now looks quite
good. The quadratic trend has disappeared (because the model now incorporates it).
PAGE 172
Year
G
l
o
b
a
l

t
e
m
p
e
r
a
t
u
r
e

d
e
v
i
a
t
i
o
n
s

(
s
i
n
c
e

1
9
0
0
)
1900 1920 1940 1960 1980 2000
0
.
4
0
.
2
0
.
0
0
.
2
0
.
4
0 20 40 60 80 100
1
0
1
2
Year
R
e
s
i
d
u
a
l
s
Figure 6.12: Global temperature data. Left: Time series plot of the temperature Y
measured one time per year. The independent variable x is year, measured as 1900,
1901, ..., 1997. A simple linear regression model t has been superimposed. Right:
Residual plot from the simple linear regression model t.
Example 6.5. The data in Figure 6.12 (left) are temperature readings (in deg C) on
land-air average temperature anomalies, collected once per year from 1900-1997. To
emphasize that the data are collected over time, I have used straight lines to connect the
observations; this is called a time series plot.
Unfortunately, it is all too common that people t linear regression models to time
series data and then blindly use them for prediction purposes.
It takes neither a meteorologist nor an engineering degree to know that temperature
observations collected over time are probably correlated. Not surprisingly, residuals
from a simple linear regression display clear correlation over time.
Regression techniques (as we have learned in this chapter) are generally not appro-
priate when analyzing time series data for this reason. More advanced modeling
techniques are needed.
PAGE 173
7 Factorial Experiments
Complementary reading: Chapter 7 (VK); Sections 7.1-7.2.
7.1 Introduction
REMARK: In engineering experiments, particularly those carried out in industrial set-
tings, there are often several factors of interest and the goal is to assess the eects of
these factors on a continuous response Y (e.g., yield, lifetime, ll weights, etc.). A fac-
torial treatment structure is an ecient way of dening treatments in these types of
experiments.
One example of a factorial treatment structure uses k factors, where each factor
has two levels. This is called a 2
k
factorial experiment.
Factorial experiments are often used in the early stages of experimental work. For
this reason, factorial experiments are also called factor screening experiments.
Example 7.1. A nickel-titanium alloy is used to make components for jet turbine
aircraft engines. Cracking is a potentially serious problem in the nal part, as it can lead
to nonrecoverable failure. A test is run at the parts producer to determine the eect of
k = 4 factors on cracks: pouring temperature (A), titanium content (B), heat treatment
method (C), and amount of grain rener used (D).
Factor A has 2 levels: low temperature and high temperature
Factor B has 2 levels: low content and high content
Factor C has 2 levels: Method 1 and Method 2
Factor D has 2 levels: low amount and high amount.
The response variable in the experiment is
Y = length of largest crack (in mm) induced in a piece of sample material.
PAGE 174
NOTE: In this example, there are 4 factors, each with 2 levels. Thus, there are
2 2 2 2 = 2
4
= 16
dierent treatment combinations. These are listed here:
a
1
b
1
c
1
d
1
a
1
b
2
c
1
d
1
a
2
b
1
c
1
d
1
a
2
b
2
c
1
d
1
a
1
b
1
c
1
d
2
a
1
b
2
c
1
d
2
a
2
b
1
c
1
d
2
a
2
b
2
c
1
d
2
a
1
b
1
c
2
d
1
a
1
b
2
c
2
d
1
a
2
b
1
c
2
d
1
a
2
b
2
c
2
d
1
a
1
b
1
c
2
d
2
a
1
b
2
c
2
d
2
a
2
b
1
c
2
d
2
a
2
b
2
c
2
d
2
For example, the treatment combination a
1
b
1
c
1
d
1
holds each factor at its low level, the
treatment combination a
1
b
1
c
2
d
2
holds Factors A and B at their low level and Factors
C and D at their high level, and so on.
TERMINOLOGY: In a 2
k
factorial experiment, one replicate of the experiment uses 2
k
runs, one at each of the 2
k
treatment combinations.
Therefore, in Example 7.1, one replicate of the experiment would require 16 runs
(one at each treatment combination listed above).
Two replicates would require 32 runs, three replicates would require 48 runs, and
so on.
TERMINOLOGY: There are dierent types of eects of interest in factorial experiments:
main eects and interaction eects. For example, in a 2
4
factorial experiment,
there is 1 eect that does not depend on any of the factors.
there are 4 main eects: A, B, C, and D.
there are 6 two-way interaction eects: AB, AC, AD, BC, BD, and CD.
there are 4 three-way interaction eects: ABC, ABD, ACD, and BCD.
there is 1 four-way interaction eect: ABCD.
PAGE 175
OBSERVATION: Note that 1 +4 +6 +4 +1 = 16. In other words, with 16 observations
(from one 2
4
replicate), we can estimate the 4 main eects and we can estimate all of the
11 interaction eects. We will have 1 observation left to estimate the overall mean of Y ,
that is, the eect that depends on none of the 4 factors.
GENERALIZATION: In a 2
k
factorial experiment, there is/are
_
k
0
_
= 1 overall mean (the mean of Y ignoring all factors)
_
k
1
_
= k main eects
_
k
2
_
=
k(k1)
2
two-way interaction eects
_
k
3
_
three-way interaction eects, and so on.
Note that
_
k
0
_
+
_
k
1
_
+
_
k
2
_
+ +
_
k
k
_
=
k
j=0
_
k
j
_
= 2
k
and additionally that
_
k
0
_
,
_
k
1
_
, ...,
_
k
k
_
are the entries in the (k + 1)th row of Pascals
Triangle. Observe also that 2
k
grows quickly in size as k increases. For example, if there
are k = 10 factors (A, B, C, D, E, F, G, H, I, and J, say), then performing just one
replicate of the experiment would require 2
10
= 1024 runs! In real life, rarely would this
type of experiment be possible.
7.2 Example: A 2
2
experiment with replication
NOTE: We rst consider 2
k
factorial experiments where k = 2, that is, there are only
two factors, denoted by A and B. This is called a 2
2
experiment. We illustrate with an
agricultural example.
Example 7.2. Predicting corn yield prior to harvest is useful for making feed supply and
marketing decisions. Corn must have an adequate amount of nitrogen (Factor A) and
phosphorus (Factor B) for protable production and also for environmental concerns.
PAGE 176
Table 7.1: Corn yield data (bushels/plot).
Treatment combination Yield (Y ) Treatment sample mean
a
1
b
1
35, 26, 25, 33, 31 30
a
1
b
2
39, 33, 41, 31, 36 36
a
2
b
1
37, 27, 35, 27, 34 32
a
2
b
2
49, 39, 39, 47, 46 44
Experimental design: In a 2 2 = 2
2
factorial experiment, two levels of nitrogen
(a
1
= 10 and a
2
= 15) and two levels of phosphorus were used (b
1
= 2 and b
2
= 4).
Applications of nitrogen and phosphorus were measured in pounds per plot. Twenty
small (quarter acre) plots were available for experimentation, and the four treatment
combinations a
1
b
1
, a
1
b
2
, a
2
b
1
, and a
2
b
2
were randomly assigned to plots. Note that
there are 5 replications.
Response: The response variable is
Y = yield per plot (measured in # bushels).
Side-by-side boxplots of the data in the table above are in Figure 7.1.
Naive analysis: One silly way to analyze these data would be to simply regard each
of the combinations a
1
b
1
, a
1
b
2
, a
2
b
1
, and a
2
b
2
as a treatment and perform a one-way
ANOVA with t = 4 treatment groups like we did in Chapter 4. This would produce the
following ANOVA table:
> anova(lm(yield ~ treatment))
treatment 3 575 191.67 9.5833 0.0007362 ***
Residuals 16 320 20.00
PAGE 177
a1b1 a1b2 a2b1 a2b2
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
Y
i
e
l
d

(
b
u
s
h
e
l
s
/
p
l
o
t
)
Figure 7.1: Boxplots of corn yields (bushels/plot) for four treatment groups.
OMNIBUS CONCLUSION: The value F = 9.5833 is not what we would expect from an
F(3, 16) distribution, the distribution of F when
H
0
:
11
=
12
=
21
=
22
is true (p-value 0.0007). Therefore, we would conclude that at least one of the factorial
treatment population means is dierent.
REMARK: As we have discussed before in one-way classication experiments, an overall
F test provides very little information. With a factorial treatment structure, it is possible
to explore the data further; in particular, we can learn about the main eects due to
nitrogen (Factor A) and due to phosphorus (Factor B). We can also learn about the
interaction between nitrogen and phosphorus.
PAGE 178
PARTITION: Let us rst recall the treatment sum of squares from the one-way ANOVA:
SS
trt
= 575.
The way we learn more about specic eects is to partition SS
trt
into the following
pieces: SS
A
, SS
B
, and SS
AB
. By partition, I mean that we will write
SS
trt
= SS
A
+ SS
B
+ SS
AB
.
In words,
SS
A
is the sum of squares due to the main eect of A (nitrogen)
SS
B
is the sum of squares due to the main eect of B (phosphorus)
SS
AB
is the sum of squares due to the interaction eect of A and B (nitrogen and
phosphorus).
We can use R to write this partition in a richer ANOVA table (mathematical details
omitted):
> fit = lm(yield ~ nitrogen*phosphorus)
> anova(fit)
nitrogen 1 125 125 6.25 0.0236742 *
phosphorus 1 405 405 20.25 0.0003635 ***
nitrogen:phosphorus 1 45 45 2.25 0.1530877
Residuals 16 320 20
The F statistics F
A
, F
B
, and F
AB
can be used to test for main eects and an interaction
eect, respectively. Each F statistic above has an F(1, 16) distribution under the as-
sumption that the associated eect is zero. Small p-values (e.g., p-value < 0.05) indicate
that the eect is nonzero. Eects with large p-values can be treated as not signicant.
PAGE 179
0 2 4 6 8
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
F
P
D
F
Figure 7.2: F(1, 16) probability density function. An at F
AB
= 2.25 has been added.
ANALYSIS: When analyzing data from an experiment with a 2
2
factorial treatment
structure, the rst task is to judge whether the two factors interact, here, whether or not
the nitrogen/phosphorus contribution is real. From the ANOVA table, we see that
F
AB
= 2.25 (p-value 0.153).
This value of F
AB
is not all that unreasonable when compared to the F(1, 16) distribution,
the distribution of F
AB
when nitrogen and phosphorus do not interact. In other words,
we do not have substantial evidence that nitrogen and phosphorus interact.
NOTE: An interaction plot is a graphical display that can help us assess (visually)
whether two factors interact. In this plot, the levels of Factor A (say) are marked on the
horizontal axis. The sample means of the treatments are plotted against the levels of A,
and the points corresponding to the same level of Factor B are joined by straight lines.
PAGE 180
2
0
2
5
3
0
3
5
4
0
4
5
5
0
nitrogen
M
e
a
n

y
i
e
l
d

(
b
u
s
h
e
l
s
/
p
l
o
t
)
a1 a2
phosphorus
b2
b1
Figure 7.3: Interaction plot for nitrogen and phosphorus in Example 7.2.
If Factors A and B do not interact at all, the interaction plot should display parallel
lines. That is, the eect of one factor stays constant across the levels of the other
factor. This is essentially what it means to have no interaction.
If the interaction plot displays a departure from parallelism (including an over-
whelming case where the lines intersect), then this is visual evidence of interaction.
That is, the eect of one factor depends on the levels of the other factor.
The F test that uses F
AB
provides numerical evidence of interaction. The interac-
tion plot provides visual evidence.
CONCLUSION: We do not have strong evidence that nitrogen and phosphorus interact.
The F
AB
statistic is not signicant and the interaction plot does not show a substantial
departure from parallelism.
PAGE 181
GENERAL STRATEGY: The following are guidelines for analyzing data from 2
2
fac-
torial experiments. Start by looking at whether or not the interaction contribution is
signicant. This can be done by using an interaction plot and an F test that uses F
AB
.
If the interaction is signicant, then formal analysis of main eects is not all
that meaningful because their interpretations depend on the interaction. In this
situation, the best approach is to just redo the entire analysis as a one-way ANOVA
with 4 treatments. Tukey pairwise condence intervals can help you formulate an
ordering among the 4 treatment population means.
If the interaction is not signicant, I prefer to ret the model without the
interaction term present and then examine the main eects. This can be done
numerically by examining the sizes of F
A
and F
B
, respectively.
ANALYSIS: Here is the ANOVA table for the corn yield data, leaving out the nitro-
gen/phosphorus interaction term:
> fit = lm(yield ~ nitrogen + phosphorus)
> anova(fit)
nitrogen 1 125 125.00 5.8219 0.027403 *
phosphorus 1 405 405.00 18.8630 0.000442 ***
Comparing this to the ANOVA table with interaction, note that the interaction sum of
squares, SS
AB
= 45, has now been absorbed into the residual sum of squares.
The main eect of nitrogen (Factor A) is signicant in describing yield (F
A
=
5.8219, p-value 0.0274).
The main eect of phosphorus (Factor B) is strongly signicant in describing yield
(F
B
= 18.8630, p-value = 0.0004).
PAGE 182
Nitrogen.1 Nitrogen.2
2
0
2
5
3
0
3
5
4
0
4
5
5
0
Y
i
e
l
d

(
b
u
s
h
e
l
s
/
p
l
o
t
)
Phosphorus.1 Phosphorus.2
2
0
2
5
3
0
3
5
4
0
4
5
5
0
Y
i
e
l
d

(
b
u
s
h
e
l
s
/
p
l
o
t
)
Figure 7.4: Left: Side by side boxplots for nitrogen (Factor A). Right: Side by side
boxplots for phosphorus (Factor B).
CONFIDENCE INTERVALS: A 95 percent condence interval for
A1
A2
, the dier-
ence in means for the two levels of nitrogen (Factor A) is given by
(Y
A1
Y
A2
) t
17,0.025
MS
res
_
1
10
+
1
10
_
.
A 95 percent condence interval for
B1
B2
, the dierence in means for the two levels
of phosphorus (Factor B) is given by
(Y
B1
Y
B2
) t
17,0.025
MS
res
_
1
10
+
1
10
_
.
The R code online can be used to calculate these intervals:
95% CI for
A1
A2
: (9.37, 0.62) bushels/acre
95% CI for
B1
B2
: (13.37, 4.63) bushels/acre
Note that neither of these intervals includes zero. This is expected because F
A
and F
B
are both signicant at the = 0.05 level.
PAGE 183
REGRESSION: In Example 7.2, there were two levels of nitrogen (a
1
= 10 and a
2
= 15)
and two levels of phosphorus (b
1
= 2 and b
2
= 4) used in the experiment. These levels,
which we generically called low and high when analyzing the data using ANOVA,
are actually numerical in nature (measured in pounds per plot). In this light, there is
nothing to prevent us from tting the following multiple linear regression model
using the numerical values of nitrogen and phosphorus:
Y
i
=
0
+
1
x
i1
+
2
x
i2
+
i
,
where the independent variables
x
1
= nitrogen amount (10 or 15 pounds)
x
2
= phosphorus amount (2 or 4 pounds).
Doing so in R gives the following output:
> fit = lm(yield ~ nitrogen + phosphorus)
> summary(fit)
Coefficients:
(Intercept) 9.5000 6.1297 1.550 0.139599
nitrogen 1.0000 0.4144 2.413 0.027403 *
phosphorus 4.5000 1.0361 4.343 0.000442 ***
F-statistic: 12.34 on 2 and 17 DF, p-value: 0.0004886
b =
_
_
_
_
_
b
0
b
1
b
2
_
_
_
_
_
=
_
_
_
_
_
9.5
1.0
4.5
_
_
_
_
_
.
PAGE 184
Therefore, the tted least squares regression model for the corn yield data is
Y = 9.5 + 1.0x
1
+ 4.5x
2
,
or, in other words,
YIELD = 9.5 + 1.0 NITROGEN + 4.5 PHOSPHORUS.

This equation can subsequently be used make predictions about future yields based on
given values of nitrogen and phosphorus. In doing so, be careful about extrapolation;
for example, you would not want to make a prediction when x
1
= 25 and x
2
= 10. These
values are not representative of those used in the actual experiment, so this model may
not be a good description of yield for these values of nitrogen and phosphorus.
INTERESTING: I have below constructed the analysis of variance table for the multiple
linear regression t:
> anova(fit)
Response: yield
nitrogen 1 125 125.00 5.8219 0.027403 *
phosphorus 1 405 405.00 18.8630 0.000442 ***
You will note that this table is identical to the two-way ANOVA table (without inter-
action) on pp 182 (notes). This is no coincidence! In fact, the two-way ANOVA model
(without interaction) and the multiple linear regression model
Y
i
=
0
+
1
x
i1
+
2
x
i2
+
i
are actually identical models. Therefore, tting each one gives the same analysis and the
same conclusions.
PAGE 185
7.3 Example: A 2
4
experiment without replication
Example 7.3. A chemical product is produced in a pressure vessel. A factorial exper-
iment is carried out to study the factors thought to inuence the ltration rate of this
product. The four factors are temperature (A), pressure (B), concentration of formalde-
hyde (C) and stirring rate (D). Each factor is present at two levels (e.g., low and
high). A 2
4
experiment is performed with one replication; the data are shown below.
Factor Filtration rate
Run A B C D Run label (Y , gal/hr)
1 a
1
b
1
c
1
d
1
45
2 + a
2
b
1
c
1
d
1
71
3 + a
1
b
2
c
1
d
1
48
4 + + a
2
b
2
c
1
d
1
65
5 + a
1
b
1
c
2
d
1
68
6 + + a
2
b
1
c
2
d
1
60
7 + + a
1
b
2
c
2
d
1
80
8 + + + a
2
b
2
c
2
d
1
65
9 + a
1
b
1
c
1
d
2
43
10 + + a
2
b
1
c
1
d
2
100
11 + + a
1
b
2
c
1
d
2
45
12 + + + a
2
b
2
c
1
d
2
104
13 + + a
1
b
1
c
2
d
2
75
14 + + + a
2
b
1
c
2
d
2
86
15 + + + a
1
b
2
c
2
d
2
70
16 + + + + a
2
b
2
c
2
d
2
96
NOTATION: When discussing factorial experiments, it is common to use the symbol
to denote the low level of a factor and the symbol + to denote the high level. For
example, the rst row of the table above indicates that each factor (A, B, C, and D) is
run at its low level. The response Y for this run is 45 gal/hr.
PAGE 186
NOTE: In this experiment, there are k = 4 factors, so there are 15 eects to estimate:
the 4 main eects: A, B, C, and D
the 6 two-way interactions: AB, AC, AD, BC, BD, and CD
the 4 three-way interactions: ABC, ABD, ACD, BCD
the 1 four-way interaction: ABCD.
In this 2
4
experiment, we have 16 values of Y and 15 eects to estimate. This means that
only one Y observation is left, but this observation is used to estimate the overall mean.
This leaves us with no observations (and therefore no degrees of freedom) to perform
statistical tests. This is an obvious problem! Why? Because we have no way to judge
which main eects are signicant, and we cannot learn about how these factors interact.
TERMINOLOGY: A single replicate of a 2
k
factorial experiment is called an unrepli-
cated factorial. With only one replicate, as in Example 7.3, there is no internal error
estimate, so we cannot perform statistical tests to judge signicance. What do we do?
One approach to the analysis of an unreplicated factorial is to assume that certain
higher-order interactions are negligible and then combine their mean squares to
estimate the error.
This is an appeal to the sparsity of eects principle; that is, most systems
are dominated by some of the main eects and low-order interactions and most
high-order interactions are negligible.
To learn about which eects may be negligible, we can t the full ANOVA model
and obtain the SS attached to each of these 15 eects (see next page).
Eects with large SS can be retained. Eects with small SS can be discarded.
A smaller model with only the large eects can then be t. This smaller model
will have an error estimate formed by taking all of the eects with small SS and
combining them together.
PAGE 187
ANALYSIS: Here is the R output summarizing the t of the full model:
> # Fit full model
> fit = lm(filtration ~ A*B*C*D)
> anova(fit)
A 1 1870.56 1870.56
B 1 39.06 39.06
C 1 390.06 390.06
D 1 855.56 855.56
A:B 1 0.06 0.06
A:C 1 1314.06 1314.06
B:C 1 22.56 22.56
A:D 1 1105.56 1105.56
B:D 1 0.56 0.56
C:D 1 5.06 5.06
A:B:C 1 14.06 14.06
A:B:D 1 68.06 68.06
A:C:D 1 10.56 10.56
B:C:D 1 27.56 27.56
A:B:C:D 1 7.56 7.56
Residuals 0 0.00
Warning message: ANOVA F-tests on an essentially perfect fit are unreliable
NOTE: From this table, it is easy to see that the eects
A, C, D, AC, AD
are far more signicant than the others. For example, the smallest SS in this set is 390.06
(Factor C) which is over 5 times larger than the largest remaining SS (68.06). As a next
step, we therefore consider tting a smaller model with these 5 eects only. This will
free up 10 degrees of freedom that can be used to estimate the error variance.
ANALYSIS: Here is the R output (next page) summarizing the t of the smaller model
that includes only A, C, D, AC, and AD:
PAGE 188
4
0
5
0
6
0
7
0
8
0
9
0
1
0
0
Factor.A
F
i
l
t
r
a
t
i
o
n

r
a
t
e

(
g
a
l
/
h
r
)
a1 a2
Factor.C
c1
c2
4
0
5
0
6
0
7
0
8
0
9
0
1
0
0
Factor.A
F
i
l
t
r
a
t
i
o
n

r
a
t
e

(
g
a
l
/
h
r
)
a1 a2
Factor.D
d2
d1
Figure 7.5: Left: Interaction plot for temperature (Factor A) and concentration of
formaldehyde (Factor C). Right: Interaction plot for temperature (Factor A) and stirring
rate (Factor D).
> # Fit smaller model
> fit = lm(filtration ~ A + C + D + A:C + A:D)
> anova(fit)
A 1 1870.56 1870.56 95.865 1.928e-06 ***
C 1 390.06 390.06 19.990 0.001195 **
D 1 855.56 855.56 43.847 5.915e-05 ***
A:C 1 1314.06 1314.06 67.345 9.414e-06 ***
A:D 1 1105.56 1105.56 56.659 1.999e-05 ***
Residuals 10 195.13 19.51
NOTE: It is clear that these ve eects are each signicant (note that the p-values are
all very close to zero). Interaction plots for temperature (Factor A) and concentration of
formaldehyde (Factor C) and temperature (Factor A) and stirring rate (Factor D) are in
Figure 7.5. These plots depict the strong pairwise interaction that exists.
PAGE 189
REGRESSION: In Example 7.3, there were no numerical values attached to the levels
of temperature (Factor A), concentration of formaldehyde (Factor C), and stirring rate
(Factor D). Therefore, if we wanted to t a regression model (e.g., for prediction purposes,
etc.), we can use the following variables with arbitrary numerical codings assigned:
x
1
= temperature (1 = low; 1 = high)
x
2
= concentration of formaldehyde (1 = low; 1 = high)
x
3
= stirring rate (1 = low; 1 = high).
With these values, we can t the multiple linear regression model
Y
i
=
0
+
1
x
i1
+
2
x
i2
+
3
x
i3
+
4
x
i1
x
i2
+
5
x
i1
x
i3
+
i
.
Doing so in R gives the following output:
> fit = lm(filtration ~ temp + conc + stir.rate + temp:conc + temp:stir.rate)
> summary(fit)
(Intercept) 70.062 1.104 63.444 2.30e-14 ***
temp 10.812 1.104 9.791 1.93e-06 ***
conc 4.938 1.104 4.471 0.00120 **
stir.rate 7.313 1.104 6.622 5.92e-05 ***
temp:conc -9.062 1.104 -8.206 9.41e-06 ***
temp:stir.rate 8.312 1.104 7.527 2.00e-05 ***
b =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
b
0
b
1
b
2
b
3
b
4
b
5
_
_
_
_
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
70.062
10.812
4.938
7.313
9.062
8.312
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
PAGE 190
Therefore, the tted least squares regression model for the ltration rate data is
Y = 70.062 + 10.812x
1
+ 4.938x
2
+ 7.313x
3
9.062x
1
x
2
+ 8.312x
1
x
3
or, in other words,
FILT = 70.062+10.812 TEMP+4.938 CONC+7.313 STIR9.062 TEMP*CONC+8.312 TEMP*STIR

This tted regression model can be used to write condence intervals or prediction in-
tervals for future values of temperature, concentration, and stirring rate.
ANOVA TABLE: I have constructed the analysis of variance table for the multiple linear
regression t in this example:
> anova(fit)
temp 1 1870.56 1870.56 95.865 1.928e-06 ***
conc 1 390.06 390.06 19.990 0.001195 **
stir.rate 1 855.56 855.56 43.847 5.915e-05 ***
temp:conc 1 1314.06 1314.06 67.345 9.414e-06 ***
temp:stir.rate 1 1105.56 1105.56 56.659 1.999e-05 ***
Residuals 10 195.13 19.51
Note that this table is identical to the ANOVA table on pp 189 (notes).
REMARK: In this chapter, we have only just scratched the surface when it comes
to discussing factorial treatment structures. Specialized courses in experimental design,
such as STAT 506, would delve into more advanced designs and analysis techniques.
For example, a design that often arises in industrial experiments is that of running
a 2
k
factorial experiment in less than 2
k
runs; these are called fractional factorial
experiments. In these experiments, the engineer acknowledges a priori that the highest
order interactions are negligible and the goal is to assess main eects and lower order
interactions only. For those who are interested, Section 7.3 (VK) introduces this concept.
An illustrative example is Example 7.7 on pp 499 (VK).
PAGE 191

Stat 509 Notes

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Stat 509 Notes

Enviado por

Direitos autorais:

Formatos disponíveis

STAT 509

STATISTICS FOR ENGINEERS

1.110, 15.271) distribution as a model for future

which are unbiased, but we would

1.184) = (0.567, 1.088).

Moisture = 72.959 + 0.041 Filtration rate.

Y (150) = 72.959 + 0.041(150) 79.109.

X is full rank, the (unique)

TASTE = 28.877 + 0.328 ACETIC + 3.912 H2S + 19.670 LACTIC.

TASTE = 28.877 + 0.328 ACETIC + 3.912 H2S + 19.670 LACTIC.

Y ), and inverse (1/Y ).

Peak demand = 0.580831 + 0.000953 Monthly usage.

Y = 1.15590 + 0.72294x 0.03812x

DC output = 1.15590 + 0.72294 Wind.velocity 0.03812 (Wind.velocity)

YIELD = 9.5 + 1.0 NITROGEN + 4.5 PHOSPHORUS.

FILT = 70.062+10.812 TEMP+4.938 CONC+7.313 STIR9.062 TEMPCONC+8.312 TEMPSTIR

Você também pode gostar