Escolar Documentos
Profissional Documentos
Cultura Documentos
Lecture notes
Jennie Hansen
George Streftaris
Contents
Introduction
iii
1 Collecting Data
1.1 Sampling . . . . . . . . . . .
1.2 Experimentation . . . . . .
1.3 Measurement . . . . . . . .
1.4 Looking at data intelligently
1
1
23
40
45
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ii
CONTENTS
Introduction
The use of statistical arguments and the conclusions of statistical analyses are
used to guide and inform many decisions in society. Some examples include:
Global warming: The scientific claims both for and against global
warming are often based on statistical analyses - so how can we judge
the validity of the underlying statistical analyses on both sides of the
argument?
Clinical research: NICE (National Institute for Health and Clinical
Excellence) recommends whether new drugs should be made available
on the NHS. Their recommendations are based on statistical analysis
of the effectiveness of the drug. How are the clinical experiments designed and the data collected in order prove/disprove clinical claims of
effectiveness?
Psychological research: You may have heard people say that women
are better at multi-tasking than men, whereas men are better at map
reading than women. What is the basis (if any!) of such claims? How
did researchers design an experiment to investigate these differences between men and women and how do they know that the differences that
were observed between men and women were statistically significant
and could be attributed to a difference in gender?
Opinion polls (e.g. market research, television ratings, etc ): Many
corporate decisions are based on survey results. It is important to understand what the percentages in such surveys mean, how they were
obtained, and how accurate the information is - otherwise costly mistakes can be made.
Life insurance and pensions: Life insurance premiums and pension calculations are based on estimates of life expectancy. These estimates
iii
iv
INTRODUCTION
are based on statistical analyses of mortality data. Life insurance companies and pension providers would find it difficult to meet their obligations if the underlying statistical calculations are incorrect!
In all of the above examples, the aim of the underlying statistical analysis is
to provide insight by means of numbers. This process usually involves three
stages:
1. Collecting data
2. Organising data
3. Drawing conclusions from the data (inference)
In this module, we look at the statistical principles and techniques used at
each of these three stages in an analysis. At the end of this module students
should be able to
understand the logic of statistical reasoning
understand how statisticians come to their conclusions
be able to evaluate the use of statistical methods in a variety of applications (e.g. science, finance, the media, etc)
Developing these skills is an important step in the development of the critical
thinking skills which are required in every aspect of professional life.
Chapter 1
Collecting Data
1.1
Sampling
Samuel Johnson is reported to have said, You dont have to eat the whole
ox to know that the meat is tough. This is one way of describing the idea
behind sampling. Sampling is a way to gain information about the whole by
examining only a part of the whole. Sampling is widely used by industry,
social scientists, political parties, the media, etc.
Example 1.1.1 When a newspaper reports that 34% of the Scottish electorate support independence for Scotland, someone, somewhere, has to have
asked some voters (but clearly not every voter) their opinion. The reported
percentage is based on the sample of voters that were questioned. This is an
example of sampling in order to obtain information about the whole.
Whenever statisticians discuss sampling they use certain precise terms to
describe the procedure of sampling:
Population - this is the entire group of objects about which information is
sought.
Unit - any individual member of the population.
Sample- a subset of the population which is used to gain information about
the population.
Sampling frame - this is the list or group of units from which the sample
is chosen.
Variable - a characteristic of a unit which is to be measured for those units
in the sample.
1
Discussion
In the above example, the population is the entire Scottish electorate. The
sampling frame is not usually the same as the population. For example, if
the sample was chosen by selecting a subset from the electoral roll, then
the sampling frame would be the electoral roll - however, the electoral roll
does not necessarily contain the name of every eligible Scottish voter. In the
example the variable to be measured is opinion on Scottish independence,
e.g. support/dont support independence.
Example 1.1.2 Acceptance sampling is used by purchasers to check the
quality of the items in a large shipment. In practice, manufacturers often
incorporate acceptance-sampling procedures in their contracts with suppliers.
The purchaser and supplier agree that a sample of parts from a large shipment
will be selected and inspected and based on the number of parts in the sample
which meet the purchasers specifications, the shipment is either accepted of
rejected.
Discussion
In the above example, the population and the sampling frame are the same,
i.e. the shipment of items, and the variable measured is whether or not a
part is defective.
Sampling design
We have seen that we start with a question about a population and then
take a sample from the population in order to answer our question about the
population. This approach will give us a meaningful answer provided we can
be confident that the sample is (roughly) representative of the population.
Problem: How should we go about selecting a (representative) sample from
the population?
One possibility is to select a sample consisting of the entire population - this
is what happens when the government conducts a census. In this case we
would obtain the exact answer to our question. It might seem that this
would be the ideal approach, but it is usually problematic to sample the
entire population. Some problems with this approach include:
expensive, time-consuming,
problems with acceptance sampling if units are destroyed as part of the
sampling procedure (e.g. testing the tolerance of fuses in a sample).
1.1. SAMPLING
such a way that no subset of n units is more likely to be selected then any
other.
A simple random sample (SRS) of size n is a sample chosen in such a
way that every collection of n units from the sampling frame has the same
chance of being selected.
Remark
By taking a simple random sample, we can avoid some of the problems
associated with convenience sampling because no part of the population is
likely to be over (or under) represented in a simple random sample.
Question: How can we select a simple random sample?
One method is to use physical mixing. Physical mixing is the method that
is used to select the lottery numbers each week. The lottery draw works as
follows:
Start with 49 identical balls and label the balls with the numbers 1 to
49. Then thoroughly mix the balls and select a ball at random from the
49 balls (i.e. choose it mechanically or blindly from the 49 balls).
Key point: at this first stage, every ball has the same chance to be
selected!
After selecting the first ball, the remaining balls are thoroughly mixed
and a second ball is selected at random from the 48 balls.
Note: if the mixing at the second stage is thorough, then any of the
remaining 48 balls has the same chance of being selected. So, after two
stages in the draw, we have a simple random sample of size 2.
This procedure is repeated 4 more times to obtain a total of 6 balls and
because at each stage any of the remaining balls is equally likely to be
selected, the resulting sample of 6 balls is a simple random sample of
size 6 from the original 49 balls, and the numbers on the selected balls
correspond to a simple random sample from the numbers 1,2,..., 49.
Remarks
Selecting a random sample using physical mixing is not as easy as it might
appear. One key feature of this approach is that at each stage we must be
able to thoroughly mix the set of objects from which we are selecting a random
object. In the case of the lottery, much effort goes into verifying that the
machine that mixes the balls does, in fact, thoroughly mix the balls before
1.1. SAMPLING
each draw and that there is no bias in the way that the machine selects
each ball (just imagine the uproar if the mixing stage of the procedure was
found to be biased in some way!)
In other situations it can be more difficult to guarantee that the objects
have been thoroughly mixed. For example, when playing card games, someone usually shuffles the deck before dealing the cards to the players. The
purpose of shuffling the deck is to physically mix the deck in such a way that
each player has the same chance of getting any particular card in their hand.
However, does shuffling a deck really thoroughly mix the deck?! Also, how
many times should a fresh deck be shuffled in order to ensure that it has
been thoroughly mixed? Casinos are interested in the answers to these questions and have even commissioned statisticians to work out the answers! It
turns out that a fresh deck should be shuffled 7 times in order to thoroughly
mix the deck (see http://en.wikipedia.org/wiki/Persi_Diaconis and
http://homepage.mac.com/bridgeguys/SGlossary/ShuffleofCards.html).
Another problem with using physically mixing to select a simple random
sample is that the sample frame may be too large for this method to be
practical. For example, suppose the Admissions Office at Heriot-Watt would
like to select a sample of 50 first-year students to interview regarding their
reasons for choosing to go to university. It wouldnt be practical to get several
hundred identical balls, put student names on the balls, mix the balls, and
select a sample of 50 balls in the same way that the lottery balls are drawn!
Even if we tried to do this (and could find a container big enough to hold all
the balls), there would still be the problem of making sure that the balls were
thoroughly mixed before each draw. It was exactly these sorts of difficulties
that were encountered when the US army conducted the first lottery in 1970
to determine who would be drafted into the army. The order in which men
between 19 and 25 were to be drafted was determined by drawing capsules
which contained birthdays out of a box - men with the first birthday to
be drawn would be drafted first, then the men with the next birthday, etc.
However, because of a flawed mixing process, in the 1970 lottery it turned
out that capsules containing birthdays later in the year were more likely to
be drawn than capsules containing birthdays in January!
Question: So, if it is difficult and sometimes not practical to use physical
mixing to select a simple random sample, what can we do instead?
Answer: One way to select a simple random sample (even from a very large
sampling frame) is to use a table of random digits.
1.1. SAMPLING
once we have selected a simple random sample from the population, there
are some questions that we need to resolve:
How can we justify using information from the sample to tell us something about the population - especially if the sample is only a small
subset of the population?
How does information from the sample tell us something about the
population?
To begin thinking about these questions we need to introduce some more
terminology.
A parameter is a numerical characteristic of the population. It is a
fixed number, but we (usually) do not know its value.
A statistic is a numerical characteristic of the sample. The value of
the statistic is known once we have taken a sample, but its value
changes from sample to sample.
Typically, when we want to know something about a population, our question
about the population can be expressed in terms of an unknown parameter.
After taking a sample from the population, we compute the value of an
associated statistic and use the value of this statistic to estimate the value
of the unknown population parameter.
Example 1.1.5 In the 1980s Newsday (an American weekly news magazine) surveyed a random sample of 1373 parents. The magazine wanted to
determine the proportion, p, of parents in the American population who, if
given the choice, would have children again. In the sample, 1249 parents said
that they would, if given the choice, have children again. So, the proportion
0.91. Note: The fraction p which was comin sample was p = 1249
1373
puted from the sample is a statistic and we use it to estimate the population
parameter p.
Now suppose Newsday selected another random sample of size 1373. We
would not expect the number in the second sample who would have children
again to be exactly the same as the number in the first sample, but we
might still expect the proportion in the sample to be close to 0.91, the
proportion in the first sample. In fact, we intuitively expect that if we were
1.1. SAMPLING
2
0.08
17
3
4
5
0.12 0.16 0.20
25
35
38
6
0.24
31
7
8
0.28 0.32
23
12
9
0.36
8
10
0.40
2
11
0.44
2
10
20
30
0.0
0.2
0.4
0.6
0.8
1.0
Proportion
10
Discussion
1. The bar graph above (which describes the sampling distribution) shows
that the values of the sample proportions obtained from the 200 repetitions are (more or less) symmetrically clustered around the population
parameter p = 0.2. This symmetry arises because taking a simple random sample is an unbiased sampling procedure. In this example, we
know the value of the population parameter p, but even if p were unknown, it would still be true that the sample proportions from repeated
sampling would be clustered around the population parameter p.
2. On the other hand, if our sampling procedure is biased, the values of the
sampling proportions will tend to be clustered on one side or the other
of the population parameter p and bulge in the sampling distribution
will be on one side or the other of the population parameter p. This is
because bias in the sampling method means that the sampling statistic
tends to either overestimate or underestimate the population parameter
when we repeatedly sample from the population.
3. The spread of the values of the sample proportions gives an indication
of the precision of the sampling method. We will always see some
spread in the values taken by a sample statistic when we repeatedly
sample because there is sampling variability. Since we cannot eliminate
sampling variability, our goal must be to try to reduce the spread in
the sampling distribution of our statistic (i.e. to increase the precision
of the sample statistic).
Question: How can we improve the precision of a statistic obtained from a
simple random sample?
In this (and many other situations) the precision of a statistic which is based
on a simple random sample can be increased by increasing the size of the
simple random sample. To illustrate this, I used a computer to perform 200
repetitions of sampling 100 beads from 5000 (sample size =100) and 200 repetitions of sampling 250 beads from 5000 (sample size =250). The sampling
distributions for the corresponding sample proportions are represented below
by bar graphs.
11
10
20
30
40
1.1. SAMPLING
0.0
0.2
0.4
0.6
0.8
1.0
Proportion
10
20
30
40
Sampling distribution for sample proportions: sample size 100 (200 repetitions)
0.0
0.2
0.4
0.6
0.8
1.0
Proportion
Sampling distribution for sample proportions: sample size 250 (200 repetitions)
12
Discussion
Lets look a bit more closely at the bar graphs above which correspond to
the sampling distribution of the sample proportion p when the sample size
equals 25, 100, and 250 respectively. We can see that as the sample size
increases, the values of p (obtained from repeated samples) become more
tightly clustered around the population parameter p = 0.2:
When the sample size is 25, then the values of p range between 0.0 and
0.44. However, not many of the observed values were as far away from
the true proportion p = 0.2 as either the value 0.0 or 0.44. In fact,
194 of the values of p (97% of the 200 values) lie somewhere between
0.04 and 0.36. This indicates that if we were to take another simple
random sample of size 25 from the 5000 balls then there is a good
chance that the difference between the sample proportion p computed
from our sample and the true proportion p = 0.2 is no more than 0.16.
When the sample size is 100, then the values of p range between 0.08
and 0.28. Again, not many of the observed values were as far away
from the true proportion p = 0.2 as either the value 0.08 or 0.28. In
fact, 197 of the values of p (98.5% of the 200 values) lie somewhere
between 0.12 and 0.26. In this case, if we were to take another simple
random sample of size 100 from the 5000 balls then there is a good
chance that the difference between the sample proportion p computed
from our sample and the true proportion p = 0.2 is no more than 0.08.
When the sample size is 250, then the values of p range between 0.14
and 0.26. However, not many of the observed values were as far away
from the true proportion p = 0.2 as either the value 0.14 or 0.26. In
fact, 194 of the values of p (97% of the 200 values) lie somewhere
between 0.15 and 0.25. In this case, if we were to take another simple
random sample of size 250 from the 5000 balls then there is a good
chance that the difference between the sample proportion p computed
from our sample and the true proportion p = 0.2 is no more than 0.05.
So, by looking at the sampling distribution, we can work out the likely
range of values for p, and we can see that this range of values is smaller the
larger then sample size. So, precision increases as the sample size increases.
In the example above we knew the value of the population proportion
p = 0.2, but our observation that the precision of the sample proportion p
1.1. SAMPLING
13
increases as the sample size increases holds no matter what the population
proportion p equals. In fact, statisticians have studied examples like our
model and have worked out the following rule of thumb for determining the
precision of p in terms of the sample size:
Rule of thumb: Suppose that you select a simple random sample of size
n from from a (much larger) population, then there is a good chance that
the magnitude of the difference between
a sample proportion, p, and the
population parameter p is less than 1/ n.
One consequence of this rule of thumb:
The precision of a sample statistic depends on the size of the sample
but not on the size of the population provided the population is much
larger than the sample.
For example, (provided the population is very large) the sample proportion
p computed from a simple random sample of size 1000 from the population
is
likely to differ from the (unknown!) population parameter by at most
1/ 1000 0.03.
Another consequence of the rule of thumb is that it allows us to determine how big the sample must be in order to achieve a prescribed level of
precision. For example, suppose a national radio station wishes to determine
the proportion p of the population that listen to their station at least once
during a typical week.
Question: Given that the station is able to select a simple random sample
from the population of radio listeners, how big a sample should they select
in order to have a good chance that the sample proportion p differs from
the population proportion by at most 0.02 (i.e there is at most a 2% error)?
Answer: The statisticians Rule of Thumb says that we should choose the
sample size n so that
1
= 0.02.
n
We can re-arrange this equation to get:
n = 50
n = 2500.
Note: Now that you have a rule of thumb for determining the precision of
a sample proportion you can look out for mistaken criticisms of statistics in
the media. For example, a journalist criticised the Newsday results about
14
parenting (see Example 1.1.5 above) by saying that a random sample of size
1373 was too small to give any meaningful information about a population
of several million. But our rule of thumb says that the sample proportion p
calculated from a simple random sample of size 1373is likely to differ from
the true proportion in the population by at most 1/ 1373 0.027!
Summary:
Despite the sampling variability of a statistic computed from a simple
random sample (i.e. the value of the statistic varies from sample to
sample), the values of the statistic have a sampling distribution which
can be observed by looking at a frequency bar graph for the values of
the statistic which are obtained by repeated sampling.
When the sampling frame consists of the entire population (as it did
in our example of sampling balls), then the values of the statistic computed from repeated simple random samples from the entire population
neither consistently overestimate nor consistently underestimate the
value of the population parameter that we wish to estimate. In other
words, simple random sampling produces unbiased estimates and the
sampling distribution of the statistic bulges around the value of the
population parameter.
The sampling distribution associated with a statistic computed from a
sample gives an indication of the precision of the statistic (i.e. we can
get a rough idea from the sampling distribution of the magnitude of
the typical difference between the value of the statistic computed from
the sample and the value of the population parameter). The precision
of a statistic computed from a simple random sample depends on the
size of the sample and can be made as high as desired by taking a large
enough sample.
Errors in sampling
In the discussion above we saw that we can always expect that there will be
a difference between the value of a sample statistic and the (unknown) population parameter that we wish to determine. In Example 1.1.6 above, the
discrepancy between the sample proportion and the population proportion
p = 0.2 was caused by chance in selecting the random sample (i.e. due to
1.1. SAMPLING
15
chance, we cant guarantee that the proportion of black beads in the sample
will be exactly the same as the proportion of black beads in the population). We also saw that for this very simple example, we could reduce the
discrepancy between the value of the sample proportion p and the population proportion p = 0.2 by increasing the sample size. Unfortunately, not all
discrepancies between sample statistics and population parameters are the
result of chance errors (and which can be reduced by increasing sample size).
In particular, whenever we are sampling from a human population there are
other sources of error that we need to watch out for. Some of typical examples
of these are described below.
Sampling errors
Sampling errors are errors that arise from the act of taking a sample
and cause the sample statistic to differ from the population statistic. Sampling errors arise because the sample is a subset of the entire population.
There are two types of sampling error:
Random sampling errors are the deviations between the sample statistic and the population parameter which are caused by chance when we
select a random sample. The deviations between the sample proportions and the population proportion observed in the example above are
random sampling errors.
Nonrandom sampling errors arise from improper sampling methods.
These errors can arise because the sampling method is inherently biased
(e.g. convenience sampling). Nonrandom sampling errors can also arise
when the sampling frame (the list from which the sample is drawn)
differs systematically from the population.
Example 1.1.7 Suppose that a polling organisation has been commissioned
to determine the proportion of Edinburghs population who favour the introduction of a congestion charge. The polling organisation decides to use the
Edinburgh telephone directory as a sampling frame (i.e. list) from which to
select a random sample to survey.
Question: Will a sample chosen in this way be representative of the population (i.e. adults who live in Edinburgh)?
Answer: The problem in this example is that using the telephone directory
means people without landline phones (e.g. students and others who rely
16
primarily on mobile phones and people who cant afford a telephone landline)
and those who are ex-directory will not be part of any sample chosen using
this method. The random sample selected by the polling organisation will be
representative of the population of landline phone users but wont necessarily
be representive the population under investigation (i.e. adults who live in
Edinburgh). If the views of the excluded members of the population differ
significantly from those who are listed in the telephone directory, then the
sample statistic will be a biased estimate of the population parameter.
Note: If the sampling frame differs systematically from the population, sample statistics will be biased no matter how the sample is selected from the
sampling frame. In other words, simple random sampling cannot give us
unbiased statistics if the sampling frame differs systematically from the population.
Nonsampling errors
Nonsampling errors are errors that are not related to the act of selecting
a sample from the population. These errors can occur even if we used
the entire population as our sample. Here are some typical sources of
nonsampling errors:
Missing data: Sometimes information from a sample is incomplete
because it was not possible to contact some members of the sample or
because some members of the sample refuse to respond. Even if the
entire population was used as the sample, missing data could cause the
results of a survey to be biased if the people who cant be contacted or
who refuse to respond differ in some specific way from the population
as a whole.
Response errors: Some members of a sample may give wrong answers when surveyed. For example, subjects may lie about their age,
weight, income, use of alchohol, cigarettes and drugs, etc. Even subjects that are trying to answer a question truthfully may answer incorrectly because they cant estimate very accurately exactly how many
times they go up and down stairs in a day or how much tea they drink
in a day, etc. Other subjects may exaggerate their answers.
Processing errors: These errors usually occur at some stage in the
process of entering raw data into a computer. Sometimes big errors can
occur simply because a zero has been added or deleted as a number is
1.1. SAMPLING
17
recorded. These errors are often spotted by asking whether the results
make numerical sense.
Errors due to the method used to collect data once the sample
has been selected: Once the sample has been selected, the data has
to be collected. In the case of surveys (e.g. market research or opinion
polls) a decision must be made whether to contact subjects in the
sample by post, telephone, online, or by personal interview. Each of
these methods can lead to bias in the results.
Postal surveys are relatively inexpensive but response rates can
be low and, depending on the nature of the survey, there can also
be voluntary response bias.
Telephone surveys use computers to randomly dial numbers (so
even unlisted numbers can be reached). They are also relatively
inexpensive. However, there are still households (mostly poorer
ones) that do not have a telephone, so this leads to some nonrandom sampling error. It is also important in telephone sampling to
try the same number several times and at different times of the
day - otherwise the sample will only contain those who are usually
at home at a certain time of the day.
Some organisations, such as YouGov, now carry out surveys online. Again, not everyone is online so there is potential for some
nonrandom sampling error since those who cannot be reached by
an online survey may be different in some important way from
those that can be contacted online.
Personal (e.g. face-to-face) interviews can result in a higher response rate but can be expensive to conduct. Also, in some cases
face-to-face interviews can lead to response errors. For example,
a face-to-face interview about ones health or lifestyle might involve some embarrassing questions which some subjects would be
reluctant to answer.
Errors due to the wording of the questions in a survey: The
problem is that the wording of a question can be slanted to favour one
response over another. One way to slant a question is to pose it in
terms of a desirable outcome. For example, consider the following two
questions:
18
In practice, it is often not practical to take a simple random sample from the
population of interest. Some of the practical problems include:
The population may be so large that it is very difficult or too timeconsuming to construct a (complete) sampling frame (e.g it would be
quite time-consuming to construct a complete sampling frame for the
population of all Scottish high school students.)
The sampling frame may be so large (e.g. the electoral roll for the UK)
that it is technically difficult to select a simple random sample from it.
A simple random sample taken from a very large population (e.g. a
simple random sample from all UK adults) is likely to be geographically dispersed. If the sampling data is to be collected by interview,
then tracking down all the members of the simple random sample for
interview is both time-consuming and expensive.
To deal with these and other problems, statisticians and opinion pollsters
have developed more sophisticated methods for selecting representative
samples. Some examples of these more elaborate methods are described
below.
Multistage sampling
Lets consider the problem of interviewing a sample of size 500 from the
population of Scottish high school students. A simple random sample of size
500 from this population (supposing that we can obtain such a sample) is
very likely to be dispersed throughout Scotland and would be expensive to
interview. Instead, we could use the following approach to select a sample:
1.1. SAMPLING
19
Select a random sample of size 20 from a list of all Scottish high schools.
Get the school roll for each high school in the sample of 20 schools and
select a random sample of size 25 from each school roll. This gives us
(in total) a sample of size 500.
Discussion
1. The key feature of this multistage sampling example is that at each
stage we make selections at random.
2. This procedure does not select a simple random sample since there are
some subsets of 500 students that are impossible to select by using the
procedure described above. For example, this procedure will never select a subset of students who attend 500 different schools. Nevertheless,
since at each stage we select schools and students at random, we avoid
some of the problems with bias which arise when we dont make sample
selections at random.
3. The other advantage of this multistage sampling design is that the interviewers only have to visit 20 schools rather than traveling to (possibly)
hundreds of different schools across Scotland.
Stratified sampling
To construct a stratified sample, we divide the population into distinct groups
which are called strata. Next, we decide how many units from each strata
will be included in the sample (the number selected from each strata will
often depend on what we want to know about the population). Finally, we
select a (simple) random sample of the designated size from each strata and
combine these samples to form the stratified sample. To illustrate stratified
sampling, we will consider two examples.
Example 1.1.8 Suppose that the University library wants to conduct a survey of Heriot-Watt students studying on campus in order to determine student views on the service provided by the Riccarton library. The population
for this survey is the 6191 students studying on the Riccarton campus, of
which 4699 (75.9%) are undergraduates and 1492 (24.1%) are postgraduates.
A stratified sample of size 200 is selected by selecting a simple random sample of size 152 from the undergraduates and a simple random sample of size
48 from the postgraduate students.
20
Discussion
1. By selecting a stratified sample, the library can guarantee that 76% of
the students in the sample are undergraduates and 24% of the sample are postgraduates and this matches the percentages in the whole
population of students.
2. By selecting a simple random sample from each group, we can avoid
sampling bias and we can use data from each group to obtain unbiased
estimates for each group separately and for the whole population. For
example, suppose that 87 of the 152 undergraduates surveyed and 34 of
the 48 postgraduates surveyed said that they Strongly favour longer
library opening hours. From these data we can estimate pU G = 0.572,
the proportion of undergraduates who strongly favour longer opening
hours, and pP G = 0.708, the proportion of postgraduates that strongly
favour longer opening hours. To estimate proportion of all students who
strongly favour longer opening hours, we work backwards and estimate
how many students in each group strongly favour longer opening hours
as follows:
0.572 4699 = 2688 undergraduates
0.708 1492 = 1056 postgraduates
So, we estimate that, in total, 2688+1056= 3744 students out of the
6191 students , strongly favour longer opening hours. This gives us an
estimated proportion of
pT otal = 0.605.
Now lets look at another example.
Example 1.1.9 Suppose that the Admissions Office at Heriot-Watt wants
to conduct a survey of undergraduate entrants to find out what the students
thought of the service provided by the Admissions Office. The Office plans
to select a sample of 160 entrants to survey. Based on the data which is
summarised in the table displayed below, the Admissions Office identifies 3
distinct groups of entrants: Home/EU students on the Edinburgh campus,
Overseas students on the Edinburgh campus, and students on the Borders
Campus.
1.1. SAMPLING
21
Home/EU Overseas
Edinburgh Campus
1270
148
1418
Borders Campus
169
1
170
1439
149
1588
Heriot-Watt Undergraduate Entrants
In addition to finding out about general customer satisfaction, the Admissions Office would also like to investigate any differences between these
groups with respect to their satisfaction with the service provided. Now if
they select a simple random sample of size 160 from the 1588 entrants there
will be only a few Overseas students and Borders students in the sample because there are (relatively) few such students in the population of entrants.
To get better (i.e. more precise) information about these two groups it is
necessary to have more of these students in the sample. To accomplish this
the Admissions Office decides to select a stratified sample, and selects simple random samples of of size 80, 40, and 40 from the Home/EU students
on the Edinburgh campus, the Overseas students on the Edinburgh campus,
and the students on the Borders Campus, respectively, in order to obtain a
sample of size 160 from the new entrants.
Discussion
1. In this example, the numbers selected from each group do not correspond to the relative sizes of these three groups in the population of
Heriot-Watt undergraduate entrants. This is because the Admissions
Office wants to get more precise information about Overseas students
on the Edinburgh campus and students on the Borders Campus, so it
selects relatively more students from these two groups. Nevertheless,
since we use simple random sampling to select the samples from each
stratum, we can use the data to obtain unbiased estimates for each
stratum. Here is the data:
68 out of 80 Home/EU students (Edinburgh campus),
27 of the 40 Overseas students (Edinburgh campus),
23 out of the 40 Borders campus students
reported that they were Very satisfied with the service provided by
the Admissions Office.
22
pO 148 = 99.9,
pB 170 = 97.75.
So, we estimate the overall proportion of students who were Very satisfied to be
1079.5 + 99.9 + 97.75
= 0.804.
pT =
1588
2. Taken as a whole, a stratified sample constructed as described above
would over-represent the opinions of Overseas students on the Edinburgh campus and students on the Borders Campus and would underrepresent the opinions of Home/EU students on the Edinburgh campus.
So, for example, we would need to be careful about using the sample to
estimate the proportion of all entrants who are Very satisfied with the
service provided by the Admissions Office. In fact, if we just computed
a simple proportion of those in the entire sample who were Very satisfied we would certainly obtain a biased estimate! Fortunately, much
as we did in Example 1.18, provided that we know the size of each stratum and the size of the sample from each stratum, we can use the data
from this stratified sample to obtain unbiased estimates of population
parameters.
Systematic random sampling
As a final example in the section, lets see how to construct a systematic
random sample of size 25 from the students enrolled in this module:
As with selecting a simple random sample, we start with an (ordered) class
list and we assign to each student on the list one (or more) 3-digit number(s).
We then use a table of random digits to select the first person in the sample.
Once the first person is selected, we then select every fifth person (say) on
the list, starting with the randomly selected first person, until we have a
sample of size 25.
1.2. EXPERIMENTATION
23
Discussion
1. We can select a systematic random sample much more quickly than a
simple random sample because we only need to select the first person
at random. The rest of the sample is obtained by a systematic selection
from the list, starting from a random person on the list.
2. A systematic random sample is not a simple random sample since not
all subsets of size 25 have the same chance of being selected (for example, systematic random sampling will never select the first 25 people
on the class list) Nevertheless, since we select the starting point for
the sample at random, every person on the list has the same chance of
being selected by a systematic random sample. This means that there
is no favouritism in the selection mechanism (i.e. we do not have sampling bias provided there is no underlying bias in the way the names
appear on the class list).
Note: The key features of the sampling designs described above is that each
is
based upon a well-defiined procedure for selecting the sample, and
uses chance to select units from the population.
We also note that it is possible to combine these methods to construct even
more elaborate random sampling designs. All the sampling methods described above are examples of probability sampling:
A probability sample is a sample chosen in such a way that we know
what samples are possible (not all need be possible) and we know
the chance each possible sample has to be chosen.
Key point: Provided we are working with a probability sample, we can still
use the data obtained from the sample to obtain (unbiased) estimates of the
population parameters we are interested in and we can work out the sampling
distribution for our estimates. By looking at the sampling distribution we
can also work out the likely magnitude of our sampling error.
1.2
Experimentation
24
1.2. EXPERIMENTATION
25
Discussion
1. In the above example the response variable is blood pressure and the explanatory variable is whether or not the participant in the study smoked
or not. This study is not an experiment because the researcher did
not impose a treatment (i.e. smoking or not smoking) on the participants in the study. So, even though we can identify explanatory and
response variables, this does not mean that the study is an experiment.
2. We justify the use of experimentation to establish causation as follows:
Suppose that we change the level of one or more explanatory variables (and all other experimental conditions remain the same), then
any resulting change in the response variable must be the caused by
the changes in the levels of the explanatory variables. For example,
we could investigate the effect of water temperature on colour fastness
of a dye by washing dyed fabric at different temperatures. Provided
we could keep all the conditions in the experiment the same (except
for water temperature), any changes in the colour of the fabric after
washing could be attributed to the effect of the water temperature on
the dyed fabric. Unfortunately, in many situations it can be difficult to
guarantee that nothing affects the response variable except the changes
in the explanatory variables!
The need for an experimental design
In our discussion of sampling we looked at how to sample in order to obtain
a representative sample of the population and to minimise the errors in our
results. Likewise, in experimentation we have to be concerned with how an
experiment is designed.
The most basic type of experiment follows one of these patterns:
Treatment
Observation
(1)
(2)
or
Observation
Treatment
Observation
In experiment (1), a treatment is applied and its effect is observed. In experiment (2), before-and-after measurements are taken. Now under ideal
conditions (e.g. an experiment in a carefully set up laboratory), experiments
26
following one of the designs above can give us good results. Unfortunately, it
is not always possible to design an ideal experiment and just as we need to
sample with care, we also need to do experiments carefully in order to draw
conclusions from them.
With sampling we needed to look out for sampling procedures which could
lead to sampling bias. With experimentation (and observational studies)
the problem is usually invalid data from which we are unable to draw any
conclusions, i.e. we cannot determine if the treatment had an effect on the
response variable. Here are some typical situations which result in invalid
data:
Confounding of variables
Sometimes we cannot determine the effect of an explanatory variable on the
response variable because the response variable may be influenced by other
variables which are not part of the treatment used in the experiment. Any
variable which is not an explanatory variable in the experiment but which
may influence the response variable is called an extraneous variable.
Example 1.2.2 An educational researcher who has developed a new method
to teach reading to primary school children decides to test its effectiveness
by asking several head teachers in Edinburgh to introduce the scheme in
their schools. At the end of the academic year the children who have been
taught using the new scheme are tested and the results are compared with the
reading test results from schools that did not participate in the scheme. The
results for the pupils that were taught under the new method were higher,
on average, than those of the children from non-participating schools.
Question: Do the results from the above experiment show that the new
scheme improves reading attainment?
No! The problem is that there may be other factors which have also influenced the performance of the children in the participating schools. For example, the researcher may have favoured contacting head teachers of schools in
more prosperous areas of Edinburgh where average performance on reading
tests is already higher than the city average before the experiment. Also,
head teachers did not have to participate in the experiment. So perhaps
the ones that chose to let their schools participate in the new scheme were
already more motivated to improve reading attainment in their schools than
the ones who didnt participate. The motivation and enthusiasm of the participating head teachers may have helped to improve the reading attainment
1.2. EXPERIMENTATION
27
of the pupils at least as much as the new scheme! So we cannot draw any
conclusions from this experiment because factors other than the new reading
scheme may have influenced the results of the experiment. This is an example
of confounding:
The effects of two or more variables on a response variable are said
to be confounded if these effects cannot be distinguished from one
another.
In the example above, the motivation of the participating teachers (an extraneous variable) and the method of teaching (the explanatory variable) are
confounded.
Data from observational studies
As already mentioned above, it is usually difficult to determine cause and
effect based on data from observational studies. In particular, we often have
problems with confounding of variables in observational studies. Heres another example of an (comparative) observational study:
Example 1.2.3 A large study used health service records to investigate the
effectiveness of two ways to treat prostate cancer. One treatment was traditional surgery and the other was based on chemotherapy. In each case, the
patients consultant determined which treatment would be given. The study
found that the patients who received chemotherapy were less likely to survive
for more than 5 years.
Discussion
In this example the response variable is post-treatment survival and the explanatory variable is the type of cancer treatment (i.e. surgery or chemotherapy). This study is not an experiment because the researcher did not impose
the treatment on the patients (each patients consultant determined the treatment).
Question: Do the results from the above comparative study show that
chemotherapy is less effective as a treatment for prostrate cancer?
No! There are other variables that may be confounding the explanatory variable (which is cancer treatment). In particular, the choice of treatment for
each patient was determined by the patients doctor (not by the researcher
who was doing the study), and the doctor is likely to consider a variety of
factors when deciding which treatment is appropriate. For example, some
patients may have been in such poor health or have other complicating health
28
problems that surgery would have been too dangerous for these patients. In
these cases, the doctor would be more likely to recommend chemotherapy
instead of surgery. If unwell patients tend to be recommended more often
than healthier patients for chemotherapy, this could also explain why the patients who received chemotherapy tended to have a lower survival rate. So,
in this example, the patients general health before treatment (an extraneous
variable) and the cancer treatment received (the explanatory variable) are
confounded.
Placebo effect
The response by patients to any treatment in which they have confidence is
called the placebo effect. There are many surprising examples of the power
of the placebo effect in the medical literature. Here are a few:
Many studies have shown that placebos relieve pain in 30-40% of patients, even those recovering from major surgery.
One study found that when a group of balding men was given a placebo,
42% of the men either experienced increased hair growth or did not
continue to lose hair.
In an experiment to investigate the effectiveness of vitamin C in preventing colds it was found that those who thought that they were being
given vitamin C (but in fact received a placebo) had fewer colds than
those who thought that they were being given a placebo (even though
they were, in fact, receiving vitamin C)!
Remark: Because of the placebo effect, clinical trials of drugs and other
medical treatments have to be carefully designed. For example, suppose that
I wish to determine whether a certain medication was effective in reducing
blood pressure. A naive design for an experiment to investigate this question
might be to measure the blood pressure of 40 patients, give each patient
the medication, and then measure their blood pressure after a week on the
medication. The problem with this approach is that any improvement in
a patients blood pressure might be due to the fact that they expect the
treatment to work (i.e. might be due to the placebo effect) rather than due
to the action of the medication. In other words the placebo effect and the
effect of the medication are confounded.
Experimental design
1.2. EXPERIMENTATION
29
Weve seen above, that data from experiments and observational studies can
be invalid due to the confounding of variables. In order to avoid generating
invalid data, we need to develop statistically sound methods for conducting
experiments. In other words, we need a good experimental design (i.e. a good
plan for the experiment). The key idea behind most good experimental designs is comparative experimentation. The basic features of comparative
experimentation are:
1. Start with two equivalent groups.
2. Give the treatment to one of the groups (this group is called the experimental group). The other group (which is called the control group) is
treated in exactly the same way except that this group does not receive
the treatment.
Key point: Extraneous variables influence both groups, whereas the treatment only affects one group.
Warning: Although this experimental design addresses the problem of confounding variables, there is still some room for improving this design! In
particular, comparison will eliminate the problem of confounding only if we
have equivalent groups of subjects. The weakness in the design described
above is that it relies on dividing the units into two equivalent groups. But
how can we be sure that the groups are equivalent ? How can we make
sure that one of the groups isnt different from the other in some way that
leads to bias in the experimental results? In particular, how can we avoid
some hidden bias arising due to the way that units or subjects are assigned
to groups?
Just as in sampling, we were able to eliminate (some) sources of bias by selecting a random sample, in experimentation we can use random allocation
to improve our experimental design and reduce any bias in the results.
Implementation of random allocation
The implementation of random allocation of units to experimental groups is
similar to the method of selecting a simple random sample.
Example 1.2.4 Suppose that I want to test a new organic fertiliser on
tomato plants. I start with 40 plants and assign a number to each plant.
Then using a table of random digits, I select 20 of the 40 plants to be fertilised. The other 20 plants receive no fertiliser, but in every other way are
30
treated exactly the same as the 20 plants in the treatment group. At the end
of the growing season I record the yield of each plant. This is an example of
a randomised comparative experiment because the units are randomly
assigned to groups.
Discussion: Because we have allocated the tomato plants randomly to the
two groups, there has been no favouritism in the allocation of the plants i.e. the composition of each group is roughly the same with respect to the
other extraneous variables such as the health and vigour of the plants. In
other words, neither group is more likely to consist of the healthiest plants
(or the weaker plants). This random allocation of plants to groups averages
out the effect of extraneous variables and ensures that the groups are roughly
equivalent.
The importance of randomised comparative experiments stems from the fact
that we can use the following argument to establish cause-and-effect based
on the results of a randomised comparative experiment:
Randomization produces groups of subjects that should be similar in
all respects before we apply the treatments
Comparative design ensures that extraneous variables other than the
experimental treatments operate equally on all groups.
Therefore, (significant) differences in the response variable between
groups must be due to the differences (and the effects) of the treatments.
Further discussion: The more subjects used in a randomised comparative
experiment, the more likely it will be that the treatment groups in the experiment will be roughly equivalent. For example, in the tomato experiment
described above, there is still a chance that when I randomly allocate plants
to groups I will (by chance) allocate many more healthy plants to the group
that gets the fertiliser than to the other group (this would be unlucky but
still possible). To reduce the chance that, in spite of random allocation, I end
up with unbalanced groups, I should make the treatment groups as large as
possible, since this would reduce the chance that one group has many more
healthy plants than the other.
1.2. EXPERIMENTATION
31
32
who is involved in recording data from the subjects) shouldnt know who is
receiving the Echinacea and who is getting the placebo. Once the experiment
has ended (e.g. all the data has been collected), the blinds can be removed
so that a statistician can analyse the results.
Note: It is generally accepted that whenever possible, it is desirable that
medical experiments with human subjects are randomized, double-blind with
a placebo control.
Interpretation of results
How can we know that the differences observed between the treatment group
and the control group are significant - i.e. the differences are due to something other than just chance?
Example 1.2.3 again Lets consider the tomato fertiliser experiment again.
Suppose at the end of the growing season I harvest the tomatoes from each
plant and weigh them. I then use my data to compute the average yield for
each group and I obtain:
Average yield for fertilised plants: 3.95 kgs
Average yield for unfertilised plants: 3.58 kgs
Question: Since the average yield for the fertilised plants is greater than
the average yield for the unfertilised plants, do these results prove that the
organic fertiliser increases yield?
Discussion: We need to be careful about coming to hasty conclusions! The
problem is that even if both groups received no fertiliser there would still
be a difference between the yield of the first group of plants and the second
group of plants. This is because there will always be some chance differences
between the plants in the groups and this will result in chance differences
between their yields. In order to decide whether these results prove that
the fertiliser increases yield, a statistician must first work out how much
chance variation in the yields we would expect to see between the groups
if neither group is fertilised. Next, the statistician looks at the observed
difference between the unfertilised and fertilised groups. Now if this observed
difference is much greater than the difference that we would expect to see
between two untreated groups, we can conclude that the difference in the
yields is unlikely to be due to just chance variation and we conclude that
the best interpretation of the data is that the fertiliser increases yield. The
1.2. EXPERIMENTATION
33
statistician would report that the difference between the yields is statistically
significant.
An observed effect or difference of a size that would rarely occur by
chance is called a statistically significant effect or difference.
In practice, whether an observed effect or difference is statistically significant
will depend on both the magnitude of the observed effect or difference and
on the number of subjects in the experiment. You will learn more about
exactly how statisticians determine whether an observed effect or difference
is statistically significant in subsequent statistics modules. For now the main
point is that if you read that a result of an investigation is statistically
significant, you can conclude that the investigators found good statistical
evidence to support the claim that differences in the levels of the response
variable(s) are due to differences in the treatments imposed.
Difficulties and issues in experimentation
In the section on sampling we saw that even when we use an unbiased sampling method, there can still be problems with sampling that cannot be
avoided by using a good sampling method (e.g. non-sampling errors such
as nonresponse errors, leading questions, processing errors, etc.). Likewise,
randomised comparative experiments go a long way towards avoiding the
problem of invalid data, but we still need to be on the lookout for difficulties
in experimentation and in the interpretation of experimental results. Here
are (just a few) problems and issues that we need to watch out for:
Applicability of the results (can the results be extended ?)
A common problem with the interpretation of experimental results is that
the applicability of the results can be over-stated. We always need to look
carefully at how an experiment was conducted in order to determine to what
population the conclusions apply. In many experiments the researcher has
to select subjects from an available pool of subjects which may not be representative of the population to which the researcher would like to apply the
results. In this case the results will probably be valid for the subject pool
but the researcher must justify why the results can be applied to the larger
population.
Example 1.2.6 Various well-designed clinical trials have shown that using
drugs to reduce blood cholesterol in middle-aged men with high cholesterol
34
also decreases their risk of a heart attack. Can we conclude from this trial
that, in general, reducing blood cholesterol decreases the risk of heart disease?
Discussion: The problem with drawing general conclusions about blood
cholesterol and the risk of heart attack from the results of these experiments
is that there may be important physiological differences between men and
women (or between men of different ages) which mean that blood cholesterol
level is not as important a risk factor for these other groups as it is for middleaged men with high cholesterol. Doctors, for example, need to be careful not
to assume that the results of clinical trials are applicable to types of patients
that were not part of the relevant trials.
Lack of realism in the experiment
Another (related) problem with the interpretation of experimental results is
that the experimental treatments (or some other feature of the experiment)
may be unrealistic.
Example 1.2.7 In order to determine whether a food additive is safe, it is
standard practice to test high doses of the additive on laboratory rats. The
additive is deemed unsafe if the experimental group develops significantly
more tumours than the control group.
Discussion: The decision to ban a food additive based on an animal experiment is an example of erring on the side on caution. It is important
to remember that such an experiment does not necessarily prove that the
additive is actually dangerous for human consumption. The problem is that
the experiment is not realistic: humans are not rats and typical doses of the
additive are usually much smaller than the doses given to the rats.
Psychologists and other social scientists often have to devise ingenious experiments to investigate psychological responses to various factors. The difficulty
with some of these experiments is that they are (necessarily) somewhat artificial - e.g. they are conducted in a laboratory, the subjects are aware that
they are participating in a psychological experiment, etc. As a result, one
needs to be careful about generalising the findings of such experiments to
real-world situations.
Dropouts, nonadherers, and refusals in experiments with human
subjects
Experiments with human subjects can be compromised by human behaviour!
When this happens, statisticians and researchers have to figure out how to
1.2. EXPERIMENTATION
35
make appropriate adjustments in order to try to reduce any bias that may
result from human behaviour. Some typical problems include:
Dropouts: Experiments that continue over a long period of time often
have subjects that dropout before the end of the experiment. It is very
important that researchers try to determine the reasons that participants drop out. In particular, the researchers should try to determine
whether the reason for dropping out is related to a feature of the experiment. For example, perhaps the subjects receiving one particular
treatment experienced unpleasant side effects and as a result decide to
stop participating. Clearly, their reason for dropping out is very relevant to the experiment and as a result the results of the experiment
may be biased because the dropouts did not complete the experiment.
Nonadherers: A subject who participates in an experiment but who
doesnt follow the experimental treatment is called a nonadherer. There
are many reasons why a subject may break the rules. For example,
an experiment might require participants to take a medication according to a very careful schedule over several weeks. The difficulty with
such an experiment is that people sometimes arent very good about
remembering to take medication. If subjects are not taking the medication according to the experimental guidelines it will be difficult to
determine what the true effect of the medication is!
Refusals: Human subjects have to agree to participate in experiments
and that means that individuals can refuse to participate! Now if there
is no particular reason or pattern to the refusals, then non-participation
of some of the selected subjects may not make any difference to the
validity of the experimental results. On the other hand, if those who
refuse to participate differ in some systematic way from those who
participate then bias can result.
More complicated experimental designs
There are a variety of ways that randomised comparative experiments can
be developed to make more complicated comparisons. We describe a some
common variants below.
Completely randomized design with multiple factors/levels
36
Randomised comparative experiments can be used to investigate the combined effect of more than one variable on the response variable. Variables
can also be set at different levels in order to investigate the effect of the
level of a variable (e.g. dose of a certain drug) on the response variable. Here
is an example of an experiment with two explanatory variables that are set
at various levels:
Example 1.2.8 Clothing manufacturers usually recommend both the temperature (i.e. 30o , 40o , etc) and the cycle setting ( Cotton wash, Synthetic
wash, etc.) at which a garment should be washed. To determine the optimal temperature and cycle setting for a particular material, we can perform
a randomised comparative experiment with multiple factors and levels. In
this case the factors (i.e. explanatory variables) are temperature and cycle
setting and the levels are the various settings for temperature and cycle.
All possible combinations of temperature and cycle settings give us a total of
20 different treatments as shown in the diagram below (labelled by Roman
numerals):
30o
Cotton
I
Synthetic VI
Delicate
XI
Wool
XVI
40o
II
VII
XII
XVII
50o
60o 90o
III
IV
V
VIII
IX
X
XIII XIV XV
XVIII IX XX
To carry out the experiment, the researcher obtains 200 pieces of the same
fabric which have all been stained with the same substance, and randomly
allocates 10 pieces of fabric to each of the 20 treatments. At the end of the
experiment the washed pieces of fabric are examined to see how well they
have been cleaned, whether the dye in the fabric has run, etc.
Remark: This experiment allows the manufacturer to discover how the
interaction between between two explanatory variables (temperature and
cycle setting) effect the response variable(s).
Randomized Block Designs
Matching subjects in various ways can be used in conjunction with randomization to produce more precise results than would be obtained by a
simple randomised comparative experiment. This is a particularly useful in
the design of experiments where it is thought that extraneous variables (i.e.
1.2. EXPERIMENTATION
37
variables that are not part of the treatment) may have a big impact on the
response variable. In order to control the effects of these extraneous variables
in the experiment a block design can be used:
A block is a group of experimental units or subjects that are similar
with respect to some extraneous variables that are thought to affect
the response to the treatment in the experiment.
In a randomised block design experiment, the subjects are first grouped
into blocks and then, within the blocks, the subjects are randomly assigned
to treatments.
Note: In a randomised block design, the allocation of subjects to blocks is
not random! The subjects or units are grouped together according to some
characteristics that they have in common. After the subjects have been put
in blocks, the subjects within a block are randomly allocated to treatments.
Heres a simple example of a randomised block design:
Example 1.2.9 A pharmaceutical company wishes to compare the effectiveness of a new drug for reducing levels of LDL cholesterol to the effectiveness
of two commonly used treatments for high levels of LDL cholesterol. It is
thought that the effectiveness of any drug for reducing LDL levels is affected
both by the gender of the patient and by the initial level of LDL in the bloodstream. A total of 600 men and 400 women have agreed to participate in
the clinical trial of these treatments. In order to control for these extraneous
variables, the men are divided into blocks of men with similar levels of LDL
cholesterol and the women are divided into blocks of women with similar
levels of cholesterol. Within each block the subjects are randomly allocated
to treatments. Also, because this is an experiment with human subjects, the
experiment is double-blind - i.e. the subjects do not know which treatment
they are receiving and the staff running the experiment do not know which
treatment a patient has received.
Discussion: Blocking in the experiment above allows a researcher to get a
clearer picture of the differences between the treatments. This is because the
blocks have been chosen to equalise important (and unavoidable) sources
of variation between the subjects. Less important sources of variation are
then averaged out by randomly allocating treatments within the blocks. In
addition, by grouping similar subjects together before randomly allocating
treatments, the researcher can also separately investigate the responses of
38
1.2. EXPERIMENTATION
39
40
1.3
Measurement
1.3. MEASUREMENT
41
this property, the participant exhales into a peak flow meter and the level of
peak flow is recorded.
Note: Once the researcher has decided how to measure lung function, the
variable is defined in terms of the method of measurement. In this case, variable is the peak flow because that is what the researcher actually measures.
Now deciding how to measure a property is easiest when everyone clearly
understands the property that we propose to measure (e.g. height, weight,
etc.) The problems arise when the definition of the property to be measured
is imprecise or disputed.
Example 1.3.2 Suppose that a psychologist wants to measure intelligence.
In this case, there is an immediate problem because human intelligence is
a complex property and there is no universally accepted definition of it.
Without a clear understanding of intelligence, it is difficult for researchers to
agree how to measure it. For example, there is much debate about whether
the standard IQ test is an appropriate measure of a property that is as
complex as intelligence!
Heres another example that illustrates some of the issues that arise when we
try to measure properties in complicated situations.
Example 1.3.3 Suppose we wish to measure an individuals employment
status. Before we can measure this property, we need to define what we
mean when we say that someone is employed or unemployed or economically
inactive.
Note: Different organisations may have different ideas about what it means
to say that someone is employed! In the UK, the Office of National Statistics
has adopted the following definitions:
1. A person (aged 16 or over) is employed if in the previous week they
did at least one hour of paid work, or are temporarily away from a job
(e.g. on holiday), or are on a government training scheme, or have done
unpaid work for a family business.
2. A person (aged 16 or over) is unemployed if in the previous week they
did not have a job but they were available to start work within the next
two weeks and had either been looking for a job during the last four
weeks or are waiting to start a job that they have already secured.
42
1.3. MEASUREMENT
43
A variable is a valid measure of a property if it is relevant and appropriate as a representation of that property
Discussion: The method of measuring employment status described above
is an example of a valid (though not perfect) measure of the property. It is
both relevant and appropriate.
In contrast, heres a (silly) example of an invalid measurement:
Example 1.3.4 Suppose that I want to measure intelligence. To do this, I
decide to measure an individuals height in centimetres. Clearly, it is invalid
to measure intelligence by measuring someones height because height is neither relevant nor appropriate as numerical representation of someones
intelligence!
Here is a more subtle example of invalid measurement:
Example 1.3.5 Suppose that a small business wishes to measure the level
of employee job satisfaction. It surveys its employees and discovers that 65
of its employees who are under 50 are Satisfied or Very satisfied with their
job, whereas only 43 of its employees who are 50 or older are Satisfied or
Very satisfied with their job. Can we conclude from these numbers that the
younger employees are more satisfied with their jobs than the older employees?
No! The company employs 56 people who are 50 or older and 175 people
who are under 50. So the satisfaction rate for younger employees is
65
= 0.371, or 37.1%,
175
whereas the satisfaction rate for older employees is
43
= 0.768, or 78.8%.
56
This satisfaction rate is a more valid measure of the level of job satisfaction
than a simple count of the numbers.
Note: The above example illustrates that often the rate (i.e. fraction,
proportion, or percent) at which something occurs is a more valid measure
than a simple count of occurrences.
44
Lastly, lets think again about the problem of how to measure intelligence.
Because there continue to be debates about how to define intelligence, there
continue to be debates about how to measure it - e.g. is the score of the
standard IQ test a valid measure of intelligence?
One way to resolve debates over whether a variable is a valid measure is to
claim (instead) that the variable has predictive validity:
A measurement of a property has predictive validity if it can be used
to predict success on tasks that are related to the property to be
measured.
Discussion: So, for example, instead of arguing about whether an IQ score
is a valid measurement of intelligence, we could claim (instead) that an IQ
score is valid for predicting success on (for example) school assessments.
Key Point: We can use data to investigate whether a variable has predictive
validity - e.g. by looking at an individuals IQ score and their results on
school assessments we can investigate the claim that IQ scores are valid
for predicting success on assessments. In contrast, data consisting of only IQ
scores cant really help us decide whether an IQ score in a valid measurement
of intelligence!
Accuracy in measurement
Once we have decided how to measure a property (i.e.we have decided what
variable to measure ), we need to consider the process of taking the measurements. Ideally, we want our measurement process to be unbiased and
reliable.
A measurement process is unbiased if it does not systematically overstate or understate the true value of the variable measured.
A measurement process has random error if repeated measurements on
the same individual give different results. If the random error is
small, we say that the process is reliable - i.e repeated measurements
on the same individuals give the same (or approximately the same)
results.
Example 1.3.6 Lets look again at the process of measuring peak flow (as
a representation of lung function). The measurement process is reliable if,
when I take repeated measurements on the same person, I get (more or less)
the same reading on the meter. On the other hand, if the meter is faulty (e.g.
45
perhaps it gets stuck), then the measurement process will be biased because
the meter will always tend to record a peak flow value that is smaller than
the true peak flow.
If our measurement process appears to be reliable even though it is subject
to random errors, we can improve our measurement by averaging several
measurements taken on the same unit to obtain a more reliable (less variable)
measurement.
Example 1.3.7 Every week I monitor my rabbits health by weighing him
using my kitchen scales. I take three measurements because the measurements tend to vary by about 25g (plus or minus). I average these measurements to obtain a more reliable measurement of my rabbits weight.
Note: By averaging measurements we can improve the reliability of our
measurement process, but this does not reduce bias. Bias depends on how
good the measurement instrument is! In the example above, this means that
we need to know whether I have a good set of scales - e.g. do they tend to
give (more or less) the right answer? Or, do they tend to give readings that
are either too large or too small?
Further Discussion: In many situations (as in the examples above), we
can check whether a measurement process is reliable by taking repeated measurements on the same units. However, sometimes researchers have to use
more complicated ways of checking reliability. For example, it is difficult to
check the reliability of psychometric tests because, if the same person takes
a particular type of test over and over again, they will learn how to take
the test and their scores will increase with repeated attempts.
1.4
46
47
48
49
The problem with this information is that it gives the impression that
Maltesers are a low calorie sweet because each one only has 11 calories.
However, under EU regulations a low calorie food is one that has fewer
than 40 calories per 100 g. Maltesers contain 505 calories per 100 g!
Note: The information given was correct but the advert didnt tell you the
full story!
Example In order to provide consumers with information about energy efficiency, many products are now rated using a letter scale with A corresponding to greatest energy efficiency and G corresponding to the lowest energy
efficiency.
In a brochure produced by a well-known double-glazing company, the company points out that its standard windows are all B-rated. The brochure
also points out that the minimum standard of energy efficiency required for all
new windows installed in Scotland is a D-rating . This information creates
the impression that the companys windows are much more energy efficient
than its competitors! What is missing is information about the energy ratings
of windows supplied by other companies - i.e. it is not really relevant what
the minimum standard is if most companies also supply B-rated windows.
(A little checking on the Web shows that many of this companies competitors
also supply B-rated windows as standard!)
Again, the information provided is correct but we need more information
in order to properly understand how special or otherwise the companys
windows are.