Designing Clinical Research

Authors: Hulley, Stephen B.; Cummings, Steven R.; Browner, Warren S.
;
Grady, Deborah G.; Newman, Thomas B.
Title: Desi gni ng Cl i ni cal Resear ch, 3r d Edi t i on
Copyri ght 2007 Lippi ncott Willi ams & Wilkins
> Tabl e of Contents > Secti on I - Basi c Ingredi ents > 1 - Getti ng Started: The Anatomy
and Physi ol ogy of Cl i ni cal Research
1
Getting Started: The Anatomy and
Physiology of Clinical Research
Stephen B. Hulley
Thomas B. Newman
Steven R. Cummings
This chapter introduces cli nical research from two viewpoints, setting up themes
that run together through the book. One is the anatomy of researchwhat it' s
made of. This i ncludes the tangible elements of the study pl an: research
questi on, design, subjects, measurements, sample size calculation, and so forth.
An investigator' s goal i s to create these elements in a form that will make the
project feasi ble, effi cient, and cost-effecti ve.
The other theme is the physiology of researchhow i t works. Studies are
useful to the extent that they yield valid i nferences, fi rst about what happened
i n the study sampl e and then about how these study fi ndings generalize to
people outside the study. The goal is to minimi ze the errors, random and
systemati c, that threaten conclusi ons based on these inferences.
Separating the two themes is artificial in the same way that the anatomy of the
human body doesn' t make much sense without some understanding of i ts
physiology. But the separati on has the same advantage: it clarifies our thinki ng
about a complex topi c.
Anatomy of Research: What It's Made Of
The structure of a research project i s set out in its protocol, the written
plan of the study. Protocol s are well known as devices for seeki ng grant funds,
but they also have a vital sci entific functi on: helpi ng the investi gator organize
her research in a l ogical, focused, and efficient way. Table 1.1 outl ines the
components of a protocol . We introduce the whole set here, expand on each
component in the ensuing chapters of the book, and return to put the completed
pieces together in Chapter 19.
Resear ch Quest i on
The research question i s the objective of the study, the uncertainty the
i nvestigator wants to resolve. Research questions often begin with a general
concern that must be
narrowed down to a concrete, researchable i ssue. Consider, for example, the
general question:
P.4
Pgina 1 de 15 Ovid: Designing Clinical Research
01/04/2011 mk:@MSITStore:C:\Users\Let\Downloads\Designing%20Clinical%20Research.chm::
...
Shoul d people eat more fish?
This is a good pl ace to start, but the question must be focused before planning
efforts can begin. Often thi s i nvolves breaking the question into more speci fic
components, and singling out one or two of these to buil d the protocol around.
Here are some exampl es:
How often do Americans eat fish?
Table 1.1 Outline of the Study Protocol
Element Purpose
Research questions What questions wil l the study address?
Background and
signi ficance
Why are these questions important?
Design How is the study structured?
Time frame
Epi demiologi c approach
Subjects Who are the subjects and how will they be
selected?
Selection cri teria
Sampling design
Variables What measurements wil l be made?
Predictor variables
Confounding variables
Outcome vari ables
Statistical i ssues How large is the study and how wil l it be
analyzed?
Hypotheses
Sample size
Anal ytic approach
...
Does eating fish lower the risk of cardiovascul ar disease?
Is there a ri sk of mercury toxicity from increasing fi sh intake in ol der
adul ts?
Do fish oi l suppl ements have the same effects on cardiovascul ar disease as
dietary fi sh?
Which fish oil supplements don' t make people smell like fish?
A good research question should pass the So what? test. Getti ng the
answer should contri bute usefull y to our state of knowledge. The acronym FINER
denotes
fi ve essential characteristics of a good research question: i t should be feasible,
interesting, novel, ethical , and relevant (Chapter 2).
Backgr ound and Si gni f i cance
The background and significance secti on of a protocol sets the proposed study
i n context and gives i ts rationale: What is known about the topic at hand? Why is
the research question important? What kind of answers wil l the study provide?
This section cites previ ous research that i s relevant (including the i nvestigator' s
own work) and indicates the problems with the prior research and what
uncertainti es remai n. It specifi es how the findings of the proposed study will
hel p resolve these uncertai nties and lead to new scientifi c knowledge and
i nfluence practice guidelines or publ ic health poli cy. Often, work on the
signi ficance section wi ll lead to modi fications in the research questi on.
Desi gn
The design of a study is a complex issue. A fundamental decision is whether to
take a passive role in observi ng the events taking place i n the study subjects in
an observational study or to appl y an intervention and exami ne its effects on
these events in a clinical trial (Table 1.2). Among observational studi es, two
common designs are cohort studies, in which observations are made i n a group
of subjects that i s foll owed over time, and cross-sectional studies, i n which
observations are made on a si ngle occasion. Cohort studies can be further
divided i nto prospective studi es that begin i n the present and fol low subjects
i nto the future, and retrospective studi es that exami ne information and
specimens that have been coll ected in the past. A third common opti on i s the
case-control design, in which the i nvestigator compares a group of people who
have a di sease or condi tion wi th another group who do not. Among clinical tri al
options, the randomized blinded trial is usual ly the best desi gn but
nonrandomized or unblinded desi gns may be more suitable for some research
questi ons.
No one approach is always better than the others, and each research questi on
requires a judgment about which desi gn is the most effi cient way to get a
satisfactory answer. The randomized bli nded trial is often hel d up as the best
desi gn for establ ishi ng causal ity and the effecti veness of interventi ons, but there
are many si tuations for which an observational study is a better choice or the
onl y feasi ble opti on. The relatively low cost of case-control studies and thei r
suitabili ty for rare outcomes makes them attractive for some questions. Special
consi derations appl y to choosi ng desi gns for studying diagnostic tests. These
P.5
...
i ssues are discussed in Chapters 7, 8, 9, 10, 11 and 12, each deali ng wi th a
particul ar set of desi gns.
A typical sequence for studying a topic begins wi th observati onal studies of a
type that is often call ed descriptive. These studi es expl ore the l ay of the
l andfor example, describi ng distributions of di seases and health-rel ated
characteristics in the popul ati on:
What is the average number of servings of fi sh per week in the di et of
Ameri cans with a history of coronary heart di sease (CHD)?
Descri ptive studies are usual ly followed or accompanied by analytic studi es that
evaluate associations to permit inferences about cause-and-effect relationships:
Is there an association between fish intake and risk of recurrent myocardial
infarction in peopl e wi th a history of CHD?
Table 1.2 Examples of Common Clinical Research
Designs Used to Find Out Whether Fish Intake
Reduces Coronary Heart Disease Risk
Study Design Key Feature Example
Observational Designs
Cohort study A group followed
over time
The investigator measures fish
i ntake at baseline and
periodically examines subjects at
fol low-up visits to see if those
who eat more fi sh have fewer
coronary heart di sease (CHD)
events
Cross-
sectional
study
A group examined
at one point i n
time
She interviews subjects about
current and past history of fish
i ntake and correl ates results
with history of CHD and current
coronary calcium score
Case-control
study
Two groups
selected based on
the presence or
absence of an
outcome
She exami nes a group of
patients with CHD (the
cases ) and compares
them with a group who did not
have CHD (the controls), asking
about past fish intake
...
The final step is often a clinical trial to establi sh the effects of an intervention:
Does treatment wi th fish oi l capsul es reduce total mortality i n people with
CHD?
Cli nical trials usuall y occur relatively late in a series of research studies about a
given questi on, because they tend to be more di fficul t and expensive, and to
answer more definitivel y the narrowl y focused questi ons that arise from the
fi ndings of observational studies.
It is useful to characterize a study in a single sentence that summarizes the
desi gn and research questi on. If the study has two major phases, the design for
each should be menti oned.
Thi s i s a cross-sectional study of dietary habits in 50- to 69-year-old peopl e
with a history of CHD, fol lowed by a prospective cohort study of whether
fish intake is associated with low risk of subsequent coronary events.
This sentence is the research analog to the opening sentence of a medi cal
resident' s report on a new hospi tal admission: This 62-year-old white
policewoman was well unti l 2 hours before admi ssion, when she developed
crushing chest pain radiating to the l eft shoulder . Some designs do not easily
fi t into the categories li sted above, and classifying them with a singl e sentence
can be surpri singl y difficult. It is worth the
efforta precise descri ption of design and research question clarifies the
i nvestigator' s thoughts and is useful for orienting colleagues and consultants.
St udy Subj ect s
Two major decisions must be made in choosing the study subjects (Chapter 3).
The first is to specify inclusion andexclusion criteria that defi ne the target
populati on: the kinds of patients best sui ted to the research question. The
second deci sion concerns how to recruit enough people from an accessibl e
subset of this populati on to be the subjects of the study. For exampl e, the study
of fi sh intake in people with coronary heart di sease (CHD) might identify
Clinical Trial Design
Randomi zed
blinded trial
Two groups
created by a
random process,
and a bl inded
i nterventi on
She randomly assigns subjects
to receive fish oil supplements
or placebo, then fol lows both
treatment groups for several
years to observe the i ncidence
of CHD
P.6
P.7
...
subjects seen i n the cli nic with di agnosis codes for myocardial infarction,
angiopl asty, or coronary artery bypass grafting i n their electronic medical
record. Decisions about whi ch patients to study represent trade-offs; studying a
random sample of peopl e wi th CHD from the entire country (or at least several
different states and medi cal care setti ngs) woul d enhance generalizability but be
much more difficult and costly.
Var i abl es
Another major set of decisi ons in designing any study concerns the choice of
which vari abl es to measure (Chapter 4). A study of fi sh intake in the diet, for
example, mi ght ask about different types of fish that contain di fferent l evels of
-3 fatty acids, and i nclude questions about portion size, whether the fish was
fried or baked, and whether the subject takes fish oil supplements.
In an analytic study the investigator studies the associations among variables to
predi ct outcomes and to draw inferences about cause and effect. In considering
the associati on between two variables, the one that occurs fi rst or is more li kely
on bi ologic grounds to be causal i s call ed the predictor variable; the other is
cal led the outcome variable.
1
Most observational studies have many predictor
vari ables (age, race, sex, smoking history, fi sh and fish oil suppl ement intake),
and several outcome variables (heart attacks, strokes, quali ty of l ife, unpleasant
odor).
Cli nical trials examine the effects of an intervention (a special ki nd of predictor
vari able that the investigator manipulates), such as treatment with fi sh oil
capsules. Thi s design allows her to observe the effects on the outcome variable
using randomization to control for the influence of confounding
variablesother predictors of the outcome such as i ntake of red meat or
i ncome l evel that could be related to di etary fish and confuse the interpretation
of the findings.
St at i sti cal I ssues
The investigator must develop pl ans for estimating sampl e size and for managi ng
and analyzing the study data. This generall y i nvol ves speci fyi ng a hypothesis
(Chapter 5).
Hypothesis: 50- to 69-year-ol d women with CHD who take fish oil
supplements wi ll have a lower risk of myocardial infarcti on than those who
do not.
This is a version of the research question that provides the basis for testi ng the
statistical significance of the fi ndings. The hypothesi s also allows the
i nvestigator to calcul ate the sample sizethe number of subjects needed to
observe the expected difference i n outcome between study groups with
reasonable probabi lity or power
(Chapter 6). Purely descripti ve studies (what proportion of peopl e wi th CHD use
fi sh oil suppl ements?) do not i nvol ve tests of statistical si gnificance, and thus do
not require a hypothesis; i nstead, the number of subjects needed to produce
acceptably narrow confidence intervals for means, proportions, or other
descri ptive stati sti cs can be calcul ated.
P.8
...
Physiology of Research: How It Works
The goal of cli nical research is to draw inferences from findi ngs in the
study about the nature of the uni verse around it (Fig. 1.1). Two major sets of
i nferences are invol ved in interpreting a study (illustrated from ri ght to left i n
Fig. 1.2). Inference #1 concerns internal validity, the degree to whi ch the
i nvestigator draws the correct concl usions about what actually happened in the
study. Inference #2 concerns external validity (al so call ed generalizability),
the degree to whi ch these conclusi ons can be appropriately applied to people and
events outside the study.
When an investigator plans a study, she reverses the process, worki ng from l eft
to right in the lower half of Fi g. 1.2 wi th the goal of maximizi ng the validity of
these inferences at the end of the study. She designs a study plan in whi ch the
choice of research questi on, subjects, and measurements enhances the external
validi ty of the study and is conduci ve to implementation with a high degree of
i nternal val idity. In the next secti ons we address design and then
i mplementation before turni ng to the errors that threaten the val idity of these
i nferences.
FIGURE 1.1. The fi ndings of a study l ead to i nferences about the universe
outside.
...
Desi gni ng t he Study
Consider the si mple descriptive questi on:
What is the prevalence of regular use of fish oil supplements among peopl e
with CHD?
This question cannot be answered with perfect accuracy because it would be
i mpossibl e to study all patients with CHD and our approaches to discovering
whether a person i s taking fish oi l are imperfect. So the investi gator settles for a
related question that can be answered by the study:
Among a sample of patients seen in the investigator' s clini c who have a
previ ous CHD di agnosi s and respond to a mail ed questionnaire, what
proporti on report taking fish oil supplements?
The transformation from research question to study plan i s i llustrated in Fig. 1.3.
One major component of this transformati on i s the choice of a sample of
subjects that wil l represent the population. The group of subjects specifi ed in
the protocol can only be a sample of the population of interest because there are
practi cal barriers to studying the entire population. The deci sion to study
patients in the i nvestigator's clinic identi fied through the electronic medical
record system is a compromise. Thi s i s a sample that is feasible to study but one
that may produce a di fferent preval ence of fish oil use than that found in all
people with CHD.
The other major component of the transformation is the choice of variables that
wil l represent the phenomena of interest. The variabl es specified i n the study
plan are usuall y proxies for these phenomena. The decision to use a sel f -report
questi onnaire to assess fi sh oil use is a fast and inexpensive way to col lect
i nformation, but it will not be perfectly accurate. Some people may not
accurately remember or record how much they take in a typical week, others
may report how much they think they should be taking, and some may be taking
products that they do not reali ze should be i ncluded.
In short, each of the differences in Fi g. 1.3 between the research questi on and
the study plan has the purpose of maki ng the study more practical. The cost of
thi s
i ncrease in practicali ty, however, is the risk that design changes may cause the
study to produce a wrong or mi sl eadi ng concl usion because its design answers
something different from the research question of interest.
FIGURE 1.2. The process of desi gning and i mplementi ng a research project
sets the stage for drawing conclusi ons from the findings.
P.9
P.10
...
I mpl ementi ng t he Study
Returning to Fi g. 1.2, the ri ght-hand side is concerned with implementation and
the degree to whi ch the actual study matches the study plan. At issue here is the
probl em of a wrong answer to the research question because the way the sample
was actually drawn, and the measurements made, differed i n important ways
from the way they were designed (Fig. 1.4).
The actual sample of study subjects is al most always different from the i ntended
sample. The plans to study all el igible patients with CHD, for example, coul d be
disrupted by i ncomplete diagnoses in the electroni c medical record, wrong
addresses for the mailed questi onnaire, and refusal to participate. Those
subjects who are reached and agree to parti cipate may have a di fferent
prevalence of fi sh oil use than those not reached or not interested. In addition to
these problems with the subjects, the actual measurements can di ffer from the
i ntended measurements. If the format of the questionnaire is unclear subjects
may get confused and check the wrong box, or they may simpl y omi t the
questi on by mi stake.
These differences between the study plan and the actual study can al ter the
answer to the research questi on. Figure 1.4 ill ustrates that errors in
i mplementi ng the study join errors of design in l eading to a misleading or wrong
answer to the research questi on.
FIGURE 1.3. Desi gn errors: if the intended sample and vari abl es do not
represent the target population and phenomena of interest, these errors
may distort i nferences about what actually happens in the popul ation.
...
Causal I nf er ence
A speci al kind of validi ty probl em arises i n studies that examine the association
between a predictor and an outcome variabl e in order to draw causal i nference.
If a cohort study finds an association between fish i ntake and CHD events, does
thi s represent cause and effect, or is the fish just an innocent bystander in a
web of causati on that involves other variables? Reduci ng the likeli hood of
confounding and other rival expl anations is one of the major chal lenges in
desi gni ng an observational study (Chapter 9).
The Er r or s of Resear ch
No study i s free of errors, and the goal i s to maximize the validi ty of i nferences
from what happened in the study sample to the nature of thi ngs in the
populati on. Erroneous inferences can be addressed i n the anal ysis phase of
research, but a better strategy is to focus on design and impl ementati on (Fig.
1.5), preventing errors from occurring i n the first place to the extent that thi s i s
practi cal.
The two mai n ki nds of error that i nterfere with research inferences are random
error and systemati c error. The distinction is i mportant because the strategies
for minimizing them are quite different.
Random error i s a wrong result due to chancesources of variation that are
FIGURE 1.4. Implementation errors: if the actual subjects and
measurements do not represent the intended sample and vari abl es, these
errors may distort inferences about what actually happened in the study.
P.11
...
equall y l ikely to distort esti mates from the study in ei ther direction. If the true
prevalence of fi sh oil supplement use i n 50- to 69-year-ol d pati ents with CHD is
20%, a wel l-desi gned sample of 100 patients from that popul ation mi ght contain
exactly 20 pati ents who use these supplements. More likely, however, the
sample woul d contain a nearby number such as 18, 19, 21, or 22. Occasional ly,
chance woul d produce a substantially di fferent number, such as 12 or 28. Among
several techniques for reduci ng the influence of random error (Chapter 4), the
simpl est is to i ncrease the sample size. The use of a larger sample diminishes
the likeli hood of a wrong result by i ncreasing the precision of the
esti matethe degree to which the observed prevalence approximates 20% each
time a sample is drawn.
Systematic error is a wrong result due to biassources of variation that
distort the study findings in one direction. An il lustration is the deci si on i n Fig.
1.3 to study patients in the investigator' s cli nic, where the local treatment
patterns have responded to her interest i n the topic and her fellow doctors are
more li kely than the US average to recommend fish oil. Increasi ng the sample
size has no effect on systemati c error. The onl y way to i mprove the accuracy of
the esti mate (the degree to which i t approximates the true val ue) i s to design
the study in a way that ei ther reduces the size of the various biases or gives
some informati on about them. An example would be to compare results with
those from a second sampl e of pati ents with CHD drawn from another setting, for
example, examining whether the fi ndings in pati ents seen in a cardiol ogy cli nic
are different from those in patients in a gynecology cl inic.
The examples of random and systematic error i n the preceding two paragraphs
FIGURE 1.5. Research errors. This blown-up detail of the error boxes in
Figures 1.3 and 1.4 reveals strategies for control ling random and systemati c
error in the design, i mplementation, and analysis phases of the study.
P.12
...
are components of sampling error, which threatens inferences from the study
subjects to the population. Both random and systematic errors can al so
contribute to measurement error, threatening the inferences from the study
measurements to the phenomena of i nterest. An illustration of random
measurement error is the variation in the response when the diet questionnaire
i s administered to the patient on several occasions. An exampl e of systematic
measurement error is the underesti mati on of the preval ence of fi sh oil use due
to lack of clari ty in how the question is phrased. Additional strategies for
control ling all these sources of error are presented in Chapters 3 and 4.
The concepts presented in the last several pages are summarized i n Fig. 1.6.
Getti ng the right answer to the research question is a matter of designi ng and
i mplementi ng the study i n a fashion that keeps the extent of i nferential errors at
an acceptabl e level.
Designing the Study
FIGURE 1.6. Summary of the physiology of researchhow it works.
P.13
St udy Pr otocol
The process of developing the study plan begins with the one-sentence
research question described earl ier. Three versions of the study plan are then
produced in sequence, each larger and more detail ed than the preceding one.
Outline of the elements of the study (Tabl e 1.1 and Appendix 1.1). Thi s
one page beginning serves as a standardized checklist to remind the
investigator to include all the components. As important, the sequence has
an orderly logic that hel ps clarify the i nvestigator's thinking on the topi c.
Study protocol. This expansion on the study outl ine can range from 5 to
25 or more pages, and is used to plan the study and to apply for grant
support. The protocol parts are di scussed throughout this book and put
together in Chapter 19.
Operations manual. Thi s collection of speci fic procedural instructions,
...
questionnaires, and other materials i s designed to ensure a uniform and
standardized approach to carrying out the study with good quali ty control
(Chapters 4 and 17).
The research question and study outline shoul d be written out at an early stage.
Putti ng thoughts down on paper leads the way from vague ideas to specifi c pl ans
and provi des a concrete basis for getting advi ce from col leagues and
consul tants. It is a chall enge to do it (ideas are easier to tal k about than to
write down), but the rewards are a faster start and a better project.
Appendi x 1.1 provides an example of a study outli ne. These plans deal more wi th
the anatomy of research (Table 1.1) than with its physi ology (Fig. 1.6), so the
i nvestigator must remind herself to worry about the errors that may resul t when
i t is time to draw inferences about what happened i n the study sampl e and how
i t applies to the population. A study' s virtues and problems can be revealed by
expl icitly considering how the question the study i s li kely to answer differs from
the research question, gi ven the plans for acquiring subjects and maki ng
measurements, and gi ven the likely problems of implementati on.
With the study outline in hand and the intended inferences in mind, the
i nvestigator can proceed with the details of her protocol. Thi s includes getti ng
advice from colleagues, drafting speci fic recruitment and measurement methods,
consi deri ng scientifi c and ethical appropri ateness, changing the study question
and outline, pretesting specific recrui tment and measurement methods, maki ng
more changes, getting more advi ce, and so forth. This iterative process is the
nature of research design and the topic of the rest of this book.
Tr ade- of f s
Errors are an inherent part of al l studies. The main issue is whether the errors
wil l be large enough to change the concl usions in important ways. When
desi gni ng a study, the investi gator is in much the same position as a labor union
offi ci al bargaining for a new contract. The uni on offi ci al begins with a wi sh
l istshorter hours, more money, heal th care benefits and so forth. She must
then make concessions, hol ding on to the things that are most important and
reli nquishing those that are not essential. At
the end of the negotiations i s a vi tal step: she looks at the best contract she
could negoti ate and decides if it has become so bad that it i s no longer worth
having.
The same sort of concessions must be made by an investigator when she
transforms the research question to the study plan and considers potential
probl ems in i mplementation. On one side are the i ssues of internal and external
validi ty; on the other, feasibil ity. The vital l ast step of the uni on negotiator is
someti mes omi tted. Once the study pl an has been formulated, the investi gator
must decide whether it adequatel y addresses the research question and whether
i t can be impl emented with acceptable levels of error. Often the answer i s no,
and there is a need to begin the process anew. But take heart! Good sci entists
distinguish themselves not so much by their uni formly good research ideas as by
their tenacity in turni ng over those that won't work and trying again.
Summary
P.14
...
1. The anatomy of research is the set of tangi ble elements that make up the
study plan: the research question and its significance, and the design,
study subjects, and measurement approaches. The chal lenge is to
design elements that are fast, inexpensive, and easy to implement.
2. The physiology of research is how the study works. The study findings are
used to draw inferences about what happened in the study sample
(internal validity), and about events in the world outside (external
validity). The challenge here is to design and implement a study plan
with adequate control over two major threats to these inferences: random
error (chance) and systematic error (bias).
3. In designing a study the i nvestigator may find it hel pful to consider the
relati onships between the research question (what she wants to answer),
the study plan (what the study i s designed to answer), and the actual
study (what the study wil l actually answer, given the errors of
impl ementation that can be anticipated).
4. A good way to develop the study plan is to begi n with a one-sentence
version of the research question and expand thi s into an outline that sets
out the study elements in a standardized sequence. Later on the study plan
will be expanded i nto the protocol and the operations manual.
5. Good judgment by the investi gator and advice from colleagues are needed
for the many trade-offs invol ved, and for determining the overall vi abil ity
of the project.
Appendix
Appendix 1.1
Outline of a Study*
P.15
Element Example
Title Relationship between Level of Experience and Degree of
Cl inical Utility of Third Heart Sound Auscultation.
Research
question
Do auscultatory assessments of third heart sound by more
experi enced physici ans result in higher sensitivity and
specificity for detecting left ventricular dysfunction than
assessments by less experienced physici ans?
Significance 1. Auscultation of third heart sounds is a standard
physi cal exami nation indi cator of heart failure that all
medi cal students have l earned for 100 years.
2. The degree to whi ch this cl inical assessment, which
many physici ans find diffi cult, actually detects
abnormal left ventricular functi on has not been
studied.
...
Reference
1. Marcus GM, Vessey J, Jordan MV, et al. Relationship between accurate
auscultation of a cl inically useful third heart sound and level of experience.
Arch Intern Med 2006;166:17.
Footnote
1
Predictors are sometimes termed independent variables and outcomes
dependent variables, but we find thi s usage confusing, parti cularl y since
i ndependent means something quite different in the context of mul tivariate
anal yses.
3. There are no studies of whether auscul tatory
measurements of thi rd heart sounds by cardiol ogy
fell ows and attendings are more accurate than those
of resi dents and medical students.
Study design Cross-sectional anal ytic study
Subjects
Entry
criteria
Sampling
design
Adults referred for left heart catheteri zation
Consecuti ve sample of consenting patients
Variables
Predictor Level of experience of physicians
Outcome 1. Area under the receiver operating characteri sti c curve
for third heart sound score (AUC) in rel ation to hi gher
LV diastol ic pressure by catheterizati on
2. AUC in rel ati on to l ower ejecti on fraction by cardiac
echo
3. AUC in rel ati on to B natri uretic protein
Statistical
issues
Hypothesis: More experienced physi cians will have more
favorable AUCs
Sample size (to be fill ed in after reading Chapter 6)
*Fortunatel y thi s study, desi gned and i mplemented by clini cal investigators in
training at our instituti on, found that more experienced physi ci ans were better
at detecti ng cli nically signifi cant third heart sounds (1).
P.16
...
Authors: Hulley, Stephen B.; Cummings, Steven R.; Browner, Warren S.;
> Tabl e of Contents > Secti on I - Basi c Ingredi ents > 2 - Concei vi ng The Research Questi on
2
Conceiving The Research Question
Steven R. Cummings
Warren S. Browner
Stephen B. Hulley
The research question i s the uncertainty about something i n the populati on
that the investigator wants to resol ve by making measurements on her study
subjects (Fig. 2.1). There is no shortage of good research questions, and even as
we succeed in produci ng answers to some questions, we remai n surrounded by
others. Recent cl inical trials, for example, have establ ished that treatments that
block the synthesis of estradiol (aromatase inhibitors) reduce the risk of breast
cancer in women who have had early stage cancer (1). But now there are new
questi ons: How long should treatment be continued, what is the best way to
prevent the osteoporosis that is an adverse effect of these drugs, and does thi s
treatment prevent breast cancer in patients with BRCA 1 and BRCA 2 mutations?
The challenge in searching for a research question is not a shortage of
uncertainti es; it i s the di fficul ty of fi nding an i mportant one that can be
transformed into a feasible and vali d study plan. This chapter presents
strategies for accomplishing this in arenas that range from classical clinical
FIGURE 2.1. Choosing the research question and designing the study plan.
P.18
...
research to the newly popular translational research.
Origins of a Research Question
For an establi shed investigator the best research questions usual ly emerge
from the findi ngs and probl ems she has observed in her own prior studi es and in
those of other workers in the fiel d. A new investigator has not yet developed this
base of experi ence. Although a fresh perspective can someti mes be useful by
allowing a creative person to conceive new approaches to old problems, lack of
experience is largel y an impedi ment.
Mast er i ng t he Li ter atur e
It is important to master the publ ished literature in an area of study;
scholarship is a necessary ingredient to good research. A new i nvestigator
should conduct a thorough search of publi shed l iterature in the area of study.
Carrying out a systematic revi ew i s a great first step i n developing and
establishing expertise in a research area, and the underlying li terature review
can serve as background for grant proposal s and research reports. Recent
advances may be presented at research meeti ngs or just be known to acti ve
i nvestigators i n a particular field long before they are publi shed. Thus, mastery
of a subject entail s partici pati ng in meetings and bui lding relationshi ps with
experts in the fiel d.
Bei ng Al er t to New I deas and Techni ques
In addition to the medical literature as a source of ideas for research questions,
all i nvestigators fi nd it helpful to attend conferences i n which recent work is
presented. As important as the presentations are the opportuni ties for i nformal
conversati ons with other scientists during the breaks. A new i nvestigator who
overcomes her shyness and engages a speaker at the coffee break will often find
the experi ence richl y rewarding, and occasionall y wil l find she has a new senior
coll eague. Even better, for a speaker known in advance to be especial ly rel evant,
i t may be worthwhil e to look up her recent publicati ons and contact her in
advance to arrange a meeting during the conference.
A skeptical attitude about prevai ling bel iefs can stimul ate good research
questi ons. For exampl e, it has been widel y believed that lacerati ons that extend
through the dermis requi re sutures to assure rapi d heali ng and a satisfactory
cosmetic outcome. Alternati ve approaches that woul d not requi re local
anesthetics and be faster, l ess expensi ve, and produce as good a cosmetic result
were wi dely bel ieved to be unachievable. However, Quinn et al . noted personal
experience and case seri es evidence that wounds repair themselves regardl ess of
whether wound edges are approxi mated. They carried out a randomized trial in
which pati ents with hand l acerations less than 2 cm in l ength all received tap
water i rrigation and a 48-hour antibiotic dressing, but one group receive
conventional sutures while the other did not. The group treated with sutures had
a more painful and time-consuming treatment but subsequent bl inded
assessment reveal ed si milar time to heal ing and cosmetic results (2).
The applicati on of new technologies often generates new insights and
questi ons about famili ar cl inical problems, whi ch in turn can generate new
paradigms (3).
Recent advances in imaging and i n techniques for mol ecul ar and geneti c
P.19
...
anal yses, for example, have spawned a l arge number of clini cal research studi es
that have i nformed extraordinary advances in the use of these technologies in
cli ni cal medi cine. Simil arly, taking a new concept or fi nding from one fi eld and
applyi ng i t to a problem in a di fferent fi eld can lead to good research questions.
Low bone density, for example, i s widel y recognized as a risk factor for
fractures. Investigators applied thi s technology to other populati ons and found
that women wi th low bone density have hi gher rates of cognitive decline (4),
perhaps due to l ow l evels of estrogen over a lifeti me.
Keepi ng t he I magi nat i on Roami ng
Careful observation of patients has led to many descri ptive studies and is a
fruitful source of research questions. Teaching i s also an excel lent source of
i nspi ration; i deas for studies often occur whil e preparing presentations or during
discussions with i nquisi tive students. Because there i s usuall y not enough time
to develop these ideas on the spot, it i s useful to keep them in a computer fil e or
notebook for future reference.
There is a major role for creativity i n the process of conceiving research
questi ons, imagining new methods to address old questions and havi ng fun with
i deas. There is al so a need for tenacity, for returning to a troubl esome problem
repeatedly until a resol ution is reached that feels comfortable. Some creati ve
i deas come to mind during informal conversati ons with col leagues over l unch;
others occur in brai nstorming sessions. Many inspirations are sol o affairs that
strike whi le preparing a lecture, showering, perusi ng the Internet, or just sitting
and thinking. Fear of criticism or seemi ng unusual can prematurely quash new
i deas. The trick is to put an unresolved problem clearly in view and al low the
mind to run freely toward it.
Choosi ng a Mentor
Nothing substi tutes for experience i n guiding the many judgments involved in
concei ving and fl eshing i n a research questi on. Therefore an essential strategy
for a new investigator i s to apprentice herself to an experienced mentor who
has the time and i nterest to work with her regul arly. A good mentor will be
avail abl e for regular meetings and informal di scussions, encourage creative
i deas, provi de wi sdom that comes from experi ence, hel p ensure protected time
for research, open doors to networking and funding opportuni ties, encourage the
development of independent work, and put the new investigator' s name first on
grants and publicati ons whenever appropriate. Sometimes it is desirable to have
more than one mentor, representing di fferent discipl ines. Good relati onships of
thi s sort can also provide tangible resources that are neededoffi ce space,
access to clini cal populations, datasets and specimen banks, specialized
l aboratories, financial resources, and a research team. Choosing a mentor can be
a diffi cult process, and is perhaps the si ngle most important decision a new
i nvestigator makes.
Characteristics of a Good Research Question
The characteristics of a good research question, assessed in the context of
the intended study desi gn, are that i t be feasibl e, interesti ng, novel , ethical, and
relevant (which form the mnemonic FINER; Table 2.1).
...
Feasi bl e
It is best to know the practical l imits and problems of studyi ng a question early
on, before wasting much time and effort along unworkable l ines.
Number of subj ect s. Many studies do not achieve their intended purposes
because they cannot enroll enough subjects. A prelimi nary cal culation of
the sampl e size requirements of the study earl y on can be qui te helpful
(Chapter 6), together with an estimate of the number of subjects li kel y to
be available for the study, the number who would be excluded or refuse to
parti cipate, and the number who would be lost to follow-up. Even careful
planning often produces esti mates that are overly opti mistic, and the
investigator shoul d assure that there are enough el igibl e wi lling subjects. It
is sometimes necessary to carry out a pilot survey or chart review to be
sure. If the number of subjects appears i nsufficient, the i nvestigator can
consider several strategies: expanding the inclusion criteria, eliminating
unnecessary exclusi on criteria, l engthening the time frame for enrolli ng
subjects, acquiring additi onal sources of subjects, developing more preci se
measurement approaches, i nvi ting colleagues to join in a mul ticenter study,
and using a different study design.
Techni cal exper t i se. The investi gators must have the skills, equipment,
and experience needed for designing the study, recrui ting the subjects,
measuring the variables, and managing and anal yzi ng the data. Consultants
can help to shore up technical aspects that are unfamili ar to the
investigators, but for major areas of the study i t is better to have an
experienced coll eague steadily involved as a coinvestigator; for example, i t
is wi se to i nclude a statistician as a member of the research team from the
Table 2.1 FINER Criteria for a Good Research
Question
Feasible
Adequate number of subjects
Adequate techni cal expertise
Affordabl e in time and money
Manageabl e in scope
Interesting
Getti ng the answer intrigues the investi gator and her fri ends
Novel
Confirms, refutes or extends previous findi ngs
Provi des new findi ngs
Ethical
Amenable to a study that insti tutional review board wil l approve
Relevant
To scientific knowledge
To clinical and heal th policy
To future research
P.20
...
begi nni ng of the planning process. It is best to use famili ar
and established approaches, because the process of developing new
methods and skills is time-consuming and uncertain. When a new approach
is needed, such as a questionnaire, experti se in how to accomplish the
innovati on should be sought.
Cost i n t i me and money. It is important to esti mate the costs of each
component of the project, bearing in mind that the time and money needed
will generally exceed the amounts projected at the outset. If the projected
costs exceed the avai labl e funds, the only opti ons are to consider a less
expensi ve desi gn or to devel op addi tional sources of funding. Early
recognition of a study that is too expensi ve or ti me-consuming can l ead to
modifi cation or abandonment of the plan before expending a great deal of
effort.
Scope. Problems often arise when an investi gator attempts to accompl ish
too much, making many measurements at repeated contacts with a l arge
group of subjects in an effort to answer too many research questions. The
sol ution is to narrow the scope of the study and focus only on the most
important goals. Many sci entists fi nd it diffi cult to give up the opportunity
to answer interesting side questions, but the reward may be a better
answer to the main question at hand.
I nt er esti ng
An investigator may have many motivations for pursuing a particular research
questi on: because it will provide fi nanci al support, because i t is a logi cal or
i mportant next step in buil ding a career, or because getti ng at the truth of the
matter is interesting. We li ke thi s l ast reason; it is one that grows as i t is
exercised and that provides the i ntensity of effort needed for overcoming the
many hurdles and frustrations of the research process. However, i t is wise to
confirm that you are not the only one who fi nds a questi on interesti ng. Speak
with mentors and outside experts before devoti ng substanti al energy to develop
a research plan or grant proposal that peers and fundi ng agencies may consider
dul l.
Novel
Good cl inical research contri butes new information. A study that merely
reiterates what is already established is not worth the effort and cost. The
novelty of a proposed study can be determined by thoroughly reviewing the
l iterature, consulting with experts who are famili ar with ongoi ng research, and
searching lists of projects that have been funded using the NIH Computer
Retrieval of Information on Scienti fic Projects (CRISP). Although novel ty is an
i mportant criteri on, a research questi on need not be total ly originalit can be
worthwhil e to ask whether a previous observati on can be repl icated, whether the
fi ndings in one populati on also appl y to others, or whether improved
measurement techniques can clarify the rel ationship between known risk factors
and a di sease. A confi rmatory study is particularly useful if it avoi ds the
weaknesses of previous studies.
Et hi cal
P.21
...
A good research question must be ethical. If the study poses unacceptabl e
physical risks or invasion of pri vacy (Chapter 14), the investi gator must seek
other ways to answer the question. If there is uncertainty about whether the
study is ethical, i t is helpful to di scuss it at an early stage with a representative
of the insti tutional review board.
Rel evant
Among the characteri sti cs of a good research question, none is more important
than its rel evance. A good way to decide about relevance is to imagine the
vari ous outcomes that are likel y to occur and consi der how each possibi lity might
advance scienti fic knowledge, i nfl uence practice guidel ines and heal th policy, or
gui de further research. When relevance is uncertain, it is useful to discuss the
i dea with mentors, cli ni ci ans or experts i n the field.
Developing the Research Question and Study
Plan
It helps a great deal to write down the research questi on and a brief (one-page)
outl ine of the study plan at an early stage (Appendix 1.1). This requires some
self-disci pline, but i t forces the investi gator to clarify her i deas about the plan
and to discover speci fic problems that need attention. The outli ne also provi des
a basis for specific suggesti ons from coll eagues.
P r obl ems and Sol uti ons
Two general solutions to the probl ems i nvol ved in developing a research questi on
deserve special emphasi s. The first is the importance of getting good advi ce. We
recommend a research team that i ncludes representatives of each of the major
discipl ines i nvol ved in the study, and that i ncludes at l east one seni or sci entist.
In addition, it is a good i dea to consult wi th specialists who can guide the
discovery of previous research on the topi c and the choice and design of
measurement techniques. Sometimes a local expert will do, but it i s often useful
to contact indi vidual s in other institutions who have publ ished pertinent work on
the subject. A new i nvestigator may be i ntimidated by the prospect of wri ting or
cal ling someone she knows only as an author in the Journal of the Ameri can
Medical Association, but most scientists respond favorably to such requests for
advice.
The second solution is to al low the study plan to gradual ly emerge from an
i terative process of desi gning, revi ewi ng, pretesting, and revi sing. Once the one-
page study plan i s specified, advice from colleagues wil l usually result i n
i mportant changes. As the protocol gradually takes shape, a small pretest of the
number and wil lingness of the potential subjects may l ead to changes in the
recruitment plan. The preferred imaging test may turn out to be prohibitively
costly and a less expensive al ternative sought. The qual ities needed i n the
i nvestigator for these planning stages of research are creati vi ty, tenacity, and
judgment.
P r i mar y and Secondar y Quest i ons
Many studies have more than one research questi on. Experiments often address
the effect of the interventi on on more than one outcome; for exampl e, the
Women' s Heal th Initiative was desi gned to determine whether reducing dietary
P.22
...
fat intake would reduce the risk of breast cancer, but an important secondary
hypothesis was to examine the effect on coronary events (5). Almost al l cohort
and casecontrol studies look at several risk factors for each outcome. The
advantage of designing a study with several research questi ons is the efficiency
that can result, with several answers emerging from a single study. The
disadvantages are the increased complexity of designi ng and impl ementing the
study and of drawing statistical inferences when there are mul tipl e hypotheses
(Chapter 5). A sensi ble strategy is to establi sh a singl e
primary research question around whi ch to focus the study plan and sample size
esti mate, adding secondary research questions about other predi ctors or
outcomes that may also produce val uable concl usions.
Translational Research
Translational research refers to studi es of how to transl ate fi ndings from
the ivory tower i nto the real worl d. Translational research (6) comes in
two main flavors (Fig. 2.2):
Applying basi c science findi ngs from laboratory research in clinical studies
of patients (sometimes abbreviated as T1 research), and
Applying the fi ndi ngs of these clinical studies to alter health practi ces in the
communi ty (sometimes abbreviated as T2 research).
Both forms of translational research require identi fying a translation
opportunity. Just as a literary translator first needs to fi nd a novel or poem that
merits translating, a translational research i nvestigator must first identify a
worthwhil e scientifi c fi nding. Translational research projects are usually limited
by the quality of the source materi al, so think Tolstoy : the more
valuabl e the result of a laboratory experiment or a clinical tri al, the more li kely a
translational project wi ll have merit. Pay attention to colleagues when they tal k
about their latest findings, to presentations at nati onal meetings about novel
methods, and to speculation about mechani sms in published reports.
Tr ansl at i ng Resear ch f r om the Labor ator y t o
Cl i ni cal Pr acti ce ( T1)
A host of new tools have become available for cl inical i nvestigations, includi ng
anal ysis of singl e nucleotide polymorphisms (SNPs), gene expression arrays,
i magi ng and proteomics. From the vi ewpoi nt of a clinical investi gator, there i s
nothi ng intri nsical ly different about any of these measurements or test results.
The chapters on measurements wi ll be useful in pl anning studies invol vi ng these
types of measurements,
as wil l the advi ce about study design, popul ation sampl es, and sample size.
Especial ly rel evant will be the information about multipl e hypothesis testing.
P.23
P.24
...
Compared with ordinary clinical research, being a successful T1 transl ati onal
researcher requires having an addi tional skil l set or identifying a collaborator
with those ski lls. Bench-to-bedside research necessi tates a thorough
understanding of the underlying basi c sci ence. Although many clinical
researchers believe that they can master this knowledgejust like many
l aboratory-based researchers believe doing cl inical research requires no speci al
trainingin reality, the skills hardly overlap. For example, suppose a basic
scientist has identi fied a gene that affects circadian rhythm in mice. A clinical
investigator has access to a cohort study with data on sl eep cycles in peopl e
and a bank of stored DNA, and wants to study whether there is an association
between polymorphisms in the human homolog of that gene and sleep i n people.
In order to propose a T1 study looking at that associ ation she needs
coll aborators who are fami liar with that gene and with the advantages and
l imitations of the various methods of genotyping.
Simi larly, imagine that a laboratory-based investigator has discovered a
unique pattern of gene expression in ti ssue biopsy samples from patients wi th
breast cancer. She shoul d not propose a study of its use as a diagnosti c test for
breast cancer without collaborating with someone who understands the
i mportance of test-retest rel iabi lity, recei ver operati ng curves, sampl ing and
blinding, and the effects of prior probabil ity of disease on the applicabil ity of her
discovery. Good translational research requi res expertise i n more than one area.
Thus a research team i nterested i n testi ng a new drug needs scientists familiar
with mol ecular biology, pharmacokinetics, pharmacodynamics, Phase I cl inical
trials, and the practice of medicine.
Tr ansl at i ng Resear ch f r om Cl i ni cal Studi es t o
P opul at i ons ( T2)
Studies that attempt to apply findings from clini cal trial s to l arger and more
diverse populations often requi re expertise in identifying high-risk or
underserved groups, understanding the di fference between screening and
diagnosis, and knowledge of how to i mplement changes in heal th care delivery
FIGURE 2.2. Transiti onal research is the component of clini cal research that
i nteracts with basic science research (hatched area T1) or wi th population
research (hatched area T2).
...
systems. On a practical level , this ki nd of research usuall y needs access to large
groups of pati ents (or clini cians), such as those enrolled in health pl ans. Support
and advi ce from the department chai r, the chief of the medi cal staff at an
affili ated hospital, or the leader of the l ocal medical society, may be hel pful
when planning these studies.
Some investigators take a short cut when doing thi s type of translational
research, studying patients in their coll eagues' practices (e.g., a housestaff-run
cli ni c i n an academi c medical center) rather than involving practitioners in the
community. Thi s i s a bit li ke translating Ari stophanes i nto modern Greekit will
stil l not be very useful for Engl ish-speaking readers. Chapter 18 emphasizes the
i mportance of getting as far i nto the community as possi ble.
The sampl ing scheme is often a problem when studyi ng whether research results
can be appl ied in general populations. For example, in a study of whether a new
offi ce-based di et and exerci se program will be effective in the community, it may
not be possible to randomly assi gn individual patients. One soluti on would be to
use physi cian practices as the unit of randomi zation; this wil l al most certainl y
require collaborating with an expert on cl uster sampli ng and clustered analyses.
Many T2 research projects use proxy process vari ables as their
outcomes. For example, i f cl inical trials have establi shed that a new treatment
reduces mortality from sepsis, a translational research study might not need to
have mortality as the outcome. Rather, it might examine different approaches to
i mplementi ng the treatment protocol, and
use the percentage of patients wi th sepsis who were placed on the protocol as
the outcome of the study.
Summary
1. All studies should start with a research question that addresses what the
investigator would li ke to know. The goal is to find one that can be
developed into a good study plan.
2. One key ingredient for developi ng a research question is scholarship that
is acquired by a thorough and conti nui ng review of the work of others, both
publi shed and unpubl ished. Another key i ngredi ent is experience, and the
single most important decision a new investigator makes i s her choice of
one or two senior scientists to serve as her mentor(s).
3. Good research questions arise from medical articles and conferences,
from critical thinking about clinical practices and problems, from applying
new methods to ol d issues, and from ideas that emerge from teaching
and daydreaming.
4. Before committing much ti me and effort to wri ting a proposal or carrying
out a study, the investigator shoul d consider whether the research question
and study plan are FINER : feasible, interesting, novel, ethical,
and relevant.
5. Early on, the research questi on should be devel oped into a one-page
written study plan that specificall y descri bes how many subjects wil l be
needed, and how the subjects wil l be selected and the measurements made.
6. Developi ng the research questi on and study pl an is an iterative process
that incl udes consultations with advisors and friends, a growi ng famil iarity
P.25
...
with the literature, and pilot studies of the recruitment and measurement
approaches. The qualiti es needed in the investigator are creativity,
tenacity, and judgment.
7. Most studies have more than one question, but it is useful to focus on a
single primary question in designing and implementing the study
8. Translational research is a type of cl inical research that studi es the
applicati on of basi c sci ence fi ndings i n cli ni cal studies of pati ents (T1), and
how to apply these findi ngs to improve heal th practices in the community
(T2); it requires coll aborati ons from laboratory to population-based
investigators, using the clinical research methods presented in this
book.
References
1. The ATAC Trial ists Group. Anastrazol e al one or in combination wi th
tamoxi fen versus tamoxi fen alone for adjuvant treatment of postmenopausal
women with early breast cancer: fi rst results of the ATAC randomized trial s.
Lancet 2002;359:21312139.
2. Quinn J, Cummi ngs S, Callaham M, et al . Suturing versus conservative
management of lacerations of the hand: randomi zed controlled trial. BMJ
2002;325:299301.
3. Kuhn TS. The structure of scienti fic revolutions. Chi cago, IL: University of
Chicago Press, 1962.
4. Yaffe K, Browner W, Cauley J, et al . Association between bone mineral
density and cognitive decli ne in older women. J Am Geriatr Soc
1999;47:11761182.
5. Prentice RL, Caan B, Chlebowski RT, et al. Low-fat di etary pattern and risk
of invasive breast cancer. JAMA 2006;295:629642.
6. Zerhouni EA. US bi omedical research: basic, transl ational and cli ni cal
sciences. JAMA 2005;294:13521358.
P.26
...
> Tabl e of Contents > Secti on I - Basi c Ingredi ents > 3 - Choosi ng the Study Subjects:
Speci fi cati on, Sampl i ng, and Recrui tment
3
Choosing the Study Subjects:
Specification, Sampling, and Recruitment
Stephen B. Hulley
Thomas B. Newman
Steven R. Cummings
A good choi ce of study subjects serves the vital purpose of ensuring that the
fi ndings in the study accurately represent what i s going on in the population of
i nterest (Fi g. 3.1). The protocol must speci fy a sample of subjects that can be
studied at an acceptable cost in time and money, yet one that is large enough to
control random error and representative enough to all ow generalizing the study
fi ndings to populations of interest. An i mportant precept here is that
generalizability i s rarely a simple yes-or-no matter; it i s a compl ex qualitative
judgment that is hi ghly dependent on the investigator' s choice of population and
of sampling design.
We wi ll come to the issue of choosi ng the appropriate number of study subjects
i n Chapter 6. In this chapter we address the process of specifying and
sampling the k i nds of subjects who wil l be representative and feasible. We also
discuss strategies for recruiting these subjects to participate in the study.
FIGURE 3.1. Choosing study subjects that represent the population.
P.28
...
Basic Terms and Concepts
P opul at i ons and Sampl es
A population is a complete set of people with a specified set of characteri sti cs,
and a sample is a subset of the populati on. In l ay usage, the characteri sti cs that
define a popul ati on are geographicthe population of Canada. In research the
defining characteri sti cs are al so clinical , demographic, and temporal :
Clinical and demographi c characteri sti cs define the target population, the
large set of peopl e throughout the worl d to whi ch the resul ts will be
generali zedall teenagers with asthma, for example.
The accessible population is a geographical ly and temporally defined
subset of the target population that i s available for studyteenagers wi th
asthma l iving in the investigator' s town this year .
The study sample i s the subset of the accessi ble population that
participates in the study.
Gener al i zi ng the St udy Fi ndi ngs
The classi c Framingham Study was an early approach to designi ng a study that
would allow inferences from fi ndings observed in a sampl e to be applied to a
popul ati on (Fig. 3.2). The sampli ng design called for li sti ng all the adult
residents of the town and then asking every second person to participate. This
systemati c sampling design is not as tamperproof as a true random
sample (as noted later i n thi s chapter), but
two more serious concerns were the facts that one-third of the Framingham
residents sel ected for the study refused to parti ci pate, and that i n their pl ace the
i nvestigators accepted other residents who had heard about the study and
volunteered (1). Because respondents are often heal thier than nonrespondents,
especially if they are volunteers, the characteristics of the actual sample
undoubtedly differed from those of the intended sample. Every sample has some
errors, however, and the issue is how much damage has been done. The
Framingham Study sampl ing errors do not seem large enough to i nvalidate the
concl usion that the findings of the studythat hypertension is a ri sk factor for
coronary heart di sease (CHD)can be general ized to all the residents of
Framingham.
P.29
...
The next concern i s the validi ty of generalizing the fi nding that hypertension is a
risk factor for CHD from the accessible populati on of Framingham residents to
target populations elsewhere. This inference i s more subjective. The town of
Framingham was selected from the universe of towns in the world, not wi th a
scientific sampl ing design, but because it seemed fai rl y typi cal of middle-class
residential communities in the United States and was convenient to the
i nvestigators. The vali dity of generali zi ng the Frami ngham risk relationshi ps to
populati ons in other parts of the country i nvol ves the precept that, i n general,
anal yti c studies and clini cal trial s that address bi ologic rel ationships produce
more widel y generalizable results across diverse populations than descripti ve
studies that address di stributions of characteristics. For example, the strength of
hypertension as a risk factor for CHD is si milar i n Caucasian Frami ngham
residents to that observed in inner city African Americans, but the prevalence of
hypertension is much higher in the latter populati on.
St eps i n Desi gni ng t he Pr otocol f or Acqui r i ng Study
Subj ect s
The inferences in Fig. 3.2 are presented from right to left, the sequence used for
i nterpreti ng the findings of a completed study. An i nvestigator who is planning a
study reverses thi s sequence, beginning on the left (Fi g. 3.3). She begi ns by
specifying the clinical and demographic characteristics of the target populati on
that will serve the research questi on well . She then uses geographic and
temporal criteria to specify a study sample that is representati ve and practical.
Selection Criteria
FIGURE 3.2. Inferences in generali zi ng from the study subjects to the
target populations.
...
An investigator wants to study the efficacy of low dose testosterone versus
placebo for enhanci ng l ibido i n menopause. She begi ns by creating selection
criteria that defi ne the populati on to be studied.
Est abl i shi ng I ncl usi on Cr i t er i a
The inclusi on criteria define the main characteristics of the target population
that pertai n to the research questi on (Table 3.1). Age is often a crucial factor. In
thi s study the investigators might decide to focus on women in their fifti es,
reasoning that in this group the benefit-to-harm ratio of the drug might be
optimal , but another study mi ght incl ude older decades. Incorporating Afri can
American, Hispanic, and Asian women i n the study would appear to expand
generalizability, but i t' s i mportant to reali ze that the i ncrease in generali zabil ity
i s i llusory unless there are enough women of each race to stati sti cally test for
the presence of an interaction (an effect in one
race that is different from that in other races, Chapter 9); this is a l arge
number, and most studies are not powered to discover such interacti ons.
Specifying clinical characteristics often i nvol ves diffi cult judgments, not onl y
about whi ch factors are important to the research question, but about how to
define them. How, for example, would an investi gator put into practice the
criterion that the subjects be in good general heal th ? She might decide
not to include pati ents with diseases that mi ght be worsened by the testosterone
treatment (atherosclerosi s) or interfere with follow-up (metastatic cancer).
The selection criteria that address the geographi c and temporal characteri sti cs of
the accessi ble population may i nvol ve trade-offs between scienti fic and practical
goals. The i nvestigator may find that pati ents at her own hospi tal are an
P.30
FIGURE 3.3. Steps i n designi ng the protocol for choosing the study
subjects.
...
avail abl e and inexpensive source of subjects. But she must consider whether
peculi arities of the local referral patterns might interfere with general izing the
results to other popul ations. On these and other deci si ons about i nclusion
criteria, there is no si ngle course of action that i s clearly right or wrong; the
i mportant thing i s to make deci sions that are sensible, that can be used
consi stentl y throughout the study, and that wil l provi de a basi s for knowing to
whom the publi shed concl usions appl y.
Est abl i shi ng Excl usi on Cr i t er i a
Exclusi on criteria i ndicate subsets of indi vi duals who would be suitable for the
research question were it not for characteristics that might interfere wi th the
success of foll ow-up efforts, the quality of the data, or the acceptabi lity of
randomi zed treatment (Table 3.1). Clini cal trial s differ somewhat from
observational studies in being more likel y to have exclusions mandated by
concern for the safety of the
participant (Chapter 10). A good general rul e that keeps things si mple and
preserves the number of potential study subjects is to have as few exclusion
criteria as possible.
P.31
Table 3.1 Designing Selection Criteria for a
Clinical Trial of Low Dose Testosterone to
Enhance Libido in Menopause
Design Feature Example
Inclusion
criteria (be
specific)
Specifying populations
relevant to the research
question and efficient for
study:
Demographic
characteristics
White women 50 to
60 years old
Cli nical characteri sti cs Good general health
Has a sexual
partner
Geographic
(administrative)
characteristics
Patients attendi ng
cli ni c at the
i nvestigator' s
hospital
Temporal characteristics Between January 1
and December 31 of
specified year
...
Exclusi on criteria can be a two-edged sword. Includi ng alcoholics i n the
testosterone tri al mi ght provide subjects with low baseli ne l ibido, for exampl e,
but this potential advantage could be accompanied by greater probl ems with
adherence to study treatment and wi th follow-up; the investigator may deci de to
exclude alcohol ics if she beli eves that adherence to study protocol i s the more
i mportant consideration. (She wi ll then face the problem of developing specific
criteria for classi fyi ng whether an i ndividual is alcohol ic.)
Cl i ni cal ver sus Communi ty Popul at i ons
If the research questi on i nvol ves patients wi th a disease, hospital ized or cli ni c-
based patients are easier to find, but selection factors that determine who comes
to the hospital or cl inic may have an i mportant effect. For exampl e, a speci alty
cli ni c at a terti ary care medi cal center tends to accumul ate patients with serious
forms of the disease that give a di storted impressi on of the commonplace
features and prognosis. For research questions that pertain to diagnosi s,
treatment, and prognosis of patients in medi cal settings, sampling from primary
care clinics can be a better choice.
Another common option i n choosi ng the sample is to select subjects in the
community who represent a heal thy populati on. These samples are often
recruited using mass mailings and advertisi ng, and are not full y representative
of a general popul ati on because they must (a) volunteer, (b) fit incl usion and
exclusion criteria, and (c) agree to be i ncluded in the study. True
population-based samples are diffi cult and expensi ve to recrui t, but
useful for guiding publ ic health and cli nical practice in the communi ty. One of
the largest and best examples is the National Health and Nutri tion Examination
Survey (NHANES), a probability sample of al l US residents.
The size and di versity of a sample can be increased by coll ecti ng data by mail or
Exclusion
criteria (be
parsimoni ous)
Specifying subsets of the
popul ati on that will not be
studied because of:
A high li kelihood of
being lost to follow-up
Alcoholic or pl an to
move out of state
An inability to provide
good data
Disoriented or have
a language barrier*
Bei ng at high risk of
possible adverse effects
History of
myocardial
i nfarcti on or stroke
*Alternatives to exclusion (when these subgroups are i mportant to the
research question) would be coll ecting nonverbal data or using bi lingual
staff and questionnaires.
P.32
...
telephone, by col laborating wi th coll eagues in other cities, or by usi ng
preexisting data sets such as NHANES and Medi care. Electronically accessibl e
datasets have come into widespread use in cl inical research and may be more
representative of national populati ons and less time-consuming than other
possibil ities (Chapter 13).
Sampling
Often the number of people who meet the selection criteria is too l arge, and
there i s a need to sel ect a sample (subset) of the population for study.
Conveni ence Sampl es
In cli ni cal research the study sample is often made up of people who meet the
entry criteria and are easily accessible to the investi gator. Thi s is termed a
convenience sample. It has obvious advantages in cost and l ogistics, and i s a
good choice for many research questions.
A convenience sampl e can minimize volunteerism and other selection biases by
consecutively selecting every accessi ble person who meets the entry cri teri a.
Such a consecutive sample is especi all y desirable when i t amounts to taking
the entire accessi ble population over a long enough period to include seasonal
vari ations or other temporal changes that are important to the research
questi on. The vali dity of usi ng a sampl e is the premi se that, for the purpose of
answering the research question at hand, i t suffi cientl y represents the target
populati on. With convenience samples this requi res a subjecti ve judgment.
P r obabi l i t y Sampl es
Someti mes, particularly with descri ptive research questions, there is a need for a
scientific basis for generalizing the findings in the study sample to the
populati on. Probabi lity sampli ng, the gold standard for ensuring general izabi lity,
uses a random process to guarantee that each uni t of the population has a
specified chance of being i ncluded in the sample. It is a scienti fic approach that
provides a rigorous basi s for estimati ng the fidel ity wi th which phenomena
observed in the sample represent those i n the populati on, and for computing
statistical signifi cance and confidence intervals. There are several versions of
thi s approach.
A simple random sample is drawn by enumerating the uni ts of the popul ation
and selecting a subset at random. The most common use of thi s approach i n
cli ni cal research is when the investigator wishes to select a representati ve
subset from a population that i s l arger than she needs. To take a random sample
of the cataract surgery patients at her hospital, for example, the investigator
could l ist all such patients
on the operating room schedules for the period of study, then use a tabl e of
random numbers to sel ect i ndividual s for study (Appendix 3.1).
A systematic sample resembl es a si mple random sampl e in first enumerati ng
the populati on but differs i n that the sampl e is selected by a preordained
peri odic process (e.g., the Framingham approach of taking every second person
from a list of town residents). Systematic sampling i s susceptible to errors
caused by natural peri odicities in the popul ation, and it al lows the investigator to
predi ct and perhaps manipulate those who will be in the sampl e. It offers no
l ogistic advantages over simple random sampling, and in cli nical research it i s
P.33
...
rarely a better choi ce.
A stratified random sample involves dividi ng the population into subgroups
according to characteri sti cs such as sex or race and taking a random sample
from each of these strata. The subsamples in a stratifi ed sample can be
weighted to draw disproportionately from subgroups that are less common in the
populati on but of special interest to the investi gator. In studyi ng the inci dence
of toxemi a in pregnancy, for example, the investigator could stratify the
populati on by race and then sample equal numbers from each stratum. This
would yiel d incidence esti mates of comparable precision from each racial group.
A cluster sample is a random sample of natural groupings (clusters) of
i ndividuals i n the populati on. Cluster sampling i s very useful when the population
i s widel y dispersed and i t is impractical to l ist and sampl e from all i ts el ements.
Consider, for example, the problem of reviewing the hospi tal records of patients
with l ung cancer sel ected randomly from a statewide li st of di scharge diagnoses;
patients coul d be studied at lower cost by choosi ng a random sampl e of the
hospitals and taki ng the cases from these. Community surveys often use a two-
stage cl uster sample: a random sample is drawn from city blocks enumerated on
a map and a fi eld team visits the blocks in the sampl e, lists all the addresses in
each, and sel ects a subsampl e for study by a second random process. A
disadvantage of cl uster sampli ng is the fact that naturally occurring groups are
often relatively homogeneous for the variables of interest; each city block, for
example, tends to have people of uniform socioeconomi c status. This means that
the effecti ve sample size will be somewhat smal ler than the number of subjects,
and that statistical analysis must take the cl ustering into account.
Summar i zi ng t he Sampl i ng Desi gn Opt i ons
The use of descri ptive stati sti cs and tests of statistical signi ficance to draw
i nferences about the population from observati ons in the study sample i s based
on the assumption that a probability sample has been used. But i n cli ni cal
research a random sample of the whole target population is almost never
possible. Convenience sampling, preferably with a consecutive design, i s a
practi cal approach that is often suitable. The deci si on about whether the
proposed sampli ng desi gn is satisfactory requires that the investigator make a
judgment: for the research question at hand, wil l the conclusions of the study
be si milar to those that would resul t from studying a true probability sampl e of
the target population?
Recruitment
The Goal s of Recr ui t ment
An important factor to consider i n choosi ng the accessi ble population and
sampling approach i s the feasi bility of recruiti ng study parti ci pants. There are
two main goal s:
(a) to recruit a sample that adequately represents the target population; and
(b) to recruit enough subjects to meet the sample size requirements.
Achi evi ng a Repr esentat i ve Sampl e
The approach to recruiti ng a representative sample begins i n the desi gn phase
with choosing popul ati ons and sampling methods wisel y. It ends with
P.34
...
i mplementation, guarding agai nst errors in appl ying the entry cri teria to
prospective study partici pants, and monitoring adherence to these criteria as the
study progresses.
A parti cular concern, especially i n observational studies, i s the problem of
nonresponse. The proportion of eli gible subjects who agree to enter the study
(the response rate) influences the val idity of i nferring that the sample
represents the popul ation. People who are diffi cult to reach and those who refuse
to participate once they are contacted tend to be different from people who do
enroll . The level of nonresponse that wil l compromi se the generalizabi lity of the
study depends on the research questi on and on the reasons for not responding.
A nonresponse rate of 25%, a good achievement i n many settings, can seriously
distort the observed prevalence of a disease when the disease itself is a cause of
nonresponse. The degree to which this bias may i nfluence the concl usions of a
study can sometimes be estimated duri ng the study with an i ntensive effort to
acqui re additional i nformation on a sample of nonrespondents.
The best way to deal with nonresponse bi as, however, is to mi nimize the number
of nonrespondents. The probl em of fail ure to make contact with i ndividual s who
have been chosen for the sample can be reduced by designi ng a systematic
series of repeated contact attempts and by usi ng vari ous methods (mail , email ,
telephone, home visi t). Among those contacted, refusal to partici pate can be
minimized by i mproving the effici ency and attractiveness of the study (especial ly
the initial encounter), by choosi ng a design that avoids invasive and
uncomfortable tests, by using brochures and indivi dual di scussion to al lay
anxiety and di scomfort, and by provi ding incentives such as reimbursi ng the
costs of transportation and providi ng the results of tests. If l anguage barri ers
are prevalent, they can be circumvented by usi ng bilingual staff and transl ated
questi onnaires.
Recr ui t i ng Suf f i ci ent Number s of Subj ects
Fal ling short i n the rate of recrui tment is one of the commonest problems in
cli ni cal research. In planni ng a study i t is safe to assume that the number of
subjects who meet the entry cri teri a and agree to enter the study wil l be fewer,
someti mes by several fold, than the number projected at the outset. The
soluti ons to this problem are to estimate the magnitude of the recruitment
probl em empiricall y with a pretest, to plan the study wi th an accessible
populati on that is larger than believed necessary, and to make contingency plans
should the need ari se for additional subjects. Whil e the study i s in progress it i s
i mportant to closely moni tor progress in meeti ng the recruitment goals and
tabulate reasons for fall ing short of the goals; understanding the proporti ons of
potential subjects lost to the study at various stages can lead to strategies for
enhancing recruitment by reducing some of these l osses.
Someti mes recruitment invol ves selecting patients who are already known to the
members of the research team (e.g., in a study of a new treatment in patients
attendi ng the investi gator' s clinic). Here the chief concern i s to present the
opportunity for participation in the study fairly, maki ng clear the advantages and
disadvantages. In
discussing the desirabi lity of parti ci pati on, the i nvestigator must recogni ze the
special ethical dilemmas that arise when her advi ce as the patient' s physician
might confl ict with her interests as an investi gator (Chapter 14).
P.35
...
Often recruitment involves contacti ng populations that are not known to the
members of the research team. It is helpful if at least one member of the
research team has previ ous experience wi th the approaches for contacting the
prospective subjects. These i nclude screening i n work setti ngs or public places
such as shopping malls; sending out l arge numbers of mai lings to listings such
as driver' s l icense holders; adverti sing on the Internet; inviting referrals from
cli ni ci ans; carrying out retrospective record revi ews; and exami ning lists of
patients seen in cl inic and hospital settings. Some of these approaches,
particul arly the latter two, involve concerns with privacy invasion that must be
consi dered by the institutional review board.
It may be hel pful to prepare for recruitment by getting the support of important
organi zations. For exampl e, the investi gator can meet with hospital
admini strators to discuss a cli ni c-based sampl e, and with the leadership of the
medical soci ety and county health department to plan a community screening
operation or maili ng to physicians. Written endorsements can be i ncluded as an
appendi x i n appli cations for funding. For large studi es it may be useful to create
a favorabl e cli mate in the community by gi vi ng public lectures or by advertisi ng
through radio, TV, newspaper, fliers, websites, or mass mai lings.
Summary
1. All clinical research i s based, phi losophi cally and practi cally, on the use of a
sample to represent a population.
2. The advantage of sampli ng is efficiency; it allows the i nvestigator to draw
inferences about a large popul ati on by examining a subset at relatively
small cost in time and effort. The di sadvantage is the source of error it
introduces. If the sampl e is not sufficiently representative for the research
question at hand, the findings may not generalize well to the population.
3. In designing a study the fi rst step is to conceptual ize the target
population with a specifi c set of inclusion criteria that establish the
demographi c and cli nical characteri sti cs of subjects well suited to the
research question, an appropriate accessible population that i s
geographi cally and temporally convenient, and a parsimonious set of
exclusion criteria that eli minate subjects who are unethi cal or
inappropriate to study.
4. The next step i s to design an approach to sampling the population. A
convenience sample is often a good choi ce i n cli ni cal research, especially
if i t is drawn consecutively. Simple random sampling can be used to
reduce the size of a convenience sample if necessary, and other
probability sampling strategi es (stratified and cl uster) are useful i n
certain si tuati ons.
5. Fi nall y, the i nvestigator must design and implement strategies for
recruiting a sample of subjects that is large enough to meet the study
needs, and that minimizes bias due to nonresponse and loss to follow-
up.
Appendix
P.36
...
Appendix 3.1
Selecting a Random Sample from a Table of
Random Numbers
10480 15011 01536 81647 91646 02011
22368 46573 25595 85393 30995 89198
24130 48390 22527 97265 78393 64809
42167 93093 06243 61680 07856 16376
37570 33997 81837 16656 06121 91782
77921 06907 11008 42751 27756 53498
99562 72905 56420 69994 98872 31016
96301 91977 05463 07972 18876 20922
89572 14342 63661 10281 17453 18103
85475 36857 53342 53998 53060 59533
28918 79578 88231 33276 70997 79936
63553 40961 48235 03427 49626 69445
09429 93969 52636 92737 88974 33488
10365 61129 87529 85689 48237 52267
07119 97336 71048 08178 77233 13916
51085 12765 51821 51259 77452 16308
02368 21382 52404 60268 89368 19885
01011 54092 33362 94904 31273 04146
52162 53916 46369 58569 23216 14513
07056 97628 33787 09998 42698 06691
48663 91245 85828 14346 09172 30163
54164 58492 22421 74103 47070 25306
32639 32363 05597 24200 38005 13363
...
Reference
1. Dawber TR. The Frami ngham Study. Cambri dge, MA: Harvard Uni versity
Press, 1980:1429.
29334 27001 87637 87308 58731 00256
02488 33062 28834 07351 19731 92420
81525 72295 04839 96423 24878 82651
29676 20591 68086 26432 46901 20949
00742 57392 39064 66432 84673 40027
05366 04213 25669 26422 44407 44048
91921 26418 64117 94305 26766 25940
To select a 10% random sampl e, begin by enumerati ng (li sti ng and
numbering) every element of the population to be sampled. Then decide on a
rule for obtaining an appropriate seri es of numbers; for example, i f your list
has 741 elements (which you have numbered 1 to 741), your rule might be to
go verti cally down each column usi ng the first three digits of each number
(beginning at the upper l eft, the numbers are 104, 223, etc.) and to select the
first 74 di fferent numbers that fall in the range of 1 to 741. Fi nally, pi ck a
starti ng point by an arbi trary process. (Closing your eyes and putti ng your
penci l on some number in the table is one way to do it.)
...
Authors: Hulley, Stephen B.; Cummings, Steven R.; Browner, Warren S.; Grady,
Deborah G.; Newman, Thomas B.
Copyri ght 2007 Li ppincott Wil li ams & Wi lki ns
> Tabl e of Contents > Secti on I - Basi c Ingredi ents > 4 - Planni ng the Measurements: Preci si on and
Accuracy
4
Planning the Measurements: Precision and
Accuracy
Stephen B. Hulley
Jeffrey N. Martin
Steven R. Cummings
Measurements descri be phenomena i n terms that can be analyzed stati sticall y. The val idi ty
of a study depends on how well the variabl es designed for the study represent the
phenomena of interest (Fi g. 4.1). How well does a prostate-specifi c anti gen (PSA) l evel
si gnal cancer i n the prostate that wil l soon metastasi ze, for example, or an insomnia
questi onnaire detect amount and quali ty of sleep?
Thi s chapter begi ns by consideri ng how the choi ce of measurement scale influences the
i nformation content of the measurement. We then turn to the central goal of mi ni mi zi ng
measurement error: how to design measurements that are relati vely precise (free of
random error) and accurate (free of systemati c error), thereby enhanci ng the val idi ty of
drawing inferences from the study to the universe. We conclude wi th some consi derati ons
for cl inical research, noti ng especiall y the advantages of stori ng speci mens for l ater
measurements.
Measurement Scales
Table 4.1 presents a simpl ifi ed cl assifi cation of measurement scal es and the
i nformation that resul ts. The classifi cation is i mportant because some types of variabl es are
more informative than others, adding power to the study and reducing sample si ze
requi rements.
FIGURE 4.1. Desi gni ng measurements that represent the phenomena of i nterest.
P.38
...
Conti nuous Var i abl es
Continuous variables are quantifi ed on an i nfi ni te scale. The number of possi ble val ues of
body weight, for exampl e, i s l i mited onl y by the sensi ti vity of the machine that i s used to
measure it. Conti nuous variabl es are rich in informati on.
A scal e whose units are li mi ted to i ntegers (such as the number of cigarettes smoked per
day) i s termed discrete. Discrete vari abl es that have a consi derabl e number of possi bl e
val ues can resemble conti nuous variables i n stati stical anal yses and be equi valent for the
purpose of designing measurements.
Categor i cal Var i abl es
Phenomena that are not sui table for quanti fi cation can often be measured by cl assi fying
them i n categori es. Categorical vari ables wi th two possi ble val ues (dead or al i ve) are
termed dichotomous. Categorical vari ables wi th more than two categori es
(pol ychotomous) can be further characterized accordi ng to the type of i nformati on they
contai n.
Nominal vari abl es have categori es that are not ordered; type O bl ood, for example, is
neither more nor l ess than type B; nomi nal vari ables tend to have a qual i tati ve and
absol ute character that makes them straightforward to measure. Ordinal variabl es
have categories that do have an order, such as severe, moderate, and mil d pai n. The
addi ti onal informati on i s an advantage over nomi nal vari ables, but because ordinal
variables do not speci fy a numeri cal or uni form difference between one category and the
next, the i nformation content i s less than that of di screte variables.
P.39
Table 4.1 Measurement Scales
Type of
Measurement
Characteristics
of Variable Example
Descriptive
Statistics
Information
Content
Categori cal *
Nomi nal Unordered
categori es
Sex; bl ood
type; vi tal
status
Counts,
proporti ons
Lower
Ordinal Ordered
categori es with
i ntervals that
are not
quantifi abl e
Degree of
pai n
In additi on
to the
above:
medi ans
Intermedi ate
Continuous
or ordered
discrete
Ranked
spectrum with
quantifi abl e
i ntervals
Wei ght;
number of
ci garettes/day
In additi on
to the
above:
means,
standard
deviati ons
Hi gher
*Categori cal measurements that contai n only two cl asses (e.g., sex) are termed
dichotomous.
...
Choosi ng a Measur ement Scal e
A good general rul e i s to prefer continuous variabl es, because the addi ti onal informati on
they contai n improves stati sti cal effi ci ency. In a study comparing the antihypertensive
effects of several treatments, for exampl e, measuri ng blood pressure i n mi l li meters of
mercury al lows the investi gator to observe the magni tude of the change in every subject,
whereas measuri ng i t as hypertensi ve versus normotensive would l imi t the assessment. The
conti nuous variabl e contai ns more informati on, and the resul t i s a study wi th more power
and/or a smal l er sample si ze (Chapter 6).
The rul e has some excepti ons. If the research question i nvol ves the determinants of l ow
birth wei ght, for exampl e, the i nvestigator would be more concerned wi th babies whose
wei ght i s so l ow that thei r heal th is compromised than with di fferences observed over the
ful l spectrum of bi rth wei ghts. In thi s case she i s better off wi th a large enough sample to
be able to analyze the resul ts wi th a dichotomous outcome l ike the proporti on of babi es
whose wei ght i s below 2,500 g. Even when the categorical data are more meani ngful,
however, i t i s sti ll best to col lect the data as a continuous vari abl e. Thi s leaves the anal yti c
opti ons open: to change the cutoff point that defi nes l ow bi rth wei ght (she may later deci de
that 2,350 g is a better value for identi fying babi es at increased ri sk of developmental
abnormali ties), or to fall back on the more powerful anal ysis of the predi ctors of the ful l
spectrum of wei ght.
Simil arl y, when there i s the option of designing the number of response categories i n an
ordi nal scale, as in a questi on about food preferences, it is often useful to provi de a hal f -
dozen categories that range from strongl y di sli ke to extremely fond of. The resul ts can
l ater be coll apsed into a di chotomy (di sl ike and l ike), but not vice versa.
Many characteri stics, parti cul arl y symptoms (pai n) or aspects of l ifestyle, are di fficult to
describe with categories or numbers. But these phenomena often have i mportant roles i n
diagnostic and treatment decisi ons, and the attempt to measure them i s an essenti al part
of the sci enti fi c approach to descri pti on and analysi s. This i s il l ustrated by the SF-36, a
standardi zed questi onnaire for assessing quality of life (1). The process of cl assi ficati on
and measurement, i f done wel l, can i ncrease the objecti vi ty of our knowl edge, reduce bias,
and provide a means of communi cati on.
Precision
The precision of a vari abl e is the degree to which it is reproducibl e, wi th nearl y the
same val ue each time i t i s measured. A beam scal e can measure body weight wi th great
precisi on, whereas an intervi ew to measure quali ty of l i fe i s more l ikel y to produce val ues
that vary from one observer to the next. Preci sion has a very i mportant i nfl uence on the
power of a study. The more precise a measurement, the greater the stati stical power at a
given sampl e si ze to esti mate mean val ues and to test hypotheses (Chapter 6).
Preci si on (also cal l ed reproducibi li ty, reli abi li ty, and consi stency) i s a functi on of random
error (chance variabi l ity); the greater the error, the l ess preci se the measurement. There
Continuous vari abl es have an infinite number of values (e.g., weight), whereas
discrete variabl es are li mi ted to i ntegers (e.g., number of ci garettes/day). Discrete
vari ables that are ordered (e.g., arranged i n sequence from few to many) and that
have a large number of possibl e values resemble continuous variabl es for practical
purposes of measurement and analysi s.
...
are three mai n sources of random error in making measurements.
Obser ver var i abi l i t y refers to vari abi li ty i n measurement that is due to the observer,
and i ncludes such thi ngs as choice of words in an i nterview and ski ll in usi ng a
mechanical i nstrument.
I nst r ument var i abi l i t y refers to vari abil i ty in the measurement due to changi ng
envi ronmental factors such as temperature, agi ng mechanical components, different
reagent lots, and so on.
Subj ect var i abi l i t y refers to i ntrinsi c bi ologic vari abi li ty i n the study subjects due to
such thi ngs as fluctuati ons in mood and ti me since last medi cation.
Assessi ng P r eci si on
Preci si on is assessed as the reproducibility of repeated measurements, ei ther compari ng
measurements made by the same person (wi thin-observer reproduci bi l ity) or di fferent
peopl e (between-observer reproduci bi li ty). Si mil arl y, it can be assessed wi thi n or between
i nstruments.
The reproduci bil i ty of conti nuous variables i s often expressed as the within-subject
standard deviation. However, if a Bland-Altman plot (2) of the wi thi n-subject
standard deviati on versus that subject's mean demonstrates a li near associati on, then the
preferred approach i s the coefficient of variation (wi thi n-subject standard devi ati on
divi ded by the mean). Correlati on coeffici ents should be avoi ded (2). For categori cal
variables, percent agreement and the kappa stati sti c (3) are often used (Chapter 12).
Str ategi es f or Enhanci ng P r eci si on
There are fi ve approaches to mi ni mi zi ng random error and i ncreasing the preci sion of
measurements (Tabl e 4.2):
1. Standardi zi ng the measurement methods. Al l study protocol s shoul d include
operati onal definiti ons (speci fic i nstructions for maki ng the measurements). This
i ncl udes written directions on how to prepare the envi ronment and the subject, how to
carry out and record the interview, how to cal ibrate the instrument, and so forth
(Appendix 4.1). This set of materials, part of the operations manual, is essenti al for
l arge and complex studies and recommended for small er ones. Even when there i s
onl y a single observer, speci fic written gui deli nes for making each measurement wi ll
help her performance to be uniform over the duration of the study and serve as the
basi s for describi ng the methods when the resul ts are publi shed.
2. Trai ning and certi fyi ng the observers. Traini ng wil l i mprove the consi stency of
measurement techniques, especi al ly when several observers are invol ved. It is often
desirabl e to desi gn a formal test of the mastery of the techniques specifi ed i n the
operati ons manual and to certi fy that observers have achi eved the prescri bed l evel of
performance (Chapter 17).
3. Refi ning the i nstruments. Mechani cal and electroni c instruments can be engineered to
reduce variabil i ty. Si mi larl y, questi onnaires and interviews can be wri tten to i ncrease
cl arity and avoid potenti al ambi gui ti es (Chapter 15).
4. Automating the i nstruments. Vari ations i n the way human observers make
measurements can be el i minated wi th automati c mechani cal devi ces and sel f -response
questi onnaires.
5. Repeti ti on. The i nfl uence of random error from any source i s reduced by repeati ng the
measurement, and using the mean of the two or more readi ngs. Preci sion wi ll be
substanti al l y i ncreased by this strategy, the pri mary l i mitati on being the added cost
and practi cal di ffi culties of repeati ng the measurements.
P.40
...
For each measurement in the study, the i nvesti gator must decide how vigorousl y to pursue
each of these strategi es. Thi s decisi on can be based on the i mportance of the variable, the
magnitude of the potential probl em with preci sion, and the feasibi li ty and cost of the
strategy. In general, the fi rst two strategi es (standardizing and training) should al ways be
used, and the fi fth (repeti ti on) is an option that i s guaranteed to improve preci sion
Table 4.2 Strategies for Reducing Random Error in
Order to Increase Precision, with Illustrations from a
Study of Antihypertensive Treatment
Strategy to
Reduce Random
Error
Source of
Random
Error
Example of Random
Error
Example of
Strategy to
Prevent the Error
1. Standardi zi ng
the
measurement
methods in an
operati ons
manual
Observer Vari ation i n bl ood
pressure (BP)
measurement due to
vari able rate of cuff
deflati on (someti mes
faster than 2 mm
Hg/second and
sometimes sl ower)
Specify that the
cuff be defl ated at
2 mm Hg/second
Subject Vari ation i n BP due to
vari able l ength of
qui et si tti ng
Specify that
subject si t i n a
qui et room for 5
mi nutes before BP
measurement
2. Training and
certi fying the
observer
Observer Vari ation i n BP due to
vari able observer
technique
Train observer i n
standard
techniques
3. Refining the
i nstrument
Instrument
and
observer
Vari ation i n BP due to
digi t preference (e.g.,
the tendency to round
number to a multipl e
of 5)
Design instrument
that conceals BP
readi ng unti l after
i t has been
recorded
4. Automati ng
the i nstrument
Observer Vari ation i n BP due to
vari able observer
technique
Use automati c BP
measuri ng devi ce
Subject Vari ation i n BP due to
emotional reacti on to
observer by subject
Use automati c BP
measuri ng devi ce
5. Repeating the
measurement
Observer,
subject, and
i nstrument
Al l measurements and
al l sources of variati on
Use mean of two
or more BP
measurements
P.41
...
whenever i t i s feasi bl e and affordable.
Accuracy
The accuracy of a vari abl e i s the degree to which it actual l y represents what it is
i ntended to represent. This has an important infl uence on the validity of the
studythe degree to whi ch the observed fi ndi ngs lead to the correct i nferences about
phenomena taki ng place in the study sampl e and i n the uni verse.
Accuracy i s di fferent from preci si on in the ways shown in Tabl e 4.3, and the two are not
necessaril y li nked. If serum chol esterol were measured repeatedly usi ng standards that had
i nadvertently been dil uted twofold, for example, the resul t woul d be i naccurate but could
stil l be precise (consi stentl y off by a factor of 2). Thi s concept i s further i l lustrated in
Figure. 4.2. Accuracy and precisi on do often go hand i n hand however, i n the sense that
many of the strategi es for increasi ng preci si on wil l also improve accuracy.
Accuracy i s a functi on of systematic error (bi as); the greater the error, the less accurate
the variable. The three main cl asses of measurement error noted i n the earl ier section on
precisi on each have counterparts here.
Obser ver bi as i s a distorti on, conscious or unconsci ous, in the perception or
reporting of the measurement by the observer. It may represent systemati c errors in
the way an i nstrument i s operated, such as a tendency to round down blood pressure
measurements, or i n the way an i ntervi ew is carri ed out as i n the use of leading
questi ons.
I nst r ument bi as can resul t from faul ty functi on of a mechani cal i nstrument. A scal e
that has not been cali brated recentl y may have dri fted downward, produci ng
P.42
Table 4.3 The Precision and Accuracy of Measurements
Precision Accuracy
Defi nition The degree to which a variabl e
has nearl y the same val ue
when measured several ti mes
The degree to which a
vari able actual ly represents
what i t i s supposed to
represent
Best way to
assess
Comparison among repeated
measures
Comparison wi th a reference
standard
Val ue to
study
Increase power to detect
effects
Increase val idi ty of
concl usions
Threatened
by
Random error (chance)
contributed by
Systemati c error (bi as)
contributed by
The observer The observer
The subject The subject
The i nstrument The i nstrument
...
consi stently l ow body weight readi ngs.
Subj ect bi as i s a distorti on of the measurement by the study subject, for example, in
reporting an event (respondent or recall bias). Pati ents wi th breast cancer who beli eve
that al cohol i s a cause of thei r cancer, for example, may exaggerate the amount they
used to dri nk.
The accuracy of a measurement is best assessed by compari ng it, when possibl e, to a
gold standard a reference techni que that is consi dered accurate. For
measurements on a conti nuous scal e, the mean di fference between the measurement under
i nvestigati on and the gold standard across study subjects can be determi ned. For
measurements on a di chotomous scal e, accuracy in comparison to a gol d standard can be
described in terms of sensi ti vity and specifi ci ty (Chapter 12). For measurements on
categori cal scal es with more than two response opti ons, kappa can be used.
Val i di ty
The degree to which a variabl e represents what i s intended is diffi cul t to assess when
measuring subjecti ve and abstract phenomena, such as pai n or qual i ty of l ife, for whi ch
there is no concrete gol d standard. At issue is a particul ar type of accuracy termed
validityhow wel l the measurement represents the phenomenon of i nterest. There are
three ways to vi ew and assess vali dity:
Cont ent val i di t y examines how wel l the assessment represents all aspects of the
phenomena under studyfor exampl e, i ncl udi ng questi ons on social, physical,
emotional, and i ntel lectual functioning to assess qual i ty of l ifeand often i t uses
subjecti ve judgments (face validity) about whether the measurements seem
reasonable.
Const r uct val i di t y refers to how well a measurement conforms to theoreti cal
constructs; for example, if an attribute i s theoreticall y bel i eved to di ffer between two
groups a measure of this attribute that has construct val idi ty woul d show this
difference.
Cr i t er i on- r el at ed val i di ty i s the degree to whi ch a new measurement correl ates wi th
well -accepted exi sti ng measures. A powerful version of this approach is predictive
validity, the abil i ty of the measurement to predi ct an outcome: the val i dity of a
measure of depression woul d be strengthened i f it was found to predi ct sui cide.
The general approach to validating an abstract measure is to begi n by searchi ng the
FIGURE 4.2. The di fference between preci sion and accuracy.
P.43
...
l iterature and consul ti ng with experts i n an effort to fi nd a suitabl e i nstrument
(questionnai re) that has al ready been val i dated. Usi ng such an i nstrument has the
advantage of maki ng the resul ts of the new study comparabl e to earl ier work in the area,
and may simpl ify and strengthen the process of appl yi ng for grants and publ ishing the
resul ts. Its disadvantage, however, i s that an i nstrument taken off the shel f may be
outmoded or not appropriate for the research questi on.
If existi ng instruments are not sui tabl e for the needs of the study, then the investi gator
may deci de to devel op a new measurement approach and val idate i t hersel f. Thi s can be an
i nteresti ng chall enge that l eads to a worthwhi le contributi on to the l iterature, but i t i s fai r
to say that the process i s often l ess scienti fic and concl usi ve than the word
val idati on connotes (Chapter 15).
Str ategi es f or Enhanci ng Accur acy
The major approaches to i ncreasing accuracy i ncl ude the first four of the strategies l isted
earl ier for precisi on, and three addi ti onal ones (Tabl e 4.4):
1. Standardizing the measurement methods
2. Training and certifying the observers
Table 4.4 Strategies for Reducing Systematic Error
in Order to Increase Accuracy, with Illustrations
from a Study of Antihypertensive Treatment
Strategy to
Reduce
Systematic Error
Source of
Systematic
Error
Example of
Systematic Error
Example of
Strategy to
Prevent the Error
1. Standardi zi ng
the measurement
methods in an
operati ons
manual
Observer Consi stently hi gh
diastol ic bl ood
pressure (BP)
readi ngs due to
using the point at
whi ch sounds
become muffl ed
Specify the
operati onal
definiti on of
diastol ic BP as
the point at
whi ch sounds
cease to be heard
Subject Consi stently hi gh
readi ngs due to
measuri ng BP
right after walki ng
upstai rs to cli ni c
Specify that
subject si t i n
qui et room for 5
mi nutes before
measurement
2. Training and
certi fying the
observer
Observer Consi stently hi gh
BP readi ngs due
to fai lure to fol l ow
procedures
speci fi ed in
operati ons manual
Trainer checks
accuracy of
observer's
readi ng with a
doubl e-headed
stethoscope
3. Refining the
i nstrument
Instrument Consi stently hi gh
BP readi ngs wi th
standard cuff i n
Use extra-wi de
BP cuff i n obese
patients
...
3. Refining the instruments
4. Automating the instruments
5. Making Unobtrusive Measurements. It is sometimes possi bl e to desi gn
measurements that the subjects are not aware of, thereby el imi nati ng the possibi li ty
that they wi l l consci ousl y bi as the vari able. A study of advi ce on healthy eating
patterns for schoolchi ldren, for exampl e, coul d measure the number of candy bar
wrappers i n the trash.
subjects wi th very
l arge arms
4. Automati ng
the i nstrument
Observer Consci ous or
unconsci ous
tendency for
observer to read
BP l ower i n study
group randomi zed
to acti ve drug
Use automati c BP
measuri ng devi ce
Subject BP i ncrease due to
proxi mi ty of
attracti ve
technici an
Use automati c BP
measuri ng devi ce
5. Making
unobtrusive
measurements
Subject Tendency of
subject to
overesti mate
compli ance with
study drug
Measure study
drug level in
uri ne
6. Cal i brati ng
the i nstrument
Instrument Consi stently hi gh
BP readi ngs due
to manometer
being out of
adjustment
Cal ibrate each
month
7. Bl inding Observer Consci ous or
unconsci ous
tendency for
observer to read
BP l ower i n acti ve
treatment group
Use doubl e-bl ind
placebo to
conceal study
group assi gnment
Subject Tendency of
subject to
overreport si de
effects i f she
knew she was on
acti ve drug
Use doubl e-bl ind
placebo to
conceal study
group assi gnment
P.44
P.45
...
6. Calibrating the Instrument. The accuracy of many instruments, especi al l y those
that are mechani cal or el ectri cal , can be i ncreased by periodic cali brati on using a gol d
standard.
7. Blinding. Thi s classic strategy does not ensure the overal l accuracy of the
measurements, but i t can el imi nate differential bias that affects one study group
more than another. In a doubl e-bl ind cl i ni cal tri al the subjects and observers do not
know whether active medi cine or placebo has been assigned, and any inaccuracy i n
measuring the outcome wi ll be the same in the two groups.
The decisi on on how vigorousl y to pursue each of these seven strategi es for each
measurement rests, as noted earli er for preci si on, on the judgment of the i nvestigator. The
consi derations are the magni tude of the potenti al i mpact that the anticipated degree of
i naccuracy wil l have on the concl usi ons of the study and the feasi bil i ty and cost of the
strategy. The first two strategies (standardi zi ng and trai ni ng) shoul d al ways be used,
cal i brati on is needed for any instrument that has the potential to change over ti me, and
bli ndi ng i s essential whenever feasibl e.
Other Features of Measurement Approaches
Measurements shoul d be sensitive enough to detect di fferences in a characteri sti c
that are important to the investi gator. Just how much sensiti vi ty i s needed depends on the
research questi on. For example, a study of whether a new medi cation hel ps people to quit
smoking could use an outcome measure that i s relati vel y insensi tive to the number of
ci garettes smoked each day. On the other hand, i f the questi on i s the effect of reducing the
nicotine content of cigarettes on the number of ci garettes smoked, the method shoul d be
sensitive to di fferences i n dai l y habi ts of just a few ci garettes.
An i deal measurement i s specific, representi ng onl y the characteristi c of interest. The
carbon monoxi de l evel i n expired ai r i s a measure of smoking habi ts that i s only moderatel y
speci fi c because i t can al so be affected by other exposures such as automobi le exhaust.
The overal l speci fici ty of assessi ng smoking habi ts can be i ncreased by suppl ementi ng the
carbon monoxi de data wi th other measurements (such as sel f -report and serum cotinine
l evel ) that are not affected by air poll ution.
Measurements shoul d be appropriate to the objecti ves of the study. A study of stress as
an antecedent to myocardial i nfarction, for exampl e, would need to consider whi ch kind of
stress (psychol ogi cal or physi cal , acute or chroni c) was of interest before setti ng out the
operati onal definiti ons for measuring i t.
Measurements shoul d provide an adequate distribution of responses i n the study
popul ation. A measure of functi onal status i s most useful if it produces val ues that range
from high in some subjects to low in others. One of the mai n functions of
pretesting is to ensure that the actual responses do not al l cl uster around one end of the
possibl e range of response (Chapter 17).
Final ly, there i s the i ssue of objectivity. Thi s is achi eved by reducing the i nvol vement of
the observer and by increasi ng the structure of the instrument. The danger i n these
strategi es, however, i s the consequent tunnel visi on that l imi ts the scope of the
observati ons and the abil ity to di scover unantici pated phenomena. The best design i s often
a compromise, incl udi ng an opportuni ty for acquiri ng subjecti ve and qual i tati ve data i n
addi ti on to the mai n objecti ve and quanti tative measurements.
Measurements on Stored Materials
Cli nical research i nvol ves measurements on peopl e that range across a broad array of
domai ns (Tabl e 4.5). Some of these measurements can only be made during a contact wi th
the study subject, but many can be carried out later on biol ogi cal specimens banked for
chemical or geneti c anal ysis, or on images from radi ographi c and other procedures fil ed
el ectroni cal l y.
One advantage of such storage i s the opportuni ty to reduce the cost of the study by maki ng
P.46
...
measurements onl y on indivi dual s who turn out during fol low-up to devel op the condi tion of
i nterest. A terri fic approach to doi ng thi s is the nested casecontrol desi gn (Chapter 7);
pai red bl inded measurements can be made in a si ngle analyti c batch, eli mi nati ng the batch-
to-batch component of random error. A second advantage i s that scienti fic advances may
l ead to new i deas and measurement techniques that can be employed years after the study
i s compl eted.
The growi ng interest in translational research (Chapter 2) takes advantage of new
measurements that have greatly expanded cl inical research in the areas of genetic and
molecular epidemiology (4,5). Measurements on speci mens that contain DNA (e.g.,
sal i va, bl ood) can provi de i nformati on on genotypes that contribute to the
occurrence of disease or modify a pati ent's response to treatment. Measurements on serum
can be used to study mol ecul ar causes or consequences of di sease; for exampl e, proteomi c
patterns may provide useful i nformation for di agnosing certai n di seases (6). It is i mportant
to consul t wi th experts regarding the proper col lecti on tubes and storage condi tions i n
order to preserve the quali ty of the speci mens and make them avail abl e for the wi dest
spectrum of subsequent use.
In Closing
Table 4.5 revi ews the many kinds of measurements that can be i ncl uded i n a study.
Some of these are the topi c of later chapters in this book. In Chapter 9 we wi l l address the
P.47
Table 4.5 Common Types of Measurements that Can Be
Made on Stored Materials
Type of
Measurement Examples
Bank for Later
Measurement
Medi cal hi story Diagnoses, medi cati ons,
operati ons, symptoms,
physi cal fi ndi ngs
Cli nical charts
Psychosocial
factors
Depressi on, fami ly hi story Voice recordi ngs,
videotapes
Anthropometri c Hei ght, wei ght, body
compositi on
Photographs
Bi ochemi cal
measures
Serum cholesterol , pl asma
fibri nogen
Serum, plasma, urine,
pathol ogy specimens
Geneti c/mol ecular
tests
Si ngle neucl eoti de
polymorphi sms, human
l eukocyte anti gen type
DNA, i mmortal cel l l ine
Imaging Bone density, coronary
cal ci um
X-rays, CT scans, MRI
El ectromechani cal Arrhythmi a, congeni tal heart
disease
El ectrocardi ogram,
echocardi ogram
...
i ssue of choosing measurements that wi ll facil i tate i nferences about confounding and
causal i ty. And i n Chapter 15 we wil l address the topi c of questionnaires and other
i nstruments for measuri ng i nformati on suppl i ed by the study subject.
In designing measurements i t i s important to keep i n mi nd the val ue of efficiency and
parsimony. The ful l set of measurements should coll ect useful data at an affordabl e cost i n
time and money. Effi ci ency can be i mproved by i ncreasi ng the quali ty of each i tem and by
reduci ng the number of items measured. Coll ecting more data than are needed is a
common error that can ti re subjects, overwhel m the research team, and clutter data
management and anal ysi s. The result may be a more expensi ve study that paradoxi cal l y i s
l ess successful i n answeri ng the main research questi ons.
Summary
1. Vari ables are ei ther continuous (quanti fied on an infinite scal e), discrete (quanti fi ed
on a fi ni te scale of i ntegers), or categorical (cl assi fied i n categori es). Categori cal
variables are further classi fi ed as nominal (unordered) or ordinal (ordered); those
that have only two categori es are termed dichotomous.
2. Cli nical i nvestigators prefer variables that contain more information and thereby
provi de greater power and/or small er sampl e sizes: conti nuous variabl es > di screte
variables > ordered categori cal vari abl es > nominal and di chotomous variabl es.
3. The precision of a measurement (i .e., the reproducibi l ity of repl icate measures) is
another major determinant of power and sample si ze. Precisi on i s reduced by random
error (chance) from three sources of variabi l ity: the observer, the subject, and the
i nstrument.
4. Strategi es for increasing precision that shoul d be part of every study are to
operationally define and standardize methods i n an operations manual, and to
train and certify observers. Other strategi es that are often useful are refining the
i nstruments, automating the i nstruments, and usi ng the mean of repeated
measurements.
5. The accuracy of a measurement (i .e., the degree to which it actuall y measures the
characteri stic i t i s supposed to measure) is a major key to inferri ng correct
concl usions. Validity i s a form of accuracy commonly used for abstract variabl es.
Accuracy i s reduced by systematic error (i .e., bi as) from the same three sources:
the observer, the subject, and the i nstrument.
6. The strategies for increasing accuracy i ncl ude all those l i sted for precisi on with the
excepti on of repetition. In additi on, accuracy i s enhanced by unobtrusive measures,
by calibration, and (i n compari sons between groups) by blinding.
7. Indivi dual measurements should be sensitive, specific, appropriate, and objective,
and they shoul d produce a range of values. In the aggregate, they should be broad
but parsimonious, serving the research question at moderate cost in ti me and
money.
8. Investi gators shoul d consider storing banks of materials for later measurements
that can take advantage of new technol ogies and the effi ciency of nested
casecontrol desi gns.
Appendix
Appendix 4.1
Operations Manual: Operational Definition of a
Measurement of Grip Strength
P.48
...
The operati ons manual describes the method for conducti ng and recording the results of al l
the measurements made i n the study. Thi s example, from the operati ons manual of the
Study of Osteoporotic Fractures, descri bes the use of a dynamometer to measure gri p
strength. To standardize i nstructi ons from exami ner to exami ner and from subject to
subject, the protocol includes a script of instructi ons to be read to the parti ci pant verbati m.
P r otocol f or Measur i ng Gr i p Str ength wi th the
Dynamometer
Grip strength wil l be measured i n both hands. The handl e should be adjusted so that the
partici pant hol ds the dynamometer comfortabl y. Place the dynamometer i n the right hand
with the di al facing the pal m. The participant' s arm shoul d be fl exed 90 at the el bow wi th
the forearm parall el to the fl oor.
1. Demonstrate the test to the subject. Whil e demonstrati ng, use the foll owi ng
description: This device measures your arm and upper body strength. We wi ll
measure your gri p strength in both arms. I wi ll demonstrate how i t i s done. Bend your
el bow at a 90 angl e, with your forearm parall el to the fl oor. Don' t l et your arm
touch the si de of your body. Lower the devi ce and squeeze as hard as you can whi l e I
count to three. Once your arm i s ful ly extended, you can loosen your gri p.
2. All ow one practi ce tri al for each arm, starti ng with the ri ght i f she is ri ght handed. On
the second tri al , record the ki lograms of force from the di al to the nearest 0.5 kg.
3. Reset the dial. Repeat the procedure for the other arm.
The arm should not contact the body. The gri ppi ng acti on should be a slow, sustai ned
squeeze rather than an expl osive jerk.
References
1. Ware JE, Gandek B Jr. Overview of the SF-36 heal th survey and the International
Quali ty of Life Assessment (IQOLA) Project. Cli n Epidemi c 1998;51:903912.
2. Bl and JM, Altman DG. Measurement error and correlati on coeffi ci ents. BMJ 1996;313:
4142; al so, Measurement error proporti onal to the mean. BMJ 1996;313:106.
3. Cohen J. A coeffici ent of agreement for nominal scal es. Educ Psychol Meas
1960;20:3746.
4. Guttmacher AE, Coll i ns FS. Genomi c medicine: a primer. NEJM
2002;347:15121520.
5. Healy DG. Case-control studies i n the genomi c era: a cli ni ci an' s gui de.
http://www.neurol ogy. thelancet.com. 2006;5:701707.
6. Li otta LA, Ferrari M, Petri coin E. Written in bl ood. Nature 2003;425:905.
P.49
...
Authors: Hulley, Stephen B.; Cummings, Steven R.; Browner,
Warren S.; Grady, Deborah G.; Newman, Thomas B.
Title: Desi gni ng Cl i ni cal Res ear ch, 3r d Edi t i on
Copyri ght 2007 Li ppincott Wil li ams & Wi l ki ns
> Tabl e of Contents > Secti on I - Basi c Ingredi ents > 5 - Getti ng Ready to
Esti mate Sampl e Si ze: Hypotheses and Underl yi ng Pri nci pl es
5
Getting Ready to Estimate Sample
Size: Hypotheses and Underlying
Principles
Warren S. Browner
Thomas B. Newman
Stephen B. Hulley
After an investi gator has decided whom and what she is goi ng to study
and the design to be used, she must deci de how many subjects to
sampl e. Even the most rigorously executed study may fail to answer i ts
research question i f the sample si ze is too small . On the other hand, a
study wi th too large a sampl e wi ll be more di fficult and costl y than
necessary. The goal of sample size pl anni ng is to estimate an
appropriate number of subjects for a gi ven study design.
Although a useful guide, sampl e size calculati ons gi ve a deceptive
impressi on of stati sti cal objecti vity. They are only as accurate as the
data and estimates on whi ch they are based, which are often just
informed guesses. Sampl e size planning i s a mathematical way of
maki ng a ball park estimate. It often reveal s that the research desi gn is
not feasi bl e or that different predictor or outcome variabl es are
needed. Therefore, sampl e si ze shoul d be estimated earl y in the design
phase of a study, when major changes are sti ll possible.
Before setti ng out the specifi c approaches to calculati ng sample size for
several common research desi gns in Chapter 6, we wil l spend some
ti me consideri ng the underlyi ng pri nci pl es. Readers who fi nd some of
these pri nci pl es confusi ng wil l enjoy discoveri ng that sample si ze
planning does not require their total mastery. However, just as a reci pe
makes more sense i f the cook i s somewhat famil iar wi th the
ingredients, sampl e si ze cal culations are easi er i f the investi gator is
acquai nted wi th the basi c concepts.
Hypotheses
The research hypothesis i s a speci fic version of the research
...
question that summarizes the mai n elements of the studythe
sampl e, and the predi ctor and outcome vari abl esi n a form that
establ ishes the basi s for tests of stati sti cal si gnifi cance. Hypotheses
are not needed in descripti ve studi es, whi ch describe how
characteri sti cs are di stributed i n a populati on, such as a study of the
prevalence of a particular genotype among pati ents with hip fractures.
(That does not mean, however, that
you won' t need to do a sampl e size estimate for a descripti ve study,
just that the methods for doi ng so, descri bed in Chapter 6, are
different). Hypotheses are needed for studi es that wil l use tests of
statistical signi fi cance to compare findi ngs among groups, such as a
study of whether that parti cul ar genotype i s more common among
patients wi th hi p fractures than among controls. Because most
observati onal studi es and al l experiments address research questions
that invol ve maki ng compari sons, most studi es need to speci fy at least
one hypothesi s. If any of the foll owi ng terms appear i n the research
question, then the study is not simpl y descri ptive, and a hypothesis
shoul d be formul ated: greater than, l ess than, causes, l eads to,
compared wi th, more li kel y than, associated with, related to, si mi lar to,
correl ated wi th.
Char act er i st i cs of a Good Hypothesi s
A good hypothesi s must be based on a good research questi on. It
shoul d also be si mpl e, specifi c, and stated i n advance.
Simple versus Complex
A simpl e hypothesis contains one predi ctor and one outcome variabl e:
A sedentary l ifestyle is associ ated with an i ncreased ri sk of
proteinuri a i n pati ents with diabetes
A compl ex hypothesi s contains more than one predi ctor variable:
A sedentary l ifestyle and al cohol consumption are associ ated with
an i ncreased ri sk of protei nuria i n patients wi th di abetes
Or more than one outcome variable:
Al cohol consumpti on i s associ ated wi th an increased ri sk of
proteinuri a and of neuropathy in pati ents with diabetes
Compl ex hypotheses li ke these are not readi ly tested wi th a si ngle
statistical test and are more easil y approached as two or more si mpl e
P.52
...
hypotheses. Sometimes, however, a combined predi ctor or outcome
variable can be used:
Al cohol consumpti on i s associ ated wi th an increased ri sk of
developi ng a mi crovascul ar compl icati on of di abetes (i.e.,
proteinuri a, neuropathy, or reti nopathy) in patients wi th di abetes.
In this example the investi gator has decided that what matters i s
whether a parti ci pant has a compli cati on, not what type of compli cati on
occurs.
Specific versus Vague
A specifi c hypothesis l eaves no ambigui ty about the subjects and
variables or about how the test of statistical signi ficance wil l be
appli ed. It uses conci se operational definitions that summarize the
nature and source of the subjects and how variabl es wi ll be measured.
Use of tricycl ic antidepressant medicati ons, assessed wi th
pharmacy records, is more common i n patients hospital ized with
an admi ssion di agnosis of myocardial i nfarction at Longvi ew
Hospi tal i n the past year than in control s hospi tali zed for
pneumonia.
Thi s is a l ong sentence, but i t communicates the nature of the study i n
a clear way that mini mizes any opportuni ty for testing somethi ng a
li ttle di fferent once the study findi ngs have been exami ned. It would be
incorrect to substitute, during the anal ysis phase of the study, a
different measurement of the predi ctor, such as the self-reported use
of pil ls for depression, without consideri ng the i ssue of multiple
hypothesi s testi ng (a topi c we di scuss at the end of the chapter).
Usuall y, to keep the research hypothesi s conci se, some of these detai ls
are made expl ici t i n the study pl an rather than being stated in the
research hypothesis. But they should always be clear i n the
investigator' s conception of the study, and spell ed out in the protocol.
It i s often obvious from the research hypothesis whether the predictor
variable and the outcome vari abl e are dichotomous, conti nuous, or
categorical. If it is not clear, then the type of vari ables can be
specifi ed:
Al cohol consumpti on (in mg/day) i s associated wi th an increased
ri sk of protei nuria (>300 mg/day) i n patients with di abetes.
If the research hypothesi s begi ns to get too cumbersome, the
P.53
...
definitions can be left out, as long as they are cl arifi ed elsewhere in
the protocol .
In-Advance versus After-the-Fact
The hypothesi s should be stated i n writing at the outset of the study.
Most important, this wi ll keep the research effort focused on the
primary objective. A si ngle prestated hypothesi s also creates a stronger
basi s for i nterpreting the study results than several hypotheses that
emerge as a result of inspecti ng the data. Hypotheses that are
formulated after exami nati on of the data are a form of mul ti pl e
hypothesi s testi ng that can lead to overi nterpreting the i mportance of
the fi ndings.
Types of Hypotheses
For the purpose of testing statisti cal signi ficance, the research
hypothesi s must be restated i n forms that categorize the expected
difference between the study groups.
Nul l and al t er nat i ve hypot heses . The null hypothesis states
that there is no association between the predictor and outcome
variables i n the popul ation (there i s no difference in the frequency
of drinki ng well water between subjects who devel op pepti c ul cer
disease and those who do not ). The null hypothesi s is the formal
basi s for testi ng statistical si gnifi cance. Assumi ng that there reall y
is no associ ation i n the populati on, stati sti cal tests hel p to
estimate the probabi li ty that an associ ati on observed i n a study i s
due to chance.
The proposi ti on that there i s an association (the frequency of
drinki ng well water i s di fferent i n subjects who develop pepti c
ul cer disease than in those who do not) is cal led the alternative
hypothesis. The alternative hypothesis cannot be tested di rectly;
it is accepted by default if the test of stati sti cal si gnifi cance
rejects the null hypothesis (see later).
One- and t w o- si ded al t er nat i ve hypot heses . A one-sided
hypothesi s specifi es the directi on of the association between the
predictor and outcome variables. The hypothesis that drinking wel l
water i s more common among subjects who devel op pepti c ul cers
is a one-sided hypothesis. A two-si ded hypothesi s states only that
an associ ation exi sts; it does not specify the directi on. The
hypothesi s that subjects who develop peptic ulcer di sease have a
different frequency of drinki ng wel l water than those who do not i s
a two-sided hypothesi s.
P.54
...
One-si ded hypotheses may be appropri ate i n sel ected circumstances,
such as when onl y one di rection for an associ ation i s cli ni cal ly
important or bi ologi call y meaningful . An exampl e is the one-sided
hypothesi s that a new drug for hypertension is more li kel y to cause
rashes than a placebo; the possibil ity that the drug causes fewer
rashes than the pl acebo is not usuall y worth testi ng (i t might be i f the
drug had anti -inflammatory properti es!). A one-si ded hypothesis may
also be appropri ate when there is very strong evidence from prior
studi es that an associ ati on i s unli kely to occur in one of the two
directi ons, such as a study that tested whether ci garette smoki ng
affects the risk of brain cancer. Because smoking has been associ ated
with an i ncreased risk of many di fferent types of cancers, a one-sided
alternative hypothesis (e.g., that smoki ng increases the ri sk of brai n
cancer) might suffi ce. However, i nvestigators should be aware that
many well -supported hypotheses (e.g., that -carotene therapy wil l
reduce the ri sk of l ung cancer, or that treatment wi th drugs that reduce
the number of ventri cul ar ectopi c beats wil l reduce sudden death
among pati ents with ventri cul ar arrhythmias) turn out to be wrong
when tested i n randomized tri al s. Indeed, in these two examples, the
results of wel l -done trial s reveal ed a statistical ly si gnifi cant effect that
was opposi te i n directi on from the one supported by previ ous data
(1,2,3). Overal l, we bel ieve that nearl y al l al ternati ve hypotheses
deserve to be two-si ded.
It i s important to keep in mi nd the di fference between a research
hypothesis, which is often one-si ded, and the al ternative hypothesis
that is used when pl anni ng sample size, which is al most always two-
si ded. For exampl e, suppose the research hypothesis i s that recurrent
use of anti bi otics during chi ldhood i s associ ated wi th an increased ri sk
of i nfl ammatory bowel di sease. That hypothesis speci fies the di rection
of the antici pated effect, so it is one-sided. Why use a two-si ded
alternative hypothesis when planning the sampl e si ze? The answer is
that most of the ti me, both si des of the alternative hypothesis (i.e.,
greater risk or lesser ri sk) are interesti ng, and the i nvesti gators woul d
want to publi sh the resul ts no matter which di rection was observed.
Stati sti cal ri gor requi res the i nvesti gator choose between one- and two-
si ded hypotheses before analyzing the data; swi tching to a one-sided
alternative hypothesis to reduce the P val ue (see below) is not correct.
In addition (and this is probably the real reason that two-si ded
alternative hypotheses are much more common), most grant and
manuscript revi ewers expect two-si ded hypotheses, and are cri ti cal of a
one-si ded approach.
Underlying Statistical Principles
A hypothesi s, such as that 15 mi nutes or more of exerci se per day
is associated with a lower mean fasti ng bl ood gl ucose level i n middl e-
aged women wi th di abetes, is either true or fal se in the real world.
...
Because an investi gator cannot study al l middle-aged women wi th
diabetes, she must test the hypothesi s in a sample of that target
populati on. As noted i n Fi gure 1.6, there wi ll al ways be a need to draw
inferences about phenomena in the popul ati on from events observed in
the sampl e.
In some ways, the investigator' s probl em i s si mi lar to that faced by a
jury judging a defendant (Tabl e 5.1). The absolute truth about whether
the defendant commi tted the cri me cannot usual ly be determined.
Instead, the jury begins by presumi ng innocence: the defendant di d not
commit the cri me. The jury must decide whether there i s suffi cient
evi dence to reject the presumed innocence of the defendant; the
standard is known as beyond a reasonable doubt. A jury can err,
however, by convi cti ng an innocent defendant or by fail ing to convict a
guil ty one.
Table 5.1 The Analogy between Jury
Decisions and Statistical Tests
Jury Decision Statistical Test
Innocence: The
defendant did not
counterfei t money.
Null hypothesis: There i s no
association between di etary
carotene and the i nci dence of colon
cancer i n the popul ation.
Guilt: The defendant
di d counterfeit money.
Alternative hypothesis: There i s
an association between di etary
carotene and the i nci dence of colon
cancer.
Standard for
rejecting innocence:
Beyond a reasonable
doubt.
Standard for rejecting null
hypothesis: Level of statistical
signi ficance ().
Correct judgment:
Convi ct a
counterfei ter.
Correct inference: Conclude that
there i s an associ ation between
di etary carotene and colon cancer
when one does exist in the
populati on.
...
In si mi lar fashi on, the investi gator starts by presuming the nul l
hypothesi s of no associati on between the predi ctor and outcome
variables i n the popul ation. Based on the data coll ected i n her sampl e,
the i nvestigator uses statisti cal tests to determi ne whether there is
suffi cient evidence to reject the nul l hypothesis i n favor of the
alternative hypothesis that there i s an associati on in the popul ation.
The standard for these tests i s known as the level of statistical
significance.
Type I and Type I I Er r or s
Li ke a jury, an i nvesti gator may reach a wrong concl usi on. Sometimes
by chance al one a sampl e is not representati ve of the popul ation and
the resul ts i n the sampl e do not refl ect reali ty i n the population,
leadi ng to an erroneous inference. A type I error (false-positive)
occurs if an i nvesti gator rejects a nul l hypothesi s that is actuall y true
in the populati on; a type II error (false-negative) occurs if the
investigator fai ls to reject a null hypothesi s that is actuall y not true in
the population. Al though type I and type II errors can never be avoided
enti rel y, the investi gator can reduce thei r li kel ihood by i ncreasi ng the
sampl e size (the larger the sampl e, the l ess l ikel y that it wil l differ
substanti al ly from the popul ation) or by mani pulati ng the design or the
measurements in other ways that we wi l l discuss.
In this chapter and the next, we deal only wi th ways to reduce type I
and type II errors due to chance vari ation, al so known as random
Correct judgment:
Acqui t an i nnocent
person.
Correct inference: Conclude that
there i s no associati on between
carotene and col on cancer when one
does not exi st.
Incorrect judgment:
Convi ct an i nnocent
person.
Incorrect inference (type I
error): Concl ude that there is an
carotene and col on cancer when
there actuall y is none.
Incorrect judgment:
Acqui t a counterfei ter.
Incorrect inference (type II
error): Concl ude that there is no
carotene and col on cancer when
there actuall y is one.
P.55
...
error. Fal se-positive and false-negati ve resul ts can also occur because
of bias, but errors due to bias are not usuall y referred to as type I and
II errors. Such errors are especiall y troublesome, because they may be
difficult to detect and cannot usual ly be quanti fied using statistical
methods
or avoi ded by increasing the sampl e si ze. (See Chapters 1, 3, 4, and 7,
8, 9, 10, 11, and 12 for ways to reduce errors due to bi as.)
Ef f ect Si ze
The li keli hood that a study wi ll be able to detect an associati on
between a predi ctor and an outcome variable i n a sample depends on
the actual magni tude of that associati on in the popul ation. If it is l arge
(mean fasting blood glucose l evel s are 20 mg/dL l ower i n diabeti c
women who exercise than i n those who do not ), it wi ll be easy to detect
in the sampl e. Conversel y, i f the si ze of the associati on i s smal l ( a
difference of 2 mg/dL), it wil l be diffi cul t to detect in the sample.
Unfortunatel y, the investi gator does not usuall y know the exact size of
the associ ation; one of the purposes of the study is to estimate i t!
Instead, the i nvestigator must choose the size of the associ ation that
she expects to be present i n the sampl e. That quantity is known as the
effect size. Sel ecting an appropri ate effect si ze i s the most di fficult
aspect of sampl e si ze planning (4). The i nvesti gator should first try to
find data from prior studi es in rel ated areas to make an i nformed guess
about a reasonabl e effect size. When data are not avai labl e, i t may be
necessary to do a small pi lot study. Alternatively, she can choose the
smal lest effect si ze that in her opi nion would be cl i nical ly meani ngful (a
10 mg/dL reducti on in the fasting gl ucose l evel ).
Of course, from the publ ic health point of vi ew, even a reducti on of 2
or 3 mg/dL i n fasti ng gl ucose levels mi ght be important, especi all y if it
was easy to achi eve. The choice of the effect size i s al ways arbitrary,
and consi derati ons of feasi bil ity are often paramount. Indeed, when the
number of avai labl e or affordabl e subjects i s li mited, the investi gator
may have to work backward (Chapter 6) to determi ne the effect size
that her study wi ll be able to detect.
There are many different ways to measure the size of an associ ation,
especi al ly when the outcome variabl e i s di chotomous. For example,
consider a study of whether mi ddle-aged men are more l ikel y to have
impai red hearing than middl e-aged women. Suppose an investigator
finds that 20% of women and 30% of men 50 to 65 years of age are
hard of hearing. These results could be interpreted as showi ng that
men are 10% more l ikely to have i mpaired hearing than women (30% -
20%, the absol ute di fference), or 50% more l ikel y ([30% - 20%]
20%, the rel ative di fference). For sampl e si ze pl anni ng, both of the
proporti ons matter; the sampl e size tables in thi s book use the smal ler
P.56
...
proporti on (i n this case, 20%) and the absolute difference (10%)
between the groups bei ng compared.
Many studi es measure several effect sizes, because they measure
several different predi ctor and outcome variabl es. For sampl e size
planning, the sampl e size using the desi red effect size for the most
important hypothesis shoul d be determined; the effect si zes for the
other hypotheses can then be estimated. If there are several
hypotheses of simi lar i mportance, then the sample size for the study
shoul d be based on whichever hypothesis needs the l argest sample.
, , and Power
After a study is completed, the investigator uses statistical tests to try
to reject the nul l hypothesis i n favor of i ts alternative, in much the
same way that a prosecuti ng attorney tries to convi nce a jury to reject
innocence in favor of guil t. Dependi ng on whether the null hypothesi s is
true or fal se i n the target populati on, and assuming that the study is
free of bi as, four si tuati ons are possi bl e ( Table 5.2). In two of these,
the fi ndings i n the sampl e and reali ty i n the populati on are concordant,
and the
investigator' s i nference wil l be correct. In the other two situati ons,
ei ther a type I or type II error has been made, and the inference wil l
be incorrect.
The investi gator establ ishes the maxi mum chance that she wi ll tol erate
of maki ng type I and II errors i n advance of the study. The probabil ity
P.57
Table 5.2 Truth in the Population versus
the Results in the Study Sample: The Four
Possibilities
Results in the
Study Sample
Truth in the Population
Association
Between Predictor
and Outcome
No Association
Between Predictor
and Outcome
Reject null
hypothesi s
Correct Type I error
Fail to reject
null hypothesis
Type II error Correct
...
of committi ng a type I error (rejecting the nul l hypothesi s when it is
actual ly true) i s cal led (alpha). Another name for is the level of
statistical significance.
If, for exampl e, a study of the effects of exerci se on fasting blood
glucose l evel s is desi gned wi th an of 0.05, then the investi gator has
set 5% as the maximum chance of i ncorrectly rejecting the nul l
hypothesi s if i t i s true (and i nferring that exerci se and fasting blood
glucose l evel s are associated i n the popul ation when, in fact, they are
not). This i s the l evel of reasonable doubt that the i nvestigator wi ll be
wil li ng to accept when she uses statistical tests to anal yze the data
after the study i s compl eted.
The probabi li ty of making a type II error (fai li ng to reject the null
hypothesi s when it is actuall y fal se) is call ed (beta). The quantity [1
- ] i s call ed power, the probabil ity of correctl y rejecti ng the null
hypothesi s in the sample if the actual effect i n the populati on is equal
to (or greater than) the effect size.
If i s set at 0.10, then the i nvestigator has deci ded that she i s wi ll i ng
to accept a 10% chance of missi ng an association of a gi ven effect si ze
if i t exists. Thi s represents a power of 0.90; that is, a 90% chance of
findi ng an associ ation of that si ze or greater. For example, suppose
that exercise reall y woul d l ead to an average reduction of 20 mg/dL i n
fasti ng gl ucose levels among di abetic women in the enti re populati on.
Suppose that the investi gator drew a sample of women from the
populati on on numerous occasions, each time carrying out the same
study (with the same measurements and the same 90% power each
ti me). Then i n nine of every ten studi es the i nvesti gator woul d
correctly reject the null hypothesis and conclude that exerci se i s
associ ated wi th fasting glucose l evel. Thi s does not mean, however,
that the i nvestigator doi ng a si ngle study wil l be unabl e to detect it if
the effect actual ly present in the popul ati on was smal ler, say, a 15
mg/dL reducti on; it means si mply that she wil l have less than a 90%
li kel ihood of doing so.
Ideal ly, and woul d be set at zero, eli mi nati ng the possibil ity of
false-positive and false-negative resul ts. In practi ce they are made as
smal l as possi bl e. Reducing them, however, requires increasi ng the
sampl e size; other strategies are di scussed i n Chapter 6. Sample size
planning aims at choosi ng a suffici ent number of subjects to keep
and at an acceptably l ow level wi thout making the study
unnecessaril y expensi ve or di ffi cul t.
Many studi es set at 0.05 and at 0.20 (a power of 0.80). These are
arbi trary val ues, and others are sometimes used: the conventional
range for i s between 0.01 and 0.10, and that for i s between 0.05
and 0.20. In general, the investi gator shoul d
use a l ow when the research questi on makes i t parti cul arly
P.58
...
important to avoid a type I (false-positive) errorfor exampl e, in
testi ng the effi cacy of a potentiall y dangerous medicati on. She should
use a l ow (and a small effect size) when i t i s especi al ly i mportant to
avoid a type II (fal se-negati ve) errorfor exampl e, i n reassuring the
publ i c that li ving near a toxic waste dump is safe.
P Val ue
The nul l hypothesis acts li ke a straw man: i t is assumed to be true so
that it can be knocked down as false wi th a stati sti cal test. When the
data are anal yzed, such tests determine the P value, the probabil ity of
seeing an effect as bi g as or bi gger than that i n the study by chance if
the null hypothesi s actual ly were true. The nul l hypothesis i s rejected
in favor of i ts alternative if the P value is l ess than , the
predetermined l evel of statistical signi ficance.
A nonsi gnifi cant resul t (i.e., one wi th a P value greater than
) does not mean that there is no associ ation in the populati on; i t
only means that the resul t observed i n the sampl e i s smal l compared
with what coul d have occurred by chance alone. For exampl e, an
investigator mi ght find that men with hypertension were twice as l ikel y
to develop prostate cancer as those with normal bl ood pressure, but
because the number of cancers in the study was modest thi s apparent
effect had a P value of only 0.08. Thi s means that even if hypertensi on
and prostati c carcinoma were not associ ated in the popul ati on, there
woul d be an 8% chance of fi nding such an associ ation due to random
error i n the sampl e. If the investi gator had set the signi ficance level as
a two-sided of 0.05, she woul d have to concl ude that the associ ation
in the sampl e was not statistical ly si gnifi cant. It might be
tempti ng for the i nvesti gator to change her mi nd about the l evel of
statistical signi fi cance, reset the two-si ded to 0.10, and report,
The resul ts showed a stati sti cal ly si gnifi cant associati on (P <
0.10), or swi tch to a one-sided P val ue and report it as P =
0.04. A better choi ce woul d be to report that The resul ts,
although suggesti ve of an associ ation, di d not achieve stati sti cal
si gnifi cance (P = 0.08).
Thi s sol ution acknowledges that statistical signi ficance is not an al l-or-
none si tuation. In part because of thi s probl em, many stati sti cians and
epidemiologists are movi ng away from hypothesi s testi ng, wi th its
emphasis on P val ues, to using confidence interval s to report the
preci sion of the study resul ts (5,6,7). However, for the purposes of
sampl e size planning for anal yti c studi es, hypothesi s testi ng is stil l the
standard.
Si des of t he Al t er nat i ve Hypot hesi s
Recall that an al ternati ve hypothesis actual ly has two si des, either or
both of whi ch can be tested i n the sampl e by using one- or two-sided
...
statistical tests. When a two-sided statistical test is used, the P val ue
includes the probabil ities of commi tting a type I error i n each of two
directi ons, which i s about twice as great as the probabi li ty i n ei ther
directi on alone. It i s easy to convert from a one-si ded P value to a two-
si ded P val ue, and vi ce versa. A one-si ded P value of 0.05, for
exampl e, is usual ly the same as a two-si ded P value of 0.10. (Some
statistical tests are asymmetri c, which is why we sai d usuall y. )
In those rare situations in which an investigator i s onl y interested i n
one of the si des and has so formul ated the al ternati ve hypothesis,
sampl e size shoul d be calculated accordi ngl y. A one-sided hypothesi s
shoul d never be used just to reduce the sampl e size.
Type of Stat i st i cal Test
The formul as used to cal cul ate sampl e size are based on mathemati cal
assumptions, whi ch differ for each stati sti cal test. Before the sampl e
si ze can be calculated, the i nvesti gator must deci de on the stati sti cal
approach to analyzing the data. That choice depends mainly on the
type of predi ctor and outcome vari abl es in the study. Table 6.1 li sts
some common stati sti cs used i n data anal ysis, and Chapter 6 provides
si mpli fied approaches to esti mating sampl e si ze for studi es that use
these stati sti cs.
Additional Points
P.59
Var i abi l i ty
It is not si mpl y the size of an effect that i s important; i ts vari abi li ty
al so matters. Stati sti cal tests depend on bei ng able to show a
difference between the groups being compared. The greater the
vari abi li ty (or spread) in the outcome variable among the subjects, the
more li kel y it i s that the values i n the groups wil l overl ap, and the
more di fficult it wil l be to demonstrate an overal l difference between
them. Because measurement error contributes to the overal l vari abil ity,
l ess preci se measurements requi re l arger sample sizes (8).
Consider a study of the effects of two isocal ori c di ets (l ow fat and l ow
carbohydrate) i n achi evi ng weight l oss i n 20 obese patients. If all those
on the l ow-fat diet l ost about 3 kg and al l those on the l ow-
carbohydrate di et fai led to lose much wei ght (an effect size of 3 kg), it
i s li kel y that the low-fat diet reall y is better (Fig. 5.1A). On the other
hand, suppose that al though the average wei ght l oss is 3 kg i n the l ow-
fat group and 0 kg i n the l ow-carbohydrate group, there i s a great deal
of overl ap between the two groups. (The changes i n wei ght vary from a
l oss of 8 kg to a gai n of 8 kg.) In this si tuation (Fig. 5.1B), al though
the effect size is stil l 3 kg, the greater vari abil ity wil l make it more
diffi cul t to detect a di fference between the di ets, and a larger sampl e
...
si ze wi ll be needed.
When one of the vari ables used in the sample size estimate i s
conti nuous (e.g., body weight in Fi gure 5.1), the investigator wi ll need
to esti mate i ts vari abil ity. (See the secti on on the t test in Chapter 6
for detai ls.) In the other situati ons, vari abi li ty i s al ready included in
the other parameters entered i nto the sampl e size formulas and tabl es,
and need not be specifi ed.
Mul t i pl e and Post Hoc Hypot heses
When more than one hypothesis i s tested in a study, especi al ly i f some
of those hypotheses were formul ated after the data were anal yzed
(post hoc hypotheses), the li kel ihood that at least one wi ll achieve
statistical signi fi cance on the basis of chance alone i ncreases. For
exampl e, if 20 independent hypotheses are tested at an of 0.05, the
li kel ihood is substanti al (64%; [1 - 0.95
20
]) that at least one
hypothesi s wil l be statistical ly si gnifi cant by chance al one. Some
statistici ans advocate adjusti ng the level of stati sti cal si gnifi cance
when more than one hypothesi s i s tested in a study. Thi s keeps the
overal l probabi li ty of accepti ng any one of the alternative hypotheses,
when all the findi ngs are due to chance, at the specifi ed level. For
exampl e, genomi c studi es that look for an associ ati on between
hundreds (or even thousands) of genotypes and a di sease need to use
a much small er than 0.05, or they ri sk identifyi ng many fal se-
positive associati ons.
One approach, named after the mathemati cian Bonferroni, i s to divi de
the si gnifi cance l evel (say, 0.05) by the number of hypotheses tested.
If there were four
hypotheses, for exampl e, each would be tested at an of 0.0125 (i .e.,
0.05 4). This would requi re substanti all y increasing the sampl e si ze
over that needed for testing each hypothesis at an of 0.05.
P.60
...
We bel ieve that a Bonferroni -type of approach to multiple hypothesis
testi ng is usuall y too stri ngent. Investigators do not adjust the
si gnifi cance level s for hypotheses that are tested in separate studies.
Why do so when several hypotheses are tested i n the same study? In
our view, adjusting for mul ti pl e hypotheses is chiefly useful when
the l ikel ihood of maki ng fal se-positive errors is high, because the
number of tested hypotheses i s substanti al (say, more than ten) and
FIGURE 5.1. A: Wei ght l oss achi eved by two di ets. Al l subjects on
the low-fat diet lost from 2 to 4 kg, whereas wei ght change in
those on the low- carbohydrate (CHO) di et vari ed from -1 to +1
kg. Because there i s no overl ap between the two groups, it is
reasonable to i nfer that the low-fat diet is better at achieving
weight loss than the low-carbohydrate diet (as would be confi rmed
with a t test, which has a P value < 0.0001). B: Weight loss
achieved by two di ets. There i s substanti al overlap i n wei ght
change i n the two groups. Al though the effect si ze is the same (3
kg) as i n A, there i s li ttle evidence that one di et is better than the
other (as woul d be confirmed with a t test, which has a P value of
0.19).
...
the prior probabil ity for each hypothesis i s low (e.g., in screening a
large number of genes for association with a phenotype). The first
cri terion is actuall y stri cter than it may appear, because what matters
is the number of hypotheses that are tested, not the number that are
reported. Testing 50 hypotheses but only reporti ng or emphasizing the
one or two P val ues
that are l ess than 0.05 is mi sleading. Adjusting for mul ti pl e
hypotheses is especi all y important when the consequences of maki ng a
false-positive error are large, such as mi stakenly concl udi ng that an
ineffective treatment is benefici al .
In general, the i ssue of what si gni ficance level to use depends more on
the prior probability of each hypothesi s than on the number of
hypotheses tested. There is an anal ogy wi th the use of di agnosti c tests
that may be helpful (9). When i nterpreting the resul ts of a diagnostic
test, a cli ni cian consi ders the l ikel ihood that the patient bei ng tested
has the disease in questi on. For example, a modestl y abnormal test
result in a healthy person (a serum al kal i ne phosphatase level that i s
15% greater than the upper l imit of normal ) i s probably a fal se-positive
test that i s unli kely to have much cli ni cal i mportance. Simil arly, a P
value of 0.05 for an unl ikel y hypothesi s is probabl y al so a fal se-posi ti ve
result.
However, an alkali ne phosphatase l evel that is 10 or 20 ti mes greater
than the upper l imit of normal is unli kel y to have occurred by chance
(al though i t might be a l aboratory error). So too a very small P val ue
(say, <0.001) is unl ikely to have occurred by chance (al though i t coul d
be due to bi as). It i s hard to dismi ss very abnormal test resul ts as
being false-posi ti ves or to dismiss very low P values as being due to
chance, even if the pri or probabil ity of the disease or the hypothesi s
was low.
Moreover, the number of tests that were ordered, or hypotheses that
were tested, is not al ways relevant. The i nterpretati on of an elevated
serum uri c acid level i n a pati ent with a pai nful and swoll en joint
shoul d not depend on whether the physi cian ordered just a singl e test
(the uri c acid level) or obtained the result as part of a panel of 20
tests. Simi larl y, when i nterpreting the P value for testing a research
hypothesi s that makes good sense, i t shoul d not matter that the
investigator al so tested several unli kely hypotheses. What matters
most i s the reasonabl eness of the research hypothesis bei ng tested:
that it has a substantial pri or probabi li ty of bei ng correct. (Prior
probabil ity, in thi s Bayesian approach, is usuall y a subjective
judgment based on evidence from other sources.) Hypotheses that are
formulated duri ng the design of a study usuall y meet thi s requi rement;
after al l, why else woul d the i nvestigator put the time and effort into
planning and doi ng the study?
P.61
...
What about unanti cipated associations that appear during the coll ection
and anal ysis of a study' s results? Thi s process i s sometimes call ed
hypothesis generation or, l ess favorabl y, data-mi ni ng or a
fishing expedition. The many i nformal comparisons that are
made duri ng data analysi s are a form of mul ti pl e hypothesi s testi ng. A
si mi lar problem ari ses when variables are redefined during data
analysi s, or when resul ts are presented for subgroups of the sampl e.
Signi ficant P val ues for data-generated hypotheses that were not
considered duri ng the design of the study are often due to chance.
They should be vi ewed with i nterest but skeptici sm and considered a
fertil e source of potenti al research questions for future studi es.
Someti mes, however, an i nvestigator fai ls to specify a particular
hypothesi s in advance, al though that hypothesi s seems reasonabl e
when it is time for the data to be anal yzed. This might happen, for
exampl e, if others discover a new ri sk factor whil e the study i s going
on, or i f the investi gator just didn' t happen to thi nk of a particular
hypothesi s when the study was being desi gned. The i mportant i ssue is
not so much whether the hypothesi s was formulated before the study
began, but whether there i s a reasonable pri or probabi li ty based on
evi dence from other sources that the hypothesis i s true (9).
There are some defi ni te advantages to formul ating more than one
hypothesi s when pl anni ng a study. The use of mul ti ple unrelated
hypotheses increases the effici ency of the study, maki ng it possible to
answer more questions with a singl e
research effort and to di scover more of the true associati ons that exist
in the populati on. It may also be a good idea to formulate several
related hypotheses; if the fi ndings are consi stent, the study
conclusions are made stronger. Studies in pati ents with heart fail ure
have found that the use of angi otensi n-converti ng enzyme inhi bi tors is
benefi ci al in reduci ng cardi ac admi ssions, cardiovascular mortal ity, and
total mortali ty. Had onl y one of these hypotheses been tested, the
inferences from these studi es would have been less defi ni ti ve. Lunch
may not be free, however, when mul ti pl e hypotheses are tested.
Suppose that when these related and prestated hypotheses are tested,
only one turns out to be statistical ly si gnifi cant. Then the i nvesti gator
must deci de (and try to convince editors and readers) whether the
si gnifi cant resul ts, the nonsi gnifi cant resul ts, or both sets of resul ts are
true.
Pr i mar y and Secondar y Hypotheses
Some studi es, especi all y large randomized tri al s, specify some
hypotheses as bei ng secondary. Thi s usual ly happens when
there is one primary hypothesis around which the study has been
designed, but the investi gators are also interested i n other research
questions that are of l esser importance. For example, the primary
P.62
...
outcome of a trial of zi nc supplementati on mi ght be hospi tali zations or
emergency department vi sits for upper respi ratory tract infecti ons; a
secondary outcome mi ght be self-reported days mi ssed from work or
school. If the study is bei ng done to obtain approval for a
pharmaceuti cal agent, the pri mary outcome is what wil l matter most to
the regulatory body. The sample si ze cal cul ati ons are al ways focused
on the pri mary hypothesis, and secondary hypotheses wi th i nsuffici ent
power shoul d be avoided. Stati ng a secondary hypothesis i n advance
does i ncrease the credibil ity of the resul ts. Stati ng a secondary
hypothesi s after the data have been col lected and anal yzed is another
form of data dredging.
A good rule, parti cul arly for cli ni cal trials, i s to establi sh i n advance as
many hypotheses as make sense, but speci fy just one as the primary
hypothesis, which can be tested statisticall y without argument about
whether to adjust for multiple hypothesis testing. More important,
havi ng a primary hypothesi s helps to focus the study on i ts main
objective and provi des a cl ear basi s for the main sample si ze
calculati on.
Summary
1. Sample size planning i s an i mportant part of the desi gn of both
analytic and descri ptive studies. The sampl e si ze shoul d be
estimated earl y in the process of devel oping the research desi gn,
so that appropri ate modifi cations can be made.
2. Analytic studies and experi ments need a hypothesis that
specifi es, for the purpose of subsequent statistical tests, the
antici pated associ ation between the main predi ctor and outcome
variables. Purel y descripti ve studies, l acking the strategy of
compari son, do not require a hypothesis.
3. Good hypotheses are specific about how the popul ation wil l be
sampled and the variables measured, simple (there is onl y one
predictor and one outcome variable), and formulated in
advance.
4. The null hypothesis, whi ch proposes that the predictor and
outcome vari ables are not associ ated, is the basis for tests of
statistical signi fi cance. The alternative hypothesis
proposes that they are associ ated. Stati sti cal tests attempt to
reject the null hypothesis of no associ ation i n favor of the
alternative hypothesis that there i s an associati on.
5. An alternative hypothesis i s ei ther one-sided (only one directi on
of associ ation wil l be tested) or two-sided (both di rections wil l be
tested). One-si ded hypotheses should onl y be used in unusual
P.63
...
ci rcumstances, when onl y one di rection of the associ ati on i s
cl ini call y or biol ogicall y meani ngful.
6. For analytic studies and experi ments, the sample size is an
estimate of the number of subjects requi red to detect an
associ ation of a gi ven effect size and variability at a speci fied
li kel ihood of maki ng type I (fal se-positive) and type II (fal se-
negati ve) errors. The maximum li kel ihood of maki ng a type I
error i s cal led ; that of making a type II error, . The quanti ty
(1 - ) is power, the chance of observing an associ ation of a
given effect size or greater i n a sample i f one is actuall y present
in the populati on.
7. It i s often desirable to establ ish more than one hypothesi s i n
advance, but the i nvestigator should specify a singl e primary
hypothesis as a focus and for sample size estimation.
Interpretation of fi ndings from testi ng multiple hypotheses i n
the sampl e, includi ng unanti cipated findi ngs that emerge from the
data, i s based on a judgment about the prior probability that
they represent real phenomena i n the popul ation.
References
1. The Al pha-Tocopherol, Beta Carotene Cancer Preventi on Study
Group. The effect of vi tamin E and beta carotene on the incidence
of lung cancer and other cancers i n male smokers. N Engl J Med
1994;330:10291035.
2. Echt DS, Li ebson PR, Mi tchel l LB, et al . Mortal ity and morbi di ty
in pati ents receivi ng encai ni de, fl ecainide, or pl acebo. The Cardi ac
Arrhythmia Suppression Tri al . N Engl J Med 1991;324:781788.
3. The Cardi ac Arrhythmia Suppressi on Trial II Investigators. Effect
of the antiarrhythmi c agent mori cizine on survi val after myocardial
infarcti on. N Engl J Med 1992;327:227233.
4. Van Walraven C, Mahon JL, Moher D, et al. Surveying physi cians
to determine the minimal i mportant di fference: i mpl ications for
sample-si ze cal cul ation. J Cl in Epi demiol 1999;52: 717723.
5. Daly LE. Confi dence l imits made easy: interval esti mation usi ng
a substitution method. Am J Epi demi ol 1998;147:783790.
6. Goodman SN. Toward evidence-based medical stati sti cs. 1: The P
...
val ue fal lacy. Ann Intern Med 1999;130:9951004.
7. Goodman SN. Toward evidence-based medical stati sti cs. 2: The
Bayes factor. Ann Intern Med 1999;130:10051013.
8. McKeown-Eyssen GE, Tibshi rani R. Impli cati ons of measurement
error in exposure for the sampl e si zes of case-control studies. Am J
Epidemiol 1994;139:415421.
9. Browner WS, Newman TB. Are al l si gnifi cant P values created
equal? The analogy between diagnostic tests and cl ini cal research.
JAMA 1987;257:24592463.
...
Authors: Hulley, Stephen B.; Cummings, Steven R.; Browner, Warren S.; Grady, Deborah
G.; Newman, Thomas B.
Copyright 2007 Lippincott Wil l iams & Wil kins
> Tabl e of Contents > Secti on I - Basi c Ingredi ents > 6 - Esti mati ng Sampl e Si ze and Power: Appl i cati ons and
Exampl es
6
Estimating Sample Size and Power: Applications and
Examples
Warren S. Browner
Thomas B. Newman
Stephen B. Hulley
Chapter 5 i ntroduced the basi c pri ncipl es underlyi ng sample si ze calcul ati ons. Thi s chapter presents
several cookbook techniques for usi ng those princi pl es to esti mate the sampl e size needed for a
research project. The first section deals with sampl e size esti mates for an analyti c study or
experi ment, i ncludi ng some special issues that apply to these studi es such as mul ti vari ate anal ysi s.
The second section consi ders studies that are primari l y descri ptive. Subsequent secti ons deal wi th
studi es that have a fixed sampl e si ze, strategi es for maxi mizi ng the power of a study, and how to
estimate the sampl e size when there appears to be i nsufficient information from whi ch to work. The
chapter concludes with common errors to avoid.
At the end of the chapter, there are tables and formul as in the appendixes for several basi c methods
of esti mati ng sample si ze. In additi on, there i s a cal culator on our websi te
(http://www.epibi ostat.ucsf.edu/dcr/), and there are many si tes on the Web that can provide i nstant
i nteractive sampl e size cal culations; try searchi ng for sampl e size and power and
interacti ve . Most stati sti cal packages can also esti mate sampl e size for common study
designs.
Sample Size Techniques for Analytic Studies and Experiments
There are several vari ati ons on the reci pe for esti mati ng sample si ze in an anal yti c study or
experi ment, but they al l have certai n steps i n common:
1. State the null hypothesis and ei ther a one- or two-sided alternative hypothesis.
2. Sel ect the appropri ate statistical test from Table 6.1 based on the type of predictor variabl e
and outcome vari abl e i n those hypotheses.
3. Choose a reasonable effect size (and variability, if necessary).
Table 6.1 Simple Statistical Tests for Use in Estimating
Sample Size*
Predictor Variable
Outcome Variable
Dichotomous Continuous
Di chotomous Chi-squared test
t test
Continuous t test Correl ati on coeffici ent
* See text for what to do about ordinal vari ables, or if planni ng to anal yze the data
wi th another type of statisti cal test.
The chi -squared test i s always two-si ded; a one-si ded equival ent i s the Z statistic.
...
4. Set and . (Specify a two-si ded unless the alternative hypothesis i s cl early one-sided.)
5. Use the appropriate table or formula i n the appendix to estimate the sampl e size.
Even i f the exact val ue for one or more of the ingredients is uncertai n, i t i s i mportant to esti mate the
sampl e size early i n the desi gn phase. Wai ti ng until the last mi nute to prepare the sample si ze can be
a rude awakeni ng: i t may be necessary to start over wi th new i ngredients, whi ch may mean
redesigni ng the enti re study. Thi s i s why thi s subject is covered earl y i n thi s book.
Not al l analytic studies fi t neatly i nto one of the three mai n categori es that foll ow; a few of the more
common excepti ons are di scussed i n the secti on call ed Other Considerati ons and Speci al
Issues.
The t Test
The t test (sometimes cal led Student' s t test, after the pseudonym of i ts devel oper) is
commonl y used to determi ne whether the mean value of a conti nuous outcome vari able in one group
di ffers signi fi cantly from that i n another group. For example, the t test woul d be appropri ate to use
when compari ng the mean depression scores i n pati ents treated with two different antidepressants,
or the mean change in weight among two groups of participants i n a pl acebo-controll ed trial of a new
drug for wei ght l oss. The t test assumes that the distributi on (spread) of the variabl e i n each of the
two groups approximates a normal (bel l-shaped) curve. However, the t test is remarkabl y robust, so
i t can be used for al most any distributi on unless the number of subjects is smal l (fewer than 30 to
40) or there are extreme outl iers.
To esti mate the sample si ze for a study that wi ll be anal yzed wi th a t test (see Exampl e 6.1), the
i nvestigator must
1. State the nul l hypothesis and whether the al ternati ve hypothesi s is one- or two-si ded.
2. Estimate the effect si ze (E) as the difference i n the mean val ue of the outcome variable between
the study groups.
3. Estimate the vari abi li ty of the outcome vari able as its standard deviation (S).
4. Calcul ate the standardized effect size (E/S), defi ned as the effect size di vi ded by the standard
deviati on of the outcome variable.
5. Set and .
The effect size and variability can often be estimated from previ ous studi es i n the l iterature and
consul tation wi th experts. Occasi onal ly, a smal l pi lot study wi ll be necessary to esti mate the standard
devi ation of the outcome vari abl e (also see the Secti on How to esti mate sampl e size when there
i s insuffi ci ent i nformati on, l ater i n this chapter). When the outcome variabl e i s the change in a
conti nuous measurement (e.g., change i n wei ght during a study), the i nvesti gator should use the
standard devi ati on of the change i n that vari abl e (not the standard deviation of the vari abl e i tsel f) in
the sample si ze esti mates. The standard deviati on of the change i n a variable i s usual ly small er than
the standard deviati on of the vari abl e; therefore the sampl e si ze wil l al so be smal ler.
The standardized effect size i s a unitl ess quantity that makes i t possi ble to estimate a sampl e size
when an i nvesti gator cannot obtai n i nformati on about the vari abil ity of the outcome variabl e; it also
si mpl ifies compari sons between the effect sizes of di fferent variabl es. (The standardi zed effect si ze
equal s the effect size di vided by the standard devi ation of the outcome vari abl e. For example, a 10
mg/dL di fference i n serum chol esterol level, whi ch has a standard deviation in the popul ation of
about 40 mg/dL, would equal a standardi zed effect size of 0.25.) The larger the standardized effect
si ze, the smal ler the requi red sampl e size. For most studies, the standardized effect size wi ll be
>0.1. Effect sizes small er than that are di ffi cult to detect (they requi re very large sampl e sizes) and
usual ly not very i mportant cli ni cal l y.
Appendix 6A gi ves the sample si ze requirements for various combi nati ons of and for several
standardi zed effect sizes. To use Table 6A, l ook down i ts l eftmost column for the standardized effect
si ze. Next, read across the table to the chosen val ues for and for the sampl e size requi red per
group. (The numbers in Table 6A assume that the two groups being compared are of the same si ze;
use the formul a below the tabl e or an i nteracti ve Web-based program if that assumption is not true.)
Example 6.1 Calculating Sample Size When Using the t Test
Problem: The research question is whether there i s a difference i n the effi cacy of salbutamol and
P.66
P.67
...
i pratropi um bromide for the treatment of asthma. The investigator plans a randomized trial of the
effect of these drugs on FEV
1
(forced expi ratory volume i n 1 second) after 2 weeks of treatment. A
previous study has reported that the mean FEV
1
i n persons wi th treated asthma was 2.0 l iters, wi th a
standard devi ation of 1.0 l i ter. The investigator would l ike to be abl e to detect a difference of 10% or
more i n mean FEV
1
between the two treatment groups. How many pati ents are requi red in each group
(salbutamol and i pratropium) at (two-si ded) = 0.05 and power = 0.80?
Soluti on: The ingredi ents for the sampl e size cal culati on are as fol lows:
1. Null Hypothesis: Mean FEV
1
after 2 weeks of treatment is the same i n asthmatic patients
treated with salbutamol as i n those treated with ipratropium.
Alternative Hypothesis (two-sided): Mean FEV
1
after 2 weeks of treatment i s di fferent in
asthmati c patients treated with salbutamol from what it is in those treated with ipratropi um.
2. Effect Size = 0.2 li ters (10% 2.0 l iters).
3. Standard Devi ati on of FEV
1
= 1.0 li ter.
4. Standardized Effect Si ze = effect size standard deviati on = 0.2 l i ters 1.0 l iter = 0.2.
5. (two-si ded) = 0.05; = 1 - 0.80 = 0.20. (Recal l that = 1 - power.)
Looki ng across from a standardized effect size of 0.20 in the l eftmost column of Tabl e 6A and down
from (two-sided) = 0.05 and = 0.20, 394 pati ents are requi red per gr oup. This i s the number
of patients i n each group who need to compl ete the study; even more wil l need to be enrol led to
account for dropouts. Thi s sampl e size may not be feasi ble, and the investi gator mi ght reconsi der the
study desi gn, or perhaps settl e for onl y being able to detect a larger effect si ze. See the secti on on
the t test for paired sampl es (Exampl e 6.8 ) for a great sol ution.
The t test is usuall y used for compari ng continuous outcomes, but i t can also be used to esti mate the
sample si ze for a di chotomous outcome (e.g., i n a casecontrol study) i f the study has a conti nuous
predi ctor vari able. In thi s situation, the t test compares the mean value of the predi ctor vari able in
the cases wi th that i n the controls.
There is a convenient shortcut for approxi mati ng sample si ze using the t test, when more than about
30 subjects wil l be studi ed and the power i s set at 0.80 ( = 0.2) and (two-si ded) i s set at 0.05
(1). The formul a i s
Sample si ze (per equal-si zed group) = 16 (standardi zed effect si ze)
2
.
For Example 6.1, the shortcut esti mate of the sampl e si ze woul d be 16 0.2
2
= 400 per group.
The Chi - Squar ed Test
The chi-squared test (
2
) can be used to compare the proportion of subjects in each of two groups
who have a di chotomous outcome. For example, the proporti on of men who devel op coronary heart
disease (CHD) whi le bei ng treated with folate can be compared with the proportion who devel op CHD
whil e taki ng a placebo. The chi -squared test is al ways two-sided; an equivalent test for one-sided
hypotheses i s the one-sided Z test.
In an experiment or cohort study, effect si ze i s speci fi ed by the difference between P
1
, the proportion
of subjects expected to have the outcome i n one group, and P
2
, the proportion expected in the other
group. In a casecontrol study, P
1
represents the proportion of cases expected to have a particul ar
risk factor, and P
2
represents the proporti on of controls expected to have the ri sk factor. Variabil ity i s
a function of P
1
and P
2
, so i t need not be specifi ed.
To estimate the sampl e size for a study that wil l be analyzed with the chi -squared test or Z test to
compare two proportions, the investigator must
1. State the nul l hypothesi s and decide whether the alternative hypothesis shoul d be one- or two-
si ded.
2. Esti mate the effect size and vari abil ity i n terms of P
1
, the proportion wi th the outcome in one
group, and P
2
, the proporti on with the outcome i n the other group.
3. Set and .
Appendi x 6B gives the sampl e size requi rements for several combinations of and , and a range of
val ues of P
1
and P
2
. To estimate the sample si ze, l ook down
P.68
P.69
...
the l eftmost col umn of Tables 6B.1 or 6B.2 for the smal ler of P
1
and P
2
(if necessary rounded to the
nearest 0.05). Next, read across for the di fference between P
1
and P
2
. Based on the chosen values for
and , the table gi ves the sample si ze required per group.
Example 6.2 Calculating Sample Size When Using the Chi -Squared Test
Problem: The research question is whether elderly smokers have a greater inci dence of ski n cancer
than nonsmokers. A review of previ ous l i terature suggests that the 5-year i nci dence of ski n cancer is
about 0.20 i n el derly nonsmokers. At (two-si ded) = 0.05 and power = 0.80, how many smokers
and nonsmokers wil l need to be studi ed to determine whether the 5-year ski n cancer i ncidence i s at
l east 0.30 i n smokers?
Sol uti on: The i ngredients for the sample si ze calcul ation are as foll ows:
1. Null Hypothesis: The inci dence of ski n cancer i s the same i n elderly smokers and nonsmokers.
Alternative Hypothesis (two-si ded): The inci dence of ski n cancer is different i n el derly
smokers and nonsmokers.
2. P
2
(i ncidence i n nonsmokers) = 0.20; P
1
(i ncidence i n smokers) = 0.30. The smal l er of these
values i s 0.20, and the difference between them (P
1
- P
2
) i s 0.10.
3. (two-sided) = 0.05; = 1 - 0.80 = 0.20.
Looking across from 0.20 in the l eftmost column i n Table 6B.1 and down from an expected di fference
of 0.10, the middle number for (two-sided) = 0.05 and = 0.20 i s the requi red sampl e size of
313 smokers and 313 nonsmokers. If the investigator had chosen to use a one-sided alternati ve
hypothesis, given that there is a great deal of evidence suggesti ng that smoking i s a carcinogen and
none suggesting that i t prevents cancer, the sampl e size woul d be 251 smokers and 251 nonsmokers.
Often the investigator speci fies the effect size i n terms of the relative risk (risk rati o) of the
outcome i n two groups of subjects. For example, an investigator mi ght study whether women who
use oral contracepti ves are at least twice as l i kel y as nonusers to have a myocardi al infarcti on. In a
cohort study (or experi ment), it is strai ghtforward to convert back and forth between relative ri sk
and the two proportions (P
1
and P
2
), si nce the relati ve risk i s just P
1
di vided by P
2
(or vi ce versa).
For a casecontrol study, however, the si tuati on i s a l ittl e more compl ex because the rel ati ve risk
must be approximated by the odds ratio, which equal s [P
1
(1 - P
2
)] [P
2
(1 - P
1
)]. The
i nvestigator must speci fy the odds rati o (OR) and P
2
(the proportion of control s exposed to the
predictor variabl e). Then P
1
(the proporti on of cases exposed to the predictor vari abl e) i s
For exampl e, i f the investigator expects that 10% of controls wil l be exposed to the oral
contracepti ves (P
2
= 0.1) and wishes to detect an odds rati o of 3 associated wi th the exposure, then
The Cor r el ati on Coef f i ci ent
Although the correlation coefficient (r ) is not used frequently i n sampl e size cal culati ons, it can be
useful when the predictor and outcome variabl es are both continuous. The correl ati on coefficient i s a
measure of the strength of the l i near associ ati on between the two vari ables. It varies between -1 and
+1. Negati ve val ues i ndicate that as one variabl e i ncreases, the other decreases (li ke blood lead level
and IQ in chil dren). The cl oser the absol ute value of r i s to 1, the stronger the associ ati on; the cl oser
to 0, the weaker the associ ation. Hei ght and weight i n adul ts, for exampl e, are hi ghly correlated i n
some populations, with r 0.9. Such high val ues, however, are uncommon; many biol ogi c
associ ati ons have much smal ler correl ati on coeffici ents.
Correl ati on coeffi ci ents are common i n some fi elds of cli ni cal research, such as behavi oral medi ci ne,
but using them to esti mate the sample si ze has a disadvantage: correl ati on coeffi ci ents have li ttl e
i ntuitive meaning. When squared (r
2
) a correl ation coefficient represents the proporti on of the spread
(vari ance) in an outcome variabl e that results from its l inear association wi th a predi ctor vari able,
and vi ce versa. That' s why smal l val ues of r, such as those 0.3, may be stati stical ly si gnificant i f
the sample is large enough wi thout bei ng very meaningful cl inical ly or sci enti fi call y, si nce they
explain at most 9% of the variance.
An alternativeand often preferredway to estimate the sampl e size for a study in which the
predictor and outcome vari ables are both conti nuous i s to di chotomi ze one of the two variables (say,
at its medi an) and use the t test cal cul ati ons instead. Thi s has the advantage of expressing the effect
P.70
...
si ze as a difference between two groups.
To esti mate sampl e size for a study that wil l be anal yzed with a correlati on coeffi ci ent, the
i nvestigator must
1. State the nul l hypothesis, and deci de whether the al ternati ve hypothesi s is one or two-sided.
2. Estimate the effect si ze as the absol ute val ue of the smal lest correl ation coefficient (r) that the
i nvesti gator woul d li ke to be abl e to detect. (Vari abi li ty is a functi on of r and i s already i ncluded
i n the table and formul a.)
3. Set and .
In Appendix 6C, l ook down the l eftmost col umn of Table 6C for the effect si ze (r). Next, read across
the table to the chosen val ues for and , yiel ding the total sampl e size required. Table 6C yi el ds
the appropri ate sample si ze when the i nvesti gator wi shes to reject the nul l hypothesi s that there is
no associati on between the predi ctor and outcome variabl es (e.g., r = 0). If the i nvesti gator wi shes
to determine whether the correl ati on coefficient i n the study di ffers from a value other than zero
(e.g., r = 0.4), she shoul d see the text below Table 6C for the appropri ate methodology.
Example 6.3 Calculating Sample Size When Using the Correlation Coefficient in a
Cross-Sectional Study
Problem: The research question is whether urinary coti ni ne l evels (a measure of the intensi ty of
current cigarette smoki ng) are correl ated wi th bone densi ty in smokers. A previous study found a
modest correl ati on (r = -0.3) between reported smoki ng (i n ci garettes per day) and bone densi ty;
the investigator anticipates that urinary coti ni ne l evels wil l be at l east as well correl ated. How many
smokers wi ll need to be enroll ed, at (two-sided) = 0.05 and = 0.10?
1. Null Hypothesis: There i s no correl ati on between uri nary cotini ne level and bone density i n
smokers.
Alternative Hypothesis: There is a correlation between urinary coti ni ne l evel and bone densi ty
in smokers.
2. Effect size (r) = | - 0.3| = 0.3.
3. (two-sided) = 0.05; = 0.10.
Using Tabl e 6C, readi ng across from r = 0.30 i n the leftmost col umn and down from (two-sided) =
0.05 and = 0.10, 113 smokers wi ll be requi red.
Other Considerations and Special Issues
P.71
Dr opout s
Each sampl ing uni t must be avai l able for analysis; subjects who are enrol led i n a study but in whom
outcome status cannot be ascertai ned (such as dropouts) do not count i n the sample si ze. If the
i nvestigator anticipates that any of her subjects wil l not be avai l able for foll ow-up, she shoul d
i ncrease the si ze of the enroll ed sample accordi ngly. If, for example, the i nvesti gator esti mates that
20% of her sample wi ll be l ost to foll ow-up, then the sampl e size should be i ncreased by a factor of
(1 [1 - 0.20]), or 1.25.
Cat egor i cal Var i abl es
Ordinal variables can often be treated as continuous vari ables, especi al ly i f the number of
categories i s rel ati vely l arge (si x or more) and if averagi ng the values of the vari abl e makes sense.
In other si tuati ons, the best strategy is to change the research hypothesis sli ghtly by di chotomizi ng
the categori cal vari abl e. As an exampl e, suppose a researcher is studyi ng whether the sex of a
di abetic pati ent i s associ ated with the number of ti mes the patient visits a podi atrist in a year. The
number of vi si ts i s unevenly di stri buted: many peopl e wil l have no visi ts, some wil l make one vi si t,
and onl y a few wi l l make two or more vi si ts. In this situati on, the i nvesti gator coul d esti mate the
sampl e size as i f the outcome were di chotomous (no visits versus one or more visits).
Sur vi val Anal ysi s
When an i nvestigator wishes to compare which of two treatments i s more effecti ve i n prolongi ng l ife
or i n reduci ng the symptomati c phase of a disease, survival analysis wi l l be the appropri ate techni que
...
for analyzing the data (2,3). Although the outcome vari able, say weeks of survi val, appears to be
conti nuous, the t test i s not appropriate because what i s actual ly bei ng assessed i s not ti me (a
conti nuous vari able) but the proportion of subjects (a di chotomous vari able) stil l ali ve at each point
i n ti me. A reasonabl e approxi mati on can be made by dichotomizi ng the outcome vari able at the end
of the anticipated foll ow-up period (e.g., the proportion survi vi ng for 6 months or more), and
estimating the sample si ze with the chi -squared test.
Cl ust er ed Sampl es
Some research desi gns invol ve the use of clustered samples, i n whi ch subjects are sampled by
groups (Chapter 11). Consider, for example, a study of whether an educati onal i ntervention directed
at cl i nici ans i mproves the rate of smoki ng cessati on among their pati ents. Suppose that 20 physi ci ans
are randoml y assi gned to the group that recei ves the i ntervention and 20 physi ci ans are assigned to
a control group. One year later, the i nvesti gators plan to revi ew the charts of a random sample of 50
pati ents who had been smokers at basel i ne i n each practi ce to determi ne how many have qui t
smoking. Does the sampl e size equal 40 (the number of physi cians) or 2,000 (the number of
pati ents)? The answer, which l i es somewhere i n between those two extremes, depends upon how
si mil ar the patients within a physici an's practice are (i n terms of thei r l ikel ihood of smoking
cessati on) compared with the simi l arity among al l the pati ents. Esti mati ng thi s quanti ty often
requires obtai ni ng pi lot data, unl ess another i nvesti gator has previousl y done a si mil ar study. There
are several techniques for estimating the requi red sampl e size for a study using cl ustered sampl es
(4,5,6,7), but they are chal lengi ng and usuall y requi re the assistance of a statistician.
Matchi ng
For a variety of reasons (Chapter 9), an i nvesti gator may choose to use a matched design. The
techniques i n this chapter, which ignore any matchi ng, nevertheless provide reasonable estimates of
the requi red sampl e size. More precise esti mates can be made usi ng standard approaches (8) or an
i nteractive Web-based program.
Mul ti var i at e Adj ustment and Ot her Speci al Stati st i cal Anal yses
When designing an observational study, an i nvesti gator may deci de that one or more vari ables wil l
confound the associ ati on between the predi ctor and outcome ( Chapter 9), and pl an to use stati sti cal
techniques to adjust for these confounders when she anal yzes her results. When thi s adjustment
wi ll be i ncluded i n testing the primary hypothesi s, the esti mated sampl e si ze needs to take this i nto
account.
Analytic approaches that adjust for confounding variabl es often i ncrease the required sample si ze
(9,10). The magnitude of that increase depends on several factors, i ncludi ng the prevalence of the
confounder, the strength of the association between the predictor and the confounder, and the
strength of the association between the confounder and the outcome. These effects are complex and
no general rul e covers all si tuati ons.
Statisticians have developed mul tivari ate methods such as l inear regression and logistic regression
that al l ow the investi gator to adjust for confoundi ng vari abl es. One widel y used stati sti cal technique,
Cox proportional hazards anal ysi s, can adjust both for confounders and for differences in length of
foll ow-up. If one of these techni ques is goi ng to be used to anal yze the data, there are correspondi ng
approaches for estimati ng the required sampl e size (3,11,12,13,14). Sample si ze techni ques are also
avai labl e for other designs, such as studi es of potenti al geneti c ri sk factors or candidate genes
(15,16,17), economi c studi es (18,19,20), doseresponse studies (21), or studies that invol ve more
than two groups (22). Agai n, the Internet i s a useful resource for these more sophi sti cated
approaches (e.g., search for sampl e si ze and l ogisti c regressi on ).
It i s usual ly easi er, at l east for novi ce investigators, to esti mate the sampl e size assumi ng a si mpler
method of anal ysis, such as the chi -squared test or the t test. Suppose, for example, an i nvestigator
i s planni ng a casecontrol study of whether serum chol esterol level (a continuous variabl e) is
associ ated with the occurrence of
brai n tumors (a di chotomous variable). Even if the eventual plan i s to analyze the data wi th the
l ogisti c regressi on technique, a bal lpark sampl e size can be esti mated with the t test. It turns out
that the si mpli fied approaches usuall y produce sampl e si ze estimates that are si mil ar to those
generated by more sophi sti cated techniques. An experienced stati stici an may need to be consul ted,
however, i f a grant proposal that invol ves substanti al costs i s bei ng submi tted for fundi ng: grant
revi ewers wil l expect you to use a sophisti cated approach even if they accept that the sampl e size
estimates are based on guesses about the ri sk of the outcome, the effect si ze, and so on.
P.72
P.73
...
Equi val ence St udi es
Sometimes the goal of a study is to show that the nul l hypothesi s is correct and that there reall y i s
no substanti al associ ation between the predictor and outcome variables (23,24,25,26). A common
example i s a cl ini cal tri al to test whether a new drug is as effecti ve as an establi shed drug. This
si tuati on poses a chall enge when planning sampl e size, because the desi red effect si ze i s zero (i.e.,
the i nvesti gator woul d li ke to show that the two drugs are equall y effecti ve).
One acceptabl e method i s to design the study to have substanti al power (say, 0.90 or 0.95) to reject
the null hypothesis when the effect si ze is smal l enough that i t woul d not be cli ni call y important
(e.g., a di fference of 5 mg/dL in mean fasti ng glucose level s). If the results of such a wel l -powered
study show no effect (i .e., the 95% confi dence i nterval excl udes the prespecified difference
of 5 mg/dL), then the i nvestigator can be reasonably sure that the two drugs have si mil ar effects.
One probl em with equivalence studies, however, i s that the additi onal power and the smal l effect size
often requi re a very l arge sample si ze.
Another problem i nvol ves the loss of the usual safeguards that are i nherent in the paradi gm of the
nul l hypothesi s, whi ch protects a conventi onal study, such as one that compares an active drug wi th
a placebo, against Type I errors (falsely rejecting the null hypothesis). The paradigm ensures that
many probl ems in the desi gn or executi on of a study, such as usi ng i mprecise measurements or
i nadequate numbers of subjects, make i t harder to reject the null hypothesis. Investi gators i n a
conventi onal study, who are trying to reject a nul l hypothesi s, have a strong incenti ve to do the best
possi bl e study. The same i s not true for an equi val ence study, in which the goal is to fi nd no
di fference, and the safeguards do not apply.
Sample Size Techniques for Descriptive Studies
Esti mating the sample si ze for descripti ve studi es, incl uding studi es of diagnostic tests, i s based
on somewhat different princi pl es. Such studi es do not have predi ctor and outcome vari ables, nor do
they compare di fferent groups. Therefore the concepts of power and the null and al ternati ve
hypotheses do not apply. Instead, the investigator cal culates descri pti ve stati sti cs, such as means
and proportions. Often, however, descri ptive studies (What is the preval ence of depressi on among
elderly pati ents in a medical cl i nic?) are also used to ask analyti c questions (What are the predi ctors
of depressi on among these patients?). In thi s si tuati on, sample si ze shoul d be esti mated for the
analytic study as well , to avoid the common problem of having inadequate power for what turns out
to be the question of greater interest.
Descri pti ve studies commonl y report confidence intervals, a range of val ues about the sampl e
mean or proporti on. A confidence i nterval is a measure of the preci sion of a sampl e esti mate. The
i nvestigator sets the confidence l evel , such as
95% or 99%. An interval with a greater confidence l evel (say 99%) i s wi der, and therefore more
l i kely to i ncl ude the true popul ation value, than an i nterval with a l ower confi dence l evel (90%).
The wi dth of a confidence i nterval depends on the sampl e size. For exampl e, an i nvesti gator might
wi sh to esti mate the mean score on the U.S. Medical Licensing Exami nation i n a group of medi cal
students. From a sampl e of 200 students, she might esti mate that the mean score i n the population
of al l students i s 215, with a 95% confi dence interval from 210 to 220. A small er study, say wi th 50
students, mi ght have about the same mean score but woul d almost certainl y have a wi der 95%
confi dence i nterval.
When esti mating sampl e size for descri pti ve studies, the investigator speci fies the desi red level and
wi dth of the confidence i nterval . The sample si ze can then be determined from the tables or formul as
i n the appendi x.
Conti nuous Var i abl es
When the vari able of i nterest i s conti nuous, a confi dence interval around the mean value of that
variabl e i s often reported. To esti mate the sampl e size for that confi dence interval, the i nvesti gator
must
1. Estimate the standard devi ation of the variabl e of i nterest.
2. Specify the desired preci si on (total wi dth) of the confi dence interval .
3. Sel ect the confi dence level for the interval (e.g., 95%, 99%).
To use Appendi x 6D, standardi ze the total width of the i nterval (di vi de i t by the standard deviation of
the vari able), then l ook down the leftmost col umn of Tabl e 6D for the expected standardi zed wi dth.
P.74
...
Next, read across the tabl e to the chosen confidence l evel for the requi red sampl e size.
Example 6.4 Calculating Sample Size for a Descriptive Study of a Continuous Variable
Problem: The i nvestigator seeks to determi ne the mean IQ among thi rd graders i n an urban area with
a 99 % confi dence i nterval of 3 points. A previ ous study found that the standard deviation of IQ
i n a si mi lar ci ty was 15 poi nts.
1. Standard devi ation of vari able (SD) = 15 poi nts.
2. Total wi dth of i nterval = 6 poi nts (3 poi nts above and 3 poi nts bel ow). The standardi zed wi dth
of interval = total width SD = 6 15 = 0.4.
3. Confi dence level = 99%.
Readi ng across from a standardized width of 0.4 i n the leftmost col umn of Table 6D and down from
the 99% confi dence level , the required sample si ze is 166 thi rd graders.
Di chotomous Var i abl es
In a descri pti ve study of a di chotomous variable, resul ts can be expressed as a confidence i nterval
around the esti mated proportion of subjects wi th one of the values.
Thi s incl udes studi es of the sensitivity or specificity of a di agnosti c test, whi ch appear at fi rst
gl ance to be continuous vari ables but are actual l y di chotomousproporti ons expressed as
percentages (Chapter 12). To estimate the sampl e size for that confi dence interval, the investi gator
must
1. Estimate the expected proporti on with the vari able of i nterest i n the populati on. (If more than
half of the popul ati on is expected to have the characteristi c, then pl an the sampl e size based on
the proporti on expected not to have the characteri sti c.)
2. Specify the desired preci si on (total wi dth) of the confi dence interval .
3. Sel ect the confi dence level for the interval (e.g., 95%).
In Appendix 6E, l ook down the l eftmost col umn of Table 6E for the expected proporti on with the
variabl e of i nterest. Next, read across the table to the chosen width and confi dence level , yielding
the requi red sampl e size.
Example 6.5 provi des a sample si ze calcul ati on for studyi ng the sensi tivi ty of a di agnosti c test, whi ch
yi el ds the required number of subjects wi th the disease. When studyi ng the specificity of the test,
the i nvesti gator must esti mate the required number of subjects who do not have the di sease. There
are al so techniques for estimating the sample si ze for studi es of receiver operating characteristic
(ROC) curves (27), li keli hood ratios (28), and reli abi li ty (29) (Chapter 12).
Example 6.5 Calculating Sample Size for a Descriptive Study of a Dichotomous
Variable
Problem: The i nvestigator wi shes to determi ne the sensiti vi ty of a new diagnostic test for pancreati c
cancer. Based on a pi lot study, she expects that 80 % of patients wi th pancreati c cancer wi ll have
posi ti ve tests. How many such pati ents wil l be required to esti mate a 95% confidence i nterval for the
test's sensiti vity of 0.80 0.05?
1. Expected proportion = 0.20. (Because 0.80 i s more than half, sampl e size i s esti mated from the
proporti on expected to have a negative resul t, that is, 0.20.)
2. Total wi dth = 0.10 (0.05 below and 0.05 above).
3. Confi dence level = 95%.
Readi ng across from 0.20 i n the leftmost col umn of Table 6E and down from a total wi dth of 0.10,
the middle number (representi ng a 95% confi dence level ) yi el ds the required sample si ze of 246
pati ents wi th pancreati c cancer.
What to do When Sample Size is Fixed
Especial l y when doing secondary data analysis, the sampl e size may have been determi ned
P.75
...
before you desi gn your study. In thi s situation, or if the number of partici pants who are avail abl e or
affordable for study i s li mited, the investi gator must work backward from the fi xed sampl e size. She
estimates the effect si ze that can be detected at a gi ven power (usuall y 80%) or, less commonl y, the
power to detect a gi ven effect. The investi gator can use the sampl e size tables i n the chapter
appendixes, interpol ating when necessary, or use the sampl e size formulas i n the appendi xes for
estimating the effect si ze.
A good general rul e is that a study shoul d have a power of 80% or greater to detect a reasonabl e
effect si ze. It is often tempti ng to pursue research hypotheses that have less power i f the cost of
doi ng so is small , such as when doing an analysis of data that have already been col lected. The
i nvestigator shoul d keep in mind, however, that she might face the di ffi culty of i nterpreti ng (and
publ ishi ng) a study that may have found no effect because of insuffi ci ent power; the broad
confi dence i ntervals wil l reveal the possi bi li ty of a substanti al effect i n the populati on from which the
smal l study sampl e was drawn.
Example 6.6 Calculating the Detectable Effect Size When Sample Size is Fixed
Problem: An i nvestigator determi nes that there are 100 patients with systemic l upus erythematosus
(SLE) who might be wi ll ing to parti ci pate i n a study of whether a 6-week medi tati on program affects
disease acti vi ty, as compared wi th a control group that recei ves a pamphl et describi ng rel axation. If
the standard devi ati on of the change i n a val idated SLE di sease acti vity scal e score i s expected to be
fi ve poi nts i n both the control and the treatment groups, what size di fference wil l the i nvesti gator be
abl e to detect between the two groups, at (two-si ded) = 0.05 and = 0.20?
Sol uti on: In Tabl e 6A, readi ng down from (two-si ded) = 0.05 and = 0.20 (the ri ghtmost column
i n the middle tri ad of numbers), 45 patients per group are requi red to detect a standardi zed effect
size of 0.6, which is equal to three poi nts (0.6 5 poi nts). The i nvesti gator (who wil l have about
50 pati ents per group) wil l be able to detect a di fference of a l ittl e l ess than three poi nts between
the two groups.
Strategies for Minimizing Sample Size and Maximizing Power
When the esti mated sample si ze i s greater than the number of subjects that can be studi ed
reali sti call y, the i nvestigator shoul d proceed through several steps. First, the cal cul ati ons should be
checked, as i t is easy to make mistakes. Next, the i ngredi ents shoul d be reviewed. Is the
effect si ze unreasonably smal l or the vari abil i ty unreasonabl y l arge? Coul d or , or both, be
i ncreased without harm? Woul d a one-sided alternative hypothesis be adequate? Is the confidence
l evel too high or the i nterval unnecessaril y narrow?
These techni cal adjustments can be useful, but it i s important to real ize that statistical tests
ulti matel y depend on the information contained in the data. Many changes i n the ingredi ents, such as
reduci ng power from 90% to 80%, do not i mprove the quantity or quali ty of the data that wi ll be
coll ected. There are, however, several strategies for reducing the requi red sampl e size or for
i ncreasing power for a given sampl e size that actuall y increase the information content of the
coll ected data. Many of these strategies i nvolve modificati ons of the research hypothesis; the
i nvestigator shoul d careful ly consider whether the new hypothesis stil l answers the research questi on
that she wi shes to study.
Use Cont i nuous Var i abl es
When conti nuous vari abl es are an opti on, they usual ly permi t small er sample si zes than di chotomous
variabl es. Bl ood pressure, for example, can be expressed ei ther as
mil li meters of mercury (conti nuous) or as the presence or absence of hypertensi on (di chotomous).
The former permi ts a smal ler sampl e si ze for a given power or a greater power for a gi ven sampl e
si ze.
In Example 6.7, the conti nuous outcome addresses the effect of nutri tion supplements on muscle
strength among the el derly. The di chotomous outcome is concerned wi th i ts effects on the proportion
of subjects who have at least a mi ni mal amount of strength, whi ch may be a more val id surrogate for
potenti al fal l -rel ated morbi di ty.
Example 6.7 Use of Continuous versus Dichotomous Variables
Problem: Consi der a pl acebo-controll ed trial to determine the effect of nutriti on suppl ements on
strength in el derl y nursing home residents. Previ ous studi es have establ ished that quadri ceps
strength (as peak torque i n newton-meters) i s approximately normall y distri buted, with a mean of 33
Nm and a standard devi ation of 10 Nm, and that about 10 % of the el derl y have very weak
muscles (strength <20 Nm). Nutri ti on supplements for 6 months are anticipated to i ncrease
P.76
P.77
...
strength by 5 Nm as compared wi th the usual diet. Thi s change i n mean strength can be esti mated,
based on the distributi on of quadri ceps strength in the elderly, to correspond to a reduction in the
proporti on of the elderl y who are very weak to about 5%.
One desi gn mi ght treat strength as a di chotomous vari able: very weak versus not very weak. Another
might use all the information contained in the measurement and treat strength as a conti nuous
vari abl e. How many subjects would each desi gn requi re at (two-si ded) = 0.05 and = 0.20? How
does the change i n design affect the research questi on?
Soluti on: The ingredi ents for the sampl e size cal culati on usi ng a di chot omous out come var i abl e
(very weak versus not very weak) are as fol lows:
1. Null Hypothesis: The proportion of elderl y nursing home residents who are very weak (peak
quadri ceps torque <20 Nm) after recei ving 6 months of nutri ti on supplements is the same as
the proporti on who are very weak i n those on a usual di et.
Alternative Hypothesis: The proportion of el derl y nursi ng home residents who are very weak
(peak quadriceps torque <20 Nm) after receivi ng 6 months of nutrition suppl ements differs
from the proporti on i n those on a usual di et.
2. P
1
(prevalence of being very weak on usual diet) = 0.10; P
2
(i n supplement group) = 0.05. The
small er of these values is 0.05, and the difference between them (P
1
- P
2
) i s 0.05.
3. (two-si ded) = 0.05; = 0.20.
Usi ng Table 6B.1, reading across from 0.05 i n the leftmost col umn and down from an expected
di fference of 0.05, the mi ddl e number (for [two-sided] = 0.05 and = 0.20), this design woul d
require 473 subjects per group.
The ingredients for the sample si ze calcul ati on using a cont i nuous out come var i abl e (quadriceps
strength as peak torque) are as foll ows:
1. Null Hypothesis: Mean quadri ceps strength (as peak torque i n Nm) in el derl y nursi ng home
resi dents after receivi ng 6 months of nutriti on suppl ements i s the same as mean quadriceps
strength in those on a usual diet.
Alternative Hypothesis: Mean quadriceps strength (as peak torque in Nm) i n elderly nursi ng
home resi dents after recei ving 6 months of nutri tion suppl ements di ffers from mean quadriceps
strength in those on a usual diet.
2. Effect si ze = 5 Nm
3. Standard deviati on of quadriceps strength = 10 Nm
4. Standardized effect si ze = effect si ze standard devi ati on = 5 Nm 10 Nm = 0.5.
5. (two-si ded) 0.05; = 0.20.
Usi ng Table 6A, reading across from a standardi zed effect si ze of 0.50, with (two-si ded) = 0.05
and = 0.20, this desi gn woul d require about 64 subjects in each group. (In this exampl e, the
shortcut sample size esti mate from page 68 of 16 (standardi zed effect size)
2
, or 16 0.5
2
gives
the same estimate of 64 subjects per group.) The bottom li ne i s that the use of an outcome vari able
that was conti nuous rather than di chotomous meant that a substantiall y small er sample si ze needed
to study thi s research question
Use P ai r ed Measur ement s
In some experi ments or cohort studies with conti nuous outcome vari ables, pai red
measurementsone at basel i ne, another at the concl usion of the studycan be made in each
subject. The outcome variable i s the change between these two measurements. In thi s si tuati on, a t
test on the paired measurements can be used to compare the mean val ue of thi s change i n the two
groups. This techni que often permits a smal l er sampl e size because, by comparing each subject wi th
herself, it removes the basel ine between-subject part of the vari abi li ty of the outcome vari able. For
exampl e, the change i n wei ght on a diet has less variabil ity than the fi nal weight, because final
weight i s hi ghly correl ated with initi al weight. Sampl e si ze for thi s type of t test i s esti mated i n the
usual way, except that the standardized effect si ze (E/S in Tabl e 6A) i s the anti cipated difference i n
the change i n the variable di vided by the standard devi ation of that change.
Example 6.8 Use of the t Test with Paired Measurements
Probl em: Recal l Example 6.1, in whi ch the i nvesti gator studying the treatment of asthma is
i nterested in determi ni ng whether salbutamol can improve FEV
1
by 200 mL compared with
i pratropi um bromide. Sampl e si ze cal culati ons indi cated that 394 subjects per group are needed,
P.78
...
more than are l ikely to be avai labl e. Fortunatel y, a col league poi nts out that asthmati c pati ents have
great di fferences i n their FEV
1
values before treatment. These between-subject differences account
for much of the vari abil ity i n FEV
1
after treatment, therefore obscuri ng the effect of treatment. She
suggests using a pai red t test to compare the changes in FEV
1
i n the two groups. A pi lot study finds
that the standard devi ati on of the change i n FEV
1
i s only 250 mL. How many subjects would be
required per group, at (two-si ded) = 0.05 and = 0.20?
Soluti on: The ingredi ents for the sampl e size cal culati on are as fol lows:
1. Null Hypothesis: Change i n mean FEV
1
after 2 weeks of treatment i s the same in asthmati c
patients treated wi th sal butamol as it i s in those treated wi th i pratropi um bromide.
Alternative Hypothesis: Change in mean FEV
1
after 2 weeks of treatment is different i n
asthmati c patients treated with salbutamol from what it is in those treated with ipratropi um
bromi de.
2. Effect si ze = 200 mL.
3. Standard deviati on of the outcome variabl e = 250 mL.
4. Standardized effect si ze = effect si ze standard devi ati on = 200 mL 250 mL = 0.8.
5. (two-si ded) = 0.05; = 1 - 0.80 = 0.20.
Usi ng Table 6A, thi s design woul d require about 26 parti ci pants per group, a much more reasonable
sampl e size than the 394 per group i n Exampl e 6.1 . In thi s example, the shortcut sampl e size
estimate of 16 (standardized effect size)
2
, or 16 0.8
2
gi ves a si mi lar esti mate of 25 subjects per
group.
A Brief Technical Note
This chapter al ways refers to two-sample t tests, which are used when compari ng the mean values
of an outcome variable in two groups of subjects. A two-sample t test can be unpaired, if the
outcome variabl e i tsel f i s bei ng compared between two groups (see Example 6.1 ), or paired
if the outcome i s the change in a pair of measurements, say before and after an i nterventi on (see
Example 6.8 ).
A third type of t test, the one- sampl e paired t test, compares the mean change i n a pai r of
measurements wi thin a si ngle group to zero change. Thi s type of analysis i s reasonably common i n
time seri es designs (Chapter 10), a beforeafter approach to exami ni ng treatments that are
difficul t to randomize (for exampl e, the effect of el ecti ve hysterectomy, a decisi on few women are
wil li ng to l eave to a coi n toss, on qual ity of l ife). However, it i s a fairl y weak desi gn because the
absence of a compari son group makes it difficul t to know what woul d have happened had the subjects
been l eft untreated (Chapter 10). When pl anning a study that wi ll be anal yzed wi th a one-sampl e
pai red t test, the sampl e size i n Appendix 6A represents the total number of subjects (because there
is onl y one group). Appendix 6F presents addi tional informati on on the use and mi suse of one- and
two-sampl e t tests.
Use Mor e P r eci se Var i abl es
Because they reduce vari abil ity, more precise vari ables permit a smal ler sampl e si ze i n both anal ytic
and descri ptive studies. Even a modest change in preci si on can have a substantial effect on sample
si ze. For example, when usi ng the t test to estimate sample si ze, a 20% decrease in the standard
deviati on of the outcome variable resul ts i n a 36% decrease i n the sample si ze. Techniques for
increasi ng the precisi on of a variable, such as making measurements i n dupl icate, are presented in
Chapter 4.
Use Unequal Gr oup Si zes
Because an equal number of subjects in each of two groups usual ly gi ves the greatest power for a
given total number of subjects, Tables 6A, 6B.1, and 6B.2 in the
appendi xes assume equal sample si zes i n the two groups. Sometimes, however, the di stri bution of
subjects is not equal in the two groups, or it i s easier or l ess expensi ve to recrui t study subjects for
one group than the other. It may turn out, for exampl e, that an investigator wants to esti mate
sample si ze based on the 30% of the subjects in a cohort who smoke cigarettes (compared with 70%
who do not smoke). Or, in a casecontrol study, the number of persons with the disease may be
small , but i t may be possi ble to sampl e a much l arger number of controls. In general, the gai n i n
power when the si ze of one group i s increased to twi ce the size of the other i s consi derabl e; tri pli ng
and quadrupli ng one of the groups provi de progressivel y small er gai ns. Sample si zes for unequal
P.79
P.80
...
groups can be computed from the formulas found i n the text to Appendixes 6A and 6B or from the
Web.
Here i s a useful approxi mation for esti mati ng sample si ze for casecontrol studies of di chotomous
ri sk factors and outcomes using c controls per case. If n represents the number of cases that would
have been requi red for one control per case (at a gi ven , , and effect size), then the approximate
number of cases (n) with cn controls that wi ll be requi red is
For exampl e, with c = 2 control s per case, then [(2 + 1) (2 2)] n = 3/4 n, and only
75% as many cases are needed. As c gets larger, n approaches 50% of n (when c = 10, for
example, n = 11/20 n).
Example 6.9 Use of Multiple Controls per Case in a CaseControl Study
Problem: An i nvestigator is studyi ng whether exposure to househol d insecticide is a ri sk factor for
apl asti c anemia. The original sample si ze calcul ation indi cated that 25 cases would be required, using
one control per case. Suppose that the i nvesti gator has access to onl y 18 cases. How should the
i nvesti gator proceed?
Sol uti on: The i nvestigator should consider usi ng mul tipl e control s per case (after all , she can find
many patients who do not have apl astic anemi a). By usi ng three controls per case, for example, the
approxi mate number of cases that wi ll be requi red is [(3 + 1) (2 3)] 25 = 17.
Use a Mor e Common Outcome
When the outcome is dichotomous, using a more frequent outcome, up to a frequency of 0.5, i s
usual ly one of the best ways to increase power: i f an outcome occurs more often, there is more of a
chance to detect i ts predictors. Power actuall y depends more on the number of subjects with a
speci fi ed outcome than it does on the total number of subjects in the study. Studies with rare
outcomes, li ke the occurrence of breast cancer in healthy women, require very l arge sample si zes to
have adequate power.
One of the best ways to make an outcome more common is to enrol l subjects at greater ri sk of
devel opi ng that outcome (such as women wi th a fami ly history of breast cancer). Others are to
extend the fol low-up peri od, so that there i s more ti me to accumul ate outcomes, or to l oosen the
defini tion of what constitutes an outcome (e.g., by i ncludi ng ductal carci noma i n si tu). Al l these
techniques, however, may change the research questi on, so they shoul d be used with caution.
Example 6.10 Use of a More Common Outcome
Problem: Suppose an i nvesti gator is compari ng the effi cacy of an anti septi c gargle versus a placebo
gargle i n preventi ng upper respi ratory infecti ons. Her i ni tial cal culations i ndicated that her
anticipated sampl e of 200 volunteer col lege students was inadequate, in part because she expected
that only about 20 % of her subjects would have an upper respiratory infecti on duri ng the 3-month
fol low-up peri od. Suggest a few changes i n the study pl an.
Sol uti on: Here are two possibl e sol uti ons: (a) study a sample of pedi atric i nterns and resi dents, who
are li kel y to experi ence a much greater i ncidence of upper respi ratory i nfections than coll ege
students; or (b) foll ow the sample for a longer period of ti me, say 6 or 12 months. Both of these
sol uti ons invol ve modi fi cati on of the research hypothesis, but neither change seems suffi ci entl y large
to affect the overall research question about the effi cacy of anti septic gargle.
How to Estimate Sample Size When There is Insufficient
Information
Often the investigator fi nds that she is mi ssing one or more of the ingredi ents for the sampl e si ze
cal culation and becomes frustrated i n her attempts to pl an the study. Thi s is an especi al ly frequent
problem when the investigator i s using an i nstrument of her design (such as a new questionnai re on
qual ity of l ife i n pati ents wi th uri nary inconti nence). How shoul d she go about deci ding what effect
si ze or standard devi ati on to use?
The first strategy i s an extensive search for previous and related fi ndings on the topi c and on
si mil ar research questions. Roughly comparable si tuati ons and medi ocre or dated fi ndings may be
good enough. (For exampl e, are there data on qual ity of li fe among patients wi th other urol ogi c
problems, or with rel ated conditi ons li ke havi ng a col ostomy?) If the li terature revi ew is
unproducti ve, she shoul d contact other i nvesti gators about thei r judgment on what to expect, and
whether they are aware of any unpubl ished resul ts that may be rel evant. If there i s stil l no
i nformati on avai labl e, she may consi der doi ng a small pilot study or obtai ning a data set for a
secondary anal ysis to obtain the missi ng i ngredi ents before embarki ng on the mai n study. (Indeed, a
P.81
...
pi lot study is hi ghl y recommended for almost al l studi es that i nvolve new instruments, measurement
methods, or recruitment strategies. They save ti me in the end by enabl i ng investi gators to do a much
better job pl anni ng the main study). Pil ot studi es are useful for esti mating the standard devi ation of
a measurement, or the proportion of subjects wi th a particul ar characteri sti c. Another trick i s to
recognize that for continuous vari ables that have a roughl y bell -shaped di stri bution, the standard
deviation can be esti mated as one-quarter of the difference between the hi gh and low ends of the
range of values that occur commonly, ignoring extreme val ues. For example, i f most subjects are
l i kely to have a serum sodium level between 135 and 143 mEq/L, the standard deviati on of serum
sodium is about 2 mEq/L (1/4 8 mEq/L).
Alternativel y, the i nvestigator can determi ne the detectabl e effect size based on a value that she
considers to be clinically meaningful. For example, suppose that an i nvestigator is studyi ng a new
i nvasi ve treatment for severe refractory gastroparesi s, a
conditi on i n whi ch at most 5 % of patients i mprove spontaneousl y. If the treatment is shown to be
effective, she thinks that gastroenterologists woul d be wi ll ing to treat up to fi ve patients to produce
a sustained benefi t in one of those patients (because the treatment has substantial si de effects and
i s expensive, she doesn't thi nk that the number would be more than 5). A number needed to treat
(NNT) of 5 corresponds to a ri sk difference of 20% (NNT = 1/risk di fference), so the i nvesti gator
should estimate the sample si ze based on a compari son of P1 = 5% versus P2 = 25% (i .e., 59
subjects per group at a power of 0.80 and a two-sided of 0.05).
Another strategy, when the mean and standard deviati on of a continuous or categori cal vari abl e are
i n doubt, i s to dichotomize that variabl e. Categori es can be l umped into two groups, and continuous
variabl es can be spli t at thei r mean or median. For exampl e, divi ding quali ty of l ife into better
than the medi an or the medi an or less avoids having to estimate its standard deviation i n
the sample, al though one sti ll has to estimate what proportions of subjects woul d be above the
medi an in the two groups bei ng studied. The chi -squared statistic can then be used to make a
reasonable, albei t somewhat hi gh, esti mate of the sampl e size.
If all thi s fai ls, the investi gator shoul d just make an educated guess about the l ikel y val ues of the
missi ng i ngredi ents. The process of thinking through the problem and i magi ning the findi ngs wi ll
often result i n a reasonabl e estimate, and that i s what sample si ze planni ng i s about. This is usuall y
a better opti on than just deci ding to design the study to have 80% power at a two-sided of 0.05
to detect a standardi zed effect si ze of, say, 0.5 between the two groups ( n = 64, per group, by the
way). Very few grant reviewers wil l accept that sort of arbitrary decision.
Common Errors to Avoid
Many i nexperienced i nvestigators (and some experi enced ones!) make mistakes when pl anning
sampl e size. A few of the more common ones foll ow:
1. The most common error is estimati ng the sample size late during the design of the study. Do it
earl y in the process, when fundamental changes can sti ll be made.
2. Dichotomous variabl es can appear to be conti nuous when they are expressed as a percentage
or rate. For exampl e, vital status (al ive or dead) might be misinterpreted as continuous when
expressed as percent al ive. Simi larly, i n survi val analysis a di chotomous outcome can appear to
be continuous (e.g., medi an survi val in months). For all of these, the outcome itself is actuall y
dichotomous and the appropriate si mple approach in planni ng sample si ze would be the chi-
squared test.
3. The sample si ze esti mates the number of subjects with outcome data, not the number who need
to be enrol led. The i nvesti gator should always plan for dropouts and subjects with missing
data.
4. The tabl es at the end of the chapter assume that the two groups being studi ed have equal
sample si zes. Often that i s not the case; for exampl e, a cohort study of whether use of vitami n
suppl ements reduces the risk of sunburn woul d probably not enroll equal numbers of subjects
who used, or did not use, vi tami ns. If the sampl e sizes are not equal, then the formul as that
fol low the tabl es or the Web shoul d be used.
5. When using the t test to esti mate the sample si ze, what matters is the standard deviati on of the
outcome variabl e. Therefore i f the outcome i s change i n a continuous variabl e, the investi gator
shoul d use the standard devi ation of that change rather than the standard devi ati on of the
vari able itself.
6. Be aware of clustered data. If there appear to be two l evel s of sampl e size (e.g., one
P.82
P.83
...
for physicians and another for pati ents), cl usteri ng i s a l ikel y problem and the tabl es i n the
appendi ces do not apply.
Summary
1. When esti mati ng sample si ze for an analytic study, the fol lowing steps need to be taken: (a)
state the null and alternative hypotheses, specifyi ng the number of sides; (b) select a
statistical test that could be used to anal yze the data, based on the types of predictor and
outcome variables; (c) esti mate the effect size (and its variability, if necessary); and (d)
specify appropriate val ues for and , based on the i mportance of avoi di ng Type I and Type
II errors.
2. Other consi derations i n cal culati ng sampl e size for anal ytic studi es incl ude adjusti ng for
potential dropouts, and strategi es for deal ing wi th categorical variables, survival analysis,
clustered samples, multivariate adjustment, and equivalence studies.
3. The steps for estimati ng sampl e si ze for descriptive studies, whi ch do not have hypotheses,
are to (a) esti mate the proportion of subjects wi th a dichotomous outcome or the standard
deviation of a continuous outcome; (b) speci fy the desi red precision (wi dth of the confidence
i nterval); and (c) speci fy the confidence level (e.g., 95%).
4. When sample si ze is predetermi ned, the i nvesti gator can work backward to esti mate the
detectabl e effect size or, less commonl y, the power.
5. Strategi es to minimize the requi red sampl e size i nclude using continuous variables, more
precise measurements, paired measurements, unequal group sizes, and more common
outcomes.
6. When there seems to be not enough i nformati on to esti mate the sample si ze, the i nvestigator
shoul d review the literature i n rel ated areas, do a small pilot study or choose an effect size
that is cl i nical l y meani ngful; standard deviation can be estimated as 1/4 of the range of
commonl y encountered values. If none of these is feasi ble, an educated guess can gi ve a useful
bal lpark estimate.
Appendices
Appendix 6A
Sample Size Required per Group When Using the t Test to
Compare Means of Continuous Variables
P.84
Table 6A Sample Size per Gr oup for Comparing Two Means
One-
sided
= 0.005 0.025 0.05
Two-
sided
= 0.01 0.05 0.10
E/S*
= 0.05 0.10 0.20 0.05 0.10 0.20 0.05 0.10 0.20
0.10 3,565 2,978 2,338 2,600 2,103 1,571 2,166 1,714 1,238
0.15 1,586 1,325 1,040 1,157 935 699 963 762 551
...
Cal cul ati ng Var i abi l i ty
Vari abil ity i s usual ly reported as ei ther the standard devi ation or the standard error of the mean
(SEM). For the purposes of sample si ze calcul ation, the standard devi ation of the variabl e i s most
useful . Fortunatel y, it i s easy to convert from one measure to another: the standard deviation is
si mpl y the standard error ti mes the square root of N, where N i s the number of subjects that makes
up the mean. Suppose a study reported that the weight l oss i n 25 persons on a low-fi ber diet was 10
2 kg (mean SEM). The standard deviation woul d be 2 25 = 10 kg.
Gener al For mul a f or Other Val ues
The general formul a for other values of E, S, , and , or for unequal group si zes, i s as foll ows.
Let:
z

= the standard normal devi ate for (If the al ternati ve hypothesi s is z

= 2.58 when =
0.01,
z

= 1.96 when = 0.05, and z

= 1.645 when = 0.10. If the al ternative hypothesi s i s
one-si ded,
z

= 1.645 when = 0.05.)
z

= the standard normal devi ate for (z

= 0.84 when = 0.20, and z

= 1.282 when =
0.10)
q
1
= proporti on of subjects in group 1
q
2
= proporti on of subjects in group 2
N = total number of subjects requi red
Then:
0.20 893 746 586 651 527 394 542 429 310
0.25 572 478 376 417 338 253 347 275 199
0.30 398 333 262 290 235 176 242 191 139
0.40 225 188 148 164 133 100 136 108 78
0.50 145 121 96 105 86 64 88 70 51
0.60 101 85 67 74 60 45 61 49 36
0.70 75 63 50 55 44 34 45 36 26
0.80 58 49 39 42 34 26 35 28 21
0.90 46 39 21 34 27 21 28 22 16
1.00 38 32 26 27 23 17 23 18 14
*E/S i s the standardi zed effect si ze, computed as E (expected effect size) divi ded by S (SD
of the outcome vari able). To estimate the sampl e size, read across from the standardi zed
effect si ze, and down from the speci fied values of and for the requi red sampl e size i n
each group.
P.85
...
N = [(1/q
1
+ 1/q
2
) S
2
(z

+ z

)
2
] E
2
.
Readers who woul d li ke to ski p the work invol ved in hand cal culati ons wi th this formul a can get an
i nstant answer from a cal culator on our websi te (http://www.epi biostat.ucsf.edu/dcr/)(Because thi s
formula is based on approximating the t statistic wi th a z statistic, i t wi ll sli ghtly underesti mate the
sampl e size when N is l ess than about 30. Tabl e 6A uses the t stati sti c to estimate sample si ze.)
Appendix 6B
Sample Size Required per Group When Using the Chi-Squared
Statistic or Z Test to Compare Proportions of Dichotomous
Variables
P.86
Table 6B.1 Sample Size per Gr oup for Comparing Two
Proportions
Smaller
of P
1
and
P
2
*
Upper number: = 0.05 (one-sided) or = 0.10 (two-sided); = 0.20
Middle number: = 0.025 (one-sided) or = 0.05 (two-sided); =
0.20
Lower number: = 0.025 (one-sided) or = 0.05 (two-sided); =
0.10
Difference Between P
1
and P
2
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
0.05 381 129 72 47 35 27 22 18 15 13
473 159 88 59 43 33 26 22 18 16
620 207 113 75 54 41 33 27 23 19
0.10 578 175 91 58 41 31 24 20 16 14
724 219 112 72 51 37 29 24 20 17
958 286 146 92 65 48 37 30 25 21
0.15 751 217 108 67 46 34 26 21 17 15
944 270 133 82 57 41 32 26 21 18
1,252 354 174 106 73 53 42 33 26 22
0.20 900 251 121 74 50 36 28 22 18 15
1,133 313 151 91 62 44 34 27 22 18
1,504 412 197 118 80 57 44 34 27 23
0.25 1,024 278 132 79 53 38 29 23 18 15
...
1,289 348 165 98 66 47 35 28 22 18
1,714 459 216 127 85 60 46 35 28 23
0.30 1,123 300 141 83 55 39 29 23 18 15
1,415 376 175 103 68 48 36 28 22 18
1,883 496 230 134 88 62 47 36 28 23
0.35 1,197 315 146 85 56 39 29 23 18 15
1,509 395 182 106 69 48 36 28 22 18
2,009 522 239 138 90 62 47 35 27 22
0.40 1,246 325 149 86 56 39 29 22 17 14
1,572 407 186 107 69 48 35 27 21 17
2,093 538 244 139 90 62 46 34 26 21
0.45 1,271 328 149 85 55 38 28 21 16 13
1,603 411 186 106 68 47 34 26 20 16
2,135 543 244 138 88 60 44 33 25 19
0.50 1,271 325 146 83 53 36 26 20 15
1,603 407 182 103 66 44 32 24 18
2,135 538 239 134 85 57 42 30 23
0.55 1,246 315 141 79 50 34 24 18
1,572 395 175 98 62 41 29 22
2,093 522 230 127 80 53 37 27
0.60 1,197 300 132 74 46 31 22
1,509 376 165 91 57 37 26
2,009 496 216 118 73 48 33
0.65 1,123 278 121 67 41 27
...
The general formul a for cal culati ng the t ot al sample si ze (N) requi red for a study usi ng the z
stati sti c, where P
1
and P
2
are defined above, is as fol lows (see Appendi x 6A for defi ni tions of Z

and
Z

). Let
q
1
= proporti on of subjects i n group 1
1,415 348 151 82 51 33
1,883 459 197 106 65 41
0.70 1,024 251 108 58 35
1,289 313 133 72 43
1,714 412 174 92 54
0.75 900 217 91 47
1,133 270 112 59
1,504 354 146 75
0.80 751 175 72
944 219 88
1,252 286 113
0.85 578 129
724 159
958 207
0.90 381
473
620
The one-sided estimates use the z statistic.
*P
1
represents the proportion of subjects expected to have the outcome in one group; P
2
in
the other group. (In a casecontrol study, P
1
represents the proporti on of cases with the
predictor variabl e; P
2
the proporti on of control s wi th the predi ctor vari able.) To esti mate the
sampl e size, read across from the smal ler of P
1
and P
2
, and down the expected di fference
between P
1
and P
2
. The three numbers represent the sample si ze required i n each group for
the specified values of and .
Addi tional detail for P
1
and P
2
between 0.01 and 0.10 i s gi ven i n Table 6.B.2.
P.87
...
q
2
= proporti on of subjects i n group 2
N = total number of subjects
P = q
1
P
1
+ q
2
P
2
Then
Readers who woul d li ke to ski p the work invol ved in hand cal culati ons wi th this formul a can get an
i nstant answer from a cal culator on our websi te (http://www.epi biostat.ucsf.edu/dcr/) (Thi s formula
does not i nclude the Fleiss-Tytun-Ury continuity correction and therefore underestimates the required
sampl e size by up to about 10%. Tables 6B.1 and 6B.2 do incl ude thi s conti nui ty correcti on.
P.88
Table 6B.2 Sample Size per Gr oup for Comparing Two
Proportions, the Smaller of Which Is Between 0.01 and 0.10
Smaller
of P
1
and P
2
Upper number: = 0.05 (one-sided) or = 0.10 (two-sided); = 0.20
Middle number: = 0.025 (one-sided) or = 0.05 (two-sided); = 0.20
Lower number: = 0.025 (one-sided) or = 0.05 (two-sided); = 0.10
Expected Difference Between P
1
and P
2
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
0.01 2,019 700 396 271 204 162 134 114 98 87
2,512 864 487 332 249 197 163 138 120 106
3,300 1,125 631 428 320 254 209 178 154 135
0.02 3,205 994 526 343 249 193 157 131 113 97
4,018 1,237 651 423 306 238 192 161 137 120
5,320 1,625 852 550 397 307 248 207 177 154
0.03 4,367 1,283 653 414 294 224 179 148 126 109
5,493 1,602 813 512 363 276 220 182 154 133
7,296 2,114 1,067 671 474 359 286 236 199 172
0.04 5,505 1,564 777 482 337 254 201 165 139 119
6,935 1,959 969 600 419 314 248 203 170 146
9,230 2,593 1,277 788 548 410 323 264 221 189
0.05 6,616 1,838 898 549 380 283 222 181 151 129
8,347 2,308 1,123 686 473 351 275 223 186 159
...
Appendix 6C
Total Sample Size Required When Using the Correlation
Coefficient (r )
11,123 3,061 1,482 902 620 460 360 291 242 206
0.06 7,703 2,107 1,016 615 422 312 243 197 163 139
9,726 2,650 1,272 769 526 388 301 243 202 171
12,973 3,518 1,684 1,014 691 508 395 318 263 223
0.07 8,765 2,369 1,131 680 463 340 263 212 175 148
11,076 2,983 1,419 850 577 423 327 263 217 183
14,780 3,965 1,880 1,123 760 555 429 343 283 239
0.08 9,803 2,627 1,244 743 502 367 282 227 187 158
12,393 3,308 1,562 930 627 457 352 282 232 195
16,546 4,401 2,072 1,229 827 602 463 369 303 255
0.09 10,816 2,877 1,354 804 541 393 302 241 198 167
13,679 3,626 1,702 1,007 676 491 377 300 246 207
18,270 4,827 2,259 1,333 893 647 495 393 322 270
0.10 11,804 3,121 1,461 863 578 419 320 255 209 175
14,933 3,936 1,838 1,083 724 523 401 318 260 218
19,952 5,242 2,441 1,434 957 690 527 417 341 285
The one-sided estimates use the z statistic.
P.89
Table 6C Sample Size for Determining Whether a Correlation
Coefficient Differs from Zero
One-
sided
= 0.005 0.025 0.05
...
The general formul a for other values of r, , and is as fol lows (see Appendi x 6A for defi ni ti ons of
Z

and Z

). Let
r = expected correlati on coeffi ci ent
C = 0.5 ln [(l + r)/(l - r)]
N = Total number of subjects required
Then
N = [(z

+ z
) C]
2
+ 3.
Esti mat i ng Sampl e Si ze f or Di f f er ence between Two Cor r el at i ons
If testing whether a correlation, r
1
, i s di fferent from r
2
(i .e., the nul l hypothesi s i s that r
1
= r
2
; the
alternative hypothesi s i s that r
1
r
2
), l et
Two-
sided
= 0.01 0.05 0.0101
= 0.05 0.10 0.20 0.05 0.10 0.20 0.05 0.10 0.20
r *
0.05 7,118 5,947 4,663 5,193 4,200 3,134 4,325 3,424 2,469
0.10 1,773 1,481 1,162 1,294 1,047 782 1,078 854 616
0.15 783 655 514 572 463 346 477 378 273
0.20 436 365 287 319 259 194 266 211 153
0.25 276 231 182 202 164 123 169 134 98
0.30 189 158 125 139 113 85 116 92 67
0.35 136 114 90 100 82 62 84 67 49
0.40 102 86 68 75 62 47 63 51 37
0.45 79 66 53 58 48 36 49 39 29
0.50 62 52 42 46 38 29 39 31 23
0.60 40 34 27 30 25 19 26 21 16
0.70 27 23 19 20 17 13 17 14 11
0.80 18 15 13 14 12 9 12 10 8
*To estimate the total sample si ze, read across from r (the expected correlati on coeffi ci ent)
and down from the specifi ed val ues of and .
...
C
1
= 0.5 l n [(l + r
1
)/(l - r
1
)]
C
2
= 0.5 l n [(l + r
2
)/(l - r
2
)]
Then
N = [(z

+ z
) (C
1
- C
2
)]
2
+ 3.
Appendix 6D
Sample Size for a Descriptive Study of a Continuous Variable
For other values of W, S, and a confidence level of (1 - ), the total number of subjects required (N)
i s
P.90
Table 6D Sample Size for Common Values of W/ S*
W/ S
Confidence Level
90% 95% 99%
0.10 1,083 1,537 2,665
0.15 482 683 1,180
0.20 271 385 664
0.25 174 246 425
0.30 121 171 295
0.35 89 126 217
0.40 68 97 166
0.50 44 62 107
0.60 31 43 74
0.70 23 32 55
0.80 17 25 42
0.90 14 19 33
1.00 11 16 27
*W/S i s the standardized wi dth of the confi dence i nterval , computed as W (desired total
wi dth) di vi ded by S (standard deviati on of the vari able). To esti mate the total sampl e si ze,
read across from the standardi zed wi dth and down from the speci fied confi dence l evel.
...
N = 4z

2
S
2
W
2
(see Appendi x 6A for the definition of z

).
Appendix 6E
Sample Size for a Descriptive Study of a Dichotomous Variable
P.91
Table 6E Sample Size for Proportions
Expected Proportion
(P)*
Upper number: 90% confidence level
Middle number: 95% confidence level
Lower number: 99% confidence level
Total Width of Confidence Interval (W)
0.10 0.15 0.20 0.25 0.30 0.35 0.40
0.10 98 44
138 61
239 106
0.15 139 62 35 22
196 87 49 31
339 151 85 54
0.20 174 77 44 28 19 14
246 109 61 39 27 20
426 189 107 68 47 35
0.25 204 91 51 33 23 17 13
288 128 72 46 32 24 18
499 222 125 80 55 41 31
0.30 229 102 57 37 25 19 14
323 143 81 52 36 26 20
559 249 140 89 62 46 35
0.40 261 116 65 42 29 21 16
...
The general formul a for other values of P, W, and a confi dence level of (1 - ), where P and W are
defined above, is as fol lows. Let
z

= the standard normal deviate for a two-sided , where (1 - ) is the confidence l evel (e.g.,
si nce = 0.05 for a 95% confi dence level , z

= 1.96; therefore, for a 90% confi dence level z

=
1.65, and for a 99% confi dence l evel z

= 2.58).
Then the total number of subjects required i s:
N = 4z

2
P (l - P) W
2
Appendix 6F
Use and Misuse of t Tests
Two-sample t tests, the primary focus of thi s chapter, are used when comparing the mean values of
a vari able in two groups of subjects. The two groups can be defi ned by a predi ctor variableactive
drug versus placebo i n a randomized tri al, or presence versus absence of a ri sk factor i n a cohort
studyor they can be defi ned by an outcome vari able, as in a casecontrol study. A two-sampl e t
test can be unpaired, if measurements obtained on a si ngl e occasion are bei ng compared between
two groups, or paired if the change i n measurements made at two points i n ti me, say before and
after an i ntervention, are bei ng compared between the groups. A thi rd type of t test, the one-
sample paired t test, compares the mean change in measurements at two poi nts in ti me wi thin a
si ngle group to zero change.
Tabl e 6F i l lustrates the mi suse of one-sampl e pai red t tests in a study designed for between-group
comparisonsa randomi zed bl inded tri al of the effect of a new sleepi ng pil l on quali ty of l ife. In
si tuati ons l i ke this, some investi gators have performed (and publi shed!) fi ndi ngs wi th two separate
one-sample t testsone each i n the treatment and pl acebo groups.
In the tabl e, the P val ues designated wi th a dagger () are from one-sampl e pai red t-tests. The
first P (0.05) shows a signi fi cant change in qual ity of li fe i n the treatment group duri ng the study;
the second P val ue (0.16) shows no si gni ficant change in the control group. However, thi s analysis
does not permit i nferences about di fferences between the groups, and it woul d be wrong to conclude
that there was a signi fi cant effect of the treatment.
The P values designated wi th a (*), represent the appropriate two-sampl e t test results. The fi rst two
P values (0.87 and 0.64) are two-sample unpai red t tests that show no stati sti call y si gnificant
between-group di fferences i n the initi al or final measurements for qual ity of l ife. The l ast P value
(0.17) i s a two-sample paired t test; i t is closer to 0.05 than the P val ue for the end of study values
(0.64) because the paired mean di fferences have smal ler standard devi ations. However, the improved
qual ity of l ife i n the treatment group (1.3) was not signi fi cantly di fferent from that i n the placebo
group (0.9), and the correct conclusi on i s that the study did not find the treatment to be effective.
369 164 92 59 41 30 23
639 284 160 102 71 52 40
0.50 272 121 68 44 30 22 17
384 171 96 61 43 31 24
666 296 166 107 74 54 42
*To estimate the sampl e size, read across the expected proportion (P) who have the vari able
of interest and down from the desired total wi dth (W) of the confi dence interval. The three
numbers represent the sample size required for 90%, 95%, and 99% confidence l evels.
P.92
...
References
Table 6F Correct (and Incorrect) Ways to Analyze Paired Data
Time of Measurement
Quality of Life, as Mean SD
Treatment (n = 100) Control (n = 100) P value
Basel ine 7.0 4.5 7.1 4.4 0.87*
End of study 8.3 4.7 8.0 4.6 0.64*
P val ue 0.05
0.16
Di fference 1.3 2.1 0.9 2.0 0.17*

P.93
1. Lehr R. Si xteen S-squared over D-squared: a rel ati on for crude sample si ze esti mates. Stat
Med 1992;11:10991102.
2. Lakatos E, Lan KK. A compari son of sample si ze methods for the l ogrank stati sti c. Stat Med
1992;11:179191.
3. Shih JH. Sample si ze calcul ati on for compl ex cli ni cal tri als with survi val endpoints. Control Cl in
Tri al s 1995;16:395407.
4. Donner A. Sample si ze requirements for stratified cluster randomi zati on desi gns [publ ished
erratum appears in Stat Med 1997;30;16:2927]. Stat Med 1992;11:743750.
5. Liu G, Liang KY. Sampl e size cal culations for studies with correlated observations. Biometrics
1997;53:937947.
6. Kerry SM, Bl and JM. Trial s which randomize practices II: sampl e si ze. Fam Pract
1998;15:8487.
7. Hayes RJ, Bennett S. Si mpl e sampl e si ze cal culati on for cluster-randomized trials. Int J
Epi demiol 1999;28:319326.
8. Edwardes MD. Sampl e size requi rements for case-control study desi gns. BMC Med Res
Methodol 2001;1:11.
9. Drescher K, Timm J, Jckel KH. The desi gn of case-control studi es: the effect of confoundi ng
on sampl e size requi rements. Stat Med 1990;9:765776.
10. Lui KJ. Sample si ze determi nation for case-control studi es: the i nfluence of the joi nt
di stributi on of exposure and confounder. Stat Med 1990;9:14851493.
11. Vaeth M, Skovl und E. A si mple approach to power and sampl e size cal culati ons in l ogi sti c
regression and Cox regressi on model s. Stat Med 2004;23:17811792.
12. Dupont WD, Pl ummer WD Jr. Power and sampl e size cal culations for studies i nvolvi ng l inear
...
regression. Control Cli n Tri al s 1998;19:589601.
13. Hsi eh FY, Bloch DA, Larsen MD. A simpl e method of sampl e size cal culati on for li near and
logi stic regression. Stat Med 1998;17:16231634.
14. Hsi eh FY, Lavori PW. Sample-si ze cal cul ati ons for the Cox proporti onal hazards regression
model wi th nonbi nary covariates. Control Cli n Tri al s 2000;21:552560.
15. Sasieni PD. From genotypes to genes: doubli ng the sample si ze. Bi ometri cs
1997;53:12531261.
16. Elston RC, Idury RM, Cardon LR, et al . The study of candidate genes in drug tri al s: sample
si ze consi derations. Stat Med 1999;18:741751.
17. Garca-Closas M, Lubin JH. Power and sampl e si ze cal cul ati ons in case-control studi es of
gene-envi ronment i nteracti ons: comments on di fferent approaches. Am J Epidemi ol
1999;149:689692.
18. Torgerson DJ, Ryan M, Ratcli ffe J. Economi cs i n sample si ze determination for cl inical trials.
QJM 1995;88:517521.
19. Laska EM, Mei sner M, Siegel C. Power and sampl e size i n cost-effectiveness anal ysi s. Med
Deci s Maki ng 1999;19:339343.
20. Wi ll an AR, O' Bri en BJ. Sampl e size and power i ssues i n estimating incremental cost-
effecti veness rati os from cl inical tri al s data. Heal th Econ 1999;8:203211.
21. Patel HI. Sample si ze for a dose-response study [publ i shed erratum appears i n J Biopharm
Stat 1994;4:127]. J Biopharm Stat 1992;2:18.
22. Day SJ, Graham DF. Sampl e size estimation for compari ng two or more treatment groups i n
cl i nical trials. Stat Med 1991;10:3343.
23. Nam JM. Sample si ze determi nation i n strati fied tri al s to establi sh the equivalence of two
treatments. Stat Med 1995;14:20372049.
24. Bristol DR. Determining equivalence and the impact of sampl e size i n anti-i nfective studies: a
poi nt to consider. J Bi opharm Stat 1996;6:319326.
25. Tai BC, Lee J. Sampl e size and power calcul ations for comparing two i ndependent proportions
in a negative tri al . Psychiatry Res 1998;80:197200.
26. Hauschke D, Ki eser M, Dil etti E, et al . Sample si ze determi nation for proving equivalence
based on the ratio of two means for normal ly di stributed data. Stat Med 1999;18:93105.
27. Obuchowski NA. Computing sampl e size for recei ver operati ng characteri stic studies. Invest
Radi ol 1994;29:238243.
28. Simel DL, Samsa GP, Matchar DB. Likel ihood rati os with confidence: sample si ze esti mati on
for diagnosti c test studi es. J Cli n Epidemi ol 1991;44:763770.
29. Walter SD, Eli asziw M, Donner A. Sampl e size and opti mal designs for rel iabi li ty studi es. Stat
Med 1998;17:101110.
P.94
...
> Tabl e of Contents > Secti on II - Study Desi gns > 7 - Desi gni ng a Cohort Study
7
Designing a Cohort Study
Steven R. Cummings
Thomas B. Newman
Stephen B. Hulley
Cohort studies involve foll owing groups of subjects over time. There
are two pri mary purposes: descriptive, typical ly to describe the
occurrence of certai n outcomes over time; and analytic, to analyze
associ ations between predictors and those outcomes. Thi s chapter
begi ns wi th a descri ption of the cl assic prospective cohort study, i n
whi ch the i nvesti gator defi nes the sample and measures predictor
variables before undertaking a fol low-up period to observe outcomes.
We then revi ew retrospective cohort studies, which save ti me and
money because the fol low-up period and outcomes have al ready
occurred when the study takes pl ace, and include highl y effici ent
nested casecontrol and casecohort options. The chapter
concludes by descri bi ng multiple-cohort studies and reviewi ng the
methods for opti mizi ng a key ingredi ent for al l cohort designs, cohort
retention duri ng fol low-up.
Prospective Cohort Studies
St r uct ur e
Cohort was the Roman term for a group of sol di ers that marched
together, and in cli ni cal research a cohort i s a group of subjects
foll owed over time. In a prospective cohort study, the i nvestigator
begins by assembli ng a sampl e of subjects (Fi g. 7.1). She measures
characteristics i n each subject that mi ght predict the subsequent
outcomes, and foll ows these subjects wi th periodi c measurements of
the outcomes of i nterest.
...
Example 7.1 Prospective Cohort Study
The Nurses' Heal th Study exami nes i nci dence and ri sk factors for
common di seases i n women. The basi c steps i n performi ng the study
were to:
1. Assemble the Cohort. In 1976, the investi gators obtained l i sts
of regi stered nurses aged 25 to 42 i n the 11 most populous states
and mai led them an invitation to partici pate in the study; those
who agreed became the cohort.
2. Measure Predictor Variables and Potential Confounders.
They mai led a questi onnaire about wei ght, exercise and other
potential risk factors and obtained compl eted questi onnai res from
121,700 nurses. They send questionnai res peri odi call y to ask
about addi ti onal risk factors and update the status of some ri sk
factors that had been measured previ ousl y.
3. Follow-up the Cohort and Measure Outcomes. The peri odic
questionnai res also included questions about the occurrence of a
variety of disease outcomes.
The prospecti ve approach al lowed investigators to make measurements
at baseli ne, and coll ect data on subsequent outcomes. The large si ze of
the cohort and long peri od of foll ow-up have provi ded substanti al
FIGURE 7.1. In a prospecti ve cohort study, the i nvestigator (a)
sel ects a sampl e from the population (the dotted li ne signi fies i ts
large and undefi ned size) (b) measures the predictor variables (i n
this case whether a di chotomous risk factor is present [shaded]),
and (c) measures the outcome vari abl es duri ng foll ow-up (in this
case whether a disease occurs [outl ined in bol d]).
...
statistical power to study risk factors for cancers and other diseases.
For exampl e, the i nvestigators examined the hypothesi s that gaining
wei ght increases a woman' s ri sk of breast cancer after menopause (1).
The women reported thei r weight at age 18 i n an early questionnaire,
and their current weights i n l ater questi onnaires. The i nvestigators
succeeded i n foll owing 95% of the women and 1,517 cases of breast
cancer were confirmed duri ng the next 12 years. Heavi er women had a
hi gher ri sk of breast cancer after menopause, and those who gai ned
more than 20 kg since age 18 had a twofol d i ncreased risk of
devel opi ng breast cancer (relati ve ri sk = 2.0; 95% confidence interval ,
1.4 to 2.8). Adjusting for potenti al confounding factors did not change
the resul t.
St r engt hs and Weaknesses
The prospective cohort design i s a powerful strategy for assessing
incidence (the number of new cases of a condition in a speci fied time
interval ), and it is hel pful in investigating the potential causes of the
condi ti on. Measuring l evel s of the predictor before the outcome occurs
establi shes the ti me sequence of the vari ables and prevents the
predi ctor measurements from bei ng influenced by knowl edge of the
outcome. The prospecti ve approach al so al l ows the investi gator to
measure vari abl es more completel y and accuratel y than is possi bl e
retrospectivel y. Thi s is i mportant for predictors such as di etary habi ts
that are di fficult for a subject to remember accurately. When fatal
di seases are studied retrospecti vely, predictor vari abl es about the
decedent can onl y be reconstructed from i ndirect sources such as
medi cal records or fri ends and rel atives.
Al l cohort studi es share the general disadvantage of observati onal
studi es (rel ative to cl inical tri al s) that causal inference i s chall engi ng
and i nterpretation often muddied by the infl uences of confounding
variabl es (Chapter 9). A particular weakness of the prospecti ve desi gn
is i ts expense and ineffici ency for studyi ng rare outcomes. Even
di seases we thi nk of as rel ati vely common, such as breast cancer,
happen so infrequentl y in any gi ven year that l arge numbers of peopl e
must be fol lowed for long peri ods of ti me to observe enough outcomes
to produce meaningful resul ts. Cohort desi gns become more effici ent as
the outcomes become more common and immediate; a prospecti ve
study of risk factors for progressi on after treatment of patients wi th
breast cancer wil l be smal ler and l ess time consumi ng than a
prospecti ve study of risk factors for the occurrence of breast cancer in
a healthy population.
Retrospective Cohort Studies
P.98
P.99
...
St r uctur e
The design of a retrospective cohort study (Fi g. 7.2) differs from that
of a prospecti ve one in that the assembl y of the cohort, basel ine
measurements, and fol low-up have
all happened i n the past. This type of study is onl y possi bl e i f adequate
data about the ri sk factors and outcomes are avail abl e on a cohort of
subjects that has been assembl ed for other purposes.
Example 7.2 Retrospective Cohort Study
To describe the natural history of thoraci c aortic aneurysms and risk
factors for rupture of these aneurysms, Cl ouse et al. anal yzed data
from the medi cal records of 133 pati ents who had aneurysms ( 2). The
basic steps i n performing the study were to
1. Identify a Suitable Cohort. The i nvesti gators used the residents
of Olmsted County, Mi nnesota. They searched a database of
diagnoses made between 1980 and 1995 and found 133 residents
who had a di agnosis of aorti c aneurysm.
2. Collect Data about Predictor Variables. They revi ewed
patients' records to col lect gender, age, size of aneurysm, and
ri sk factors for cardi ovascular disease at the ti me of diagnosi s.
3. Collect Data about Subsequent Outcomes. They coll ected data
from the medi cal records of the 133 pati ents to determi ne
P.100
FIGURE 7.2. In a retrospective cohort study, the investi gator (a)
identi fi es a cohort that has been assembled in the past, (b)
col lects data on predictor variables (measured i n the past), and (c)
col lects data on outcome vari ables (measured i n the present).
...
whether the aneurysm ruptured or was surgical ly repaired.
The investi gators found that the 5-year risk of rupture was 20% and
that women were 6.8 times more li kely to suffer a rupture than men
(95% confidence interval , 2.3 to 20). They al so found that 31% of
aneurysms with di ameters of more than 6 cm ruptured, compared with
none wi th di ameters of l ess than 4 cm.
Retrospective cohort studies have many of the same strengths as
prospecti ve cohort studi es, and they have the advantage of bei ng much
less costl y and ti me consumi ng. The subjects are al ready assembl ed,
basel ine measurements have already been made, and the fol low-up
period has already taken pl ace. The main di sadvantages are the li mi ted
control the i nvestigator has over the approach to sampl ing the
populati on, and over the nature and the quali ty of the predictor
variabl es. The exi sti ng data may be i ncomplete, i naccurate, or
measured in ways that are not ideal for answeri ng the research
question.
Nested CaseControl and CaseCohort
Studies
1
St r uctur e
A nested casecontrol design has a casecontrol study
nested wi thin a cohort study (Fi g. 7.3). It i s an excel lent desi gn
for predictor vari ables that are expensi ve to measure and that can be
assessed at the end of the study on subjects who
develop the outcome during the study (the cases), and on a sample of
those who do not (the controls). The i nvesti gator begi ns with a sui table
cohort wi th enough cases by the end of fol low-up to provi de adequate
power to answer the research questi on. At the end of the study she
appli es cri teria that define the outcome of i nterest to identi fy al l those
who have developed the outcome (the cases). Next, she sel ects a
random sample of the subjects who have not developed the outcome
(the control s); she can i ncrease power by selecti ng two or three
control s for each case, and by matchi ng on consti tuti onal determi nants
of outcome such as age and sex (see Chapter 9 for the pros and cons of
matchi ng). She then retri eves speci mens, images or records that were
col lected before the outcomes had occurred, measures the predictor
variabl es, and compares the levels i n cases and controls.
P.101
...
The nested casecohort approach i s the same design except that
the controls are a random sample of all the members of the cohort
regardless of outcomes. Thi s means that there wil l be some cases
among those sampled for the compari son group, who wil l also appear
among the cases and be analyzed as such (removi ng them from the
cohort sample for purposes of anal ysis i s a negli gi bl e problem provi ded
that the outcome i s uncommon). Thi s approach has the advantage that
the controls represent the cohort in general , and therefore provi de a
basi s for esti mati ng incidence and preval ence in the popul ati on from
whi ch i t was drawn. More important, i t means that thi s cohort sampl e
can be used as the compari son group for more than one type of
outcome provi ded that i t i s not too common. In Exampl e 7.3, for
instance, a singl e set of sex hormone l evel s from the baseli ne exam
measured i n a random sampl e of the cohort could be compared with
level s from basel ine in cases with breast cancer i n one analysi s, and i n
cases with fractures in another.
Example 7.3 Nested CaseControl Design
Caul ey et al . carried out a nested casecontrol study of whether
higher l evel s of sex hormones were risk factors for breast cancer, ( 4).
The basi c steps i n this study were to
1. Identify a Cohort with Banked Samples. The i nvestigators
FIGURE 7.3. In a nested casecontrol study, the investigator (a)
identi fi es a cohort wi th banked speci mens, i mages, or i nformati on;
(b) i denti fies those parti cipants who developed the outcome during
foll ow-up (the cases); (c) selects a sample from the rest of the
cohort (the controls); and (d) measures predictor variables i n
cases and control s.
...
used serum and data from the Study of Osteoporoti c Fractures, a
prospecti ve cohort of 9,704 women age 65 and ol der.
2. Identify Cases at the End of Follow-up. Based on responses to
fol low-up questi onnaires and review of death certifi cates, the
investigators i denti fied 97 subjects wi th a fi rst occurrence of
breast cancer duri ng 3.2 years of fol low-up.
3. Select Controls. The investi gators sel ected a random sample of
244 women i n the cohort who did not devel op breast cancer duri ng
that fol low-up period.
4. Measure Predictors on Baseline Samples from Cases and
Controls. Level s of estradi ol and testosterone were measured i n
serum specimens from the basel ine examination that had been
stored at 190C by l aboratory staff who were bli nded to
casecontrol status.
Women who had hi gh levels of either estradiol or testosterone had a
threefold increase i n the risk of a subsequent diagnosis of breast
cancer compared with women who had very low levels of these
hormones.
Nested casecontrol and casecohort studi es are especi al ly
useful for costly measurements on serum, el ectroni c images, hospital
charts, etc. that have been archived at the begi nning of the study and
preserved for later anal ysi s. In addi ti on to the cost savings of not
maki ng the measurements on the enti re cohort, the design all ows the
investi gator to introduce novel measurements that were not avail abl e at
the outset of the study. The desi gn preserves al l the advantages of
cohort studies that result from col lecting predictor vari abl es before the
outcomes have happened, and it avoids the potenti al biases of
conventional casecontrol studies that draw cases and control s from
di fferent popul ations and cannot make measurements on cases and
control s who have di ed.
The chief di sadvantage of thi s desi gn is that many research questi ons
and ci rcumstances are not amenable to the strategy of stori ng
materi al s for l ater anal ysis on a sample of the study subjects. Also,
when data are avail able for the enti re cohort at no addi ti onal cost,
nothi ng is gai ned by studyi ng onl y a sampl e of controlsthe whol e
cohort shoul d be used.
These are such great designs that an i nvesti gator pl anni ng a
prospecti ve study shoul d al ways consider preservi ng bi ologi c samples
and stori ng images or records that involve expensi ve measurements for
subsequent nested casecontrol or casecohort analyses. She
should ensure that the conditions of storage wil l preserve substances of
P.102
...
interest for many years, and consi der setti ng asi de specimens for
peri odi c measurements to confirm that the components have remained
stable. She may also find it useful to col l ect new samples or
informati on duri ng the fol low-up period that can be used i n the
casecontrol comparisons.
Multiple-Cohort Studies and External
Controls
St r uctur e
Mul ti pl e-cohort studies begin with two or more separate samples of
subjects: typicall y, one group with exposure to a potential risk factor
and one or more other groups with no exposure or a lower level of
exposure (Fig. 7.4). After defi ning suitabl e cohorts wi th di fferent l evel s
of exposure to the predictor of i nterest, the i nvestigator measures
predictor variables, foll ows up the cohorts, and assesses outcomes as
in any other type of cohort study.
The use of two different samples of subjects in a doubl e-cohort design
shoul d not be confused with the use of two samples i n the
casecontrol desi gn (Chapter 8). In a doubl e-cohort study the two
groups of subjects are chosen based on the level of a predictor
variable, whereas i n a casecontrol study the two groups are chosen
based on the presence or absence of the outcome.
P.103
...
Example 7.4 Multiple-Cohort Design
To determi ne whether si gni ficant neonatal jaundi ce or dehydrati on has
any si gnifi cant adverse effects on neurodevel opment, i nvesti gators
from UCSF and the Northern Cal iforni a Kaiser Permanente Medical Care
Program (5) undertook a triple- cohort study. The basi c steps i n
performi ng the study were to
1. Identify Cohorts with Different Exposures. The i nvestigators
used electronic databases to i denti fy term and near -term
newborns who (1) had a total serum bi li rubin level of 25
mg/dL, or (2) were readmitted for dehydrati on wi th a serum
sodi um of 150 mEq/L or weight loss of 12% from
birth, or (3) were randomly sel ected from the bi rth cohort.
2. Determine Outcomes: The investi gators used el ectronic
databases to search for diagnoses of neurologi cal di sorders and
did ful l neurodevelopmental examinations at the age of 5 for
consenti ng parti cipants.
Wi th few excepti ons, nei ther hyperbi li rubinemia nor dehydration was
associ ated with adverse outcomes.
In a variati on on the mul ti pl e-cohort desi gn, the outcome rate in a
cohort can be compared with outcome rates i n a census or regi stry
from a di fferent popul ation. For exampl e, in a cl assic study of whether
uranium miners had an i ncreased i nci dence of l ung cancer, Wagoner et
al. (6) compared the i nci dence of respi ratory cancer i n 3,415 urani um
miners wi th that of white men who li ved in the same states. The
increased i nci dence of l ung cancer observed in the miners helped
establ ish occupati onal exposure to ioni zi ng radiation as an important
cause of lung cancer.
The multiple-cohort design may be the only feasi bl e approach for
studyi ng rare exposures, and exposures to potenti al occupati onal and
environmental hazards. Using data from a census or regi stry as the
external control group has the additional advantage of bei ng populati on
FIGURE 7.4. In a prospecti ve double-cohort study, the
investi gator (a) sel ects cohorts from two populati ons wi th di fferent
levels of the predi ctor, and (b) measures outcome variables during
foll ow-up. (Doubl e-cohort studi es can al so be conducted
retrospectivel y.).
P.104
...
based and economical. Otherwise, the strengths of this desi gn are
si mi lar to those of other cohort studi es.
The probl em of confounding i s accentuated i n a multiple-cohort study
because the cohorts are assembled from different popul ati ons that can
differ i n i mportant ways (besi des exposure to the predi ctor variabl e)
that influence the outcomes. Although some of these di fferences, such
as age and race, can be matched or used to adjust the fi ndi ngs
statistical ly, other characteristics may not be measurabl e and create
problems i n the i nterpretation of observed associ ations.
Other Cohort Study Issues
The hall mark of a cohort study i s the need to defi ne a group of
subjects at the beginni ng of a peri od of fol low-up. The subjects should
be appropriate to the research question and avail able for fol low-up.
They should suffi cientl y resemble the populati on to whi ch the resul ts
wil l be generali zed. The number of subjects should provide adequate
power.
The qual ity of the study wi ll depend on the preci sion and accuracy of
the measurements of predi ctor and outcome variables. The abi li ty to
draw i nferences
about cause and effect wi ll al so depend on the degree to which the
investigator has i dentifi ed and measured all potential confounders
and sources of effect modifi cation (Chapter 9). Predictor vari ables may
change during the study; whether and how frequently measurements
shoul d be repeated depends on cost, how much the vari abl e is l ikely to
change, and the i mportance to the research question of observing these
changes. Outcomes should be assessed usi ng standardi zed cri teria and
blindly wi thout knowi ng the val ues of the predictor variables.
Follow-up of the entire cohort is i mportant, and prospective studies
shoul d take a number of steps to achi eve this goal. Loss of subjects
can be minimized i n several ways (Table 7.1). Those who pl an to move
out of reach duri ng the study or who wil l be diffi cul t to foll ow for other
reasons shoul d be excl uded at the outset. The i nvestigator should
col lect i nformati on early on that she can use to find subjects i f they
move or di e. Thi s includes the address, telephone number and e-mail
address of the subject, personal physici an and one or two cl ose friends
or relatives who do not li ve in the same house. It i s useful to obtain
the soci al securi ty number and (for those over
65) the medi care number. Thi s information wil l all ow the i nvestigator
to determi ne the vi tal status of subjects who are lost to foll ow-up using
the Nati onal Death Index and to obtai n hospital di scharge i nformati on
from the Soci al Securi ty Admi ni stration for subjects who receive
Medicare. Periodi c contact with the subjects once or twi ce a year helps
P.105
P.106
...
in keeping track of them, and may improve the timel iness and accuracy
of recording the outcomes of interest. Fi nding subjects for foll ow-up
assessments sometimes requires persi stent and repeated efforts by
mai l, e-mail , tel ephone, house cal ls, or professional tracking.
Table 7.1 Strategies for Minimizing Losses
during Follow-up
During enrollment
1. Excl ude those l ikely to be l ost
a. Planning to move
b. Uncertainty about wi ll ingness to return
c. Il l health or fatal di sease unrelated to research
question
2. Obtain information to al low future tracki ng
a. Address, telephone number(s), and e-mail address
of subject
b. Social Security/Medi care numbers
c. Name, address, telephone number, and e-mai l
addresses for one or two close fri ends or relati ves
who do not l ive wi th the subject
d. Name, e-mail , address, and tel ephone number of
physici an(s)
During follow-up
1. Peri odi c contact wi th subjects to coll ect information,
provide resul ts, express care, and so on.
a. By tel ephone: may requi re call s duri ng weekends
and eveni ngs
b. By mail : repeated mai li ngs by e-mai l or with
stamped, sel f-addressed return cards
c. Other: newsl etters, token gifts
2. For those who are not reached by phone or mai l:*
a. Contact fri ends, rel ati ves, or physi cians
b. Request forwardi ng addresses from postal servi ce
c. Seek address through other publ ic sources, such as
telephone directories and the Internet, and
ul ti mately a credi t bureau search.
d. For subjects receivi ng Medi care, col lect data about
hospi tal discharges from Social Security
Administration
e. Determine vital status from state heal th department
...
Summary
1. In cohort studies, subjects are foll owed over time to describe
the incidence or natural history of a condi ti on and to analyze
predictors (ri sk factors) for vari ous outcomes. Measuri ng the
predictor before the outcome occurs establ ishes the sequence of
events and helps control bi as in that measurement.
2. Prospective cohort studies begin at the outset of fol low-up and
may require large numbers of subjects fol lowed for l ong peri ods of
ti me. Thi s di sadvantage can someti mes be overcome by
identi fying a retrospective cohort in whi ch measurements of
predictor variables have already occurred.
3. Another effici ent variant is the nested casecontrol design. A
bank of specimens, images, or records i s col lected at baseli ne;
measurements are made on the stored material s for all subjects
who have developed an outcome, and for a subset of those who
have not. In the nested casecohort strategy, a si ngle
random sample of the cohort can serve as the comparison group
for several casecontrol studi es.
4. The multiple-cohort desi gn, which compares the incidence of
outcomes i n cohorts that di ffer i n l evel of a predi ctor variabl e, i s
useful for studyi ng the effects of rare and occupational exposures.
5. Inferences about cause and effect are strengthened by
measuri ng al l potential confounding vari ables at basel ine. Bias i n
the assessment of outcomes is prevented by standardizing the
measurements and blinding those assessi ng the outcome to the
predictor variable val ues.
or Nati onal Death Registry
At all times
1. Treat study subjects with appreciati on, ki ndness and
respect, hel pi ng them to understand the research
question so they wil l want to join as partners i n making
the study successful
* This assumes that parti ci pants in the study have given
informed consent to col l ect the tracking i nformati on and for
foll ow-up contact.
...
6. The strengths of a cohort design can be undermined by incompl ete
follow-up of subjects. Losses can be mi ni mized by excluding
subjects who may not be avail able for fol low-up, col lecting
baseline information that faci li tates tracki ng, staying in touch
with all subjects regularly, and invol ving subjects as partners in
the research.
References
1. Huang Z, Hanki nson SE, Colditz GA, et al . Dual effect of
weight and weight gain on breast cancer ri sk. JAMA
1997;278:14071411.
2. Cl ouse WD, Hal lett JW Jr, Schaff HV, et al. Improved prognosis
of thoracic aorti c aneurysms: a popul ati on-based study. JAMA
1998;280:19261929.
3. Szkl o M, Nieto FJ. Epi demi ol ogy: beyond the basics.
Gaithersburg, MD: Aspen, 2000:3338.
4. Caul ey JA, Lucas FL, Kull er LH, et al. Study of Osteoporoti c
Fractures Research Group. El evated serum estradiol and
testosterone concentrations are associated wi th a high ri sk for
breast cancer. Ann Intern Med 1999;130:270277.
5. Newman TB, Lil jestrand P, Jeremy RJ, et al . Outcomes of
newborns with total serum bi l irubi n l evel s of 25 mg/dL or more. N
Engl J Med 2006;354:1889900.
6. Wagoner JK, Archer VE, Lundin FE, et al. Radi ation as the cause
of lung cancer among uranium miners. N Engl J Med
1965;273:181187.
These terms are used i nconsistentl y i n the l iterature; the definitions
provided here are the si mpl est. For a detai led di scussion, see Szkl o and
Nieto (3).
P.107
...
Copyright 2007 Lippincott Williams & Wilkins
> Tabl e of Contents > Secti on II - Study Desi gns > 8 - Desi gni ng Cross-secti onal
and CaseControl Studi es
8
Designing Cross-sectional and
CaseControl Studies
Thomas B. Newman
Warren S. Browner
Steven R. Cummings
Stephen B. Hulley
Chapter 7 dealt with cohort studies, in which the sequence of the
measurements is the same as the chronology of cause and effect: first the
predictor, then (after an interval of follow-up) the outcome. In this chapter
we turn to two kinds of observational studies that are not guided by this
l ogical time sequence.
In a cross-sectional study, the investigator makes all of her
measurements on a single occasion or within a short period of time. She
draws a sample from the population and looks at distributions of variables
within that sample, sometimes designating predictor and outcome variables
based on biologic plausibility and information from other sources. In a
casecontrol study, the investigator works backward. She begins by
choosing one sample from a population of patients with the outcome (the
cases) and another from a population without it (the controls); then she
compares the distribution levels of the predictor vari ables in the two
samples to see which ones are associated with and might cause the
outcome.
Cross-Sectional Studies
Str uctur e
The structure of a cross-sectional study is similar to that of a cohort study
except that all the measurements are made at about the same time, with
no follow-up period (Fig. 8.1). Cross-sectional designs are very well suited
to the goal of describing variables and their distribution patterns. In the
National Health and Nutrition Examination Survey (NHANES), for example,
a sample designed to represent the US population is interviewed and
examined. NHANES surveys have been carried out periodically, and an
NHANES follow-up (cohort) study has been added to the original cross-
...
sectional design. Each cross-secti onal study is a major source of
i nformation about the health and habits of the US population in the year it
i s carried out, providing estimates of such things as the prevalence of
smoking in various demographic groups. All NHANES datasets are available
for public use.
Cross-sectional studies can also be used for exami ning associations,
although the choice of which variables to label as predi ctors and which as
outcomes depends on the cause-and-effect hypotheses of the investigator
rather than on the study design. This choice is easy for constitutional
factors such as age and race; these cannot be altered by other variables
and therefore are predictors. For other variables, however, the choice can
go either way. For example, a cross-sectional finding in NHANES III is an
association between childhood obesity and hours spent watching televisi on
(1). Whether to label obesity or TV-watching as the outcome depends on
the question of interest to the investigator.
Unlike cohort studies, which have a longitudinal time dimension and can be
used to estimate incidence (the proportion who get a disease or condition
over time), cross-sectional studies can generall y provide information only
about prevalence, the proportion who have a disease or condition at one
poi nt in time (Table 8.1). Prevalence is useful to health planners who want
to know how many people have certain diseases so that they can allocate
enough resources to care for them, and it is useful to the clinician who
must estimate the likelihood that the patient sitting in her office has a
FIGURE 8.1. In a cross-sectional study, the investigator (a) selects a
sample from the population and (b) measures predi ctor and outcome
variables (e.g., presence or absence of a risk factor and disease).
P.110
...
particular disease. When analyzing cross-sectional studies, the prevalence
of the outcome is compared in those with and without an exposure, giving
the relative prevalence of the outcome, the cross-sectional equivalent of
relati ve risk. An example calculation of prevalence and relative prevalence
i s provided in Appendix 8A.
Sometimes cross-sectional studies describe the prevalence of ever having
done something or ever having had a disease or condition. In that case,
the prevalence is the same as the cumulative incidence, and it is
i mportant to make sure that follow-up time is the same in those exposed
and unexposed. This is illustrated in Example 8.1, in which the prevalence
of ever having tried smoking was studied in a cross-sectional study of
chi ldren wi th differing levels of exposure to movies in which the actors
smoke. Of course, chil dren who had seen more movies were also older, and
therefore had longer to try smoki ng, so it was very important to adjust for
age in analyses (multivariate adjustment is discussed in Chapter chap 9).
Cross-sectional Study
To determine whether exposure to movies in which the actors smoke is
associated with smoking initiation, Sargent et al . (2):
P.111
Table 8.1 Statistics for Expressing Disease
Frequency in Observational Studies
Type of
Study Statistic Definition
Cohort Incidence rate
Number of people who get a
disease or condition
Number of people at risk
Time period at risk
Cross-
sectional
Prevalence
Number of people who have a
disease or condition
Both
Cumulative
incidence
Number of people who get
(cohort) or report ever
having acquired (cross-
sectional) a disease or
condition
...
1. Selected the Sample: They did a random-digi t-dial survey of 6,522
children aged 10 to 14 years.
2. Measured the Variables: They quanti fied smoking in 532 popular
movies and for each subject asked which of a randomly selected
subset of 50 movies they had seen. Subjects were also asked about a
variety of covariates such as age, race, gender, parental smoking and
education, sensation-seeking (e.g., I like to do dangerous
things ) and self-esteem (e.g., I wish I were someone
else. ) The outcome variable was whether the child had ever tried
smoking a cigarette.
The prevalence of ever having tried smoki ng varied from 2% i n the lowest
quartile of movie smoking exposure to 22% in the highest quartile. After
adjusting for age and other confounders, odds ratios were much lower but
still significant: 1.7, 1.8, and 2.6 for the second, third, and highest
quartiles of movie smoking exposure, compared wi th the lowest quartile.
Based on the adjusted odds ratios, the authors estimated that 38% of
smoking initiation was attributable to exposure to movies in which the
actors smoke.
St r engt hs and Weaknesses of Cr oss- secti onal
St udi es
A major strength of cross-sectional studies over cohort studies and clinical
trials is that there is no waiting for the outcome to occur. This makes them
fast and inexpensive, and it means that there is no loss to follow-up. A
cross-sectional study can be included as the first step in a cohort study or
experiment at little or no added cost. The results define the demographic
and clinical characteristics of the study group at baseline and can
sometimes reveal cross-sectional associations of interest.
A weakness of cross-sectional studies is the difficulty of establishing causal
relationships from observational data coll ected in a cross-secti onal time
frame. Cross-sectional studies are also impractical for the study of rare
diseases if the desi gn involves collecting data on a sample of individuals
from the general population. A cross-sectional study of stomach cancer in a
general population of 45- to 59-year-ol d men, for example, would need
about 10,000 subjects to find just one case.
Cross-sectional studies can be done on rare diseases if the sample is drawn
from a population of diseased patients rather than from the general
population. A case series of this sort is better suited to describing the
characteristics of the disease than to analyzing differences between these
patients and healthy people, although informal comparisons with prior
experience can sometimes identify very strong risk factors. Of the first
1,000 patients with AIDS, for example, 727 were homosexual or bisexual
males and 236 were i njection drug users (3). It did not require a formal
control group to conclude that these groups were at increased risk.
Furthermore, within a sample of persons with a disease there may be
P.112
...
associations of interest (e.g., the higher risk of Kaposi' s sarcoma among
pati ents with AIDS who were homosexual than among those who were
i njection drug users).
When cross-sectional studies measure only prevalence and not cumulative
i ncidence it limits the information they can produce on prognosis, natural
history, and disease causation. To show causation, investigators need to
demonstrate that the incidence of disease differs in those exposed to a risk
factor. Because prevalence is the product of disease incidence and disease
duration, a factor that is associated with higher prevalence of disease may
be a cause of the disease but could also be associated with prolonged
duration of the disease. For example, the prevalence of severe depression
i s affected not just by its incidence, but by the duration of episodes, the
sui cide rate and the responsiveness to medication of those affected.
Therefore, cross-sectional studies may show increased relative prevalence
either because the condition occurs more frequently in those with the
exposure, or because the condition lasts longer in those with the exposure.
Ser i al Sur veys
A series of cross-sectional studies of a single populati on observed at
several points in time is sometimes used to draw inferences about
changing patterns over time. For example, Zito et al. ( 4), using annual
cross-sectional surveys, reported that the prevalence of prescription
psychotropic drug use among youth (<20 years old) increased more than
threefold between 1987 and 1996 in a mid-Atlantic Medicaid population.
This is not a cohort design because it does not follow a single group of
people over time; there are changes in the population over time due to
births, deaths, aging, migration, and eligibility changes.
CaseControl Studies
Str uctur e
To investigate the causes of all but the most common diseases, both
cohort and cross-sectional studi es of general population samples are
expensive: each would require thousands of subjects to identify risk
factors for a rare disease like stomach cancer. A case series of patients
with the disease can identify an obvious risk factor (such as, for AIDS,
injection drug use), using prior knowledge of the prevalence of the risk
factor in the general population. For most risk factors, however, it is
necessary to assemble a reference group, so that the prevalence of the
risk factor in subjects with the disease (cases) can be compared with the
prevalence in subjects without the disease (controls).
The retrospective structure of a casecontrol study is shown in Fig.
8.2. The study identifies one group of subjects with the disease and
another without it, then
looks backward to find differences in predictor variables that may explain
why the cases got the disease and the controls did not.
P.113
...
Casecontrol studies began as epidemiologic studies to try to identify
risk factors for diseases. Therefore the outcome traditionally used to
determine casecontrol status has been the presence or absence of a
disease. For this reason and because it makes the discussion easier to
follow, we generally refer to cases as those with the disease.
However, the casecontrol design can also be used to look at other
outcomes, such as disability among those who already have a disease. In
addition, when undesired outcomes are the rule rather than the exception,
the cases in a casecontrol study may be the rare pati ents who have had
a good outcome, such as recovery from a usually fatal disease.
Casecontrol studies are the house red on the research design
wine list: more modest and a little riski er than the other selections but
much less expensive and sometimes surprisingly good. The design of a
casecontrol study is challenging because of the increased opportunities
for bias, but there are many examples of well -designed casecontrol
studies that have yielded important resul ts. These include the li nks
between maternal diethylstilbestrol use and vaginal cancer in daughters (a
FIGURE 8.2. In a casecontrol study, the investigator (a) selects a
sample from a population with the disease (cases), (b) selects a
sample from a population at risk that is free of the disease (controls),
and (c) measures predictor variables.
...
classic study that provided a definitive conclusion based on just seven
cases!) (5), and prone sleeping position to prevent sudden infant death
syndrome (6), a simple result that has saved thousands of lives.
Case-Control Study
Because intramuscular (IM) vitamin K is given routinely to newborns in the
United States, a pair of studies reporting a doubling in the risk of
childhood cancer among those who had received IM vitamin K caused quite
a sti r (7,8). To investigate this association further, German investigators
(9
1. Selected the Sample of Cases. 107 children with leukemi a from the
German Childhood Cancer Registry.
2. Selected the Sample of Controls. 107 children matched by sex and
date of birth and randomly selected from children living i n the same
town as the case at the time of diagnosis (from local government
residential registration records).
3. Measured the Predictor Variable. Reviewed medical records to
determine which cases and controls had received IM vitamin K in the
newborn period.
The authors found 69 of 107 cases (64%) and 63 of 107 controls (59%)
had been exposed to IM vitamin K, for an odds ratio of 1.2 (95%
confidence interval [CI], 0.7 to 2.3). (See Appendix 8A for the
calculation.) Therefore, this study did not confirm the existence of an
association between the receipt of IM vitamin K as a newborn and
subsequent childhood leukemia. The point estimate and upper limit of the
95% CI leave open the possibility of a cli nicall y important increase in
leukemia in the population from which the samples were drawn, but
several other studies and an analysis using an additional control group in
the example study also fai led to confirm the association (10,11).
Casecontrol studies cannot yield estimates of the incidence or
prevalence of a disease because the proportion of study subjects who have
the disease is determined by how many cases and how many controls the
i nvestigator chooses to sample, rather than by their proporti ons in the
population. What casecontrol studies do provide is descriptive
i nformation on the characteristics of the cases and, more important, an
estimate of the strength of the association between each predictor vari able
and the presence or absence of the disease. These estimates are in the
form of the odds ratio, which approximates the relative risk if the
prevalence of the disease is relatively low (about 10% or less) (Appendix
8B).
St r engt hs of CaseContr ol Studi es
Efficiency for Rare Outcomes
One of the major strengths of casecontrol studies is their rapid, high
P.114
...
yield of informati on from relatively few subjects. Consider a study of the
effect of circumcision on subsequent carcinoma of the penis. This cancer is
very rare in circumci sed men but is also rare in uncircumcised men: their
l ifetime cumulative incidence i s about 0.16% (12). To do a cohort study
with a reasonable chance (80%) of detecti ng even a very strong risk factor
(say a relati ve risk of 50) would require more than 6,000 men, assuming
that roughly equal proportions were circumcised and uncircumci sed. A
randomized clini cal trial of circumcision at birth would require the same
sample size, but the cases would occur at a median of 67 years after entry
i nto the studyit would take three generations of epidemiol ogists to
follow the subjects!
Now consider a casecontrol study of the same question. For the same
chance of detecting the same relative risk, only 16 cases and 16 controls
(and not much investigator time) would be requi red. For diseases that are
either rare or have long latent periods between exposure and disease,
casecontrol studies are far more efficient than the other designs. In
fact, they are often the only feasible option.
Usefulness for Generating Hypotheses
The retrospective approach of casecontrol studies, and their ability to
examine a large number of predictor variables makes them useful for
generating hypotheses about the causes of a new outbreak of disease. For
example, a casecontrol study of an epidemic of acute renal failure in
Haitian children found an odds ratio of 53 for ingestion of locally
manufactured acetaminophen syrup. Further investigation revealed that
the renal failure was due to poisoning by diethylene glycol, which was
found to contaminate the glycerine solution used to make the
acetaminophen syrup (13).
Weaknesses of CaseContr ol Studi es
Casecontrol studies have great strengths, but they also have major
l imitations. The informati on available in casecontrol studi es is l imited:
unless the population and time period from which the cases arose are
known, there is no direct way to estimate the inci dence or prevalence of
the disease, nor the attributable or excess risk. There is al so the problem
that only one outcome can be studied (the presence or absence of the
disease that was the criterion for drawing the two samples), whereas
cohort and cross-sectional studies (and clinical trials) can study any
number of outcome variables. But the biggest weakness of casecontrol
studies is their susceptibility to bias. This bi as comes chiefly from two
sources: the separate sampling of the cases and controls, and the
retrospective measurement of the predictor variables. These two problems
and the strategies for dealing with them are the topic of the next two
sections.
Sampl i ng Bi as and How t o Cont r ol I t
P.115
...
The sampling in a casecontrol study begins with the cases. Ideally, the
sample of cases would be a complete or a random sample of everyone who
develops the disease under study. An immediate problem comes up,
however. How do we know who has developed the disease and who has
not? In cross-sectional and cohort studies the disease is systematically
sought in all the study participants, but in casecontrol studies the cases
must be sampled from patients in whom the disease has already been
diagnosed and who are available for study. This sample may not be
representative of all patients who develop the disease because those who
are undiagnosed, misdiagnosed, unavail able for study or dead are less
l ikely to be included (Fig. 8.3).
In general, sampling bias is important when the sample of cases is
unrepresentative with respect to the risk factor bei ng studied. Diseases
that almost always require hospitalization and are relatively easy to
diagnose, such as hip fracture and traumatic amputations, can be safely
sampled from diagnosed and accessible cases. On the other hand,
condi tions that may not come to medical attention are not well suited to
retrospective studies because of the selection that precedes diagnosis. For
example, women seen in a gynecologic clinic with first -trimester
spontaneous abortions would probably differ from the entire population of
women experiencing spontaneous abortions because those with greater
access to gynecologic care or with complications would be
overrepresented. If a predictor vari able of interest is associated with
gynecologic care in the population (such as past use of an intrauterine
device [IUD]), sampling cases from the cli nic could be an important source
of bias. If, on the other
hand, a predictor is unrelated to gynecologic care (such as blood type)
there would be less li kelihood of a cl inic-based sample being
unrepresentative.
P.116
FIGURE 8.3. Some reasons that the cases in a casecontrol study
...
Although it is important to think about these issues, in actual practice the
selection of cases is often straightforward because the accessible sources
of subjects are limited. The sample of cases may not be entirely
representative, but it may be all that the investigator has to work with.
The more difficult deci sions faced by an investigator designing a
casecontrol study then relates to the more open-ended task of selecting
the controls. The general goal is to sample controls from a population at
risk for the disease that is otherwise similar to the cases. Four strategies
for sampling controls follow:
Hospi t al - or cl i ni c- based cont r ol s . One strategy to compensate for
the possible selection bias caused by obtaining cases from a hospital
or clinic is to select controls from the same facilities. For example, i n
a study of past use of an IUD as a risk factor for spontaneous
abortion, controls could be sampled from a population of women
seeking care for vaginitis at the same gynecologic clinic. Compared
with a random sampl e of women from the same area, these controls
would presumably better represent the popul ation of women who, had
they developed a spontaneous abortion, would have come to the clinic
and become a case.
However, selection of an unrepresentative sample of control s to
compensate for an unrepresentative sample of cases can be
problematic. If the ri sk factor of interest also causes diseases for
whi ch the controls seek care, the prevalence of the risk factor i n the
control group will be falsely high, biasing the study results toward the
null. If, for example, many women in the control group had vagini tis
and use of an IUD increased the risk of vaginiti s, there would be an
excess of IUD users among the controls, masking a possible real
association between IUD use and spontaneous abortion.
Because hospital -based and clinic-based control subjects are usually
unwell and because their diseases may be associated with the risk
factors being studied, the use of hospital - or clinic-based controls can
produce misleading findings. For this
reason, the added convenience of hospital - or clinic-based controls is
not often worth the possible threat to the validity of the study.
Mat chi ng. Matching is a simple method of ensuring that cases and
controls are comparable with respect to major factors that are related
to the disease but not of interest to the investigator. So many risk
factors and diseases are related to age and sex, for exampl e, that the
study results may be unconvincing unless the cases and control s are
comparable with regard to these two variables. One approach to
avoiding this problem is to choose controls that match the cases on
may not be representative of all cases of the disease.
P.117
...
these constitutional predictor variables. For example, in a study that
matched on sex and age (say, within 2 years), for a 44-year-old male
case the investigators would choose a male control between the ages
of 42 and 46 years. Alternatively, the investigators can try to make
sure that the overall proportions of men in each age-group are the
same in the cases and controls (a process known as frequency
matching). Matching does have its adverse consequences, however,
particularly when modifiable predictors such as income or serum
cholesterol level are matched. The reasons for this and the
alternatives to matching are discussed in Chapter 9.
Usi ng a popul at i on- bas ed sampl e of cas es. Population-based
casecontrol studies are now possible for many diseases, because
of a rapid increase in the use of disease registri es, both in
geographically defined populations and within health maintenance
organizations. Because cases obtained from such registries are
generally representative of the general population of patients in the
area with the disease, the choice of a control group is simplifi ed: it
should be a representative sample from the population covered by the
registry. In Example 8.2, all residents of the town were registered
with the local government, making selection of such a sample
straightforward.
When registries are available, population-based casecontrol
studies are cl early the most desi rable. As the disease registry
approaches compl eteness and the population it covers approaches
stabi lity (no migration in or out), the population-based casecontrol
study approaches a casecontrol study that is nested within a
cohort study or clinical trial (Chapter 7). When information on the
cases and controls can come from previously recorded sources,
(thereby not requiring consent of the subject and the selection bias
likely to accompany such consent) this design has the potential for
eli minating sampling bias, because both cases and controls are
selected from the same population. When designing the sampling
approach for a casecontrol study, the nested casecontrol
design is useful to keep in mind as the model to emul ate.
Usi ng t wo or mor e cont r ol gr oups. Because selection of a control
group can be so tricky, particularly when the cases are not a
representative sample of those with disease, it is sometimes
advisable to use two or more control groups selected in different
ways. The Public Health Servi ce study of Reye' s syndrome and
medications (14), for example, used four types of controls:
emergency room controls (seen in the same emergency room as the
case), inpatient controls (admitted to the same hospital as the case),
school controls (attending the same school or day care center as the
case), and communi ty controls (identified by random-digit dialing).
The odds ratios for salicylate use in cases compared with each of
these control groups (in the order listed) were 39, 66, 33, and 44,
and each was statistically significant. The consistent finding of a
strong association using control groups that would have a variety of
...
sampling biases makes a convincing case for the inference that there
is a real association in the population.
Unfortunatel y, many causal factors have odds ratios that are much
closer to unity, and the biases associated with different strategies for
selecting controls can
endanger causal inference. What happens if the control groups give
confl icting resul ts? This is actual ly helpful, reveal ing inherent fragility
to the casecontrol method for the research question at hand. If
possible, the investigator should seek additional information to try to
determi ne the magnitude of potential biases from each of the control
groups. In any case, it is better to have inconsistent results and
conclude that the answer is not known than to have just one control
group and draw the wrong conclusion.
Di f f er enti al Measur ement Bi as and How t o
Cont r ol I t
The second particular problem of casecontrol studies is bi as due to
measurement error caused by the retrospective approach to measuring
the predictor variables, particularly when it occurs to a different extent in
cases than in controls. Casecontrol studies of birth defects, for
example, are susceptible to recall bias: parents of babies with birth
defects may be more likely to recall drug exposures than parents of normal
babies, because they will already have been worrying about what caused
the defect. Recall bias cannot occur in a cohort study because the parents
are asked about exposures before the baby is born.
In addition to the strategies set out in Chapter 4 for controll ing biased
measurements (standardizing the operational definitions of variables,
choosing objective approaches, supplementing key variables wi th data from
several sources, etc.), there are two specific strategies for avoiding bias in
measuring risk factors in casecontrol studies:
Use dat a r ecor ded bef or e t he out come occur r ed. It may be
possible, for example, to exami ne prenatal records in a
casecontrol study of IM vitamin K as a risk factor for cancer. This
excellent strategy is limited to the extent that recorded information
about the risk factor of interest is available and of satisfactory
reliability. For example, information about vitamin K administration
was often missing from medical records, and how that missing
information was treated affected results of some studies of vitamin K
and subsequent cancer risk (10).
Use bl i ndi ng. The general approach to blinding was discussed in
Chapter 4, but there are some issues that are specific to designing
interviews in casecontrol studies. Because both observers and
study subjects could be blinded both to the casecontrol status of
each subject and to the risk factor being studied, four types of
P.118
...
blinding are possible (Table 8.2).
Ideally, neither the study subjects nor the observers should know which
subjects are cases and which are control s. If this can be done successfully,
differenti al bias in measuring the predictor variable is eliminated. In
practice, this is often difficult. The subjects know whether they are sick or
well, so they can be blinded to casecontrol status only if controls are
also i ll with diseases that they believe might be related to the risk factors
being studied. (Of course, if the controls are selected for a disease that is
related to the risk factor being studied, it will cause sampling bi as.) Efforts
to blind interviewers are hampered by the obvi ous nature of some diseases
(an interviewer can hardly help noticing if the subject is jaundiced or has
had a laryngectomy), and by the clues that interviewers may discern in the
subject' s responses.
Blinding to specific risk factors being studied is usually easier than blinding
to casecontrol status. Casecontrol studies are often first steps in
i nvestigating an ill ness, so there may not be one risk factor of particular
i nterest. When there is, the study subjects and the interviewer can be kept
Table 8.2 Approaches to Blinding Interview
Questions in a CaseControl Study
Person
Blinded
Blinding CaseControl
Status
Blinding Risk
Factor Measurement
Subject Possible if both cases and
controls have diseases
that could plausibly be
related to the risk factor
Include
dummy risk
factors and be
suspicious if they
differ between cases
and control s
May not work if the
risk factor for the
disease has already
been publicized
Observer Possible if cases are not
externally distinguishable
from controls, but subtle
signs and statements,
volunteered by the
subjects make i t difficult
Possible if interviewer
is not the
investigator, but may
be difficult to
maintain
P.119
...
i n the dark about the study hypotheses by including dummy
questions about plausible risk factors not associated with the disease. For
example, if the specific hypothesis to be tested is whether honey intake is
associated with increased risk of infant botulism, equally detailed
questions about jelly, yogurt, and bananas could be included in the
i nterview. This type of blinding does not actually prevent differential bias,
but it allows an estimate of whether it is a problem: if the cases report
more exposure to honey but no increase in the other foods, then
differenti al measurement bias is less l ikely. This strategy would not work if
the association between infant botulism and honey had previously been
widely publicized or if some of the dummy risk factors turned out to be
real risk factors.
Blinding the observer to the casecontrol status of the study subject is a
particularly good strategy for laboratory measurements such as blood
tests and x-rays. Blindi ng under these circumstances is easy and should
always be done: someone other than the individual who will make the
measurement simply applies coded identification labels to each speci men.
The importance of bli nding was illustrated by 15 casecontrol studies
comparing measurements of bone mass between hip fracture patients and
controls; much larger differences were found in the studies that used
unblinded measurements than in the blinded studies (15).
Case- Cr ossover Studi es
A variant of the casecontrol design, useful for studying the short-term
effects of intermittent exposures, is the case-crossover design (16). As
with regular casecontrol studies, these are retrospective studies that
begin with a group of cases: people who have had the outcome of interest.
However, unlike traditional casecontrol studies, in which the exposures
of the cases are compared with exposures of a group of controls, in case-
crossover studies each case serves as his or her own control. Exposures of
the cases at the time (or right before) the outcome occurred are compared
with exposures of those same cases at one or more other points in time.
For example, McEvoy et al. (17) studied cases who were injured in car
crashes and reported owning or using a mobile phone. Using phone
company records, they compared mobile phone usage in the 10 minutes
before the crash with usage when the subjects were driving at the same
time of day 24 hours, 72 hours, and 7 days before the crash. They found
that mobile phone usage was more likely in the 10 minutes before a crash
than in the comparison time periods, with an odds ratio of about 4. The
analysis of a case-crossover study i s like that of a matched casecontrol
study, except that the control exposures are exposures of the case at
different time periods,
rather than exposures of the matched control. This is ill ustrated in
Case-crossover Study in Appendix 8A. Other examples of use of
the case-crossover design include a series of studies of possible triggers of
myocardi al infarction, including episodes of anger (18), and use of
P.120
P.121
...
marijuana (19) and of sil denafil (Viagra) (20).
Table 8.3 Advantages and Disadvantages of
the Major Observational Designs
Design Advantages Disadvantages*
Cohort
Al l Establishes sequence
of events
Often requires large
sample sizes
Multiple predictors
and outcomes
Less feasible for
rare outcomes
Number of outcome
events grows over
time
Yields incidence,
relative risk, excess
risk
Prospective More control over

subject selection and
measurements
Follow-up can be
lengthy
Often expensive
Avoids bias in
measuring predictors
Retrospective Follow-up is in the

past
Relatively inexpensive
Less control over
subject selection
and measurements
Multiple cohort Useful when disti nct
cohorts have different
or rare exposures
Bias and
confounding from
sampli ng several
populations
Cross-sectional
...
Relatively short
duration
Does not establi sh
sequence of events
A good first step for a
cohort study or
clinical trial
Not feasible for rare
predictors or rare
outcomes
Yields prevalence of
multiple predictors
and outcomes
Does not yield
incidence
CaseControl
Useful for rare
outcomes
Bias and
confounding from
Short duration, smal l
sample size
sampli ng two
populations
Relatively inexpensive Differential
measurement bias
Yields odds ratio
(resembles relative
risk for
uncommon outcomes)
Limited to one
outcome variable
Sequence of events
uncl ear
Does not yield
prevalence,
incidence, or excess
risk
Combination Designs
Nested
casecontrol
Advantages of a
retrospective cohort
design, only much
more efficient
Suitable cohort and
specimens many not
be available
Nested
casecohort
Can use a single
control group for
Suitable cohort and
specimens many not
...
Choosing Among Observational Designs
The pros and cons of the main observational designs presented in the
l ast two chapters are summarized in Table 8.3. We have already described
these issues in detail and will make only one final point here. Among all
these designs, none is best and none is worst; each has its place and
purpose, depending on the research question and the circumstances.
Summary
1. In a cross-sectional study, the variables are all measured at a
single point in time, with no structural distinction between predictors
and outcomes. Cross-sectional studies are valuable for providing
descriptive information about prevalence; they also have the
advantage of avoiding the time, expense, and dropout problems
of a follow-up-design.
2. Cross-sectional studies yiel d weaker evidence for causality than
cohort studies, because the predictor variable is not shown to precede
the outcome. A further weakness is the need for a large sample size
(compared with that of a casecontrol study) when studying
uncommon diseases. The cross-sectional design can be used for an
uncommon disease in a case series of patients with that disease, and
it often serves as the first step of a cohort study or experi ment.
3. In a casecontrol study, the prevalence of risk factors in a sample
of subjects who have a disease or other outcome of interest (the
cases) is compared with that in a separate sample who do not (the
controls). This design is relatively inexpensive and uniquely
efficient for studying rare diseases.
4. One problem with casecontrol studies is their susceptibility to
sampling bias. Four approaches to reducing sampling bias are (a) to
multiple
casecontrol
studies
be available
Case-crossover Cases serve as their
own controls,
reducing random
error and confounding
Requires special
circumstances
* All these observational designs have the di sadvantage
(compared with randomized trials) of being susceptible to the
influence of confounding variablesSee Chapter chap9.
...
sample controls and cases in the same (admittedly unrepresentative)
way; (b) to match the cases and controls; (c) to do a population-
based study; and (d) to use several control groups sampled in
different ways.
5. The other major problem with casecontrol studies is their
retrospective design, which makes them susceptible to measurement
bias that affects cases and controls differentially. Such bias can be
reduced by measuring the predictor prior to the outcome and by
blinding the subjects and observers.
6. Case-crossover studies are a variation on the matched
casecontrol design in which observations at two points in time
allow each case to serve as his or her own control.
Appendix
Appendix 8A: Calculating Measures of
Association
1. Cross-sectional study. Reijneveld (21) did a cross-sectional study
of maternal smoking as a risk factor for infant colic. Partial results
are shown below:
Prevalence of colic with smoking mothers = a/(a + b) = 15/182 =
8.2%.
Prevalence of colic with nonsmoking mothers = c/(c + d) = 111/2,588
= 4.3%.
Prevalence of colic overall = (a + c)/(a + b + c + d) = 126/2,770 =
P.122
Predictor Variable
Outcome Variable:
Infant
Colic
No Infant
Colic Total
Mother smokes
1550
cigarettes/day
15 (a) 167 (b) 182 (a + b)
Mother does not
smoke
111 (c) 2,477 (d) 2,588 (c + d)
Total 126 (a +
c)
2,644 (b +
d)
2,770 (a + b +
c + d)
...
4.5%.
2. Casecontrol study. The research question for Example 8.2 was
whether there is an association between IM vitamin K and ri sk of
childhood leukemia. The findings were that 69/107 leukemia cases
and 63/107 controls had recei ved IM vitamin K. A two-by-two table of
these findings is as follows:
Because the disease (leukemia in this instance) is rare, the odds ratio
provides a good estimate of the relative risk. The authors actually did
a multivariate, matched analysis, as was appropriate for the matched
design, but in this case the simple, unmatched odds ratio was almost
the same as the one reported in the study.
3. Matched casecontrol study
(To illustrate the similarity between analysis of a matched
casecontrol study and a case-crossover study, we will use
the same example for both.) The
research question is whether mobile telephone use increases the risk
of car crashes among mobile telephone owners. A traditi onal matched
casecontrol study might consider self -reported frequency of using
a mobile telephone while driving as the risk factor. Then the cases
would be people injured in crashes and they could be matched to
controls who had not been in crashes by age, sex, and mobile
telephone prefix. The cases and controls would be asked whether they
ever use a mobile telephone while driving. (To simplify, for this
example, we dichotomize the exposure and consider people as either
users or nonusers of mobile telephones while driving.)
We then classify each case/control pai r according to whether both are
Predictor Variable: Medication
History
Outcome Variable:
Diagnosis
Childhood
Leukemia Control
IM vitamin K 69(a) 63(b)
No IM vitamin K 38 44
Total 107 107
P.123
...
users, neither is a user, or the case was a user but not the control, or
the control was a user but not the case. If we had 300 pairs, the
results might look like this:
The table above shows that there were 90 pairs where the case ever
used a mobile phone while driving, but not the matched control, and
40 pairs where the matched control but not the case was a
user. Note that this 2 2 table is different from the 2 2
table from the unmatched vitamin K study above, in which each cell in
the table is the number of people in that cell. In the 2 2 table for
a matched casecontrol study the number in each cell is the
number of pairs of subjects in that cell; the total N in the table above
is therefore 600 (300 cases and 300 controls). The odds ratio for such
a table is simply the ratio of the two types of discordant pairs; in the
table above the OR = 90/40 = 2.25.
4. LNHI Case-crossover study
Now consider the case-crossover study of the same question. Data
from the study by McEvoy et al. are shown below.
For the case-crossover study, each cell i n the table is a number of
subjects, not a number of pairs, but each cell represents two time
periods for that one subject: the time period just before the crash
and a comparison time period 7 days before. Therefore the 5 in the
upper left cell means there were 5 drivers involved in crashes who
were using a mobile phone just before they crashed, and also using a
mobile phone during the comparison period 7 days before, while the
27 just below indicates that there were 27 drivers involved in crashes
who were using a phone just before crashing, but not using a phone
Matched Controls
Cases (with crash injuries)
User Nonuser Total
User 110 40 150
Nonuser 90 60 150
Total 200 100 300
Seven Days Before
Crash
Crash Time Period
Driver Using
Phone
Not
Using Total
Driver using phone 5 6 11
Not using 27 288 315
Total 32 294 326
...
during the comparison period 7 days before. The odds ratio is the
ratio of the numbers of discordant ti me periods, in this example 27/6
= 4.5.
Appendix 8B: Why the Odds Ratio Can
Be Used as an Estimate for Relative
Risk in a CaseControl Study
The data in a casecontrol study represent two samples: the cases are
drawn from a population of people who have the disease and the controls
from a population of people who do not have the disease. The predictor
variable is measured, and the followi ng two-by-two table produced:
If this two-by-two table represented data from a cohort study, then the
i ncidence of the disease in those with the risk factor would be a/(a + b)
and the relative risk would be simply [a/(a + b)]/[c/(c + d)]. However, it
i s not appropriate to compute either incidence or relative risk in this way
i n a casecontrol study because the two samples are not drawn from the
population in the same proportions. Usually, there are roughly equal
numbers of cases and controls in the study samples but many fewer cases
than controls in the population. Instead, relative risk in a casecontrol
study can be approximated by the odds ratio, computed as the cross-
product of the two-by-two table, ad/bc.
This extremely useful fact is difficult to grasp intuitively but easy to
demonstrate algebrai cally. Consider the situation for the full population,
represented by [a, b, c, and d.
Here it is appropriate to calculate the risk of disease among people with
the risk factor as a/(a + b), the risk among those without the
risk factor as c/(c + d), and the relative risk as [a/(a +
b)]/[c/(c + d)]. We have already discussed the fact that
a/(a + b) is not equal to a/(a + b). However, if the disease is
relati vely uncommon (as most are), then a is much smaller than b,
P.124
Disease No Disease
Risk factor present a b
Risk factor absent c d
Disease No Disease
Risk factor present a b
Risk factor absent c d
...
and c is much smaller than d. This means that a/(a + b)
i s closely approximated by a/b and that c/(c + d) is
closel y approximated by c/d. Therefore the relative risk of the
population can be approximated as follows:
The latter term is the odds ratio of the population (literally, the ratio of
the odds of disease in those with the risk factor, a/b, to the odds of
disease in those wi thout the risk factor, c/d). This can be
rearranged as the cross-product:
However, a/c in the population equals a/c in the sample if the cases
are representative of all cases in the population (i.e., have the same
prevalence of the risk factor). Similarly, b/d equals b/d if the
controls are representative.
Therefore the population parameters i n this last term can be replaced by
the sample parameters, and we are left with the fact that the odds ratio
observed in the sample, ad/bc, is a close approximation of the relative risk
i n the population, [a/(a + b)]/[c/(c + d)], provided
that the disease is rare and sampling error (systematic as well as random)
i s small.
References
1. Andersen RE, Crespo CJ, Bartlett SJ, et al. Relationship of
physical activity and television watching with body weight and level of
fatness among children: results from the Third National Health and
Nutrition Examination Survey. JAMA 1998;279(12):938942.
2. Sargent JD, Beach ML, Adachi -Mejia AM, et al. Exposure to movie
smoking: i ts relation to smoking i niti ation among US adolescents.
Pediatrics 2005;116(5):11831191.
3. Jaffe HW, Bregman DJ, Selik RM. Acquired immune deficiency
syndrome in the United States: the first 1,000 cases. J Infect Dis
1983;148(2):339345.
4. Zito JM, Safer DJ, DosReis S, et al . Psychotropic practice patterns
for youth: a 10-year perspective. Arch Pediatr Adolesc Med 2003;157
(1):1725.
P.125
...
5. Herbst AL, Ulfelder H, Poskanzer DC. Adenocarcinoma of the vagina.
Association of maternal stilbestrol therapy with tumor appearance in
young women. N Engl J Med 1971;284(15):878881.
6. Beal SM, Finch CF. An overview of retrospective case-control studies
investigating the relationship between prone sleeping position and
SIDS. J Paediatr Child Health 1991;27(6):334339.
7. Golding J, Paterson M, Kinlen LJ. Factors associated with childhood
cancer in a national cohort study. Br J Cancer 1990;62(2):304308.
8. Golding J, Greenwood R, Birmingham K, et al. Childhood cancer,
intramuscular vitamin K, and pethidine given during labour. BMJ
1992;305(6849):341346.
9. von Kries R, Gobel U, Hachmeister A, et al. Vitamin K and childhood
cancer: a population based case-control study i n Lower Saxony,
Germany. BMJ 1996;313(7051):199203.
10. Roman E, Fear NT, Ansell P, et al. Vitamin K and childhood cancer:
analysis of individual patient data from six case-control studies. Br J
Cancer 2002;86(1):6369.
11. Fear NT, Roman E, Ansell P, et al. Vitamin K and childhood cancer:
a report from the United Kingdom Childhood Cancer Study. Br J Cancer
2003;89(7):12281231.
12. Kochen M, McCurdy S. Circumcision and the ri sk of cancer of the
penis. A life-table analysis. Am J Dis Child 1980;134:484-486.
13. O'Brien KL, Selanikio JD, Hecdivert C. et al. Epidemic of pediatric
deaths from acute renal failure caused by diethyl ene glycol poisoning.
Acute Renal Failure Investigation Team. JAMA 1998;279
(15):11751180.
14. Hurwitz ES, Barrett MJ, Bregman D, et al . Public Health Service
study of Reye' s syndrome and medications. Report of the main study.
JAMA 1987;257(14):19051911.
15. Cummings SR. Are patients with hip fractures more osteoporotic?
Review of the evidence. Am J Med 1985;78:487494.
16. Maclure M, Mittleman MA. Should we use a case-crossover design?
...
Annu Rev Public Health 2000;21:193221.
17. McEvoy SP, Stevenson MR, McCartt AT, et al. Role of mobile phones
in motor vehicle crashes resulting in hospital attendance: a case-
crossover study. BMJ 2005;331(7514):428.
18. Mittleman MA, Maclure M, Sherwood JB, et al. Determinants of
Myocardial Infarction Onset Study Investigators. Triggering of acute
myocardial infarction onset by episodes of anger. Circulation 1995;92
(7):17201725.
19. Mittleman MA, Lewis RA, Macl ure M, et al. Triggeri ng myocardial
infarction by marijuana. Circulation 2001;103(23):28052809.
20. Mittleman MA, Maclure M, Glasser DB. Evaluation of acute risk for
myocardial infarction in men treated with sildenafi l citrate. Am J
Cardiol 2005;96(3):443446.
21. Reijneveld SA, Brugman E, Hirasing RA. Infantile colic: maternal
smoking as potential risk factor. Arch Dis Child 2000;83(4):302303.
P.126
...
Copyri ght 2007 Li ppi ncott Wi l li ams & Wi l ki ns
> Tabl e of Contents > Secti on II - Study Desi gns > 9 - Enhanci ng Causal Inf erence i n
Observati onal Studi es
9
Enhancing Causal Inference in Observational
Studies
Thomas B. Newman
Warren S. Browner
Stephen B. Hulley
For many research questi ons, the i nference that an associ ati on represents a
causeeffect rel ati on is i mportant. (Excepti ons are studi es of di agnostic and
prognosti c tests, di scussed i n Chapter 12.) The abi li ty to make that i nference depends
upon deci sions made duri ng both the desi gn and anal ysi s phases of a study. Although
thi s text i s concerned pri maril y wi th desi gning cli ni cal research, i n thi s chapter we
di scuss ways to strengthen causal i nference in both phases, because knowl edge of
anal ysi s phase options can help i nform decisions about study desi gn. We begi n with a
di scussi on of how to avoi d spurious associations and then concentrate on rul i ng out
real associations that do not represent causeeffect, especi al l y those due to
confoundi ng.
Suppose that a study reveal s an associ ation between coffee drinki ng and myocardi al
i nfarcti on (MI). One possi bi l i ty i s that coffee dri nki ng i s a cause of MI. Before reaching
thi s concl usi on, however, four ri val expl anati ons must be consi dered (Table 9.1). The
fi rst two of these, chance (random error) and bias (systematic error), represent
spurious associ ati ons: coffee dri nki ng and MI are associ ated only in the study fi ndings,
not i n the popul ation.
Even i f the associ ati on i s real , however, i t may not represent a causeeffect
rel ati onshi p. Two rival explanations must be consi dered. One i s the possi bi li ty of
effectcausethat havi ng an MI makes peopl e dri nk more coffee. (Thi s i s just
cause and effect i n reverse.) The other i s the possi bi l ity of confounding, i n whi ch a
thi rd factor (such as ci garette smoki ng) i s both associ ated wi th coffee dri nking and a
cause of MI.
Spurious Associations
Rul i ng Out Spur i ous Associ ati ons Due t o Chance
Imagi ne that there i s no associati on between coffee dri nki ng and MI i n the populati on,
and that 60% of the enti re popul ation dri nks coffee, whether or not they have had an
MI. If we were to sel ect a random sampl e of 20 pati ents wi th MI, we would expect
about 12 of them to dri nk coffee. But by chance al one we mi ght happen to get 19
coffee dri nkers i n a sampl e of 20 pati ents with MI. In that case, unl ess we were l ucky
enough to get a si mi l ar chance excess of coffee dri nkers among the control s, a
spuri ous associ ati on between coffee consumpti on and MI woul d be observed. Such an
associ ati on due to random error (chance), i f statistical l y si gnificant, is cal l ed a type
P.128
...
I error (Chapter 5).
Strategi es for addressi ng random error are avai l abl e i n both the design and anal ysis
phases of research (Tabl e 9.2). The desi gn strategi es of increasi ng the precision of
measurements and i ncreasi ng the sample size are i mportant ways to reduce random
error that are di scussed i n Chapters 4 and 6. The anal ysi s strategy of calcul ati ng P
values hel ps the i nvesti gator quanti fy the magni tude of the observed associati on i n
compari son wi th what mi ght have occurred by chance al one. For exampl e, a P val ue of
0.10 i ndi cates that the observed value of the test statistic (or l arger) woul d occur by
chance alone about one time in ten. Confidence intervals show a range of values for
the parameter bei ng estimated (e.g., ri sk rati o) that are consistent wi th that estimate,
based on the study's resul ts.
Rul i ng Out Spur i ous Associ at i ons Due to Bi as
Table 9.1 The Five Explanations When an Association
between Coffee Drinking and Myocardial Infarction
(MI) is Observed in a Sample
Explanation
Type of
Association
What's Really
Going on in
the
Population? Causal Model
1. Chance
(random error)
Spuri ous Coffee
dri nki ng and
MI are not
rel ated
2. Bias
(systemati c
error)
Spuri ous Coffee
dri nki ng and
MI are not
rel ated
3.
Effectcause
Real MI is a cause
of coffee
dri nki ng
Coffee dri nki ng MI
4. Confoundi ng Real Coffee
dri nki ng i s
associated
with a third,
extrinsi c
factor that is
a cause of MI
5.
Causeeffect
Real Coffee
dri nki ng i s a
cause of MI
Coffee dri nki ng MI
...
Associati ons that are spuri ous because of bi as are tricki er. To understand bi as i t i s
hel pful to di sti nguish between the research questi on and the questi on actuall y
answered by the study (Chapter 1). The research questi on is what the i nvesti gator
real l y wi shes to answer, whi l e the questi on answered by the study refl ects the
compromi ses the i nvesti gator needed to make for the study to be feasible. Bias can be
thought of as a systematic di fference between the research questi on and the actual
question answered by the study that causes the study to gi ve the wrong answer to the
research questi on. Strategi es for mi ni mi zing these systematic errors are avai labl e i n
both the design and anal ysis phases of research (Table 9.2).
Desi gn P hase. Many ki nds of bi as have been identi fi ed, and deal i ng wi th some
of them has been a major topi c of thi s book. To the speci fi c strategies noted i n
Chapters 3, 4, 7, and 8
we now add a general approach to mini mi zi ng sources of bi as. Wri te down the
research questi on and the study pl an side by si de, as i n Figure 9.1. Then
careful l y thi nk through the fol lowi ng three concerns as they pertai n to thi s
parti cul ar research questi on:
a. Do the samples of study subjects (e.g., cases and controls or exposed and
unexposed subjects) represent the popul ati on(s) of i nterest?
b. Do the measurements of the predictor variables represent the predi ctors
of i nterest?
c. Do the measurements of the outcome variables represent the outcomes
of i nterest?
P.129
Table 9.2 Strengthening the Inference that an
Association has a CauseEffect Basis: Ruling
Out Spurious Associations
Type of
Spurious
Association
Design Phase (How to Prevent
the Rival Explanation)
Analysis Phase
(How to Evaluate
the Rival
Explanation)
Chance (due
to random
error)
Increase sampl e size and other
strategies to i ncrease preci si on
(Chapters 4 and 6)
Cal culate P
val ues and
confi dence
i nterval s
Interpret them i n
the context of
pri or evi dence
(Chapter 5)
Bi as (due to
systemati c
error)
Carefull y consider the potenti al
consequences of each difference
between the research questi on
and the study pl an (see Fi g. 1.6):
Obtai n addi ti onal
data to see i f
potential bi ases
have actual l y
occurred
...
For each questi on, answered No or Maybe not, consi der whether
the bias appl i es simi l arly to one or both groups studi ed (e.g., cases and control s
or exposed and unexposed) and whether i t i s large enough to affect the answer
to the research questi on.
To i l lustrate thi s wi th our coffee and MI exampl e, consi der the impl i cati ons of
drawi ng the sampl e of control subjects from a popul ati on of hospi tal i zed patients.
If many of these pati ents have chroni c i l lnesses that have caused them to reduce
thei r coffee i ntake, the sampl e of control s wil l not represent the target
popul ati on from whi ch the pati ents wi th MI arose; there wi l l be a shortage of
coffee dri nkers. Furthermore, i f coffee drinki ng i s measured by questi onnai re, the
answers on the questi onnai re may not accuratel y represent actual coffee
drinki ng, the predi ctor of i nterest. And i f esophageal spasm, which can be
exacerbated by coffee, i s mi sdi agnosed as MI, a spuri ous associ ati on between
coffee and MI coul d be found because the measured outcome (di agnosi s of MI)
di d not accuratel y represent the outcome of interest (actual MI).
The next step is to thi nk about possi bl e strategies for preventi ng each potenti al
bi as. For exampl e, as discussed i n Chapter 8, sel ecti ng more than one control
group i n a casecontrol study i s one approach to addressing sampl i ng bi as. In
Chapter 4
we suggested strategi es for reduci ng measurement bias. In each case, judgments
are requi red about the l i kel ihood of bias, and how easi l y i t coul d be prevented
wi th changes i n the study pl an. If the bi as is easil y preventable, revise the study
pl an and ask the three questi ons agai n. If the bias i s not easil y preventabl e,
deci de whether the study i s sti l l worth doi ng by making a judgment on the
l i kel ihood of the potential bi as and the degree to which i t wil l compromi se the
concl usi ons.
popul ation/subjects
phenomena/measurements
Check
consi stency wi th
other studi es
(especi al l y those
using di fferent
methods)
P.130
...
Anal ys i s P has e. The i nvesti gator is often faced wi th one or more potential
bi ases after the data have been col l ected. Some may have been anti ci pated but
too diffi cult to prevent, and others may not have been suspected unti l it was too
l ate to avoi d them.
In ei ther si tuati on, one approach i s to obtai n additional information to
esti mate the magni tude of the potenti al bi as. Suppose, for exampl e, the
i nvesti gator is concerned that the hospi tal i zed control subjects do not represent
the target popul ation of peopl e free of MI because they have decreased thei r
coffee i ntake due to chronic il l ness. The magni tude of thi s sampl ing bi as could be
esti mated by reviewi ng the di agnoses of the control subjects and separati ng
them i nto two groups: those with i ll nesses that might al ter coffee habi ts and
those with il l nesses that would not. If both types of control s drank l ess coffee
than the MI cases, then sampl i ng bias woul d be a l ess l i kel y expl anation for the
fi ndi ngs. Simi l arl y, i f the investi gator i s concerned that a questi onnaire does not
accuratel y capture coffee dri nking (perhaps because of poorl y worded questi ons),
she coul d assi gn a bl i nded i ntervi ewer to questi on a subset of the cases and
FIGURE 9.1. Mi nimi zi ng bias by comparing the research questi on and the
study pl an.
P.131
...
controls to determi ne the agreement wi th thei r questi onnaire responses. Fi nal l y,
i f i t i s the outcome measure that i s i n doubt, the i nvesti gator coul d specify
objecti ve el ectrocardi ographi c and serum enzyme changes needed for the
di agnosi s, and reanalyze the data excl udi ng the subset of cases that do not meet
these cri teri a.
The i nvesti gator can al so l ook at the results of other studies. If the concl usi ons
are consistent, the associati on i s l ess li kel y to be due to bi as. Thi s i s especi all y
true i f the other studi es have used di fferent methods and are therefore unl i kel y
to share the same bi ases. In many cases, potenti al bi ases turn out not to be a
major problem. The deci si on on how vi gorousl y to pursue addi ti onal information
and how best to di scuss these i ssues i n reporti ng the study are matters of
judgment for whi ch i t i s hel pful to seek advi ce from col l eagues.
Real Associations Other than CauseEffect
In addi ti on to chance and bi as, the two types of associ ati ons that are real but do
not represent causeeffect must be consi dered (Tabl e 9.3).
Ef f ectCause
One possi bi l i ty i s that the cart has come before the horsethe outcome has caused
the predictor. Effectcause i s often a probl em i n cross-secti onal and casecontrol
studi es, especial l y when the predictor vari abl e i s a l aboratory test for whi ch no
previ ous values are avail abl e, and i n case-crossover studies i f the ti mi ng of events i s
uncertai n. For example, i n the study of mobi l e phone use and motor vehi cl e acci dents
descri bed in Chapter 8,
a car crash coul d cause a mobil e phone cal l (to report the crash right after i t
happened), rather than vi ce versa. To address thi s possibil i ty, the i nvesti gators asked
drivers who had been i nvol ved in a crash about phone use both before and after the
crash, and veri fi ed the responses using phone records and the esti mated ti me of the
crash (1).
P.132
Table 9.3 Strengthening the Inference that an
Association has a CauseEffect Basis: Ruling Out
Other Real Associations
Type of Real Association
Design Phase (How
to Prevent the Rival
Explanation)
Analysis Phase
(How to Evaluate
the Rival
Explanation)
Effectcause (the outcome
i s actual l y the cause of the
predi ctor)
Do a l ongi tudi nal
study
Consi der bi ol ogi c
pl ausi bi l i ty
Obtai n data on the
histori c sequence of
the vari abl es
Consi der fi ndi ngs of
other studi es wi th
di fferent desi gns
(Ul ti mate sol ution:
...
Effectcause i s l ess commonl y a probl em i n cohort studi es because ri sk factor
measurements can be made i n a group of peopl e who do not yet have the di sease.
Even i n cohort studi es, however, effectcause is possi bl e i f the di sease has a l ong
l atent peri od and those wi th subcl i ni cal di sease cannot be i denti fi ed at basel i ne. For
exampl e, type 2 di abetes i s associ ated wi th subsequent risk of pancreati c cancer.
Some of thi s associ ation is al most certai nly effectcause, because pancreati c cancer
can cause di abetes, and the associ ation between di abetes and pancreatic cancer
di mini shes wi th fol low-up ti me (2). However, some associ ati on persi sts (a relati ve ri sk
of about 1.5) even when pancreatic cancer cases diagnosed wi thin 4 years of the onset
of diabetes are excl uded, l eavi ng open the possi bi l ity that part of the rel ati onship
might be causeeffect.
This exampl e il l ustrates a general approach to ruli ng out effectcause: drawi ng
i nferences from assessments of the variabl es at different poi nts in time. In addi ti on,
effectcause is often unl i kel y on the grounds of biologic impl ausibil i ty. For exampl e,
i t i s unl i kel y that i ncipient l ung cancer causes ci garette smoki ng.
Conf oundi ng
The other rival explanation in Tabl e 9.3 i s confoundi ng, whi ch occurs when there i s a
thi rd factor i nvol ved i n the associ ation that is the real cause of the outcome. The word
confoundi ng usual ly means somethi ng that confuses i nterpretati on, but i n cl i ni cal
research the term has a more specific defi ni ti on.
A confounding variable is one that is associated with the predictor variable,
and a cause of the outcome variable.
Ci garette smoki ng is a l i kel y confounder in the coffee and MI exampl e because
smoki ng i s associ ated wi th coffee dri nking and is a cause of MI. If this is the actual
expl anation, then the associ ati on between coffee and MI does not represent
causeeffect al though i t is sti ll real ; the coffee i s an i nnocent bystander. Appendi x
9A gi ves a numeri c exampl e of how ci garette smoki ng coul d cause an apparent
associ ati on between coffee drinki ng and MI.
Asi de from bi as, confoundi ng i s often the onl y l i kel y al ternati ve expl anati on to
causeeffect and the most i mportant one to try to rul e out. It i s al so the most
chall engi ng; much of the rest of thi s chapter i s devoted to strategi es for copi ng wi th
confounders.
Coping With Confounders in the Design Phase
In observati onal studi es, most strategi es for copi ng wi th confoundi ng variabl es
requi re that an i nvestigator be aware of and abl e to measure them. It i s hel pful to li st
the variabl es (l i ke age and sex) that may be associ ated wi th the predi ctor vari able of
do a randomi zed
tri al )
Confoundi ng (another
vari abl e i s associated wi th
the predi ctor and a cause of
the outcome)
See Tabl e 9.4 See Tabl e 9.5
...
i nterest and that may al so be a cause of the outcome. The i nvesti gator must then
choose among design and analysis strategi es for controll i ng the i nfluence of these
potenti al confoundi ng variabl es.
Table 9.4 Design Phase Strategies for Coping with
Confounders
Strategy Advantages Disadvantages
Specificati on Easi l y understood
Focuses the
sampl e of
subjects for the
research question
at hand
Li mi ts
general i zabi l i ty
May make i t
di ffi cul t to acqui re
an adequate sampl e
si ze
May be ti me
consumi ng and
expensi ve; l ess
effi ci ent than
i ncreasi ng the
number of subjects
Decisi on to match
must be made at
the outset of the
study and can have
an i rreversi bl e
adverse effect on
the anal ysi s and
concl usions
Requi res an earl y
deci si on about
which variabl es are
predi ctors and
which are
confounders
Eli mi nates the
opti on of studyi ng
matched vari abl es
as predi ctors or as
i ntervening
vari abl es
Requi res a matched
anal ysi s
Creates the danger
of overmatchi ng
(i .e., matchi ng on a
factor that is not a
confounder, thereby
reducing power)
Onl y feasi bl e for
casecontrol and
mul ti pl e-cohort
Matchi ng Can el i mi nate the
i nfluence of
strong
consti tuti onal
confounders l i ke
age and sex
Can el i mi nate the
i nfluence of
confounders that
are di fficul t to
measure
Can i ncrease
preci sion (power)
by bal anci ng the
number of cases
and control s i n
each stratum
May be a
sampl i ng
conveni ence,
maki ng i t easi er
to select the
control s in a
casecontrol
study
...
The fi rst two desi gn phase strategi es (Tabl e 9.4), specification and matching,
i nvol ve changes i n the sampl i ng scheme. Cases and control s (i n a casecontrol
study) or exposed and unexposed subjects (i n a cohort study) are sampl ed i n such a
way that they have comparabl e val ues of the confounding vari abl e. Thi s removes the
confounder as an expl anati on for any associ ati on that is observed between predictor
and outcome. The third desi gn phase strategy, use of what we cal l opportunistic
study designs, i s onl y appl i cabl e to sel ected research questi ons for whi ch the ri ght
condi ti ons exi st. However, when appl i cabl e, these designs resemble randomi zed tri al s
i n thei r abi l i ty to reduce or el i mi nate confounding not onl y by measured vari abl es, but
by unmeasured vari ables as wel l.
Speci f i cati on
The si mpl est strategy i s to desi gn incl usi on criteri a that specify a val ue of the
potenti al confoundi ng variabl e and excl ude everyone with a different val ue. For
exampl e, the investi gator studying coffee and MI coul d specify that onl y nonsmokers
be incl uded
i n the study. If an associ ation were then observed between coffee and MI, i t obviousl y
coul d not be due to smoki ng.
Speci fi cation is an effective strategy, but, as wi th al l restri cti ons i n the sampl i ng
scheme, i t has disadvantages. Fi rst, even if coffee does not cause MI i n nonsmokers,
i t may cause them i n smokers. (Thi s phenomenonan effect of coffee on MI that i s
di fferent in smokers from that i n nonsmokersi s call ed effect modification or
interaction.) Therefore, speci fi cati on l i mi ts the general izabi l i ty of i nformation
avai l able from a study, i n thi s i nstance compromi si ng our abil i ty to general i ze to
smokers. Second, i f smoki ng i s hi ghl y prevalent among the pati ents avai labl e for the
study, the i nvesti gator may not be abl e to recrui t a l arge enough sample of
nonsmokers.
These probl ems can become serious i f specificati on i s used to control too many
confounders or to control them too narrowl y. Sampl e si ze and general i zabi li ty woul d
be major probl ems i f a study were restri cted to lower -i ncome, nonsmoki ng, 70- to 75-
year-ol d men.
Matchi ng
In a casecontrol study, matching i nvol ves selecti ng cases and control s wi th
matchi ng val ues of the confounding vari abl e(s). Matchi ng and speci fi cati on are both
sampli ng strategi es that prevent confoundi ng by all owi ng compari son onl y of cases
and controls that share comparabl e l evel s of the confounder. Matchi ng di ffers from
Opportuni sti c
study desi gns
Can provide
great strength of
causal i nference
May be a lower
cost and el egant
al ternati ve to a
randomized tri al
studi es
Onl y possible i n
sel ect
ci rcumstances
where the predi ctor
vari abl e is
randomly or
vi rtuall y randoml y
assi gned, and an
i nstrumental
vari abl e exi sts
P.133
P.134
...
specificati on, however, i n preservi ng general izabil i ty because subjects at al l l evel s of
the confounder can be studied.
Matchi ng i s usual ly done i ndi vi dual l y (pairwise matching). In the study of coffee
drinki ng as a predi ctor of MI, for exampl e, each case (a pati ent wi th an MI) coul d be
i ndi vidual l y matched to one or more control s that smoked roughl y the same amount as
the case (e.g., 10 to 20 ci garettes/day). The coffee dri nki ng of each case would then
be compared wi th the coffee dri nking of the matched control(s).
An al ternative approach to pai rwise matchi ng i s to match i n groups (frequency
matching). For each level of smoki ng, the number of cases wi th that amount of
smoki ng coul d be counted, and an appropri ate number of control s wi th the same l evel
of smoki ng coul d be selected. If the study cal led for two control s per case and there
were 20 cases that had smoked 10 to 20 ci garettes/day, the i nvestigators woul d select
40 control s that smoked thi s amount, matched as a group to the 20 cases.
Matchi ng i s most commonl y used in casecontrol studi es, but i t can al so be used
wi th mul ti pl e-cohort desi gns. For exampl e, to investi gate the effects of servi ce i n the
1990 to 1991 Gul f War on subsequent ferti l ity in mal e veterans, Maconochi e et al .
compared 51,581 men depl oyed to the Gul f region duri ng the war wi th 51,688 men
who were not depl oyed, but were frequency-matched by servi ce, age, fi tness to be
deployed, servi ng status and rank (3). There was a sl i ghtl y hi gher ri sk of reported
i nferti li ty and a l onger ti me to concepti on i n the Gulf War veterans.
There are four main advantages to matchi ng (Tabl e 9.4). The fi rst three rel ate to the
control of confounding variabl es; the last i s a matter of l ogi sti cs.
Matchi ng i s an effecti ve way to prevent confounding by constitutional factors
l i ke age and sex that are strong determinants of outcome, not suscepti bl e to
i nterventi on, and unli kely to be an i ntermedi ary i n a causal pathway.
Matchi ng can be used to control confounders that cannot be measured and
controll ed i n any other way. For exampl e, matching sibli ngs (or, better yet,
twi ns) wi th one another can control for a whol e range of geneti c and fami l i al
factors that would be impossi bl e to measure, and matchi ng for cl i nical center i n a
multi center
study can control for unspeci fi ed di fferences among the popul ati ons seen at the
centers.
Matchi ng may i ncrease the preci si on of compari sons between groups (and
therefore the power of the study to fi nd a real associ ati on) by bal anci ng the
number of cases and control s at each l evel of the confounder. Thi s may be
i mportant i f the avai l able number of cases i s l i mited or i f the cost of studyi ng the
subjects is hi gh. However, the effect of matchi ng on preci si on i s modest and not
always favorabl e (see overmatchi ng, bel ow). In general, the desi re to
enhance preci si on i s a l ess i mportant reason to match than the need to control
confoundi ng.
Fi nal l y, matchi ng may be used pri mari l y as a sampl i ng convenience, to narrow
down an otherwise i mpossi bl y l arge number of potenti al control s. For exampl e, i n
a nati onwi de study of toxi c shock syndrome, vi cti ms were asked to i denti fy
friends to serve as control s (3). Thi s conveni ence, however, al so runs the ri sk of
overmatchi ng.
There are a number of disadvantages to matchi ng (Tabl e 9.4).
Matchi ng sometimes requi res addi ti onal time and expense to i denti fy a match
P.135
...
for each subject. In casecontrol studi es, for exampl e, the more matchi ng
cri teri a there are, the l arger the pool of control s that must be searched to match
each case. Cases for whi ch no match can be found wi l l need to be di scarded. The
possi bl e increase i n stati sti cal power from matchi ng must therefore be wei ghed
agai nst the potenti al l oss of otherwise el i gi bl e cases or control s.
Because matchi ng i s a sampl ing strategy, the deci si on to match must be made at
the begi nni ng of the study and i s i rreversi bl e. Thi s precl udes further anal ysi s of
the effect of the matched vari abl es on the outcome. It al so can create a seri ous
error i f the matchi ng vari abl e i s not a fixed (constitutional ) variabl e l i ke age or
sex, but a vari able i ntermediate in the causal pathway between the predi ctor and
outcome. For exampl e, i f an i nvesti gator wishi ng to i nvesti gate the effects of
alcohol i ntake on ri sk of MI matched on serum hi gh-densi ty l i poprotei n (HDL)
l evel s, she woul d miss any benefi ci al effects of al cohol that are medi ated through
an increase i n HDL. Al though the same error can occur wi th the anal ysis phase
strategi es di scussed l ater, matchi ng bui l ds the error i nto the study i n a way that
cannot be undone; wi th the anal ysi s phase strategi es the error can be avoi ded
si mply by appropri atel y al teri ng the anal ysi s.
Correct anal ysi s of pai r-matched data requi res speci al anal yti c techni ques that
compare each subject onl y wi th the indi vi dual (s) wi th whom she has been
matched, and not wi th subjects who have di fferi ng l evels of confounders. The use
of ordi nary stati sti cal anal ysis techni ques on matched data can l ead to i ncorrect
results (general l y bi ased toward no effect) because the assumpti on that the
groups are sampl ed independently is vi olated. This someti mes creates a probl em
because the appropri ate matched analyses, especi al l y multivari ate techni ques,
are less fami l i ar to most i nvesti gators and l ess readi l y avai l abl e i n packaged
stati sti cal programs than are the usual unmatched techni ques.
A fi nal di sadvantage of matchi ng i s the possi bi li ty of overmatching, which
occurs when the matchi ng vari able i s not a confounder because i t i s not
associ ated wi th the outcome. Overmatchi ng can reduce the power of a
casecontrol study, maki ng i t more di ffi cul t to fi nd an associati on that reall y
exi sts in the popul ati on. In the study of toxi c shock syndrome that used fri ends
for controls, for example, matchi ng may have i nappropri ately control l ed for
regi onal di fferences i n tampon marketi ng, making it more probabl e that cases
and controls would use the same brand of tampon. It i s i mportant to note,
however, that overmatchi ng wi l l not di stort the
esti mated rel ative ri sk (provided that a matched anal ysi s i s used); it wi ll onl y
reduce i ts statistical si gni fi cance.
1
Therefore when the fi ndi ngs of the study are
stati sti cal l y si gni fi cant (as was the case in the toxi c shock exampl e),
overmatchi ng i s not a probl em.
Oppor t uni sti c Studi es
Under certai n conditi ons or for certai n research questions, there may be opportuni ti es
to control for confoundi ng variabl es in the design phase, even wi thout measuring
them. Because these designs are not general l y avai l abl e, we cal l them
opportunistic desi gns. One exampl e for short-term exposures with immedi ate
effects i s the case-crossover study (Chapter 8)al l potenti al confoundi ng vari abl es
that are constant over the ti me (e.g., sex, race, soci al class, geneti c factors) are
control led because each subject i s compared onl y to hersel f in a di fferent ti me peri od.
Occasi onal ly, investi gators discover a natural experiment, i n which subjects are
ei ther exposed or not exposed through a process that i n effect randoml y all ocates
them to have or not have a ri sk factor or intervention. For exampl e, Lofgren et al . ( 4)
studi ed the effects of di sconti nui ty of care on test orderi ng and l ength of stay by
P.136
...
taki ng advantage of the fact that pati ents admi tted after 5:00 PM to thei r i nsti tuti on
were al ternatel y assi gned to seni or residents that either maintai ned care of the
pati ents or transferred them to another team the fol l owi ng morni ng. They found that
pati ents whose care was transferred had 38% more l aboratory tests (P = 0.01) and 2-
day l onger medi an length of stay (P = 0.06) than those kept on the same team.
Si mi larly, Bel l and Redel mei er (5) studi ed effects of nursi ng staffi ng by comparing
outcomes for pati ents with sel ected di agnoses who were admi tted on weekends to
those admi tted on weekdays. They found hi gher mortal i ty from al l three conditions
hypothesized to be sensi ti ve to staffi ng rati os, but not for other condi ti ons.
As geneti c di fferences i n suscepti bi l ity to an exposure are el uci dated, a strategy cal l ed
Mendelian randomization (3) becomes an opti on. Mendel i an randomization takes
advantage of the fact that for common genetic pol ymorphi sms, the al l el e a person
recei ves i s determined at random wi thi n fami l i es, and usual l y not li nked to rel evant
confoundi ng vari abl es. Therefore, i f peopl e with al l el es expected to confer increased
susceptibil i ty to a ri sk factor do i ndeed have a hi gher rate of di sease than those who
are ei ther unexposed, or exposed but l ess suscepti bl e, the study can provi de strong
evi dence for causal i ty.
For exampl e, some farmers who di p sheep i n i nsecti ci des (to kil l ti cks, li ce, etc.) have
heal th compl ai nts that mi ght or mi ght not be due to thei r occupati onal exposures.
Investi gators at the Uni versi ty of Manchester took advantage of a pol ymorphi sm i n the
paraoxonase-1 gene, whi ch l eads to enzymes wi th differing abil i ty to hydrol yze the
organophosphate sheep di p di azi nonoxon. They hypothesi zed that if sheep di p was a
cause of il l heal th i n exposed farmers, that farmers wi th i l l health would be more
l i kel y to have al l el es associ ated wi th reduced paraoxonase-1 acti vi ty. They asked
farmers who bel i eved that sheep di p had adversel y affected thei r heal th to suggest
control farmers
who were si mil arl y exposed to sheep di p, but i n good heal th. Thei r fi ndi ng that
exposed farmers wi th heal th compl ai nts had a hi gher frequency of all el es associated
wi th reduced paraoxonase-1 acti vity than simi l arl y exposed but asymptomati c farmers
provi ded strong evi dence of a causal rel ati onshi p between exposure to sheep di p and
i l l heal th (6).
Natural experi ments and Mendel i an randomi zati on are examples of a more general
approach to enhanci ng causal i nference i n observati onal studies, use of instrumental
variables. These are vari abl es associ ated wi th the predi ctor of i nterest, but not
i ndependentl y associ ated wi th outcome. Whether someone i s admitted on a weekend,
for exampl e, i s associ ated wi th staffi ng l evel s, but was thought not to be
i ndependentl y associ ated wi th mortal i ty risk (for the diagnoses studi ed), so admi ssi on
on a weekend can be considered an instrumental variabl e. Si mil arl y, acti vi ty of the
paraoxonase-1 enzyme i s associ ated wi th possi ble toxi ci ty due to di pping sheep, but
not otherwi se associ ated wi th i l l heal th. Other exampl es of i nstrumental vari abl es are
draft l ottery number (used to i nvesti gate del ayed effects of mil i tary servi ce duri ng the
Vi etnam War era (7)) and the di stance of resi dence from a faci l i ty that does coronary
revascul ari zati on procedures (used to i nvesti gate the effects of these procedures on
mortal i ty (8)).
Coping with Confounders in the Analysis Phase
Desi gn phase strategi es requi re deci di ng at the outset of the study which
vari ables are predi ctors and whi ch are confounders. An advantage of anal ysi s phase
strategi es i s that they all ow the i nvesti gator to defer that deci si on unti l she has
examined the data for evidence as to whi ch vari abl es may be confounders (i .e.,
associ ated wi th the predictor of i nterest and a cause of the outcome).
Sometimes there are several predi ctor vari abl es, each of which may act as a
confounder to the others. For exampl e, al though coffee dri nking, smoki ng, mal e sex,
P.137
...
and personal i ty type are associated wi th MI, they are also associated with each other.
The goal i s to determine whi ch of these predi ctor vari abl es are i ndependentl y
associ ated wi th MI and which are associ ated wi th MI onl y because they are associ ated
wi th other (causal) ri sk factors. In this secti on, we di scuss anal yti c methods for
assessi ng the independent contri buti on of predi ctor vari abl es i n observati onal
studi es. These methods are summarized i n Tabl e 9.5.
St r at i f i cati on
Like speci fi cati on and matchi ng, stratification ensures that onl y cases and control s
(or exposed and unexposed subjects) wi th si mi lar level s of a potenti al confoundi ng
vari able are compared. It i nvolves segregati ng the subjects i nto strata (subgroups)
accordi ng to the level of a potenti al confounder and then examini ng the relati on
between the predi ctor and outcome separately in each stratum. Stratifi cati on i s
i l lustrated i n Appendi x 9A. By consi deri ng smokers and nonsmokers separatel y
(strati fyi ng on smoki ng ), the confoundi ng effects of smoki ng can be removed.
Appendi x 9A al so il l ustrates interaction, i n whi ch strati fi cati on reveal s that the
associ ati on between predictor and outcome vari es wi th the level of a thi rd factor.
Because the third factor (smoki ng i n thi s exampl e) modifies the effect of the predi ctor
(coffee dri nki ng) on outcome (MI), i nteracti on i s someti mes al so cal led effect
modification. By chance al one the esti mates of associ ati on i n di fferent strata
wi ll rarel y be preci sel y the same, and i nteracti on i ntroduces addi ti onal compl exi ty,
because a si ngl e measure of associ ati on no l onger can summari ze the rel ati onshi p
between predi ctor and outcome. For thi s reason, before concludi ng that an i nteracti on
i s present, it i s necessary to assess its bi ol ogi cal plausi bi li ty and statistical
si gni fi cance (usi ng a formal test for i nteracti on, or, as a shortcut, checki ng to see
whether the confi dence interval s i n the di fferent strata overl ap). The i ssue of
i nteracti on al so ari ses for subgroup anal yses of cl ini cal tri al s ( Chapter 11), and for
meta-anal yses when homogenei ty of studi es i s bei ng consi dered (Chapter 13).
P.138
Table 9.5 Analysis Phase Strategies for Coping with
Confounders
Strategy Advantages Disadvantages
Stratificati on Easi l y understood
Flexible and
reversi bl e; can
choose whi ch
vari abl es to stratify
upon after data
col lecti on
Number of strata l i mi ted
by sample si ze needed
for each stratum
Few covari abl es can
be consi dered
Few strata per
covari able l eads to
i ncompl ete control
of confounding
Rel evant covari abl es
must have been
measured
Stati sti cal
adjustment
Mul ti pl e confounders
can be control l ed
si mul taneousl y
Model may not fi t:
Incomplete control
of confounding (i f
...
Strati fi cati on resembl es matchi ng and specificati on i n being easil y understood. An
advantage of strati fi cati on i s i ts fl exi bi li ty: by performi ng several strati fi ed analyses,
the investi gators can decide whi ch vari abl es appear to be confounders and ignore the
remai nder. (Thi s i s done by determi ni ng whether the resul ts of strati fi ed anal yses
substanti al l y di ffer from those of unstratified anal yses; see Appendi x 9A.)
Strati fi cati on al so has the advantage over desi gn phase strategi es of bei ng reversi bl e:
no choi ces need be made at the begi nni ng of the study that mi ght l ater be regretted.
The pri nci pal di sadvantage of strati fi ed anal ysi s i s the l i mi ted number of vari ables that
can be control l ed si mul taneously. For exampl e, possi bl e confounders i n the coffee and
MI study mi ght i ncl ude age, systol i c bl ood pressure, serum chol esterol , ci garette
smoki ng, and al cohol i ntake. To strati fy on these fi ve vari abl es, with three strata for
Informati on i n
conti nuous vari abl es
can be ful l y used
Flexible and
reversi bl e
model does not fi t
confounder-
outcome
rel ati onshi p)
Inaccurate
esti mates of
strength of effect
(i f model does not
fi t predi ctor-
outcome
rel ati onshi p)
Resul ts may be hard to
understand. (Many
peopl e do not readi l y
comprehend the meani ng
of a regressi on
coeffici ent.)
must have been
measured
Propensity
scores
Mul ti pl e confounders
can be control l ed
si mul taneousl y
Informati on i n
conti nuous vari abl es
can be ful l y used
Enhances control for
confoundi ng when
more peopl e receive
the treatment than
get the outcome
If a stratified or
matched anal ysi s i s
used, does not
require model
assumpti ons
Flexible and
reversi bl e
Resul ts may be hard to
understand
must have been
measured
Can onl y be done for
exposed and unexposed
subjects wi th
overlappi ng propensity
scores, reduci ng sampl e
si ze
P.139
...
each, woul d requi re 3
5
(= 243) strata! With thi s many strata there wi ll be some wi th
no cases or no control s, and these strata cannot be used.
To mai ntai n a suffi cient number of subjects i n each stratum, a vari abl e i s often
di vi ded into just two strata. When the strata are too broad, however, the confounder
may not be adequatel y control l ed. For exampl e, i f the precedi ng study strati fi ed usi ng
onl y two age strata (e.g., age < 50 and age 50), some resi dual confounding
woul d sti l l be possi bl e i f wi thi n each stratum the subjects dri nki ng the most coffee
were ol der and therefore at hi gher risk of MI.
Adj ust ment
Several stati sti cal techni ques are avai l abl e to adjust for confounders. These
techni ques model the nature of the associ ati ons among the vari abl es to isolate the
effects of predi ctor vari ables and confounders. For exampl e, a study of the effect of
l ead l evel s on IQ in chi l dren mi ght examine parental educati on as a potential
confounder. Statistical adjustment might model the rel ati on between parents' years of
school i ng and the chi l d' s IQ as a strai ght l i ne, i n whi ch each year of parent educati on
i s associ ated wi th a fi xed increase i n chi l d IQ. The IQs of chil dren wi th di fferent l ead
l evel s coul d then be adjusted to remove the effect of parental educati on usi ng the
approach descri bed i n Appendi x 9B. Si mi l ar adjustments can be made for several
confounders si multaneousl y, usi ng software for mul ti vari ate anal ysi s.
One of the great advantages of multivariate adjustment techniques i s the capaci ty
to adjust for the i nfluence of many confounders si multaneousl y. Another advantage i s
thei r use of al l the information in continuous variabl es. It i s easy, for exampl e, to
adjust for a parent' s educati on l evel i n 1-year i ntervals, rather than strati fyi ng i nto
just two categori es.
There are, however, two disadvantages of multivariate adjustment. Fi rst, the
model may not fi t. Computeri zed statistical packages have made these model s so
accessi bl e that the i nvesti gator may not stop to consi der whether thei r use i s
appropri ate for the predi ctor and outcome vari ables i n the study. Taki ng the exampl e
i n Appendi x 9B, the investi gator shoul d exami ne whether the rel ati on between the
parents' years of schooli ng and the chi ld's IQ i s actuall y l i near. If the pattern is very
di fferent (e.g., the slope of the l i ne becomes steeper with i ncreasi ng educati on) then
attempts to adjust IQ for parental educati on usi ng a l i near model wi l l be i mperfect and
the esti mate of the i ndependent effect of l ead wi l l be i ncorrect.
Second, the resul ti ng hi ghl y deri ved statistics are diffi cul t to understand i ntuitivel y.
This is parti cul arl y a problem i f a simpl e model does not fi t and transformations (e.g.,
parental educati on squared) or i nteracti on terms (used when the effect of one vari abl e
i s modi fi ed by another) are needed.
P r opensi ty Scor es
Propensity scores are a rel ati vel y new anal yti c technique that can be useful i n
observati onal studies of treatment effi cacy. They are a tool for control l i ng
confounding by indicationthe problem that patients for whom a treatment i s
i ndi cated (and
hence prescri bed) are often at hi gher ri sk or otherwi se intri nsical l y di fferent from
those who do not get the treatment. Recal l that i n order to be a confounder, a
vari able must be associ ated with both the predi ctor and outcome. Instead of adjusting
for all other factors that predi ct outcome, use of propensi ty scores invol ves creati ng a
multivari ate (usual l y l ogi sti c) model to predi ct receipt of the treatment. Each subject
can then be assigned a predi cted probabil i ty of treatmenta propensi ty score. Thi s
si ngl e score can be used as the onl y confoundi ng vari abl e i n stratified or mul ti vari ate
anal ysi s. Alternati vel y, subjects who di d and di d not recei ve the treatment can be
P.140
...
matched by propensity score, and outcomes compared between matched pai rs.
Example 9.1 Propensity analysis
Gum et al . (10) prospecti vel y studi ed 6,174 consecutive adults undergoing stress
echocardi ography, 2,310 of whom (37%) were taking aspi ri n and 276 of whom di ed i n
the 3.1-year fol l ow-up peri od. In unadjusted anal yses, aspi rin use was not associ ated
wi th mortal i ty (4.5% i n both groups). However, when 1,351 pati ents who had recei ved
aspi ri n were matched to 1,351 pati ents wi th the same propensi ty to recei ve aspi ri n
but who di d not, mortali ty was 47% l ower i n those treated (P = 0.002).
Anal ysi s usi ng propensi ty scores has three distinct advantages. Fi rst, the number of
potenti al confoundi ng variabl es that can be model ed as predi ctors of the intervention
i n the propensity score i s greater than i f one i s model i ng the predi ctors of outcome
because the number of peopl e treated is general l y much greater than the number who
devel op the outcome (2,310 compared with 276 i n Exampl e 9.1).
2
Second, because the
potenti al confoundi ng variabl es are reduced to a si ngl e score, the primary analysi s of
the rel ati onship between the mai n predi ctor and outcome can be a stratified or
matched anal ysi s, whi ch does not requi re assumpti ons about the form of the
rel ati onshi p between predictor, outcome, and confoundi ng vari abl es. Fi nal l y, i f the
predi ctor vari able i s recei pt of a prescri bed treatment, i nvesti gators mi ght be more
confi dent i n understandi ng determi nants of treatment than determi nants of outcome,
because after al l , treatment deci si ons are made by cl i ni cians based on a l imi ted and
potenti al l y knowable number of pati ent characteri sti cs.
Of course, l i ke other mul ti vari ate techni ques, use of propensity scores sti l l requires
that potenti al confounding variabl es be i dentified and measured. A limitation of thi s
techni que is that i t does not provi de i nformati on about the relati onshi p between any of
the confoundi ng vari ables and outcomethe onl y result i s for the treatment that was
model ed with the propensi ty score. However, because thi s i s an anal ysis phase
strategy, i t does not precl ude doi ng more tradi ti onal mul ti vari ate anal yses as wel l,
and both types of anal ysi s are usual ly done. The mai n di sadvantages of thi s techni que
are that it requires an addi ti onal step and i s l ess intui ti ve, l ess famil i ar and l ess wel l
understood by journals and revi ewers than tradi ti onal mul ti vari ate anal yses.
Underestimation of Causal Effects
To thi s point, we have focused on determi ni ng whether observed associ ati ons are
causal . The emphasi s has been on whether alternati ve bases for an associ ati on exi st,
that i s, on avoi di ng a fal se concl usi on that an associati on i s real and causal when i t i s
not. However, another type of error i s al so possi bl eunderestimati on of causal
effects. It i s i mportant to remember that chance, bias and confoundi ng can al l be
reasons why a real associati on mi ght be mi ssed or underesti mated.
We di scussed chance as a reason for mi ssi ng an associati on i n Chapter 5, when we
revi ewed type II errors and the need to make sure the sampl e si ze wi l l provi de
adequate power to fi nd real associ ati ons. After a study has been compl eted, however,
the power cal cul ation is no l onger the best way to quanti fy uncertainty due to random
error. At thi s stage esti mati ng the probabil i ty of fi ndi ng an effect of a speci fi ed size i s
l ess rel evant than the actual fi ndi ngs, expressed as the observed esti mate of
associ ati on (e.g., ri sk rati o) and i ts 95% confidence interval.
Bias can also di stort esti mates of associ ati on toward no effect. In Chapter 8, the need
for bl i ndi ng i n ascertai ni ng ri sk factor status among cases and control s was to avoi d
differential measurement bias, for example, di fferences between the cases and
control s i n the way questi ons were asked or answers i nterpreted that mi ght lead
observers to get the answers they desi re. Because observers mi ght desi re resul ts i n
ei ther di recti on, di fferential measurement bias can bi as results i n ei ther di recti on.
P.141
...
Confounding can al so l ead to attenuati on of real associ ati ons. For exampl e, suppose
coffee dri nking actual l y protected agai nst MI, but was more common i n smokers. If
smoki ng were not control l ed for, the benefi cial effects of coffee mi ght be
missedcoffee dri nkers mi ght appear to have the same ri sk of MI as those who di d
not dri nk coffee, when (based on thei r greater smoki ng) one would have expected
thei r ri sk to be hi gher. Thi s type of confounding, in whi ch the effects of a benefi ci al
factor are hi dden by i ts associ ati on with a cause of the outcome, i s sometimes cal l ed
suppression (11). It i s a common probl em for observati onal studi es of treatments,
because treatments are often most i ndi cated i n those at hi gher ri sk of a bad outcome.
The resul t, noted earl i er, i s confounding by indication i n which a benefi ci al
treatment can appear to be useless (as aspi ri n di d in Exampl e 9.1) or even harmful .
Choosing a Strategy
What general guidel i nes can be offered for deci di ng whether to cope wi th
confounders duri ng the desi gn or anal ysi s phases, and how best to do i t? The use of
specification to control confounding i s most appropri ate for si tuati ons i n which the
i nvesti gator is chiefl y i nterested i n speci fi c subgroups of the popul ati on; thi s i s real l y
just a speci al form of the general process in every study of establ i shing cri teri a for
sel ecting the study subjects (Chapter 3).
An important deci si on to make i n the desi gn phase of the study i s whether to match.
Matchi ng i s most appropri ate for casecontrol studi es and fi xed consti tuti onal
factors such as age, race, and sex. Matching may al so be hel pful when the sampl e size
i s smal l compared wi th the number of strata necessary to control for known
confounders, and when the confounders are more easil y matched than measured.
However, because matchi ng can permanentl y compromi se the i nvesti gator' s abi li ty
to observe real associ ati ons, i t shoul d be used spari ngly, parti cul arl y for vari abl es that
may be i n the causal chai n. In many si tuations the anal ysi s phase strategi es
(strati fi cati on, adjustment, and propensi ty scores) are just as good for control l ing
confoundi ng, and have the great advantage of being reversiblethey al l ow the
i nvesti gator to add or subtract covari ates to the stati sti cal model i n her efforts to i nfer
causal pathways.
The decisi on to stratify, adjust or use propensity scores can wai t unti l after the
data are col l ected; in many cases the i nvesti gator may wi sh to do al l of the above.
However, i t is important at the time the study is desi gned to consi der whi ch factors
may l ater be used for adjustment, i n order to know whi ch vari ables to measure. Al so,
because strategi es to adjust for the infl uence of a speci fi c confoundi ng vari abl e can
onl y succeed to the degree that the confounder is wel l measured, i t is important to
desi gn measurement approaches that have adequate preci si on and accuracy (Chapter
4).
Evi dence Favor i ng Causal i ty
The approach to enhanci ng causal i nference has l argel y been a negati ve one thus
farhow to rule out the four ri val expl anati ons i n Table 9.1. A compl ementary
strategy i s to seek characteri sti cs of associati ons that provi de positive evi dence for
causal i ty, of which the most i mportant are the consistency and strength of the
associ ati on, the presence of a doseresponse rel ati on, and bi ol ogi c pl ausibil i ty.
When the resul ts are consistent i n studies of various desi gns, i t i s less l i kel y that
chance or bi as is the cause of an associ ati on. Real associ ations that represent
effectcause or confoundi ng, however, wi l l al so be consi stentl y observed. For
exampl e, i f ci garette smokers dri nk more coffee and have more MIs i n the popul ati on,
studi es wi l l consi stentl y observe an associati on between coffee dri nki ng and MI.
The strength of the associ ati on i s al so important. For one thi ng, stronger associ ati ons
P.142
...
gi ve more si gni fi cant P val ues, maki ng chance a l ess l i kel y expl anati on. Stronger
associ ati ons al so provi de better evi dence for causal i ty by reducing the l i kel i hood of
confoundi ng. Associ ations due to confoundi ng are indirect (i .e., vi a the confounder)
and therefore are generall y weaker than di rect causeeffect associati ons. Thi s i s
i l lustrated i n Appendi x 9A: the strong associ ations between coffee and smoki ng (odds
rati o = 16) and between smoki ng and MI (odds rati o = 4) l ed to a much weaker
associ ati on between coffee and MI (odds rati o = 2.25).
A doseresponse rel ation provi des posi ti ve evi dence for causali ty. The associati on
between ci garette smoki ng and l ung cancer i s an exampl e: moderate smokers have
hi gher rates of cancer than do nonsmokers, and heavy smokers have even hi gher
rates. Whenever possi bl e, predi ctor variabl es shoul d be measured conti nuousl y or in
several categories, so that any doseresponse rel ati on that i s present can be
observed. Once agai n, however, a doseresponse relati on can be observed wi th
effectcause associati ons or wi th confoundi ng. For example, i f heavi er coffee
drinkers al so were heavi er smokers, thei r MI ri sk would be greater than that of
moderate coffee dri nkers.
Fi nal l y, biologic plausibility i s an important consi derati on for drawi ng causal
i nferenceif a causal mechani sm that makes sense bi ologi cal l y can be proposed,
evi dence for causal i ty i s enhanced, whereas associ ations that do not make sense gi ven
our current understanding of bi ol ogy are l ess li kel y to represent causeeffect. It i s
i mportant not to overemphasize bi ol ogi c plausi bi l ity, however. Investi gators seem to
be abl e to come up with a pl ausi bl e mechani sm for vi rtual l y any associ ati on.
Summary
1. The design of observational studies shoul d anti ci pate the need to i nterpret
associations. The inference that the associ ati on represents a causeeffect
rel ati onshi p (often the goal of the study) i s strengthened by strategi es that
reduce the l ikel ihood of the four rival explanationschance, bias,
effectcause, and confounding.
2. The role of chance can be mini mi zed by designi ng a study wi th adequate
sample size and precision to assure a l ow type I error rate. Once the study is
compl eted, the l i kel i hood that chance i s the basi s of the associ ati on can be
judged from the P value and the consi stency of the resul ts wi th previous
evidence.
3. Bias ari ses from di fferences between the popul ati on and phenomena addressed
by the research questi on and the actual subjects and measurements in the study.
Bias can be avoi ded by basi ng desi gn decisi ons on a judgment as to whether
these di fferences wil l l ead to a wrong answer to the research question.
4. Effectcause is made l ess l i kel y by desi gni ng a study that permi ts assessment
of temporal sequence, and by consi dering biologic plausibility.
5. Confounding i s made l ess l ikely by the fol lowi ng strategi es, most of whi ch
requi re potenti al confounders to be anti cipated and measured:
a. Specification or matching in the design phase, whi ch al ters the sampl i ng
strategy to ensure that only groups with si mi l ar l evel s of the confounder are
compared. These strategi es shoul d be used spari ngl y because they can
i rreversi bl y l imi t the i nformation avai labl e from the study.
b. Stratification, adjustment or propensity analysis in the anal ysi s phase,
whi ch accompl i sh the same goal stati sti cal ly and preserve more opti ons for
i nferri ng causal pathways. Stratification i s the easi est to grasp i ntui tivel y,
and adjustment can permi t many factors to be control l ed si mul taneously.
P.143
...
Propensity scores are parti cul arl y hel pful for addressi ng confounding by
indication i n studies of treatment effi cacy.
6. Investi gators shoul d be on the lookout for opportunistic observati onal desi gns,
i ncluding natural experiments, Mendelian randomization and other
instrumental variable desi gns, that offer a strength of causal inferences that
can approach that of a randomi zed cl i ni cal tri al .
7. In addi ti on to servi ng as ri val explanati ons for observed associ ati ons, chance,
bi as, and confoundi ng can also l ead to suppression (underesti mati on) of real
causal associ ations.
8. Causal inference can be enhanced by posi ti ve evi dence, notabl y the consistency
and strength of the association, the presence of a doseresponse relati on,
and prior evi dence on biologic plausibility.
Appendix
Appendix 9A: Hypothetical Example of
Confounding and Interaction
The entri es i n these tabl es are numbers of subjects. Therefore, the top l eft entry
means that there were 90 subjects wi th MI who drank coffee.
1. If we l ook at the enti re group of study subjects, there appears to be an
associ ati on between coffee drinki ng and MI (odds rati o = 2.25):
Odds ratio for MI associ ated wi th coffee:
2. However, thi s could be due to confounding, as shown by the tabl es strati fi ed on
smoki ng bel ow. These tabl es show that coffee dri nki ng i s not associ ated wi th MI
i n ei ther smokers or nonsmokers:
Odds ratio for MI associ ated wi th coffee:
P.144
Smokers and Nonsmokers Combined

MI No MI
Coffee 90 60
No coffee 60 90
Smokers Nonsmokers
MI No MI MI No MI
Coffee 80 40 10 20
No coffee 20 10 40 80
...
Smoki ng i s a confounder because i t i s strongl y associ ated wi th coffee dri nki ng
(bel ow, l eft panel ) and with MI (bel ow, ri ght panel ):
Odds ratio for coffee drinki ng associ ated
Odds ratio for MI associ ated wi th
3. A more compl i cated si tuati on i s interaction. In that case, the associ ati on
between coffee dri nki ng and MI di ffers i n smokers and nonsmokers. (In thi s
exampl e, the associ ati on between coffee dri nki ng and MI i n the whol e study i s
due enti rel y to a strong associati on i n smokers). When i nteracti on i s present, the
odds rati os i n di fferent strata are different, and must be reported separatel y:
MI and No MI Combine
Coffee No Coffee
Smokers 120 30
Nonsmokers 30 120
Coffee and No Coffee Combined

MI No MI
Smokers 100 50
Nonsmokers 50 100
P.145
Smokers
MI No MI
Coffee 50 15
No Coffee 10 33
NonSmokers
MI No MI
Coffee 40 45
No Coffee 50 57
...
Appendix 9B: A Simplified Example of
Adjustment
Suppose that a study fi nds two major predi ctors of the IQ of chi l dren: the parental
educati on l evel and the chi l d's blood l ead l evel . Consi der the foll owi ng hypotheti cal
data on chi ldren wi th normal and hi gh l ead l evels:
Note that the parental education level i s also associated wi th the chi l d' s bl ood l ead
l evel . The questi on i s, Is the di fference i n IQ more than can be accounted for on
the basi s of the difference i n parental education? To answer this questi on we look
at how much di fference in IQ the di fference i n parental educati on l evels woul d be
expected to produce. We do thi s by pl otti ng parental educational l evel versus IQ i n the
chi l dren wi th normal l ead l evels (Fi g. 9.2).
3
The dotted l i ne i n Fi gure 9.2 shows the rel ati onshi p between the chi ld's IQ and
parental educati on i n chi l dren wi th normal lead level s; there i s an i ncrease i n the
chi l d's IQ of fi ve points for each 2 years of parental education. Therefore, we can
adjust the IQ of the normal l ead group to account for the di fference i n mean parental
educati on by sli di ng down the l i ne from poi nt A to poi nt A. (Because the group wi th
normal l ead l evel s had 2 more years of parental education on the average, we adjust
thei r IQs downward by fi ve poi nts to make them comparabl e i n mean parental
educati on to the hi gh l ead group.) Thi s stil l l eaves a 10-poi nt di fference i n IQ between
poi nts A and B, suggesting that lead has an i ndependent effect on IQ of this
magni tude. Therefore, of the 15-poi nt di fference in IQ of chi l dren with low and hi gh
l ead l evel s, fi ve poi nts can be accounted for by thei r parents' di fferent education
l evel s and the remai ning ten are attri butabl e to the l ead exposure.
Average Years of Parental Education Average IQ of Child
High l ead l evel 10.0 95
Normal l ead l evel 12.0 110
...
Reference
1. McEvoy SP, Stevenson MR, McCartt AT, et al. Rol e of mobi le phones i n motor
vehi cl e crashes resul ti ng i n hospital attendance: a case-crossover study. BMJ
2005;331(7514):428.
2. Huxl ey R, Ansary-Moghaddam A, Berrington de Gonzal ez A, et al . Type-II
di abetes and pancreatic cancer: a meta-anal ysis of 36 studi es. Br J Cancer
2005;92(11):20762083.
3. Maconochi e N, Doyl e P, Carson C. Inferti l i ty among mal e veterans of the
19901991 Gul f war: reproducti ve cohort study. BMJ 2004;329:196201.
Erratum i n BMJ 2004;329:323.
4. Lofgren RP, Gottli eb D, Wi l l i ams RA, et al . Post -call transfer of resi dent
responsi bi li ty: i ts effect on patient care [see comments]. J Gen Intern Med 1990;5
(6):501505.
5. Bel l CM, Redel mei er DA. Mortal ity among pati ents admi tted to hospital s on
weekends as compared wi th weekdays. N Engl J Med 2001;345(9):663668.
6. Davey Smi th G, Ebrahi m S. Mendel i an randomization: can geneti c
epi demi ol ogy contri bute to understanding envi ronmental determi nants of di sease?
Int J Epi demiol 2003;32(1):122.
7. Cherry N, Mackness M, Durri ngton P, et al. Paraoxonase (PON1) pol ymorphi sms
i n farmers attri buti ng i l l health to sheep di p. Lancet 2002;359(9308):763764.
8. Hearst N, Newman TB, Hull ey SB. Delayed effects of the mi l itary draft on
mortal ity. A randomi zed natural experi ment. N Engl J Med 1986;314
(10):620624.
9. McCl el l an M, McNei l BJ, Newhouse JP. Does more i ntensi ve treatment of acute
myocardi al i nfarcti on i n the el derl y reduce mortal i ty? Anal ysi s usi ng i nstrumental
vari abl es. JAMA 1994;272(11):859866.
10. Gum PA, Thami l arasan M, Watanabe J, et al . Aspi ri n use and al l -cause
mortal ity among pati ents bei ng eval uated for known or suspected coronary artery
di sease: a propensity anal ysi s. JAMA 2001;286(10):11871194.
11. Kl ungel OH, Martens EP, Psaty BM, et al . Methods to assess i ntended effects of
drug treatment in observati onal studi es are reviewed. J Cl i n Epi demi ol 2004;57
FIGURE 9.2. Hypotheti cal graph of chi ld's IQ as a l i near functi on (dotted l ine) of
years of parental educati on.
P.146
...
(12):12231231.
12. Katz M. Mul ti variabl e analysi s: a practi cal guide for cl i nicians. Cambridge:
Cambri dge Universi ty Press, 1999.
Footnote
1
The reason that overmatchi ng reduces power can be seen wi th a matched pairs
anal ysi s of a casecontrol study. In the matched anal ysi s, onl y casecontrol pai rs
that are di scordant for exposure to the ri sk factor are anal yzed ( Appendi x 8A).
Matchi ng on a vari abl e associ ated wi th the ri sk factor wi l l l ead to fewer di scordant
pai rs, and hence smal l er effecti ve sampl e si ze and less power. Of course, this happens
to some extent any time matchi ng i s used, not just wi th overmatchi ng. The di fference
wi th overmatchi ng i s that thi s cost comes wi th no benefi t, because the matchi ng was
not necessary to control confoundi ng. If a matched anal ysi s i s not used, then the
esti mate of the effect size wi l l be distorted, because the matching causes the cases
and controls to be more li kel y to have the same val ue of the risk factor.
2
Another reason that more confounders can be i ncl uded i s that there i s no danger of
overfi tti ng the propensity model i nteraction terms, quadrati c terms, and
multipl e i ndicator vari abl es can al l be i ncluded (10).
3
Thi s descri pti on of analysi s of covari ance (ANCOVA) i s simpl i fi ed. Actual ly, parental
educati on i s pl otted against the chi l d's IQ in both the normal and hi gh l ead groups,
and the si ngl e slope that fits both pl ots the best i s used. The model for thi s form of
adjustment therefore assumes l i near rel ati onshi ps between education and IQ in both
groups, and that the slopes of the l i nes i n the two groups are the same.
...
> Tabl e of Contents > Secti on II - Study Desi gns > 10 - Desi gni ng a Randomi zed Bl i nded
Tri al
10
Designing a Randomized Blinded Trial
Steven R. Cummings
Deborah Grady
Stephen B. Hulley
In cli ni cal trial s, the i nvestigator appli es an intervention and observes the
effect on outcomes. The major advantage of a trial over an observati onal study
i s the abil ity to demonstrate causality. In parti cular, randomly assigning the
i nterventi on can el iminate the influence of confounding variables, and blinding
i ts administrati on can el iminate the possibi lity that the observed effects of the
i nterventi on are due to differential use of other treatments in the treatment and
control groups or to biased ascertainment or adjudicati on of the outcome.
However, clinical tri als are general ly expensive, ti me consuming, address narrow
cli ni cal questions, and sometimes expose parti ci pants to potential harm. For
these reasons, trials are best reserved for relati vely mature research
questions, when observational studies and other li nes of evidence suggest that
an intervention might be effecti ve and safe but stronger evidence is requi red
before i t can be approved or recommended. Not every research questi on is
amenable to the clinical tri al designit is not feasibl e to study whether drug
treatment of high LDL-cholesterol i n chil dren will prevent heart attacks many
decades l ater. But cl inical trial evidence on cli nical interventions should be
obtained whenever possible.
This chapter focuses on designing the cl assic randomized blinded trial (Fig.
10.1), addressing the choi ce of intervention and control, defini ng outcomes,
selecting participants, measuring baseline variables, and approaches to
randomizing and blinding. In the next chapter we will cover alternative trial
desi gns and i mplementation and anal ysis issues.
Selecting the Intervention and Control
Conditions
In a clinical trial, the investigator compares the outcome i n groups of
participants that receive different i nterventions. Between-group designs
always incl ude a group
that recei ves an i nterventi on to be tested, and another that receives either no
active treatment (preferably a placebo) or a compari son treatment.
P.148
...
Choi ce of I nt er vent i on
The choi ce of i ntervention i s the cri tical fi rst step in designing a cli nical trial.
Investigators shoul d consider several issues as they design their i nterventi ons,
i ncludi ng the intensity, duration and frequency of the i nterventi on that best
balances effectiveness and safety. It is also important to consi der the feasibil ity
of blinding, whether to treat with one or a combi nation of interventions, and
generalizability to the way the treatment will be used i n practi ce. If i mportant
deci sions are uncertai n, such as which dose best balances effecti veness and
safety, i t is generally best to postpone major or costly trials unti l pil ot studi es
have been completed to hel p resolve the issue. Choosi ng the best treatment can
be especially di fficul t in studi es that involve years of fol low-up because a
treatment that refl ects current practi ce at the outset of the study may have
become outmoded by the end, transforming a pragmati c test into an academi c
exercise.
The best balance between effectiveness and safety depends on the conditi on
being studied. On the one hand, effecti veness is generally the paramount
consi deration in designi ng interventi ons to treat illnesses that cause severe
symptoms and a hi gh ri sk of death. Therefore, i t may be best to choose the
highest tolerable dose for treatment of metastati c cancer. On the other
hand, safety should be the pri mary criterion for designi ng interventions to treat
l ess severe conditions or to prevent il lness. Preventive therapy i n healthy people
should meet stringent tests of safety: if i t is effective, the treatment wil l prevent
the conditi on in a few persons, but everyone treated wil l be at ri sk of the
adverse effects of the drug. In thi s case, it is generall y best to choose the
lowest effective dose. If the best dose is not certain based on prior
animal and human research findi ngs, there may be a need for addi tional trials
that compare the effects of mul tiple doses on surrogate outcomes (see phase II
trials, Chapter 11).
FIGURE 10.1. In a randomized trial, the investigator (a) selects a sample
from the population, (b) measures baseline variables, (c) randomizes the
participants (R), (d) applies interventions (one should be a bli nded placebo,
i f possibl e), (e) measures outcome variables duri ng follow-up (bli nded to
randomi zed group assignment).
...
Someti mes an investi gator may decide to compare several promising doses
with a si ngle control group in a major disease endpoi nt tri al. For example, at the
time
the Multi ple Outcomes of Ral oxi fene Evaluation Trial was desi gned, it was not
clear which dose of ral oxi fene (60 or 120 mg) was best, so the trial tested two
doses of raloxifene for preventi ng fractures (1). This is sometimes a reasonable
strategy, but i t has its costs: a l arger and more expensive trial, and the
complexi ty of deali ng wi th mul tipl e hypotheses (Chapter 5).
Trial s to test single interventions are general ly much easi er to pl an and
i mplement than those testi ng combinations of treatments. However, many
medical conditions, such as HIV infection or congestive heart failure, are treated
with combinations of drugs or therapies. The most important disadvantage of
testi ng combi nations of treatments is that the resul t cannot provide clear
concl usions about any one of the i nterventions. In the first Women's Health
Ini tiative trial , for example, postmenopausal women were treated with estrogen
plus progestin therapy or placebo. The interventi on increased the risk of several
conditi ons, such as breast cancer; however, i t was unclear whether the effect
was due to the estrogen or the progestin (2). In general , it is preferable to
desi gn trials that have onl y one major difference between any two study groups.
The investigator should consider how well the intervention can be incorporated i n
practi ce. Simple interventions are generally better than compli cated ones
(patients are more likely to take a pil l once a day than two or three times).
Compl icated interventions, such as mul tifaceted counsel ing about changi ng
behavior, may not be feasible to incorporate in general practice because they
require rare expertise or are too ti me consuming or costly. Such interventi ons
are less li kely to have cl inical i mpact, even if a tri al proves that they are
effecti ve.
Some treatments are generally given in doses that vary from patient to pati ent.
In these i nstances, it may be best to desi gn an intervention so that the active
drug is titrated to achieve a cl inical outcome such as reduction in the hepatiti s
C viral load. To mai ntain bl inding, correspondi ng changes shoul d be made (by
someone not otherwise involved i n the tri al) in the dose of pl acebo for a
randoml y selected or matched partici pant i n the placebo group.
Choi ce of Cont r ol
The best control group receives no active treatment i n a way that can be
blinded, which for medi cations generally requires a placebo that is
i ndistinguishable from active treatment. Thi s strategy compensates for any
placebo effect of the active i ntervention (i.e., through suggestion and other
nonpharmacologic mechanisms) so that any outcome difference between study
groups can be ascribed to a biol ogical effect.
The cleanest comparison between the interventi on and control groups occurs
when there are no cointerventionsmedications, therapi es or behaviors
(other than the study interventi on) that reduce the risk of devel oping the
outcome of interest. If participants use effective coi nterventi ons, power will be
reduced and the sample size wi ll need to be larger or the trial longer. In the
absence of effective bl inding, the trial protocol must include plans to obtain data
to all ow stati sti cal adjustment for di fferences between the groups in the rate of
use of such coi nterventions duri ng the tri al. However, adjusting for such
P.149
...
postrandomi zation differences viol ates the intenti on-to-treat princi ple and should
be vi ewed as a secondary or explanatory analysis ( Chapter 11).
Often it i s not possi ble to withhold treatments other than the study intervention.
For example, i n a trial of a new drug to reduce the ri sk of myocardial infarcti on
i n persons with known coronary heart di sease (CHD), the investigators cannot
ethically
prohibi t or di scourage participants from taki ng medical treatments that are
i ndicated for persons with known CHD, including aspi rin, stati ns and beta-
blockers. One soluti on i s to give standard care drugs to all participants in the
trial; al though thi s approach reduces the event rate and therefore i ncreases the
required sampl e si ze, i t minimi zes the potential for di fferences i n cointerventions
between the groups and tests whether the new interventi on improves outcome
when given in additi on to standard care.
When the treatment to be studied is a new drug that is believed to be a good
alternati ve to standard care, one option is to desi gn an equivalence trial in
which new treatments are compared with those al ready proven effective (see
Chapter 11). When the treatment to be studied i s a surgery or other procedure
that is so attractive that prospective participants are reluctant to be randomized
to something different, an excel lent approach may be randomizati on to
i mmediate intervention versus a wait-list control. This design requi res an
outcome that can be assessed wi thin a few months of starting the interventi on.
It provides an opportunity for a randomi zed compari son between the immediate
i nterventi on and wait-list control groups during the first several months, and
also for a wi thin-group comparison before and after the intervention in the wait-
l ist control group (see Chapter 11 for time-series and cross-over designs).
Choosing Outcome Measurements
The definition of the specifi c outcomes of the trial infl uences many other
desi gn components, as wel l as the cost and even the feasibil ity of answering the
questi on. Tri als shoul d incl ude several outcome measurements to increase the
richness of the results and possibili ties for secondary analyses. However, a
singl e outcome must be chosen that refl ects the mai n questi on, allows
cal culation of the sampl e si ze and sets the priority for efforts to implement the
study.
Clinical outcomes provide the best evi dence about whether and how to use
treatments. However for outcomes that are uncommon, such as the occurrence
of cancer, trials must generally be large, long, and expensive. As noted in
Chapter 6, outcomes measured as continuous variables, such as quali ty of life,
can generally be studied with fewer subjects and shorter foll ow-up times than
rates of a dichotomous clinical outcome, such as recurrence of treated breast
cancer.
Intermediate markers, such as bone densi ty, are measurements that are
related to the cli ni cal outcome. Trial s that use i ntermediate outcomes can further
our understanding of pathophysiology and provide i nformation to design the best
dose or frequency of treatment for use i n trials wi th cli nical outcomes. The
cli ni cal relevance of tri als with intermediate outcomes depends in l arge part on
how accurately changes in these markers, especial ly changes that occur due to
treatment, represent changes in the ri sk or natural history of clini cal outcomes.
Intermediate markers can be considered surrogate markers for the clini cal
P.150
...
outcome to the extent that treatment-induced changes in the marker
consi stentl y predi ct how treatment changes the clinical outcome (3). Generally, a
good surrogate measures changes in an intermedi ate factor in the main pathway
that determines the clinical outcome.
HIV viral l oad is a good surrogate marker because treatments that reduce the
vi ral load consistently reduce morbidity and mortality i n pati ents with HIV
i nfecti on. In contrast, bone mineral density (BMD) is consi dered a poor surrogate
marker (3). It reflects the amount of mi neral in a section of bone, but
treatments that improve
BMD sometimes have little or no effect on fracture risk, and the magni tude of
change i n BMD can substanti all y underestimate how much the treatment reduces
fracture ri sk (4). The best evidence that a biologi cal marker i s a good surrogate
comes from randomi zed tri als of the clinical outcome ( fractures) that also
measure change in the marker (BMD) in all parti ci pants. If the marker i s a good
surrogate, then statistical adjustment for changes i n the marker will account for
much of the effect of treatment on the outcome (3).
Number of Outcome Var i abl es
It is often desi rable to have several outcome variables that measure different
aspects of the phenomena of i nterest. In the Heart and Estrogen/progestin
Replacement Study (HERS), CHD events were chosen as the primary endpoi nt.
Nonfatal myocardial infarcti on, coronary revasculari zation, hospital izati on for
unstabl e angi na or congestive heart fai lure, stroke and transient ischemi c attack,
venous thromboembol ic events, and all -cause mortality were all assessed and
adjudi cated to provide a more detail ed descripti on of the cardiovascular effects
of hormone therapy (5). However, a single primary endpoint (CHD events) was
desi gnated for the purpose of pl anning the sampl e size and duration of the study
and to avoid the probl ems of i nterpreti ng tests of multi ple hypotheses (Chapter
5).
Adver se Ef f ect s
The investigator should include outcome measures that wil l detect the
occurrence of adverse effects that may result from the intervention. Revealing
whether the beneficial effects of an interventi on outweigh the adverse ones is a
major goal of most cli ni cal trial s, even those that test apparently innocuous
treatments like a health educati on program. Adverse effects may range from
relatively mi nor symptoms such as a mild or transi ent rash, to seri ous and fatal
compli cations. The investigator shoul d consider the problem that the rate of
occurrence, the effect of treatment and the sampl e si ze requirements for
detecting adverse effects will generally be di fferent from those for detecti ng
benefits. Unfortunately, rare side effects will usuall y be impossible to detect no
matter how l arge the trial and are di scovered (if at al l) onl y after an intervention
i s i n widespread cli ni cal use.
In the early stages of testing a new treatment when potential adverse effects are
unclear, investigators should ask broad, open-ended questions about all types of
potential adverse effects. In large trials, assessment and coding of all potential
adverse events can be very expensive and ti me consuming, often with a l ow
yi eld of important results. Investigators shoul d consider strategies for
minimizing thi s burden whi le preserving an adequate assessment of potential
harms of the intervention. For exampl e, in very large tri als, common and minor
P.151
...
events, such as upper respiratory infections and gastroi ntestinal upset, might be
recorded in a subset of the participants. Important potential adverse events or
effects that are expected because of previous research or cl inical experience
should be ascertained by specific queri es. For example, because rhabdomyolysi s
i s a reported side effect of treatment with statins, the si gns and symptoms of
myositi s shoul d be queried in any trial of a new stati n.
When data from a tri al i s used to apply for regulatory approval of a new drug,
the tri al design must satisfy regul atory expectations for reporting adverse events
(see Good Cli nical Practices on the U.S. Food and Drug Admini strati on
[FDA] websi te). Certai n disease areas, such as cancer, have establi shed methods
for classi fyi ng adverse events (see NCI Common Toxicity Criteri a on the
Nati onal Cancer Insti tute website).
Selecting the Participants
Chapter 3 di scussed how to specify entry criteria defining a target
populati on that is appropriate to the research questi on and an accessibl e
populati on that is practical to study, how to design an efficient and sci entific
approach to sel ecti ng participants, and how to recruit them. Here we cover
i ssues that are especial ly rel evant to clinical trials.
Def i ne Ent r y Cr i t er i a
In a clinical trial, inclusi on and excl usion cri teri a have the joint goal of
i dentifying a population in whi ch it i s feasi ble, ethical and relevant to study the
i mpact of the intervention on outcomes. Inclusion criteria should produce a
sufficient number of enroll ees who have a hi gh enough rate of the pri mary
outcome to achieve adequate power to find an important effect on the outcome.
On the other hand, criteria should also maximize the generalizability of fi ndings
from the trial and ease of recruitment. For exampl e, if the outcome of i nterest is
a rare event, such as breast cancer, i t is usual ly necessary to recruit participants
who have a hi gh ri sk of the outcome to reduce the sampl e size and follow-up
time to feasi ble level s. On the other hand, narrowing the inclusion criteria to
hi gher-ri sk women limi ts the generalizability of the resul ts and makes it more
difficult to recruit parti ci pants into the trial .
To plan the ri ght sample size, the investi gator must have rel iable estimates of
the rate of the primary outcome i n people who might be enrolled. These
esti mates can be based on data from vital statistics, l ongitudinal observati onal
studies, or rates observed in the untreated group in trial s with outcomes simil ar
to those in the planned trial. For exampl e, expected rates of breast cancer in
postmenopausal women can be esti mated from cancer registry data. The
i nvestigator should keep i n mind, however, that screening and healthy volunteer
effects generall y mean that event rates among those who qual ify and agree to
enter clinical tri als are lower than in the general population; i t may be preferabl e
to obtain rates of breast cancer from the placebo group of other tri als with
simi lar inclusi on criteria.
Includi ng participants with a high risk of the outcome can decrease the number
of subjects needed for the trial. If risk factors for the outcome have been
established, then the selection criteria can be designed to include parti ci pants
who have a minimum estimated risk of the outcome of interest. The Raloxifene
Use for The Heart trial, designed to test the effect of ral oxi fene for prevention of
P.152
...
cardiovascular disease (CVD) and breast cancer, enrolled women who were at
i ncreased risk of CVD based on a combination of ri sk factors ( 6). Another way to
i ncrease the rate of events is to limi t enrol lment to peopl e who already have the
disease. The Heart and Estrogen/Progesti n Replacement Study included 2,763
women who already had CHD to test whether estrogen plus progestin reduced
the risk of new CHD events (5). This approach was much less costly than the
Women' s Heal th Initiative trial of the same research question in women without
CHD, whi ch required about 17,000 participants (7).
Addi tional ly, a trial can be smaller and shorter i f it includes people who are
l ikely to have the greatest benefit from the treatment. For exampl e, tamoxifen
blocks the binding of estradi ol to its receptor and decreases the risk of breast
cancer that is estrogen receptor positive but not that of cancer that is estrogen
receptor negati ve (8). Therefore, a trial testing the effect of tamoxifen on the
risk of breast cancer would be somewhat smaller and shorter if the selection
criteria specify participants at high risk of estrogen receptorpositive breast
cancer.
Although probabi lity sampl es of general populations confer advantages i n
observational studies, thi s type of sampling is generall y not feasible and has
l imited value for randomized trials. Inclusion of participants with diverse
characteristics will increase the confi dence that the resul ts of a tri al apply
broadly. However, setti ng aside issues of adherence to randomized treatment, it
i s generally true that resul ts of a tri al done i n a convenience sample (e.g.,
women with CHD who respond to advertisements) will be si milar to results
obtained i n probabili ty samples of eligible people (all women with CHD).
Stratification by a characteristic, such as racial group, all ows investi gators to
enroll a desired number of partici pants wi th a characteri sti c that may have an
i nfluence on the effect of the treatment or i ts general izabi lity. Recruitment to a
stratum i s general ly closed when the goal for parti cipants with that characteri sti c
has been reached.
Exclusion criteria should be parsimonious because unnecessary excl usions may
diminish the generali zabil ity of the results, make i t more di fficul t to recruit the
necessary number of participants, and i ncrease the complexity and cost of
recruitment. There are five reasons for excl uding people from a cli ni cal trial
(Tabl e 10.1).
The treatment may be unsafe in peopl e who are suscepti ble to known or
suspected adverse effects of the active treatment. For example, myocardial
i nfarcti on i s a rare adverse effect of treatment with sil denafil (Viagra).
Therefore, trials of Viagra to treat painful vasospasm in patients wi th Raynaud's
disease should exclude patients who have CHD (9). Conversely receiving pl acebo
may be considered unsafe for some partici pants. For exampl e, bisphosphonates
are known to be so beneficial in women with vertebral fractures that it would be
unacceptable to enter them in a placebo-controlled tri al of a new treatment for
osteoporosi s unless bi sphosphonates could also be provi ded for all tri al
participants. Persons in whom the active treatment
i s unli kel y to be effective should be excluded, as wel l as those who are unlikely
to be adherent to the intervention or unl ikely to complete foll ow-up. It is wise to
exclude people who are not l ikely to contribute a primary outcome to the study
(e.g., because they wi ll move duri ng the period of follow-up). Occasi onally,
P.153
P.154
...
practi cal problems such as i mpai red mental status that makes it diffi cult to
fol low instructi ons justify exclusi on. Investigators shoul d carefull y wei gh
potential exclusion criteria that apply to many people (e.g., diabetes or upper
age li mits) as these may have a large impact on the feasibility and costs of
recruitment and the general izabi lity of resul ts.
Table 10.1 Reasons for Excluding People from a
Clinical Trial
Reason
Example (A trial of raloxifene vs.
placebo to prevent heart disease
1. A study treatment would be harmful
Unacceptabl e ri sk of
adverse reacti on to
active treatment
Prior venous thromboembol ic event
(raloxifene i ncreases risk of venous
thromboembol ic events)
Unacceptabl e ri sk of
assi gnment to placebo
Recent estrogen receptorposi tive
breast cancer (treatment wi th an anti -
estrogen is an effective standard of
care)
2. Active treatment i s unli kel y to be effective
At low risk for the
outcome
Low coronary heart disease risk factors
Has a type of di sease
that is not l ikely to
respond to treatment
Taking a treatment that
is likely to i nterfere
with the intervention
Taki ng estrogen therapy (which
competes with raloxifene)
3. Unl ikely to adhere to the
i nterventi on
Poor adherence during run-in
4. Unl ikely to complete
fol low-up
Plans to move before trial ends
Short life expectancy because of a
serious i llness
Unreli abl e participation in visits before
randomi zation
...
Desi gn an Adequate Sampl e Si ze and P l an t he
Recr ui t ment Accor di ngl y
Trial s with too few participants to detect substanti al effects are wasteful,
unethi cal, and may produce misleading conclusi ons (10). Estimating the sampl e
size i s one of the most important earl y parts of pl anning a tri al (Chapter 6).
Outcome rates i n cli ni cal trial s are commonly lower than esti mated, primarily
due to screeni ng and volunteer bias. Recruitment for a trial is usual ly more
difficult than recruitment for an observational study. For these reasons, the
i nvestigator should pl an an adequate sample from a l arge accessible popul ati on,
and enough time and money to get the desired sample size when (as usuall y
happens) the barriers to doing so turn out to be greater than expected.
Measuring Baseline Variables
To facili tate contacting partici pants who are l ost to fol low-up, it i s i mportant
to record the names, phone numbers, addresses, and e-mail addresses of two or
three friends or relatives who will always know how to reach the parti cipant. It is
also valuable to record Social Security numbers or other national I.D. numbers.
These can be used to determine the vital status of participants (through the
Nati onal Death Index) or to detect key outcomes using heal th records (e.g.,
health insurance systems). However, thi s is confi dential protected personal
health informati on that must be kept confidential and should not accompany
data that are sent to a coordinati ng center or sponsoring i nstitution.
Descr i be the Par t i ci pant s
Investigators shoul d col lect enough i nformation (e.g., age, gender, and
measurements of the severity of di sease) to help others judge the
generalizability of the fi ndings. These measurements also provide a means for
checki ng on the comparabili ty of the study groups at baseli ne; the fi rst table of
the final report of a clini cal trial typically compares the levels of baseline
characteristics in the study groups. The goal i s to make sure that di fferences in
these level s do not exceed what mi ght be expected from the play of chance,
which might suggest a technical error or bias in carrying out the randomization.
Measur e Var i abl es t hat ar e Ri sk Fact or s f or t he
Out come or can be Used t o Def i ne Subgr oups
5. Practical problems with
participating in the protocol
Impaired mental state that prevents
accurate answers to questions
...
It is a good idea to measure baseli ne vari abl es that are li kel y to be strong
predi ctors of the outcome (e.g., smoking habi ts of the spouse i n a trial of a
smoking intervention). This al lows the investigator to study secondary research
questi ons, such as predictors of the outcomes. In small trials where
randomi zati on i s more prone to produce chance
maldi stributi ons of baseli ne characteristics, measurement of important predi ctors
of the outcome permits statistical adjustment of the pri mary randomi zed
comparison to reduce the i nfl uence of these chance maldistributions on the
outcome of the trial. Baseline measurements of potential predictors of the
outcome also allow the investigator to examine whether the i ntervention has
different effects in subgroups classifi ed by baseline vari ables, an uncommon but
i mportant phenomenon termed effect modification or interaction (Chapter 9).
For example, bone density measured at baseline i n the Fracture Intervention
Trial led to the fi nding that treatment with al endronate si gnificantly decreased
the risk of nonspine fractures in women with very low bone density
(osteoporosis) but had no effect in women with higher bone density (11).
Importantly, a specific test for the i nteracti on was of bone density and treatment
effect was stati sti cally si gnificant (P = 0.02).
Measur e Basel i ne Val ue of the Outcome Var i abl e
If outcomes incl ude change in a vari able, the outcome variable must be
measured at the begi nni ng of the study in the same way that it wil l be measured
at the end. In studies that have a dichotomous outcome (incidence of CHD, for
example) it may be important to demonstrate by hi story and electrocardiogram
that the di sease i s not present at the outset. In studies that have a conti nuous
outcome variable (effects of antihypertensive drugs on blood pressure) the best
measure is generall y a change in the outcome over the course of the study. This
approach usual ly minimizes the variabili ty in the outcome between study
participants and offers more power than simply comparing blood pressure values
at the end of the trial. Similarl y, i t may also be useful to measure secondary
outcome variables, and outcomes of pl anned ancill ary studies, at baseline.
Be Par si moni ous
Havi ng pointed out all these uses for baseline measurements, we should stress
that the design of a cli nical trial does not require that any be measured, because
randomi zati on el iminates the probl em of confounding by factors that are present
at the outset. Maki ng a lot of measurements adds expense and complexity. In a
randomi zed trial that has a limited budget, ti me and money are usually better
spent on things that are vi tal to the integrity of the trial , such as the adequacy
of the sample size, the success of randomizati on and bli nding, and the
completeness of follow-up. Yusuf et al. have promoted the use of large trials
with very few measurements (12).
Est abl i sh Banks of Mat er i al s
Storing images, sera, DNA, and other bi ol ogic specimens at baseline will all ow
subsequent measurement of biological effects of the treatment, biologi cal
markers that predict the outcome, and factors (such as genotype) that mi ght
i dentify peopl e who respond well or poorly to the treatment. Stored speci mens
can also be a rich resource to study other research questions not di rectly related
to the main outcome.
P.155
...
Randomizing and Blinding
The third step in Fi gure 10.1 i s to randoml y assign the partici pants to two
or more groups. In the si mplest design, one group receives an active treatment
i nterventi on and the other receives a placebo. The random allocati on of
participants to one or another of the study groups establishes the basi s for
testi ng the statistical signi ficance of di fferences
between these groups i n the measured outcome. Random assignment provi des
that age, sex, and other prognosti c basel ine characteristics that coul d confound
an observed association (even those that are unknown or unmeasured) wil l be
distributed equal ly, except for chance variation, among the randomized groups.
Do a Good J ob of Random Assi gnment
Because randomization is the cornerstone of a clini cal trial, i t is important that it
be done correctly. The two most important features are that the procedure truly
allocates treatments randomly and that the assignments are tamperproof so
that neither intenti onal nor uni ntentional factors can influence the
randomi zati on.
Ordi narily, the parti cipant completes the baseli ne examinations, i s found eli gible
for inclusi on, and gives consent to enter the study before randomization. He is
then randoml y assigned by computerized algorithm or by applying a set of
random numbers, which are typicall y computer-generated. Once a l ist of the
random order of assi gnment to study groups i s generated, it must be applied to
participants in stri ct sequence as they enter the trial.
It is essential to design the random assignment procedure so that members of
the research team who have any contact with the study parti cipants cannot
i nfluence the allocation. For example, random treatment assignments can be
placed in advance in a set of sealed envel opes by someone who will not be
i nvol ved in opening the envelopes. Each envel ope must be numbered (so that all
can be accounted for at the end of the study), opaque (to prevent
transil lumination by a strong li ght), and otherwise tamperproof. When a
participant i s randomi zed, hi s name and the number of the next unopened
envel ope are first recorded in the presence of a second staff member and both
staff sign the envel ope; then the envel ope is opened and the randomi zation
number contained therein assigned to the participant.
Multicenter tri als typical ly use a separate tamperproof randomization facil ity that
the tri al staff contact when an eli gible partici pant i s ready to be randomized. The
staff member provides the name and study ID of the new participant. This
i nformation is recorded and the treatment group is then randoml y assigned by
providi ng a treatment assignment number li nked to the interventions. Treatment
can also be randomly assi gned by computer programs at a singl e research site as
l ong as these programs are tamperproof. Ri gorous precautions to prevent
tampering with randomization are needed because investigators sometimes find
themsel ves under pressure to influence the randomization process (e.g., for an
i ndividual who seems particularly sui tabl e for an active treatment group in a
placebo-controlled trial).
Consi der Speci al Randomi zat i on Techni ques
The preferred approach i s typi cally si mple randomi zation of indi vi dual
P.156
...
participants in an equal rati o to each intervention group. Tri als of small to
moderate size wi ll have a small gain in power if speci al randomi zation
procedures are used to bal ance the study groups i n the numbers of participants
they contai n (blocked randomization) and in the distribution of basel ine variables
known to predict the outcome (stratified blocked randomi zation).
Blocked randomization is a commonly used technique to ensure that the
number of parti cipants is equally di stributed among the study groups.
Randomi zation i s done in blocks of predetermined size. For exampl e, if
the block size is si x, randomi zation proceeds normal ly withi n each block until the
thi rd person is randomi zed to one group, after which participants are
automatically assigned to the other group unti l the bl ock of si x i s completed.
This means that i n a study of 30 participants exactly 15
wil l be assigned to each group, and in a study of 33 partici pants, the
disproportion coul d be no greater than 18:15. Blocked randomi zation with a fixed
block size is less suitable for nonbl inded studies because the treatment
assignment of the participants at the end of each block could be predi cted and
manipulated. This problem can be minimi zed by varying the size of the bl ocks
randoml y (ranging, for exampl e, from four to eight) according to a schedule that
i s not known to the investi gator.
Stratified blocked randomization ensures that an i mportant predictor of the
outcome is more evenl y distributed between the study groups than chance alone
would di ctate. In a trial of the effect of a drug to prevent fractures, having a
prior vertebral fracture i s such a strong predictor of outcome and response to
treatment that it may be best to ensure that simil ar numbers of people who have
vertebral fractures are assigned to each group. This can be achi eved by dividi ng
participants into two groupsthose with and those without vertebral
fracturesas they enroll in the trial and then carrying out bl ocked
randomi zati on separately in each of these two strata. Stratified blocked
randomi zati on can slightl y enhance the power of a small tri al by reducing the
vari ation i n outcome due to chance disproportions in important basel ine
vari ables. It i s of li ttle benefi t in large trials (more than 1,000 participants)
because chance assignment ensures nearly even di stribution of basel ine
vari ables. An i mportant li mitation of stratified bl ocked randomization is the small
number of basel ine variabl es, not more than two or three, that can be bal anced
by thi s technique.
Randomi zing equal numbers of participants to each group maximizes study
power, but unequal allocation of participants to treatment and control
groups may someti mes be appropriate (13). Occasi onally, investigators increase
the ratio of active to pl acebo treatment to make the trial more attractive to
potential subjects who would l ike a greater chance of receiving acti ve treatment
i f they enroll , or decrease the ratio (as in the Women' s Health Initiati ve low-fat
diet trial (14)) to save money if the intervention is expensive. A trial compari ng
mul tiple active treatments to one control group may increase the power of those
comparisons by enl arging the control group (as in the Coronary Drug Project tri al
(15)). In thi s case there is no clear way to pi ck the best proporti ons to use, and
disproportionate randomizati on mi ght complicate the process of obtai ning
i nformed consent. Because the advantages are margi nal (the effect of even a 2:1
disproportion on power is surprisi ngly modest (16)), the best deci sion is usually
to assign equal numbers to each group.
Randomization of matched pairs i s a strategy for balancing baseli ne
P.157
...
confounding variables that requires selecting pairs of subjects who are matched
on i mportant factors li ke age and sex, then randomly assigning one member of
each pair to each study group. A drawback of randomi zi ng matched pairs is that
i t complicates recruitment and randomizati on, requiring that an eli gible
participant wait for randomi zation until a suitable match has been identified. In
addition, matching is generally not necessary in large trials in whi ch random
assignment prevents confounding. However, a particularl y attractive version of
thi s design can be used when the circumstances permit a contrast of treatment
and control effects in two parts of the same individual. In the Diabetic
Retinopathy Study, for example, each participant had one eye randomly assi gned
to photocoagul ati on treatment whi le the other served as a control ( 17).
Bl i ndi ng
Whenever possi ble, the investigator should design the interventions in such a
fashi on that the study participants, staff who have contact with them, persons
making laboratory measurements, and those adjudicating outcomes have no
knowledge of the study group assignment. When it is not possibl e to blind all of
these indi vidual s, it i s
hi ghly desi rable to bl ind as many as possi ble (always, for example, bl inding
l aboratory personnel). In a randomized tri al, blinding is as important as
randomization: it prevents bi as due to use of cointerventions and biased
ascertai nment of outcomes.
P.158
Table 10.2 In a Randomized Blinded Trial,
Randomization Eliminates Confounding by
Baseline Variables and Blinding Eliminates
Confounding by Cointerventions
Explanation for
Association
Strategy to Rule Out Rival
Explanation
1. Chance Same as in observati onal studies
2. Bias Same as in observati onal studies
3. EffectCause (Not a possible explanation in a
trial)
5. CauseEffect
...
Randomi zation only eli minates the i nfl uence of confoundi ng vari ables that are
present at the time of randomization; it does not elimi nate di fferences that
develop between the groups during foll ow-up (Tabl e 10.2). In an unblinded study
the investi gator or study staff may gi ve extra attenti on or treatment to
participants he knows are receivi ng the active drug, and this
cointervention may be the actual cause of any difference i n outcome
that is observed between the groups. For example, i n an unblinded trial of the
effect of exerci se to prevent myocardial infarction, the investi gator' s eagerness
to find a benefi t might l ead him to suggest that participants in the exerci se
group stop smoking. Cointerventions can also affect the control group if, for
example, partici pants who know that they are recei vi ng placebo seek out other
treatments that affect the outcome. Concern by a parti cipant' s family or private
physician mi ght also lead to effective coi nterventi ons if the study group i s not
blinded. Cointerventions that are del ivered simi larly in both groups may
decrease the power of the study by decreasi ng outcome rates, but
cointerventions that affect one group more than the other can cause bias i n
either directi on.
The other important value of blindi ng is to prevent biased ascertainment and
adjudication of outcome. In an unblinded tri al, the i nvestigator may be
tempted to l ook more carefully for outcomes in the untreated group or to
diagnose the outcome more frequently. For example, in an unbli nded trial of
estrogen therapy, the investigators may be more likel y to ask women in the
active treatment group about pain or swell ing in the calf and to order ultrasound
or other tests to make the diagnosis of deep vein thrombosi s.
After a possible outcome event has been ascertained, it i s i mportant that
personnel who will adjudicate the outcome are bli nded. Results of the Canadian
Cooperati ve Multi ple Scl erosis trial nicely il lustrate the importance of bli ndi ng in
unbi ased outcome adjudi cation (18). Persons with mul tiple sclerosis were
randoml y assigned to combi ned plasma exchange, cycl ophosphami de and
predni sone, or to sham plasma exchange and placebo medi cations. At the end of
the tri al, the severi ty of multipl e
sclerosis was assessed using a structured examinati on by neurologists blinded to
treatment assignment and agai n by neurologists who were unbl inded. Therapy
was not effective based on the assessment of the bl inded neurologists, but was
statisticall y signifi cantly effective based on the assessment of the unblinded
neurologists.
Bl inded assessment of outcome may not be important if the outcome of the tri al
i s a hard outcome such as death, about which there is no uncertainty or
opportunity for biased assessment. Most other outcomes, such as cause-speci fic
death, disease di agnosis, physical measurements, questi onnaire scales, and self -
reported conditions, are susceptible to biased ascertainment.
After the study i s over, it is a good i dea to assess whether the participants and
i nvestigators were unbl inded by asking them to guess which treatment the
participant was assigned to; if a hi gher than expected proportion guesses
correctly, the publi shed di scussion of the findings should incl ude an assessment
of the potential biases that partial unblinding may have caused.
What to do When Blinding is Difficult or Impossible. In some cases blinding
i s di fficul t or i mpossible, either for technical or ethical reasons. For example, it
i s di fficul t to bli nd participants if they are assigned to an educational, dietary or
P.159
...
exercise interventi on. However, the control group in such studies might recei ve a
different form of educati on, diet or exercise of a type and intensi ty unl ikely to be
effecti ve. Surgical interventions often cannot be blinded because i t may be
unethi cal to perform sham surgery in the control group. However, surgery i s
always associated wi th some risk, so it i s very i mportant to determine i f the
procedure i s truly effective. For exampl e, a recent randomized trial found that
arthroscopic debridement of the carti lage of the knee was no more effective than
sham arthroscopy for reli eving osteoarthritic knee pain (19). In this case, the
risk to participants i n the control group may have been outweighed i f thousands
of patients were prevented from undergoing an ineffective procedure.
If the interventions cannot be bl inded, the investigator should li mit and
standardize other potential coi nterventi ons as much as possi ble and blind study
staff who ascertain and adjudi cate the outcomes. For example, an i nvestigator
testi ng the effect of yoga for reli ef of hot flashes could specify a preci se regimen
of yoga sessions in the treatment group and general relaxati on sessions of equal
duration in the control group. To mi nimize other differences between the groups,
he could instruct both yoga and control partici pants to refrain from starting new
recreational, exercise or relaxation acti vi ties or other treatments for hot fl ushes
unti l the trial has ended. Al so, study staff who collect informati on on the
severity of hot flushes coul d be different from those who provide yoga traini ng.
Summary
1. The choice and dose of intervention is a difficult decision that balances
effectiveness and safety; other considerations i nclude relevance to
clinical practi ce, simplicity, suitabil ity for blinding, and feasibil ity of
enrolling subjects.
2. The best comparison group i s a placebo control that allows participants,
investigators and study staff to be blinded.
3. Clinically relevant outcome measures such as pain, quali ty of life,
occurrence of cancer, and death are the most meaningful outcomes of
trial s. Intermediary
markers, such as HIV viral load, are valid surrogate markers for cl inical
outcomes to the degree that treatment -induced changes in the marker
consistently predict changes i n the cli nical outcome.
4. All clinical tri als should include measures of potential adverse effects of
the intervention.
5. The criteria for selecting study participants shoul d identi fy those who
are li kely to benefit and not be harmed by treatment, easy to recruit,
and l ikely to adhere to treatment and follow-up protocols. Choosing
parti cipants at high risk of an uncommon outcome can decrease sampl e
size and cost, but may make recruitment more diffi cult and decrease
generali zabil ity of the findings.
6. Baseline variables shoul d be measured parsi moniousl y to track the
parti cipants, describe their characteristics, measure risk factors for and
baseli ne values of the outcome, and enabl e later examinati on of di sparate
intervention effects i n vari ous subgroups (interactions); serum, genetic
materi al, and so on should be stored for later analysis.
P.160
...
7. Randomization, which elimi nates bias due to baseline confounding
variables, should be tamperproof; matched pair randomi zation i s an
excellent design when feasible, and in small tri als stratified blocked
randomization can reduce chance maldistributions of key predictors.
8. Blinding the i nterventi on i s as important as randomization and serves
to control cointerventions and biased outcome ascertainment and
adjudication.
References
1. Etti nger B, Black DM, Mitlak BH, et al. Reduction of vertebral fracture
ri sk in postmenopausal women with osteoporosis treated with ral oxi fene:
results from a 3-year randomized clinical tri al. Mul tiple Outcomes of
Raloxifene Eval uation (MORE) i nvestigators. JAMA 1999;282:637645.
2. The Women' s Heal th Initiative Study Group. Design of the women' s health
initi ati ve cli ni cal trial and observational study. Control Cli n Trials
1998;19:61109.
3. Prentice RL. Surrogate endpoi nts in clinical tri als: definiti on and
operati onal criteri a. Stat Med 1989;8:431440.
4. Cummings SR, Karpf DB, Harris F, et al. Improvement in spine bone
density and reduction in ri sk of vertebral fractures duri ng treatment with
anti resorptive drugs. Am J Med 2002;112:281289.
5. Hul ley S, Grady D, Bush T, et al. Randomized tri al of estrogen plus
progesti n for secondary prevention of coronary heart disease in
postmenopausal women. JAMA 1998;280:605613.
6. Mosca L, Barrett-Connor E, Wenger NK, et al . Design and methods of the
Raloxifene Use for The Heart (RUTH) Study. Am J Cardiol
2001;88:392395.
7. Rossouw JE, Anderson GL, Prentice RL, et al. Risks and benefi ts of
estrogen pl us progestin in heal thy postmenopausal women: princi pal results
from the women' s health i ni tiative randomized control led trial . JAMA
2002;288:321333.
8. Fi sher B, Costanti ns J, Wickerham D, et al. Tamoxi fen for prevention of
breast cancer: report of the National Surgical Adjuvant Breast and Bowel
Project P-1 Study. JNCI 1998;90:13711388.
9. Fries R, Shariat K, von Wilmowsky H, et al . Sil denafi l in the treatment of
Raynaud' s phenomenon resi stant to vasodilatory therapy. Circulation
P.161
...
2005;112:29802985.
10. Freiman JA, Chalmers TC, Smith H Jr, et al . The importance of beta, the
type II error and sample size in the design and interpretation of the
randomized control tri al. Survey of 71 negative trials. N Engl J Med
1978;299:690694.
11. Cummings SR, Bl ack DM, Thompson DE, et al . Effect of alendronate on
ri sk of fracture i n women with low bone densi ty but wi thout vertebral
fractures: resul ts from the fracture i ntervention trial. JAMA
1998;280:20772082.
12. Yusuf S, Coll ins R, Peto R. Why do we need some l arge, si mple
randomized trials? Stat Med 1984;3:409420.
13. Avi ns AL. Can unequal be more fair? Ethics, subject al location, and
randomised clinical trials. J Med Ethics 1998;24:401408.
14. Prenti ce RL, Caan B, Chlebowski RT, et al. Low-fat di etary pattern and
ri sk of invasive breast cancer: the women's heal th initiative randomized
controlled dietary modi fication trial . JAMA 2006;295:629642.
15. CDP Research Group. The coronary drug project. Initial findi ngs l eading
to modifi cations of i ts research protocol . JAMA 1970;214:13031313.
16. Friedman LM, Furberg C, DeMets DL. Fundamentals of cl inical trials, 3rd
ed. St. Louis, MO: Mosby Year Book, 1996.
17. Diabetic Retinopathy Study Research Group. Preliminary report on effects
of photocoagulation therapy. Am J Ophthalmol 1976;81:383396.
18. Noseworthy JH, O' Brien P, Erickson BJ, et al. The Mayo-Clinic Canadi an
cooperative trial of sulfasalazi ne in active multi ple sclerosis. Neurol ogy
1998;51:13421352.
19. Moseley JB, O' Mal ley K, Petersen NJ, et al. A controlled trial of
arthroscopic surgery for osteoarthriti s of the knee. N Engl J Med
2002;347:8188.
...
> Tabl e of Contents > Secti on II - Study Desi gns > 11 - Al ternati ve Tri al Desi gns and
Impl ementati on Issues
11
Alternative Trial Designs and
Implementation Issues
Deborah Grady
Steven R. Cummings
Stephen B. Hulley
In the last chapter, we discussed the classic randomi zed, blinded, parall el group
trial: how to select the intervention, choose outcomes, select participants,
measure basel ine variables, randomi ze, and bli nd. In this chapter, we describe
alternative clinical trial designs and address the conduct of clinical trials,
i ncludi ng interim monitoring duri ng the tri al.
Alternative Clinical Trial Designs
Other Randomi zed Desi gns
There are a number of variations on the classi c parallel group randomi zed trial
that may be useful when the ci rcumstances are right.
The factorial design aims to answer two (or more) separate research questions
i n a single cohort of parti ci pants (Fig. 11.1). A good example i s the Women's
Health Study, which was designed to test the effect of l ow-dose aspirin and
vi tamin E on risk for cardi ovascular events among healthy women ( 1). The
participants were randoml y assigned to four groups, and two hypotheses were
tested by comparing two hal ves of the study cohort. First, the rate of
cardiovascular events i n women on aspirin is compared wi th women on aspi rin
placebo (disregardi ng the fact that half of each of these groups recei ved vi tami n
E); then the rate of cardi ovascular events in those on vitamin E is compared wi th
all those on vitamin E placebo (now di sregarding the fact that hal f of each of
these groups received aspirin). The investigators have two complete trials for
the pri ce of one.
The factorial desi gn can be very efficient. For example, the Women' s Health
Ini tiative randomized trial was able to test the effect of three i nterventions
(hormone therapy, l ow-fat diet and cal ci um plus vitamin D) on a number of
outcomes in one cohort (2). A l imitation is the possi bility of i nteracti ons between
the effects of the treatments on the outcomes. For example, if the effect of
aspiri n on risk for cardiovascul ar disease is different in women treated with
vi tamin E compared to those
not treated with vitamin E, an interaction exists and the effect of aspirin woul d
have to be calculated separately i n these two groups. Thi s would reduce the
P.164
...
power of these compari sons, because only hal f of the participants would be
i ncluded in each analysis. Factorial designs can actuall y be used to study such
i nteractions, but trials desi gned to test interactions are more compli cated and
difficult to implement, larger sampl e sizes are required, and the resul ts can be
hard to interpret. Other l imitations of the factori al design are that the same
study population must be appropriate for each intervention and mul tiple
treatments may interfere wi th recruitment and adherence.
Group or cluster randomization requires that the investigator randomly assign
naturally occurri ng groups or clusters of participants to the i nterventi on groups
rather than assign i ndividual s. A good exampl e is a trial that enrolled players on
120 coll ege baseball teams, randomly all ocated hal f of the teams to an
i nterventi on to encourage cessati on of spit-tobacco use, and observed a
signi ficantly lower rate of spi t-tobacco use among players on the teams that
received the intervention compared to control teams (3). Applying the
i nterventi on to groups of people may be more feasible and cost effective than
treating indi vi duals one at a time, and it may better address research questions
about the effects of public health programs in the population. Some
i nterventi ons, such as a low-fat di et, are diffi cult to i mplement i n onl y one
member of a fami ly. Si milarly, when participants in a natural group are
randomi zed individually, those who receive the intervention are likel y to discuss
or share the intervention with famil y members, col leagues or acquaintances who
have been assi gned to the control group. For example, a clinician in a group
practi ce who is randomly assigned to an educati onal intervention is very li kely to
discuss this intervention wi th hi s colleagues. In the cluster randomizati on
desi gn, the uni ts of randomization and analysis are groups, not individuals.
Therefore, the effecti ve sample size is smaller than the number of individual
FIGURE 11.1. In a factorial randomized trial , the investi gator (a) selects a
sample from the population, (b) measures basel ine variables, (c) randomly
assigns two active interventions and their controls to four groups as shown,
(d) appl ies interventi ons, (e) measures outcome variables duri ng follow-up,
(f) anal yzes the results, first combi ni ng the two drug A groups to be
compared with the two placebo A groups and then combi ning the two drug B
groups to be compared wi th the two placebo B groups.
...
participants and power is diminished. In fact, the effective sampl e si ze depends
on the correlation of the effect of the intervention
among participants i n the clusters and is somewhere between the number of
clusters and the number of parti ci pants (4). Another drawback is that sampl e
size estimation and data anal ysis are more compl icated i n cluster randomizati on
desi gns than for indivi dual randomi zation (4).
In equivalence trials, an intervention is compared to an acti ve control.
Equi valence tri als may be necessary when there i s a known effective treatment
for a conditi on, or an accepted standard of care. In this situation, it may
be unethical to assign parti cipants to placebo treatment. For example, because
bisphosphonates effectively prevent osteoporotic fractures in women at high risk,
new drugs should be compared against or added to thi s standard of care. In
general, there should be strong evidence that the acti ve compari son treatment is
effecti ve for the types of participants who wi ll be enrolled in the trial .
The objecti ve of equivalence trial s i s to prove that the new intervention is at
l east as effective as the established one. It is impossible to prove that two
treatments are exactly equival ent because the sample size woul d be infini te.
Therefore, the i nvestigator sets out to prove that the di fference between the new
treatment and the established treatment is no more than a defined amount. If
the acceptable difference between the new and the established treatment i s
small, the sampl e size for an equi valence trial can be largemuch larger than
for a placebo-controlled trial. However, there is l ittle cli ni cal reason to test a
new therapy if i t does not have si gnificant advantages over an establi shed
treatment, such as less toxicity or cost, or greater ease of use. Dependi ng on
how much advantage the new treatment is judged to have, the al lowable
difference between the effi cacy of the new treatment and the establi shed
treatment may be substanti al. In this case, the sample size estimate for an
equivalence trial may be simil ar to that for a pl acebo-controlled trial.
An important probl em wi th equivalence trials i s that the traditional roles of the
null and alternative hypotheses are reversed. The null hypothesis for equivalence
trials is that the effects of the two treatments are not more di fferent than a
prespecified amount; the alternati ve hypothesis is that the di fference does
exceed thi s amount. In thi s case, fail ure to reject the null hypothesis results in
accepti ng the hypothesis that the two treatments are equal. Inadequate sample
size, poor adherence to the study treatments and large l oss to follow-up all
reduce the power of the study to reject the null hypothesis i n favor of the
alternati ve. Therefore, an inferior new treatment may appear to be equival ent to
the standard when i n reality the findi ngs just represent an underpowered and
poorly done study.
Nonr andomi zed Bet ween- Gr oup Desi gns
Trial s that compare groups that have not been randomized are far l ess effective
than randomized trials i n controlling for the influence of confounding variables.
Anal ytic methods can adjust for basel ine factors that are unequal i n the two
study groups, but this strategy does not deal with the probl em of unmeasured
confounding. When the findi ngs of randomized and nonrandomized studies of the
same research question are compared, the apparent benefi ts of i ntervention are
much greater in the nonrandomized studies, even after adjusti ng statisticall y for
differences in baseli ne vari abl es (5). The problem of confoundi ng in
nonrandomized cli ni cal studies can be serious and not fully removed by
P.165
...
statistical adjustment (6).
Someti mes partici pants are al located to study groups by a pseudorandom
mechani sm. For example, every other subject (or every subject with an even
hospital record number) may be assi gned to the treatment group. Such designs
someti mes offer logistic advantages, but the predictability of the study group
assignment permi ts the
i nvestigator to tamper with it by manipulating the sequence or eli gibil ity of new
subjects.
Parti cipants are someti mes assigned to study groups by the investigator
according to certain speci fic criteria. For example, patients wi th diabetes may be
allocated to receive ei ther i nsulin four times a day or l ong-acting i nsuli n once a
day according to thei r wil lingness to accept four dail y i njections. The problem
with thi s design is that those will ing to take four injecti ons per day might be
more compliant wi th other health advi ce, and this might be the cause of any
observed difference i n the outcomes of the two treatment programs.
Nonrandomized designs are sometimes chosen in the mi staken belief that they
are more ethi cal than randomi zation because they all ow the participant or
cli ni ci an to choose the intervention. In fact, studies are onl y ethical if they have
a reasonable l ikelihood of produci ng the correct answer to the research question,
and randomized studies are more likely to l ead to a conclusi ve and correct result
than nonrandomi zed designs. Moreover, the ethical basis for any trial is the
uncertainty as to whether the interventi on wil l be benefi cial or harmful. This
uncertainty, termed equipoise, means that an evidence-based choi ce of
i nterventi ons is not possibl e and justifi es random assignment.
Wi t hi n- Gr oup Desi gns
Designs that do not incl ude randomizati on can be useful opti ons for some types
of questions. In a time-series design, measurements are made before and after
each participant receives the interventi on (Fig. 11.2). Therefore, each
participant serves as his own control to evaluate the effect of treatment. This
means that innate characteristics such as age, sex, and genetic factors are not
merel y balanced (as they are in between-group studies) but actuall y eliminated
as confounding variables.
The major disadvantage of within-group designs is the lack of a concurrent
control group. The apparent efficacy of the interventi on might be due to
learning effects (participants do better on fol low-up cogni tive functi on tests
because they learned from the baseline test), regression to the mean
(parti cipants who were selected for the trial because they had high blood
pressure at baseline are found to have l ower
blood pressure at follow-up simply due to random variation in blood pressure),
or secular trends (upper respiratory infections are less frequent at fol low-up
because the trial started during flu season). Wi thin-group designs someti mes use
a strategy of repeatedl y starti ng and stopping the treatment. If repeated onset
and offset of the intervention produces simi lar patterns in the outcome, this
provides strong support that these changes are due to the treatment. This
approach is only useful when the outcome variable responds rapidly and
reversibly to the intervention (e.g., the effect of a statin on LDL-chol esterol
l evel). The design has a cli ni cal applicati on i n the so-called N-of-one
P.166
P.167
...
study in which an individual patient can al ternate between active and inactive
versions of a drug (usi ng identical -appearing pl acebo prepared by the local
pharmacy) to detect his particular response to the treatment (7).
The crossover design has features of both wi thin- and between-group designs
(Fig. 11.3). Half of the parti cipants are randomly assi gned to start wi th the
control period and then swi tch to active treatment; the other half begin wi th the
active treatment and then switch to control. This approach (or the Lati n square
for more than two treatment groups) permits between-group, as well as within-
group analyses. The advantages of this desi gn are substanti al: it mi nimizes the
potential for confoundi ng because each participant serves as his own control and
the paired analysis substantially increases the statistical power of the tri al so
that it needs fewer participants. However, the disadvantages are al so
substantial: a doubl ing of the duration of the study, and the added complexity of
anal ysis and interpretation created by the problem of potential carryover
effects. A carryover effect is the resi dual i nfl uence of the i ntervention on the
outcome during the period after i t has been stoppedblood pressure not
returning to baseli ne levels for months after a course of diuretic treatment, for
example. To reduce the carryover effect, the investi gator can introduce an
untreated washout
peri od between treatments wi th the hope that the outcome variable will return to
normal before starti ng the next interventi on, but it is difficult to know whether
all carryover effects have been elimi nated. In general, crossover studies are
chiefl y a good choi ce when the number of study subjects is limited and the
outcome responds rapidl y and reversibly to an interventi on.
FIGURE 11.2. In a time-seri es trial, the investigator (a) sel ects a sample
from the population, (b) measures baseline and outcome variables, (c)
appli es the i nterventi on to the whole cohort, (d) follows up the cohort and
measures outcome variables again, (e) (optional) removes the intervention
and measures outcome variabl es agai n, and so on.
P.168
...
A variation on the crossover design may be appropriate when partici pants are
randoml y assigned to usual care or to a very appeali ng interventi on (such as
weight loss, yoga or elective surgery). Participants assigned to usual care may
be provided the active intervention at the end of the parallel, two-group period,
making enrollment much more attractive. The outcome can be measured at the
end of the interventi on peri od in this group, providing within group crossover
data on the participants who receive the delayed intervention.
Tr i al s f or Regul ator y Appr oval of New
I nt er vent i ons
Many trial s are done to test the effecti veness and safety of new treatments that
might be considered for approval for marketi ng by the U.S. Food and Drug
Admi ni stration (FDA) or another international regul atory body. Trial s are also
done to determine whether drugs that have FDA approval for one condition might
be approved for the treatment or preventi on of other condi tions. The design and
conduct of these trials i s generally the same as for other trials, but regul atory
requirements must be considered.
The FDA publ ishes general and speci fic gui delines on how such trial s shoul d be
conducted (search for FDA on the web). It would be wise for
i nvestigators and staff conducting trials with the goal of obtaining FDA approval
of a new medication or device to seek specific training on these general
gui delines, called Good Clinical Practice. In addi tion, the FDA provides
FIGURE 11.3. In a crossover randomized trial , the investi gator (a) selects a
sample from the population, (b) measures basel ine and outcome variables,
(c) randomizes the parti ci pants (R), (d) appl ies i nterventi ons, (e) measures
outcome variables during fol low-up, (f) allows washout period to reduce
carryover effect, (g) applies the intervention to former pl acebo group and
placebo to former i ntervention group, (h) measures outcome variabl es agai n
at the end of foll ow-up.
...
specific guidel ines for studies of certai n outcomes. For example, studies
desi gned to obtai n FDA approval of treatments for hot flashes in menopausal
women must currentl y i nclude participants with at l east seven hot fl ashes per
day or 50 per week. FDA guidelines are regul arly updated and simil ar guideli nes
are avai labl e from i nternational regul atory agencies.
Trial s for regul atory approval of new treatments are general ly described by
phase. This system refers to an orderly progressi on i n the testing of a new
treatment, from experi ments in animal s (preclinical) and i ni tial unblinded,
uncontroll ed treatment of a few human volunteers to test safety ( phase I), to
small randomized blinded trials that test the effect of a range of doses on si de
effects and clini cal outcomes (or surrogate outcomes) (phase II), to randomized
trials large enough to test the hypothesis that the treatment improves the
targeted condition (such as blood pressure) or reduces the risk of disease (such
as stroke) with acceptable safety (phase III) (Table 11.1). Phase IV refers to
l arge studies (which may or may not be randomized trials) conducted after a
drug is approved. These studies are often conducted (and fi nanced) by marketi ng
departments of pharmaceutical compani es with the goals of assessi ng the rate of
serious side effects when used in large populati ons and identifying additi onal
uses of the drug that might be approved by the FDA.
P i l ot Cl i ni cal Tr i al s
Designi ng and conducting a successful cl inical trial requires extensive
i nformation on the type, dose and duration of the i ntervention, the l ikely effect
of the interventi on on the outcome, potential adverse effects and the feasibi lity
of recruiting, randomizing
and mai ntaini ng participants in the trial. Often, the only way to obtain some of
thi s i nformation is to conduct a good pi lot study.
P.169
Table 11.1 Stages in Testing New Therapies
Precl inical Studies in cell cultures and animals
Phase I Unblinded, uncontroll ed studi es i n a few vol unteers to test
safety
Phase II Relatively small randomized bl inded trial s to test
tolerabil ity and di fferent i ntensity or dose of the
i nterventi on on surrogate or clinical outcomes
Phase III Relatively large randomized bli nded trial s to test the
effect of the therapy on clinical outcomes
Phase IV Large trial s or observati onal studies conducted after the
therapy has been approved by the FDA to assess the rate
of serious side effects and evaluate additi onal therapeutic
...
Pil ot studi es vary from a brief test of the feasibility of recruitment to a full -scale
pilot in hundreds of partici pants. Pi lot studies should be as carefull y planned as
the main trial, wi th cl ear objectives and methods. Many pilot studies are focused
primarily on determining the feasibility, time required and cost of recruiting
adequate numbers of eli gible participants, and di scovering i f they are wi lling to
accept randomi zation and can comply with the i ntervention. Pilot studies may
also be designed to demonstrate that planned measurements, data collection
i nstruments and data management systems are feasi ble and effici ent. For pilot
trials focused primarily on feasibility, a control group is generally not included.
An important goal of many pilot studies is to define the opti mal
i nterventi onthe frequency, intensity and durati on of the interventi on that will
result in mini mal toxi city and maximal effectiveness. Phase I and II studi es can
be vi ewed as pilot studies with these goals.
Another important goal of pil ot studi es is to provide parameters to allow more
accurate estimation of sample size. Sound esti mates of the rate of the outcome
or mean outcome measure in the placebo group, the effect of the intervention on
the main outcome (effect size), and the statistical variability of this outcome
are crucial to pl anning the sampl e si ze. In some situations, an estimate of the
effect size and its vari abi lity can be achieved by delivering the intervention to al l
pilot subjects. For example, i f it is known that a surgical procedure results in a
certai n volume of blood loss, eval uating the amount of blood loss in a small
group of pil ot study partici pants who undergo a new procedure might provide a
good esti mate of the effect size. However, if there is likely to be a placebo
effect, i t may be better to randomize pilot participants to recei ve the new
i nterventi on or placebo. For example, to obtain an esti mate of the effect of a
new treatment for pain related to dental extracti ons, the fact that pai n responds
markedly to placebo treatment woul d result in a biased estimate of effect if no
placebo group is i ncluded.
Many trial s fall short of estimated power not because the effect of the
i nterventi on i s l ess than anti ci pated, but because the rate of outcome events in
the placebo group is much lower than expected. This screening bias
l ikely occurs because persons who
fi t the enrol lment cri teri a for a clinical tri al and agree to be randomi zed are
healthier than the general population with the condi tion of interest. Therefore, i n
some trials, i t is crucial to determi ne the rate of the outcome in the placebo
uses
P.170
...
group, which can only be done by randomizi ng participants to pl acebo in a pi lot
study.
A pil ot study should have a short but compl ete protocol (approved by the
Institutional Review Board), data collection forms and analysi s plans. Variables
should include the typi cal baseli ne measures, predictors and outcomes i ncluded
i n a clinical tri al, but also esti mates of the number of subjects available or
accessi ble for recruitment, the number who are contacted or respond using
different sources or recruitment techniques, the number and proporti on eligi ble
for the trial, those who are eli gible but refuse (or say they would refuse)
randomi zati on, the ti me and cost of recruitment and randomization, and
esti mates of adherence to the intervention and other aspects of the protocol ,
i ncludi ng study visits. It may be very helpful to debri ef both subjects
and staff after the pilot study to obtai n their vi ews on how the trial methods
could be i mproved.
A good pi lot study requires substanti al ti me and can be costly, but markedly
i mproves the chance of funding for major cli ni cal trial s and the likel ihood that
the tri al will be successfully completed.
Conducting a Clinical Trial
Fol l ow- up and Adher ence t o t he Pr ot ocol
If a substantial number of study participants do not receive the study
i nterventi on, do not adhere to the protocol, or are lost to follow-up, the results
of the trial are l ikely to be underpowered or biased. Strategi es for maximizing
follow-up and adherence are outli ned in Table 11.2.
The effect of the intervention (and the power of the trial) is reduced to the
degree that parti ci pants do not receive it. The investigator should try to choose
a study drug or interventi on that is easy to apply or take and is well tol erated.
Adherence i s li kely to be poor if a behavioral intervention requires hours of
practi ce by parti ci pants. Drugs that can be taken in a single dail y dose are the
easiest to remember and therefore preferabl e. The protocol should i nclude
provisions that will enhance adherence, such as instructing participants to take
the pill at a standard point in the morning routine and gi vi ng them pil l containers
l abel ed wi th the day of the week.
There is also a need to consi der how best to measure adherence to the
i nterventi on, using such approaches as self -report, pill counts, pi ll containers
with computer chips that record when the container is opened, and serum or
urinary metabolite l evels. This information can identify participants who are not
complying, so that approaches to i mproving adherence can be instituted and the
i nvestigator can interpret the findings of the study appropriately.
Adherence to study visits and measurements can be enhanced by di scussing
what is i nvol ved in the study before consent is obtained, by scheduling the vi sits
at a time that is convenient and with enough staff to prevent waiti ng, by calling
the participant the day before each visi t, and by reimbursi ng travel expenses
and other out-of-pocket costs.
Fai lure to follow tri al partici pants and measure the outcome of interest can resul t
i n biased resul ts, di minished credibi lity of the findi ngs, and decreased statistical
power. For exampl e, a trial of nasal cal citoni n spray to reduce the risk of
P.171
...
osteoporotic fractures reported that treatment reduced fracture risk by 36% (8).
However, about 60% of those randomized were lost to follow-up, and it was not
known if fractures had occurred i n these partici pants. Because the overall
number of fractures was small, even a few fractures in the parti ci pants lost to
fol low-up coul d have altered the findings of the trial. This uncertainty dimini shed
the credibi lity of the study findings (9).
Table 11.2 Maximizing Follow-up and Adherence
to the Protocol
Principle Example
Choose subjects who are l ikely
to be adherent to the
i nterventi on and protocol
Require completion of two or more
comprehensive visits before
randomi zation
Exclude those who are nonadherent
i n a prerandomization run-in period
Exclude those who are li kely to move
or be noncompliant
Make the i ntervention easy Use a si ngle tablet once a day i f
possible
Make study vi si ts conveni ent
and enjoyable
Schedule visits often enough to
maintain cl ose contact but not
frequently enough to be ti resome
Schedule visits at night or on
weekends, or collect informati on by
phone or e-mail
Have adequate and well -organized
staff to prevent wai ting
Provi de reimbursement for travel
Establish i nter-personal relati onships
with subjects
Make study measurements
painless, useful and
i nteresting
Choose noni nvasive, i nformative
tests that are otherwise costly or
unavailable
...
Even if partici pants violate the protocol or disconti nue the tri al i ntervention,
they should be foll owed so that their outcomes can be used i n intention-to-treat
anal yses. In many tri als, participants who violate the protocol by enrolli ng in
another trial, missi ng study visits, or di scontinuing the study interventi on are
discontinued
from foll ow-up; thi s can result in biased or uninterpretable resul ts. Consi der, for
example, a drug that causes a symptomatic side effect that resul ts i n more
frequent di scontinuati on of the study medication in those on acti ve treatment
compared to those on placebo. If participants who discontinue study medication
are not continued i n fol low-up, this can bias the findi ngs i f the side effect is
associated wi th the main outcome.
Some strategi es for achieving complete follow-up are simi lar to those discussed
for cohort studies (Chapter 7). At the outset of the study, participants should be
i nformed of the i mportance of fol low-up and investi gators should record the
name, address, and telephone number of one or two cl ose acquaintances who
wil l al ways know where the participant is. In additi on to enhanci ng the
i nvestigator' s ability to assess vital status, the ability to contact parti cipants by
phone or e-mail may give him access to proxy outcome measures from those who
refuse to come for a visit at the end. The Heart and Estrogen/Progestin
Replacement Study (HERS) trial used all of these strategies: 89% of the women
returned for the fi nal clini c visit after an average of 4 years of follow-up, another
8% had a final tel ephone contact for outcome ascertainment, and information on
vi tal status was determined for every parti cipant by usi ng phone contact,
Provi de test results of interest to
participants and appropri ate
counsel ing or referrals
Encourage subjects to
continue in the trial
Never discontinue subjects from
fol low-up for protocol violations,
adverse events, or side effects
Send parti ci pants birthday and
hol iday cards
Send newsl etters and e-mail
messages
Emphasi ze the scientifi c i mportance
of adherence and follow-up
Find subjects who are lost to
fol low-up
Pursue contacts of subjects
Use a tracki ng service
P.172
...
regi stered letters, contacts with close relatives, and a tracking servi ce (10).
The design of the trial should make it as easy as possi ble for participants to
adhere to the intervention and complete all fol low-up visits and measurements.
Long and stressful vi si ts can deter some participants from attending. Participants
are more l ikely to return for vi sits that involve noni nvasi ve tests, such as
electron beam computed tomography, than for i nvasi ve tests such as coronary
angiography. Collecting foll ow-up information by phone or electronic means may
i mprove adherence for participants who find visi ts difficul t. On the other hand,
participants may l ose interest in a trial if there are not some social or
i nterpersonal rewards for participation. Participants may ti re of study visi ts that
are scheduled monthly, and they may lose interest i f vi sits only occur annuall y.
Follow-up is i mproved by making the trial experi ence positi ve and enjoyable for
study parti cipants: designing trial measurements and procedures to be pai nl ess
and interesting; performing tests that would not otherwise be avail able;
providi ng results of tests to parti ci pants (i f the result will not influence
outcomes); sendi ng newsletters, e-mail notes of appreciation, holiday, and
birthday cards; giving inexpensive gifts; and developing strong interpersonal
relationships with an enthusiasti c and fri endly study staff.
Two desi gn aspects that are specifi c to trials may improve adherence and foll ow-
up: screening visits before randomizati on and a run-in period. Aski ng
participants to attend one or two screening visits before randomizati on may
exclude partici pants who find that they cannot compl ete such visits. The trick
here is to set the hurdl es for entry i nto the tri al high enough to exclude those
who will l ater be nonadherent, but not high enough to excl ude participants who
wil l turn out to have satisfactory adherence.
A run-in period may be useful for increasing the proportion of study
participants who adhere to the intervention and foll ow-up procedures (Fi g. 11.4).
During the baseline period, all parti cipants are placed on pl acebo. A specified
time later (usually a few weeks), only those who have complied wi th the
i nterventi on (e.g., taken at least 80% of the assigned study medication) are
randomi zed. Excluding nonadherent partici pants before randomizati on in this
fashi on may increase the power of the study and permit a better esti mate of the
ful l effects of i ntervention. However, a run-i n period delays entry i nto the tri al,
the proportion of parti cipants excluded
i s generally small, and participants randomi zed to the active drug may notice a
change i n their medication fol lowi ng randomizati on, contri buting to unbli nding. It
i s also not clear that a placebo run-in is more effective in increasing adherence
than the requirement that partici pants compl ete one or more screening visits
before randomi zation. In the absence of a specific reason to suspect that
adherence in the study will be poor, it i s probably not necessary to i nclude a
run-i n peri od in the trial desi gn.
P.173
...
A variant of the placebo run-in design is the use of the active drug rather than
the placebo for the run-in period. In addition to increasing adherence among
those who enroll , an active drug run-in is desi gned to select participants who
tolerate and respond to the intervention. The effect of treatment on an
i ntermediary variabl e (i .e., a biomarker associated with the outcome) i s used as
the criterion for randomi zation. In a tri al of the effect of an antiarrhythmic drug
on mortal ity, for example, the i nvestigators randomized onl y those parti cipants
whose arrhythmias were satisfactoril y suppressed without undue si de effects
(11). This design maxi mized power by i ncreasing the proportion of the
i nterventi on group that is responsive to the interventi on. It also improved
generalizability by mi micki ng the cl inician' s tendency to continue using a drug
onl y when he sees evidence that it is working. However, the findings of trials
using this strategy may not be generalizable to those excluded.
Usi ng an active run-in may also result i n underestimation of the rate of adverse
effects. A trial of the effect of carvedilol on mortal ity in patients wi th congestive
heart fail ure used a 2-week active run-in period. Duri ng the run-in, 17 people
had worsening congestive heart failure and 7 died (12). These people were not
randomi zed in the trial, and these adverse effects of drug treatment were not
i ncluded as outcomes.
Adj udi cati ng Outcomes
Most self-reported outcomes, such as history of stroke or a participant report of
qui tting smoking, are not 100% accurate. Self-reported outcomes that are
i mportant to the trial should be confirmed i f possible. Occurrence of disease,
such as a stroke, is generally adjudicated by (a) creati ng clear criteria for the
outcome (e.g., a new, persistent neurol ogic deficit with corresponding l esi on on
computed tomography or magnetic resonance i magi ng scan), (b) col lecting the
cli ni cal documents needed to
FIGURE 11.4. In a randomized trial preceded by a run-in period to test
compli ance, the investi gator (a) selects a sampl e from the populati on, (b)
measures baseline vari ables, (c) conducts the run-i n (d) randomizes
adherent participants (R), (e) applies interventions, (f) measures outcome
vari ables during follow-up.
P.174
...
make the assessment (e.g., discharge summaries and radiol ogy reports), and (c)
having experts review each potential case and judge whether the criteri a for the
diagnosis have been met. The adjudicati on i s often done by two experts working
i ndependently, then resol vi ng discordant cases by di scussion between the two or
with a third expert. Those who collect the informati on and adjudicate the cases
must be bl inded to the treatment assi gnment.
Moni t or i ng Cl i ni cal Tr i al s
Investigators must assure that participants not be exposed to a harmful
i nterventi on, deni ed a benefici al i ntervention, or continued i n a trial if the
research question is unlikely to be answered.
The most pressing reason to moni tor clini cal trial s is to make sure that the
i nterventi on does not turn out unexpectedly to be harmful . If harm is judged to
be cl early present and to outwei gh any benefits, the trial should be stopped.
Second, if an intervention is more effective than was esti mated when the trial
was designed, then benefit can be observed early in the trial. When cl ear
benefit has been proved, it may be unethi cal to continue the trial and delay
offering the i nterventi on to participants on placebo and to others who could
benefit. Thi rd, i f there i s a very low probabil ity of answering the research
questi on, i t may be unethical to continue partici pants in a trial that requi res time
and effort and that may cause some di scomfort or risk. If a cl inical trial is
scheduled to continue for 5 years, for example, but after 4 years there is littl e
difference in the rate of outcome events in the intervention and control groups,
then the condi tional power (the likeli hood of answering the research
questi on gi ven the results thus far) becomes very smal l and consi deration shoul d
be given to stopping the trial. Sometimes tri als are stopped early on, if
i nvestigators are unable to recruit or retain enough participants to provide
adequate power to answer the research question, or adherence to the
i nterventi on i s very poor.
The research question might be answered by other trials before a given trial is
fi nished. It is desi rable to have more than one trial that provides evidence
concerning a given research questi on, but i f definitive evidence becomes
avail abl e during a trial , the investi gator should consi der stoppi ng.
Most cli ni cal trial s should include an i nteri m moni tori ng plan. Tri als funded by
the National Institutes of Heal th (NIH) general ly requi re interim moni tori ng,
even i f the intervention is consi dered safe (such as a behavioral interventi on for
weight loss). How interim monitoring will occur should be considered in the
planning of any clini cal trial . In small trials with i nterventions likel y to be safe,
the tri al i nvestigators might moni tor safety or appoint a single i ndependent data
and safety monitor. In large trials and trial s in which adverse effects of the
i nterventi on are unknown or potentially dangerous, interim monitoring is
generally performed by a committee (usually known as the Data and Safety
Monitoring Board [DSMB] or Data Monitoring Committee) consisting of experts i n
the disease or condition under study, bi ostatisticians, cli nical trialists, ethi cists
and occasionally a representative of the patient group being studied. These
experts are not involved in the trial, and should have no personal or financial
i nterest in i ts continuati on. DSMB gui delines and procedures should be detai led
i n writing before the trial begi ns. Gui dance for devel oping DSMB procedures i s
provided by the FDA and the NIH. Items to incl ude in these gui delines are
outl ined in Tabl e 11.3.
...
Stoppi ng a trial should always be a careful decision that balances ethical
responsibility to the participants and the advancement of scienti fic knowledge.
Whenever a tri al i s stopped earl y, the chance to provi de more concl usive resul ts
wil l be lost. The deci sion is often complex, and potential risks to participants
must be weighed against possible benefi ts. Statistical tests of signi ficance
provide i mportant but not conclusive information for stopping a trial. Trends
over time and effects on related outcomes shoul d be eval uated for consistency,
and the impact of stoppi ng the study earl y on the credibil ity of the findings
should be carefully considered (Exampl e 11.1).
There are many stati sti cal methods for monitoring the i nteri m results of a trial.
Anal yzi ng the results of a trial repeatedly is a form of mul tiple hypothesis testing
and thereby increases the probabil ity of a type I error. For example, if = 0.05
i s used for each i nteri m test and the resul ts of a tri al are anal yzed four times
during the trial and again at the end, the probabili ty of making a type I error is
i ncreased from 5% to about 14% (13). To address this problem, stati sti cal
methods for interim monitoring general ly decrease the for each test so that
the overal l is close to 0.05. There are mul tipl e approaches to deciding how to
spend (Appendix 11.1).
Example 11.1 Trials That Have Been Stopped Early
Table 11.3 Monitoring a Clinical Trial
El ements to moni tor
Recruitment
Randomi zation
Adherence to i nterventi on, and blindi ng
Follow-up compl eteness
Important variables
Outcomes
Adverse effects
Potenti al co-i nterventi ons
Who wil l monitor
Trial investi gator or a si ngle monitor if small tri al with mi nor hazards
Independent DSMB otherwise
Methods for interim moni toring
Specify statistical approach and frequency of monitoring in advance
Importance of judgment and context in additi on to statistical stopping
rules
Changes in the protocol that can resul t from moni toring
Terminate the tri al
Modify the trial
Stop one arm of the trial
Add new measurements necessary for safety moni tori ng
Discontinue high-ri sk participants
Extend the trial in ti me
Enlarge the tri al sample
P.175
...
Canadian Atrial Fibrillation Anticoagulation Study (CAFA) (14): Atrial
fi bri llation is a risk factor for stroke and embolic events. The CAFA study was a
blinded, randomi zed, placebo-controlled trial to evaluate the efficacy of warfarin
i n decreasing the rate of stroke, systemi c embol ism, or intracerebral or fatal
bleeding i n patients with nonrheumatic atrial fibrill ation. The tri al was designed
to enroll 660 subjects and follow them on therapy for 3.5 years. During the trial
(after 383 pati ents had been randomized and followed for a mean of 1.2 years),
the results of two other randomized trials were reported showing a signifi cant
decrease in stroke risk and a low rate of major bl eeding in those treated with
warfarin. The Steeri ng Commi ttee decided that the evidence of benefit with
warfarin was suffi ciently compel ling to stop the trial .
Cardiac Arrhythmia Suppression Trial (CAST) (11): The occurrence of
ventricular premature depolari zations in survivors of myocardi al i nfarction (MI)
i s a ri sk factor for sudden death. The CAST eval uated the effect of
antiarrhythmic therapy (encainide, fl ecainide, or mori ci zi ne) i n patients with
asymptomatic or mildl y symptomatic ventricul ar arrhythmia after MI on risk for
sudden death. During an average of 10 months of foll ow-up, the participants
treated with active drug had a higher total mortality (7.7% vs. 3.0%) and a
hi gher rate of death from arrhythmia (4.5% vs. 1.5%) than those assigned to
placebo. The tri al was planned to conti nue for 5 years but this l arge and highly
statisticall y signifi cant difference l ed to the trial being stopped after 18 months.
Coronary Drug Project (CDP) (15, 16): The CDP was a randomized, bli nded
trial to determi ne if five different cholesterol -lowering interventions (conjugated
estrogen 5.0 mg/day; estrogen 2.5 mg/day; clofibrate 1.8 g/day;
dextrothyroxine 6.0 mg/day; niaci n 3.0 g/day) reduced the 5-year mortal ity
rate. The CDP enrolled 8,341 men wi th MI who were foll owed for at least 5
years. With an average of 18 months of follow-up, the hi gh-dose estrogen arm
was stopped due to an excess of nonfatal MI (6.2% compared with 3.2%) and
venous thromboembol ic events (3.5% compared with 1.5%), as wel l as testicul ar
atrophy, gynecomastia, breast tenderness, and decreased l ibido. At the same
time, dextrothyroxine was stopped in the subgroup of men who had frequent
premature ventricular beats on their basel ine electrocardiogram because the
death rate i n this subgroup was 38.5% compared wi th 11.5% in the same
subgroup receiving placebo. Dextrothyroxine therapy was stopped in al l subjects
shortly thereafter due to an excess mortali ty rate in the overall treated group.
Two years before the planned end of the study, the 2.5-mg-dose estrogen arm
was also stopped because there was no evidence of any benefi ci al effect and an
i ncreased risk of venous thromboembolic events.
Physicians Health Study (17): The Physi cians Health Study was a randomi zed
trial of the effect of aspirin (325 mg every other day) on cardi ovascular
mortality. The tri al was stopped after 4.8 years of the planned 8-year foll ow-up.
There was a statistical ly signi ficant reducti on i n risk of MI in the treated group
(relative ri sk for nonfatal MI = 0.56), but the number of cardi ovascular di sease
deaths in each group was equal. The rate of cardiovascular disease deaths
observed in the study was far lower than expected (88 after 4.8 years of fol low-
up vs. 733 expected), and the trial was stopped because of the benefi cial effect
of aspiri n on risk for nonfatal MI coupled with the very l ow conditional power to
detect a favorabl e impact of aspi ri n therapy on cardiovascul ar mortality.
P.176
P.177
...
Adapt i ve Desi gn
Cli nical trials are generally conducted according to a protocol that does not
change during the conduct of the study. However, for some types of treatments
and condi tions, i t is possible to moni tor results from the trial as it progresses
and change the design of the trial based on interim analyses of the resul ts (18).
For example, consi der a trial of several doses of a new treatment for menopausal
hot fl ashes. The initi al design may plan to enroll 40 women to a placebo group
and 40 to each of three doses for 12 weeks of treatment over an enrollment
peri od lasting 1 year. Review of the results after the first 10 women i n each
group have completed the first 4 weeks of treatment might reveal that there i s a
trend toward an effect only i n the hi ghest dose. It may be more effici ent to stop
assigni ng participants to the two lower doses and continue randomizing only to
the hi ghest dose. In this case, the desi gn of the tri al can be adapted to the
i nteri m results by changing the design in midstream to use only one dose versus
the placebo. Other facets of a trial that could be changed based on interim
results include increasing or decreasing the sample size or duration of the tri al i f
i nteri m results indicate that the effect size or rate of outcomes differ from the
origi nal assumptions.
These adaptive designs are feasibl e only for treatments that produce outcomes
that are measured and anal yzed early enough i n the course of the trial that
changes can be made in the design. To prevent bi as in the ascertainment of
outcomes, the interim analyses and consi deration of change in design must be
done by an independent group such as a DSMB that revi ews unblinded data.
Furthermore, multipl e interim analyses will increase the probabi lity of findi ng a
result that is due to chance variations i n the early results from the trial; the
i ncreased chance of a Type 1 error must be considered in the design and
anal ysis of the results. Adaptive designs are also more complex to conduct and
anal yze, informed consent must incl ude the range of possible changes i n the
study design or be repeated, and it is diffi cult to esti mate the cost of an
adaptive trial and the specific resources necessary to complete i t.
With these precautions and limitations, adaptive designs are efficient and may
be valuable, especi all y during the devel opment of a new treatment, allowing
earlier identi fication of the best dose and durati on and ensuring that a high
proportion of participants receive the optimal treatment.
Anal yzi ng the Resul t s
Statistical analysis of the pri mary hypothesis of a clinical tri al i s general ly
straightforward. If the outcome is dichotomous, the si mplest approach is to
compare the proportions i n the study groups using a chi-squared test. When the
outcome is conti nuous, a t test may be used, or a nonparametri c alternati ve i f
the outcome is not normally di stributed. In most clinical tri als, the duration of
fol low-up is different for each participant, necessitating the use of survival time
methods. More sophisticated statistical model s such as Cox proportional hazards
anal ysis can accompl ish thi s and at the same time adjust for chance
maldi stributi ons of baseli ne confounding variables. The technical detail s of when
and how to use these methods are described elsewhere (19).
Two important i ssues that should be consi dered in the analysi s of clinical tri al
results are the pri macy of the intention-to-treat analytic approach and the
anci llary role for subgroup anal yses. The investigator must decide what to do
with nonadherence or cross-overs, partici pants assi gned to the active
...
treatment group who do not get treatment or discontinue it and those assigned
to the control group who end up
getting acti ve treatment. An anal ysis done by intention-to-treat compares
outcomes between the study groups with every participant anal yzed according to
hi s randomized group assignment, regardless of whether he adhered to the
assigned interventi on. Intention-to-treat analyses may underestimate the full
effect of the treatment, but they guard against the more important problem of
biased results.
An al ternative to the i ntention-to-treat approach is to analyze only those who
comply with the intervention. It i s common, for example, to perform per
protocol analyses that include only participants who were full y adherent to
the protocol. Thi s i s defi ned in various ways, but often includes onl y partici pants
i n both groups who were adherent to the assigned study medicati on, completed a
certai n proportion of visits or measurements and had no other protocol
vi ol ati ons. A subset of the per protocol analysis i s an as-treated
anal ysis in whi ch only participants who were adherent to the interventi on are
i ncluded. These analyses seem reasonable because participants can only be
affected by an intervention they actual ly receive. The problem arises, however,
that parti cipants who adhere to the study treatment and protocol may be
different from those who drop out in ways that are related to the outcome. In
the Postmenopausal Estrogen-Progesti n Interventions Trial (PEPI), 875
postmenopausal women were randomly assigned to four different estrogen or
estrogen plus progestin regimens and placebo (20). Among women assigned to
the unopposed estrogen arm, 30% had disconti nued treatment after 3 years
because of endometrial hyperplasia, which is a precursor of endometrial cancer.
If these women are eli minated in a per protocol anal ysis, the association of
estrogen therapy and endometri al cancer wil l be missed.
The major disadvantage of the intenti on-to-treat approach i s that participants
who choose not to take the assigned interventi on wil l, neverthel ess, be included
i n the estimate of the effects of that interventi on. Therefore, substantial
discontinuation or crossover between treatments wi ll cause intention-to-treat
anal yses to underestimate the magnitude of the effect of treatment. For thi s
reason, results of trial s are often evaluated wi th both intention-to-treat and per
protocol analyses. For example, in the Women' s Health Initiati ve randomized
trial of the effect of estrogen plus progestin treatment on breast cancer risk, the
hazard ratio was 1.24 (P = 0.003) from the i ntention-to-treat analysis and 1.49
i n the as-treated analysi s (P < 0.001) (21). If the resul ts of i ntention-to-treat
and per protocol analyses di ffer, the i ntention-to-treat results generally
predomi nate for estimates of efficacy because they preserve the value of
randomi zati on and, unl ike per protocol analyses, can only bias the estimated
effect i n the conservative direction (favoring the null hypothesis). However, for
esti mates of harm (e.g., the breast cancer fi ndings noted above), as-treated or
per protocol analyses provi de the most conservative esti mates, as interventions
can only be expected to cause harm in exposed persons. Results can only be
anal yzed both by i ntention-to-treat and per protocol i f fol low-up measures are
completed regardless of whether participants adhere to treatment, whi ch should
always be a goal.
Subgroup analyses are defi ned as comparisons between randomized groups i n
a subset of the trial cohort. These analyses have a mixed reputation because
they are easy to misuse and can l ead to wrong concl usions. With proper care,
P.178
...
however, they can provide useful ancillary i nformation and expand the inferences
that can be drawn from a cl inical trial. To preserve the value of randomization,
subgroups shoul d be defi ned by measurements that were made before
randomi zati on. For example, a tri al of alendronate to prevent osteoporoti c
fractures found that the drug decreased risk of fracture by 14% among women
with l ow bone density. Prepl anned analyses by subgroups of bone densi ty
measured at baseli ne revealed that the treatment was effective (36% reduction
i n fracture ri sk; P < 0.01) among women whose bone
density was more than 2.5 standard devi ations below normal. In contrast,
treatment was ineffective in women with hi gher bone density at baseli ne (P =
0.02 for the interacti on) (22). It is i mportant to note that the value of
randomi zati on i s preserved: the fracture rate among women randomi zed to
alendronate i s compared with the rate among women randomi zed to placebo i n
each subgroup.
Subgroup anal yses are prone, however, to producing misleading results for
several reasons. Subgroups are, by defi ni tion, smaller than the entire tri al
populati on, and there may not be suffici ent power to find i mportant differences;
i nvestigators shoul d avoid claiming that a drug was ineffective i n a
subgroup when the findi ng might reflect insuffi ci ent power to find an effect.
Investigators often examine results in a l arge number of subgroups, increasing
the likeli hood of fi nding a di fferent effect of the i nterventi on i n one subgroup by
chance. For exampl e, if 20 subgroups are exami ned, differences in one subgroup
at P < 0.05 would be expected by chance. Opti mall y, pl anned subgroup analyses
should be defined before the trial begins and the number of subgroups analyzed
should be reported with the results of the study. A conservative approach i s to
require that cl aims about di fferent responses in subgroups be supported by
statistical evidence that there is an interaction between the effect of treatment
and the subgroup characteri sti c, as in the alendronate trial noted above; if
several subgroups are examined, a signifi cance level of 0.01 should be used.
Subgroup anal yses based on postrandomizati on factors do not preserve the value
of randomi zation and often produce misleading results. Per protocol analyses
l imited to subjects who adhere to the randomized treatment are examples of thi s
type of postrandomi zation subgroup analysis.
Summary
1. There are several variations on the randomi zed trial desi gn that can
substantial ly increase effi ciency under the right circumstances:
a. The factorial design al lows two independent trials to be carried out
for the price of one.
b. Cluster randomization permi ts effi ci ent studi es of naturally
occurring groups.
c. Equivalence trials compare a new intervention to an exi sti ng
standard of care; thi s design may be the most ethical and
cli ni cally meaningful, but often requires a larger sample size than
placebo-controlled trials.
d. Time-series designs have a single (nonrandomized) group with
outcomes compared withi n each subject during periods on and off the
P.179
...
i nterventi on.
e. Crossover designs combine randomi zed and ti me-series desi gns to
enhance control over confounding and minimi ze the requi red sample
size i f carryover effects are not a problem.
2. If a substanti al number of study partici pants do not adhere to the study
intervention or are lost to follow-up, the results of the tri al are likely to
be underpowered, bi ased, or uninterpretabl e.
3. An i mportant difference between cl inical trials and observational studies is
that in a clinical tri al, somet hi ng i s bei ng done t o t he par t i ci pant s.
Duri ng a trial, interim monitoring by an independent DSMB, i s needed to
assure that participants are not exposed to a harmful interventi on, denied a
beneficial intervention, or continued in a tri al i f the research questi on i s
unlikely to be answered.
4. Intention-to-treat analysis takes advantage of the control of confounding
provi ded by randomization and shoul d be the primary analysi s approach.
Per protocol analyses, a secondary approach that provides an estimate of
the effect size in adherent subjects, should be interpreted with caution.
5. With proper care, subgroup analyses can expand the inferences that can
be drawn from a cl inical trial. To preserve the value of randomization,
analyses shoul d compare outcomes between subsets of randoml y assigned
study groups classifi ed by prerandomization variables. To mini mize
misinterpretati ons, the investigator shoul d specify the subgroups in
advance, test interactions for stati stical si gnificance, and report the
number of subgroups examined.
Appendix
Appendix 11.1: Interim Monitoring of Trial
Outcomes
Interi m monitori ng of tri al resul ts i s a form of multiple testi ng, and thereby
i ncreases the probabili ty of a type I error. To address this problem, for each
test (
i
) is generally decreased so that the overall approxi matel y = 0.05.
There are multiple stati stical methods for decreasing
i
.
One of the easiest to understand is the Bonferroni method, where
i
= /N if N
i s the total number of tests performed. For exampl e, if the overall i s 0.05 and
fi ve tests will be performed,
i
for each test is 0.01. Thi s method has several
disadvantages, however. It requi res using an equal threshold for stoppi ng the
trial at any interim analysis and results in a very low for the final analysi s.
Most investi gators would rather use a lower threshol d for stopping a trial earlier
rather than later in the trial and use an close to 0.05 for the fi nal analysis. In
addition, this approach i s too conservati ve because i t assumes that each test i s
i ndependent. Interim analyses are not i ndependent, because each successi ve
anal ysis is based on cumulative data, some of whi ch were i ncluded i n pri or
anal yses. For these reasons, Bonferroni is not generall y used.
P.180
...
A commonly used method suggested by O' Brien and Fleming (23) uses a very
small initial
i
, then gradual ly increases it such that
i
for the fi nal test i s
close to the overall . O'BrienFlemi ng provide methods for calculati ng
i
if
the investi gator chooses the number of tests to be done and the overall . At
each test, Z
i
= Z* (N
i
)
1/ 2
, where Z
i
= Z value for the ith test; Z* i s determi ned
so as to achieve the overall si gnificance level; N is the total number of tests
planned and i is the ith test. For example, for fi ve tests and overal l = 0.05,
Z* = 2.04; the initi al = 0.00001 and the final
5
= 0.046. This method is
unli kely to lead to stopping a trial very early unl ess there is a striki ng difference
i n outcome between randomi zed groups. In addi tion, this method avoids the
awkward situation of getting to the end of a trial and accepting the null
hypothesis although the P value i s substantially less than 0.05.
A major drawback to the preceding methods is that the number of tests and the
proportion of data to be tested must be decided before the tri al starts. In some
trials, addi tional interim tests become necessary when important trends occur.
DeMets and Lan (24) devel oped a method using a specified -spending functi on
that provides continuous stoppi ng boundari es. The
i
at a parti cular time (or
after a certain proportion of outcomes) i s determi ned by the function and by the
number of previous l ooks. Using this method, neither the number of
looks nor the proporti on of
data to be analyzed at each look must be specified before the tri al. Of
course, for each additional unpl anned interim analysis conducted, the final
overal l is a li ttle small er.
A different set of stati sti cal methods based on curtai led sampl ing techni ques
suggests termi nation of a tri al i f future data are unli kely to change the
concl usion. The multipl e testing problem is irrel evant because the decision is
based only on esti mation of what the data wil l show at the end of the trial . A
common approach is to compute the condi tional probabili ty of rejecti ng the null
hypothesis at the end of the tri al, based on the accumulated data. A range of
conditi onal power is typi cally calculated, first assuming that H
o
is true (i .e., that
any future outcomes i n the treated and control groups will be equal ly
distributed) and second assumi ng that H
a
i s true (i.e., that outcomes wil l be
distributed unequall y i n the treatment and control groups as speci fied by H
a
).
Other esti mates can also be used to provi de a full range of reasonable effect
sizes. If the conditional power to reject the null hypothesis across the range of
assumptions is low, the nul l hypothesis is not likel y to be rejected and the trial
might be stopped.
Reference
1. Ri dker PM, Cook NR, Lee I, et al. A randomi zed trial of low-dose aspirin in
the primary prevention of cardi ovascular disease in women. N Engl J Med
2005;352:12931304.
2. The Women' s Heal th Initiative Study Group. Design of the women' s health
initi ati ve cli ni cal trial and observational study. Control Cli n Trials
1998;19:61109.
3. Walsh M, Hi lton J, Masouredi s C, et al. Smokeless tobacco cessation
intervention for college athletes: results after 1 year. Am J Publi c Heal th
P.181
...
1999;89:228234.
4. Donner A, Birkett N, Buck C. Randomizati on by cluster: sample size
requirements and anal ysis. Am J Epidemiol 1981;114:906914.
5. Chalmers T, Celano P, Sacks H, et al . Bias in treatment assignment in
controlled cl inical trials. N Engl J Med 1983;309:13581361.
6. Pocock S. Current issues in the design and i nterpretation of cl inical trials.
Br Med J 1985;296:3942.
7. Ni ckles CJ, Mitchall GK, Delmar CB et al. An n-of-1 trial servi ce in clini cal
practice: testing the effecti veness of sti mulants for attention-
deficit/hyperactivity disorder. Pediatri cs 2006;117:20402046.
8. Chestnut CH III, Silverman S, Andriano K, et al. A randomized trial of
nasal spray salmon calcitonin i n postmenopausal women with established
osteoporosis: the prevent recurrence of osteoporotic fractures study. Am J
Med 2000;109:267276.
9. Cummings SR, Chapurlat R. What PROOF proves about calcitonin and
cl inical trials. Am J Med 2000;109:330331.
10. Hul ley S, Grady D, Bush T, et al. Randomi zed tri al of estrogen plus
progesti n for secondary prevention of coronary heart disease in
postmenopausal women. JAMA 1998;280: 605613.
11. Cardiac Arrhythmia Suppression Trial (CAST) Investigators. Prelimi nary
report: effect of encainide and flecai ni de on mortality i n a randomized trial
of arrhythmia suppressi on after myocardi al i nfarction. N Engl J Med
1989;321:406412.
12. Pfeffer M, Stevenson L. Beta-adrenergic blockers and survival in heart
fai lure. N Engl J Med 1996;334:13961397.
13. Armitage P, McPherson C, Rowe B. Repeated signi ficance tests on
accumul ating data. J R Stat Soc 1969;132A:235244.
14. Laupacis A, Connol ly SJ, Gent M, et al . How should results from
completed studies influence ongoing cl inical trials? The CAFA Study
experience. Ann Intern Med 1991;115: 818822.
15. Coronary Drug Project Research Group. The Coronary Drug Project.
Initi al fi ndings l eadi ng to modificati ons of its research protocol. JAMA
1970;214:13031313.
P.182
...
16. Coronary Drug Project Research Group. The Coronary Drug Project.
Fi ndings leadi ng to discontinuation of the 2.5-mg day estrogen group. JAMA
1973;226:652657.
17. PHS Investigati ons. Findings from the aspi ri n component of the ongoing
Physicians' Health Study. N Engl J Med 1988;318:262264.
18. Chang M, Chow S, Pong A. Adaptive design in cl inical research: issues,
opportuni ties, and recommendations. J Biopharm Stat 2006;16:299309.
19. Friedman LM, Furberg C, DeMets DL. Fundamentals of cl inical trials, 3rd
edn. St. Louis, MO: Mosby Year Book, 1996.
20. Writing Group for the PEPI Trial . Effects of estrogen or
estrogen/progesti n regi mens on heart di sease risk factors in postmenopausal
women. JAMA 1995;273:199208.
21. Chlebowski RT, Hendrix SL, Langer RD, et al. Influence of estrogen pl us
progesti n on breast cancer and mammography i n healthy postmenopausal
women. The Women' s Health Initiati ve Randomi zed Trial. JAMA
2003;289:32433253.
22. Cummings SR, Bl ack D, Thompson D, et al. Effect of al endronate on risk
of fracture in women wi th low bone densi ty but without vertebral fractures:
results from the fracture intervention trial . JAMA 1998;280:20772082.
23. O' Bri en P, Flemi ng T. A multipl e testing procedure for cli ni cal trial s.
Biometrics 1979;35: 549556.
24. DeMets D, Lan G. The al pha spending functi on approach to interim data
analyses. Cancer Treat Res 1995;75:127.
...
Copyri ght 2007 Li ppi ncott Wi l li ams & Wi l ki ns
> Tabl e of Contents > Secti on II - Study Desi gns > 12 - Desi gni ng Studi es of Medi cal Tests
12
Designing Studies of Medical Tests
Thomas B. Newman
Warren S. Browner
Steven R. Cummings
Stephen B. Hulley
Medi cal tests, such as those performed to screen for a ri sk factor, diagnose a di sease,
or esti mate a pati ent's prognosi s, are an i mportant topic for cli ni cal research. The
study desi gns di scussed i n thi s chapter can be used when studying whether, and in
whom, a particular test should be done.
Al though cli ni cal tri als of medi cal tests are occasi onal l y feasi bl e and someti mes
necessary, most designs for studi es of medi cal tests are descri pti ve and resembl e the
observati onal desi gns i n Chapters 7 and 8. There are, however, some important
di fferences. The goal of most observati onal studi es is to i denti fy causal relati onshi ps
(e.g., whether estrogen use causes breast cancer). Causal i ty i s general ly irrelevant in
studi es of di agnosti c tests. In addi ti on, knowi ng that a test result i s more closel y
associ ated wi th a condi ti on or outcome than woul d be expected by chance al one is not
nearl y enough to determi ne i ts cli ni cal useful ness. Instead, parameters that descri be
the performance of a medi cal test, such as sensitivity, specificity, and likelihood
ratios are commonl y esti mated, wi th their associ ated confi dence i nterval s. In this
chapter we revi ew studi es of medi cal tests focusi ng not just on studi es of test
performance, but also on determi ning whether or under what ci rcumstances a test i s
clinically useful.
Determining Whether a Test is Useful
For a test to be useful i t must pass muster on a seri es of increasi ngly di ffi cul t
questions that address its reproducibility, accuracy, feasibility, and effects on
clinical decisions and outcomes (Tabl e 12.1). Favorabl e answers to each of these
questions are necessary but insuffici ent cri teri a for a test to be worth doi ng. For
exampl e, i f a test does not gi ve consistent resul ts when performed by di fferent peopl e
or i n di fferent places, it can hardly be useful . If the test sel dom suppl i es new
i nformati on and hence sel dom affects cl i ni cal deci si ons, i t may not be worth doing.
Even i f i t affects deci si ons, i f these deci si ons do not i mprove the cl i ni cal outcome of
pati ents who were tested, the test stil l may not be useful .
Table 12.1 Questions to Determine Usefulness of a
Medical Test, Possible Designs to Answer Them, and
Statistics for Reporting Results
Question Possible Designs Statistics for Results*
...
Of course, i f usi ng a test i mproves outcome, favorable answers to the other questi ons
can be i nferred. However, demonstrati ng that doi ng a test improves outcome is
i mpracti cal for most di agnostic tests. Instead, the potenti al effects of a test on cli ni cal
outcomes are usuall y assessed i ndi rectl y, by demonstrati ng that the test i ncreases the
l i kel ihood of making the correct di agnosi s or i s safer or l ess costl y than exi sti ng tests.
When devel opi ng a new di agnosti c or prognosti c test, i t may be worthwhil e to consi der
what aspects of current practi ce are most i n need of i mprovement. Are current tests
unreli abl e, expensi ve, dangerous, or di ffi cul t to perform?
How
reproduci bl e is
the test?
Studi es of i ntra- and
i nterobserver and l aboratory
vari abi l i ty
Proporti on agreement,
kappa, coeffi ci ent of
vari ati on, mean and
di stri buti on of
di fferences (avoi d
correl ation coeffi ci ent)
How accurate i s
the test?
Cross-secti onal ,
casecontrol , or cohort-
type desi gns i n whi ch a test
resul t is compared wi th a
gol d standard
Sensi ti vi ty, speci fi ci ty,
posi ti ve and negati ve
predi cti ve val ue,
recei ver operati ng
characteri sti c curves,
and li keli hood rati os
How often do
test resul ts
affect cli ni cal
deci si ons?
Di agnosti c yi el d studies,
studi es of pre- and posttest
cl ini cal deci si on maki ng
Proporti on abnormal,
proporti on wi th
di scordant resul ts,
proporti on of tests
l eading to changes i n
cl ini cal deci si ons; cost
per abnormal resul t or
per deci si on change
What are the
costs, ri sks, and
acceptabi li ty of
the test?
Prospecti ve or retrospecti ve
studi es
Mean costs, proportions
experi enci ng adverse
effects, proporti ons
wil l i ng to undergo the
test
Does doi ng the
test improve
cl ini cal outcome
or have adverse
effects?
Randomi zed tri al s, cohort or
casecontrol studi es in
which the predi ctor vari able
i s recei ving the test and the
outcome i ncl udes morbi di ty,
mortal ity, or costs rel ated
ei ther to the di sease or to
i ts treatment
Ri sk rati os, odds rati os,
hazard rati os, number
needed to treat, rates
and rati os of desi rabl e
and undesi rabl e
outcomes
*Most stati sti cs i n thi s tabl e shoul d be presented wi th confi dence i nterval s.
P.184
...
Gener al I ssues f or Studi es of Medi cal Test s
Gol d st andar d f or di agnosi s . Some di seases have a gol d standard, such as the
results of a ti ssue bi opsy, that i s general l y accepted to i ndi cate the presence (or
absence) of that disease. Other di seases have defi nitional gol d
standards, such as defi ni ng coronary artery di sease as a 50% obstruction of at
l east one major coronary artery as seen wi th coronary angi ography. Stil l others,
such as rheumatol ogi c di seases, require that a pati ent have a mi nimum number
of signs, symptoms, or speci fi c
l aboratory abnormal i ti es to meet the cri teri a for havi ng the di sease. Of course
the accuracy of any si gns, symptoms, or l aboratory tests used to di agnose a
di sease cannot be studi ed i f those same signs and symptoms are used as part of
the gol d standard for the di agnosis. Furthermore, if the gold standard i s
i mperfect i t can make a test ei ther l ook worse than it real l y i s (i f in real i ty the
test outperforms the gol d standard), or better than i t real l y i s (i f the gold
standard i s an i mperfect measure of the condi ti on of interest and the test has the
same defi ci enci es).
Spect r um of di sease sever i t y and of t est r esul t s . Because the goal of most
studi es of medi cal tests is to draw i nferences about popul ati ons by making
measurements on sampl es, the way the sampl e i s selected has a major effect on
the val i di ty of the i nferences. Spectrum bias occurs when the spectrum of
di sease (or nondisease) i n the sampl e di ffers from that in the popul ati on to whi ch
the investigator wi shes to generali ze. This can occur i f the sample of subjects
wi th di sease i s si cker, or the subjects wi thout the di sease are healthier, than
those to whom the test wi l l be appl i ed in practi ce. Al most any test wi l l perform
wel l i f the task i s to di sti ngui sh between the very si ck and the heal thy, such as
those with symptomati c pancreatic cancer and heal thy controls. It i s more
di ffi cul t to di sti nguish between one di sease and another that can cause si mil ar
symptoms, or between the heal thy and those wi th earl y, presymptomati c di sease.
The subjects in a study of a di agnosti c test shoul d have spectra of di sease and
nondi sease that resembl e those of the populati on i n whi ch the test wi l l be used.
For exampl e, a diagnosti c test for pancreatic cancer mi ght be studi ed in pati ents
wi th abdominal pai n and wei ght l oss.
Spectrum bias can occur from an i nappropri ate spectrum of test resul ts as wel l as
an inappropriate spectrum of di sease. For exampl e, consi der a study of
i nterobserver agreement among radiol ogists readi ng mammograms. If they are
asked to cl assi fy the fi l ms as normal or abnormal , thei r agreement wil l be much
hi gher if the posi ti ve fi l ms they examine are a set sel ected because they
are clearl y abnormal , and the negati ve fi l ms are a set sel ected as free of
suspi ci ous abnormali ti es.
Sour ces of var i at i on, gener al i zabi l i t y, and t he sampl i ng scheme. For some
research questi ons the mai n source of vari ati on i n test resul ts i s between
pati ents. For exampl e, some i nfants wi th bacteremi a wi l l have an el evated whi te
bl ood cel l count, whereas others wil l not. The proporti on of bacteremic infants
wi th hi gh white bl ood cel l counts i s not expected to vary much accordi ng to who
draws the bl ood or what l aboratory measures i t.
On the other hand, for many tests the resul ts may depend on the person doi ng or
i nterpreting them, or the setti ng i n whi ch they are done. For exampl e,
sensi ti vi ty, speci fi city, and i nterrater rel i abi li ty for i nterpreti ng mammograms
depend on the readers' ski l l and experience as wel l as the qual i ty of the
equi pment. Sampli ng those who perform and interpret the test can enhance the
generali zabi l i ty of studi es of tests that require technical or i nterpretive ski ll .
P.185
...
When accuracy may vary from i nsti tuti on to i nstitution, the i nvesti gators wi l l
need to sampl e several di fferent i nsti tuti ons to be abl e to assess the
generali zabi l i ty of the resul ts.
I mpor t ance of bl i ndi ng. Many studi es of di agnosti c tests invol ve judgments,
such as whether to consider a test resul t posi ti ve, or whether a person has a
parti cul ar di sease. Whenever possi bl e, investi gators shoul d bli nd those
i nterpreting test resul ts from i nformati on about the patient being tested that i s
rel ated to the gol d standard. In a study of the contri buti on of ul trasonography to
the diagnosi s of appendi ci ti s, for exampl e, those readi ng the sonograms shoul d
not know the results of the hi story and physical examinati on. Simi l arl y, the
pathol ogi sts making the fi nal determi nati on of who does and does not have
appendi ci ti s (the gol d standard to whi ch sonogram
results wi ll be compared) shoul d not know the resul ts of the ul trasound
examinati on. Bli ndi ng prevents bi ases, preconcepti ons, and i nformati on from
sources other than the test from affecti ng these judgments.
Cost s ver sus char ges. Investi gators wishi ng to focus on test expense may be
tempted to report charges rather than costs because charges are more readi ly
avai l able and are general l y much higher than costs. However, test charges vary
greatly among i nsti tuti ons and may have l ittle relati on to what i s actual ly pai d
for the test or to its actual costs. In many cases, test charges resembl e the rack
rate on the i nsi de door of a hotel rooma charge much hi gher than most
customers actual l y pay. On the other hand, esti mati ng how much an i nstitution
or soci ety must spend per test i s di fficul t, because many of the expenses, such
as l aboratory space and equi pment, are fi xed. One approach i s to use the
average amount actuall y pai d for the test; another is to mul tiply charges by the
i nsti tuti on's average cost-to-charge rati o.
Studies of Test Reproducibility
Someti mes the resul ts of tests vary accordi ng to when or where they were done
or who di d them. Intraobserver variability describes the lack of reproduci bi l i ty i n
results when the same observer or l aboratory performs the test at di fferent ti mes. For
exampl e, i f a radi ol ogi st is shown the same chest radi ograph on two occasi ons, what
proportion of the ti me wil l he agree with hi msel f on the i nterpretation? Interobserver
variability descri bes the l ack of reproducibil i ty among two or more observers: i f
another radi ol ogi st i s shown the same fil m, how l i kel y i s he to agree wi th the fi rst
radi ol ogi st?
Studies of reproduci bi l ity may be done when the l evel of reproduci bi li ty (or l ack
thereof) i s the mai n research question. In addi ti on, reproduci bi l i ty i s often studi ed
wi th a goal of qual ity i mprovement, ei ther for those making measurements as part of
a research study of a di fferent questi on, or as a part of cl i ni cal care. When
reproducibil i ty i s poorbecause ei ther i ntra- or interobserver vari abi li ty is largea
measurement i s unl i kel y to be useful, and it may need to be either i mproved or
abandoned.
Studies of reproduci bi l ity do not requi re a gol d standard, so they can be done for tests
or di seases where none exi sts. Of course, both (or al l) observers can agree with one
another and sti ll be wrong: i ntra- and interobserver reproduci bi li ty address precisi on,
not accuracy (Chapter 4).
Desi gns
The basi c design to assess test reproduci bi li ty invol ves compari ng tests done to
results from more than one observer or on more than one occasi on from a sampl e of
P.186
...
pati ents or specimens. For tests that i nvol ve several steps i n many l ocati ons,
di fferences i n any one of whi ch mi ght affect reproduci bi l i ty, the i nvesti gator wi l l need
to deci de on the breadth of the study' s focus. For exampl e, measuri ng i nterobserver
agreement of pathol ogists about the interpretati on of a set of cervi cal cytol ogy sl i des
i n a si ngle hospi tal may overestimate the overal l reproducibil i ty of Pap smears
because the vari abi li ty in how the sample was obtained and how the sli de was
prepared woul d not be assessed.
The extent to whi ch an i nvesti gator needs to i sol ate the steps that mi ght l ead to
i nterobserver di sagreement depends partl y on the goal s of hi s study. Most studi es
shoul d estimate the reproduci bi l i ty of the entire testi ng process, because thi s i s what
determi nes whether the test i s worth usi ng. On the other hand, an investi gator who
i s devel opi ng or improvi ng a test may want to focus on the speci fi c steps at whi ch
vari abil i ty occurs, to i mprove the process. In ei ther case, the investi gator shoul d l ay
out the exact process for obtaini ng the test resul t in the operati ons manual (Chapters
4 and 17) and then descri be i t i n the methods secti on when reporti ng the study
results.
Anal ysi s
Cat egor i cal var i abl es . The si mpl est measure of i nterobserver agreement is the
proportion of observati ons on whi ch the observers agree exactly, someti mes
call ed the concordance rate. However, when there are more than two categories
or the observati ons are not evenly di stri buted among the categori es (e.g., when
the proporti on abnormal on a dichotomous test i s much di fferent from
50%), the concordance rate can be hard to i nterpret, because it does not account
for agreement that coul d resul t si mpl y from both observers having some
knowl edge about the preval ence of abnormali ty. For exampl e, i f 95% of subjects
are normal , two observers who randoml y choose whi ch 5% of tests to cal l
abnormal wi l l agree that resul ts are normal about 90% of the
ti me. A better measure of interobserver agreement, cal l ed kappa (Appendi x
12A), measures the extent of agreement beyond what woul d be expected by
chance alone. Kappa ranges from -1 (perfect di sagreement) to 1 (perfect
agreement). A kappa of 0 i ndicates that the amount of agreement was exactl y
that expected by chance. Kappa values above 0.8 are general l y consi dered very
good; l evel s of 0.6 to 0.8 are good.
Cont i nuous var i abl es . Measures of i nterobserver vari abi l i ty for conti nuous
vari ables depend on the desi gn of the study. Some studi es measure the
agreement between just two machi nes or methods (e.g., temperatures obtai ned
from two di fferent thermometers). The best way to descri be the data from such a
study i s to report the mean di fference between the pai red measurements and the
di stribution of the di fferences, perhaps indi cati ng the proporti on of ti me that the
di fference i s cl i ni cal ly important. For exampl e, if a cl ini cal ly important di fference
i n temperature is thought to be 0.3C, a study comparing temperatures from
tympanic and rectal thermometers coul d esti mate the mean difference between
the two and how often the two measurements di ffered by more than 0.3C.
1
Other studi es exami ne i nterobserver or i nteri nstrument variabi l i ty of a l arge
group of different techni ci ans, l aboratori es, or machi nes. These resul ts are
commonl y summari zed using the coefficient of variation, which is the standard
devi ati on of the resul ts on a si ngl e speci men di vided by thei r mean, expressed as
a percentage. If the resul ts are normal ly di stri buted (i .e., i f a hi stogram wi th
results on the same speci men woul d be bel l shaped), then about 95% of the
results on di fferent machi nes wi l l be wi thi n two standard devi ati ons of the mean.
For exampl e, gi ven a coefficient of vari ati on of a serum chol esterol measurement
P.187
...
of 2% (2), the standard devi ati on of mul ti pl e measurements wi th a mean of 200
mg/dL woul d be about 4 mg/dL and about 95% of l aboratori es woul d be expected
to report a val ue between 192 and 208 mg/dL.
Studies of the Accuracy of Tests
Studi es i n thi s secti on address the questi on, To what extent does the test
gi ve the ri ght answer? To be abl e to answer thi s questi on, a gol d standard must be
avai l able i n order to tel l what the ri ght answer is.
Desi gns
Sampl i ng. Studi es of di agnostic tests can have desi gns anal ogous to
casecontrol or cross-secti onal studi es, whereas studies of prognostic tests
usual l y resembl e cohort studies. In the casecontrol desi gn, those wi th and
wi thout the di sease are sampl ed separatel y and the test resul ts in the two groups
are compared. Unfortunatel y, i t i s often hard to reproduce a cl ini cal ly real istic
spectrum of the di sease and absence of the di sease i n the two sampl es. Those
wi th the disease shoul d not have progressed to severe stages that are rel ati vel y
easy to di agnose. Those wi thout the target di sease shoul d be pati ents who had
symptoms consi stent with a parti cular di sease and who turned out not to have i t.
Studies of tests that sampl e those wi th and without the target disease separately
are al so subject to a bi as i n the measurement of the test resul t if that
measurement i s made knowi ng whether the sample came from a case or control .
Fi nal l y, studi es wi th this sampl i ng scheme cannot be used (wi thout other
i nformati on) to esti mate predictive value or posterior probability (di scussed
bel ow). Therefore, casecontrol sampl i ng for di agnosti c tests shoul d be
reserved for rare di seases for whi ch no other sampl i ng scheme is feasi bl e.
A single cross-sectional sampl e of patients bei ng eval uated for a parti cular
di agnosi s generall y wi l l yi el d more val i d and i nterpretabl e resul ts. For exampl e,
Tokuda et al . (3) found that the severi ty of chi ll s was a strong predi ctor of
bacteremia i n a seri es of 526 consecutive febri l e adul t emergency department
pati ents. Because the subjects were enrol l ed before i t was known whether they
were bacteremi c, the spectrum of pati ents i n thi s study shoul d be reasonabl y
representati ve of patients who present to emergency rooms wi th fever.
A variant of the cross-secti onal sampl i ng scheme that we cal l tandem testing i s
someti mes used to compare two (presumabl y i mperfect) tests wi th one another.
Both tests are done on a representati ve sampl e of pati ents who may or may not
have the disease and the gol d standard i s selecti vel y appl i ed to the pati ents with
posi ti ve resul ts on ei ther or both tests. Because subjects wi th negative results
may be fal se-negatives, the gol d standard shoul d al so be appl i ed to a random
sampl e of pati ents with concordant negati ve resul ts. Thi s desi gn, whi ch al l ows
the investigator to determi ne which test i s more accurate wi thout the expense of
measuri ng a gol d standard i n al l the subjects wi th negati ve test resul ts, has been
used in studi es compari ng di fferent cervi cal cytology methods (4).
Prognostic test studi es requi re ei ther prospecti ve or retrospective cohort
designs. In prospecti ve cohort studi es, the test is done at basel i ne, and the
subjects are then fol lowed to see who devel ops the outcome of i nterest. A
retrospecti ve cohort study may be possi bl e i f a new test becomes avai l abl e, such
as vi ral l oad in HIV-posi ti ve pati ents, and a previ ousl y defi ned cohort wi th
banked blood sampl es i s avai l abl e. Then the vi ral l oad can be measured i n the
stored blood, to see whether i t predicts prognosi s. The nested casecontrol
desi gn (Chapter 7) i s parti cul arl y attracti ve if the outcome of i nterest is rare and
P.188
...
the test i s expensi ve.
P r edi ct or var i abl e: t he t est r esul t . Al though i t i s simpl est to thi nk of the
results of a di agnosti c test as bei ng ei ther posi ti ve or negati ve, many tests have
categori cal ,
ordinal or conti nuous resul ts. Whenever possi bl e, investi gators shoul d use ordinal
or conti nuous resul ts to take advantage of al l avai l able i nformati on i n the test.
Most tests are more i ndi cati ve of a disease i f they are very abnormal than i f they
are sli ghtl y abnormal , and most al so have a borderl i ne range i n which they do
not provi de much i nformati on.
Out come var i abl e: t he di sease ( or i t s out come) . The outcome variabl e i n a
di agnosti c test study i s often the presence or absence of the di sease, best
determi ned wi th a gol d standard. Wherever possi bl e, the assessment of outcome
shoul d not be i nfl uenced by the resul ts of the diagnosti c test bei ng studi ed. Thi s
i s best accompli shed by bl i ndi ng those measuri ng the gold standard so that they
do not know the resul ts of the test. Someti mes uniform appl i cation of the gol d
standard i s not ethi cal or feasi bl e for studi es of di agnosti c tests, parti cul arl y
screeni ng tests. For exampl e, Smi th-Bindman et al . studied the accuracy of
mammography accordi ng to characteri sti cs of the i nterpreti ng radi ol ogi st ( 5).
Women wi th posi ti ve mammograms were referred for further tests, eventual l y
wi th pathol ogi c eval uati on as the gol d standard. However, i t i s not reasonabl e to
do biopsi es i n women whose mammograms are negati ve. Therefore, to determi ne
whether these women had falsel y negati ve mammograms the authors l i nked their
mammography resul ts wi th l ocal tumor regi stries and used whether or not breast
cancer was di agnosed i n the year fol l owing mammography as the gol d standard.
This sol ution, al though reasonable, assumes that al l breast cancers that exi st at
the time of mammography wi l l be di agnosed wi thi n 1 year, and that al l cancers
di agnosed wi thi n 1 year exi sted at the time of the mammogram. Measuri ng the
gol d standard di fferentl y dependi ng on the resul t of the test i n thi s fashion
creates a potential for bi as, di scussed i n more detail at the end of the chapter.
Prognostic tests are studi ed i n patients who al ready have the di sease. The
outcome i s what happens to them, such as how l ong they l i ve, what
compl i cati ons they develop, or what additional treatments they requi re. Agai n,
bl i ndi ng i s i mportant, especi all y i f cli ni ci ans caring for the pati ents may make
deci si ons based upon the prognosti c factors bei ng studied. For exampl e, Rocker
et al . (6) found that the attendi ng physi ci ans' esti mates of prognosi s, but not
those of bedsi de nurses, were i ndependentl y associ ated wi th i ntensive care uni t
mortal i ty. This could be because the attendi ng physi ci ans were more skil l ed at
esti mating severi ty of il l ness, but i t coul d al so be because attending physi cian
prognosti c estimates had a greater effect than those of the nurses on deci si ons
to wi thdraw support. To di sti ngui sh between these possi bi l i ties, i t woul d be
hel pful to obtai n esti mates of prognosi s from attendi ng physici ans other than
those invol ved i n maki ng or frami ng deci si ons about wi thdrawal of support.
Anal ysi s
Sensi t i vi t y, speci f i ci t y, and posi t i ve and negat i ve pr edi ct i ve val ues . When
results of a di chotomous test are compared wi th a dichotomous gol d standard,
the resul ts can be summari zed i n a 2 2 table (Tabl e 12.2). The sensi ti vi ty i s
defined as the proporti on of subjects wi th the di sease i n whom the test gi ves the
ri ght answer (i .e., i s posi ti ve), whereas the speci fi city i s the proporti on of
subjects wi thout the di sease in whom the test gi ves the ri ght answer (i .e., i s
negati ve). Posi ti ve and negati ve predi cti ve val ues are the proporti ons of subjects
P.189
...
wi th posi ti ve and negative tests i n whom the test gi ves the ri ght answer.
Recei ver oper at i ng char act er i st i c cur ves . Many di agnosti c tests yi el d ordi nal
or conti nuous resul ts. Wi th such tests, several val ues of sensi ti vi ty and
specifici ty are
possi bl e, depending on the cutoff point chosen to defi ne a positive test. Thi s
trade-off between sensi ti vi ty and specifici ty can be di spl ayed usi ng a graphic
techni que ori gi nal l y devel oped i n el ectronics: recei ver operati ng characteri sti c
(ROC) curves. The i nvestigator selects several cutoff poi nts and determi nes the
sensi ti vi ty and specifici ty at each poi nt. He then graphs the sensitivi ty (or true-
posi ti ve rate) on the Y-axi s as a functi on of 1- specifi city (the fal se-posi ti ve rate)
on the X-axi s. An i deal test is one that reaches the upper l eft corner of the graph
(100% true-posi ti ves and no fal se-posi ti ves). A worthl ess test foll ows the
di agonal from the lower l eft to the upper right corners: at any cutoff the true-
posi ti ve rate i s the same as the fal se-posi ti ve rate (Fi g. 12.1). The area under
the ROC curve, whi ch thus ranges from 0.5 for a usel ess test to 1.0 for a perfect
test, i s a useful summary of the overal l accuracy of a test and can be used to
compare the accuracy of two or more tests.
P.190
Table 12.2. Summarizing Results of a Study of a
Dichotomous Tests in a 2 2
...
Li kel i hood r at i os . Al though the i nformati on i n a di agnosti c test wi th continuous
or ordi nal resul ts can be summarized usi ng sensitivi ty and speci fi ci ty or ROC
curves, there i s a better way. Li kel i hood ratios al l ow the investi gator to take
advantage of al l information in a test. For each test resul t, the l i kel i hood rati o i s
the rati o of the l ikel ihood of that resul t i n someone with the di sease to the
l i kel ihood of that resul t i n someone without the disease.
2
The P i s read as probabi l i ty of and the | i s read as gi ven.
Thus P(Resul t|Di sease) i s the probabi l i ty of resul t given di sease, and P(Resul t|No
Di sease) i s the probabi l i ty of that resul t gi ven no di sease. The l ikel ihood rati o i s
a rati o of these two probabil i ti es.
The hi gher the l i kel i hood ratio, the better the test resul t for ruli ng i n a di agnosi s;
a l ikeli hood rati o greater than 100 i s very hi gh (and very unusual among tests).
On the other hand, the l ower a l i kel i hood ratio (the closer i t is to 0), the better
the test resul t i s for ruli ng out the disease. A li kel i hood rati o of 1 means that the
test result provi des no i nformati on at all about the l i kel i hood of disease.
An exampl e of how to cal cul ate l i kel ihood rati os i s shown i n Tabl e 12.3, whi ch
presents resul ts from the Pedi atri c Research i n Offi ce Settings Febri l e Infant
study (10) on how wel l the whi te bl ood cel l count predicted bacteremi a or
bacteri al meni ngi ti s i n young, febri l e i nfants. A white bl ood cel l count that i s
ei ther l ess than 5,000 cel l s/mm
3
or at least 15,000 cel ls/mm
3
was more common
among infants wi th bacteremi a or meni ngi ti s than among other i nfants. The
calcul ati on of l i kel i hood ratios si mpl y quanti fi es this: 8% of the i nfants wi th
bacteremia or bacteri al meni ngi ti s had l ess than 5,000 cel l s/mm
3
, whereas onl y
4% of those wi thout bacteremi a or meni ngi ti s di d. Therefore the l i kel i hood rati o
FIGURE 12.1. Receiver operati ng characteri sti c curves for good and
worthl ess tests.
P.191
...
i s 8%/4% = 2.
Rel at i ve r i sks and r i sk di f f er ences . The anal ysi s of studi es of prognosti c tests
or ri sk factors for di sease is simi l ar to that of other cohort studi es. If everyone in
a prognosti c test study i s foll owed for a set peri od of ti me (say 3 years) with few
l osses to fol l ow-up, then the resul ts can be summari zed wi th absol ute ri sks,
rel ati ve ri sks and risk di fferences. Especi al l y when foll ow-up i s compl ete and of
short durati on, resul ts of prognosti c tests are sometimes summari zed l ike
di agnosti c tests, usi ng sensi ti vi ty, speci fi city, predi cti ve val ue, l i kel i hood rati os
and ROC curves. On the other hand, when the study subjects are fol l owed for
varyi ng l engths of
ti me, a survi val -anal ysi s techni que that accounts for the l ength of fol l ow-up ti me
i s preferabl e (11).
P.192
Table 12.3 Example of Calculation of Likelihood
Ratios from a Study of Predictors of Bacterial
Meningitis or Bacteremia Among Young Febrile
Infants
Meningitis or
Bacteremia
White Blood Cell Count (per
mm
3
) Yes No
Likelihood
Ratio
<5,000 5 96
8% 4% 2.0
5,0009,999 18 854
29% 39% 0.7
10,00014,999 8 790
12% 36% 0.3
15,00019,999 17 286
27% 13% 2.1
20,000 15 151
24% 7% 3.4
...
Studies of the Effect of Test Results on Clinical
Decisions
A test may be accurate, but if the di sease i s very rare, the test may be so sel dom
posi ti ve that it i s not worth doi ng i n most si tuati ons. Another diagnosti c test may be
posi ti ve more often but not affect cli ni cal deci si ons because it does not provide new
i nformati on beyond what was al ready known from the medical hi story, physi cal
examinati on, or other tests. The study desi gns i n thi s secti on address the yi el d of
di agnosti c tests and thei r effects on cl i ni cal deci sions.
Types of Studi es
Di agnost i c yi el d st udi es . Di agnosti c yi el d studi es address such questions as
the fol lowi ng:
When a test i s ordered for a parti cul ar i ndi cati on, how often i s i t abnormal ?
Can a test result be predicted from other i nformati on avai l abl e at the ti me
of testi ng?
What happens to pati ents with abnormal resul ts? Do they appear to benefi t?
Di agnosti c yi el d studi es esti mate the proportion of posi ti ve tests among
pati ents wi th a particul ar i ndi cati on for the test. Of course, showi ng that a
test i s often posi ti ve i s not suffi ci ent to indicate the test shoul d be done.
However, a di agnosti c yi el d study showi ng a test i s al most al ways negati ve
may be suffici ent to questi on its use for that i ndi cati on,
For exampl e, Si egel et al . (12) studi ed the yi el d of stool cul tures i n
hospi tal ized pati ents with di arrhea. Al though not al l patients wi th di arrhea
recei ve stool cultures, it seems reasonabl e to assume that those who do
are, if anythi ng, more l i kel y to have a posi tive cul ture than those who do
not. Overal l , onl y 40 (2%) of 1,964 stool cul tures were posi ti ve. Moreover,
none of the positive results were i n the 997 pati ents who had been i n the
hospi tal for more than 3 days. Because a negative stool cul ture i s unl ikel y
to affect management in these pati ents with a low li kel i hood of bacterial
di arrhea, it i s of l i ttl e val ue i n that setti ng. Therefore, the authors were abl e
to concl ude that stool cul tures are unl i kel y to be useful i n patients wi th
di arrhea who have been i n the hospi tal for more than 3 days.
Bef or e/ af t er st udi es of cl i ni cal deci si on mak i ng. These desi gns di rectl y
address the effect of a test resul t on cl i ni cal deci sions. The desi gn general l y
i nvol ves a compari son between what cl i ni ci ans do (or say they woul d do) before
Total 63 2,177
100% 100%
P.193
...
and after obtai ning resul ts of a di agnosti c test. For exampl e, Carrico et al . (13)
prospecti vely studi ed the val ue of abdomi nal ul trasound i n 94 chil dren wi th acute
l ower abdomi nal pain. They asked the cli ni ci ans requesti ng the sonograms to
record thei r diagnosti c impressi on and what thei r treatment woul d be i f a
sonogram were not avai l abl e. After doing the sonograms and provi di ng the
cl i ni ci ans wi th the resul ts, they asked again. They found that sonographi c
i nformati on changed the i ni ti al treatment pl an i n 46% of pati ents.
Of course (as discussed l ater), al teri ng a cl i ni cal deci sion does not guarantee
that a pati ent wi l l benefi t. Therefore, if a study wi th thi s desi gn shows effects on
deci si ons, i t is most useful when the natural hi story of the di sease and the
efficacy of treatment are cl ear. In the precedi ng exampl e, there is very li kel y a
benefi t from changi ng the deci si on from di scharge from hospi tal to
l aparotomy in chi l dren wi th appendi ci ti s, or from laparotomy to
observe i n chi l dren with nonspeci fi c abdomi nal pai n.
Studies of Feasibility, Costs, and Risks of Tests
An i mportant area for cl i nical research rel ates to the practi cal i ti es of di agnosti c
testing. What proporti on of patients wi ll return a postcard wi th tubercul osi s skin test
results? What proporti on of colonoscopi es are compli cated by hypotensi on? What are
the medical and psychol ogi cal effects of false-posi ti ve screeni ng tests i n newborns?
Desi gn I ssues
Studies of the feasi bi l i ty, costs, and ri sks of tests are general ly descri pti ve. The
sampli ng scheme i s i mportant because tests often vary among the peopl e or
i nsti tuti ons doi ng them, as well as the pati ents recei vi ng them.
Among studi es that sample i ndi vidual pati ents, several sampl ing schemes are possi bl e.
A strai ghtforward choi ce i s to study everyone who recei ves the test, as i n a study of
the return rate of postcards after tubercul osi s ski n testing. Al ternativel y, for some
questions, the subjects i n the study may be onl y those wi th resul ts that were posi ti ve
or falsely posi ti ve. For example, Bodegard et al . (14) studied famil i es of i nfants who
had tested fal sel y posi ti ve on a newborn screening test for hypothyroi di sm and found
that fears about the baby's heal th persi sted for at least 6 months i n almost 20% of
the fami l ies.
Adverse effects can occur not just from fal se-posi ti ve resul ts, but al so from tests i n
whi ch the measurement may be correct but a pati ent's reacti on l eads to a decrement
i n quali ty of l i fe. Rubi n and Cummi ngs (15), for exampl e, studi ed women who had
undergone bone densi tometry to test for osteoporosi s. They found that women who
had been tol d that their bone densi ty was abnormal were much more l i kel y to l i mi t
thei r activi ti es because of fear of fal li ng.
Anal ysi s
Resul ts of these studi es can usual l y be summari zed wi th simpl e descripti ve stati sti cs
l i ke means and standard deviati ons, medi ans, ranges, and frequency di stri buti ons.
Dichotomous variabl es, such as the occurrence of adverse effects, can be summari zed
wi th proportions and thei r 95% CIs. For exampl e, Waye et al . (16) reported that 3 of
2,097 ambul atory col onoscopies (0.14%; 95% CI, 0.030% to 0.042%) resul ted i n
hypotension that requi red i ntravenous fl uids.
There are general l y no sharp li nes that di vide tests i nto those that are or are not
feasi bl e, or those that have or do not have an unacceptabl y hi gh risk of adverse
effects. For thi s reason i t i s hel pful i n the desi gn stage of the study to speci fy cri teri a
for deci di ng that the test i s acceptabl e. What rate of fol low-up woul d be insuffi cient?
P.194
...
What rate of compl icati ons woul d be too hi gh?
Studies of the Effect of Testing on Outcomes
The best way to determi ne the val ue of a medi cal test i s to see whether pati ents
who are tested have a better outcome (e.g., l i ve l onger) than those who are not.
Randomized tri als are the i deal desi gn for maki ng this determi nati on, but tri al s of
di agnosti c tests are often diffi cul t to do. The val ue of tests i s therefore usual ly
esti mated from observati onal studies. The key di fference between the designs
descri bed in this secti on and the experi mental and observati onal desi gns di scussed
el sewhere i n thi s book i s that the predi ctor variabl e for thi s secti on i s testi ng, rather
than a treatment, ri sk factor, or test resul t.
Desi gns
Testi ng i tself i s unli kel y to have any di rect benefit on the patient's heal th. It is onl y
when a test result leads di rectl y to the use of effecti ve preventi ve or therapeuti c
i nterventi ons that the patient may benefi t. Therefore, one i mportant caveat about
outcome studi es of testi ng i s that the predi ctor vari able actual l y bei ng studi ed i s not
just a test (e.g., a fecal occul t bl ood test), but everything that fol l ows (e.g.,
procedures for fol l owi ng up abnormal results, col onoscopy, etc.).
The outcome variabl e of these studi es must be a measure of morbi di ty or mortal i ty,
not si mpl y a di agnosis or stage of disease. For exampl e, showing that men who are
screened for prostate cancer have a greater proportion of cancers di agnosed at an
earl y stage does not by i tself establ i sh the val ue of screeni ng. It i s possi bl e that some
of those cancers would not have caused any probl em i f they had not been detected or
that treatment of the detected cancers i s i neffecti ve.
The outcome shoul d be broad enough to i nclude pl ausible adverse effects of testing
and treatment, and may i ncl ude psychologi cal as wel l as medi cal effects of testi ng.
Therefore, a study of the val ue of prostate-specific anti gen screening for prostate
cancer shoul d incl ude treatment-related morbi di ty (e.g., impotence or i nconti nence,
peri operative myocardi al i nfarcti on) and mortal ity. When many more people are tested
than are expected to benefi t (as is usual l y the case), l ess severe adverse outcomes
among those wi thout the di sease may be important, because they wi l l occur
much more frequentl y. Whil e negati ve test results may be reassuri ng and reli evi ng to
some pati ents, in others the psychological effects of l abel i ng or fal se-posi ti ve resul ts,
l oss of insurance, and troublesome (but nonfatal ) side effects of preventi ve
medi cati ons may outweigh infrequent benefi ts.
Obser vat i onal st udi es . Observati onal studi es are general l y qui cker, easi er, and
l ess costly than experimental studi es. However, they have i mportant
di sadvantages as wel l, especial l y because pati ents who are tested tend to di ffer
from those who are not tested in i mportant ways that may be rel ated to the ri sk
of a di sease or its prognosis. For exampl e, those getti ng the test may be at lower
ri sk of an adverse health outcome, because peopl e who vol unteer for medi cal
tests and treatments tend to be heal thi er than average, an exampl e of volunteer
bias. On the other hand, those tested may be at hi gher ri sk, because pati ents
are more l i kel y to be tested when they or thei r cl ini ci ans are concerned about a
di sease or i ts sequel ae, an example of confounding by indication for the test
(Chapter 9).
An addi ti onal common probl em wi th observati onal studi es of testi ng i s the l ack of
standardi zati on and documentati on of any i nterventi ons or changes i n
management that fol low posi ti ve resul ts. If a test does not i mprove outcome in a
parti cul ar setti ng, i t coul d be because fol l ow-up of abnormal resul ts was poor,
P.195
...
because patients were not compl i ant with the pl anned i nterventi on, or because
the parti cular i nterventi on used i n the study was not i deal .
Cl i ni cal t r i al s . The most ri gorous desi gn for assessi ng the benefi t of a di agnosti c
test i s a cl ini cal tri al , i n whi ch subjects are randomly assi gned to recei ve or not
to recei ve the test. Presumabl y the resul t of the test is then used to gui de
cl i ni cal management. A vari ety of outcomes can be measured and compared i n
the two groups. Randomized tri als mi ni mi ze or el i mi nate confounding and
sel ection bi as
and al l ow measurement of al l rel evant outcomes such as mortal ity, morbi di ty,
cost, and sati sfacti on. Standardi zi ng the testi ng and i nterventi on process enables
others to reproduce the resul ts.
Unfortunatel y, randomi zed tri al s of di agnosti c tests are often not practi cal,
especi al l y for di agnosti c tests already i n use i n the care of si ck pati ents.
Randomized tri als are general l y more feasible and i mportant for tests that might
be used i n l arge numbers of apparentl y heal thy peopl e, such as new screening
tests.
Randomized tri als, however, may bri ng up ethi cal i ssues about withhol di ng
potenti al l y valuabl e tests. Rather than randoml y assigning subjects to undergo a
test or not, one approach to mi nimi zi ng thi s ethi cal concern i s to randomly assi gn
some subjects to recei ve an i nterventi on that i ncreases the use of the test, such
as frequent postcard remi nders and assi stance i n scheduli ng. The pri mary
anal ysi s must sti l l fol l ow the i ntention-to-treat rul ethat i s, the enti re
group that was randomi zed to recei ve the i nterventi on must be compared wi th
the enti re comparison group. However, thi s rul e wi ll tend to create a
conservative bi as; the observed effi cacy of the i nterventi on wil l underesti mate
the actual efficacy of the test, because some subjects in the control group wi ll
get the test and some subjects i n the i nterventi on group wi l l not. Thi s probl em
can be addressed i n secondary anal yses that assume all the di fference between
the two groups i s due to di fferent rates of testing. The actual benefi ts of testi ng
i n the subjects as a resul t of the intervention can then be esti mated al gebrai cal l y
(18).
Example 12.1 An Elegant Observational Study of a Screening Test
Sel by et al . (17) did a nested casecontrol study i n the Kai ser Permante Medi cal
Care Program to determi ne whether screening sigmoi doscopy reduces the ri sk of death
from col on cancer. They compared the rates of previ ous si gmoi doscopy among
pati ents who had di ed of col on cancer wi th control s who had not. They found an
adjusted odds rati o of 0.41 (95% CI, 0.25 to 0.69), suggesting that sigmoi doscopy
resul ted in a 60% decrease i n the death rate from cancer of the rectum and distal
col on.
A potential probl em i s that pati ents who undergo sigmoi doscopy may di ffer in
i mportant ways from those who do not, and that those di fferences mi ght be associ ated
wi th a di fference i n the expected death rate from col on cancer. To address this
possible confoundi ng, Sel by et al . exami ned the apparent effi cacy of si gmoi doscopy at
preventi ng death from cancers of the proxi mal col on, above the reach of the
si gmoi doscope. If pati ents who underwent si gmoi doscopy were l ess l ikel y to di e of
col on cancer for other reasons, then si gmoi doscopy woul d appear to be protective
against these cancers as wel l . However, sigmoi doscopy had no effect on mortal i ty
from cancer of the proxi mal col on (adjusted odds rati o = 0.96; 95% CI, 0.61 to 1.50),
suggesti ng that confoundi ng was not the reason for the apparent benefi t in distal
col on cancer mortal i ty.
P.196
...
Anal ysi s
Anal ysi s of studies of the effect of testing on outcome are those appropriate to the
specific desi gn usedodds rati os for casecontrol studies, and ri sk ratios or hazard
rati os for cohort studies or experi ments. A conveni ent way to express the resul ts i s to
project the resul ts of the testi ng procedure to a l arge cohort (e.g., 100,000), li sti ng
the number of ini ti al tests, fol l ow-up tests, peopl e treated, side effects of treatment,
costs, and li ves saved.
Pitfalls in the Design or Analysis of Diagnostic Test
Studies
As with other types of cl i nical research, errors in the desi gn or anal ysi s of studi es of
di agnosti c tests are common. Some of the most common and serious of these, along
wi th steps to avoi d them, are outl i ned bel ow.
Ver i f i cati on Bi as 1: Sel ecti ve Appl i cat i on of a Si ngl e
Gol d St andar d
A common sampl ing strategy for studi es of medi cal tests i s to study (ei ther
prospecti vely or retrospecti vely) pati ents at ri sk for di sease who recei ve the gol d
standard for diagnosi s. However, thi s causes a probl em i f the findings bei ng studied
are al so used to deci de who gets the gol d standard. For exampl e, consider a study of
predi ctors of fracture i n chil dren presenting to the emergency department wi th ankle
i njuri es, i n whi ch onl y chil dren who had x-rays for ankl e i njuries were i ncluded. If
those without a parti cul ar fi ndi ng, for exampl e, ankle swel l i ng, were l ess l i kel y to get
an x-ray, both fal se-negati ves and true-negati ves (c and d i n the 2 2 table i n
Table 12.4) would be reduced, thereby increasing sensi ti vi ty (a/(a + c)) and
decreasi ng specifici ty (d/(d + b)), as shown i n Tabl e 12.4. Thi s bias, cal l ed
verification bias, work-up bias, or referral bias, i s i l l ustrated numeri cal l y i n
Appendi x 12B.
This type of veri fi cati on bias can be avoi ded by usi ng stri ct cri teria for appl i cation of
the gol d standard that do not i ncl ude the test bei ng studied. Another strategy i s to
Table 12.4 How Verification Bias Leads to
Overestimation of Sensitivity and Underestimation
of Specificity, by Decreasing the Number of Subjects
in the Study with No Swelling, and Hence Both Cells
c and d
Fracture No Fracture
Swel l i ng a b
No swel l i ng c d
P.197
...
use a di fferent gol d standard for those i n whom the usual gol d standard i s not
i ndi cated. However, thi s can cause other probl ems as di scussed bel ow.
Ver i f i cati on Bi as 2: Di f f er ent Gol d St andar ds f or Those
Test i ng P osi ti ve and Negat i ve
A di fferent type of veri fi cati on bi as, whi ch mi ght be cal l ed double gold standard
bias, occurs when di fferent gol d standards are used for those wi th posi ti ve and
negati ve test resul ts. An example i s the previ ousl y menti oned study of mammography
(5) i n which the gol d standard for those with posi ti ve mammograms was a bi opsy,
whereas for those wi th negati ve mammograms it was a peri od of fol l ow-up to see i f a
cancer became evi dent. Having two di fferent gol d standards for the di sease is a
probl em i f the gol d standards mi ght not agree wi th one another.
Another exampl e i s a study of ul trasonography to diagnose i ntussuscepti on i n young
chi l dren (19). Al l chi l dren wi th a positive ul trasound scan for i ntussuscepti on recei ved
the gol d standard, a contrast enema. In contrast, the majori ty of chil dren wi th a
negati ve ul trasound were observed in the emergency room and i ntussuscepti on was
rul ed out cl i nical l y. For cases of i ntussuscepti on that resol ve spontaneousl y, the two
gol d standards woul d gi ve different resul ts: the contrast enema woul d be positive, and
cl i nical fol l ow-up woul d be negati ve. If these cases have a negati ve ul trasound, the
doubl e gol d standard can turn what woul d appear to be a fal se-negative result (when
the ul trasound i s negati ve and the contrast enema i s posi tive) i nto a true negati ve
(when the ul trasound i s negative and cl ini cal fol l ow-up reveal s no i ntussusception).
This increases both sensi ti vi ty and specifici ty (Tabl e 12.5). A numerical exampl e of
thi s doubl e gol d standard type of verificati on bi as i s provi ded in Appendix 12.C.
Because someti mes usi ng an i nvasi ve gol d standard for everyone i s not feasi bl e,
i nvesti gators consi dering a study wi th two gol d standards shoul d make every effort to
use other data sources (e.g., autopsy studi es exami ni ng the preval ence of
asymptomati c cancers among patients who di ed from other causes in a study of a
cancer screening test) to assess the degree to whi ch double gol d standard bi as mi ght
threaten the val i di ty of the study.
Table 12.5 How Using Clinical Follow-up as the Gold
Standard for Children with a Negative Ultrasound
Moves Self-resolving Cases of Intussusception From
Cell c to Cell d and Changes False-Negatives into
True-Negatives
P.198
...
I nadequat e Sampl e Si ze
A basi c pri nci pl e i s that i f there are plenty of i nstances of what the i nvesti gator i s
trying to measure, the sampl e si ze i s l ikely to be adequate. However, i f the di sease or
outcome bei ng tested for i s rare, thi s may requi re testi ng a very l arge number of
peopl e. Many l aboratory tests, for exampl e, are not expensi ve, and a yi el d of 1% or
l ess mi ght justi fy doing them, especi al l y i f they can di agnose a seri ous treatabl e
i l lness. Therefore, to concl ude that a test is not useful , the upper confi dence i nterval
for the yi eld should be low enough to excl ude a cl i ni cal l y si gnificant yi el d.
For exampl e, Shel i ne and Kehr (20) retrospecti vel y revi ewed routi ne admi ssi on
l aboratory tests, incl udi ng the Venereal Disease Research Laboratory (VDRL) test for
syphi l is among 252 psychi atri c pati ents and found that the l aboratory tests i denti fi ed
1 pati ent wi th previ ousl y unsuspected syphil i s. If thi s pati ent's psychi atri c symptoms
were i ndeed due to syphi l is, i t woul d be hard to argue that i t was not worth the
$3,186 spent on VDRLs to make thi s di agnosi s. But i f the true rate of unsuspected
syphi l is were cl ose to the 0.4% seen i n thi s study, a study of thi s sampl e si ze coul d
easi l y have found no cases. In that si tuation, the upper l i mi t of the 95% CI woul d
have been 1.2%. Thi s confi dence l imi t woul d not be l ow enough to exclude a cl ini cal ly
si gni fi cant yiel d of the VDRL i n such psychi atri c pati ents.
I nappr opr i at e Excl usi on
When cal cul ati ng proporti ons, such as the proporti on of subjects wi th a posi ti ve test
result i n a di agnostic yield study, excl uding subjects from the numerator wi thout
excl udi ng si mi lar subjects from the denominator i s a common error. The basi c rul e i s
that i f any subjects who test posi ti ve are excl uded from the numerator, simi l ar
subjects must al so be excl uded from the denomi nator. In a study of routi ne l aboratory
tests i n emergency department pati ents wi th new sei zures (21), for exampl e, 11 of
136 pati ents (8%) had a correctabl e l aboratory abnormal i ty (hypogl ycemi a,
hypocal cemia, etc.) as a sol e or contri butory cause for thei r sei zure. In 9 of the 11
pati ents, however, the abnormali ty was suspected on the basi s of the hi story or
physi cal exami nation. The authors therefore reported that onl y 2 of 136 pati ents
(1.5%) had abnormali ti es not suspected on the basi s of the hi story or physical
examinati on. But i f al l patients wi th suspected abnormal iti es are excl uded from the
numerator, then si mi lar patients must be excl uded from the denomi nator as wel l . The
correct denomi nator for thi s proporti on i s therefore not al l 136 pati ents tested, but
onl y those who were not suspected of having any l aboratory abnormali ti es on the
basi s of thei r medi cal history or physi cal exami nati on.
I nst i t uti on- Speci f i c Resul t s
General i zabi l i ty i s especi al l y i mportant for tests that requi re ski ll or trai ni ng to do or
i nterpret. For exampl e, just because pathol ogi sts i n a particul ar i nsti tuti on cannot
agree on what consti tutes an abnormal Pap smear does not mean that pathol ogi sts
el sewhere woul d have the same probl em. In some cases, investi gators are moti vated
to study questions that seem particul arl y probl ematic in thei r own i nsti tuti on. The
results obtai ned may be i nternall y val i d but of l i ttl e i nterest el sewhere.
Nongenerali zabl e fi ndi ngs can al so occur i n i nsti tuti ons that do exceptional ly wel l. For
exampl e, i t i s possible that the val ue of abdominal ul trasonography in chi l dren wi th
bel l y pai n reported by Carri co et al . (13) i s greater than would be found el sewhere,
because of the parti cul ar ski l l of thei r ul trasonographers.
Dr oppi ng Bor der l i ne or Uni nter pr et abl e Resul ts
Sometimes a test may fail to gi ve any answer at al l , such as i f the assay fai l ed, the
test speci men deteriorated, or the test resul t fel l i nto a gray zone of bei ng nei ther
P.199
...
posi ti ve nor negati ve. It i s not usual l y l egi ti mate to i gnore these probl ems, but how to
handle them depends on the specific research question and study design. In studi es
deal i ng wi th the expense or i nconveni ence of tests, fai l ed attempts to do the test are
cl earl y i mportant resul ts. On the other hand, for most other studi es of di agnosti c
tests, i nstances of fai l ure of the test to provi de a resul t shoul d be di vi ded i nto those
that l i kel y are and are not rel ated to characteri sti cs of the pati ent. Thus patients
whose speci mens were l ost or i n whom the assays fail ed for reasons unrel ated to the
pati ent can general ly be excluded wi thout di storti ng resul ts.
Patients wi th nondi agnosti c imagi ng studi es or a borderl ine resul t on a test
need to be counted as havi ng had that speci fi c resul t on the test. In effect, thi s may
change a di chotomous test to an ordi nal oneposi ti ve, negative, and i ndeterminate.
ROC curves can then be drawn and li keli hood rati os can be cal cul ated for the
i ndetermi nate as wel l as posi ti ve and negati ve resul ts.
Summary
1. The useful ness of medical tests can be assessed usi ng desi gns that address a
seri es of increasi ngly stri ngent questi ons (Tabl e 12.1). For the most part,
standard observational designs provi de descriptive statistics of test
characteristics with confi dence interval s.
2. The subjects for a study of a di agnosti c test shoul d be chosen from pati ents who
have a spectrum of di sease and nondi sease that refl ects the anti ci pated use of
the test i n cl i ni cal practi ce.
3. If possi bl e, the investi gator shoul d blind those i nterpreting the test resul ts from
other i nformati on about the patients bei ng tested.
4. Measuri ng the reproducibility of a test, i ncl uding the intra- and interobserver
variability, i s often a good fi rst step i n eval uati ng a test.
5. Studies of the accuracy of tests requi re a gold standard for determi ni ng i f a
pati ent has, or does not have, the di sease or outcome bei ng studi ed.
6. The resul ts of studi es of the accuracy of di agnostic tests can be summari zed
usi ng sensitivity, specificity, predictive value, ROC curves, and likelihood
ratios. Studi es of the val ue of prognosti c tests can be summari zed wi th risk
ratios or hazard ratios.
7. Because of the di ffi cul ty of demonstrati ng that doi ng a test improves outcome,
studi es of the effects of tests on clinical decisions and the accuracy,
feasibility, costs, and risks of tests are often most useful when they suggest a
test shoul d not be done.
8. The most ri gorous way to study a di agnosti c test i s to do a clinical trial, i n
whi ch subjects are randoml y assi gned to receive or not to recei ve the test, and
outcomes, such as mortal i ty, morbi di ty, cost, and satisfacti on, are compared.
However, there may be practi cal and ethi cal impediments to such tri al s; with
appropri ate attention to possi bl e bi ases and confoundi ng, observational studies
of these questi ons can be helpful.
Appendix
Appendix 12A: Calculation of Kappa to
Measure Interobserver Agreement
P.200
...
When there are two observers or when the same observer repeats a measurement on
two occasi ons, the agreement can be summarized in a c by c tabl e, where c i s
the number of categori es that the measurement can have. For exampl e, consider two
observers l i steni ng for an S4 gal l op on cardi ac exami nation (Table 12.A.1). They
record i t as ei ther present or absent. The si mpl est measure of i nterobserver
agreement i s the concordance ratethat i s, the proporti on of observati ons on whi ch
the two observers agree. The concordance rate can be obtai ned by summi ng the
numbers al ong the di agonal from the upper l eft to the l ower right and di vi di ng i t by
the total number of observations. In thi s example, out of 100 pati ents there were 10
pati ents i n whom both observers heard a gal lop, and 75 i n whom nei ther di d, for a
concordance rate of (10 + 75)/100 = 85%.
When the observati ons are not evenl y di stributed among the categori es (e.g., when
the proporti on abnormal on a dichotomous test i s substantial l y di fferent from
50%), the concordance rate can be mi sl eading. For exampl e, if the two observers each
hear a gal l op on five pati ents but do not agree on whi ch pati ents have the gal l op,
thei r observed agreement wi l l sti l l be 90% (Tabl e 12.A.2). In fact, i f two observers
both know an abnormal ity i s uncommon, they can have nearl y perfect agreement just
by never or rarel y sayi ng that i t i s present.
P.201
Table 12A.1 Interobserver Agreement on Presence
of an S4 Gallop
Gallop Heard by
Observer 1
No Gallop Heard
by Observer 1
Total,
Observer 2
Gal l op heard by
observer 2
10 5 15
No gal lop heard
by observer 2
10 75 85
Total , observer 1 20 80 100
Note: The concordance rate is the percentage of the ti me two observers agree
with one another. In thi s exampl e, both observers ei ther heard or did not
hear the gall op in (10 + 75)/100 = 85% of cases.
Table 12A.2 High Agreement When Both Observers
Know Gallops are Uncommon
Gallop Heard by
Observer 1
No Gallop Heard
by Observer 1
Total,
Observer 2
Gal l op heard by 0 5 5
...
To get around thi s probl em, another measure of i nterobserver agreement, cal l ed
kappa (), i s sometimes used. Kappa measures the extent of agreement beyond what
woul d be expected from knowi ng the margi nal val ues (i.e., the row and
col umn totals). Kappa ranges from -1 (perfect di sagreement) to 1 (perfect
agreement). A kappa of 0 i ndicates that the amount of agreement was exactl y that
expected by chance. i s estimated as:
The expected proporti on i n each cel l i s si mpl y the proporti on i n that cel l 's row
(i .e., the row total di vi ded by the sampl e size) times the proporti on i n that cel l 's
col umn (i .e., the col umn total di vided by the sampl e si ze). The expected agreement i s
obtai ned by addi ng the expected proportions i n the cell s al ong the di agonal of the
tabl e, i n whi ch the observers agreed.
For exampl e, in Tabl e 12.A.1, the observers appear to have done qui te wel l : they have
agreed 85% of the time. But how wel l di d they do compared with agreement by
chance? By chance al one they wi ll agree about 71% of the ti me: (20% 15%) +
(80% 85%) = 71%. Because the observed agreement was 85%, kappa i s (85%-
71%)/(100%-71%) = 0.48respectabl e, i f somewhat l ess i mpressi ve than 85%
agreement. But now consi der Table 12.A.2. Al though the observed agreement was
90%, the expected agreement is (5% 5%) + (95% 95%) = 90.5%. Therefore,
kappa i s (90%-90.5%)/(100%-90.5%) = -0.05%a ti ny bi t worse than chance
alone.
When there are more than two categories of variabl es, it is i mportant to distingui sh
between ordi nal vari abl es, whi ch are intri nsical l y ordered, and nomi nal vari abl es,
whi ch are not. For ordi nal vari abl es, kappa fai l s to capture al l the i nformati on i n the
data, because i t does not gi ve partial credi t for comi ng cl ose. For exampl e, i f a
radi ograph can be cl assi fi ed as normal , questi onabl e, and
abnormal, havi ng one observer cal l i t normal and the other cal l i t questionabl e
i s better agreement than if one says i t i s normal and the other says it i s abnormal . To
gi ve credi t for parti al agreement, a weighted kappa
3
shoul d be used.
Appendix 12B: Numerical Example of
Verification Bias: 1
observer 2
No gal lop heard
by observer 2
5 90 95
Total , observer 1 5 95 100
Note: When both observers know that an abnormal ity i s uncommon, they wi ll
have a hi gh concordance rate, even i f they do not agree on whi ch subjects
are abnormal . In thi s case the observers agree 90% of the time, al though
they do not agree at al l on who has a gal l op.
P.202
...
Consi der two studi es exami ning ankl e swel li ng as a predi ctor of fractures i n chi ldren
wi th ankl e i njuri es. The fi rst study i s a consecutive sample of 200 chil dren. In this
study, al l chi l dren wi th ankl e i njuri es are x-rayed, regardl ess of swel l i ng. The
sensi ti vi ty and specificity of ankle swel l i ng are 80% and 75%, as shown i n Tabl e
12.B.1:
The second study i s a selected sample, i n which only hal f the chi ldren wi thout ankl e
swel l i ng are x-rayed. Therefore, the numbers i n the No swel li ng row wil l be
reduced by hal f. Thi s rai ses the apparent sensi ti vi ty from 32/40 (80%) to 32/36
(89%) and l owers the apparent speci fi ci ty from 120/160 (75%) to 60/100 (60%), as
shown i n Tabl e 12.B.2:
Table 12B.1 Ankle Swelling as a Predictor of
Fracture Using a Consecutive Sample
Swel l i ng 32 40
No swel l i ng 8 120
Total 40 160
Sensi ti vi ty = 32/40 = 80%
Specificity = 120/160 = 75%
Table 12B.2 Verification Bias: Ankle Swelling as a

Predictor of Fracture Using a Selected Sample
Swel l i ng 32 40
No swel l i ng 4 60
Total 36 100
Sensi ti vity = 32/36 = 89%
Speci fi ci ty = 60/100 = 60%
Note: If we knew that the chi l dren wi th no swel l i ng who recei ved an x-ray
were otherwise si mi l ar to those who di d not recei ve an x-ray, we could
esti mate the veri fi cati on bi as and correct for i t al gebrai cal l y. In practice, the
chil dren who recei ve an x-ray are probabl y more l i kel y to have a fracture
...
Appendix 12C: Numerical Example of
Verification Bias: 2
Resul ts of the study by Eshed et al . (19) of ul trasonography to di agnose
i ntussuscepti on are shown i n Tabl e 12.C.1:
The 104 subjects wi th a negati ve ul trasound li sted as havi ng No
Intussuscepti on actual ly incl uded 86 who were fol lowed cl i ni cal l y and di d not
recei ve a contrast enema. If about 10% of these subjects (i .e., nine chil dren) actual l y
had an intussuscepti on that resol ved spontaneously, but that woul d sti l l have been
i denti fi ed i f they had a contrast enema, and all subjects had recei ved a contrast
enema, those nine chil dren woul d have changed from true-negatives to fal se-
negati ves, as shown i n Tabl e 12.C.2:
than those who do not, so the effect of verificati on bi as on sensitivi ty in the
example above i s l i kel y a worst-case scenario. (That is, i f the cli ni ci ans were
good at predicting who di d not have a fracture, al l those not x-rayed woul d
not have had any fractures, and the sensi ti vi ty of the test would not have
been bi ased upwards. Speci fi ci ty, however, would sti l l be underesti mated.
P.203
Table 12C.1 Results of a Study of Ultrasound
Diagnosis of Intussusception
Intussusception No Intussusception
Ultrasound + 37 7
Ultrasound - 3 104
Total 40 111
Sensi ti vi ty = 37/40 = 93%
Table 12C.2 Effect on Sensitivity and Specificity if

Nine Children with Spontaneously Resolving
...
Now consi der the 37 subjects wi th posi ti ve ul trasound scans, who had i ntussuscepti on
based on thei r contrast enema. Suppose about 10% of those i ntussusceptions would
have resol ved spontaneously, if gi ven the chance. Then about
four chi ldren woul d change from true-posi tives to fal se-posi ti ves, i s shown in Tabl e
12.C.3:
Therefore, for spontaneousl y resol vi ng cases of intussuscepti on, the ultrasound scan
wi ll appear to give the ri ght answer whether it i s posi ti ve or negati ve, increasi ng both
sensi ti vi ty and specificity.
Reference
1. Bl and J, Al tman D. Statistical methods for assessing agreement between two
Intussusception had Received the Contrast Enema
Gold Standard Instead of Clinical Follow-up
Ultrasound + 37 7
Ultrasound - 3 + 9 = 12 104 - 9 = 95
Total 49 102%
Sensi ti vi ty = 37/49 = 76%
P.204
Table 12C.3 Effect on Sensitivity and Specificity if
Four Children with Spontaneously Resolving
Intussusception had Received the Clinical Follow-up
Gold Standard Instead of the Contrast Enema
Ultrasound + 37 - 4 = 33 7 + 4 = 11
Ultrasound - 3 104
Total 36 115
Sensi ti vi ty = 33/36 = 92%

...
methods of cl ini cal measurement. Lancet 1986;1: 307310.
2. Watson JE, Evans RW, Germanowski J, et al . Qual i ty of l ipid and l ipoprotei n
measurements i n communi ty laboratori es. Arch Pathol Lab Med 1997;121(2):
105109.
3. Tokuda Y, Mi yasato H, Stei n GH, et al . The degree of chi l l s for ri sk of
bacteremi a i n acute febri l e i l l ness. Am J Med 2005;118(12): 1417.
4. Sawaya GF, Washi ngton AE. Cervi cal cancer screeni ng: whi ch techni ques should
be used and why? Cl i n Obstet Gynecol 1999;42(4): 922938.
5. Smi th-Bi ndman R, Chu P, Mi gl i oretti DL, et al . Physi ci an predictors of
mammographi c accuracy. J Natl Cancer Inst 2005;97(5): 358367.
6. Rocker G, Cook D, Sjokvi st P, et al . Cli ni ci an predictions of i ntensi ve care uni t
mortal ity. Cri t Care Med 2004;32(5): 11491154.
7. Guyatt G, Renni e D. Users' gui des to the medical l i terature. A manual for
evi dence-based practi ce. Chicago, IL: AMA Press, 2002.
8. Fl etcher R, Fl etcher S. Cli ni cal epi demi ol ogy, the essenti al s, 4th ed. Bal ti more,
MD: Li ppi ncott Wi l l iams & Wi l ki ns, 2005.
9. Straus S, Ri chardson W, Gl aszi ou P, et al . Evi dence-based medicine: how to
practi ce and teach EBM. New York: El sevi er/Churchil l Li vingstone, 2005.
10. Pantel l RH, Newman TB, Bernzwei g J, et al . Management and outcomes of care
of fever in earl y i nfancy. JAMA 2004;291(10): 12031212.
11. Vitti nghoff E, Gl i dden D, Shiboski S, et al . Regressi on methods i n bi ostati sti cs:
l inear, l ogi sti c, survi val, and repeated measures model s. New York: Spri nger-
Verl ag, 2005.
12. Siegel DL, Edel stei n PH, Nachamki n I. Inappropriate testi ng for di arrheal
di seases i n the hospi tal . JAMA 1990;263(7): 979982.
13. Carri co CW, Fenton LZ, Tayl or GA, et al . Impact of sonography on the
di agnosis and treatment of acute l ower abdominal pai n in chi l dren and young
adul ts. AJR Am J Roentgenol 1999;172(2): 513516.
14. Bodegard G, Fyro K, Larsson A. Psychological reactions in 102 famil i es wi th a
newborn who has a fal sel y positive screeni ng test for congeni tal hypothyroi di sm.
Acta Paedi atr Scand Suppl 1983;304: 121.
15. Rubin SM, Cummi ngs SR. Resul ts of bone densi tometry affect women' s
deci si ons about taki ng measures to prevent fractures. Ann Intern Med 1992;116
P.205
...
(12 Pt 1): 990995.
16. Waye JD, Lewi s BS, Yessayan S. Col onoscopy: a prospecti ve report of
compl icati ons. J Cl i n Gastroenterol 1992;15(4): 347351.
17. Sel by JV, Fri edman GD, Quesenberry CJ, et al . A case-control study of
screeni ng si gmoidoscopy and mortal i ty from col orectal cancer. N Engl J Med
1992;326(10): 653657.
18. Shei ner LB, Rubi n DB. Intenti on-to-treat anal ysis and the goal s of cli ni cal
tri al s. Cl i n Pharmacol Ther 1995;57(1): 615.
19. Eshed I, Gorenstei n A, Serour F, et al . Intussusception in chil dren: can we rel y
on screeni ng sonography performed by junior resi dents? Pediatr Radi ol 2004;34
(2): 134137.
20. Shel ine Y, Kehr C. Cost and uti l i ty of routi ne admi ssi on l aboratory testi ng for
psychi atri c i npatients. Gen Hosp Psychi atry 1990;12(5): 329334.
21. Turnbul l TL, Vanden Hoek TL, Howes DS, et al. Util i ty of l aboratory studi es i n
the emergency department patient wi th a new-onset sei zure. Ann Emerg Med
1990;19(4): 373377.
Footnote
1
Although commonl y used, the correl ati on coeffi ci ent i s best avoi ded in studi es of the
rel iabi l i ty of l aboratory tests because it i s highly infl uenced by outlyi ng val ues and
does not al low readers to determine how frequentl y di fferences between the two
measurements are cl i nical l y i mportant. Confi dence interval s for the mean di fference
shoul d al so be avoi ded because thei r dependence on sampl e si ze makes them
potenti al l y misl eadi ng. A narrow confi dence i nterval for the mean di fference between
the two measurements does not impl y that they generall y cl osely agreeonl y that
the mean di fference between them i s bei ng measured preci sel y. For more extensi ve
reading on thi s i ssue, see Bland and Al tman (1).
2
For di chotomous tests the l i kel i hood ratio for a posi tive test i s
and the l i kel i hood ratio for a negative test i s
Detail ed di scussi ons of how to use l i kel i hood ratios and pri or i nformation (the pri or
probabi li ty of di sease) to esti mate a pati ent' s probabi li ty of di sease after knowi ng the
test result (the posteri or probabi l i ty) are avai l abl e i n standard cl i nical epidemi ol ogy
texts (7,8,9). The formula i s
Pri or odds Li kel i hood Rati o = Posteri or odds,
where prior and posteri or odds are rel ated to thei r respecti ve probabi l iti es by
3
The formula for wei ghted kappa i s the same as that for regul ar kappa except that
observed and expected agreement are summed not just al ong the di agonal , but for the
...
whol e tabl e, wi th each cel l fi rst mul ti pl i ed by a weight for that cel l . Any wei ghti ng
system can be used, but the most common are W
i j
= 1 - | i - j|/(c - 1) and W
i j
= 1 -
[(i - j)/(c - 1)]
2
where W
i j
i s the wei ght for the number i n i
th
row and the j
th
col umn
and c i s the number of categories.
...
> Tabl e of Contents > Secti on II - Study Desi gns > 13 - Uti l i zi ng Exi sti ng
Databases
13
Utilizing Existing Databases
Deborah Grady
Norman Hearst
Many research questions can be answered qui ckly and effi cientl y usi ng
data that have al ready been col lected. There are three general
approaches to using existing data. Secondary data analysis is the use
of exi sti ng data to investigate research questi ons other than the mai n
ones for which the data were ori gi nall y gathered. Ancillary studies
add one or more measurements to a study, often i n a subset of the
participants, to answer a separate research questi on. Systematic
reviews combine the resul ts of mul ti ple previ ous studi es of a gi ven
research question, often i ncl uding cal cul ation of a summary esti mate of
effect that has greater precision than the indi vidual study estimates.
Maki ng creative use of existing data i s a fast and effecti ve way for new
investigators wi th li mi ted resources to begi n to answer i mportant
research questions.
Advantages and Disadvantages
The mai n advantages of usi ng existing data are speed and
economy. A research questi on that mi ght otherwise require much time
and money to investigate can sometimes be answered rapidly and
inexpensively. For example, i n the Multiple Ri sk Factor Intervention
Tri al (MRFIT), a l arge heart di sease preventi on trial in men,
informati on about the smoking habits of the wives of the study subjects
was recorded to exami ne whether thi s influenced the men' s abil ity to
quit smoking. After the study was over, one of the investi gators
real ized that the data provided an opportunity to investi gate the heal th
effects of passive smoki nga new fi nding at the time. A twofol d
excess in the incidence of heart di sease was found i n nonsmoki ng men
married to smoking wi ves when compared with si mil ar nonsmoki ng men
married to nonsmoking wi ves (1).
Exi sti ng data sets also have disadvantages. The sel ecti on of the
populati on to study, which data to col lect, the quali ty of data gathered,
...
and how variables were measured and recorded are al l predetermined.
The existing data may have been col lected from a popul ati on that i s not
ideal (men onl y rather than men and women), the measurement
approach may not be what the i nvestigator would prefer (hi story of
hypertension, a dichotomous histori cal variabl e, i n place of actual blood
pressure) and the quali ty of the data may be poor (frequent mi ssing or
incorrect val ues). Important confounders and outcomes may not have
been measured or recorded. Al l these factors contribute to the mai n
disadvantage of usi ng existing data: the i nvestigator has li ttle or no
control over what data have been coll ected, and how.
Secondary Data Analysis
Secondary data sets may come from previ ous research studies,
medical records, heal th care bi ll ing fi les, death certi ficates, and many
other sources. Previous research studies, often conducted at the
investigator' s i nstitution, may provi de a ri ch source of secondary data.
Many studi es col l ect more data than the i nvestigators anal yze and
contain interesti ng findi ngs that have gone unnoti ced. Access to such
data i s controll ed by the study' s principal investigator; the new
researcher should therefore seek out informati on about the work of
senior investi gators at his i nstituti on. One of the most i mportant ways
a good mentor can be hel pful to a new investigator i s by provi di ng
knowl edge of and access to relevant data both from his own and other
insti tuti ons. Most l arge NIH-funded studies are now required to make
thei r data publi cly avail able after a certain period of ti me. These data
sets are usuall y avail abl e through the Internet and can provide
extensi ve data addressing related research questi ons.
Other sources of secondary data are l arge regi onal and national data
sets that are publ icly avai lable and do not have a princi pal
investigator. Computeri zed databases of this sort are as vari ed as the
reasons people have for col lecti ng information. We wi ll gi ve several
exampl es that deserve special menti on, and readers can l ocate others
in thei r own areas of i nterest.
Tumor registries are government-supported agenci es that col lect
complete statistics on cancer incidence, treatment, and outcome in
defined geographi c areas. These registries currently i ncl ude about one
quarter of the US popul ation, and the area of coverage is expected to
increase during the comi ng years. One of the purposes of these
regi stri es i s to provi de data to outside i nvestigators. Combi ned data for
all the registri es are avai lable from the Survei ll ance, Epi demi ology, and
End Results (SEER) Program. For exampl e, investigators used the SEER
regi stry of breast cancer diagnoses to determine the speci fici ty of
screeni ng mammography i n a large cohort of women i n the San
Francisco Bay Area. Women wi th negati ve mammograms i n whom
cancer was not diagnosed withi n 13 months were consi dered to have
P.208
...
had true negati ve mammography (2).
Death certifi cate regi stries can be used to fol low the mortali ty of any
cohort. The National Death Index i ncl udes all deaths i n the Uni ted
States since 1978. This can be used to ascertain the vital status of
subjects of an earli er study or of those who are part of another data
set that includes important predi ctor variables. An example i s the
fol low-up of men wi th coronary di sease who were treated with hi gh-
dose ni cotinic acid (or pl acebo) to lower serum chol esterol in the
Coronary Drug Project. Although there was no di fference i n death rates
at the end of the 5 years of randomi zed treatment, a mortal ity foll ow-
up 9 years later usi ng the Nati onal Death Index reveal ed a si gnifi cant
difference (3). Whether an i ndivi dual is ali ve or dead is publi c
informati on, so foll ow-up was avai labl e even for men who had dropped
out of the study.
The National Death Index can be used when any two of three basi c
indi vidual identifi ers (name, bi rth date, and social security number) are
known. Ascertai nment
of the fact of death is 99% complete with thi s system, and additional
informati on from the death certifi cates (notabl y cause of death) can
then be obtai ned from state records. On the state and local level, many
jurisdi cti ons now have computeri zed vital statistics systems, in which
indi vidual data (such as i nformati on from bi rth or death certi ficates)
are entered as they are recei ved.
Secondary data can be especiall y useful for studi es to evaluate patterns
of uti li zati on and cli ni cal outcomes of medi cal treatment. Thi s approach
can complement the i nformation avai labl e from randomi zed tri als and
examine questions that tri als cannot answer. These types of existing
data i ncl ude administrative and clinical databases such as those
devel oped by Medi care, the Department of Veterans Affairs, Kai ser
Permanente Medi cal Group, the Duke Cardiovascular Disease Databank,
and registries such as the San Franci sco Mammography Registry and
the Nati onal Regi stry of Myocardi al Infarcti on (NRMI). Information from
these sources (many of whi ch can be found on the Web) can be very
useful for studyi ng rare adverse events and for assessing real -world
util izati on and effectiveness of an intervention that has been shown to
work i n a cli ni cal trial setting. For exampl e, the NRMI was used to
examine ri sk factors for intracrani al hemorrhage after treatment wi th
recombi nant tissue-type plasminogen acti vator (tPA) for acute
myocardi al i nfarcti on (MI). The registry i ncl uded 71,073 pati ents who
received tPA; among these, 673 had i ntracranial hemorrhage confi rmed
by computed tomography or magneti c resonance i maging. A
mul ti vari ate anal ysis showed that a tPA dose exceeding 1.5 mg/kg was
si gnifi cantl y associated wi th devel oping an i ntracranial hemorrhage
when compared wi th l ower doses (4). Gi ven that the overal l ri sk of
devel opi ng an intracranial hemorrhage was less than 1%, a cl inical tri al
P.209
...
col lecti ng pri mary data to exami ne thi s outcome would have been
prohi bi ti vel y large and expensi ve.
Another val uabl e contribution from this type of secondary data anal ysis
is a better understandi ng of the difference between effi cacy and
effecti veness. The randomi zed cli ni cal trial i s the gold standard for
determini ng the efficacy of a therapy under hi ghly controll ed
ci rcumstances in sel ected cl inical settings. In the real worl d,
however, patients and treatments are often di fferent. The choice of
drugs and dosage by the treati ng physici an and the adherence to
medicati ons by the patient are much more vari abl e. These factors often
act to make the new therapy less effective than demonstrated i n trials.
Assessi ng the effectiveness of treatments i n actual practice can
sometimes be accompl ished through studies usi ng secondary data. For
exampl e, primary angi oplasty has been demonstrated to be superior to
thrombolytic therapy in cli ni cal trials of treating patients wi th acute MI
(5). But thi s may only be true when success rates for angiopl asty are
as good as those achieved i n the cl inical tri al setti ng. Secondary
analyses of communi ty data sets have not found a benefi t of pri mary
angioplasty over thrombol yti c therapy (6, 7).
Secondary data anal ysis i s often the best approach for studying the
util izati on of accepted therapies. Al though cl inical tri al s can
demonstrate efficacy of a new therapy, this benefit can onl y occur if
the therapy is adopted by practi cing physi cians. Understandi ng
utilization rates, addressi ng regional variati on and use in specifi c
populati ons (such as the el derly, ethnic minori ti es, the economi cal ly
disadvantaged, and women), can have major publi c health i mpl icati ons.
For exampl e, despi te convi nci ng data that angiotensin converti ng
enzyme i nhi bi tors decrease mortali ty i n patients wi th MI, a secondary
analysi s of community data has shown that many pati ents with cl ear
indi cati ons for such therapy do not receive it (8).
Two data sets may al so be li nked to answer a research question.
Investigators who were interested i n how mi li tary service effects health
used the 1970 to 1972 draft
lottery i nvol ving 5.2 mi ll ion 20-year-ol d men who were assigned
el igibil ity for mil i tary servi ce randoml y by date of birth (the first data
set) l inked to l ater mortali ty based on state death certificate
registries (the second source of data). The predictor vari abl e (date of
birth) was a randomly assigned proxy for mi li tary service during the
Vietnam era. Men who had been randomly assigned to be eli gible for
the draft had signi ficantly greater mortal ity from sui cide and motor
vehicl e accidents i n the ensui ng 10 years (9). The study was done for
less than $2,000 (not i ncl uding the i nvesti gators' ti me), yet i t was a
more unbiased approach to examini ng the effect of mil itary servi ce on
specifi c causes of subsequent death than other studi es of thi s topi c
P.210
...
with much larger budgets.
When i ndivi dual data are not avail able, aggregate data sets can
sometimes be useful . The term aggregate data means that
informati on i s avai labl e only for groups of subjects (e.g., death rates
from cervical cancer i n each of the 50 states). Wi th such data,
associ ations can only be measured among these groups by compari ng
group informati on on a risk factor (such as tobacco sales) wi th the rate
of an outcome. Studies using aggregate data are call ed ecologic
studies.
The advantage of aggregate data is i ts avail abil ity. Its major drawback
is the fact that associations are especi al ly suscepti bl e to confoundi ng:
groups tend to differ from each other in many ways, not all of whi ch
are causal ly related. Furthermore, associ ations observed i n the
aggregate do not necessari ly hold for the i ndivi dual . For exampl e, sales
of ci garettes may be greater in states wi th hi gh suicide rates, but the
indi vidual s who commit sui cide may not be the ones doi ng most of the
smoking. This si tuation i s referred to as the ecologic fallacy.
Aggregate data are most appropri atel y used to test the plausi bi li ty of a
new hypothesis or to generate new hypotheses. Interesting resul ts can
then be pursued in another study that uses i ndivi dual data.
Gett i ng Star t ed
After choosi ng a research topi c and becomi ng famil iar wi th the
li terature i n that area (i ncl uding a thorough li terature search and
advi ce from a seni or mentor), the next step is to i nvestigate whether
the research questi on can be addressed wi th an existing database. The
help of a senior colleague can be invaluable i n fi nding an appropri ate
data set. An experi enced researcher has defi ned areas of interest in
whi ch he stays current and is aware of i mportant data sets and the
investigators who control these data, both at his own insti tuti on and
el sewhere. This person can help identify and gai n access to the
appropri ate database. Often, the research questi on needs to be al tered
sl ightly (by modi fying the defi ni ti on of the predi ctor or outcome
variables, for exampl e) to fit the avai lable data.
The best sol ution may be close at hand, a database at the home
institution. For exampl e, a University of Cal iforni a, San Francisco
(UCSF) fell ow who was interested i n the role of li poprotei ns i n coronary
disease noticed that one of the few i nterventi ons known to lower the
level of li poprotei n(a) was estrogen. Knowing that the Heart and
Estrogen/Progestin Repl acement Study (HERS), a major cl inical tri al of
hormone treatment to prevent coronary disease, was bei ng managed at
UCSF, the fell ow approached the i nvesti gators with hi s interest.
Because no one el se had specifi cal ly planned to exami ne the
rel ati onship between this l ipoprotein, hormone treatment and coronary
heart disease events, the fel low desi gned an anal ysis and publi cation
...
plan. After recei ving permi ssion from the HERS study l eadershi p, he
worked with coordinating center statistici ans, epidemiol ogists, and
programmers to carry out an anal ysis that he subsequentl y publ ished in
a leading journal (10).
Someti mes a research questi on can be addressed that has li ttle to do
with the original study. For exampl e, another fel low from UCSF was
interested in the val ue of
repeated screening Pap tests in women over 65 years ol d. He reali zed
that the mean age of partici pants i n the HERS trial was 67 years, that
participants were requi red to have a normal Pap test to enter and then
received screeni ng Pap tests annuall y duri ng foll ow-up. By fol lowi ng up
on Pap test outcomes, he was able to document that 110 Pap tests
were abnormal among over 2500 women screened over a 2-year peri od,
and onl y one woman was ul ti matel y found to have abnormal fol low-up
hi stology. Therefore, al l but one of the abnormal Pap tests were falsely
positive (11). Thi s study strongl y i nfl uenced the US Preventive Services
Task Force' s current recommendation that Pap tests shoul d not be
performed in low-ri sk women over age 65 wi th previous normal tests.
Someti mes it is necessary to venture further afield. Worki ng from a
li st of predi ctor and outcome variabl es whose relati on mi ght help to
answer the research question, an investi gator can seek to l ocate
databases that i ncl ude these vari ables. Phone call s or e-mai l messages
to the authors of previ ous studi es or to government offici al s might
result in access to fil es containing useful data. It i s essenti al to
conquer any anxi ety that the i nvesti gator may feel about contacting
strangers to ask for hel p. Most people are surpri singl y cooperati ve,
ei ther by providing data themselves or by suggesting other pl aces to
try.
Once the data for answeri ng the research question have been located,
the next chal lenge is to obtai n permission to use them. It is a good
practi ce to use offi cial letterhead on correspondence and to adopt any
insti tuti onal ti tl es that are appropri ate. Young investi gators shoul d
determine if thei r mentors are acquainted with the i nvesti gators who
control the database, as an introducti on may be more effecti ve than a
col d contact. It is generall y most effecti ve to work with an i nvestigator
who i s interested in the research topic and i nvol ved in the study whose
database you woul d l ike to examine. Thi s investi gator can facil itate
access to the data, assure that you understand the study methods and
how the vari abl es were measured, and often becomes a val ued
col league and col laborator. Databases that resul t from mul ti center
studi es and cli ni cal trial s general ly have cl ear mechanisms for obtai ni ng
access to the data that i ncl ude the requirement for a written analysi s
proposal and approval by an anal ysis or publ ications commi ttee.
The investi gator shoul d be very speci fic about what information is
P.211
...
sought and confi rm the request i n wri ti ng. It is a good i dea to keep the
si ze of the request to a minimum and to offer to pay any cost of
preparing the data. If the data set i s controll ed by another group of
researchers, the investigator can suggest a coll aborative relati onshi p.
In addition to providing an i ncenti ve to share the data, this can engage
a coinvestigator who is famil iar with the database. It is wise to clearl y
define such a rel ationshi p early on, includi ng who wil l be fi rst author of
the planned publi cati ons. Important arrangements of this sort often
benefi t from a face-to-face meeting.
Ancillary Studies
Research with secondary data takes advantage of the fact that the
data needed to answer a research questi on are al ready avail abl e. In an
ancillary study, the investi gator adds one or several measurements to
an exi sti ng study to answer a different research question. For example,
in the HERS trial of the effect of hormone therapy on ri sk for coronary
events i n 2,763 el derly women, an investi gator added measurement of
the frequency and severity of uri nary i nconti nence. Adding a one-page
questionnai re
created a l arge tri al of the effect of hormone therapy on uri nary
inconti nence, with l ittl e additional time or expense (12).
Anci ll ary studies have many of the advantages of secondary data
analysi s with fewer constraints. They are both inexpensi ve and
effi cient, and the i nvestigator can design a few key ancil l ary
measurements speci fical ly to answer the research questi on. Ancil lary
studi es can be added to any type of study, i ncl uding cross-secti onal
and casecontrol studies, but large prospecti ve cohort studi es and
randomized tri al s are parti cul arly wel l suited to such studi es.
Anci ll ary studies in randomi zed trial s have the problem that the
measurements may be most i nformati ve when added before the trial
begi ns, and it may be diffi cult for an outsider to identi fy tri al s in the
planning phase. Even when a vari abl e was not measured at basel ine,
however, a singl e measurement during or at the end of the tri al can
produce useful informati on. By adding cogni ti ve functi on measures at
the end of the HERS tri al , the investigators were abl e to compare the
cogni ti ve functi on of elderl y women treated wi th hormone therapy for 4
years wi th the cognitive function of those treated wi th pl acebo (13).
A good opportuni ty for ancil lary studies i s provided by the banks of
stored serum, DNA, images, and so on, that are found i n most large
cl ini cal trial s and cohort studi es. The opportunity to propose new
measurements usi ng these specimens can be an extremel y cost -
effecti ve approach to answeri ng a novel research question, especi al ly i f
it is possi bl e to make these measurements on a subset of specimens
usi ng a nested casecontrol or casecohort desi gn (Chapter 7). In
P.212
...
HERS, for exampl e, geneti c anal yses of fewer than 100 cases and
controls showed that the excess number of thromboemboli c events i n
the hormone-treated group was not due to an interaction with factor V
Lei den (14).
Gett i ng Star t ed
Opportuni ti es for anci ll ary studi es shoul d be acti vel y pursued,
especi al ly by new i nvesti gators with l imited ti me and resources. A good
place to start is to identi fy studies with research questions that i ncl ude
ei ther the predictor or the outcome vari able of interest. For example,
an i nvestigator i nterested in the effect of wei ght l oss on pain
associ ated wi th osteoarthri ti s mi ght start by identifyi ng tri als of
interventi ons (such as diet, exercise, behavior change, or drugs) for
wei ght l oss. Such studies can be identifi ed by searching l ists of studi es
funded by the federal government, by contacting pharmaceuti cal
companies that manufacture drugs for wei ght loss, and by tal ki ng with
experts in weight loss who are famil iar with ongoing studies. To create
an anci ll ary study, the i nvestigator would si mply add a measure of
arthri ti s symptoms among subjects enrol led i n these studies.
Alternativel y, he mi ght i denti fy studi es that have joint pain as an
outcome, and add change i n wei ght as an ancil lary measure.
After i dentifyi ng a study that provides a good opportuni ty for ancil lary
measures, the next step i s to obtai n the cooperati on of the study
investigators. Most researchers wil l consider addi ng bri ef anci ll ary
measures to an establ ished study i f they address an i mportant questi on
and do not substantiall y interfere wi th the conduct of the main study.
Investigators wi l l be rel uctant to add measures that require a lot of the
participant' s time (cognitive function testi ng) or are invasi ve and
unpl easant (col onoscopy) or costl y (positron emissi on tomography
scanning).
General ly, formal permi ssion from the principal i nvestigator or the
appropri ate study commi ttee i s requi red to add an anci ll ary study. Most
large, mul ti center studi es
have establ ished procedures requiring a written appli cati on. The
proposed ancil lary study is generall y revi ewed by a commi ttee that can
approve, reject, or revise the anci ll ary study. Many ancil lary measures
requi re funding, and the anci ll ary study investigator must find a way to
pay these costs. Of course, the cost of an anci ll ary study is much l ess
than the cost of conducti ng the same tri al independently. Some large
studi es may have thei r own mechanisms for funding anci ll ary studi es,
especi al ly i f the research question i s important and consi dered rel evant
by the funding agency. The NIH has recently i ssued several requests
for proposals to add anci ll ary studies to l arge NIH-funded tri al s.
The disadvantages of ancil lary studi es are few. If the main study is
P.213
...
already in progress, new vari abl es can be added, but variabl es already
being measured cannot be changed. In some cases there may be
practi cal probl ems in obtaining permissi on from the investi gators or
sponsor to perform the ancil lary study, trai ning those who wi ll make
the measurements, or obtai ning separate informed consent from
participants. Because the ancil l ary study investi gator may not have
designed or conducted the mai n study, i t may al so be di fficult to obtain
access to the ful l database for anal ysi s. These issues, includi ng a cl ear
understandi ng of authorship of scientifi c papers that result from the
anci ll ary study and the rul es governing thei r preparation and
submissi on, need to be cl arifi ed before starti ng the study.
Systematic Reviews
Systematic reviews identi fy compl eted studies that address a
research question, and evaluate the resul ts of these studi es to arrive at
conclusions about a body of research. In contrast to other approaches
to revi ewing the l iterature, systemati c revi ews use a wel l-defined and
uniform approach to i dentify all rel evant studi es, di splay the results of
el igible studi es, and, when appropriate, cal cul ate a summary esti mate
of the overall results. The stati sti cal aspects of a systemati c revi ew
(cal cul ating summary effect estimates and vari ance, statistical tests of
heterogenei ty, and statistical esti mates of publi cation bias) are cal led
meta-analysis.
A systemati c revi ew can be a good opportunity for a new
investigator. Al though it takes a surpri si ng amount of ti me and effort,
a systemati c revi ew general ly does not requi re substanti al financial or
other resources. Completi ng a good systemati c revi ew requi res that the
investigator become i ntimately fami li ar with the l iterature regarding
the research questi on. For new investi gators, this detai l ed knowledge
of publi shed studi es i s inval uable. Publ icati on of a good systematic
revi ew can also establi sh a new i nvesti gator as an expert on
the research questi on. Moreover, the findi ngs, wi th power enhanced by
the l arger sample si ze avail abl e from the combined studies and
pecul iari ti es of i ndivi dual study fi ndings revealed by comparison with
the others, often represent an important sci entifi c contri buti on.
Systemati c revi ew findi ngs can be particularl y useful for developing
practi ce guidel ines.
The elements of a good systematic review are l i sted in Tabl e 13.1. Just
as for other studi es, the methods for completi ng each of these steps
shoul d be descri bed i n a written protocol before the systematic review
begi ns.
The Resear ch Quest i on
As wi th any research, a good systematic review has a wel l-formulated,
cl ear research question that meets the usual FINER cri teria (Chapter
...
2). Feasibil ity depends l argely on the existence of a set of studi es of
the question. The research question
shoul d descri be the di sease or condition of i nterest, the populati on and
setti ng, the intervention and compari son treatment (for trial s), and the
outcomes of interest. For exampl e, Among persons admitted to an
intensi ve care unit with unstable angina, does treatment wi th aspi rin
plus i ntravenous hepari n reduce the ri sk of myocardial infarction and
death during the hospi tali zation more than treatment with aspiri n alone
(15)?
I dent i f yi ng Compl eted St udi es
Systemati c revi ews are based on a comprehensive and unbi ased search
for completed studies. The search should fol low a well -defi ned strategy
establ ished before the results of the i ndivi dual studi es are known. The
process of identifyi ng studi es for potential inclusi on in the revi ew and
the sources for findi ng such articles shoul d be expl ici tl y documented
before the study. Searches should not be li mi ted to MEDLINE, which
includes onl y about hal f of all publ ished Engli sh-language cli ni cal
research studies and often does not li st non-Engl ish-language
references. Depending on the research questi on, other el ectroni c
databases such as AIDSLINE, CANCERLIT, and EMBASE can be included,
as well as manual revi ew of the bibli ography of rel evant publ ished
P.214
Table 13.1 Elements of a Good Systematic
Review
1. Clear research questi on
2. Comprehensive and unbi ased identifi cation of completed
studi es
3. Definition of i ncl usion and excl usi on cri teria
4. Uniform and unbi ased abstraction of the characteri sti cs
and fi ndings of each study
5. Clear and uniform presentation of data from i ndivi dual
studi es
6. Cal cul ation of a summary esti mate of effect and
confi dence i nterval based on the fi ndings of al l el igibl e
studi es when appropri ate
7. Assessment of the heterogenei ty of the findi ngs of the
indi vidual studies
8. Assessment of potenti al publ i cation bi as
9. Subgroup and sensitivi ty analyses
...
studi es, previ ous reviews, evaluation of the Cochran Collaboration
database, and consultation with experts. The search strategy shoul d be
cl early descri bed so that other i nvestigators can repli cate the search.
Cr i t er i a f or I ncl udi ng and Excl udi ng Studi es
The protocol for a systematic review should provide a good rational e
for i ncl uding and excl uding studies, and these criteria should be
established a pr i or i . Cri teri a for i ncl uding or excludi ng studi es from
meta-anal yses typi call y desi gnate the peri od during which studi es were
publ i shed, the populati on that i s acceptabl e for study, the disease or
condi ti on of interest, the i nterventi on to be studi ed, whether bl indi ng is
requi red, acceptabl e control groups, required outcomes, maxi mal
acceptabl e l oss to foll ow-up, and minimal acceptabl e l ength of fol low-
up. Once these criteri a are establi shed, each potentiall y el igible study
shoul d be revi ewed for eli gi bi li ty i ndependently by two or more
investigators, wi th di sagreements resol ved by another revi ewer or by
consensus. When determini ng el igibil ity, it may be best to bl ind
revi ewers to the date, journal, authors, and resul ts of tri al s.
Publi shed systematic reviews should list studies that were
considered for i ncl usi on and the specifi c reason for excl uding a study.
For exampl e, if 30 potential ly el igi bl e trials are i dentifi ed, these 30
trials should be full y referenced and a reason shoul d be given for each
excl usi on.
Col l ect i ng Data f r om El i gi bl e St udi es
Data should be abstracted from each study in a uniform and unbi ased
fashi on. General ly, thi s is done independently by two or more
abstractors usi ng predesigned forms that i nclude variabl es that define
el igibil ity cri teria, design features, the popul ati on i ncl uded i n the
study, the number of i ndivi dual s in each group, the i nterventi on (for
trials), the mai n outcome, secondary outcomes, and outcomes in
subgroups. The data abstracti on forms should include any data that wil l
subsequentl y appear in the text, tables or figures descri bi ng the
studi es i ncl uded in the systemati c revi ew, or i n tables or fi gures
presenti ng the outcomes. When the two abstractors di sagree, a thi rd
abstractor may settl e the difference, or a consensus process may be
used. The process for abstracting data from studi es for the systemati c
revi ew shoul d be cl early descri bed i n the manuscript.
The publ ished reports of some studi es that might be eli gible for
inclusion i n a systemati c revi ew may not i ncl ude i mportant i nformati on,
such as desi gn features, ri sk estimates, and standard deviati ons. Often
it is di ffi cul t to tell if desi gn features such as bl indi ng were not
implemented or were just not described in the publ icati on. The
revi ewer can someti mes calculate rel ative risks and confidence
P.215
...
intervals from crude data presented from randomi zed tri als, but it is
general ly unacceptable to calculate risk esti mates and confi dence
intervals based on crude data from observati onal studi es because there
is not sufficient i nformati on to adjust for potential confounders. Every
effort shoul d be made to contact the authors to retrieve important
informati on that i s not included i n the publi shed descri ption of a study.
If this necessary i nformati on cannot be cal cul ated or obtai ned, the
study fi ndings are generall y excl uded.
Pr esenti ng t he Fi ndi ngs Cl ear l y
Systemati c revi ews general ly i ncl ude three types of i nformati on. First,
important characteri sti cs of each study included i n the systematic
revi ew are presented i n tables. These often i ncl ude the study sampl e
si ze, number of outcomes, l ength of fol low-up, characteristi cs of the
populati on studi ed, and methods used in the study. Second, the revi ew
displays the results of the i ndivi dual studi es (risk esti mates, confidence
intervals or P val ues) in a table or fi gure. Final ly, i n the absence of
si gnifi cant heterogeneity (see bel ow), the meta-anal ysis presents
summary estimates and confi dence interval s based on the fi ndings of
all the included studies as wel l as sensitivi ty and subgroup anal yses.
The summary effect esti mates represent a main outcome of the meta-
analysi s but shoul d be presented i n the context of all the i nformati on
abstracted from the i ndivi dual studi es. The characteristics and fi ndings
of i ndivi dual studi es i ncl uded i n the systemati c revi ew should be
displayed clearl y in tabl es and figures so that the reader can form
opinions that do not depend solel y on the stati sti cal summary
estimates.
Meta- Anal ysi s: St ati sti cs f or Syst emati c
Revi ews
Summar y ef f ect est i mat e and conf i dence i nt er val . Once al l
completed studi es have been identifi ed, those that meet the
inclusion and exclusion cri teria have been chosen, and data have
been abstracted from each study, a summary estimate (summary
rel ati ve ri sk, summary odds rati o, etc.) and confi dence i nterval
may be calculated. The summary effect is essentiall y an average
effect wei ghted by the i nverse of the vari ance of the outcome of
each study. Methods for calculati ng the summary effect and
confidence interval are di scussed i n Appendi x 13.1. Those not
interested i n the detail s of cal cul ati ng mean weighted esti mates
from mul ti pl e studi es shoul d at least be aware that different
approaches can gi ve different results. For example, recent meta-
analyses of the effectiveness of condoms for preventi ng
heterosexual transmi ssion of HIV have given summary esti mates
P.216
...
ranging from 80% to 94% decrease i n transmission rates,
although they are based on the results of almost i dentical sets of
studi es (16, 17).
Het er ogenei t y. Combini ng the resul ts of several studies i s not
appropri ate if the studi es di ffer i n cl inicall y important ways, such
as the intervention, outcome, controls, bl indi ng, and so on. It i s
also i nappropri ate to combine the findi ngs if the resul ts of the
indi vidual studi es di ffer wi dely. Even if the methods used i n the
studi es appear to be simi lar, the fact that the resul ts vary
markedl y suggests that somethi ng important was di fferent i n the
indi vidual studi es. Thi s variabi li ty i n the fi ndings of the indi vidual
studi es i s call ed heterogeneity (and the study fi ndings are sai d
to be heterogeneous); if there i s li ttle vari abil ity, the study
findi ngs are said to be homogeneous.
How can the investi gator decide whether methods and fi ndings are
si mi lar enough to combine into summary estimates? Fi rst, he can
revi ew the i ndivi dual studi es to determi ne i f there are substantial
differences in study design, study popul ati ons, i nterventi on, or
outcome. Then he can examine the resul ts of the i ndivi dual
studi es. If some trial s report a substanti al benefici al effect of an
intervention and others report consi derable harm, heterogenei ty i s
cl early present. Sometimes, i t i s di fficult to deci de if
heterogenei ty i s present. For exampl e, i f one tri al reports a 50%
ri sk reduction for a specifi c intervention but another reports only
a 30% ri sk reduction, is heterogeneity present? Stati stical
approaches (tests of homogeneity) have been devel oped to help
answer thi s question (Appendi x 13.1), but ultimatel y, this requires
judgment. Every reported systematic review should include some
discussion of heterogenei ty and its effect on the summary
estimates.
Assessment of Publ i cat i on Bi as
Publi cati on bias occurs when publ ished studi es are not representati ve
of all studi es that have been done, usuall y because posi ti ve results
tend to be submitted and publ ished more often than negative resul ts.
There are two mai n ways to deal wi th publ icati on bias. Unpublished
studies can be identified and the results included i n the summary
estimate. Unpubli shed resul ts may be i dentifi ed by queryi ng
investigators and revi ewi ng abstracts, meeting presentati ons, and
doctoral theses. The resul ts of unpubli shed studi es can be i ncl uded wi th
those of the publ ished tri al s in the overal l summary estimate, or
sensitivi ty analyses can determi ne if adding these unpubl ished resul ts
substanti al ly changes the summary esti mate determined from publi shed
results. However, i ncl udi ng unpubl ished results in a systematic review
is probl ematic for several reasons. It is often di fficult to i denti fy
...
unpubl ished studi es and even more di fficult to abstract the requi red
data. Frequentl y, inadequate information i s avail abl e to determi ne i f
the study meets i ncl usi on criteri a for the systematic review or to
evaluate the qual ity of the methods. For these reasons, unpubl ished
data are not often included i n meta-anal yses.
Alternativel y, the extent of potenti al publication bias can be
estimated and this information used to temper the concl usi ons of the
systematic review. Publ icati on bi as exists when unpubli shed studies
have different findi ngs from publ ished studi es. Unpubli shed studi es are
more l ikely to be smal l (large studi es usuall y get publ ished, regardl ess
of the fi ndings) and to have found no associati on between the risk
factor or i nterventi on and the outcome (markedl y posi ti ve studies
usuall y get publ ished, even i f smal l). If there i s no publi cati on bias,
there should be no associ ation between a study' s si ze (or the variance
of the outcome esti mate) and fi ndings. The degree of this associati on i s
often measured using Kendall's Tau, a coeffi cient of correlati on. A
strong or stati sti cal ly si gnifi cant correl ati on between study outcome
and sample si ze suggests publi cation bi as. In the absence of
publ i cation bi as, a plot of study sample si ze versus outcome (e.g., l og
rel ati ve ri sk) shoul d have a bel l or funnel shape with the apex near
the summary effect estimate.
The funnel pl ot i n Fig. 13.1A suggests that there i s li ttle publ icati on
bias because small studi es with both negati ve and posi ti ve findi ngs
were publ ished. The plot in Fi g. 13.1B, on the other hand, suggests
publ i cation bi as because the distribution appears truncated in the
corner that shoul d contai n small , negati ve studi es.
P.217
...
When substanti al publ ication bi as is l ikel y, summary esti mates shoul d
not be calculated or should be i nterpreted cauti ously. Every reported
FIGURE 13.1. A: Funnel plot that does not suggest publ icati on
bi as because there are studi es with a range of l arge and smal l
sample sizes, and l ow relati ve ri sks are reported by some small er
studi es. B: Funnel pl ot suggestive of publ ication bi as because the
smal ler studi es primaril y report hi gh rel ative risks.
...
systematic review shoul d i ncl ude some discussi on of potential
publ i cation bi as and i ts effect on the summary estimates.
Subgr oup and Sensi ti vi t y Anal yses
Subgroup analyses may be possible usi ng data from all or some
subset of the studies i ncl uded in the systemati c revi ew. For exampl e, in
a systemati c revi ew of the effect of postmenopausal estrogen therapy
on endometrial cancer ri sk, some of the studies presented the results
by duration of estrogen use. Subgroup analyses of the results of
studi es that provi ded such i nformati on demonstrated that l onger
durati on of use was associ ated wi th hi gher ri sk for cancer ( 18).
Sensitivity analyses i ndicate how sensitive the fi ndings of the
meta-anal ysis are to certai n decisi ons about the design of the
systematic review or inclusion of certain studi es. For exampl e, if the
authors deci ded to include studi es with a sli ghtl y di fferent desi gn or
methods in the systemati c revi ew, the fi ndings are strengthened if the
summary results are simi lar whether or not the questi onable studi es
are i ncl uded. Systematic reviews shoul d generall y include sensi ti vity
analyses if any of the desi gn decisi ons appear questi onabl e or
arbi trary.
Gar bage I n, Gar bage Out
The bi ggest drawback to a systemati c revi ew is that i t can produce a
rel i able-appeari ng summary esti mate based on the resul ts of indi vidual
studi es that are of poor qual i ty. The process of assessi ng qual i ty is
complex and probl emati c. We favor rel ying on rel ati vely stri ct cri teria
for good study desi gn when setting the i ncl usi on criteri a. If the
indi vidual studi es that are summarized in a systematic review are of
poor quali ty, no amount of careful analysi s can prevent the summary
estimate from bei ng unrel iabl e. A special i nstance of thi s probl em is
encountered in systemati c revi ews of observati onal data. If the results
of these studi es are not adjusted for potenti al confoundi ng variabl es,
the resul ts of the meta-analysi s wil l also be unadjusted and potentiall y
confounded.
Summary
P.218
Secondar y Dat a Anal ysi s
1. Secondary data analysi s has the advantage of greatl y reduci ng
the time and cost of doing research and the disadvantage of
provi di ng the i nvestigator l ittl e or no control over the study
popul ation, design, or measurements.
2. One good source of data for secondary anal ysis i s a completed
...
research project at the i nvestigator' s institution; others are the
large number of public databases now avai lable from many
sources.
3. Large communi ty-based data sets are useful for studying the
effectiveness and utilization of an i nterventi on in the
communi ty, and for discoveri ng rare adverse events.
Anci l l ar y Studi es
1. A clever ancil l ary study can answer a new research questi on wi th
little cost and effort. As wi th secondary data anal yses, the
investigator cannot control the design, but he i s abl e to specify
a few key additional measurements.
2. Good opportuni ti es for ancil lary studies may be found i n cohort
studies or clinical trials that i ncl ude ei ther the predictor or
outcome vari able for the research questi on of i nterest. Stored
banks of serum, DNA, images, and so on, provide the
opportuni ty for cost-effective nested casecontrol and case-
cohort desi gns.
3. Most large studi es have wri tten policies that al low i nvesti gators
(i ncl uding outside scientists) to propose and carry out ancil l ary
studi es.
Syst emati c Revi ews
1. A good systematic review, l ike any other study, requi res a
complete written protocol before the study begi ns. The protocol
shoul d i ncl ude the research question, methods for identifying
all eligible studies, methods for abstracting data from the
studi es, and statistical methods.
2. The stati sti cal aspects of a systematic review, termed meta-
analysis, i ncl ude the summary effect estimate and confidence
interval, tests for evaluating heterogeneity and potential
publication bias, and planned subgroup and sensitivity
analyses.
3. The characteristics and findings of i ndivi dual studi es shoul d be
displayed clearl y in tabl es and figures so that the reader can form
opinions that do not depend solel y on the stati sti cal summary
estimates.
4. The bi ggest drawback to a systemati c revi ew is that the resul ts
can be no more rel iable than the quality of the studies on whi ch
P.219
...
it is based.
Appendix
Appendix 13.1: Statistical Methods
for Meta-Analysis
Summary Effects and Confidence Intervals
The pri mary goal of meta-anal ysis i s to cal cul ate a summary effect si ze
and confi dence i nterval. An i ntui ti ve way to do this i s to mul ti ply each
trial relative risk (an effect esti mate) by the sampl e size (a wei ght that
refl ects the accuracy of the relative ri sk), add these products, and
divi de by the sum of the wei ghts. In actual practi ce, the i nverse of the
variance of the effect estimate from each i ndi vidual study (1\vari ance
i
)
is used as the weight for each study. The inverse of the vari ance i s a
better esti mate of the preci sion of the effect estimate than the sample
si ze because it takes into account the number of outcomes and their
distribution. The weighted mean effect esti mate is cal cul ated by
mul ti pl ying each study wei ght (1\vari ance
i
) by the l og of the rel ati ve
ri sk (or any other ri sk estimate, such as the l og odds rati o, ri sk
difference, etc.), adding these products, and di viding by the sum of the
wei ghts. Smal l studies generall y result in a l arge vari ance (and a wide
confidence interval around the risk estimate) and large studi es resul t i n
a smal l variance (and a narrow confidence interval around the risk
estimate). Therefore, in a meta-analysi s, large studies get a l ot of
wei ght (1\smal l variance) and smal l studies get l ittl e weight (1\bi g
variance).
To determine if the summary effect esti mate is statistical ly si gnifi cant,
the variabil ity of the estimate of the summary effect i s calculated.
There are vari ous formulas for cal cul ating the vari ance of summary ri sk
estimates (19,20). Most use somethi ng that approxi mates the inverse
of the sum of the weights of the i ndivi dual studi es (1/ wei ght
i
). The
variance of the summary estimate i s used to cal cul ate the 95%
confidence interval around the summary estimate ( 1.96
variance
1/2
).
Random- versus Fixed-Effect Models
There are mul ti pl e statisti cal approaches avail abl e for cal culating a
summary estimate (20). The choice of stati sti cal method is usuall y
dependent on the type of outcome (rel ative risk, risk reducti on,
difference score, etc.). In addi ti on to the stati sti cal model, the
investigator must also choose to use ei ther a fixed-effect or random-
P.220
...
effect model. The fixed-effect model si mply calculates the variance of a
summary estimate based on the i nverse of the sum of the weights of
each i ndivi dual study. The random-effect model adds vari ance to the
summary effect in proporti on to the variabil ity of the results of the
indi vidual studi es. Summary effect estimates are general ly si mi l ar
usi ng ei ther the fi xed- or random-effect model , but the vari ance of the
summary effect is greater in the random-effect model to the degree
that the resul ts of the i ndivi dual studi es differ, and the confi dence
interval around the summary effect i s correspondingl y larger, so that
summary results are less li kel y to be statisticall y signi ficant. Many
journal s now require authors to use a random-effect model because i t
is consi dered conservative. Meta-analyses should state clearly
whether they used a fixed- or random-effect model .
Simpl y usi ng a random-effect model does not obviate the probl em of
heterogenei ty. If the studies i dentifi ed by a systemati c revi ew are
cl early heterogeneous, a summary esti mate shoul d not be calculated.
Statistical Tests of Homogeneity
Tests of homogenei ty assume that the fi ndings of the indi vi dual trial s
are the same (the nul l hypothesis) and use a stati stical test (test of
homogenei ty) to determine if the data (the i ndivi dual study fi ndings)
refute thi s hypothesi s. A chi -square test i s commonl y used (19). If the
data do support the null hypothesis (P val ue 0.10), the val ue
investigator accepts that the studi es are homogeneous. If the data do
not support the hypothesi s (P value < 0.10), he rejects the nul l
hypothesi s and assumes that the study fi ndings are heterogeneous. In
other words, there are meani ngful di fferences i n the populations
studi ed, the nature of the predi ctor or outcome vari abl es, or the study
results.
All meta-anal yses should report tests of homogeneity wi th a P value.
These tests are not very powerful and it i s hard to reject the nul l
hypothesi s and prove heterogenei ty when the sampl e sizethe
number of indi vidual studiesis small . For this reason, a P val ue
somewhat hi gher than the typi cal val ue of 0.05 is typi cal ly used as a
cutoff. If substantial heterogeneity is present, it is i nappropri ate to
combine the results of trials i nto a singl e summary estimate.
Reference
1. Svendsen KH, Kul ler LH, Marti n MJ, et al. Effects of passi ve
smoking i n the mul ti pl e ri sk factor i nterventi on tri al (MRFIT). Am J
Epidemiol 1987;126:783795.
2. Kerl ikowske K, Grady D, Barcl ay J, et al . Likel ihood rati os for
...
modern screeni ng mammography. JAMA 1996;276:3943.
3. Canner PL. Mortal i ty in CDP pati ents duri ng a ni ne-year post-
treatment period. J Am Coll Cardi ol 1986;8:12431255.
4. Gurwi tz JH, Gore JM, Gol dberg RJ, et al . Ri sk for i ntracranial
hemorrhage after tissue plasminogen acti vator treatment for acute
myocardi al infarcti on. Partici pants in the Nati onal Registry of
Myocardi al Infarction 2. Ann Intern Med 1998;129:597604.
5. Weaver WD, Si mes RJ, Betriu A, et al . Compari son of pri mary
coronary angi oplasty and intravenous thrombol yti c therapy for
acute myocardi al infarcti on: a quantitati ve revi ew. JAMA
1997;278:20932098; publ ished erratum appears i n JAMA
1998;279:876.
6. Every NR, Parsons LS, Hl atky M, et al. A compari son of
thrombolytic therapy wi th primary coronary angi oplasty for acute
myocardi al infarcti on. Myocardial infarction triage and i nterventi on
investi gators. N Engl J Med 1996;335:12531260.
7. Ti efenbrunn AJ, Chandra NC, French WJ, et al . Cl i nical
experi ence wi th primary percutaneous transl umi nal coronary
angi oplasty compared wi th altepl ase (recombi nant ti ssue-type
pl asmi nogen activator) i n pati ents with acute myocardi al infarcti on:
a report from the Second National Registry of Myocardial Infarction
(NRMI-2). J Am Coll Cardi ol 1998;31:12401245.
8. Barron HV, Michael s AD, Maynard C, et al . Use of angi otensi n-
converti ng enzyme inhi bi tors at di scharge i n patients wi th acute
myocardi al infarcti on i n the Uni ted States: data from the National
Regi stry of Myocardi al Infarcti on 2. J Am Coll Cardi ol
1998;32:360367.
9. Hearst N, Newman TB, Hull ey SB. Del ayed effects of the mi li tary
draft on mortali ty: a randomized natural experi ment. N Engl J Med
1986;314:620624.
10. Shli pak M, Simon J, Vi ttinghoff E, et al. Estrogen and progestin,
li poprotei n (a), and the risk of recurrent coronary heart disease
events after menopause. JAMA 2000;283:18451852.
P.221
...
11. Sawaya GF, Grady D, Kerl ikowske K, et al. The positive
predi cti ve value of cervical smears i n previ ousl y screened
postmenopausal women: the Heart and Estrogen/progestin
Repl acement Study (HERS). Ann Intern Med 2000;133:942950.
12. Grady D, Brown J, Vitti nghoff E, et al . Postmenopausal
hormones and incontinence: the Heart and Estrogen/Progestin
Repl acement Study. Obstet Gynecol 2001;97:116120.
13. Grady D, Yaffe K, Kri stof M, et al. Effect of postmenopausal
hormone therapy on cognitive function: the Heart and
Estrogen/progesti n Replacement Study. Am J Med
2002;113:543548.
14. Herrington DM, Vitti nghoff E, Howard TD, et al. Factor V Lei den,
hormone repl acement therapy, and ri sk of venous thromboemboli c
events i n women wi th coronary di sease. Arterioscler Thromb Vasc
Biol 2002;22:10121017.
15. Ol er A, Whooley M, Ol er J, et al . Heparin pl us aspi ri n reduces
the ri sk of myocardi al infarcti on or death i n patients wi th unstabl e
angi na. JAMA 1996;276:811815.
16. Pinkerton SD, Abramson PR. Effectiveness of condoms in
preventing HIV transmissi on. Soc Sci Med 1997;44:13031312.
17. Wel ler S, Davis K. Condom effectiveness i n reducing
heterosexual HIV transmissi on. Cochrane Database Syst Rev 2002;
(1):CD003255.
18. Grady D, Gebretsadik T, Kerl ikowske K, et al . Hormone
replacement therapy and endometrial cancer ri sk: a meta-anal ysis.
Obstet Gynecol 1995;85:304313.
19. Petitti D. Meta-analysi s, decisi on anal ysis and cost
effectiveness anal ysi s. New York: Oxford Uni versi ty Press, 1994.
20. Cooper H, Hedges LV. The handbook of research synthesi s. New
York: Russell Sage Foundati on, 1994.
...
Copyri ght 2007 Li ppi ncott Wi l li ams & Wi lki ns
> Tabl e of Contents > Secti on III - Impl ementati on > 14 - Addressi ng Ethi cal
Issues
14
Addressing Ethical Issues
Bernard Lo
Research with human partici pants raises ethi cal concerns because peopl e
accept ri sks and i nconvenience pri mari ly to advance sci enti fic knowledge
and to benefi t others. For the publ ic to be wi ll i ng to partici pate i n cl inical
research and to provide publ ic fundi ng, i t needs to trust that such
research i s conducted accordi ng to stri ct ethi cal standards (1).
In thi s chapter we begi n by revi ewi ng ethical pri nci pl es and the federal
regul ati ons about informed consent and institutional review boards
(IRBs) i n the Uni ted States. We then turn to a number of ethical
considerati ons i ncl udi ng sci enti fic mi sconduct, confl i ct of interest,
authorship, and confidential i ty.
Ethical Principles
Three ethi cal princi pl es gui de research wi th human parti cipants (2).
The princi ple of respect for persons requires i nvesti gators to obtai n
i nformed consent from research parti ci pants, to protect participants wi th
i mpai red decisi on-maki ng capaci ty, and to mai ntain confi denti al ity.
Research parti ci pants are not passive sources of data, but i ndi vi dual s
whose ri ghts and welfare must be respected.
The princi ple of beneficence requi res that the research desi gn be
scientifi cal l y sound and that the risks of the research be acceptable in
rel ati on to the li kely benefi ts. Ri sks to participants incl ude both physi cal
harm from research i nterventi ons and al so psychosoci al harm, such as
breaches of confi dential i ty, sti gma, and di scri minati on. The ri sks of
participati ng i n the study can be reduced, for exampl e, by screeni ng
potential partici pants to excl ude those l ikel y to suffer adverse effects
and moni tori ng parti cipants for adverse effects.
The princi ple of justice requi res that the benefits and burdens of
research be di stributed fai rly. Vulnerabl e popul ati ons, such as people
wi th poor access to health care, those wi th impai red deci si on-maki ng
capaci ty, and institutional i zed persons, may l ack the capaci ty to make
i nformed and free choi ces about partici pating in research.
P.226
...
Such populations may seem attracti ve to study if access and fol l ow-up
are conveni ent, but vul nerabl e popul ations shoul d not be targeted for
research i f other popul ati ons woul d al so be sui tabl e parti ci pants.
Justi ce also requi res equi table access to the benefi ts of research.
Tradi ti onal ly, cl ini cal research has been regarded as risky, and potential
subjects have been thought of as gui nea pigs who needed protecti on
from dangerous i nterventi ons that woul d confer l i ttl e or no personal
benefi t. Increasi ngl y, however, cli nical research i s regarded as provi ding
access to new therapies for such condi ti ons as HIV i nfecti on, cancer, and
organ transpl antati on. Pati ents who seek promisi ng new drugs for fatal
condi ti ons want increased access to cli nical research, not greater
protecti on (3). In addition, groups that are underrepresented on cl ini cal
research have subopti mal cl inical care because of a weak evi dence base.
Chi ldren, women, and members of ethni c minori ti es histori call y have
been underrepresented in cli ni cal research. NIH-funded cl ini cal
researchers must have adequate representati on of chi ldren, women, and
members of ethni c minori ti es i n studi es, or else justi fy why they are
underrepresented.
Federal Regulations for Research on
Human Subjects
Federal regulations are intended to assure that human subjects
research i s conducted in an ethicall y acceptable manner ( 4). (Al though
the regulations refer to human subjects, the term
parti ci pants i s general ly preferred today.) These regulations
appl y to all federall y funded research and to research that wi l l be
submitted to the U.S. Food and Drug Administration (FDA) in support of a
new drug or devi ce appli cation. In addition, most universities requi re
that all research on human subjects conducted by affi l iated faculty and
staff compl y wi th these regul ati ons, i ncl udi ng research funded privately
or conducted off-si te.
These federal regul ati ons define research as systematic
i nvesti gati on designed to devel op or contri bute to general i zabl e
knowledge (4). Research i s therefore di stingui shed from unproven
cl ini cal care that is directed toward benefi ti ng the individual pati ent and
not toward publ icati on. Human subjects are l iving individuals about
whom an investigator obtai ns either data through i nterventi on or
i nteracti on with the i ndivi dual or i denti fiabl e pri vate
i nformati on. Private information compri ses (1) information that a
person can reasonabl y expect i s not bei ng observed or recorded and (2)
i nformati on that has been provi ded for specifi c purposes and that the
i ndivi dual can reasonabl y expect wi l l not be made publ i c (e.g., a medical
record). Information i s identi fi abl e i f the i dentity of the subject
i s or may be readil y ascertai ned by the i nvesti gator or associated wi th
the information. Research data that i s i denti fied by a code is not
considered i ndi vi dual ly identifiable i f the key that l inks data to
participants i s destroyed before the research begi ns or i f the
...
i nvesti gators have no access to the key.
Researchers who have questi ons about these federal regul ati ons shoul d
consul t their IRB or read the full text of federal regulations, which are
avail able on the websi te of the Office for Human Research Protecti ons
(OHRP) of the Department of Heal th and Human Servi ces.
The federal regul ati ons provide two mai n protecti ons for human subjects,
IRB approval and informed consent.
I nst i t ut i onal Revi ew Boar d Appr oval
Federal regulations requi re that research wi th human subjects be
approved by an IRB. The IRB mi ssion i s to ensure that the research i s
ethi cal ly acceptabl e and that the wel fare and ri ghts of research
participants are protected. Although most IRB members are researchers,
IRBs must al so i nclude community members and persons knowl edgeabl e
about legal and ethi cal issues concerni ng research.
When approvi ng a research study, the IRB must determine that:
ri sks to participants are mi ni mi zed,
ri sks are reasonabl e i n rel ati on to anti ci pated benefi ts and the
i mportance of the knowledge that i s expected to resul t,
sel ecti on of parti ci pants i s equi tabl e,
i nformed consent wi ll be sought from parti ci pants or their l egal ly
authori zed representati ves, and
confidenti al ity is adequately mai ntai ned (4).
The IRB system i s decentral i zed. Each l ocal IRB impl ements federal
regul ati ons using its own forms, procedures, and gui del ines, and there i s
no appeal to a hi gher body. As a result, protocol for a mul ti center study
may be approved by the IRB of one insti tution but not by the IRB of
another i nsti tuti on. Usual ly these di fferences can be resolved through
di scussions or protocol modi fi cati ons.
IRBs have been cri ti ci zed for several reasons (5,6). They may pl ace
undue emphasi s on consent forms and fai l to scrutini ze the research
desi gn. Review of the sci enti fi c merit of the research is usuall y beyond
the experti se of the IRB and i s l eft to the fundi ng agency. Although IRBs
need to review any protocol revisions and moni tor adverse events,
typical ly they do not check whether research was actuall y carri ed out in
accordance wi th the approved protocols. Many IRBs l ack the resources
and experti se to adequatel y fulfi ll thei r mi ssion of protecting research
participants. For these reasons, federal regul ati ons and IRB approval
should be regarded onl y as a mi ni mal ethi cal standard for research.
Ul ti mately, the judgment and character of the investigator are the
P.227
...
most essenti al element for assuri ng that research i s ethical ly acceptabl e.
Exceptions to Institutional Review Board Review
Certai n research may be exempted from IRB review or may receive
expedited review.
IRBs may be exempted from review of certai n types of research, most
commonly surveys, i nterviews, and research wi th existi ng specimens,
records, or data (Table 14.1). The ethi cal justi fi cati on for such
exempti ons i s that the research involves low ri sk, al most al l peopl e
woul d consent to such research, and obtaining consent from each subject
woul d make such studi es prohi bitivel y expensive or di ffi cul t.
An IRB may all ow certai n research to undergo expedited review by a
si ngl e revi ewer rather than the ful l commi ttee (Table 14.2). The
Department of Heal th and Human Servi ces publ i shes a li st of types of
research that are el i gi ble for expedi ted revi ew (7), whi ch can be obtained
at its website.
The concept of minimal risk to participants plays a key rol e i n federal
regul ati ons, as i ndi cated in Tabl es 14.1 and 14.2. Mini mal ri sk is defi ned
as that ordinari ly encountered i n dai ly l i fe or during the performance
of routi ne physi cal or psychologi cal tests. Both the magni tude and
probabi l ity of ri sk must be consi dered. The IRB must judge whether a
speci fic project may be consi dered minimal ri sk.
Table 14.1 What Research is Exempt from
Institutional Review Board Review?
1. Surveys, i nterviews, or observati ons of publi c behavior
unless:
subjects can be i denti fied, ei ther di rectl y or through
i dentifi ers and di sclosure of subjects' responses could
place them at ri sk for legal l iabi li ty or damage their
reputati on, fi nanci al standing, or empl oyabil ity.
2. Studies of existi ng records, data, or specimens, provi ded
that:
sampl es exi st and are publ icl y avai labl e (e.g., data
tapes released by state and federal agenci es) or
i nformati on is recorded by the investigator in such a
manner that subjects cannot be identifi ed, ei ther
directly or through i denti fiers. Coded data is
consi dered i denti fi abl e i f the codes could be broken
with the cooperation of others.
3. Research on normal educati onal practi ces
...
The HIPAA Health Privacy Regulations
The federal Heal th Privacy Regulations (commonly known as HIPAA, after
the Health Insurance Portabil i ty and Accountabi li ty Act) requi re
researchers to obtain permi ssi on from pati ents to use protected heal th
i nformati on i n research (8,9,10). The Privacy Rule protects i ndi vi dual ly
i denti fi abl e health i nformati on, which is termed protected health
information. Under the Pri vacy Rul e, i ndi vi dual s must si gn an
authori zati on to all ow the health care provi der to use or disclose
protected heal th i nformati on i n a research project. The regulations
speci fy information that must be i ncl uded i n the authorization form. Thi s
HIPAA authorizati on form i s in addi ti on to the i nformed consent form
requi red by the IRB. Researchers must obtai n authori zati on for each use
of protected i nformati on for research. Furthermore, under the Pri vacy
Rule research participants may have the right to access heal th
i nformati on and to obtain a record of di sclosures of thei r protected
health i nformati on.
I nf or med and Vol unt ar y Consent
Investi gators must obtai n i nformed and vol untary consent from research
participants.
Disclosure of Information to Participants
Investi gators must di scl ose i nformati on that is rel evant to the potenti al
participant's decision whether or not to parti ci pate i n the research.
Speci ficall y, i nvesti gators must di scuss wi th potenti al participants:
P.228
Table 14.2 What Research May Undergo
Expedited Institutional Review Board (IRB)
Review?
1. Research that i nvol ves no more than mi nimal risk and is
one of the categori es of research l isted by the Department
of Heal th and Human Servi ces as el i gi bl e for expedi ted
revi ew. Examples i ncl ude:
col lection of speci mens through veni puncture
col lection of speci mens through noninvasive
procedures routi nel y empl oyed i n cli nical practice,
such as el ectrocardi ograms and magneti c resonance
i magi ng. However, procedures using x-rays must be
reviewed by the full IRB.
research involving data, records, or speci mens that
...
The nature of the research project. The prospecti ve subject shoul d
be tol d expl icitly that research i s being conducted, what the purpose
of the research i s, and how parti ci pants are bei ng recruited. The
actual study hypothesis need not be stated.
The Procedures of the Study. Parti ci pants need to know what they
wil l be asked to do in the research project. On a practical l evel ,
they shoul d be tol d how much ti me wi ll be requi red and how often.
Procedures that are not standard cl i ni cal care shoul d be i denti fi ed
as such. Al ternative procedures or treatments that may be avai labl e
outsi de the study shoul d be di scussed. If the study i nvol ves bl inding
or randomi zati on, these concepts shoul d be expl ained i n terms the
parti ci pant can understand. In i ntervi ew or questi onnaire research,
parti ci pants shoul d be informed of the topi cs to be addressed.
The Ri sks and Potenti al Benefi ts of the Study and the Al ternati ves
to Parti ci pati ng in the Study. Medical, psychosoci al, and economi c
harms and benefits should be described i n l ay terms. Al so, potential
parti ci pants need to be tol d the al ternati ves to parti ci pati on, for
example, whether the i nterventi on in a cl i ni cal tri al i s avail able
outsi de the study. Concerns have been voi ced that often the
i nformati on provi ded to participants understates the risks and
overstates the benefi ts (11,12). For exampl e, research on new
drugs is sometimes descri bed as offeri ng benefi ts to parti cipants.
However, most promi si ng new i nterventi ons, despite encouraging
prel i minary results, show no signifi cant advantages over standard
therapy. Often participants have a therapeuti c
mi sconcepti on that the research i nterventi on is designed to
provide them a personal benefit (13). Investi gators shoul d make
clear that it i s not known whether the study drug i s more effecti ve
than standard therapy and that promi sing drugs can cause seri ous
harms.
have been col lected or wil l be col lected for cli nical
purposes and research usi ng surveys or i nterviews
that i s not exempt from IRB revi ew.
2. Mi nor changes i n previ ousl y approved research
P.229
...
Consent Forms
Written consent forms are generall y requi red to document that the
process of i nformed consentdi scussions between an i nvesti gator and
the subjecthas occurred. The consent form needs to contai n all the
i nformati on that must be discl osed under the provisi ons of 45 CFR
46.116. Al ternati vely, a short form may be used, which states that the
requi red el ements of i nformed consent have been presented oral ly. If the
short form is used, there must be a wi tness to the oral presentati on, and
the wi tness must sign the short consent form as wel l as the parti cipant.
IRBs usuall y have sampl e consent language and forms that they prefer
i nvesti gators to use. IRBs may requi re more i nformati on to be discl osed
than the Common Rul e requi res. Investigators should be fami li ar wi th the
templ ates and suggestions from their IRBs.
Participants' Understanding of Disclosed
Information
Ethi cal ly, the crucial issue regarding consent i s not what i nformati on the
researcher discl oses but whether partici pants understand the ri sks and
benefi ts of the research project. Research parti ci pants commonl y have
serious mi sunderstandi ngs about the goal s of research and the
procedures and risks of the specifi c protocol (1,14). In di scussions and
consent forms, researchers shoul d avoi d techni cal jargon and
compl i cated sentences. IRBs have been cri ti cized for excessive focus on
consent forms rather than on whether potenti al parti ci pants have
understood perti nent i nformati on (1). Strategi es to i ncrease
comprehensi on by parti ci pants i nclude havi ng a study team member or a
neutral educator spend more ti me tal ki ng one-on-one with study
participants, si mpl i fying consent forms, usi ng a questi on and answer
format, providi ng i nformation over several visits, and usi ng audiotapes or
videotapes (15). In research that i nvol ves
substanti al ri sk or i s controversi al, i nvesti gators shoul d consi der
assessi ng whether parti cipants have appreci ated the disclosed
i nformati on (16).
The Voluntary Nature of Consent
Ethi cal ly vali d consent must be vol untary as well as i nformed.
Researchers must minimi ze the possi bil i ty of coercion or undue infl uence.
Exampl es of undue i nfl uence are excessive payments to partici pants or
aski ng staff members or students to vol unteer for research. An undue
i nfluence i s ethical ly probl emati cal because parti ci pants mi ght di scount
the risks of a research project or find it too diffi cult to decl ine to
participate. Parti ci pants must understand that decl i ni ng to parti ci pate i n
the study wi ll not compromise their medical care and that they may
wi thdraw from the project at any time.
P.230
...
Exceptions to consent
Table 14.3 explai ns how informed consent or wri tten consent forms may
not be needed i n several si tuati ons. Fi rst, acti vi ties that do not obtain
i denti fi abl e pri vate i nformation on li vi ng persons is not consi dered human
subjects research. Second, the acti vi ty may quali fy for an exempti on
from the Common Rul e. These provi si ons permi t excepti ons to informed
consent for many projects that carry out secondary anal yses of existi ng
data or biological materi al s. Under HIPAA, the excepti ons to indi vidual
authori zati on for research differ somewhat from excepti ons and wai vers
of informed consent under the Common Rul e. HIPAA al l ows research to
be carri ed out wi thout authori zation i f the data set does not contai n
certai n specifi ed parti ci pant i denti fiers.
P.231
Table 14.3 Is Informed Consent Required
Under the Common Rule?
1. Is the activi ty human subjects research as defi ned i n
46.012(e) and (f)?
Is there an i nterventi on or i nteracti on wi th a li vi ng
person?
Does the researcher obtai n identifi abl e pri vate
i nformati on?
If the answer to both questi ons i s NO, then the Common
Rule does not appl y
2. Does the acti vity qual i ty for an exempti on from the
Common Rul e under 45 CFR 46.101(b)
Existi ng data, documents, records, or speci mens,
provi ded that the investigator records data in a way
that i t cannot be li nked to the subject OR that the
data or specimens are publi cly avail able.
Surveys, intervi ews, observati on of publ i c behavi or,
provi ded that subjects cannot be i denti fied AND
responses coul d not put participants at risk for l egal ,
financi al, or soci al ri sk.
Educational practices i n educational setti ngs.
3. Does the project qual i fy for a wai ver or modifi cation of
i nformed consent under 46.116(d)?
the research i nvol ves no more than minimal ri sk to
the subjects; AND
the wai ver or alterati on wil l not adversely affect the
ri ghts and wel fare of the subjects; AND
...
Subjects Who Lack Decision-Making Capacity
When participants are not capabl e of giving i nformed consent, permi ssi on
to parti ci pate in the study shoul d be obtai ned from the subject's legall y
authori zed representative. Also, the protocol shoul d be subjected to
addi ti onal scruti ny, to ensure that the research questi on coul d not be
studied in a popul ati on that i s capabl e of gi vi ng consent.
Ri sks and Benef i t s
Researchers need to maximi ze the benefi ts and mi ni mize the ri sks of
research projects. Researchers must anti cipate risks that might occur i n
the study; and modi fy the protocol to reduce risks to an acceptabl e l evel .
Measures mi ght i ncl ude i denti fyi ng and excludi ng persons who are very
suscepti bl e to adverse events, appropri ate monitori ng for adverse
events, and traini ng staff i n how to identify and respond to seri ous
adverse events. An i mportant aspect of mi ni mi zing risk i s mai ntaining
participants' confi denti ali ty.
Conf i denti al i ty
Breaches of confi denti ali ty may cause stigma or di scriminati on,
particul arl y if the research addresses sensitive topics such as psychi atric
i l lness, al cohol i sm, or sexual behavi ors. Strategi es for protecti ng
confi denti ali ty i ncl ude coding research data, stori ng i t i n l ocked cabinets,
protecti ng or destroying the key that i denti fies subjects, and li mi ti ng
the research could not practi cabl y be carri ed out
without the wai ver or al teration; AND
whenever appropriate, the subjects wi ll be provided
with additi onal perti nent i nformati on after
parti ci pati on.
4. Does the research project qual ify for a wai ver of si gned
consent forms under 46.117(c)?
The onl y record li nking the subject and the research
would be the consent document and the principal ri sk
i s a breach of confi denti ali ty OR
The research i s mi ni mal risk and i nvol ves no
procedures for whi ch written consent i s normall y
requi red outside the research context.
The research presents no more than mi nimal risk to
parti ci pants and
The wai ver or alterati on would not adversel y affect
the ri ghts and welfare of partici pants and
The research otherwise coul d not practi cabl y be
carri ed out
...
personnel who have access to identifi ers. However, i nvesti gators shoul d
not make unqual i fi ed promises of confidential i ty. Confidential i ty may be
breached i f research records are audi ted or subpoenaed, or if conditi ons
are i dentifi ed that l egal l y must be reported. Researchers have a moral
and legal obli gati on to override confi denti ali ty to prevent harm in such
si tuations as chil d abuse, certai n i nfecti ous di seases, and serious threats
of vi ol ence by psychiatric pati ents. In projects where i nformati on about
such si tuati ons can be foreseen, the protocol shoul d speci fy how fi el d
staff shoul d respond, and participants should be i nformed of these pl ans.
Investi gators can forestall subpoenas i n l egal di sputes by obtaini ng
confi denti ali ty certifi cates from the Publ ic Health Service i f the research
project i nvol ves sensitive i nformati on, such as sexual atti tudes or
practi ces, use of al cohol or drugs, i l legal conduct, or mental health, or
any i nformati on that could reasonabl y lead to sti gma or discri mi nation
(17). These certi ficates al low the investigator to wi thhol d names or
i denti fy characteristi cs of the parti cipants from peopl e not connected
wi th the project, even if faced wi th a subpoena or court order. The
research need not be federal ly funded. However, these certi ficates do not
appl y to audi ts by fundi ng agencies or the FDA.
Research Participants who Require
Additional Protections
Some parti ci pants mi ght be at greater risk for being used i n ethicall y
i nappropriate ways i n research (18). Such vul nerabl e persons might
have diffi culty giving vol untary and i nformed consent or mi ght be more
suscepti bl e to adverse events.
Types of Vul ner abi l i t y
Identi fyi ng different types of vulnerabi l ity all ows researchers to adopt
safeguards tai l ored to the speci fic type of vulnerabi l ity.
Cognitive or Communicative Impairments
Persons wi th i mpai red cogni ti ve function may have diffi culty
understandi ng i nformati on about a study and del iberati ng about the ri sks
and benefi ts of the study.
Vulnerability because of Power Differences
Persons who resi de i n i nsti tuti ons, such as prisoners or nursi ng home
residents, mi ght feel pressure to parti ci pate i n research. In these
i nstituti ons, those i n authori ty control the dai l y routi ne and l ife choices
of the resi dents (19). Resi dents might not appreci ate that they may
decl i ne to participate i n research, wi thout retal iati on by authori ti es or
jeopardy to other aspects of thei r everyday l i ves.
If the i nvesti gator i n the research project i s al so a parti ci pant' s treati ng
P.232
...
physici an, the parti cipant might fi nd i t di ffi cult to decl ine to parti ci pate
i n research. Participants might fear that i f they decl i ne, the physi ci an wi ll
not be as interested i n thei r care or that they might have di ffi cul t getti ng
timel y appoi ntments. Thi s mi ght particularl y be a concern for patients at
a special i zed cli ni c or hospi tal , who have few alternati ve sources of care.
Social and Economic Disadvantages
Persons wi th poor access to health care and l ow socioeconomic status
may join a research study to obtai n payment, a physical exami nation, or
screeni ng tests, al though they would regard the ri sks as unacceptable if
they had a hi gher i ncome. Poor educati on or l ow heal th l i teracy may
make i t both di ffi cul t for parti cipants to comprehend i nformati on about
the study and also make them undul y i nfl uenced by other peopl e.
Speci al Feder al Regul at i ons f or Vul ner abl e
P ar t i ci pant s
Research on Children
Investi gators must obtai n both the permi ssion of the parents and the
assent of the chil d when devel opmental ly appropri ate. In addi ti on,
research with chi l dren involvi ng more than mi ni mal ri sk is ci rcumscri bed.
Such research is permi ssibl e i f it presents the prospect of di rect benefi t
to the chil d. If the research offers no such prospect, i t may sti ll be
approved by the IRB, provided that the i ncrease over mini mal ri sk i s
minor and the research is li kely to yield generali zable knowledge of vital
i mportance about the chi ld's disorder or condi ti on.
Research on Prisoners
Prisoners may not feel free to refuse to participate i n research and may
be undul y i nfl uenced by cash payments, l iving condi ti ons, or parole
considerati ons. Federal regul ati ons li mi t the types of research that are
permi tted and require both stricter IRB revi ew and approval by the
Department of Heal th and Human Servi ces.
Research on Pregnant Women, Fetuses, and
Embryos
Extra protecti ons and restri ctions are requi red when research i s carri ed
out on fetuses and embryos or pregnant women.
Responsibilities of Investigators
Sci ent i f i c Mi sconduct
In several hi ghl y publ ici zed cases, researchers made up or altered
research data or enroll ed i nel igi ble partici pants in cl i ni cal tri als
(20,21,22,23). Such conduct gi ves i ncorrect answers to the research
question, undermi nes publi c trust in research, and threatens publ i c
...
support of federal ly funded research (24).
The federal government defines research misconduct as fabri cati on,
fal sifi cation, and pl agiari sm, as the website of the Offi ce for Research
Integri ty expl ains. Fabrication i s making up results and recording or
reporting them. Falsification i s mani pul ati ng research material s,
equi pment, or procedures or changi ng or omi tting data or results, so that
the research record mi srepresents the actual findings. Plagiarism i s
appropri ati ng another person's i deas, resul ts, or words wi thout givi ng
appropri ate credit.
The federal defi nition of misconduct requires perpetrators to act
i ntentional l y in the sense that they are aware that their conduct i s
wrong. Research mi sconduct does not i nclude honest error or legitimate
scientifi c di fferences of opi nion, which are a normal part of the research
process. The federal defini ti on also excludes other wrong actions, such as
doubl e publi cation, fai lure to share research materi al s, and sexual
harassment (25). Such inappropriate behavi or should be dealt wi th by
the princi pal i nvesti gator and institution.
When research mi sconduct i s al l eged, both the federal fundi ng agency
and the i nvesti gator's i nsti tuti on have the responsi bi l ity to carry out a
fai r and ti mely inqui ry or i nvesti gati on (26). During an i nvesti gation,
both whistl ebl owers and accused sci enti sts have ri ghts that must be
respected. Whi stl ebl owers need to be protected from retal iation, and
accused sci enti sts need to be told the charges and gi ven an opportuni ty
to respond. Puni shment for proven research mi sconduct may i ncl ude
suspension of a grant, debarment from future grants, and other
admi nistrati ve, cri minal , or ci vi l procedures.
Aut hor shi p
Authorshi p of sci enti fic papers resul ts i n prestige, promotions, and
grants for researchers. Therefore i nvesti gators are eager to receive
credi t for publi cations. Researchers al so need to take responsibi l ity for
problems wi th publi shed articles (27). In several cases of sci enti fic
misconduct, coauthors of manuscripts containi ng fabricated, fal si fied, or
pl agiari zed data deni ed knowl edge of the mi sconduct. The ri se i n
mul ti ple-authored papers has made i t more diffi cul t to assign
accountabi l ity for publ ished arti cl es.
Probl ems wi th authorshi p i ncl ude guest authorshi p and ghost authorshi p.
Guest or honorary authors are persons who have made onl y tri vi al
contri butions to the paper, for exampl e, by provi di ng access to
participants, reagents, laboratory assi stance, or funding (28). Ghost
authors are i ndi vi dual s who made substanti al contri butions to the paper
but are not li sted as authors; generall y ghost authors are empl oyees of
pharmaceuti cal compani es or publ i c relations offi cers. In one study, 21%
of arti cl es had guest authors and 13% ghost authors (29).
P.233
...
Medi cal journal s have set cri teri a for authorship (30). Authors must make
substanti al contri butions to (a) the concepti on and desi gn of the project,
or the data analysi s and interpretation, and (b) the drafting or revi si ng
of the arti cle; they must al so (c) gi ve fi nal approval of the manuscri pt.
Mere acqui si ti on of fundi ng, data coll ection, or supervi si on of a research
group does not justi fy authorshi p, instead warranti ng an
acknowl edgment. Because there is no agreement on cri teri a for fi rst,
middle, or l ast author, i t has been suggested that the contributi ons of
each author to the project be described i n the publ ished arti cl e (31).
Di sagreements commonly ari se among the research team regardi ng who
should be an author or the order of authors. These i ssues are best
di scussed expli citl y and decided at the begi nning of a project.
Col laborators subsequently might not carry out the tasks they agreed to,
for exampl e, fai l ing to carry out data analyses or prepare
a fi rst draft. Changes i n authorship shoul d be negotiated when deci si ons
are made to shi ft responsi bil i ti es for the work. Detai led suggestions have
been made on how to carry out such negotiations dipl omaticall y (32).
Conf l i ct s of I nt er est
Researchers may have confl icti ng i nterests that mi ght impai r thei r
objectivi ty and undermi ne publ i c trust in research ( 33,34). Even the
percepti on of a confl i ct of interest may be del eteri ous (35).
Types of Conflicts of Interests
Dual r ol es f or cl i ni ci an- i nves t i gat or s . An i nvesti gator may be
the personal physi ci an of an eli gi bl e research parti ci pant. Such
parti ci pants mi ght fear that thei r future care wi ll be jeopardi zed i f
they decli ne to partici pate i n the research, or they may not
disti ngui sh between research and treatment. Furthermore, what i s
best for a parti cul ar participant may di ffer from what is best for the
research project. In thi s si tuation, the wel fare of the participant
shoul d be paramount, and the physi cian must do what is best for
the parti cipant.
Fi nanci al conf l i ct s of i nt er est s. Studi es of new drugs are
commonl y funded by pharmaceutical compani es or biotechnol ogy
firms. The ethical concern is that certai n fi nancial ti es may l ead to
bias in the desi gn and conduct of the study, the overi nterpretati on
of posi ti ve resul ts, or fail ure to publ ish negative results ( 33,36,37).
If i nvesti gators hol d stock or stock opti ons in the company maki ng
the drug or devi ce under study, they may reap l arge financi al
rewards i f the treatment is shown to be effecti ve, in addi ti on to
thei r compensati on for conducting the study. Furthermore,
i nvesti gators may lose wel l -paying consul ti ng arrangements i f the
drug proves ineffective.
P.234
...
Responding to Conflicting Interests
Researchers can respond to some confli cts of interests by substanti all y
el imi nati ng the potenti al for bias. Other situati ons, however, have such
great potenti al for confl icts of i nterest that they shoul d be avoided.
Mi ni mi ze conf l i ct i ng i nt er est s. In well -desi gned cl inical trial s,
several standard precauti ons help keep competi ng interests in
check. Investi gators can be blinded to the i nterventi on a subject i s
receiving, to prevent bi as i n assessi ng outcomes. An independent
data safety monitoring board (DSMB), whose members have no
confli ct of interest, can revi ew interim data and terminate the study
i f the data provi de convi nci ng evidence of benefi t or harm. The
peer review process for grants, abstracts, and manuscripts also
hel ps eli mi nate biased research.
Physicians should separate the roles of investigator in a research
project and cli nician provi di ng the research parti ci pant' s medical
care, whenever possi bl e. A member of the research team who i s not
the treati ng physician shoul d handle consent di scussions and foll ow-
up visits that are part of the study.
If research i s funded by a pharmaceutical company, academic-based
i nvesti gators need to ensure that the contract gives them control
over the primary data and statistical analysis, and the
freedom to publish findings, whether or not the investigational
drug i s found to be effecti ve (36,38). The i nvesti gator has an
ethi cal obl i gation to take responsibi l ity for all aspects of the
research, ensuri ng that the work is done ri gorously. The sponsor
may revi ew the manuscripts, make
suggesti ons, and ensure that patent appli cations have been fil ed
before the articl e i s submi tted to a journal . However, the sponsor
must not have power to veto or censor publ icati on (36).
Di scl ose conf l i ct i ng i nt er es t s. Confl icts of interest shoul d be
disclosed to the IRB and research participants. In a l andmark court
case, the Cali forni a Supreme Court declared that physi ci ans need to
di scl ose personal interests unrelated to the pati ent's heal th,
whether research or economi c, that may affect the physi ci an's
professional judgment (39). Medi cal journals commonl y require
authors to di sclose such confli cts of i nterest when manuscripts are
submi tted or publ ished (40,41). Although discl osure i tsel f is a smal l
step, i t may deter i nvesti gators from ethicall y probl ematical
practices.
Manage conf l i ct s of i nt er est . If a parti cul ar study presents
concerns about a confl i ct of interest, the research i nsti tuti on may
requi re additional safeguards, such as closer moni tori ng of the
P.235
...
i nformed consent process.
P r ohi bi t cer t ai n si t uat i ons . To mini mize confl i cts of i nterest,
researchers from academi c i nsti tuti ons should not hol d stock or
stock options i n a company that has a fi nanci al i nterest i n the
i ntervention bei ng studi ed, nor be an offi cer in the company
(42,43,44). Many uni versi ti es, however, all ow i nvesti gators to have
de mi nibus hol dings under $10,000.
Ethical Issues Specific to Certain Types of
Research
Randomi zed Cl i ni cal Tr i al s
Al though randomi zed control l ed tri als are the most ri gorous desi gn for
eval uati ng interventions (see Chapter 10), they present special ethi cal
concerns because the i nterventi on i s determi ned by chance. The ethi cal
justi ficati on for assi gni ng treatment by randomization i s that the arms of
the protocol are i n equi poi se. That is, current evi dence does not prove
that ei ther arm i s superior. Even i f some experts bel ieve that one arm
offers more effective treatment, other experts beli eve the opposi te (45).
Furthermore, i ndi vi dual parti ci pants and their personal physi ci ans must
fi nd randomi zati on acceptabl e. If physici ans bel ieve strongl y that one
arm of the tri al i s superi or and can provi de the i nterventi on in that arm
outsi de the study, they cannot i n good faith recommend that thei r
patients enter the trial . Al so, the parti cipant might not consider the arms
equi val ent, for exampl e, when the trade-offs between benefi t and
adverse effects di ffer markedly in a compari son of medi cal and surgi cal
approaches to a di sease (46).
Interventi ons for control groups also rai ses ethi cal concerns. According
to the pri nci ple of do no harm, i t i s probl emati c to wi thhol d
therapies that are known to be effecti ve. Hence the control group shoul d
recei ve the current standard of care. However, placebo control s may sti l l
be justifi ed i n short-term studi es that do not offer seri ous risks to
participants, such as studi es of mi l d hypertension and mil d, self -l i mited
pai n. Partici pants need to be i nformed of effecti ve interventions that are
avail able outsi de the research study. Di lemmas about the control group
are parti cularl y di ffi cul t when the research parti cipants have such poor
access to care that the research project i s the only practi cal way for
them to recei ve adequate health care.
It i s unethi cal to conti nue a cl ini cal tri al i f there i s compel li ng evi dence
that one arm i s safer or more effecti ve. Furthermore, i t woul d be wrong
to conti nue a trial that wi l l not answer the research question because of
l ow enroll ment, few outcome events, or hi gh drop out rates. The peri odi c
anal ysis of i nteri m data in a cl ini cal tri al by an independent DSMB can
determine whether a trial shoul d terminated prematurel y (47). Such
P.236
...
i nterim analyses should not be carri ed out by the researchers
themsel ves, because unbli ndi ng i nvesti gators to interim fi ndi ngs can lead
to bi as i f the study conti nues. Procedures for exami ning interim data and
statisti cal stopping rul es shoul d be speci fi ed in the protocol (Chapter
11).
Cl i ni cal tri als i n devel opi ng countries present addi ti onal ethi cal
di l emmas, as Chapter 18 di scusses.
Resear ch on Pr evi ousl y Col l ected Speci mens
and Dat a
Such research offers the potenti al for si gni fi cant discoveries. For
exampl e, DNA testi ng on a l arge number of stored bi ol ogi cal speci mens
that are l inked to cli nical data may i denti fy genes that i ncrease the
l i keli hood of developing a di sease or respondi ng to a parti cular
treatment. Large bi obanks of blood and tissue sampl es all ow future
studies to be carried out without the col lection of addi ti onal sampl es.
Research on previousl y coll ected speci mens and data offers no physi cal
ri sks to parti ci pants. However, there are ethi cal concerns. Consent for
future studies i s probl emati c because no one can anti ci pate what ki nd of
research mi ght be carri ed out l ater. Furthermore, participants may object
to the use of data and sampl es in certai n ways (48). Breaches of
confi denti ali ty may occur and may l ead to stigma and di scriminati on.
Even i f indi vidual parti cipants are not harmed, groups may be harmed.
Historicall y, geneti cs research i n the Uni ted States led to eugenics
abuses, such as forced steril i zation of persons with mental retardati on or
psychi atri c i ll ness (49).
When bi ologi cal specimens are coll ected, consent forms shoul d all ow
participants to agree to or refuse certain broad categori es of future
research usi ng the speci mens. For exampl e, parti cipants might agree to
al l ow thei r speci mens to be used i n future research on rel ated condi ti ons
or for any ki nd of future study that is approved by an IRB and sci enti fic
revi ew panel . Partici pants shoul d al so know whether the code i dentifying
i ndivi dual parti ci pants wil l be retai ned or shared wi th other researchers.
Furthermore, parti cipants should understand that research di scoveri es
from the bi obank may be patented and developed i nto commerci al
products. Several nati onal bi obanks i n Europe have required commerci al
users of the bi obanks to make payments to the government, so that the
popul ati on that contributed the sampl es wil l derive some financi al
benefi t.
Other Issues
P ayment to Resear ch Par ti ci pant s
Parti cipants i n cl ini cal research deserve payment for thei r time and effort
and reimbursement for out -of-pocket expenses such as transportati on
and chi l dcare. Practi call y speaki ng, compensati on may be needed to
...
enrol l and retai n participants. The wi despread practice i s to offer hi gher
payment for studies that are very inconveni ent or ri sky. However, such
i ncentives also rai se ethi cal concerns about undue i nducement. If
participants are paid more to partici pate i n ri skier research, poor persons
may undertake ri sks agai nst thei r better judgment. To avoid undue
i nfluence, i t has been suggested that parti ci pants be compensated only
for actual expenses and the ti me, at an hourly rate for unski l led labor
(50).
Summary
1. Investigators must assure that their projects observe the ethical
principles of respect for persons, beneficence, and justice.
2. Investigators must assure that research meets the requi rements of
appli cabl e federal regulations. Informed consent from
parti ci pants and IRB revi ew are the key features of these
regul ations. Duri ng the informed consent process, i nvesti gators
must expl ain to potential partici pants the nature of the project
and the risks, potential benefits, and alternatives.
3. Vulnerable populations, such as children, prisoners, pregnant
women, and peopl e wi th cognitive deficiency or social
disadvantage requi re additional protecti ons.
4. Researchers must have ethical integrity. They must not commit
sci enti fic mi sconduct, including fabrication, falsification, or
plagiarism. They shoul d deal with conflicts of interest
appropriatel y and foll ow criteri a for appropriate authorship.
5. In certai n types of research, addi ti onal ethi cal issues must be
addressed. In randomi zed cli nical trial s, the interventi on arms must
be i n equipoise, control groups must recei ve appropriate
interventions, and the trial must not be conti nued once i t has been
demonstrated that one intervention i s safer or more effecti ve. When
research is carri ed out on previ ousl y col l ected specimens and data,
speci al attention needs to be gi ven to confidentiality.
References
P.237
1. Institute of Medi ci ne. Responsi ble research: a systems approach
to protecti ng research parti ci pants. Washi ngton, DC: National
Academi es Press, 2003.
2. National Commi ssi on for the Protection of Human Subjects of
Biomedi cal and Behavioral Research. The bel mont report: ethi cal
pri nci ples and gui del ines for the protection of human subjects of
...
bi omedi cal and behavi oral research. Washington, DC: US Government
Printi ng Offi ce, 1979.
3. Levi ne C, Dubl er NN, Levine RJ. Buil ding a new consensus: ethi cal
pri nci pl es and poli cies for cl ini cal research on HIV/AIDS. IRB
1991;13:117.
4. Department of Heal th and Human Services. Protecti on of human
subjects. 45 CFR 56. Avail able at
http://www.hhs.gov/ohrp/humansubjects/guidance/45cfr46.htm.
2005
5. US General Accounti ng Offi ce. Conti nued vigil ance cri ti cal to
protecti ng human subjects. Washington, DC: Government Accounti ng
Offi ce, 1996.
6. Office of the Inspector General . Institutional review boards: their
rol e i n revi ewing approved research. Washi ngton, DC: Department of
Heal th and Human Services, 1998.
7. Insti tutional Review Board (IRB). Through an expedi ted revi ew
procedure. 63 Federal Register 6036460367. Avail abl e at:
http://www.hhs.gov/ohrp/humansubjects/guidance/expedi ted98.htm.
(1998)
8. Department of Heal th and Human Services. 45 CFR Parts 160 and
164. Standards for Privacy of Indi vi dual l y Identifi able Heal th
Informati on. Fed Regi st 2002;67:5318253273.
9. Nati onal Insti tutes of Heal th. Protecti ng Personal Health
Informati on i n Research: Understandi ng the HIPAA Pri vacy Rul e.
http://www.pri vacyruleandresearch.nih.gov/ . Accessed Jul y 28, 2003.
10. Gunn PP, Fremont AM, Bottrel l M, et al. The Heal th Insurance
Portabi li ty and Accountabi l ity Act Pri vacy Rul e: a practi cal gui de for
researchers. Med Care 2004;42(4):321327.
11. Advisory Commi ttee on Human Radi ati on Experi ments. Fi nal
report. New York: Oxford Uni versi ty Press, 1998.
12. Ki ng NM. Defini ng and describi ng benefi t appropri ately i n cl ini cal
P.238
...
trial s. J Law Med Ethics 2000;28(4):332343. Wi nter
13. Li dz CW, Appel baum PS. The therapeuti c mi sconception:
probl ems and soluti ons. Med Care 2002;40(9 Suppl ):V55V63.
14. Wendl er D, Emanuel EJ, Lie RK. The standard of care debate: can
research i n developi ng countri es be both ethical and responsi ve to
those countri es' heal th needs? Am J Publi c Heal th 2004;94
(6):923928.
15. Flory J, Emanuel E. Interventi ons to i mprove research
parti cipants' understandi ng i n i nformed consent for research: a
systemati c revi ew. JAMA 2004;292(13):15931601.
16. Woodsong C, Kari m QA. A model designed to enhance i nformed
consent: experi ences from the HIV preventi on tri als network. Am J
Publ ic Health 2005;95(3):412419.
17. Wolf L, Lo B. Usi ng the l aw to protect confidential ity of sensi ti ve
research data. IRB 1999;21:47.
18. Nati onal Bi oethics Advi sory Commissi on. Ethi cal and poli cy issues
i n international research. Rockvil l e, MD: National Bioethi cs Advisory
Commi ssion, 2001.
19. Goffman E. Asylums;~essays on the social si tuati on of mental
pati ents and other i nmates. Garden City, NY: Anchor Books, 1961.
20. Cul l iton B. Copi ng wi th fraud: the Darsee case. Sci ence
1983;220:3135.
21. Rel man AS. Lessons from the Darsee affai r. N Engl J Med
1983;308:14151417.
22. Engler RL, Covel l JW, Fri edman PJ, et al . Misrepresentati on and
responsi bi l ity in medi cal research. N Engl J Med 1987;317
(22):13831389.
23. Kassi rer JP, Angel l M. The journal 's poli cy on cost -effectiveness
anal yses. N Engl J Med 1994;331(10):669670.
24. Dingel l JD. Shattuck l ecturemi sconduct i n medical research. N
...
Engl J Med 1993;328: 16101615.
25. Friedman PJ. Advi ce to individuals involved i n mi sconduct
accusati ons. Acad Med 1996; 71(7):716723.
26. Mel lo MM, Brennan TA. Due process in investigations of research
mi sconduct. N Engl J Med 2003;349(13):12801286.
27. Renni e D, Flanagin A. Authorshi p! authorshi p! guests, ghosts,
grafters, and the two-si ded coin. JAMA 1994;271:469471.
28. Shapri o DW, Wenger NS, Shapi ro MS. The contri butions of
authors to mul ti authored bi omedical research papers. JAMA
1994;271:438442.
29. Flanagin A, Carey LA, Fontranarosa PB, et al . Prevalence of
articles wi th honorary authors and ghost authors in peer -reviewed
medi cal journals. JAMA 1998;280:222224.
30. Lundberg GD, Gl ass RM. What does authorshi p mean in a peer -
revi ewed medical journal? JAMA 1996;276:75.
31. Renni e D, Yank V, Emanuel L. When authorship fai ls: a proposal
to make contri butors accountabl e. JAMA 1997;278:579585.
32. Browner WS. Publ i shing and presenti ng cli nical research.
Bal ti more, MD: Lippincott Wi ll iams & Wi l ki ns, 1999.
33. Rel man AS. Economi c i ncentives in cl i ni cal i nvesti gation. N Engl J
Med 1989;320:933934.
34. Bekel man JE, Li Y, Gross CP. Scope and impact of fi nanci al
confl icts of interest i n bi omedical research: a systemati c review.
JAMA 2003;289(4):454465.
35. Thompson DF. Understandi ng financi al confl icts of interest. N
Engl J Med 1993;329:573576.
36. Renni e D, Flanagin A. Thyroi d storm. JAMA
1997;277:12381243.
...
37. DeAngel is CA. Confl ict of i nterest and the publi c trust. JAMA
2000;284:22372238.
38. Hi ll man AL, Ei senberg JM, Pauly MV, et al . Avoi ding bi as i n the
conduct and reporting of cost-effectiveness research sponsored by
pharmaceuti cal compani es. N Engl J Med 1991;324(19):13621365.
39. Moore v. Regents of Uni versi ty of Cali forni a, 51 Cal.3d 120; Cal
Rptr. 146, 793 P.2d 479 (1990).
40. Renni e D, Flanagin A. Confl i cts of i nterest in the publ i cation of
science. JAMA 1991; 266:266267.
41. Angell M, Kassirer JP. Edi torial s and confl icts of i nterest. N Engl J
Med 1996;335(14): 10551056.
42. Healy B, Campeau L, Gray R, et al. Confli ct -of-interest guideli nes
for a mul ti center cli nical trial of treatment after coronary-artery
bypass-graft surgery. N Engl J Med 1989; 320(14):949951.
43. Topol EJ, Armstrong P, Van de Werf F, et al . Confronting the
i ssues of pati ent safety and i nvesti gator confli ct of i nterest i n an
i nternati onal tri al of myocardial reperfusion. J Am Col l Cardi ol
1992;19:11231128.
44. Association of Ameri can Medical Coll eges. Protecting subjects,
preservi ng trust, promoting progresspoli cy and gui del i nes for the
oversi ght of individual fi nanci al interests in human subjects research.
Washi ngton, DC: Association of Ameri can Medical Coll eges, 2001.
45. Freedman B. Equi poise and the ethics of cl ini cal research. N Engl
J Med 1987;317:141145.
46. Li l ford RJ. Ethics of cl ini cal tri als from a bayesi an and deci sion
anal yti c perspecti ve: whose equipoise i s it anyway? BMJ 2003;326
(7396):980981.
47. Slutsky AS, Lavery JV. Data safety and moni toring boards. N Engl
J Med 2004;350(11): 11431147.
48. Nati onal Bi oethics Advi sory Commissi on. Research on human
P.239
...
stored biol ogic material s. Rockvil l e, MD: National Bioethi cs Advisory
Commi ssion, 1999.
49. Kevl es DJ. In the name of eugenics: geneti cs and the uses of
human heredity. New York: Knopf, 1985.
50. Dickert N, Grady C. What' s the pri ce of a research subject?
Approaches to payment for research participation. N Engl J Med
1999;341:198203.
...
Authors: Hulley, Stephen B.; Cummings, Steven R.; Browner, Warren S.; Grady,
Deborah G.; Newman, Thomas B.
Copyri ght 2007 Li ppincott Wil li ams & Wi lki ns
> Tabl e of Contents > Secti on III - Impl ementation > 15 - Desi gning Questi onnaires and Intervi ews
15
Designing Questionnaires and Interviews
Steven R. Cummings
Stephen B. Hulley
Much of the data i n cl i ni cal research i s gathered using questionnaires or interviews. For
many studies, the vali dity of the resul ts depends on the quali ty of these instruments. In
thi s chapter we wi ll descri be the components of questi onnaires and interviews and outli ne
procedures for devel opi ng them.
Designing Good Instruments
Open- Ended and Cl osed- Ended Questi ons
There are two basic types of questions, open-ended and cl osed-ended, whi ch serve
somewhat di fferent purposes. Open-ended questions are parti cul arl y useful when i t i s
i mportant to hear what respondents have to say i n thei r own words. For exampl e:
What habi ts do you bel i eve i ncrease a person's chance of having a stroke?
__________________________________________________________
__________________________________________________________
Open-ended questi ons leave the respondent free to answer wi th fewer l imi ts i mposed by
the researcher. They all ow participants to report more i nformati on than i s possibl e wi th a
di screte li st of answers, but the responses may be l ess compl ete. A major disadvantage is
that open-ended questions usual ly require quali tative methods or special systems (such as
coding di ctionaries for symptoms and health conditions) to code and anal yze the responses,
which takes more ti me than enteri ng data from cl osed-ended responses, and may require
subjecti ve judgments. Open-ended questions are often used i n exploratory phases of
questi on design because they faci li tate understandi ng a concept as respondents express i t.
Phrases and words used by respondents can form the basi s for more structured i tems i n a
l ater phase.
Closed-ended questions are more common and form the basi s for most standardi zed
measures. These questi ons ask respondents to choose from two or more preselected
answers:
Because cl osed-ended questi ons provi de a l i st of possi bl e al ternatives from which the
respondent may choose, they are qui cker and easier to answer and the answers are
easier to tabulate and analyze. In addition, the l ist of possibl e answers often hel ps cl ari fy
P.242
Which of the foll owi ng do you bel ieve i ncreases the chance of havi ng a stroke?
(check al l that appl y)
Smoki ng
Being overweight
Stress
Drinki ng al cohol
...
the meani ng of the questi on. Final l y, closed-ended questions are wel l suited for use i n
multi-i tem scales desi gned to produce a single score.
On the other hand cl osed-ended questi ons have several disadvantages. They lead
respondents in certai n di recti ons and do not al low them to express thei r own, potenti al l y
more accurate, answers. The set of answers may not be exhaustive (i .e., not i ncl ude al l
possibl e options, e.g., the l i st does not i ncl ude sexual acti vity or di etary salt). One solution
i s to include an opti on such as Other (please specify) or None of the above.
When a si ngl e response i s desired, the respondent shoul d be so i nstructed and the set of
possibl e responses shoul d al so be mutually exclusive (i.e., the categori es shoul d not
overl ap) to ensure cl ari ty and parsi mony.
When the questi on al lows more than one answer, i nstructing the respondent to mark al l
that apply i s not i deal . Thi s does not force the respondent to consi der each possi bl e
response, and a mi ssi ng item may represent ei ther an answer that does not apply or an
overl ooked i tem. It i s better to ask respondents to mark each possibl e response as either
yes or no as i n the exampl e.
The visual analog scale (VAS) is another option for recording answers to cl osed-ended
questi ons usi ng l ines or other drawi ngs. The parti cipant i s asked to mark a li ne at a spot,
al ong the continuum from one extreme to the other, that best represents his characteri stic.
It is i mportant that the words that anchor each end descri be the most extreme val ues for
the item of i nterest. Here i s a VAS for pai n severity:
Please use an X to mark the place on this l ine that best descri bes the severi ty of your pai n
i n general over the past week.
For convenience of measurement, the l i nes are often 10-cm l ong and the score is the
distance, i n centi meters, from the l owest extreme (Exampl e 15.1).
Example 15.1: Illustrated Use of a Visual Analog Scale for Rating the Severity
of Pain
Thi s i s a 10-cm li ne, and the mark i s 3.0 cm from the end (30% of the di stance from none
to unbearabl e) so the respondent's pai n woul d be recorded as having a severi ty of 3.0, or
30%.
VASs are attractive because they rate characteri sti cs on a conti nuous scal e; they may be
more sensi ti ve to change than rati ngs based on categori cal li sts of adjecti ves. An
al ternative approach i s to provi de numbers to ci rcle i nstead of a l ine. Thi s may be easier to
score, but some parti cipants may fi nd i t di ffi cul t to understand, so i t i s important to explai n
and gi ve examples of how to answer the question.
Which of the foll owi ng do you bel ieve i ncreases the chance of havi ng a stroke?
Yes No Don't know
Smoki ng
Bei ng overwei ght
Stress
Dri nking alcohol
P.243
______________________________________________________________
None Unbearable
________________X______________________________________________
None Unbearabl e
...
For matti ng
On questi onnai res, it is customary to describe the purpose of the study and how the data
wil l be used i n a bri ef statement on the cover. Si mi lar i nformati on is usuall y presented at
the beginni ng of an i ntervi ew as part of obtai ni ng consent. To ensure accurate and
standardi zed responses, all instruments must have i nstructi ons specifyi ng how they shoul d
be fi l led out. Thi s i s true not only in self -admi ni stered questi onnaires, but also for the
forms that i ntervi ewers use to record responses.
Someti mes i t i s hel pful to provi de an example of how to compl ete a questi on, usi ng a
si mple questi on that is easi l y answered (Example 15.2).
Example 15.2: Instructions on How to Fill Out a Questionnaire that Assesses
Dietary Intake
These questions are about your usual eating habi ts duri ng the past 12 months. Pl ease mark
your usual servi ng size and write down how often you eat each food i n the boxes next to
the type of food.
For exampl e, i f you drink a medi um (6 oz) glass of apple jui ce about three times a week,
you woul d answer:
To i mprove the fl ow of the instrument, questi ons concerni ng major subject areas shoul d be
grouped together and introduced by headi ngs or short descriptive statements. To warm up
the respondent to the process of answeri ng questions, i t i s hel pful to begi n with
emotionall y neutral questions such as name and contact informati on. More sensi ti ve
questi ons can then be pl aced i n the mi ddl e, and questions about personal characteristics
such as i ncome or sexual function are often placed at the end of the i nstrument. For each
questi on or set of questions, particul arly if the format di ffers from that of other questions
on the i nstrument, instructi ons must i ndi cate clearl y how to respond.
If the instructions i ncl ude di fferent time frames, it is sometimes useful to repeat the time
frame at the top of each new set of questi ons. For exampl e, questi ons such as
How often have you vi sited a doctor during the past year?
During the past year, how many ti mes have you been a pati ent in an emergency
department?
How many ti mes were you admi tted to the hospi tal duri ng the past year?
can be shortened and ti died as foll ows:
During the past year, how many ti mes have you
visi ted a doctor?
been a pati ent in an emergency department?
been admitted to a hospi tal ?
The visual design of the instruments should make i t as easy as possi bl e for respondents
to compl ete al l questi ons i n the correct sequence. If the format i s too compl ex,
respondents or i ntervi ewers may ski p questions, provi de the wrong i nformation, and even
refuse to compl ete the instruments.
P.244
...
A neat format wi th plenty of space i s more attractive and easier to use than one that i s
crowded or cl uttered. Although investi gators often assume that a questi onnai re wi l l appear
shorter by havi ng fewer pages, the task i s more di ffi cul t when more questions are crowded
onto a page. Response scal es should be spaced wi dely enough so that i t i s easy to circl e or
check the correct number wi thout the mark acci dental ly including the answer above or
below. When an open-ended questi on is i ncl uded, the space for responding shoul d be big
enough to al l ow respondents with l arge handwriting to wri te comfortabl y in the space.
Peopl e wi th visual probl ems, i ncl udi ng many elderl y subjects, wil l appreci ate large type
(e.g., font si ze 14) and hi gh contrast (black on white).
Possi ble answers to cl osed-ended questions shoul d be l i ned up verti cal ly and preceded by
boxes or brackets to check, or by numbers to ci rcle, rather than open bl anks:
How many di fferent medi cines do you take everyday? (Check one)
None
12
34
56
7 or more
(Note that these response options are exhausti ve and mutual ly exclusive.)
Someti mes the i nvestigator may wi sh to foll ow up certai n answers with more detai l ed
questi ons. Thi s is best accompli shed by a branching question. Respondents' answers to
the ini ti al question, often referred to as a screener, determine whether they are
directed to answer addi ti onal questi ons or skip ahead to l ater questi ons. For example:
Have you ever been tol d that you have high bl ood pressure?
Branchi ng questi ons save ti me and all ow respondents to avoi d i rrelevant or redundant
questi ons. Di recti ng the respondent to the next appropri ate question is done by using
arrows to poi nt from response to fol l ow-up questi ons and i ncluding di recti ons such as
Go to question 11 (see Appendi x 15.1).
If questi onnaires or data wi ll be entered by scanning the forms, the format of the
questi ons and the page may be dictated by the requirements of the software used for
scanni ng. It is i mportant to understand the requi rements of the data entry program and
use the appropri ate software to develop the forms (Chapter 16).
Wor di ng
Every word in a question can i nfl uence the vali dity and reproduci bi li ty of the responses.
The objecti ve shoul d be to construct questi ons that are simpl e, are free of ambi gui ty, and
encourage accurate and honest responses without embarrassi ng or offendi ng the
P.245
...
respondent.
Cl ar i t y. Questions must be as cl ear and speci fic as possi ble. In general, concrete
words are preferred over abstract words. For exampl e, to measure the amount of
exercise respondents get, asking, How much exercise do you usuall y get? i s
l ess clear than Duri ng a typi cal week, how many hours do you spend in vi gorous
wal ki ng?
Si mpl i ci t y. Questi ons should use si mpl e, common words that convey the idea and
avoid technical terms and jargon. For most people, for exampl e, i t i s clearer to ask
about drugs you can buy without a doctor's prescri ption than to ask about
over-the-counter medi cati ons : sentences shoul d also be simpl e, using the
fewest words and simpl est grammatical structure that convey the meani ng.
Neut r al i t y. Avoi d l oaded words and stereotypes that suggest that there is a
most desi rable answer. Aski ng, During the l ast month, how often did you drink too
much al cohol ? wi l l discourage respondents from admi tting that they drink a l ot of
al cohol . During the l ast month, how often did you dri nk more than five dri nks in
one day? is a more factual , l ess judgmental , and l ess ambi guous questi on.
Someti mes i t i s useful to set a tone that permits the respondent to admi t to behaviors and
atti tudes that may be considered undesirabl e. For exampl e, when aski ng about a patient's
compli ance with prescribed medicati ons, an i nterviewer or a questi onnai re may use an
i ntroducti on: People sometimes forget to take medi cations their doctor prescri bes.
Does that ever happen to you? Wordi ng of these introductions can be tri cky. It is
i mportant to gi ve respondents permissi on to admit certai n behaviors without encouragi ng
them to exaggerate.
Coll ecting i nformation about potentiall y sensitive areas l ike sexual behavi or or income is
especi al ly difficul t. Some people feel more comfortable answeri ng these types of questi ons
i n sel f-admi ni stered questi onnai res than i n intervi ews, but a skil l ful i nterviewer can
sometimes reveal open and honest answers. In personal intervi ews, i t may be useful to put
potenti al l y embarrassi ng responses on a card so that the respondent can answer by simpl y
pointi ng to a response.
Setti ng the Ti me Fr ame
Many questi ons are designed to measure the frequency of certai n habitual or recurrent
behavi ors, l ike drinki ng al cohol or taking medi cati ons. To measure the frequency of the
behavi or i t i s essenti al to have the respondent descri be it in terms of some uni t of ti me. If
the behavi or i s usual ly the same day after day, such as taki ng one tabl et of a di uretic
every morni ng, the question can be very simpl e: How many tabl ets do you take a day?
Many behavi ors change from day to day, season to season, or year to year. To measure
these, the i nvesti gator must first deci de what aspect of the behavi or i s most important to
the study: the average or the extremes. For exampl e, a study of the effect of chronic
al cohol i ntake on the risk of cardi ovascular di sease may need a measurement of average
consumption duri ng a period of time. On the other hand, a study of the rol e of alcohol in
the occurrence of fall s may need to know how frequentl y the respondent drank enough
al cohol to become i ntoxicated.
Questi ons about average behavi or can be asked i n two ways: asking about usual or
typi cal behavi or or counting actual behavi ors during a peri od of ti me. For exampl e,
an investi gator may determi ne average i ntake of beer by aski ng respondents to estimate
thei r usual i ntake:
About how many beers do you have duri ng a typi cal week (one beer is equal to one
12-oz can or bottle, or one l arge gl ass)?
P.246
...
beers per week
Thi s format is si mple and bri ef. It assumes, however, that respondents can accuratel y
average thei r behavior into a si ngl e esti mate. Because drinking patterns often change
markedl y over even bri ef i ntervals, the respondent may have a di ffi cul t time deci ding what
i s a typical week. Faced with questi ons that ask about usual or typi cal behavior, peopl e
often report the thi ngs they do most commonly and
i gnore the extremes. Asking about dri nking on typical days, for exampl e, wil l underestimate
al cohol consumpti on if the respondent dri nks unusual l y l arge amounts on weekends.
An alternati ve approach i s to quantify exposure duri ng a certai n period of time.
During the l ast 7 days, how many beers di d you have (one beer is equal to one 12-oz
can or bottle, or one l arge gl ass)?
beers in the l ast 7 days
The goal i s to ask about the shortest recent segment of ti me that accuratel y represents the
characteri stic over the whol e peri od of interest for the research questi on. The best l ength
of ti me depends on the characteristi c. For exampl e, patterns of sl eep can vary consi derabl y
from day to day, but questi ons about sleep habits duri ng the past week may adequatel y
represent patterns of sl eep during an enti re year. On the other hand, the frequency of
unprotected sex may vary greatl y from week to week so questi ons about unprotected sex
should cover l onger interval s.
Usi ng diaries may be a more accurate approach to keep track of events, behaviors, or
symptoms that happen episodi cal ly (such as fall s) or that vary from day to day (such as
pai n foll owing surgery or vaginal bl eedi ng). This may be val uabl e when the ti ming or
durati on of an event i s important or the occurrence i s easi l y forgotten. Parti ci pants can
enter these data into electroni c devi ces, and the approach al l ows the i nvestigator to
cal culate an average dail y score of the object or behavi or bei ng assessed. However, this
approach can be ti me consuming for parti ci pants and can lead to more mi ssi ng data than
the more common retrospecti ve questi ons. The use of di aries assumes that the ti me period
assessed was typi cal , and the sel f-awareness invol ved i n using di ari es can alter the
behavi or bei ng recorded.
Avoi d P i tf al l s
Doubl e- bar r el ed quest i ons. Each question shoul d contain onl y one concept.
Questi ons that use the words or or and sometimes l ead to unsati sfactory responses.
Consi der thi s question desi gned to assess caffei ne i ntake: How many cups of
coffee or tea do you dri nk during a day? Coffee contai ns much more caffeine than
tea and di ffers in other ways, so a response that combines the two beverages i s not
as precise as i t could be. When a questi on attempts to assess two things at one ti me,
i t i s better to break it into two separate questions. (1) How many cups of coffee
do you drink duri ng a typi cal day? and (2) How many cups of tea do you drink
duri ng a typi cal day?
Hi dden assumpt i ons. Someti mes questions make assumpti ons that may not appl y to
al l people who partici pate in the study. For example, a standard depressi on i tem asks
how often respondents have fel t thi s way in the past week: I fel t that I coul d not
shake off the bl ues even wi th hel p from my fami l y. Thi s assumes that respondents
have famil i es and ask for emoti onal support; for those who do not have a famil y or
who do not seek hel p from their fami ly, it is diffi cul t to answer the question.
The quest i on and answer opt i ons don' t mat ch. It is i mportant that the questi on
match the options for the answer, a task that seems simpl e but is often done
i ncorrectl y. For exampl e, the questi on, Have you had pain in the l ast week? i s
P.247
P.248
...
sometimes matched wi th response opti ons of never, sel dom,
often, very often, which i s grammati cal ly i ncorrect and can be
confusing to respondents. (The questi on should be changed to How often have you
had pain in the l ast week? or the answer shoul d be changed to yes or
no. ) Another common problem occurs when questi ons about intensi ty are
given agree/di sagree options. For exampl e, a respondent may be given the statement
I am someti mes depressed and then asked to respond wi th agree or
disagree. For those who are often depressed, it is uncl ear how to respond;
disagreeing wi th this statement coul d mean that the person is often depressed or
never depressed. In such a case, i t i s usual ly cl earer to use a si mple questi on about
how often the person feel s depressed matched with opti ons about frequency (never,
sometimes, often).
Scal es and Scor es to Measur e Abstr act Var i abl es
It is di ffi cul t to quanti tati vely assess abstract concepts, such as quali ty of l i fe, from si ngl e
questi ons. Therefore abstract characteri sti cs are commonl y measured by generati ng scores
from a seri es of questi ons that are organized i nto a scal e.
Usi ng multipl e items to assess a concept may have other advantages over single questi ons
or several questions asked i n di fferent ways that cannot be combined. Compared wi th the
al ternative approaches, multi -i tem scales can increase the range of possi ble responses
(e.g., a multi -i tem qual ity-of-l i fe scale mi ght generate scores that range from 1 to 100
whereas a single questi on rati ng quali ty of li fe mi ght produce four or fi ve responses from
poor to excell ent ). A disadvantage of mul ti -i tem scal es i s that they produce
resul ts (quali ty of l i fe = 46.2) that can be diffi cul t to understand i ntui ti vely.
Likert scales are commonl y used to quanti fy attitudes, behavi ors, and domai ns of health-
related quali ty of l i fe. These scal es provide respondents with a l ist of statements or
questi ons and asks them to select a response that best represents the rank or degree of
thei r answer. Each response i s assi gned a number of points.
For each item, ci rcl e the one number that best represents your opinion:
An i nvestigator can compute an overal l score for a respondent's answers by simpl y
summing the score for each item, or averagi ng the poi nts for al l nonmi ssi ng items. For
exampl e, a person who answered that he or she strongly agreed that smoki ng in publ ic
places should be i l legal (one poi nt) and adverti sements for ci garettes be banned (one
point) but di sagreed that publi c funds shoul d be spent for anti smoking advertisi ng (four
points) woul d have a total score of 6. Si mply addi ng up or averagi ng
i tem scores assumes that all the items have the same weight and that each item i s
measuring the same general characteri stic.
The internal consistency of a scale can be tested stati sti cal ly using measures such as
Cronbach's alpha (1) that assess the overal l consistency of a scal e. Cronbach's al pha i s
Strongly
Agree Agree Neutral Disagree
Strongly
Disagree
a. Smoki ng in publi c places
should be il l egal.
1 2 3 4 5
b. Adverti sements for
ci garettes shoul d be banned.
1 2 3 4 5
c. Publ i c funds should be
spent for anti smoki ng
campaigns.
1 2 3 4 5
P.249
...
cal culated from the correl ati ons between scores on i ndi vi dual i tems. Val ues of thi s measure
above 0.70 are usual l y acceptabl e, and 0.80 or more i s excel l ent. Lower values for i nternal
consi stency indicate that some of the i ndi vi dual i tems may be measuring di fferent
characteri stics.
Cr eati ng New Questi onnai r es and Scal es
Someti mes an investi gator needs to measure a characteristi c for which there is no standard
questi onnaire or i nterview approach. When no adequate measure can be found of a concept
that is i mportant to the research, i t i s necessary to create new questi ons or develop a new
scal e. The task can range from the creation of a single new questi on about a mi nor variabl e
i n one study (How frequentl y do you cut your toenai ls?) to devel oping and testi ng a new
multi-i tem scale for measuri ng the pri mary outcome (sexual qual ity of li fe) for a major
study or li ne of investi gation. At the si mplest end of thi s spectrum, the i nvestigator may
use good judgment and basic princi pl es of writi ng good questions to devel op an i tem that
should then be pretested to make sure i t i s clear and produces appropri ate answers. At the
other extreme, developi ng a new i nstrument to measure an important concept may need a
systematic approach that can take years from i ni tial draft to fi nal product.
The latter process often begi ns by generati ng potential items for the i nstrument from
i ntervi ews wi th i ndi vi duals and focus groups (small groups of peopl e who are rel evant to
the research questi on and who are i nvi ted to spend 1 or 2 hours discussi ng specifi c topi cs
pertaining to the study wi th a group l eader). Once the instrument has been drafted, the
next step i s to i nvi te cri ti cal revi ew by peers, mentors, and experts. The i nvestigator then
proceeds wi th the i terati ve sequence of pretesting, revisi ng, shortening, and vali dating that
i s descri bed i n the next secti on. The devel opment of the National Eye Insti tute Vi sual
Functi on Questi onnaire i ll ustrates this process (Example 15.3).
Example 15.3: Development of a New Multi -Item Instrument
The National Eye Insti tute Vi sual Functi on Questionnaire exempli fies the pai nstaking
development and testing of a mul ti -item i nstrument. Mangi one and col leagues devoted
several years to creati ng and testi ng the scal e because i t was i ntended to serve as a
primary measurement of outcome of many studi es of eye di sease (2,3,4). They began by
i ntervi ewing pati ents with eye di seases about the ways that the conditi ons affected thei r
l ives. Then they organi zed focus groups of patients wi th the di seases and analyzed
transcri pts of these sessi ons to choose rel evant questions and response opti ons. They
produced and pretested a long questi onnaire that was admi ni stered to hundreds of
partici pants i n several studi es. They used data from these studi es to identi fy i tems that
made the l argest contributi on to vari ati on i n scores from person to person and to shorten
the questi onnaire from 51 to 25 items.
Because the creation and vali dation of new multi -i tem i nstruments i s ti me consumi ng, i t
should generall y only be undertaken for vari abl es that are central to a study, and when
existi ng measures are i nadequate or i nappropriate for the peopl e who wil l be i ncl uded in
the study.
Steps in Assembling the Instruments for the Study
There are a number of steps in devel oping a set of i nstruments for a parti cul ar study.
Make a Li st of Var i abl es
Before desi gni ng an i nterview or questi onnai re i nstrument, the researcher shoul d wri te a
detail ed li st of the informati on to be col l ected and concepts to be measured i n the study. It
can be helpful to l ist the rol e of each item (e.g., predi ctors, outcomes, and potential
confounders) i n answering the main research questi ons.
Col l ect Exi sti ng Measur es
Assembl e a fi l e of questions or instruments that are avai lable for measuri ng each vari abl e.
When there are several al ternative methods, i t i s useful to create an el ectronic fi le for each
P.250
...
variable to be measured and then to fi nd and fi le copi es of candidate questions or
i nstruments for each i tem. It i s i mportant to use the best possi ble i nstruments to measure
the mai n predictors and outcomes of a study, so most of the effort of col lecti ng al ternative
i nstruments shoul d focus on these major variables.
There are several sources for i nstruments. A good pl ace to start i s to coll ect i nstruments
from other investi gators who have conducted studi es that incl uded measurements of
i nterest. Many standard instruments have been compi led and revi ewed i n books, revi ew
articl es, and el ectroni c sources accessed through NIH, CDC or other Web sites. There are
coll ections of instruments on the Web that can be found by searchi ng for key terms such as
health outcomes questi onnaires. Instruments can also be found by examining
publ ished studies of si mi lar topics and by cal l ing or wri ting the authors.
Borrowi ng instruments from other studies has the advantage of saving devel opment ti me
and all owing results to be compared wi th those of other studi es. On the other hand,
existi ng instruments may not be enti rely appropri ate for the question or the populati on, or
they may be too l ong. It i s ideal to use exi sti ng i nstruments wi thout modi ficati on. However,
i f some of the i tems are i nappropriate (as may occur when a questionnai re devel oped for
one cultural group i s appl ied to a di fferent setti ng), i t may be necessary to del ete, change,
or add a few items.
If a good establ ished i nstrument is too long, the i nvesti gator can contact those who
developed the i nstrument to see if they have shorter versions. Deleti ng items from
establi shed scales ri sks changi ng the meani ng of scores and endangeri ng compari sons of
the findings wi th resul ts from studi es that used the intact scale. Shortening a scal e can
al so di mi ni sh i ts reproduci bi l ity or its sensi tivi ty to detect changes. However, i t i s
sometimes acceptabl e to delete secti ons or subscal es that are not essential to the
study whi le l eaving other parts i ntact.
Compose a Dr af t
The fi rst draft of the instrument should incl ude more questi ons about the topi c than wi ll
eventuall y be i ncl uded i n the instrument. The fi rst draft should be formatted just as a final
questi onnaire woul d be.
Revi se
The investi gator should read the first draft careful ly, attempti ng to answer each questi on
as i f he were a respondent and trying to i magi ne all possi bl e ways to
mi sinterpret questi ons. The goal is to i denti fy words or phrases that might be confusi ng or
mi sunderstood by even a few respondents and to find abstract words or jargon that could
be transl ated i nto si mpl er, more concrete terms. Questi ons that are compl ex shoul d be spl it
i nto two or more questi ons. Col leagues and experts i n questi onnai re desi gn should be
asked to review the instrument, consideri ng the content of the items as wel l as cl ari ty.
Shor ten the Set of I nstr uments f or the Study
Studies usuall y col l ect more data than wi ll be analyzed. Long intervi ews, questi onnaires,
and examinations may ti re respondents and thereby decrease the accuracy and
reproducibi li ty of thei r responses. When the i nstrument i s sent by mail , people are l ess
l ikel y to respond to l ong questionnai res than to short questi onnaires. It i s important to
resist the temptati on to incl ude additi onal questions or measures just i n case they
mi ght produce i nteresti ng data. Questions that are not essenti al to answeri ng the mai n
research questi on i ncrease the amount of effort invol ved i n obtaini ng, enteri ng, cleani ng,
and analyzi ng data. Ti me devoted to unnecessary or margi nall y valuabl e data can detract
from other efforts and decrease the overal l qual ity and producti vity of the study.
To decide i f a concept i s essential, it is useful to think ahead to anal yzi ng and reporti ng the
resul ts of the study. Sketching out the final tabl es wil l help to ensure that al l needed
variables are i ncl uded and to identi fy those that are l ess i mportant. If there i s any doubt
about whether an item or measure wil l be used i n l ater anal yses, i t i s usual ly best to l eave
P.251
...
i t out.
P r etest
Pretests shoul d be done to cl ari fy, refi ne, and time the i nstrument. For key measurements,
l arge pi lot studies may be valuabl e to find out whether each question produces an adequate
range of responses and to test the val idi ty and reproduci bil ity of the instrument ( Chapter
17).
Val i date
Questi onnaires and i nterviews can be assessed for val idi ty (an aspect of accuracy) and for
reproducibi li ty (preci si on) i n the same fashi on as any other type of measurement (Chapter
4). The process begi ns wi th choosi ng questions that have face validity, a subjecti ve but
i mportant judgment that the i tems assess the characteri sti cs of i nterest, and conti nues
with efforts to establ ish content validity andconstruct validity. Whenever feasibl e, new
i nstruments can then be compared wi th establi shed gold standard approaches to
measuring the conditi on of i nterest. Ultimately, the predictive validity of an i nstrument
can be assessed by correl ati ng measurements with future outcomes.
If an instrument is i ntended to measure change, then its responsiveness can be tested by
appl yi ng i t to pati ents before and after recei ving treatments consi dered effecti ve by other
measures. For exampl e, a new instrument designed to measure qual ity of li fe i n people wi th
i mpai red visual acuity mi ght incl ude questions that have face vali di ty (Are you abl e to
read a newspaper without gl asses or contact l enses? ). Answers coul d be compared with
the responses to an exi sting val i dated i nstrument (Exampl e 15.3) among pati ents with
severe cataracts and among those wi th normal eye exami nati ons. The responsi veness of
the instrument to change could be tested by comparing responses of pati ents wi th cataracts
before and after curati ve surgery. The process of vali dati ng new instruments i s ti me
consumi ng and expensive, and worthwhil e onl y i f exi sti ng i nstruments are i nadequate for
the research questi on or popul ation to be studi ed.
Administering the Instruments
P.252
Questi onnai r es ver sus I nter vi ews
There are two basic approaches to coll ecting data about attitudes, behavi ors, knowledge,
heal th, and personal history. Questionnaires are instruments that respondents admi ni ster
to themselves, and i ntervi ews are those that are admi ni stered verbal ly by an intervi ewer.
Each approach has advantages and di sadvantages.
Questionnaires are general ly a more effi cient and uniform way to admi ni ster si mple
questi ons, such as those about age or habi ts of tobacco use. Questionnaires are less
expensi ve than i nterviews because they do not requi re as much ti me from research staff,
and they are more standardi zabl e. Interviews are usuall y better for coll ecting answers to
compl icated questions that require expl anation or gui dance, and i nterviewers can make
sure that responses are complete. Intervi ews may be necessary when parti cipants wi l l have
vari able abi l ity to read and understand questions. However, i ntervi ews are more costly and
time consumi ng, and they have the di sadvantage that the responses may be i nfl uenced by
the relati onship between intervi ewer and respondent.
Both types of i nstruments can be standardi zed, but questi onnaires have the advantage
because intervi ews are i nevi tably admi ni stered at least a l ittle differentl y each ti me. Both
methods of col lecti ng informati on are suscepti bl e to errors caused by i mperfect memory;
both are al so affected by the respondent's tendency to give sociall y acceptable answers,
al though not necessari ly to the same degree.
I nter vi ewi ng
The skil l of the i nterviewer can have a substantial i mpact on the qual ity of the responses.
Standardizing the i ntervi ew procedure from one interview to the next is the key to
maxi mi zi ng reproducibi li ty. The i nterview must be conducted wi th uni form wordi ng of
...
questi ons and uni form nonverbal si gnals duri ng the intervi ew. Intervi ewers must be careful
to avoi d i ntroduci ng their own biases into the responses by changi ng the words or the tone
of thei r voi ce. Thi s requires training and practice.
For the intervi ewer to comfortably read the questi ons verbati m, the i nterview shoul d be
written in language that resembl es common speech. Questions that sound unnatural or
stil ted when they are said al oud wil l encourage i ntervi ewers to improvise thei r own, more
natural but l ess standardi zed way of asking the question.
Someti mes i t i s necessary to foll ow up on a respondent's answers to encourage hi m to gi ve
an appropriate answer or to cl arify the meaning of a response. Thi s probing can
al so be standardized by wri ti ng standard phrases in the margins or beneath the text of
each question. To a questi on about how many cups of coffee respondents dri nk on a typi cal
day, some respondents mi ght respond I'm not sure; it's di fferent from day to day.
The instrument coul d i nclude the foll ow-up probe: Do the best you can; tell me
approxi matel y how many you dri nk on a typical day.
Intervi ews can be conducted in person or over the tel ephone. Computer-assisted
telephone interviewing (CATI) can reduce some of the costs associated with i ntervi ews
whi l e retai ni ng most of their advantages (4) The intervi ewer reads questi ons to the
respondent as they appear on the computer screen, and answers are entered di rectl y i nto a
database as they are keyed i n. This enables i mmedi ate checking of out -of-range val ues.
Interactive Voice Response (IVR) systems replace the i nterviewer with computer-
generated questions that col lect subject responses by telephone keypad or voi ce
recogni ti on (5).
In-person i nterviews, however, may be necessary if the study requires direct observation of
partici pants or physi cal exami nati ons, or i f potenti al parti ci pants do not have telephones
(e.g., the homel ess). Some elderl y and il l persons are best reached through i n-person
i ntervi ews where they are l ivi ng.
Methods of Admi ni ster i ng Questi onnai r es
There are several methods of admi ni steri ng questi onnai res. They can be given to subjects
i n person or admi nistered through the mai l, by e-mai l, or through a Web si te. Distributi ng
questi onnaires i n person al lows the researcher to explain the i nstructi ons before the
partici pant starts answering the questi ons. When the research requi res the participant to
visi t the research site for exami nati ons, questionnai res can also be sent i n advance of an
appointment and answers checked for completeness before the partici pant l eaves.
E-mailed questionnaires have several advantages over those sent by US Mai l , al though
they can only be sent to parti cipants who have access to and fami li ari ty with the Internet.
Questi onnaires sent by e-mai l all ow respondents an easy way to provi de data wi thout a
cl inic vi sit that can be directly entered i nto databases. Questi onnai res on Web si tes or
handheld devices can produce very clean data because answers can be automaticall y
checked for mi ssi ng and out-of-range val ues, the errors poi nted out to the respondent, and
the responses accepted only after the errors are corrected.
Summary
1. For many cli ni cal studies, the qual i ty of the resul ts depends on the qual ity and
appropri ateness of the questionnaires and interviews. Investigators shoul d take the
time and care to make sure the instruments are as valid and reproducible as
possibl e before the study begins.
2. Open-ended questions all ow subjects to answer wi thout l imitations imposed by the
i nvestigator, and closed-ended questions are easier to answer and anal yze. The
response opti ons to a cl osed-ended questi on should be exhaustive and mutually
exclusive.
3. Questi ons should be clear, simple, neutral, and appropriate for the populati on that
wil l be studi ed. Investi gators shoul d exami ne potential questi ons from the viewpoi nt
P.253
...
of potential partici pants, looki ng for ambiguous terms and common pitfal ls such as
double-barreled questions, hidden assumptions, and answer options that do
not match the question.
4. The instrument should be easy to read, and i nterview questi ons should be
comfortabl e to read out l oud. The format shoul d fi t the method for electroni c data
entry and be spacious and uncl uttered, wi th instructions and arrows that direct the
respondent or i ntervi ewer.
5. To measure abstract variables such as attitudes or heal th status, questi ons can be
combined into multi-item scales to produce a total score. Such scores assume that
the questi ons measure a si ngl e characteri stic and that the responses are internally
consistent.
6. An i nvestigator shoul d search out and use existing instruments that are known to
produce vali d and reli abl e resul ts. When it is necessary to modi fy exi sti ng measures or
devise a new one, the i nvestigator shoul d start by coll ecting exi sti ng measures to be
used as potential models and sources of i deas.
7. The whole set of i nstruments to be used i n a study should be pretested and timed
before the study begi ns. For new instruments, smal l i ni ti al pretests can i mprove the
cl arity of questions and i nstructions; l ater, larger pil ot studi es can test and refi ne the
new i nstrument's range, reproducibility, and validity.
8. Self-administered questionnaires are more economical than interviews, they are
more readil y standardized, and the added pri vacy can enhance the val i di ty of the
responses. Interviews, on the other hand, can ensure more compl ete responses and
enhance vali di ty through i mproved understanding. Admini strati on of i nstruments by
computer-assisted telephone interviewing, e-mail, and the Internet can enhance
the effi ci ency of a study.
Appendix
Appendix 15.1
An Example of a Questionnaire about Smoking
The fol lowi ng i tems are taken from a self-admi ni stered questionnaire used i n our Study of
Osteoporoti c Fractures. Note that the branching questi ons are foll owed by arrows that
direct the subject to the next appropriate questi on and that the format i s uncluttered wi th
the responses consi stently li ned up on the l eft of each next area.
1. Have you smoked at l east 100 cigarettes i n your enti re l ife?
P.254
...
7. Have you ever li ved for at l east a year i n the same household with someone who smoked
ci garettes regularly?
References
1. Bl and JM, Altman DG. Cronbach' s alpha. BMJ 1997;314:572.
2. Mangi one CM, Berry S, Spritzer K, et al . Identifying the content area for the 51-item
Nati onal Eye Insti tute Vi sual Function Questi onnai re: resul ts from focus groups wi th
visual l y i mpai red persons. Arch Ophthal mol 1998;116:227233.
3. Mangi one CM, Lee PP, Pi tts J, et al . Psychometri c properties of the Nati onal Eye
Institute Visual Function Questi onnai re (NEI-VFQ). NEI-VFQ Fi el d Test Investi gators.
Arch Ophthalmol 1998;116:14961504.
4. Ani e KA, Jones PW, Hi lton SR, et al . A computer-assi sted tel ephone interview
technique for assessment of asthma morbi di ty and drug use in adul t asthma. J Cl in
Epi demi ol 1996; 49:653656.
5. Kobak KA, Greist JH, Jefferson JW, et al . Computer assessment of depressi on and
P.255
...
anxiety over the phone usi ng i nteractive voi ce response. MD Comput 1999;16:6468.
...
> Tabl e of Contents > Secti on III - Impl ementati on > 16 - Data Management
16
Data Management
Michael A. Kohn
We have seen that undertaking a cl inical research project requires choosing a
study design, defini ng the population, and speci fyi ng the predictor and outcome
vari ables. Ultimately, all information about the subjects and vari ables will reside
i n a computer database that wi ll be used to store, update, and moni tor the
data, as wel l as format the data for statistical anal ysis. Si mple study databases
consi sti ng of individual data tables can be mai ntained using spreadsheet or
statistical software. More compl ex databases contai ni ng multipl e interrel ated
data tabl es require database management software. Data management for a
cli ni cal research study i nvolves defining the data tables, developing the data
entry system, and querying the data for monitoring and analysis.
Data Tables
All computer databases consist of one or more data tables i n which the
rows correspond to records or entities and the columns correspond to
fi elds or attributes. For example, the si mplest study databases consi st
of a si ngle table in which each row corresponds to an individual study subject
and each col umn corresponds to a subject -specific attribute such as name, birth
date, sex, predi ctor or outcome status. Each row must have a col umn value or
combination of col umn values that di stingui shes it from the other rows. We
recommend assi gning a unique identification number (subject ID ) to
each study participant. Using a unique subject identifier that has no meaning
external to the study database si mplifi es the process of de-linking study
data from personal identifiers for purposes of maintaining subject confi dentiality.
Figure 16.1 shows a simpli fied data table for a cohort study of the association
between neonatal jaundi ce and IQ score at age five ( 1). Each row i n the tabl e
corresponds to a study subject, and each column corresponds to an attribute of
that subject. The di chotomous predictor is Jaundice, that is, whether the
subject had
neonatal jaundi ce, and the continuous outcome is ExIQScor, which i s the
subject' s IQ score at age 5.
P.258
...
If the study data are limited to a single table such as the table i n Figure 16.1,
they are easily accommodated i n a spreadsheet or statisti cal package.
1
We often
refer to a database consisting of a single, two-dimensional table as a flat-
file .
2
Many statistical packages have added features to accommodate more
than one tabl e, but at their core, most remai n singl e-table or flat-fi le databases.
The need to i nclude more than one table i n a study database (and move from
spreadsheet or statistical software to data management software) often ari ses
fi rst when measurements are repeated on individual subjects. If the same study
vari able is measured on multi ple occasions, then a separate tabl e should be used
to store the repeated measurements. The rows in this separate table
correspond to indivi dual examinati ons and i nclude the examination date, the
results of the exam, and most importantly, the subject identifi cation number. In
thi s relational database, the relationship between the tabl e of subjects
and the table of exami nations is termed one-to-many.
3
To improve preci sion and enable assessment of the interrater reli abil ity in our
i nfant jaundice study, the subjects receive the IQ exam multi ple ti mes from
different examiners. This requi res a second table of examinations i n which each
row corresponds to a discrete examination, and the columns represent
examination date, examination results, and most importantl y, the subject
i dentifi cation number (to li nk back to the subject table) (Fig. 16.2). In this two-
table database structure, querying the exami nation table for al l exams
performed within a particular time period
requires searching a singl e exam date col umn. A change to a subject -specifi c
fi eld l ike birth date i s made in one place, and consistency is preserved. Fi elds
hol ding personal identi fiers such as name and birth date appear onl y i n the
subject table with the other table(s) containing only the subject ID. The
database can sti ll accommodate subjects (such as Alejandro, Ryan, Zachary, and
FIGURE 16.1. Simpl ified data tabl e for a cohort study of the associ ation
between neonatal jaundi ce and IQ score at age 5. The dichotomous
predi ctor is Jaundice, that i s whether the subject had neonatal
jaundice, and the continuous outcome is ExIQScor, whi ch is the
subject' s IQ score at age 5.
P.259
...
Jackson) who have no exams.
By structuri ng the database this way, instead of as a very wide and complex
singl e tabl e, we have elimi nated redundant storage and the opportuni ty for
i nconsistencies. Rel ati onal database software wi ll maintain referential
integrity, meaning that i t wi ll not allow creation of an exam record for a subject
who does not already exi st in the subject table. Simil arly, a subject may not be
deleted unless and until all that subject' s examinati ons have also been deleted.
Dat a Di ct i onar i es, Dat a Types, and Domai ns
So far we have seen tabl es only in the spreadsheet view. Each column or field
has a name and, impli citly, a data type and a defi nition. In the Subject
table of Figure 16.2, FName is a text fi eld that contains the subject' s
fi rst name; DOB is a date field that contains the subject's birth date, and
Jaundice is a yes/no field that indi cates whether the study subject had
neonatal jaundi ce. In the Exam table, ExWght is a real-number
weight in kilograms and ExIQScor i s an integer IQ score. The data
dictionary makes these col umn definiti ons explicit. Figure 16.3 shows the
subject and exam tables in table design (or data dictionary ) view. Note
that the data di cti onary i s itself a table with rows representing fiel ds and
columns for field name, fi eld type, and field description. Si nce the data
dictionary i s a tabl e of information about the database itsel f, i t is referred to as
metadata .
4
FIGURE 16.2. The two-table i nfant jaundice study database has a table of
study subjects i n which each row corresponds to a single study subject and
a table of exami nations in whi ch each row corresponds to a particular
examination. Since a subject can have mul tiple examinations, the
relationshi p between the two tabl es i s one-to-many. The SubjectID fiel d in
the exam table li nks the exam-specific data to the subject-speci fic data.
P.260
...
Each field al so has a domain or range of all owed values. For example, the
allowed values for the Sex field are M and F .
5
The
software wi ll not allow entry of any other val ue i n this fi eld. Simil arly, the
ExIQScor field allows onl y i ntegers between 40 and 200. Creating
validation rules to define al lowed values affords some protection against data
entry errors. Some of the fiel d types come with automatic vali dati on rules. For
example, the database management software wil l al ways reject a date of Apri l
31.
Var i abl e Names
Most spreadsheet, stati sti cal, and database management programs allow long
column headings or variable names. Phi losophi es and naming conventions
abound. We recommend vari able names that are short enough to type qui ckly,
but long enough to be descripti ve. Although they are often allowed by the
software, we recommend avoidi ng spaces and special characters in variabl e
names. It is generally better to use a variable name that descri bes the field
rather than its l ocation
on the data coll ecti on form (e.g., EverSmokedCigarettes or
eversmo, instead of questi on1. )
Data Entry
Whether the study database consists of one or many tables and whether i t
FIGURE 16.3. The table of study subjects (Subject ) and the tabl e
of measurements (Exam ) in data dictionary view. Each
vari able or fiel d has a name, a data type, a description, and a domain or set
of allowed val ues.
P.261
...
uses spreadsheet, stati sti cal, or database management software, a mechanism
for populating the data tables is required.
Keyboard transcription. Historicall y, the common method for populati ng a
study database has been to col lect data on paper forms
6
and then
transcribe the data by keyboard i nto the computer tables. The investigator
or other members of the research team may fill out the paper form, or in
some cases, the subject hi msel f fill s it out. Transcription can occur from the
paper forms directly i nto the data tabl es (e.g, the response to question 3
on subject 10 goes i nto the cel l at row 10, column 3) or through on-screen
forms designed to make data entry easier and include automatic data
vali dati on checks. Transcription should occur as shortly as possible after
the data collection, so that the subject and i nterviewer or data coll ector is
still avail abl e if responses are found to be missing or out of range. Al so, as
discussed below, moni tori ng for data problems (e.g., outlier values) and
prelimi nary analyses can only occur once the data are in the computer
database.
If transcribi ng from paper forms, the investi gator should consi der double
data entry to ensure the fideli ty of the transcri ption. The database
program compares the two val ues entered for each vari able and presents a
list of val ues that do not match. Discrepant entries are then checked on the
ori ginal forms and corrected. Double data entry identi fies data entry errors
at the cost of doubling the time required for data entry. An alternati ve is to
recheck or reenter a random proporti on of the data. If the error rate is
acceptabl y low, additi onal data editing is unlikely to be worth the effort and
cost.
Machi ne-readable forms. Another alternati ve is to scan the data i nto the
tables using optical mark recognition (OMR) and optical character
recognition (OCR) software. Machine-readable forms can be created
usi ng this special software. When scanned or faxed, the handwritten
informati on on these forms is read into the database. The obvious
advantage i s that keyboard data entry i s not required. However, these
systems are more difficult and costly to design and support. Because text is
diffi cult to read accuratel y, machi ne-readable forms must be completed
carefully, typicall y requiring that bubbl es be filled in for categorical
variables and that text be written cl early insi de specifi c spaces. For this
reason, machi ne-readable forms are often compl eted by trained study staff,
rather than by study participants.
Most OMR/OCR programs provide a verificati on step. After the data forms
are scanned or faxed, an image of the completed form and the data as they
will appear in the database are presented on a computer screen. The person
who i s verifying the data must approve each coded value before it becomes
part of the database. Systems for col lecting data from machine-readabl e
forms
generall y handle some types of data so well (bubbl es, check boxes) that
veri fication is not required, but accuracy should be tested before the study
begi ns and retested during data collection. Machi ne-readable systems may
be less accurate for other types of data, such as text, and it may be
necessary to devote substanti al time toward veri fication.
P.262
...
As with keyboard transcription, scanni ng of machine-readabl e forms should
occur as shortly as possible after data collection.
Distributed data entry. If data coll ecti on occurs at multiple l ocations, the
paper forms can be mail ed or faxed to a central l ocation for transcription or
scanni ng into the computer database. Alternatively, the computer data
entry can occur at each of the l ocations. In such di stributed data entry,
networked computers and the Internet all ow di rect entry i nto the central
study database, often mai ntained on a server at the princi ple investigator' s
institution. Alternati vely, data are stored on a l ocal computer at the data
col lection si te and batch transmitted by diskette, CD, tape, e-mail, or Fil e
Transfer Protocol (FTP). Government regulations requi re that electroni c
heal th information be ei ther de-identified or transmitted securely (e.g.,
encrypted and password-protected).
Electroni c data capture. Increasingl y, research studies col lect data usi ng
on-screen forms
7
or web pages, instead of paper data col lection forms.
Thi s has many advantages:
The data are keyed di rectly into the data tabl es without a second
transcription step, which can be a source of error.
The computer form can include vali dati on checks and provide
i mmediate feedback when an entered val ue is out of range.
The computer form can al so incorporate skip l ogic so that, for
example, a question about packs per day appears only if the subject
answered yes to a question about cigarette smoki ng.
The form can stil l be fi lled out by a member of the study team or by
the subject himself. When fi lli ng out a computer questionnai re,
subjects may be more forthcoming about sensiti ve topics such as
sexual behaviors and illi ci t drug use than during an i n-person
i nterview.
The form may be viewed and data entered on portable, wireless
devi ces such as handheld tablet computers and personal data
assistants.
When usi ng on-screen forms for electronic data capture, it sometimes
makes sense to print out a paper record of the data immediately after
col lection. Thi s is analogous to printi ng out a recei pt after a transacti on at
the automated teller machine. The printout i s a paper snapshot of
the record immediatel y after data col lection and may be used as the
ori ginal or source document i f a paper version is required.
Coded Responses ver sus Fr ee Text
As mentioned above, defining a vari able or fiel d in a data tabl e incl udes
specifying its range of allowed values. For subsequent analysis, i t is always
preferable to li mit responses to a range of coded values rather than al lowi ng
free-text responses. This is the same as the distinction made in Chapter 15
between cl osed-ended and open-ended questions. If the range of
possible responses is unclear, ini tial data
coll ecti on during the pretesting of the study can all ow free-text responses that
P.263
...
wil l subsequently be used to devel op coded response opti ons.
A set of response options to a question should be exhaustive (all possi ble
options are provided) and mutually exclusive (no two options can both be
correct); response options can always be made col lectively exhaustive by addi ng
an other response. On-screen data collection forms and web pages
provide three possi ble formats for di spl aying the mutually excl usive and
coll ecti vel y exhausti ve response opti ons: drop-down list, pick list (fi eld li st), or
option group (Fig. 16.4.). These formats wil l be familiar to any research subject
or data entry person who has worked with a computer form or web page. The
drop-down l ist saves screen space but wi ll not work if the screen form wi ll be
printed to paper for data col lection. Both the pick li st (which is just a drop-down
l ist that i s permanentl y dropped down) and the opti on group require more screen
space, but provide a compl ete record when printed.
Questi ons with a set of mutually excl usive responses correspond to a singl e field
i n the data table. All that apply questions are not mutually excl usive,
corresponding to as many yes/no fi elds as there are possible responses. By
convention, response options for all that appl y questi ons use square
check boxes rather than the round radio buttons used for option groups with
mutually exclusi ve responses (Fig. 16.5.). It is good practice to be consistent
when coding yes/no (di chotomous) variables. In particular, 0 should always
represent no or absent, and 1 should always represent yes or present. With thi s
coding, the average value of the vari able is i nterpretable as the proportion wi th
the attribute.
I mpor t i ng Measur ement s and Labor at or y Resul ts
Much study information, such as the baseli ne demographic information i n the
hospital registration system, the l aboratory results in the laboratory' s computer
system, and the measurements made by dual energy x-ray absorptiometry
scanners and Holter monitors, i s already in di gital electroni c format. Where
possible, these data shoul d be incorporated directl y in the study database to
avoid the labor and potenti al transcription errors involved in reentering the data.
For example, i n the study of infant jaundi ce, the subject demographi c data and
contact informati on are obtai ned from the hospital database. Computer systems
can often communicate with each other directly using a variety of protocols (web
services, ODBC, XML, etc), and if this is not possible, they can produce text -
delimi ted or fi xed-column-width character files that the database software can
i mport.
Back- end ver sus Fr ont - end Sof t war e
Now that we have discussed data tables and data entry, we can make the
distincti on between the study database' s back end and front end. The back end
consi sts of the data tables themselves. The front end or interface
consi sts of the on-screen forms or web pages used for entering, viewing, and
editing the data. Table 16.1 li sts some software programs used in data
management for clinical research.
Simple study databases consi sting of a si ngle data table can use spreadsheet or
statistical software for the back-end data table and the study personnel can
enter data directly i nto the data tabl e' s cells, obviating the need for front -end
data coll ecti on forms. More compl ex study databases consisting of multi ple data
tables require relational database software to maintain the back-end data
...
tables. If the data are coll ected first on paper forms, entering the data wi ll
require scanni ng using OMR/OCR software or transcri ption through front-end
forms or web pages. As discussed under Electronic Data Capture above,
data may al so be entered directly
i nto the front-end forms or web pages (with the option of pri nting out a paper
snapshot of the record i mmediately after it is collected).
P.264
P.265
P.266
...
FIGURE 16.4. Formats for entering from a mutually excl usive, collectively
exhaustive l ist of responses. The drop-down l ist (A) saves screen space but
wil l not work if the screen form wil l be printed to paper for data coll ecti on.
Both the pi ck li st (whi ch is just a drop-down l ist that is permanently
dropped down) (B) and the opti on group (C) require more screen space, but
wil l work if printed.
Table 16.1 Some Software Programs Used in
Research Data Management
Spreadsheet
Excel
Open Office Cal c*
Statistical Analysis
Statistical Anal ysis System (SAS)
Statistical Package for the Social Sci ences (SPSS)
Stata
R*
Form Scanning (Optical Mark Recognition) Software
TeleForm
Integrated Desktop Database Systems (Back End and Front End)
Access
Fil emaker Pro
Open Office Base*
Enterprise Relational Database Systems (Back End)
Oracl e
DB2
SQL Server
MySQL*
Interface Builders (Front End)
Adobe Acrobat
Front Page
Dream Weaver
Visual Studio
JBuilder (Java)
Eclipse (Java)
Integrated Web-Based Applications for Research Data
Management
Oracl e Cl inical
Phase Forward Clintrial /InForm
QuesGen
Velos eResearch
StudyTRAX
Labmatri x
...
Some of the statistical packages, such as SAS, have developed data entry
modul es. Integrated desktop database programs, such as Access and
Fil emaker Pro, also provi de extensi ve tools for the development of data forms.
The so-call ed enterprise database programs, such as Oracl e, SQL
Server, and MySQL, are generall y used for the back end, wi th the front end
developed using separate software. Frequently, these enterprise databases are
used in conjuncti on with a web browserbased front end.
Integrated applications, dedi cated to research data management, with web-
based front ends and enterprise back ends are beginning to appear ( Table 16.1).
At the ti me of publication, none of these was establi shed i n the research
community, but this is a domain in which rapi d progress i s li kely to take pl ace.
Extracting Data (Queries)
Once the database has been created and data entered, the i nvestigator wil l
want to organize, sort, filter, and view (query ) the data. Queri es are
used for monitori ng data entry, reporting study progress, and ultimately
*Open source software.
FIGURE 16.5. By conventi on, response opti ons for all that apply
questions use square check boxes. All that apply questions
correspond to as many fiel ds as there are possi ble responses.
...
anal yzing the results. The standard l anguage for manipulating data in a
relational database is call ed Structured Query Language (SQL) (pronounced
sequel ).
8
Al l rel ati onal database software systems use one or another
vari ant of SQL, but most provi de a graphi cal interface for building queries that
makes i t unnecessary for the cl inical researcher to learn SQL.
A query can join data from two or more tables, di spl ay only sel ected fields, and
fi lter for records that meet certain criteria. Queri es can also cal culate val ues
based on raw data fi elds from the tables. Figure 16.6 shows the resul ts of a
query on our infant jaundice database that fi lters for boys examined in January
or February and
cal culates age i n months (from bi rth date and date of exam) as well as body
mass index (BMIfrom weight and height). Note that the result of a query that
joins two tables, displays only certain fields, selects rows based on speci al
criteria, and calcul ates certain values, still l ooks like a table in spreadsheet
vi ew. One of the tenets of the relational database model is that operati ons on
tables produce table-li ke results. The data i n Figure 16.6 are easil y exported to a
statistical analysis package. (Note that no personal i dentifi ers are included in the
query.)
Identifying and Correcting Errors in the Data
The first step toward avoi ding errors i n the data is testing the data
coll ecti on and management system as part of the overal l pretesting for the
study. The entire system (data tables, data entry forms, and queries) should be
tested using dummy data.
We have discussed ways to enhance the fidelity of keyboard transcription,
scanning of machine-readable forms, or electronic data capture once data
coll ecti on begins. Val ues that are outsi de the permissible range shoul d not get
past the data entry process. However, the database shoul d also be queried for
missing val ues and outliers (extreme values that are nevertheless wi thin the
P.267
FIGURE 16.6. The results of a query on the infant jaundice database
fi ltering for boys exami ned in January or February and calcul ating age in
months (from birth date and date of exam) as well as body mass index
(BMIfrom weight and height).
...
range of all owed values). For exampl e, a weight of 30 kg mi ght be within the
range of all owed values for a 5-year old, but i f it is 5 kg greater than any other
weight in the dataset, it bears investigation. Many data entry systems are
i ncapabl e of doing cross-field validation, whi ch means that the data tables may
contai n field values that are within the allowed ranges but inconsistent with one
another. For example, it would not make sense for a 30-kg 5-year-old to have a
hei ght of 100 cm. While the weight and height values are withi n range, the
weight (extremely high for a 5-year old) is inconsistent wi th the height
(extremely low for a 5-year ol d). A 5-year ol d si mply cannot have a BMI of 30
kg/m
2
. Such an i nconsistency is easily identified using a query like the one
depicted in Figure 16.6.
Missi ng values, outl iers, inconsistencies, and other data probl ems are identifi ed
using queries and communicated to the study staff, who can respond to them by
checki ng original source documents, interviewing the parti ci pant, or repeating
the measurement. If the study relies on paper source documents, any resulti ng
changes to the data should be highl ighted (e.g., in red ink), dated, and signed.
As discussed below, el ectronic databases should maintai n an audi t log of all data
changes.
If data are col lected by several investi gators from di fferent clini cs or locations,
means and medians should be compared across investi gators and sites.
Substantial di fferences by i nvestigator or site can indi cate systematic differences
i n measurement or data collection.
Data editing and cl eaning should give higher priority to more important
vari ables. For example, in a randomi zed trial , the most i mportant variable is the
outcome, and
no errors should be tolerated. In contrast, errors in other vari ables, such as the
date of a visi t, may not substantial ly affect the results of analyses. Data edi ting
i s an iterati ve process. After errors are identifi ed and corrected, editing
procedures should be repeated unti l very few important errors are identi fied. At
thi s poi nt, the edited database is decl ared final or frozen, so that no
further changes are permitted even if errors are discovered.
Analysis of the Data
Analyzing the data often requi res creating new, deri ved vari ables based on
the raw fi eld val ues in the frozen dataset. For exampl e, conti nuous
vari ables may be dichotomized (blood pressure above a cut point defined as
hypertension), new categories created (speci fic drugs grouped as antibi otics),
and calculati ons made (years of smoking x number of packs of cigarettes per day
= pack years). It is desirable to decide how missing data will be handled.
Don' t know is often recoded as a speci al category, combined with
no, or excluded as missi ng. If the study uses database software, queries
can be used to derive the new variables pri or to export to a statistical analysis
package. Alternativel y, derivation of the new fi elds can occur in the statistical
package itself.
When mul tipl e manuscripts are written based on the same database, it is
desi rable to use the same defini tions of variables and handle missi ng data in the
same way for each analysi s. For example, it may be disconcerting for readers if
the number of di abeti c parti ci pants in the study varies. This could easil y happen
i f diabetes is defined as self -reported di abetes in one analysi s and reported use
P.268
...
of hypogl ycemic medications in another.
Confidentiality and Security
As mentioned above, to protect research subject confidenti ali ty, the
database should assign a uni que subject identifi er (subject ID) that has no
meaning external to the study database. In other words, the subject ID should
not i ncorporate the subject' s name, initials, birth date, or medical record
number. Any database fields that contain personal identifi ers should be deleted
when the study is complete
9
and prior to shari ng the data. If the database uses
mul tiple tabl es, the personal identifiers can be kept in a separate tabl e. Study
databases that contain personal identi fiers must be maintai ned on secure servers
accessi ble only to authorized members of the research team, each of whom wi ll
have a user ID and password.
To be full y compli ant with the Pri vacy Rule of the Heal th Insurance Portabi lity
and Accountabil ity Act (HIPAA) (4), the database system should audit all
vi ewi ng, entering, and changing of the data. Field l evel audi ting al lows
determinati on of when a data el ement was changed, who made the change, and
what change was made. Thi s is necessary under the Code of Federal Regulations,
Title 21, Part 11, (21 CFR 11) (5) i f the electronic data will be used i n an FDA
New Drug Appli cation wi thout paper source documents.
The study database must be backed up regularly and stored off-site.
Periodical ly, the back-up procedure shoul d be tested by restoring a backed-up
copy of the data. At the end of the study the original data, data dictionary, final
database, and
the study analyses shoul d be archived for future use. Such archives can be
revisited i n future years, allowing the i nvestigator to respond to questions about
the integrity of the data or anal yses, perform further analyses to address new
research questions and share data wi th other i nvestigators.
Summary
1. The study database consists of one or more data tables in which the rows
correspond to records or entities (often partici pants) and the
col umns correspond to fields or attri butes (often measurements).
2. The data dictionary speci fies the name, data type, description, and
range of allowed values for all the fields in the database.
3. The data entry system is the means by which the data tables are
populated. Transcription from paper forms may require double data entry
to ensure fidelity.
4. Electronic data capture through on-screen forms or web pages el iminates
the transcripti on step. If required, a source document can be printed
immediately after direct data entry.
5. A spreadsheet or statistical program i s adequate for simple databases, but
complex databases requi re the creati on of a relational database usi ng
database management software.
6. Database queries sort and filter the data as wel l as calcul ate values based
on the raw data fields. Queries are used to monitor data entry, report on
P.269
...
study progress, and format the results for analysi s.
7. To protect subject confidentiality, databases that contain personal
identifiers must be stored on secure servers, with access restricted and
audi ted.
8. Loss of the database must be prevented by regular backups and off-site
storage, and by archiving copi es of key versions of the database for
future use.
Footnotes
1
In the table shown in Fi g. 16.1, the mean ( standard deviati on) IQ score for
the 5 of 6 neonatal jaundice patients who had the outcome measured i s 112.6
( 19.1$). For the 4 of 7 control s with measurements, the mean is 101.3 (
24.2$). t-test comparison of these means yields a P val ue of 0.46.
2
The origi nal meaning of the term flat file was a file consi sti ng of a
string of characters that could only be evaluated sequenti all y such as a tab-
delimi ted text fil e.
3
Strictly speaking, the term relational has little to do with the between-
table relationships. In fact, rel ati on is the formal term from
mathematical set theory for a data tabl e (2,3). However, the concept of a
relational database as a coll ecti on of related tables is a useful heuri sti c.
4
Al though Figure 16.3 displays two data dictionaries, one for the Subject
table and one for the Exam table, the enti re database can be vi ewed as
having a single data dictionary rather than one dictionary for each table. For
each fi eld in the database, the si ngle data di cti onary requires specificati on of the
fi eld' s tabl e name in additi on to the field name, field type, field descri ption, and
range of all owed values.
5
Another possibl y preferable way to handle gender i s to replace the Sex
fi eld with a Male field, and use 1 for yes (male sex) and 0 for no (female
sex). This way the mean val ue for the fi eld is the proporti on of subjects who are
male.
6
In cl inical trials, the paper data collection form corresponding to a specifi c
subject i s commonly called a Case Report Form (CRF).
7
Si nce in clini cal trial s paper forms are called CRFs, electroni c on-screen forms
are called eCRFs.
8
SQL has 3 subl anguages: DDLData Definiti on Language, DMLData
Manipul ati on Language, and DCLData Control Language. Stri ctl y speaki ng,
DML is the SQL sublanguage used to view, organi ze, and extract data, as well as
i nsert, update, and del ete records.
9
The table l inking the subject ID to personal identi fiers such as name, address,
phone, and medical record number may be removed from the database but
archived in a secure location shoul d it ever be necessary to reidentify a subject.
References
1. Newman TB, Li ljestrand P, Jeremy RJ, et al. Fi ve-year outcome of
newborns with total serum bilirubin level s of 25 mg/dL or more. N Engl J Med
...
2006;354(18):18891900.
2. Codd EF. A rel ati onal model of data for large shared data banks. Commun
ACM 1970;13(6): 377387.
3. Date CJ. An i ntroduction to database systems, 7th ed. Readi ng, MA:
Addison Wesl ey, 2000.
4. NIH. Protecting personal health information i n research: understandi ng
the HIPAA pri vacy rul e. NIH Publicati on #03-5388, September 25, 2003.
5. FDA. Gui dance for industry: computeri zed systems used in clinical tri als
(Draft gui dance, Revision 1). September, 2004.
...
> Tabl e of Contents > Secti on III - Impl ementati on > 17 - Impl ementi ng the Study and
Qual i ty Control
17
Implementing the Study and Quality
Control
Deborah Grady
Stephen B. Hulley
Most of thi s book has dealt with the left-hand side of the cli nical research model,
addressing matters of design (Fi g. 17.1). In thi s chapter we turn to the right
hand, implementation si de. Even the best of pl ans thoughtful ly assembl ed in
the armchair may work out differently in practice. Skil led research staff may be
unavailable, study space less than optimal , participants less wil ling to enroll
than anti cipated, the i ntervention poorly tolerated, and the measurements
chal lenging. The conclusi ons of a well -designed study can be marred by
i gnorance, carelessness, l ack of training and standardizati on, and other errors in
fi nali zi ng and implementing the protocol.
Successful study impl ementati on begins with assembling resources incl udi ng
space, staff, and financial management for study start-up. The next task is to
finalize the protocol through a process of pretesting and pilot studies of
recruitment, measurement and intervention plans, in an effort to avoid the need
for protocol
revisions after data collection has begun. The study is then carried out with a
systemati c approach to quality control of clinical and lab procedures and of
data management fol lowi ng the pri ncipl es of Good Clinical Practice (GCP).
P.272
...
Assembling Resources
FIGURE 17.1. Implementi ng a research project.
Space
Conducting cl inical research in which vol unteer parti ci pants make study vi sits
requires accessi ble, attracti ve, and suffi ci ent space. Fai ling to successfully
negotiate for space early in the study planni ng process can result i n difficulty
enrolli ng participants, poor adherence to study visits, incompl ete data, and
unhappy staff. Cl inical research space must be easily accessibl e to participants
and have adequate avail abl e parking. The space shoul d be welcoming,
comfortabl e, and spacious enough to accommodate staff, measurement
equipment, and storage of study drug and study-related files. If there wi ll be a
physical examinati on, provision for privacy and hand-washing must be available.
If the participants must go to other places for tests (such as the hospital
l aboratory or radi ology department) these should be easily accessible. In some
studies, such as those that enrol l sick patients or deliver interventi ons that could
be dangerous, access to cardiopulmonary resuscitation teams and equipment
may be required.
Many university medi cal centers have clinical research centers that provi de fully
equipped research space that i s staffed by experienced research staff. Clinical
research centers often include the abili ty to make speci ali zed measurements
(such as caloric intake, bone density, and i nsulin clamp studi es), and may
provide access to other services (such as professional recruitment, database
management, and stati sti cal analysis). These centers provide an excel lent opti on
for carrying out clinical and translational research, but may require separate
appli cation and review procedures and rei mbursement for services.
The Resear ch Team
Research teams range in si ze from small often just the i nvestigator and a
part-ti me research assi stantto multiple full -ti me staff for large studies.
Regardl ess of si ze, all research teams must accomplish si milar acti vi ties and fi ll
simi lar rol es, which are descri bed in Tabl e 17.1. Often, one person carri es out
several or many of these activiti es. However, some of these duti es require
special expertise, such as statistical programming and anal yses. Some team
members, such as the financial and human resources managers, are generally
employed by the university or medical center, and provided by the i nvestigator's
department or uni t. Regardless of the size of the study team, the principal
i nvestigator (PI) must make sure that each of the functions descri bed in Tabl e
17.1 is carried out.
After deciding on the number of team members and the distri bution of duties,
the next step i s to work with a departmental administrator to fi nd quali fied and
experienced job appl icants. This can be difficult, because formal trai ning for
some research team members is variable, and job requirements vary from one
study to the next. For example, the cruci al positi on of project director may be
fi lled by a person with a background in nursing, pharmacy, public heal th,
l aboratory services, or pharmaceutical research, and the duties of this positi on
...
can vary widely.
Table 17.1 Functional Roles for Members of a
Research Team
1
Role Function Comment
Princi pal i nvestigator Ultimately
responsible for the
design, conduct and
quality of the
study, and for
reporting findi ngs
Project director/
cli ni c coordi nator
Provi des day-to-day
management of all
study activities
Experienced,
responsible,
meticul ous, and with
good interpersonal and
organi zational ski lls
Recruiter Ensures that the
desired number of
eligible participants
are enroll ed
Knowledgeabl e and
experienced with a
range of recruitment
techniques
Research
assistant/clinic staff
Carries out study
vi sit procedures
and makes
measurements
Physi cal exami nation
or other specialized
procedures may
require speci al l icenses
or certi fication
Quali ty control
coordinator
Ensures that all
staff follow
standard operating
procedures (SOPs),
and oversees
quality control
Observes study
procedures to ensure
adherence to SOPs,
may supervise audit by
external groups such
as U.S. Food and Drug
Admini strati on (FDA)
Data manager Designs, tests, and
i mplements the
data entry, editing
and storage system
Programmer/analyst Produces study Works under the

...
Most uni versities and medical centers have formal methods for posting job
openi ngs, but other avenues, such as newspaper and web-based adverti sements,
can be useful. The safest approach is to find staff of known competence, for
example, someone working for a colleague whose project has ended.
Leader shi p and Team- Bui l di ng
The quality of a study that i nvol ves more than one person on the research team
begins with the i ntegrity and leadershi p of the PI. The PI should ensure that all
staff are properly trai ned and certified to carry out thei r duti es. He shoul d
clearl y convey the message that protection of human subjects, maintenance of
privacy, completeness and accuracy of data, and fair presentati on of research
fi ndings are paramount. He cannot watch every measurement made by
coll eagues and staff, but i f he creates a sense that he i s aware of all study
activities and feels strongly about human subjects' protection and the quali ty of
the data, most people will respond in kind. It is helpful to meet with each
member of the team from time to time, expressing appreciation and di scussing
reports that
descri be
recruitment,
adherence and data
quality, conducts
data anal yses
supervi si on of the
principle investi gator
(PI) and statistician
Statistician Estimates sample
size and power,
designs analysi s
plan, interprets
fi ndings
Often plays a major
role i n overal l study
design and conduct
Admini strati ve
assistant
Provi des clerical
and administrative
support, sets up
meetings, etc.
Generally not el igibl e
for federal direct cost
support
Financial manager Prepares budget
and manages
expenditures
Provi des projecti ons to
hel p manage budget
Human resources
manager
Assists in preparing
job descripti ons,
hi ring, evaluati ons
Helps manage
personnel issues and
problems
*In small studies, one person may take on several of these rol es.
P.273
P.274
...
probl ems and soluti ons. A good leader i s adept at delegating authority
appropri atel y and at the same time setting up a hierarchical system of
supervi si on that ensures sufficient oversight of all aspects of the study.
From the outset of the planning phase, the i nvestigator should l ead regular staff
meetings wi th all members of the research team. Meeti ngs shoul d have the
agenda distri buted in advance, with progress reports from i ndividual s who have
been given responsibili ty for specific areas of the study. These meetings provide
an opportuni ty to discover and solve probl ems, and to involve everyone in the
process of developi ng the project and conducting the research. Staff meetings
are enhanced by scientific discussions and updates related to the project.
Regular staff meetings are a great source of moral e and interest in the goals of
the study and provide on-the-job education and trai ni ng.
Most research-ori ented universiti es and medical centers provide a wide range of
institutional resources for conducting clini cal research. These include human
resources and financial management infrastructure, and centralized clinical
research centers that provide space and experienced research staff. Many
universities al so have core l aboratories where speci ali zed measurements can be
performed, centralized space and equipment for storage of biologic speci mens or
i mages, centralized database management services, professional recruitment
centers, expertise regarding U.S. Food Drug Administration (FDA) and other
regulatory i ssues, and libraries of study forms and documents. Thi s
i nfrastructure may not be readily apparent in a large sprawl ing institution, and
i nvestigators shoul d seek to become famili ar wi th their local resources before
trying to do i t themselves.
St udy St ar t - up
At the beginning of the study, the PI must finalize the budget, develop and sign
any contracts that are involved, defi ne staff posi tions, hi re and train staff,
obtain i nstituti onal review board (IRB) approval, wri te the operati ons manual,
develop and test forms, devel op and test the database, and begi n recruiting
participants. Thi s period of study activity is referred to as study start-up, and
requires intensi ve effort before the fi rst participant can be enrol led. Adequate
time and planning for study start-up are important to the conduct of a high-
quality study.
Adequate funding for conducting the study i s crucial. The budget wil l have been
prepared at the ti me the proposal is submi tted for funding, wel l in advance of
starti ng
the study (Chapter 19). Most universi ties and medi cal centers employ staff with
fi nanci al experti se to assist in the development of budgets (the preaward
manager). It is a good idea to get to know thi s person well , and to thoroughl y
understand regulations related to various sources of funding.
In general, the rul es for expending NIH and other public funding are
consi derably more restrictive than for industry or foundation funding. The total
amount of the budget cannot be i ncreased if the work turns out to be more
costly than predicted, and shifti ng money across categories of expense (e.g.,
personnel, equipment, suppli es, travel) usually requires approval by the sponsor.
Universiti es and medical centers typi cally empl oy fi nanci al personnel whose mai n
responsibility i s to ensure that funds avail able to an investigator through grants
and contracts are spent appropriatel y (the postaward manager). The
P.275
...
postaward manager shoul d prepare regul ar reports and projections that allow the
i nvestigator to make adjustments i n the budget to make the best use of the
avail abl e finances during the l ife of the study.
The budget for a study supported by a pharmaceutical company i s part of a
contract that i ncorporates the protocol and a cl ear deli neation of the tasks to be
carried out by the investigator and the sponsor. Contracts are legal documents
that obli gate the investi gator to activities and describe the ti ming and amount of
payment in return for specifi ed del iverables. Universi ty or medi cal center
l awyers are needed to hel p develop such contracts and ensure that they protect
the investi gator' s intellectual property ri ghts, access to data, publ ication rights,
and so forth. However, lawyers may not be famil iar wi th the tasks required to
complete a specific study, and i nput from the i nvestigator is crucial, especi all y
with regard to the scope of work.
I nst i tuti onal Revi ew Boar d Appr oval
The IRB must approve the study protocol , consent form and recrui tment
materials before recruitment can begin (Chapter 14). Investigators should be
famil iar wi th the requi rements of their local IRB and the time required to obtain
approval. IRB staff are generall y very helpful in these matters, and should be
contacted earl y on to discuss any procedural issues and design decisions that
affect study parti ci pants.
Oper at i ons Manual and For ms Devel opment
The study protocol is commonl y expanded to create the operations manual,
which incl udes the protocol, information on study organi zation and policies, and
a detailed version of the methods section of the study protocol (Appendi x 17.1).
It specifi es exactl y how to recruit and enroll study parti cipants, and describes all
activities that occur at each vi si thow randomizati on and bli nding wi ll be
achi eved, how each variable will be measured, quality control procedures, data
management practices, and the statistical analysis plan. It should al so i nclude all
of the questi onnaires and forms that wil l be used in the study, with instructions
on contacting the study participants; carrying out interviews, completi ng and
coding study forms, entering and editing data, and col lecting and processing
specimens. An operati ons manual is essential for research carried out by several
i ndividuals, particularly when there is coll aborati on among investigators in more
than one location. Even when a singl e investi gator does all the work himself,
operational definiti ons help reduce random variati on and changes i n
measurement technique over time.
Design of the data collection forms will have an important influence on the
quality of the data and the success of the study (Chapter 15). Before the fi rst
participant i s recruited, the forms should be pretested. Any entry on a form that
i nvol ves judgment requi res expli cit operational definiti ons that should be
summarized
briefly on the form itself and set out in more detai l in the operati ons manual.
The items shoul d be coherent and their sequence cl early formatted, with arrows
i ndicating when questions shoul d be ski pped (see Appendi x 15.1). Pretesti ng wi ll
ensure clari ty of meaning and ease of use. Labeli ng each page with the date,
name, and ID number of the subject and staff safeguards the i ntegrity of the
data should pages become separated. Some studi es use digi tal forms, handheld
computers, personal di gital assi stants or other devices to collect data, bypassing
P.276
...
the need to create paper forms. These devi ces must also be pretested duri ng
study start-up, and directions for their use included i n the operations manual.
Dat abase Desi gn
Before the first participant is recruited, the database that wi ll be used to enter,
store, update, and monitor the data must be created and tested. Database
desi gn and management are di scussed in Chapter 16. Dependi ng on the type of
database that wil l be used and the scope of the study, development and testing
of the data entry and management system can require weeks to months after
staff wi th the appropriate skill s have been identifi ed, hired, and trai ned. For very
l arge studies, professi onal database desi gn and management servi ces are
avail abl e, generally from organizati ons that provide professi onal research
management (Cli ni cal Research Organizations, or CROs).
Recr ui t ment
Approaches to successfull y recruiti ng the goal number of study participants are
descri bed in Chapter 3. We want to emphasize here that ti mely recruitment is
the most di fficul t aspect of many studies. Adequate time, staff, resources, and
experti se are essential , and shoul d be planned wel l in advance of study start-up.
Finalizing the Protocol
P r etest s and Pi l ot Studi es
Pretests and pilot studies are designed to evaluate the feasibi lity, efficiency and
cost of study methods, the reproducibil ity and accuracy of measurements, and
l ikely recruitment rates, outcome rates and effect si zes. The nature and scale of
pretests and pi lot studies depends on the study design and the needs of the
study. For most studi es, a series of pretests or a small pilot study serves very
well , but for l arge, expensi ve studies a ful l-scale pilot study may be appropriate
(1). It may be desirable to spend up to 10% of the eventual cost of the study to
make sure that recruitment strategies wi ll work, measurements are appropriate
and sample size estimates are realistic.
Pretests are evaluati ons of speci fic questionnaires, measures, or procedures
that can be carri ed out by study staff to assess thei r functionali ty,
appropri ateness and feasibi lity. For example, pretesting the data entry and
database management system i s general ly done by having study staff compl ete
forms with missing, out of range or illogical data, entering these data and
testi ng to ensure that the data editi ng system identifi es these errors.
Pilot studies enroll a smal l number of partici pants and require a protocol,
approval by the IRB and informed consent. Important reasons for conducting a
pilot study are to guide decisions about how to desi gn recrui tment approaches,
measurements, and interventions. Evaluati ng the methods for recruiting study
participants during a pi lot study can provide rough esti mates of the number who
are avai labl e and will ing to enrol l, and test the effi ci ency of different recruitment
approaches. Pilot
studies can al so gi ve the investi gator an idea of the nature of the populations he
wil l be sampl ingthe di stributi ons of age, sex, race, and other characteristics
that may be important to the study. They can be designed to provide data on the
feasibili ty of measurements, subjective reactions to each procedure and any
P.277
...
discomfort it may have caused, whether there were questi onnaire items that
were not understood, and ways to improve the study.
Pil ot studi es may be particularly helpful for studies that i nvol ve a new
intervention, where it i s i mportant to determine the dose or intensi ty,
frequency, and duration of the intervention. For example, a pi lot test of a new
school -based AIDS education program designed to prevent HIV infection could
hel p optimi ze effecti veness by determining the i deal duration of each training
sessi on and number of sessions per week.
Before the study begins, it is a good i dea to test study procedures i n a full -scal e
dress rehearsal. The purpose is to iron out problems with the final set of
i nstruments and procedures. What appears to be a smooth, probl em-free
protocol on paper usually reveal s logi sti c and substanti ve probl ems i n practice,
and the dress rehearsal wi ll generate improvements i n the approach. The
i nvestigator himself can serve as a mock subject to experience the study and
the research team from that viewpoint.
Mi nor P r ot ocol Revi si ons once Data Col l ect i on Has
Begun
No matter how careful ly the study is designed and the procedures pretested,
probl ems inevi tably appear once the study has begun. The general rule is to
make as few changes as possi ble at this stage. Sometimes, however, protocol
modificati ons can strengthen the study.
The decision as to whether a minor change will improve the integrity of the study
i s often a trade-off between the benefit that results from the improved
methodology and the di sadvantages of altering the uniformity of the study
fi ndings and spending time and money to change the system. Decisions that
simpl y invol ve making an operational definition more speci fic are rel ativel y
easy. For example, in a study that excl udes alcohol ics, can a recovering alcoholi c
be incl uded? This deci sion shoul d be made in consultation wi th coinvestigators,
but with adequate communication through memos and the operations manual to
ensure that it is appli ed uniforml y by all staff for the remai nder of the study.
Often minor adjustments of this sort do not require IRB approval , particularly if
they do not i nvol ve changi ng the protocol that has been approved by the IRB,
but the PI should ask an IRB staff member if there is any uncertai nty.
Other deci sions are more difficult. A questi onnaire to elicit symptoms of angina
pectoris asks if the pain goes away within 10 mi nutes. Six months i nto the study
the investi gator realizes that this does not distingui sh the brief sharp pai ns of a
few seconds duration that are probably musculoskeletal from the pai n of some
minutes durati on that could represent an i schemic myocardium. The investigator
could change the wording of a question to exclude instances of sharp pai n lasting
l ess than 1 mi nute. Doing so mi ght produce a vari able that is more appropriate
for the research question, but it would create a change halfway through the
study in the nature of what i s measured. Thi s is diffi cult to deal with i n the
anal ysis phase, and it may be best not to make the change. Someti mes i t is
possible to continue measuring the variable the old way and to add the new
approach in addi tion. The new version of the question should be asked last, so
that it wil l not al ter the responses to the original standard set of questions.
Subst ant i ve Pr otocol Revi si ons once Dat a
Col l ect i on Has Begun
...
Major changes i n the study protocol , such as i ncludi ng different kinds of
participants or changi ng the intervention or outcome, are a serious problem.
Although there may be good reasons for maki ng these changes, they must be
undertaken with a view to anal yzi ng and reporti ng the data separatel y if this wi ll
l ead to a more appropriate interpretation of the findi ngs. The judgments involved
are illustrated by two examples from the Raloxifene Use for the Heart (RUTH)
trial, a mul ticenter cli ni cal trial of the effect of treatment with raloxifene on
coronary events and breast cancer i n 10,101 women at high risk for coronary
heart disease (CHD) events. The i nitial defi ni tion of the primary outcome was the
occurrence of nonfatal myocardial infarcti on (MI) or coronary death. Early in the
trial, it was noted that the rate of thi s outcome was l ower than expected,
probably because new clinical cointerventions such as thrombolysis and
percutaneous angiopl asty lowered the risk. After careful considerati on, the RUTH
Executive Committee decided to change the primary outcome to include acute
coronary syndromes other than MI. This change was made early in the trial;
appropri ate i nformation had
been coll ected on potential cardi ac events to determi ne if these met the new
criteria for acute coronary syndrome, all owi ng the study database to be searched
for acute coronary syndrome events that had occurred before the change was
made (1).
Also early in the RUTH trial, results from the Multi ple Outcomes of Raloxifene
Evaluati on (MORE) tri al showed that the relative risk of breast cancer was
markedly reduced by treatment with raloxifene (2). These resul ts were not
concl usive, since the number of breast cancers was smal l, and there were
concerns about generali zabil ity since all women enrol led in MORE had
osteoporosi s. To determine if ral oxi fene would also reduce the risk of breast
cancer in another populationolder women wi thout osteoporosi s and at risk for
CHD eventsthe RUTH Executi ve Commi ttee decided to add breast cancer as a
second pri mary outcome (1).
Each of these changes was major, requiring a protocol amendment, approval of
the IRB at each cli nical site, and approval of the FDA. These are exampl es of
substantive revisions that enhanced feasibi lity or the information content of the
study without compromising its overall integrity. Ti nkering with the protocol is
not always so successful. Substantive revi sions should only be undertaken after
weighi ng the pros and cons with members of the research team and appropriate
advisors such as the DSMB and funding agency. The investi gator must then deal
with the potential impact of the change when he analyzes data and draws the
study conclusions.
Cl oseout
At some point in all l ongitudinal studi es and cl inical trials, fol low-up of
participants stops. The period duri ng whi ch participants complete their last visit
i n the study is often call ed closeout. Closeout of cl inical studies
presents several i ssues that deserve careful planning and i mplementation. At a
minimum, at the closeout visit staff shoul d thank participants for their time and
effort and inform them that thei r parti ci pation was criti cal to the success of the
study. In addition, closeout may incl ude the following acti vi ties:
parti cipants (and their physicians) may be informed of the results of
laboratory tests or other measurements that were performed duri ng the
P.278
...
study, ei ther i n person at the l ast visi t or l ater by mai l.
in a blinded cli ni cal trial , participants may be told their treatment status,
either at the last visit, or by mail at the ti me al l participants have
completed the trial and the main data anal yses are compl ete.
a copy of the mai n manuscri pt based on the study results and a press
release or other description of the findings written in lay l anguage may be
mai led to partici pants (and their physicians) at the time of presentation or
publi cation.
after all partici pants have completed the study, they may be i nvited to a
reception during which the PI thanks them, di scusses the results of the
study, and answers questi ons.
Quality Control During the Study
Good Cl i ni cal Pr acti ce
A crucial aspect of cli ni cal research is the approach to ensuring that all aspects
of the study are of the highest quality. Guidelines for high-qual ity research,
cal led GCP, were developed to apply speci fically to clinical tri als that test drugs
requiring approval by the FDA or other regulatory agenci es, and are defined as
a standard for the design, conduct, performance,
monitoring, auditing, recording, analyses, and reporting of
clinical tri als that provides assurance that the data and
reported results are credi ble and accurate, and that the rights,
i ntegri ty, and confi dential ity of trial subjects are protected.
Recently, these principles have been increasingly applied to cl inical trials
sponsored by federal and other public agencies, and to research designs other
than tri als (Table 17.2). GCP requi rements are described in detail in the FDA
Code of Federal Regulations Title 21 (3). The International Conference on
Harmonization (4) provides quali ty control guidel ines used by regulatory
agencies i n Europe, the United States and Japan.
GCP i s best impl emented by standard operating procedures (SOPs) for all
study-related acti vi ties. The study protocol and operations manual can be
consi dered SOPs, but often do not cover areas such as how staff are trained and
certified, how the database i s devel oped and tested, or how study files are
maintained, kept confi dential, and backed up. Many universiti es have staff who
special ize in processes for meeting GCP guidelines and various templates and
models for SOPs. GCP wi th respect to ethical conduct of research i s addressed in
Chapter 14, and in this chapter we focus on quality control of study procedures
and data management.
Qual i t y Cont r ol f or Cl i ni cal P r ocedur es
It is a good idea to assign one member of the research team to be the quality
control coordinator who is responsible for implementi ng appropriate quali ty
control techni ques for all aspects of the study, supervisi ng staff training and
certificati on, and monitoring the use of quali ty control procedures during the
study. The goal is to detect possi ble problems before they occur, and prevent
P.279
...
them. The quality control coordi nator may also be responsible for prepari ng for
and acting as the contact person for audits by the IRB, FDA, study sponsor, or
NIH. Qual ity control of clinical procedures begi ns during the planning phase and
continues throughout the study (Tabl e 17.3).
The oper at i ons manual . The operations manual is a very i mportant aspect
of qual ity control that has been described earl ier in this chapter. To
ill ustrate, consi der measuring bl ood pressure, a partially subjective
outcome for which there is no
feasibl e gold standard. The operati ons manual should give specific
instructions for preparations before the clinic visit (incl uding the timi ng of
taki ng blood pressure medication); prepari ng the participant for the
measurement (remove long-sl eeved cl othing, si t quietly for 5 minutes);
choosing the proper size blood pressure cuff; l ocating the brachial artery
and applying the cuff; inflating and defl ati ng the cuff; and recogni zi ng
whi ch sounds represent systoli c and diastoli c bl ood pressure.
P.280
Table 17.2 Aspects of the Conduct of Clinical
Research that are Covered by Good Clinical
Practices
The desi gn is supported by precli nical, ani mal and other data
as appropriate
The study is conducted accordi ng to ethical research
princi ples
A written protocol is carefull y fol lowed
Investigators and those providi ng cl inical care are trained and
qual ified
All clinical and laboratory procedures meet quality standards
Data are reliable and accurate
Complete and accurate records are maintai ned
Stati stical methods are prespecified and carefully followed
The results are cl early and fairl y reported
Table 17.3 Quality Control of Clinical
Procedures*
Steps that precede
the study
Devel op a manual of operations
Define recruitment strategies
Operational defini tions of measurements
Standardized instruments and forms
Approach to managing and analyzing the
data
...
Tr ai ni ng and cer t i f i cat i on. Standardized trai ning of study staff i s
essenti al to high-qual ity research. Al l staff involved in the study should
recei ve appropri ate trai ning before the study begins, and be certi fied as to
competence with regard to key procedures and measurements. With regard
to measurement of blood pressure, for example, members of the team can
be trai ned in each aspect of the measurement, required to pass a written
test on the rel evant secti on of the operations manual and to obtain
satisfactory readings on mock participants assessed simultaneously by the
instructor using a double-headed stethoscope. The certifi cation procedure
should be suppl emented during the study by scheduled recertifi cations and
a l og of training, certificati on and recertificati on should be maintained at
the study site.
P er f or mance r evi ew . Supervisors should review the way cli ni cal
procedures are carried out by periodi cally si tting in on representative clinic
visits or tel ephone calls. After obtaining the study participant's permission,
the supervisor can be quietly present for at least one complete example of
every kind of i nterview and technical procedure each member of hi s
research team performs. This may seem awkward at first, but it soon
becomes comfortable. It is helpful to use a standardi zed checklist
(provided in advance and based on the protocol and operations manual )
during these observations. Afterward, communication between the
supervisor and the research team member can be facilitated by reviewing
the checklist and resolving any
qual ity control issues that were noted in a posi tive and nonpejorative
fashion. The timing and results of performance revi ews should be recorded
in trai ni ng logs.
Involving peers from the research team as reviewers is useful for building
Quali ty control systems
Systems for blindi ng participants and
i nvestigators
Appoint quality control coordinator
Train the research team and document this
Certify the research team and document
thi s
Steps during the
study
Provi de steady and caring l eadershi p
Hold regular staff meetings
Special procedures for drug interventi ons
Recerti fy the research team
Periodic performance review
Periodical ly compare measurements across
technicians and over time
*Cli nical procedures incl ude blood pressure measurement,
structured interview, chart review, etc.
P.281
...
morale and teamwork, as well as for ensuring the consi stent appli cation of
standardized approaches among members of the team who do the same
thing. One advantage of usi ng peers as observers i n thi s system i s that al l
members of the research team acquire a sense of ownership of the quali ty
control process. Another advantage is that the observer often learns as
much from observing someone else' s performance as the person at the
recei vi ng end of the review procedure.
P er i odi c r epor t s. It is important to tabulate data on the technical quali ty
of the clinical procedures and measurements at regular i ntervals. This can
give clues to the presence of missing, i naccurate, or impreci se
measurements. Di fferences among the members of a blood pressure
screening team in the mean l evels observed over the past 2 months, for
exampl e, can lead to the discovery of differences in their measurement
techni ques. Simi larly, a gradual change over a period of months i n the
standard devi ation of sets of readings can i ndicate a change in the
techni que for making the measurement. Periodi c reports should al so
address the success of recruitment, the timeli ness of data entry, the
proporti on of mi ssing and out-of-range variabl es, the time to address data
queries, and the success of follow-up and adherence to the intervention.
Speci al pr ocedur es f or dr ug i nt er vent i ons. Cl inical trials that use drugs,
parti cularl y those that are blinded, require speci al attention to the quality
control of labeli ng, drug del ivery and storage, dispensing the medication
and collecting unused medication. Providing the correct drug and dosage is
ensured by carefully planning wi th the manufacturer and pharmacy the
nature of the drug distri bution approach, by overseeing its implementation,
and occasi onall y by testing the composition of the bl inded study
medications to make sure they contain the correct consti tuents. Drug
studies also requi re clear procedures and l ogs for tracki ng receipt of study
medication, storage, distributi on, and return by parti cipants.
Qual i t y Cont r ol f or Labor ator y P r ocedur es
The quality of laboratory procedures can be controlled using many of the
approaches described above for cli nical procedures. In addi tion, the fact that
specimens are being removed from the participants (creating the possibi lity of
misl abel ing) and the techni cal nature of laboratory tests, lead to special
strategies summarized below (Tabl e 17.4).
At t ent i on t o l abel i ng. When a participant' s blood specimen or
electrocardi ogram is mistakenly labeled with another indi vi dual' s name, i t
may be i mpossibl e to correct or even discover the error l ater. The onl y
sol ution is preventi on, avoiding
transposition errors by carefull y checki ng the participant' s name and
number when labeling each specimen. Computer printouts of labels for
blood tubes and records speed the process of l abel ing and avoid the digit
transpositions that can occur when numbers are handwri tten. A good
procedure when transferring serum from one tube to another i s to l abel the
new tube in advance and hol d the two tubes next to each other, reading
one out loud whi le checking the other; this can also be automated with
scannable bar codes.
P.282
...
Bl i ndi ng. The task of blinding the observer i s easy when it comes to
measurements on specimens, and it is al ways a good idea to label
speci mens so that the technician has no knowledge of the study group or
the val ue of other key variables. Even for apparentl y objective procedures,
like an automated blood gl ucose determination, this precaution reduces
opportunities for bi as and provides a stronger methods section when
reporti ng the results. However, blinding l aboratory staff means that there
must be cl ear procedures for reporti ng abnormal resul ts to a member of the
staff who is qualifi ed to revi ew the results and decide i f the participant
should be notified or other action should be taken. In clinical tri als, there
must also be strategies in pl ace for (sometimes emergent) unbli nding if
laboratory measures i ndicate abnormaliti es that might be associated wi th
the trial intervention and require immedi ate acti on.
Bl i nded dupl i cat es and st andar d pool s. When specimens or images are
sent to a central laboratory for chemical analysi s or interpretati on, i t may
be desirabl e to send blinded dupl icatesa second specimen from a random
subset of partici pants given a separate and fi cti tious ID numberthrough
the same system. This strategy gives a measure of the precision of the
laboratory technique. Another approach for serum specimens that can be
stored frozen is to prepare a pool of serum at the outset and periodically
send ali quots through the system that are blindl y l abel ed wi th fictitious ID
numbers. Measurements carried out on the serum pool at the outset, using
the best available technique, establ ish i ts consti tuents; the pool is then
used as a gold standard duri ng the study, providing estimates of accuracy
and precision. A third approach, for measurements that have inherent
variability such as a Pap test or mammography readings, is to involve two
independent, blinded readers. If both agree withi n predefi ned limi ts, the
result i s established. Discordant resul ts may be resolved by di scussion and
consensus, or the opi nion of a third reader.
Commer ci al l abor at or y cont r act s. In some studies, biologic measures
made on blood, sera, cell s, or tissue are made under contract to commercial
Table 17.4 Quality Control of Laboratory
Procedures*
Steps that precede the
study
Use strategi es i n Table 17.3
Establish good labeling procedures
Steps during the study Use strategi es i n Table 17.3
Ensure/document proper functi on of
equipment
Use bl inded dupl icates or standard
pool s
*Laboratory procedures include bl ood tests, x-rays,
electrocardiograms, radiology, pathology, etc.
...
laboratori es. The lab must be appropri atel y l icensed and certifi ed and a
copy of these certi fications should be on fi le in the study office. Commercial
labs should guarantee timely servi ce and provide standardized procedures
for handling coded speci mens, notifyi ng investi gators of abnormal results,
and transferri ng data to the mai n database.
Qual i t y Cont r ol f or Data Management
The investigator should set up and pretest the data management system before
the study begi ns (Chapter 16). This incl udes designi ng the forms for recording
measurements, choosing computer hardware and software for data editing and
management and designing the data editi ng parameters for mi ssing, out -of-
range and il logical entries, testing the data management system and pl anning
dummy tabulations to ensure that the appropriate variables are col lected ( Table
17.5).
Mi ssi ng dat a. Missi ng data can be disastrous if they affect a large
proporti on of the measurements, and even a few mi ssing val ues can
sometimes bias the conclusions. A study
of the long-term sequel ae of an operation that has a delayed mortali ty rate
of 5%, for example, coul d seriously underestimate this compl ication i f 10%
of the parti cipants were lost to follow-up and if death were a common
reason for losing them. Erroneous concl usions due to missing data can
sometimes be corrected after the factin this case by an i ntense effort to
track down the missing participantsbut often the measurement cannot
be replaced. There are statistical techniques for imputing missing values
based on other information (from baseline or from other follow-up vi sits)
available for the participant. Al though these techniques are useful,
parti cularl y for multivari ate analysis i n which the accumul ation of missing
data across a number of predictor variables could otherwise lead to large
proporti ons of parti cipants unavailable for analysi s, they do not guarantee
conclusions free of nonresponse bias if there are substantial numbers of
missi ng observati ons.
P.283
Table 17.5 Quality Control of Data
Management: Steps that Precede the Study
Be parsimonious: collect only needed vari ables
Select appropriate computer hardware and software for database
management
Program the database to fl ag mi ssing and out -of-range values
Test the database using missi ng and out-of-range values
Plan anal yses and test with dummy tabul ati ons
Design forms that are
Self-explanatory
Coherent (e.g., mul tiple-choi ce opti ons are exhaustive and
mutually exclusi ve)
...
The onl y good soluti on i s to design and carry out the study i n ways that
avoid missing data, for exampl e, by having a member of the research team
check forms for completeness before the participant leaves the clinic,
designing el ectronic data entry i nterfaces that do not allow ski pped entries
and designi ng the database so that mi ssing data are i mmediately flagged
for study staff (Table 17.6). Mi ssing clinical
measurements should be addressed whi le the parti ci pant is still i n the cli nic
when i t is rel ativel y easy to correct errors that are discovered.
I naccur at e and i mpr eci se dat a. This i s an insi dious problem that often
remains undiscovered, particularly when more than one person is i nvolved
in maki ng the measurements. In the worst case, the investi gator designs
the study and leaves the collection of the data to hi s research assistants.
When he returns to analyze the data, some of the measurements may be
seriously biased by the consi stent use of an i nappropri ate technique. This
Clearly formatted with boxes for data entry and arrows directing
skip patterns
Printed in l ower case using capi tals, underl ining, and bol d font
for emphasis
Esthetic and easy to read
Pretested and validated (see Chapter 15)
Label ed on every page wi th date, name, ID number, and/or bar
code
P.284
Table 17.6 Quality Control of Data
Management: Steps during the Study
Flag or check for omissions and major errors while parti ci pant is
stil l in the clinic
No errors or transpositi ons in ID number, name code, date on
each page
All the correct forms for the speci fied visit have been fi lled out
No missi ng entries or faulty skip patterns
Entries are legi ble
Values of key variables are wi thin permissibl e range
Values of key variables are consistent wi th each other (e.g., age
and birth date)
Carry out peri odic frequency distributions and vari ance measures
to discover aberrant values
Create other periodi c tabulations to di scover errors (see Appendix
17.2)
...
problem is parti cularl y severe when the errors i n the data cannot be
detected after the fact. If interviews are carried out with leading questions
or i f blood pressure is measured di fferentl y in participants known to be
recei vi ng placebo, the database wil l include seri ous errors that are
undetectabl e. The i nvestigator wil l assume that the variables mean what he
intended them to mean, and, ignorant of the problem, may draw
conclusions from his study that are wrong.
Staff trai ni ng and certificati on, periodi c performance reviews and regular
eval uation of differences i n mean or range of data generated by di fferent
staff members can help identi fy or prevent these problems. Computerized
editing plays an i mportant rol e, using data entry and management systems
programmed to flag or not to all ow submission of forms with mi ssing,
inconsistent, and out-of-range values. A standardi zed procedure should be
in place for changing original data on any data form. General ly this should
be done as soon after data col lection as possibl e, and includes marking
through the origi nal entry (not erasing it), signing and dati ng the change.
Thi s provides an audit trail to justi fy changes in data and prevent
fraud. On a computerized database, changi ng data generall y requires
generati ng a computer-based entry that i s recorded with the date, staff ID,
and reason for changi ng the data.
Peri odic tabulation and i nspection of frequency distributions of important
variables at regular interval s allows the i nvestigator to assess the
completeness and quali ty of the data at a time when correcti on of past
errors may sti ll be possible (e.g., by calli ng the participant or requesting
that the participant return to the study offi ces), and when further errors in
the remai nder of the study can be prevented. A useful list of topi cs for
qual ity control reports is provi ded in Appendix 17.2.
Fr audul ent dat a. Clinical investi gators who lead research teams have to
keep in mi nd the possibil ity of an unscrupul ous colleague or employee who
chooses fabrication of study information as the easiest way to get the job
done. Approaches to guarding agai nst such a disastrous event i nclude
taki ng great care in choosing col leagues and staff, developing a strong
relati onship with them so that ethical behavior i s expli ci tly understood and
rigorously followed by all, being alert to the possi bili ty of fraud when data
are examined, and making unscheduled checks of the pri mary source of the
data to be sure that they are real .
Col l abor ati ve Mul ti center St udi es
Many research questions require larger numbers of parti cipants than are
avail abl e in a single center, and these are often addressed in col laborative
studies carri ed out by research teams that work i n several locations. Sometimes
these are all i n the same ci ty or state, and a single i nvestigator can oversee all
the research teams. Often, however, col laborative studi es are carried out by
i nvestigators i n citi es thousands of miles apart with separate funding, and
admini strati ve and regulatory structures.
Multicenter studi es of thi s sort require speci al steps to ensure that all centers
are using the same study procedures and producing comparable data that can be
combined in the analysis of the results. A coordinating center establ ishes a
communicati on
P.285
...
network, coordinates the development of the operations manual , forms, and
other standardi zed quali ty control aspects of the trial, trai ns staff at each center
who will make the measurements, and oversees data management, anal ysis, and
publicati on. Collaborative studies often have di stributed data entry systems wi th
computers or scanners connected through the Internet.
There is also a need for establ ishi ng a governance system with a steering
committee made up of the PIs and representatives of the funding institution,
and with various subcommi ttees. One subcommittee needs to be responsi ble for
quality control issues, developi ng the standardization procedures and the
systems for training, certificati on, and performance review of study staff. These
tend to be complicated and expensive, providi ng centralized training for
relevant staff from each center, site visits for performance review and data
audits by coordinating center staff and peers (Appendix 17.2).
In a multi center study, changes in operati onal defi ni tions and other study
methods often resul t from questions raised by a clinical center that are answered
by the relevant study staff or committee and posted on the Internet in a runni ng
l ist to make sure that everyone involved i n the study is aware of changes. If a
signi ficant number of changes accumul ate, dated revised pages in the operati ons
manual should be prepared that include these changes. Smal l si ngle site studies
can fol low a si mpler pattern, maki ng notes about changes that are dated and
retained in the operations manual.
A Fi nal Thought
A common error i n research i s the tendency to col lect too much data. The fact
that the basel ine period is the only chance to measure basel ine variabl es l eads
to a desire to include everythi ng that might conceivably be of i nterest, and there
i s a tendency to have more fol low-up visits and col lect more data at them than is
useful. Investigators tend to collect far more data than they will ever analyze or
publish.
One problem with this approach is the time consumed by measuring less
i mportant things; participants become tired and annoyed, and the quality of
more important measurements deteriorates. Another problem i s the added size
and complexity of the database, which makes quality control and data analysis
more di fficul t. It is wi se to question the need for every variable that will be
coll ected and to eli minate many that are optional. Including a few intentional
redundancies can improve the vali dity of important vari ables, but parsimony is
the general rule.
Summary
1. Successful study i mplementation begi ns with assembling resources
including space, staff, and budget for study start-up.
2. The next task i s to fi nalize the protocol through a process of pretesting
and pilot studies of the appropriateness and feasibil ity of plans for
recruitment, measurements, interventions and outcome
ascertainment in an effort to minimize the need for subsequent protocol
revi si ons once data coll ecti on has begun.
3. Minor protocol revisions after the study has begun, such as adding an
item to a questionnaire or modifying an operational definiti on are rel ativel y
...
easily accomplished, though IRB approval may someti mes be requi red
and data anal ysis may be affected.
4. Major protocol revisions after the study has begun, such as a change i n
the nature of the intervention or pri mary outcome, have major
implications and shoul d be undertaken rel uctantly and with the approval
of key bodies such as the DSMB, IRB, and funding i nstitution.
5. The study is then carried out with a systematic approach under the
supervision of a quality control coordinator, following the principles of
GCP, and i ncluding:
a. Quali ty control for clinical proceduresoperations manual, staff
training and certification, performance review, periodic reports
(on recruitment, visit adherence, and so forth), and team meetings.
b. Quali ty control for laboratory proceduresblinding and
systemati cally labeling specimens taken from study partici pants, and
using standard pools and blinded duplicates.
c. Quali ty control of the data managementdesigning forms and
electronic systems to enable oversight of the completeness,
accuracy, and integrity of col lecting, editing, enteri ng, and analyzing
the data.
6. Collaborative multicenter studies have special systems for managing the
study and qual ity control.
Appendix
Appendix 17.1
Example of an Operations Manual Table of
Contents
1
P.286
P.287
Chapter 1. Study protocol
Chapter 2. Organization and poli ci es
Parti ci pati ng units (cl inical centers, laboratories, coordinati ng center, etc.)
Admi nistration and governance (committees, funding agency, safety and
data monitoring, etc.)
Poli cy concerns (publications and presentations, ancil lary studi es, conflict of
interest, etc.)
Chapter 3. Recruitment
El igibi lity and exclusion criteria
Sampling design
Recruitment approaches (publi ci ty, referral contacts, screening, etc.)
Informed consent
Chapter 4. Cl inic vi sits
Content of the baseline visit
Content and timi ng of fol low-up visits
...
Appendix 17.2
Quality Control Tables and Checklists
I. Tabulations for monitoring performance characteristics.
2
A. Cli nic characteristics
1. Recruitment
a. Number of participants screened for enroll ment; number
rejected and tabulati on of reasons for rejection
b. Cumulative graph of number recrui ted compared with that
required to achieve recruitment goal
2. Fol low-up
a. Number of completed follow-up examinations for each
expected vi sit; number seen within specified time frame
b. Number of dropouts and partici pants who cannot be located
Foll ow-up procedures for nonresponders
Chapter 5. Randomi zati on and blindi ng procedures
Chapter 6. Predictor vari ables
Measurement procedures
Interventi on, i ncludi ng drug l abeli ng, delivery and handl ing procedures
Assessment of compl iance
Chapter 7. Outcome vari ables
Assessment and adjudication of pri mary outcomes
Assessment and management of other outcomes and adverse events
Chapter 8. Quali ty control
Overview and responsi bili ties
Trai ni ng in procedures
Certifi cation of staff
Equi pment mai ntenance
Peer review and site visits
Periodi c reports
Chapter 9. Data management
Data col lection and recording
Data entry
Editing, storage, and backup
Confidential ity
Analysis plans
Chapter 10. Data analysis
Appendices
Letters to participants, primary provi ders, and so on
Questionnaires, forms
Detai ls on procedures, criteria, and so on
N.B. This is a model for a large multicenter trial. The manual of operations for
a small study can be l ess elaborate.
P.288
...
for follow-up
3. Data quantity and quality
a. Number of forms compl eted, number that generated edit
messages, and number of unanswered edit queri es
b. Number of forms missing
4. Protocol adherence
a. Number of ineligi ble parti ci pants enroll ed
b. Summary of data on pill counts and other adherence
measures by treatment group
B. Data center characteristics
1. Number of forms received and number awai ting data entry
2. Cumul ati ve list of codi ng and protocol changes
3. Timetable indi cating completed and unfinished tasks
C. Central laboratory characteristics
1. Number of sampl es recei ved and number analyzed
2. Number of sampl es inadequately identifi ed, lost, or destroyed
3. Number of sampl es requiring reanal ysis and tabulation of reasons
4. Mean and variance of bl ind dupli cate differences, and secul ar
trend analyses based on repeat determinati ons of known
standards
D. Reading center characteri sti cs
1. Number of records received and read
2. Number of records received that were improperly labeled or had
other deficienci es (tabulate deficienci es)
3. Analyses of repeat readings as a check on reproducibil ity of
readings and as a means of monitoring for time shifts i n the
reading process
II. Site visit components:
A. Site vi sit to cl inical center
1. Private meeti ng of the site visi tors wi th the PI
2. Meeting of the site visitors with members of the cl inic staff
3. Inspecti on of examining and record storage faciliti es
4. Comparison of data contained on randomly sel ected data forms
with those contai ned in the computer data file
5. Revi ew of fi le of data forms and rel ated records to assess
completeness and securi ty against l oss or mi suse
6. Observation of clinic personnel carrying out specified procedures
7. Check of operati ons manuals, forms, and other documents on file
P.289
...
at the cli nic to assess whether they are up-to-date
8. Observation or verbal walk through of certai n procedures (e.g.,
the seri es of examinati ons needed to determine participant
eli gibil ity)
9. Conversations wi th actual study participants during or after
enrol lment as a check on the i nformed consent process
10. Private conversations with key support personnel to assess their
practices and philosophy with regard to data coll ecti on
11. Private meeti ng with the PI chief concerning identi fied problems
B. Site vi sit to data center
1. Revi ew of methods for inventorying data received from cli nics
2. Revi ew of methods for data management and verification
3. Assessment of the adequacy of methods for fili ng and stori ng
paper records received from clinics, i ncluding the security of the
storage area and methods for protecting records agai nst l oss or
unauthorized use
4. Revi ew of avail able computing resources
5. Revi ew of method of randomizati on and of safeguards to protect
against breakdowns in the randomi zation process
6. Revi ew of data editing procedures
7. Revi ew of computer data file structure and methods for
mai ntaini ng the anal ysis database
8. Revi ew of programming methods both for data management and
analysi s, including an assessment of program documentation
9. Comparison of i nformation contained on ori ginal study forms with
that in the computer data fi le
10. Revi ew of methods for generating analysis data files and related
data reports
11. Revi ew of analysis philosophy
12. Revi ew of methods for backing up the mai n data file
13. Revi ew of master fil e of key study documents, such as
handbooks, manual s, data forms, mi nutes of study committees,
and so on, for completeness
Footnote
2
Tables should contai n results for the enti re study period, and, when
appropri ate, for the time peri od covered si nce producti on of the last report.
Rates and compari sons among staff and participating units should be provided
when appropriate.
References
...
1. Mosca L, Barrett-Connor E, Wenger NK, et al . Design and methods of the
Raloxifene Use for The Heart (RUTH) Study. Am J Cardiol 2001;88:
392395.
2. MORE Investigators. The effect of raloxifene on risk of breast cancer i n
postmenopausal women: results from the MORE randomi zed tri al. Multipl e
outcomes of raloxifene evaluati on. JAMA. 1999;281: 21892197.
3. Information about Good Clini cal Practices in FDA Code of Federal
Regul ati ons Title 21.
http://www.accessdata.fda.gov/scri pts/cdrh/cfdocs/cfcfr/cfrsearch.cfm,
October, 2006.
4. Information about Good Clini cal Practices in the European Medi cines
Agency Internati onal Conference on Harmoni zation, at
http://www.google.com/search?hl=en&q=ICH&btnG=Google+Search, or at
http://www.ich.org/, October, 2006.
...
Copyri ght 2007 Li ppi ncott Wi l li ams & Wi lki ns
> Tabl e of Contents > Secti on III - Impl ementati on > 18 - Communi ty and
Internati onal Studi es
18
Community and International Studies
Norman Hearst
Thomas E. Novotny
Most cli nical research takes place i n university medi cal centers or other
academi c instituti ons. Such sites offer many advantages for conducti ng
research, i ncl udi ng the obvious one of having experi enced researchers.
An establ ished culture, reputati on, and i nfrastructure for research
faci li tate the work of everyone from novice investigator to tenured
professor. Success breeds more success, thereby concentrating cl ini cal
research i n centers of excell ence. Thi s chapter, in contrast, deals wi th
research that takes place outsi de of such centers.
We define community research as research that takes pl ace outside the
usual uni versi ty or medi cal center setti ng and that i s desi gned to meet
the needs of the communi ti es where i t is conducted. International
research, particularl y i n poor countri es, can i nvol ve many of the same
chall enges of establ i shi ng a research program where none existed before.
Communi ty and international research both often involve collaboration
between local investigators and coll eagues from an establ ished
research center. Such col laboration can be producti ve and critical i n
solvi ng l ongstandi ng or emerging health probl ems, but i t can be
challenging because of physical distance, cultural differences, and
funding constraints.
Why Community and International
Research?
Communi ty research i s often the only way to address research questi ons
that have to do wi th speci fi c settings or popul ati ons. Research in
academi c medi cal centers tends to focus on pri ori ti es that may be quite
di fferent from those i n thei r surroundi ng communiti es, l et al one those i n
di stant pl aces. The 10/90 gap i n health research i n whi ch 90% of
the global burden of di sease recei ves onl y 10% of gl obal research
i nvestment (1) i s ampl e justifi cation for more coll aborative research that
addresses the enormous health probl ems of l ow- and mi ddle-i ncome
countri es. Furthermore, parti cipati on i n the research process has benefi ts
...
for a communi ty that go beyond the value of the i nformati on coll ected in
a parti cul ar study.
Local Questi ons
Many research questi ons require answers avail able onl y through local
research. National or state l evel data from central sources may not
accuratel y refl ect l ocal di sease burdens or the di stri buti on of ri sk factors
i n the l ocal community. Interventi ons, especial ly those desi gned to
change behavior, may not have the same effect in different settings. For
exampl e, the publi c health effecti veness of condom promoti on as an AIDS
prevention strategy i s qui te different i n the United States than in Afri ca
(2). Fi ndi ng approaches that fi t l ocal needs requires l ocal research ( Tabl e
18.1).
Bi ologi c data on the pathophysi ology of disease and the effecti veness of
treatments are usuall y generali zable to a wi de vari ety of popul ations and
cultures. But even here there can be raci al or genetic di fferences or
di fferences based on di sease eti ol ogy. The effi cacy of anti hypertensi ve
drugs i s di fferent i n patients of Afri can and European descent (3). The
causati ve agents and patterns of anti microbi al sensitivity for pneumonia
are different i n Boli vi a and Boston.
Gr eat er Gener al i zabi l i ty
Community research i s someti mes useful for produci ng resul ts that are
more generalizable. For exampl e, pati ents wi th back pai n who are seen
at referral hospitals are very different from patients who present with
back pain to pri mary care providers. Studi es of the natural hi story of
back pain or response to treatment at a tertiary care center therefore
Table 18.1 Examples of Research Questions
Requiring Local Research
What are the rates of chil d car seat and seat bel t use in a low-
i ncome nei ghborhood of Chicago?
What are the patterns of anti microbi al resi stance of tubercul osi s
i solates in Uganda?
What i s the impact of a worksite-based AIDS prevention
campai gn for migrant farm workers i n Texas?
What proporti on of coronary heart di sease among women i n
Brazil is associ ated wi th ci garette smoki ng?
P.292
...
may be of l i mited use for cl i ni cal practi ce i n the communi ty.
Partl y in response to this problem, several practice-based research
networks have been organized in whi ch physi ci ans from community
setti ngs work together to study research questi ons of mutual interest
(4). An exampl e i s the response to treatment of patients with carpal
tunnel syndrome i n pri mary care practi ces (5). Most pati ents i mproved
wi th conservati ve therapy; few required referral to speci al ists or
sophi sti cated di agnosti c tests. Thi s contrasted wi th the previous
l i terature on the di sease from academi c medical centers, which had
i ndicated that the majority of pati ents wi th carpal tunnel syndrome
requi re surgery.
Issues of generali zabil ity are al so i mportant i n international research.
Research findings from one country wil l not always appl y i n another.
Al though resul ts generali ze best to where the research was done, they
may also be rel evant for migrant populations that ori ginated in the
country of the research. Such mi grant and di spl aced popul ati ons are of
ever increasing importance in a world that had 175 mil l ion i nternati onal
migrants as of the year 2000 (6).
Bui l di ng Local Capaci ty
Cl i ni cal research shoul d not be the exclusive property of academi c
medi cal centers. The priorities of researchers in these sites are bound to
refl ect the i ssues they encounter in thei r dail y practice or that they
bel ieve are of general sci entifi c or economi c
i mportance. Conducting research i n the communi ty setti ng ensures that
questi ons of local i mportance wil l al so be addressed.
The val ue of community participation i n research goes beyond the
speci fic information col l ected i n each study. Conducting research has a
substanti al posi ti ve ri pple effect by rai sing local schol arly standards and
encouragi ng creati vi ty and independent thinki ng. Each project buil ds
ski l ls and confidence that al l ow local researchers to see themselves as
ful l parti ci pants i n the scienti fi c process, not just consumers of
knowledge produced el sewhere. This in turn encourages more research.
Furthermore, parti cipati ng i n research can bring intel lectual and fi nanci al
resources to a communi ty and hel p encourage l ocal empowerment and
self-suffi ci ency.
Community Research
In theory, communi ty research i s much l ike any other research. The
general approach outl ined i n thi s book appl ies just as well in a smal l
town i n rural Ameri ca or Kathmandu as i t does i n San Francisco or
London. In practi ce, the greatest chal l enge i s fi ndi ng experi enced
coll eagues or mentors wi th whom to i nteract and l earn. Such hel p may
not be avai l abl e l ocal ly. Thi s often l eads to an i mportant early deci si on
for woul d-be l ocal investigators: to work al one or i n col laborati on wi th
P.293
...
more establi shed i nvesti gators based el sewhere.
St ar t i ng on Y our Own
Getting started i n research without the hel p of a more experienced
coll eague i s l i ke teachi ng onesel f how to swi m: i t is not i mpossi ble, but i t
i s di fficult. Someti mes, however, i t is the onl y option. Foll owi ng a few
rul es may make the process easier.
St ar t si mpl e. It i s seldom a good i dea to begin research in a
communi ty wi th a randomi zed controll ed tri al . Smal l descriptive
studi es produci ng useful l ocal data may make more sensebetter
a smal l success than a large fai l ure. More ambi ti ous projects can be
saved for l ater. For example, a descriptive study of condom use
among young men in Uganda conducted by a novi ce local researcher
served as a first step toward a l arger i nterventi on trial on AIDS
preventi on in that communi ty (7,8).
Thi nk of l ocal compar at i ve advant age. What questi ons can an
i nvesti gator answer i n hi s l ocal setti ng better than anyone else?
This usuall y means leavi ng the devel opment of new l aboratory
techniques and treatments to the academi c medical centers and
drug companies of the worl d. It i s often best for a young
i nvesti gator to focus on heal th probl ems or popul ati ons that are
unusual el sewhere, but common in his communi ty.
Net wor k. As discussed i n Chapter 2, networki ng i s i mportant for
any investigator. A new i nvesti gator shoul d make whatever contact
he can with sci enti sts el sewhere who are addressi ng simi l ar
research questions. If formal col laborators are not avail able, it may
at l east be possibl e to fi nd someone to give feedback on a draft of a
research protocol, a questi onnai re, or a manuscri pt. Attending a
sci enti fic conference i n one's field of i nterest is a good way to make
such contacts. Contacts can al so be made at a di stance through
telephone, l etters, and e-mail . Compl imenti ng a person's work (if
not overdone) can be a good way to i nitiate such a contact.
Col l abor ati ve Resear ch
Because i t i s diffi cul t to get started on one's own, a good way to begi n
research i n a community is often in coll aboration wi th more experienced
researchers based elsewhere. There are two mai n model s for such
coll aborati on: top-down and bottom-up (9).
The top-down model refers to studies that ori gi nate i n an academi c
center and i nvol ve communi ty i nvesti gators i n the recrui tment of patients
and the conduct of the study. Thi s occurs, for exampl e, i n large
mul ti center tri al s that i nvite hospitals and cli ni cs to enrol l patients i nto
an establ ished research protocol. Thi s approach has the great advantage
P.294
...
that it comes wi th bui lt-in senior coll aborators who are usual l y
responsi ble for obtai ning the necessary resources and cl earances to
conduct the study.
Al though one can gain val uabl e experi ence through this sort of
coll aborati on, opportuni ti es to develop as a researcher may be li mited.
Just as i mportant, the potenti al benefit to one's community may be no
greater than i f the study had been done el sewhere. Once the study i s
over, the academi c center or drug company may cut off i nvolvement
qui ckl y with l ittl e or nothing l eft behind in the community.
In the bottom-up model , establ ished investigators provi de gui dance and
techni cal assi stance to local i nvesti gators and communi ti es devel opi ng
thei r own research agendas. Some academic medi cal centers offer
traini ng programs for communi ty i nvesti gators or i nternati onal
researchers. If one can gai n access to such a program or establ ish an
equi val ent rel ati onship, thi s can be i deal for buil ding local research
capaci ty, especi al ly when such a partnershi p is sustained on a l ong-term
basis. But establi shi ng an instituti onal rel ati onshi p of this type is not
easy. Supporti ng bottom-up community research can be ti me consumi ng
and therefore expensi ve. Most fundi ng agencies are more interested in
sponsori ng speci fi c research projects than in bui ldi ng l ocal research
capaci ty. Even when funding to cover expenses i s avail able, experi enced
i nvesti gators may prefer to spend thei r ti me conducti ng their own
research rather than hel ping others get started.
Communi ty researchers need to take advantage of the potenti al
i ncentives they can offer to more establ ished i nvestigators wi th whom
they woul d l ike to work. In the top-down model, the most important
thi ng they can offer is access to subjects. In the bottom-up model, the
i ncentives can include the intri nsic sci enti fic meri t of a study in the
communi ty, coauthorshi p of resul ti ng publi cations, and the sati sfacti on of
helpi ng a l ess experi enced col league i n a worthwhi le endeavor.
To start a new research program, the ideal option may be to form a l ong-
term partnership with an establi shed research i nsti tuti on. Col laborati on
under such a structure can i ncl ude a combinati on of top-down and
bottom-up projects. It must be remembered, however, that good
research col l aborati on is fundamental l y between individual i nvesti gators.
An academic institution may provi de the cl imate, structure, and
resources that support individual col l aborati on, but the individuals
themsel ves must provi de the cultural sensi ti vity, mutual respect, hard
work, and l ong-term commitment to make it work.
International Research
International research often i nvol ves col l aborati on between groups
wi th di fferent levels of experi ence and resources and therefore is subject
to many of the same issues as communi ty research. However,
i nternati onal research bri ngs additional challenges. The issues
...
descri bed bel ow are especi all y i mportant.
Bar r i er s of Di st ance, Language, and Cul t ur e
Because of the distances involved, opportunities for face-to-face
communication between i nternati onal coll eagues are l i mited. If at al l
possi bl e, col l eagues on both si des shoul d make at least one site vi sit to
each other's institutions. International conferences may someti mes
provi de additi onal opportunities to meet, but such opportuni ti es are
l i kely to be rare. Fortunatel y, wirel ess communicati ons, faxes, and e-mai l
(perhaps wi th voice-over capabi li ty and wi de-band video capaci ty) have
made international communicati on easier, faster, and l ess expensive.
Good communicati on i s possi ble at any distance, but i t requi res effort on
both si des. The most modern methods of communicati on are of no help i f
they are not used regul arl y. Lack of frequent communicati on and prompt
response to queri es made on ei ther si de i s a sign that a long-di stance
coll aborati on may be in trouble.
Language di fferences are often superimposed on the communicati on
barri ers caused by distance. If the fi rst l anguage spoken by investigators
at all si tes i s not the same, i t i s i mportant that there be a l anguage that
everyone can use. Expecting al l interactions to be in Engli sh pl aces
i nvesti gators i n poor countri es at a di sadvantage. Forei gn i nvesti gators
who do not speak the l ocal l anguage are unl ikel y to have more than a
superfi ci al understandi ng of the country's cul ture and cannot parti ci pate
ful ly i n many key aspects of a study, i ncl udi ng questionnai re
development and conversations with study subjects and research
assistants. Thi s i s especi al ly i mportant in studi es wi th behavi oral
components.
Even when li ngui stic barri ers are overcome, cultural differences can
cause serious mi sunderstandi ngs between investigators and thei r
subjects or between i nvesti gators. Li teral word-by-word transl ati ons of
questi onnaires may have different meani ngs, be cul turall y i nappropri ate,
or omi t key l ocal factors. Insti tutional norms may be di fferent. For
exampl e, i n some settings, a forei gn col laborator's department chi ef who
had li ttle direct i nvol vement i n a study mi ght expect to be fi rst author of
the resulting publi cation. Such issues shoul d be anti ci pated and cl earl y
l ai d out in advance as part of the important process of gai ni ng hi gh-l evel
l ocal institutional support for the project. Pati ence, good wi ll , and
fl exibi l ity on all si des can usuall y surmount probl ems of thi s type. For
l arger projects, it may be advi sabl e to include an anthropol ogi st, ethicist,
or other expert on cul tural i ssues as part of the research team.
Frequent, cl ear, and open communi cation and prompt clari ficati on of any
questi ons or confusion are essenti al. When deal i ng with cul tural and
l anguage di fferences, it is better to be repeti ti ve and ri sk stati ng the
obvi ous than to make i ncorrect assumptions about what the other person
thi nks. Written affil i ation agreements that spel l out mutual
P.295
...
responsi bil ities and obl igati ons may hel p clari fy i ssues such as data
ownershi p, authorshi p order, publ i cation ri ghts, and deci si ons regarding
the framing of research resul ts. Devel opment of such agreements
requi res the personal and careful attention of coll aborators from both
si des.
I ssues of Fundi ng
Because of economi c i nequities, col laborati on between institutions in ri ch
and poor countri es is general l y onl y possi ble wi th funding origi nating
from the ri ch country or, less often, from other rich countri es or
i nternati onal organizati ons. An increasing number of l arge donor
organi zati ons are active i n gl obal health research, but often their support
i s li mi ted to a speci fic research agenda. Donor fundi ng tends to fl ow
through the i nsti tuti on in the ri ch country, rei nforci ng the subordinate
posi ti on of i nsti tuti ons i n poor countries. As i n any si tuation wi th an
unequal balance of power, thi s creates a potenti al for expl oi tati on.
When investigators from ri ch countries control
the purse stri ngs, i t i s not uncommon for them to treat thei r
counterparts i n poor countries more l ike employees than col l eagues.
Internati onal donors and funding agenci es need to be especi al ly careful
to di scourage thi s and i nstead to promote true joi nt governance of
coll aborati ve activiti es.
Di fferent practi ces of financial management are another potenti al area
for confl i ct between cul tures. Institutions i n ri ch countri es may attempt
to i mpose accounti ng standards that are di ffi cul t or i mpossibl e to meet
l ocal ly. Insti tuti ons i n poor countri es may load budgets with computers
and other equipment that they expect to keep after the study is over.
Al though thi s i s understandable given thei r needs and l ack of alternati ve
fundi ng sources, i t i s i mportant that any subsidi es beyond the actual cost
of conducting the research be clearly negoti ated and that the potenti al
for diversion of funds by insti tutions or i ndi vi dual s be mini mized.
Conversely, i nsti tuti onal overheads and hi gher investigator sal aries often
create the inequi table si tuati on of the majori ty of funding for
coll aborati ve research staying in the ri ch country even when most of the
work is in the poor country.
Source country i nsti tuti ons and donors should pay parti cular attenti on to
bui ldi ng the research administration capacity of l ocal partners. Thi s
coul d mean provi ding admi nistrati ve and budgetary trai ning or usi ng
consul tants in the fi el d to hel p wi th local admi ni strati ve tasks. Effort
i nvested in devel opi ng admi ni strative capaci ty may pay off in improved
responsi veness to deadli nes, more effi ci ent reporti ng, avoidi ng
unnecessary confl ict, and bui l ding a sol i d i nfrastructure for future
research.
Ethi cal I ssues
P.296
...
Internati onal research rai ses ethical issues that must be faced
squarel y. Al l the general ethi cal i ssues for research appl y (Chapter 14).
Because i nternati onal research presents an enhanced potenti al for
expl oi tation, i t also requi res additional consi derati ons and safeguards.
What, for exampl e, i s the appropriate compari son group when testi ng
new treatments i n a poor country where conventional treatment i s
unavai labl e? Pl acebo control s are unethical when other effecti ve
treatments are the standard of care in a communi ty. But what i s the
standard of care i n a communi ty where most peopl e are too poor
to afford proven treatments? On the one hand, it may not be possi bl e for
i nvesti gators to provi de state-of-the-art treatment to every parti cipant in
a study. On the other hand, al l owing pl acebo controls si mpl y because
peopl e are poor may encourage drug compani es and others to test their
new treatments i n poor countries wi thout proper protecti ons and benefi ts
for volunteers. Studies i n poor countries of expensi ve anti retroviral drugs
have drawn new attenti on to these concerns (10,11).
A related issue has to do wi th testi ng treatments that, even i f proven
effecti ve, are unl ikel y to be economi cal ly accessi bl e to the population of
the host country. Are such studi es ethi cal , even if they fol low al l the
usual rul es? If not, what proporti on of study subjects shoul d be abl e to
afford the new treatment to make the study ethi cal ? These questions do
not have simpl e answers. Establ ished i nternati onal conventions governing
ethi cal research, such as the Decl arati on of Helsi nki, have been
chall enged and are subject to mul tipl e i nterpretati ons ( 12,13).
A key test may be to consi der why the study i s being conducted i n a poor
country i n the first place. If the true goal i s to gather i nformati on to help
the people of that country, this shoul d wei gh i n favor of the study.
Ideal l y, the goal of research shoul d be sustainable change and added
value for the host country (14). If, on the other hand, the goal i s
expediency or to avoi d obstacl es to doing the study i n a ri ch country, the
study shoul d be subject to al l ethical requi rements that woul d appl y i n
the sponsori ng country.
For this and other reasons, studies i n poor countries that are directed or
funded from elsewhere shoul d be approved by ethical review boards in
both countries. Although such approval is necessary, i t does not
guarantee that a study i s ethical . Systems for ethi cal review of research
i n many poor countries are weak or nonexi stent and can someti mes be
mani pul ated by l ocal investigators or pol iti cians who stand to benefi t
from a study. Conversel y, review boards in rich countri es are someti mes
i gnorant of or i nsensi ti ve to the speci al i ssues i nvolved in international
research. Offi ci al approval does not remove the fi nal responsi bil i ty for
the ethi cal conduct of research from the investigators themsel ves.
Less talked about but also i mportant are ethi cal i ssues i n the treatment
of collaborators from poor countries. Several issues must be agreed
upon i n advance. Who owns the data that wi ll be generated? Who needs
P.297
...
whose permissi on to conduct and publ ish anal yses? Wi l l l ocal
i nvesti gators get the support they need to prepare manuscri pts for
i nternati onal publ icati on wi thout havi ng to pay for this by giving up first
authorship? How long a commi tment is bei ng made on both si des? A
l arge recent tri al i n several poor countri es of voluntary counseli ng and
testi ng to prevent HIV i nfection abruptly dropped its col l aborati ng site i n
Indonesia (15). Accordi ng to the i nvesti gators, thi s was because the
outcome variabl e of interest (HIV seroconversi on) turned out to be l ess
common at that site than projected in the study's power cal cul ati ons.
Al though thi s deci si on made practi cal sense, i t was perceived by the
Indonesians as a breach of fai th.
Other ethi cal i ssues may have to do with local economic and political
realities. For exampl e, a pl anned cli nical trial of pre-exposure HIV
prophyl axi s wi th tenofovi r for commerci al sex workers recently was
cancell ed al though i t had been cl eared by multinati onal ethical revi ew
boards (16). The i ntended study subjects were concerned that they might
end up wi th no source of medi cal care for probl ems rel ated to HIV
i nfecti on or drug effects and were not wi l li ng to partici pate without
guarantees of li fetime heal th insurance. The Pri me Mini ster of the
country i ntervened to stop the tri al .
Fi nall y, an expl ici t goal of all i nternati onal col l aborati on shoul d be to
i ncrease local research capacity. What ski ll s and equipment wi l l the
project l eave behi nd when completed? What trai ning acti vi ti es wi l l take
pl ace for project staff? Wi l l l ocal researchers parti cipate in international
conferences? Wi l l thi s be only for hi gh-l evel l ocal i nvesti gators who
al ready have many such opportuni ti es, or wil l juni or col leagues have a
chance as wel l? Wi ll the l ocal researchers be true col laborators and
pri nci pal authors of publ icati ons, or are they si mply being hi red to col lect
data? Sci enti sts i n poor countries should ask and expect clear answers to
these questi ons. As summari zed i n Tabl e 18.2, good communication and
l ong-term commi tment are recurring themes i n successful i nternati onal
coll aborati ve research.
Ri sks and Fr ust r at i ons
Researchers from ri ch countries who contempl ate becomi ng i nvol ved i n
i nternati onal research need to start wi th a real isti c appreci ati on of the
di fficul ti es and ri sks i nvol ved. Launchi ng such work is usual l y a long,
sl ow process. Bureaucratic obstacles are common on both ends. In
countri es that lack i nfrastructure and pol itical stabi li ty, years of work can
be vulnerabl e to major disrupti on from natural or manmade
catastrophe. In extreme cases, these can threaten the safety of project
staff or i nvesti gators. For example, important col laborative AIDS
research programs that had
been bui l t over many years were completel y destroyed by recent ci vi l
wars i n Rwanda and the Congo.
P.298
...
Less catastrophi c and more common are the dai l y hardshi ps and health
risks that expatriate researchers may face, rangi ng from unsafe water
and malari a to smog, common cri me, and traffi c accidents. Key
requi rements to help i nsure safety of researchers working abroad incl ude
evacuation and repatri ati on i nsurance, good pre-travel health advi ce,
registration with their Embassy or Consulate i n the host country, and
awareness of any speci al ri sks or instabil i ty i n the host country. The US
Centers for Di sease Control and Prevention websi te provi des excell ent
health i nformati on (http://www.cdc.gov/travel/), and the US Department
of State websi te provi des up-to-date information on registrati on,
evacuation, and safety for i nternati onal travelers
(http://www.state.gov/travel andbusi ness/).
Another frustrati on for researchers i n poor countri es i s the di fficul ty in
applying their findings. Even when new strategi es for preventi ng or
Table 18.2 Strategies to Improve
International Collaborative Research
Sci enti sts i n poor countries
Choose col l aborators carefull y
Learn Engl ish (or other language of col laborators)
Become famil i ar wi th the international sci enti fic l iterature i n
area of study
Be sure that coll aboration wil l bui ld l ocal research capacity
Cl ari fy administrative and sci entifi c expectations in advance
Sci enti sts i n ri ch countries
Choose col l aborators carefull y
Learn the l ocal language and cul ture
Be sensi ti ve to l ocal ethi cal issues
Encourage local col laboration i n all aspects of the research
process
Cl ari fy administrative and sci entifi c expectations in advance
Funding agenci es
Set fundi ng priorities based on publ i c heal th need
Encourage true col laborati on rather than a purel y top-
down model
Recogni ze the importance of bui l di ng l ocal research capaci ty
Make subsi dies for local equi pment and i nfrastructure expl i ci t
Be sure that overhead and high sal aries in the ri ch country do
not take too much of the budget
...
treating di sease can be successful ly developed and proven to be
effecti ve, lack of poli ti cal wi l l and resources often thwarts their
wi despread appl i cati on. Researchers need to be real i stic in thei r
expectati ons, gear their work toward i nvesti gati ng strategies that woul d
be feasibl e to i mplement i f found effective, and be prepared to act as
advocates for i mprovi ng the heal th of the populations they study.
The Rewar ds
Despi te the diffi cul ti es, the need for more health research i n many parts
of the worl d is overwhelming. Many investigators i n major academi c
centers wonder how much difference thei r work reall y makes. It often
seems that there are pl enty of other quali fi ed people who coul d do the
job just about as wel l as they do. By partici pating in international
research, an investigator in a rich country can someti mes have a far
greater and more i mmedi ate impact on people's lives than woul d be
possi bl e by
staying wi thin the wall s of hi s own uni versi ty. Thi s i mpact comes not onl y
from the research i tsel f but al so from doi ng one's own small part to
foster i nternati onal coll aborati on.
Many of the potential problems wi th international research have thei r
posi ti ve aspects. Al though fundi ng i s harder to obtain, the same amount
of money can go much further (17). Cross-cul tural col laborati on is as
rewardi ng as it is di fficult. The chance to have meaningful i nvol vement
and make a real contri bution i n a forei gn l and i s a rare pri vi l ege.
Furthermore, i t can teach unexpected lessons that enrich careers and
l i ves. All stand to gai n through increased col laborati on and expandi ng the
traditi onal setti ngs for research
Summary
1. Community and international research i s necessary to di scover
regional differences i n such thi ngs as the epi demi ol ogy of a
disease or the cultural factors that determine which i nterventi ons
wil l be effecti ve.
2. Local participation i n cl ini cal research can have secondary
benefi ts to the regi on such as enhanced level s of scholarship and
self-sufficiency.
3. Although the theoreti cal issues i nvol ved i n research are broadly
appli cabl e, practical issues such as acqui ri ng funding and
mentori ng are more difficult i n a community setting; tips for
success include starting small, thi nking of local advantages, and
networking.
4. Collaboration between academic medi cal centers and community
researchers can fol low a top-down model (communi ty
P.299
...
i nvesti gators conduct studies that ori gi nate from the academi c
center) or a bottom-up model (investigators from the academi c
center help communi ty investigators conduct their own research).
5. International research involves many of the same i ssues as
communi ty research with additional chal lenges rel ated to
communication and language, cultural differences, funding,
unequal balance of power, financial and administrative
practices, and ethics.
6. Overcomi ng these chall enges can bring the rewards of helping
people in need, a large public health impact, and rich cross-
cultural experiences.
References
1. Stevens P. Diseases of poverty and the 10/90 gap. November
2004. Avai l able at:
http://www.fi ghtingdiseases.org/pdf/Diseases_of_Poverty_FINAL.pdf .
2. Hearst N, Chen S. Condom promotion for AIDS preventi on i n the
developing world: i s i t worki ng? Stud Fam Pl ann 2004;35
(1):3947.
3. Drugs for hypertensi on. Med Lett Drugs Ther 1999;41:2328.
4. Nutti ng PA, Beasley JW, Werner JJ. Practice-based research
networks answer primary care questi ons. JAMA 1999;281:686688.
5. Mil ler RS, Ivenson DC, Fried RA, et al. Carpal tunnel syndrome in
pri mary care: a report from ASPN. J Fam Pract 1994;38:337344.
6. Uni ted Nations Popul ati on Di vi si on. The i nternati onal migrant
stock: a gl obal vi ew. 2002. Avail able at:
http://www.iom.int/documents/offi cial txt/en/unpd%5Fhandout.pdf .
7. Kamya M, McFarl and W, Hudes ES, et al . Condom use wi th casual
partners by men i n Kampala, Uganda. AIDS 1997;11(Suppl
1):S61S66.
8. Kajubi P, Kamya MR, Kamya S, et al . Increasi ng condom use
wi thout reduci ng HIV risk: results of a controll ed community tri al i n
Uganda. J Acquir Immune Defi c Syndr 2005;40(1):7782.
P.300
...
9. Hearst N, Mandel J. A research agenda for AIDS preventi on in the
developing world. AIDS 1997;11(Suppl 1):S1S4.
10. Lurie P, Wolfe SM. Unethical tri al s of i nterventi ons to reduce
perinatal transmi ssi on of the human i mmunodefi ci ency vi rus in
developing countri es. N Engl J Med 1997;337:853856.
11. Perinatal HIV Intervention Research in Devel opi ng Countri es
Workshop Parti cipants. Sci ence, ethics, and the future of research
i nto maternal -infant transmi ssion of HIV-1. Lancet
1999;353:832835.
12. Brennan TA. Proposed revi sions to the Declaration of Hel sinki :
wi ll they weaken the ethi cal pri nci ples underl ying human research? N
Engl J Med 1999;341:527531.
13. Levi ne RJ. The need to revi se the Declaration of Hel sinki . N Engl
J Med 1999;341:531534.
14. Tayl or D, Tayl or CE. Just and l asti ng change: when communities
own their futures. Bal ti more, MD: JHU Press, 2002.
15. Kamenga MC, Sweat MD, De Zoysa I, et al . The vol untary HIV-1
counsel ing and testi ng efficacy study: desi gn and methods. AIDS
Behav 2000;4:514.
16. Page-Shafer K, Saphonn V, Sun LP, et al. HIV preventi on
research i n a resource-li mi ted setti ng: the experience of planni ng a
trial in cambodia. Lancet 2005;366(9495):14991503.
17. Chequer P, Mari ns JRP, Possas C, et al. AIDS research i n Brazi l.
AIDS 2005;19(Suppl 4):S1S3.
...
Copyright 2007 Lippi ncott Wi lli ams & Wi lki ns
> Tabl e of Contents > Secti on III - Impl ementati on > 19 - Wri ti ng and Fundi ng a Research
Proposal
19
Writing and Funding a Research Proposal
Steven R. Cummings
Stephen B. Hulley
The protocol is the detail ed written plan of the study. Writi ng the protocol forces
the i nvesti gator to organize, clari fy, and refi ne all the el ements of the study, and
this enhances the scientific rigor and the effi ciency of the project. Even i f the
investigator does not requi re funding for a study, a protocol i s necessary for guidi ng
the work. A proposal is a document written for the purpose of obtai ning funds from
granti ng agencies. It contai ns the study protocol, the budget, and other
administrati ve and supporting i nformation that is requi red by the speci fic agency or
board. This chapter wi ll focus on the structure of a proposal and on how to wri te
one that will be successful in getting funded.
Writing Proposals
The task of preparing a proposal generall y requires several months of
organizi ng, writing, and revising. The foll owi ng steps can hel p the project to get off
to a good start.
Deci de wher e t he pr opos al wi l l be submi t t ed. Every funding agency has
its own unique process and requirements for proposals. Therefore, the
investigator shoul d start by deci ding where the proposal will be submi tted,
determini ng the l imit on amounts of fundi ng, and obtai ning detail ed guideli nes
about how to craft the proposal for that parti cular agency.
Or gani ze a t eam and desi gnat e a l eader . Most proposal s are written by a
team of several peopl e who wil l eventuall y carry out the study. This team may
be small (just the investigator and his mentor) or l arge (including
col laborators, a biostatisti cian, a fi scal administrator, and support staff). It i s
important that this team incl ude or have access to the mai n experti se needed
for desi gning and implementing the study.
One member of the team must assume the responsibi li ty for leading the effort.
General ly this i ndi vidual is the principal investigator (PI), who wi ll have the
ulti mate authority and accountabi li ty for the study. The PI shoul d general ly be
an
experienced sci entist whose knowl edge and wisdom are useful for desi gn
decisi ons and whose track record wi th previous studies increases the l ikelihood
of a successful study and, therefore, of funding (revi ewers give considerabl e
weight to the value of experi ence). Some studies al so have a Co-PI, often a
junior sci enti st who will serve as the day-to-day manager of the study and
coordi nate the proposal -writing effort. Ei ther the PI or the Co-PI must exert
P.302
...
steady leadershi p, delegati ng responsibi li ti es for wri ting and other tasks,
setti ng deadli nes, conducting periodic meeti ngs of the team, and ensuring that
all the necessary tasks are compl eted on time.
Fol l ow t he gui del i nes of t he f undi ng agency. All funding sources provide
written guidelines that the i nvesti gator must carefull y study before starting to
write the proposal. This i nformati on includes instructions for organizi ng the
proposal , page l imi ts, i nformation on the amount of money that can be
requested, and elements that must be included in the proposal.
However, these gui delines do not contain all the important i nformati on that the
investigator needs to know about the operations and the preferences of the
fundi ng agencies. The NIH and pri vate foundati ons have scientific
administrators whose job is to hel p i nvesti gators design thei r proposals to be
more responsive to the agency's fundi ng polici es. Early in the development of
the proposal it i s a good i dea to discuss the plan wi th an indi vidual at the
agency who can cl arify what the agency prefers (such as budgetary limits and
the scope and detail required i n the proposal ) and confi rm that the research
pl an i s wi thin the bounds of the agency's interests. The i nitial contact can be
made by e-mail or l etter, but a seri es of tel ephone call s or even a visi t is a
better way to establi sh a rel ati onshi p and get information that wi ll l ead to a
fundable proposal.
It is useful to make a checklist of the detail s that are required, and to
careful ly revi ew the checkl ist before sendi ng the proposal. Rejection of an
otherwi se excell ent proposal for l ack of adherence to detail s is a frustrating
and avoi dable experi ence.
Est abl i sh a t i met abl e and meet per i odi cal l y. A schedule for completing the
writing tasks keeps gentle pressure on team members to meet their obl igations
on ti me. In addi tion to addressing the scientifi c components speci fied by the
fundi ng agency, the timetable shoul d take into account the admini strative
requi rements of the instituti on that wil l sponsor the research. Universi ties
often require a ti me-consuming revi ew of the budget and subcontracts before a
proposal can be submitted to the funding agency. Leaving these detail s to the
end can precipi tate a l ast-mi nute crisis that damages an otherwi se well -done
proposal .
A timetable generall y works best i f it specifies deadlines for written products
and if each indi vidual parti cipates i n setti ng his own assignments. The
timetabl e shoul d be reviewed at peri odic meetings of the writing team to check
that the tasks are on schedule and the deadlines stil l real isti c.
Fi nd a model pr oposal . It i s hel pful to borrow from a coll eague a successful
recent proposal to the agency from whi ch funding is being sought. Successful
appl ications ill ustrate i n a concrete way the format and content of a good
proposal . The investigator can fi nd inspirati on for new i deas from the model
and then desi gn and write a proposal that is even clearer, more logi cal, and
more persuasi ve. It is also a good i dea to borrow examples of wri tten
cri ticisms that have been provided by the agency for previous successful or
unsuccessful proposal s. This wi ll i ll ustrate the key points that are important to
the sci enti sts who will be revi ewi ng the proposal.
NIH proposals of i nterest can be i dentified by using the Internet to search the
NIH CRISP (Computer Retrieval of Information on Sci enti fi c Projects)
database
of funded grants. Copies of funded proposals can be obtai ned by wri ting to the
P.303
...
PI or, as a l ast resort, through the Freedom of Information Act.
Wor k f r om an out l i ne. Begi n by setti ng out the proposal i n outli ne form
(Tabl e 19.1). This provides a starting point for writing and is useful for
organizi ng the tasks that need to be done. If several peopl e wi ll be worki ng on
the grant, the outl ine helps in assigning responsibili ties for writing parts of the
proposal . One of the most common road bl ocks to creating an outl ine is the
feeling that an entire plan must be worked out before starting to wri te the fi rst
sentence. The i nvesti gator should put this noti on aside and let his thoughts
flow onto paper, creating the raw material for editing, refining, and getting
specifi c advice from colleagues.
Revi ew, pr et est , and r evi se r epeat edl y. Writing a proposal is an iterative
process; there are usuall y many versions, each refl ecti ng new ideas, advice,
and pretest experiences. Before the final draft i s wri tten, the proposal shoul d
be criti cally revi ewed by coll eagues who are famili ar wi th the subject matter
and funding agency. Particul ar attention should go to the quality of the
research questi on, the vali dity of the design and methods, and the cl arity of
the writi ng. It is better to have sharp and detail ed critici sm before the
proposal is submitted than to have the project rejected because of fail ure to
antici pate and address potenti al problems. When the proposal is nearly ready
for submi ssion, the fi nal step is to review i t carefull y for i nternal consistency,
format, adherence to agency gui del ines, and typographi cal errors.
Elements of a Proposal
The most important elements of a proposal are set out in Table 19.1 i n the
sequence required by the NIH. Some funding instituti ons may requi re l ess
information or a di fferent format, and the i nvesti gator should organize the proposal
according to the gui delines of the agency that wil l recei ve the proposal (generally
availabl e on the Web).
The Begi nni ng
The title shoul d be descripti ve and concise. It provi des the first i mpressi on and a
lasti ng reminder of the content and design of the study. A good title manages to
summarize these el ements, achieving brevity by avoi di ng unnecessary phrases l ike
A study to determi ne the. In an NIH grant appl ication, the choice of
words i n the titl e is i mportant because it can infl uence the decisi on on which study
secti on (revi ew group) and insti tute wi ll receive the protocol .
The abstract i s a concise summary of the protocol that should begin with the
research questi on and rational e, then set out the design and methods, and conclude
wi th a statement of the importance of potential findings of the study. Most agenci es
requi re that the abstract be kept wi thin a limited number of words, so it is best to
use effici ent and descri ptive terms. The abstract wi ll generall y be written after the
other protocol elements are settl ed, and i t should go through enough revi sions to
ensure that it i s fi rst rate. Thi s wil l be the onl y page read by some reviewers, and a
convenient reminder of the specifics of the proposal for everyone el se. It must
therefore stand on i ts own, incorporating al l the mai n features of the proposed
study and persuasively reveali ng the strengths.
Table 19.1 Main Elements of a Proposal, Based on
the NIH Model
...
The Admi ni st r ati ve Par t s
Almost al l agencies require an admini strative secti on that i ncludes a budget and a
descri ption of the qual ifications of personnel and the i nstituti on and access to
equipment, space, and experti se.
The budget section i s generall y organized according to guidel ines from the funding
insti tution. The NIH, for example, has a prescribed format that requires a detai led
budget for the first 12-month peri od and a summary budget for the entire proposed
project period (usuall y 35 years). The detai led 12-month budget includes the
fol lowi ng categori es of expenses: personnel (incl uding names and posi tions of all
persons i nvol ved i n the project, the percent of time each wi ll devote to the project,
and the doll ar amounts of sal ary and fringe benefi ts listed separatel y for each
indivi dual); consul tant costs; equipment (i temized); suppl ies (itemized); travel
(i temi zed); pati ent care costs; alterations and renovati ons; consorti um/contractual
costs; and other expenses (e.g., the costs of telephones, mail, copying, ill ustration,
publi cati on, books, and fee-for-servi ce contracts).
The budget shoul d not be left unti l the last minute. Many elements require time (to
get good esti mates of the cost of space, equipment, and personnel ). The best
approach is to notify a knowl edgeabl e admini strator as soon as possibl e about the
Title
Abstract
Admi nistrative parts
Budget and budget justi fication
Bi osketches of investigators
Resources, equi pment, physical faci liti es
Speci fi c aims
Background and significance
Prel imi nary studies and experience of the investigators
Methods
Overview of design
Study subjects
Sel ection criteri a
Design for sampl ing
Pl ans for recruitment
Measurements
Main predictor variabl es (i nterventi on, if an experi ment)
Potential confoundi ng vari ables
Outcome variables
Statisti cal issues
Approach to statisti cal anal yses
Hypotheses, sample size, and power
Qual ity control and data management
Ti metable and organi zational chart
Limitations and issues
Ethical considerations
References
Appendices and col laborati ve agreements
P.304
P.305
...
pl an to submit a proposal and schedul e regular meetings with hi m to review
progress and a wri tten ti mel ine for fini shing the administrati ve section. An
administrator can begin worki ng as soon as the outline of the proposal is
formulated, recommendi ng the amounts for budget items. Institutions have
regul ati ons that must be foll owed and deadli nes to meet, and an experienced
administrator can hel p the i nvesti gator antici pate i nstitutional rules, pitfall s, and
potential del ays. The administrator can also be very helpful in drafti ng the text of
the sections on budget and resources, and in collecting the bi osketches, appendi ces,
and other supporti ng materials.
The need for the amounts requested for each item of the budget must be full y
explained i n a budget justification. Salari es wi ll generally comprise most of the
overall cost of a typical cl ini cal research project, so it is i mportant to show the need
for each person and hi s effort. Careful ly conceived job descri ptions for the
investigators and other members of the research team should leave no doubt in the
reviewers' minds that the esti mated effort of each indi vidual is essential to the
success of the project.
Reviewers often l ook at the percentages of ti me committed by key members of the
project. Occasi onall y, proposals may be cri tici zed because key members of the
research team have onl y a very small (5%) commitment of ti me l isted in the budget
and a large number of other studies listed i n their other support (implyi ng
that they have too many other commitments to be abl e to devote the necessary
energy to the proposed study). On the other hand, the reviewers may al so balk at
percentages that are inflated beyond the requi rements of the job descripti on.
Even the best-pl anned budgets wil l change as the needs of the study change or
there are unexpected expenses and savi ngs. In general , once the grant is awarded
the i nvesti gator i s al lowed to spend money in different ways from those specified in
the budget, provided that the changes are modest and the expenditures are all
appropriate to the study. When the i nvesti gator wants to move money across
categories or to make a substantial change (up or down) i n the effort of key
investigators, he may need to get approval from the fundi ng agency. Agencies
general ly approve reasonable requests for rebudgeti ng so l ong as the i nvesti gator i s
not asking for an increase i n total funds.
The biosketches of i nvesti gators are four-page resumes that include academic
degrees, current and previ ous empl oyment, honors, recent and pertinent
publi cati ons,
and descri ptions of recent research grants and contracts. The sections on
resources avail able to the project, i ncluding computer and technical equi pment and
office and laboratory space, often draw on boil erplate in previ ous grants by
col leagues in the investigator's insti tution.
Ai ms and Si gni f i cance
The specific aims are statements of the research question and plan using a concise
format that specifi es i n concrete terms the desi red outcome. When appropri ate,
aims may be expressed as testable hypotheses. Most research proposals have
several, and after an i ntroductory paragraph these should be presented i n a l ogi cal
sequence. Someti mes this means putting them i n order of i mportance, and
sometimes i n chronological order (objecti ves served by baseli ne data first, then
those rel ated to fol low-up). Someti mes, as in the foll owi ng example, a l ogi cal
approach is to present the admini strative ai ms first, then the sci entifi c ai ms:
1. To recrui t 400 heal thy men, 4059 years ol d into a randomi zed bl inded tri al
P.306
...
of the effects of a testosterone patch.
2. To test the hypothesis that compared wi th men assi gned to receive a placebo
patch, those assi gned to receive the testosterone patch will have
a. less bone loss
b. an increase in quadri ceps muscl e strength
c. a decreased risk of fall ing.
The speci fi c aims section can also serve as an outline for organizing l ater sections;
the components of the signi ficance and methods secti ons should usual ly fol low a
paral lel sequence.
When a study has many facets, i t i s tempting to impress the reader with a long and
detai led li sting of speci fic aims. Thi s strategy may backfire, creati ng a proposal that
is overl y ambi tious or cluttered. When numerous specifi c ai ms are possi ble, it is
best to propose only the most i mportant and interesti ng ones. In general, this
shoul d not exceed one page.
The background and significance section sets the proposal in context, descri bing
the background in the fiel d under study. It should be written, as much as possi ble,
in a way that is comprehensibl e to someone who is not an expert i n that fi el d.
Enough information should be given to make cl ear what thi s particular study wil l
accompli sh and why i t is important. How, specifi cal ly, wi ll the study fi ndi ngs
advance understanding, change cl inical practice, or infl uence pol icy?
The purpose of thi s section is to demonstrate that the investigator understands
what has been accompli shed, what the problems are, and what needs to be done.
The appropriate breadth or detai l of the revi ew depends on the scope of the speci fic
aims, the complexity of the fi el d, and the expectati ons of the revi ew panel.
Reviewers usuall y appreciate a thoughtful cri tical revi ew of the most i mportant
previ ous studi es rather than an exhaustive superfici al catal ogue of previous
publi cati ons.
The preliminary studies and experience of the investigators secti on shoul d
conci sel y describe relevant previ ous research and skil ls of the i nvesti gator; a
li mi ted number of preprints can be included in appendi ces. Emphasis shoul d be
pl aced on the importance of the previ ous work and on the reasons it shoul d be
conti nued or extended. Pilot studies that support the research questi on and the
feasi bil ity of the study are i mportant to many types of proposal s, especiall y when
the research team has li ttle previous experience in the area to be studi ed, when the
question i s novel,
and when there may be doubts about the feasibi li ty of the proposed procedures or
recrui tment of subjects. Results of these studi es shoul d be highlighted here, with
detai ls provi ded in the appendices.
The Sci ent i f i c Methods
The methods section generally receives close scrutiny from revi ewers, and it will
later serve as the basi s for the operati ons manual for carryi ng out the study.
Weakness i n the technical methods is a common reason that proposals fail to be
approved or funded by the NIH. For these reasons, this secti on deserves careful
attention to detai l.
The fi rst concern is how to organize the section. Sometimes agencies provi de
gui delines about how to organize the methods. If not, we recommend the
P.307
...
components and sequence listed i n Table 19.1. A detailed tabl e of the contents of
the methods secti on can be very helpful at thi s poi nt, and an overview of the
design, sometimes accompanied by a schematic diagram or table, is essential for
orienti ng the reader (Tabl e 19.2).
The other specifi c components of the methods section have been di scussed in other
parts of thi s book. The subjects and measurements (Chapters 3 and 4), pretest
plans, data management, and quality control (Chapters 16 and 17) are the
centerpiece of the proposal , and require suffici ent detail so that sophisti cated
reviewers wil l understand exactly how the study wil l be performed and the reasons
for the design choices. Long descripti ons of some techniques, such as the details of
bi ochemical assays or of questionnaires, can be put into an appendi x unless they
are crucial to the evaluati on of scientifi c merit.
The statistical section should usually begin wi th the pl ans for analysi s. Thi s can
be set out i n the logical sequence, first the descri ptive tabulations and then the
approach to anal yzi ng associ ati ons among variabl es. This wil l lead to the topic
of sample size (Chapters 5 and 6), whi ch should begin with a statement of the nul l
hypotheses and the choice of statisti cal test before gi ving the sampl e si ze and
power estimates at the specified alpha, and effect size. Most NIH review panel s
attach considerabl e i mportance to the stati stical section, so i t i s a good idea to
involve a statisti cian in writi ng, or at least in reviewing, this component of the
proposal .
P.308
Table 19.2 Study Timeline for a Randomized Trial
of the Effect of Testosterone Administration on
Risk Factors for Heart Disease, Prostate Cancer,
and Fractures
Screening
Visit Randomization
3
Months
6
Months
12
Months
Medi cal
history
X X
Bl ood
pressure
X X X X X
Prostate
exami nation
X X
Prostate
speci fic
anti gen
X X
Bl ood lipid
levels
X X X X
...
The proposal must provi de a realisti c work plan and timetable, incl udi ng dates
when each major phase of the study wil l be started and completed (Fi g. 19.1).
Simi lar ti metabl es can be prepared for staffing patterns and other components of
the project. For l arge studies, an organizati onal chart descri bing the research team
shoul d indicate l evels of authority and accountabi li ty, and show how the team wil l
function.
Fi nal Pi eces
The human subjects section i s devoted to the ethi cal issues raised by the study,
setti ng forth the issues of safety, privacy, and confi denti al ity. This section should
indicate the specific plans to inform potenti al subjects of the risks and benefi ts, and
to obtai n thei r consent to participate (Chapter 14). It i s an appropri ate place to
descri be the incl usion of women, chil dren and participants from mi nori ty groups, as
requi red of NIH proposals, expanding on information provided in the methods
Markers of
inflammation
X X
Bone density X X
Markers of
bone
turnover
X X X
Handgrip
strength
X X X X
Adverse
events
X X X
FIGURE 19.1. A hypotheti cal timetable.
...
secti on.
The references send a message about the investigator's famil iarity with a fiel d.
They should be comprehensive but parsi moni ous, up to date and balancednot an
exhausti ve and unselected list. Each reference should be cited accuratel y; errors in
these citations or mi sinterpretation of the work wi ll be viewed negati vel y by
reviewers who are famil iar wi th the field of research.
Information that is i mportant for all revi ewers to understand about the research
pl an should generall y not be put in an appendix; i n NIH study sections only the
primary and secondary reviewersusual ly about threerecei ve the appendices.
However the appendices are useful for detail ed technical and supporting material
can be menti oned or descri bed briefl y in the main text. Examples are hi ghly
relevant preprints or in-progress reports by the investigators, questionnai res, and
long descri ptions of measurements that might be a useful reference for a reviewer.
The proposed use and value of each consultant shoul d be described, accompanied
by a si gned letter of agreement from the i ndividual and a copy of hi s biosketch.
(Investigators with effort l isted in the budget shoul d not provi de l etters, because
they are offici ally part of the proposal .) An expl anati on of the programmati c and
administrati ve arrangements between the appli cant organization and collaborating
institutions,
labs, and so on should be included, accompanied by letters of commitment from
responsi bl e offici als addressed to the i nvesti gator.
Wr i ti ng Pr oposal s f or Car eer Devel opment Awar ds
The research plan is only one el ement of proposal s for career development
awards. These proposal s emphasize descriptions of the candidate and his strategy
for developing a career i n research, incl uding pl ans for traini ng in research. They
general ly requi re evidence of commitment from a mentor who has a strong track
record in research and mentoring, and from the appl icants insti tution ( 1). The
requi rements and criteria for review of applications for NIH career devel opment
awards are availabl e on the NIH Web site.
Characteristics of Good Proposals
A good proposal for a research project has several attri butes. Fi rst is the
scientific quality of the research plan: it must be based on a good research
question, use a desi gn and methods that are rigorous and feasibl e, and have a
research team wi th suffi cient experi ence, skill , and commitment to carry i t out
(2,3).
Clarity of presentation i s one of the most i mportant determi nants of the fate of
grant appli cati ons. Even i f the research questi on is i mportant and the study plan
excell ent, a poor presentation can l eave the reviewer confused and uninterested.
The proposal shoul d be concise and engaging, and not lose the attenti on of the
reviewer with writi ng that wanders vaguely through peri pheral topi cs. A proposal
that is well organi zed, thoughtful ly written, attractively presented, and free of
errors reassures the reader that the conduct of research is l ikely to be of similar
qual ity.
Reviewers are often overwhelmed by a l arge stack of l engthy proposals, so the
meri ts of the project must stand out i n a way that will not be missed even wi th a
qui ck and cursory reading. Cl ear outlines, short sections with meaningful
subheadings, brief poi nt-by-poi nt summaries, concise tables, and simple
diagrams can gui de the reviewer's understanding of the most important features of
P.309
...
the proposal. It is good to leave some white space on the pages.
Most reviewers are sophisti cated, and are put off by overstatement and other
heavy-handed forms of grantsmanship. Proposals that exaggerate the importance of
the project or overestimate what it can accompli sh wil l generate skepticism. Writing
wi th enthusi asm is a good idea, but the i nvesti gator should be realisti c about the
li mi tati ons of the project. Most reviewers are adept at identifyi ng potential problems
in the design or feasibili ty of a research project.
Rather than ignore potential flaws, an investigator can address them expl ici tly,
di scussing the advantages and di sadvantages of the various trade-offs in reaching
the chosen plan. It i s a mistake to overemphasize these problems, however, for this
may l ead a reviewer to focus disproporti onately on the weaker aspects of the
proposal and to overlook its strengths. The goal is to reassure the reviewer that the
investigator has anti cipated the potenti al probl ems and has a real isti c and
thoughtful approach to deal ing with them. If the investigator thinks of issues that
are not fully resol ved in the main body of the proposal, it may be useful to pose
these as questi ons wi th thoughtful and balanced answers in a Questions and
Issues section at the end of the methods secti on.
A final round of scientific review by skill ed sci enti sts who have not been centrally
involved, at a point in ti me when substantial changes are sti ll possible, can be
extraordi naril y hel pful to the proposal as well as a rewarding collegial experi ence.
The i nvesti gator should l eave ti me i n the l ast week or two before the deadl ine to
have someone wi th excell ent writing skills read the proposal for clari ty,
grammatical errors, and spell ing errors that are mi ssed by word-processing spel l -
and grammar-check programs.
Finding Support for Research
Investi gators should be alert to opportunities to conduct good research wi thout
formal proposal s for fundi ng. For exampl e, a beginni ng researcher may hi mself
analyze data sets that have been col lected by others, or receive small amounts of
staff ti me from senior scientists or his department to conduct small studi es.
Conducting research without funding of formal proposals i s general ly qui cker and
simpl er but has the disadvantage that the projects must be inexpensive and limited
in scope. Furthermore, academic i nstituti ons often base decisi ons about
advancement i n part on a faculty member's track record of garneri ng external
fundi ng for research. There are four main sources of funds for medi cal research:
the government (notably NIH, but also Centers for Disease Control and
Preventi on (CDC), and many other federal, state and county agenci es);
private nonprofit institutions (notably foundations and professional
soci eties);
profit-making corporations (notably pharmaceuti cal companies); and
intramural resources (e.g., from the investigator's universi ty).
Getting support from these sources i s a compl ex and competiti ve process that
favors i nvesti gators with experience and tenacity, and beginning i nvesti gators are
well advised to find a mentor with these characteristi cs. In the secti ons below, we
focus on several of the most important of these; for a complete listi ng, try the The
American Associ ati on for the Advancement of Science (AAAS) Web site ( 4).
P.310
...
Nat i onal I nst i t ut es of Heal t h Gr ant s and Cont r acts
It takes 8 to 10 months from the time a successful appli cation i s submi tted to NIH
until it recei ves funding. Duri ng thi s time the application goes through a process of
ini tial admini strative revi ew by NIH staff, advi sory peer review, fi nal
recommendati on about funding by the Council of an institute, and decisi on about
fundi ng by the insti tute di rector (5). The peer revi ew process, al though l abori ous
and somewhat caprici ous, is reasonably fai r and tends to enhance the quali ty of
medi cal research in the same way that journal revi ewers enhance the qual ity of the
medi cal literature.
The NIH offers many types of grants and contracts (6). TheR awards (such
as R01 and small er R03 and R21 awards) support research projects conceived by the
investigator on a topi c of his choosing or written i n response to a publ icized request
by one of the i nstitutes at NIH. The K awards (such as K-08 or K-23
awards) support training and development of the careers of junior or midl evel
investigators. An excel lent way to begin a research career, K-awards generall y
provide substanti al support for the young investigator's sal ary and modest support
for research projects (1).
Institute-initiative proposal s are designed to sti mul ate research i n areas
designated by NIH advi sory committees, and take the form of ei ther Requests for
Proposals (RFPs) or Requests for Appl ications (RFAs). Under an RFP, the
investigator contracts to perform certain research acti vities determined by the NIH.
Under an RFA,
the i nvesti gator conducts research i n a topic area defi ned by the NIH, but the
specifi c research question and study pl an are proposed by the investigator. RFPs
use the contract mechanism to rei mburse the contractor for the costs invol ved in
achieving the pl anned objectives, and RFAs use the grant mechani sm to support
activi ties that are more open-ended.
Grant appli cati ons are usual ly reviewed by one of many NIH study
sections. Each of these has a speci fic focus and i s composed of experts i n those
areas drawn from insti tutions around the country. A list of the study sections and
thei r current membership i s avai labl e on the NIH Web site (5), and many
investigators use thi s information to make sure thei r applications wi ll be responsi ve
to the particul ar individuals who may provide thei r peer revi ew. Proposals for K-
awards are usuall y reviewed by study sections comprised of experts in a general
area of research sponsored by the particul ar NIH Insti tute. Proposals sent i n
response to an RFA or RFP are usuall y reviewed by ad hoc commi ttees of peers that
fol low the same procedures as the study sections in passi ng on the merits of a
proposal .
When an investigator submits a grant appl ication to the NIH, it is assi gned by the
Center for Scientifi c Review (CSR) to a parti cular study secti on (Fig. 19.2). After
review and di scussion, each member of the study secti on assigns a priority score of
1 to 3 for each of the appli cations judged to be in the upper half. (A few
appl ications are deferred to the next cycle 4 months l ater, pending clari fication of
points that were uncl ear, and the rest are not revi ewed.) Thi s is done by secret
ball ot, and the
average is computed and multi pli ed by 100 to yield a score from 100 (best) to 300
(worst). Thi s score i s compared with other scores from the study secti on to
generate a percentil e rank.
P.311
P.312
...
The CSR also assi gns each grant appl ication to a particul ar institute at NIH. Each
insti tute then funds the grants assigned to i t, i n order of priority score (tempered
by an advi sory council revi ew and sometimes over-ridden by the insti tute di rector),
until the budget it has received from Congress is exhausted (Fig. 19.3). If an
appl ication is of interest to more than one i nstitute, the PI should request dual
assignment and the second institute may provi de funds if the pri mary i nstitute
cannot, or the insti tutes may share fundi ng.
The i nvesti gator should deci de i n advance, wi th advice from senior coll eagues, on
the outcome he prefers for the two key assi gnments that are made by the CSRto
a study section and to an i nstitute. Study sections vary a great deal not onl y in
topic area but also in the stringency and nature of thei r review, and there is a
consi derable difference among insti tutes in the extent and quali ty of the
competi tion. Although the assi gnments are not ful ly control lable, the investigator
can infl uence them by (a) choosing words in the ti tle and abstract that make it
obvious what the best assi gnment would be; (b) stating his preference in the cover
letter for the application; (c) aski ng the NIH sci enti st in charge of the study secti on
of choi ce (the scientifi c revi ew admini strator ) or the NIH scientist who wil l
handl e the grant at the insti tute of choi ce for advice on how to steer the
appl ication.
FIGURE 19.2. Overview of NIH and foundation funding sources and
mechani sms.
...
After an appl ication has been revi ewed by the appropriate commi ttee, the
investigator receives written notification of the commi ttee's action. Thi s summary
statement incl udes the score, percenti le, and detail ed comments and cri ticisms
from the commi ttee members who reviewed the appli cati on.
Appl ications that are not funded, as is often the case for the first submi ssion, can
be revised and submitted up to two more times. If the reviewers' criti cisms suggest
that the appl ication can be made more acceptable to the commi ttee, then a
thoughtfully revi sed versi on may have an excel lent chance of obtai ning funding
when it i s resubmi tted. An investigator need not automatical ly make al l the changes
suggested by revi ewers, but he should adopt revi sions that wi ll satisfy the
reviewer's criti cisms wherever possibl e and justify any decisi on not to do so. A good
format for the i ntroduction of a resubmissi on is to quote each major criti cism from
the summary statement and then state the corresponding response and changes i n
the proposal, whi ch should be earmarked i n the text.
Gr ant s f r om Foundat i ons and P r of essi onal Soci eti es
Private foundations (such as The Robert Wood Johnson Foundation) generall y
restrict their funding to specific areas of interest. Some disease-based foundati ons
and professional societies (such as the Ameri can Heart Associ ation and American
Cancer Soci ety) also sponsor smal l research programs, many of which are desi gned
to support junior investigators. The total amount of research support i s far smaller
than that provi ded by NIH, and most foundations have the goal of usi ng this money
to fil l the gaps, fundi ng projects of merit that for one or another reason woul d not
FIGURE 19.3. NIH and foundation procedures for reviewi ng grant appli cati ons.
P.313
...
be funded by NIH. A few foundati ons offer career devel opment awards that usual ly
provide less financi al support than NIH K-awards, and are focused on specifi c areas
such as quali ty of heal th care. The Foundati on Center (7) maintains a searchable
di rectory of foundati ons, thei r Web si tes, and contact information al ong with advi ce
about how to write effective proposal s to foundations. Decisi ons about funding
fol low procedures that vary from one i nstituti on to another but that usually respond
rapi dl y to rel ati vel y short proposals (Fi g. 19.3). The deci sions are often made by an
executive process rather than by peer revi ew. Typicall y, the staff of the foundati on
makes a recommendation that is rati fied by a board of di rectors.
To determine whether a foundati on mi ght be interested in a parti cular proposal, an
investigator shoul d consult wi th his seni or mentors, and check foundation Web sites.
The Website will generall y describe the goals and purposes of the foundation and
often l ist projects that have recently been funded. If i t appears that the foundation
mi ght be an appropri ate source of support, it i s best to contact the appropri ate staff
member of the foundation to describe the project, determine the potential i nterest,
and get gui dance about how to submit a proposal . Many foundations ask that
investigators send a short (three- to five-page) letter descri bi ng the background
and princi pal goal s of the project, the qual ifications of the i nvesti gators, and the
approximate durati on and costs of the research. If the proposal is of suffi cient
interest, the foundation may request a more detail ed proposal.
Resear ch Suppor t f r om I ndust r y
Corporations that make drugs and devi ces (referred to as industry ) are a
major source of funding, especi ally for randomized trials of new treatments. Large
companies generall y accept appli cati ons for i nvesti gator-initi ated research that may
incl ude small pil ot studi es about the effects or mechanisms of acti on of a treatment,
or epi demiol ogi c studi es about conditi ons of interest to the company. They will often
supply the drug
and a matching pl acebo for an i nvesti gator's research. Companies may provi de smal l
grants to support educati onal programs in areas of their interest. However, the
most common form of industry support for cli nical research i s payment to enroll
parti cipants i nto large tri al s that are designed, conducted and anal yzed by the
company.
Requests for support for research or educati onal programs, or to parti cipate as a
site in a trial , generally begin by contacti ng the local or regional representative for
the company. If the company i s interested in the topic, the investigator may be
asked to submit a rel ati vel y short appli cati on and complete forms about the request.
Companies often gi ve preference to requests from opinion leaders,
cli nici ans or investigators who are well known and whose views may infl uence how
other cli nici ans prescribe drugs or use devices. Therefore, a young investigator
seeking industry support should generally get the help of a wel l -known mentor i n
contacting the company and submi tti ng the appli cation.
The contracts for support from profit-making companies can be a mixed
experience. For parti cipati on in cl ini cal trials, compani es general ly pay i nvesti gators
a fixed fee for each parti cipant i ncluded i n the tri al and the trial closes enrol lment
when the desired study-wide goal has been met. An investigator may enroll enough
parti cipants to receive funding that exceeds his costs, in whi ch case he may retain
the surpl us as a long-term unrestricted account, but he will lose money if he
recrui ts too few partici pants to achieve the needed economy of scale.
Fundi ng from industry, parti cularl y from marketing departments, i s often channeled
into topics and acti viti es intended to increase the sales of the company's product
P.314
...
(8). Investigators general ly have more control over results of i nvesti gator-i niti ated
work funded by i ndustry than when they are one of many i nvesti gators i n large
industry-sponsored trials. Mul ticenter studies are general ly analyzed by company
statisticians and often written by i ndustry-funded medical wri ters.
Industry conducts research primari ly to make profit from the sale of drugs and
devices. Thi s moti vation may influence empl oyees of the company to put findings
about their products in the most favorabl e l ight. All medical research (regardl ess of
the source of support) is suscepti ble to various extrasci enti fic infl uences. Because
soci ety pl aces a premi um on a favorabl e result, negative resul ts are often dull and
hard to publ ish even though we al l recogni ze that a concl usive negati ve finding may
be as i mportant as a conclusi ve positive one. Investigators can create some
safeguards against undue infl uence of financial and social pressures. It is
important that contracts with companies incl ude clear terms provi di ng investigators
wi th meani ngful access to data. Investigators shoul d seek involvement in publishing
and presenting the results of the studies provided that they are able to desi gn,
request, and carefull y review the analyses, write all, or key sections, of papers and
produce thei r own sl ides for presentation of resul ts at meeti ngs. Manuscri pts and
presentati ons shoul d be reviewed by publicati on committees, most of whose
members are scientists involved in the trial but not affili ated with the company.
One advantage of corporate support is that it is the only practical way to address
some research questi ons. There woul d be no other source of funds, for example, for
testi ng a new anti biotic that is not yet on the market. Another advantage i s the
relative speed with whi ch this source of funding can be acqui red; decisi ons about
smal l investigator-i nitiated proposals are made withi n a few months and drug
companies are often eager to sign up qual ified investigators to parti cipate in thei r
mul ticenter cl ini cal trial s. Additionally, most pharmaceutical compani es pl ace a high
premi um on maintaini ng a reputation for integri ty (whi ch enhances their deali ngs
wi th the vi gil ant U.S. Food and Drug Admini stration (FDA) and their
stature with the publ ic), and the research experti se, measurement i nstruments,
statistical support and financi al resources they provide can improve the quality of
the research.
I ntr amur al Suppor t
Uni versities often have l ocal research funds for their own investigators that can be
di scovered through the Dean's office. Grants from these intramural funds are
general ly l imi ted to rel ati vel y small amounts, but they are usuall y avai labl e much
more quickl y (weeks to months) and to a hi gher proportion of appli cants than grants
from the NIH or private foundations. Intramural funds may be restricted to special
purposes, such as pil ot studi es that may lead to external funding, or the purchase
of equipment that wi ll permit a study to be done by scientists whose salary i s
supported by trai ning funds. Such funds are often earmarked for junior faculty
members or fellows and provide a unique opportuni ty for a beginning i nvesti gator
to acqui re the experi ence of leading a funded project.
Summary
1. The protocol is the detail ed written plan of the study. It is the scienti fic
component of a proposal for fundi ng, whi ch al so contains administrative and
supporti ng informati on required by the funding agency.
2. An investigator who i s working on a research protocol should begi n by getting
advi ce from seni or col leagues about the choi ce of funding agency. The next
P.315
...
steps are to study that agency's written guidelines and to contact the
scientific administrator in the agency for advice.
3. The process of wri ting a proposal , which often takes much longer than
expected, incl udes organi zing a team with the necessary expertise,
designati ng a project leader, establ ishing a timetable for wri tten products,
finding a model proposal, outlining the proposal along agency guideli nes,
and reviewing progress at regular meetings. The proposal should be reviewed
by knowl edgeabl e col leagues, revised often, and polished at the end with
attention to detai l.
4. A good proposal requi res not only a good research question, study plan,
and research team, but also a good presentation: the proposal must
communi cate clearly and concisel y, foll owi ng a logical outl ine and indicating
the advantages and di sadvantages of trade-offs i n the study plan. The merits
of the proposal shoul d stand out so that they wi ll not be missed by a busy
reviewer.
5. There are four main sources of support for cl ini cal research:
a. The NIH and other governmental sources are the largest provi ders of
support, usi ng a complex system of peer and administrati ve revi ew that
moves sl owl y but encourages good science.
b. Foundations and societies are often i nterested i n promising research
questions that escape NIH fundi ng, and have revi ew procedures that are
qui cker but more parochial than those of NIH.
c. Manufacturers of drugs and devices are a very large source of support
that is usuall y channel ed to company-run studies of new drugs and
medi cal devices,
but corporati ons value partnershi ps wi th leadi ng sci enti sts and support
some investigator-i nitiated research.
d. Intramural funds tend to have favorable funding rates for getti ng smal l
amounts of money qui ckly, and are suitable for pilot studies and
beginning i nvesti gators.
References
1. Gi ll TM, McDermott MM, Ibrahim SA, et al. Getti ng funded: career
development awards for aspiri ng cl inical investigators. J Gen Intern Med
2004;19:472478.
2. Inouye SK, Fiell in DA. An evidence-based guide to writing grant proposals for
clinical research. Ann Intern Med 2005;142:274282.
3. Advice about how to wri te successful NIH appl ications: http://www.kern-2pt
http://www.niaid.nih.gov/ncn/grants/default.htm and
http://www.ora.stanford.edu/ora/ratd/nih_04.asp.
4. General advi ce from AAAS about how to obtain research fundi ng.
http://www.sciencecareers.sci encemag.org/funding.
P.316
...
5. Informati on about types of NIH fundi ng:
http://www.grants.nih.gov/grants/oer.htm.
6. Descri ption of the NIH grant revi ew and funding process:
http://www.cms.csr.ni h.gov.
7. Informati on from the Foundati on Center about applyi ng for funding from
foundations: http://www.fdncenter.org/.
8. Davi doff F, DeAngelis CD, Drazen JM, et al. Sponsorship, authorshi p, and
accountabil ity. JAMA 2001;286:12321234.
...

Designing Clinical Research

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Designing Clinical Research

Enviado por

Direitos autorais:

Formatos disponíveis

Authors: Hulley, Stephen B.; Cummings, Steven R.; Browner, Warren S.

Di fference 1.3 2.1 0.9 2.0 0.17*

Prospective More control over

Retrospective Follow-up is in the

Smokers and Nonsmokers Combined

Coffee and No Coffee Combined

Table 12B.2 Verification Bias: Ankle Swelling as a

Table 12C.2 Effect on Sensitivity and Specificity if

Pgina 23 de 26 Ovid: Designing Clinical Research

Programmer/analyst Produces study Works under the

Você também pode gostar