Você está na página 1de 48

February 17, 2012

The GATE Notes: a Graphic Approach To Epidemiology


Epidemiology with Pictures PREFACE
The Graphic Approach To Epidemiology (GATE) uses a triangle, circle, square and arrows to graphically represent all epidemiological studies, including randomised and non-randomised studies. Every epidemiological study can be hung on the GATE frame. The GATE approach was developed to make epidemiological principles, study design, analysis, and appraisal easy to understand and remember. GATE does this by illustrating how every epidemiological study shares the same basic structure and how every epidemiological study is designed with the same objective - to measure how much dis-ease occurs in different groups of people (or populations). These differences in the amount of disease in different groups of people can provide insights into the causes and predictors of dis-ease, and the effects of treatments on dis-ease. The GATE Notes are not a comprehensive epidemiology textbook but a visually aided guide to epidemiological principles, measures, analyses, errors and study design. The GATE approach is equally applicable in clinical, health services or public health practice. GATE started life as a Graphic Appraisal Tool for Epidemiological studies to help medical students develop critical appraisal skills but the GATE approach is equally relevant to teaching epidemiological study design (Graphic Architectural Tool for Epidemiology), which is the flip-side of the critical appraisal of epidemiological studies. GATE was inspired by Ken Rothman (1) who so elegantly dissected epidemiological study design into its component parts (that we illustrate with the GATE frame) and by the pioneering work of the Evidence Based Medicine Working Group (2,3), who developed user-friendly structured guides to critiquing the clinical epidemiological literature. However our overarching goal to make the teaching of epidemiology as simple and as accessible as possible was inspired by Jerry Morriss book Uses of Epidemiology (4) in which he defined epidemiology as numerator denominator (i.e. number of people with outcomes number of people in a population) and won me over to its elegant simplicity.

THE GATE FRAME

GATE has been work in progress since the early 1990s. We thank the thousands of students and health professionals whom we have exposed to versions of GATE. We have observed their reactions, assessed their understanding, and hopefully improved GATE. Finally, I thank my colleagues who have helped me make so many improvements to GATE.

Rod Jackson

CHAPTER 1: EPIDEMIOLOGICAL STUDY DESIGN, MEASUREMENT & ANALYSIS: triangles, circles, squares & arrows
1. 1. INTRODUCTION

Epidemiology is the study of how much dis-ease occurs in different groups of people and of the factors that determine any differences (variation) in the amount of dis-ease between these groups. We use the term dis-ease, rather than disease, to encompass all health-related events (e.g. an injury or death) and health-related states (e.g. diabetes, a disability, a raised blood pressure, or a state of wellness). The hyphen between dis and ease emphasises that as well as investigating pathological conditions, epidemiology can be used to investigate any health-related factors that affects our state of health or ability to function well (i.e. being at ease). Moreover while epidemiologists usually measure dis-ease as negative events or states, sometimes they measure positive states of health such as wellbeing, survival, or say, remission of cancer. The more technically correct epidemiological term for the amount of dis-ease is disease occurrence. An occurrence (of dis-ease) describes the transition from a non dis-eased state to a dis-eased state. Sometimes it is possible, and useful, to observe when the transition from no dis-ease to dis-ease occurs (e.g. when a heart attack occurs) and epidemiologists often count the number of events that occur over a period of time. Other times it is much easier and more useful just to determine if, not when, the transition has occurred, such as classifying someone as overweight or having diabetes - it is impossible to observe a person transitioning from a normal to an overweight state or from no diabetes to diabetes, but it can be useful to know how many people are overweight, or have diabetes at one point in time. Epidemiologists measure the amount of dis-ease in different groups of people (or populations). A population is any cluster of people who share a specified common factor. This factor could be a geographic characteristic (e.g. people living in northern or southern Europe); a demographic characteristic (e.g. an age group, gender, ethnicity or socio-economic category); a time period (e.g. 2001); a dis-ease (e.g. heart disease); a behaviour (e.g. smoking); a treatment (e.g. a blood pressure lowering drug); or a combination of several of these factors. Measures of the amount of dis-ease (dis-ease occurrence) in a population can inform health service planners about the services required for that population. Or, by measuring dis-ease occurrence in different populations (e.g. smokers versus non smokers or Maori versus non Maori) it is possible to investigate possible and probable determinants (or causes) of variations in dis-ease occurrence. This knowledge can inform health promotion, dis-ease prevention and treatment decisions. All epidemiological studies are measurement exercises involving the collection of data that can be counted (i.e. quantified). Quantitative data can be classified as categorical data that is grouped into categories (e.g. male/female, smokers/ nonsmokers, dead/alive); or numerical - data that take on numerical values (e.g. body weight, blood cholesterol levels, number of hospital visits, number of births). Methods of presenting quantitative data depend on the type of data collected. Most epidemiological studies present data in categories, such as the number (and proportions) of men or women, smokers or non-smokers, or people with lung cancer

or no lung cancer, in a study population. Numerical data is often presented using measures of central tendency such as an average or mean body mass index in people with diabetes and people without diabetes. Numerical data can also be presented in categories, for example body mass index a numerical measure can be categorised into normal, overweight and obese categories. For simplicity, most examples in the GATE Notes involve using just two categories to describe populations or dis-eases (e.g. smoking yes/no and lung cancer - yes/no).

This chapter describes: ! measures of dis-ease occurrence in groups of people; ! ways of comparing differences in dis-ease occurrence between groups of people; and ! the shared design features of all epidemiological studies.

1.2.

MEASURES OF DIS-EASE OCCURRENCE & the GATE FRAME: NUMERATORS and DENOMINATORS (and the hourglass)

Epidemiological measures of dis-ease occurrence all have a Denominator the number of people in the study population being investigated - and a Numerator - the number of people from a denominator in whom the disease occurs.

An hourglass illustrates the two essential components of all epidemiological measures of dis-ease occurrence. The denominator (i.e. the number of people in the study population) at the beginning of an epidemiological study is represented by the number of grains of sand in the top bulb of the hourglass, before any sand has flowed to the lower bulb. The numerator (i.e. the number of people in whom dis-ease occurs) is represented by the number of grains of sand that fall through to the lower bulb. The key requirement of an epidemiological study is that the dis-ease outcomes counted in the numerator must come from a defined denominator population, just as the sand in the bottom bulb must come from the top bulb. Thats why all wellconducted epidemiological studies begin by defining the denominator population. Most epidemiological studies measure dis-ease occurrence in several subpopulations within the study population and the study objective is to determine if the occurrence in each sub-population differs and why it differs. We call these subpopulations groups and they are the study denominators used in calculating disease occurrence.
Number of persons in whom dis-ease occurs (numerator) -----------------------------------------------------------------------------Number of persons in study population (denominator)

Epidemiology is the study of the occurrence of dis-ease in populations

Note: see over for details of different measures of occurrence related to the timing of measurements

The GATE frame (Figure 1.1) illustrates the component parts of epidemiological studies identifying the numerators & denominators:
Participant Population (overall DENOMINATOR)

EG

CG

Exposure & Comparison Groups (study DENOMINATORS)

yes

a b c d

no

Outcomes (NUMERATORS)

Triangle = Participant (or study) Population (P) e.g. 2000 men aged 4574 years. While this is the overall denominator, it is usually subdivided into several study denominators. !"Circle = study Denominators The GATE circle is divided into the Exposure Group (EG) & Comparison Group (CG), e.g. EG = 400 people exposed to smoking & CG = 1600 unexposed non-smokers. EG & CG are the actual denominators used in the calculations of dis-ease occurrence. # Square = Numerators Dis-ease Outcomes (O). The GATE square is

Figure 1.1. An epidemiological study with one Exposure Group (EG), a Comparison Group (CG) & a categorical (yes or no) Outcome

divided into 4 cells: a are the people from EG in whom dis-ease occurs and c are those from EG who dont get dis-ease, while b are people from CG in whom disease occurs and d are those from CG who dont get dis-ease. Measures of dis-ease occurrence normally use a and b as the numerators the people with dis-ease. We call the occurrence of dis-ease in the Exposure Group the Exposure Group Occurrence or EGO (EGO = a/EG) and the occurrence of disease in the Comparison Group is called the Comparison Group Occurrence or CGO (CGO = b/CG). One could measure the occurrence of no dis-ease in EG (= c/EG) and CG (=d/CG) and this is done in some studies, particularly diagnostic test accuracy studies (discussed later). Notes in purple are not required reading for POPHLTH 111. Not all exposures or outcomes are easily categorised as yes or no (i.e. exposed or not exposed and dis-ease or no dis-ease). For example, the exposure of interest could be salt intake and the outcome of interest could be blood pressure level. In this example both the exposure and outcome are numerical measures (see page 2 for definition). In many epidemiological studies these numerical measures are divided into categories (e.g. low or high salt intake and low or high blood pressure). This allows the study to measure the occurrence of high (or low) blood pressure in people with low salt intake (EGO) and the occurrence of high blood pressure in people with high salt intake (CGO). An alternative to dividing numerical outcome measures into two categories in order to calculate dis-ease occurrence, is to calculate the average level of the outcome in EG and CG. In the salt and blood pressure example, salt intake is still classified into low and high intake, but the average blood level is calculated in each group. This is also considered to be a measure of dis-ease occurrence and simply involves adding together the outcome measure (e.g. blood pressure level) for every person in EG (e.g. people with a low salt intake), then dividing by the total number of people in EG to determine the average blood pressure level. The same calculation is done for CG (e.g. people with a high salt intake). Some exposures and dis-ease outcomes are classified into more than two categories such as light / moderate high salt intake and low / moderate / high blood pressure, which would just involve adding additional vertical and horizontal dividers to the GATE frame circle and square and involve additional calculations of dis-ease occurrence. In the ideal study everyone in EG and CG ultimately gets classified as either having a dis-ease outcome (a or b) or not having the dis-ease (c or d). Therefore the number of people in EG should equal the number of people in a & c. Similarly the number of people in CG should equal the number of people in b & d. Time - the horizontal " and vertical # arrows in the GATE Frame represent the time when or during which the outcomes are measured and is discussed in the next section.

Some studies involve both numerical exposure measures (e.g. body weight) and a numerical outcome measure (e.g. blood pressure levels). Associations between numerical exposures and numerical outcomes are described by calculating correlations coefficients.

1.3.

THE TWO KEY MEASURES OF DIS-EASE OCCURRENCE: INCIDENCE & PREVALENCE There are two key measures of disease occurrence in epidemiology incidence and prevalence. They are differentiated by the timing of the measures. We illustrate this using the analogy of the Incidence rain dropping into the Prevalence pool.

Population Cloud

Incidence is calculated by counting the number of onsets of dis-ease occurring during a period of time (the numerator analogous to the number of raindrops falling into the pool, say, over one hour) and then dividing the numerator by the number of people in the study population (the denominator analogous to the total number of possible raindrops in the Population cloud). The analogy of the raindrops Prevalence pool # is to illustrate that the dis-ease onsets (raindrops hitting the pool) can be observed when they occur and so it is possible to count the number of onsets that occur during a specified time period. Incidence is the most appropriate measure of disease occurrence for dis-eases that have an observable and obvious onset (e.g. heart attacks occurring in 5 years among 1000 smokers compared to heart attacks occurring in 5 years in 1000 non-smokers; which could be reported as the number of heart attacks per 1000 people over 5 years). Incidence measures require the dis-ease outcome to be a categorical (e.g. yes / no) variable. The vertical arrow " in the GATE frame (Figure 1.1) and the raindrops falling into the pool in the diagram above are used to represent incidence measures of outcomes. Incidence is usually presented as the proportion (or percentage) of people from the study population (or more commonly from the exposure or comparison groups within the study population) in whom a dis-ease event occurs during a specified time period. Most epidemiological textbooks differentiate between two slightly different measures of incidence - Incidence Proportion and Incidence Rate. Incidence proportion (discussed above) is also known as cumulative incidence or more commonly simply as risk. Incidence Proportion counts everyone who started the study in the denominator and everyone who has a dis-ease onset during the study time period in the numerator. The Incidence Rate is a more exact measure of incidence as it only counts participants in the denominator for the time they remain in the study. To achieve this, participants are counted in units of person-time in the study. For example participants remaining in the study for 10 years contribute 10 person-years each to the denominator while a person
Cure cloud # Death drips #

Incidence raindrops #

whole dies 2 years into the study contributes 2 person-years and another who decides to leave the study after 5 years contributes 5 person-years. In practice, unless the study has a very long follow-up period or a high loss to follow-up or a very high event rate (all of which are uncommon), there will be little difference between the Incidence Rate and Incidence Proportion and the terms can be used interchangeably, as long as the time period over which events are counted is always specified. The incidence proportion is easier to calculate because the denominator is simply the number of persons who started in the study rather than person-time. For simplicity, the GATE Notes only use Incidence Proportion (usually known as Risk). Incidence is typically calculated separately in the exposure and comparison groups in the study population (i.e. in EG and in CG; see box below). Therefore EG and CG are considered to be separate denominators within the initial overall study population.

Number of persons with dis-ease onsets (Numerator) Incidence = --------------------------------------------------------------------- during the study Time Total number of persons (Denominator) Incidence in EG* (EGO**) = (number of yes outcomes from EG number in EG) during time T or = [a EG] or = [a (a + c)] during time T Incidence in CG* (CGO**) = (number of yes outcomes from CG number in CG) during time T or = [b CG] or = [b (b + d)] during time T * EG is the acronym for the Exposure Group - the people exposed to the factor being studied and CG is the Comparison Group the people not exposed. ** EGO is the acronym for Exposure Group Occurrence & CGO for Comparison Group Occurrence. In this example Incidence is the measure of Occurrence.

Prevalence is the alternative measure of Dis-ease occurrence to incidence and is calculated by Population Cloud counting the number of people with dis-ease at one point of time (the numerator) and then dividing by the number of people in the study group at that point in time (the denominator). Prevalence is the measure of dis-ease occurrence that is used for dis-eases that develop so slowly that the actual time they occur cannot easily be observed or measured. Examples include diabetes, raised blood pressure level or being overweight - none of which have an easily observable (and therefore measurable) time of onset. This is illustrated in the adjacent picture by replacing the Incidence Prevalence pool # raindrops - that can be counted - by Incidence drizzle - that cannot easily be counted. However the drizzle, like the raindrops, falls into the pool. So if you cannot count the raindrops as they fall (Incidence), you can count the amount of drizzle that has fallen into the pool, by measuring how much water is in the pool (Prevalence). For example a prevalence study of diabetes would involve identifying a study population, measuring their fasting blood glucose levels (at one point in time), and calculating the proportion of this population who have a glucose level high enough for them to be diagnosed with diabetes. This proportion with diabetes is known as the prevalence of diabetes at the point in time that the measurements were taken. You do not know when they developed diabetes, but you know that at a particular point in time a certain proportion of the study population have diabetes. We often measure the prevalence of a dis-ease at two points of time (analogous to measuring the amount of water in the pool at two time points) and calculate the change in prevalence. The difference in prevalence between the two time points is in fact a measure of the incidence of dis-ease over the time period between the two time points. Some dis-eases that do have observable onsets are still best measured as prevalence if the signs and symptoms come and go frequently such as asthma attacks. People with asthma often have multiple asthma attacks of different severity and different frequency, so measuring the incidence of asthma attacks is very difficult and not very useful information. For example in a population of 100 people, one person may have 10 attacks over 1 year, another may have 7 attacks and another may have 20 attacks, while 97 of the 100 people have no attacks. If you add up the attacks (10 + 7 + 20 = 37) and divide by the population (100) you would calculate the incidence of asthma attacks as 37/100/year, but this is rather meaningless because only 3 of the 100 people had any asthma attacks. It is more useful to first define asthma, for example, as a condition
Cure cloud #

Incidence drizzle

Death drips #

involving at least 2 attacks in one year that are severe enough to limit usual activities, rather than measuring the exact number of asthma attacks. Therefore a diagnosis of asthma could be made if a person has, for example, had 2 or more asthma attacks, severe enough to limit normal activities, in the previous year. Once we have defined what asthma is, we then measure the prevalence of asthma as the proportion of a group of people, who at the time of asking have had at least 2 severe asthma attack in the previous year. So this measure of prevalence is done in two steps: i. asthma is first defined by a minimum number of attacks over the previous year, which has some aspects of a measure of incidence because it depends on events that happen over time; ii. then the prevalence of asthma (as defined in step i) is calculated at one point of time. We do not measure the number of attacks as we would if we were measuring incidence, just whether they have had at least two attacks over the previous year. The prevalence pool analogy is used to illustrate that prevalence is a static measure of dis-ease status like the amount of water in a pool at a point in time. It is important to appreciate that the size of the Prevalence pool depends on both the rate that raindrops fall into the pool (Incidence) and on how much water is lost from the pool. People can leave the prevalence pool either by dying as illustrated by water dripping out (the death drips) or if they are cured as illustrated by water evaporating from the pool (the cure cloud). Therefore a population with a high incidence of dis-ease could have a low prevalence if the death rate or cure rate is also high. Alternatively a population with a low incidence of dis-ease could have a high prevalence of dis-ease, if almost no one dies of the dis-ease or is cured. Prevalence is a measure of the amount (i.e. the burden) of disease in a population at a point in time and is very relevant to funders and planners of health services. However it is a less useful measure than incidence for investigating causes of dis-ease occurrence because as discussed, a high incidence of dis-ease could result in either a high or low prevalence depending on the death rate and cure rate. As described above, prevalence can be calculated as a proportion for a categorical outcome such as diabetes (e.g. diabetes prevalence in 60-70 year old Maori women is 35%). For a numerical outcome such as blood cholesterol, an average or mean value is usually calculated (e.g. the mean cholesterol in patients waiting for heart surgery is 6.5 mmol/L). This is done by adding up the cholesterol values of all participants (i.e. the sum) and then dividing by the number of participants. Alternatively, numerical outcomes can be reclassified as categorical outcomes (e.g. high or low cholesterol levels) to calculate prevalence as a proportion. Using the example above, the prevalence of high blood cholesterol (say, > 6.5.mmol/L) in patients waiting for bypass surgery could be 500 per 1000 patients (50%). A horizontal arrow # is used in the GATE frame (Figure 1.1) and the prevalence pool in the raindrops diagram to represent prevalence measures of outcomes at a point in time.

Number1 or sum2 of dis-ease states (Numerator) Prevalence = ----------------------------------------------------------------- at a point in time Total number of persons (Denominator) Prevalence in EG (EGO*) = number of yes outcomes from EG number in EG or = a EG Prevalence in CG (CGO*) = number of yes outcomes from CG number in CG or = b CG
1

If the numerator is a count of a categorical (yes/no) dis-ease states (e.g. diabetes), then prevalence will be a proportion (often expressed as a percentage %) 2 If the numerator is the sum of the scores for a numerical outcome measured on everyone (e.g. blood pressure), then prevalence will usually be expressed as a mean (or average). *As with incidence, in the GATE Notes we use the generic term Exposure Group Occurrence (EGO) to describe the prevalence in the exposure group and Comparison Group Occurrence (CGO) for prevalence in the comparison group.

Prevalence measures are usually called point prevalence measures because the presence of dis-ease is measured at one point in time. Sometimes we measure period prevalence, as in the asthma example discussed above. As asthma was defined by the number of attacks that occurred over a period of time, this measure of prevalence is usually called period prevalence. Period prevalence is a mix of incidence and prevalence because it includes onsets of dis-ease that have occurred during a time period. Differentiating between incidence and period prevalence measures of occurrence can be difficult and confusing in some situations and it may also be possible to measure both. You should always ask yourself: which measure of dis-ease occurrence will give me the most useful answer to my specific question?

10

1.4. COMPARING DIS-EASE OCCURRENCES IN DIFFERENT GROUPS & POPULATIONS (EGO VERSUS CGO) TO ESTIMATE THE EFFECTS OF EXPOSURES: RISK RATIOS (EGO/CGO) & RISK DIFFERENCES (EGO-CGO). Most epidemiological studies are designed to investigate whether there are differences in dis-ease occurrence (or dis-ease risk) between exposure and comparison groups within a study population. Using the GATE terminology, most epidemiological studies are designed to compare the Exposure Group Occurrence (EGO) with the Comparison Group Occurrence (CGO). The difference between EGO and CGO (e.g. EGO = road traffic injury incidence among drinkers and CGO = road traffic injury incidence among non-drinkers) may be caused by the effect of the study exposure (e.g. drinking alcohol) on the dis-ease outcome (e.g. road traffic injury). This difference in dis-ease occurrence between an exposed group and an unexposed (or comparison group) is commonly called an estimates of effect (or effect estimate). In most situations it is important to measure the occurrence of dis-ease both in the group of people who are exposed (EG) and in the group who are not exposed (CG) because the comparison group is seldom dis-ease-free. For example, not all road traffic injuries in drunk drivers will be caused by alcohol, because many non-drinkers also have road traffic injuries. It is the difference in the occurrence of road traffic injuries between drunk and sober drivers that better indicates the effect of alcohol consumption on road traffic injury rates. This important principle is often overlooked in the reporting of associations between exposures and dis-ease, and the reporting of the number of road traffic injuries caused by alcohol in the lay press is often wrongly assumed to be the same as the number of road traffic injuries among drunk drivers. Another common example of this kind of misinterpretation of associations between exposures and dis-ease is the apparent link between drugs and common side effects. If a person gets muscle pain soon after starting a new drug, both the patient and their doctor may assume the drug caused the muscle pain. But as muscle pain is very common, the link may be spurious. The best way to determine if a drug causes muscle pain is to compare the occurrence of muscle pain in people (EGO) who take the drug with the occurrence of muscle pain among those who dont (CGO). While comparisons of dis-ease occurrence are typically called estimates of effect of an exposure (e.g. alcohol) on a dis-ease outcome (e.g. injury), it is usually more appropriate to initially describe them as estimates of association between an exposure and an outcome. The term effect suggests a causal relationship between the exposure and outcome but there are several further steps required before one can be reasonably confident that an association is causal. The first step is to determine whether there are any important errors in the measurements (see Chapter 2 of these Notes). The second step is to consider where on the causal pathway or in the causal mix, the exposure is situated and then to consider other evidence supporting a causal link. For simplicity, in the GATE Notes, we use estimates of effect and estimates of association interchangeably. There are two main ways to compare two dis-ease occurrences; i. the exposure group occurrence can be divided by the comparison group occurrence (EGO CGO) to produce a Ratio of Occurrences; or ii. the comparison group occurrence can be subtracted from the exposure group occurrence (EGO CGO) or vice versa ((CGO EGO) to produce a Difference in Occurrences. Comparisons of occurrences are commonly called comparisons of risks. So a ratio of occurrences (EGO CGO) may be called a risk ratio or more commonly a relative risk (RR), while a difference between occurrences (EGO - CGO) is commonly called a risk difference (RD). The risk difference is often referred to as an absolute risk to

11

distinguish it from a relative risk, but it would be more appropriate to use the term absolute risk difference (shortened to risk difference) because all measures of risk or occurrence (i.e. incidence and prevalence) are absolute risk measures. It is only when you divide one risk by another that you produce a relative risk. So the ratio of two absolute risks is a relative estimate of the difference between two absolute risks while the risk difference is an absolute estimate of the difference between two absolute risks. The picture (left) shows two pairs of columns to illustrate the differences between the two ways of comparing dis-ease occurrences. The heights of the columns represent the magnitudes of the occurrences (we shall call them risks here). The first pair of risks is EGO = 20 units and CGO = 10 units while the second pair is EGO = 8 units and CGO = 4 units. The ratio of the column heights (risks) is the same for both pairs (20 units 10 units & 8 units 4 units, both = 2) but the differences in the heights are not the same (20 units 10 units = 10 units, 8 units 4 units = 4 units). Note that estimates of relative risk have no units (because they are relative), but estimates of risk difference have the same units as

EGO
20 units

CGO EGO
10 units 8 units

CGO
4 units

EGO and CGO. RISK RATIO or RELATIVE RISK (RR = EGO CGO): By convention, in the GATE Notes the relative risk is calculated by dividing the dis-ease occurrence in the exposure group (EGO) by the occurrence in the comparison group (CGO) but it can be calculated by dividing CGO by EGO so always make sure you explain a relative risk by stating that the risk in one (specified) group is, say 2 times higher than in another (specified) group. A relative risk can be any number greater than zero. If there is no difference in the risk or occurrences of a dis-ease between the two groups being compared (i.e. EGO = CGO), then the relative risk = 1.0 (i.e. when the RR = 1, there is no difference in the effect of E and C on the study outcome; this is often known as the no-effect value).

Relative Risk* (RR) =

Exposure Group Occurrence** [EGO] (or EG Risk) ------------------------------------------------------------------------------------Comparison Group Occurrence** [CGO] (or CG Risk)

* the terms Risk Ratio and Relative Risk can be used interchangeably. ** if dis-ease occurrence measures are calculated as means or averages (e.g. mean quality of life scores in people taking different anti-depressant drugs), then the relative comparison of two mean scores would yield a relative mean (RM)

A relative risk that is less than 1.0 can also be expressed as a Relative Risk Reduction (RRR) because it is reduced below 1.0. The RRR is usually expressed as a percentage and is calculated by subtracting the relative risk from 1.0 and then multiplying by 100 (see box below). For example if the risk of heart attacks in people taking a cholesterol-lowering drug relative to people not taking the drug is 7/10 (i.e. RR = 0.7,

12

then the RRR = (1.0 - 0.7) x 100 = 30%. In other words the cholesterol-lowering drug takers have a 30% lower risk of heart attacks relative to the non-drug takers, suggesting that the drug lowers the risk of heart attacks. Similarly if the relative risk is greater than 1.0, it can be expressed as a Relative Risk Increase (RRI). The RRI is usually expressed as a percentage increase, calculated by subtracting 1.0 from the relative risk and then multiplying by 100. For example if the risk of heart attacks in smokers relative to non-smokers is 2/1 or RR = 2.0, then the RRI = (2.0 - 1) x 100 = 100%. In other words smokers have a 100% higher risk of heart attacks relative to non-smokers, suggesting that smoking increases the risk of heart attacks.

Relative risk reduction (RRR) = (1 - RR) x 100%. e.g. if RR= 0.6, RRR= (1- 0.6) x 100= 40% Relative risk increase (RRI) = (RR - 1) x 100%. e.g. if RR= 1.6, RRI= (1.6-1) x 100= 60%

The odds ratio (OR) is another measure used to compare risks. It is the only estimate of effect that can be derived from case-control studies, although ORs can be calculated in any epidemiological study. The odds ratio is similar to the relative risk in most circumstances but as dis-ease risk becomes more common in the study population, the difference between these two estimates of effect increases. If less than about 15 - 20% of people in the study population develop the dis-ease during the study period, then the difference between the odds ratio and relative risk is of little relevance. So introductory level readers should think of the odds ratio as equivalent to a relative risk. In the GATE frame shown in Figure 1.1, the Odds Ratio is calculated as (ab) (cd) which is the ratio of the odds of being exposed among people who have the study disease outcome (i.e. those in a or b) to the odds of being exposed among people who dont have the study dis-ease outcome (i.e. those in c or d). Mathematically the Odds Ratio [ (ab) (cd) ] can be rewritten as (ac) (bd). This latter equation is very similar to the Relative Risk [ i.e. RR= {a(a+c)} {b(b+d)} ] if a<< c and b<<d (in other words if having the dis-ease outcome (i.e. a or b) is uncommon compared to not having the disease (i.e. c & d). The odds ratio in a case control study will approximate the relative risk in an equivalent cohort study if the above conditions apply. The odds ratio is also the measure of effect calculated in logistic regression equations.

Risk Difference (RD = EGO - CGO): By convention, in the GATE Notes the risk difference is calculated by subtracting the Comparison Group Occurrence (CGO) from the Exposure Group Occurrence (EGO) but it could also be calculated by subtracting EGO from CGO. As when describing the relative risk, always state which risk (group) is being subtracted from which other risk. A risk difference can be any number between minus infinity and plus infinity. If there is no difference between the groups compared (EGO = CGO), the risk difference = 0 units (i.e. a RD = 0 demonstrates no difference in effect of E and C on the study outcome; this is also known as the no-effect value; remember that the equivalent noeffect value for a RR = 1.0).

13

Risk Difference* (RD) = Exposure Group Occurrence** [EGO] (or EG Risk) MINUS Comparison Group Occurrence** [CGO] (or CG Risk) * the RD is an Absolute Risk Reduction (ARR) if the risk is lower in the exposure group or an Absolute Risk Increase (ARI) if the risk is higher in the exposure group. ** If dis-ease occurrence measures are calculated as means (i.e. averages), the difference between two means is a mean difference (MD) If the exposure in a study is a treatment (e.g. a drug) and there is a treatment benefit (i.e. the dis-ease occurrence [or risk] in the treatment group is less than in the comparison group: EGO < CGO), then the risk difference can also be expressed as the Number of people Needed To be Treated to prevent one event (abbreviated to Number Needed to Treat or NNT). For example, if the risk of death over 5 years in patients with breast cancer treated with surgery plus chemotherapy (EGO) is 20/100, compared to a risk of 25/100 in patients receiving surgery only (CGO), the RD is -5/100 (i.e. EGO CGO = 20/100 - 25/100 = -5/100). This means that for every 100 people treated with surgery & chemotherapy compared with surgery alone, there will be 5 fewer deaths over 5 yearsthe same as stating that for every 20 people treated with surgery plus chemotherapy compared with surgery alone, there will be one less death over 5 years. The number 20 is called the number needed to treat to benefit one person. The NNT is the reciprocal of the risk difference (i.e. NNT = 1 RD) which in this example is 1 (5/100) = 100/5 = 20. If the risk difference indicates that the treatment is harmful then the NNT will be the Number of people Needed to Treat to cause harm to one person. This is often called the Number Needed to Harm or NNH but it would be more accurate to call it the NNT(H), while if there is a benefit it should be called NNT(B). If the study exposure is a risk factor, like smoking, rather than a treatment, the equivalent measures are the Number Needed to Expose to Benefit or Harm one person or NNE(B) and NNE(H). Similarly if the exposure is a screening test, the equivalent measure is the Number Needed To Screen (NNS) to correctly (or incorrectly) diagnose one person with dis-ease or the Number Needed To Screen to prevent or cause one death. The NNT (or NNE/NNS), like the RD, is very dependent on the Time period specified. For example the NNT to prevent one event in 1 year will be about 5 times the NNT to prevent one event in 5 years. PECOT: THE 5 PARTS OF ALL EPIDEMIOLOGICAL STUDIES

1.5.

The GATE frame shown in Figure 1.1 illustrates the key components of all epidemiological study designs. There are 5 parts to most epidemiological studies: 1. the Participants or study Population

!; 2. the Exposure Group ! and 3. the Comparison

Group "; 4. the dis-ease Outcomes # ; and 5. the Time when # or during which " disease outcomes are measured. We use the acronym PECOT as a memory aid for describing these 5 parts of epidemiological studies.

14

The occasional study does not have a comparison group so its just PEOT, but there is usually an implicit comparison based on other studies or the Exposure Group can be subdivided by age, gender etc, which in effect creates a Comparison Group. An alternative acronym to PECOT commonly seen in the clinical epidemiology literature is PICO, where the I stands for Intervention (a treatment) or Indicator (a risk factor or prognostic factor) and Time is not explicitly mentioned. The GATE Notes use the acronym PECOT because it is more generic to all epidemiological studies, whether clinically or public health focussed, experimental or non-experimental. Exposure is the generic epidemiological term for any factor that is used to allocate the study participants into groups. Also the T in PECOT is to remind study appraisers and designers of the importance of the Time point when, or the Time period during which, dis-ease outcomes are measured. Dotted (not solid) horizontal and vertical lines are used within the GATE frame to indicate that there may be more than two exposure groups and more than two outcome groups. Alternatively either or both exposures and outcomes can be numerical measures. The Outcomes square with a, b, c & d cells is commonly called a 2x2 table that can be derived from any epidemiological study with dichotomous exposure groups (i.e. exposure / comparison) and dichotomous outcomes (i.e. yes / no). All types of epidemiological study will hang on the generic GATE Frame with its 5 PECOT components. The main study types are described in Chapter 3.

REFERENCES. 1. K J Rothman. Epidemiology. An Introduction. Oxford University Press. 2002. 2. Evidence-Based Medicine Working Group. Evidence-based medicine. A new approach to teaching the practice of medicine. JAMA 1992; 268: 2420-5. 3. SE Straus, WS Richardson, P Glasziou, RB Haynes. Evidence-Based Medicine. How to practice and teach EBM. 3rd Edn. Elsevier Churchill Livingstone. 2005. pp 89-90. 4. J N Morris. Uses of Epidemiology. 3rd Edn. Livingstone. 1976.

15

CHAPTER 2: ERROR IN EPIDEMIOLOGICAL STUDIES (RAMboMAN and Confidence Intervals)


2.1 INTRODUCTION Chapter 1 has described how to estimate the occurrence of dis-ease (EGO & CGO) and associations between exposures and dis-ease (RR = EGOCGO and RD = EGO-CGO). Before you accept these estimates as true, it is essential to check whether they are likely to have any important random and non-random errors (i.e. deviations from the truth). This chapter describes the main types of error found in epidemiological studies. Errors can occur either due to problems with the study recruitment, design and implementation or due to chance. Errors caused by chance are described as random errors. Errors caused by problems with how the study is designed or conducted are described here simply as non-random errors but are often called biases or systematic errors.

W Reproduced with permission of the copyright owner e use the acronym RAMboMAN (from the movie character) to demonstrate where non-random errors can occur in epidemiological studies. For estimates of EGO and CGO - and therefore estimates RR and RD - to be valid (i.e. to have no important non-random errors), the right people must be included in the right parts of the GATE frame. We then use the 95% Confidence Interval to describe the amount of random error in the study results.

The cartoon showing darts in the bulls-eye of a dartboard is used to illustrate error. If each dart represents the result of a series of identical studies conducted on different samples from the same population and the bulls-eye represents the truth, then any dart missing the bulls-eye has error - a deviation from the truth.

16

Figure 2.1. Hanging a study on the GATE frame: PECOT and RAMboMAN (also see Glossary) STUDY QUESTION & DESIGN: describe with PECOT
P = Participants: Setting
Describe: - Setting. - Eligibility criteria. - Recruitment process. - % of eligibles who participated. n= Setting/eligible population appropriate, given study goals?

STUDY NUMBERS: hang on GATE frame

STUDY ERROR: assess using RAMboMAN


Recruitment appropriate to study goals?

Eligibles
n =_ _ _ _ _ _ _ Participants representative of Eligibles?

P
_____ Participant risk/prognostic profile reported?

EG = Exposed Group [Intervention]


Method of allocation Allocated: randomly or by measurement EG Allocated CG Allocated

Allocation ( adjustment) to EG & CG


successful?
If allocated randomly: Was process concealed? Were EG&CG similar?

Describe E (how measured if not RCT)

=_ _ _ _ _

=_ _ _ _ _

EG

CG

If allocated by measurement: Was it done accurately? Done before outcomes? Were differences between EG&CG documented?

EG completed CG completed follow-up (f/u) f/u

CG = Comparison Group [Control]


Describe C (how measured if not RCT)

=_ _ _ _ _

=_ _ _ _ _

Maintenance of EG & CG as allocated


sufficient? Compliance high, Contamination low? Co-interventions similar in EG&CG?

EG incomplete CG incomplete f/u = _ _ _ _ _ f/u =_ _ _ _

Completeness of follow-up high?

O = Outcome: Primary (& 2 include adverse) T = Time when outcomes counted (at what
point in time or over what time period)
Describe Outcomes & how / when measured:

a= ----------

b= -----------d

Participants & Investigators blind to Exposure?

blind and objective Measurements?


Outcomes measured accurately?

Outcome & Time EGO = a/EG

CGO = b/CG

RR = EGO/CGO 95% CI

RD = EGO-CGO 95% CI

NNT = 1/RD 95% CI

Study Analyses

ANalyses: Intention to treat (if RCT)?_______Adjusted if EG & CG different?_______95% CI or p-values given?_______ Summary: Non-random error: amount & direction of bias (RAMboM)? ANalyses done well? Random error: sufficiently low Power/sample size sufficiently high (if no statistically significant effect demonstrated)? Applicability of findings? Any important adverse effects? Size of effect sufficient to be meaningful (RR &/or RD)? Can be applied in practice?

17

GLOSSARY to Figure 2.1 Use this form for questions about: interventions (RCTs & cohort studies), risk factors/causes (cohort & cross-sectional studies) or prognosis (cohort studies): Hang the study on the GATE Frame STUDY QUESTIONS/DESIGN: use PECOT & GATE Frame to define study question & describe study design Setting of study: Timing & locations in which eligible population identified (e.g. country/urban/hospital). Eligible Population: those from study Setting who meet eligibility (i.e. inclusion / exclusion) criteria. Recruitment process: How was the eligible population identified from study setting: what kind of list (sampling frame) was used to recruit potential participants: e.g. hospital admission list, electoral rolls, advertisements. Who was recruited? (eg. a random sample, consecutive eligibles)? P: Participants: recruited from eligibles & allocated to EG/CG. How allocated? By randomisation or by measurement? EG: Exposed Group: participants allocated to the main exposure (or intervention or prognostic group) being studied. If there are multiple exposures, use a new GATE frame for each exposure. CG: Comparison Group: participants allocated to alternative (or no) exposure (i.e. control). Outcome: specified study outcome(s) for analyses. If multiple outcomes, use additional GATE frames. Time: when were outcomes measured; at one point in time ! (prevalence) or over a period of time "(incidence). STUDY VALIDITY (non random error): use RAMboMAN to identify possible non-random error Recruitment: was the setting / eligible population appropriate given the study goals? If relevant, were participants representative of elligibles? Could the results be applied to relevant populations? This should be able to be determined from risk factor/prognostic profile of participants. In prognostic studies were participants at similar stage in progression of their disease or condition? Allocation: how well were participants allocated to E&C? If an RCT was the randomisation process well described valid? $ If randomised, was allocation concealed (i.e. was allocation to exposure/comparison determined by a process independent of study staff or participants? Was randomization successful (i.e. EG & CG similar after randomisation were baseline characteristics similar in each group)? $ If not randomised (observational study) were measurements of E&C accurate & done similarly for EG & CG? Were differences between EG & CG documented for later adjustment/interpretation? Maintenance: did participants remain in the groups (EG or CG) they were initially allocated to? Compliance: % participants allocated to EG (or CG) who remained exposed to E (or C) during study? Contamination: % participants allocated to CG who crossover to EG (& visa versa if CG also an exposure)? Co-intervention: other significant interventions received unequally by EG&CG during follow-up? Completeness of follow-up: was it high & similar in EG & CG? Blinding: were participants / investigators blind to whether participants were exposed to E or C? blind and objective Measurement of outcomes: were outcome assessors blind / unaware of whether participants in EG or CG? and were outcomes measured objectively eg. biopsies; x-rays, validated questionnaires? ANalyses (calculation of occurrence [EGO & CGO] and effect [RR & RD] estimates) Measures of Occurrence: EGO: Exposed Group Occurrence (either incidence or prevalence measures; also known as Experimental Event Rate (EER) in RCTs). CGO: Comparison Group Occurrence (or Control Event Rate (CER) in RCTs). Most studies report cumulative incidence or prevalence measures of occurrence and EGO = a/EG & CGO = b/CG. Always document over what time period (cumulative incidence) or at what point in time (prevalence) EGO & CGO were measured. Measures of Effect (for comparing EGO & CGO): Risk Ratio (RR) = EGO/CGO; more commonly known as Relative Risk. Odds Ratios & Hazards ratios are similar to RR. Risk Difference (RD) = EGO-CGO; also known as absolute risk difference. NNT (or NNE) = 1/RD; the number Needed to Treat (or Expose) to change the number of outcomes by one (in a specified time). NNT(B): if exposure/intervention BENEFICIAL. NNT(H) or NNH: if exp/intervention HARMFUL. Intention to treat (or expose) analyses: did analyses (i.e. calculation of EGO & CGO) include all participants initially allocated to EG & CG, including anyone who dropped out during study or did not complete follow-up)? Adjusted analyses (for confounders): Were EG & CG similar at baseline? If not, were analytical methods used to adjust for any differences, e.g. stratified analyses, multiple regression? Random error [= 95% Confidence Interval (CI)] in estimates of EGO, CGO, RR, RD & NNT/E usually assessed by width of the 95% CI. A wide CI (i.e. big gap between upper & lower confidence limits (CL) = more random error = less precision. STUDY SUMMARY Non-random error (bias): what was the likely amount & direction of bias in the measures of effect: is bias likely to substantially increase or decrease the observed difference between EGO & CGO (and therefore the effect sizes)? Random error: would you make a different decision if the real effect was closer to upper CL than lower CL? Power: if the effect sizes were not statistically significant, was study just too small to show a real difference? Applicability: if the sizes of beneficial versus adverse effects are considered meaningful (i.e. sufficiently large benefits versus small harms)& errors small, are the findings likely to be applicable in practice?
REFERENCE: Jackson et al. The GATE frame: critical appraisal with pictures. In: Evidence-Based Medicine. 2006;11;35-38. Also in: Evidence-Based Nursing 2006; 9: 68-71, and in ACP Journal Club 2006; 144: A8-A11.

18

2.2. NON-RANDOM ERROR (or BIAS or SYSTEMATIC ERROR): RAMboMAN Figure 2.1 is a tool for designing and critiquing a range of epidemiological studies including randomised controlled trials, cohort studies and cross-sectional studies. The acronym RAMboMAN, down the left hand side and below the GATE frame in the figure, illustrates where non-random error can occur in epidemiological studies. It stands for: Recruitment; Allocation; Maintenance; blind and objective Measurements of outcomes; and ANalyses. The components of RAMboMAN are described below. R stands for Recruitment The R question is: were the appropriate study participants recruited into the study, given the study objectives? There are two types of recruitment error that can occur in epidemiological studies. One type makes it difficult to apply (or generalise) the findings to a wider (or external) population and is also known as an external validity error. This type of recruitment error occurs when the main objective of the study is to measure the characteristics of a specified eligible population (the Eligibles in Figure 2.1), but the participants (P) who are recruited are not representative of the Eligibles. For example, consider a study in which the objective is to measure the prevalence of participation in sport at school among all New Zealand school children (the Eligibles). In this type of study, the investigators must make sure that a representative sample of all New Zealand school children are recruited. The best way to do this is to obtain a list of all school children and choose a random sample of children from the list. If however the investigators only recruited participants from schools that require all children to participate in sport, then the prevalence of sport participation in the study participants ! will overestimate the true prevalence among all school children in New Zealand, because not all schools expect all children to participate in sport. We call this error a recruitment error (also known as a selection bias). Of note, even if the investigators recruit their participants correctly (e.g. a random sample from a list of all New Zealand school children), it is still possible to recruit a nonrepresentative sample just by chance alone, particularly if the sample is small. This is known as a random sampling error and is discussed later in this chapter under Random Error. In many studies, it is unnecessary to recruit participants who are representative of a specified external population. For example, it is possible to determine that a particular type of antibiotics can cure a particular type of infection in a group of people who are not representative of any particular population. Nevertheless one should always ask: is sufficient information given about the recruitment process for me to determine if I could apply the results of this study to myself, my patients, or my population? The other type of error that is often described as selection (or recruitment) error in textbooks occurs when many/all of the participants who are allocated to the Exposure Group are recruited from a different source than the participants allocated to the Comparison Group. This is equivalent to the GATE frame having two separate or overlapping triangles. For example in a study investigating whether heavy manual labour reduces the risk of heart disease, the Exposure Group (workers who undertake heavy manual labour) will have to be recruited from a population of labourers, whereas the Comparison Group (sedentary people) could be recruited from, say, office workers or the general population. As heavy manual labourers have to be fit and healthy enough to do manual labour (any many office workers may be too unfit/unhealthy to do heavy manual labour), any association found between manual labour and risk of heart disease maybe more related to the characteristics of the people recruited than the effect of manual labour on the risk of heart disease. This type of error is very different from the recruitment error

19

described in the section above and we prefer to consider it as one of the allocation/adjustment errors described in the next section. To remind readers to consider all the key recruitment issues, the triangle ! in the GATE frame is divided into 3 overlapping levels (Figure 2.1): i. the open top of triangle represents the setting in which the eligible population was recruited, for example school children living in New Zealand in 2010; ii. the rest of the triangle, combining the two lower levels, represents the eligible population (i.e. those who meet the inclusion and exclusion criteria [i.e. the eligibility criteria], for example, including children aged 5-9 years but excluding those with significant disabilities); and iii. the tip of the triangle represents those from the eligible population who agree to take part (i.e. the study Participants). Often only a small proportion of the eligible population (e.g. the more healthy or more motivated) agree to participate in a study and, as discussed above, if the study objective is to measure the prevalence of a dis-ease in a community or population, it is important to determine if the participants are similar enough to all people who meet the eligibility criteria. If a substantial proportion of the eligible population do not agree to take part these people are known as non-responders and if the non-responders are different from the responders, this can cause a recruitment error (also known as a non-response bias or selection bias). There is no specific level of response (i.e. the response rate) that is considered unacceptable, but a response rate of less than about 70-75% of those invited could cause a significant recruitment error in prevalence studies. The other type of study that requires a well-defined participant population is a prognostic study (e.g. consider a study question like: what is the prognosis probability of survival or of death among patients aged 40 to 50 years diagnosed with advanced prostate cancer?).

A stands for Allocation or Adjustment - The A question is: were the study participants successfully allocated ( adjustment) to the Exposure Group (EG) and the Comparison Group (CG)? In the ideal epidemiological study the participants allocated to the Exposure Group and the Comparison Group would be identical except for the factor the exposure being investigated. In many studies this is not possible but the allocation and adjustment processes are designed to make EG and CG as similar as possible There are two common ways of allocating participants to the exposure! and comparison" groups in epidemiological studies. One way is to allocate participants by a random process to EG and CG. Studies using this random allocation process are called randomised controlled trials (RCTs) and involve the study investigators, in effect, tossing a coin for each participant. If it comes up heads, the participant is allocated to, say, EG and is offered the study exposure (e.g. a drug) and if it comes up tails, the participant is allocated to CG (control or comparison group) and receives a placebo (a non-active tablet that looks identical to the study drug) or perhaps an alternative treatment or nothing. The purpose of the randomisation process is to give all participants an equal chance of being allocated to EG or CG and so produce exposure and comparison groups that are very similar. RCTs are experiments because the study investigators actively experiment on participants by controlling the allocation process. Studies that randomly allocate participants to exposure and comparison groups should not be confused with studies that randomly sample (recruit) participants from a population (discussed under Recruitment above). Occasionally the investigator chooses which participant will receive the exposure or not, rather than using a random allocation process. Such non-randomised experiments are not recommended because the study investigators may choose to treat particular people they think will benefit most from the study treatment. While this may seem a very reasonable approach, it is almost certain the people in EG will differ from those in CG which makes it difficult to separate the effect of the treatment from the effects of other

20

attributes of the people selected. This problem, known as confounding, is discussed further below. The main alternative approach to randomly allocating participants to EG and CG is to allocate participants by measurement. Participants are measured to determine if they are exposed to the factor(s) being investigated in the study (e.g. they may be questioned about cigarette smoking or be asked to have a blood test). Participants are then allocated to EG (e.g. the smokers group) or CG (the non-smokers group) according to these measurements. In the smoking example, the measurement instrument is usually a questionnaire, however if the exposure of interest in the study is a blood test result, then the measurement instrument would be a test done in a laboratory. Studies that allocate participants by measurement rather than by a random allocation method are usually called observational studies because participants are observed in order to determine if they are exposed or not exposed and are then allocated to the appropriate exposure/comparison group. In observational studies the exposure and comparison groups are frequently quite different from each other and we usually try to adjust for these differences in the analyses (see below). Therefore it is important to collect sufficient information of the differences between EG and CG that can be used for the adjustments. As discussed in the introduction to the Allocation section, in the ideal epidemiological study, the only difference between participants in EG ! and in CG " is the presence or absence of the exposure (E) being investigated, but this is seldom the case in observational (non-randomised) studies. For example, in a non-randomised study investigating the effect of vigorous leisure time physical activity (E) on the risk of heart attacks, participants who report taking vigorous leisure time activity, say, at least three times per week will be allocated to EG and participants who report less activity will be allocated to CG. In a study like this, it is very likely that the average age of participants in EG will be younger than in CG because more young people are physically active than older people. As heart attacks are also less common in younger than older people, there will be fewer hearts attacks in EG than in CG simply because of the difference in average age between the EG and CG. In addition, people who take regular leisure-time physical activity tend to smoke less and eat more healthily than non-regular exercises. So while there may be real beneficial effect of physical activity on heart attack risk, any differences in the occurrence of heart attacks (EGO) in the physically active group and the occurrence of heart attacks (CGO) in the less active group will be caused by a mix of the effect of physical activity on heart attacks and the effect of age and other factors on heart attacks. This problem of mixing two or more effects (e.g. younger age, physical activity and non-smoking) that are all related to the dis-ease outcome is called confounding. Some epidemiological textbooks state that confounding is caused by a selection bias because it is caused by the methods used to select (i.e. allocate) participants into EG and CG. However it is important not to confuse this type of selection bias with the recruitment error discussed in the Recruitment section above, that is due to the methods used to recruit (or select) participants for the study (the GATE triangle). For these reasons we prefer not to use the term selection bias at all, and instead use the term allocation error as the cause of confounding, because the error occurs because of how participants were allocated to EG and CG. The best way to reduce the likelihood of allocation error (and therefore confounding) is to conduct a RCT in which participants are randomly allocated to EG and CG. Randomisation is a very effective allocation method for producing two groups (EG and CG) with similar characteristics. If the study is big enough, random allocation will result in similar numbers of older and younger people, men and women, smokers and non smokers etc, in EG and CG. However, in some RCTs, particularly small ones, randomly allocating participants may not produce groups with similar characteristics, just by chance

21

alone. Therefore it is always important to check for differences between EG and CG at the beginning of a study this is called a baseline comparison and should be done whether the study has allocated participants by randomisation or by measurement. Allocation error can still occur in a large RCT if the random allocation process is tampered with. Consider the scenario of a surgeon who takes part as one of the investigators in a RCT comparing a surgical (E) and a medical (C) treatment to prevent a heart condition. The study protocol requires the surgeon to open a sealed envelope that has a randomly generated instruction to proceed either with surgical (E) or medical (C) treatment. The surgeon opens the envelope and finds an instruction to allocate the patient to the medical group (CG) and so is expected to proceed with medical treatment. However the surgeon may feel that this particular patient is more likely to benefit from surgical than medical treatment, possibly because the patient is young and healthy and will be able to cope with surgery much better than older patients. In this scenario, the surgeon may feel some pressure to reseal the envelope and choose another one and keep doing this until there is an envelope with instructions to allocate the patient to the surgical group (EG). In this example the effect of the surgery on the outcome will be mixed with the effect of the patients young age on the outcome, which will cause confounding. This tampering with the randomisation process is believed to have been surprising common in the past and simple methods have been developed to prevent it, or at least reduce the chance of it happening. The solution is known as concealment of allocation. There are a number of ways of doing this, but in effect it involves getting a completely independent person to open the envelope, then write down the treatment group and the name of the participant, and then instruct the investigator implementing the treatment (e.g. the surgeon) which group the participant will be allocated to. This means that any subsequent tampering with the randomisation process will be obvious, because the independent person has documented the correct group for that participant. In practice, concealment of allocation is usually done by phone, fax or using the internet, which further conceals the independent person from the participant and investigator. One almost concealed approach is to number the envelopes and contents and require them to be used consecutively. It is usually, but not always, possible to check if this has been done correctly. Studies comparing concealed with unconcealed allocation have been shown to produce quite different results, with unconcealed allocation tending to exaggerate the benefits of the desired treatment. Of note, if the study exposure (E) is a drug and the comparison group exposure (C) is an identical looking tablet, then concealment of allocation is unnecessary as long as no one involved directly with the participants or practitioner can tell the difference between E and C. While a large RCT with concealed random allocation is the best way to minimise allocation error, randomisation is only possible when the exposure being investigated is considered to be safe (e.g. cholesterol-lowering drugs). You cannot randomise people to smoking or non-smoking groups or to any drugs you believe may be harmful! Therefore in many situations only non-randomised studies are possible. However in studies that allocate participants to EG and CG by measurement rather than by random allocation, it is common to find important differences between the characteristics of participants in each group that may affect the study outcome and cause confounding. A range of methods are used to reduce this confounding, either in the design of the study or in the study analyses.

A also stands for Adjusted analyses The alternative A question: if there were differences in the characteristics of participants in EG and CG that could affect the study disease outcomes (i.e. confounders), were they adjusted for in the analyses?

22

Allocation errors can be reduced in the study analyses by dividing participants into, say, older and younger age groups or strata (equivalent to dividing the study participants in the triangle into two triangles and then analysing the data as if there were two separate studies). The results of the analyses in the different strata can then be combined, if they give reasonably similar results. If they give very different results, they should be reported separately. This analytical approach is known as stratified analysis and in the example given here, the analysis addresses (or adjusts for) allocation error (or confounding) caused by allocating more young people to the exposure group (e.g. frequent physical activity) than in the comparison group (infrequent physical activity). This is the equivalent of age-standardisation, which is commonly done when national populations with different age structures are compared. Stratified analyses are also called adjusted analyses. Other multivariate statistical methods can be used to reduce the amount of confounding by adjusting for multiple differences (e.g. age, smoking, gender, socio-economic status etc.) between EG and CG. These multivariate analyses simultaneously stratify EGO and CGO into multiple comparable strata and a detailed description is beyond the scope of these notes. Unfortunately it is never possible to fully adjust for confounding factors, firstly because the confounders are seldom measured perfectly but secondly and more importantly because it is obviously only possible to adjust for the confounding factors that are measured. There will often be some confounding factors, such as a positive, healthy attitude to life, which are very difficult to measure but may have a major effect on aspects of a participants behaviour that also have an effect on the dis-ease outcome being studied. Therefore it is important not to over-interpret the results of a non-randomised study if you believe there could be important unmeasured or poorly measured confounding factors.

M stands for Maintenance The M question is: were most of the participants maintained throughout the study in the groups (EG & CG) to which they were initially allocated? In the perfect epidemiological study, once allocated to EG or CG, participants should remain in their allocated group and: i. maintain their exposure or comparison status throughout the study; ii. not be exposed to other factors that could influence the study outcome(s); and iii. not drop out of the study. If some participants exposure status changes or some are lost to follow-up, this can introduce a maintenance error. In practice participants are seldom perfectly maintained in their allocated groups but as long as any maintenance errors are small and similar in both EG and CG, the error will underestimate the true effect of the exposure on the study outcome(s). This conservative error is usually considered preferable to not knowing whether the error will exaggerate or underestimate the true study effect measures. The best way to keep the degree of maintenance error similar EG and CG is to keep the participants and study practitioners blind to whether the participant is receiving the study exposure (E) or the comparison exposure (C). This is easier to achieve with drugs (which can be prepared to look identical) than with other interventions like surgery or physiotherapy or diet. There are 4 main factors that can cause Maintenance error: compliance, contamination, co-intervention and loss-to-follow-up. In some RCTs, participants are required to take an intervention (e.g. a drug or diet) every day for a number of months or years. If they comply with these instructions most of the time (e.g. take 80% or more of the tablets prescribed) they are considered to have good Compliance. The level of compliance is often checked periodically in trials and it usually falls over time. If the comparison group is also actively exposed (e.g. receiving an alternative drug), then their compliance should also be assessed. Similarly, exposure

23

status should be checked periodically in long-term observational studies, (e.g. light, moderate or heavy drinking) although this is seldom done and is an important shortcoming of many observational studies, particularly those with very long follow-up periods during which exposure status may change a lot. In studies where both EG and CG receive (different) treatments, compliance can be a problem for both EG and CG. If participants in CG receive the study treatment that was only meant for EG this is called Contamination because participants in CG are contaminated by the exposure that was meant only for EG. In studies where both EG and CG receive (different) treatments, contamination can go both ways from EG to CG or from CG to EG. Maintenance error can also occur if the exposure and comparison groups are treated differently during the study in any way (other than being exposed to E or C) that influences the dis-ease outcome. This is more common when the participants or the clinical staff involved are aware of which group (EG ! or CG ") the participants are allocated to (i.e. are not blind to exposure status). For example in a study comparing the effectiveness of a new blood pressure lowering drug (E) compared with an older drug (C), participants in the comparison group may have been allocated to a treatment that is usually perceived to be less effective than the new drug being studied. So these participants may ask for, or be more receptive to, advice about lifestyle changes to help reduce blood pressure or may be more likely to try additional interventions. Similarly the clinicians involved may be more likely to provide other interventions to participants allocated to the older therapy. This form of maintenance error is known as CoIntervention. The best way to reduce co-intervention (and also contamination) is to keep both participants and practitioners blind to the allocated exposure. If neither the participants nor the practitioners are aware of which exposure (or comparison exposure) the participants are receiving (i.e. the exposure status), then the study is called a double blind study. In studies where only the participants or only the study staff are unaware of participants exposure status, the study is called a single blind study. Participants who stop taking part or drop out of a study (i.e. are lost to follow-up) sometime after being allocated to EG and CG, can also be a cause a maintenance error. The degree of error will be exaggerated if the numbers and characteristics of those who are lost to follow-up differ substantially between EG and CG. One way to address this problem (other than blinding) is to calculate EGO and CGO assuming that all the participants initially allocated to EG and CG are still in their allocated groups. This is called Intention-To-Treat (ITT) or Intention-To-Expose (ITE) analyses and will reduce any differences between EGO and CGO and therefore underestimate the effects of the exposure being investigated. This conservative error is considered to be preferable to the alternative approach of only including participants who remain in EG and CG in the calculations of EGO and CGO, known as On-Treatment (or per-protocol) analyses. OnTreatment analyses tend to exaggerate differences between EGO and CGO and therefore over-estimate any true effect of interventions. Some studies calculate EGO & CGO using both intention-totreat and on-treatment methods. If the differences between the two methods are small, then loss-to-follow-up is less likely to be an important cause of error.

boM stands for blind or objective Measurement of dis-ease outcomes The boM question is: were the people who measured the dis-ease outcomes unaware of (i.e. blind to) the participants exposure status or were these measurements made objectively (using measurement instruments that were not influenced by subjective human factors)? Errors in the measurement of outcomes can result in study participants being classified in wrong dis-ease outcome category # of the GATE frame (e.g. classified as mild disability

24

rather than moderate disability, which is often a difficult distinction). If these errors result from deficiencies with measurement methods or instruments (e.g. a questionnaire rather than a blood test for smoking, or a poorly designed questionnaire for diagnosing disability, or perhaps a faulty set of scales that overestimates weight) they will cause a measurement error. While boM as described here relates to errors in the measurement of outcomes, it is similar to errors in the measurement of exposure status in nonrandomised studies where measurement error could result in participants being allocated to the wrong exposure (or comparison) groups !" (e.g. non-smoking rather than smoking). We refer to this latter error as an allocation error, because it will result in people being allocated to the wrong exposure/comparison group. Measurement of outcomes errors can be reduced in several ways. Knowledge of a participants exposure status can influence the participants or the practitioners perception or interpretation of signs and symptoms of the study outcome. For example, the results of RCTs of surgery (E) versus physiotherapy (C) for treating limitations in knee movement and pain due to damaged cartilage in the knee joint, can be influenced just by the knowledge of which intervention was used. Participants receiving surgery may report greater improvements in movement and less pain than participants receiving physiotherapy, because they may assume that surgery to remove damaged cartilage should be more effective than physiotherapy. As the outcomes being investigated (i.e. range of movement of the knee and pain) are not simple, objective, clear-cut (yes/no) measures, they are more susceptible to influence from subjective factors that may be unrelated to the effectiveness of the treatment. The practitioner who measures the degree of movement in a participants knee and asks about pain may also be influenced by knowledge of the type of treatment received. One way to reduce this problem is to blind the participants or investigators or both to knowledge of which intervention (exposure) participants received. While it is generally not possible to keep information about surgery from participants, there is a famous blinded study of surgery versus physiotherapy for the knee problem described above. To blind the participants, everyone in the study received a local anaesthetic and a small cut in the skin of the knee. However the actual surgery, which used a keyhole procedure through the small cut, was only undertaken on participants randomly allocated to the surgical group (EG). The surgeon just pretended to do the surgery on the comparison group (CG). As the procedure was done behind surgical drapes, participants were unable to tell if they had received surgery or not, so they were blind to the exposure. The practitioners measuring the study outcomes were also kept blind to whether participants were in EG or CG. This study showed no differences in pain or knee movement between the surgical and non-surgical groups, whereas most of the previous un-blinded studies had shown a benefit of surgery. The other main way of addressing measurement of outcome (and also of exposure) errors is to use objective measurements wherever possible. Examples include wellvalidated standardised questionnaires about the study exposures outcomes that are administered in exactly the same way to all participants. Alternatively if a reliable blood test is available for measuring the dis-ease outcome, it would be considered more objective than self-reported information from the participant. For example, there is a blood test for checking whether someone has recently smoked a cigarette. Another example of a more objective measure than signs and symptoms is a chest x-ray to diagnose heart failure. However x-rays and other scans require interpretation, so radiologists reading the scans should also be blind to information about participants exposure status. Clearly death is an objective measure of outcome.

25

AN stands for ANalyses The AN question is: were the study Analyses done correctly? As discussed in the previous chapter, the goal of all epidemiological studies is to measure EGO and CGO, and to calculate the difference between EGO and CGO (i.e. RR, RD, NNT/E). The AN component of RAMboMAN is to remind you to check whether these were done correctly. There are two key analytical issues in epidemiological study, both of which have been mentioned above. The first relates to the denominator population used in the calculation of EGO and CGO. If everyone who was allocated to EG or to CG are included in the denominators in the analyses, then this is know as an intention-totreat (or to-expose) analyses, whereas if only those who remained on treatment are included, then an on-treatment (exposure) analysis has been done. As discussed in the loss-to-follow-up section under Maintenance error, intention-to-treat/expose analyses are generally considered to be the preferable approach. The second key analytical issue relates to adjustments for potential confounding that has already been described under Allocation (and adjustment) error.

2.3.

RANDOM ERROR and 95% CONFIDENCE INTERVALS

In epidemiological studies, random errors are errors that occur due to chance, rather than due to the way studies are designed and conducted. Unlike the non-random errors described in Section 2.2, most random errors can be reduced by increasing study size or by increasing the number of times a factor is measured on each participant (e.g. blood pressure). Using the analogy of throwing a standard six-sided dice, every time you throw it, there is a one in six chance of it landing on one of the six sides. However if you throw it six times it is very unlikely to land once on each of the sides you may even get the same side six times just be chance alone! Yet if you throw it 600 times it will land on each of the six sides approximately 100 times each. The more throws you make, the less influence chance (random error) will have on the number of times the dice lands on a particular side rather than another. There are a number of causes of random error in epidemiological studies. The four main ones are described below. Random sampling error: In the study discussed in section 2.2 about the prevalence of regular participation in sport among New Zealand school children, let us assume that the study participants! were a representative sample of all children and that they were recruited by taking a random sample from all New Zealand school rolls. Even if the recruitment process is done very well, the participants will never be a perfectly representative sample of all children on all the school rolls because you would literally have to include every school child in New Zealand to achieve this. Every representative sample of school children recruited will be slightly different from every other sample, just by chance. So the prevalence of sport participation in one sample of children will be different from the prevalence in other samples and they may all be different from the true prevalence among all school children, which is what the study is trying to determine. This error is known as a random sampling error.

26

Random sampling error is inherent in every study because, as discussed above, every study population can only be a sample of the total population of interest. While repeated samples from the same total population will all be different, the bigger the sample chosen, the smaller the differences between the sample and the total population (i.e. the smaller the random sampling error).

Random measurement / assessment error: ! The measurements of exposure ! (and comparison ") status and of the dis-ease outcomes # are all subject to random measurement error. Our ability to measure biological factors in exactly the same way every time we measure them is often poor, particularly if the measurement instrument requires a human operator. For example when measuring blood pressure with a standard sphygmomanometer, operators may record different results in repeated measurements of, say, a persons blood pressure level even if the actual level remained unchanged. This could be due to a variation in background noise or some other factor that influences the operator's ability to detect the blood pressure sounds accurately. The best way to reduce this random error is to take multiple measurements and average them or to use an automatic, more objective, instrument.

The randomness inherent in biological phenomena: A fundamental cause of random error is the inherent variability in all biological phenomena and therefore inherent variability in all measurements of biological phenomena (i.e. measuring factors in living organisms that by definition are always changing). For example, if blood pressure is measured a number of times on the same person, using exactly the same automatic sphygmomanometer, each reading will be slightly different, even if the instrument is perfect (i.e. no random measurement error). The reason for these differences is that a persons blood pressure level changes from moment to moment. As with random measurement errors due to operator error, these differences between multiple measurements caused by biological variability can be reduced by taking multiple measurements and then averaging the results.

Random allocation error: As discussed in section 2.2, the exposure ! and comparison " groups in a randomised controlled trial may differ by chance alone, particularly if the trial is small; this type of random allocation error can also be reduced by undertaking a larger study, so large numbers are randomised to EG and CG.

2.4

ESTIMATING RANDOM ERROR WITH 95% CONFIDENCE INTERVALS

As there is random error in every measurement in every epidemiological study, all measures of EGO or CGO and calculations of RR or RD or NNT will have random error. Fortunately statisticians have developed methods for estimating the amount of some of the random errors described above, particularly random sampling error. There are two main ways of describing the amount of random sampling error in a measurement or calculation; confidence intervals and p-values, and they are different ways of expressing the same information. We will focus on confidence intervals because they are easier to understand and usually more informative than p-values. A small section on p-values is included as they are still in common use. It is important to appreciate that it is not possible to estimate the total amount of random error in a study measurement or calculation, so the confidence intervals and p-values described

27

here only account for some of this error mainly the random sampling error, which is the easiest to measure. Therefore the estimates of the amount of random error described by confidence intervals or p-values are all very approximate and will generally underestimate the total random error. The true amount of random error could be up to twice as much as it is possible to calculate and will depend in particular on the amount of random measurement error present in a study (personal correspondence Doug Altman). Confidence Intervals (CI) A Confidence Interval describes the range of values of a particular measure (e.g. EGO, CGO, RR, RD, NNT) that is likely to include the true value. The amount of random error in the measurement is reflected in the width of the confidence interval; the wider the interval, the more random error in the measure). The 95% Confidence interval (CI) is the most commonly used CI for reporting the amount of random error in epidemiological studies. There is nothing special about a 95% CI compared, say, with a 90% CI; it has simply become the standard CI measure of random error, just like a p-value of greater than or less than 0.05 has become the standard p-value measure of the degree of random error in a result. Occasionally, a study will report 90% or 99% intervals or even a family of confidence intervals from, say 50% to 99% intervals. For a given measurement or calculation the 99% CI will be wider than the 95% CI (because there is a 99% chance that it includes the true value rather than only a 95% chance), and the 95% CI will be wider than the 90% CI etc. Figure 2.2 illustrates the results of a study in which EGO = 9.0, CGO = 6.0 and the RD (EGO-CGO) = 3.0. The filled-in black squares represent the actual values (or point estimates) of EGO, CGO and the RD determined in this example, while the horizontal lines through the squares represent the 95% CIs. Each end of a confidence interval is called a confidence limit. For example the point estimate for EGO = 9.0 events per 100 in 5 years, the lower 95% confidence limit is 8.0 events per 100 in 5 years, and the upper 95% confidence limit is 10.0 events per 100 in 5 years. (Note: these CIs are not real intervals but have been made-up to illustrate the principles.)

Figure 2.2 The confidence interval for EGO from 8.0 to 10.0 estimates the degree of uncertainty (i.e. amount of random error) in the study measurement of 9.0 events per 100 in 5 years due to random error. If one assumes that the study participants are a representative sample of the total population of interest and if one also assumes there is no non-random error in the measurement of EGO, then a reasonable interpretation of this 95% confidence interval is that there is about a 95% probability that the true value of EGO in the total population of interest, lies between 1.0 and 3.5.The more exact definition of a 95% confidence interval is more of a mouthful - if the same study is repeated many times

28

using random samples recruited from the same total population, then approximately 95% of the (95%) confidence intervals would include the true value in the total population. Figure 2.3 As shown in Figure 2.2, EGO, CGO and the RD all have CIs. For completeness, Figure 2.3 presents the RR for the same study (RR = EGOCGO = 1.5 with its 95% CI: 1.2 to 2.0). Note that both figures (2.2 and 2.3) include a vertical dotted line, known as the no-effect line. The study will show no effect if EGO = CGO, which for a risk difference (EGO-CGO) has a value of zero (Figure 2.2) and for a relative risk

(EGOCGO) has a value of one (Figure 2.3). In the example given (Figures 2.2 & 2.3) the CIs for EGO and CGO do not overlap the lower confidence limit for EGO (=8) does not overlap with the upper confidence limit for CGO (=7). When there is no overlap in the CIs for EGO and CGO it is reasonable to assume that EGO and CGO are truly different from each other, rather than there being an apparent difference that is due to the random error in the measurements. Also, when there is no overlap in the CIs for EGO and CGO, the confidence intervals for RD or RR will not cross the no-effect line. In other words, these two measures of the association between EGO and CGO also show a real effect and it is the convention to state that the study results are statistically significant. Later in the chapter we will argue about whether this statement is very useful or not. Figure 2.4 The next 2 figures (2.4 & 2.5) show the results of another study with identical values for EGO, CGO, RD and RR but with wider confidence intervals. In Figure 2.4 there is an obvious overlap between the 95% CIs for EGO and CGO. This means that the study is unable to determine if EGO is different from CGO in the population from which the study participants were recruited, because EGO could be any value between 6.0 and 14.0 while CGO could be any value between 4.0 and 10.0. When the 95% CIs for EGO and CGO overlap, the 95% CIs for the RD and RR will usually cross the no-effect line. In other words it is not possible to determine if the true RD is positive or negative, or if the RR is more than or less than 1.0. In this situation it is the convention to state that the study results are NOT statistically significant. Of note, sometimes when the 95% CIs of EG and CG overlap, a p-value will still show a statistically significant effect (i.e. p<0.05).

29

Figure 2.5 The most likely reason for the difference in the width of the confidence intervals in Figures 2.2 & 2.3 compared to Figures 2.4 & 2.5 is the size of the two studies. Yet a non-significant result (RR or RD) is often wrongly interpreted as demonstrating that there is no difference between EGO and CGO (i.e. there is no association between the exposure and the dis-ease outcome in this population). When the 95% CI for a RR or RD crosses the no-effect line, we prefer to state that there is too much random error to determine if there a real difference between EGO and CGO. Statistical significance Figure 2.6 Figure 2.6 shows the RD point estimates and 95% CIs for 4 identical studies each with the same number of participants recruited from the same total population of interest. As the studies have the same number of participants, the CIs are a similar width. The studies have been ordered by the size of the RD point estimates and as would be expected, none have identical point estimates because all are subject to random error. The first 2 studies (1 & 2) have 95% CIs that overlap the no-effect line (i.e. RD = 0), the upper 95% confidence limit of study 3 just touches the no-effect line, while the whole of the CI for study 4 is below the no-effect line. The convention is to describe the results of studies 1 & 2 as not statistically significant, study 3 as borderline statistically significant and study 4 as showing a statistically significant result. The figure also shows the p-values associated with each of these study results, which are described in more detail in the next section. Figure 2.7 shows the RR point estimates and 95% CIs for 4 studies with identical methods drawn from the same population, but with different numbers of participants in each study. The difference in study size is represented by the size of the point estimate squares and also by the length of the 95% CIs. Bigger studies tend to have more dis-ease

30

events and the width of the CI is dependent on the number of events that occur in a study. As the number of events in a study increases the width of the 95% CIs for EGO, CGO, RR or RD decreases. The CIs for Study 4 in both Figures 2.6 and 2.7, suggest there may be a real difference between EGO and CGO in the populations from which the participants were recruited, as the 95% CIs for the risk estimates lie entirely on the lower side of the effect estimate lines. Equally, if the whole 95% CI was above the no-effect line, this would also suggest a real difference between EGO and CGO, but in the opposite direction.

Figure 2.7 As discussed, a confidence interval can be calculated for all measures of dis-ease frequency (EGO & CGO for both prevalence & incidence) and all estimates of effect (relative risks, risk differences and NNTs). Confidence intervals can be calculated for both categorical outcomes (e.g. the incidence of prostate cancer) and continuous variables (e.g. average blood pressure level). There are many free computer programs available for calculating confidence intervals (e.g www.pedro.org.au/wpcontent/uploads/CIcalculator.xls) Clinical significance Clinicians (and practitioners / decision-makers in general) should be more interested in the clinical significance (or practical significance) of a studys results than the statistical significance. A study result is considered statistically significant if neither of the confidence limits crosses the no-effect line, but the same result is only considered to be clinically significant if a clinician would make the same clinical decision whether the true result was at one end of the confidence interval or the other. For example if a treatment was estimated to reduce the risk of a dis-ease by 8 events per 1000 people treated for 5 years (i.e. RD = -8/1000/5 years) and the 95% CI was 14 to -2 events/1000/5 years, then despite being statistically significant, a clinician may decide the benefit is too small if the true effect was -2 events/1000 per 5 years, although an effect of -14/1000/5years would have been large enough to be clinically worthwhile. So this statistically significant result is clinically insignificant. Similarly a small but statistically significant effect with a narrow confidence interval (e.g. RD = - 2/100 treated per 5 years; 95% CI: -3 to -1), may not be of clinical significance anywhere within the confidence interval.

31

P-values A p-value is the main alternative but less intuitive way of describing the amount of random error in a study result. It describes the probability of getting a study result (e.g. RR = 1.5 or RD = 3.0) by chance alone, when there actually is no effect in the underlying population sampled (i.e. when EGO = CGO; a RR = 1.0 or RD = 0). The correct interpretation of a p-value, using the example of study 1 in Figure 2.6 (i.e. p > 0.05) is as follows: If there is no difference between EGO and CGO in the population from which the participants were sampled (i.e. true RD = 0), then there is more than a 5% probability that a study result obtained (e.g. RD = 0.8) could have occurred by chance alone. Therefore the hypothesis that there is no difference between EGO and CGO in the underlying population is probably correct. This is called a non-significant p-value. The description of a p-value is much more of a mouthful than defining confidence intervals and is also less intuitive! When one of the 95% confidence limits touches the no-effect value as in Study 3 in Figure 2.6, then the p-value will be approximately 0.05 and this study is considered of borderline statistical significance. Similarly if the 95% confidence interval is narrower and does not include the "no-effect" value, as illustrated in Study 4 in Figure 2.6, the p-value is likely to be < 0.05 and considered to be statistically significant. P-values are calculated from the same information used for calculating confidence intervals, so they can be interpreted like confidence intervals (i.e. to assess the amount of random error in a measurement). A large p-value or a wide confidence interval indicates a large amount of random error (i.e. poor precision). The degree of statistical significance of the results of a study, depend on both the size of the study (which determines the width of the CIs around EGO and CGO, RR and RD) and on how different EGO and CGO are (i.e. the size of the RR or RD point estimates). Pvalues combine both these factors into one number (the p-value), whereas CIs disaggregate these two factors by examining the upper and lower confidence limits. Therefore CIs tend to be less prone to misinterpretation than p-values and are generally considered a more useful measure. However the exact p-value has the advantage of not being based on the arbitrary decision to calculate a 95% CI rather than, say, a 90% or 99% CI. The best compromise is to present the point estimate, the 95% CI and the exact p-value. Prior to the widespread use of personal computers p-values were obtained from statistical tables and were generally presented as p < or > 0.05 or 0.01. However most personal computers can calculate exact p-values (e.g. p = 0.037), which are less prone to being misinterpreted as yes/no tests of the truth. As discussed at the beginning of the Random Error section, it is never possible to accurately quantify all the random errors in a study result and many assumptions are made in estimating these random errors. Therefore all measures of random error and statements about statistical significance should be considered as very approximate.

32

Meta-analyses Figure 2.8: The confidence intervals and pvalues calculated for a particular study are often misinterpreted as yes/no tests of whether there is a difference between EGO and CGO in the underlying population. If the p-value is greater than 0.05 and the 95% CI crosses the no-effect line (by convention a non-statistically significant result) then it is often assumed there is no difference between EGO and CGO in the underlying population. This is illustrated in Figure 2.8, which shows estimates of relative risks and associated 95% confidence intervals in four hypothetical randomised controlled trials studying the effect of a new blood pressure-lowering drug on the risk of stroke. As in the previous two figures, the squares indicate the point estimates of the effect in each study (RR in this example) and the length of the horizontal lines through the dots indicates the width of the 95% confidence intervals. In all four trials the upper confidence limit crosses the no-effect line (RR = 1.0), in other words none of the trials demonstrate a (conventionally) statistically significant effect of blood pressure lowering on stroke risk. However this does not necessarily mean that there is no beneficial treatment effect. In fact all the point estimates of effect are to the left of the no-effect line (i.e. on the benefit side of the no-effect line, suggesting that blood pressure lowering may reduce the risk of stroke). So there could be a real benefit but perhaps none of the individual studies are large enough to give a precise enough estimate (i.e. the estimates have too much random error) to demonstrate that there is a real treatment benefit. One way to determine if there is likely to be a real treatment effect is to combine the results of the four studies mathematically. This is known as a meta-analysis and it generates a summary estimate of the effect (the large diamond at the bottom of the figure; a vertical line through the middle of the diamond gives the point estimate of the summary effect and the width of the diamond describes the summary confidence interval). The associated 95% confidence interval is narrow and does not cross the noeffect line, suggesting that there is likely to be a real treatment benefit. If a p-value was calculated for this summary estimate, it would be much less than 0.05 and would meet the conventional criteria for a statistically significant result. Combining multiple similar studies in a meta-analysis is an alternative to conducting one large study. Meta-analyses are now commonly undertaken to combine the results of multiple RCTs that individually have too much random error to demonstrate whether or not the intervention has a real effect. Increasingly meta-analyses are also being used to combine other types of studies, particularly diagnostic test accuracy studies.

33

2.5.

DESCRIBING THE STATISTICAL POWER OF A STUDY

If the 95% confidence intervals around the RR or RD in a study cross the no-effect line and therefore do not meet the standard criteria for statistical significance (as in each of the four studies in Figure 2.8), there are two possible explanations. The first explanation is that there is no difference between EGO and CGO in the underlying population from which the study participants were recruited. The alternative explanation is that there is a difference but the study results were too imprecise to be certain about this difference (had too much random error / the confidence intervals around EGO and CGO were too wide). If the latter explanation is correct (i.e. there is a real difference but the study didnt find it), then the study does not have adequate statistical power to show this effect it is described as an underpowered study. If a study shows a statistically significant RR or RD (as in Study 4 in Figure 2.7), then by definition the study has adequate statistical power. The statistical power of a study is only in question when a study result does not show a statistically significant result. A reasonable analogy for statistical power is the magnification power of a microscope. A high-powered study (equivalent to a high powered microscope) can differentiate between two separate objects (in this case the two objects are the 95% CIs for EGO and CGO). If the CIs overlap, the study cannot differentiate between EGO and CGO (as shown in 2.4). This may of course be due to there being no difference between EGO and CGO in the underlying population, in which case the study would be considered to have adequate statistical power. However if there is a real difference between EGO and CGO in the underlying population, as demonstrated by the meta-analysis in Figure 2.8, then the individual studies would be considered to be under-powered. The ability (or power) of a study to differentiate between EGO and CGO depends on both the distances between EGO and CGO and the width of the confidence intervals around EGO and CGO. The statistical power of a study is determined by: i. the size of the study (the bigger the study, the smaller the confidence interval); ii. the size of the effect estimate (i.e. EGOCGO or EGO/CGO); iii. the choice of confidence interval or p-value (usually 95% CI which is equivalent to p=0.05); and iv. the variability of the factor in the population if it is a continuous measure (i.e. the standard deviation). Free on-line calculators are available that enable one to input these factors to calculate study power; e.g. (www.dssresearch.com/toolkit/spcalc/power.asp). By convention, a statistical power of about 80% or more is considered reasonable. In other words a study should have at least an 80% probability of demonstrating a statistically significant effect (i.e. RR or RD), if a real effect exists. Study investigators should always do a power calculation before they undertake a study so they can determine how many participants to include as increasing the number of participants is the main way to increase a studys power. When critiquing an epidemiological study, it is only necessary to check the study power, if the findings are not statistically significant.

34

CHAPTER 3: SUBTYPES OF EPIDEMIOLOGICAL STUDIES - all epidemiological studies hang on the GATE frame
3.1. INTRODUCTION

This chapter describes the design features of major epidemiological study subtypes by hanging them on the GATE frame. Once you understand GATE you will understand the fundamental elements of epidemiological study design and also find it easier to both appraise and design studies. All epidemiological studies fit on the GATE frame as all epidemiological studies are variations on one fundamental design, illustrated by the GATE frame. We recommend that when you are appraising or designing a study, begin by drawing a GATE frame and then hang the study on the frame, documenting the main components using the PECOT acronym. Secondly, check the calculations of EGO and CGO these are the fundamental calculations in all epidemiological studies (except case-control studies). Finally use the RAMboMAN acronym to assess the strength of the study (i.e. the potential for biases in the estimates of EGO and CGO). When you appraise an epidemiological study, it is important to weigh up the relative importance of the studys strengths and weaknesses based on the RAMboMAN criteria. All studies have shortcomings and it is all too easy to dismiss the findings because a possible bias has been identified feeding frenzies of criticism are common in critical appraisal sessions! A useful way to assess the significance of a studys weaknesses is to calculate a range of dis-ease occurrences (EGO and CGO) and estimates of association (RR and RD) based on a range of possible errors. This is called sensitivity analysis. For example, if there is large loss to follow-up, you could re-calculate EGO & CGO assuming, say, half those lost to follow up had the primary study outcome. If it has little effect on the estimates of RR and RD, then the loss to follow-up is unlikely to cause a major bias. While all the RAMboMAN criteria should be considered when critiquing a study, the study question (and study design) determine whether some criteria will be more important than others, as discussed below. KEY DIFFERENCES BETWEEN EPIDEMIOLOGICAL STUDY DESIGNS

3.2.

Previous chapters have emphasised the similarities between all epidemiological study designs. This chapter describes the key differences in their designs. Epidemiological studies are primarily differentiated by: i. how participants are allocated to exposure and comparison groups; and ii. when the study outcomes are measured.

i. Non-randomised versus randomised allocation to exposure and comparison groups Studies can be differentiated into two types based on how participants are allocated to exposure and comparison groups.

35

Non-randomised studies are also called observational studies because once the participants are recruited to the study, the investigators measure, or observe who is exposed or not (e.g. smokers or non smokers) and then allocate them to EG (e.g. smokers) & CG (e.g. non smokers). In other words, they are allocated by measurement. In contrast, in a randomised controlled trial (RCT) participants are randomly allocated to intervention groups (EG) or a control group (CG). It is an experiment because the investigator (i.e. experimenter) determines who will be exposed or not. The occasional experimental study is non-randomised if the investigator allocates participants to EG and CG using a non-random process. This is no longer a common study design because such studies have been shown to be very prone to confounding.

ii. Cross-sectional or Longitudinal measurement of outcomes: Epidemiological studies can also be differentiated into two subtypes based on when the study outcomes are measured. In cross-sectional studies, the measurement of outcomes is done at one point in time (i.e. prevalence), simultaneously with the allocation of participants into exposure and comparison groups. This does not imply that the participants have just become exposed but rather that the measurement of exposure (and comparison) status happens at the same time as outcome are measured. For example EG and CG may be smokers and non-smokers while the Outcome categories might be high and low blood pressure. While both exposures and outcomes are measured at the same study visit, the participants had either been smoking or not smoking (i.e. exposed or not exposed) for some time before the study was done. In longitudinal studies, participants are followed over time and the outcomes can be measured over different time periods (i.e. incidence), often over many years following the allocation of participants to EG and CG. It is also possible to measure the prevalence of outcomes in longitudinal studies, at any specified point in time during the follow-up period. It is not possible to measure incidence in a cross-sectional study because there is no follow-up time.

Prospective or retrospective assessment of exposures and outcomes: Many epidemiological studies are described as prospective (looking forward in time) or retrospective (looking back in time), but this can be very confusing and misleading. Most epidemiological studies investigate the effect of exposures on dis-ease outcomes, so it is essential to determine that the postulated exposure happened before the outcome occurred. In the ideal prospective study, the study exposure is either randomly allocated (as in an RCT) or is measured (as in a non-randomised study) and then, outcome events are measured prospectively after the exposure status has been determined. All RCTs are prospective in design because the study always begins by allocating participants to exposure and comparison groups and then exposing them or not. So outcomes are always measured after the exposure has started. Many non-randomised studies are also prospective, but sometimes information on both exposures and outcomes is collected after they have happened (i.e. retrospectively).

36

For example many of the occupational cohort studies investigating asbestos exposure and disease have involved examining old factory records of exposure to asbestos and old hospital and death records to identify relevant outcomes. The study participants could be all people who worked in a factory that handled asbestos and those participants who worked directly with asbestos would be allocated to the exposure group and those in other departments would be allocated to the comparison group. Such studies are often called historical or retrospective cohort studies. The collection of exposure and outcome data in these studies could be considered as retrospective because the studies were undertaken well after both the exposure and outcomes had occurred. However if the actual data on exposure was documented in factory records before the disease outcomes were documented in hospital records or death registers, then the outcome data documentation (rather than the study data collection) will have been done prospectively following exposure documentation. When factory records were not available, some asbestos studies involved asking survivors to recall whether they were exposed to asbestos in their workplace or not. In this type of study the exposure data is measured retrospective to outcome measurement. One of the causes of bias in retrospective measurement of exposure is that people who develop disease outcomes tend to provide study investigators with more complete information about their possible exposures in the past, than people who do not develop disease (often called recall bias) because they tend to spend a lot of time trying to recall what could have caused them to get disease. Another problem (and cause of bias) with this kind of retrospective cohort study is that it is often only possible to investigate prior exposure in survivors and many potential participants may be missed because they had already died as a result of the exposure. A similar bias occurs in some prospective non-randomised studies. Although it is possible to design a non-randomised study so that outcomes are only measured prospectively after the participants have been allocated to exposure and comparison groups, it is seldom possible to measure exposures prospectively. Participants in the exposure group of a non-randomised study may have been exposed for years prior to the study. For example, in a cohort study of smoking (exposure) and lung cancer (outcome) some potential participants who were long-term smokers would have died due to their smoking before the study started. One way around this problem would be to start a cohort study pre-exposure, early in life (e.g. at conception or at birth). These cohort studies are called life-course studies. Given the possible mix of retrospective and prospective measurements in the same study, it is better not to label studies as retrospective or prospective, but rather describe particular measurements of exposures and outcomes as either retrospectively or prospectively documented.

Descriptive or analytic studies: Sometimes epidemiological studies are labelled as descriptive and others as analytical. Descriptive studies describe the frequency of health-related behaviours, risk factors and outcomes, whereas analytic studies analyse these descriptions. However all epidemiological studies involve both description and analysis, so it is generally unhelpful to use these terms to differentiate between studies.

37

3.3.

THE CLASSIFICATION OF EPIDEMIOLOGICAL STUDIES

The key characteristics that differentiate epidemiological study subtypes are covered in the section above. A brief description of each of the main designs is discussed below. LONGITUDINAL STUDIES: Longitudinal studies include non-randomised studies (cohort studies, case-control studies, and prognostic studies) and randomised controlled trials Cohort (or follow-up) studies (Figure 3.1):

In cohort studies the study investigators measure the presence (and absence) of study exposures among the study participants (also known as a study cohort) and then allocate them into the appropriate Exposure Group (EG) or Comparison Group (CG) accordingly (e.g. a smoking group and non-smoking group). Figure 3.1 illustrates the structure of a standard non-randomised cohort study in

Participants P (Study Cohort)

P allocated to EG & CG either: randomly (RCT) or by measurement (Observational Study) Exposure & Comparison Groups (DENOMINATORS)

EG

CG

which the participants (P) or cohort ! are allocated to the appropriate EG or CG !" after being assessed (or measured) yes a b Outcomes to determine their exposure status. In O (NUMERATORS) no c d T cohort studies the EG and CG are followed for a period of time (T) " during which dis-ease outcomes (O) # are Figure 3.1. Design features of both counted. In cohort studies the main randomised (RCTs) & non-randomised measure of dis-ease occurrence is (Observational) cohort studies incidence (i.e. EGO = a/EG and CGO = b/CG during a period of time (T)). It is possible to calculate prevalence measures in cohort studies by measuring a dis-ease outcome at one point of time during a study (e.g. blood cholesterol levels in the middle of a long-term study of the effect of dietary fat consumption on heart attacks),

Cohort studies are also known as follow-up studies because the cohort is followed over time to identify study outcomes. As discussed in Chapter 2, a common cause of bias in non-randomised studies is confounding. Clinical decisions to treat patients (e.g. with hormone replacement therapy following menopause) or the voluntary uptake of some lifestyle practices (e.g. smoking, a low fat diet, or regular leisure time exercise) do not happen at random, so exposed and non-exposed participants are likely to differ in other ways that may affect dis-ease outcomes. Therefore the results of non-randomised studies, particularly those investigating therapies and voluntary lifestyle practices, should be assessed for the presence of confounding, if at all possible.

38

Case-Control studies (Figure 3.2):

VP

Virtual Participants VP (Virtual Study Cohort)

Representative sample of VP allocated to eg & cg by measurement (Observational Study) CONTROLS: eg & cg (sample of VP)

eg cg

A case-control study is usually described as an observational study in which the exposure status (i.e. E or C) of a group of people with dis-ease known as cases is compared with the exposure status of the group of people from the same population without dis-ease who are known as controls. Mathematically this can be restated as the odds of exposure in cases divided by the odds of exposure in controls, which is the odds ratio.

However a good case-control study is just a variation on the cohort study O design in which a case-control study is ? ? T no nested in a virtual cohort study (Figure 3.2). The dotted triangle, circle and square in Figure 3.2 represent the Figure 3.2. Design features of a case-control study, virtual cohort study while the solid lines nested inside a virtual cohort study (dotted lines represent the case-control study represent virtual cohort study, solid lines represent case-control study) nested inside this virtual cohort study. So the odds of exposure in cases (a/b in the top part of the square in Figure 3.2) can be compared to the odds of exposure in the controls (eg/cg in the inner circle in Figure 3.2). Case-control studies can be done much more quickly and more cheaply than the equivalent cohort studies.
yes

a b

CASES: a&b (all dis-ease outcomes from VP)

As with all epidemiological studies, the appraisal (or design) of a case-control study starts by defining the eligible population (those who meet the eligibility criteria) and then

the method used to recruit the Participants (P)!. However the fundamental difference between a cohort study and a case-control study is that in a cohort study, all Participants

! have their exposure status measured so they can all be allocated to either EG or CG !" whereas in a case-control study, only a small sample of potential Participants ! are

recruited to have their exposure status measured. So the equivalent to a cohort studys P in a case-control study is referred to here as the virtual P (VP) or virtual study cohort. The sample chosen from the VP is the control group of a case-control study and should be a representative sample of VP. They have their exposure status measured and are allocated to an exposure group and comparison group just as in a cohort study. As they are in effect a sample of the equivalent EG & CG !" in a cohort study, we use the lower case letters eg & cg !" to describe them. If the controls are a representative sample of the equivalent P in a standard cohort study, then the ratio of EG to CG should be identical to the ratio of eg to cg.

The cases in a case-control study (a & b in Figure 3.2) are usually all the virtual Participants who have a dis-ease event during the specified study follow-up period. The reason for including all dis-ease outcomes from VP is that the number of outcome events (a & b) is usually very small compared with the number of VP; indeed most epidemiological studies, whatever the design struggle to find sufficient outcome events to calculate EGO & CGO with low random error. The advantage of a case-control study is

39

that one could define a very large number of people as the VP, such as the one million people in a large city who are on the electoral register, but only recruit 1000 of them as controls. Moreover because all the outcome events that occur among the one million people who are the VP are included as cases, it may be possible to identify 1000 cases in a reasonably short time period, say 1 year. The equivalent cohort study might only have sufficient funding to identify and measure the exposure status of 10,000 of the one million people. As the cohort study only includes one-hundredth of the one million VP in the case-control study, it could take up to 100 times as long to identify 1000 cases. So a cohort study with 10,000 people would only identify 100 cases in about 10 years of follow-up, whereas the equivalent case-control study would only have to include about 1000 controls (a representative sample of VP) and would be able to identify 1000 cases (all outcomes among VP in one year) and yet still be completed in 1-2 years.

Unfortunately case-control studies also have their shortcomings. In particular it is difficult to ensure that the controls are a truly representative sample of VP and the validity of calculations in a case-control study depend on the ratio of exposed to unexposed controls (i.e. eg to cg) being the same as the ratio of exposed to unexposed people in all VP (EG to CG). Another shortcoming of case-control studies is that exposure information is usually collected from cases after they have had their dis-ease event and as discussed earlier in this chapter, retrospective measurements have their problems, because people with disease (i.e. cases) often recall much more detail about their past exposures than people without disease (i.e. controls). Case-controls studies are the most appropriate design to use when outcomes occur soon (within minutes, hours or days) after exposure, such as the effect of sleepiness or drinking on road traffic injury. They are also the most efficient epidemiological study design so are commonly used when dis-ease outcomes are very rare and very large numbers of study participants would be required to generate sufficient numbers of disease outcomes. As with all non-randomised studies, case-control studies are prone to confounding. It is not possible to directly calculate EGO (a/EG) and CGO (b/CG) - the main measurement in all other epidemiological studies in a case-control study because there are no complete denominators (i.e. EG & CG). Instead case-control studies measure eg and cg which are samples of virtual complete denominators. However if the ratio of eg to cg is very similar to what the ratio of EG to CG would be in the equivalent cohort study, then it is possible to calculate the Odds Ratio (OR) in a case-control study and this estimate of effect is very similar to the Relative Risk (RR) calculated in a cohort study, in most situations. As discussed earlier, the Odds Ratio is the odds of exposure in cases (a/b) divided by the odds of exposure in the controls (eg/cg). This measure of effect (or association) = a/b eg/cg can be rewritten as OR = a/eg b/cg, which will be very similar to the RR = a/EG b/CG assuming eg/cg = EG/CG. This assumption will be met if the controls (eg & cg) are a truly representative sample of the virtual cohort (VP) - the population from which all the cases (a & b) come from. As stated at the beginning of this section, the usual way of thinking about case-control studies is to consider them as a comparison between the exposure status of people with dis-ease (the cases) in a population and a sample of people from the same population without disease (the controls). Taking a sample of people without disease is usually very similar to taking a sample of everyone in the virtual cohort because the number of people who develop disease is usually a very small proportion of all VP. The main benefit of

40

thinking of case-control studies as nested in virtual cohort studies is that it is easier to identify possible biases using the RAMboMAN criteria. Prognostic studies: These are a version of cohort studies in which the objective is to determine how well exposure(s) (called prognostic factors) predict the occurrence of outcomes (often death) over a period of time, among people who generally already have dis-ease. In Figure 3.1, E and C would be called prognostic factors rather than risk factors as in a standard nonrandomised cohort study. A persons prognosis (e.g. the probability that they will die or survive - in the next 5 years) is very dependent on their current health status. For example two people who have been diagnosed with bowel cancer will have a very different five-year prognosis if the cancer in one of the people has spread beyond the bowel (i.e. has metastasised) whereas the cancer in the other person is still limited to the wall of the bowel. Therefore in prognostic studies it is very important that the participants have a similar severity of dis-ease or have been stratified into different categories based on the severity of their dis-ease. A persons prognosis is very time-dependent so prognostic studies measure incidence (i.e. usually EGO = a/EG and CGO = b/CG during a defined period, which are measures of incidence of dis-ease. Alternatively EGO = c/EG and CGO = d/CG during a defined time period (Figure 3.1), which are measures of the incidence of survival.

Randomised Controlled Trials (RCTs): ! ! RCTs are versions of cohort studies in which the participants are randomly allocated to intervention (exposure) or control (comparison) groups. Figure 3.1 can be used to illustrate the structure of an RCT as well as a standard cohort study. The Participants (P)! are randomly allocated to E or C !" and followed for a period of time (T) " during which outcomes (O) # are counted. As in cohort studies, RCTs are primarily designed to measure the incidence of outcomes in exposed and comparison groups (EGO = a/EG and CGO = b/CG during a defined time period). ! A randomised trial is the most valid study design for assessing the effectiveness (benefits and harms) of therapeutic or preventive interventions, including screening programmes. The main advantage of RCTs over standard cohort studies is that they are less prone to confounding because the randomisation process is designed to produce an EG and CG with very similar characteristics. For practical and ethical reasons many important questions cannot be investigated using the randomised trial design (e.g. assessing the effects of potentially dangerous or illicit drugs). Moreover when trials are possible, they are often conducted in artificial environments and atypical highly motivated participants are recruited who may not be representative of any typical population (the R in RAMboMAN). This may limit the applicability of their findings to populations of interest, although the importance of representativeness is often over-emphasised in appraisals of RCTs.

41

The most important causes of bias in RCTs are poor allocation processes (randomisation and concealment) and poor maintenance of participants in their allocated groups.

CROSS-SECTIONAL STUDIES (Figure 3.3) All cross-sectional studies are non-randomised studies in which dis-ease outcomes are measured simultaneously with the participants exposure status. There are several types of cross-sectional study, which are described below.
Participants P

Surveys measuring the prevalence of outcomes: Many epidemiological studies simply involve recruiting a sample from a known population and measuring dis-ease outcomes at one point in time (i.e. prevalence or crosssectional measurements). Examples include measuring the prevalence of health-related behaviours such as alcohol consumption, of risk factors such as blood pressure, or of diseases such as asthma, in a defined population (e.g. New Zealanders aged 25-44 years). The exposure and comparison groups in surveys are usually based on age, gender, ethnicity, social class or a particular year (e.g. EG = the population of a country in 1995 and CG = the population of the same country in 2005). In surveys it is essential that the study participants represent the eligible population, so the recruitment process should be well planned and well described.

P allocated to EG & CG by measurement (Observational Study) Exposure & Comparison Groups (DENOMINATORS)

EG

CG

yes

a b c d

no

Outcomes (NUMERATORS)

Figure 3.3. Design features of crosssectional studies (E, C & measured at same time ! as dis-ease Outcomes)

Cross-sectional studies investigating the associations between exposures and outcomes: Some cross-sectional studies, other than simple surveys, have similar objectives and are similar in design to cohort studies, except that the study exposures and outcomes are measured at the same time. Therefore only prevalence measures of dis-ease outcomes can be determined [e.g. the prevalence of asthma (O) among smoking (E) and non smoking (C) Auckland University students (P) measured at the time of an epidemiology lecture]. Because the outcome (e.g. asthma) is measured at the same time that exposure is measured (e.g. smoking) in cross-sectional studies, if smoking causes asthma and people who develop asthma subsequently stop smoking, then the investigator has a problem! In the extreme situation, if every smoker who develops asthma because of their smoking immediately stops smoking, it will not be possible to find a relationship between smoking and asthma in a cross-sectional study. The opposite problem finding an association that isnt real - can also occur in cross-sectional studies. For example in a cross-sectional study investigating whether regular vigorous physical activity (E) improves lung function (O), the investigator may find that people who do the least

42

exercise have the worst lung function. While this association may be a real cause-effect association, it is possible that people with poor lung function are unable to do any vigorous activity because they get very short of breath. If this is true then the study disease outcome (i.e. poor lung function) could be the cause of the study exposure (i.e. low exercise levels) rather than low exercise levels being the cause of poor lung function. It is often impossible to determine which of these two possible temporal associations between exposures and outcomes is correct in cross-sectionals studies and this limits their usefulness for investigating causal associations. This temporal association issue between exposures and outcomes (i.e. cause and effect) is a problem in any study in which exposure measures are made after (i.e. retrospectively) the person has developed the study outcome, so it is a problem in many case-control studies and some cohort studies.

Diagnostic (and screening) test accuracy studies (Figure 3.4): These are a type of cross-sectional study designed to assess the accuracy of a diagnostic or screening test by comparing the results of a test with the results of a good diagnostic standard. The most accurate way of diagnosing a dis-ease is to use a reference standard or gold standard diagnostic tool, but they are often expensive or invasive (e.g. require an operation), so clinicians are always searching for less expensive and less invasive tests to use instead of the reference standard tools.

Participants P

P allocated to EG & CG by measurement using reference (or gold) standard EG: participants with dis-ease CG: participants without dis-ease (DENOMINATORS)

EG

CG

There are two ways of describing a diagnostic test accuracy study using the +ve GATE frame (figures 3.4 and 3.5). In the a b Test results O version of the study shown in Figure 3.4, (NUMERATORS) -ve c d EG are all the participants who are T diagnosed with dis-ease using the Figure 3.4. Design features of a diagnostic reference standard (i.e. positive) and test accuracy study for calculating CG are the reference standard likelihoods & likelihood ratios negative participants (without dis-ease). The test result is the study outcome (O) - either the test is positive [a & b] or the test is negative [c & d]). The square in the GATE frame in Figure 3.4 is the standard 2 x 2 table of diagnostic test studies. In the study version described in Figure 3.4 the dis-ease frequency measures (EGO & CGO) are known as likelihood measures. The positive EGO = a/EG and is the likelihood of a positive test in people with dis-ease determined by the reference standard and the positive CGO = b/CG is the likelihood of a positive test in people without disease determined by the reference standard. In diagnostic test accuracy studies alternative measures of EGO and CGO can be made using c and d as numerators - the negative test results - rather than the positive test results (a & b). The measures are the negative EGO = c/EG which is the likelihood of a negative test in people with dis-ease determined by the reference standard and negative

Dis-ease +ve -ve

43

CGO = d/CG which is the likelihood of a negative test in people without dis-ease determined by the reference standard. In diagnostic test accuracy studies EGO CGO is known as the Likelihood Ratio (LR) and it is equivalent to the RR calculated in other studies. However there are two types of LR - a positive LR if a & b are used as numerators in calculating EGO & CGO (a/EG & b/CG) and a negative LR when c & d are used (c/EG & d/CG). The other results commonly reported in diagnostic or screening test accuracy studies are sensitivity, which is the likelihood of a positive test in those with the dis-ease as assessed by the reference standard (i.e. the positive EGO or a/EG), and specificity which is the likelihood of a negative test result in those without dis-ease (i.e. the negative CGO or d/CG). One benefit of a LR is that it combines sensitivity and specificity in one number. The positive LR equals sensitivity (1-specificity) and the negative LR equals (1-sensitivity) specificity. The LRs can also be used to rapidly calculate the probability that a person has (or does not) have a dis-ease using a tool called the Likelihood Ratio Nomogram. (4) Many of our students who first learnt to use GATE for appraising RCTs and Participants P cohort studies find the way we hang P diagnostic test accuracy studies on the GATE frame (Figure 3.4) confusing. They assume that EG and CG should be the participants with and without a P allocated to EG & CG by measurement using test positive test result and that a & b should be those with dis-ease (i.e. reference EG: participants with standard positive) and c & d should be +ve test those people without dis-ease (i.e. EG CG CG: participants with -ve test reference standard negative) as (DENOMINATORS) illustrated in Figure 3.5. While there is no right or wrong way to hang a Test diagnostic test study on the GATE +ve -ve frame, there are two subtly different Reference (gold) +ve questions that can be asked about the a b standard results O (NUMERATORS) accuracy of diagnostic tests. The -ve c d T question addressed by the GATE frame shown in Figure 3.4 is among people Figure 3.5. Design features of a diagnostic with (or without) dis-ease, what is the test study for calculating +ve and ve likelihood that they will have a positive predictive values. (or negative) test result. In contrast the question addressed in Figure 3.5 is among people who are tested, how well does a positive (or negative) test predict their chance of having dis-ease? In Figure 3.5 the positive EGO (a/EG) is called the Positive Predictive Value (PPV) the probability of having dis-ease if the test is positive, while the negative CGO (d/CG) is the Negative Predictive Value (NPV) the probability of not having dis-ease if the test is negative. The questions addressed in Figure 3.5 about positive and negative predictive values are very dependent on the prevalence of dis-ease among Participants (also known as the pre-test probability of dis-ease) as well as the accuracy of the test. Therefore in these studies the R for Recruitment component of RAMboMAN is extremely important. In contrast the questions addressed in Figure 3.4 about likelihoods and likelihood ratios are much less dependent on the prevalence of disease in P and are mainly influenced by the accuracy of the test.

44

There are two other common biases in many diagnostic test accuracy studies. First, as the reference standard measure is often expensive or invasive, it is common to find that many potential participants do not have both the reference standard and the test measurements completed. Second, the interpretation of the test results may be influenced by knowledge of reference standard and vice versa, because often neither are very objective measures and require human (subjective) interpretation. Some so-called reference (or gold) standards would be better described as silver or bronze standards because they are not very accurate. Therefore it is important to check whether tests or reference standards were measured objectively and if not whether they were interpreted blind to information about the other measure. A screening test is just a diagnostic test applied in a screening situation, but there is a fundamental difference between a screening test accuracy study, which is a crosssectional study as described above, and a screening test effectiveness study, which is ideally a randomised controlled trial. In the latter type of study, participants are randomised to EG if they receive the screening test (e.g. a mammogram) and CG if they do not receive the screening test. Then all participants are followed over a period of time to determine if mammograms reduce the risk of death from breast cancer.

ECOLOGICAL STUDIES Many epidemiological studies involve comparisons of groups of populations (typically the whole population of cities, regions or countries) rather than of groups of individuals. For example a study might compare the average annual death rate from heart disease in a group of low-income countries and in a group of high-income countries. In other words the study participants are groups of populations rather than groups of individuals. This type of study design is called an ecological study. Ecological studies can be longitudinal or cross-sectional. For example in a longitudinal study of the association between alcohol consumption and coronary heart disease in different countries, exposure status can be determined by assessing the per capita consumption of alcohol (i.e. average consumption per person) in one year from national sales data in each of the selected countries (EG would include countries with high per capita consumption and CG would include countries with lower per capita consumption) and the outcome could be the average incidence of coronary heart disease in each group of countries over the subsequent 5 years. In a cross-sectional ecological study investigating whether use of inhaled bronchodilator drugs increase the prevalence of asthma, exposure status could be assessed by the per capita sales of the drug during a particular year in each country and countries would be allocated to EG if they had high consumption and to CG if they had low consumption. The outcome could be the average prevalence of asthma based on prevalence survey carried out during the same year in each country. Ecological studies like those described above are often plotted on a graph with each country represented by a point that shows both their exposure status (e.g. per capita consumption of alcohol, on the horizontal axis) and their outcome status (e.g. CHD mortality, on the vertical axis). This is equivalent to each country being a different EG (i.e. EG1, EG2, EG3, etc) except for one country that would be equivalent to CG. Similarly, studies of trends in the prevalence or incidence of dis-ease in the same population over a number of years (with the years on the horizontal axis) are also ecological studies.

45

It is also possible for an ecological study to be randomised controlled trial. In a study of the effectiveness of a televised advertising programme on reducing road traffic injuries, different regions across a country could be randomised to receive the advertisements or not and regional statistics assessed to determine if reported injury rates (the study outcome) differed in regions that were exposed or unexposed to the advertisements. Ecological studies are useful exploratory studies for investigating both risk factor and intervention effects because they are generally reasonably cheap and can be completed in a short time. They are also the only appropriate design for assessing exposures that are almost universal in some populations, for example, the consumption of dairy products in New Zealand. If almost everyone in a study population is exposed to a factor, it is not possible to investigate that factor within that population, other than to investigate different levels of exposure. However if the factor is common in some populations but uncommon in others (e.g. consumption of dairy products is still uncommon in many Asian countries) then ecological studies including both high and low consumption countries can be undertaken. Unfortunately ecological studies are also very prone to confounding because there are usually many other differences between populations that may influence the study outcome.

SYSTEMATIC REVIEWS & META-ANALYSES: As discussed in the previous chapter, one of the most common errors in epidemiological studies is random error due to the small number of dis-ease outcomes in many studies. Dis-ease events are surprising uncommon in most populations and the large studies required to produce large numbers of dis-ease outcomes are very expensive to undertake. Therefore most studies are too small to produce precise estimates (i.e. with low random error) of dis-ease frequency and measures of association or effect. A systematic review (SR) is a study design that combines a group of studies addressing the same question. The name (SR) describes how the studies are recruited using a systematic or comprehensive search process. As many studies are now available electronically, most systematic reviews start by undertaking a systematic search of electronic registers of studies like PubMed. In the most comprehensive systematic reviews, the investigators will also contact researchers whom they think may have done or know about studies that have not been published. The inclusion of unpublished studies in a systematic review will sometimes have a significant effect on the reviews conclusions because the published studies may be a biased sample of all studies; as it is easier to get studies with positive results published. If the recruited studies are methodologically valid and have reasonably similar findings, they can be mathematically combined using a technique known as meta-analysis. This technique reduces the amount of random error in the effect estimates, as discussed in section 2.4. A meta-analysis is not a specific epidemiological study design but rather an analytical technique used to combine the results of a group of studies, ideally identified from a systematic review of multiple studies. However a systematic review combined with a meta-analysis is a specific epidemiological meta-study design. The Cochrane Collaboration is an international not-for-profit organisation that provides leadership, training, and support for groups involved in undertaking systematic reviews

46

and meta-analyses. Reviews that meet the Collaborations criteria are published electronically by the Collaboration using a standard format and are updated regularly as new studies are completed. Cochrane Reviews are available free to everyone in New Zealand: google Cochrane Library NZ or go to http://www.moh.govt.nz/cochranelibrary. Anyone who wants to inform a health-related decision based on epidemiological evidence (whether a clinical decision for an individual patient or a policy decision for a population) should ideally use evidence from a systematic review of studies, if available, rather than from an individual study.

REFERENCES. 5. K J Rothman. Epidemiology. An Introduction. Oxford University Press. 2002. 6. Evidence-Based Medicine Working Group. Evidence-based medicine. A new approach to teaching the practice of medicine. JAMA 1992; 268: 2420-5. 7. J N Morris. Uses of Epidemiology. 3rd Edn. Livingstone. 1976. 8. SE Straus, WS Richardson, P Glasziou, RB Haynes. Evidence-Based Medicine. How to practice and teach EBM. 3rd Edn. Elsevier Churchill Livingstone. 2005. pp 89-90.

47

Table 3.1 Properties of different study designs


! Usual study objective! Randomised controlled trials (RCT)! To investigate the effects of different interventions (exposures) on dis-ease incidence in different groups of individuals ! Studying the effects of interventions (e.g. new therapies)! Longitudinal, experimental. Participants randomly allocated to either study exposure or comparison exposure and dis-ease outcomes measured during a follow-up period! Randomisation minimises confounding. ! Cohort (& case-control) studies! To investigate associations (effects) between risk / prognostic factors (exposures) & dis-ease incidence in different groups of individuals ! Studying the causes of disease incidence (i.e. risk or prognostic factors) or the effects of interventions.! Longitudinal, observational (non-experimental) Participants allocated to exposure & comparison by measurement & dis-ease outcomes measured during follow-up! Cohort studies: Multiple outcomes can be assessed. Exposure measured before outcome, avoiding recall bias and providing clear time sequence between exposure and dis-ease outcomes. Case-control studies: efficient if dis-ease is rare, best study design when exposure has rapid effect on outcome. ! Confounding common ! Cross-sectional studies! To measure dis-ease prevalence in defined groups / populations of individuals. To investigate associations between exposures & dis-ease prevalence in the groups! Measuring prevalence/ burden of dis-ease in different groups & populations! Cross-sectional, observational, (non-exp). Participants allocated to exposure & comparison by measurement dis-ease outcomes measured simultaneously Generally cheap & can be completed quickly. Best design for assessing the prevalence (or burden) of dis-ease in a population.! Ecological studies! To investigate associations between exposures & disease prevalence or incidence in different groups of populations ! Studying the causes of disease incidence and prevalence! Longitudinal or crosssectional, nonexperimental or experimental. Exposure and comparison allocated to groups rather than individuals! Generally cheap & quick. Useful when the majority of some populations are exposed but others are not. Efficient for rare outcomes !

Main applications! Main design features!

Main strengths!

Main weaknesses!

Ethical limitations Logistically difficult Long-term follow-up difficult & costly. Large studies expensive.!

Uncertain time sequence limits interpretation of cause and effect. Confounding common!

Confounding very common!

Você também pode gostar