Escolar Documentos
Profissional Documentos
Cultura Documentos
19 November 2012
1 / 61
The chi-squared test in contingency tables, P&G Chapter 15, pp 342 - 349
Odds and Odds Ratios, P&G Chapter 6, pp 144 - 149, section 15.3 Condence intervals for an OR, P&G, pp 352 - 357
Summary
2 / 61
3 / 61
contingency table. Contingency tables and testing for associations between categorical variables (22 and r c tables)
Main method will be the Pearson chi-squared (2 )
4 / 61
T HE ORGANIZATION OF TABLES
and the outcome variable (success or failure) for the rows. The group variable is sometimes called the explanatory variable. The outcome variable is sometimes called the response variable.
7 / 61
T HE ORGANIZATION OF TABLES . . .
Epidemiologists often use case-control designs,
where cases are sampled according to outcome (typically, disease or no disease) and exposure is measured based on retrospective records. In a case-control analysis, Stata uses the exposure status as the column label because it is the explanatory variable. Epi 202 uses exposure as the column variable. Epi 500 uses the exposure status as the row variable.
8 / 61
answer yes to the abortion question? Is response to the abortion question independent of sex of the student? Is there any association between response and sex of the student?
10 / 61
student answers yes pM = probability that a randomly chosen male student answers yes Then the hypotheses are H0 : pF = pM vs HA : pF = pM .
11 / 61
The (Pearson) Chi-Square Test: basic idea 1. If the row and column variables are independent (null hypothesis is true) what do we expect to see? 2. How do expected values in the table compare to what has been observed ?
12 / 61
percentage of the female students would be expected to respond yes? 110/150 = 0.73333 = 73% Why? How many students do we expect to fall into the cell dened by female, yes? (95)(0.73333) = 69.7 Why? What was the observed cell count? 73 observed - expected= 73 69.7 = 3.3 Now compute the expected values for all cells.
13 / 61
14 / 61
15 / 61
The statistic has a sampling distribution that is approximately 2 with degrees of freedom df = (r 1)(c 1) where r = # rows, c = # columns. In a 2 2 table, df = 1. Important values from the distribution are given in Table A.8 on page A-26.
16 / 61
Moore/McCabe
moore
page T-20
Table entry for p is the critical value ( 2 ) with probability p lying to its right.
Probability p
( 2)*
17 / 61
Binomial Normal t F 2
18 / 61
C HI -S QUARE (2 ) T EST
If the rows and columns are independent, the
observeds and expecteds shouldnt be very different, and 2 value will be small If there is an association between rows and columns, the observeds may be far from the expecteds, leading to a large 2 value So, reject the null hypothesis when 2 value is sufciently large (look only at upper tail of chi square distribution). Like the F-test in ANOVA, the alternative is inherently two-sided, but the right tail area is not doubled.
19 / 61
df 1 2 3
Area in the Upper Tail 0.05 0.025 0.01 3.841 5.991 7.815 ........ 5.024 7.378 9.348 6.635 9.210 11.345
21 / 61
determined by any one of the cell counts. Upper left cell is typically used. For the sampling distribution: Use the conditional distribution of values for the upper left cell, given the row and column totals and under the hypothesis of independence. A one-sided p-value is the probability of observing a value as or more extreme as the cell count in the upper left corner. Two-sided p-values even more complicated. We will only consider two-sided tests in contingency tables.
23 / 61
(obs-exp)2 exp
Note that this table has the outcome variable in the columns, since that is the way it is displayed in the text. That does not affect the value of the 2 statistic.
25 / 61
Values in green are expected counts. Certicate Status Hospit. Conrmed Inaccurate Incorrect Accurate No Change Recoded Commun. 157 169.3 18 24.7 54 35.0 Teaching 268 255.7 44 37.3 34 53.0 Total 425 62 88 2 =
all cells
(obs-exp)2 exp
= 21.52 df = (r 1)(c 1) = (1)(2) = 2 In Table A.8, 2 df=2,0.001 = 13.82, p-value < 0.001, so we reject the null-hypothesis that accuracy in death certicates is independent of hospital type.
26 / 61
S TATA . . .
. tabi 157 18 54 \268 44 34, expected chi exact ....some output not shown | col row | 1 2 3 | Total -------+---------------------------------+---------1 | 157 18 54 | 229 | 169.3 24.7 35.0 | 229.0 -------+---------------------------------+---------2 | 268 44 34 | 346 | 255.7 37.3 53.0 | 346.0 -------+---------------------------------+---------Total | 425 62 88 | 575 | 425.0 62.0 88.0 | 575.0 Pearson chi2(2) = Fishers exact = 21.5235 Pr = 0.000 0.000
27 / 61
S OME C OMMENTS
The 2 test for r c tables does not take into account any natural ordering of rows or columns that might be present in data. The text mentions the Yates continuity correction (p 346) sometimes used in calculating the 2 statistic in small samples. Used far less often now; better to use Fishers exact test. Fishers exact test can be used to assess associations in general r c tables. Very common now for papers to report Fishers exact test, even in moderately large samples.
28 / 61
C OMMENTS . . .
The following rule of thumb is used for the validity for the Pearson 2 test:
In 2 2 tables, each expected cell count (calculated
under the hypothesis of independence) should be at least 5. In tables with more than 4 cells (excluding the cells with the row and column totals), the average expected count should be at least 5, and no expected count should be smaller than 1. When these conditions do not hold, use Fishers exact test.
29 / 61
C OMMENTS . . .
Alternate form of 2 test for 2 2 tables. If the table has entries Outcome Success Failure Total Group Sample 1 Sample 2 a b c d n1 n2
then the 2 test can be written 2 = n(ad bc)2 (a + c)(b + d)(a + b)(c + d) df = 1
30 / 61
31 / 61
I NTRODUCTION
The 2 and Fishers exact test provide methods for testing the null hypothesis of independence between row and column variables. But neither test provides an estimate of the nature of the association when the hypothesis of independence is rejected. We will use odds ratios for estimating association between row and column variables. To study odds ratios, we rst need to study odds.
32 / 61
33 / 61
1 to 37 for you
34 / 61
B ETTING IN ROULETTE
For the game to be fair,
Casino keeps your $1 if 00 does not come up Casino pays $37 if 00 comes up, and you keep your
$1 bet If X represents your winnings from a $1 bet and E(X) the average winnings in many such bets E(X) = 1(37/38) + 37(1/38) = 0 Casinos stay in business
by paying out 35 to 1, the casinos insure that roulette
If p is small (say p < 0.10) then (1 p) 1 and so odds p. The approximation improves as p approaches 0.
36 / 61
O DDS VS . P ROBABILITIES
Probability 0 1/100 = 0.01 1/10 = 0.10 1/4 1/3 1/2 2/3 3/4 1 Odds = p/(1 p) 0/1 = 0 1/99 = 0.0101 1/9 = 0.11 1/3 1/2 1 1 ( 2 )/( 2 )=1 (2/3)/(1/3)=2 3 1/0 Odds 0 1 : 99 1:9 1:3 1:2 1:1 2:1 3:1
37 / 61
This is the odds of disease for smokers divided by the odds of disease for non-smokers.
OR > 1 implies smokers have higher probability of
is equal to the odds ratio of exposure, comparing diseased vs. non-diseased subjects. We will derive this later using a simple formula for OR in a 2 2 table.
39 / 61
T WO IMPORTANT POINTS
Why is this surprising?
Because when cases and controls are sampled and
exposure is determined retrospectively, it is only possible to estimate Pr(E|D) or Pr(E|Dc ), not Pr(D|E) and Pr(D|Ec ). Why is this important?
Because even when exposure is estimated by
sampling from cases and controls, it is possible to estimate the correct OR.
40 / 61
who develops disease, as in a cohort study design Retrospective studies of diseases vs. healthy subjects, to see who is exposed, as in a case-control study design Both types of studies can estimate the OR of disease, comparing exposed to unexposed.
41 / 61
In this case OR
Pr(D|E) , Pr(D|Ec )
42 / 61
(E|D)/(1 P (E|D)) P (E|DC )/(1 P (E|DC )) P (a/(a + b)) (b/(a + b)) (d/(c + d))
= (c/(c + d)) = ad bc
43 / 61
(D|E)/(1 P (D|E)) P (D|EC )/(1 P (D|EC )) P (a/(a + c)) (c/(a + c)) (d/(b + d))
= (b/(b + d)) = ad bc
44 / 61
This is a case-control study, sampled according to type of delivery. EFM is the exposure variable.
45 / 61
according to type of delivery, it is possible to estimate the odds ratio = Pr(C|E) 1 Pr(C|E) Pr(C|Ec ) , 1 Pr(C|Ec )
where C = (woman delivers by C-Section) and E = (EFM used during pre-natal care). Invoke the rare disease assumption to estimate Pr(C-section|EFM) Pr(C-section|no EFM)
46 / 61
Relative Risk
Odds Ratio (358)(2745) = = 1.72 (229)(2492) Can we check the rare disease assumption with these data?
47 / 61
50 / 61
P(D|E) 1 P(D|E)
P(D|EC ) 1 P(D|EC )
where D = disease, E = exposed, Ec = unexposed We will calculate a condence interval for log(OR), then convert that to a condence interval for OR.
51 / 61
OR = ad/bc 1 1 1 1 + + + a b c d where the log function is the natural log or log base e, sometimes denoted by ln. Condence intervals for log(OR) have form se log(OR) = log(OR) z/2 s.e.(log(OR))
52 / 61
EFM MONITORING
Caesarean EFM Exposure Delivery Yes No Total Yes a = 358 b = 229 587 No c = 2, 492 d = 2, 745 5,237 Total 2,850 2,974 n = 5, 824 OR = (358 2745) = 1.72; log(OR) = 0.542 (229 2492) 1 1 1 1 + + + = 0.089 358 229 2492 2745
se(log(OR)) =
53 / 61
S TATA AGAIN . . .
cci 358 229 2492 2745, woolf
Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+-----------------------Cases | 358 229 | 587 0.6099 Controls | 2492 2745 | 5237 0.4758 -----------------+------------------------+-----------------------Total | 2850 2974 | 5824 0.4894 | | | Point estimate | [95% Conf. Interval] |------------------------+-----------------------Odds ratio | 1.722035 | 1.446314 2.050318 (W Attr. frac. ex. | .4192916 | .308587 .5122708 (W Attr. frac. pop | .2557178 | +------------------------------------------------chi2(1) = 37.95 Pr>chi2 = 0.0000
The notation (Woolf) has been clipped from the output, next to the condence intervals.
55 / 61
for cci. Exact CIs are numerically more difcult to estimate, but easy in software. Exact CIs are now the default in Stata. Use the exact method whenever Stata will compute it in a reasonable amount of time.
56 / 61
Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+-----------------------Cases | 358 229 | 587 0.6099 Controls | 2492 2745 | 5237 0.4758 -----------------+------------------------+-----------------------Total | 2850 2974 | 5824 0.4894 | | | Point estimate | [95% Conf. Interval] |------------------------+-----------------------Odds ratio | 1.722035 | 1.44092 2.058247 (e Attr. frac. ex. | .4192916 | .3059991 .5141496 (e Attr. frac. pop | .2557178 | +------------------------------------------------chi2(1) = 37.95 Pr>chi2 = 0.0000
The notation (exact) has been clipped from the output, next to the condence intervals.
57 / 61
cases in the exposed group attributed to the exposure. Attr. frac. pop is an abbreviation for the fraction of cases in the whole population attributed to the exposure. Before we give the formulas, important to note that these two concepts only have meaning if there is a clear causal relationship between exposure and outcome. Rarely possible to draw causal inference from a case-control study. Nevertheless . . .
58 / 61
OR 1 OR 1.722 1 1.722 0.419 Attr. frac. ex. proportion exposed cases (0.419)(0.6099) 0.2557
60 / 61
M AIN IDEAS
Inference for two-sample binomial framed as a 2 2
table, with the 2 test used to test independence between rows and columns. Can extend this to r c tables Odds ratios (OR) in a 2 2 table and condence intervals for OR used to quantify the association. OR for exposure, given disease, is the same as OR for disease, given exposure. OR approximates relative risk when disease is rare. In a rare disease, OR from a case control study can be used to estimate relative risk. In the next unit, we will extend the use of odds ratios to stratied 2 2 tables to adjust for possible confounders.
61 / 61