Bibliographical
Note
Library S o k a l , R o b e r t R.
of Congress
CataloginginPublication
Data
I n t r o d u c t i o n t o Biostatistics / R o b e r t R. S o k a l a n d F. J a m e s R o h l f . D o v e r ed. p. c m . O r i g i n a l l y p u b l i s h e d : 2 n d ed. N e w Y o r k : W . H . F r e e m a n , 1969. I n c l u d e s b i b l i o g r a p h i c a l r e f e r e n c e s a n d index. I S B N  1 3 : 9780486469614 I S B N  1 0 : 0486469611 I. B i o m e t r y . I. R o h l f , F. J a m e s , 1936 II. Title. Q H 3 2 3 . 5 . S 6 3 3 2009 570.1 '5195 dc22 2008048052
Contents
xi
INTRODUCTION
1.1 1.2 1.3 Some
definitions
2.
D A T A IN B i O S T A T l S T I C S
2.1 2.2 2.3 2.4 2.5 2.6 Samples Variables Accuracy Derived Frequency
6
7 8 of data 14 24 10
The handling
3.
D E S C R I P T I V E STATISTICS
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3. The arithmetic Other means mean 31 32 33 34 deviation
27
28
The median The mode The range The standard Sample Practical deviation
statistics methods 39
3.9
The coefficient
of variation
43
V1U
CONTENTS
4.
I N T R O D U C T I O N TO PROBABILITY DISTRIBUTIONS: T H E B I N O M I A L A N D P O I S S O N D I S T R I B U T I O N S 46
4.1 4.2 4.3 Probability, The The binomial Poisson random sampling, 54 63 and hypothesis testing 48
distribution distribution
5.
74
75
85
6.
93
101
109
chisquare
limits
lo hypothesis hypotheses
of simple
employing
126
the hypothesis
!!,,: 2 = al
7.
133
F distribution hypothesis
of squares
of freedom
150
I anova II anora
8.
160
179
CONTENTS
ix
9.
185
186 197 199
replication
10.
211
11.
REGRESSION
11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8
230
to regression 233 equation of 235 value 250 of X 243 231
linear than
Y for each
in regression 257
Residuals
in regression 263
259
A nonparametric
regression
12.
CORRELATION
12.1 12.2 12.3 12.4 12.5
267
and regression 268 coefficient 280 284 correlation 286 270
Correlation The
productmoment tests
correlation in correlation
13.
ANALYSIS O F FREQUENCIES
13.1 13.2 13.3 Tests for goodness Singleclassification Tests of independence: goodness
294
295 301 305 of fit tests tables
of Jit: introduction
Twoway
APPENDIXES
AI A2
314
appendix tables 320 314
Mathematical Statistical
349
We are pleased and honored to see the reissue of the second edition of our Introduction to Biostatistics by Dover Publications. On reviewing the copy, we find there is little in it that needs changing for an introductory textbook of biostatistics for an advanced undergraduate or beginning graduate student. The book furnishes an introduction to most of the statistical topics such students are likely to encounter in their courses and readings in the biological and biomedical sciences. The reader may wonder what we would change if we were to write this book anew. Because of the vast changes that have taken place in modalities of computation in the last twenty years, we would deemphasize computational formulas that were designed for precomputer desk calculators (an age before spreadsheets and comprehensive statistical computer programs) and refocus the reader's attention to structural formulas that not only explain the nature of a given statistic, but are also less prone to rounding error in calculations performed by computers. In this spirit, we would omit the equation (3.8) on page 39 and draw the readers' attention to equation (3.7) instead. Similarly, we would use structural formulas in Boxes 3.1 and 3.2 on pages 41 and 42, respectively; on page 161 and in Box 8.1 on pages 163/164, as well as in Box 12.1 on pages 278/279. Secondly, we would put more emphasis on permutation tests and resampling methods. Permutation tests and bootstrap estimates are now quite practical. We have found this approach to be not only easier for students to understand but in many cases preferable to the traditional parametric methods that are emphasized in this book.
Preface
T h e favorable reception that the first edition of this b o o k received f r o m teachers a n d s t u d e n t s e n c o u r a g e d us to p r e p a r e a second edition. In this revised edition, we provide a t h o r o u g h f o u n d a t i o n in biological statistics for the u n d e r g r a d u a t e student w h o has a minimal knowledge of m a t h e m a t i c s . W e intend Introduction to Biostatistics to be used in c o m p r e h e n s i v e biostatistics courses, but it can also be a d a p t e d for short courses in medical a n d professional schools; thus, we include examples f r o m the healthrelated sciences. We have extracted most of this text f r o m the moreinclusive second edition of our o w n Biometry. W e believe t h a t the p r o v e n pedagogic features of that book, such as its informal style, will be valuable here. We have modified some of the features f r o m Biometry, for example, in Introduction to Biostatistics we provide detailed outlines for statistical c o m p u tations but we place less e m p h a s i s on the c o m p u t a t i o n s themselves. Why? Students in m a n y u n d e r g r a d u a t e courses are not motivated to a n d have few o p p o r t u n i t i e s to p e r f o r m lengthy c o m p u t a t i o n s with biological research m a terial; also, such c o m p u t a t i o n s can easily be m a d e on electronic calculators a n d m i c r o c o m p u t e r s . T h u s , we rely on the course instructor t o advise students on the best c o m p u t a t i o n a l p r o c e d u r e s to follow. We present material in a sequence that progresses from descriptive statistics to f u n d a m e n t a l d i s t r i b u t i o n s and the testing of elementary statistical hypotheses; we (hen proceed immediately to the analysis of variance and the familiar t test
XIV
PREFACE
(which is treated as a special case of the analysis of variance a n d relegated to several sections of the book). W e d o this deliberately for two reasons: (1) since t o d a y ' s biologists all need a t h o r o u g h f o u n d a t i o n in the analysis of variance, s t u d e n t s should b e c o m e a c q u a i n t e d with the subject early in the course; a n d (2) if analysis of variance is u n d e r s t o o d early, the need to use the f distribution is reduced. (One would still w a n t to use it for the setting of confidence limits a n d in a few o t h e r special situations.) All t tests can be carried out directly as analyses of variance, a n d the a m o u n t of c o m p u t a t i o n of these analyses of variance is generally equivalent to t h a t of t tests. This larger second edition includes the K o l g o r o v  S m i r n o v twosample test, n o n p a r a m e t r i c regression, stemandleaf d i a g r a m s , h a n g i n g h i s t o g r a m s , a n d the B o n f e r r o n i m e t h o d of multiple c o m p a r i s o n s . W e have rewritten t h e c h a p t e r on the analysis of frequencies in terms of the G statistic rather t h a n 2 , because the f o r m e r h a s been shown t o have m o r e desirable statistical properties. Also, because of t h e availability of l o g a r i t h m functions on calculators, the c o m p u t a t i o n of the G statistic is n o w easier t h a n that of the earlier chisquare test. T h u s , we reorient the c h a p t e r to e m p h a s i z e loglikeiihoodratio tests. We have also a d d e d new h o m e w o r k exercises. We call special, d o u b l e  n u m b e r e d tables "boxes." T h e y can be used as convenient guides for c o m p u t a t i o n because they s h o w the c o m p u t a t i o n a l m e t h o d s for solving various types of biostatistical problems. They usually c o n t a i n all the steps necessary t o solve a p r o b l e m f r o m the initial setup to the final result. T h u s , s t u d e n t s familiar with material in the b o o k can use them as quick s u m m a r y reminders of a technique. We found in teaching this course that we w a n t e d s t u d e n t s to be able to refer to the material n o w in these boxes. W e discovered that we could not cover even half as m u c h of o u r subject if we had to put this material on the blackboard d u r i n g the lecture, a n d so we m a d e u p and distributed boxe^ a n d asked s t u d e n t s to refer to them d u r i n g the lecture. I n s t r u c t o r s w h o use this b o o k m a y wish to use the boxes in a similar m a n n e r . We e m p h a s i z e the practical a p p l i c a t i o n s of statistics to biology in this book; thus, we deliberately keep discussions of statistical theory to a m i n i m u m . Derivations are given for s o m e f o r m u l a s , but these arc consigned to Appendix A l , where they should be studied a n d reworked by the student. Statistical tables to which the reader can refer when w o r k i n g t h r o u g h the m e t h o d s discussed in this b o o k are found in A p p e n d i x A2. We a r e grateful to K.. R. Gabriel, R. C. Lewontin, a n d M. K a b a y for their extensive c o m m e n t s on t h e second edition of Biometry and to M. D. M o r g a n , E. R u s s e k  C o h e n , a n d M . Singh for c o m m e n t s on an early d r a f t of this book. We also a p p r e c i a t e the w o r k of o u r secretaries, Resa C h a p e y a n d Cheryl Daly, with p r e p a r i n g the m a n u s c r i p t s , a n d of D o n n a D i G i o v a n n i , Patricia Rohlf, a n d B a r b a r a T h o m s o n with p r o o f r e a d i n g . Robert R. Sokal F. J a m e s Rohlf
INTRODUCTION TO
BIOSTATISTICS
CHAPTER
Introduction
This c h a p t e r sets the stage for your study of biostatistics. In Section 1.1, we define the field itself. We then cast a necessarily brief glance at its historical development in Section 1.2. T h e n in Section 1.3 we conclude the c h a p t e r with a discussion of the a t t i t u d e s that the person trained in statistics brings to biological research.
I.I Some definitions Wc shall define biostatistics as the application of statistical methods to the solution of biological problems. T h e biological p r o b l e m s of this definition a r e those arising in the basic biological sciences as well as in such applied areas as the healthrelated sciences a n d the agricultural sciences. Biostatistics is also called biological statistics o r biometry. T h e definition of biostatistics leaves us s o m e w h a t u p in the air"statistics" has not been defined. Statistics is a scicnce well k n o w n by n a m e even to the l a y m a n . T h e n u m b e r of definitions you can find for it is limited only by the n u m b e r of b o o k s you wish to consult. We might define statistics in its m o d e r n
CHAS'TER 1 / INTRODUCTION
sense as the scientific study of numerical data based on natural phenomena. All p a r t s of this definition a r e i m p o r t a n t a n d deserve emphasis: Scientific study: Statistics m u s t meet t h e c o m m o n l y accepted criteria of validity of scientific evidence. W e m u s t always be objective in p r e s e n t a t i o n a n d e v a l u a t i o n of d a t a a n d a d h e r e t o the general ethical code of scientific m e t h o d ology, or we m a y find t h a t t h e old saying t h a t "figures never lie, only statisticians d o " applies to us. Data: Statistics generally deals with p o p u l a t i o n s or g r o u p s of individuals; hence it deals with quantities of i n f o r m a t i o n , not with a single datum. T h u s , t h e m e a s u r e m e n t of a single a n i m a l or the response f r o m a single biochemical test will generally not be of interest. Numerical: Unless d a t a of a study c a n be quantified in one way o r a n o t h e r , they will not be a m e n a b l e to statistical analysis. N u m e r i c a l d a t a can be m e a s u r e m e n t s (the length or w i d t h of a s t r u c t u r e or t h e a m o u n t of a chemical in a b o d y fluid, for example) o r c o u n t s (such as t h e n u m b e r of bristles or teeth). Natural phenomena: W e use this term in a wide sense to m e a n not only all t h o s e events in a n i m a t e a n d i n a n i m a t e n a t u r e that take place outside the c o n t r o l of h u m a n beings, but also those evoked by scientists a n d partly u n d e r their control, as in experiments. Different biologists will c o n c e r n themselves with different levels of n a t u r a l p h e n o m e n a ; o t h e r k i n d s of scientists, with yet different ones. But all would agree t h a t the chirping of crickets, the n u m b e r of peas in a pod, and the age of a w o m a n at m e n o p a u s e are n a t u r a l p h e n o m e n a . T h e h e a r t b e a t of rats in response to adrenalin, the m u t a t i o n rate in maize after irradiation, or t h e incidence o r m o r b i d i t y in patients treated with a vaccine m a y still be considered n a t u r a l , even t h o u g h scientists have interfered with t h e p h e n o m e n o n t h r o u g h their intervention. T h e average biologist w o u l d n o t c o n sider the n u m b e r of stereo sets b o u g h t by p e r s o n s in different states in a given year to be a n a t u r a l p h e n o m e n o n . Sociologists o r h u m a n ecologists, however, might so consider it a n d deem it w o r t h y of study. T h e qualification " n a t u r a l p h e n o m e n a " is included in the definition of statistics mostly to m a k e certain that the p h e n o m e n a studied are not a r b i t r a r y ones t h a t are entirely u n d e r the will a n d c o n t r o l of the researcher, such as the n u m b e r of animals e m p l o y e d in an experiment. T h e w o r d "statistics" is also used in a n o t h e r , t h o u g h related, way. It can be the plural of the n o u n statistic, which refers t o any one of m a n y c o m p u t e d or estimated statistical quantities, such as the m e a n , the s t a n d a r d deviation, o r the correlation coefficient. Each o n e of these is a statistic.
1.2 The development of biostatistics M o d e r n statistics a p p e a r s to have developed f r o m t w o sources as far back as the seventeenth century. T h e first s o u r c e was political science; a form of statistics developed as a quantitive description of the v a r i o u s aspects of the affairs of a g o v e r n m e n t or state (hence the term "statistics"). This subject also became k n o w n as political arithmetic. T a x e s a n d insurance caused people to b e c o m e
interested in p r o b l e m s of censuses, longevity, a n d mortality. Such c o n s i d e r a t i o n s a s s u m e d increasing i m p o r t a n c e , especially in E n g l a n d as the c o u n t r y p r o s p e r e d d u r i n g the d e v e l o p m e n t of its empire. J o h n G r a u n t ( 1 6 2 0  1 6 7 4 ) a n d William Petty (16231687) were early students of vital statistics, a n d o t h e r s followed in their footsteps. At a b o u t the s a m e time, the s e c o n d s o u r c e of m o d e r n statistics developed: the m a t h e m a t i c a l t h e o r y of probability engendered by t h e interest in games of c h a n c e a m o n g the leisure classes of the time. I m p o r t a n t c o n t r i b u t i o n s to this theory were m a d e by Blaise Pascal (16231662) a n d Pierre de F e r m a t (16011665), b o t h F r e n c h m e n . J a c q u e s Bernoulli (16541705), a Swiss, laid the f o u n d a t i o n of m o d e r n probability t h e o r y in /Irs Conjectandi. A b r a h a m de M o i v r e (16671754), a F r e n c h m a n living in E n g l a n d , was the first to c o m b i n e the statistics of his d a y with probability t h e o r y in w o r k i n g o u t a n n u i t y values a n d t o a p p r o x i m a t e the i m p o r t a n t n o r m a l distribution t h r o u g h the expansion of the binomial. A later stimulus for the d e v e l o p m e n t of statistics came f r o m the science of a s t r o n o m y , in which m a n y individual o b s e r v a t i o n s h a d to be digested into a c o h e r e n t theory. M a n y of the f a m o u s a s t r o n o m e r s a n d m a t h e m a t i c i a n s of the eighteenth century, such as Pierre Simon Laplace ( 1 7 4 9  1 8 2 7 ) in F r a n c e a n d K a r l Friedrich G a u s s ( 1 7 7 7  1 8 5 5 ) in G e r m a n y , were a m o n g the leaders in this field. T h e latter's lasting c o n t r i b u t i o n to statistics is the d e v e l o p m e n t of the m e t h o d of least squares. P e r h a p s the earliest i m p o r t a n t figure in biostatistic t h o u g h t was A d o l p h e Quetelet (17961874), a Belgian a s t r o n o m e r a n d m a t h e m a t i c i a n , w h o in his work c o m b i n e d the t h e o r y a n d practical m e t h o d s of statistics a n d applied t h e m to p r o b l e m s of biology, medicine, a n d sociology. Francis G a l t o n (18221911), a cousin of C h a r l e s D a r w i n , h a s been called the father of biostatistics a n d eugenics. T h e i n a d e q u a c y of D a r w i n ' s genetic theories stimulated G a l t o n to try to solve the p r o b l e m s of heredity. G a l t o n ' s m a j o r c o n t r i b u t i o n to biology was his application of statistical m e t h o d o l o g y to the analysis of biological variation, particularly t h r o u g h the analysis of variability and t h r o u g h his study of regression a n d correlation in biological m e a s u r e m e n t s . His hope of unraveling the laws of genetics t h r o u g h these p r o c e d u r e s was in vain. He started with the most difficult material a n d with the w r o n g a s s u m p t i o n s . However, his m e t h o d o l o g y has become the f o u n d a t i o n for the application of statistics to biology. Karl P e a r s o n (18571936), at University College, L o n d o n , b e c a m e interested in the application of statistical m e t h o d s t o biology, particularly in the d e m o n s t r a t i o n of n a t u r a l selection. P e a r s o n ' s interest came a b o u t t h r o u g h the influence of W. F. R. W c l d o n (18601906), a zoologist at t h e s a m e institution. Weldon, incidentally, is credited with coining the term " b i o m e t r y " for the type of studies he and P e a r s o n pursued. P e a r s o n continued in the tradition of G a l t o n a n d laid the f o u n d a t i o n for m u c h of descriptive a n d correlational statistics. T h e d o m i n a n t figure in statistics and biometry in the twentieth century has been R o n a l d A. Fisher (1890 1962). His m a n y c o n t r i b u t i o n s to statistical theory will become o b v i o u s even to the cursory reader of this b o o k .
CHAPTER 1 /
INTRODUCTION
Statistics t o d a y is a b r o a d a n d extremely active field w h o s e a p p l i c a t i o n s t o u c h a l m o s t every science a n d even the humanities. New a p p l i c a t i o n s for statistics are c o n s t a n t l y being f o u n d , a n d n o o n e can predict f r o m w h a t b r a n c h of statistics new a p p l i c a t i o n s to biology will be m a d e .
1.3 The statistical frame of mind A brief perusal of a l m o s t a n y biological j o u r n a l reveals h o w pervasive the use of statistics has b e c o m e in the biological sciences. W h y h a s there been such a m a r k e d increase in the use of statistics in biology? Apparently, because biologists h a v e f o u n d t h a t the interplay of biological causal a n d response variables d o e s n o t fit the classic m o l d of n i n e t e e n t h  c e n t u r y physical science. In t h a t century, biologists such as R o b e r t M a y e r , H e r m a n n von H e l m h o l t z , a n d o t h e r s tried t o d e m o n s t r a t e t h a t biological processes were n o t h i n g but physicochemical p h e n o m e n a . In so doing, they helped create the impression t h a t the experim e n t a l m e t h o d s a n d n a t u r a l philosophy t h a t h a d led to such d r a m a t i c p r o g r e s s in the physical sciences should be imitated fully in biology. M a n y biologists, even to this day, have retained the tradition of strictly mechanistic a n d deterministic concepts of t h i n k i n g (while physicists, interestingly e n o u g h , as their science has b e c o m e m o r e refined, have begun t o resort t o statistical approaches). In biology, most p h e n o m e n a are affected by m a n y causal factors, u n c o n t r o l l a b l e in their variation a n d often unidentifiable. Statistics is needed to m e a s u r e such variable p h e n o m e n a , to d e t e r m i n e the e r r o r of m e a s u r e m e n t , a n d to ascertain the reality of m i n u t e but i m p o r t a n t differences. A m i s u n d e r s t a n d i n g of these principles and relationships h a s given rise t o the a t t i t u d e of some biologists t h a t if differences induced by an experiment, or observed by nature, are not clear on plain inspection (and therefore a r e in need of statistical analysis), they arc not w o r t h investigating. There are few legitimate fields of inquiry, however, in which, f r o m the n a t u r e of the p h e n o m e n a studied, statistical investigation is unnecessary. Statistical thinking is not really different f r o m o r d i n a r y disciplined scientific thinking, in which wc try to q u a n t i f y o u r observations. In statistics we express o u r degree of belief or disbelief as a p r o b a b i l i t y rather than as a vague, general s t a t e m e n t . F o r example, a statement that individuals of species A a r e larger t h a n those of specics or that w o m e n suffer m o r e often f r o m disease X t h a n d o m e n is of a kind c o m m o n l y m a d e by biological and medical scientists. Such s t a t e m e n t s can a n d should be m o r e precisely expressed in q u a n t i t a t i v e form. In m a n y ways the h u m a n mind is a r e m a r k a b l e statistical machine, a b s o r b ing m a n y facts f r o m the outside world, digesting these, a n d regurgitating them in simple s u m m a r y form. F r o m o u r experience we k n o w certain events to o c c u r frequently, o t h e r s rarely. " M a n s m o k i n g cigarette" is a frequently observed event, " M a n slipping on b a n a n a peel," rare. W e k n o w f r o m experience t h a t J a p a n e s e arc on the average shorter than Englishmen a n d that E g y p t i a n s are on the average d a r k e r t h a n Swedes. We associate t h u n d e r with lightning a l m o s t always, flics with g a r b a g e cans in the s u m m e r frequently, but s n o w with the
s o u t h e r n C a l i f o r n i a n desert extremely rarely. All such k n o w l e d g e comes to us as a result of experience, b o t h o u r o w n a n d that of others, which we learn a b o u t by direct c o m m u n i c a t i o n or t h r o u g h reading. All these facts have been processed by that r e m a r k a b l e c o m p u t e r , t h e h u m a n brain, which furnishes an abstract. This a b s t r a c t is constantly u n d e r revision, a n d t h o u g h occasionally faulty a n d biased, it is o n the whole astonishingly s o u n d ; it is o u r k n o w l e d g e of the m o m e n t . A l t h o u g h statistics arose t o satisfy the needs of scientific research, the develo p m e n t of its m e t h o d o l o g y in t u r n affected the sciences in which statistics is applied. T h u s , t h r o u g h positive feedback, statistics, created t o serve the needs of n a t u r a l science, h a s itself affected the c o n t e n t a n d m e t h o d s of t h e biological sciences. T o cite a n example: Analysis of variance has h a d a t r e m e n d o u s effect in influencing the types of experiments researchers carry out. T h e whole field of quantitative genetics, o n e of whose p r o b l e m s is the s e p a r a t i o n of e n v i r o n m e n t a l f r o m genetic effects, d e p e n d s u p o n the analysis of variance for its realization, and m a n y of the c o n c e p t s of q u a n t i t a t i v e genetics have b e e n directly built a r o u n d the designs inherent in the analysis of variance.
CHAPTER
Data in
Biostatistics
In Section 2.1 we explain the statistical m e a n i n g of the terms " s a m p l e " a n d " p o p u l a t i o n , " which we shall be using t h r o u g h o u t this book. Then, in Section 2.2, we c o m e to the types of o b s e r v a t i o n s t h a t we o b t a i n f r o m biological research material; we shall see h o w these c o r r e s p o n d to the different kinds of variables u p o n which we perform (he various c o m p u t a t i o n s in the rest of this b o o k . In Section 2.3 we discuss the degree of accuracy necessary for recording d a t a a n d the p r o c e d u r e for r o u n d i n g off figures. We shall then be ready to consider in Section 2.4 certain k i n d s of derived d a t a frequently used in biological science a m o n g them ratios a n d indices a n d the peculiar problems of accuracy a n d d i s t r i b u t i o n they present us. K n o w i n g how to a r r a n g e d a t a in frequency distrib u t i o n s is i m p o r t a n t because such a r r a n g e m e n t s give an overall impression of the general p a t t e r n of the variation present in a s a m p l e a n d also facilitate f u r t h e r c o m p u t a t i o n a l procedures. F r e q u e n c y distributions, as well as the p r e s e n t a t i o n of numerical d a t a , a r e discussed in Section 2.5. In Section 2.6 we briefly describe the c o m p u t a t i o n a l h a n d l i n g of d a t a .
2.1 Samples and populations We shall n o w define a n u m b e r of i m p o r t a n t terms necessary for an unders t a n d i n g of biological d a t a . T h e data in biostatistics are generally based on individual observations. T h e y are observations or measurements taken on the smallest sampling unit. These smallest s a m p l i n g units frequently, b u t not necessarily, are also individuals in the o r d i n a r y biological sense. If we m e a s u r e weight in 100 rats, then the weight of each rat is an individual observation; t h e h u n d r e d rat weights together represent the sample of observations, defined as a collection of individual observations selected by a specified procedure. In this instance, one individual o b s e r v a t i o n (an item) is based on o n e individual in a biological s e n s e t h a t is, o n e rat. However, if we h a d studied weight in a single rat over a period of time, the s a m p l e of individual o b s e r v a t i o n s w o u l d be the weights recorded on one rat at successive times. If we wish to m e a s u r e t e m p e r a t u r e in a study of ant colonies, where each colony is a basic s a m p l i n g unit, each t e m p e r a t u r e reading for o n e colony is an individual observation, a n d the sample of o b s e r v a t i o n s is the t e m p e r a t u r e s for all the colonies considered. If we consider an estimate of the D N A c o n t e n t of a single m a m m a l i a n sperm cell to be an individual o b s e r v a t i o n , the s a m p l e of o b s e r v a t i o n s may be the estimates of D N A c o n t e n t of all the sperm cells studied in o n e individual m a m m a l . W e have carefully avoided so far specifying what particular variable was being studied, because the terms "individual o b s e r v a t i o n " a n d " s a m p l e of observations" as used a b o v e define only the s t r u c t u r e but not the n a t u r e of the d a t a in a study. T h e actual property m e a s u r e d by the individual o b s e r v a t i o n s is the character, or variable. T h e m o r e c o m m o n term employed in general statistics is "variable." H o w e v e r , in biology the word " c h a r a c t e r " is frequently used synonymously. M o r e t h a n one variable can be measured on each smallest sampling unit. T h u s , in a g r o u p of 25 mice we might m e a s u r e the blood pH and the e r y t h r o c y t e c o u n t . Each m o u s e (a biological individual) is the smallest sampling unit, blood p H a n d red cell c o u n t would be the t w o variables studied, the readings a n d cell c o u n t s are individual observations, a n d two samples of 25 o b s e r v a t i o n s (on a n d on e r y t h r o c y t e c o u n t ) would result. O r we might speak of a bivariate sample of 25 observations, each referring to a />H reading paired with an e r y t h r o c y t e c o u n t . Next we define population. T h e biological definition of this term is well k n o w n . It refers to all the individuals of a given species ( p e r h a p s of a given lifehistory stage or sex) f o u n d in a circumscribed area at a given time. In statistics, p o p u l a t i o n always m e a n s the totality of individual observations about which inferences are to he made, existing anywhere in the world or at least within a definitely specified sampling area limited in space and time. If you take five men a n d study the n u m b e r of leucocytes in their peripheral blood and you are prepared to d r a w conclusions a b o u t all men from this s a m p l e of five, then the p o p u l a t i o n f r o m which the sample has been d r a w n represents the leucocyte c o u n t s of all extant males of the species Homo sapiens. If, on the other hand, you restrict yourself to a m o r e narrowly specified sample, such as live male
Chinese, aged 20, a n d you are restricling y o u r conclusions to this p a r t i c u l a r g r o u p , then the p o p u l a t i o n f r o m which you a r e s a m p l i n g will be leucocyte n u m b e r s of all Chinese males of age 20. A c o m m o n misuse of statistical m e t h o d s is to fail to define the statistical p o p u l a t i o n a b o u t which inferences can be m a d e . A report on the analysis of a s a m p l e f r o m a restricted p o p u l a t i o n should not imply that the results hold in general. T h e p o p u l a t i o n in this statistical sense is sometimes referred t o as the universe. A p o p u l a t i o n m a y represent variables of a concrete collection of objects or creatures, such as the tail lengths of all the white mice in the world, the leucocyte c o u n t s of all the Chinese m e n in the world of age 20, or the D N A c o n t e n t of all the h a m s t e r sperm cells in existence: or it m a y represent the o u t c o m e s of experiments, such as all the h e a r t b e a t frequencies p r o d u c e d in guinea pigs by injections of adrenalin. In cases of the first kind the p o p u l a t i o n is generally finite. A l t h o u g h in practice it would be impossible to collect, count, a n d e x a m i n e all h a m s t e r sperm cells, all Chinese men of age 20, or all white mice in the world, these p o p u l a t i o n s a r e in fact finite. Certain smaller p o p u l a t i o n s , such as all the w h o o p i n g cranes in N o r t h America or all the recorded cases of a rare but easily d i a g n o s e d disease X, m a y well lie within reach of a total census. By c o n t r a s t , an experiment can be repeated an infinite n u m b e r of times (at least in theory). A given experiment, such as the a d m i n i s t r a t i o n of adrenalin t o guinea pigs, could be repealed as long as the e x p e r i m e n t e r could o b t a i n material a n d his or her health and patience held out. T h e s a m p l e of experiments actually perf o r m e d is a sample f r o m an infinite n u m b e r that could be p e r f o r m e d . S o m e of the statistical m e t h o d s to be developed later m a k e a distinction between s a m p l i n g from finite a n d f r o m infinite p o p u l a t i o n s . However, t h o u g h p o p u l a t i o n s are theoretically finite in most applications in biology, they are generally so much larger than samples d r a w n from them that they can be c o n sidered de facto infinitesized populations. 2.2 Variables in biostatisties Each biological discipline has its own set of variables, which may include conventional m o r p h o l o g i c a l m e a s u r e m e n t s ; c o n c e n t r a t i o n s of chemicals in b o d y fluids; rates of certain biological processes; frequencies of certain events, as in genetics, epidemiology, a n d radiation biology; physical readings of optical or electronic machinery used in biological research; and m a n y more. We have already referred to biological variables in a general way, but we have not yet defined them. We shall define a variable as a properly with respect
to which individuals in a sample d i f f e r in some ascertainable way. If t h e property
does not differ within a s a m p l e at h a n d or at least a m o n g the samples being studied, it c a n n o t be of statistical interest. Length, height, weight, n u m b e r of teeth, vitamin ( ' c o n t e n t , and genotypes are examples of variables in o r d i n a r y , genetically and phcnotypically diverse g r o u p s of organisms. W a r m  b l o o d e d n e s s in a g r o u p of m a m m a l s is not, since m a m m a l s are all alike in this regard.
2 . 2 / VARIABLES IN BIOSTATISTICS
Variables
Measurement variables Continuous variables Discontinuous variables Ranked variables Attributes Measurement variables are those measurements and counts that are expressed numerically. M e a s u r e m e n t variables are of t w o kinds. T h e first kind consists of continuous variables, which at least theoretically can assume an infinite n u m b e r of values between a n y t w o fixed points. F o r example, between the t w o length m e a s u r e m e n t s 1.5 a n d 1.6 cm there are an infinite n u m b e r of lengths that could be m e a s u r e d if o n e were so inclined a n d h a d a precise e n o u g h m e t h o d of calibration. Any given reading of a c o n t i n u o u s variable, such as a length of 1.57 m m , is therefore an a p p r o x i m a t i o n to the exact reading, which in practice is u n k n o w a b l e . M a n y of the variables studied in biology arc c o n t i n u o u s variables. Examples are lengths, areas, volumes, weights, angles, temperatures, periods of time, percentages, c o n c e n t r a t i o n s , a n d rates. C o n t r a s t e d with c o n t i n u o u s variables are the discontinuous variables, also k n o w n as meristic or discrete variables. These are variables that have only certain fixed numerical values, with no intermediate values possible in between. T h u s the n u m b e r of segments in a certain insect a p p e n d a g e may be 4 or 5 or 6 but never 5l or 4.3. Examples of d i s c o n t i n u o u s variables are n u m b e r s of a given s t r u c t u r e (such as segments, bristles, leel h, or glands), n u m b e r s of offspring, n u m b e r s of colonics of m i c r o o r g a n i s m s or animals, or n u m b e r s of plants in a given q u a d r a t . Some variables c a n n o t be m e a s u r e d but at least can be ordered or r a n k e d by their m a g n i t u d e . T h u s , in an experiment one might record the rank o r d e r of emergence o f t e n p u p a e without specifying the exact time at which each p u p a emerged. In such cases we code the d a t a as a ranked variable. I he o r d e r of emergence. Special m e t h o d s for dealing with such variables have been developed, and several arc furnished in this book. By expressing a variable as a series of ranks, such as 1,2, 3, 4. 5, we d o not imply that the difference in m a g n i t u d e between, say, r a n k s I and 2 is identical lo or even p r o p o r t i o n a l to the difference between r a n k s 2 a n d 3. Variables that c a n n o t be measured but must be expressed qualitatively are called attributes, or nominal variables. These are all properties, such as black or white, p r e g n a n t or not p r e g n a n t , d e a d or alive, male or female. W h e n such attributes are c o m b i n e d with frequencies, they can be treated statistically. Of 80 mice, we may, for instance, state that four were black, t w o agouti, and the
10
rest gray. W h e n a t t r i b u t e s are c o m b i n e d with frequencies into tables suitable for statistical analysis, they are referred to as enumeration data. T h u s the e n u m e r a t i o n d a t a on color in mice w o u l d be a r r a n g e d as follows:
Frequency 4 2 74 80
In s o m e cases a t t r i b u t e s c a n be c h a n g e d into m e a s u r e m e n t variables if this is desired. T h u s colors c a n be c h a n g e d into wavelengths o r c o l o r  c h a r t values. C e r t a i n o t h e r a t t r i b u t e s that can be r a n k e d o r ordered can be c o d e d t o bec o m e r a n k e d variables. F o r example, three a t t r i b u t e s referring to a s t r u c t u r e as " p o o r l y developed," "well developed," a n d " h y p e r t r o p h i e d " could be c o d e d 1, 2, a n d 3. A term that has not yet been explained is variate. In this b o o k we shall use it as a single reading, score, or o b s e r v a t i o n of a given variable. T h u s , if we have m e a s u r e m e n t s of the length of the tails of five mice, tail length will be a c o n t i n u o u s variable, a n d each of the five readings of length will be a variate. In this text we identify variables by capital letters, the most c o m m o n s y m b o l being Y. T h u s V may s t a n d for tail length of mice. A variate will refer t o a given length m e a s u r e m e n t ; Yt is the m e a s u r e m e n t of tail length of the /'th mouse, a n d y 4 is the m e a s u r e m e n t of tail length of the f o u r t h m o u s e in our sample.
2.3 Accuracy and precision of data " A c c u r a c y " and "precision" are used s y n o n y m o u s l y in everyday speech, but in statistics we define them m o r e rigorously. Accuracy is the closeness of a measured
or computed value to its true value. Precision is the closeness of repeated measure
ments. A biased but sensitive scale might yield inaccurate but precise weight. By chance, an insensitive scale might result in an a c c u r a t e reading, which would, however, be imprecise, since a repeated weighing would be unlikely to yield an equally accurate weight. Unless there is bias in a m e a s u r i n g i n s t r u m e n t , precision will lead to accuracy. We need therefore mainly be concerned with the former. Precise variates arc usually, but not necessarily, whole n u m b e r s . T h u s , when we count four eggs in a nest, there is no d o u b t a b o u t the exact n u m b e r of eggs in the nest if we have c o u n t e d correctly; it is 4, not 3 or 5, and clearly it could not be 4 plus or minus a fractional part. Meristic, or d i s c o n t i n u o u s , variables are generally m e a s u r e d as exact n u m b e r s . Seemingly, c o n t i n u o u s variables derived from mcristic ones can u n d e r certain c o n d i t i o n s also be exact n u m b e r s . F o r instance, ratios between exact n u m b e r s arc themselves also exact. If in a c o l o n y of a n i m a l s there are 18 females and 12 males, the ratio of females to males (a
11
M o s t c o n t i n u o u s variables, however, are a p p r o x i m a t e . W e m e a n by this that the exact value of the single m e a s u r e m e n t , the variate, is u n k n o w n a n d p r o b a b l y u n k n o w a b l e . T h e last digit of the m e a s u r e m e n t stated should imply precision; t h a t is, it should indicate t h e limits on the m e a s u r e m e n t scale between which we believe the true m e a s u r e m e n t to lie. T h u s , a length m e a s u r e m e n t of 12.3 m m implies t h a t the true length of the structure lies s o m e w h e r e between 12.25 a n d 12.35 m m . Exactly where between these implied limits the real length is we d o not k n o w . But where w o u l d a true m e a s u r e m e n t of 12.25 fall? W o u l d it not equally likely fall in either of the t w o classes 12.2 a n d 12.3clearly an unsatisfactory state of affairs? Such an a r g u m e n t is correct, b u t w h e n we record a n u m b e r as either 12.2 or 12.3, we imply t h a t the decision w h e t h e r to put it into the higher or lower class h a s already been taken. This decision was not taken arbitrarily, b u t p r e s u m a b l y was based o n the best available m e a s u r e m e n t . If the scale of m e a s u r e m e n t is so precise t h a t a value of 12.25 would clearly have been recognized, then the m e a s u r e m e n t should have been recorded originally to four significant figures. Implied limits, therefore, always carry one more figure beyond the last significant one measured by the observer. Hence, it follows t h a t if we record the m e a s u r e m e n t as 12.32, we a r e implying that the true value lies between 12.315 a n d 12.325. Unless this is w h a t we m e a n , there would be n o p o i n t in a d d i n g the last decimal figure to o u r original measurements. If we d o a d d a n o t h e r figure, we must imply a n increase in precision. W e see, therefore, t h a t accuracy a n d precision in n u m b e r s are not a b s o l u t e concepts, but are relative. Assuming there is n o bias, a n u m b e r b e c o m e s increasingly m o r e a c c u r a t e as we are able to write m o r e significant figures for it (increase its precision). T o illustrate this concept of the relativity of accuracy, consider the following three n u m b e r s :
We m a y imagine these n u m b e r s t o be recorded m e a s u r e m e n t s of the same structure. Let us a s s u m e that we h a d e x t r a m u n d a n e knowledge that the true length of the given s t r u c t u r e was 192.758 units. If t h a t were so, the three m e a s u r e m e n t s would increase in accuracy f r o m the t o p d o w n , as the interval between their implied limits decreased. You will n o t e that the implied limits of the t o p m o s t m e a s u r e m e n t a r e wider than those of the o n e below it, which in turn are wider t h a n those of the third m e a s u r e m e n t . Meristic variates, t h o u g h ordinarily exact, may be recorded a p p r o x i m a t e l y when large n u m b e r s are involved. T h u s w h e n c o u n t s are reported to the nearest t h o u s a n d , a c o u n t of 36,000 insects in a cubic meter of soil, for example, implies that the true n u m b e r varies s o m e w h e r e f r o m 35,500 to 36,500 insects. T o h o w m a n y significant figures should we record m e a s u r e m e n t s ? If we array
t l l r f m rvl i> K i ; r /f tvi < n m l / rv >  1 U . i otrv 11 i /\;</1'>1 1a I l> l<irini'
12
one, an easy rule to remember is that the number of unit steps from the smallest to the largest measurement in an array should usually be between 30 a n d 300. Thus, if we are measuring a series of shells to the nearest millimeter a n d the largest is 8 m m and the smallest is 4 m m wide, there are only four unit steps between the largest a n d the smallest measurement. Hence, we should measure our shells to one m o r e significant decimal place. Then the two extreme measurements might be 8.2 m m a n d 4.1 mm, with 41 unit steps between them (counting the last significant digit as the unit); this would be an a d e q u a t e n u m b e r of unit steps. T h e reason for such a rule is that an error of 1 in the last significant digit of a reading of 4 m m would constitute an inadmissible error of 25%, but an e r r o r of 1 in the last digit of 4.1 is less t h a n 2.5%. Similarly, if we measured the height of the tallest of a series of plants as 173.2 cm a n d that of the shortest of these plants as 26.6 cm, the difference between these limits would comprise 1466 unit steps (of 0.1 cm), which are far too many. It would therefore be advisable to record the heights to the nearest centimeter, as follows: 173 cm for the tallest and 27 cm for the shortest. This would yield 146 unit steps. Using the rule we have stated for the n u m b e r of unit steps, we shall record two or three digits for most measurements. The last digit should always be significant; that is, it should imply a range for the true measurement of from half a "unit step" below to half a "unit step" above the recorded score, as illustrated earlier. This applies to all digits, zero included. Zeros should therefore not be written at the end of a p p r o x i m a t e n u m bers to the right of the decimal point unless they are meant to be significant digits. T h u s 7.80 must imply the limits 7.795 to 7.805. If 7.75 to 7.85 is implied, the measurement should be recorded as 7.8. When the n u m b e r of significant digits is to be reduced, we carry out the process of rounding off numbers. The rules for r o u n d i n g off are very simple. A digit to be rounded off is not changed if it is followed by a digit less than 5. If the digit to be rounded off is followed by a digit greater than 5 or by 5 followed by other nonzero digits, it is increased by 1. When the digit to be rounded off is followed by a 5 standing alone or a 5 followed by zeros, it is unchanged if it is even but increased by 1 if it is odd. T h e reason for this last rule is that when such numbers are summed in a long series, we should have as m a n y digits raised as arc being lowered, on the average; these changes should therefore balance out. Practice the above rules by r o u n d i n g off the following n u m b e r s to the indicated n u m b e r of significant digits:
Significant
digits
desired
2
5 3 3 2 3
8.000
17.3
2 . 4 / DERIVED VARIABLES
13
M o s t pocket calculators or larger c o m p u t e r s r o u n d off their displays using a different rule: they increase t h e preceding digit when the following digit is a 5 s t a n d i n g alone o r with trailing zeros. H o w e v e r , since m o s t of the m a c h i n e s usable for statistics also retain eight or ten significant figures internally, the a c c u m u l a t i o n of r o u n d i n g e r r o r s is minimized. Incidentally, if t w o calculators give answers with slight differences in the final (least significant) digits, suspect a different n u m b e r of significant digits in m e m o r y as a cause of t h e disagreement.
2.4 Derived variables T h e m a j o r i t y of variables in biometric w o r k are o b s e r v a t i o n s r e c o r d e d as direct m e a s u r e m e n t s or c o u n t s of biological material o r as readings that are the o u t p u t of various types of instruments. However, there is a n i m p o r t a n t class of variables in biological research t h a t we m a y call the derived or computed variables. These are generally based on t w o o r m o r e independently m e a s u r e d variables whose relations are expressed in a certain way. We are referring to ratios, percentages, concentrations, indices, rates, a n d the like. A ratio expresses as a single value the relation that t w o variables have, o n e to the other. In its simplest form, a ratio is expressed as in 64:24, which m a y represent the n u m b e r of wildtype versus m u t a n t individuals, the n u m b e r of males versus females, a c o u n t of parasitized individuals versus those not p a r a sitized, a n d so on. T h e s e examples imply ratios based on counts. A ratio based on a c o n t i n u o u s variable might be similarly expressed as 1.2:1.8, which m a y represent the ratio of width t o length in a sclerite of an insect o r the ratio between the c o n c e n t r a t i o n s of t w o minerals contained in w a t e r or soil. Ratios m a y also be expressed as fractions; thus, the t w o ratios a b o v e could be expressed as f  a n d f ^  . However, for c o m p u t a t i o n a l p u r p o s e s it is m o r e useful to express the ratio as a quotient. T h e two ratios cited would therefore be 2.666 . . . and 0.666 . . . , respectively. These are pure n u m b e r s , not expressed in m e a s u r e m e n t units of any kind. It is this form for ratios that we shall consider further. Percentages are also a type of ratio. Ratios, percentages, a n d c o n c e n t r a t i o n s are basic quantities in m u c h biological research, widely used and generally familiar. An index is the ratio of the value of one variable to the value of a socalled standard one. A wellknown example of an index in this sense is the cephalic index in physical a n t h r o p o l o g y . Conceived in the wide sense, an index could be the average of t w o m e a s u r e m e n t s e i t h e r simply, such as {(length of A + length of ), or in weighted fashion, such as ^ [ ( 2 length of A) + length of B\. Rates are i m p o r t a n t in m a n y experimental fields of biology. T h e a m o u n t of a s u b s t a n c e liberated per unit weight or volume of biological material, weight gain per unit time, reproductive rates per unit p o p u l a t i o n size a n d time (birth rates), a n d d e a t h rates would fall in this category. T h e use of ratios a n d percentages is deeply ingrained in scientific t h o u g h t . Often ratios m a y be the only m e a n i n g f u l way to interpret and u n d e r s t a n d certain types of biological problems. If the biological process being investigated
14
o p e r a t e s o n the ratio of the variables studied, o n e must e x a m i n e this r a t i o to u n d e r s t a n d the process. T h u s , Sinnott a n d H a m m o n d (1935) f o u n d t h a t inheritance of the shapes of squashes of the species Cucurbita pepo could be interpreted t h r o u g h a form index based on a lengthwidth ratio, b u t n o t t h r o u g h the i n d e p e n d e n t d i m e n s i o n s of shape. By similar m e t h o d s of investigation, we should be able to find selection affecting b o d y p r o p o r t i o n s to exist in t h e evolution of almost any o r g a n i s m . T h e r e are several d i s a d v a n t a g e s to using ratios. First, they are relatively inaccurate. Let us return to the ratio m e n t i o n e d a b o v e a n d recall f r o m t h e previous section that a m e a s u r e m e n t of 1.2 implies a true r a n g e of m e a s u r e m e n t of the variable f r o m 1.15 to 1.25; similarly, a m e a s u r e m e n t of 1.8 implies a r a n g e f r o m 1.75 to 1.85. We realize, therefore, that the true ratio m a y vary a n y w h e r e f r o m f^J to Hi", or f r o m 0.622 t o 0.714. W e n o t e a possible m a x i m a l e r r o r of 4.2% if 1.2 is an original m e a s u r e m e n t : (1.25 1.2)/1.2; the c o r r e s p o n d i n g maximal e r r o r for the r a t i o is 7.0%: (0.714  0.667)/0.667. F u r t h e r m o r e , the best estimate of a ratio is n o t usually the m i d p o i n t between its possible ranges. T h u s , in o u r example the m i d p o i n t between the implied limits is 0.668 a n d the r a t i o based on 4~ is 0.666 . . . ; while this is only a slight difference, the discrepancy m a y be greater in o t h e r instances. A second d i s a d v a n t a g e to ratios a n d percentages is that they m a y not be a p p r o x i m a t e l y n o r m a l l y distributed (see C h a p t e r 5) as required by m a n y statistical tests. This difficulty can frequently be o v e r c o m e by t r a n s f o r m a t i o n of the variable (as discussed in C h a p t e r 10). A third d i s a d v a n t a g e of ratios is t h a t in using them o n e loses i n f o r m a t i o n a b o u t the relationships between the t w o variables except for the i n f o r m a t i o n a b o u t the ratio itself.
2.5 Frequency distributions If we were to sample a p o p u l a t i o n of birth weights of infants, we could represent each sampled m e a s u r e m e n t by a point a l o n g an axis d e n o t i n g m a g n i t u d e of birth weight. This is illustrated in Figure 2.1 A, for a s a m p l e of 25 birth weights. If we s a m p l e repeatedly from the p o p u l a t i o n a n d o b t a i n 100 birth weights, we shall p r o b a b l y have to place some of these points on t o p of o t h e r points in o r d e r to record them all correctly (Figure 2.1 B). As we c o n t i n u e s a m p l i n g additional h u n d r e d s a n d t h o u s a n d s of birth weights (Figure 2.1C a n d D), the assemblage of points will c o n t i n u e to increase in size but will a s s u m e a fairly definite shape. The outline of the m o u n d of p o i n t s a p p r o x i m a t e s the distribution of the variable. R e m e m b e r thai a c o n t i n u o u s variable such as birth weight can a s s u m e an infinity of values between a n y t w o p o i n l s on the abscissa. T h e refinement of o u r m e a s u r e m e n t s will d e t e r m i n e how fine the n u m b e r of recorded divisions between any t w o p o i n t s a l o n g the axis will be. T h e distribution of a variable is of c o n s i d e r a b l e biological interest. If we find thai the disl ribution is asymmetrical a n d d r a w n out in one direction, it tells us that there is. perhaps, selection that causes o r g a n i s m s to fall preferentially in o n e of the tails of the distribution, or possibly that the scale of m e a s u r e m e n t
15
10 l
25
0
10
I III I . ll.l I.
30 r
20
10
500
i I
70
60
50
40
2000
30
20
10
0
60
ll. 11 1 1 1
70 80 90 100 110 120 130 Birth w e i g h t (oz)
n i " l t.l
140 150 160
16
200 
150 
JJ
FIGURE 2 . 2
2 I
'"" flacca
1 2
(i
N u m b e r of p l a n t s q u a d r a t
chosen is such as to bring a b o u t a distortion of the distribution. If, in a s a m p l e of i m m a t u r e insects, we discover that the m e a s u r e m e n t s are b i m o d a l l y distributed (with t w o peaks), this would indicate that the p o p u l a t i o n is d i m o r p h i c . This m e a n s that different species or races m a y have become intermingled in o u r sample. O r the d i m o r p h i s m could have arisen f r o m the presence of b o t h sexes or of different instars. T h e r e are several characteristic shapes of frequency distributions. T h e most c o m m o n is the symmetrical bell shape ( a p p r o x i m a t e d by the b o t t o m g r a p h in Figure 2.1), which is the s h a p e of the n o r m a l frequency distribution discussed in C h a p t e r 5. T h e r e a r e also skewed d i s t r i b u t i o n s (drawn out m o r e at o n e tail than the other), Lshaped d i s t r i b u t i o n s as in Figure 2.2, Ushaped distributions, a n d others, all of which impart significant i n f o r m a t i o n a b o u t the relationships they represent. We shall have m o r e to say a b o u t the implications of various types of distributions in later c h a p t e r s a n d sections. After researchers have obtained d a t a in a given study, they must a r r a n g e the d a t a in a form suitable for c o m p u t a t i o n a n d interpretation. We m a y a s s u m e that variates arc r a n d o m l y ordered initially or are in the o r d e r in which the m e a s u r e m e n t s have been taken. A simple a r r a n g e m e n t would be an array of the d a t a by o r d e r of m a g n i t u d e . T h u s , for example, the variates 7, 6, 5, 7, 8, 9, 6, 7, 4, 6, 7 could be arrayed in o r d e r of decreasing m a g n i t u d e as follows: 9. 8, 7, 7, 7, 7, 6, 6, 6, 5, 4. W h e r e there are some variates of the same value, such as the 6's a n d 7's in this Fictitious example, a timesaving device might immediately have occurred to you namely, to list a frequency for each of the recurring variates; thus: 9, 8, 7(4 ). 6(3 ), 5, 4. Such a s h o r t h a n d n o t a t i o n is o n e way to represent a frcqucncy distribution, which is simply an a r r a n g e m e n t of the classes of variates with the frequency of each class indicated. C o n v e n t i o n a l l y , a frequency distribution is stated in t a b u l a r form; for our example, this is d o n e as follows:
17
Variable
Frequency
V 9 8 7 6 5 4
/ I 1 4 3 1 1
T h e a b o v e is a n example of a quantitative frequency distribution, since Y is clearly a m e a s u r e m e n t variable. However, a r r a y s a n d frequency distributions need not be limited to such variables. W e can m a k e frequency distributions of attributes, called qualitative frequency distributions. In these, the various classes are listed in some logical o r a r b i t r a r y order. F o r example, in genetics we might have a qualitative frequency distribution as follows:
Phenolype Aan J 86
32
This tells us that there are two classes of individuals, those identifed by the A phenotype, of which 86 were f o u n d , a n d those comprising the h o n i o z y g o t e recessive aa, of which 32 were seen in the sample. An example of a m o r e extensive qualitative frequency distribution is given in Table 2.1, which s h o w s the distribution of m e l a n o m a (a type of skin cancer) over b o d y regions in men a n d w o m e n . This table tells us t h a t the t r u n k a n d limbs are the most frequent sites for m e l a n o m a s and that the buccal cavity, the rest of the gastrointestinal tract, and the genital tract are rarely afflicted by this
2.1
Two qualitative frequency distributions. N u m b e r of cases of skin c a n c e r ( m e l a n o m a l d i s t r i b u t e d over b o d y regions of 4599 men a n d 47X6 w o m e n . OhseiVi <1 frequency Men Women J ( 949 3243 8 5 12 382 4599 645 3645 11 21 93 371 4786
Anatomic
site
Mead a n d neck T r u n k and limbs Buccal cavity Rest of g a s t r o i n t e s t i n a l tract Genital tract F.ye Total cases Snune.
Oiilii from I cc
distribution. Carex
type of cancer. We often e n c o u n t e r o t h e r examples of qualitative frequency d i s t r i b u t i o n s in ecology in the form of tables, o r species lists, of the i n h a b i t a n t s of a sampled ecological area. Such tables c a t a l o g the i n h a b i t a n t s by species o r at a higher t a x o n o m i c level a n d record the n u m b e r of specimens observed for each. T h e a r r a n g e m e n t of such tables is usually alphabetical, o r it m a y follow a special c o n v e n t i o n , as in some botanical species lists. A q u a n t i t a t i v e frequency distribution based on meristic variates is s h o w n in T a b l e 2.2. This is an example f r o m plant ecology: the n u m b e r of p l a n t s per q u a d r a t sampled is listed at the left in the variable c o l u m n ; the observed frequency is shown at the right. Q u a n t i t a t i v e frequency d i s t r i b u t i o n s based on a c o n t i n u o u s variable are the most c o m m o n l y e m p l o y e d frequency distributions; you should b e c o m e t h o r o u g h l y familiar with them. An e x a m p l e is s h o w n in Box 2.1. It is based on 25 femur lengths m e a s u r e d in an aphid p o p u l a t i o n . T h e 25 readings a r e s h o w n at the t o p of Box 2.1 in the o r d e r in which they were o b t a i n e d as m e a s u r e m e n t s . (They could have been arrayed a c c o r d i n g to their magnitude.) T h e d a t a arc next set up in a frequency distribution. T h e variates increase in m a g n i t u d e by unit steps of 0.1. T h e frequency distribution is prepared by entering each variate in turn on the scale a n d indicating a c o u n t by a conventional tally m a r k . W h e n all of (lie items have been tallied in the c o r r e s p o n d i n g class, the tallies are c o n verted into n u m e r a l s indicating frequencies in the next c o l u m n . Their sum is indicated by / . W h a t have we achieved in s u m m a r i z i n g o u r d a t a ? T h e original 25 variates are now represented by only 15 classes. We find that variates 3.6, 3.8, and 4.3 have the highest frequencies. However, wc also n o t e that there arc several classes, such as 3.4 or 3.7, that are not represented by a single aphid. This gives the
2 . 5 / FREQUENCY DISTRIBUTIONS
19
entire frequency distribution a d r a w n  o u t a n d scattered a p p e a r a n c e . T h e reason for this is that we have only 25 aphids, t o o few to put into a frequency distribution with 15 classes. T o o b t a i n a m o r e cohesive and s m o o t h  l o o k i n g distribution, we have to c o n d e n s e our d a t a into fewer classes. This process is k n o w n as grouping of classes of frequency distributions; it is illustrated in Box 2.1 a n d described in the following p a r a g r a p h s . We should realize t h a t g r o u p i n g individual variates into classes of wider range is only an extension of the same process t h a t t o o k place w h e n we o b t a i n e d the initial m e a s u r e m e n t . T h u s , as we have seen in Section 2.3, w h e n we m e a s u r e an aphid and record its femur length as 3.3 units, we imply thereby that the true m e a s u r e m e n t lies between 3.25 a n d 3.35 units, but that we were u n a b l e t o measure to the second decimal place. In recording the m e a s u r e m e n t initially as 3.3 units, we estimated t h a t it fell within this range. H a d we estimated that it exceeded the value of 3.35, for example, we would have given it the next higher score, 3.4. Therefore, all the m e a s u r e m e n t s between 3.25 a n d 3.35 were in fact g r o u p e d into the class identified by the class mark 3.3. O u r class interval was 0.1 units. If we now wish to m a k e wider class intervals, we are d o i n g n o t h i n g but extending the r a n g e within which m e a s u r e m e n t s are placed into one class. Reference to Box 2.1 will m a k e this process clear. We g r o u p the d a t a twice in order to impress u p o n the reader the flexibility of the process. In the first example of grouping, the class interval has been doubled in width; that is, it has been m a d e to equal 0.2 units. If we start at the lower end, the implied class limits will now be f r o m 3.25 to 3.45, the limits for the next class from 3.45 to 3.65, a n d so forth. O u r next task is to find the class marks. This was quite simple in the frequency distribution s h o w n at the left side of Box 2.1, in which the original measurements were used as class marks. However, now we are using a class interval twice as wide as before, and the class m a r k s arc calculated by t a k i n g the midpoint of the new class intervals. T h u s , to find the class mark of the first class, we lake the midpoint between 3.25 and 3.45. which turns out to be 3.35. We note that the class m a r k has one m o r e decimal place than the original measurements. We should not now be led to believe that we have suddenly achieved greater precision. Whenever we designate a class interval whose last significant digit is even (0.2 in this case), the class mark will carry one m o r e decimal place than the original m e a s u r e m e n t s . O n the right side of the table in Box 2.1 the d a t a are grouped once again, using a class interval of 0.3. Because of the o d d last significant digit, the class mark now shows as m a n y decimal places as the original variates, the m i d p o i n t between 3.25 and 3.55 being 3.4. O n c e the implied class limits and the class mark for the first class have been correctly found, the others can be written d o w n bv inspection without any special c o m p u t a t i o n . Simply a d d the class interval repeatedly to each of the values. Thus, starting with the lower limit 3.25, by a d d i n g 0.2 wc obtain 3.45. 3.65, 3.X5, a n d so forth; similarly, for the class marks, we o b t a i n 3.35, 3.55, 3.75, and so forth. It should be o b v i o u s that the wider the class intervals, the m o r e c o m p a c t the d a t a become but also the less precise. However, looking at
BOX 2.1 Preparation of frequency distribution and grouping into fewer classes with wider class intervals. Twentyfive femur lengths of the aphid Pemphigus. Measurements are in m m 10~ \
Original
measurements
Original
frequency
distribution
Implied limits
Tally
marks 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 1 0 1 4 0 4 3 0 2 1
Implied limits
Class mark
Tally marks
Tally marks
3.253.35 3.353.45 3.453.55 3.553.65 3.653.75 3.753.85 3.853.95 3.954.05 4.054.15 4.154.25
 M mi ( III
4 3 1 0 1 25
1 _1 25
4.454.75
4.6
If
Source: Data from R, R. Sokal.
25
Histogram of the original frequency distribution shown above and of the grouped distribution with 5 classes. Line below abscissa shows class marks for the grouped frequency distribution. Shaded bars represent original frequency distribution; hollow bars represent grouped distribution.
10 r
_] 3.4
22
the frequency d i s t r i b u t i o n of a p h i d f e m u r lengths in Box 2.1, we notice that the initial r a t h e r chaotic s t r u c t u r e is being simplified by grouping. W h e n we g r o u p the frequency distribution into five classes with a class interval of 0.3 units, it b e c o m e s n o t a b l y b i m o d a l (that is, it possesses t w o peaks of frequencies). In setting up frequency distributions, f r o m 12 to 20 classes should be established. This rule need not be slavishly a d h e r e d to, but it should be e m p l o y e d with some of the c o m m o n sense that comes f r o m experience in h a n d l i n g statistical d a t a . T h e n u m b e r of classes d e p e n d s largely on the size of the s a m p l e studied. Samples of less t h a n 40 o r 50 should rarely be given as m a n y as 12 classes, since that w o u l d provide t o o few frequencies per class. O n the o t h e r h a n d , samples of several t h o u s a n d m a y profitably be g r o u p e d into m o r e t h a n 20 classes. If the a p h i d d a t a of Box 2.1 need to be g r o u p e d , they should p r o b a b l y not be g r o u p e d into m o r e t h a n 6 classes. If the original d a t a p r o v i d e us with fewer classes than we think we should have, then n o t h i n g can be d o n e if the variable is meristic, since this is the n a t u r e of the d a t a in question. H o w e v e r , with a c o n t i n u o u s variable a scarcity of classes w o u l d indicate that we p r o b a b l y had not m a d e our m e a s u r e m e n t s with sufficient precision. If we h a d followed the rules on n u m b e r of significant digits for m e a s u r e m e n t s stated in Section 2.3, this could not have h a p p e n e d . W h e n e v e r we c o m e u p with m o r e t h a n the desired n u m b e r of classes, g r o u p ing should be u n d e r t a k e n . W h e n the d a t a are meristic, the implied limits of c o n t i n u o u s variables are meaningless. Yet with m a n y meristic variables, such as a bristle n u m b e r varying f r o m a low of 13 to a high of 81, it would p r o b a b l y be wise to g r o u p the variates into classes, each c o n t a i n i n g several counts. This can best be d o n e by using an o d d n u m b e r as a class interval so that the class m a r k representing the d a t a will be a whole rather than a fractional n u m b e r . T h u s , if we were to g r o u p the bristle n u m b e r s 13, 14, 15, a n d 16 into o n e class, the class m a r k would have to be 14.5, a meaningless value in terms of bristle n u m b e r . It would therefore be better to use a class ranging over 3 bristles or 5 bristles, giving the integral value 14 or 15 as a class m a r k . G r o u p i n g data into frequency d i s t r i b u t i o n s was necessary when c o m p u tations were d o n e by pencil a n d paper. N o w a d a y s even t h o u s a n d s of variatcs can be processed efficiently by c o m p u t e r without prior grouping. However, frequency d i s t r i b u t i o n s are still extremely useful as a tool for d a t a analysis. This is especially true in an age in which it is all l o o easy for a researcher to o b t a i n a numerical result f r o m a c o m p u t e r p r o g r a m without ever really e x a m i n i n g the d a t a for outliers or for o t h e r ways in which the sample m a y not c o n f o r m to the a s s u m p t i o n s of the statistical m e t h o d s . Rather t h a n using tally m a r k s to set u p a frequency distribution, as was d o n e in Box 2.1, we can e m p l o y T u k e y ' s stemandleaf display. This t e c h n i q u e is an i m p r o v e m e n t , since it not only results in a frequency distribution of the variates of a sample but also permits easy checking of the variates a n d o r d e r i n g them into an array (neither of which is possible with tally marks). This technique will therefore be useful in c o m p u t i n g the m e d i a n of a sample (see Section 3.3) a n d in c o m p u t i n g various tests that require ordered a r r a y s of the sample variates
c . . , . .. t
23
T o learn how to construct a stemandleaf display, let us look a h e a d to Table 3.1 in the next c h a p t e r , which lists 15 b l o o d n e u t r o p h i l c o u n t s . T h e unordered m e a s u r e m e n t s are as follows: 4.9, 4.6, 5.5, 9.1, 16.3, 12.7, 6.4, 7.1, 2.3, 3.6, 18.0, 3.7, 7.3, 4.4, a n d 9.8. T o p r e p a r e a stemandleaf display, we scan the variates in the s a m p l e to discover the lowest a n d highest leading digit or digits. Next, we write d o w n the entire range of leading digits in unit increments to the left of a vertical line (the "stem"), as s h o w n in the a c c o m p a n y i n g illustration. We then put the next digit of the first variate (a "leaf") at that level of the stem c o r r e s p o n d i n g to its leading digit(s). T h e first o b s e r v a t i o n in o u r s a m p l e is 4.9. W e therefore place a 9 next to the 4. T h e next variate is 4.6. It is entered by finding the stem level for the leading digit 4 a n d recording a 6 next to the 9 that is already there. Similarly, for the third variate, 5.5, we record a 5 next to the leading digit 5. W e c o n t i n u e in this way until all 15 variates have been entered (as "leaves") in sequence a l o n g the a p p r o p r i a t e leading digits of the stem. The completed a r r a y is the equivalent of a frequency distribution a n d has the a p p e a r a n c e of a histogram or bar d i a g r a m (see the illustration). M o r e o v e r , it permits the efficient o r d e r i n g of the variates. T h u s , f r o m the c o m p l e t e d array it becomes o b v i o u s that the a p p r o p r i a t e o r d e r i n g of the 15 variates is 2.3, 3.6, 3.7, 4.4, 4.6, 4.9, 5.5, 6.4, 7.1, 7.3, 9.1, 9.8, 12.7, 16.3, 18.0. T h e m e d i a n can easily be read off the stemandleaf display. It is clearly 6.4. F o r very large samples, stemandleaf displays m a y b e c o m e a w k w a r d . In such cases a conventional frequency distribution as in Box 2.1 w o u l d be preferable.
Completed (Step array ) 3 67 964 5 4
Step I
Step 2
...
Step 7
.. .
3 4 5 6 7 X 9 10 9
3 4 5 6 7 X 9 96 3 4 5 6 7 X 9 10 96 5 4
7
3 4 5 6 7 X 9
13
IX
11
12 13 14 15
10 11
12
11
12 7
10 11
12 7
13
14
13
14 15 16 17 18 3
13
14 15 16 17 IX 0 3
16
17 IX
15 16
17 IX
W h e n the shape of a frequency distribution is of particular interest, wish to present the distribution in graphic form when discussing the This is generally d o n e by m e a n s of frequency d i a g r a m s , of which there c o m m o n types. F o r a distribution of meristic d a t a we e m p l o y a bar
24
FIGURE 2 . 3
F r e q u e n c y polygon. Birth weights of 9465 males infants. C h i n e s e thirdclass p a t i e n t s in S i n g a p o r e , 1950 a n d 1951. D a t a f r o m Millis a n d Seng (1954).
the variable (in o u r case, the n u m b e r of p l a n t s per q u a d r a t ) , a n d the o r d i n a t e represents the frequencies. T h e i m p o r t a n t point a b o u t such a d i a g r a m is t h a t the bars d o not t o u c h each other, which indicates that the variable is not c o n tinuous. By contrast, c o n t i n u o u s variables, such as the frequency distribution of the femur lengths of a p h i d stem m o t h e r s , are g r a p h e d as a histogram. In a h i s t o g r a m the width of each bar a l o n g the abscissa represents a class interval of the frequency distribution a n d the bars t o u c h each other to s h o w that the actual limits of the classes a r e contiguous. T h e m i d p o i n t of the bar c o r r e s p o n d s to the class mark. At the b o t t o m of Box 2.! are shown h i s t o g r a m s of the frequency distribution of the aphid data, u n g r o u p e d a n d g r o u p e d . T h e height of each bar represents the frequency of the c o r r e s p o n d i n g class. T o illustrate that h i s t o g r a m s are a p p r o p r i a t e a p p r o x i m a t i o n s to the c o n t i n u o u s distributions f o u n d in nature, we may take a histogram a n d m a k e the class intervals m o r e n a r r o w , p r o d u c i n g m o r e classes. T h e h i s t o g r a m would then clearly have a closer fit to a c o n t i n u o u s distribution. We can c o n t i n u e this p r o cess until the class intervals b e c o m e infinitesimal in width. At this point the h i s t o g r a m becomes the c o n t i n u o u s distribution of the variable. Occasionally the class intervals of a g r o u p e d c o n t i n u o u s frequency distribution are unequal. For instance, in a frequency distribution of ages we might have m o r e detail on the different ages of y o u n g individuals and less a c c u r a t e identification of the ages of old individuals. In such cases, the class intervals lor the older age g r o u p s would be wider, those lor the y o u n g e r age groups, narrower. In representing such d a t a , the bars of the histogram are d r a w n with different widths. f igure 2.3 shows a n o t h e r graphical m o d e of representation of a frequency distribution of a c o n t i n u o u s variable (in this case, birth weight in infants). As we shall see later the shapes of distributions seen in such frequency polygons can reveal much a b o u t the biological situations affecting the given variable.
2.6 The handling of data D a t a must be handled skillfully a n d expeditiously so that statistics can be practiced successfully. Readers should therefore a c q u a i n t themselves with (he var
25
In this b o o k we ignore " p e n c i l  a n d  p a p e r " shortcut m e t h o d s for c o m p u t a tions, f o u n d in earlier t e x t b o o k s of statistics, since we a s s u m e that t h e s t u d e n t has access to a calculator or a c o m p u t e r . S o m e statistical m e t h o d s are very easy to use because special tables exist that provide answers for s t a n d a r d statistical problems; thus, almost no c o m p u t a t i o n is involved. An example is Finney's table, a 2by2 contingency table c o n t a i n i n g small frequencies that is used for the test of independence ( P e a r s o n a n d Hartley, 1958, T a b l e 38). F o r small problems, Finney's table can be used in place of Fisher's m e t h o d of finding exact probabilities, which is very tedious. O t h e r statistical techniques are so easy to carry out that no mechanical aids are needed. S o m e are inherently simple, such as the sign test (Section 10.3). O t h e r m e t h o d s are only a p p r o x i m a t e but can often serve the p u r p o s e adequately; for example, we m a y sometimes substitute an casytoevaluate m e d i a n (defined in Section 3.3) for the m e a n (described in Sections 3.1 a n d 3.2) which requires c o m p u t a t i o n . We can use m a n y new types of e q u i p m e n t to p e r f o r m statistical c o m p u t a t i o n s m a n y m o r e than we could have when Introduction to Biostalistics was first published. T h e o n c e  s t a n d a r d electrically driven mechanical desk calculator has completely d i s a p p e a r e d . M a n y new electronic devices, f r o m small pocket calculators to larger d e s k  t o p c o m p u t e r s , have replaced it. Such dcvices are so diverse that we will not try to survey the field here. Even if we did, the rate of a d v a n c e in (his area would be so rapid that w h a t e v e r we might say would soon become obsolete. We c a n n o t really d r a w the line between the m o r e sophisticated electronic calculators, on the o n e h a n d , a n d digital c o m p u t e r s . T h e r e is no a b r u p t increase in capabilities between the more versatile p r o g r a m m a b l e calculators a n d the simpler m i c r o c o m p u t e r s , just as there is n o n e as we progress f r o m m i c r o c o m puters to m i n i c o m p u t e r s a n d so on up to the large c o m p u t e r s that o n e associates with the central c o m p u t a t i o n center of a large university or research l a b o r a t o r y . All can perform c o m p u t a t i o n s automatically a n d be controlled by a set of detailed instructions p r e p a r e d by the user. Most of these devices, including prog r a m m a b l e small calculators, are a d e q u a t e for all of the c o m p u t a t i o n s described in this book, even for large sets of d a t a . T h e m a t e r i a l in this b o o k c o n s i s t s of r e l a t i v e l y s t a n d a r d statistical c o m p u t a t i o n s that arc a v a i l a b l e in m a n y statistical p r o g r a m s . BI()Mstat : , : is a statistical s o f t w a r e p a c k a g e that i n c l u d e s most of the statistical m e t h o d s c o v e r e d in this b o o k . T h e ti.se of m o d e r n d a t a processing procedures has o n e inherent danger. O n e can all too easily cither feed in e r r o n e o u s d a t a or choose an i n a p p r o p r i a t e p r o g r a m . Users must select p r o g r a m s carefully to ensure that those p r o g r a m s perform the desired c o m p u t a t i o n s , give numerically reliable results, and are as free from e r r o r as possible. When using a p r o g r a m for the first time, one should test it using d a t a f r o m t e x t b o o k s with which o n e is familiar. S o m e p r o g r a m s
* ' information or l<> order, contact Hxcter S o f t w a r e . Websile:hUp://www.cxelcrM>ilwaa\com. limail: siilcs('cxctcrsoflwai'e.com. ' t h e s e programs are compatible with Windows XI' and Vista.
26
a r e n o t o r i o u s because the p r o g r a m m e r has failed to g u a r d against excessive r o u n d i n g e r r o r s or o t h e r p r o b l e m s . Users of a p r o g r a m should carefully check the d a t a being analyzed so t h a t typing e r r o r s are not present. In a d d i t i o n , p r o g r a m s should help users identify a n d remove b a d d a t a values a n d should p r o v i d e them with t r a n s f o r m a t i o n s so that they can m a k e sure that their d a t a satisfy the a s s u m p t i o n s of various analyses.
Exercises
2.1 R o u n d t h e f o l l o w i n g n u m b e r s t o t h r e e s i g n i f i c a n t figures: 1 0 6 . 5 5 , 0 . 0 6 8 1 9 , 3 . 0 4 9 5 , 7815.01, 2.9149, a n d 20.1500. W h a t a r e t h e implied limits b e f o r e a n d after r o u n d ing? R o u n d these s a m e n u m b e r s t o o n e decimal place. A N S . F o r t h e first v a l u e : 107; 1 0 6 . 5 4 5 2.2 106.555; 1 0 6 . 5  1 0 7 . 5 ; 106.6 D i f f e r e n t i a t e b e t w e e n t h e f o l l o w i n g p a i r s of t e r m s a n d g i v e a n e x a m p l e o f e a c h , (a) S t a t i s t i c a l a n d b i o l o g i c a l p o p u l a t i o n s , ( b ) V a n a l e a n d i n d i v i d u a l , (c) A c c u r a c y a n d p r e c i s i o n ( r e p e a t a b i l i t y ) , ( d ) C l a s s i n t e r v a l a n d c l a s s m a r k , (e) B a r d i a g r a m a n d h i s t o g r a m , (f) A b s c i s s a a n d o r d i n a t e . G i v e n 2 0 0 m e a s u r e m e n t s r a n g i n g f r o m 1.32 t o 2 . 9 5 m m , h o w w o u l d y o u g r o u p t h e m i n t o a f r e q u e n c y d i s t r i b u t i o n ? G i v e class limits a s well a s c l a s s m a r k s . G r o u p t h e f o l l o w i n g 4 0 m e a s u r e m e n t s of i n t e r o r b i t a l w i d t h of a s a m p l e o f d o m e s t i c p i g e o n s i n t o a f r e q u e n c y d i s t r i b u t i o n a n d d r a w its h i s t o g r a m ( d a t a f r o m O l s o n a n d M i l l e r , 1958). M e a s u r e m e n t s a r e in m i l l i m e t e r s . 12.2 10.7 12.1 10.8 2.5 12.9 1 1.5 11.9 11.6 11.8 1 1.3 10.4 10.4 11.9 11.2 10.7 10.7 11.6 1 1.6 10.8 12.0 11.1 11.9 11.0 12.4 12.3 13.3 11.9 11.7 12.2 11.2 10.2 11.8 11.8 10.5 10.9 1 1.3 11.8 11.1 11.6 11.1
2.3 2.4
2.6
H o w p r e c i s e l y s h o u l d y o u m e a s u r e t h e w i n g l e n g t h of a s p e c i e s of m o s q u i t o e s in a s t u d y of g e o g r a p h i c v a r i a t i o n if t h e s m a l l e s t s p c c i m c n h a s a l e n g t h of a b o u t 2.8 m m a n d t h e l a r g e s t a l e n g t h of a b o u t 3.5 mm'. 1 T r a n s f o r m t h e 4 0 m e a s u r e m e n t s in E x e r c i s e 2.4 i n l o c o m m o n l o g a r i t h m s ( u s e a t a b i c o r c a l c u l a t o r ) a n d m a k e a f r e q u e n c y d i s t r i b u t i o n of t h e s e t r a n s f o r m e d v a r i a t e s . C o m m e n t o n t h e r e s u l t i n g c h a n g e in t h e p a t t e r n of t h e f r e q u e n c y d i s tribution from that found before f o r t h e d a t a of T a h l e s 2.1 a n d 2.2 i d e n t i f y t h e i n d i v i d u a l o b s e r v a t i o n s , s a m p l e s , populations, and variables. M a k e a s t e m  a n d  l c a f d i s p l a y of t h e d a t a g i v e n in E x c r c i s c 2.4. T h e d i s t r i b u t i o n o f a g e s of s t r i p e d b a s s c a p t u r e d by h o o k a n d l i n e f r o m t h e E a s t R i v e r a n d t h e H u d s o n R i v e r d u r i n g 1 9 8 0 w e r e r e p o r t e d a s f o l l o w s ( Y o u n g , 1981):
A<tc
1
2 3 4 5
13
49 96 28 16
S h o w t h i s d i s t r i b u t i o n in t h e f o r m of a b a r d i a g r a m .
CHAPTER
Descriptive
Statistics
An early a n d f u n d a m e n t a l stage in any seienec is the descriptive stage. Until p h e n o m e n a c a n be accurately described, a n analysis of their causes is p r e m a t u r e . T h e question " W h a t ? " comes before " H o w ? " Unless we k n o w s o m e t h i n g a b o u t the usual distribution of the sugar c o n t e n t of blood in a p o p u l a t i o n of guinea pigs, as well as its fluctuations f r o m day to d a y a n d within days, we shall be unable to ascertain the effect of a given dose of a d r u g u p o n this variable. In a sizable s a m p l e it w o u l d be tedious to o b t a i n o u r knowledge of the material by c o n t e m p l a t i n g each individual o b s e r v a t i o n . W e need s o m e f o r m of s u m m a r y to permit us to deal with the d a t a in m a n a g e a b l e form, as well as to be able to share o u r findings with o t h e r s in scientific talks a n d publications. A hist o g r a m or bar d i a g r a m of the frequency distribution would be o n e type of s u m m a r y . However, for most purposes, a numerical s u m m a r y is needed to describe concisely, yet accurately, t h e properties of the o b s e r v e d frequency distribution. Q u a n t i t i e s p r o v i d i n g such a s u m m a r y are called descriptive statistics. This c h a p t e r will i n t r o d u c e you to some of them a n d s h o w how they arc c o m p u t e d . T w o kinds of descriptive statistics will be discussed in this c h a p t e r : statistics of location and statistics of dispersion. T h e statistics of location (also k n o w n as
28
measures of central tendency) describe the position of a sample along a given dimension representing a variable. F o r example, after we m e a s u r e t h e length of the a n i m a l s within a sample, we will then w a n t to k n o w w h e t h e r the a n i m a l s a r e closer, say, to 2 cm o r to 20 cm. T o express a representative value for t h e s a m p l e of o b s e r v a t i o n s f o r the length of the a n i m a l s w e use a statistic of location. But statistics of location will n o t describe the s h a p e of a frequency distribution. T h e s h a p e m a y be long or very n a r r o w , m a y be h u m p e d or Us h a p e d , m a y c o n t a i n t w o h u m p s , or m a y be m a r k e d l y asymmetrical. Q u a n t i tative m e a s u r e s of such aspects of frequency distributions a r e required. T o this e n d we need to define a n d study t h e statistics of dispersion. T h e a r i t h m e t i c m e a n , described in Section 3.1, is u n d o u b t e d l y the most i m p o r t a n t single statistic of location, but o t h e r s (the geometric m e a n , the h a r m o n i c mean, the m e d i a n , a n d the m o d e ) are briefly m e n t i o n e d in Sections 3.2, 3.3, a n d 3.4. A simple statistic of dispersion (the range) is briefly n o t e d in Section 3.5, a n d the s t a n d a r d deviation, the most c o m m o n statistic for describing dispersion, is explained in Section 3.6. O u r first e n c o u n t e r with c o n t r a s t s between s a m p l e statistics a n d p o p u l a t i o n p a r a m e t e r s occurs in Section 3.7, in c o n n e c t i o n with statistics of location a n d dispersion. In Section 3.8 there is a description of practical m e t h o d s for c o m p u t i n g the m e a n a n d s t a n d a r d deviation. T h e coefficient of variation (a statistic that permits us to c o m p a r e the relative a m o u n t of dispersion in different samples) is explained in the last section (Section 3.9). T h e techniques that will be at y o u r disposal after you have mastered this c h a p t e r will not be very powerful in solving biological problems, but they will be indispensable tools for any further w o r k in biostatistics. O t h e r descriptive statistics, of b o t h location and dispersion, will be taken up in later chapters. An important note: We shall first e n c o u n t e r the use of l o g a r i t h m s in this c h a p t e r . T o avoid c o n f u s i o n , c o m m o n logarithms have been consistently abbreviated as log, a n d n a t u r a l l o g a r i t h m s as In. T h u s , log \ m e a n s l o g , 0 a n d In v m e a n s log,, x.
3.1 The arithmetic mean T h e most c o m m o n statistic of location is familiar to everyone. It is the arithmetic mean, c o m m o n l y called the mean or average. T h e m e a n is calculated by s u m m i n g all the individual o b s e r v a t i o n s or items of a s a m p l e and dividing this s u m by the n u m b e r of items in the sample. F o r instance, as the result of a gas analysis in a respirometer an investigator o b t a i n s the following four readings of oxygen percentages a n d s u m s them: 14.9
10.8
12.3 23.3
Sum = 6 1 7 3
29
T h e investigator calculates the m e a n oxygen percentage as the s u m of the four items divided by the n u m b e r of items. T h u s the average oxygen p e r c e n t a g e is Mean = 15.325%
Calculating a m e a n presents us with the o p p o r t u n i t y for learning statistical symbolism. W e have already seen (Section 2.2) t h a t a n individual o b s e r v a t i o n is symbolized by Y, which s t a n d s for t h e ith o b s e r v a t i o n in t h e sample. F o u r observations could be written symbolically as follows:
v2, Y3, ^
W e shall define n, t h e sample size, as the n u m b e r of items in a sample. In this particular instance, the sample size is 4. T h u s , in a large sample, we c a n symbolize the a r r a y f r o m the first to the nth item as follows:
Yl, 2,..,
= y> +
2 + +
T h e capital Greek sigma, , simply m e a n s the sum of the items indicated. T h e i = 1 m e a n s that the items should be s u m m e d , starting with the first o n e a n d ending with the nth one, as indicated by the i = a b o v e the . T h e subscript a n d superscript are necessary to indicate how m a n y items s h o u l d be s u m m e d . T h e "/ = " in the superscript is usually o m i t t e d as superfluous. F o r instance, if we h a d wished t o s u m only the first three items, we would have written ?=, Y{. O n the o t h e r h a n d , h a d we wished to sum all of them except the first one, we would have written " = 2 ; . W i t h some exceptions (which will a p p e a r in later chapters), it is desirable to omit subscripts a n d superscripts, which generally add to the a p p a r e n t complexity of the f o r m u l a and, when they are unnecessary, distract the s t u d e n t ' s a t t e n t i o n f r o m the i m p o r t a n t relations expressed by the formula. Below are seen increasing simplifications of the c o m p l e t e s u m m a t i o n n o t a t i o n shown at the extreme left:
1 Yi = 1
= < ;
T h e third symbol might be interpreted as meaning, " S u m the Y t 's over all available values of /." This is a frequently used n o t a t i o n , a l t h o u g h we shall not employ it in this b o o k . T h e next, with as a superscript, tells us to sum items of V; note (hat the i subscript of the Y has been d r o p p e d as unnecessary. Finally, the simplest n o t a t i o n is s h o w n at the right. It merely says sum the Vs. This will be the form we shall use most frequently: if a s u m m a t i o n sign precedes a variable, the s u m m a t i o n will be u n d e r s t o o d to be over items (all the items in the sample) unless subscripts or superscripts specifically tell us otherwise.
30
W e shall use the s y m b o l Y for the a r i t h m e t i c m e a n of the variable Y. Its f o r m u l a is ^written as follows: y y L Y = = ~YY
""
(3.1)
This f o r m u l a tells us, " S u m all the () items a n d divide the s u m by n." T h e mean of a sample is the center of gravity of the obsen'ations in the sample. If you were to d r a w a h i s t o g r a m of an observed frequency d i s t r i b u t i o n o n a sheet of c a r d b o a r d a n d then cut out the h i s t o g r a m a n d lay it flat against a b l a c k b o a r d , s u p p o r t i n g it with a pencil b e n e a t h , chances a r e t h a t it would be out of balance, t o p p l i n g to either the left o r the right. If you m o v e d the s u p p o r t i n g pencil p o i n t to a position a b o u t which the h i s t o g r a m w o u l d exactly balance, this point of b a l a n c e would c o r r e s p o n d to the a r i t h m e t i c m e a n . W e often m u s t c o m p u t e averages of m e a n s or of o t h e r statistics that m a y differ in their reliabilities because they are based on different sample sizes. At o t h e r times we m a y wish the individual items to be averaged to have different weights or a m o u n t s of influence. In all such cases we c o m p u t e a weighted average. A general f o r m u l a for calculating the weighted average of a set of values Yt is as follows:
(3.2)
.w h e r e variates, each weighted by a factor w are being averaged. T h e values of Yi in such cases are unlikely to represent variates. They are m o r e likely to be s a m p l e m e a n s Yt or s o m e o t h e r statistics of different reliabilities. T h e simplest case in which this arises is when the V, are not individual variates but are means. T h u s , if the following three m e a n s are based on differing s a m p l e sizes, as shown,
>;
3.85 5.21 4.70
n, 12 25
214.05 45
N o t e that in this example, c o m p u t a t i o n of Ihc weighted mean is exactly equivalent to a d d i n g up all the original m e a s u r e m e n t s a n d dividing the sum by the total n u m b e r of the m e a s u r e m e n t s . Thus, the s a m p l e with 25 observations, having the highest m e a n , will influence the weighted average in p r o p o r t i o n to ils size.
31
3.2 Other means W e shall see in C h a p t e r s 10 a n d 11 t h a t variables are s o m e t i m e s t r a n s f o r m e d into their l o g a r i t h m s or reciprocals. If we calculate the m e a n s of such transformed variables a n d then c h a n g e the m e a n s back into the original scale, these m e a n s will not be the s a m e as if we h a d c o m p u t e d the arithmetic m e a n s of t h e original variables. T h e resulting m e a n s have received special n a m e s in statistics. T h e b a c k  t r a n s f o r m e d m e a n of the logarithmically t r a n s f o r m e d variables is called the geometric mean. It is c o m p u t e d as follows: GMv = antilog  log Y
(3.3)
which indicates that the geometric m e a n GMr is the a n t i l o g a r i t h m of the m e a n of the l o g a r i t h m s of variable Y. Since a d d i t i o n of logarithms is equivalent t o multiplication of their antilogarithms, there is a n o t h e r way of representing this quantity; it is GMY = ^Y^YiT77Yn (3.4)
T h e geometric m e a n p e r m i t s us to b e c o m e familiar with a n o t h e r o p e r a t o r symbol: capital pi, , which m a y be read as " p r o d u c t . " Just as symbolizes s u m m a t i o n of the items that follow it, so symbolizes the multiplication of the items that follow it. T h e subscripts a n d superscripts have exactly the same m e a n i n g as in the s u m m a t i o n case. T h u s , Expression (3.4) for the geometric m e a n can be rewritten m o r e c o m p a c t l y as follows: GMr=nY\Yi
I
T h e c o m p u t a t i o n of the geometric m e a n by Expression (3.4a) is quite In practice, the geometric m e a n has to be c o m p u t e d by t r a n s f o r m i n g the into logarithms. The reciprocal of the arithmetic m e a n of reciprocals is called the mean. If we symbolize it by HY, the f o r m u l a for the h a r m o n i c m e a n written in concise form (without subscripts a n d superscripts) as
1 1 1
You may wish to convince yourself that the geometric mean a n d the h a r m o n i c m e a n of the four oxygen percentages are 14.65% a n d 14.09%, respectively. U n less the individual items d o not vary, the geometric m e a n is always less than the arithmetic m e a n , and the h a r m o n i c m e a n is always less t h a n the geometric mean. S o m e beginners in statistics have difficulty in accepting the fact that measures of location or central tendency o t h e r t h a n the arithmetic m e a n are permissible or even desirable. T h e y feel that the arithmetic m e a n is the "logical"
32
average, a n d that any o t h e r m e a n would be a distortion. This whole p r o b l e m relates t o the p r o p e r scale of m e a s u r e m e n t for representing d a t a ; this scale is not always the linear scale familiar to everyone, but is sometimes by preference a logarithmic or reciprocal scale. If you have d o u b t s a b o u t this question, we shall try to allay t h e m in C h a p t e r 10, where we discuss the reasons for t r a n s f o r m i n g variables.
3.3 The median T h e median is a statistic of location occasionally useful in biological research. It is defined as that value of the variable (in an o r d e r e d array) that has an equal number of items on either side of it. Thus, the m e d i a n divides a frequency distribution into two halves. In the following sample of five m e a s u r e m e n t s , 14, 15, 16, 19, 23 ~ 16, since the third o b s e r v a t i o n has an equal n u m b e r of o b s e r v a t i o n s on b o t h sides of it. We can visualize the m e d i a n easily if we think of an a r r a y f r o m largest t o s m a l l e s t f o r example, a row of m e n lined u p by their heights. T h e m e d i a n individual will then be that m a n having an equal n u m b e r of m e n on his right a n d left sides. His height will be the median height of the s a m ple considered. This quantity is easily evaluated f r o m a sample a r r a y with an o d d n u m b e r of individuals. W h e n the n u m b e r in the s a m p l e is even, the m e d i a n is conventionally calculated as the m i d p o i n t between the (n/2)th a n d the [(/2) + 1 j t h variate. T h u s , for the s a m p l e of four m e a s u r e m e n t s 14, 15, 16, 19 the median would be the m i d p o i n t between the second and third items, or 15.5. Whenever any o n e value of a variatc occurs m o r e than once, p r o b l e m s may develop in locating the m e d i a n . C o m p u t a t i o n of the median item b e c o m e s m o r e involved because all the m e m b e r s of a given class in which the m e d i a n item is located will have the s a m e class m a r k . T h e median then is the {n/2)lh variate in the frequency distribution. It is usually c o m p u t e d as that point between the class limits of the m e d i a n class where the median individual would be located (assuming the individuals in the class were evenly distributed). T h e median is just o n e of a family of statistics dividing a frequency distribution into equal areas. It divides the distribution into two halves. T h e three quartiles cut the d i s t r i b u t i o n at the 25, 50, and 75% p o i n t s t h a t is, at points dividing the distribution into first, second, third, and f o u r t h q u a r t e r s by area (and frequencies). T h e second quarlile is, of course, the median. (There are also quintiles, deciles, a n d percentiles, dividing the distribution into 5. 10, a n d 100 equal portions, respectively.) M e d i a n s arc most often used for d i s t r i b u t i o n s that d o not c o n f o r m to the s t a n d a r d probability models, so that n o n p a r a m e t r i c m e t h o d s (sec C h a p t e r 10) must be used. Sometimes (he median is a m o r e representative m e a s u r e of location than the a r i t h m e t i c m e a n . Such instances almost always involve a s y m m e t r i c
33
distributions. An often q u o t e d example f r o m economics w o u l d be a suitable m e a s u r e of location for the "typical" salary of a n employee of a c o r p o r a t i o n . T h e very high salaries of the few senior executives would shift the arithmetic m e a n , the center of gravity, t o w a r d a completely unrepresentative value. T h e m e d i a n , on the o t h e r h a n d , would be little affected by a few high salaries; it w o u l d give the p a r t i c u l a r point o n the salary scale a b o v e which lie 50% of the salaries in the c o r p o r a t i o n , the o t h e r half being lower t h a n this figure. In biology an example of the preferred application of a m e d i a n over the arithmetic m e a n m a y be in p o p u l a t i o n s showing skewed distribution, such as weights. T h u s a m e d i a n weight of American males 50 years old m a y be a more meaningful statistic than the average weight. T h e m e d i a n is also of i m p o r t a n c e in cases where it m a y be difficult or impossible to o b t a i n a n d m e a s u r e all the items of a sample. F o r example, s u p p o s e an animal behaviorist is studying the time it takes for a s a m p l e of a n i m a l s to perform a certain behavioral step. T h e variable he is m e a s u r i n g is the time from the beginning of the experiment until each individual has performed. W h a t he w a n t s to o b t a i n is an average time of p e r f o r m a n c e . Such an average time, however, can be calculated only after records have been o b t a i n e d on all the individuals. It m a y t a k e a long lime for the slowest a n i m a l s to complete their p e r f o r m a n c e , longer t h a n the observer wishes to spend. (Some of them may never respond a p p r o p r i a t e l y , m a k i n g the c o m p u t a t i o n of a m e a n impossible.) Therefore, a convenient statistic of location to describe these a n i m a l s may be the median time of p e r f o r m a n c e . Thus, so long as the observer k n o w s what the total sample size is, he need not have m e a s u r e m e n t s for the righthand tail of his distribution. Similar e x a m p l e s would be the responses to a d r u g or poison in a g r o u p of individuals (the median lethal or effective dose. LD 5 ( I or F.D S 0 ) or the median time for a m u t a t i o n to a p p e a r in a n u m b e r of lines of a species.
When seen on a frequency distribution, the m o d e is the value of the variable at which the curve peaks. In grouped frequency distributions the m o d e as a point has little meaning. It usually sulliccs It) identify the m o d a l class. In biology, the m o d e does not have m a n y applications. Distributions having two peaks (equal or unequal in height) are called bimodal; those with m o r e than two peaks are multimodal. In those rare distributions that are Ushaped, we refer to the low point at the middle of the
distribution as an antimode.
In evaluating the relative merits of the arithmetic mean, the median, a n d the mode, a n u m b e r of c o n s i d e r a t i o n s have to be kept in mind. T h e m e a n is generally preferred in statistics, since it has a smaller s t a n d a r d e r r o r than o t h e r statistics of location (see Section 6.2), it is easier to work with mathematically, and it has an a d d i t i o n a l desirablc p r o p e r t y (explained in Section 6.1): it will tend to be normally distributed even if the original data are not. T h e mean is
34
20 18
= 120
Hi
uh
14
12 c" c t10
U.
:i.
:i.(i
;is
i.o
1.2
1,1
!.(>
4.8
5,0
lVl"!'!']
HGURi: 3.1
bul I r r f a t
m a r k e d l y affected by outlying observations; the m e d i a n and m o d e are not. T h e mean is generally m o r e sensitive to c h a n g e s in the s h a p e of a frequency distribution, a n d if it is desired to have a statistic reflecting such changes, the m e a n may be preferred. In symmetrical, u n i m o d a l d i s t r i b u t i o n s the mean, the median, a n d the m o d e are all identical. A prime example of this is the wellknown n o r m a l distribution of C h a p t e r 5. In a typical asymmetrical d i s t r i b u t i o n , such as the o n e s h o w n in Figure 3.1, the relative positions of the mode, median, and mean are generally these: the mean is closest to the d r a w n  o u t tail of the distribution, the m o d e is farthest, and the m e d i a n is between these. An easy way to r e m e m b e r this seq u e n c e is to recall that they occur in alphabetical o r d e r from the longer tail of t h e distribution.
3.5 The ran}>e We now turn to measures of dispersion, f igure 3.2 d e m o n s t r a t e s that radically differentlooking distributions may possess the identical arithmetic mean. It is
35
10 8
6
4 2
0
10 8
Uh
6 ; 4
2
0
10
8 (i 1
0
FIGURE 3 . 2
of the four oxygen percentages listed earlier (Section 3.1) is R a n g e = 23.3  10.8 = 12.5";, a n d the range of the a p h i d femur lengths (Box 2.1) is Range = 4.7  3.3 = 1.4 units of 0.1 m m Since the range is a m e a s u r e of the s p a n of the variates a l o n g the scale of the variable, it is in the same units as the original m e a s u r e m e n t s . T h e range is clearly affected by even a single outlying value a n d for this reason is only a rnuoh estimate of the dtsriersion of all the items in the samtnle.
36
3.6 The standard deviation W e desire t h a t a m e a s u r e of dispersion t a k e all items of a d i s t r i b u t i o n i n t o c o n s i d e r a t i o n , weighting e a c h item by its distance f r o m the center of the distrib u t i o n . W e shall n o w try t o c o n s t r u c t such a statistic. In T a b l e 3.1 we s h o w a s a m p l e of 15 b l o o d n e u t r o p h i l c o u n t s f r o m p a t i e n t s with t u m o r s . C o l u m n (1) s h o w s the variates in t h e o r d e r in which they were reported. T h e c o m p u t a t i o n of t h e m e a n is s h o w n below the table. T h e m e a n n e u t r o p h i l c o u n t t u r n s o u t to be 7.713. T h e distance of e a c h variate f r o m t h e m e a n is c o m p u t e d as t h e following deviation: y = Y  Y E a c h individual deviation, or deviate, is by c o n v e n t i o n c o m p u t e d as the individual o b s e r v a t i o n m i n u s t h e m e a n , , r a t h e r t h a n the reverse, Y. D e v i a t e s are symbolized by lowercase letters c o r r e s p o n d i n g to the capital letters of t h e variables. C o l u m n (2) in T a b l e 3.1 gives the deviates c o m p u t e d in this manner. W e n o w wish to calculate a n average d e v i a t i o n t h a t will s u m all t h e deviates and divide t h e m by the n u m b e r of deviates in the sample. But n o t e that when
TABLE 3.1
The standard deviation. L o n g m e t h o d , not r e c o m m e n d e d for h a n d or c a l c u l a t o r c o m p u t a t i o n s but s h o w n here to illust r a t e t h e m e a n i n g of t h e s t a n d a r d deviation. T h e d a t a a r e b l o o d n e u t r o p h i l c o u n t s (divided by 1000) per microliter, in 15 p a t i e n t s with n o n h e m a t o l o g i c a l t u m o r s . (/) Y 4.9 4.6 5.5 9.1 16.3 12.7 6.4 7.1 2.3 3.6 18.0 3.7 7.3 4.4 9.8 Total 15.7 (2)  Y (i)
y2
2.81 3.11 2.21 1.39 8.59 4.99 1.31 0.61 5.41 4.11 10.29 4.01 0.41 3.31 2.09 0.05
7.9148 9.6928 4.8988 1.9228 73.7308 24.8668 1.7248 0.3762 29.3042 16.9195 105.8155 16.1068 0.1708 10.9782 4.3542 308.7770 7.713
Mean
I Is.7
37
we s u m o u r deviates, negative a n d positive deviates cancel out, as is s h o w n by the s u m at the b o t t o m of c o l u m n (2); this sum a p p e a r s to be u n e q u a l to zero only because of a r o u n d i n g error. D e v i a t i o n s f r o m the a r i t h m e t i c m e a n always s u m to zero because the m e a n is the center of gravity. C o n s e q u e n t l y , an average based o n the s u m of deviations w o u l d also always e q u a l zero. Y o u are urged to study A p p e n d i x A l . l , which d e m o n s t r a t e s that the s u m of deviations a r o u n d the m e a n of a s a m p l e is equal t o zero. S q u a r i n g t h e deviates gives us c o l u m n (3) of Table 3.1 a n d e n a b l e s us to reach a result o t h e r t h a n zero. (Squaring the deviates also h o l d s o t h e r m a t h e matical a d v a n t a g e s , which we shall t a k e u p in Sections 7.5 a n d 11.3.) T h e sum of the s q u a r e d deviates (in this case, 308.7770) is a very i m p o r t a n t q u a n t i t y in statistics. It is called t h e sum of squares a n d is identified symbolically as 2. A n o t h e r c o m m o n symbol for the s u m of s q u a r e s is SS. T h e next step is t o o b t a i n the average of the s q u a r e d deviations. T h e resulting q u a n t i t y is k n o w n as the variance, or the mean square'. Variance = X> 2 __ 308.7770 15 = 20.5851
T h e variance is a m e a s u r e of f u n d a m e n t a l i m p o r t a n c e in statistics, a n d we shall employ it t h r o u g h o u t this b o o k . At the m o m e n t , we need only r e m e m b e r that because of the s q u a r i n g of the deviations, the variance is expressed in squared units. T o u n d o the effect of the squaring, we now take the positive s q u a r e r o o t of the variance a n d o b t a i n the standard deviation:
Thus, s t a n d a r d deviation is again expressed in the original units of measurement, since it is a s q u a r e r o o t of the squared units of the variance. An important note: T h e technique just learned a n d illustrated in T a b l e 3.1 is not the simplest for direct c o m p u t a t i o n of a variance a n d s t a n d a r d deviation. However, it is often used in c o m p u t e r p r o g r a m s , where accuracy of c o m p u t a tions is an i m p o r t a n t consideration. Alternative a n d simpler c o m p u t a t i o n a l m e t h o d s are given in Section 3.8. T h e o b s e r v a n t reader m a y have noticed that we have avoided assigning any symbol to either the variance o r the s t a n d a r d deviation. We shall explain why in the next section. 3.7 Sample statistics and parameters U p to now we have calculated statistics f r o m samples without giving t o o m u c h t h o u g h t to what these statistics represent. W h e n correctly calculated, a m e a n and s t a n d a r d deviation will always be absolutely true measures of location a n d dispersion for the samples on which they are based. T h u s , the true m e a n of the four oxygen percentage readings in Section 3.1 is 15.325".",. T h e s t a n d a r d deviation of the 15 n e u t r o p h i l c o u n t s is 4.537. However, only rarely in biology (or
f ,,,^, .,,,, ,;..
38
only as descriptive s u m m a r i e s of the samples we have studied. Almost always we are interested in the populations f r o m which t h e samples h a v e been t a k e n . W h a t we w a n t to k n o w is not the m e a n of the particular four oxygen precentages, but r a t h e r the t r u e oxgyen percentage of the universe of readings f r o m which the f o u r readings have been sampled. Similarly, we would like t o k n o w the true m e a n neutrophil c o u n t of the p o p u l a t i o n of patients with n o n h e m a t o l o g i c a l t u m o r s , n o t merely the m e a n of the 15 individuals m e a s u r e d . W h e n s t u d y i n g dispersion we generally wish to learn the true s t a n d a r d deviations of t h e p o p u lations a n d not those of t h e samples. These p o p u l a t i o n statistics, however, are u n k n o w n a n d (generally speaking) are u n k n o w a b l e . W h o would be able t o collect all the patients with this p a r t i c u l a r disease a n d m e a s u r e their n e u t r o p h i l c o u n t s ? T h u s we need to use sample statistics as e s t i m a t o r s of population statistics or parameters. It is c o n v e n t i o n a l in statistics to use G r e e k letters for p o p u l a t i o n p a r a m e t e r s a n d R o m a n letters for s a m p l e statistics. T h u s , the sample m e a n estimates , the p a r a m e t r i c m e a n of the p o p u l a t i o n . Similarly, a sample variance, symbolized by s 2 , estimates a p a r a m e t r i c variance, symbolized by a 2 . Such e s t i m a t o r s should be unbiased. By this we m e a n that samples (regardless of the sample size) t a k e n f r o m a p o p u l a t i o n with a k n o w n p a r a m e t e r should give sample statistics that, when averaged, will give the p a r a m e t r i c value. An estimator that d o e s not d o so is called biased. T h e s a m p l e m e a n is an unbiased e s t i m a t o r of the p a r a m e t r i c m e a n . H o w e v e r , the s a m p l e variance as c o m p u t e d in Section 3.6 is not unbiased. O n the average, it will u n d e r e s t i m a t e the m a g n i t u d e of the p o p u l a t i o n variance a 1 . T o o v e r c o m e this bias, m a t h e m a t i c a l statisticians have shoWn t h a t w h e n s u m s of squares are divided by 1 rather than by the resulting s a m p l e variances will be unbiased estimators of the p o p u l a t i o n variance. F o r this reason, it is c u s t o m a r y to c o m p u t e variances by dividing the sum of squares by 1. T h e f o r m u l a for the s t a n d a r d deviation is therefore customarily given as follows: (3.6) In the n e u t r o p h i l  c o u n t d a t a the s t a n d a r d deviation would thus be c o m p u t e d as
We note that this value is slightly larger than o u r previous estimate of 4.537. Of course, the greater the s a m p l e size, the less difference there will be between division by a n d by n I. However, regardless of sample size, it is good practice to divide a sum of s q u a r e s by 1 when c o m p u t i n g a variance or s t a n d a r d deviation. It m a y be assumed that when the symbol s2 is e n c o u n t e r e d , it refers to a variance o b t a i n e d by division of the sum of squares by the degrees of freedom, as the q u a n t i t y 1 is generally referred to. Division of the s u m of s q u a r e s by is a p p r o p r i a t e only when the interest of the investigator is limited to the s a m p l e at h a n d a n d to its variance a n d
39
s t a n d a r d deviation as descriptive statistics of the sample. This w o u l d be in c o n t r a s t t o using these as estimates of the p o p u l a t i o n p a r a m e t e r s . T h e r e are also the rare cases in which the investigator possesses d a t a on the entire p o p u lation; in such cases division by is perfectly justified, because then the investigator is not e s t i m a t i n g a p a r a m e t e r but is in fact e v a l u a t i n g it. T h u s the variance of the wing lengths of all a d u l t w h o o p i n g cranes w o u l d b e a p a r a m e t r i c value; similarly, if the heights of all winners of the N o b e l Prize in physics h a d been m e a s u r e d , their variance w o u l d be a p a r a m e t e r since it w o u l d be based on the entire p o p u l a t i o n . 3.8 Practical methods for computing mean and standard deviation T h r e e steps are necessary for c o m p u t i n g the s t a n d a r d deviation: (1) find >>2, the s u m of squares; (2) divide by 1 to give the variance; a n d (3) take the s q u a r e r o o t of the variance to o b t a i n the s t a n d a r d deviation. T h e p r o c e d u r e used t o c o m p u t e the s u m of squares in Section 3.6 can be expressed by the following f o r m u l a : y2 = X<yy)
2
(3.7)
This f o r m u l a t i o n explains most clearly the m e a n i n g of the sum of squares, alt h o u g h it m a y be inconvenient for c o m p u t a t i o n by h a n d or calculator, since one must first c o m p u t e the m e a n before one can s q u a r e a n d sum the deviations. A quicker c o m p u t a t i o n a l f o r m u l a for this q u a n t i t y is v
r
V>"
11
(3.8)
Let us see exactly w h a t this f o r m u l a represents. T h e first term o n the right side of the e q u a t i o n , 2 , is the sum of all individual Y's, each s q u a r e d , as follows: y
2
Y 2 + >1 + >1 + + Y2
W h e n referred to by name, 2 should be called the "sum of Y s q u a r e d " and should be carefully distinguished f r o m >>2, "the sum of squares of Y." These names are u n f o r t u n a t e , but they are t o o well established to think of a m e n d i n g them. T h e o t h e r q u a n t i t y in Expression (3.8) is () 2 />. It is often called the correction term (CT). T h e n u m e r a t o r of this term is the s q u a r e of the sum of the Y's; t h a t is, all t h e Y's are first s u m m e d , and this s u m is then s q u a r e d . In general, this q u a n t i t y is different f r o m 2 , which first squares the y ' s a n d then sums them. These two terms a r c identical only if all the Y's arc equal. If you arc not certain a b o u t this, you can convince yourself of this fact by calculating these two quantities for a few n u m b e r s . T h e d i s a d v a n t a g e of Expression (3.8) is that the quantities Y2 a n d ( Y)2hi may b o t h be quite large, so that accuracy may be lost in c o m p u t i n g their difference unless one takes the precaution of c a r r y i n g sufficient significant figures. W h y is Expression (3.8) identical with Expression (3.7)? T h e proof of this identity is very simple a n d is given in Appendix A 1.2. You are urged to work
40
t h r o u g h it t o build u p y o u r confidence in h a n d l i n g statistical s y m b o l s a n d formulas. It is s o m e t i m e s possible t o simplify c o m p u t a t i o n s by recoding variates into simpler f o r m . W e shall use the term additive coding for the a d d i t i o n or s u b t r a c t i o n of a c o n s t a n t (since s u b t r a c t i o n is only a d d i t i o n of a negative n u m b e r ) . W e shall similarly use multiplicative coding to refer to the multiplication or division by a c o n s t a n t (since division is multiplication by the reciprocal of the divisor). W e shall use the t e r m combination coding to m e a n the a p p l i c a t i o n of b o t h additive a n d multiplicative c o d i n g t o the same set of d a t a . In A p p e n d i x A 1.3 we e x a m i n e the c o n s e q u e n c e s of the three types of coding in the c o m p u t a t i o n of means, variances, a n d s t a n d a r d deviations. F o r the case of means, the f o r m u l a for c o m b i n a t i o n coding a n d d e c o d i n g is the most generally applicable one. If the c o d e d variable is Yc = D(Y + C), then
where C is an additive c o d e a n d D is a multiplicative code. O n considering the effects of c o d i n g variates on the values of variances and standard deviations, we find that additive codes have no effect o n the s u m s of squares, variances, or s t a n d a r d deviations. T h e m a t h e m a t i c a l proof is given in A p p e n d i x A 1.3, but we can see this intuitively, because an additive code has n o effect on the distance of an item f r o m its m e a n . T h e distance f r o m an item of 15 to its m e a n of 10 would be 5. If we were to code the variates by subtracting a c o n s t a n t of 10, the item would now be 5 a n d the m e a n zero. T h e difference between t h e m would still be 5. T h u s , if only additive c o d i n g is employed, the only statistic in need of d e c o d i n g is the mean. But multiplicative coding does have an effect on s u m s of squares, variances, a n d s t a n d a r d deviations. T h e s t a n d a r d deviations have to be divided by the multiplicative code, just as had to be d o n e for the m e a n . However, the s u m s of squares or variances have to be divided by the multiplicative codes s q u a r e d , because they are s q u a r e d terms, and the multiplicative factor becomcs s q u a r e d d u r i n g the o p e r a t i o n s . In c o m b i n a t i o n coding the additive code can be ignored. W h e n the d a t a are u n o r d e r e d , the c o m p u t a t i o n of the m e a n and s t a n d a r d deviation proceeds as in Box 3.1, which is based on the u n o r d e r e d n e u t r o p h i l c o u n t d a t a s h o w n in T a b l e 3.1. W e chose not to apply coding to these d a t a , since it would not have simplified the c o m p u t a t i o n s appreciably. W h e n the d a t a are a r r a y e d in a frequency distribution, the c o m p u t a t i o n s can be m a d e m u c h simpler. W h e n c o m p u t i n g the statistics, you can often avoid the need for m a n u a l entry of large n u m b e r s of individual variatcs if you first set u p a frequency distribution. Sometimes the d a t a will c o m e to you already in the form of a frequency distribution, having been g r o u p e d by the researcher. T h e c o m p u t a t i o n of a n d s f r o m a frequency distribution is illustrated in Box 3.2. T h e d a t a are the birth weights of male Chinese children, first e n c o u n t e r e d in Figure 2.3. T h e calculation is simplified by coding to remove the a w k w a r d class m a r k s . This is d i m e bv s u b t r a c t i n g 59.5. the lowest class m a r k of the arrav.
41
BOX 3.1
Calculation of and s from unordered data. Neutrophil counts, unordered as shown in Table 3.1.
Computation
n = 15 7 = 115.7 y =  T y = 7.713
= 1201.21
T h e resulting class m a r k s are values such as 0, 8, 16, 24, 32, a n d so on. T h e y are then divided by 8, which c h a n g e s them to 0, 1, 2, 3, 4, and so on, which is the desired f o r m a t . T h e details of the c o m p u t a t i o n can be learned f r o m the box. W h e n checking the results of calculations, it is frequently useful to have an a p p r o x i m a t e m e t h o d for e s t i m a t i n g statistics so that gross e r r o r s in c o m p u tation can be detected. A simple m e t h o d for e s t i m a t i n g the m e a n is to average the largest a n d smallest o b s e r v a t i o n to obtain the socalled miJrunye. F o r the neutrophil c o u n t s of T a b l e 3.1, this value is (2.3 + 18.0J/2 = 10.15 (not a very good estimate). S t a n d a r d deviations can be estimated f r o m ranges by a p p r o priate division of the range, as follows:
6 6
42
BOX 3.2
Calculation of F, s, and from a frequency distribution. Birth weights of male Chinese in ounces.
(/)
Class
mark
/
2 6 39 385 888 1729 2240 2007 1233 641 201 74 14 5 1 9465 = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
59.5 67.5 75.5 83.5 91.5 99.5 107.5 115.5 123.5 131.5 139.5 147.5 155.5 163.5 171.5
Computation
v
Coding
and
decoding
Code: Yc = To decode
Z / > ' c = Z / n 2  C T = 27,327.450 s? = 1 = 2.888 To decode sf: s = 8sc = 13.593 oz V = i X 100  ^ ^ X 100 = 12.369% Y 109.9
sc = 1.6991
43
T h e range of the neutrophil c o u n t s is 15.7. W h e n this value is divided by 4, we get a n estimate for the s t a n d a r d deviation of 3.925, which c o m p a r e s with the calculated value of 4.696 in Box 3.1. H o w e v e r , w h e n we estimate m e a n a n d s t a n d a r d deviation of the a p h i d f e m u r lengths of Box 2.1 in this m a n n e r , we o b t a i n 4.0 a n d 0.35, respectively. These are g o o d estimates of the a c t u a l values of 4.004 a n d 0.3656, the s a m p l e m e a n a n d s t a n d a r d deviation. 3.9 The coefficient of variation H a v i n g o b t a i n e d the s t a n d a r d deviation as a m e a s u r e of t h e a m o u n t of v a r i a t i o n in the d a t a , y o u m a y be led to ask, " N o w w h a t ? " At this stage in o u r c o m prehension of statistical theory, n o t h i n g really useful comes of the c o m p u t a t i o n s we have carried out. H o w e v e r , the skills j u s t learned are basic to all later statistical w o r k . So far, the only use t h a t we might have for the s t a n d a r d deviation is as an estimate of the a m o u n t of variation in a p o p u l a t i o n . T h u s , we may wish to c o m p a r e the m a g n i t u d e s of the s t a n d a r d deviations of similar p o p u l a tions a n d see w h e t h e r p o p u l a t i o n A is m o r e or less variable than p o p u l a t i o n B. W h e n p o p u l a t i o n s differ appreciably in their means, the direct c o m p a r i s o n of their variances o r s t a n d a r d deviations is less useful, since larger o r g a n i s m s usually vary m o r e t h a n smaller one. F o r instance, the s t a n d a r d deviation of the tail lengths of e l e p h a n t s is obviously m u c h greater than the entire tail length of a mouse. T o c o m p a r e the relative a m o u n t s of variation in p o p u l a t i o n s having different means, the coefficient of variation, symbolized by V (or occasionally CV), has been developed. This is simply the s t a n d a r d deviation expressed as a percentage of the m e a n . Its f o r m u l a is
F o r example, the coefficient of variation of the birth weights in Box 3.2 is 12.37%, as s h o w n at the b o t t o m of that box. T h e coefficient of variation is independent of the unit of m e a s u r e m e n t a n d is expressed as a percentage. Coefficients of variation are used when one wishes t o c o m p a r e the variation of two p o p u l a t i o n s without considering the m a g n i t u d e of their means. (It is p r o b a b l y of little interest to discover whether the birth weights of the Chinese children are m o r e or less variable t h a n the femur lengths of the aphid stem mothers. However, we can calculate V for the latter as (0.3656 00)/4.004 = 9.13%, which would suggest that the birth weights arc m o r e variable.) O f t e n , we shall wish to test whether a given biological sample is m o r e variable for o n e character than for a n o t h e r . T h u s , for a s a m p l e of rats, is b o d y weight m o r e variable than b l o o d sugar content? A second, frequent type of c o m p a r i s o n , especially in systcmatics, is a m o n g different p o p u l a t i o n s for the same c h a r a c t e r . Thus, we m a y have m e a s u r e d wing length in samples of birds f r o m several localities. We wish t o k n o w w h e t h e r any o n e of these p o p u l a t i o n s is m o r e variable than the others. An a n s w e r to this question can be o b t a i n e d by examining the coefficients of variation of wing length in these samples.
44
Exercises
3.1 F i n d f , s, V, a n d t h e m e d i a n f o r t h e f o l l o w i n g d a t a ( m g o f g l y c i n e p e r m g o f c r e a t i n i n e in t h e u r i n e o f 3 7 c h i m p a n z e e s ; f r o m G a r t l e r , F i r s c h e i n , a n d D o b z h a n s k y , 1956). A N S . Y = 0 . 1 1 5 , s = 0 . 1 0 4 0 4 . .008 .025 .018 .036 .060 .155 .056 .043 .070 .370 .055 .100 .050 .019 .135 .120 .080 .100 .052 .110 .110 .100 .077 .100 .110 .116 .026 .350 .120 .440 .100 .133 .300 .300 .100
.011
.100 3.2
F i n d t h e m e a n , s t a n d a r d d e v i a t i o n , a n d c o e f f i c i e n t of v a r i a t i o n f o r t h e p i g e o n d a t a g i v e n i n E x e r c i s e 2.4. G r o u p t h e d a t a i n t o t e n c l a s s e s , r e c o m p u t e a n d s, a n d c o m p a r e t h e m with the results o b t a i n e d from u n g r o u p e d data. C o m p u t e the m e d i a n for the g r o u p e d data. T h e f o l l o w i n g a r e p e r c e n t a g e s of b u t t e r f a t f r o m 120 r e g i s t e r e d t h r e e  y e a r  o l d A y r s h i r e c o w s selected at r a n d o m f r o m a C a n a d i a n stock r e c o r d b o o k . (a) C a l c u l a t e Y, s, a n d V d i r e c t l y f r o m t h e d a t a . (b) G r o u p t h e d a t a i n a f r e q u e n c y d i s t r i b u t i o n a n d a g a i n c a l c u l a t e Y, s, a n d V. C o m p a r e t h e r e s u l t s w i t h t h o s e o f (a). H o w m u c h p r e c i s i o n h a s b e e n l o s t b y grouping? Also calculate the median. 4.32 3.96 3.74 4.10 4.33 4.23 4.28 4.15 4.49 4.67 4.60 4.00 4.71 4.38 4.06 3.97 4.31 4.30 4.51 4.24 3.94 4.17 4.06 3.93 4.38 4.22 3.95 4.35 4.09 4.28 4.24 4.48 4.42 4.00 4.16 4.67 4.03 4.29 4.05 4.11 4.38 4.46 3.96 4.16 4.08 3.97 3.70 4.17 3.86 4.05 3.89 3.82 3.89 4.20 4.14 3.47 4.38 3.91 4.34 3.98 4.29 3.89 4.20 4.33 3.88 3.74 4.42 4.27 3.97 4.24 3.72 4.82 3.66 3.77 3.66 4.20 3.83 3.97 4.36 4.05 4.58 3.70 4.07 3.89 4.66 3.92 4.12 4.10 4.09 3.86 4.00 4.02 3.87 3.81 4.81 4.25 4.09 4.38 4.32 5.00 3.99 3.91 4.10 4.40 4.70 4.41 4.24 4.20 4.18 3.56 3.99 4.33 3.58 4.60 3.97 4.91 4.52 4.09 4.88 4.58 5.2 t o all o b s e r v a t i o n s have upon the
3.3
3.4
W h a t clfect w o u l d a d d i n g a c o n s t a n t
n u m e r i c a l v a l u e s o f t h e f o l l o w i n g s t a t i s t i c s : , .s, V, a v e r a g e d e v i a t i o n ,
median.
EXERCISES
45
3.5
3.6
mode, range? What would be the effect of adding 5.2 and then multiplying the sums by 8.0? Would it make any difference in the above statistics if we multiplied by 8.0 first and then added 5.2? Estimate and using the midrange and the range (see Section 3.8) for the data in Exercises 3.1, _3.2, and 3.3. How well do these estimates agree with the estimates given by Y and s? ANS. Estimates of and for Exercise 3.2 are 0.224 and 0.1014. Show that the equation for the variance can also be written as
s2 = ^
22
1
3.7
3.8
Using the striped _bass age distribution given in Exercise 2.9, compute the following statistics: Y, s2, s, V, median, and mode. ANS. 7 = 3.043, s2 = 1.2661, s = 1.125, V = 36.98%, median = 2.948, mode = 3. Use a calculator and compare the results of using Equations 3.7 and 3.8 to compute s 2 for the following artificial data sets: (a) 1 , 2 , 3 , 4 , 5 (b) 9001, 9002, 9003, 9004, 9005 (c) 90001, 90002, 90003, 90004, 90005 (d) 900001, 900002, 900003, 900004, 900005 Compare your results with those of one or more computer programs. What is the correct answer? Explain your results.
CHAPTER
to
Probability
Distributions
In Section 2.5 we first e n c o u n t e r e d frequency distributions. F o r example, T a b l e 2.2 s h o w s a distribution for a meristic, or discrete (discontinuous), variable, the n u m b e r of sedge p l a n t s per q u a d r a t . Examples of distributions for c o n t i n u o u s variables are the f e m u r lengths of a p h i d s in Box 2.1 and the h u m a n birth weights in Box 3.2. Each of these d i s t r i b u t i o n s i n f o r m s us a b o u t the a b s o l u t e f r e q u e n c y of a n y given class a n d permits us to c o m p u t a t e the relative frequencies of a n y class of variable. T h u s , m o s t of the q u a d r a t s c o n t a i n e d either n o sedges or o n e or t w o plants. In the 139.5oz class of birth weights, we find only 201 out of the total of 9465 babies recorded; that is, a p p r o x i m a t e l y only 2.1% of the infants are in t h a t birth weight class. W e realize, of course, that these frequency d i s t r i b u t i o n s are only samples f r o m given p o p u l a t i o n s . T h e birth weights, for example, represent a p o p u l a t i o n of male Chinese infants f r o m a given geographical area. But if we k n e w o u r s a m p l e to be representative of that p o p u l a t i o n , we could m a k e all sorts of predictions based u p o n the s a m p l e frequency distribution. F o r instance, we could say t h a t a p p r o x i m a t e l y 2.1% of male Chinese babies b o r n in this p o p u l a t i o n should weigh between 135.5 a n d 143.5 oz at birth. Similarly, we might say that
the p r o b a b i l i t y t h a t the weight at birth of any o n e b a b y in this p o p u l a t i o n will be in t h e 139.5oz b i r t h class is quite low. If all of the 9465 weights were mixed up in a h a t a n d a single o n e pulled out, t h e probability t h a t we w o u l d pull out one of the 201 in the 139.5oz class w o u l d be very low i n d e e d o n l y 2.1%. It would be m u c h m o r e p r o b a b l e t h a t we w o u l d sample a n infant of 107.5 or 115.5 oz, since the infants in these classes are represented by frequencies 2240 a n d 2007, respectively. Finally, if we were t o s a m p l e f r o m a n u n k n o w n p o p u l a tion of babies a n d find t h a t the very first individual sampled h a d a b i r t h weight of 170 oz, we w o u l d p r o b a b l y reject a n y hypothesis t h a t the u n k n o w n p o p u l a t i o n was the same as t h a t sampled in Box 3.2. W e w o u l d arrive at this conclusion because in the distribution in Box 3.2 only o n e out of a l m o s t 10,000 infants h a d a birth weight t h a t high. T h o u g h it is possible t h a t we could have sampled f r o m the p o p u l a t i o n of male Chinese babies a n d o b t a i n e d a birth weight of 170 oz, the probability t h a t t h e first individual s a m p l e d would have such a value is very low indeed. It seems m u c h m o r e r e a s o n a b l e t o s u p p o s e t h a t the u n k n o w n p o p u l a t i o n f r o m which we are s a m p l i n g has a larger m e a n t h a t the o n e sampled in Box 3.2. W e have used this empirical frequency distribution to m a k e certain predictions (with w h a t frequency a given event will occur) or to m a k e j u d g m e n t s a n d decisions (is it likely t h a t an infant of a given birth weight belongs to this population?). In m a n y cases in biology, however, we shall m a k e such predictions not f r o m empirical distributions, b u t on the basis of theoretical c o n s i d e r a t i o n s that in o u r j u d g m e n t are pertinent. W e m a y feel t h a t the d a t a should be distributed in a certain way because of basic a s s u m p t i o n s a b o u t the n a t u r e of the forces acting o n the e x a m p l e at h a n d . If o u r actually observed d a t a d o not c o n f o r m sufficiently to the values expected on the basis of these a s s u m p t i o n s , we shall have serious d o u b t s a b o u t o u r a s s u m p t i o n s . This is a c o m m o n use of frequency distributions in biology. T h e a s s u m p t i o n s being tested generally lead to a theoretical frequency distribution k n o w n also as a probability distribution. This m a y be a simple twovalued distribution, such as the 3:1 ratio in a Mendelian cross; or it m a y be a m o r e complicated function, as it would be if we were trying to predict the n u m b e r of plants in a q u a d r a t . If we find that the observed d a t a d o not fit the expectations on the basis of theory, we are often led to the discovery of s o m e biological m e c h a n i s m causing this deviation f r o m expectation. T h e p h e n o m e n a of linkage in genetics, of preferential m a t i n g between different p h e n o t y p e s in animal behavior, of c o n g r e g a t i o n of a n i m a l s at certain favored places or, conversely, their territorial dispersion are cases in point. We shall thus m a k e use of probability theory to test o u r a s s u m p t i o n s a b o u t the laws of occurrence of certain biological p h e n o m e n a . Wc should point out to the reader, however, t h a t probability theory underlies the entire s t r u c t u r e of statistics, since, owing to the n o n m a t h e m a t i c a l o r i e n t a t i o n of this b o o k , this m a y not be entirely obvious. In this c h a p t e r we shall first discuss probability, in Section 4.1, but only to the extent necessary for c o m p r e h e n s i o n of the sections that follow at the intended level of m a t h e m a t i c a l sophistication. Next, in Section 4.2, we shall take up the
48
b i n o m i a l frequency distribution, which is not only i m p o r t a n t in certain types of studies, such as genetics, but also f u n d a m e n t a l to an u n d e r s t a n d i n g of t h e various k i n d s of p r o b a b i l i t y d i s t r i b u t i o n s t o be discussed in this b o o k . T h e Poisson d i s t r i b u t i o n , which follows in Section 4.3, is of wide applicability in biology, especially for tests of r a n d o m n e s s of occurrence of certain events. B o t h the b i n o m i a l a n d P o i s s o n d i s t r i b u t i o n s are discrete p r o b a b i l i t y distributions. T h e m o s t c o m m o n c o n t i n u o u s p r o b a b i l i t y distribution is the n o r m a l frequency d i s t r i b u t i o n , discussed in C h a p t e r 5.
4.1 Probability, random sampling, and hypothesis testing W e shall start this discussion with an e x a m p l e t h a t is n o t biometrical o r biological in the strict sense. W e have often f o u n d it pedagogically effective t o i n t r o d u c e new c o n c e p t s t h r o u g h situations t h o r o u g h l y familiar to the s t u d e n t , even if the e x a m p l e is n o t relevant to the general subject m a t t e r of biostatistics. Let us b e t a k e ourselves to M a t c h l e s s University, a state institution s o m e w h e r e between the A p p a l a c h i a n s a n d the Rockies. L o o k i n g at its e n r o l l m e n t figures, we notice the following b r e a k d o w n of the student body: 70% of the s t u d e n t s a r e American u n d e r g r a d u a t e s (AU) a n d 26% are American g r a d u a t e s t u d e n t s (AG); the r e m a i n i n g 4% are f r o m a b r o a d . Of these, 1% are foreign u n d e r g r a d u a t e s ( F U ) a n d 3% are foreign g r a d u a t e s t u d e n t s (FG). In m u c h of o u r w o r k we shall use p r o p o r t i o n s r a t h e r t h a n percentages as a useful c o n v e n t i o n . T h u s the enrollment consists of 0.70 AU's, 0.26 AG's, 0.01 F U ' s , a n d 0.03 F G ' s . T h e total student b o d y , c o r r e s p o n d i n g to 100%, is therefore represented by the figure 1.0. If we were to assemble all the s t u d e n t s a n d s a m p l e 100 of t h e m at r a n d o m , we would intuitively expect that, on the average, 3 would be foreign g r a d u a t e students. T h e actual o u t c o m e might vary. T h e r e might not be a single F G s t u d e n t a m o n g the 100 sampled, or there might be quite a few m o r e t h a n 3. T h e ratio of the n u m b e r of foreign g r a d u a t e s t u d e n t s sampled divided by the total n u m b e r of s t u d e n t s sampled might therefore vary f r o m zero to c o n s i d e r a b l y greater than 0.03. If we increased o u r s a m p l e size to 500 or 1000, it is less likely t h a t t h e ratio would fluctuate widely a r o u n d 0.03. T h e greater the s a m p l e taken, the closer the r a t i o of F G s t u d e n t s sampled t o the total s t u d e n t s s a m p l e d will a p p r o a c h 0.03. In fact, the probability of s a m p l i n g a foreign s t u d e n t can be defined as the limit as s a m p l e size keeps increasing of the ratio of foreign s t u d e n t s to the total n u m b e r of s t u d e n t s sampled. T h u s , we may formally s u m m a r i z e the situation by stating that the probability that a student at Matchless University will be a foreign g r a d u a t e student is P [ F G ] = 0.03. Similarly, the probability of s a m p l i n g a foreign u n d e r g r a d u a t e is P [ F U ] = 0 . 0 1 , that of s a m p l i n g an American u n d e r g r a d u a t e is /"[AUJ = 0.70, and that for American g r a d u a t e students, P [ A G ] = 0.26. N o w let us imagine the following experiment: We try to sample a student at r a n d o m f r o m a m o n g the student body at Matchless University. This is not as easy a task as might be imagined. If we w a n t e d to d o this o p e r a t i o n physically,
49
we w o u l d h a v e t o set u p a collection o r t r a p p i n g s t a t i o n s o m e w h e r e o n c a m p u s . A n d t o m a k e certain t h a t the s a m p l e was truly r a n d o m with respect t o t h e entire s t u d e n t p o p u l a t i o n , we w o u l d have t o k n o w t h e ecology of s t u d e n t s o n c a m p u s very t h o r o u g h l y . W e should try to locate o u r t r a p a t s o m e s t a t i o n where e a c h s t u d e n t h a d a n e q u a l probability of passing. F e w , if a n y , such places can be f o u n d in a university. T h e s t u d e n t u n i o n facilities a r e likely t o be frequented m o r e by i n d e p e n d e n t a n d foreign students, less by t h o s e living in organized houses a n d d o r m i t o r i e s . F e w e r foreign a n d g r a d u a t e s t u d e n t s m i g h t be f o u n d a l o n g fraternity row. Clearly, we w o u l d n o t wish t o place o u r t r a p near the I n t e r n a t i o n a l C l u b o r H o u s e , because o u r p r o b a b i l i t y of s a m p l i n g a foreign s t u d e n t w o u l d be greatly e n h a n c e d . In f r o n t of the b u r s a r ' s w i n d o w we might s a m p l e s t u d e n t s p a y i n g tuition. But those o n scholarships m i g h t n o t be found there. W e d o n o t k n o w w h e t h e r the p r o p o r t i o n of scholarships a m o n g foreign o r g r a d u a t e s t u d e n t s is t h e s a m e as o r different f r o m t h a t a m o n g t h e American or u n d e r g r a d u a t e students. Athletic events, political rallies, dances, and the like w o u l d all d r a w a differential s p e c t r u m of the s t u d e n t body; indeed, n o easy solution seems in sight. T h e time of s a m p l i n g is equally i m p o r t a n t , in the seasonal as well as the d i u r n a l cycle. T h o s e a m o n g t h e r e a d e r s w h o are interested in s a m p l i n g o r g a n i s m s f r o m n a t u r e will already h a v e perceived parallel p r o b l e m s in their w o r k . If we were to s a m p l e only s t u d e n t s wearing t u r b a n s or saris, their p r o b a b i l i t y of being foreign s t u d e n t s w o u l d b e a l m o s t 1. W e could n o longer speak of a r a n d o m sample. In the familiar ecosystem of t h e university these violations of p r o p e r sampling p r o c e d u r e a r e o b v i o u s t o all of us, b u t they are not nearly so o b v i o u s in real biological instances where we a r e unfamiliar with the true n a t u r e of the environment. H o w s h o u l d we proceed t o o b t a i n a r a n d o m s a m p l e of leaves f r o m a tree, of insects f r o m a field, o r of m u t a t i o n s in a culture? In s a m p l i n g at r a n d o m , we are a t t e m p t i n g t o permit the frequencies of v a r i o u s events occurring in n a t u r e t o be r e p r o d u c e d unalteredly in o u r records; t h a t is, we h o p e t h a t o n the average the frequencies of these events in o u r s a m p l e will be the same as they a r e in the n a t u r a l situation. A n o t h e r way of saying this is that in a r a n d o m s a m p l e every individual in the p o p u l a t i o n being s a m p l e d has a n equal probability of being included in the sample. We might go a b o u t o b t a i n i n g a r a n d o m s a m p l e by using records representing the student b o d y , such as the student directory, selecting a page f r o m it at r a n d o m a n d a n a m e at r a n d o m f r o m the page. O r we could assign an an a r b i t r a r y n u m b e r t o each s t u d e n t , write each o n a chip or disk, put these in a large c o n t a i n e r , stir well, a n d then pull out a n u m b e r . I m a g i n e n o w t h a t we s a m p l e a single s t u d e n t physically by the t r a p p i n g m e t h o d , after carefully p l a n n i n g t h e placement of the t r a p in such a way as to m a k e s a m p l i n g r a n d o m . W h a t a r e the possible o u t c o m e s ? Clearly, the student could be either a n A U , A G , F U or F G . T h e set of these four possible o u t c o m e s exhausts the possibilities of this experiment. This set, which we c a n represent as {AU, A G , F U , F G } is called the sample space. Any single trial of the experiment described a b o v e w o u l d result in only o n e of the f o u r possible o u t c o m e s (elements)
50
in t h e set. A single element in a s a m p l e space is called a simple event. It is distinguished f r o m an event, which is a n y subset of the samplespace. T h u s , in the s a m p l e space defined a b o v e {AU}, {AG}, {FU}, a n d { F G } a r e e a c h simple events. T h e following s a m p l i n g results a r e some of the possible events: {AU, A G , F U } , {AU, A G , F G } , {AG, F G } , {AU, F G } , . . . By t h e definition of "event," simple events as well as t h e entire s a m p l e space a r e also events. T h e m e a n i n g of these events s h o u l d be clarified. T h u s {AU, A G , F U } implies being either a n A m e r i c a n o r a n u n d e r g r a d u a t e , or b o t h . Given the s a m p l i n g space described above, the event A = {AU, A G } enc o m p a s s e s all possible o u t c o m e s in the space yielding a n A m e r i c a n student. Similarly, the event = {AG, F G } s u m m a r i z e s the possibilities for o b t a i n i n g a g r a d u a t e student. T h e intersection of events A a n d B, written , describes only those events t h a t a r e shared by A a n d B. Clearly only A G qualifies, as can be seen below: A = {AU, A G } = {AG, F G }
T h u s , is that event in the s a m p l e space giving rise to the s a m p l i n g of a n A m e r i c a n g r a d u a t e s t u d e n t . W h e n the intersection of t w o events is e m p t y , as in C, where C = {AU, F U } , events a n d C are m u t u a l l y exclusive. T h u s there is n o c o m m o n element in these t w o events in the s a m p l i n g space. W e m a y also define events t h a t are unions of t w o o t h e r events in the s i m p l e space. T h u s indicates t h a t A or or b o t h A a n d occur. As defined above, A u would describe all s t u d e n t s w h o are either American students, g r a d u a t e students, o r A m e r i c a n g r a d u a t e students. W h y a r e we c o n c e r n e d with defining s a m p l e spaces a n d events? Because these concepts lead us to useful definitions a n d o p e r a t i o n s r e g a r d i n g the p r o b a b i l i t y of various o u t c o m e s . If we can assign a n u m b e r p, where 0 < < 1, t o each simple event in a s a m p l e space such t h a t the sum of these p's over all simple events in the space e q u a l s unity, then the space b e c o m e s a (finite) probability space. In o u r e x a m p l e above, the following n u m b e r s were associated with the a p p r o p r i a t e simple events in the s a m p l e space: {AU, AG, F U , FG}
{0.70,0.26, 0.01,0.03} G i v e n this p r o b a b i l i t y space, we a r e n o w able to m a k e s t a t e m e n t s r e g a r d i n g the probability of given events. F o r example, w h a t is the p r o b a b i l i t y that a s t u d e n t sampled at r a n d o m will be an A m e r i c a n g r a d u a t e s t u d e n t ? Clearly, it is P [ { A G } ] = 0.26. W h a t is the p r o b a b i l i t y that a student is either American o r a g r a d u a t e s t u d e n t ? In terms of the events defined earlier, this is
PLAuBj = P[{AU,AG}] + P[{AG, FG]] P[{AG]]
0.26
4.1 /
51
W e s u b t r a c t P [ { A G } ] f r o m the s u m on the righthand side of t h e e q u a t i o n because if we did n o t d o so it w o u l d be included twice, once in P [ A ] a n d once in P [ B ] , a n d w o u l d lead to the a b s u r d result of a p r o b a b i l i t y greater t h a n 1. N o w let us a s s u m e t h a t we have sampled o u r single s t u d e n t f r o m the s t u d e n t b o d y of Matchless University a n d t h a t s t u d e n t t u r n s o u t to be a foreign g r a d u a t e student. W h a t c a n we c o n c l u d e f r o m this? By c h a n c e alone, this result w o u l d h a p p e n 0.03, or 3%, of the t i m e n o t very frequently. T h e a s s u m p t i o n t h a t we have sampled at r a n d o m should p r o b a b l y be rejected, since if we accept the hypothesis of r a n d o m sampling, the o u t c o m e of the experiment is i m p r o b a b l e . Please n o t e that we said improbable, n o t impossible. It is o b v i o u s t h a t we could have chanced u p o n a n F G as the very first one t o be s a m p l e d . H o w e v e r , it is not very likely. T h e p r o b a b i l i t y is 0.97 t h a t a single s t u d e n t s a m p l e d will be a n o n  F G . If we could be certain t h a t o u r s a m p l i n g m e t h o d was r a n d o m (as when d r a w i n g s t u d e n t n u m b e r s o u t of a container), we w o u l d have t o decide that an i m p r o b a b l e event h a s occurred. T h e decisions of this p a r a g r a p h are all based on o u r definite k n o w l e d g e t h a t the p r o p o r t i o n of s t u d e n t s at Matchless University is indeed as specified by t h e p r o b a b i l i t y space. If we were uncertain a b o u t this, we w o u l d be led to a s s u m e a higher p r o p o r t i o n of foreign g r a d u a t e students as a c o n s e q u e n c e of the o u t c o m e of o u r sampling experiment. W e shall n o w extend o u r experiment a n d s a m p l e two s t u d e n t s r a t h e r t h a n just one. W h a t a r e the possible o u t c o m e s of this s a m p l i n g e x p e r i m e n t ? T h e new sampling space can best be depicted by a d i a g r a m (Figure 4.1) t h a t shows the set of t h e 16 possible simple events as p o i n t s in a lattice. T h e simple events are the following possible c o m b i n a t i o n s . I g n o r i n g which student was sampled first, they are (AU, AU), (AU, AG), (AU, FU), (AU, FG), (AG, AG), (AG, FU), (AG, FG), ( F U , FU), ( F U , FG), a n d ( F G , FG).
.:! ;
0.0210 0.0078 o.ooo:! 0.000!)
o.oi i'T
(1.0070
0.0020
0.0001
o.ooo:!
.2 \<;
0.1820
0.0()7(i
0.002(1
0.0078
0.0210
i''<;
AC 0.70
AC
I'll
0.20
0.01
o.o:i
52
W h a t are t h e expected probabilities of these o u t c o m e s ? W e k n o w t h e expected o u t c o m e s for s a m p l i n g o n e s t u d e n t f r o m the f o r m e r p r o b a b i l i t y space, b u t w h a t will be the p r o b a b i l i t y space c o r r e s p o n d i n g to t h e new s a m p l i n g space of 16 elements? N o w t h e n a t u r e of the s a m p l i n g p r o c e d u r e b e c o m e s q u i t e imp o r t a n t . W e m a y s a m p l e with or w i t h o u t replacement: we m a y r e t u r n t h e first s t u d e n t s a m p l e d to the p o p u l a t i o n (that is, replace the first student), o r we m a y k e e p him or her o u t of the p o o l of t h e individuals to be s a m p l e d . If we d o n o t replace the first individual s a m p l e d , the p r o b a b i l i t y of s a m p l i n g a foreign g r a d u a t e s t u d e n t will n o longer be exactly 0.03. This is easily seen. Let us a s s u m e t h a t M a t c h l e s s University h a s 10,000 students. T h e n , since 3% are foreign g r a d u a t e students, there m u s t be 300 F G s t u d e n t s at the university. After s a m p l i n g a foreign g r a d u a t e s t u d e n t first, this n u m b e r is reduced t o 299 o u t of 9999 students. C o n s e q u e n t l y , the p r o b a b i l i t y of s a m p l i n g a n F G s t u d e n t n o w b e c o m e s 299/9999 = 0.0299, a slightly lower probability t h a n the value of 0.03 for s a m p l i n g the first F G student. If, o n the o t h e r h a n d , we r e t u r n the original foreign student t o the s t u d e n t p o p u l a t i o n a n d m a k e certain that the p o p u l a t i o n is t h o r o u g h l y r a n d o m i z e d before being sampled again (that is, give the student a c h a n c e to lose him o r herself a m o n g the c a m p u s crowd or, in d r a w i n g student n u m b e r s out of a container, mix u p the disks with the n u m b e r s on them), the probability of s a m p l i n g a second F G student is the s a m e as before0.03. In fact, if we keep on replacing the sampled individuals in the original p o p u l a t i o n , we can s a m p l e from it as t h o u g h it were a n infinitesized population. Biological p o p u l a t i o n s are, of course, finite, but they are frequently so large t h a t for p u r p o s e s of s a m p l i n g e x p e r i m e n t s we can consider t h e m effectively infinite whether we replace s a m p l e d individuals or not. After all, even in this relatively small p o p u l a t i o n of 10,000 students, the probability of s a m p l i n g a second foreign g r a d u a t e student (without replacement) is only minutely different f r o m 0.03. F o r the rest of this section we shall consider s a m p l i n g to be with replacement, so that the p r o b a b i l i t y level of o b t a i n i n g a foreign s t u d e n t d o e s not change. T h e r e is a second potential source of difficulty in this design. W e h a v e to a s s u m e not only that the p r o b a b i l i t y of s a m p l i n g a second foreign s t u d e n t is equal to that of the first, but also that it is independent of it. By independence of events we m e a n that the probability that one event will occur is not affected by whether or not another event has or has not occurred. In the case of the students, if we have s a m p l e d o n e foreign s t u d e n t , is it m o r e o r less likely t h a t a second s t u d e n t sampled in the same m a n n e r will also be a foreign s t u d e n t ? Independence of the events may d e p e n d on where we sample the s t u d e n t s or o n the m e t h o d of sampling. If we have sampled s t u d e n t s on c a m p u s , it is quite likely that the events are not i n d e p e n d e n t ; that is, if o n e foreign student has been s a m p l e d , the probability that the second student will be foreign is increased, since foreign s t u d e n t s tend to congregate. T h u s , at M a t c h l e s s University the probability that a student walking with a foreign g r a d u a t e s t u d e n t is also an F G will be greater than 0.03.
53
Events D a n d in a s a m p l e space will be defined as i n d e p e n d e n t whenever P [ D E ] = P [ D ] P [ E ] , T h e p r o b a b i l i t y values assigned to the sixteen p o i n t s in the samplespace lattice of F i g u r e 4.1 have been c o m p u t e d t o satisfy the a b o v e c o n d i t i o n . T h u s , letting P [ D ] e q u a l the p r o b a b i l i t y t h a t t h e first s t u d e n t will be an A U , t h a t is, P ^ A ^ A U ^ A U X A G 2 , A U ^ U , , A U ^ G , } ] , a n d letting P [ E ] equal the p r o b a b i l i t y t h a t the second s t u d e n t will be a n F G , t h a t is, P [ { A U ! F G 2 , A G j F G j , F U 1 F G 2 , F G 1 F G 2 } ] , we n o t e t h a t t h e intersection D n E is { A U , F G 2 } T h i s h a s a value of 0.0210 in the p r o b a b i l i t y space of Figure 4.1. W e find t h a t this value is t h e p r o d u c t P [ { A U } ] P [ { F G } ] = 0.70 0.03 = 0.0210. These m u t u a l l y i n d e p e n d e n t relations have been deliberately imposed u p o n all p o i n t s in the p r o b a b i l i t y space. Therefore, if the s a m p l i n g probabilities for t h e second s t u d e n t a r e i n d e p e n d e n t of t h e type of s t u d e n t sampled first, we c a n c o m p u t e t h e probabilities of the o u t c o m e s simply as the p r o d u c t of the i n d e p e n d e n t probabilities. T h u s the probability of o b t a i n i n g t w o F G students is P [ { F G } ] P [ { F G } ] = 0.03 0.03 = 0.0009. T h e p r o b a b i l i t y of o b t a i n i n g o n e A U a n d o n e F G student in t h e s a m p l e should be the p r o d u c t 0.70 0.03. However, it is in fact twice that p r o b a bility. It is easy t o see why. T h e r e is only o n e way of o b t a i n i n g t w o F G students, namely, by s a m p l i n g first o n e F G a n d then again a n o t h e r F G . Similarly, there is only o n e w a y to s a m p l e t w o A U students. H o w e v e r , s a m p l i n g one of each type of s t u d e n t can be d o n e by s a m p l i n g first an A U a n d then an F G or by s a m p l i n g first a n F G a n d then an AU. T h u s the probability is 2 P [ { A U } ] P [ { F G } ] = 2 0.70 0.03 = 0.0420. If we c o n d u c t e d such a n experiment a n d o b t a i n a sample of two F G students, we would be led to the following conclusions. O n l y 0.0009 of the samples of 1% o r 9 out of 10,000 cases) w o u l d be expected to consist of t w o foreign g r a d u a t e students. It is quite i m p r o b a b l e to o b t a i n such a result by c h a n c e alone. Given P [ { F G } ] = 0.03 as a fact, we would therefore suspect that s a m p l i n g was not r a n d o m or that the events were not i n d e p e n d e n t (or t h a t b o t h assumptions r a n d o m s a m p l i n g a n d independence of e v e n t s w e r e incorrect). R a n d o m s a m p l i n g is sometimes confused with r a n d o m n e s s in nature. T h e former is the faithful r e p r e s e n t a t i o n in the s a m p l e of the distribution of the events in nature; the latter is the i n d e p e n d e n c e of the events in n a t u r e . T h e first of these generally is or should be u n d e r the control of the e x p e r i m e n t e r a n d is related to the strategy of g o o d sampling. T h e second generally describes an i n n a t e p r o p e r t y of the objects being sampled a n d thus is of greater biological interest. T h e confusion between r a n d o m s a m p l i n g a n d i n d e p e n d e n c e of events arises because lack of either can yield observed frequencies of events differing f r o m expectation. We have already seen how lack of i n d e p e n d e n c e in samples of foreign s t u d e n t s can be interpreted f r o m both points of view in o u r illustrative example f r o m M a t c h l e s s University. T h e a b o v e a c c o u n t of probability is a d e q u a t e for o u r present p u r p o s e s but far t o o sketchy to convey a n u n d e r s t a n d i n g of the field. R e a d e r s interested in extending their k n o w l e d g e of the subject are referred to M o s i m a n n (1968) for a simple i n t r o d u c t i o n .
54
4.2 The binomial distribution F o r p u r p o s e s of the discussion to follow we shall simplify o u r s a m p l e space to consist of only t w o elements, foreign a n d American students, a n d i g n o r e w h e t h e r the s t u d e n t s are u n d e r g r a d u a t e s or g r a d u a t e s ; we shall represent t h e s a m p l e space by the set {F, A}. Let us symbolize the probability space by {p, q}, where .P[F], the p r o b a b i l i t y t h a t the s t u d e n t is foreign, a n d q = P [ A ] , t h e p r o b ability t h a t the s t u d e n t is American. As before, we c a n c o m p u t e the p r o b a b i l i t y space of samples of t w o s t u d e n t s as follows: { F F , FA, AA} { P2, 2pq, q2 } If we were to s a m p l e three s t u d e n t s independently, the probability space of samples of three s t u d e n t s w o u l d be as follows: { F F F , F F A , FAA, AAA}
{ p \ 3p2q, 3pq2, q3 }
S a m p l e s of three foreign o r three A m e r i c a n s t u d e n t s can again be o b t a i n e d in only o n e way, a n d their probabilities a r e p3 a n d q3, respectively. H o w e v e r , in s a m p l e s of three there are three ways of o b t a i n i n g t w o s t u d e n t s of o n e k i n d a n d o n e s t u d e n t of the other. As before, if A s t a n d s for American a n d F s t a n d s for foreign, then the s a m p l i n g sequence c a n be A F F , F A F , F F A for t w o foreign s t u d e n t s a n d o n e American. T h u s the p r o b a b i l i t y of this o u t c o m e will be 3 p 2 q . Similarly, the p r o b a b i l i t y for t w o A m e r i c a n s a n d o n e foreign s t u d e n t is 3 p q 2 . A convenient way to s u m m a r i z e these results is by m e a n s of the b i n o m i a l e x p a n s i o n , which is applicable to samples of a n y size f r o m p o p u l a t i o n s in which objects occur i n d e p e n d e n t l y in only t w o c l a s s e s s t u d e n t s w h o m a y be foreign or American, o r individuals w h o m a y be d e a d or alive, male o r female, black o r white, r o u g h or s m o o t h , a n d so forth. This is accomplished by e x p a n d i n g the b i n o m i a l term (p + q f , where k e q u a l s s a m p l e size, e q u a l s the p r o b a b i l i t y of occurrence of the first class, a n d q e q u a l s the probability of o c c u r r e n c e of the second class. By definition, + q = 1; hence q is a f u n c t i o n of p: q = 1 p. W e shall e x p a n d the expression for samples of k f r o m 1 to 3:
F o r samples of 1, (p + q)1 = + q F o r samples of 2, (p + q)2 = p2 + 2pq + q2 F o r s a m p l e s of 3, (p + q)3 = p3 + 3 p 2 q + 3pq 2 + q3 It will be seen t h a t these expressions yield the s a m e probability spaces discussed previously. T h e coefficients (the n u m b e r s before the p o w e r s of a n d q) express the n u m b e r of ways a p a r t i c u l a r o u t c o m e is o b t a i n e d . An easy m e t h o d for e v a l u a t i n g the coefficients of the e x p a n d e d t e r m s of the binomial expression
55
1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1
Pascal's triangle p r o v i d e s the coefficients of the b i n o m i a l e x p r e s s i o n t h a t is, the n u m b e r of possible o u t c o m e s of the v a r i o u s c o m b i n a t i o n s of events. F o r k = 1 t h e coefficients a r e 1 a n d 1. F o r the second line (k = 2), write 1 at the lefthand m a r g i n of the line. T h e 2 in the middle of this line is the s u m of the values t o the left a n d right of it in the line above. T h e line is c o n c l u d e d with a 1. Similarly, the values at the beginning a n d e n d of the third line a r e 1, a n d the o t h e r n u m b e r s a r e s u m s of the values t o their left a n d right in t h e line above; t h u s 3 is the s u m of 1 a n d 2. This principle c o n t i n u e s for every line. Y o u can w o r k out the coefficients for any size s a m p l e in this m a n n e r . T h e line for fc = 6 w o u l d consist of the following coefficients: 1, 6, 15, 20, 15, 6, 1. T h e a n d q values bear p o w e r s in a consistent p a t t e r n , which should be easy to imitate for any value of k. W e give it here for k = 4: pV + pV + p V + pY + pq4
T h e p o w e r of decreases f r o m 4 t o 0 (k to 0 in t h e general case) as t h e p o w e r of q increases f r o m 0 t o 4 (0 to k in the general case). Since a n y value t o t h e power 0 is 1 a n d a n y term to the p o w e r 1 is simply itself, we c a n simplify this expression as s h o w n below a n d at the s a m e time provide it with t h e coefficients f r o m Pascal's triangle for t h e case k 4: p 4 + 4p3<7 + 6 p V + 4pq3 + q4 T h u s we are able t o write d o w n a l m o s t by inspection the e x p a n s i o n of the b i n o m i a l to any r e a s o n a b l e power. Let us n o w practice o u r newly learned ability to e x p a n d the binomial. S u p p o s e we have a p o p u l a t i o n of insects, exactly 40% of which a r e infected with a given virus X. If we t a k e samples of k = 5 insects each a n d e x a m i n e e a c h insect separately for presence of the virus, w h a t distribution of s a m p l e s could we expect if the p r o b a b i l i t y of infection of each insect in a s a m p l e were i n d e p e n d e n t of that of o t h e r insects in the sample? In this case = 0.4, the p r o p o r t i o n infected, a n d q = 0.6, the p r o p o r t i o n not infected. It is a s s u m e d t h a t the p o p u l a t i o n is so large t h a t the question of w h e t h e r s a m p l i n g is with o r w i t h o u t replacement is irrelevant for practical purposes. T h e expected p r o p o r tions would be t h e e x p a n s i o n of t h e binomial: (p + q)k = (0.4 + 0.6) 5
56
With the aid of Pascal's triangle this e x p a n s i o n is {p 5 + 5 p*q + 10p3q2 + 1 0 p V + 5 pq4 + q5} or (0.4) 5 + 5(0.4) 4 (0.6) + 10(0.4) 3 (0.6) 2 + 10(0.4) 2 (0.6) 3 + 5(0.4)(0.6) 4 + (0.6) 5 representing t h e expected p r o p o r t i o n s of samples of five infected insects, f o u r infected a n d o n e noninfected insects, three infected a n d t w o noninfected insects, a n d so on. T h e reader h a s p r o b a b l y realized by n o w t h a t the terms of t h e b i n o m i a l e x p a n s i o n actually yield a type of f r e q u e n c y distribution for these different o u t c o m e s . Associated with e a c h o u t c o m e , such as "five infected insects," there is a p r o b a b i l i t y of o c c u r r e n c e i n this case (0.4) 5 = 0.01024. This is a theoretical frequency d i s t r i b u t i o n o r probability distribution of events that can o c c u r in t w o classes. It describes the expected distribution of o u t c o m e s in r a n d o m s a m p l e s of five insects f r o m a p o p u l a t i o n in which 40% are infected. T h e p r o b a b i l i t y d i s t r i b u t i o n described here is k n o w n as the binomial distribution, a n d the b i n o mial e x p a n s i o n yields the expected frequencies of the classes of t h e b i n o m i a l distribution. A convenient l a y o u t for p r e s e n t a t i o n a n d c o m p u t a t i o n of a b i n o m i a l d i s t r i b u t i o n is s h o w n in T a b l e 4.1. T h e first c o l u m n lists t h e n u m b e r of infected insects per sample, the second c o l u m n shows decreasing p o w e r s of f r o m p s to p, a n d the third c o l u m n shows increasing powers of q f r o m q to q5. T h e b i n o m i a l coefficients f r o m Pascal's triangle are s h o w n in c o l u m n (4). T h e relative
TABLE 4 . 1
Expected frequencies of infected insects in samples of 5 insects sampled from an infinitely large population with an assumed infection rate of 40",,. U) Number of infected insects per sample V 5 4 3 2 1 0
Binomial coefficients 1 5 10 10 5 1 / ( = ) Z
y
(6) Absolute expected frequencies f 24.8 186.1 558.3 837.4 628.0 188.4 2423.0 4846.1 2.00004 1.09543
(7) Observed frequencies f 29 197 535 817 643 202 2423 4815 1.98721 1.11934
L
0.01024 0.07680 0.23040 0.34560 0.25920 0.07776 1.00000 2.00000 2.00000 1.09545
57
expected frequencies, w h i c h a r e t h e p r o b a b i l i t i e s of the v a r i o u s o u t c o m e s , a r e s h o w n in c o l u m n (5). W e label s u c h e x p e c r e d f r e q u e n c i e s / r e l . T h e y a r e simply the p r o d u c t of c o l u m n s (2), (3), a n d (4). T h e i r s u m is e q u a l t o 1.0, since t h e events listed in c o l u m n (1) e x h a u s t the p o s s i b l e o u t c o m e s . W e see f r o m c o l u m n (5) in T a b l e 4.1 t h a t o n l y a b o u t 1% of s a m p l e s a r e e x p e c t e d t o c o n s i s t of 5 infected insects, a n d 25.9% a r e e x p e c t e d t o c o n t a i n 1 infected a n d 4 n o n i n f e c t e d insects. W e shall test w h e t h e r these p r e d i c t i o n s h o l d in a n a c t u a l e x p e r i m e n t .
Experiment 4.1. S i m u l a t e t h e s a m p l i n g o f i n f e c t e d i n s e c t s b y u s i n g a t a b l e o f r a n d o m n u m b e r s s u c h a s T a b l e I in A p p e n d i x A l . T h e s e a r e r a n d o m l y c h o s e n o n e  d i g i t n u m b e r s in w h i c h e a c h d i g i t 0 t h r o u g h 9 h a s a n e q u a l p r o b a b i l i t y of a p p e a r i n g . T h e n u m b e r s a r e g r o u p e d in b l o c k s o f 2 5 f o r c o n v e n i e n c e . S u c h n u m b e r s c a n a l s o b e o b t a i n e d f r o m r a n d o m n u m b e r k e y s o n s o m e p o c k e t c a l c u l a t o r s a n d b y m e a n s of p s e u d o r a n d o m n u m b e r  g e n e r a t i n g a l g o r i t h m s in c o m p u t e r p r o g r a m s . (In fact, this e n t i r e e x p e r i m e n t can be p r o g r a m m e d a n d performed a u t o m a t i c a l l y e v e n on a small computer.) Since t h e r e is a n e q u a l p r o b a b i l i t y f o r a n y o n e d i g i t t o a p p e a r , y o u c a n let a n y f o u r d i g i t s ( s a y , 0, 1, 2, 3) s t a n d f o r t h e i n f e c t e d i n s e c t s a n d t h e r e m a i n i n g d i g i t s (4, 5, 6, 7, 8, 9) s t a n d for the n o n i n f e c t e d insects. T h e p r o b a b i l i t y that a n y o n e digit selected f r o m the t a b l e will r e p r e s e n t a n i n f e c t e d i n s e c t ( t h a t is, will b e a 0, 1, 2, o r 3) is t h e r e f o r e 4 0 % , o r 0.4, s i n c e t h e s e a r e f o u r o f t h e t e n p o s s i b l e d i g i t s . A l s o , s u c c e s s i v e d i g i t s a r e a s s u m e d t o b e i n d e p e n d e n t of t h e v a l u e s o f p r e v i o u s d i g i t s . T h u s t h e a s s u m p t i o n s o f t h e b i n o m i a l d i s t r i b u t i o n s h o u l d b e m e t in t h i s e x p e r i m e n t . E n t e r t h e t a b l e o f r a n d o m n u m b e r s a t a n a r b i t r a r y p o i n t ( n o t a l w a y s at t h e b e g i n n i n g ! ) a n d l o o k at s u c c e s s i v e g r o u p s of five d i g i t s , n o t i n g in e a c h g r o u p h o w m a n y of t h e d i g i t s a r e 0, 1, 2, o r 3. T a k e a s m a n y g r o u p s of five a s y o u c a n f i n d t i m e t o d o , b u t n o f e w e r t h a n 100 g r o u p s .
C o l u m n (7) in T a b l e 4.1 s h o w s the results of o n e such e x p e r i m e n t d u r i n g o n e year by a b i o s t a t i s t i c s class. A total of 2423 s a m p l e s of five n u m b e r s were o b t a i n e d f r o m t h e t a b l e of r a n d o m n u m b e r s ; the d i s t r i b u t i o n of t h e f o u r digits s i m u l a t i n g t h e p e r c e n t a g e of infection is s h o w n in this c o l u m n . T h e o b s e r v e d f r e q u e n c i e s a r c l a b e l e d / . T o c a l c u l a t e the e x p e c t c d f r e q u e n c i e s for this a c t u a l e x a m p l e we m u l t i p l i e d the relative f r e q u e n c i e s / r c , of c o l u m n (5) times = 2423, the n u m b e r of s a m p l e s t a k e n . T h i s results in absolute expected frequencies, labeled / , s h o w n in c o l u m n (6). W h e n w e c o m p a r e the o b s e r v e d f r e q u e n c i e s in c o l u m n (7) with the e x p e c t e d f r e q u e n c i e s in c o l u m n (6) we n o t e general a g r e e m e n t b e t w e e n t h e t w o c o l u m n s of figures. T h e t w o d i s t r i b u t i o n s a r e a l s o illustrated in F i g u r e 4.2. If the o b s e r v e d f r e q u e n c i e s did not fit expected f r e q u e n c i e s , we might believe t h a t the lack of fit w a s d u e t o c h a n c e a l o n e . O r w e might be led t o reject o n e o r m o r e of the f o l l o w i n g h y p o t h e s e s : (I) t h a t the t r u e p r o p o r t i o n of digits 0, 1 , 2 , a n d 3 is 0.4 (rejection of this h y p o t h e s i s w o u l d n o r m a l l y not be r e a s o n a b l e , f o r w e m a y rely o n the fact t h a t t h e p r o p o r t i o n of digits 0, 1. 2, a n d 3 in a t a b l e of r a n d o m n u m b e r s is 0.4 o r very close to it); (2) t h a t s a m p l i n g w a s at r a n d o m ; a n d (3) t h a t e v e n t s were i n d e p e n d e n t . T h e s e s t a t e m e n t s c a n be r e i n t e r p r e t e d in t e r m s of the o r i g i n a l infection m o d e l with w h i c h we s t a r t e d this d i s c u s s i o n . If, i n s t e a d of a s a m p l i n g e x p e r i m e n t of digits by a biostatistics class, this h a d been a real s a m p l i n g e x p e r i m e n t of insects, we w o u l d c o n c l u d e that the insects h a d indeed been r a n d o m l y s a m p l e d
58
distributions
900
H00 700
C,'
f
k
">00
400
:ioo '200
100
N u m b e r of i n f e c t e d insects per s a m p l e
FIGURE 4 . 2
a n d t h a t we h a d n o evidence to reject the hypothesis t h a t the p r o p o r t i o n of infected insects was 40%. If the observed frequencies h a d not fitted the expected frequencies, the lack of fit m i g h t be a t t r i b u t e d t o chance, o r t o t h e conclusion t h a t the true p r o p o r t i o n of infection was not 0.4; o r we w o u l d h a v e h a d to reject o n e o r b o t h of the following a s s u m p t i o n s : (1) t h a t s a m p l i n g was at r a n d o m a n d (2) t h a t the o c c u r r e n c e of infected insects in these samples w a s i n d e p e n d e n t . E x p e r i m e n t 4.1 w a s designed to yield r a n d o m samples a n d i n d e p e n d e n t events. H o w could we simulate a s a m p l i n g p r o c e d u r e in which the occurrences of the digits 0, 1, 2, a n d 3 were not i n d e p e n d e n t ? We could, for example, instruct the sampler t o s a m p l e as indicated previously, but, every time a 3 w a s f o u n d a m o n g t h e first four digits of a sample, t o replace t h e following digit with a n o t h e r o n e of the four digits s t a n d i n g for infected individuals. T h u s , once a 3 w a s f o u n d , the p r o b a b i l i t y w o u l d be 1.0 t h a t a n o t h e r o n e of the indicated digits w o u l d be included in the sample. After repeated samples, this would result in higher frequencies of s a m p l e s of t w o o r m o r e indicated digits a n d in lower frequencies t h a n expected (on the basis of the b i n o m i a l distribution) of s a m p l e s of o n e such digit. A variety of such different s a m p l i n g schemes could be devised. It s h o u l d b e q u i t e clear t o t h e reader that t h e probability of the second event's o c c u r r i n g w o u l d be different f r o m that of the first a n d d e p e n d e n t on it. H o w w o u l d we interpret a large d e p a r t u r e of the observed frequencies f r o m expectation? W e have not as yet learned techniques for testing w h e t h e r observed frequencies differ f r o m those expected by m o r e t h a n c a n be a t t r i b u t e d to c h a n c e alone. T h i s will be taken u p in C h a p t e r 13. A s s u m e t h a t such a test h a s been carried o u t a n d t h a t it has s h o w n us that o u r observed frequencies a r e significantly different f r o m expectation. T w o m a i n types of d e p a r t u r e f r o m exp e c t a t i o n can be characterized: (1) clumping a n d (2) repulsion, s h o w n in fictitious
59
TABLE 4 . 2
(/)
Number of infected insects per sample Y
<i)
Clumped (contagious) frequencies f (4) Deviation from expectation
(5)
Repulsed frequencies f
5)
Deviation from expectation
5 4 3 2 1 0 /or
+

+ +
_

+

examples in T a b l e 4.2. In actual examples we w o u l d have n o a priori n o t i o n s a b o u t the m a g n i t u d e of p, the probability of o n e of the t w o possible o u t c o m e s . In such cases it is c u s t o m a r y to o b t a i n f r o m the observed s a m p l e a n d to calculate the expected frequencies, using the s a m p l e p. This w o u l d m e a n that the hypothesis t h a t is a given value c a n n o t be tested, since by design the expected frequencies will have t h e same value as the observed frequencies. Therefore, the hypotheses tested a r e w h e t h e r the samples a r e r a n d o m a n d the events independent. T h e c l u m p e d frequencies in T a b l e 4.2 have an excess of o b s e r v a t i o n s at the tails of the frequency distribution a n d consequently a s h o r t a g e of o b s e r v a t i o n s at the center. Such a distribution is also said to be contagious. ( R e m e m b e r that the total n u m b e r of items m u s t be the s a m e in b o t h observed a n d expected frequencies in o r d e r t o m a k e t h e m c o m p a r a b l e . ) In the repulsed frequency distrib u t i o n there are m o r e o b s e r v a t i o n s t h a n expected at the center of the distribution a n d fewer at the tails. These discrepancies are most easily seen in c o l u m n s (4) and (6) of T a b l e 4.2, where t h e deviations of observed f r o m expected frequencies are s h o w n as plus o r m i n u s signs. W h a t d o these p h e n o m e n a imply? In the c l u m p e d frequencies, m o r e samples were entirely infected (or largely infected), a n d similarly, m o r e samples were entirely noninfected (or largely noninfected) t h a n you would expect if p r o b a bilities of infection were independent. This could be d u e t o p o o r s a m p l i n g design. If, for example, the investigator in collecting samples of five insects always tended t o pick out like o n e s t h a t is, infected o n e s or noninfected o n e s t h e n such a result would likely a p p e a r . But if the s a m p l i n g design is s o u n d , the results b e c o m e m o r e interesting. C l u m p i n g would then m e a n t h a t the samples of five a r e in some way related, so that if o n e insect is infected, o t h e r s in the
60
s a m e s a m p l e a r e m o r e likely to be infected. This could be t r u e if they c o m e f r o m a d j a c e n t l o c a t i o n s in a s i t u a t i o n in which n e i g h b o r s are easily infected. O r they could be siblings j o i n t l y exposed to a source of infection. O r possibly the infection m i g h t s p r e a d a m o n g m e m b e r s of a sample between t h e time that the insects are s a m p l e d a n d the time they a r e examined. T h e o p p o s i t e p h e n o m e n o n , repulsion, is m o r e difficult t o interpret b i o logically. T h e r e are fewer h o m o g e n e o u s g r o u p s a n d m o r e mixed g r o u p s in such a distribution. T h i s involves the idea of a c o m p e n s a t o r y p h e n o m e n o n : if s o m e of the insects in a s a m p l e a r e infected, the o t h e r s in the s a m p l e are less likely to be. If the infected insects in the s a m p l e could in some w a y t r a n s m i t imm u n i t y to their associates in t h e sample, such a situation could arise logically, b u t it is biologically i m p r o b a b l e . A m o r e r e a s o n a b l e i n t e r p r e t a t i o n of such a finding is t h a t for e a c h s a m p l i n g unit, there were only a limited n u m b e r of p a t h o g e n s available; then once several of the insects have b e c o m e infected, t h e o t h e r s go free of infection, simply because there is n o m o r e infectious agent. This is an unlikely situation in microbial infections, but in situations in which a limited n u m b e r of p a r a s i t e s enter t h e b o d y of the host, repulsion m a y b e m o r e reasonable. F r o m the expected a n d observed frequencies in T a b l e 4.1, we m a y calculate the m e a n a n d s t a n d a r d deviation of the n u m b e r of infected insects per sample. These values are given at the b o t t o m of c o l u m n s (5), (6), a n d (7) in T a b l e 4.1. W e n o t e that the m e a n s a n d s t a n d a r d deviations in c o l u m n s (5) a n d (6) a r e a l m o s t identical a n d differ only trivially because of r o u n d i n g errors. C o l u m n (7), being a sample f r o m a p o p u l a t i o n whose p a r a m e t e r s are the s a m e as those of the expected frequency distribution in c o l u m n (5) or (6), differs s o m e w h a t . T h e m e a n is slightly smaller a n d the s t a n d a r d deviation is slightly greater t h a n in the expected frequencies. If we wish to k n o w the m e a n a n d s t a n d a r d d e v i a t i o n of expected binomial frequency distributions, we need not go t h r o u g h the c o m p u t a t i o n s s h o w n in T a b l e 4.1. T h e m e a n a n d s t a n d a r d deviation of a b i n o m i a l frequency distribution are, respectively,
= kp
= \Jkpq
S u b s t i t u t i n g the values k = 5, = 0.4, a n d q = 0.6 of the a b o v e example, we o b t a i n = 2.0 a n d = 1.095,45, which are identical to the values c o m p u t e d f r o m c o l u m n (5) in T a b l e 4.1. N o t e that we use the G r e e k p a r a m e t r i c n o t a t i o n here because a n d a are p a r a m e t e r s of an expected frequency d i s t r i b u t i o n , not s a m p l e statistics, as a r e the m e a n a n d s t a n d a r d deviation in c o l u m n (7). T h e p r o p o r t i o n s a n d q a r e p a r a m e t r i c values also, and strictly speaking, they should be distinguished from s a m p l e p r o p o r t i o n s . In fact, in later c h a p t e r s we resort to a n d q for p a r a m e t r i c p r o p o r t i o n s (rather t h a n , which c o n v e n t i o n ally is used as the ratio of the circumfcrence to the d i a m e t e r of a circle). Here, however, we prefer to keep o u r n o t a t i o n simple. If we wish to express o u r variable as a p r o p o r t i o n r a t h e r than as a c o u n t t h a t is, to indicate m e a n incidence of infection in the insccts as 0.4, r a t h e r t h a n as 2 per sample of 5 we can use o t h e r f o r m u l a s for the m e a n a n d s t a n d a r d deviation in a binomial
61
distribution:
= =
It is interesting t o look at the s t a n d a r d deviations of the c l u m p e d a n d replused frequency distributions of T a b l e 4.2. W e n o t e that the c l u m p e d distrib u t i o n h a s a s t a n d a r d deviation greater t h a n expected, a n d t h a t of the repulsed one is less t h a n expected. C o m p a r i s o n of sample s t a n d a r d deviations with their expected values is a useful m e a s u r e of dispersion in such instances. W e shall n o w e m p l o y the b i n o m i a l distribution t o solve a biological p r o b lem. O n the basis of o u r k n o w l e d g e of the cytology a n d biology of species A, we expect the sex r a t i o a m o n g its offspring to be 1:1. T h e study of a litter in n a t u r e reveals t h a t of 17 offspring 14 were females a n d 3 were males. W h a t conclusions can we d r a w f r o m this evidence? A s s u m i n g t h a t (the probability of being a female offspring) = 0.5 a n d t h a t this probability is i n d e p e n d e n t a m o n g the m e m b e r s of t h e sample, the pertinent p r o b a b i l i t y distribution is the b i n o m i a l for s a m p l e size k = 17. E x p a n d i n g the b i n o m i a l t o the p o w e r 17 is a f o r m i d a b l e task, which, as we shall see, f o r t u n a t e l y need n o t be d o n e in its entirety. H o w ever, we m u s t have the b i n o m i a l coefficients, which c a n be o b t a i n e d either from an e x p a n s i o n of Pascal's triangle (fairly tedious unless once o b t a i n e d a n d stored for f u t u r e use) or by w o r k i n g o u t t h e expected frequencies for any given class of Y f r o m the general f o r m u l a for any term of the b i n o m i a l distribution
C(k, Y)prqk
Y
(4.1)
T h e expression C(/c, y ) s t a n d s for the n u m b e r of c o m b i n a t i o n s t h a t can be formed f r o m k items t a k e n Y at a time. This can be evaluated as kl/[ Y!(k V)!], where ! m e a n s "factorial." In m a t h e m a t i c s k factorial is the p r o d u c t of all the integers from 1 u p t o a n d including k. T h u s , 5! = 1 2 3 4 5 = 120. By convention, 0! = 1. In w o r k i n g out fractions c o n t a i n i n g factorials, n o t e that any factorial will always cancel against a higher factorial. T h u s 5!/3! = (5 4 3!)/ 3! = 5 4. F o r example, the binomial coefficient for the expected frequency of samples of 5 items c o n t a i n i n g 2 infected insects is C(5, 2) = 5!/2!3! = (5 4)/2 = 10. T h e s e t u p of t h e e x a m p l e is s h o w n in T a b l e 4.3. Decreasing p o w e r s of f r o m p,17 d o w n a n d increasing powers of q . a r e c o m p u t e d (from p o w e r 0 to p o w e r 4). Since we require the probability of 14 females, we n o t e that for the p u r p o s e s of o u r p r o b l e m , we need not proceed b e y o n d the term for 13 females a n d 4 males. C a l c u l a t i n g the relative expected frequencies in c o l u m n (6), we note that the probability of 14 females a n d 3 males is 0.005,188,40, a very small value. If we a d d t o this value all "worse" o u t c o m e s t h a t is, all o u t c o m e s that are even m o r e unlikely t h a n 14 females a n d 3 males on the a s s u m p t i o n of a 1:1 h y p o t h e s i s w e o b t a i n a probability of 0.006,363,42, still a very small value. (In statistics, we often need to calculate the p r o b a b i l i t y of observing a deviation as large as or larger than a given value.)
62
distributions
TABLE 4 . 3
S o m e expected frequencies of males and females for samples of 17 offspring on the assumption that the sex ratio is 1:1 [ p v = 0.5, q. = 0.5; ( p . + q)k = (0.5 + 0 . 5 ) ' 7 ] .
(/)
(2)
(3)
(4)
(5)
Binomial coefficients
$$ 17 16 15 14 13 1 2 3 4
Pi
is
1 0.5 0.25 0.125 0.0625
L
0.000,007,631 0.000,129,711 0.001,037,681 6 ' 3 ' 4 2 0.005,188,40j 0.018,157,91
O n the basis of these findings o n e o r m o r e of the following a s s u m p t i o n s is unlikely: (1) t h a t the t r u e sex r a t i o in species A is 1:1, (2) t h a t we have s a m p l e d at r a n d o m in t h e sense of o b t a i n i n g an u n b i a s e d sample, or (3) t h a t the sexes of t h e offspring are i n d e p e n d e n t of o n e a n o t h e r . Lack of i n d e p e n d e n c e of events m a y m e a n t h a t a l t h o u g h the average sex r a t i o is 1:1, the individual sibships, o r litters, are largely unisexual, so that the offspring f r o m a given m a t i n g w o u l d tend to be all (or largely) females or all (or largely) males. T o c o n f i r m this hypothesis, we w o u l d need t o have m o r e samples a n d then examine the distrib u t i o n of samples for clumping, which w o u l d indicate a tendency for unisexual sibships. W e m u s t be very precise a b o u t the q u e s t i o n s we ask of o u r d a t a . T h e r e are really t w o questions we could ask a b o u t the sex ratio. First, are the sexes u n e q u a l in frequency so t h a t females will a p p e a r m o r e often t h a n males? Second, a r e the sexes u n e q u a l in frequency? It m a y be t h a t we k n o w from past experience t h a t in this particular g r o u p of o r g a n i s m s the males are never m o r e f r e q u e n t t h a n females; in that case, we need be c o n c e r n e d only with the first of these t w o questions, and the r e a s o n i n g followed a b o v e is a p p r o p r i a t e . H o w e v e r , if we k n o w very little a b o u t this g r o u p of organisms, a n d if o u r q u e s t i o n is simply whether the sexes a m o n g the offspring are unequal in frequency, then we have to consider b o t h tails of the binomial f r e q u e n c y distribution; d e p a r t u r e s f r o m the 1:1 ratio could occur in either direction. We should then consider not only the p r o b a b i l i t y of samples with 14 females a n d 3 males (and all worse cases) but also the probability of samples of 14 males a n d 3 females (and all worse cases in t h a t direction). Since this probability distribution is symmetrical (because p, = q . = 0.5), we simply d o u b l e the c u m u l a t i v e probability of 0.006,363,42 o b tained previously, which results in 0.012,726,84. This new value is still very small, m a k i n g it quite unlikely that the true sex ratio is 1:1. This is y o u r first experience with o n e of the m o s t i m p o r t a n t a p p l i c a t i o n s of statistics hypothesis testing. A f o r m a l i n t r o d u c t i o n t o this field will be deferred
4.2
/ t h e BINoMiAL d i s t r i b u t i o n
63
until Section 6.8. W e m a y simply p o i n t o u t here t h a t the t w o a p p r o a c h e s followed a b o v e are k n o w n a p p r o p r i a t e l y as onetailed, tests a n d twotailed tests, respectively. S t u d e n t s sometimes have difficulty k n o w i n g which of the t w o tests to apply. In f u t u r e examples we shall try to p o i n t out in each case w h y a onetailed or a twotailed test is being used. W e have said t h a t a tendency for unisexual sibships w o u l d result in a clumped distribution of observed frequencies. An actual case of this n a t u r e is a classic in t h e literature, the sex r a t i o d a t a o b t a i n e d by Geissler (1889) f r o m hospital records in Saxony. T a b l e 4.4 r e p r o d u c e s sex ratios of 6115 sibships of 12 children each f r o m t h e m o r e extensive study by Geissler. All c o l u m n s of the table should by n o w be familiar. T h e expected frequencies were not calculated on the basis of a 1:1 hypothesis, since it is k n o w n t h a t in h u m a n p o p u l a t i o n s the sex ratio at b i r t h is n o t 1:1. As the sex r a t i o varies in different h u m a n populations, the best estimate of it for the p o p u l a t i o n in S a x o n y was simply o b t a i n e d using the m e a n p r o p o r t i o n of males in these d a t a . This can be o b t a i n e d by calculating the average n u m b e r of males per sibship ( F = 6.230,58) for the 6115 sibships a n d c o n v e r t i n g this i n t o a p r o p o r t i o n . This value t u r n s out to be 0.519,215. C o n s e q u e n t l y , t h e p r o p o r t i o n of females = 0.480,785. In the deviations of the observed frequencies f r o m the a b s o l u t e expected frequencies s h o w n in c o l u m n (9) of T a b l e 4.4, we notice considerable clumping. T h e r e are m a n y m o r e instances of families with all male or all female children (or nearly so) than i n d e p e n d e n t probabilities would indicate. T h e genetic basis for this is not clear, b u t it is evident t h a t there a r e some families which " r u n t o girls" a n d similarly those which " r u n t o boys." Evidence of c l u m p i n g can also be seen f r o m the fact t h a t s2 is m u c h larger t h a n we would expect on the basis of t h e b i n o m i a l distribution ( 2 = kpq = 12(0.519,215)0.480,785 = 2.995,57). T h e r e is a distinct c o n t r a s t between the d a t a in T a b l e 4.1 a n d those in Table 4.4. In the insect infection d a t a of T a b l e 4.1 we h a d a hypothetical p r o p o r tion of infection based on outside knowledge. In the sex r a t i o d a t a of T a b l e 4.4 we h a d n o such knowledge; we used an empirical value of obtained from the data, r a t h e r t h a n a hypothetical value external to the data. This is a distinction whose i m p o r t a n c e will b e c o m e a p p a r e n t later. In the sex r a t i o d a t a of T a b l e 4.3, as in m u c h w o r k in M e n d e l i a n genetics, a hypothetical value of is used.
4.3 The Poisson distribution In the typical a p p l i c a t i o n of the binomial we h a d relatively small s a m p l e s (2 students, 5 insects, 17 offspring, 12 siblings) in which t w o alternative states occurred at varying frequencies (American a n d foreign, infected a n d n o n i n fected, male a n d female). Q u i t e frequently, however, we study cases in which sample size k is very large a n d o n e of the events (represented by p r o b a b i l i t y q) is very m u c h m o r e f r e q u e n t t h a n the o t h e r (represented by probability p). W e have seen t h a t the e x p a n s i o n of the binomial (p + qf is quite tiresome w h e n k is large. S u p p o s e y o u h a d to e x p a n d t h e expression (0.001 + 0.999) 1 0 0 0 . In such cases we are generally interested in o n e tail of the distribution only. This is the
'5
+ + + +
i + + + +
^ O O h N ^ ' t n h M O N
Co '
iu
N u> 00 < O m oo <* N c O < N >/"i 00 oo fN 1 < vo vi N (N inN N O (N in < O r<i O < O Tf oo N N H O N O
5/
I a
in oo n o fS rn CI s ' < N s
s s g
5 Si
t =0
o\
Tt
IS JS
'C * (X
4.2
/ t h e BINoMiAL d i s t r i b u t i o n
65
pq\ C(k,
3)p3qk~\ . .
T h e first term represents n o r a r e events a n d k frequent events in a s a m p l e of k events. T h e second t e r m represents o n e r a r e event a n d k 1 f r e q u e n t events. T h e third term represents t w o r a r e events a n d k 2 f r e q u e n t events, a n d so forth. T h e expressions of the f o r m C(k, i) are the b i n o m i a l coefficients, represented by the c o m b i n a t o r i a l terms discussed in the previous section. A l t h o u g h the desired tail of the curve could be c o m p u t e d by this expression, as long as sufficient decimal accuracy is m a i n t a i n e d , it is c u s t o m a r y in such cases to c o m p u t e a n o t h e r distribution, the Poisson distribution, which closely a p p r o x i m a t e s the desired results. As a rule of t h u m b , we m a y use the P o i s s o n distribution to a p p r o x i m a t e the b i n o m i a l distribution w h e n the p r o b a b i l i t y of the rare event is less t h a n 0.1 a n d the p r o d u c t kp (sample size probability) is less t h a n 5. T h e Poisson distribution is also a discrete frequency distribution of the n u m b e r of times a r a r e event occurs. But, in c o n t r a s t to the binomial distribution, t h e Poisson distribution applies t o cases where the n u m b e r of times that a n event does n o t occur is infinitely large. F o r p u r p o s e s of o u r t r e a t m e n t here, a P o i s s o n variable will be studied in samples t a k e n over space o r time. An example of the first would be the n u m b e r of m o s s p l a n t s in a s a m p l i n g q u a d r a t on a hillside or the n u m b e r of parasites on a n individual host; a n e x a m p l e of a t e m p o r a l sample is the n u m b e r of m u t a t i o n s occurring in a genetic strain in the time interval of o n e m o n t h or the r e p o r t e d cases of influenza in o n e t o w n d u r i n g one week. T h e Poisson variable Y will be the n u m b e r of events per sample. It can a s s u m e discrete values f r o m 0 o n up. T o be d i s t r i b u t e d in Poisson fashion the variable m u s t have t w o properties: (I) Its m e a n m u s t be small relative t o the m a x i m u m possible n u m b e r of events per sampling unit. T h u s the event should be "rare." But this m e a n s t h a t o u r s a m p l i n g unit of space o r time must be large e n o u g h to a c c o m m o d a t e a potentially substantial n u m b e r of events. F o r example, a q u a d r a t in which m o s s plants are c o u n t e d m u s t be large e n o u g h t h a t a substantial n u m b e r of m o s s plants could occur there physically if the biological c o n d i t i o n s were such as to favor the d e v e l o p m e n t of n u m e r o u s moss plants in the q u a d r a t . A q u a d r a t consisting of a 1cm s q u a r e w o u l d be far t o o small for mosses to be distributed in Poisson fashion. Similarly, a time s p a n of 1 m i n u t e would be unrealistic for r e p o r t i n g new influenza cases in a t o w n , b u t within 1 week a great m a n y such cases could occur. (2) An occurrence of the event m u s t be i n d e p e n d e n t of prior occurrences within the s a m p l i n g unit. T h u s , the presence of o n e m o s s plant in a q u a d r a t m u s t not e n h a n c e or diminish the probability t h a t o t h e r m o s s plants a r e developing in the q u a d r a t . Similarly, the fact t h a t o n e influenza case has been r e p o r t e d m u s t not affect the probability of r e p o r t i n g s u b s e q u e n t influenza cases. Events t h a t meet these c o n d i t i o n s (rare and random events) should be d i s t r i b u t e d in Poisson fashion. T h e p u r p o s e of fitting a Poisson distribution to n u m b e r s of rare events in n a t u r e is to test w h e t h e r the events occur independently with respect to each
66
distributions
other. If they do, they will follow the P o i s s o n distribution. If the o c c u r r e n c e of o n e event e n h a n c e s the p r o b a b i l i t y of a second such event, we o b t a i n a c l u m p e d , or c o n t a g i o u s , distribution. If the occurrence of one event impedes t h a t of a second such event in t h e s a m p l i n g unit, we o b t a i n a repulsed, o r spatially o r temporally u n i f o r m , distribution. T h e Poisson can be used as a test for r a n d o m ness or i n d e p e n d e n c e of d i s t r i b u t i o n not only spatially but also in time, as s o m e examples below will show. T h e Poisson distribution is n a m e d after the F r e n c h m a t h e m a t i c i a n P o i s s o n , w h o described it in 1837. It is an infinite series w h o s e terms a d d t o 1 (as m u s t be t r u e for a n y probability distribution). T h e series can be represented as
^
1!>"'
2!>"'
3!<>'
(4.2)
4!?"
w h e r e the terms are the relative expected frequencies c o r r e s p o n d i n g t o the following c o u n t s of the rare event Y:
0, 1, 2, 3, 4,
T h u s , the first of these t e r m s represents the relative expected f r e q u e n c y of samples c o n t a i n i n g n o rare event; the second term, o n e rare event; t h e third term, t w o rare events; a n d so on. T h e d e n o m i n a t o r of each term c o n t a i n s e where e is t h e base of the n a t u r a l , or N a p i e r i a n , logarithms, a c o n s t a n t w h o s e value, a c c u r a t e to 5 decimal places, is 2.718,28. W e recognize as the p a r a m e t r i c m e a n of the distribution; it is a c o n s t a n t for a n y given problem. T h e e x c l a m a t i o n m a r k after the coefficient in the d e n o m i n a t o r m e a n s "factorial," as explained in the previous section. O n e way to learn m o r e a b o u t the Poisson distribution is to apply it to a n actual case. At the t o p of Box 4.1 is a wellknown result f r o m the early statistical literature based on the d i s t r i b u t i o n of yeast cells in 400 squares of a h e m a c y t o meter, a c o u n t i n g c h a m b e r such as is used in m a k i n g c o u n t s of b l o o d cells a n d o t h e r microscopic objects suspended in liquid. C o l u m n (1) lists the n u m b e r of yeast cells observed in each h e m a c y t o m e t e r square, and c o l u m n (2) gives the observed f r e q u e n c y t h e n u m b e r of squares c o n t a i n i n g a given n u m b e r of yeast cells. We n o t e that 75 s q u a r e s c o n t a i n e d n o yeast cells, but that m o s t s q u a r e s held either 1 or 2 cells. O n l y 17 s q u a r e s c o n t a i n e d 5 or m o r e yeast cells. W h y would we expect this frequency distribution to be d i s t r i b u t e d in Poisson fashion? W e have here a relatively rare event, the frequency of yeast cells per h e m a c y t o m e t e r s q u a r e , the m e a n of which has been calculated a n d f o u n d to be 1.8. T h a t is, on the average there are 1.8 cells per square. Relative to the a m o u n t of space provided in each s q u a r e and the n u m b e r of cells t h a t could have c o m e to rest in a n y o n e square, the actual n u m b e r f o u n d is low indeed. We might also expect that the occurrence of individual yeast cells in a s q u a r e is independent of the occurrence of o t h e r yeast cells. This is a c o m m o n l y e n c o u n t e r e d class of application of the Poisson distribution. T h e m e a n of the rare event is the only q u a n t i t y that we need to k n o w to calculate the relative expected frequencies of a Poisson distribution. Since we d o
4.2 / t h e BINoMiAL d i s t r i b u t i o n
67
BOX 4 1
(/)
Number of cells per square Y
(2)
(3)
Observed frequencies
0 1 2 3 4 5 6 7 8 9
66.1 119.0 107.1 64.3 28.9 10.41 3.1 0.8 14.5 0.2
0.0.
4+
399.9
Computational
steps
Flow of computation based on Expression (4.3) multiplied by n, since we wish to obtain absolute expected frequencies,/. 1. Find e f in a table of exponentials or compute it using an exponential key: J _ e* 1.8 6.0496 400 66.12 2. f 0 6.0496 3 . / W o ? 4. 66.12(1.8) 119.02 =119.02
/ 2J t
^
5./3=/
3 =
6 . / W 3 7.  A y
64.27 28.92
; Y 8/6=
"() 
68
CHAPTER 4 / i n t r o d u c t i o n t o p r o b a b i l i t y
distributions
39935 0.05
and beyond
At step 3 enter as a constant multiplier. Then multiply it by n/er (quantity 2). At each subsequent step multiply the result of the previous step by ? and then divide by the appropriate integer.
n o t k n o w the p a r a m e t r i c m e a n of t h e yeast cells in this p r o b l e m , we e m p l o y a n e s t i m a t e (the sample m e a n ) a n d calculate expected frequencies of a Poisson d i s t r i b u t i o n with equal to the m e a n of the observed frequency d i s t r i b u t i o n of Box 4.1. It is c o n v e n i e n t for t h e p u r p o s e of c o m p u t a t i o n t o rewrite Expression (4.2) as a recursion f o r m u l a as follows: h = L ,(y) for i = 1, 2, . . . , where / 0 = (4.3)
N o t e first of all that the p a r a m e t r i c m e a n h a s been replaced by the s a m p l e m e a n . Each term developed by this recursion f o r m u l a is m a t h e m a t i c a l l y exactly the same as its c o r r e s p o n d i n g term in Expression (4.2). It is i m p o r t a n t to m a k e n o c o m p u t a t i o n a l error, since in such a chain multiplication the correctness of each term d e p e n d s o n the accuracy of the term before it. Expression (4.3) yields relative expected frequencies. If, as is m o r e usual, a b s o l u t e expected frequencies are desired, simply set the first term / 0 to n/ey, where is the n u m b e r of samples, a n d then proceed with the c o m p u t a t i o n a l steps as before. T h e a c t u a l c o m p u t a t i o n is illustrated in Box 4.1, a n d the expected frequencies so o b t a i n e d a r e listed in c o l u m n (3) of t h e frequency d i s t r i b u t i o n . W h a t have we learned f r o m this c o m p u t a t i o n ? W h e n we c o m p a r e the observed with the expected frequencies, we notice quite a good fit of o u r o b served frequencies t o a Poisson d i s t r i b u t i o n of m e a n 1.8, a l t h o u g h we have not as yet learned a statistical test for g o o d n e s s of fit (this will be covered in C h a p ter 13). N o clear p a t t e r n of deviations f r o m expectation is s h o w n . We c a n n o t test a hypothesis a b o u t the m e a n , because the m e a n of the expected distribution was t a k e n f r o m the s a m p l e m e a n of the observed variates. As in the binomial distribution, c l u m p i n g o r a g g r e g a t i o n w o u l d indicate that the probability that a second yeast cell will be f o u n d in a s q u a r e is not i n d e p e n d e n t of the prcs
4.2
/ t h e BINoMiAL d i s t r i b u t i o n
69
ence of t h e first one, b u t is higher t h a n the p r o b a b i l i t y for the first cell. This would result in a c l u m p i n g of the items in the classes at the tails of the distrib u t i o n so t h a t there w o u l d be s o m e s q u a r e s with larger n u m b e r s of cells t h a n expected, o t h e r s with fewer n u m b e r s . T h e biological i n t e r p r e t a t i o n of the dispersion p a t t e r n varies with the p r o b l e m . T h e yeast cells seem to be r a n d o m l y distributed in t h e c o u n t i n g c h a m b e r , indicating t h o r o u g h mixing of the suspension. Red b l o o d cells, o n the o t h e r h a n d , will often stick t o g e t h e r because of a n electrical c h a r g e unless the p r o p e r suspension fluid is used. T h i s socalled r o u l e a u x effect w o u l d be indicated by c l u m p i n g of t h e observed frequencies. N o t e t h a t in Box 4.1, as in the s u b s e q u e n t tables giving examples of the application of the P o i s s o n distribution, we g r o u p the low frequencies at o n e tail of the curve, uniting t h e m by m e a n s of a bracket. This t e n d s t o simplify the p a t t e r n s of d i s t r i b u t i o n s o m e w h a t . However, the m a i n r e a s o n for this g r o u p ing is related t o the G test for g o o d n e s s of fit (of observed t o expected f r e q u e n cies), which is discussed in Section 13.2. F o r p u r p o s e s of this test, n o expected frequency / should be less t h a n 5. Before we t u r n t o o t h e r examples, we need to learn a few m o r e facts a b o u t the P o i s s o n distribution. Y o u p r o b a b l y noticed t h a t in c o m p u t i n g expected frequencies, we needed t o k n o w only o n e p a r a m e t e r t h e m e a n of the distribution. By c o m p a r i s o n , in the b i n o m i a l distribution we needed t w o parameters, and k. T h u s , the m e a n completely defines the s h a p e of a given Poisson distribution. F r o m this it follows that the variance is some f u n c t i o n of the m e a n . In a P o i s s o n distribution, we have a very simple relationship between the two: = 2 , t h e variance being equal to the m e a n . T h e variance of the n u m b e r of yeast cells per s q u a r e based o n the observed frequencies in Box 4.1 e q u a l s 1.965, not m u c h larger t h a n t h e m e a n of 1.8, indicating again that the yeast cells are distributed in Poisson fashion, hence r a n d o m l y . This r e l a t i o n s h i p between variance a n d m e a n suggests a rapid test of w h e t h e r an observed frequency distribution is distributed in Poisson fashion even w i t h o u t fitting expected frequencies to the d a t a . We simply c o m p u t e a coefficient of dispersion
This value will be near 1 in distributions that are essentially Poisson distributions, will be > 1 in c l u m p e d samples, a n d will be < 1 in cases of repulsion. In the yeast cell example, CD = 1.092. T h e shapes of five Poisson d i s t r i b u t i o n s of different m e a n s are s h o w n in Figure 4.3 as frequency polygons (a frequency polygon is formed by the line connecting successive m i d p o i n t s in a bar diagram). We notice that for the low value of = 0.1 the frequency polygon is extremely Lshapcd, but with an increase in the value of the d i s t r i b u t i o n s b e c o m e h u m p e d a n d eventually nearly symmetrical. We c o n c l u d e o u r study of the Poisson distribution with a c o n s i d e r a t i o n of two examples. T h e first e x a m p l e (Table 4.5) s h o w s the d i s t r i b u t i o n of a n u m b e r
70
distributions
1.0
L^i 2
I^^t^gc^. 4 6 8
I I 10
I  ' 12 14 16
' 18
N u m b e r of r a r e e v e n t s p e r s a m p l e
figure 4.3
F r e q u e n c y p o l y g o n s of t h e P o i s s o n d i s t r i b u t i o n for v a r i o u s values of t h e m e a n .
of accidents per w o m a n f r o m an accident record of 647 w o m e n w o r k i n g in a m u n i t i o n s factory d u r i n g a fiveweek period. T h e s a m p l i n g unit is o n e w o m a n d u r i n g this period. T h e rare event is the n u m b e r of accidents t h a t h a p p e n e d t o a w o m a n in this period. T h e coefficient of dispersion is 1.488, a n d this is clearly reflected in the observed frequencies, which are greater t h a n expected in the tails a n d less t h a n expected in the center. T h i s relationship is easily seen in the deviations in the last c o l u m n (observed m i n u s expected frequencies) a n d shows a characteristic c l u m p e d p a t t e r n . T h e m o d e l assumes, of course, t h a t the accidents a r e n o t fata! o r very serious a n d thus d o not remove the individual f r o m f u r t h e r exposure. T h e noticeable c l u m p i n g in these d a t a p r o b a b l y arises
t a b l e
4.5
Observed frequencies f
0 2 3 4 5+ Total
447 132 42
647
7 = 0.4652
exercises
71
t a b l e
4.6
Azuki bean weevils (Callosobruchus chinensis) 112 Azuki beans (Phaseolus radiatus). U) Number of weevils emerging per bean Y
emerging from
0 1 2 3 4 Total
61 50 oil 112
1 J
? = 0.4643
Source: Utida (1943).
CD = 0.579
either because some w o m e n are accidentprone or because some w o m e n have m o r e d a n g e r o u s j o b s t h a n others. Using only information on the distributions of accidents, one c a n n o t distinguish between the two alternatives, which suggest very different changes that should be m a d e to reduce the n u m b e r s of accidents. The second example (Table 4.6) is extracted f r o m an experimental study of the effects of different densities of the Azuki bean weevil. Larvae of these weevils enter the beans, feed and p u p a t e inside them, and then emerge through an emergence hole. T h u s the n u m b e r of holes per bean is a good measure of the n u m b e r of adults that have emerged. T h e rare event in this case is the presence of the weevil in the bean. W e note that the distribution is strongly repulsed. There are m a n y m o r e beans containing one weevil than the Poisson distribution would predict. A statistical finding of this sort leads us to investigate the biology of the p h e n o m e n o n . In this case it was found that the adult female weevils tended to deposit their eggs evenly rather than r a n d o m l y over the available beans. This prevented the placing of too m a n y eggs on any one bean and precluded heavy competition a m o n g the developing larvae on any one bean. A contributing factor was competition a m o n g remaining larvae feeding on the same bean, in which generally all but one were killed or driven out. Thus, it is easily understood h o w the above biological p h e n o m e n a would give rise to a repulsed distribution. Exercises 4.1 The two columns below give fertility of eggs of the CP strain of Drosophila melanogaster raised in 100 vials of 10 eggs each (data from R. R. Sokal). Find the expected frequencies on the assumption of independence of mortality for
72
distributions
each egg in a vial. Use the observed mean. Calculate the expected variance and compare it with the observed variance. Interpret results, knowing that the eggs of each vial are siblings and that the different vials contain descendants from different parent pairs. ANS. 2 = 2.417, s 2 = 6.636. There is evidence that mortality rates are different for different vials.
Number f
of vials
0 1 2 3 4 5 6 7 8 9 10
1 3 8 10 6 15 14 12 13 9 9
4.2
43
4.4 4.5
4.6
In human beings the sex ratio of newborn infants is about 100?V': 105 J J . Were we to take 10,000 random samples of 6 newborn infants from the total population of such infants for one year, what would be the expected frequency of groups of 6 males, 5 males, 4 males, and so on? The Army Medical Corps is concerned over the intestinal disease X. From previous experience it knows that soldiers suffering from the disease invariably harbor the pathogenic organism in their feces and that to all practical purposes every stool specimen from a diseased person contains the organism. However, the organisms are never abundant, and thus only 20% of all slides prepared by the standard procedure will contain some. (We assume that if an organism is present on a slide it will be seen.) How many slides should laboratory technicians be directed to prepare and examine per stool specimen, so that in case a specimen is positive, it will be erroneously diagnosed negative in fewer than 1 % of the cases (on the average)? On the basis of your answer, would you recommend that the Corps attempt to improve its diagnostic methods? ANS. 21 slides. Calculate Poisson expected frequencies for the frequency distribution given in Table 2.2 (number of plants of the sedge Carex flacca found in 500 quadrats). A cross is made in a genetic experiment in Drosophila in which it is expected that { of the progeny will have white eyes and 5 will have the trait called "singed bristles." Assume that the two gene loci segregate independently, (a) What proportion of the progeny should exhibit both traits simultaneously? (b) If four flies are sampled at random, what is the probability that they will all be whiteeyed? (c) What is the probability that none of the four flies will have either white eyes or "singed bristles?" (d) If two flies are sampled, what is the probability that at least one of the flies will have either white eyes or "singed bristles" or both traits? ANS. (a) (b) (i) 4 ; (c) [(1  i)(l  i)] 4 ; (d) 1  [(1  i)(l Those readers who have had a semester or two of calculus may wish to try to prove that Expression (4.1) tends to Expression (4.2) as k becomes indefinitely
exercises
73
large (and becomes infinitesimal, so that = kp remains constant). HINT: *Y 1  e x as oo V "/ If the frequency of the gene A is and the frequency of the gene a is q, what are the expected frequencies of the zygotes A A, Aa, and aa (assuming a diploid zygote represents a random sample of size 2)? What would the expected frequency be for an autotetraploid (for a locus close to the centromere a zygote can be thought of as a random sample of size 4)? ANS. P{AA} = p2, P{Aa} = 2pq,
P{aa} = q2, f o r a d i p l o i d ; a n d P{AAAA} = p4, P{AAAa} 6 p 2 q 2 , P{Aaaa} = 4 p q 3 , P{aaaa} = q4, f o r a t e t r a p l o i d . = 4 p 3 q , P{AAaa} =
4.7
4.8 4.9
Summarize and compare the assumptions and parameters on which the binomial and Poisson distributions are based. A population consists of three types of individuals, A A2, and A3, with relative frequencies of 0.5,0.2, and 0.3, respectively, (a) What is the probability of obtaining only individuals of type in samples of size 1, 2, 3 , . . . , n? (b) What would be the probabilities of obtaining only individuals that were not of type or A 2 in a sample of size n? (c) What is the probability of obtaining a sample containing at least one representation of each type in samples of size 1, 2, 3, 4, 5 , . . . , n? ANS. (a) I i , I , . . . , 1/2". (b) (0.3)". (c) 0, 0, 0.18, 0.36, 0.507, for n: "f
.=
" '
i=
"'.
Aj\(nij)\
0.5'(0.2(0.3)" '
4.10
If the average number of weed seeds found in a j o u n c e sample of grass seed is 1.1429, what would you expect the frequency distribution of weed seeds lo be in ninetyeight 4ounce samples? (Assume there is random distribution of the weed seeds.)
CHAPTER
The
Normal Distribution
Probability
T h e theoretical frequency d i s t r i b u t i o n s in C h a p t e r 4 were discrete. T h e i r variables a s s u m e d values that c h a n g e d in integral steps (that is, they were meristic variables). T h u s , the n u m b e r of infected insects per sample could be 0 or 1 o r 2 but never an i n t e r m e d i a t e value between these. Similarly, the n u m b e r of yeast cells per h e m a c y t o m e t e r s q u a r e is a meristic variable a n d requires a discrete probability f u n c t i o n to describe it. However, most variables e n c o u n t e r e d in biology either are c o n t i n u o u s (such as the aphid femur lengths or the infant birth weights used as e x a m p l e s in C h a p t e r s 2 a n d 3) or can be treated as cont i n u o u s variables for m o s t practical purposes, even t h o u g h they a r e inherently meristic (such as the n e u t r o p h i l c o u n t s e n c o u n t e r e d in the same chapters). C h a p t e r 5 will deal m o r e extensively with the distributions of c o n t i n u o u s variables. Section 5.1 introduces frequency d i s t r i b u t i o n s of c o n t i n u o u s variables. In Section 5.2 we show o n e way of deriving the m o s t c o m m o n such distribution, the n o r m a l probability distribution. T h e n we e x a m i n e its properties in Section 5.3. A few a p p l i c a t i o n s of the n o r m a l d i s t r i b u t i o n are illustrated in Section 5.4. A g r a p h i c technique for pointing out d e p a r t u r e s f r o m normality and for cstimat
5.1 / f r e q u e n c y " d i s t r i b u t i o n s o f c o n t i n u o u s
variables
75
ing m e a n a n d s t a n d a r d deviation in a p p r o x i m a t e l y n o r m a l d i s t r i b u t i o n s is given in Section 5.5, as are s o m e of the reasons for d e p a r t u r e f r o m n o r m a l i t y in observed frequency distributions.
I KillKL 5.1
A p r o b a b i l i t y d i s t r i b u t i o n of ;i c o n t i n u o u s variable.
76
chapter 4 /
introduction
to
probability distributions
of t h e curve will a p p r o a c h the Y axis rapidly e n o u g h t h a t the p o r t i o n of t h e a r e a b e y o n d a certain p o i n t will for all practical p u r p o s e s be zero a n d t h e frequencies it represents will be infinitesimal. W e m a y fit c o n t i n u o u s frequency d i s t r i b u t i o n s t o s o m e sets of meristic d a t a (for example, t h e n u m b e r of teeth in an organism). In such cases, we h a v e r e a s o n t o believe t h a t u n d e r l y i n g biological variables t h a t cause differences in n u m b e r s of the s t r u c t u r e a r e really c o n t i n u o u s , even t h o u g h expressed as a discrete variable. W e shall n o w proceed t o discuss the m o s t i m p o r t a n t p r o b a b i l i t y density f u n c t i o n in statistics, the n o r m a l frequency distribution.
Half the a n i m a l s would have intensity I, the o t h e r half 0. W i t h k = 2 factors present in the p o p u l a t i o n (the factors arc a s s u m e d to occur i n d e p e n d e n t l y of each other), the distribution of p i g m e n t a t i o n intensities would be represented by
5.4 / a p p l i c a t i o n s o f t h e n o r m a l
distribution
77
0.4 0.3
Ten factors
0.2
0.1 0
0 1 2 3 4 5 V
FIGURE 5 . 2
8 ! )
10
the e x p a n s i o n of the b i n o m i a l ( + q)2' {FF, {0.25, {2, Ff, 0.50, 1, jf } 0.25} 0 } p i g m e n t a t i o n classes (probability space) expected frequency p i g m e n t a t i o n intensity
O n e  f o u r t h of the individuals w o u l d have p i g m e n t a t i o n intensity 2; onehalf, intensity 1; a n d the r e m a i n i n g f o u r t h , intensity 0. T h e n u m b e r of classes in the binomial increases with the n u m b e r of factors. The frequency distributions a r e symmetrical, a n d the expected frequencies at the tails b e c o m e progressively less as k increases. T h e b i n o m i a l d i s t r i b u t i o n for k 10 is g r a p h e d as a h i s t o g r a m in F i g u r e 5.2 (rather t h a n as a b a r d i a g r a m , as it should be drawn). We note that the g r a p h a p p r o a c h e s t h e familiar bellshaped outline of the n o r m a l frequency distribution (seen in Figures 5.3 a n d 5.4). W e r e we to e x p a n d the expression for k = 20, o u r h i s t o g r a m w o u l d be so close to a n o r m a l frequency distribution that we could not show the difference between the t w o on a g r a p h the size of this page. At the beginning of this procedure, we m a d e a n u m b e r of severe limiting a s s u m p t i o n s for the sake of simplicity. W h a t h a p p e n s when these a r e removed? First, when q, the distribution also a p p r o a c h e s n o r m a l i t y as k a p p r o a c h e s infinity. This is intuitively difficult to see, because when / q, the h i s t o g r a m is at first asymmetrical. However, it can be s h o w n that w h e n k, , a n d q are such that kpq > 3, the n o r m a l distribution will be closely a p p r o x i m a t e d . Second, in a m o r e realistic situation, factors w o u l d be permitted to o c c u r in m o r e t h a n two s t a t e s o n e state m a k i n g a large c o n t r i b u t i o n , a second state a smaller c o n t r i b u t i o n , a n d so forth. However, it can also be s h o w n that the m u l t i n o m i a l (p + q + r + + z)k a p p r o a c h e s the n o r m a l frequency distribution as k a p p r o a c h e s infinity. T h i r d , different factors m a y be present in different frequencies a n d m a y have different q u a n t i t a t i v e effects. As long as these a r e additive a n d independent, n o r m a l i t y is still a p p r o a c h e d as k a p p r o a c h e s infinity. Lifting these restrictions m a k e s the a s s u m p t i o n s leading to a n o r m a l distribution c o m p a t i b l e with i n n u m e r a b l e biological situations. It is therefore not surprising that so m a n y biological variables are a p p r o x i m a t e l y normally distributed.
78
distribution
Let us s u m m a r i z e the c o n d i t i o n s t h a t tend to p r o d u c e n o r m a l f r e q u e n c y distributions: (1) t h a t there be m a n y factors; (2) t h a t these factors be i n d e p e n d e n t in occurrence; (3) t h a t t h e factors be i n d e p e n d e n t in e f f e c t t h a t is, t h a t their effects be additive; a n d (4) t h a t they m a k e e q u a l c o n t r i b u t i o n s t o t h e variance. T h e f o u r t h c o n d i t i o n we a r e n o t yet in a position t o discuss; we m e n t i o n it here only for completeness. It will be discussed in C h a p t e r 7.
1 Z = = e
"
(5.1)
H e r e indicates the height of the o r d i n a t e of the curve, which represents the density of the items. It is the d e p e n d e n t variable in the expression, being a function of the variable Y. T h e r e are t w o c o n s t a n t s in the e q u a t i o n : , well k n o w n to be a p p r o x i m a t e l y 3.141,59, m a k i n g \/yj2n a p p r o x i m a t e l y 0.398,94, a n d e, the base of the n a t u r a l logarithms, whose value a p p r o x i m a t e s 2.718,28. T h e r e are t w o p a r a m e t e r s in a n o r m a l probability density function. These are the p a r a m e t r i c m e a n a n d the p a r a m e t r i c s t a n d a r d deviation , which d e t e r m i n e the location a n d s h a p e of the distribution. T h u s , there is not j u s t one n o r m a l distribution, as might a p p e a r to the uninitiated w h o keep e n c o u n t e r i n g the same bellshaped image in t e x t b o o k s . R a t h e r , there are an infinity of such curves, since these p a r a m e t e r s can a s s u m e a n infinity of values. This is illustrated by the three n o r m a l curves in Figure 5.3, representing the same total frequencies.
5.4
/ applications of t h e n o r m a l
distribution
79
Curves A a n d differ in their locations a n d hence represent p o p u l a t i o n s with different m e a n s . C u r v e s a n d C represent p o p u l a t i o n s t h a t h a v e identical m e a n s but different s t a n d a r d deviations. Since the s t a n d a r d deviation of curve C is only half t h a t of curve B, it presents a m u c h n a r r o w e r a p p e a r a n c e . In theory, a n o r m a l frequency distribution extends f r o m negative infinity to positive infinity a l o n g the axis of the variable (labeled Y, a l t h o u g h it is frequently the abscissa). This m e a n s t h a t a n o r m a l l y distributed variable can assume a n y value, however large or small, a l t h o u g h values f a r t h e r f r o m the m e a n t h a n plus or m i n u s three s t a n d a r d deviations are quite rare, their relative expected frequencies being very small. This can be seen f r o m Expression (5.1). W h e n 7 is very large or very small, the term ( )2/22 will necessarily b e c o m e very large. H e n c e e raised to the negative p o w e r of t h a t t e r m will be very small, a n d will therefore be very small. T h e curve is symmetrical a r o u n d the m e a n . Therefore, the m e a n , median, a n d m o d e of the n o r m a l distribution are all at the same point. T h e following percentages of items in a n o r m a l frequency distribution lie within the indicated limits: c o n t a i n s 68.27% of the items 2 c o n t a i n s 95.45% of the items 3 c o n t a i n s 99.73% of the items Conversely, 50% of the items fall in the range 0.674 95% of the items fall in the range + 1.960 99% of the items fall in the range + 2.576 These relations are s h o w n in F i g u r e 5.4. H o w have these percentages been calculated? T h e direct calculation of any p o r t i o n of the area u n d e r the n o r m a l curve requires an integration of the function s h o w n as Expression (5.1). F o r t u n a t e l y , for those of you w h o d o not k n o w calculus (and even for those of you w h o do) the integration has already been carried out a n d is presented in an alternative f o r m of the n o r m a l distribution: the normal distribution function (the theoretical cumulative distribution f u n c t i o n of the n o r m a l probability density function), also s h o w n in Figure 5.4. It gives the total frequency f r o m negative infinity u p to a n y point a l o n g the abscissa. We can therefore look u p directly t h e probability t h a t an observation will be less t h a n a specified value of Y. F o r example, Figure 5.4 shows that the total frequency u p to the m e a n is 50.00% a n d the frequency u p to a point o n e s t a n d a r d deviation below the m e a n is 15.87%. These frequencies are f o u n d , graphically, by raising a vertical line f r o m a point, such as , until it intersects the c u m u l a t i v e distribution curve, a n d then reading the frequency (15.87%) off the ordinate. T h e probability that a n o b s e r v a t i o n will fall between t w o a r b i t r a r y points can be found by s u b t r a c t i n g the probability t h a t an observation will fall below the
80
distribution
~
figure 5.4
'>5.45' W.7.V;
lower point f r o m the p r o b a b i l i t y t h a t an o b s e r v a t i o n will fall below the u p p e r point. F o r example, we can see f r o m F i g u r e 5.4 that the probability t h a t an o b s e r v a t i o n will fall between the m e a n a n d a point o n e s t a n d a r d deviation below the m e a n is 0.5000  0.1587 = 0.3413. T h e n o r m a l distribution f u n c t i o n is t a b u l a t e d in T a b l e II in A p p e n d i x A2, "Areas of the n o r m a l curve," where, for convenience in later calculations, 0.5 has been s u b t r a c t e d f r o m all of the entries. This table therefore lists the p r o p o r tion of the area between the m e a n a n d a n y p o i n t a given n u m b e r of s t a n d a r d deviations a b o v e it. T h u s , for example, t h e area between the m e a n a n d the point 0.50 s t a n d a r d deviations a b o v e the m e a n is 0.1915 of the total area of t h e curve. Similarly, the area between t h e m e a n a n d the point 2.64 s t a n d a r d deviations a b o v e the m e a n is 0.4959 of the curve. A point 4.0 s t a n d a r d deviations f r o m the m e a n includes 0.499,968 of the a r e a between it a n d the mean. H o w e v e r , since the n o r m a l distribution e x t e n d s f r o m negative to positive infinity, o n e needs
5.4 / a p p l i c a t i o n s o f t h e n o r m a l
distribution
81
t o g o a n infinite d i s t a n c e f r o m t h e m e a n t o r e a c h a n a r e a of 0.5. T h e use of the t a b l e of a r e a s of t h e n o r m a l c u r v e will b e i l l u s t r a t e d in t h e n e x t section. A s a m p l i n g e x p e r i m e n t will give y o u a "feel" f o r t h e d i s t r i b u t i o n of i t e m s sampled from a normal distribution. Experiment 5.1. You are asked to sample from two populations. The first one is an approximately normal frequency distribution of 100 wing lengths of houseflies. The second population deviates strongly from normality. It is a frequency distribution of the total annual milk yield of 100 Jersey cows. Both populations are shown in Table 5.1. You are asked to sample from them repeatedly in order to simulate sampling from an infinite population. Obtain samples of 35 items from each of the two populations. This can be done by obtaining two sets of 35 twodigit random numbers from the table of random numbers (Table I), with which you became familiar in Experiment 4.1. Write down the random numbers in blocks of five, and copy next to them the value of Y (for either wing length or milk yield) corresponding to the random number. An example of such a block of five numbers and the computations required for it are shown in the
TABLE 5.1 Populations of wing lengths and milk yields. Column I. R a n k n u m b e r . Column 2. L e n g t h s (in m m 1 ( T ' ) of 100 wings of houseflies a r r a y e d in o r d e r of m a g n i t u d e ; / / = 45.5. 2 = 15.21, = 3.90; d i s t r i b u t i o n a p p r o x i m a t e l y n o r m a l . Column 3. T o t a l a n n u a l milk yield (in h u n d r e d s of p o u n d s ) of 100 twoyearold registered Jersey c o w s a r r a y e d in o r d e r of m a g n i t u d e ; = 66.61, a 2 = 124.4779, = 11.1597; d i s t r i b u t i o n d e p a r t s s t r o n g l y f r o m n o r m a l i t y .
(/) 01 02 03 04 05 06 07 08 09 10 II 12 13 14 15 16 17 18 19 20
(2) 36 37 38 38 39 39 40 40 40 40 41 41 41 41 41 41 42 42 42 42
li) 51 51 51 53 53 53 54 55 55 56 56 56 57 57 57 57 57 57 57 57
(/) 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
(2) 42 42 42 43 43 43 43 43 43 43 43 44 44 44 44 44 44 44 44 44
(3)
(/) 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
(2) 45 45 45 45 45 45 45 45 45 45 46 46 46 46 46 46 46 46 46 46
61 61 61 61 61 62 62 62 62 63 63 63 64 65 65 65 65 65 67 67
(0 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
(2) 47 47 47 47 47 47 47 47 47 48 48 48 48 48 48 48 48 49 49 49
(3) 67 67 68 68 69 69 69 69 69 69 70 72 73 73 74 74 74 74 75 76
(') 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00
(2) 49 49 49 49 50 50 50 50 50 50 51 51 51 51 52 52 53 53 54 55
(3) 76 76 79 80 80 8! 82 82 82 82 83 85 87 88 88 89 93 94 96 98
58 58 58 58 58 58 58 58 58 58 58 59 59 59 60 60 60 60 60 61
Source: Column 2Data adapted from Soka] and Hunter (1955). Column 3 Data from Canadian government records.
82
distribution
Random number
Wing length Y
16 59 99 36 21 Y = Y
2
41 46 54 44 42 227
= 10,413 45.4
y =
Those with ready access to a computer may prefer to program this exercise and take many more samples. These samples and the computations carried out for each sample will be used in subsequent chapters. Therefore, preserve your data carefully! In this experiment, consider t h e 35 variates for each variable as a single sample, r a t h e r t h a n b r e a k i n g them d o w n i n t o g r o u p s of five. Since the t r u e m e a n a n d s t a n d a r d deviation ( a n d a) of the t w o d i s t r i b u t i o n s are k n o w n , you can calculate the expression (, )/ for each variate Y,. T h u s , for the first housefly wing length s a m p l e d above, you c o m p u t e 41  45.5  l 9 0 =
"
: 1 1 5 3 8
This m e a n s t h a t the first wing length is 1.1538 s t a n d a r d deviations below the true m e a n of t h e p o p u l a t i o n . T h e deviation f r o m the m e a n m e a s u r e d in s t a n d a r d deviation units is called a standardized deviate o r standard deviate. T h e a r g u m e n t s of T a b l e II, expressing distance f r o m the m e a n in units of , a r e called standard normal deviates. G r o u p all 35 variates in a frequency distribution; then d o t h e s a m e for milk yields. Since you k n o w the p a r a m e t r i c m e a n a n d s t a n d a r d deviation, you need not c o m p u t e each deviate separately, but can simply write d o w n class limits in terms of the actual variable as well as in s t a n d a r d deviation f o r m . T h e class limits for such a frequency distribution are s h o w n in T a b l e 5.2. C o m b i n e the results of y o u r s a m p l i n g with those of your classmates a n d study the percentage of the items in the distribution one, two, a n d three s t a n d a r d deviations t o each side of the m e a n . N o t e the m a r k e d differences in d i s t r i b u t i o n between the housefly wing lengths and the milk yields.
5.4 / a p p l i c a t i o n s o f t h e n o r m a l
distribution
83
t a b l e
5.2
Table for recording frequency distributions of standard deviates (, )] for samples of Experiment 5.1. Wing lengths Variates falling between these limits 00 3  2 k  2 a  i k a  k = 45.5 k a i k 2a 2 k 3 + GO 36, 37 38, 39 40,41 42, 4 3 44, 4 5 46, 47 48, 49 50, 51 52, 53 54, 5 5 2(7  l k 5155 5661 Milk Variates falling between these limits 00 3 yields
= 66.61
62  6 6 6772 7377
Ik
2
2k
3 + GO
1. We sometimes have to k n o w whether a given s a m p l e is normally distributed before we can apply a certain test to it. T o test whether a given sample is normally distributed, we have to calculate expected frequencies for a n o r m a l curve of the same m e a n a n d s t a n d a r d deviation using the table of areas of the n o r m a l curve. In this book we shall e m p l o y only a p p r o x i m a t e graphic m e t h o d s for testing normality. These are featured in the next section. 2. K n o w i n g whether a sample is normally distributed may confirm or reject certain underlying hypotheses a b o u t the n a t u r e of the factors affecting the p h e n o m e n o n studied. This is related to the c o n d i t i o n s m a k i n g for n o r m a l i t y in a frequency distribution, discusscd in Scction 5.2. T h u s , if we find a given variable to be normally distributed, we have no reason for rejecting the hypothesis t h a t the causal factors affecting the variable arc additive a n d independent a n d of equal variance. O n the o t h e r h a n d , when we find d e p a r t u r e from normality, this may indicate certain forces, such as selection, affecting the variable u n d e r study, f o r instance, bimodality may indicate a mixture
84
distribution
of o b s e r v a t i o n s f r o m t w o p o p u l a t i o n s . Skewness of milk yield d a t a m a y indicate t h a t these a r e r e c o r d s of selected c o w s a n d s u b s t a n d a r d milk c o w s h a v e n o t been included in the record. 3. If we a s s u m e a given d i s t r i b u t i o n t o be n o r m a l , we m a y m a k e p r e d i c t i o n s a n d tests of given h y p o t h e s e s based u p o n this a s s u m p t i o n . (An e x a m p l e of such a n a p p l i c a t i o n follows.) Y o u will recall the b i r t h weights of m a l e Chinese children, illustrated in Box 3.2. T h e m e a n of this s a m p l e of 9465 b i r t h weights is 109.9 oz, a n d its s t a n d a r d deviation is 13.593 oz. If y o u s a m p l e a t r a n d o m f r o m t h e b i r t h r e c o r d s of this p o p u l a t i o n , w h a t is y o u r c h a n c e of o b t a i n i n g a birth weight of 151 oz o r heavier? Such a birth weight is considerably a b o v e t h e m e a n of o u r sample, the difference being 151 109.9 = 41.1 oz. H o w e v e r , we c a n n o t consult t h e table of a r e a s of the n o r m a l curve with a difference in ounces. W e m u s t express it in standardized u n i t s t h a t is, divide it by the s t a n d a r d deviation t o c o n v e r t it i n t o a s t a n d a r d deviate. W h e n we divide the difference by t h e s t a n d a r d deviation, we o b t a i n 41.1/13.593 = 3.02. This m e a n s t h a t a birth weight of 151 oz is 3.02 s t a n d a r d deviation units greater t h a n the m e a n . A s s u m i n g t h a t t h e birth weights a r e n o r m a l l y distributed, we m a y consult the table of areas of the n o r m a l curve (Table II), where we find a value of 0.4987 for 3.02 s t a n d a r d deviations. T h i s m e a n s t h a t 49.87% of the area of the curve lies between the m e a n a n d a point 3.02 s t a n d a r d deviations f r o m it. Conversely, 0.0013, or 0.13%, of the a r e a lies b e y o n d 3.02 s t a n d a r d deviation units a b o v e the m e a n . T h u s , a s s u m i n g a n o r m a l distribution of birth weights a n d a value of = 13.593, only 0.13%, or 13 o u t of 10,000, of the infants w o u l d have a birth weight of 151 oz or farther f r o m the m e a n . It is quite i m p r o b a b l e t h a t a single s a m p l e d item f r o m t h a t p o p u l a t i o n w o u l d deviate by so m u c h f r o m the m e a n , a n d if such a r a n d o m s a m p l e of o n e weight were o b t a i n e d f r o m the records of an unspecified p o p u l a t i o n , we m i g h t be justified in d o u b t i n g w h e t h e r the o b s e r v a t i o n did in fact c o m e f r o m the p o p u l a t i o n k n o w n to us. T h e a b o v e p r o b a b i l i t y was calculated f r o m o n e tail of the distribution. W e f o u n d t h e probability t h a t a n individual w o u l d be greater t h a n the m e a n by 3.02 o r m o r e s t a n d a r d deviations. If we are not c o n c e r n e d w h e t h e r t h e individual is either heavier o r lighter t h a n the m e a n but wish to k n o w only h o w different the individual is f r o m t h e p o p u l a t i o n m e a n , an a p p r o p r i a t e q u e s t i o n would be: A s s u m i n g t h a t the individual belongs to the p o p u l a t i o n , w h a t is the probability of observing a birth weight of a n individual deviant by a certain a m o u n t f r o m the m e a n in either direction? T h a t probability m u s t be c o m p u t e d by using b o t h tails of the distribution. T h e previous probability can be simply d o u b l e d , since the n o r m a l curve is symmetrical. T h u s , 2 0.0013 = 0.0026. This, too, is so small t h a t we w o u l d c o n c l u d e t h a t a birth weight as deviant as 151 oz is unlikely t o have c o m e f r o m the p o p u l a t i o n represented by o u r s a m p l e of male Chinese children. W e can learn o n e m o r e i m p o r t a n t point f r o m this example. O u r a s s u m p t i o n has been t h a t the birth weights are n o r m a l l y distributed. Inspection of the
5.5 / d e p a r t u r e s f r o m n o r m a l i t y : g r a p h i c
methods
85
frequency d i s t r i b u t i o n in Box 3.2, however, shows clearly t h a t the d i s t r i b u t i o n is asymmetrical, t a p e r i n g t o the right. T h o u g h there are eight classes a b o v e the m e a n class, there a r e only six classes below the m e a n class. In view of this a s y m m e t r y , conclusions a b o u t o n e tail of the distribution w o u l d n o t necessarily p e r t a i n to the second tail. W e calculated t h a t 0.13% of the items w o u l d be f o u n d b e y o n d 3.02 s t a n d a r d deviations a b o v e t h e m e a n , which c o r r e s p o n d s to 151 oz. In fact, o u r s a m p l e c o n t a i n s 20 items (14 + 5 + 1) b e y o n d t h e 147.5oz class, the u p p e r limit of which is 151.5 oz, a l m o s t the same as the single b i r t h weight. H o w e v e r , 20 items of the 9465 of the s a m p l e is a p p r o x i m a t e l y 0.21%, m o r e t h a n the 0.13% expected f r o m t h e n o r m a l frequency distribution. A l t h o u g h it w o u l d still be i m p r o b a b l e to find a single birth weight as heavy as 151 oz in the sample, conclusions based o n the a s s u m p t i o n of n o r m a l i t y m i g h t be in e r r o r if the exact p r o b a b i l i t y were critical for a given test. O u r statistical conclusions are only as valid as o u r a s s u m p t i o n s a b o u t the p o p u l a t i o n f r o m which the samples are d r a w n .
5.5 Departures from normality: Graphic methods In m a n y cases an observed frequency distribution will d e p a r t obviously f r o m normality. W e shall e m p h a s i z e t w o types of d e p a r t u r e f r o m n o r m a l i t y . O n e is skewness, which is a n o t h e r n a m e for a s y m m e t r y ; skewness m e a n s that o n e tail of the curve is d r a w n o u t m o r e t h a n the other. In such curves the m e a n a n d the m e d i a n will not coincide. C u r v e s a r e said to be skewed to the right o r left, d e p e n d i n g u p o n w h e t h e r the right or left tail is d r a w n out. T h e o t h e r type of d e p a r t u r e f r o m n o r m a l i t y is kurtosis, or " p e a k e d n e s s " of a curve. A leptokurtic curve has m o r e items near the m e a n a n d at the tails, with fewer items in t h e i n t e r m e d i a t e regions relative to a n o r m a l distribution with the s a m e m e a n a n d variance. A platykurtic curve has fewer items at the m e a n a n d at the tails t h a n the n o r m a l curvc but h a s m o r e items in i n t e r m e d i a t e regions. A b i m o d a l distribution is an extreme p l a t y k u r t i c distribution. G r a p h i c m e t h o d s have been developed that examine the s h a p e of an observed distribution for d e p a r t u r e s f r o m normality. These m e t h o d s also permit estimates of the m e a n a n d s t a n d a r d deviation of the distribution w i t h o u t computation. T h e graphic m e t h o d s are based on a c u m u l a t i v e frequency distribution. In F i g u r e 5.4 we saw t h a t a n o r m a l frequency distribution g r a p h e d in c u m u l a t i v e fashion describes an S  s h a p e d curve, called a sigmoid curve. In F i g u r e 5.5 the o r d i n a t e of the sigmoid curve is given as relative frequencies expressed as percentages. T h e slope of the c u m u l a t i v e curve reflects changcs in height of the frequency distribution o n which it is based. T h u s the steep m i d d l e segment of the c u m u l a t i v e n o r m a l curve c o r r e s p o n d s to the relatively greater height of the n o r m a l curvc a r o u n d its m e a n . T h e o r d i n a t e in Figures 5.4 a n d 5.5 is in linear scale, as is the abscissa in Figure 5.4. A n o t h e r possible scale is the normal probability scale (often simply called probability scale), which can be generated by d r o p p i n g p e r p e n d i c u l a r s
86
distribution
5 10
30 50 70
90 95
99
C u m u l a t i v e p e r c e n t in p r o b a b i l i t y scale
figure 5.5
f r o m the c u m u l a t i v e n o r m a l curve, c o r r e s p o n d i n g to given percentages o n t h e o r d i n a t e , t o the abscissa (as s h o w n in F i g u r e 5.5). T h e scale represented by the abscissa c o m p e n s a t e s for the n o n l i n e a r i t y of the cumulative n o r m a l curve. It c o n t r a c t s the scale a r o u n d t h e m e d i a n a n d e x p a n d s it at the low a n d high c u m u l a t i v e percentages. This scale c a n be f o u n d on arithmetic or normal probability graph paper ( o r s i m p l y probability graph paper), w h i c h is g e n e r a l l y avail
able. Such p a p e r usually h a s the long edge g r a d u a t e d in probability scale, while the short edge is in linear scale. N o t e that there are n o 0% or 100% p o i n t s on the o r d i n a t e . These p o i n t s c a n n o t be shown, since the n o r m a l frequency distrib u t i o n extends f r o m negative t o positive infinity a n d thus however long we m a d e o u r line we w o u l d never reach the limiting values of 0% a n d 100%. If we g r a p h a c u m u l a t i v e n o r m a l distribution with the o r d i n a t e in n o r m a l probability scale, it will lie exactly on a straight line. Figure 5.6A shows such a g r a p h d r a w n o n p r o b a b i l i t y p a p e r , while the o t h e r p a r t s of F i g u r e 5.6 s h o w a series of frequency d i s t r i b u t i o n s variously d e p a r t i n g f r o m normality. These are g r a p h e d b o t h as o r d i n a r y f r e q u e n c y d i s t r i b u t i o n s with density on a linear scale (ordinate not shown) a n d as c u m u l a t i v e d i s t r i b u t i o n s as they w o u l d a p p e a r on
5.5 / d e p a r t u r e s f r o m n o r m a l i t y : g r a p h i c m e t h o d s
87
Normal
distributions
figurk
5.6
probability p a p e r . T h e y a r e useful as guidelines for e x a m i n i n g the d i s t r i b u t i o n s of d a t a o n probability paper. Box 5.1 shows you h o w to use probability p a p e r to e x a m i n e a frequency distribution for n o r m a l i t y a n d to o b t a i n g r a p h i c estimates of its m e a n a n d s t a n d a r d deviation. T h e m e t h o d w o r k s best for fairly large s a m p l e s ( > 50). T h e m e t h o d d o e s not permit the plotting of the last c u m u l a t i v e frequency, 100%,
88
distribution
BOX 5.1
Graphic test for normality of a frequency distribution and estimate of mean and standard deviation. Use of arithmetic probability paper. Birth weights of male Chinese in ounces, from Box 3.2.
(2)
Upper class limit
()
67.5 75.5 83.5 91.5 99.5 107.5 115.5 123.5 131.5 139.5 147.5 155.5 163.5 171.5
63.5 71.5 79.5 87.5 95.5 103.5 111.5 119.5 127.5 135.5 143.5 151.5 159.5 167.5 175.5
8 47 432 1320 3049 5289 7296 8529 9170 9371 9445 9459 9464 9465
0.02 0.08 0.50 4.6 13.9 32.2 55.9 77.1 90.1 96.9 99.0 99.79 99.94 99.99 100.0
Computational
steps
1. Prepare a frequency distribution as shown in columns (1), (2), and (3). 2. Form a cumulative frequency distribution as shown in column (4). It is obtained by successive summation of the frequency values. In column (5) express the cumulative frequencies as percentages of total sample size n, which is 9465 in this example. These percentages are 100 times the values of column (4) divided by 9465. 3. Graph the upper class limit of each class along the abscissa (in linear scale) against percent cumulative frequency along the ordinate (in probability scale) on normal probability paper (see Figure 5.7). A straight line is fitted to the points by eye, preferably using a transparent plastic ruler, which permits all the points to be seen as the line is drawn. In drawing the line, most weight should be given to the points between cumulative frequencies of 25% to 75%. This is because a difference of a single item may make appreciable changes in the percentages at the tails. We notice that the upper frequencies deviate to the right of the straight line. This is typical of data that are skewed to the right (see Figure 5.6D). 4. Such a graph permits the rapid estimation of the mean and standard deviation of a sample. The mean is approximated by a graphic estimation of the median. The more normal the distribution is, the closer the mean will be to the median.
5.5 / d e p a r t u r e s f r o m n o r m a l i t y : g r a p h i c
methods
89
BOX 5.1 Continued The median is estimated by dropping a perpendicular from the intersection of the 50% point on the ordinate and the cumulative frequency curve to the abscissa (see Figure 5.7). The estimate of the mean of 110.7 oz is quite close to the computed mean of 109.9 oz. 5. The standard deviation can be estimated by dropping similar perpendiculars from the intersections of the 15.9% and the 84.1% points with the cumulative curve, respectively. These points enclose the portion of a normal curve represented by . By measuring the difference between these perpendiculars and dividing this by 2, we obtain an estimate of one standard deviation. In this instance the estimate is s = 13.6, since the difference is 27.2 oz divided by 2. This is a close approximation to the computed value of 13.59 oz.
<.5
79.5
95.5
1) =5 .S
VI
C . O g C  s rt &
w
J2 S S
. u Q
 (
i rt c c
C V
jz
Jo
ir>
'S,
5 c  >
5.5 / d e p a r t u r e s f r o m n o r m a l i t y : g r a p h i c
methods
91
since t h a t c o r r e s p o n d s to an infinite distance f r o m t h e m e a n . If y o u a r e interested in plotting all observations, y o u c a n plot, instead of cumulative frequencies F, the q u a n t i t y F j expressed as a p e r c e n t a g e of n. O f t e n it is desirable to c o m p a r e observed frequency d i s t r i b u t i o n s with their expectations w i t h o u t resorting to c u m u l a t i v e frequency distributions. O n e m e t h o d of d o i n g so w o u l d be t o s u p e r i m p o s e a n o r m a l curve o n the h i s t o g r a m of a n observed frequency distribution. Fitting a n o r m a l d i s t r i b u t i o n as a curve s u p e r i m p o s e d u p o n a n observed frequency distribution in t h e f o r m of a histog r a m is usually d o n e only when g r a p h i c facilities (plotters) a r e available. O r d i nates are c o m p u t e d by m o d i f y i n g E x p r e s s i o n (5.1) to c o n f o r m t o a frequency distribution: l /\ Z =
Syfln
(5.2)
In this expression is the s a m p l e size a n d i is the class interval of the frequency distribution. If this needs t o be d o n e w i t h o u t a c o m p u t e r p r o g r a m , a table of ordinates of the n o r m a l curve is useful. In F i g u r e 5.8A we s h o w t h e frequency distribution of b i r t h weights of m a l e Chinese f r o m Box 5.1 with t h e o r d i n a t e s of the n o r m a l curve s u p e r i m p o s e d . T h e r e is an excess of observed frequencies at the right tail d u e t o the skewness of the distribution. You will p r o b a b l y find it difficult t o c o m p a r e the heights of bars against the arch of a curve. F o r this reason, J o h n T u k e y h a s suggested t h a t the bars of the h i s t o g r a m s be s u s p e n d e d f r o m t h e curve. T h e i r d e p a r t u r e s f r o m expectation c a n then be easily observed against the straightline abscissa of the g r a p h . Such a h a n g i n g h i s t o g r a m is s h o w n in Figure 5.8B for the birth weight d a t a . T h e d e p a r t u r e f r o m n o r m a l i t y is n o w m u c h clearer. Becausc i m p o r t a n t d e p a r t u r e s are frequently noted in the tails of a curve, it has been suggested that s q u a r e r o o t s of expectcd frequencies should be c o m pared with the s q u a r e roots of observed frequencies. Such a " h a n g i n g r o o t o g r a m " is s h o w n in Figure 5.8C for the Chinese birth weight d a t a . N o t e the a c c e n t u a t i o n of the d e p a r t u r e f r o m normality. Finally, o n e can also use an a n a l o g o u s technique for c o m p a r i n g expected with observed histograms. Figure 5.8D shows the s a m e d a t a plotted in this m a n n e r . S q u a r e r o o t s of frequencies are again s h o w n . T h e excess of observed over expected frequencies in the right tail of the distribution is quite evident.
Exercises
5.1 U s i n g t h e i n f o r m a t i o n g i v e n in B o x 3.2, w h a t is t h e p r o b a b i l i t y o f o b t a i n i n g a n i n d i v i d u a l w i t h a n e g a t i v e b i r t h w e i g h t ? W h a t is t h i s p r o b a b i l i t y if w e a s s u m e t h a t b i r t h w e i g h t s a r e n o r m a l l y d i s t r i b u t e d ? A N S . T h e e m p i r i c a l e s t i m a t e is z e r o . If a n o r m a l d i s t r i b u t i o n c a n b e a s s u m e d , it is t h e p r o b a b i l i t y t h a t a s t a n d a r d n o r m a l d e v i a t e is less t h a n (0  1 0 9 . 9 ) / 1 3 . 5 9 3 =  8 . 0 8 5 . T h i s v a l u e is b e y o n d t h e r a n g e of m o s t tables, a n d t h e p r o b a b i l i t y can be c o n s i d e r e d z e r o for practical purposes.
92
distribution
5.2 5.3
C a r r y o u t t h e o p e r a t i o n s l i s t e d in E x e r c i s e 5.1 o n t h e t r a n s f o r m e d d a t a g e n e r a t e d i n E x e r c i s e 2.6. A s s u m e y o u k n o w t h a t t h e p e t a l l e n g t h of a p o p u l a t i o n of p l a n t s of s p e c i e s X is n o r m a l l y d i s t r i b u t e d w i t h a m e a n o f = 3.2 c m a n d a s t a n d a r d d e v i a t i o n o f = 1.8. W h a t p r o p o r t i o n o f t h e p o p u l a t i o n w o u l d b e e x p e c t e d t o h a v e a p e t a l l e n g t h ( a ) g r e a t e r t h a n 4 . 5 c m ? ( b ) G r e a t e r t h a n 1.78 c m ? (c) B e t w e e n 2 . 9 a n d 3.6 c m ? A N S . (a) = 0 . 2 3 5 3 , ( b ) = 0 . 7 8 4 5 , a n d (c) = 0 . 1 5 4 . P e r f o r m a g r a p h i c a n a l y s i s o f t h e b u t t e r f a t d a t a g i v e n i n E x e r c i s e 3.3, u s i n g p r o b ability paper. In addition, plot the d a t a on probability p a p e r with the abscissa in l o g a r i t h m i c units. C o m p a r e t h e r e s u l t s of t h e t w o a n a l y s e s . A s s u m e that traits A a n d are independent a n d normally distributed with p a r a m e t e r s = 2 8 . 6 , = 4 . 8 , = 16.2, a n d = 4.1. Y o u s a m p l e t w o i n d i v i d u a l s a t r a n d o m (a) W h a t is t h e p r o b a b i l i t y o f o b t a i n i n g s a m p l e s i n w h i c h b o t h i n d i v i d u a l s m e a s u r e l e s s t h a n 2 0 f o r t h e t w o t r a i t s ? (b) W h a t is t h e p r o b a b i l i t y t h a t a t l e a s t o n e o f t h e i n d i v i d u a l s is g r e a t e r t h a n 3 0 f o r t r a i t B ? A N S . (a) P{A < 20}P{B < 2 0 } = ( 0 . 3 6 5 4 ) ( 0 . 0 8 2 , 3 8 ) = 0 . 0 3 0 ; (b) 1  (P{A < 3 0 } ) ( { < 30}) = 1  (0.6147)(0.9960) = 0.3856. P e r f o r m t h e f o l l o w i n g o p e r a t i o n s o n t h e d a t a o f E x e r c i s e 2.4. (a) If y o u h a v e not already d o n e so, m a k e a frequency distribution f r o m the d a t a a n d g r a p h the r e s u l t s i n t h e f o r m of a h i s t o g r a m , ( b ) C o m p u t e t h e e x p e c t e d f r e q u e n c i e s f o r e a c h o f t h e c l a s s e s b a s e d o n a n o r m a l d i s t r i b u t i o n w i t h = a n d = s. (c) G r a p h t h e e x p e c t e d f r e q u e n c i e s in t h e f o r m o f a h i s t o g r a m a n d c o m p a r e t h e m w i t h t h e o b s e r v e d f r e q u e n c i e s , (d) C o m m e n t o n t h e d e g r e e of a g r e e m e n t b e t w e e n o b s e r v e d a n d expected frequencies. L e t u s a p p r o x i m a t e t h e o b s e r v e d f r e q u e n c i e s in E x e r c i s e 2.9 w i t h a n o r m a l f r e q u e n c y distribution. C o m p a r e the observed frequencies with those expected w h e n a n o r m a l d i s t r i b u t i o n is a s s u m e d . C o m p a r e t h e t w o d i s t r i b u t i o n s b y f o r m i n g a n d superimposing the observed a n d the expected histograms a n d by using a h a n g i n g h i s t o g r a m . A N S . T h e e x p e c t e d f r e q u e n c i e s f o r t h e a g e c l a s s e s a r e : 17.9, 4 8 . 2 , 7 2 . 0 , 5 1 . 4 , 17.5, 3.0. T h i s is c l e a r e v i d e n c e f o r s k e w n e s s in t h e o b s e r v e d distribution. Perform a graphic analysis on the following measurements. Are they consistent w i t h w h a t o n e w o u l d e x p e c t in s a m p l i n g f r o m a n o r m a l d i s t r i b u t i o n ? 11.44 15.81 5.60 12.88 9.46 14.20 11.06 21.27 6.60 7.02 9.72 10.42 10.25 6.37 8.18 6.26 5.40 11.09 7.92 3.21 8.74 12.53 6.50 6.74 3.40
5.4
5.5
5.6
5.7
5.8
T h e f o l l o w i n g d a t a a r e t o t a l l e n g t h s (in c m ) o f b a s s f r o m a s o u t h e r n laki 29.9 19.1 41.4 17.2 40.2 34.7 13.6 13.3 37.8 33.5 32.2 37.7 19.7 18.3 24.3 12.6 30.0 19.4 19.1 39.6 29.7 27.3 37.4 24.6 19.4 38.2 23.8 18.6 39.2 16.2 33.3 18.0 24.7 36.8 31.6 33.7 20.4 33.1 20.1 38.2
C o m p u t e t h e m e a n , t h e s t a n d a r d d e v i a t i o n , a n d t h e coefficient of v a r i a t i o n . M a k e a h i s t o g r a m of t h e d a t a . D o t h e d a t a s e e m c o n s i s t e n t w i t h a n o r m a l d i s t r i b u t i o n o n t h e b a s i s o f a g r a p h i c a n a l y s i s ? If n o t , w h a t t y p e o f d e p a r t u r e is s u g g e s t e d ? A N S . F = 2 7 . 4 4 7 5 , s = 8 . 9 0 3 5 , V = 3 2 . 4 3 8 . T h e r e is a s u g g e s t i o n o f b i m o d a l i t y .
CHAPTER
Estimation Hypothesis
and Testing
In this c h a p t e r we provide m e t h o d s to a n s w e r t w o f u n d a m e n t a l statistical questions that every biologist must ask repeatedly in the c o u r s e of his or her work: (1) how reliable are the results I h a v e o b t a i n e d ? a n d (2) h o w p r o b a b l e is it that the differences between observed results a n d those expected on the basis of a hypothesis have been p r o d u c e d by c h a n c e alone? T h e first question, a b o u t reliability, is answered t h r o u g h the setting of confidencc limits to s a m p l e statistics. T h e second question leads into hypothesis testing. Both subjects belong to the field of statistical inference. T h e subject m a t t e r in this c h a p t e r is f u n d a mental to an u n d e r s t a n d i n g of a n y of the s u b s e q u e n t chapters. In Section 6.1 we consider the f o r m of the distribution of m e a n s a n d their variance. In Section 6.2 we examine the d i s t r i b u t i o n s a n d variances of statistics o t h e r t h a n the mean. This brings us to the general subject of s t a n d a r d errors, which a r e statistics m e a s u r i n g the reliability of an estimate. C o n f i d e n c e limits provide b o u n d s to o u r estimates of p o p u l a t i o n parameters. W e d e v e l o p the idea of a confidence limit in Section 6.3 a n d s h o w its application to samples where the true s t a n d a r d d e v i a t i o n is k n o w n . However, o n e usually deals with small, m o r e o r less normally distributed s a m p l e s with u n k n o w n s t a n d a r d deviations,
94
testing
in w h i c h case t h e t d i s t r i b u t i o n m u s t be used. W e shall i n t r o d u c e the t dist r i b u t i o n in Section 6.4. T h e a p p l i c a t i o n of t t o t h e c o m p u t a t i o n of c o n f i d e n c e limits f o r statistics of s m a l l s a m p l e s w i t h u n k n o w n p o p u l a t i o n s t a n d a r d d e v i a t i o n s is s h o w n in S e c t i o n 6.5. A n o t h e r i m p o r t a n t d i s t r i b u t i o n , t h e c h i  s q u a r e d i s t r i b u t i o n , is e x p l a i n e d in S e c t i o n 6.6. T h e n it is a p p l i e d to s e t t i n g c o n f i d e n c e limits for t h e v a r i a n c e in S e c t i o n 6.7. T h e t h e o r y of h y p o t h e s i s t e s t i n g is i n t r o d u c e d in Section 6.8 a n d is a p p l i e d in S e c t i o n 6.9 to a variety of cases e x h i b i t i n g the n o r m a l o r t d i s t r i b u t i o n s . Finally, S e c t i o n 6.10 illustrates h y p o t h e s i s t e s t i n g for v a r i a n c e s by m e a n s of t h e c h i  s q u a r e d i s t r i b u t i o n .
6.1 Distribution and variance of means W e c o m m e n c e o u r s t u d y of t h e d i s t r i b u t i o n a n d v a r i a n c e of m e a n s with a s a m pling experiment. Experiment 6.1 You were asked to retain from Experiment 5.1 the means of the seven samples of 5 housefly wing lengths and the seven similar means of milk yields. We can collect these means from every student in a class, possibly adding them to the sampling results of previous classes, and construct a frequency distribution of these means. For each variable we can also obtain the mean of the seven means, which is a mean of a sample 35 items. Here again we shall make a frequency distribution of these means, although it takes a considerable number of samplers to accumulate a sufficient number of samples of 35 items for a meaningful frequency distribution. In T a b l e 6.1 we s h o w a f r e q u e n c y d i s t r i b u t i o n of 1400 m e a n s of s a m p l e s of 5 h o u s e f l y w i n g lengths. C o n s i d e r c o l u m n s (1) a n d (3) for the t i m e being. A c t u a l l y , t h e s e s a m p l e s w e r e o b t a i n e d not by b i o s t a t i s t i c s classes but by a digital c o m p u t e r , e n a b l i n g us t o collect t h e s e values with little elTort. T h e i r m e a n a n d s t a n d a r d d e v i a t i o n a r c given at the f o o t of the table. T h e s e v a l u e s are p l o t ted o n p r o b a b i l i t y p a p e r in F i g u r e 6.1. N o t e t h a t t h e d i s t r i b u t i o n a p p e a r s q u i t e n o r m a l , as d o c s t h a t of the m e a n s b a s e d o n 200 s a m p l e s of 35 w i n g l e n g t h s s h o w n in t h e s a m e figure. T h i s i l l u s t r a t e s a n i m p o r t a n t t h e o r e m : The means of samples from a normally distributed population are themselves normally distributed regardless of sample size n. T h u s , we n o t e t h a t t h e m e a n s of s a m p l e s f r o m the n o r m a l l y d i s t r i b u t e d housefly w i n g l e n g t h s a r e n o r m a l l y d i s t r i b u t e d w h e t h e r t h e y a r e b a s e d o n 5 or 35 i n d i v i d u a l r e a d i n g s . Similarly o b t a i n e d d i s t r i b u t i o n s of m e a n s of t h e heavily s k e w e d milk yields, as s h o w n in F i g u r e 6.2, a p p e a r t o be close t o n o r m a l d i s t r i b u t i o n s . H o w e v e r , t h e m e a n s based o n five milk yields d o n o t a g r e e with the n o r m a l nearly as well as d o the m e a n s of 35 items. T h i s illustrates a n o t h e r t h e o r e m of f u n d a m e n t a l i m p o r t a n c e in statistics: As sample size increases, the means of samples drawn from a population of any distribution will approach the normal distribution. This theorem, when rigorously stated (about sampling from populations with finite variances), is k n o w n as t h e central limit theorem. T h e i m p o r t a n c e of this t h e o r e m is that if is l a r g e e n o u g h , it p e r m i t s us t o use the n o r m a l distri
6.1 / d i s t r i b u t i o n a n d v a r i a n c e o f
TABLE 6 . 1
means
95
Frequency distribution of means of 1400 random samples of 5 housefly wing lengths. ( D a t a f r o m T a b l e 5.1.) C l a s s m a r k s chosen t o give intervals of t o each side of the p a r a m e t r i c m e a n .
mark Y (in mm 10~ ') 39.832 40.704 41.576 42.448 43.320 44.192 , 45.064 = 45.5  45.936 46.808 47.680 48.552 49.424 50.296 51.168
Class
(S) f 1 11 19 64 128
>4 4 4
 U  U
3 4 1 4 4 3 4
u 1
3
21
z
*A
>4
F=
45.480
s =
1.778
ffy =
1.744
b u t i o n to m a k e statistical inferences a b o u t m e a n s of p o p u l a t i o n s in which the items are not at all n o r m a l l y distributed. T h e necessary size of d e p e n d s u p o n the distribution. (Skewed p o p u l a t i o n s require larger s a m p l e sizes.) T h e next fact of i m p o r t a n c e that we n o t e is that the r a n g e of the m e a n s is considerably less t h a n that of t h e original items. T h u s , the winglength m e a n s range f r o m 39.4 to 51.6 in samples of 5 a n d f r o m 43.9 t o 47.4 in s a m p l e s of 35, but the individual wing lengths r a n g e f r o m 36 to 55. T h e milkyield m e a n s range f r o m 54.2 to 89.0 in samples of 5 a n d f r o m 61.9 to 71.3 in samples of 35, but the individual milk yields range f r o m 51 t o 98. N o t only d o m e a n s s h o w less scatter than the items u p o n which they are based (an easily u n d e r s t o o d p h e n o m e n o n if you give s o m e t h o u g h t to it), but the range of t h e distribution of the m e a n s diminishes as the sample size u p o n which the m e a n s a r e based increases. T h e differences in ranges a r e reflected in differences in the s t a n d a r d deviations of these distributions. If we calculate t h e s t a n d a r d deviations of the m e a n s
Samples of 5
_l
 3  2  1 0
H o u s e f l y w i n g l e n g t h s in units Samples of 35
0.1 . 1 1 I I I I I II I 1 I I I
 3  2  1 0 I 2 3 4 H o u s e f l y w i n g l e n g t h s in (i v units
figure 6.1
G r a p h i c analysis of m e a n s of 14(X) r a n d o m s a m p l e s of 5 housefly wing lengths (from T a b l e 6.1) a n d of m e a n s of 200 r a n d o m s a m p l e s of 35 housefly wing lengths.
Samples of 5
0.1
 3  2  1 0 1 2 3
M i l k y i e l d s in ,7 units S a m p l e s of 3 5
99.9
99
S a 
95
90
1 8 0 1.70
r 5 ft) s 0
8 40
X 30 20 1
3 3
"0
5
 2  1 0 1 2 3
M i l k y i e l d s in < units FIGURE 6.2 , , ,,. . m G r a p h i e analysis of m e a n s of 1400 r a n d o m s a m p l e s of 5 milk yields a n d of m e a n s of . 0 0 r a n d o m samples of 35 milk yields.
98
testing
1.778 5.040
0.584 1.799
N o t e t h a t t h e s t a n d a r d deviations of the s a m p l e m e a n s based o n 35 items are considerably less t h a n t h o s e based on 5 items. This is also intuitively obvious. M e a n s b a s e d o n large s a m p l e s should be close to the p a r a m e t r i c m e a n , a n d m e a n s based o n large s a m p l e s will not vary as m u c h as will m e a n s based on small samples. T h e v a r i a n c e of m e a n s is therefore partly a f u n c t i o n of t h e s a m ple size o n which the m e a n s are based. It is also a function of the variance of t h e items in the samples. T h u s , in t h e text table above, the m e a n s of milk yields have a m u c h greater s t a n d a r d deviation t h a n m e a n s of wing lengths based o n c o m p a r a b l e s a m p l e size simply because the s t a n d a r d deviation of the individual milk yields (11.1597) is considerably greater t h a n that of individual wing lengths (3.90). It is possible w o r k out the expected value of the variance of s a m p l e m e a n s . By expected value we m e a n the average value to be obtained by infinitely repeated sampling. T h u s , if we were t o t a k e samples of a m e a n s of items repeatedly a n d were t o calculate t h e variance of these a m e a n s each time, the average of these variances would be t h e expected value. W e can visualize the m e a n as a weighted a v e r a g e of the independently sampled o b s e r v a t i o n s with cach weight w, equal to 1. F r o m Expression (3.2) we o b t a i n v w _ in > i i
, '
for the weighted m e a n . W c shall state w i t h o u t proof t h a t the variance of the weighted s u m of independent items " is V a r ( w , Y ^ = vvraf where nf is the variance of V^. It follows that
(6.1)
Since the weights u, in this case equal 1. " ( = , a n d we can rewrite the a b o v e expression as V
6.1 / d i s t r i b u t i o n a n d v a r i a n c e o f
means
99
(6.2a) F r o m this f o r m u l a it is clear t h a t the s t a n d a r d deviation of m e a n s is a f u n c t i o n of the s t a n d a r d deviation of items as well as of s a m p l e size of means. T h e greater the sample size, the smaller will be the s t a n d a r d deviation of means. In fact, as s a m p l e size increases to a very large n u m b e r , the s t a n d a r d deviation of m e a n s becomes vanishingly small. This m a k e s g o o d sense. Very large s a m p l e sizes, averaging m a n y observations, should yield estimates of m e a n s closer to the p o p u l a t i o n m e a n a n d less variable t h a n those based on a few items. W h e n w o r k i n g with samples f r o m a p o p u l a t i o n , we d o not, of course, k n o w its p a r a m e t r i c s t a n d a r d deviation , a n d we can o b t a i n only a s a m p l e estimate s of the latter. Also, we w o u l d be unlikely to have n u m e r o u s samples of size f r o m which to c o m p u t e the s t a n d a r d deviation of m e a n s directly. C u s t o m a r i l y , we therefore have to estimate the s t a n d a r d deviation of m e a n s f r o m a single sample by using Expression (6.2a), substituting s for a: (6.3) Thus, f r o m the s t a n d a r d deviation of a single sample, we o b t a i n , an estimate of the s t a n d a r d deviation of m e a n s we would expect were we t o o b t a i n a collection of m e a n s based on equalsized samples of items f r o m the same p o p u l a t i o n . As we shall see, this estimate of the s t a n d a r d deviation of a m e a n is a very i m p o r t a n t a n d frequently used statistic. T a b l e 6.2 illustrates some estimates of the s t a n d a r d deviations of means that might be o b t a i n e d f r o m r a n d o m samples of the t w o p o p u l a t i o n s that we have been discussing. T h e m e a n s of 5 samples of wing lengths based on 5 individuals ranged f r o m 43.6 to 46.8, their s t a n d a r d deviations f r o m 1.095 to 4.827, a n d the estimate of s t a n d a r d deviation of 1 he means f r o m 0.490 to 2.159. Ranges for the o t h e r categories of samples in T a b l e 6.2 similarly include the p a r a m e t r i c values of these statistics. T h e estimates of the s t a n d a r d deviations of the m e a n s of the milk yields cluster a r o u n d the expected value, sincc they are not d e p e n d e n t on n o r m a l i t y of the variates. However, in a particular sample in which by c h a n c c the s a m p l e s t a n d a r d deviation is a p o o r estimate of the p o p u l a t i o n s t a n d a r d deviation (as in the second sample of 5 milk yields), the estimate of the s t a n d a r d deviation of m e a n s is equally wide of the m a r k . W e should e m p h a s i z e o n e point of difference between the s t a n d a r d deviation of items a n d the s t a n d a r d deviation of s a m p l e means. If we estimate a p o p u l a t i o n s t a n d a r d deviation t h r o u g h the s t a n d a r d deviation of a sample, the m a g n i t u d e of the e s t i m a t e will not c h a n g e as we increase o u r s a m p l e size. We m m <>/) f l i i t i h i p c i l r r i ' i i p u/itt i m n r n v p anrl will a n n r o a e h the true s t a n d a r d
100
c h a p t e r 6 / estimation a n d hypothesis
testing
t a b l e
6.2
Means, standard deviations, and standard deviations of means (standard errors) of five random samples of 5 and 35 housefly wing lengths and Jersey cow milk yields, respectively. ( D a t a f r o m T a b l e 5.1.) P a r a m e t r i c values for t h e statistics are given in the sixth line of each c a t e g o r y . U) Wing 45.8 45.6
=
(2) s lengths 1.095 3.209 4.827 4.764 1.095 = 3.90 3.812 3.850 3.576 4.198 3.958 = 3.90 Milk yields =
(3) Sf
0.490 1.435 2.159 2.131 0.490 1.744 0.644 0.651 0.604 0.710 0.669 Of = 0 . 6 5 9
35
6.205 4.278 16.072 14.195 5.215 11.160 11.003 1 1.221 9.978 9.001 = 12.415 11.160
2.775 1.913 7.188 6.348 2.332 = 4.991 1.860 1.897 1.687 1.521 2.099 = 1.886
= 35
deviation of the p o p u l a t i o n . H o w e v e r , its o r d e r of m a g n i t u d e will be the same, w h e t h e r the sample is based on 3, 30, o r 3000 individuals. This can be seen clearly in T a b l e 6.2. T h e values of s are closer to in the samples based on = 35 t h a n in samples of = 5. Yet the general m a g n i t u d e is the same in b o t h instances. T h e s t a n d a r d deviation of means, however, decreases as s a m p l e size increases, as is o b v i o u s f r o m Expression (6.3). T h u s , m e a n s based on 3000 items will have a s t a n d a r d d e v i a t i o n only o n e  t e n t h that of m e a n s based on 30 items. This is o b v i o u s from
.s i~*r\/\/\ .v .. I i \\ /Sn ,s 1A
6.2 / d i s t r i b u t i o n a n d v a r i a n c e o f o t h e r
statistics
101
6.2 Distribution and variance of other statistics Just as we o b t a i n e d a m e a n a n d a s t a n d a r d deviation f r o m e a c h s a m p l e of the wing lengths a n d milk yields, so we could also have o b t a i n e d o t h e r statistics f r o m each sample, such as a variance, a m e d i a n , or a coefficient of variation. After repeated s a m p l i n g a n d c o m p u t a t i o n , we would have frequency distributions for these statistics a n d w o u l d be able t o c o m p u t e their s t a n d a r d deviations, just as we did for the frequency distribution of means. In m a n y cases t h e statistics a r e n o r m a l l y distributed, as was true for the means. In o t h e r cases the statistics will be distributed n o r m a l l y only if they are based on samples f r o m a n o r m a l l y distributed p o p u l a t i o n , or if they are based on large samples, o r if b o t h these conditions hold. In s o m e instances, as in variances, their distribution is never n o r m a l . An illustration is given in F i g u r e 6.3, which shows a frequency distrib u t i o n of the variances f r o m t h e 1400 samples of 5 housefly wing lengths. W e notice t h a t the distribution is strongly skewed t o the right, which is c h a r a c t e r istic of the distribution of variances. S t a n d a r d deviations of various statistics are generally k n o w n as standard errors. Beginners s o m e t i m e s get confused by a n imagined distinction between s t a n d a r d deviations a n d s t a n d a r d errors. T h e s t a n d a r d e r r o r of a statistic such as the m e a n (or V) is the s t a n d a r d deviation of a distribution of m e a n s (or K's) for samples of a given s a m p l e size n. T h u s , the terms " s t a n d a r d e r r o r " a n d " s t a n d a r d d e v i a t i o n " are used synonymously, with the following exception: it is not c u s t o m a r y to use " s t a n d a r d e r r o r " as a s y n o n y m of " s t a n d a r d d e v i a t i o n " for items in a sample or p o p u l a t i o n . S t a n d a r d e r r o r or s t a n d a r d deviation has to be qualified by referring t o a given statistic, such as the s t a n d a r d deviation
100
0 0
f i g u r i : 6.3
102
c h a p t e r 6 / estimation a n d hypothesis
testing
of V, which is the same as t h e s t a n d a r d e r r o r of V. Used w i t h o u t a n y qualification, t h e t e r m " s t a n d a r d e r r o r " conventionally implies the s t a n d a r d e r r o r of t h e m e a n . " S t a n d a r d d e v i a t i o n " used w i t h o u t qualification generally m e a n s stand a r d deviation of items in a s a m p l e or p o p u l a t i o n . T h u s , w h e n y o u r e a d t h a t m e a n s , s t a n d a r d deviations, s t a n d a r d errors, a n d coefficients of v a r i a t i o n a r e s h o w n in a table, this signifies t h a t a r i t h m e t i c m e a n s , s t a n d a r d d e v i a t i o n s of items in samples, s t a n d a r d deviations of their m e a n s ( = s t a n d a r d e r r o r s of means), a n d coefficients of v a r i a t i o n are displayed. T h e following s u m m a r y of terms m a y be helpful: S t a n d a r d deviation = s = j}>2/( 1). S t a n d a r d deviation of a statistic St = s t a n d a r d e r r o r of a statistic St = S t a n d a r d e r r o r = s t a n d a r d e r r o r of a m e a n = s t a n d a r d deviation of a m e a n = Sy. S t a n d a r d e r r o r s are usually not o b t a i n e d f r o m a frequency d i s t r i b u t i o n by r e p e a t e d s a m p l i n g but a r e estimated f r o m only a single s a m p l e a n d represent t h e expected s t a n d a r d d e v i a t i o n of the statistic in case a large n u m b e r of such s a m p l e s h a d been o b t a i n e d . You will r e m e m b e r that we estimated the s t a n d a r d e r r o r of a distribution of m e a n s f r o m a single s a m p l e in this m a n n e r in the p r e v i o u s section. Box 6.1 lists the s t a n d a r d e r r o r s of four c o m m o n statistics. C o l u m n (1) lists the statistic whose s t a n d a r d e r r o r is described; c o l u m n (2) shows the f o r m u l a
(1) Statistic
Estimate S
error Is?
(3) df
Comments on applicability
1 2
Sf
1 1
Vn
True for any population with finite variance Large samples from normal populations Samples from normal populations (n > 15)
Median
s med (1.2533)sy
s, = (0.7071068)
j: V
. V
Sy X V  ~ Jin
r
VlOO/
6.3 / i n t r o d u c t i o n t o c o n f i d e n c e
limits
103
for the estimated s t a n d a r d error; c o l u m n (3) gives the degrees of f r e e d o m on which t h e s t a n d a r d e r r o r is based (their use is explained in Section 6.5); a n d c o l u m n (4) provides c o m m e n t s on t h e r a n g e of application of the s t a n d a r d error. T h e uses of these s t a n d a r d errors will be illustrated in s u b s e q u e n t sections.
6.3 Introduction to confidence limits T h e various sample statistics we have been obtaining, such as m e a n s or s t a n d a r d deviations, are estimates of p o p u l a t i o n p a r a m e t e r s or , respectively. So far we have n o t discussed t h e reliability of these estimates. W e first of all wish to k n o w w h e t h e r the s a m p l e statistics are unbiased estimators of the p o p u l a t i o n p a r a m e t e r s , as discussed in Section 3.7. But k n o w i n g , for example, t h a t is an unbiased estimate of is not e n o u g h . W e w o u l d like to find out h o w reliable a m e a s u r e of it is. T h e true values of the p a r a m e t e r s will a l m o s t always remain u n k n o w n , a n d we c o m m o n l y estimate reliability of a sample statistic by setting confidence limits to it. T o begin o u r discussion of this topic, let us start with the u n u s u a l case of a p o p u l a t i o n whose p a r a m e t r i c m e a n a n d s t a n d a r d deviation are k n o w n to be a n d , respectively. T h e m e a n of a sample of items is symbolized by . T h e expected s t a n d a r d e r r o r of the m e a n is / s j n . As we have seen, the sample m e a n s will be n o r m a l l y distributed. Therefore, f r o m Section 5.3, the region f r o m 1 . 9 6 / y f n below to 1 . 9 6 a j ^ f n a b o v e includes 95% of the s a m p l e m e a n s of size n. A n o t h e r way of stating this is t o consider the ratio (  )^). This is the s t a n d a r d deviate of a s a m p l e m e a n f r o m the p a r a m c t r i c m e a n . Since they are n o r m a l l y distributed, 95% of such s t a n d a r d deviates will lie between 1.96 a n d + 1.96. W e can express this statement symbolically as follows:
This m e a n s that the probability that the sample means Y will dilfcr by no m o r e t h a n 1.96 s t a n d a r d e r r o r s /sjn f r o m the p a r a m e t r i c mean equals 0.95. T h e expression between the brackets is an inequality, all terms of which can be multiplied by aj\fn to yield
because a < b < a implies a > b > a, which can be written as a < b < a. And finally, we can transfer across the inequality signs, just as in an
104
testing
e q u a t i o n it could be t r a n s f e r r e d across the equal sign. This yields the final desired expression: P\Y I or {  1.96 ? < < + 1.96} = 0.95 (6.4a) 1.96 \Jn + \Jn ) 1.96) = 0.95 (6.4)
This m e a n s that the p r o b a b i l i t y t h a t the term  1.96 ? is less t h a n or e q u a l to t h e p a r a m e t r i c m e a n a n d t h a t the term + 1.96 is greater t h a n o r e q u a l to is 0.95. T h e t w o terms f 1.96 ? a n d + \9.6 we shall call Lx a n d L2, respectively, the lower a n d u p p e r 95% confidence limits of the m e a n . A n o t h e r way of stating the r e l a t i o n s h i p implied by Expression (6.4a) is t h a t if we repeatedly o b t a i n e d samples of size f r o m the p o p u l a t i o n a n d c o n s t r u c t e d these limits for each, we could expect 95% of the intervals between these limits to c o n t a i n the true m e a n , a n d only 5% of the intervals w o u l d miss . T h e interval f r o m L j to L2 is called a confidence interval. If you were not satisfied to have the confidence interval c o n t a i n t h e true m e a n only 95 times o u t of 100, you might e m p l o y 2.576 as a coefficient in place of 1.960. Y o u m a y r e m e m b e r t h a t 99% of the area of the n o r m a l curve lies in the r a n g e 2.576. T h u s , to calculate 99% confidence limits, c o m p u t e the two q u a n t i t i e s L[ = 2 . 5 7 6 / / a n d L2 = + 2 . 5 7 6 / y f n as lower a n d u p p e r confidence limits, respectively. In this case 99 out of 100 confidence intervals o b t a i n e d in repeated s a m p l i n g w o u l d c o n t a i n the true m e a n . T h e new c o n f i d e n c e interval is wider t h a n t h e 95% interval (since we have multiplied by a greater coefficient). If you were still n o t satisfied with the reliability of the confidence limit, you could increase it, multiplying the s t a n d a r d error of the m e a n by 3.291 to o b t a i n 99.9% confidence limits. This value could be found by inverse interp o l a t i o n in a m o r e extensive table of areas of the n o r m a l curve or directly in a table of the inverse of the n o r m a l probability distribution. T h e new coefficient would widen the interval further. Notice that you can c o n s t r u c t confidence intervals that will be cxpcctcd to c o n t a i n an increasingly greater percentage of the time. First you would expect to be right 95 times out of 100, then 99 times out of 100, finally 999 times out of 1000. But as y o u r confidence increases, y o u r s t a t e m e n t becomes vaguer and vaguer, since the confidence interval lengthens. Let us e x a m i n e this by way of an actual sample. We o b t a i n a s a m p l e of 35 housefly wing lengths from the p o p u l a t i o n of T a b l e 5.1 with k n o w n m e a n { = 45.5) a n d s t a n d a r d deviation ( = 3.90). Let us a s s u m e that the sample m e a n is 44.8. We can expect the s t a n d a r d deviation of m e a n s based on samples of 35 items to be = /yfn = 3.90/^/35 = 0.6592. W e compute' confidence limits as follows: T h e lower limit is L , = 44.8  (1.960)(0.6592) = 43.51. T h e u p p e r limit is / . , = 44.8 + (L960)(0.6592) = 46.09.
6.3 / i n t r o d u c t i o n t o c o n f i d e n c e
limits
105
R e m e m b e r t h a t this is an u n u s u a l case in which we h a p p e n t o k n o w the true m e a n of t h e p o p u l a t i o n ( = 45.5) a n d hence we k n o w t h a t the confidence limits enclose the m e a n . W e expect 95% of such confidence intervals o b t a i n e d in repeated s a m p l i n g to include the p a r a m e t r i c m e a n . W e c o u l d increase the reliability of these limits by going t o 99% confidence intervals, replacing 1.960 in the a b o v e expression by 2.576 a n d o b t a i n i n g Ll = 43.10 a n d L 2 = 46.50. W e could have greater confidence that o u r interval covers the m e a n , but we could be m u c h less certain a b o u t the true value of the m e a n because of the wider limits. By increasing the degree of confidence still f u r t h e r , say, to 99.9%, we could be virtually certain t h a t o u r confidence limits (L, = 42.63, L2 = 46.97) contain the p o p u l a t i o n m e a n , but the b o u n d s enclosing t h e m e a n are n o w so wide as to m a k e o u r prediction far less useful t h a n previously. Experiment 6.2. For the seven samples of 5 housefly wing lengths and the seven similar samples of milk yields last worked with in Experiment 6.1 (Section 6.1), compute 95% confidence limits to the parametric mean for each sample and for the total sample based on 35 items. Base the standard errors of the means on the parametric standard deviations of these populations (housefly wing lengths = 3.90, milk yields = 11.1597). Record how many in each of the four classes of confidence limits (wing lengths and milk yields, = 5 and = 35) are correctthat is, contain the parametric mean of the population. Pool your results with those of other class members. W e tried the experiment on a c o m p u t e r for the 200 samples of 35 wing lengths each, c o m p u t i n g confidence limits of the p a r a m e t r i c m e a n by e m p l o y i n g the p a r a m e t r i c s t a n d a r d e r r o r of the mean, = 0.6592. Of the 200 confidence intervals plotted parallel to the ordinate, 194 (97.0%) cross the p a r a m e t r i c m e a n of the p o p u l a t i o n . T o reduce the width of the confidence interval, we have to reduce the stand a r d e r r o r of the m e a n . Since = /^Jn, this can be d o n e only by reducing the s t a n d a r d deviation of the items or by increasing the s a m p l e size. T h e first of these alternatives is not always available. If we are s a m p l i n g f r o m a p o p u l a t i o n in nature, we ordinarily have no way of reducing its s t a n d a r d deviation. H o w ever, in m a n y experimental p r o c e d u r e s we m a y be able to reduce the variance of the d a t a . F o r example, if we are studying the effect of a d r u g on heart weight in rats a n d find that its variance is rather large, we might be able to reduce this variance by taking rats of only o n e age g r o u p , in which the variation of heart weight would be considerably less. Thus, by controlling o n e of the variables of the experiment, the variance of the response variable, heart weight, is reduced. Similarly, by keeping t e m p e r a t u r e or o t h e r e n v i r o n m e n t a l variables c o n s t a n t in a procedure, we can frequently reduce the variance of o u r response variable a n d hence o b t a i n m o r e prccisc estimates of p o p u l a t i o n parameters. c o m m o n way to reduce the s t a n d a r d e r r o r is to increase s a m p l e size. Obviously f r o m Expression (6.2) as increases, the s t a n d a r d e r r o r decreases; hence, as a p p r o a c h e s infinity, the s t a n d a r d e r r o r and the lengths of confidence intervals a p p r o a c h zero. This ties in with w h a t we have learned: in samples whose size a p p r o a c h e s infinity, (he s a m p l e m e a n would a p p r o a c h the p a r a m e t r i c mean.
106
testing
W e m u s t g u a r d against a c o m m o n m i s t a k e in expressing the m e a n i n g of the confidence limits of a statistic. W h e n we have set lower a n d u p p e r limits ( L j a n d L 2 , respectively) to a statistic, we imply t h a t the p r o b a b i l i t y t h a t this interval covers the m e a n is, for example, 0.95, or, expressed in a n o t h e r way, t h a t o n t h e average 95 out of 100 confidence intervals similarly o b t a i n e d w o u l d cover the mean. W e cannot state t h a t there is a p r o b a b i l i t y of 0.95 t h a t t h e true m e a n is c o n t a i n e d within a given p a i r of confidence limits, a l t h o u g h this m a y seem to be saying the same thing. T h e latter s t a t e m e n t is incorrect because the t r u e m e a n is a p a r a m e t e r ; hence it is a fixed value, a n d it is therefore either inside the interval or outside it. It c a n n o t be inside the given interval 95% of the time. It is i m p o r t a n t , therefore, to learn the correct s t a t e m e n t a n d m e a n i n g of confidence limits. So far we have considered only m e a n s based on normally d i s t r i b u t e d s a m ples with k n o w n p a r a m e t r i c s t a n d a r d deviations. W e can, however, extend the m e t h o d s just learned t o samples f r o m p o p u l a t i o n s where the s t a n d a r d deviation is u n k n o w n but where the distribution is k n o w n t o be n o r m a l a n d the s a m p l e s a r e large, say, > 100. In such cases we use the s a m p l e s t a n d a r d d e v i a t i o n for c o m p u t i n g the s t a n d a r d e r r o r of the m e a n . However, when the samples are small (n < 100) a n d we lack k n o w l e d g e of the p a r a m e t r i c s t a n d a r d deviation, we m u s t take into c o n s i d e r a t i o n the reliability of o u r sample s t a n d a r d deviation. T o d o so, we m u s t m a k e use of the socallcd t or S t u d e n t ' s distribution. We shall learn how to set confidence limits e m p l o y i n g the t distribution in Section 6.5. Before that, however, we shall have t o b e c o m e familiar with this distribution in the next section. 6.4 Student's t distribution T h e deviations of s a m p l e m e a n s f r o m the p a r a m e t r i c m e a n of a n o r m a l distribution are themselves normally distributed. If these deviations are divided by the p a r a m e t r i c s t a n d a r d deviation, the resulting ratios, ( )/, are still normally distributed, with 0 a n d = 1. S u b t r a c t i n g the c o n s t a n t f r o m every , is simply an additive code (Section 3.8) and will not c h a n g e the f o r m of the distribution of s a m p l e means, which is n o r m a l (Section 6.1). Dividing each deviation by the c o n s t a n t o Y reduces the variance to unity, but p r o p o r t i o n a t e l y so for the entire distribution, so that its s h a p e is not altered a n d a previously normal distribution r e m a i n s so. If, on the o t h e r h a n d , we calculate the variance sf of each of the samples a n d calculate the deviation for each m e a n \\ as ( V /()/%,, where ,sy .stands for the estimate of the s t a n d a r d error of the m e a n of the f'th sample, we will find the distribution of the deviations wider and m o r e peaked than the n o r m a l distribution. This is illustrated in f i g u r e 6.4, which shows the ratio (Vi  )/* for the 1400 samples of live housefly wing lengths o f T a b l e 6.1. T h e new distribution ranges wider than the c o r r e s p o n d i n g n o r m a l distribution, because the d e n o m i n a t o r is the sample s t a n d a r d e r r o r r a t h e r than the p a r a m e t r i c s t a n d a r d e r r o r a n d will s o m e t i m e s be smaller a n d sometimes greater than expected. This increased variation will he reflected in the greater variance of the ratio ( ) 'sY. T h e
6.4 / s t u d e n t ' s i d i s t r i b u t i o n
107
f.
figure 6.4
D i s t r i b u t i o n of q u a n t i t y f s = ( )/ a l o n g abscissa c o m p u t e d for 1400 s a m p l e s of 5 housefly wing lengths presented as a h i s t o g r a m a n d as a c u m u l a t i v e frequency d i s t r i b u t i o n . R i g h t  h a n d o r d i n a t e represents frequencies for the h i s t o g r a m ; l e f t  h a n d o r d i n a t e is c u m u l a t i v e frequency in probability scale.
expected distribution of this ratio is called the f distribution, also k n o w n as "Student's' distribution, n a m e d after W. S. G o s s c t t , w h o first described it, p u b lishing u n d e r the p s e u d o n y m " S t u d e n t . " T h e t distribution is a function with a complicated m a t h e m a t i c a l f o r m u l a that need not be presented here. T h e t distribution shares with the n o r m a l the properties of being symmetric a n d of extending f r o m negative to positive infinity. However, it differs f r o m the n o r m a l in that it a s s u m e s different shapes d e p e n d i n g on the n u m b e r of degrees of freedom. By "degrees of f r e e d o m " we m e a n the q u a n t i t y n I, where is the sample size u p o n which a variance has been based. It will be r e m e m b e r e d that 1 is the divisor in o b t a i n i n g an unbiased estimate of the variance f r o m a sum of squares. T h e n u m b e r of degrees of freedom pertinent to a given Student's distribution is the s a m e as the n u m b e r of degrees of f r e e d o m of the s t a n d a r d deviation in the ratio ( )/. Degrees of freedom (abbreviated dj o r sometimes v) can range f r o m I to infinity. A t distribution for dj = 1 deviates most m a r k e d l y f r o m the n o r m a l . As the n u m b e r of degrees of freedom increases. Student's distribution a p p r o a c h e s the s h a p e of the s t a n d a r d n o r m a l distribution ( = 0, = 1) ever m o r e closcly, and in a g r a p h the size of this page a t distribution of df 30 is essentially indistinguishable f r o m a n o r m a l distribution. At
108
testing
df co, t h e f d i s t r i b u t i o n is the n o r m a l distribution. Thus, we can think of t h e t d i s t r i b u t i o n as the general case, considering the n o r m a l to be a special case of S t u d e n t ' s distribution with df = . F i g u r e 6.5 s h o w s t distributions for 1 a n d 2 degrees of f r e e d o m c o m p a r e d with a n o r m a l frequency distribution. W e were able t o e m p l o y a single table for t h e areas of the n o r m a l curve by c o d i n g the a r g u m e n t in s t a n d a r d deviation units. However, since t h e t distrib u t i o n s differ in s h a p e for differing degrees of freedom, it will be necessary to have a s e p a r a t e t table, c o r r e s p o n d i n g in s t r u c t u r e to the table of the areas of the n o r m a l curve, for e a c h value of d f . T h i s w o u l d m a k e for very c u m b e r s o m e and e l a b o r a t e sets of tables. C o n v e n t i o n a l t tables are therefore differently a r r a n g e d . T a b l e III s h o w s degrees of f r e e d o m a n d probability as a r g u m e n t s a n d the c o r r e s p o n d i n g values of t as functions. T h e probabilities indicate t h e percent of the area in b o t h tails of the curve (to the right a n d left of the m e a n ) b e y o n d the indicated value of t. T h u s , looking up the critical value of t at p r o b a b i l i t y = 0.05 a n d df = 5, we find t = 2.571 in T a b l e III. Since this is a twotailed table, t h e probability of 0.05 m e a n s that 0.025 of the area will fall to the left of a t value of  2 . 5 7 1 a n d 0.025 will fall to the right o f f = + 2 . 5 7 1 . Y o u will recall that the c o r r e s p o n d i n g value for infinite degrees of freedom (for the n o r m a l curve) is 1.960. O n l y those probabilities generally used are s h o w n in T a b l e III. You should b e c o m e very familiar with l o o k i n g u p t values in this table. This is o n e of the most i m p o r t a n t tables to be consulted. A fairly c o n v e n t i o n a l symbolism is [ ] , m e a n i n g the tabled t value for degrees of f r e e d o m a n d p r o p o r t i o n in both tails (a/2 in each tail), which is equivalent to the t value for the c u m u l a t i v e p r o b a b i l i t y of 1 (a/2). T r y looking u p s o m e of these values to b e c o m e familiar with the table. F o r example, convince yourself t h a t fo.osnj' ' 113]' 'o.oziioi' a n d Vo5[ < c o r r e s p o n d to 2.365, 5.841, 2.764, a n d 1.960, respecx tively. We shall now e m p l o y the t distribution for the setting of confidence limits to m e a n s of small samples.
Normal =
()
 5
4
:i
 2  1
0 I units
6.5 / c o n f i d e n c e l i m i t s b a s e d o n s a m p l e
statistics
109
6.5 Confidence limits based on sample statistics A r m e d with a k n o w l e d g e of the t distribution, we are n o w able to set confidence limits to the m e a n s of samples f r o m a n o r m a l frequency distribution whose p a r a m e t r i c s t a n d a r d deviation is u n k n o w n . T h e limits a r e c o m p u t e d as L , = (_ jSy a n d L2 = + tx[^1]Sy for confidence limits of p r o b a b i l i t y 1  . T h u s , for 95% confidence limits we use values of f 0 0 5 [ _ , v W e c a n rewrite Expression (6.4a) as P{L, < < L2} = P{  _ n s y < < + t ^  u S r } = 1  a (6.5)
BOX 6.2 Confidence limits for . Aphid stem mother femur lengths from Box 2.1: = 4.004; s = 0.366; = 25. Values for _,, from a twotailed t table (Table ), where 1  is the proportion expressing confidence and 1 are the degrees of freedom:
0.051241
= 2.064
t0.01[241
"
7 9 7
The 95% confidence limits for the population mean are given by the equations L , (lower limit) = Y  Vosi s Sn
s ~
110
testing
convince ourselves of the a p p r o p r i a t e n e s s of t h e t distribution for setting c o n fidence limits t o m e a n s of s a m p l e s f r o m a n o r m a l l y distributed p o p u l a t i o n with u n k n o w n t h r o u g h a s a m p l i n g experiment. Experiment 6.3. Repeat the computations and procedures of Experiment 6.2 (Section 6.3), but base standard errors of the means on the standard deviations computed for each sample and use the appropriate t value in place of a standard normal deviate. F i g u r e 6.6 shows 95% confidence limits of 200 sampled m e a n s of 35 housefly wing lengths, c o m p u t e d with t a n d sf r a t h e r t h a n with the n o r m a l curve a n d . W e note t h a t 191 (95.5%) of the 200 confidence intervals cross the p a r a metric mean. W e can use the s a m e t e c h n i q u e for setting confidence limits t o a n y given statistic as long as it follows the n o r m a l distribution. This will a p p l y in an a p p r o x i m a t e way to all the statistics of Box 6.1. T h u s , for example, we m a y set confidence limits to t h e coefficient of v a r i a t i o n of the a p h i d f e m u r lengths of Box 6.2. These a r e c o m p u t e d as P{V  t,lnusv < VP < V + ,^} = 1
1(10 X u m b e r of t r i a l s
101
150 N u m b e r of t r i a l s
200
k a JKf: 6.6
6.5 / c o n f i d e n c e l i m i t s b a s e d o n s a m p l e
statistics
111
w h e r e VP s t a n d s f o r t h e p a r a m e t r i c value of t h e coefficient of v a r i a t i o n . Since t h e s t a n d a r d e r r o r of t h e coefficient of v a r i a t i o n e q u a l s a p p r o x i m a t e l y = V/sJln, w e p r o c e e d as follows: , 100s Y 100(0.3656) 4.004 = 9.13
~ V ^ 2 5
7.0711
L 2 9
2.66
= 9.13 + 2.66 = 11.79 W h e n s a m p l e size is very large o r w h e n is k n o w n , t h e d i s t r i b u t i o n is effectively n o r m a l . H o w e v e r , r a t h e r t h a n t u r n t o t h e t a b l e of a r e a s of t h e n o r m a l curve, it is c o n v e n i e n t to s i m p l y use f a [ o o ] , the f d i s t r i b u t i o n with infinite degrees of f r e e d o m . A l t h o u g h c o n f i d e n c e limits a r e a useful m e a s u r e of the reliability of a s a m ple statistic, t h e y a r e n o t c o m m o n l y given in scientific p u b l i c a t i o n s , t h e statistic plus o r m i n u s its s t a n d a r d e r r o r b e i n g cited in their place. T h u s , y o u will freq u e n t l y see c o l u m n h e a d i n g s such as " M e a n + S.E." T h i s i n d i c a t e s t h a t the r e a d e r is free to use the s t a n d a r d e r r o r to set c o n f i d e n c e limits if so inclined. It s h o u l d be o b v i o u s t o y o u f r o m y o u r s t u d y of the I d i s t r i b u t i o n t h a t y o u c a n n o t set c o n f i d e n c e limits t o a statistic w i t h o u t k n o w i n g the s a m p l e size o n which it is based, b e i n g necessary t o c o m p u t e the correct d e g r e e s of f r e e d o m . T h u s , the o c c a s i o n a l citing of m e a n s a n d s t a n d a r d e r r o r s w i t h o u t a l s o s l a t i n g s a m p l e size is to be s t r o n g l y d e p l o r e d . It is i m p o r t a n t t o s t a t e a statistic a n d its s t a n d a r d e r r o r t o a sullicient n u m b e r of d e c i m a l places. T h e f o l l o w i n g rule of t h u m b helps. D i v i d e the s t a n d a r d e r r o r by 3, then n o t e the d e c i m a l placc of the first n o n z e r o digit of the q u o t i e n t ; give the statistic significant to t h a t d e c i m a l placc a n d p r o v i d e o n e f u r t h e r d e c i m a l for the s t a n d a r d e r r o r . T h i s rule is q u i t e simple, as a n e x a m p l e will illustrate. If the m e a n a n d s t a n d a r d e r r o r of a s a m p l e a r e c o m p u t e d as 2.354 0.363, wc d i v i d e 0.363 by 3, w h i c h yields 0.121. T h e r e f o r e the m e a n s h o u l d be r e p o r t e d t o o n e d e c i m a l placc, a n d the s t a n d a r d e r r o r s h o u l d be r e p o r t e d t o t w o d e c i m a l places. T h u s , we r e p o r t this result as 2.4 0.36. If, o n the o t h e r h a n d , the s a m e m e a n h a d a s t a n d a r d e r r o r of 0.243, d i v i d i n g this s t a n d a r d e r r o r by 3 w o u l d h a v e yielded 0.081, a n d the first n o n z e r o digit w o u l d h a v e been in the s e c o n d d e c i m a l placc. T h u s the m e a n s h o u l d h a v e been rep o r t e d as 2.35 + 0.243.
112
testing
6.6 The chisquare distribution A n o t h e r c o n t i n u o u s distribution of great i m p o r t a n c e in statistics is the distrib u t i o n of 1 (read chisquare). W e need to learn it now in connection with the distribution and confidence limits of variances. The chisquare distribution is a probability density function w h o s e values range f r o m zero to positive infinity. Thus, unlike the n o r m a l distribution or i, the function a p p r o a c h e s the horizontal axis asymptotically only at the righth a n d tail of the curve, not at b o t h tails. T h e function describing the 2 distribution is complicated a n d will n o t be given here. As in t, there is n o t merely one 2 distribution, but there is one distribution for each n u m b e r of degrees of freedom. Therefore, 2 is a function of v, the n u m b e r of degrees of freedom. Figure 6.7 shows probability density functions for the 2 distributions for 1, 2, 3, and 6 degrees of freedom. Notice that the curves are strongly skewed to the right, Lshaped at first, but m o r e or less a p p r o a c h i n g symmetry for higher degrees of freedom. We can generate a 2 distribution f r o m a population of s t a n d a r d n o r m a l deviates. You will recall that we standardize a variable X by subjecting it to the o p e r a t i o n (Y t )/. Let us symbolize a standardized variable as Y] = (Yj )/. N o w imagine repeated samples of variates Y f r o m a n o r m a l J p o p u l a t i o n with m e a n a n d standard deviation . F o r each sample, we transform every variate Yt to Y'h as defined above. T h e quantities " Y'2 c o m p u t e d for each sample will be distributed as a 2 distribution with degrees of freedom.
x'2
FKiURE 6.7 F r e q u e n c y curves of 2 d i s t r i b u t i o n for I. 2, 3, a n d 6 degrees of f r e e d o m .
6.6 / t h e c h i  s q u a r e
distribution a
113
1 "
(6.6)
Kiw y?
(6.7)
which is simply t h e sum of squares of the variable divided by a c o n s t a n t , the p a r a m e t r i c variance. A n o t h e r c o m m o n way of stating this expression is
Here we have replaced the n u m e r a t o r of Expression (6.7) with 1 times the sample variance, which, of course, yields the sum of squares. If we were to sample repeatedly items f r o m a normally distributed p o p u lation, Expression (6.8) c o m p u t e d for each s a m p l e would yield a 2 distribution with 1 degrees of f r e e d o m . N o t i c e that, a l t h o u g h we have samples of items, we have lost a degree of freedom because we are now employing a sample m e a n r a t h e r t h a n the p a r a m e t r i c mean. Figure 6.3, a s a m p l e distribution of variances, has a second scalc a l o n g the abscissa, which is the first scalc multiplied by the c o n s t a n t ( 1)/ 2 . This scale converts the s a m p l e variances s 2 of the first scale into Expression (6.8). Since the second scale is p r o p o r t i o n a l to s2, the distribution of the sample variance will serve to illustrate a sample distribution a p p r o x i m a t i n g 2. T h e distribution is strongly skewed to the right, as would be expected in a 2 distribution. C o n v e n t i o n a l 2 tables as shown in T a b l e IV give the probability levels customarily required a n d degrees of freedom as a r g u m e n t s and list the 2 corresponding to the probability and the df as the functions. Each chisquare in Tabic IV is the value of 2 b e y o n d which the area under the 2 distribution for degrees of freedom represents the indicated probability. Just as we used subscripts to indicate the cumulative p r o p o r t i o n of the area as well as the degrees of freedom represented by a given value of f, wc shall subscript 2 as follows: indicates the 2 value to the right of which is found p r o p o r t i o n of the area u n d e r a 2 distribution for degrees of freedom. Let us learn how to use Tabic IV. L o o k i n g at the distribution of ,2^, we note that 90% of all values of 2 ,, would be to the right of 0.211, but only 5% of all values of 2 2 would be greater than 5.991. It can be s h o w n that the expected value ,2( (the m e a n of a y2 distribution) equals its degrees of freedom v. T h u s the expected value of a ( 2 5 distribution is 5. W h e n wc e x a m i n e 50% values (the medians) in the 2 table, we notice that they are generally lower than the expected value (the means). T h u s , for (251 the 50% point is 4.351. This
114
c h a p t e r 6 /' e s t i m a t i o n a n d h y p o t h e s i s t e s t i n g
illustrates the a s y m m e t r y of the 2 distribution, the m e a n being to the right of the median. O u r first application of the 2 distribution will be in the next section. H o w ever, its most extensive use will be in connection with C h a p t e r 13. 6.7 Confidence limits for variances We saw in the last section that the ratio ( 1 )s2/a2 is distributed as 2 with 1 degrees of freedom. We take a d v a n t a g e of this fact in setting confidence limits to variances. First, we can m a k e the following statement a b o u t the ratio ( 1 )s2/a2: P i J
2 2 < ( 1  (a/2 ))[n  1]  "2^ < y /(/2)[ 1] \r ~ l1  u a S ~
This expression is similar to those encountered in Section 6.3 a n d implies that the probability that this ratio will be within the indicated b o u n d a r y values of X[21j is 1 a. Simple algebraic manipulat ion of the quantities in the inequality within brackets yields 2 < < >= 1
1 (/2))[  1 I J
(6.9)
( !2){  11
< J!
1  I
(6.10)
This still looks like a formidable expression, but it simply means that if we divide the sum of squares y2 by the two values of xf that cut off tails each a m o u n t i n g to a/2 of the area of the _,,distribution, the two quotients will cnclose the true value of the variance 1 with a probability of = I a. An actual numerical example will m a k e this clear. Suppose we have a sample of 5 housefly wing lengths with a sample variance of s 2 = 13.52. If we wish to set 95% confidcncc limits to the parametric variance, we evaluate Expression (6.10) for the sample variance .s2. We first calculate the sum of squares for this sample: 4 x 13.52 = 54.08. Then we look up the values for xf, 0 2 a n d .<* Since 95% confidence limits are required, a in this case is equal lo 0.05. These 2 values span between them 95% of the area under the 2 curve. They correspond to 11.143 and 0.484, respectively, and the limits in Expression (6.10) then become 54.08 11.113 or /.,   4 . 8 5 and L2 = I 1 1 . 7 4
Und /
This confidence interval is very wide, but we must not forget that the sample variance is, alter all, based on only 5 individuals. N o t e also that the interval
6.8 / i n t r o d u c t i o n t o h y p o t h e s i s
testing
115
BOX 63 Confidence limits for a 2 . Method of shortest unbiased confide Intervals. Aphid stem mother femur lengths from Box 11: = 25; s2  0.1337. The factors from Table VII for =  1 = 24 df and confidence coefficient (1  a) = 0.95 are / , = 0.5943 / , = 0.5139 f2 = 1.876 f2 = 2.351 and for a confidence coefficient of 0.99 they are The 95% confidence limits for the population variance <r2 are given by the equations L, = (lower limit) = /,.s 2 = 0.5943(0.1337) = 0.079,46 L2 = (upper limit) = f2s2 = 1.876(0.1337) = 0.2508 The 99% confidence limits are L, = / , s 2 = 0.5139(0.1337) = 0.068,71 L2  f2s2 = 2.351(0.1337) = 0.3143
is asymmetrical a r o u n d 13.52, the sample variance. This is in contrast to the confidence intervals encountered earlier, which were symmetrical a r o u n d the sample statistic. The method described above is called the equaltuils method, because an equal a m o u n t of probability is placed in each tail (for example, 2\%). It can be shown that in view of the skewness of the distribution of variances, this m e t h o d does not yield the shortest possible confidence intervals. O n e may wish the confidence interval to be "shortest" in the sense that the ratio L 2 /L^ be as small as possible. Box 6.3 shows how to obtain these shortest unbiased conlidence intervals for 2 using Table VII, based on the m e t h o d of Tate a n d Klett (1959). This table gives ( \)/^ _,,, where is an adjusted value of a/2 or 1 (a/2) designed to yield the shortest unbiased confidence intervals. T h e c o m p u t a t i o n is very simple.
6.8 Introduction to hypothesis testing The most frequent application of statistics in biological research is to lest some scientific hypothesis. Statistical m e t h o d s arc i m p o r t a n t in biology because results of experiments are usually not clearcut and therefore need statistical tests to support decisions between alternative hypotheses. A statistical test examines a set of sample data and, on the basis of an expected distribution of the data, leads to a decision on whether to accept the hypothesis underlying the expccted distribution or to reject that hypothesis and accept an alternative
116
c h a p t e r 6 /' e s t i m a t i o n a n d h y p o t h e s i s
testing
one. T h e n a t u r e of the tests varies with the d a t a a n d the hypothesis, but the same general philosophy of hypothesis testing is c o m m o n to all tests a n d will be discussed in this section. Study the material below very carefully, because it is f u n d a m e n t a l to an u n d e r s t a n d i n g of every subsequent chapter in this b o o k ! W e would like to refresh your m e m o r y on the sample of 17 animals of species A, 14 of which were females a n d 3 of which were males. These d a t a were examined for their fit t o the binomial frequency distribution presented in Section 4.2, and their analysis was shown in T a b l e 4.3. We concluded f r o m T a b l e 4.3 that if the sex ratio in the p o p u l a t i o n was 1:1 ( = qs = 0.5), the probability of obtaining a sample with 14 males a n d 3 females would be 0.005,188, m a k i n g it very unlikely that such a result could be obtained by chance alone. W e learned that it is conventional to include all "worse" o u t c o m e s t h a t is, all those that deviate even m o r e f r o m the o u t c o m e expected on the hypothesis p9 = qs = 0.5. Including all worse outcomes, the probability is 0.006,363, still a very small value. T h e above c o m p u t a t i o n is based on the idea of a onetailed test, in which we are interested only in departures f r o m the 1:1 sex ratio that show a prep o n d e r a n c e of females. If we have no preconception a b o u t the direction of the d e p a r t u r e s f r o m expectation, we must calculate the probability of obtaining a sample as deviant as 14 females a n d 3 males in either direction f r o m expectation. This requires the probability either of obtaining a sample of 3 females a n d 14 males (and all worse samples) or of obtaining 14 females and 3 males (and all worse samples). Such a test is twotailed, and since the distribution is symmetrical, we d o u b l e the previously discussed probability to yield 0.012,726. W h a t does this probability mean? It is our hypothesis that p.t = q , = 0.5. Let us call this hypothesis H0, the null hypothesis, which is the hypothesis under test. It is called the null hypothesis because it assumes that there is n o real difference between the true value of in the p o p u l a t i o n from which we sampled and the hypothesized value of = 0.5. Applied to the present example, the null hypothesis implies that the only reason o u r sample does not exhibit a 1:1 sex ratio is because of sampling error. If the null hypothesis p.t = q; = 0.5 is true, then approximately 13 samples out of 1000 will be as deviant as or more deviant than this one in either direction by chance alone. Thus, it is quite possible to have arrived at a sample of 14 females and 3 males by chance, but it is not very probable, since so deviant an event would occur only a b o u t 13 out of 1000 times, or 1.3% of the time. If we actually obtain such a sample, we may m a k e one of two decisions. We may decide that the null hypothesis is in fact true (that is, the sex ratio is 1:1) and that the sample obtained by us just happened to be one of those in the tail of the distribution, or we may decide that so deviant a sample is too improbable an event to justify acceptance of the null hypothesis. W e may therefore decide that the hypothesis that the sex ratio is 1:1 is not true. Either of these decisions may be correct, depending u p o n the truth of the matter. If in fact the 1:1 hypothesis is correct, then the first decision (to accept the null hypothesis) will be correct. If we decide to reject the hypothesis under these circumstances, we commit an error. The rejection of a true null hypothesis is called a type I error. O n the other hand, if in fact the true sex ratio of the pop
6.8 / i n t r o d u c t i o n t o h y p o t h e s i s
testing
117
ulation is other t h a n 1:1, the first decision (to accept the 1:1 hypothesis) is an error, a socalled type II error, which is the acceptance of a false null hypothesis. Finally, if the 1:1 hypothesis is not true and we d o decide to reject it, then we again m a k e the correct decision. Thus, there are two kinds of correct decisions: accepting a true null hypothesis a n d rejecting a false null hypothesis, a n d there are two kinds of errors: type I, rejecting a true null hypothesis, a n d type II, accepting a false null hypothesis. These relationships between hypotheses a n d decisions can be summarized in the following table:
Statistical decision
Null hypothesis
Actual situation
Null hypothesis
Before we carry out a test, we have to decide what m a g n i t u d e of type I error (rejection of true hypothesis) we are going to allow. Even when we sample from a population of k n o w n parameters, there will always be some samples that by chance are very deviant. T h e most deviant of these are likely to mislead us into believing our hypothesis IIQ to be untrue. If we permit 5% of samples to lead us into a type I error, then we shall reject 5 out of 100 samples from the population, deciding that these are not samples from the given population. In the distribution under study, this means that we would reject all samples of 17 animals containing 13 of one sex plus 4 of the other sex. This can be seen by referring to column (3) of Table 6.3, where the expected frequencies of the various outcomes on the hypothesis , = q; = 0.5 are shown. This table is an extension of the earlier Table 4.3, which showed only a tail of this distribution. Actually, you obtain a type I error slightly less than 5% if you sum relative expected frequencies for both tails starting with the class of 13 of one sex and 4 of the other. F r o m Table 6.3 it can be seen that the relative expccted frequency in the two tails will be 2 0.024,520,9 = 0.049,041,8. In a discrete frequency distribution, such as the binomial, we cannot calculate errors of exactly 5% as we can in a continuous frequency distribution, where we can measure olf exactly 5% of the area. If wc decide on an a p p r o x i m a t e I % error, we will reject the hypothesis p,. = </. for all samples of 17 animals having 14 or more of one sex. ( F r o m Table 6.3 we find the / rc , in the tails equals 2 0.006,362,9 = 0.012,725,8.) Thus, the smaller (he type I error wc are prepared to accept, the m o r e deviant a sample has to be for us to rejcct the null hypothesis H0. Your natural inclination might well be to have as little error as possible. You may decide to work with an extremely small type I error, such as 0.1% or even 0.01 %, accepting the null hypothesis unless the sample is extremely deviant. The difficulty with such an a p p r o a c h is that, although guarding against a type I error, you might be falling into a type II error, accepting the null hypothesis
118
c h a p t e r 6 /' e s t i m a t i o n a n d h y p o t h e s i s
testing
TABLE 6 . 3
Relative expected frequencies for samples of 17 animals under two hypotheses. Binomial d i s t r i b u t i o n .
(J) (2) 3? 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Total 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1rel 0.0000076 0.0001297 0.0010376 0.0051880 0.0181580 0.0472107 0.0944214 0.1483765 0.1854706 0.1854706 0.1483765 0.0944214 0.0472107 0.0181580 0.0051880 0.0010376 0.0001297 0.0000076 1.0000002
W /rel 0.0010150 0.0086272 0.0345086 0.0862715 0.1509752 0.1962677 0.1962677 0.1542104 0.0963815 0.0481907 0.0192763 0.0061334 0.0015333 0.0002949 0.0000421 0.0000042 0.0000002 0.0000000 0.9999999
when in fact it is not true and an alternative hypothesis H 1 is true. Presently, we shall show how this comes about. First, let us learn some m o r e terminology. Type I error is most frequently expressed as a probability and is symbolized by a. When a type I error is expressed as a percentage, it is also k n o w n as the significance level. T h u s a type I error of a = 0.05 corresponds to a significance level of 5% for a given test. W h e n we cut off on a frequency distribution those areas p r o p o r t i o n a l to a (the type 1 error), the portion of the abscissa under the area that has been cut off is called the rejection region or critical region of a test. The portion of the abscissa that would lead to acceptance of the null hypothesis is called the acceptance region. Figure 6.8A is a bar diagram showing the expected distribution of o u t c o m e s in the sex ratio example, given H 0 . The dashed lines separate rejection regions from the 99% acceptance region. N o w let us take a closer look at the type II error. This is the probability of accepting the null hypothesis when in fact it is false. If you try to evaluate the probability of type II error, you immediately run into a problem. If the null hypothesis H 0 is false, some other hypothesis must be true. But unless you can specify H u you are not in a position to calculate type II error. An example will m a k e this clear immediately. Suppose in o u r sex ratio case we have only two reasonable possibilities: (I) our old hypothesis H 0 : ; = or (2) an alternative
119
Acceptance region
1
0.15
0.1
0.05
^ i n
0 1 2 3 14 5 6 7 8
HL
9 10 11 12 13 14 15 16 17
N u m b e r of f e m a l e s in s a m p l e s of 17 a n i m a l s
I
1 
0.2
0.15 0.1
0.05
3j 4
9 10 11 12 1314 15 10 17
Jk
N u m b e r of f e m a l e s in s a m p l e s of 17 a n i m a l s
FIGURE 6 . 8
Expected d i s t r i b u t i o n s of o u t c o m e s when s a m p l i n g 17 a n i m a l s f r o m two h y p o t h e t i c a l p o p u l a t i o n s . (A) //(>:./>, 4 ; = 2 (B) / / , : p, = 2q; = J. D a s h e d lines s e p a r a t e critical regions f r o m a c c e p t a n c e region of the d i s t r i b u t i o n of part A. Type I e r r o r x e q u a l s a p p r o x i m a t e l y 0.01.
hypothesis H , : p . = 2q ,t, which states thai the sex ratio is 2:1 in favor of females so that p , = f a n d q ; = 3. We now have to calculate expected frequencies for the binomial distribution (p. + q .f = (5 + J ,) 17 to lind the probabilities of the various o u t c o m e s u n d e r the alternative hypothesis. These arc s h o w n graphically in Figure 6.8B a n d a r e tabulated and c o m p a r e d with expectcd frequencies of the earlier distribution in T a b l e 6.3. S u p p o s e we h a d decided on a type I e r r o r of 0.01 means " a p p r o x i mately equal to") as s h o w n in Figure 6.8A. At this significance level we would accept (he / / 0 for all samples of 17 having 13 o r fewer a n i m a l s of o n e sex. Approximately 99% of all samples will fall into this category. However, what if H 0 is not true a n d H , is true? Clearly, f r o m the p o p u l a t i o n represented by hypothesis / i , we could also o b t a i n o u t c o m e s in which o n e sex w a s represented
120
c h a p t e r 6 /' e s t i m a t i o n a n d h y p o t h e s i s
testing
13 or fewer times in samples of 17. W e have to calculate what p r o p o r t i o n of the curve representing hypothesis H , will overlap the acceptance region of the distribution representing hypothesis H 0 . In this case we find that 0.8695 of the distribution representing Hl overlaps the acceptance region of H0 (see Figure 6.8B). Thus, if Hl is really true (and H0 correspondingly false), we would erroneously accept the null hypothesis 86.95% of the time. This percentage c o r r e s p o n d s to the p r o p o r t i o n of samples from H y that fall within the limits of the acceptance regions of H0. This p r o p o r t i o n is called , the type II error expressed as a proportion. In this example is quite large. Clearly, a sample of 17 animals is unsatisfactory to discriminate between the two hypotheses. T h o u g h 99% of the samples under H 0 would fall in the acceptance region, fully 87% would d o so under f / , . A single sample that falls in the acceptance region would not enable us to reach a decision between the hypotheses with a high degree of reliability. If the sample had 14 or m o r e females, we would conclude that H 1 was correct. If it had 3 or fewer females, we might conclude that neither H0 nor H{ was true. As W, a p p r o a c h e d H 0 (as in / / , : = 0.55, for example), the two distributions would overlap more and more and the magnitude would increase, making discrimination between the hypotheses even less likely. Conversely, if 1 represented p : = 0.9, the distributions would be much farther apart and type II error would be reduced. Clearly, then, the m a g n i t u d e of depends, a m o n g other things, on the parameters of (he alternative hypothesis H t and c a n n o t be specified without knowledge of the latter. When the alternative hypothesis is fixed, as in the previous example ( / / , : = 2 q . \ the m a g n i t u d e of the type I error we are prepared to tolerate will determine the m a g n i t u d e of the type II error . The smaller the rejection region in the distribution under / / 0 , the greater will be the acceptance region 1 a in this distribution. T h e greater I , however, the greater will be its overlap with the distribution representing W,, and hence the greater will be . Convince yourself of this in Figure 6.8. By moving the dashed lines o u t w a r d , we are reducing the critical regions representing type 1 error a in diagram A. But as the dashed lines move outward, more of the distribution of / / , in diagram will lie in the acceptance region of the null hypothesis. Thus, by decreasing a, we are increasing and in a sense defeating our own purposes. In most applications, scientists would wish to keep both of these errors small, since they d o not wish to reject a null hypothesis when it is true, nor d o they wish to accept it when a n o t h e r hypothesis is correct. We shall see in the following whal steps can be taken to decrease while holding constant at a preset level. Although significance levels y. can be varied at will, investigators are frequently limited because, for m a n y tests, cumulative probabilities of the a p p r o priate distributions have not been tabulated and so published probability levels must be used. These are c o m m o n l y 0.05,0.01, and 0.001. although several others arc occasionally encountered. When a null hypothesis has been rejected at a specified level of x we say that the sample is significantly different from the parametric or hypothetical population at probability < a. Generally, values
6.8 / i n t r o d u c t i o n t o h y p o t h e s i s
testing
121
of greater t h a n 0.05 are not considered to be statistically significant. A significance level of 5% (P = 0.05) corresponds to one type I error in 20 trials, a level of 1% (P = 0.01) to one error in 100 trials. Significance levels of 1% or less ( < 0.01) are nearly always adjudged significant; those between 5% and 1 % may be considered significant at the discretion of the investigator. Since statistical significance has a special technical m e a n i n g ( H 0 rejected at < ), we shall use the adjective "significant" only in this sense; its use in scientific papers a n d reports, unless such a technical meaning is clearly implied, should be discouraged. F o r general descriptive purposes synonyms such as i m p o r t a n t , meaningful, marked, noticeable, and others can serve to underscore differences and effects. A brief remark on null hypotheses represented by asymmetrical probability distributions is in order here. Suppose our null hypothesis in the sex ratio case had been H 0 : p , = , as discussed above. T h e distribution of samples of 17 offspring from such a population is shown in Figure 6.8B. It is clearly asymmetrical, and for this reason the critical regions have to be defined independently. F o r a given twotailed test we can either double the probability of a deviation in the direction of the closer tail and c o m p a r e 2 with a, the conventional level of significance; or we can c o m p a r e with a/2, half the conventional level of significance. In this latter case, 0.025 is the m a x i m u m value of conventionally considered significant. We shall review what we have learned by means of a second example, this time involving a c o n t i n u o u s frequency d i s t r i b u t i o n t h e normally distributed housefly wing lengthsof parametric mean = 45.5 and variance 2 = 15.21. M e a n s based on 5 ilems sampled from these will also be normally distributed, as was demonstrated in Tabic 6.1 and Figure 6:1. Let us assume that someone presents you with a single sample of 5 housefly wing lengths and you wish to test whether they could belong to the specified population. Your null hypothesis will be H0: = 45.5 or H0: = 0 , where is the true mean of the population from which you have sampled and 0 stands for the hypothetical parametric mean of 45.5. We shall assume for the m o m e n t that we have no evidence that the variance of o u r sample is very much greater or smaller than the paramctric variance of the housefly wing lengths. If it were, it would be unreasonable to assume that our sample comes from the specified population. There is a critical test of the assumption a b o u t the sample variance, which we shall take up later. The curve at the center of Figure 6.9 represents the expected distribution of means of samples of 5 housefly wing lengths f r o m the specified population. Acceptance and rejection regions for a type I error = 0.05 are delimited along the abscissa. T h e boundaries of the critical regions arc computed as follows (remember that t.x. is equivalent to the normal distribution); Lt = 0  t0.0ii.pr and L2 = + ' . , = 45.5 + (1,96)(1.744) = 48.92 =
45 5
122
c h a p t e r 6 /' e s t i m a t i o n a n d h y p o t h e s i s
testing
,:
= 37
H: = 4o.o
,:
51
FIGURE 6 . 9
Expected d i s t r i b u t i o n of m e a n s of s a m p l e s of 5 housefly wing lengths f r o m n o r m a l p o p u l a t i o n s specified by as s h o w n a b o v e curves a n d a j = 1.744. C e n t e r curve r e p r e s e n t s null h y p o t h e s i s , H0: = 45.5; curves at sides represent alternative h y p o t h e s e s , = 37 or = 54. Vertical lines delimit 5% rejection regions for the null h y p o t h e s i s (2i7> in each tail, shaded).
Thus, we would consider it i m p r o b a b l e for m e a n s less t h a n 42.08 or greater t h a n 48.92 to have been sampled from this population. F o r such sample m e a n s we would therefore reject the null hypothesis. T h e test we are proposing is twotailed because we have no a priori assumption a b o u t the possible alternatives to our null hypothesis. If we could assume that the true mean of the population from which the sample was taken could only be equal to or greater than 45.5, the test would be onetailed. N o w let us examine various alternative hypotheses. O n e alternative h y p o t h esis might be that the true mean of the population from which our sample stems is 54.0, but that the variance is the same as before. We can express this assumption as ,: 54,0 or / / , : = , , where , stands for the alternative parametric mean 54.0. From Table II ("Areas of the normal curve") and o u r knowledge of the variance of the means, wc can calculate the p r o p o r t i o n of the distribution implied by W, that would overlap the acceptance region implied by f / 0 . We find that 54.0 is 5.08 measurement units from 48.92, the upper b o u n d a r y of the acceptance region of II0. This corresponds to 5.08/1.744 = 2.9 units. F r o m T a b l e II we find that 0.0018 of the area will lie beyond 2.91 at one tail of the curve. Thus, under this alternative hypothesis, 0.0018 of the distribution of H , will overlap the acceptance region of W. This is , the type II error under this alternative hypothesis. Actually, this is not entirely correct. Since the left tail of (he / / , distribution goes all the way to negative infinity, it will leave the acceptance region and cross over into the lefthand rejection region of H0. However, this represents only an infinitesimal a m o u n t of the area of i (the lower critical b o u n d a r y of H () , 42.08, is 6.83, units from , = 54.0) and can be ignored. O u r alternative hypothesis t specified that , is 8.5 units greater than 0 . However, as said before, we may have no a priori reason to believe that the true mean of our sample is either greater or less than . Therefore, wc may simply assume thai it is 8.5 measurement units away from 45.5. In such a case we must similarly calculate for the alternative hypothesis that 0 8.5. T h u s the
6.8 / i n t r o d u c t i o n t o h y p o t h e s i s
testing
123
alternative hypothesis becomes H{. = 54.0 or 37.0, or = 1, where represents either 54.0 or 37.0, the alternative parametric means. Since the distributions are symmetrical, is the same for b o t h alternative hypotheses. Type II error for hypothesis H 1 is therefore 0.0018, regardless of which of the two alternative hypotheses is correct. If H l is really true, 18 out of 10,000 samples will lead to an incorrect acceptance of H 0 , a very low p r o p o r t i o n of error. These relations are shown in Figure 6.9. You m a y rightly ask what reason we have to believe that the alternative parametric value for the m e a n is 8.5 m e a s u r e m e n t units to either side of 0 = 45.5. It would be quite unusual if we h a d any justification for such a belief. As a m a t t e r of fact, the true m e a n may just as well be 7.5 or 6.0 or any n u m b e r of units to either side of 0 . If we d r a w curves for ^ = 0 7.5, we find that has increased considerably, the curves for H0 a n d H , now being closer together. Thus, the m a g n i t u d e of will depend on how far the alternative p a r a m e t r i c mean is from the p a r a m e t r i c mean of the null hypothesis. As the alternative mean approaches the parametric mean, increases u p to a m a x i m u m value of 1 a, which is the area of the acceptance region under the null hypothesis. At this maxim u m , the two distributions would be superimposed u p o n each other. Figure 6.10 illustrates the increase in as , a p p r o a c h e s , starting with the test illustrated in Figure 6.9. T o simplify the graph, the alternative distributions are shown for one tail only. Thus, we clearly see that is not a fixed value but varies with the nature of the alternative hypothesis. An i m p o r t a n t concept in connection with hypothesis testing is the power of a test. It is 1 , the c o m p l e m e n t of , and is the probability of rejecting the null hypothesis when in fact it is false and the alternative hypothesis is correct. Obviously, for any given test we would like the quantity 1  to be as large as possible and the quantity as small as possible. Since we generally c a n n o t specify a given alternative hypothesis, we have to describe or 1 for a c o n t i n u u m of alternative values. W h e n 1 is graphed in this manner, the result is called a power curve for the test under consideration. Figure 6.11 shows the power curve for the housefly wing length example just discussed. This figure can be compared with Figure 6.10, f r o m which it is directly derived. Figure 6.10 emphasizes the type II error , and Figure 6.11 g r a p h s the complement of this value, 1 . We note that the power of the test falls off sharply as the alternative hypothesis approaches the null hypothesis. C o m m o n sense confirms these conclusions: we can make clear and firm decisions a b o u t whether our sample comes from a population of mean 45.5 or 60.0. The power is essentially 1. But if the alternative hypothesis is that = 45.6, differing only by 0.1 from the value assumed under the null hypothesis, it will be difficult to decide which of these hypotheses is true, and the power will be very low. T o improve the power of a given test (or decrease ) while keeping a constant for a stated null hypothesis, wc must increase sample size. If instead of sampling 5 wing lengths we had sampled 35, the distribution of means would be much narrower. Thus, rejection regions for the identical type I error would now commence at 44.21 and 46.79. Although the acceptance and rejection regions have
II:
] : = ,
D i a g r a m to illustrate increases in type II error as alternative hypothesis H , a p p r o a c h e s null hypothesis / / t h a t is, , a p p r o a c h e s . Shading represents . Vertical lines m a r k off 5% critical regions (2{% in each tail) for the null hypothesis. T o simplify the graph the alternative distributions are shown for one tail only. D a t a identical to those in 'igure 6.9.
FK.UKi: I I
Power curves for testing //:  45.5, / / , : 45.5 for 5
6.8 / i n t r o d u c t i o n t o h y p o t h e s i s
testing
125
r e m a i n e d the s a m e p r o p o r t i o n a t e l y , the acceptance region h a s b e c o m e m u c h n a r r o w e r in a b s o l u t e value. Previously, we could not, with confidence, reject the null hypothesis for a sample m e a n of 48.0. N o w , w h e n based on 35 individuals, a m e a n as deviant as 48.0 w o u l d occur only 15 times out of 100,000 a n d the hypothesis would, therefore, be rejected. W h a t h a s h a p p e n e d t o type II error? Since the distribution curves a r e n o t as wide as before, there is less o v e r l a p between them; if the alternative h y p o t h esis H{. = 54.0 o r 37.0 is true, t h e probability t h a t the null hypothesis could be accepted by m i s t a k e (type II error) is infinitesimally small. If we let j a p p r o a c h 0, will increase, of course, but it will always be smaller t h a n the c o r r e s p o n d i n g value for sample size = 5. This c o m p a r i s o n is s h o w n in Figure 6.11, where the p o w e r for the test with = 35 is m u c h higher t h a n that for = 5. If we were to increase o u r sample size to 100 or 1000, the p o w e r would be still f u r t h e r increased. T h u s , we reach a n i m p o r t a n t conclusion: If a given test is not sensitive e n o u g h , we can increase its sensitivity ( = power) by increasing s a m p l e size. T h e r e is yet a n o t h e r w a y of increasing the p o w e r of a test. If we c a n n o t increase s a m p l e size, the p o w e r m a y be raised by c h a n g i n g the n a t u r e of the test. Different statistical techniques testing roughly the same hypothesis m a y differ substantially b o t h in the actual m a g n i t u d e a n d in the slopes of their power curves. Tests that m a i n t a i n higher p o w e r levels over substantial ranges of alternative hypotheses are clearly t o be preferred. T h e popularity of the various n o n p a r a m e t r i c tests, m e n t i o n e d in several places in this b o o k , has g r o w n not only because of their c o m p u t a t i o n a l simplicity but also bccause their p o w e r curves are less affected by failure of a s s u m p t i o n s t h a n are those of the p a r a m e t r i c m e t h o d s . However, it is also true that n o n p a r a m e t r i c tests have lower overall p o w e r than p a r a m e t r i c ones, when all the a s s u m p t i o n s of the p a r a m e t r i c test a r e met. Let us briefly look at a onetailed test. T h e null hypothesis is H0: 0 = 45.5, as before. H o w e v e r , the alternative hypothesis a s s u m e s that we have reason to believe that the p a r a m e t r i c m e a n of the p o p u l a t i o n from which o u r s a m p l e has been taken c a n n o t possibly be less t h a n /; = 45.5: if it is different from that value, it can only be greater than 45.5. We might have two g r o u n d s for such a hypothesis. First, we might have some biological reason for such a belief. O u r p a r a m e t r i c flies might be a dwarf p o p u l a t i o n , so that any o t h e r p o p u l a t i o n f r o m which o u r sample could c o m e must be bigger. A second reason might be that we are interested in only o n e direction of difference. For example, we m a y be testing the effect of a chemical in the larval food intended to increase the size of the flies in the sample. Therefore, we would expect that , > (), and wc would not be interested in testing for any /(, that is less than //, because such an effect is the exact opposite of what we expect. Similarly, if we are investigating the effect of a certain d r u g as a cure for cancer, we might wish to c o m p a r e the untreated p o p u l a t i o n that has a m e a n fatality rate (from cancer) with the treated p o p u l a tion, whose rate is 0{. O u r alternative hypotheses will be / / , : 0t < 0. T h a t is, we arc not interested in any 0 { that is greater than 0, bccause if o u r d r u g will increase mortality f r o m cancer, it certainly is not m u c h of a prospect for a cure.
126 11 ( 4.V
c h a p t e r 6 /' e s t i m a t i o n a n d h y p o t h e s i s ,: = .4
testing
40
O n e  t a i l e d significance test for the d i s t r i b u t i o n of F i g u r e 6.9. Vertical line n o w cuts off 5% rejection region f r o m o n e tail of t h e d i s t r i b u t i o n ( c o r r e s p o n d i n g a r e a of curve has been shaded).
W h e n such a onetailed test is performed, the rejection region along the abscissa is under only one tail of the curve representing the null hypothesis. Thus, for our housefly data (distribution of means of sample size = 5), the rejection region will be in one tail of the curve only and for a 5% type I error will a p p e a r as shown in Figure 6.12. W e c o m p u t e the critical b o u n d a r y as 45.5 + (1.645)( 1.744) = 48.37. The 1.645 is i 0 I O f i r ,, which corresponds to the 5% value for a onetailed test. C o m p a r e this rejection region, which rejects the null hypothesis for all means greater than 48.37, with the two rejection regions in Figure 6.10, which reject the null hypothesis for means lower than 42.08 and greater than 48.92. T h e alternative hypothesis is considered for one tail of the distribution only, and the power curve of the test is not symmetrical but is d r a w n out with respect to one side of the distribution only. 6.9 Tests of simple hypotheses employing the t distribution We shall proceed lo apply our newly won knowledge of hypothesis testing to a simple example involving the distribution. G o v e r n m e n t regulations prescribe that the standard dosage in a certain biological preparation should be 600 activity units per cubic centimeter. We prepare 10 samples of this preparation and test each for potency. We find that the mean n u m b e r of activity units per sample is 592.5 units per cc a n d the standard deviation of the samples is 11.2. Docs our sample c o n f o r m to the government standard? Stated m o r e precisely, our null hypothesis is H(>:  0. The alternative hypothesis is that the dosage is not equal to 600, or / / , : / 0. Wc proceed to calculate the significance of the deviation 0 expressed in standard deviation units. T h e a p p r o p r i a t e s t a n d a r d deviation is that of m e a n s (the s t a n d a r d error of the mean), nol the s t a n d a r d deviation of items, because the deviation is that of a sample mean a r o u n d a parametric mean. Wc therefore calculate sY = s/yj'n = 1 1 . 2 / ^ 1 0 = 3.542. We next test the deviation ( 0)/. We have seen earlier, in Scction 6.4, that a deviation divided by an estimated
6.9 / t e s t s o f s i m p l e h y p o t h e s e s e m p l o y i n g t h e f d i s t r i b u t i o n
127
standard deviation will be distributed according to the t distribution with 1 degrees of freedom. W e therefore write = (61U
Sy
This indicates that we would expect this deviation to be distributed as a t variate. N o t e that in Expression (6.11) we wrote f s . In most textbooks you will find this ratio simply identified as t, but in fact the t distribution is a p a r a m e t r i c and theoretical distribution that generally is only approached, but never equaled, by observed, sampled data. This may seem a minor distinction, b u t readers should be quite clear that in any hypothesis testing of samples we are only assuming that the distributions of the tested variables follow certain theoretical probability distributions. T o conform with general statistical practice, the t distribution should really have a Greek letter (such as ), with t serving as the sample statistic. Since this would violate longstanding practice, we prefer to use the subscript s to indicate the sample value. The actual test is very simple. W e calculate Expression (6.11), 592.5  600
i =
7.5
=
= 2.12 dJ j = n 
ft
3.542
3.542
1 = 9
and c o m p a r e it with the expected values for t at 9 degrees of freedom. Since the t distribution is symmetrical, we shall ignore the sign of f, and always look up its positive value in Table III. The two values on cither side of f s are t a (1514 = 2.26 and t 0 1 0 ( 9 ] = 1.83. These are f values for twotailed tests, a p p r o p r i a t e in this instance because the alternative hypothesis is that 600: that is. it can be smaller or greater. It appears that the significance level of o u r value of iv is between 5% and 10%; if the null hypothesis is actually true, the probability of obtaining a deviation as great as or greater than 7.5 is somewhere between 0.05 and 0.10. By customary levels of significance, this is insufficient for declaring the sample m e a n significantly different from the standard. W e consequently accept the null hypothesis. In conventional language, we would report the results of the statistical analysis as follows: " The sample mean is not significantly different from the accepted s t a n d a r d . " Such a statement in a scientific report should always be backed up by a probability value, and the p r o p e r way of presenting this is to write "0.10 > > 0.05." This means that the probability of such a deviation is between 0.05 and 0.10. Another way of saying this is that the value of is is not significant (frequently abbreviated as ns). A convention often encountered is the use of asterisks after the computed value of the significance test, as in ts = 2.86**. The symbols generally represent the following probability ranges: * = 0.05 > > 0.01 ** = 0.01 > > 0.001 *** = < 0.001
However, since some a u t h o r s occasionally imply other ranges by these asterisks, the meaning of the symbols has to be specified in each scientific report.
128
c h a p t e r 6 /' e s t i m a t i o n a n d h y p o t h e s i s
testing
It might be argued that in a biological preparation the concern of the tester should not be whether the sample differs significantly from a standard, but whether it is significantly below the standard. This may be one of those biological preparations in which an excess of the active c o m p o n e n t is of no h a r m but a shortage would make the preparation ineffective at the conventional dosage. Then the test becomes onetailed, performed in exactly the same m a n n e r except that the critical values of t for a onetailed test are at half the probabilities of the twotailed test. T h u s 2.26, the former 0.05 value, becomes i o.o25[9]> a r "d 183, the former 0.10 value, becomes ?0.05[<>]< making o u r observed ts value of 2.12 "significant at the 5T> level" or. more precisely stated, significant at 0.05 > > 0.025. If we are prepared to accept a 5% significance level, we would consider the p r e p a r a t i o n significantly below the standard. You may be surprised that the same example, employing the same d a t a and significance tests, should lead to two different conclusions, and you may begin to wonder whether some of the things you hear about statistics and statisticians are not, after all, correct. The explanation lies in the fact that the two results are answers to different questions. If we test whether our sample is significantly different from the standard in either direction, we must conclude that it is not different enough for us to reject the null hypothesis. If, on the other hand, we exclude from consideration the fact that the true sample mean could be greater than the established standard 0, the difference as found by us is clearly significant. It is obvious from this example that in any statistical test one must clearly state whether a onetailed or a twotailed test has been performed if the nature of the example is such that there could be any d o u b t a b o u t the matter. W e should also point out that such a difference in the outcome of the results is not necessarily typical. It is only because the o u t c o m e in this case is in a borderline area between clear significance and nonsignilicance. H a d the difference between sample and s t a n d a r d been 10.5 activity units, the sample would have been unquestionably significantly different from the standard by the onetailed or the twotailed test. The promulgation of a s t a n d a r d mean is generally insufficient for the establishment of a rigid standard for a product. If the variance a m o n g the samples is sufficiently large, it will never be possible to establish a significant difference between the standard and the sample mean. This is an important point that should be quite clear to you. Remember that the standard error can be increased in two ways by lowering sample size or by increasing the s t a n d a r d deviation of the replicates. Both of these are undesirable aspects of any experimental setup. The test described above for the biological preparation leads us to a general test for the significance of any statistic that is. for the significance of a deviation of any statistic from a parametric value, which is outlined in Box 6.4. Such a test applies whenever the statistics arc expected to be normally distributed. When the standard error is estimated from the sample, the t distribution is used. However, since the normal distribution is just a special case /,,, of the I distribution, most statisticians uniformly apply the I distribution with the a p p r o 
6.10 / t e s t i n g t h e h y p o t h e s i s
al
129
BOX 6.4 Testing the significance of a statisticthat is, the significance of a deviation of a sample statistic from a parametric value. For normally distributed statistics. Computational steps I. Compute t as the following ratio:
t
St S i . r. ss<
where St is a sample statistic, Sip is the parametric value against which the sample statistic is to be tested, and ss, is its estimated standard error, obtained from Box 6.1, or elsewhere in this book. The pertinent hypotheses are H 0 : St Stp for a twotailed test, and Hq'. St" or H0: St Stp for a onetailed test. 3. In the twotailed test, look up the critical value of t,(v), where is the type I error agreed upon and is the degrees of freedom pertinent to the standard error employed (see Box 6.1). In the onetailed test look up the critical value of for a significance level of a. 4. Accept or reject the appropriate hypothesis in 2 on the basis of the ts value in 1 compared with critical values of t in 3. Ht. St < St, Stp Ht: St > St Hi St St
priatc degrees of freedom f r o m 1 to infinity. An example of such a test is the f test for the significance of a regression coefficient shown in step 2 of Box 11.4. 6.10 Testing the hypothesis H: = ,2, The method of Box 6.4 can be used only if the statistic is normally distributed. In the case of the variance, this is not so. As we have seen, in Section 6.6, sums of squares divided by <2 follow the 2 distribution. Therefore, for testing the hypothesis that a sample variance is different from a parametric variance, we must employ the 2 distribution. Let us use the biological preparation of the last section as an example. Wc were told that the s t a n d a r d deviation was 11.2 based on 10 samples. Therefore, the variance must have been 125.44. Suppose the government postulates that the variance of samples from the preparation should be no greater than 100.0. Is our sample variance significantly above 100.0? Remembering from
130
c h a p t e r 6 /' e s t i m a t i o n a n d h y p o t h e s i s
testing
J "
~ I)* 2 a2
_ (9)125.44 ~ 100
= 11.290 N o t e that we call the q u a n t i t y X rather t h a n 2 . This is done to emphasize that we are obtaining a sample statistic that we shall c o m p a r e with the p a r a metric distribution. Following the general outline of Box 6.4, we next establish our null and alternative hypotheses, which are H0: 2 = 20 and a~ > <Tq; that is, we are to perform a onetailed test. The critical value of y2 is found next as 2 1 ,, where is the p r o p o r t i o n of the 1 distribution to the right of the critical value, as described in Section 6.6, and is the pertinent degrees of freedom. You see now why we used the symbol for that portion of the area. It c o r r e s p o n d s to the probability of a type I error. F o r = 9 degrees of freedom, we find in Table IV that X l o s m = 16.919 2 . 1 0 ( 9 ] = 14.684
2 2
j & S 0 , 9 ] = 8.343
We notice that the probability of getting a as large as 11.290 is therefore less than 0.50 but higher t h a n 0.10, assuming that the null hypothesis is true. T h u s X 2 is not significant at the 5" level, we have no basis for rejecting the null hypothesis, and wc must conclude that the variance of the 10 samples of the biological p r e p a r a t i o n may be no greater than the standard permitted by the government. If wc had decided to test whether the variance is different from the s t a n d a r d , permitting it to deviate in either direction, the hypotheses for this twotailed test would have been H 0 : 2 = ,2 and 2 a n d a 5'7, type I error would have yielded the following critical values for the twotailed test:
Xo.47S[9]
= 2.700
).025(y = 19.023
The values represent chisquarcs at points cutting off 2\'Z rejection regions at each tail of the 2 distribution. value of X 2 < 2.700 or > 19.023 would have been evidence that the sample variance did not belong to this population. O u r value of X 2 = 11.290 would again have led to an acceptance of the null hypothesis. In the next chapter we shall see that there is another significance test available to test the hypotheses a b o u t variances of the present section. This is the mathematically equivalent F test, which is, however, a more general test, allowing us to test the hypothesis that two sample variances come from populations with equal variances
131
Exercises
6.1 6.2 6.3 6.4 Since it is possible to test a statistical hypothesis with any size sample, why are larger sample sizes preferred? ANS. When the null hypothesis is false, the probability of a type II error decreases as increases. Differentiate between type I and type II errors. What do we mean by the power of a statistical test? Set 99% confidence limits to the mean, median, coefficient of variation, and variance for the birth weight data given in Box 3.2. ANS. The lower limits are 109.540, 109.060, 12.136, and 178.698, respectively. The 95% confidence limits for as obtained in a given sample were 4.91 and 5.67 g. Is it correct to say that 95 times out of 100 the population mean, //, falls inside the interval from 4.91 to 5.67 g? If not, what would the correct statement be? In a study of mating calls in the tree toad Hyla ewingi, Littlejohn (1965) found the note duration of the call in a sample of 39 observations from Tasmania to have a mean of 189 msec and a standard deviation of 32 msec. Set 95% confidence intervals to the mean and to the variance. ANS. The 95% confidence limits for the mean are from 178.6 to 199.4. The 95% shortest unbiased limits for the variance are from 679.5 to 1646.6. Set 95% confidence limits to the means listed in Table 6.2. Arc these limits all correct? (That is, do they contain ?) In Section 4.3 the coefficient of dispersion was given as an index of whether or not data agreed with a Poisson distribution. Since in a true Poisson distribution, the mean equals the parametric variance \ the coefficient of dispersion is analogous to Expression (6.8). Using the mite data from Table 4.5, test the hypothesis that the true variance is equal to the sample mean in other words, that we have sampled from a Poisson distribution (in which the coefficient of dispersion should equal unity). Note that in these examples the chisquarc tabic is not adequate, so that approximate critical values must be computed using the method given with Tabic IV. In Section 7.3 an alternative significance test that avoids this problem will be presented. ANS. A'2 ( 1) CD = 1308.30, ~ 645.708. Using the method described in Exercise 6.7, test the agreement of the observed distribution with a Poisson distribution by testing the hypothesis that the true coefficient of dispersion equals unity for the data of Tabic 4.6. In a study of bill measurements of the dusky flycatcher, Johnson (1966) found that the bill length for the males had a mean of 8.14 + 0.021 and a coefficient of variation of 4.67%. On the basis of this information, infer how many specimens must have been used? ANS. Since V = lOOs/F and .s, = s/sjn, Jit = K^F/IOO. Thus 328. In direct klinokinctic behavior relating to temperature, animals turn more often in the warm end of a gradient and less often in the colder end, the direction of turning being at random, however. In a computer simulation of such behavior, the following results were found. The mean position along a temperature gradient was found to be 1.352. The standard deviation was 12.267, and ti equaled 500 individuals. The gradient was marked olTin units: zero corresponded to the middle of the gradient, the initial starting point of the animals; minus corresponded to the cold end; and plus corresponded to the warmer end. Pest the hypothesis that direct klinokinetic behavior did not result in a tendency toward aggregation in either the warmer or colder end; that is, test the hypothesis that /<, the mean position along the gradient, was zero.
6.5
6.6 6.7
6.8 6.9
6.10
132
c h a p t e r 6 /' e s t i m a t i o n a n d h y p o t h e s i s
testing
6.11
In an experiment comparing yields of three new varieties of corn, the following results were obtained.
Variety 1 2 3
22.86 20
43.21 20
38.56 20
To compare the three varieties the investigator computed a weighted mean of the three means using the weights 2, 1, 1. Compute the weighted mean and its 95% confidence limits, assuming that the variance of each value for the weighted mean is zero. ANS. Yw = 36.05, = 34.458, the 95% confidence limits are 47.555 to 24.545, and the weighted mean is significantly different from zero even at the < 0.001 level.
CHAPTER
Introduction of Variance
to Analysis
We now proceed to a study of the analysis of variance. This m e t h o d , developed by R. A. F isher, is f u n d a m e n t a l to m u c h of the application of statistics in biology and especially to experimental design. O n e use of the analysis of variance is to test whether two or m o r e s a m p l e m e a n s have been o b t a i n e d f r o m p o p u l a t i o n s with the same p a r a m e t r i c m e a n . W h e r e only t w o samples a r e involved, the I test can also be used. However, the analysis of variance is a m o r e general test, which permits testing two samples as well as m a n y , a n d we arc therefore i n t r o d u c i n g it at this early stage in o r d e r to e q u i p you with this powerful w e a p o n for y o u r statistical arsenal. Wc shall discuss the / test for t w o samples as a special ease in Section 8.4. In Section 7.1 wc shall a p p r o a c h the subject on familiar g r o u n d , the s a m p l i n g experiment of the housefly wing lengths. F r o m these samples we shall o b t a i n two independent estimates of the p o p u l a t i o n variance. Wc digress in Scction 7.2 to i n t r o d u c e yet a n o t h e r c o n t i n u o u s distribution, the /' distribution, needed lor the significance test in analysis of variance. Section 7.3 is a n o t h e r digression; here we s h o w how the F distribution can be used to test w h e t h e r t w o samples may reasonably have been d r a w n f r o m p o p u l a t i o n s with the same variance. Wc are now ready for Scction 7.4, in which we e x a m i n e the effects of subjecting the samples to different treatments. In Section 7.5, we describe the partitioning of
134
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
sums of squares and of degrees of freedom, the actual analysis of variance. The last two sections (7.6 and 7.7) take up in a more formal way the two scientific models for which the analysis of variance is appropriate, the socalled fixed treatment effects model (Model I) and the variance component model (Model II). Except for Section 7.3, the entire chapter is largely theoretical. W e shall p o s t p o n e the practical details of c o m p u t a t i o n to C h a p t e r 8. However, a t h o r o u g h understanding of the material in C h a p t e r 7 is necessary for working out actual examples of analysis of variance in C h a p t e r 8. O n e final c o m m e n t . W e shall use J. W. Tukey's acronym " a n o v a " interchangeably with "analysis of variance" t h r o u g h o u t the text. 7.1 The variances of samples and their means We shall a p p r o a c h analysis of variance t h r o u g h the familiar sampling experiment of housefly wing lengths (Experiment 5.1 and Table 5.1), in which we combined seven samples of 5 wing lengths to form samples of 35. W e have reproduced one such sample in Table 7.1. The seven samples of 5, here called groups, are listed vertically in the upper half of the table. Before we proceed to explain Table 7.1 further, we must become familiar with a d d e d terminology and symbolism for dealing with this kind of problem. We call our samples groups; they are sometimes called classes or are k n o w n by yet other terms we shall learn later. In any analysis of variance we shall have two or more such samples or groups, and we shall use the symbol a for the n u m b e r of groups. Thus, in the present example a = 7. Each g r o u p or sample is based on items, as before; in Table 7.1, = 5. The total n u m b e r of items in the table is a times n, which in this case equals 7 5 or 35. The sums of the items in the respective groups are shown in the row underneath the horizontal dividing line. In an anova, s u m m a t i o n signs can no longer be as simple as heretofore. We can sum either the items of one g r o u p only or the items of the entire table. We therefore have to use superscripts with the s u m m a t i o n symbol. In line with our policy of using the simplest possible notation, whenever this is not likely to lead to misunderstanding, we shall use " to indicate the sum of the items of a g r o u p and " to indicate the sum of all the items in the table. The sum of the items of each g r o u p is shown in the first row under the horizontal line. The mean of each group, symbolized by V', is in the next row and is c o m p u t e d simply as "/>!. The remaining t w o rows in that portion of Table 7.1 list "1 and " y1, separately for each group. These are the familiar quantities, the sum of the squared V's and the sum of squares of Y. F r o m the sum of squares for each g r o u p we can obtain an estimate of the population variance of housefly wing length. Thus, in the first g r o u p = 29.2. Therefore, our estimate of the p o p u l a t i o n variance is
s c <> 3 2 C 3 ^ tl
 f in II
O O O O w o rII Ii
1 =
"
II
rn t II
T <T N t t Tf II 'W
r1
vi
II & I
a * S U o1
" V ^t D t <> / 
^r
f 3
"
O r O I
rt V ) Tf V) "
OS < N t rj ro "t
0\
rl '
Tf Tf m ^ rf
Tf
rJ m
) T" O rf O Tf J Tj"
Tf
O C \ O O 7 N rr 1 t " Tt xt t
 Tf 00 fH t 4 ^t ^t ^t
oo rl
V <J />
S ii o c
ii
c
136
c h a p t e r 7 /' i n t r o d u c t i o n
to
analysis of
variance
a rather low estimate c o m p a r e d with those obtained in the other samples. Since we have a sum of squares for each group, we could obtain an estimate of the p o p u l a t i o n variance f r o m each of these. However, it stands to reason that we would get a better estimate if we averaged these separate variance estimates in some way. This is d o n e by c o m p u t i n g the weighted average of the variances by Expression (3.2) in Section 3.1. Actually, in this instance a simple average would suffice, since all estimates of the variance are based on samples of the same size. However, we prefer to give the general formula, which works equally well for this case as well as for instances of unequal sample sizes, where the weighted average is necessary. In this case each sample variance sf is weighted by its degrees of freedom, w\ = n ; 1, resulting in a sum of squares ( Z y f ) , since (, l)s 2 = y f . Thus, the n u m e r a t o r of Expression (3.2) is the sum of the sums of squares. T h e d e n o m i n a t o r is "(, 1) = 7 4, the sum of the degrees of freedom of each group. The average variance, therefore, is s2 =
7
448.8 28
6.029
This quantity is an estimate of 15.21, the parametric variance of housefly wing lengths. This estimate, based on 7 independent estimates of variances of groups, is called the average variance within groups or simply variance within groups. N o t e that we use the expression within groups, although in previous chapters we used the term variance of groups. T h e reason we do this is that the variance estimates used for c o m p u t i n g the average variance have so far all come from sums of squares measuring the variation within one column. As wc shall see in what follows, one can also c o m p u t e variances a m o n g groups, cutting across g r o u p boundaries. T o obtain a sccond estimate of the population variance, we treat the seven g r o u p means as though they were a sample of seven observations. T h e resulting statistics arc shown in the lower right part of Tabic 7.1, headed " C o m p u t a t i o n of sum of squares of means." There arc seven means in this example; in the general case there will be a means. We first c o m p u t e ", the sum of the means. N o t e thai this is rather sloppy symbolism. T o be entirely proper, we should identify this q u a n t i t y as ; ^" Yh s u m m i n g the m e a n s of g r o u p 1 through g r o u p a. T h e next quantity c o m p u t e d is , the grand mean of the g r o u p means, computed as = "/. T h e sum of the seven means is " = 317.4, and the grand mean is = 45.34, a fairly close a p p r o x i m a t i o n to the parametric mean 45.5. T h e sum of squares represents the deviations of the g r o u p means from the grand mean, "(>' >7)2. For this wc first need the quantity " 2 , which equals 14,417.24. The customary c o m p u t a t i o n a l formula for sum of squares applied to these means is "2  [(") 2 /ciJ = 25.417. F r o m the sum of squares of the means we obtain a variance among the means in the conventional way as follows: " ( Y) 2 /(a I). Wc divide by a 1 rather than 1 because the sum of squares was based on a items (means). Thus, variance of the means s2
7.1 / t h e v a r i a n c e s o f s a m p l e s a n d t h e i r
means
137
25.417/6 = 4.2362. W e learned in C h a p t e r 6, Expression (6.1), that when we randomly sample f r o m a single population,
and hence
Thus, we can estimate a variance of items by multiplying the variance of means by the sample size on which the means are based (assuming we have sampled at r a n d o m from a c o m m o n population). W h e n we do this for our present example, we obtain s2 = 5 4.2362 = 21.181. This is a second estimate of the parametric variance 15.21. It is not as close to the true value as the previous estimate based on the average variance within groups, but this is to be expected, since it is based on only 7 "observations." W e need a n a m e describing this variance to distinguish it from the variance of means from which it has been computed, as well as from the variance within groups with which it will be compared. W e shall call it the variance among groups; it is times the variance of means and is an independent estimate of the parametric variance 2 of the housefly wing lengths. It m a y not be clear at this stage why the two estimates of a 2 that we have obtained, the variance within groups and the variance a m o n g groups, are independent. W e ask you to take on faith that they are. Let us review what we have done so far by expressing it in a more formal way. Table 7.2 represents a generalized table for d a t a such as the samples of housefly wing lengths. Each individual wing length is represented by Y, subscripted to indicate the position of the quantity in the data table. The wing length of the j t h fly from the /th sample or g r o u p is given by Y^. Thus, you will notice that (he first subscript changes with each column representing a g r o u p in the
tabi.K 7.2 Data arranged for simple analysis of variance, single classification, completely randomized. (/roups a
I
"
>0
>:
).,
>;,,
>,
>,.
>,.
>;
>.,
x, >;.
sums Means
. ,
t2 Y2
iy3 , '
iy, V,
i n V,
138
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
table, and the second subscript changes with each row representing an individual item. Using this notation, we can c o m p u t e the variance of sample 1 as
1 i="
y
 r 1 i ( y u =
i)2
The variance within groups, which is the average variance of the samples, is c o m p u t e d as
1 ( i=a j
= j > 1) , =
( Y i j 
N o t e the double s u m m a t i o n . It means that we start with the first group, setting i = 1 (i being the index of the outer ). W e sum the squared deviations of all items from the mean of the first group, changing index j of the inner f r o m 1 to in the process. W e then return to the outer summation, set i = 2, a n d sum the squared deviations for g r o u p 2 from j = 1 toj = n. This process is continued until i, the index of the outer , is set to a. In other words, we sum all the squared deviations within one g r o u p first and add this sum to similar sums f r o m all the other groups. The variance a m o n g groups is c o m p u t e d as
n
i=a
2
^rliY.Y) a  1
N o w that we have two independent estimates of the population variance, what shall we do with them? We might wish to find out whether they d o in fact estimate the same parameter. T o test this hypothesis, we need a statistical test that will evaluate the probability that the two sample variances are from the same population. Such a test employs the F distribution, which is taken u p next. 7.2 The F distribution Let us devise yet a n o t h e r sampling experiment. This is quite a tedious one without the use of computers, so we will not ask you to carry it out. Assume that you are sampling at r a n d o m from a normally distributed population, such as the housefly wing lengths with mean and variance 2. T h e sampling procedure consists of first sampling n l items and calculating their variance .vf, followed by sampling n 2 items and calculating their variance .s2. Sample sizes n, and n 2 may or may not be equal to each other, but are fixed for any one sampling experiment. Thus, for example, wc might always sample 8 wing lengths for the first sample (n,) and 6 wing lengths for the second sample (n 2 ). After each pair of values (sf and has been obtained, wc calculate
This will be a ratio near 1, because these variances arc estimates of the same quantity. Its actual value will depend on the relative magnitudes of variances
.. > ir .. 1 r ., ,..,i...,i.,<ii,
7.2 / t h e F d i s t r i b u t i o n
139
Fs of their variances, the average of these ratios will in fact a p p r o a c h the quantity (n2 l)/(2 3), which is close to 1.0 when n2 is large. The distribution of this statistic is called the F distribution, in h o n o r of R. A. Fisher. This is a n o t h e r distribution described by a complicated mathematical function that need not concern us here. Unlike the t and 2 distributions, the shape of the F distribution is determined by two values for degrees of freedom, Vj and v 2 (corresponding to the degrees of freedom of the variance in the n u m e r a t o r and the variance in the d e n o m i n a t o r , respectively). Thus, for every possible combination of values v l5 v 2 , each ranging from 1 to infinity, there exists a separate F distribution. Remember that the F distribution is a theoretical probability distribution, like the t distribution and the 2 distribution. Variance ratios s f / s f , based on sample variances are sample statistics that m a y or may not follow the F distribution. We have therefore distinguished the sample variance ratio by calling it Fs, conforming to o u r convention of separate symbols for sample statistics as distinct from probability distributions (such as ts and X2 contrasted with t and 2). We have discussed how to generate an F distribution by repeatedly taking two samples from the same normal distribution. We could also have generated it by sampling from two separate n o r m a l distributions differing in their mean but identical in their parametric variances; that is, with , 2 but \ = \. Thus, we obtain an F distribution whether the samples come from the same normal population or from different ones, so long as their variances arc identical. Figure 7.1 shows several representative F distributions. F or very low degrees of freedom the distribution is l  s h a p c d , but it becomes humped and strongly skewed to the right as both degrees of freedom increase. Table V in Appendix
norm
7.
140
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
A2 s h o w s the cumulative probability distribution of F for three selected p r o b ability values. T h e values in the table represent F a ( v i v j ] , where a is the p r o p o r t i o n of the F d i s t r i b u t i o n t o t h e right of the given F value (in o n e tail) a n d \'j, v 2 are the degrees of f r e e d o m p e r t a i n i n g to the variances in the n u m e r a t o r and the d e n o m i n a t o r of the ratio, respectively. T h e table is a r r a n g e d so t h a t across the t o p o n e reads v l 5 the degrees of f r e e d o m p e r t a i n i n g to the u p p e r ( n u m e r a t o r ) variance, a n d a l o n g the left m a r g i n o n e r e a d s v 2 , the degrees of f r e e d o m pertaining to the lower ( d e n o m i n a t o r ) variance. At each intersection of degree of f r e e d o m values we list three values of F decreasing in m a g n i t u d e of a. F o r example, a n F distribution with v, = 6, v 2 = 24 is 2.51 at a = 0.05. By t h a t we m e a n that 0.05 of the a r e a u n d e r the curve lies to the right of F = 2.51. Figure 7.2 illustrates this. O n l y 0.01 of the area u n d e r the curve lies t o the right of F = 3.67. T h u s , if we have a null hypothesis H0: \ = \, with the alternative hypothesis x: \ > we use a onetailed F test, as illustrated by F i g u r e 7.2. W e can n o w test the t w o variances o b t a i n e d in the s a m p l i n g e x p e r i m e n t of Section 7.1 a n d T a b l e 7.1. T h e variance a m o n g g r o u p s based on 7 m e a n s w a s 21.180, a n d the variance within 7 g r o u p s of 5 individuals was 16.029. O u r null hypothesis is that the t w o variances estimate the same p a r a m e t r i c variance; the alternative hypothesis in an a n o v a is always that the p a r a m e t r i c variance estim a t e d by the variance a m o n g g r o u p s is greater t h a n that estimated by the variance within g r o u p s . T h e reason for this restrictive alternative hypothesis, which leads to a onetailed test, will be explained in Section 7.4. W e calculate the variance ratio F s = s\js\ = 21.181/16.029 = 1.32. Before we c a n inspect the
FKHJRE 7 . 2
7.1 / t h e F d i s t r i b u t i o n
141
F table, we have to k n o w the a p p r o p r i a t e degrees of freedom for this variance ratio. We shall learn simple formulas for degrees of freedom in an a n o v a later, but at the m o m e n t let us reason it out for ourselves. T h e u p p e r variance (among groups) was based on the variance of 7 means; hence it should have 1 = 6 degrees of freedom. T h e lower variance was based on an average of 7 variances, each of t h e m based on 5 individuals yielding 4 degrees of freedom per variance: a(n 1) = 7 4 = 28 degrees of freedom. Thus, the upper variance has 6, the lower variance 28 degrees of freedom. If we check Table V for 1 = 6 , v 2 = 24, the closest a r g u m e n t s in the table, we find that F0 0 5 [ 6 24] = 2.51. F o r F = 1.32, corresponding to the Fs value actually obtained, is clearly >0.05. Thus, we may expect m o r e t h a n 5% of all variance ratios of samples based on 6 and 28 degrees of freedom, respectively, to have Fs values greater t h a n 1.32. We have no evidence to reject the null hypothesis and conclude that the two sample variances estimate the same parametric variance. This corresponds, of course, to what we knew anyway f r o m o u r sampling experiment. Since the seven samples were taken from the same population, the estimate using the variance of their means is expected to yield another estimate of the parametric variance of housefly wing length. Whenever the alternative hypothesis is that the two parametric variances are unequal (rather than the restrictive hypothesis { . \ > 2 ), the sample variance s j can be smaller as well as greater than s2. This leads to a twotailed test, and in such cases a 5% type I error means that rejection regions of 2 j % will occur at each tail of the curve. In such a case it is necessary to obtain F values for ot > 0.5 (that is, in the left half of the F distribution). Since these values arc rarely tabulated, they can be obtained by using the simple relationship
' I I K)[V2. Vl] For example, F(1 5 ( 5 2 4 , = 2.62. If we wish to obtain F 0 4 5 [ 5 2 4 1 (the F value to the right of which lies 95% of the area of the F distribution with 5 and 24 degrees of freedom, respectively), we first have to find F(1 0 5 1 2 4 = 4.53. Then F0 4515 241 is the reciprocal of 4.53, which equals 0.221. T h u s 95% of an F distribution with 5 and 24 degrees of freedom lies to the right of 0.221. There is an i m p o r t a n t relationship between the F distribution and the 2 distribution. You may remember that the ratio X2 = \>2/2 was distributed as a 2 with I degrees of freedom. If you divide the n u m e r a t o r of this expression by n 1, you obtain the ratio F, = , 2 / 2 , which is a variance ratio with an expected distribution of F,,, , , The upper degrees of freedom arc I (the degrees of freedom of the sum of squares or sample variance). T h e lower degrees of freedom are infinite, because only on the basis of an infinite n u m b e r of items can we obtain the true, parametric variance of a population. Therefore, by dividing a value of X 2 by 1 degrees of freedom, we obtain an Fs value with  1 and co d f , respectively. In general, 2^\! ~ *] Wc can convince ourselves of this by inspecting the F and 2 tables. F r o m the 2 tabic (Table IV) we find that 2,. 5[ ^ 18.307. Dividing this value by 10 dj\ we obtain 1.8307.
142
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
Thus, the two statistics of significance are closely related and, lacking a 2 table, we could m a k e d o with an F table alone, using the values of vF [v ^ in place f* 2 v, Before we return to analysis of variance, we shall first apply our newly won knowledge of the F distribution to testing a hypothesis a b o u t two sample variances.
BOX 7.1 Testing the significance of differences between two variances. Survival in days of the cockroach Blattella vaga when kept without food or water. Females Males n, = 10 n2 = 1 0 H0: <xf =  Y, = 8.5 days P2 = 4.8 days ^. = 3.6 s\ = 0.9
The alternative hypothesis is that the two variances are unequal. We have no reason to suppose that one sex should be more variable than the other. In view of the alternative hypothesis this is a twotailed test. Since only the right tail of the F distribution is tabled extensively in Table V and in most other tables, we calculate F s as the ratio of the greater variance over the lesser one:
Because the test is twotailed, we look up the critical value Fa/2vi,2)> where is the type I error accepted and v, = ri1 1 and v2 = n, 1 are the degrees of freedom for the upper and lower variance, respectively. Whether we look up ^</2,.2] o r Fx/up,vi] depends on whether sample 1 or sample 2 has the greater variance and has been placed in the numerator. From Table V we find F0.02519,9] = 4.03 and F 0 0 5 l 9 i 9 J = 3.18. Because this is a twotailed test, we double these probabilities. Thus, the F value of 4.03 represents a probability of = 0.05, since the righthand tail area of = 0.025 is matched by a similar lefthand area to the left of ^o.975[9.9i = '/f0.025(9,9] = 0.248. Therefore, assuming the null hypothesis is true, the probability of observing an F value greater than 4.00 and smaller than 1/4.00 = 0.25 is 0.10 > > 0.05. Strictly speaking, the two sample variances are not significantly differentthe two sexes are equally variable in their duration of survival. However, the outcome is close enough to the 5% significance level to make us suspicious that possibly the variances are in fact different. It would be desirable to repeat this experiment with larger sample sizes in the hope that more decisive results would emerge.
7.3 /
143
7.3 The hypothesis H0: \ = \ A test of the null hypothesis that two normal populations represented by two samples have the same variance is illustrated in Box 7.1. As will be seen later, some tests leading to a decision a b o u t whether two samples come f r o m p o p u l a tions with the same m e a n assume that the population variances are equal. H o w ever, this test is of interest in its own right. We will repeatedly have to test whether two samples have the same variance. In genetics wc may need to k n o w whether an offspring generation is m o r e variable for a character t h a n the parent generation. In systematics we might like to find out whether two local p o p u l a t i o n s are equally variable. In experimental biology we may wish to d e m o n s t r a t e under which of two experimental setups the readings will be more variable. In general, the less variable setup would be preferred; if b o t h setups were equally variable, the experimenter would pursue the one that was simpler or less costly to undertake. 7.4 Heterogeneity among sample means We shall now modify the data of Table 7.1, discussed in Section 7.1. Suppose the seven groups of houseflies did not represent r a n d o m samples from the same population but resulted from the following experiment. Each sample was reared in a separate culture jar, and the medium in each of the culture jars was prepared in a different way. Some had more water added, others more sugar, yet others more solid matter. Let us assume that sample 7 represents the s t a n d a r d medium against which we propose to c o m p a r e the other samples. The various changes in the medium affect the sizes of the flies that emerge from it; this in turn affects the wing lengths we have been measuring. We shall assume the following effects resulting from treatment of the medium: Medium 1 decreases average wing length of a sample by 5 units 2 decreases average wing length of a sample by 2 units 3 d o e s not change average wing length of a sample 4 increases average wing length of a sample by 1 unit 5 increases average wing length of a sample by 1 unit 6 increases average wing length of a sample by 5 units 7(control) does not change average wing length of a sample The effect of treatment / is usually symbolized as a,. (Please note that this use of is not related to its use as a symbol for the probability of a type I error.) Thus a, assumes the following values for the above treatment effects. , . = =  5 2 0 4 = I 5=1 6 = 5
c I Q = < *
f
II
>
>
ri
n>I 'b.
rr fN
r, \ r J ^D + tl un tn to
v D
3 r r
 r c
ti
o in ^t so ^
^f 2
te
r<~)
C _ L i/i i/3 II XI =
1 . *
7.4 / h e t e r o g e n e i t y a m o n g s a m p l e
means
145
N o t e t h a t t h e ,'s have been defined so t h a t " a, = 0; t h a t is, the effects cancel out. This is a convenient p r o p e r t y t h a t is generally p o s t u l a t e d , but it is unnecessary for o u r a r g u m e n t . W e can now modify T a b l e 7.1 by a d d i n g t h e a p p r o p r i a t e values of a t to e a c h sample. In s a m p l e 1 the value of a 1 is 5; therefore, the first wing length, which was 41 (see T a b l e 7.1), n o w becomes 36; the second wing length, formerly 44, b e c o m e s 39; a n d so on. F o r the second s a m p l e a 2 > s 2, c h a n g i n g t h e first wing length f r o m 48 t o 46. W h e r e a, is 0, the wing lengths d o not change; where a { is positive, they are increased by the m a g n i t u d e indicated. T h e c h a n g e d values can be inspected in Table 7.3, which is a r r a n g e d identically to T a b l e 7.1. We n o w repeat o u r previous c o m p u t a t i o n s . W e first calculate the s u m of squares of the first s a m p l e to find it t o be 29.2. If you c o m p a r e this value with the sum of squares of the first sample in T a b l e 7.1, you find the two values to be identical. Similarly, all o t h e r values of " y2, the sum of s q u a r e s of each g r o u p , are identical to their previous values. W h y is this so? T h e effect of a d d i n g a, to each g r o u p is simply that of an additive code, since a, is c o n s t a n t for any one group. F r o m Appendix A 1.2 we can see that additive codes d o not affect s u m s of s q u a r e s or variances. Therefore, not only is each s e p a r a t e s u m of squares the same as before, but the average variance within g r o u p s is still 16.029. N o w let us c o m p u t e the variance of the means. It is 100.617/6 = 16.770, which is a value m u c h higher t h a n the variance of m e a n s f o u n d before, 4.236. W h e n we multiply by = 5 t o get an estimate of 2 , we o b t a i n the variance of groups, which now is 83.848 a n d is no longer even close to an estimate of 2. W e repeat the I' test with the new variances a n d find that Fs = 83.848/16.029 = 5.23, which is m u c h greater than the closest critical value of F 0 0 S  h 2 4 = 2.51. In fact, the observed F s is greater t h a n F 0 l  ( 1 , 4 ] = 3.67. Clearly, the u p p e r variance, representing the variance a m o n g groups, has become significantly larger. T h e t w o variances are most unlikely to represent the same p a r a m e t r i c variance. W h a t has h a p p e n e d ? We can easily explain it by m e a n s of T a b l e 7.4, which represents T a b l e 7.3 symbolically in the m a n n e r that Table 7.2 represented Table 7.1. We note that each g r o u p has a c o n s t a n t a, added a n d that this constant changes the s u m s of the g r o u p s by na, a n d the m e a n s of these g r o u p s by <Xj. In Section 7.1 we c o m p u t e d the variance within g r o u p s as
J
u j ~
,2
( V
>'.,
When wc try to repeat this, our f o r m u l a becomes m o r e complicated, because to each Y:j a n d each V, there has now been a d d e d a,. We therefore write
2
a(n
I )
l ' y u ) ,
>, )
Then we o p e n the parentheses inside t h e s q u a r e brackets, so that the second a, changes sign a n d the ,'s cancel out, leaving the expression exactly as before.
146
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
TABLE 7 . 4
Groups 3
1 t 2
3
r , , + 1 +
ll
y 22 + * 2 Yli + 2
Yil
y 33
+ 3 + 3
+ 3 ' Yn + a, Yi 2 + a.
^3 + ,
Y.I +
Y.2 +
+ Yal
J
Sums Means
Yxj
+ a,
y^ +
+ *2
2
+ HC O,
a,
F, +<2
+ "a2
y3+*3
fi + ti
s u b s t a n t i a t i n g o u r earlier o b s e r v a t i o n t h a t the variance within g r o u p s d o e s nol c h a n g e despite the t r e a t m e n t effects. T h e variance of m e a n s was previously calculated by the f o r m u l a
a
;a 1 ;=1
I i=a
a i^i
 (>; + , ) =
= <i _ a = <.
< + 
a ,
a i=
' =
' ;+ ,)(y+<*)]2
a
 
1 ,v ,
, '<>;
>)' + a
1 , ^  1,
 <)' + a 2  ,
1,=
 m
T h e first of these terms we immediately recognize as the previous variance el the means, Sy. T h e second is a new q u a n t i t y , but is familiar by general appeal ancc; it clearly is a variance or at least a q u a n t i t y akin to a variance. T h e tliiM expression is a new type; it is a socalled covariance. which we have not w i e n c o u n t e r e d . We shall not be concerned with it at this stage except to say th.n
7.4 /
147
in cases such as the present one, where the m a g n i t u d e of the treatment effects a, is assumed to be independent of the X to which they are added, the expected value of this q u a n t i t y is zero; hence it does not contribute to the new variance of means. The independence of the treatments effects and the sample m e a n s is an i m p o r t a n t concept that we must u n d e r s t a n d clearly. If we had not applied different treatments to the medium jars, but simply treated all jars as controls, we would still have obtained differences a m o n g the wing length means. Those are the differences f o u n d in Table 7.1 with r a n d o m sampling from the same population. By chance, some of these means are greater, some are smaller. In our planning of the experiment we had no way of predicting which sample means would be small and which would be large. Therefore, in planning our treatments, we had n o way of m a t c h i n g u p a large treatment effect, such as that of medium 6, with the m e a n that by chance would be the greatest, as that for sample 2. Also, the smallest sample mean (sample 4) is not associated with the smallest treatment effect. Only if the m a g n i t u d e of the treatment effects were deliberately correlated with the sample means (this would be difficult to d o in the experiment designed here) would the third term in the expression, the covariance, have an expected value other than zero. T h e second term in the expression for the new variance of m e a n s is clearly added as a result of the treatment effects. It is a n a l o g o u s to a variance, but it cannot be called a variance, since it is not based on a r a n d o m variable, but rather on deliberately chosen treatments largely under our control. By changing the m a g n i t u d e and n a t u r e of the treatments, wc can more or less alter the variancelike quantity at will. We shall therefore call it the added component due to treatment effects. Since the ,'s are arranged so that a = 0, we can rewrite the middle term as
In analysis of variance we multiply the variance of the m e a n s by in order to estimate the parametric variance of the items. As you know, we call the quantity so obtained the variance of groups. When wc d o this for the ease in which treatment effects are present, we obtain
Thus we see that the estimate of the parametric variance of the population is increased by the quantity
a which is times the added c o m p o n e n t due to treatment effects. We found the variance ratio f\. to be significantly greater than could be reconciled with the null hypothesis. It is now obvious why this is so. We were testing the variance
148
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
= 1. In fact, however,
It is clear f r o m this f o r m u l a (deliberately displayed in this lopsided m a n n e r ) that the F test is sensitive to the presence of the a d d e d c o m p o n e n t d u e to treatm e n t effects. At this point, y o u have an a d d i t i o n a l insight into the analysis of variance. It permits us to test w h e t h e r there are a d d e d t r e a t m e n t e f f e c t s t h a t is, w h e t h e r a g r o u p of m e a n s can simply be considered r a n d o m samples f r o m the same p o p u l a t i o n , or w h e t h e r t r e a t m e n t s that have affected each g r o u p separately have resulted in shifting these m e a n s so m u c h that they can n o longer be considered samples from the s a m e p o p u l a t i o n . If the latter is so, an a d d e d c o m p o n e n t d u e to t r e a t m e n t effects will be present a n d m a y be detected by an F test in the significance test of the analysis of variance. In such a study, we are generally not interested in the m a g n i t u d e of
but we are interested in the m a g n i t u d e of the separate values of In o u r e x a m p l e these a r c the effects of different f o r m u l a t i o n s of the m e d i u m on wing length. If, instead of housefly wing length, we were m e a s u r i n g b l o o d pressure in samples of rats a n d the different g r o u p s had been subjected to different d r u g s or different doses of the same drug, the quantities a, would represent the effects of d r u g s on the blood pressure, which is clearly the issue of interest to the investigator. We may also be interested in s t u d y i n g differences of the type a , x 2 , leading us to the question of the significance of the differences between the effects of a n y two types of m e d i u m or any two drugs. But we a r e a little a h e a d of o u r story. W h e n analysis of variance involves t r e a t m e n t effects of the type just studied, we call it a Model 1 tmovu. Later in this c h a p t e r (Section 7.6), M o d e l I will be defined precisely. T h e r e is a n o t h e r model, called a Model 11 anova, in which the a d d e d effects for cach g r o u p arc not fixed t r e a t m e n t s but are r a n d o m effects. By this we m e a n that we have not deliberately planned or fixed the t r e a t m e n t for any one group, but that the actual effects on each g r o u p are r a n d o m and only partly u n d e r o u r control. S u p p o s e that the seven samples of houscflies in T a b l e 7.3 represented the offspring of seven r a n d o m l y selected females f r o m a p o p u l a t i o n reared on a uniform m e d i u m . T h e r e would be gcnctic differences a m o n g these females, and their seven b r o o d s would reflect this. T h e exact n a t u r e of these differences is unclear and unpredictable. Before actually m e a s u r i n g them, we have no way of k n o w i n g whether b r o o d 1 will have longer wings than b r o o d 2, nor have we any way of controlling this experiment so that b r o o d 1 will in fact grow longer wings. So far as we can ascertain, the genctic factors
7.4 / h e t e r o g e n e i t y a m o n g s a m p l e m e a n s
149
for wing length are distributed in a n u n k n o w n m a n n e r in the p o p u l a t i o n of houseflies (we m i g h t hope t h a t they are n o r m a l l y distributed), a n d o u r s a m p l e of seven is a r a n d o m sample of these factors. In a n o t h e r example for a M o d e l II a n o v a , s u p p o s e that instead of m a k i n g u p our seven cultures f r o m a single b a t c h of m e d i u m , we have p r e p a r e d seven batches separately, o n e right after the other, a n d are n o w analyzing the v a r i a t i o n a m o n g the batches. W e w o u l d not be interested in the exact differences f r o m batch to batch. Even if these were m e a s u r e d , we would not be in a position to interpret them. N o t h a v i n g deliberately varied b a t c h 3, we have no idea why, for example, it should p r o d u c c longer wings t h a n b a t c h 2. W e would, however, be interested in the m a g n i t u d e of the variance of the a d d e d effects. T h u s , if we used seven j a r s of m e d i u m derived f r o m o n e batch, we could expect the variance of the j a r m e a n s to be 2 / 5 , since there were 5 flies per jar. But when based on different batches of m e d i u m , the variance could be expected t o be greater, because all the i m p o n d e r a b l e accidents of f o r m u l a t i o n a n d e n v i r o n m e n t a l differences d u r i n g m e d i u m p r e p a r a t i o n that m a k e o n e batch of m e d i u m different f r o m a n o t h e r would c o m e into play. Interest would focus on the a d d e d variance c o m p o n e n t arising f r o m differences a m o n g batches. Similarly, in the o t h e r example we would be interested in the a d d e d variance c o m p o n e n t arising f r o m genetic differences a m o n g the females. We shall now take a rapid look at the algebraic f o r m u l a t i o n of (he a n o v a in the case of Model II. In T a b l e 7.3 the second row at the head of the d a t a c o l u m n s shows not only a, but also Ah which is the symbol we shall use for a r a n d o m g r o u p effect. We use a capital letter to indicate that the effect is a variable. T h e algebra of calculating the two estimates of the p o p u l a t i o n variance is the same as in Model I, except that in place of a, we imagine /I, substituted in Table 7.4. T h e estimate of the variance a m o n g m e a n s now represents the q u a n t i t y 1
"'
, 1 ,  ,
T h e first term is the variance of m e a n s ,Sy, as before, and the last term is the covariance between the g r o u p m e a n s and (he r a n d o m effects Ah the expected value of which is zero (as before), because the r a n d o m effects are independent of (he m a g n i t u d e of the means. T h e middle term is a true variance, since .4, is a r a n d o m variable. We symbolize it by .s^ and call it the added variance component amoiui (/roups. It would represent the added variance c o m p o n e n t a m o n g females or a m o n g medium batches, d e p e n d i n g on which of the designs discussed a b o v e we were thinking of. T h e existence of this added variance component is d e m o n s t r a t e d by the /' test. If the g r o u p s are r a n d o m samples, we may expect I to a p p r o x i m a t e 1/1  I; but with an added variance c o m p o nent, the expected ratio, again displayed lopsidcdly, is
2
X
"
150
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
N o t e that , the parametric value of sA, is multiplied by , since we have to multiply the variance of m e a n s by to obtain an independent estimate of the variance of the population. In a Model II a n o v a we are interested not in the m a g n i t u d e of any At or in differences such as Al A2, but in the m a g n i t u d e of a n d its relative m a g n i t u d e with respect to 2 , which is generally expressed as the percentage 100s^/(s 2 + sA). Since the variance a m o n g g r o u p s estimates 2 + \, we can calculate s2A as  (variance a m o n g g r o u p s variance within groups)
J[(s2+
ns2A)s2]=i(ns2A)
= s2A
F o r the present example, s2A = (83.848  16.029) = 13.56. This a d d e d variance c o m p o n e n t a m o n g groups is 100 x 13.56 16.029 + 13.56
=
J356_ 29.589
of the sum of the variances a m o n g and within groups. Model II will be formally discussed at the end of this chapter (Section 7.7); the methods of estimating variance c o m p o n e n t s are treated in detail in the next chapter.
7.5 Partitioning the total sum of squares and degrees of freedom So far we have ignored one other variance that can be c o m p u t e d from the d a t a in Table 7.1. If we remove the classification into groups, we can consider the housefly d a t a to be a single sample of an = 35 wing lengths and calculate the m e a n and variance of these items in the conventional manner. T h e various quantities necessary for this c o m p u t a t i o n are shown in the last column at the right in Tables 7.1 and 7.3, headed " C o m p u t a t i o n of total sum of squares." We obtain a mean of F = 45.34 for the sample in Table 7.1, which is, of course, the same as the quantity c o m p u t e d previously from the seven g r o u p means. T h e sum of squares of the 35 items is 575.886, which gives a variance of 16.938 when divided by 34 degrees of freedom. Repeating these c o m p u t a t i o n s for the d a t a in Table 7.3, we obtain ? = 45.34 (the same as in Table 7.1 because " a, = 0) and .v2 = 27.997, which is considerably greater than the c o r r e s p o n d ing variance from Table 7.1. The total variance c o m p u t e d from all an items is a n o t h e r estimate of 2 . It is a good estimate in the first case, but in the second sample (Table 7.3), where added c o m p o n e n t s due to treatment effects or added variance c o m p o n e n t s are present, it is a poor estimate of the population variance. However, the p u r p o s e of calculating the total variance in an a n o v a is not for using it as yet a n o t h e r estimate of 2 , but for introducing an i m p o r t a n t m a t h e m a t i c a l relationship between it and the other variances. This is best seen when we arrange our results in a conventional analysis of variance table, as
7.5 / p a r t i t i o n i n g t h e t o t a l s u m o f s q u a r e s a n d d e g r e e s o f f r e e d o m
151
TABLE
7.5
(i)
U) Source of variation Y Y  Y (2) dj Sum of squares SS
(41
Mean square MS
 Y Y Y
6 28 34
shown in Table 7.5. Such a table is divided into four columns. The first identifies the source of variation as a m o n g groups, within groups, and total (groups a m a l g a m a t e d to form a single sample). The column headed df gives the degrees of freedom by which the sums of squares pertinent to each source of variation must be divided in order to yield the corresponding variance. T h e degrees of freedom for variation a m o n g groups is a 1, that for variation within groups is a ( 1), and that for the total variation is an 1. The next two columns show sums of squares and variances, respectively. Notice that the sums of squares entered in the a n o v a table are the sum of squares a m o n g groups, the sum of squares within groups, and the sum of squares of the total sample of an items. You will note that variances arc not referred to by that term in anova, but are generally called mean squares, since, in a Model I anova, they d o not estimate a population variance. These quantities arc not true mean squares, because the sums of squares are divided by the degrees of freedom rather than sample size. T h e sum of squares and mean square arc frequently abbreviated SS and MS, respectively. The sums of squares and mean squares in Table 7.5 are the same as those obtained previously, except for minute r o u n d i n g errors. Note, however, an i m p o r t a n t property of the sums of squares. They have been obtained independently of each other, but when we add the SS a m o n g groups to the SS within groups we obtain the total SS. The sums of squares are additive! Another way of saying this is that wc can decompose the total sum of squares into a portion due to variation a m o n g groups and a n o t h e r portion due to variation within groups. Observe that the degrees of freedom are also additive and that the total of 34 df can be decomposed into 6 df a m o n g groups and 28 df within groups. Thus, if we know any two of the sums of squares (and their a p p r o p r i a t e degrees of freedom), we can c o m p u t e the third and complete our analysis of variance. N o t e that the mean squares arc not additive. This is obvious, since generally (a + b)f(c + d) a/c + b/d. Wc shall use the c o m p u t a t i o n a l formula for sum of squares (Expression (3.8)) to d e m o n s t r a t e why these sums of squares are additive. Although it is an algebraic derivation, it is placed here rather than in the Appendix because these formulas will also lead us to some c o m m o n c o m p u t a t i o n a l formulas for analysis of variance. Depending on computational equipment, the formulas wc
152
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
have used so far to obtain the sums of squares may not be the most rapid procedure. T h e sum of squares of m e a n s in simplified n o t a t i o n is SS
Y
= ( y y  tr \n / a
a l
 i ^
an*
N o t e that the deviation of m e a n s from the g r a n d mean is first rearranged t o fit the c o m p u t a t i o n a l f o r m u l a (Expression (3.8)), a n d then each m e a n is written in terms of its constituent variates. Collection of d e n o m i n a t o r s outside the summ a t i o n signs yields the final desired form. T o obtain the sum of squares of groups, we multiply SS m c a n s by n, as before. This yields 1 " /" V 1 / " SS g r o u p s = X SS m e a n s =    ( r Next we evaluate the sum of squares within groups:
Y 2
ss w h W i n = l X (
=
u /
> 2
1 / a
 an [\ 
We now copy the formulas for these sums of squares, slightly rearranged as follows: SS.
Y
1 /" "  \ y an ) +
y 2
^ (
ss,.
a n
( a n
an
7.5 / p a r t i t i o n i n g t h e t o t a l s u m o f s q u a r e s a n d d e g r e e s o f f r e e d o m
153
Adding the expression for SSgroaps to that for SS w i t h i n , we o b t a i n a q u a n t i t y that is identical to the one we have j u s t developed as SStotal. This d e m o n s t r a t i o n explains why the sums of squares are additive. We shall not go t h r o u g h any derivation, but simply state that the degrees of freedom pertaining to the sums of squares are also additive. The total degrees of freedom are split u p into the degrees of freedom corresponding to variation a m o n g groups a n d those of variation of items within groups. Before we continue, let us review the m e a n i n g of the three m e a n squares in the anova. T h e total MS is a statistic of dispersion of the 35 (an) items a r o u n d their mean, the g r a n d m e a n 45.34. It describes the variance in the entire sample due to all the sundry causes and estimates 2 when there are n o a d d e d treatment effects or variance c o m p o n e n t s a m o n g groups. T h e withingroup MS, also k n o w n as the individual or intragroup or error mean square, gives the average dispersion of the 5 () items in each g r o u p a r o u n d the g r o u p means. If the a groups are r a n d o m samples f r o m a c o m m o n h o m o g e n e o u s p o p u l a t i o n , the withingroup MS should estimate a1. The MS a m o n g groups is based on the variance of g r o u p means, which describes the dispersion of the 7 (a) g r o u p means a r o u n d the g r a n d mean. If the groups are r a n d o m samples from a h o m o geneous population, the expected variance of their m e a n will be 2/. Therefore, in order to have all three variances of the same order of magnitude, we multiply the variance of means by to obtain the variance a m o n g groups. If there are n o added treatment effects o r variance c o m p o n e n t s , the MS a m o n g groups is an estimate of 2 . Otherwise, it is an estimate of
1
1
a
a \'
>
or
or
depending on whether the a n o v a at hand is Model I or II. T h e additivity relations we have just learned are independent of the presence of added treatment or r a n d o m effects. We could show this algebraically, but it is simpler to inspect Table 7.6, which summarizes the a n o v a of Table 7.3 in which a, or /t, is a d d e d to each sample. The additivity relation still holds, although the values for g r o u p SS and the total SS are different from those of Table 7.5.
W U)
Source of Y Y variation
(4)
can square MS
C)
df
Sum af squares SS
y y y  y

6 28 34
154
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
A n o t h e r way of looking at the partitioning of the variation is to study the deviation f r o m m e a n s in a particular case. Referring to Table 7.1, we can look at the wing length of the first individual in the seventh group, which h a p p e n s to be 41. Its deviation from its g r o u p mean is y 7 1 _ y 7 = 41  45.4 =  4 . 4 The deviation of the g r o u p m e a n from the grand m e a n is F7  F = 45.4  45.34 = 0.06 and the deviation of the individual wing length from the grand m e a n is  y = 4 i 45.34 =  4 . 3 4
N o t e that these deviations are additive. The deviation of the item from the g r o u p m e a n and that of the g r o u p mean from the grand m e a n add to the total deviation of the item from the g r a n d j n e a n . These deviations are stated algebraically as ( 7 F) + ( F  F) = (Y  F). Squaring and s u m m i n g these deviations for an items will result in
a n _ a _ _ an
Before squaring, the deviations were in the relationship a + b = c. After squaring, we would expect them to take the form a2 4 b2 + lab = c2. W h a t h a p p e n e d to the crossproduct term corresponding to 2ab'l This is
_ _
"
2(y  F h y  f) = 2 [ ( ? 
)  ?>]
a covariancetype term that is always zero, sincc ( Y F) = 0 for each of the a groups (proof in Appendix A 1.1). We identify the deviations represented by each level of variation at the left margins of the tables giving the analysis of variance results (Tables 7.5 a n d 7.6). N o t e that the deviations add u p correctly: the deviation a m o n g groups plus the deviation within groups equals the total deviation of items in the analysis of variance, ( F  F) + ( Y  F) = ( Y  F).
7.6 Model I anova An i m p o r t a n t point to remember is that the basic setup of data, as well as the actual c o m p u t a t i o n and significance test, in most cases is the same for both models. The purposes of analysis of variance differ for the two models. So do some of the supplementary tests and c o m p u t a t i o n s following the initial significance test. Let us now fry to resolve the variation found in an analysis of variance case. This will not only lead us to a more formal interpretation of a n o v a but will also give us a deeper u n d e r s t a n d i n g of the nature of variation itself. For
7.7
/ m o d e l ii a n o v a
155
p u r p o s e s of discussion, we r e t u r n t o the housefly wing lengths of T a b l e 7.3. W e ask the question, W h a t m a k e s any given housefly wing length a s s u m e the value it does? T h e third wing length of the first sample of flies is recorded as 43 units. H o w c a n we explain such a reading? If we knew n o t h i n g else a b o u t this individual housefly, o u r best guess of its wing length w o u l d be the g r a n d m e a n of the p o p u l a t i o n , which we k n o w to be = 45.5. However, we have a d d i t i o n a l i n f o r m a t i o n a b o u t this fly. It is a m e m b e r of g r o u p 1, which has u n d e r g o n e a t r e a t m e n t shifting the m e a n of the g r o u p d o w n w a r d by 5 units. Therefore, a . 1 = 5, a n d we w o u l d expect o u r individual V13 (the third individual of g r o u p 1) t o m e a s u r e 45.5  5 = 40.5 units. In fact, however, it is 43 units, which is 2.5 units a b o v e this latest expectation. T o what can we ascribe this deviation? It is individual variation of the flies within a g r o u p because of the variance of individuals in the p o p u l a t i o n ( 2 = 15.21). All the genetic a n d e n v i r o n m e n t a l effects that m a k e one housefly different f r o m a n o t h e r housefly c o m e into play t o p r o d u c e this variance. By m e a n s of carefully designed experiments, we might learn s o m e t h i n g a b o u t the causation of this variance a n d a t t r i b u t e it to certain specific genetic or environmental factors. W e might also be able to eliminate some of the variance. F o r instance, by using only full sibs (brothers and sisters) in any one culture jar, we would decrease the genetic variation in individuals, a n d undoubtedly the variance within g r o u p s would be smaller. However, it is hopeless to try to eliminate all variance completely. Even if we could remove all genetic variance, there would still be environmental variance. And even in the most i m p r o b a b l e case in which we could remove both types of variance, m e a s u r e m e n t error would remain, so that we would never obtain exactly the same reading even on the same individual fly. T h e withingroups MS always remains as a residual, greater or smaller f r o m experiment to e x p e r i m e n t p a r t of the n a t u r e of things. This is why the withingroups variance is also called the e r r o r variance or error mean square. It is not an error in the sense of o u r m a k i n g a mistake, but in the sense of a measure of the variation you have to c o n t e n d with when trying to estimate significant differences a m o n g the groups. T h e e r r o r variance is composed of individual deviations for each individual, symbolized by the r a n d o m c o m p o n e n t of the j t h individual variatc in the /th group. In o u r case, e 1 3 = 2.5, since the actual observed value is 2.5 units a b o v e its expectation of 40.5. We shall now state this relationship m o r e formally. In a Model I analysis of variance we assume that the differences a m o n g g r o u p means, if any, are due to the fixed treatment effects determined by the experimenter. T h e p u r p o s e of the analysis of variance is t o estimate the true differences a m o n g the g r o u p means. Any single variate can be d e c o m p o s e d as follows:
Yij
+ , + y
(7.2)
where i 1 , . . . , a, j = 1 , . . . , ; a n d e (J represents an independent, normally distributed variable with m e a n ,j = 0 a n d variance 2 = a1. Therefore, a given reading is composed of the grand m e a n of the population, a fixed deviation
156
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
of the mean of g r o u p i from the grand mean , and a r a n d o m deviation eis of the /th individual of g r o u p i from its expectation, which is ( + ,). R e m e m b e r that b o t h a, and can be positive as well as negative. The expected value (mean) of the e^'s is zero, a n d their variance is the parametric variance of the population, 2 . F o r all the assumptions of the analysis of variance to hold, the distribution of u must be normal. In a Model I a n o v a we test for differences of the type <xl i 2 a m o n g the g r o u p m e a n s by testing for the presence of an added c o m p o n e n t due to treatments. If we find that such a c o m p o n e n t is present, we reject the null hypothesis that the g r o u p s come f r o m the same p o p u l a t i o n and accept the alternative hypothesis that at least some of the g r o u p means are different from each other, which indicates that at least some of the a,"s are unequal in magnitude. Next, we generally wish to test which a,'s are different from each other. This is d o n e by significance tests, with alternative hypotheses such as Hl:ctl > 2 or H\+ a 2 ) > a 3 . In words, these test whether the mean of g r o u p 1 is greater t h a n the mean of g r o u p 2, or whether the mean of g r o u p 3 is smaller than the average of the m e a n s of groups I and 2. Some examples of Model I analyses of variance in various biological disciplines follow. An experiment in which we try the effects of different drugs on batches of animals results in a Model I anova. We arc interested in the results of the treatments and the differences between them. The treatments arc fixed and determined by the experimenter. This is true also when we test the effects of different doses of a given f a c t o r  a chemical or the a m o u n t of light to which a plant has been exposed or temperatures at which culture bottles of insects have been reared. The treatment does not have to be entirely understood and m a n i p ulated by the experimenter. So long as it is fixed and rcpcatable. Model I will apply. If wc wanted to c o m p a r e the birth weights of the Chinese children in the hospital in Singapore with weights of Chinese children born in a hospital in China, our analysis would also be a Model I anova. The treatment effects then would be "China versus Singapore," which sums up a whole series of different factors, genetic and environmental some known to us but most of them not understood. However, this is a definite treatment wc can describe and also repeat: we can, if we wish, again sample birth weights of infants in Singapore as well as in China. Another example of Model 1 anova would be a study of body weights for animals of several age groups. The treatments would be the ages, which are fixed. If we find that there arc significant differences in weight a m o n g the ages, wc might proceed with the question of whether there is a difference from age 2 to age 3 or only from age I to age 2. T o a very large extent. Model I anovas are the result of an experiment and of deliberate manipulation of factors by the experimenter. However, the study of differences such as the c o m p a r i s o n of birth weights from two countries, while not an experiment proper, also falls into this category.
7 . 7 / m o d e l ii a n o v a
157
7.7 Model II anova The structure of variation in a M o d e l II a n o v a is quite similar t o t h a t in M o d e l I: YtJ = + Al + (7.3)
where i = 1 , . . . , a; j = 1 , . . . , n; eu represents an independent, normally distributed variable with m e a n ei; = 0 a n d variance 2 = 2 ; a n d At j e p r e s e n t s a normally distributed variable, independent of all e's, with m e a n A t = 0 and variance \. T h e m a i n distinction is that in place of fixedtreatment effects a,, we now consider r a n d o m effects At that differ f r o m g r o u p t o group. Since the effects are r a n d o m , it is uninteresting t o estimate the m a g n i t u d e of these r a n d o m effects o n a group, or the differences f r o m g r o u p to group. But we can estimate their variance, the a d d e d variance c o m p o n e n t a m o n g g r o u p s \ . W e test for its presence a n d estimate its m a g n i t u d e s^, as well as its percentage c o n t r i b u t i o n to the variation in a M o d e l II analysis of variance. Some examples will illustrate the applications of M o d e l II a n o v a . Suppose we wish to determine the D N A content of rat liver cells. W e take five rats and m a k e three p r e p a r a t i o n s f r o m each of the five livers obtained. T h e assay readings will be for a 5 g r o u p s with = 3 readings per group. T h e five rats presumably are sampled at r a n d o m f r o m the colony available to the experimenter. They must be different in various ways, genetically a n d environmentally, but we have n o definite i n f o r m a t i o n a b o u t the n a t u r e of the differences. T h u s , if wc learn that rat 2 has slightly m o r e D N A in its liver cells t h a n rat 3, we can d o little with this i n f o r m a t i o n , because we are unlikely to have any basis for following u p this problem. W e will, however, be interested in estimating the variance of the three replicates within any one liver and the variance a m o n g the five rats; that is, does variance 2 exist a m o n g rats in addition to the variance 2 cxpcctcd on the basis of the three replicates? T h e variance a m o n g the three p r e p a r a t i o n s presumably arises only from differences in technique and possibly f r o m differences in D N A content in different parts of the liver (unlikely in a homogenate). Added variance a m o n g rats, if it existed, might be due to differences in ploidy or related p h e n o m e n a . T h e relative a m o u n t s of variation a m o n g rats and "within" rats ( = a m o n g preparations) would guide us in designing further studies of this sort. If there was little variance a m o n g tlic p r e p a r a t i o n s a n d relatively m o r e variation a m o n g the rats, wc would need fewer p r e p a r a t i o n s and more rats. O n the other h a n d , if the variance a m o n g rats was proportionately smaller, we would use fewer rats and m o r e p r e p a r a t i o n s per rat. In a study of the a m o u n t of variation in skin pigment in h u m a n populations, we might wish to study different families within a h o m o g e n e o u s ethnic or racial g r o u p and brothers and sisters within cach family. T h e variance within families would be the error mean square, a n d we would test for an a d d e d variance c o m p o n e n t a m o n g families. Wc would expect an a d d e d variance c o m p o n e n t 2 because there arc genctic differences a m o n g families that determine a m o u n t
158
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
of skin p i g m e n t a t i o n . W e w o u l d be especially interested in the relative p r o p o r tions of the t w o variances 2 a n d \, because they would p r o v i d e us with i m p o r t a n t genetic i n f o r m a t i o n . F r o m o u r k n o w l e d g e of genetic t h e o r y , we w o u l d expect the variance a m o n g families t o be greater t h a n the variance a m o n g b r o t h e r s a n d sisters within a family. T h e a b o v e examples illustrate the t w o types of p r o b l e m s involving M o d e l II analysis of variance t h a t a r e m o s t likely to arise in biological w o r k . O n e is c o n c e r n e d with the general p r o b l e m of the design of a n e x p e r i m e n t a n d the m a g n i t u d e of the e x p e r i m e n t a l e r r o r at different levels of replication, such as e r r o r a m o n g replicates within rat livers a n d a m o n g rats, e r r o r a m o n g batches, experiments, a n d so forth. T h e o t h e r relates t o variation a m o n g a n d within families, a m o n g a n d within females, a m o n g a n d within p o p u l a t i o n s , a n d so forth. Such p r o b l e m s are c o n c e r n e d with the general p r o b l e m of the relation between genetic a n d p h e n o t y p i c variation.
Exercises
7.1 In a study comparing the chemical composition of the urine of chimpanzees and gorillas (Gartler, Firschein, and Dobzhansky, 1956), the following results were obtained. For 37 chimpanzees the variance for the amount of glutamic acid in milligrams per milligram of creatinine was 0.01069. A similar study based on six gorillas yielded a variance of 0.12442. Is there a significant difference between the variability in chimpanzees and that in gorillas? ANS. Fs = 11.639, 025[5.36] ~ 2.90. The following data are from an experiment by Sewall Wright. He crossed Polish and Flemish giant rabbits and obtained 27 F , rabbits. These were inbred and 112 F 2 rabbits were obtained. We have extracted the following data on femur length of these rabbits.
7.2
y 83.39 80.5
s 1.65 3.81
F,
Fi
27
112
7.3
Is there a significantly greater amount of variability in femur lengths among the F2 than among the Fx rabbits? What wellknown genetic phenomenon is illustrated by these data? For the following data obtained by a physiologist, estimate a 2 (the variance within groups), a, (the fixed treatment effects), the variance among the groups, and the added component due to treatment 2 /(a 1), and test the hypothesis that the last quantity is zero.
Treatment A C D
V .v2
6.12 2.85 10
4.34 6.70 10
5.12 4.06 10
7.28 2.03 10
exercises
159
7.4
7.5
ANS. s 2 = 3.91, a, = 0.405, &2 = 1.375, 3 = 0.595, 4 = 1.565, MS among groups = 124.517, and F, = 31.846 (which is significant beyond the 0.01 level). For the data in Table 7.3, make tables to represent partitioning of the value of each variate into its three components, , ( ),( Yj). The first table would then consist of 35 values, all equal to the grand mean. In the second table all entries in a given column would be equal to the difference between the mean of that column and the grand mean. And the last table would consist of the deviations of the individual variates from their column means. These tables represent estimates of the individual components of Expression (7.3). Compute the mean and sum of squares for each table. A geneticist recorded the following measurements taken on twoweekold mice of a particular strain. Is there evidence that the variance among mice in different litters is larger than one would expect on the basis of the variability found within each litter?
Litters 1 2 3 4 5 6 7
7.6
ANS. .r = 5.987, MS among = 4.416, s2A = 0, and Fs = 0.7375, which is clearly not significant at the 5% level. Show that it is possible to represent the value of an individual variate as follows: y = (>') + (>', V') + (Vj; Y). What docs each of the terms in parentheses estimate in a Model 1 anova and in a Model II anova?
CHAPTER
We are now ready to study actual eases of analysis of variance in a variety of applications and designs. The present chapter deals with the simplest kind of a n o v a , singleclassification analysis of variance. By this we mean an analysis in which the groups (samples) are classified by only a single criterion. Either interpretations of the seven samples of housefly wing lengths (studied in the last chapter), different medium formulations (Model I), or progenies of different females (Model II) would represent a single criterion for classification. O t h e r examples would be different temperatures at which groups of animals were raised or different soils in which samples of plants have been grown. We shall start in Section 8.1 by staling the basic computational formulas for analysis of variance, based on the topics covered in the previous chapter. Section 8.2 gives an example of the c o m m o n case with equal sample sizes. We shall illustrate this case by means of a Model I anova. Since the basic computations for the analysis of variance  are the same in either model, it is not necessary to repeat the illustration with a Model II anova. The latter model is featured in Section 8.3, which shows the minor c o m p u t a t i o n a l complications resulting from unequal sample sizes, since all groups in the anova need not necessarily have the same sample size. Some c o m p u t a t i o n s unique to a Model II anova are also shown; these estimate variance components. F o r m u l a s be
8.1 / c o m p u t a t i o n a l
formulas
161
come especially simple for the twosample case, as explained in Section 8.4. In Model I of this case, the mathematically equivalent t test can be applied as well. W h e n a Model I analysis of variance has been f o u n d to be significant, leading to the conclusion that the m e a n s are not f r o m the same population, we will usually wish to test the means in a variety of ways to discover which pairs of m e a n s are different f r o m each other and whether the m e a n s can be divided into groups that are significantly different from each other. T o this end, Section 8.5 deals with socalled planned comparisons designed before the test is run; and Section 8.6, with u n p l a n n e d multiplecomparison tests t h a t suggest themselves to the experimenter as a result of the analysis. 8.1 Computational formulas We saw in Section 7.5 that the total sum of squares and degrees of freedom can be additively partitioned into those pertaining to variation a m o n g groups and those to variation within groups. F o r the analysis of variance proper, we need only the sum of squares a m o n g groups and the sum of squares within groups. But when the c o m p u t a t i o n is not carried out by computer, it is simpler to calculate the total sum of squares and the sum of squares a m o n g groups, leaving the sum of squares within groups to be obtained by the subtraction SSiotai SS g r o u p s . However, it is a good idea to c o m p u t e the individual variances so we can check for heterogeneity a m o n g them (sec Section 10.1). This will also permit an independent c o m p u t a t i o n of SS w i l h i n as a check. In Section 7.5 we arrived at the following c o m p u t a t i o n a l formulas for the total a n d a m o n g groups sums of squares:
These formulas assume equal sample size for each g r o u p and will be modified in Section 8.3 for unequal sample sizes. However, they suffice in their present form to illustrate some general points a b o u t c o m p u t a t i o n a l procedures in analysis of variance. We note that the second, subtracted term is the same in both sums of squares. This term can be obtained by s u m m i n g all the variates in the a n o v a (this is the grand total), squaring the sum, and dividing the result by the total n u m b e r of variates. It is c o m p a r a b l e to the second term in the c o m p u t a t i o n a l formula for the ordinary sum of squares (Expression (3.8)). This term is often called the correction term (abbreviated CT). The first term for the total sum of squares is simple. It is the sum of all squared variatcs in the anova table. T h u s the total sum of squares, which describes the variation of a single unstructured sample of an items, is simply the familiar sumofsquares formula of Expression (3.8).
162
c h a p t e r 8 / singleclassification analysis of
variance
The first term of the sum of squares a m o n g g r o u p s is obtained by squaring the sum of the items of each group, dividing each square by its sample size, a n d s u m m i n g the quotients from this operation for each group. Since the sample size of each g r o u p is equal in the above formulas, we can first sum all the squares of the g r o u p sums and then divide their sum by the constant n. F r o m the formula for the sum of squares a m o n g groups emerges an important c o m p u t a t i o n a l rule of analysis of variance: To find the sum of squares among any set of groups, square the sum of each group and divide by the sample size of the group, sum the quotients of these operations and subtract from the sum a correction term. To find this correction term, sum all the items in the set, square the sum, and divide it by the number of items on which this sum is based. 8.2 Equal W e shall illustrate a singleclassification a n o v a with equal sample sizes by a Model I example. The c o m p u t a t i o n up to and including the first test of significance is identical for b o t h models. Thus, the c o m p u t a t i o n of Box 8.1 could also serve for a Model II a n o v a with equal sample sizes. The d a t a are f r o m a n experiment in plant physiology. They are the lengths in coded units of pea sections grown in tissue culture with auxin present. T h e p u r p o s e of the experiment was to test the effects of the addition of various sugars on growth as measured by length. F o u r experimental groups, representing three different sugars and one mixture of sugars, were used, plus one control without sugar. Ten_observations (replicates) were m a d e for each treatment. T h e term "trejitmenj_" already implies a_Mmlel I anova. It is obvious that the five g r o u p s d o not represent r a n d o m samples from all possible experimental conditions but were deliberately designed to legt^the effects of certain sugars o n J h growth rate. We arc interested in the effect of the sugars on length, and our null hypothesis will be that there is no added c o m p o n e n t due to treatment effects a m o n g the five groups; that is, t h c p o p u l a j i o n means are all assumed to be equal. T h e c o m p u t a t i o n is illustrated in Box 8.1. After quantities 1 t h r o u g h 7 have been calculated, they are entered into an analysisofvariance table, as shown in the box. General formulas for such a tabic arc shown first; these arc followed by a table filled in for the specific example. We note 4 degrees of freedom a m o n g groups, there being five treatments, and 45 df within groups, representing 5 times (10 1) degrees of freedom. We find that the mean square a m o n g g r o u p s is considerably greater than the error mean square, giving rise to a suspicion that an added c o m p o n e n t due to treatment effects is present. If the MS g r o u p s is equal to or less than the M 5 w i l h i n , we d o not bother going on with the analysis, for we would not have evidence for the presence of an added c o m p o n e n t . You may wonder how it could be possible for the MS g r o u p s to be less than the MSwuhin You must remember that these two are independent estimates. If there is no added c o m p o n e n t due to treatment or variance component a m o n g groups, the estimate of the variance a m o n g groups is as likely to be less as it is to be greater than the variance within groups.
8.2 / e q u a l
163
Expressions for the expected values of the m e a n squares are also shown in the first a n o v a table of Box 8.1. They are the expressions you learned in the previous chapter for a M o d e l I anova.
BOX 8.1 Singleclassification anova with equal sample sizes. The effect of the addition of different sugars on length, in ocular units ( x 0.114 = mm), of pea sections grown in tissue culture with auxin present: = 10 (replications per group). This is a Model I anova. Treatments (a = 5) 2% Glucose added 57 58 60 59 62 60 60 57 59 61 593 59.3 2% Fructose added 58 61 56 58 57 56 61 60 57 58 582 58.2 17. Glucose + /% Fructose added 58 59 58 61 57 56 58 57 57 59 580 58X> 2% Sucrose added 62 66 65 63 64 62 65 65 62 67 641 64.1
Preliminary computations 1. Grand total = Y = 701 + 593 + + 641 = 3097 2. Sum of the squared observations
**2
*= 75 2 + 67* + + 68 2 + 57 2 + + 67 2 = 193,151
(1,929,055) = 192,905.50
4. Grand total squared and divided by total sample size = correction term CP M i eV Y y
5 x 1 0
 ^
50
 191,828.18
164
c h a p t e r 8 / singleclassification analysis of
variance
B O X 8,1 Continued
S. ss total =
i i r
~ C T
quantity 3  quantity 4 192,905.50  191,828.18 = 1077.32 7. SS w j t h i n =s SS (ora i SSgreap; quantity 5  quantity 6 1322.82  1077.32 = 245.50 T h e anova table is constructed as follows.
Source
of variation
df
SS
MS
F,
Expected MS
f  Y F  y y  Y
a  1 a(n  1) an  1
6 7 5
 i (  1 ) 7 a(n  1)
MS w i thi
a  1 a2
Substituting the computed values into the above table, we obtain the fol lowing: Anova table Source of variation  Y Y  f Y  Among groups (among treatments) Within groups (error, replicates) Total
^0.05(4,451
=
df
SS
MS
Fs
4 45 49 2.58
269.33 5.46
49.33**
3.77
Conclusions. There is a highly significant (P 0.01) added component due to treatment effects in the mean square among groups (treatments). The different sugar treatments clearly have a significant effect on growth of the pea sections. See Sections 8.5 and 8.6 for the completion of a Model I analysis of variance: that is, the method for determining which means are significantly different from each other.
8.3 / u n e q u a l
165
It may seem that we are carrying an unnecessary n u m b e r of digits in the c o m p u t a t i o n s in Box 8.1. This is often necessary to ensure that the e r r o r sum of squares, quantity 7, has sufficient accuracy. Since v 2 is relatively large, the critical values of F have been c o m p u t e d by h a r m o n i c interpolation in Table V (see f o o t n o t e to Table III for h a r m o n i c interpolation). The critical values have been given here only to present a complete record of the analysis. Ordinarily, when confronted with this example, you would not bother w o r k i n g out these values of F. C o m p a r i s o n of the observed variance ratio Fs = 49.33 with F 0 0 1 [ 4 4 0 ] = 3.83, the conservative critical value (the next tabled F with fewer degrees of freedom), would convince you that the null hypothesis should be rejected. The probability that the five groups differ as much as they d o by chance is almost infinitesimally small. Clearly, the sugars produce an added treatment effect, apparently inhibiting growth and consequently reducing the length of the pea sections. At this stage we are not in a position to say whether each treatment is different from every other treatment, or whether the sugars are different f r o m the control but not different f r o m each other. Such tests are necessary to complete a Model I analysis, but we defer their discussion until Sections 8.5 and 8.6. 8.3 Unequal This time we shall use a Model II analysis of variance for an example. Remember that up to and including the F test for significance, the c o m p u t a t i o n s are exactly the same whether the anova is based on Model I or Model II. We shall point out the stage in the c o m p u t a t i o n s at which there would be a divergence of operations depending on the model. T h e example is shown in Table 8.1. It concerns a series of morphological measurements of the width of the scutum (dorsal shield) of samples of tick larvae obtained from four different host individuals of the cottontail rabbit. These four hosts were obtained at r a n d o m from one locality. We know nothing about their origins or their genetic constitution. They represent a r a n d o m sample of the population of host individuals from the given locality. We would not be in a position to interpret differences between larvae from different hosts, since we know nothing of the origins of the individual rabbits. Population biologists arc nevertheless interested in such analyses because they provide an answer to the following question: Are (he variances of means of larval characters a m o n g hosts greater than expected on the basis of variances of the characters within hosts? We can calculate the average variance of width of larval scutum on a host. This will be our "error" term in the analysis of variance. We then test the observed mean square a m o n g groups and sec if it contains an added c o m p o n e n t of variance. What would such an added c o m p o n e n t of variance represent? The mean square within host individuals (that is, of larvae on any one host) represents genetic differences a m o n g larvae and differences in environmental experiences of these larvae. Added variance a m o n g hosts demonstrates significant differentiation a m o n g the larvae possibly due to differences a m o n g t In, lwiclt.' ilTivf inn ill. I.!,. Il ilcr mau ke> rllwa . , r,.i ,,.
166
TABLE 8 . 1
c h a p t e r 8 / singleclassification analysis of
variance
D a t a and anova table for a single classification anova with unequal sample sizes. W i d t h of s c u t u m (dorsal shield) of larvae of t h e tick Haemaphysalis leporispalustris in s a m p l e s f r o m 4 c o t t o n t a i l r a b b i t s . M e a s u r e m e n t s in m i c r o n s . T h i s is a M o d e l II a n o v a . Hosts 1 2 (a = 4) 3 4
350 356 358 376 338 342 366 350 344 364
354 360 362 352 366 372 362 344 342 358 351 348 348 4619 13 1,642,121 79.56
2978 8
s2
1,108,940 54.21
variation
df
MS
Fs 5.26**
y y
Among groups (among hosts) Within groups (error; among larvae on a host) Total Fq.05[3.331 = 2.89
3 33 36
602.6 114.5
y  y
Fq.01[3.33] ~ 4.44
the larvae, s h o u l d e a c h h o s t c a r r y a f a m i l y of ticks, o r a t least a p o p u l a t i o n w h o s e i n d i v i d u a l s a r e m o r e related t o e a c h o t h e r t h a n they a r e to tick l a r v a e on other host individuals. T h e e m p h a s i s in this e x a m p l e is o n the m a g n i t u d e s of the v a r i a n c e s . In view of t h e r a n d o m c h o i c e of h o s t s this is a clear c a s e of a M o d e l II a n o v a . B e c a u s e this is a M o d e l 11 a n o v a , t h e m e a n s for e a c h h o s t h a v e been o m i t t e d f r o m T a b l e 8.1. W e are n o t i n t e r e s t e d in t h e i n d i v i d u a l m e a n s o r p o s s i b l e differences
8.3 / u n e q u a l
167
a m o n g them. A possible reason for looking at the means would be at the beginning of the analysis. O n e might wish to look at the g r o u p means to spot outliers, which might represent readings that for a variety of reasons could be in error. The c o m p u t a t i o n follows the outline furnished in Box 8.1, except that the symbol " now needs to be written "', since sample sizes differ for each group. Steps 1, 2, and 4 t h r o u g h 7 are carried out as before. Only step 3 needs to be modified appreciably. It is: 3. Sum of the squared g r o u p totals, each divided by its sample size,
=
The critical 5% and 1% values of F are shown below the a n o v a table in Table 8.1 (2.89 and 4.44, respectively). You should confirm them for yourself in Table V. N o t e that the argument v2 = 33 is not given. You therefore have to interpolate between a r g u m e n t s representing 30 to 40 degrees of freedom, respectively. T h e values shown were c o m p u t e d using h a r m o n i c interpolation. However, again, it was not necessary to carry out such an interpolation. The conservative value of F, Fal3i30], is 2.92 and 4.51, for = 0.05 and a = 0.01, respectively. T h e observed value Fs is 5.26, considerably above the interpolated as well as the conservative value of F0 0l. We therefore reject the null hypothesis (H0: a\ = 0) that there is no added variance c o m p o n e n t a m o n g g r o u p s and that the two mean squares estimate the same variance, allowing a type I error of less than \ X . We accept, instead, the alternative hypothesis of the existence of an added variance c o m p o n e n t 2. W h a t is the biological meaning of this conclusion? For some reason, the ticks on different host individuals dilfer more from each other than d o individual ticks on any one host. This may be due to some modifying influence of individual hosts on the ticks (biochemical differences in blood, differences in the skin, differences in the environment of the host individualall of them rather unlikely in this case), or it may be due to genetic diflcrcnces a m o n g the ticks. Possibly the ticks on each host represent a sibship (that is, are descendants of a single pair of parents) and the differences in the ticks a m o n g host individuals represent genetic differences a m o n g families; or perhaps selection has acted differently on the tick populations on each host, or the hosts have migrated to the collection locality from different geographic areas in which the licks differ in width of scutum. Of these various possibilities, genetic differences a m o n g sibships seem most reasonable, in view of the biology of the organism. The c o m p u t a t i o n s up to this point would have been identical in a Model 1 anova. If this had been Model I, the conclusion would have been that there is a significant treatment effect rather than an added variance c o m p o n e n t . Now, however, we must complete the c o m p u t a t i o n s a p p r o p r i a t e to a Model II anova. These will includc the estimation of the added variance c o m p o n e n t and the calculation of percentage variation at the two levels.
168
c h a p t e r 8 / singleclassification analysis of
variance
Since sample size n, differs a m o n g g r o u p s in this example, we c a n n o t write 2 + 2 for the expected MS g r o u p s . It is o b v i o u s that no single value of would be a p p r o p r i a t e in the f o r m u l a . W e therefore use an average n; this, however, is n o t simply n, the a r i t h m e t i c m e a n of the ,'s, but is 1 =
? >\
a
(8.1)
" / which is a n average usually close to b u t always less t h a n n, unless s a m p l e sizes are equal, in which case n0 = n. In this example, 1 4 (8 + 10 + 13 + 6) + 10 2 + 13 2 + 6 2 ~ 8 + 10 + 13 + = 9.009
Since the M o d e l II expected MS g r o u p s is a2 + 2 a n d the expected M 5 w i l h i n is 2 , it is o b v i o u s how the variance c o m p o n e n t a m o n g g r o u p s a2A a n d the e r r o r variance 2 are o b t a i n e d . Of course, the values that we o b t a i n are s a m p l e estim a t e s a n d therefore are written as .s2t a n d s2. T h e a d d e d variance c o m p o n e n t s\ is estimated as (JVfSgrouph MS w i l h i n )/. W h e n e v e r sample sizes a r e u n e q u a l , the d e n o m i n a t o r becomcs n 0 . In this example, (602.7  114.5)/9.009 = 54.190. W e are frequently not so m u c h interested in the actual values of these variance c o m p o n e n t s as in their relative magnitudes. F o r this p u r p o s e we sum the c o m p o nents a n d express each as a percentage of the resulting sum. T h u s s2 + s2, = 114.5 + 54.190 168.690, a n d ,v2 a n d .v2 arc 67.9% a n d 32.1% of this sum, respectively; relatively m o r e variation occurs within g r o u p s (larvae on a host) than a m o n g g r o u p s (larvae on different hosts).
8.4 T w o groups frequent test in statistics is to establish the siynijicancc of the difference between two means. This can easily be d o n e by m e a n s of an analysis of variance for two (jroups. Box 8.2 shows this p r o c e d u r e for a Model I a n o v a , the c o m m o n case. T h e example in Box 8.2 conccrns the onset of r e p r o d u c t i v e m a t u r i t y in water fleas, Daphnia loiu/ispina. This is measured as the average age (in days) at beginning of r e p r o d u c t i o n . Hacli variate in the table is in fact an average, and a possible Haw in the analysis might be that the averages arc not based on equal sample sizes. However, we arc not given this i n f o r m a t i o n and have to proceed on the a s s u m p t i o n that each reading in the tabic is an equally reliable variate. T h e t w o scries represent different genetic crosses, a n d the seven replicates in each series arc clones derived f r o m the same genetic cross. This example is clcarly a Model 1 a n o v a . since the question to be answered is whether series I differs from series II in average age at the beginning of r e p r o d u c t i o n . Inspection of the d a t a shows thai the mean age at beginning of r e p r o d u c t i o n
8.4 / t w o
groups
169
BOX 8J
Testing the difference in means between two groups. Average age (in days) at beginning of reproduction in Daphnia longispina (each variate is a mean based on approximately similar numbers of females). Two series derived from different genetic crosses and containing seven clones each are compared; = 7 clones per series. This is a Model I anova.
Series (a = 2)
I 7.2 7.1 9.1 7.2 11
8.8 7.5 7.7 7.6 7.4 6.7 7.2 52.9 7.5571 402.23 0.4095
52.6 7.5143
s2
398.28 0.5047
Single classification anova with two groups with equal sample sizes
Anova table Source of variation df
MS
y  y y  y
Between groups (series) Within groups (error; clones within series) Y Total
1 12 13
0.00643 0.45714
0.0141
121 FO.OJ(l. ~ 4.75 Conclusions. Since Fs F 0 0 5 ( 1  2 , the null hypothesis is accepted. The means of the two series are not significantly different; that is, the two series do not differ in average age at beginning of reproduction. A t test of the hypothesis that two sample means come from a population with equal ; also confidence limits of the difference between two means This test assumes that the variances in the populations from which the two samples were taken are identical. If in doubt about this hypothesis, test by method of Box 7.1, Section 7.3.
170
chapter
8 / singleclassification analysis of
variance
BOX 8.2 Continued The appropriate formula for f s is one of the following: Expression (8.2), when sample sizes are unequal and n, or n z or both sample sizes are small ( < 30): df = n, + n 2 2 Expression (8.3), when sample sizes are identical (regardless of size): df = 2(  1) Expression (8.4), when n1 and n 2 are unequal but both are large ( > 30): df ~ tts + rt2 2 For the present data, since sample sizes are equal, we choose Expression (8.3):
t
__ (  VVl  (.  )
We are testing the null hypothesis that 2 = 0. Therefore we replace this quantity by zero in this example. Then t% = 7.5143  7.5571 V(a5047 + 0.4095)/7 0.0428 ^09142/7 0.0428 03614
11, = 0.1184
The degrees of freedom for this example are 2(n 1) = 2 6 = 12. The critical value of f0.oMi2j = 2179. Since the absolute value of our observed f, is less than the critical t value, the means are found to be not significantly different, which is the same result as was obtained by the anova. Confidence limits of the difference between two means
=
(^l
^2) ~~ '[vjSFiFz
L 2 = (Yi Y2) + ta[V]Sp, . In this case F,  f 2 = 0.0428, t.052, = 2.179, and s ? , = 0.3614, as computed earlier for the denominator of the t test. Therefore L , = 0.0428  (2.179)(0.3614) =  0 . 8 3 0 3 L 2 =  0 . 0 4 2 8 + (2.179X0.3614) = 0.7447 The 95% confidence limits contain the zero point (no difference), as was to be expected, since the difference V,  Y2 was found to be not significant.
is very similar for the two series. It would surprise us, therefore, to find that tlicy arc significantly different. However, we shall carry out a test anyway. As you realize by now, one cannot tell from the m a g n i t u d e of a difference whether i( is significant. This depends on the m a g n i t u d e of (he error mean square, representing the variance within scries. The c o m p u t a t i o n s for the analysis of variance are not shown. They would be the same as in Box 8.1. With equal sample sizes and only two groups, there
8.4 / t w o
groups
171
(526 1 4
529) 2
= 0 0 0 6 4 3
There is only 1 degree of freedom between the two groups. The critical value of F 0 ,05[i,i2] >s given u n d e r n e a t h the a n o v a table, but it is really not necessary to consult it. Inspection of the m e a n squares in the a n o v a shows that MS g r o u p s is m u c h smaller t h a n MS U h i n ; therefore the value of F s is far below unity, and there c a n n o t possibly be an added c o m p o n e n t due to treatment effects between the series. In cases where A/S g r o u p s < MS w i t h i n , we d o not usually b o t h e r to calculate Fs, because the analysis of variance could not possibly be significant. There is a n o t h e r m e t h o d of solving a Model I twosample analysis of variance. This is a t test of the differences between two means. This t test is the traditional m e t h o d of solving such a problem; it may already be familiar to you from previous acquaintance with statistical work. It has no real advantage in either ease of c o m p u t a t i o n or understanding, and as you will see, it is mathematically equivalent to the a n o v a in Box 8.2. It is presented here mainly for the sake of completeness. It would seem too much of a break with tradition not to have the t test in a biostatistics text. In Section 6.4 we learned a b o u t the t distribution and saw that a t distribution of 1 degree of freedom could be obtained from a distribution of the term (F( )/ ? , where sy_ has 1 degrees of freedom and is normally distributed. The n u m e r a t o r of this term represents a deviation of a sample mean from a parametric mean, and the d e n o m i n a t o r represents a standard error for such a deviation. We now learn that the expression i, = "(. ; (%  Y2)  (, 2) n,n7 (8.2)
is also distributed as t. Expression (8.2) looks complicated, but it really has the same structure as the simpler term for t. T h e n u m e r a t o r is a deviation, this time, not between a single sample mean and the parametric mean, but between a single difference between two sample means, F, and 2, and the true difference between the m e a n s of the populations represented by these means. In a test of this sort our null hypothesis is that the two samples come from the same population; that is, they must have the same parametric mean. Thus, the difference , 2 is assumed to be zero. We therefore test the deviation of the difference V, F2 from zero. The d e n o m i n a t o r of Expression (8.2) is a s t a n d a r d error, the s t a n d a r d error of the difference between two means F,Fi Tfie left portion of the expression, which is in square brackets, is a weighted average of the variances of the two samples, .v2 and .v2. computed
172
chapter
8 / singleclassification analysis of
variance
in the m a n n e r of Section 7.1. T h e right term of the s t a n d a r d e r r o r is the c o m p u t a t i o n a l l y easier f o r m of ( l / n j ) + ( l / n 2 ) , which is the factor by which t h e average variance within g r o u p s m u s t be multiplied in o r d e r to convert it i n t o a variance of the difference of m e a n s . T h e a n a l o g y with the m u l t i p l i c a t i o n of a s a m p l e variance s 2 by 1 jn to t r a n s f o r m it into a variance of a m e a n sy s h o u l d be obvious. T h e test as outlined here assumes e q u a l variances in the t w o p o p u l a t i o n s sampled. This is also a n a s s u m p t i o n of the analyses of variance carried out so far, a l t h o u g h we have not stressed this. W i t h only two variances, equality m a y be tested by the p r o c e d u r e in Box 7.1. W h e n sample sizes are e q u a l in a t w o  s a m p l e test, Expression (8.2) simplifies to the expression (,  ,)  (  , ) (8.3)
which is w h a t is applied in t h e present e x a m p l e in Box 8.2. W h e n the s a m p l e sizes are u n e q u a l but r a t h e r large, so t h a t the differences between and 1 are relatively trivial, Expression (8.2) reduces to the simpler form (V, 2)(,  2 ) (8.4)
T h e simplification of Expression (8.2) to Expressions (8.3) a n d (8.4) is s h o w n in A p p e n d i x A 1.3. T h e pertinent degrees of f r e e d o m for Expressions (8.2) a n d (8.4) are nl + n2 2, a n d for Expression (8.3) ilf is 2( I). T h e test of significance for differences between m e a n s using the f test is s h o w n in Box 8.2. This is a twotailed test because o u r alternative hypothesis is / / , : , 2. T h e results of this test are identical t o those of the a n o v a in the s a m e box: the two m e a n s are not significantly different. W e can d e m o n s t r a t e this m a t h e m a t i c a l equivalence by s q u a r i n g the value for ts. T h e result should be identical to the Fs value of the c o r r e s p o n d i n g analysis of variance. Since ts =  0 . 1 1 8 4 in Box 8.2, t2 = 0.0140. W i t h i n r o u n d i n g error, this is e q u a l to the Fs o b t a i n e d in the a n o v a (Fx = 0.0141). W h y is this so? We learned that f v i = ( )/*>, where is the degrees of freedom of the variance of the m e a n stherefore = ( ) 2 Is], However, this expression can be regarded as a variance ratio. T h e d e n o m i n a t o r is clearly a variance with degrees of f r e e d o m . T h e n u m e r a t o r is also a variance. It is a single deviation s q u a r e d , which represents a sum of squares possessing 1 r a t h e r than zero degrees of f r e e d o m (since it is a deviation f r o m the true m e a n r a t h e r t h a n a s a m p l e mean). s u m of s q u a r e s based on I degree of f r e e d o m is at the same time a variance. T h u s , t 2 is a variance ratio, since i[2v, = ,_vj, as we have seen. In A p p e n d i x A 1.4 wc d e m o n s t r a t e algebraically that the t 2 a n d the /' value o b t a i n e d in Box 8.2 are identical quantities. Since a p p r o a c h e s the n o r m a l distribution as
8.5 / c o m p a r i s o n s a m o n g m e a n s ' p l a n n e d
comparisons
173
the s q u a r e of t h e n o r m a l deviate as  oo. W e also k n o w (from Section 7.2) that rfv.j/Vi = Flvuao]. Therefore, when = 1 a n d v 2 = oo, x f u = F [ l ao] = f j ^ , (this c a n be d e m o n s t r a t e d f r o m Tables IV, V, a n d III, respectively): Z0.0511 ]
2
= 3.841
T h e t test for differences between t w o m e a n s is useful w h e n we wish t o set confidence limits to such a difference. Box 8.2 shows h o w to calculate 95% confidence limits to the difference between the series m e a n s in the Daphnia example. T h e a p p r o p r i a t e s t a n d a r d e r r o r a n d degrees of f r e e d o m d e p e n d on whether Expression (8.2), (8.3), or (8.4) is chosen for ts. It d o e s not surprise us to find that the confidence limits of the difference in this case enclose the value of zero, r a n g i n g f r o m ^ 0 . 8 3 0 3 t o + 0 . 7 4 4 7 . T h i s must be so w h e n a difference is found to be not significantly different from zero. We can i n t e r p r e t this by saying that we c a n n o t exclude zero as the true value of the difference between the m e a n s of the t w o series. A n o t h e r instance when you might prefer to c o m p u t e the t test for differences between two m e a n s rather t h a n use analysis of variance is w h e n you are lacking the original variates a n d have only published m e a n s a n d s t a n d a r d e r r o r s available for the statistical test. Such an example is furnished in Exercise 8.4.
174
c h a p t e r 8 / singleclassification analysis of
variance
An i m p o r t a n t point a b o u t such tests is t h a t they are designed a n d c h o s e n i n d e p e n d e n t l y of the results of the experiment. T h e y should be p l a n n e d before the experiment h a s been carried out a n d the results o b t a i n e d . Such c o m p a r i s o n s are called planned or a priori comparisons. Such tests are applied regardless of the results of the preliminary overall a n o v a . By c o n t r a s t , after t h e e x p e r i m e n t has been carried out, we might wish to c o m p a r e certain m e a n s t h a t we notice to be m a r k e d l y different. F o r instance, sucrose, with a m e a n of 64.1, a p p e a r s to have had less of a g r o w t h  i n h i b i t i n g effect t h a n fructose, with a m e a n of 58.2. We might therefore wish to test w h e t h e r there is in fact a significant difference between the effects of fructose a n d sucrose. Such c o m p a r i s o n s , which suggest themselves as a result of the c o m p l e t e d experiment, are called unplanned o r a posteriori comparisons. T h e s e tests are p e r f o r m e d only if the preliminary overall a n o v a is significant. T h e y include tests of the c o m p a r i s o n s between all possible pairs of means. W h e n there are a means, there can, of course, be a(a l)/2 possible c o m p a r i s o n s between pairs of means. T h e reason we m a k e this distinction between a priori a n d a posteriori c o m p a r i s o n s is that the tests of significance a p p r o p r i a t e for the t w o c o m p a r i s o n s a r e different. A simple e x a m p l e will s h o w why this is so. Let us a s s u m e we have sampled f r o m an a p p r o x i m a t e l y n o r m a l p o p u l a t i o n of heights on men. W e have c o m p u t e d their m e a n and s t a n d a r d deviation. If we s a m p l e t w o m e n at a time f r o m this p o p u l a t i o n , we can predict the difference between them o n the basis of o r d i n a r y statistical theory. S o m e m e n will be very similar, o t h e r s relatively very different. Their differences will be distributed normally with a m e a n of 0 and an expected variance of 2 a 2 , for reasons t h a t will be learned in Section 12.2. T h u s , if we o b t a i n a large difference between t w o r a n d o m l y sampled men, it will have to be a sufficient n u m b e r of s t a n d a r d deviations greater t h a n zero for us to reject o u r null hypothesis that the t w o men c o m c from the specified p o p u l a t i o n . If, on the o t h e r h a n d , we were to look at the heights of the men before s a m p l i n g t h e m and then take pairs of m e n w h o seemed to be very different from each o t h e r , it is o b v i o u s that we would repeatedly o b t a i n differences within pairs of men that were several s t a n d a r d deviations a p a r t . Such differences would be outliers in the expected frequency d i s t r i b u t o n of differences, a n d time a n d again wc would reject o u r null hypothesis when in fact it was true. T h e men would be sampled f r o m the s a m e p o p u l a t i o n , but because they were not being sampled at r a n d o m but being inspected before being sampled, the probability distribution on which o u r hypothesis testing rested would n o longer be valid. It is o b v i o u s that the tails in a large s a m p l e f r o m a n o r m a l distribution will be a n y w h e r e f r o m 5 to 7 s t a n d a r d deviations a p a r t . If we deliberately take individuals f r o m e a c h tail a n d c o m p a r e them, they will a p p e a r to be highly significantly different f r o m each other, a c c o r d i n g to the m e t h o d s described in the present section, even t h o u g h they belong to the s a m e p o p u l a t i o n . W h e n we c o m p a r e m e a n s differing greatly f r o m each o t h e r as the result of some treatment in the analysis of variance, we are d o i n g exactly the s a m e thing as t a k i n g the tallest and the shortest men f r o m the frequency distribution of
8.6 / c o m p a r i s o n s a m o n g m e a n s : u n p l a n n e d c o m p a r i s o n s
175
heights. If w e wish t o k n o w w h e t h e r these a r e significantly different f r o m e a c h o t h e r , we c a n n o t use the o r d i n a r y p r o b a b i l i t y d i s t r i b u t i o n o n w h i c h t h e analysis of v a r i a n c e rests, b u t we h a v e t o use special tests of significance. T h e s e u n p l a n n e d tests will be discussed in t h e next section. T h e p r e s e n t section c o n c e r n s itself with t h e c a r r y i n g o u t of t h o s e c o m p a r i s i o n s p l a n n e d b e f o r e t h e e x e c u t i o n of t h e e x p e r i m e n t . T h e general rule f o r m a k i n g a p l a n n e d c o m p a r i s o n is e x t r e m e l y simple; it is related t o t h e r u l e f o r o b t a i n i n g t h e s u m of s q u a r e s for a n y set of g r o u p s (discussed at the e n d of Section 8.1). T o c o m p a r e k g r o u p s of a n y size nh t a k e the s u m of e a c h g r o u p , s q u a r e it, divide the result by the s a m p l e size nh a n d s u m the k q u o t i e n t s so o b t a i n e d . F r o m t h e s u m of these q u o t i e n t s , s u b t r a c t a c o r r e c t i o n t e r m , w h i c h y o u d e t e r m i n e by t a k i n g t h e g r a n d s u m of all t h e g r o u p s in this c o m p a r i s o n , s q u a r i n g it, a n d d i v i d i n g t h e result by the n u m b e r of items in the g r a n d s u m . If t h e c o m p a r i s o n i n c l u d e s all t h e g r o u p s in t h e a n o v a , the c o r r e c t i o n t e r m will be the m a i n CT of the s t u d y . If, h o w e v e r , t h e c o m p a r i s o n includes only s o m e of t h e g r o u p s of the a n o v a , t h e CT will be different, b e i n g restricted only to these g r o u p s . T h e s e rules c a n best be l e a r n e d by m e a n s of a n e x a m p l e . T a b l e 8.2 lists the m e a n s , g r o u p s u m s , a n d s a m p l e sizes of the e x p e r i m e n t with t h e p e a sections f r o m Box 8.1. Y o u will recall t h a t t h e r e were highly significant differences a m o n g t h e g r o u p s . W e n o w wish t o test w h e t h e r the m e a n of the c o n t r o l differs f r o m t h a t of the f o u r t r e a t m e n t s r e p r e s e n t i n g a d d i t i o n of s u g a r . T h e r e will t h u s be t w o g r o u p s , o n e t h e c o n t r o l g r o u p a n d t h e o t h e r the " s u g a r s " g r o u p s , the latter with a sum of 2396 a n d a s a m p l e size of 40. W e t h e r e f o r e c o m p u t e SS (control v e r s u s sugars) _ (701 ) 2 10 (701) = 10
2 4 2
In this case the c o r r e c t i o n term is the s a m e as for the a n o v a , b e c a u s e it involves all the g r o u p s of t h e s t u d y . T h e result is a s u m of s q u a r e s for the c o m p a r i s o n
TABLE 8.2
Means, group sums, and sample sizes from the data in Box 8.1. l ength of pea sections g r o w n in tissue culture (in o c u l a r units). / ".i illliCOSi' + '~ fructose
siurosc 64.1
(61.94 3097 50 F)
70.1
y
58.2 582 10
58.0 580 10
701 10
593 10
641 10
176
chapter
8 / singleclassification analysis of
variance
832.32
=
15944
=
M5^th, ^0.05[1,45]
=
~5A6~
4.05,
F 0.0 1 [ 1 .4 5] = ^.23
T h i s c o m p a r i s o n is h i g h l y significant, s h o w i n g t h a t the a d d i t i o n s of s u g a r s h a v e significantly r e t a r d e d t h e g r o w t h of the p e a sections. N e x t we test w h e t h e r t h e m i x t u r e of s u g a r s is significantly d i f f e r e n t f r o m t h e p u r e sugars. U s i n g the s a m e t e c h n i q u e , we c a l c u l a t e SS (mixed s u g a r s v e r s u s p u r e s u g a r s )  < 580 i 2 ( 5 9 3 ^ 5 8 2 j f J > 4 1 ) 2 _ (593 + 582_+ 580 + 641) 2 _ (580) 2 K) (1816) 2 30 (2396) 2 40
=
40
48.13
H e r e the CT is different, since it is b a s e d o n t h e s u m of the s u g a r s only. T h e a p p r o p r i a t e test statistic is MS (mixed s u g a r s versus p u r e sugars) 48.13 /, = ~ 8.8^ MSwilhin 5.46 T h i s is significant in view of the critical v a l u e s of 4 5  given in t h e p r e c e d i n g paragraph. A final test is a m o n g t h e t h r e e sugars. T h i s m e a n s q u a r e h a s 2 d e g r e e s of f r e e d o m , since it is based o n t h r e e m e a n s . T h u s we c o m p u t e , <593) 2 <582) 2 (641 )2 SS ( a m o n g p u r e sugars) = + + () (() )( (1816) 2 ,() = 196.87
I\ =
MS ( a m o n g p u r e s u g a r s ! A/S w i l h ,
T h i s Fx is highly significant, since even /',, 0112.401 = 5'^ W e c o n c l u d e that the a d d i t i o n of the t h r e e s u g a r s r e t a r d s g r o w t h in the pea sections, that mixed s u g a r s affect (lie s e c t i o n s differently f r o m p u r e s u g a r s , a n d that the p u r e s u g a r s a r e signilicanlly different a m o n g themselves, p r o b a b l y bec a u s e the s u c r o s e lias a far higher m e a n . W e c a n n o t test the s u c r o s e a g a i n s t the o t h e r two, b e c a u s e that w o u l d be a n u n p l a n n e d test, which s u g g e s t s itself to us alter we have l o o k e d at the results. T o c a r r y o u t such a test, we need the m i  t h n i k (il'lhc next section.
8.6 / c o m p a r i s o n s a m o n g m e a n s : u n p l a n n e d c o m p a r i s o n s
177
O u r a p r i o r i tests m i g h t h a v e been q u i t e different, d e p e n d i n g entirely o n o u r initial h y p o t h e s e s . T h u s , w e could h a v e tested c o n t r o l v e r s u s s u g a r s initially, followed by d i s a c c h a r i d e s (sucrose) versus m o n o s a c c h a r i d e s (glucose, f r u c t o s e , glucose + fructose), f o l l o w e d by mixed versus p u r e m o n o s a c c h a r i d e s a n d finally by glucose v e r s u s f r u c t o s e . T h e p a t t e r n a n d n u m b e r of p l a n n e d tests a r e d e t e r m i n e d b y o n e ' s h y p o t h eses a b o u t t h e d a t a . H o w e v e r , t h e r e are c e r t a i n restrictions. It w o u l d clearly be a m i s u s e of statistical m e t h o d s t o d e c i d e a p r i o r i t h a t o n e wished t o c o m p a r e every m e a n a g a i n s t every o t h e r m e a n (a(a l)/2 c o m p a r i s o n s ) . F o r a g r o u p s , t h e s u m of t h e d e g r e e s of f r e e d o m of t h e s e p a r a t e p l a n n e d tests s h o u l d n o t exceed a 1. In a d d i t i o n , it is d e s i r a b l e t o s t r u c t u r e t h e tests in s u c h a w a y t h a t each o n e tests a n i n d e p e n d e n t r e l a t i o n s h i p a m o n g t h e m e a n s (as w a s d o n e in the e x a m p l e above). F o r e x a m p l e , we w o u l d prefer n o t t o lest if m e a n s 1, 2, a n d 3 differed if we h a d a l r e a d y f o u n d t h a t m e a n 1 differed f r o m m e a n 3, since significance of the latter suggests significance of the f o r m e r . Since these tests a r e i n d e p e n d e n t , the three s u m s of s q u a r e s we h a v e so far o b t a i n e d , based o n 1, 1, a n d 2 d f , respectively, t o g e t h e r a d d u p t o t h e s u m of s q u a r e s a m o n g t r e a t m e n t s of t h e original a n a l y s i s of v a r i a n c e based o n 4 degrees of f r e e d o m . T h u s : df 1 1 2 4
= =
=1077.32
T h i s a g a i n illustrates the elegance of analysis of v a r i a n c e . T h e t r e a t m e n t s u m s of s q u a r e s can be d e c o m p o s e d i n t o s e p a r a t e p a r t s that are s u m s of s q u a r e s in their o w n right, with degrees of f r e e d o m p e r t a i n i n g to t h e m . O n e s u m of s q u a r e s m e a s u r e s the difference between the c o n t r o l s a n d the s u g a r s , the second t h a t b e t w e e n the mixed s u g a r s a n d the p u r e sugars, a n d the third the r e m a i n i n g v a r i a t i o n a m o n g the t h r e e s u g a r s . W e c a n present all of these results as a n a n o v a table, as s h o w n in T a b l e 8.3.
TAHI.F 8 . 3
Anova table from Box K.I, with treatment sum of squares decomposed into planned comparisons. Source of I'tiriulioii <H
MS
Treatments Control vs. sugars Mixed vs. pure sugars Among pure sugars Within Total
4 1 1
7
45 49
178
c h a p t e r 8 / singleclassification analysis of
variance
W h e n the planned c o m p a r i s o n s are not i n d e p e n d e n t , a n d when t h e n u m b e r of c o m p a r i s o n s p l a n n e d is less t h a n the total n u m b e r of c o m p a r i s o n s possible between all pairs of means, which is a(a 1)/2, we carry out the tests as j u s t shown but we a d j u s t the critical values of the type 1 e r r o r a. In c o m p a r i s o n s that are not i n d e p e n d e n t , if the o u t c o m e of a single c o m p a r i s o n is significant, the o u t c o m e s of s u b s e q u e n t c o m p a r i s o n s are m o r e likely t o be significant as well, so that decisions based on conventional levels of significance m i g h t be in d o u b t . F o r this reason, we e m p l o y a conservative a p p r o a c h , lowering the type I e r r o r of the statistic of significance for each c o m p a r i s o n so that the p r o b a bility of m a k i n g any type I e r r o r at all in the entire series of tests d o e s not exceed a predetermined value a. This value is called the experimentwise error rate. Assuming that the investigator plans a n u m b e r of c o m p a r i s o n s , a d d i n g u p to k degrees of freedom, the a p p r o p r i a t e critical values will be o b t a i n e d if the probability x' is used for any o n e c o m p a r i s o n , where
y
7
T h e a p p r o a c h using this relation is called the Bonferroni method; it assures us of an experimentwise e r r o r rate < r. Applying this a p p r o a c h to the pea section d a t a , as discussed above, let us assume that the investigator has good reason to test the following c o m p a r i s o n s between and a m o n g treatments, given here in abbreviated form: (C) versus (G, F. S, G + F); (G, K, S) versus (G t F); a n d (G) versus (F) versus (S); as well as (G, F) versus (G + F) T h e 5 degrees of f r e e d o m in these tests require that each individual test be a d j u s t e d to a significance level of a 0.05 a' = ^ ^  0.01 for an experimentwise critical 0.05. T h u s , (lie critical value for the [\ ratios of these c o m p a r i s o n s is / l ) ] M 4 S  or /' <>, > 4 5 ] , as a p p r o p r i a t e . T h e first three tests arc carried out as shown above. T h e last test is c o m p u t e d in a similar manner: Iaverage of glucose a n d \ fructose vs. glucose \ and fructose mixed
(593 + 58,)2 (58())2 (593 + 5g2 + 58Q)2
SS
20 (I 175)2 20
+
10 (580) 2 _ (1755) 2 _ 10 )
30
In spite of the c h a n g e in critical value, the conclusions c o n c e r n i n g the first three tests are u n c h a n g e d . The last test, the average of glucose a n d fructose versus a mixture of the two, is not significant, since F s = i l l 0.687. A d j u s t ing the critical value is a conservative procedure: individual c o m p a r i s o n s using this a p p r o a c h are less likely to be significant.
8.6 / c o m p a r i s o n s a m o n g m e a n s : u n p l a n n e d
comparisons
179
T h e B o n f e r r o n i m e t h o d generally will n o t e m p l o y the s t a n d a r d , t a b l e d a r g u m e n t s of for the F d i s t r i b u t i o n . T h u s , if we were t o p l a n tests i n v o l v i n g a l t o g e t h e r 6 d e g r e e s of f r e e d o m , t h e v a l u e of a' w o u l d be 0.0083. E x a c t tables for B o n f e r r o n i critical values are a v a i l a b l e for the special case of single d e g r e e of f r e e d o m tests. Alternatively, we c a n c o m p u t e the d e s i r e d critical v a l u e b y m e a n s of a c o m p u t e r p r o g r a m . A c o n s e r v a t i v e a l t e r n a t i v e is t o use t h e next smaller t a b l e d v a l u e of a. F o r details, c o n s u l t S o k a l a n d Rohlf (1981), s e c t i o n 9.6. T h e B o n f e r r o n i m e t h o d (or a m o r e r e c e n t r e f i n e m e n t , t h e D u n n  S i d a k m e t h o d ) s h o u l d a l s o be e m p l o y e d w h e n y o u a r e r e p o r t i n g c o n f i d e n c e limits for m o r e t h a n o n e g r o u p m e a n resulting f r o m a n analysis of v a r i a n c e . T h u s , if y o u w a n t e d to p u b l i s h the m e a n s a n d 1 a c o n f i d e n c e limits of all live t r e a t m e n t s in the p e a section e x a m p l e , you w o u l d not set c o n f i d e n c e limits t o each m e a n as t h o u g h it were a n i n d e p e n d e n t s a m p l e , b u t y o u w o u l d e m p l o y t. [v] , w h e r e is the degrees of f r e e d o m of the entire s t u d y a n d a' is the a d j u s t e d t y p e I e r r o r e x p l a i n e d earlier. D e t a i l s of such a p r o c e d u r e c a n be learned in S o k a l a n d Rohlf (1981), Section 14.10.
 (!)]
(8.5)
1) M S w i l h i n J , we can r e w r i t e E x p r e s s i o n
!.,
1,
(8.6) 1077.32. S u b 
F o r e x a m p l e , in Box 8.1, w h e r e the a n o v a is significant, SS Br s t i t u t i n g into E x p r e s s i o n (8.6), we o b t a i n 1077.32 > (5 1)(5.46)(2.58)  56.35 for
a = 0.05
It is t h e r e f o r e possible t o c o m p u t e a critical \ value for a test of significance of a n a n o v a . Thus, a n o t h e r way of c a l c u l a t i n g overall significance w o u l d be t o sec w h e t h e r the S.VKIups is g r e a t e r t h a n this critical SS. It is of interest t o investigate w h y the SS vt>Ui , s is as large as it is a n d to test for t h e significance of the v a r i o u s c o n t r i b u t i o n s m a d e to this SS by dilfercnccs a m o n g the s a m p l e m e a n s . T h i s was discussed in the p r e v i o u s scction, w h e r e s e p a r a t e s u m s of s q u a r e s were c o m p u t e d based o n c o m p a r i s o n s a m o n g m e a n s p l a n n e d b e f o r e the d a t a were e x a m i n e d . A c o m p a r i s o n w a s called significant if its /', r a t i o w a s > I''iik !.( w h e r e k is the n u m b e r of m e a n s being c o m p a r e d . W e c a n n o w also s t a t e this in t e r m s of s u m s of s q u a r e s : An SS is significant if it is g r e a t e r t h a n {k I) M S w i l h i n Fxlk ,., n]. T h e a b o v e tests w e r e a priori c o m p a r i s o n s . O n e p r o c e d u r e for testing a posteriori c o m p a r i s o n s w o u l d be to set k a in this last f o r m u l a , n o m a t t e r
180
c h a p t e r 8 / singleclassification analysis of
variance
how m a n y m e a n s we c o m p a r e ; thus the critical value of the SS will be larger t h a n in the previous m e t h o d , m a k i n g it m o r e difficult to d e m o n s t r a t e the significance of a s a m p l e SS. Setting k = a allows for the fact t h a t we c h o o s e for testing those differences between g r o u p m e a n s t h a t a p p e a r to be c o n t r i b u t i n g substantially to the significance of the overall a n o v a . F o r an example, let us r e t u r n to the effects of sugars on g r o w t h in pea sections (Box 8.1). We write d o w n the m e a n s in ascending o r d e r of m a g n i t u d e : 58.0 (glucose + fructose), 58.2 (fructose), 59.3 (glucose), 64.1 (sucrose), 70.1 (control). W e notice t h a t the first three t r e a t m e n t s have quite similar m e a n s a n d suspect t h a t they d o n o t differ significantly a m o n g themselves a n d hence d o n o t c o n t r i b u t e substantially to the significance of the SSgroups. T o test this, wc c o m p u t e the SS a m o n g these three m e a n s by the usual formula:
2 2 2 2 _ (593) + (582) + (580) _ (593 + 582_+ 580) __ _
T h e dilfcrcnccs a m o n g these m e a n s are not significant, because this SS is less than the critical SS (56.35) calculated above. T h e sucrose m e a n looks suspiciously different from the m e a n s of the o t h e r sugars. T o test this wc c o m p u t e (641) 2
k
~ 10
30
= 41,088.1 + 102,667.5 
143,520.4 = 235.2
which is greater than the critical SS. Wc conclude, therefore, that sucrosc retards g r o w t h significantly less than the o t h e r sugars tested. We may c o n t i n u e in this fashion, testing all the differences that look suspicious o r even testing all possible sets of means, considering them 2, 3, 4, a n d 5 at a time. This latter a p p r o a c h may require a c o m p u t e r if there are m o r e than 5 m e a n s to be c o m pared, since there arc very m a n y possible tests that could be m a d e . This p r o c e d u r e was p r o p o s e d by Gabriel (1964), w h o called it a sum of squares simultaneous test procedure (SSS'l'P).
In the SSS I I' and in the original a n o v a , the chancc of m a k i n g a n y type I e r r o r at all is a, the probability selected for the critical I value f r o m T a b l e V. By " m a k i n g any type I e r r o r at all" we m e a n m a k i n g such an e r r o r in the overall test of significance of the a n o v a a n d in any of the subsidiary c o m p a r i s o n s a m o n g m e a n s or sets of means needed to complete the analysis of the experiment. Phis probability a therefore is an experimentwise e r r o r rate. N o t e that t h o u g h the probability of any e r r o r at all is a, the probability of e r r o r for any p a r t i c u l a r test of s o m e subset, such as a test of the difference a m o n g three o r between t w o means, will always be less than Thus, for the test of each subset o n e is really using a significance level a \ which may be m u c h less than the cxperimcntwisc
e x e r c i s e s 195
, a n d if t h e r e a r e m a n y m e a n s in t h e a n o v a , this a c t u a l e r r o r r a t e a ' m a y be o n e  t e n t h , o n e o n e  h u n d r e d t h , o r even o n e o n e  t h o u s a n d t h of t h e e x p e r i m e n t wise ( G a b r i e l , 1964). F o r this r e a s o n , t h e u n p l a n n e d tests d i s c u s s e d a b o v e a n d the overall a n o v a a r e n o t very sensitive t o differences b e t w e e n i n d i v i d u a l m e a n s o r differences w i t h i n small subsets. O b v i o u s l y , n o t m a n y differences a r e g o i n g t o be c o n s i d e r e d significant if a' is m i n u t e . T h i s is t h e price w e p a y for n o t p l a n n i n g o u r c o m p a r i s o n s b e f o r e we e x a m i n e t h e d a t a : if w e w e r e t o m a k e p l a n n e d tests, the e r r o r r a t e of e a c h w o u l d be greater, h e n c e less c o n s e r v a t i v e . T h e SSSTP p r o c e d u r e is only o n e of n u m e r o u s t e c h n i q u e s f o r m u l t i p l e u n p l a n n e d c o m p a r i s o n s . It is t h e m o s t c o n s e r v a t i v e , since it a l l o w s a large n u m b e r of possible c o m p a r i s o n s . D i f f e r e n c e s s h o w n t o be significant by this m e t h o d c a n be reliably r e p o r t e d as significant differences. H o w e v e r , m o r e sensitive a n d p o w e r f u l c o m p a r i s o n s exist w h e n t h e n u m b e r of possible c o m p a r i s o n s is c i r c u m s c r i b e d b y t h e user. T h i s is a c o m p l e x s u b j e c t , t o w h i c h a m o r e c o m p l e t e i n t r o d u c t i o n is given in S o k a l a n d Rohlf (1981), Section 9.7. Exercises 8.1 The following is an example with easy numbers to help you become familiar with the analysis of variance. A plant ecologist wishes to test the hypothesis that the height of plant species X depends on the type of soil it grows in. He has measured the height of three plants in each of four plots representing different soil types, all four plots being contained in an area of two miles square. His results are tabulated below. (Height is given in centimeters.) Does your analysis support this hypothesis? ANS. Yes, since F, = 6.951 is larger than
' <I5J.H 4 . 0 7 .
Observation number
Loetilil ies 2 .i
1 2 3 8.2
15 9 14
25 21 19
17 23 20
10 13 16
The following are measurements (in coded micrometer units) of the thorax length of the aphid Pemphigus populitransversus. The aphids were collected in 28 galls on the cottonwood I'opulas delloides. Four alate (winged) aphids were randomly selected from each gall and measured. The alate aphids of each gall are isogenic (identical twins), being descended parthcnogenetieally from one stem mother. Thus, any variance within galls can be due to environment only. Variance between galls may be due to differences in genotype and also to environmental differences between galls. If this character, thorax length, is affected by genetic variation, significant intergall variance must be present. The converse is not necessarily true: significant variance between galls need not indicate genetic variation; it could as well be due to environmental differences between galls (data by Sokal, 1952). Analyze the variance of thorax length. Is there significant intergall variance present? (Jive estimates of the added component of intergall variance, if present. What percentage of the variance is controlled by intragall and what percentage by intergall factors? Discuss your results.
182
c h a p t e r 8 / s i n g l e  c l a s s i f i c a t i o n a n a l y s i s of
variance
Gall no.
Gall no.
6.1, 6.2, 6.2, 5.1, 4.4, 5.7, 6.3, 4.5, 6.3, 5.4, 5.9, 5.9, 5.8, 5.6,
6.0, 5.1, 6.2, 6.0, 4.9, 5.1, 6.6, 4.5, 6.2, 5.3, 5.8, 5.9, 5.9, 6.4,
5.7. 6.1. 5.3, 5.8, 4.7, 5.8, 6.4, 4.0, 5.9, 5.0, 6.3, 5.5, 5.4, 6.4,
6.0 5.3 6.3 5.9 4.8 5.5 6.3 3.7 6.2 5.3 5.7 5.5 5.5 6.1
15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.
6.3, 5.9, 5.8, 6.5, 5.9, 5.2, 5.4, 4.3, 6.0, 5.5, 4.0, 5.8, 4.3, 6.1,
6.5, 6.1, 6.0, 6.3, 5.2, 5.3, 5.5, 4.7, 5.8, 6.1, 4.2, 5.6, 4.0, 6.0,
6.1, 6.1, 5.9, 6.5, 5.7, 5.4, 5.2, 4.5, 5.7, 5.5, 4.3, 5.6, 4.4, 5.6,
6.3 6.0 5.7 7.0 5.7 5.3 6.3 4.4 5.9 6.1 4.4 6.1 4.6 6.5
8.3
VI ill is and Seng (1954) published a study on the relation of birth order to the birth weights infants. The data below on firstborn and eighthborn infants are extracted from a table of birth weights of male infants of Chinese thirdclass patients at the K a n d a n g Kerbau Maternity Hospital in Singapore in 1950 and 1951.
Birth I
order ti
3:0 3:8 4:0 4:8 5:0 5:8 6:0 6:8 7:0 7:8 8:0 8:8 9:0 9:8 10:0 10:8
8.4
Which birth order appears to be accompanied by heavier infants? Is this differ ence significant? Can you conclude that birth order causes differences in birth weight? (Computational note: The variable should be coded as simply as possible.) Reanalyze, using the I test, and verify that ff = F s . ANS. l s ^ 11.016 and /;= 121.352 " The following cytochrome oxidase assessments of male Pcriplaneta roaches in cubic millimeters per ten minutes per milligram were taken IVom a larger study
exercises
183
Sy
24.8 19.7
0.9 1.4
8.5
Are the two means significantly different? P. E. Hunter (1959. detailed data unpublished) selected two strains of D. melanoiicisler, one for short larval period (SL) and one for long larval period (LL). A nonselected control strain (CS) was also maintained. At generation 42 these data were obtained for the larval period (measured in hours). Analyze and interpret.
Strain CS
SL
LL
tii
80 8070
69 7291
33 3640 3 "
2
= 1,994.650
8.6
Note that part of the computation has already been performed for you. Perform unplanned tests a m o n g the three means (short vs. long larval periods and each against the control). Set 95% confidence limits to the observed differences of means for which these comparisons are made. ANS. MS  S L v s 1 L ) = 2076.6697. These data are measurements of live random samples of domestic pigeons collected during January, February, and March in Chicago in 1955. The variableis the length from the anterior end of the narial opening to the lip of the bony beak and is recorded in millimeters. Data from Olson and Miller (1958).
Samples 3
s 5.1 5.5 5.9 6.1 5.2 5.0 5.9 5.0 4.9 5.3 5.3 5.1 4.9 5.8 5.0 5.6 6.1 5.1 4.8 4.9
5.4 5.3 5.2 4.5 5.0 .5.4 3.8 5.9 5.4 5.1 5.4 4.1 5.2 4.8 4.6 5.7 5.9 5.8 5.0 5.0
5.2 5.1 4.7 5.0 5.9 5.3 6.0 5.2 6.6 5.6 5.1 5.7 5.1 4.7 6.5 5.1 5.4 5.8 5.8 5.9
5.5 4.7 4.8 4.9 5.9 5.2 4.8 4.9 6.4 5.1 5.1 4.5 5.3 4.8 5.3 5.4 4.9 4.7 4.8 5.0
5.1 4.6 5.4 5.5 5.2 5.0 4.8 5.1 4.4 6.5 4.8 4.9 6.0 4.8 5.7 5.5 5.8 5.6 5.5 5.0
184 8.7
198 c h a p t e r 8 / s i n g l e  c l a s s i f i c a t i o n a n a l y s i s o f
variance
The following data were taken from a study of blood protein variations in deer (Cowan and Johnston, 1962). The variable is the mobility of serum protein fraction II expressed as 1(T 5 cm 2 /voltseconds.
Y Sitka California blacktail Vancouver Island blacktail Mule deer Whitetail 2.8 2.5 2.9 2.5 2.8
8.8
= 12 for each mean. Perform an analysis of variance and a multiplecomparison test, using the sums of squares STP procedure. ANS. MS within = 0.0416; maximal nonsignificant sets (at = 0.05) are samples 1, 3, 5 and 2, 4 (numbered in the order given). For the data from Exercise 7.3 use the Bonferroni method to test for differences between the following 5 pairs of treatment means: A, A, C A, D A, (B + C + D)/3 B, (C + D)/2
CHAPTER
TwoWay of Variance
Analysis
F r o m the singleclassification a n o v a of C h a p t e r 8 we p r o g r e s s t o the t w o  w a y a n o v a of the p r e s e n t c h a p t e r by a single logical step. I n d i v i d u a l items m a y be g r o u p e d i n t o classes r e p r e s e n t i n g t h e different possible c o m b i n a t i o n s of t w o t r e a t m e n t s o r factors. T h u s , the h o u s e f l y w i n g l e n g t h s s t u d i e d in earlier c h a p t e r s , which yielded s a m p l e s r e p r e s e n t i n g different m e d i u m f o r m u l a t i o n s , might also be divided i n t o m a l e s a n d females. S u p p o s e we w a n t e d t o k n o w n o t o n l y w h e t h e r m e d i u m 1 i n d u c e d a different wing l e n g t h t h a n m e d i u m 2 b u t a l s o w h e t h e r m a l e housefiies differed in w i n g length f r o m females. O b v i o u s l y , e a c h c o m b i n a t i o n of f a c t o r s s h o u l d be r e p r e s e n t e d by a s a m p l e of flies. T h u s , for seven m e d i a a n d t w o sexes we need at least 7 x 2 = 1 4 s a m p l e s . Similarly, the exp e r i m e n t testing five s u g a r t r e a t m e n t s o n p e a s e c t i o n s (Box 8.1) m i g h t h a v e been carried o u t at t h r e e different t e m p e r a t u r e s . T h i s w o u l d h a v e resulted in a twoway analysis of variance of t h e effects of s u g a r s as well as of t e m p e r a t u r e s . It is the a s s u m p t i o n of this t w o  w a y m e t h o d of a n o v a t h a t a given t e m p e r a t u r e a n d a given s u g a r each c o n t r i b u t e a c e r t a i n a m o u n t to the g r o w t h of a p e a section, a n d t h a t these t w o c o n t r i b u t i o n s a d d their effects w i t h o u t i n f l u e n c i n g each o t h e r . In Section 9.1 wc shall see h o w d e p a r t u r e s f r o m the a s s u m p t i o n
186
c h a p t e r 9 ,/ t w o  w a y a n a l y s i s oh v a r i a n c e
are measured; we shall also consider the expression for d e c o m p o s i n g variates in a t w o  w a y a n o v a . T h e t w o factors in the present design m a y represent either M o d e l I or M o d e l II effects o r o n e of each, in which case we talk of a mixed model. T h e c o m p u t a t i o n of a t w o  w a y a n o v a for replicated subclasses (more t h a n o n e variate per subclass or factor c o m b i n a t i o n ) is s h o w n in Section 9.1, which also c o n t a i n s a discussion of the m e a n i n g of interaction as used in statistics. Significance testing in a twoway a n o v a is the subject of Section 9.2. This is followed by Section 9.3, on twoway a n o v a without replication, or with only a single variate per subclass. T h e wellknown m e t h o d of paired c o m p a r i s o n s is a special ease of a t w o  w a y a n o v a without replication. W e will n o w proceed to illustrate the c o m p u t a t i o n of a t w o  w a y a n o v a . You will o b t a i n closer insight into the s t r u c t u r e of this design as we explain the c o m p u t a t i o n s .
9.1 Twoway anova with replication W e illustrate the c o m p u t a t i o n of a t w o  w a y a n o v a in a study of oxygen cons u m p t i o n by two species of limpets at three c o n c e n t r a t i o n s of seawater. Eight replicate readings were o b t a i n e d for each c o m b i n a t i o n of species a n d s e a w a t e r c o n c e n t r a t i o n . W e have c o n t i n u e d t o call the n u m b e r of c o l u m n s and a r e calling the n u m b e r of rows b. T h e sample size for each cell (row a n d c o l u m n c o m b i n a t i o n ) of the table is n. T h e cells are also called s u b g r o u p s or subclasses. T h e d a t a arc featured in Box 9.1. T h e c o m p u t a t i o n a l steps labeled Preliminary computations provide an efficient p r o c e d u r e for the analysis of variance, but we shall u n d e r t a k e several digressions to ensure that the c o n c e p t s u n d e r lying this design arc a p p r e c i a t e d by the reader. We c o m m e n c e by c o n s i d e r i n g the six subclasses as t h o u g h they were six g r o u p s in a singleclassification a n o v a . liach s u b g r o u p or subclass represents eight oxygen c o n s u m p t i o n readings. If we had no further classification of these six s u b g r o u p s by species or salinity, such an a n o v a would test whether there was any variation a m o n g the six subg r o u p s over a n d a b o v e the variance within (he s u b g r o u p s . But since we have the subdivision by species a n d salinity, o u r only p u r p o s e here is to c o m p u t e s o m e quantities necessary for the further analysis. Steps I t h r o u g h 3 in Box 9.1 correspond to the identical steps in Box 8.1, a l t h o u g h the symbolism has changed slightly, since in place of a g r o u p s we now have ab subgroups. T o c o m p l e t e the a n o v a , we need a correction term, which is labeled step 6 in Box 9.1. F r o m these quantities we o b t a i n SSuah a n d .S\S\vilhlll in steps 7, 8, a n d 12, c o r r e s p o n d ing to steps 5, 6, and 7 in the layout of Box 8.1. T h e results of this preliminary a n o v a arc featured in l able 9.1. T h e c o m p u t a t i o n is continued by finding the s u m s of squares for rows a n d c o l u m n s of the table. This is dime by the general f o r m u l a stated at the end of Section 8.1. Thus, for columns, we s q u a r e the c o l u m n sums, sum the resulting squares, a n d divide the result by 24. the n u m b e r of items per row. T h i s is step 4 in Box 9.1. similar q u a n t i t y is c o m p u t e d for rows (step 5). F r o m these
187
a 5
O N
t A W <3 OS 5? < u cs (u I iu
r w 3
V3
t o o o 00
m < \
rv " 
2 S
3 00
U H
<
. >
1 =
t .2 _
6?
>
>
. D I
1. Grand total =
461.74
2
+ + (12.30)2 = 5065.1530
3. Sum of the squared subgroup (cell) totals, divided by the sample size of the subgroups
" b / \
(84.49)2 + + (98.61)2 8
\2
= 4663.6317
t f y/ 4. Sum of the squared column totals divided by the sample size of a column =  A
bn
b/a
1
fb
\2
5. Sum of the squared row totals divided by the sample size of a row = ^ ....
an
2 (143.92)2 + (121.82)2 + (196.00)2 (2^8) 6. Grand total squared and divided by the total sample size = correction term CT / a b it \2 \ =
46230674
)
/ abn
, (461.74)2 (2x3x8)"4441'7464
7 SS,,ai =
a
C T
b /
\2
it
bn
(^)
9. SSA (SS of columns) =
b fa
\2
an
11. SSA B (interaction SS) = SS subgr  SSA  SS = quantity 8  quantity 9  quantity 10 = 221.8853  16.6380  181.3210 = 23.9263 12. SSwUhin (within subgroups; error SS) = SSloltll SSsllbgr = quantity 7  quantity 8 = 623.4066  221.8853 = 401.5213 As a check on your computations, ascertain that the following relations hold for some of the above quantities: 2 S 3 S 4 i 6; 3 > 5 > 6. Explicit formulas for these sums of squares suitable for computer programs are as follows:
9 a . SSA = n b t ( Y
A
Y)2
Y
?
A
= n i ( Y  ?
n
 ?
2
+ f )
12a. SS within =
t i ^  ? )
Source of variation
jf "J a 1
> 9
MS
Expected
MS (Model
 ?
9 10
1)
( a  I ) Y
B
2 , nb <r2 + V a b
2 a
 Y
h 
10
ib 1) 11 (a m 1) (a W
  +
(a 
1 Kb
11 12
1
2 1) Z w )
Y  Y Y  f
ab(n abn I
1)
12
ab(n 1)
e x p i r n f f o S r m o S
1 6
b o t h faCtors
>the
ex
?ected
Source of variation
Model II
2 +
+ nbai naog
2 + \
2 + <7 2 + + 
"
I+ "
2 
Within subgroups
Anova table
Source of variation df SS MS F,
A (columns; species) (rows: salinities) B (interaction) Within subgroups (error) Total fd.0511.4.2] = 4.07
1
1
42 47
Fo.05E2.4 2] = 3.22
Fo.01(2,42] = 5.15
Since this is a Model I anova, all mean squares are tested over the error MS. For a discussion of significance tests, see Section 9.2. Conclusions.Oxygen consumption does not differ significantly between the two species of limpets but differs with the sa!in:r At 50% seawater, the O , consumption is increased. Salinity appears to affect the two species equally, for there is insufficient evidir.:; of a species salinity interaction. I
192
c h a p t e r 9 ,/ t w o  w a y a n a l y s i s oh v a r i a n c e
TABLE
9.1
5 42 47
ab  1 ab(n abn
1)
44.377** 9.560
q u o t i e n t s we s u b t r a c t t h e c o r r e c t i o n term, c o m p u t e d as q u a n t i t y 6. T h e s e s u b t r a c t i o n s a r e carried o u t as steps 9 a n d 10, respectively. Since t h e r o w s a n d c o l u m n s a r e b a s e d o n e q u a l s a m p l e sizes, we d o n o t h a v e t o o b t a i n a s e p a r a t e q u o t i e n t for t h e s q u a r e of e a c h r o w o r c o l u m n s u m b u t c a r r y o u t a single division a f t e r a c c u m u l a t i n g t h e s q u a r e s of t h e s u m s . Let us r e t u r n for a m o m e n t t o the p r e l i m i n a r y a n a l y s i s of v a r i a n c e in T a b l e 9.1, w h i c h d i v i d e d t h e t o t a l s u m of s q u a r e s i n t o t w o p a r t s : t h e s u m of s q u a r e s a m o n g the six s u b g r o u p s ; a n d t h a t w i t h i n the s u b g r o u p s , t h e e r r o r s u m of s q u a r e s . T h e new s u m s of s q u a r e s p e r t a i n i n g t o r o w a n d c o l u m n effects clearly are n o t p a r t of the e r r o r , but m u s t c o n t r i b u t e t o t h e differences t h a t c o m p r i s e the s u m of s q u a r e s a m o n g t h e f o u r s u b g r o u p s . W e t h e r e f o r e s u b t r a c t r o w a n d col u m n SS f r o m the s u b g r o u p SS. T h e latter is 221.8853. T h e r o w S S is 181.3210, a n d t h e c o l u m n SS is 16.6380. T o g e t h e r they a d d u p t o 197.9590, a l m o s t b u t n o t q u i t e t h e value of t h e s u b g r o u p s u m of s q u a r e s . T h e difference r e p r e s e n t s a t h i r d s u m of s q u a r e s , called the interaction sum of squares, w h o s e v a l u e in this case is 23.9263. W c shall discuss the m e a n i n g of this new s u m of s q u a r e s presently. At the m o m e n t let us say o n l y t h a t it is a l m o s t a l w a y s p r e s e n t (but n o t necessarily significant) a n d g e n e r a l l y t h a t it need n o t be i n d e p e n d e n t l y c o m p u t e d but m a y be o b t a i n e d as illustrated a b o v e by the s u b t r a c t i o n of the row .SS a n d t h e colu m n SS f r o m the s u b g r o u p SS. T h i s p r o c e d u r e is s h o w n g r a p h i c a l l y in F i g u r e 9.1, which illustrates the d e c o m p o s i t i o n of the total s u m of s q u a r e s i n t o the s u b g r o u p SS a n d e r r o r SS. T h e f o r m e r is s u b d i v i d e d i n t o the row SS, c o l u m n SS, a n d i n t e r a c t i o n SS. T h e relative m a g n i t u d e s of these s u m s of s q u a r e s will differ f r o m e x p e r i m e n t to e x p e r i m e n t . In F i g u r e 9.1 they a r e not s h o w n p r o p o r t i o n a l to their a c t u a l values in the limpet e x p e r i m e n t ; o t h e r w i s e the a r e a r e p r e s e n t i n g the row SS w o u l d have to be a b o u t 11 times t h a t allotted to the c o l u m n SS. Before we c a n intelligently test for significance in this a n o v a w e m u s t u n d e r s t a n d the m e a n i n g of interaction. W e c a n best e x p l a i n i n t e r a c t i o n in a t w o  w a y a n o v a by m e a n s of a n artificial illustration b a s e d o n the limpet d a t a wc h a v e just s t u d i e d . If we i n t e r c h a n g e the r e a d i n g s for 75% a n d 50'7, for A. d'uiitulis only, we o b t a i n the d a t a t a b i c s h o w n in T a b i c 9.2. O n l y the s u m s of t h e s u b g r o u p s , rows, a n d c o l u m n s a r e s h o w n . W e c o m p l e t e the a n a l y s i s of v a r i a n c e in t h e m a n n e r p r e s e n t e d a b o v e a n d n o t e the results at the fool of f a b l e 9.2. T h e lotal a n d e r r o r SS are the s a m e as b e f o r e ( T a b l e 9.1). T h i s s h o u l d not be
9.1 / t w o  w a y a n o v a w i t h r f . p i r
ation
193
R o w SS = 181.3210
T o t a l SS
= 77,570.25 "S
Column
SS = 10.6380
S u b g r o u p SS
= 211.8803
I n t e r a c t i o n S',S* = 23.02(53
E r r o r AS = 401.5213
FIGURE 9.1
D i a g r a m m a t i c r e p r e s e n t a t i o n of the p a r t i t i o n i n g of the total s u m s of s q u a r e s in a t w o  w a y o r t h o g o n a l a n o v a . T h e a r e a s of the subdivisions are not s h o w n p r o p o r t i o n a l to the m a g n i t u d e s of the s u m s of squares.
s u r p r i s i n g , since we a r e u s i n g the s a m e d a t a . All t h a t we h a v e d o n e is t o interc h a n g e the c o n t e n t s of t h e l o w e r t w o cells in t h e r i g h t  h a n d c o l u m n of the table. W h e n we p a r t i t i o n t h e s u b g r o u p SS, we d o find s o m e differences. W e n o t e t h a t the SS b e t w e e n species (between c o l u m n s ) is u n c h a n g e d . Since the c h a n g e we m a d e w a s w i t h i n o n e c o l u m n , t h e t o t a l for t h a t c o l u m n w a s n o t altered a n d c o n s e q u e n t l y t h e c o l u m n SS did n o t c h a n g e . H o w e v e r , t h e s u m s
TABl.F. 9 . 2
An artificial example to illustrate the meaning of interaction. T h e r e a d i n g s for 75'7, a n d 50% s e a w a t e r c o n c e n t r a t i o n s of Acmaea digitalis in Box 9.1 have been i n t e r c h a n g e d . O n l y s u b g r o u p a n d marginal totals are given below. Species Seawater concentration A. scahra A digitalis
df
SS
MS
1 2 2 42 47
194
c h a p t e r 9 ,/ t w o  w a y a n a l y s i s oh v a r i a n c e
of the second and third rows have been altered appreciably as a result of the interchange of the readings for 75% and 50% salinity in A. digitalis. The sum for 75% salinity is now very close to that for 50% salinity, and the difference between the salinities, previously quite m a r k e d , is now n o longer so. By contrast, the interaction SS, obtained by subtracting the sums of squares of rows and columns from the s u b g r o u p SS, is now a large quantity. R e m e m b e r that the s u b g r o u p SS is the same in the two examples. In the first example we subtracted sums of squares due to the effects of both species and salinities, leaving only a tiny residual representing the interaction. In the second example these two main effects (species and salinities) account only for little of the s u b g r o u p sum of squares, leaving the interaction sum of squares as a substantial residual. W h a t is the essential difference between these two examples? In Table 9.3 we have shown the s u b g r o u p and marginal m e a n s for the original d a t a from Table 9.1 and for the altered d a t a of Table 9.2. T h e original results are quite clear: at 75% salinity, oxygen c o n s u m p t i o n is lower than at the other two salinities, and this is true for both species. We note further that A. scabra consumes more oxygen than A. digitalis at two of the salinities. T h u s our statements a b o u t differences due to species or to salinity can be m a d e largely independent of each other. However, if we had to interpret the artificial d a t a (lower half of Table 9.3), we would note that although A. scabra still consumes m o r e oxygen than A. digitalis (since column sums have not changed), this difference depends greatly on the salinity. At 100% and 50%, A. scabra consumes considerably more oxygen than A. digitalis, but at 75% this relationship is reversed. Thus, we are n o longer able to m a k e an unequivocal statement a b o u t the a m o u n t of oxygen taken up by the two species. We have to qualify our statement by the seawater concentration at which they are kept. At 100%
Mil
9.3
Comparison of means of the data in Box 9.1 and Table 9.2. Spa ies Seawiiter ianccniraiion

A. scabra
.). (lii/italis
can
9.1 / t w o  w a y a n o v a w i t h r i i'i h o n
195
a n d 50%, Yscabra > y d i g i , a l i ! ^ b u t at 75%, T scabril < K d , Bilali ,. If we examine the effects of salinity in the artificial example, we notice a mild increase in oxygen c o n s u m p t i o n at 75%. H o w e v e r , again we have to qualify this s t a t e m e n t by the species of the c o n s u m i n g limpet; scabra c o n s u m e s least at 75%, while digitalis c o n s u m e s most at this c o n c e n t r a t i o n . This d e p e n d e n c e of the effect of o n e factor o n the level of a n o t h e r f a c t o r is called interaction. It is a c o m m o n a n d f u n d a m e n t a l scientific idea. It indicates that the effects of t h e t w o factors are not simply additive b u t t h a t any given c o m b i n a t i o n of levels of factors, such as salinity c o m b i n e d with a n y one species, contributes a positive o r negative increment to the level of expression of the variable. In c o m m o n biological terminology a large positive increment of this sort is called synergism. W h e n drugs act synergistically, the result of the interaction of the t w o d r u g s m a y be a b o v e a n d b e y o n d the sum of the separate effects of each drug. W h e n levels of t w o factors in c o m b i n a t i o n inhibit each other's effects, wc call it interference. ( N o t e that "levels" in a n o v a is customarily used in a loose sense to include not only c o n t i n u o u s factors, such as the salinity in the present example, but also qualitative factors, such as the two species of limpets.) Synergism a n d interference will both tend to magnify the interaction
SS.
Testing for interaction is an i m p o r t a n t p r o c e d u r e in analysis of variance. If the artificial d a t a of T a b l e 9.2 were real, it would be of little value to state that 75% salinity led to slightly greater c o n s u m p t i o n of oxygen. This statement would cover up the i m p o r t a n t differences in the d a t a , which are t h a t scabra c o n s u m e s least at this c o n c e n t r a t i o n , while digitalis c o n s u m e s most. Wc are now able to write an expression symbolizing the d e c o m p o s i t i o n of a single variatc in a twoway analysis of variance in the m a n n e r of Expression (7.2) for singleclassification a n o v a . T h e expression below a s s u m e s that both factors represent fixed treatment effects. Model I. This would seem reasonable, since species as well as salinity are fixed treatments. Variatc Yiik is the Alh item in the s u b g r o u p representing the /th g r o u p treatment A a n d the /th g r o u p t r e a t m e n t B. It is d e c o m p o s e d as follows:
Yijk
= / < + , + / i , + (=r/i),7 +
(9.1)
where equals the p a r a m e t r i c mean of the p o p u l a t i o n , is the fixed treatment effect for the ;th g r o u p of treatment , , is the fixed treatment effect of the /th g r o u p of t r e a t m e n t , (of/0,, is the interaction effect in the s u b g r o u p representing the /th g r o u p of factor A a n d the /lh g r o u p of factor B, and t,jk is the e r r o r term of the fctli item in s u b g r o u p ij. We m a k e the usual a s s u m p t i o n that ej;Jl is normally distributed with a mean of 0 and a variance of a 2 . If one or both of the factors represent Model II effects, we replace the a, a n d / o r ftj in Ihe f o r m u l a by A, a n d / ,. In previous c h a p t e r s we have seen that each sum of s q u a r e s represents a sum of s q u a r e d deviations. W h a t actual deviations does an interaction SS represent? Wc can see this easily by referring back to t h e j u i o v a s of T a b l e 9.1. T h e variation a m o n g s u b g r o u p s is represented by ( F V), where V s t a n d s for the
196
c h a p t e r 9 ,/ t w o  w a y
a n a l y s i s oh
variance
(FP)(?)(Cy)=FyK+?c+F = F  c + F
T h i s s o m e w h a t involved expression is the deviation d u e t o interaction. W h e n we e v a l u a t e o n e such expression for each s u b g r o u p , s q u a r e it, s u m the squares, a n d multiply the s u m by n, we o b t a i n the i n t e r a c t i o n SS. This p a r t i t i o n of the d e v i a t i o n s also holds for their squares. This is so because the s u m s of t h e p r o d ucts of the s e p a r a t e t e r m s cancel o u t . A simple m e t h o d for revealing the n a t u r e of the interaction present in the d a t a is to inspect the m e a n s of the original d a t a table. We c a n d o this in T a b l e 9.3. T h e original d a t a , s h o w i n g n o interaction, yield the following p a t t e r n of relative m a g n i t u d e s :
Scahra
Digitalis
100%
75%
50% T h e relative m a g n i t u d e s of the m e a n s in the lower part of T a b l e 9.3 can be s u m marized as follows:
Scuhru
Digitalis
100%
V
75%
V
50% W h e n the p a t t e r n of signs expressing relative m a g n i t u d e s is not u n i f o r m as in this latter table, interaction is indicated. As long as the p a t t e r n of m e a n s is consistent, as in the f o r m e r table, interaction may not be present. However, interaction is often present without c h a n g e in the direction of the differences; sometimes only the relative m a g n i t u d e s are alTected. In any case, the statistical test needs to be performed to test whether the deviations arc larger t h a n can be expected f r o m c h a n c e alone. In s u m m a r y , when the effect of two t r e a t m e n t s applied together c a n n o t be predicted from the average responses of the s e p a r a t e factors, statisticians call this p h e n o m e n o n interaction a n d test its significance by m e a n s of an interaction
9.2 /
I')/
m e a n square. This is a very c o m m o n p h e n o m e n o n . If we say that the effect of density o n the fecundity or weight of a beetle d e p e n d s o n its genotype, we imply that a g e n o t y p e density interaction is present. If the success of several alternative surgical p r o c e d u r e s d e p e n d s on the n a t u r e of the p o s t o p e r a t i v e t r e a t m e n t , we s p e a k of a p r o c e d u r e t r e a t m e n t interaction. O r if t h e effect of t e m p e r a t u r e on a m e t a b o l i c process is i n d e p e n d e n t of the effect of oxygen c o n c e n t r a t i o n , we say t h a t t e m p e r a t u r e oxygen interaction is absent. Significance testing in a twoway a n o v a will be deferred until t h e next section. H o w e v e r , we should point o u t that the c o m p u t a t i o n a l steps 4 a n d 9 of Box 9.1 could have been s h o r t e n e d by e m p l o y i n g the simplified f o r m u l a for a sum of squares between two groups, illustrated in Section 8.4. In a n analysis with only t w o r o w s a n d t w o c o l u m n s the interaction SS c a n be c o m p u t e d directly as (Sum of o n e d i a g o n a l  sum of o t h e r diagonal) 2 abn
Error = MSwilhiI1.
W h e n we d o this in the e x a m p l e of Box 9.1, we find only factor , salinity, significant. Neither factor A nor the interaction is significant. We c o n c l u d e that the differences in oxygen c o n s u m p t i o n are induced by varying salinities ( O z c o n s u m p t i o n r e s p o n d s in a Vshaped manner), a n d there d o e s not a p p e a r to be sufficient evidence for species differences in oxygen c o n s u m p t i o n . T h e t a b u l a t i o n of the relative m a g n i t u d e s of the m e a n s in the previous section s h o w s t h a t the
198
p a t t e r n of signs in t h e t w o lines is identical. H o w e v e r , this m a y be m i s l e a d i n g , since t h e m e a n of A. scabra is far higher a t 100% s e a w a t e r t h a n a t 75%, b u t t h a t of A. digitalis is only very slightly higher. A l t h o u g h the o x y g e n c o n s u m p t i o n c u r v e s of t h e t w o species w h e n g r a p h e d a p p e a r far f r o m parallel (see F i g u r e 9.2), this s u g g e s t i o n of a species salinity i n t e r a c t i o n c a n n o t b e s h o w n t o be significant w h e n c o m p a r e d w i t h t h e w i t h i n  s u b g r o u p s v a r i a n c e . F i n d i n g a significant difference a m o n g salinities d o e s n o t c o n c l u d e the analysis. T h e d a t a suggest t h a t at 75% salinity t h e r e is a real r e d u c t i o n in o x y g e n c o n s u m p t i o n . W h e t h e r this is really so c o u l d be tested by t h e m e t h o d s of S e c t i o n 8.6. W h e n w e a n a l y z e t h e results of the artificial e x a m p l e in T a b l e 9.2, we find o n l y t h e i n t e r a c t i o n MS significant. T h u s , we w o u l d c o n c l u d e t h a t t h e r e s p o n s e t o salinity differs in t h e t w o species. T h i s is b r o u g h t o u t b y i n s p e c t i o n of t h e d a t a , w h i c h s h o w t h a t at 75% salinity A. scabra c o n s u m e s least o x y g e n a n d A. digitalis c o n s u m e s m o s t . In t h e last (artificial) e x a m p l e the m e a n s q u a r e s of t h e t w o f a c t o r s ( m a i n effects) a r e n o t significant, in a n y ease. H o w e v e r , m a n y statisticians w o u l d n o t even test t h e m o n c e they f o u n d t h e i n t e r a c t i o n m e a n s q u a r e t o be significant, since in such a case a n overall s t a t e m e n t for each f a c t o r w o u l d h a v e little m e a n ing. A s i m p l e s t a t e m e n t of r e s p o n s e to salinity w o u l d be unclear. T h e p r e s e n c e of i n t e r a c t i o n m a k e s us q u a l i f y o u r s t a t e m e n t s : " T h e p a t t e r n of r e s p o n s e to c h a n g e s in salinity differed in the t w o species." W e w o u l d c o n s e q u e n t l y h a v e t o d e s c r i b e s e p a r a t e , n o n p a r a l l e l r e s p o n s e c u r v e s for the t w o species. O c c a sionally, it b e c o m e s i m p o r t a n t to test for overall significance in a M o d e l 1 a n o v a in spite of the p r e s e n c e of i n t e r a c t i o n . W e m a y wish t o d e m o n s t r a t e t h e significance of the effect of a d r u g , r e g a r d l e s s of its significant i n t e r a c t i o n with a g e of t h e p a t i e n t . T o s u p p o r t this c o n t e n t i o n , we m i g h t wish t o test t h e m e a n s q u a r e a m o n g d r u g c o n c e n t r a t i o n s (over the e r r o r MS), r e g a r d l e s s of w h e t h e r the i n t e r a c t i o n MS is significant.
.1.
digitalis
I'KiURE 9 . 2
50
75
100
Oxygen
consumption
by
two
species
of
% Seawatrr
199
Box 9.1 also lists expected m e a n squares for a M o d e l II a n o v a a n d a mixedmodel twoway a n o v a . Here, variance c o m p o n e n t s for c o l u m n s (factor A), for rows (factor B), a n d for interaction m a k e their a p p e a r a n c e , a n d they are design a t e d ,  , a n d 2, respectively. In the M o d e l II a n o v a n o t e t h a t the two m a i n effects c o n t a i n the variance c o m p o n e n t of the interaction as well as their own variance c o m p o n e n t . In a M o d e l II a n o v a we first test (A 6)/Error. If the interaction is significant, we c o n t i n u e testing Aj(A ) a n d B/(A ). But when is n o t significant, some a u t h o r s suggest c o m p u t a t i o n of a pooled e r r o r MS = (SSAxB + S S w i t h i n ) / ( ^ x B + i// within ) t o test the significance of the main effects. T h e conservative position is to c o n t i n u e to test the main effects over the interaction MS, a n d we shall follow this p r o c e d u r e in this b o o k . Only one type of mixed m o d e l is s h o w n in Box 9.1, in which factor A is assumed to be fixed a n d factor to be r a n d o m . If the situation is reversed, the expected m e a n squares c h a n g e accordingly. In the mixed model, it is the m e a n s q u a r e representing the fixed t r e a t m e n t that carries with it the variance c o m p o n e n t of the interaction, while the m e a n s q u a r e representing the r a n d o m factor c o n t a i n s only the error variance a n d its o w n variance c o m p o n e n t a n d does not includc the interaction c o m p o n e n t . We therefore test the MS of the r a n d o m m a i n effect over the error, but test the fixed treatment MS over the interaction.
Factor A:
Time
(a = 3)
Factor B: Individuals = 8) Before alcohol ingestion Immediately after ingestion 12 hours later
1 2 3 4 5 6 7 8
The eight sets of three readings are treated as replications (blocks) in this analysis. Time is a fixed treatment effect, while differences between individuals are considered to be random effects. Hence, this is a mixedmodel anova.
Preliminary
computations
a
1. Grand total =
413 40
y2
b b fa
y y]
4. Sum of squared row totals divided by sample size of a row = \
a
\2
5. Grand total squared and divided by the total sample size = correction term CT ab
=
(quantity I) 2 = ( 4 1 3 : 4 0 ) ! =
ab
7120 8150
24
6 SSuai =
C T =
7. SSA
9. SS error (remainder; discrepance) = SSlota)  SSA  SSB = quantity 6 quantity 7  quantity 8 = 1228.5988  128.9428  1006.9909 = 92.6651
202
w
a,
1 NX is
to +
* *
G OS
to 5
Tf 3 00 rJi . oo <4 N T
Vl
en
S O \D Os
J2
OQ
+
1 "S i 3 "S c .s 3 >? I .
203
T o t a l SS = 1228.5988 <
C o l u m n .S'.S = 128.9428
I n t e r a c t i o n SS = 92.6651 = r e m a i n d e r
E r r o r .S'.V = 0
FIGURF. 9 . 3
in this example is the s a m e as the total sum of squares. If this is not immediately a p p a r e n t , consult Figure 9.3, which, w h e n c o m p a r e d with Figure 9.1, illustrates that the e r r o r sum of squares based on variation within s u b g r o u p s is missing in this example. T h u s , after we s u b t r a c t t h e sum of squares for c o l u m n s (factor A) a n d for rows (factor B) f r o m the total SS, we are left with only a single sum of squares, which is the equivalent of the previous interaction SS but which is n o w the only source for an e r r o r term in the a n o v a . This SS is k n o w n as the remainder SS or the discrepance. If you refer to the expected m e a n s q u a r e s for the twoway a n o v a in Box 9.1, you will discover why we m a d e the s t a t e m e n t earlier that for s o m e models and tests in a twoway a n o v a w i t h o u t replication we must a s s u m e that the interaction is not significant. If interaction is present, only a M o d e l II a n o v a can be entirely tested, while in a mixed model only the fixed level c a n be tested over the r e m a i n d e r m e a n square. But in a pure M o d e l I a n o v a , o r for the r a n d o m factor in a mixed model, it would be i m p r o p e r to test the m a i n effects over the r e m a i n d e r unless we could reliably a s s u m e that n o a d d e d effect d u e to interaction is present. G e n e r a l inspection of the d a t a in Box 9.2 convinces us that the t r e n d s with time for any o n e individual are faithfully reproduced for the o t h e r individuals. Thus, interaction is unlikely to be present. If, for example, some individuals had not responded with a lowering of their S  P L P levels after ingestion of alcohol, interaction would have been a p p a r e n t , a n d the test of the m e a n s q u a r e a m o n g individuals carricd out in Box 9.2 would not have been legitimate. Since we a s s u m e no interaction, the r o w and c o l u m n m e a n s q u a r e s arc tested over the e r r o r MS. T h e results a r e not surprising; casual inspection of the d a t a would have predicted o u r findings. Differences with time are highly significant, yielding a n F value of 9.741. T h e a d d e d variance a m o n g individuals is also highly significant, a s s u m i n g there is n o interaction. A c o m m o n a p p l i c a t i o n of t w o  w a y a n o v a w i t h o u t replication is the repeated testing of the same individuals. By this we m e a n that the same g r o u p of individuals
204
is tested repeatedly over a period of time. T h e individuals are o n e factor (usually considered as r a n d o m a n d serving as replication), a n d the time d i m e n s i o n is the second factor, a fixed t r e a t m e n t effect. F o r example, we might m e a s u r e g r o w t h of a s t r u c t u r e in ten individuals at regular intervals. W h e n we test for the presence of an a d d e d variance c o m p o n e n t (due to the r a n d o m factor), we again m u s t a s s u m e that there is n o interaction between time a n d the individuals; that is, the responses of the several individuals are parallel t h r o u g h time. Ano t h e r use of this design is f o u n d in various physiological a n d psychological experiments in which we test the same g r o u p of individuals for the a p p e a r a n c e of some response after t r e a t m e n t . E x a m p l e s include increasing i m m u n i t y after antigen inoculations, altered responses after conditioning, and m e a s u r e s of learning after a n u m b e r of trials. Thus, we m a y study the speed with which ten rats, repeatedly tested on the same maze, reach the end point. T h e fixedt r e a t m e n t effect would be the successive trials to which the rats h a v e been subjected. T h e second factor, the ten rats, is r a n d o m , p r e s u m a b l y representing a r a n d o m sample of rats f r o m the l a b o r a t o r y p o p u l a t i o n . O n e special case, c o m m o n e n o u g h to merit s e p a r a t e discussion, is repeated testing of the s a m e individuals in which only two treatments (a = 2) a r e given. This case is also k n o w n as paired comparisons, because each o b s e r v a t i o n for o n e t r e a t m e n t is paired with o n e for the o t h e r t r e a t m e n t . This pair is c o m posed of the same individuals tested twice o r of two individuals with c o m m o n experiences, so t h a t we can legitimately a r r a n g e the d a t a as a t w o  w a y anova. Let us e l a b o r a t e on this point. S u p p o s e we test the muscle t o n e of a g r o u p of individuals, subject t h e m to severe physical exercise, a n d measure their muscle tone once more. Since the same g r o u p of individuals will have been tested twice, we can a r r a n g e o u r muscle tone readings in pairs, each pair representing readings on o n e individual (before a n d after exercise). Such d a t a are a p p r o p r i a t e l y treated by a twoway a n o v a without replication, which in this case would be a paircdc o m p a r i s o n s test because there are only t w o t r e a t m e n t classes. This " b e f o r e a n d after t r e a t m e n t " c o m p a r i s o n is a very frequent design leading to paired c o m parisons. A n o t h e r design simply measures t w o stages in the d e v e l o p m e n t of a g r o u p of organisms, time being the treatment intervening between the Iwo stages. The e x a m p l e in Box 9.3 is of this nature. It measures lower face width in a g r o u p of girls at age five and in the s a m e g r o u p of girls when they a r e six years old. The paired c o m p a r i s o n is for each individual girl, between her face width when she is five years old a n d her face width at six years. Paired c o m p a r i s o n s often result from dividing an organism o r o t h e r individual unit so that half receives t r e a t m e n t I a n d the o t h e r half t r e a t m e n t 2, which m a y be the control. T h u s , if we wish to test the strength of t w o antigens o r allergens we might inject o n e into each a r m of a single individual a n d measure the d i a m e t e r of the red area p r o d u c e d . It would not be wise, f r o m the point of view of experimental design, to test antigen 1 on individual I a n d antigen 2 on individual 2. These individuals m a y be differentially susceptible to these antigens, and we may learn little a b o u t the relative potency of the
I(CATION
205
BOX 9.3 Paired comparisons (randomized Mocks with = 2). Lower face width (skeletal bigoniai diameter in cm) for 15 North American white girls measured when 5 and again when 6 years old.
Individuals 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Er 1
w 5yearolds 7.33 7.49 7.27 7.93 7.56 7.81 7.46 6.94 7.49 7.44 7.95 7.47 7.04 7.10 7.64 111.92 836.3300
(2) 6yearolds 7.53 7.70 7.46 8.21 7.81 8.01 7.72 7.13 7.68 7.66 8.11 7.66 7.20 7.25 7.79 114.92 881.8304
(i) 14.86 15.19 14.73 16.14 15.37 15.82 15.18 14.07 15.17 15.10 16.06 15.13 14.24 14.35 15.43 226.84 3435.6992
M > =ri2r(I (difference) 0.20 .21 .19 .28 .25 .20 .26 .19 .19 .22 .16 .19 .16 .15 .15 3.00 0.6216
Twoway anova without replication Anova table Source of variation Ages (columns; factor A) Individuals (rows; factor ) Remainder Total ^o.oii.i4] = 8.86 df SS MS 0.3000 0.188,34 0.000,771,43 F. 388.89** (244.14)** <r + tTab
2 2
^0.01(12.12] =
Conclusions.The variance ratio for ages is highly significant. We conclude that faces of 6yearold girls are wider than those of 5yearolds. If we are willing
206
._ D ~ (~2)
where D is the mean difference between the paired observations. D = _ _ _ _ _ 20 and sg = sD/v'fo is the standard error of D calculated from the observed differences in column (4):  (^Dfjb _
Sj>
3.oo
/0.0216 14
b I
We assume that the true difference between the means of the two groups, pt 2, equals zero: D 0 ^ " 0.20  0 " 010,14,9 "
19 7 2 0 3 With
"
' =
antigens, since this would be c o n f o u n d e d by the differential responses of the subjects. A m u c h better design would be lirst to injcct antigen 1 into the left a r m a n d antigen 2 into the right a r m of a g r o u p of n individuals and then to analyze the d a t a as a twoway a n o v a without replication, with rows (individuals) a n d 2 c o l u m n s (treatments). It is p r o b a b l y immaterial whether an antigen is injected into the right or left a r m , but if wc were designing such an e x p e r i m e n t a n d knew little a b o u t the reaction of h u m a n s to antigens, we might, as a p r e c a u t i o n , r a n d o m l y allocate antigen 1 to the left or right a r m for different subjects, antigen 2 being injccted into the o p p o s i t e a r m . A similar example is the testing of ccrtain plant viruses by r u b b i n g a c o n c e n t r a t i o n of the virus over the surfacc of a leaf a n d c o u n t i n g the resulting lesions. Since different leaves are susceptible in different degrees, a c o n v e n t i o n a l way of m e a s u r i n g the strength of the virus is to
9 . 3 / T W O  W A V ANOVA WITHOU
I(CATION
207
wipe it over t h e half of the leaf on o n e side of the midrib, r u b b i n g the other half of the leaf with a control or s t a n d a r d solution. A n o t h e r design leading to paired c o m p a r i s o n s is to apply the t r e a t m e n t to t w o individuals s h a r i n g a c o m m o n experience, be this genetic or e n v i r o n m e n t a l . T h u s , a d r u g or a psychological test might be given to g r o u p s of twins o r sibs. one of each pair receiving the treatment, the o t h e r one not. Finally, the p a i r e d  c o m p a r i s e n s technique may be used when the t w o individuals to be c o m p a r e d share a single experimental unit a n d are thus subjected to c o m m o n e n v i r o n m e n t a l experiences. If we have a set of rat cages, each of which holds two rats, a n d we are trying to c o m p a r e the effect of a h o r m o n e injection with a control, we might inject o n e of each pair of rats with the h o r m o n e a n d use its cage m a t e as a control. This w o u l d yield a 2 a n o v a for cages. O n e reason for f e a t u r i n g the p a i r e d  c o m p a r i s o n s test separately is t h a t it alone a m o n g the t w o  w a y a n o v a s w i t h o u t replication h a s a n equivalent, alternative m e t h o d of a n a l y s i s t h e t test for paired c o m p a r i s o n s , which is the traditional m e t h o d of analyzing it. T h e p a i r e d  c o m p a r i s o n s ease shown in Box 9.3 analyzes face widths of fiveand sixyearold girls, as already m e n t i o n e d . T h e question being asked is whether the faces of sixyearold girls are significantly wider than those of fiveyearold girls. T h e d a t a a r e s h o w n in c o l u m n s (1) a n d (2) for 15 individual girls. C o l u m n (3) features the row s u m s that are necessary for the analysis of variance. T h e c o m p u t a t i o n s for the twoway a n o v a w i t h o u t replication are the same as those already s h o w n for Box 9.2 and thus arc not shown in detail. T h e a n o v a table shows that there is a highly significant difference in face width between the two age groups. If interaction is assumed to be zero, there is a large a d d e d variance c o m p o n e n t a m o n g the individual girls, u n d o u b t e d l y representing genetic as well as e n v i r o n m e n t a l differences. T h e o t h e r m e t h o d of analyzing p a i r e d  c o m p a r i s o n s designs is the wellk n o w n t test for paired comparisons. It is quite simple to apply a n d is illustrated in the second half of Box 9.3. It tests whether the mean of s a m p l e differences between pairs of readings in the t w o c o l u m n s is significantly different from a hypothetical mean, which the null hypothesis puts at zero. T h e s t a n d a r d error over which this is tested is the s t a n d a r d e r r o r of the m e a n difference. T h e difference c o l u m n has to be calculated and is s h o w n in c o l u m n (4) of the data tabic in Box 9.3. T h e c o m p u t a t i o n s arc quite s t r a i g h t f o r w a r d , a n d the conclusions a r c the s a m e as for the twoway a n o v a . This is a n o t h e r instance in which we o b t a i n the value of F s when we s q u a r e the value of /,. Although the p a i r e d  c o m p a r i s o n s t test is the traditional m e t h o d of solving this type of problem, we prefer the twoway a n o v a . Its c o m p u t a t i o n is no more t i m e  c o n s u m i n g and has the a d v a n t a g e of providing a measure of the variance c o m p o n e n t a m o n g the rows (blocks). This is useful knowledge, because if thereis no significant a d d e d variance c o m p o n e n t a m o n g blocks, o n e might simplify the analysis a n d design of future, similar studies by e m p l o y i n g single classification a n o v a .
208
Exercises
9.1 Swanson, Latshaw, and Tague (1921) determined soil p H electrometrically for various soil samples from Kansas. An extract of their d a t a (acid soils) is shown below. D o subsoils differ in p H from surface soils (assume that there is no interaction between localities and depth for p H reading)?
County
Soil
type
Surface
Subsoil
pH
Finney Montgomery Doniphan Jewell Jewell Shawnee Cherokee Greenwood Montgomery Montgomery Cherokee Cherokee Cherokee
Richfield silt loam Summit silty clay loam Brown silt loam Jewell silt loam Colby silt loam Crawford silty clay loam Oswego silty clay loam Summit silty clay loam Cherokee silt loam Oswego silt loam Bates silt loam Cherokee silt loam Neosho silt loam
6.57 6.77 6.53 6.71 6.72 6.01 4.99 5.49 5.56 5.32 5.92 6.55 6.53
8.34 6.13 6.32 8.30 8.44 6.80 4.42 7.90 5.20 5.32 5.21 5.66 5.66
9.2
ANS. MS between surface and subsoils = 0.6246, MS r e s i d u a l = 0.6985, Fs = 0.849 which is clearly not significant at the 5% level. The following data were extracted from a Canadian record book of purebred dairy cattle. R a n d o m samples of 10 mature (fiveyearold and older) and 10 twoyearold cows were taken from each of five breeds (honor roll, 305day class). The average butterfat percentages of these cows were recorded. This gave us a total of 100 butterfat percentages, broken down into five breeds and into two age classes. The 100 butterfat percentages are given below. Analyze and discuss your results. You will note that the tedious part of the calculation has been done for you.
3.74 4.01 3.77 3.78 4.10 4.06 4.27 3.94 4.1 1 4.25 40.03 4.003
4.44 4.37 4.25 3.71 4.08 3.90 4.41 4.1 1 4.37 3.53 41.17 4.1 17
3.92 4.95 4.47 4.28 4.07 4.10 4.38 3.98 4.46 5.05 43.66 4.366
4.29 5.24 4.43 4.00 4.62 4.29 4.85 4.66 4.40 4.33 45.11 4.51 1
iihit
4.54 5.18 5.75 5.04 4.64 4.79 4.72 3.88 5.28 4.66 48.48 4.848
5.30 4.50 4.59 5.04 4.83 4.55 4.97 5.38 5.39 5.97 50.52 5.052
3.40 3.55 3.83 3.95 4.43 3.70 3.30 3.93 3.58 3.54 37.21 3.721
3.79 3.66 3.58 3.38 3.71 3.94 3.59 3.55 3.55 343 36.18 3.618
4.80 6.45 5.18 4.49 5.24 5.70 5.41 4.77 5.18 5.23 52.45 5.245
5.75 5.14 5.25 4.76 5.18 4.22 5.98 4.85 6.55 5.72 53.40 5.340
X Y2 = 2059.6109
1 \ l K IS1 s C 9.3
209
Blakeslee (1921) studied lengthwidth ratios of second seedling leaves of two types of Jimson weed called globe (G) a n d nominal (TV). Three seeds of each type were planted in 16 pots. Is there sufficient evidence to conclude that globe and nominal differ in lengthwidth ratio?
Types G
16533 16534 16550 16668 16767 16768 16770 16771 16773 16775 16776 16777 16780 16781 16787 16789
1.67 1.68 1.38 1.66 1.38 1.70 1.58 1.49 1.48 1.28 1.55 1.29 1.36 1.47 1.52 1.37
1.53 1.70 1.76 1.48 1.61 1.71 1.59 1.52 1.44 1.45 1.45 1.57 1.22 1.43 1.56 1.38
1.61 1.49 1.52 1.69 1.64 1.71 1.38 1.68 1.58 1.50 1.44 1.44 1.41 1.61 1.56 1.40
2.18 2.00 2.41 1.93 2.32 2.48 2.00 1.94 1.93 1.77 2.06 2.00 1.87 2.24 1.79 1.85
2.23 2.12 2.11 2.00 2.23 2.11 2.18 2.13 1.95 2.03 1.85 1.94 1.87 2.00 2.08 2.10
2.32 2.18 2.60 2.00 1.90 2.00 2.16 2.29 2.10 2.08 1.92 1.80 2.26 2.23 1.89 2.00
9.4
ANS. AFVwilhin 0.0177, MS, x ,, = 0.0203, MSiy)^ = 7.3206 (1\ = 360.62**), MSiMs = 0.0598 (F, = 3.378**). T h e cllect of pots is considered to be a Model 11 factor, and types, a Model 1 factor. The following data were extracted from a more cntensive study by Sokal and K a r t c n (1964). T h e data represent mean dry weights (in mg) of three genotypes of beetles, 'I'riholiimi castaneum, reared at a density of 20 beetles per gram of flour. T h e four scries of experiments represent replications.
(ienol Series + +b bb
1 2 3 4
9.5
Test whether the genotypes differ in mean dry weight. T h e mean length of developmental period (in days) for three strains of houseflies at seven densities is given. (Data by Sullivan and Sokal, 1963.) Do these Hies differ in development period with density and a m o n g strains? You may assume absence of strain density interaction.
210
per
ANS. MS r 1 ( i u a l = 0.3426, MS M r a i n , = 1.3943 (F, = 4.070*), MS cn , lty = 2.0905 (F = 6.1019**). 9.6 The following data are extracted from those of French (1976), who carried out a study of energy utilization in the pocket mouse I'eroynathus longimembris during hibernation at different temperatures. Is there evidence that the amount of food available affects the amount of energy consumed at different temperatures during hibernation?
Adlibit um footl IS C hncrii r used (//,>/) 95.73 63.95 144.30 144.30 Enerij used \kcal;g\ 101.19 76.8 (S 74.08 81.40
Animal no 1 3 4
Animal int. 5 6 7 8
1 nnnal no. 13 14 15 16
Animal no. 17 18 19 20
CHAPTER
Assumptions
of
Analysis of Variance
W c shall n o w e x a m i n e t h e u n d e r l y i n g a s s u m p t i o n s of the a n a l y s i s of v a r i a n c e , m e t h o d s for testing w h e t h e r these a s s u m p t i o n s a r e valid, t h e c o n s e q u e n c e s for a n a n o v a if t h e a s s u m p t i o n s a r e violated, a n d s t e p s t o be t a k e n if t h e a s s u m p tions c a n n o t be met. W c s h o u l d stress t h a t b e f o r e y o u c a r r y o u t a n y a n o v a o n a n a c t u a l r e s e a r c h p r o b l e m , y o u s h o u l d a s s u r e yourself t h a t t h e a s s u m p t i o n s listed in this c h a p t e r seem r e a s o n a b l e . If they a r c n o t , y o u s h o u l d c a r r y out o n e of several p o s s i b l e a l t e r n a t i v e steps to r e m e d y the s i t u a t i o n . In Scction 10.1 wc briefly list t h e v a r i o u s a s s u m p t i o n s of a n a l y s i s of variance. W c d e s c r i b e p r o c e d u r e s for t e s t i n g s o m e of t h e m a n d briefly s t a t e t h e c o n s e q u e n c e s if t h e a s s u m p t i o n s d o n o t h o l d , a n d we give i n s t r u c t i o n s o n h o w t o p r o c e e d if they d o n o t . T h e a s s u m p t i o n s i n c l u d e r a n d o m s a m p l i n g , indep e n d e n c e , h o m o g e n e i t y of variances, n o r m a l i t y , a n d a d d i t i v i t y . In m a n y cases, d e p a r t u r e f r o m the a s s u m p t i o n s of a n a l y s i s of v a r i a n c e can be rectified by t r a n s f o r m i n g the o r i g i n a l d a t a by using a new scale. T h e
212
rationale b e h i n d this is given in Section 10.2, together with s o m e of the c o m m o n transformations. W h e n t r a n s f o r m a t i o n s are u n a b l e to m a k e the d a t a c o n f o r m to the a s s u m p tions of analysis of variance, we m u s t use o t h e r techniques of analysis, a n a l o g o u s to the intended a n o v a . These a r e the n o n p a r a m e t r i c or distributionfree techniques, which are s o m e t i m e s used by preference even when t h e p a r a m e t r i c m e t h o d ( a n o v a in this case) can be legitimately employed. Researchers often like to use the n o n p a r a m e t r i c m e t h o d s because the a s s u m p t i o n s underlying t h e m are generally simple a n d because they lend themselves t o rapid c o m p u tation on a small calculator. However, when the a s s u m p t i o n s of a n o v a are met, these m e t h o d s a r e less efficient t h a n a n o v a . Section 10.3 examines three n o n p a r a m e t r i c m e t h o d s in lieu of a n o v a for t w o  s a m p l e cases only.
213
ical p r o c e s s of r a n d o m l y a l l o c a t i n g t h e t r e a t m e n t s t o t h e e x p e r i m e n t a l p l o t s e n s u r e s t h a t t h e e's will be i n d e p e n d e n t . L a c k of i n d e p e n d e n c e of t h e e's c a n result f r o m c o r r e l a t i o n in t i m e r a t h e r t h a n space. In a n e x p e r i m e n t we m i g h t m e a s u r e t h e effect of a t r e a t m e n t b y r e c o r d i n g weights of ten i n d i v i d u a l s . O u r b a l a n c e m a y suffer f r o m a m a l a d j u s t m e n t t h a t results in giving successive u n d e r e s t i m a t e s , c o m p e n s a t e d f o r by several o v e r e s t i m a t e s . C o n v e r s e l y , c o m p e n s a t i o n b y the o p e r a t o r of the b a l a n c e m a y result in r e g u l a r l y a l t e r n a t i n g over a n d u n d e r e s t i m a t e s of the t r u e weight. H e r e a g a i n , r a n d o m i z a t i o n m a y o v e r c o m e t h e p r o b l e m of n o n i n d e p e n d e n c e of e r r o r s . F o r e x a m p l e , w e m a y d e t e r m i n e t h e s e q u e n c e in w h i c h i n d i v i d u a l s of the v a r i o u s g r o u p s a r e w e i g h e d a c c o r d i n g to s o m e r a n d o m p r o c e d u r e . T h e r e is n o s i m p l e a d j u s t m e n t o r t r a n s f o r m a t i o n t o o v e r c o m e t h e lack of i n d e p e n d e n c e of e r r o r s . T h e b a s i c d e s i g n of t h e e x p e r i m e n t o r t h e w a y in w h i c h it is p e r f o r m e d m u s t b e c h a n g e d . If the e's a r e n o t i n d e p e n d e n t , t h e validity of the u s u a l F test of significance c a n be seriously i m p a i r e d . Homogeneity of variances. In S e c t i o n 8.4 a n d B o x 8.2, in w h i c h we described t h e t test for t h e difference b e t w e e n t w o m e a n s , y o u w e r e told t h a t the statistical test w a s valid o n l y if we c o u l d a s s u m e t h a t t h e v a r i a n c e s of t h e t w o s a m p l e s were e q u a l . A l t h o u g h w e h a v e n o t stressed it so far, this a s s u m p tion t h a t t h e e ; / s h a v e identical v a r i a n c e s a l s o u n d e r l i e s t h e e q u i v a l e n t a n o v a test for t w o s a m p l e s a n d in fact a n y t y p e of a n o v a . Equality of variances in a set of s a m p l e s is a n i m p o r t a n t p r e c o n d i t i o n for several statistical tests. Syno n y m s for this c o n d i t i o n a r e homogeneity of variances a n d homoscedasticity. T h i s latter t e r m is c o i n e d f r o m G r e e k r o o t s m e a n i n g e q u a l scatter; t h e c o n v e r s e c o n d i t i o n (inequality of v a r i a n c e s a m o n g s a m p l e s ) is called heteroscedasticity. Because we a s s u m e t h a t e a c h s a m p l e v a r i a n c e is a n e s t i m a t e of t h e s a m e p a r a m e t r i c e r r o r v a r i a n c e , the a s s u m p t i o n of h o m o g e n e i t y of v a r i a n c e s m a k e s intuitive sense. W e h a v e a l r e a d y seen h o w t o test w h e t h e r t w o s a m p l e s a r c h o m o s c e d a s t i c p r i o r t o a t test of the differences b e t w e e n t w o m e a n s (or t h e m a t h e m a t i c a l l y e q u i v a l e n t t w o  s a m p l e a n a l y s i s of variance): we use a n F test for the h y p o t h e s e s H n : a \ = o \ a n d , : ] \ , as illustrated in Scction 7.3 a n d Box 7.1. F o r m o r e t h a n t w o s a m p l e s t h e r e is a " q u i c k a n d d i r t y " m e t h o d , p r e f e r r e d by m a n y b e c a u s e of its simplicity. T h i s is the F m . lx lest. T h i s test relies o n the tabled c u m u l a t i v e p r o b a b i l i t y d i s t r i b u t i o n of a statistic that is the v a r i a n c e r a t i o of the largest t o the smallest of several s a m p l e v a r i a n c e s . T h i s d i s t r i b u t i o n is s h o w n in T a b l e VI. Let us a s s u m e t h a t we h a v e six a n t h r o p o l o g i c a l s a m p l e s of 10 b o n e l e n g t h s e a c h , for w h i c h we wish t o c a r r y o u t a n a n o v a . T h e v a r i a n c e s of the six s a m p l e s r a n g e f r o m 1.2 t o 10.8. W e c o m p u t e t h e m a x i m u m v a r i a n c e r a t i o 'sn>axAs'min = .'~ = 9.0 a n d c o m p a r e it with f ' m . u l l J , critical values of w h i c h a r e f o u n d in T a b l e VI. F o r a = 6 a n d =  1 = 9, /' is 7.80 a n d 12.1 at the 5% a n d ' levels, respectively. W e c o n c l u d e t h a t the v a r i a n c e s of the six s a m ples a r c significantly h e t e r o g e n e o u s . W h a t m a y c a u s e such h e t e r o g e n e i t y ? In this case, we s u s p e c t that s o m e of the p o p u l a t i o n s are i n h e r e n t l y m o r e v a r i a b l e t h a n o t h e r s . S o m e races or species
214
are relatively u n i f o r m for o n e character, while others are quite variable for t h e s a m e c h a r a c t e r . In a n a n o v a representing the results of an experiment, it m a y well be that o n e s a m p l e h a s been o b t a i n e d u n d e r less s t a n d a r d i z e d c o n d i t i o n s t h a n the others a n d hence h a s a greater variance. T h e r e are also m a n y cases in which the heterogeneity of variances is a f u n c t i o n of an i m p r o p e r choice of m e a s u r e m e n t scale. W i t h s o m e m e a s u r e m e n t scales, variances vary as f u n c t i o n s of means. T h u s , differences a m o n g m e a n s b r i n g a b o u t h e t e r o g e n e o u s variances. F o r example, in variables following the Poisson distribution t h e variance is in fact e q u a l t o the m e a n , a n d p o p u l a t i o n s with greater m e a n s will therefore have greater variances. Such d e p a r t u r e s f r o m the a s s u m p t i o n of homoscedasticity can often be easily corrected by a suitable t r a n s f o r m a t i o n , as discussed later in this chapter. A rapid first inspection for hetcroscedasticity is to check for c o r r e l a t i o n between the m e a n s a n d variances or between the m e a n s a n d the ranges of the samples. If the variances increase with the m e a n s (as in a Poisson distribution), the ratios s2/Y or s/ = V will be a p p r o x i m a t e l y c o n s t a n t for the samples. If m e a n s a n d variances are i n d e p e n d e n t , these ratios will vary widely. T h e consequences of m o d e r a t e heterogeneity of variances a r e not t o o serio u s for the overall test of significance, but single degree of f r e e d o m c o m p a r i sons m a y be far f r o m accurate. If t r a n s f o r m a t i o n c a n n o t cope with heteroscedasticity, n o n p a r a m e t r i c m e t h o d s (Section 10.3) m a y have to be resorted to. Normality. We have a s s u m e d t h a t the e r r o r terms e ; j of the variates in each s a m p l e will be i n d e p e n d e n t , that the variances of the e r r o r terms of t h e several samples will be equal, a n d , finally, t h a t the error terms will be n o r m a l l y distributed. If there is serious question a b o u t the normality of the d a t a , a g r a p h i c test, as illustrated in Section 5.5, might be applied to each sample separately. T h e consequences of n o n n o r m a l i t y of e r r o r are not too serious. O n l y very skewed distribution w o u l d have a m a r k e d effect on the significance level of the F test or on the efficiency of the design. T h e best way to correct for lack of n o r m a l i t y is to carry out a t r a n s f o r m a t i o n that will m a k e the d a t a n o r m a l l y distributed, as explained in the next section. If n o simple t r a n s f o r m a t i o n is satisfactory, a n o n p a r a m e t r i c test, as carried out in Section 10.3, should be substituted for the analysis of variance. Additivitv In twoway a n o v a without replication it is necessary to a s s u m e that interaction is not present if o n e is to m a k e tests of the m a i n effects in a M o d e l I a n o v a . This a s s u m p t i o n of no interaction in a twoway a n o v a is sometimes also referred t o as the a s s u m p t i o n of additivity of the main effects. By this we m e a n that any single observed variate can be d e c o m p o s e d into additive c o m p o n e n t s representing the t r e a t m e n t effects of a particular row a n d c o l u m n as well as a r a n d o m term special to it. If interaction is actually present, then the F test will be very inefficient, a n d possibly misleading if the effect of the interaction is very large. A check of this a s s u m p t i o n requires either m o r e t h a n a single observation per cell (so that an e r r o r m e a n square can be c o m p u t e d )
215
o r a n i n d e p e n d e n t e s t i m a t e of the e r r o r m e a n s q u a r e f r o m p r e v i o u s comparable experiments. I n t e r a c t i o n c a n be d u e t o a variety of causes. M o s t f r e q u e n t l y it m e a n s t h a t a given t r e a t m e n t c o m b i n a t i o n , such as level 2 of f a c t o r A w h e n c o m bined with level 3 of f a c t o r B, m a k e s a v a r i a t e d e v i a t e f r o m t h e e x p e c t e d value. S u c h a d e v i a t i o n is r e g a r d e d as a n i n h e r e n t p r o p e r t y of t h e n a t u r a l system u n d e r s t u d y , as in e x a m p l e s of synergism o r interference. S i m i l a r effects o c c u r w h e n a given replicate is q u i t e a b e r r a n t , as m a y h a p p e n if a n e x c e p t i o n a l p l o t is included in a n a g r i c u l t u r a l e x p e r i m e n t , if a diseased i n d i v i d u a l is i n c l u d e d in a physiological e x p e r i m e n t , o r if by m i s t a k e a n i n d i v i d u a l f r o m a different species is i n c l u d e d in a b i o m e t r i c study. Finally, a n i n t e r a c t i o n t e r m will result if t h e effects of t h e t w o f a c t o r s A a n d o n t h e r e s p o n s e v a r i a b l e Y a r e m u l t i p l i c a t i v e r a t h e r t h a n additive. An e x a m p l e will m a k e this clear. In T a b l e 10.1 we s h o w t h e a d d i t i v e a n d m u l t i p l i c a t i v e t r e a t m e n t effects in a h y p o t h e t i c a l t w o  w a y a n o v a . Let us a s s u m e t h a t the expected p o p u l a t i o n m e a n is zero. T h e n the m e a n of the s a m p l e s u b j e c t e d to t r e a t m e n t I of fact o r A a n d t r e a t m e n t 1 of f a c t o r s h o u l d be 2, by the c o n v e n t i o n a l a d d i t i v e m o d e l . T h i s is so b e c a u s e each f a c t o r at level 1 c o n t r i b u t e s u n i t y t o t h e m e a n . Similarly, the expected s u b g r o u p m e a n s u b j e c t e d t o level 3 for f a c t o r A a n d level 2 for f a c t o r is 8, t h e respective c o n t r i b u t i o n s to the m e a n b e i n g 3 a n d 5. H o w e v e r , if the p r o c e s s is multiplicative r a t h e r t h a n additive, as o c c u r s in a variety of p h y s i c o c h e m i c a l a n d biological p h e n o m e n a , the e x p e c t e d v a l u e s will be q u i t e different. F o r t r e a t m e n t AlBt< the e x p e c t e d value e q u a l s 1, which is the p r o d u c t of 1 a n d 1. F o r t r e a t m e n t A 3 B 2 , the e x p e c t e d value is 15, the p r o d uct of 3 a n d 5. If we w e r e t o a n a l y z e m u l t i p l i c a t i v e d a t a of this sort by a c o n v e n t i o n a l a n o v a , we w o u l d find that the i n t e r a c t i o n s u m of s q u a r e s w o u l d be greatly a u g m e n t e d b e c a u s e of the n o n a d d i t i v i t y of the t r e a t m e n t effects. In this case, there is a s i m p l e r e m e d y . By t r a n s f o r m i n g the v a r i a b l e i n t o l o g a r i t h m s ( T a b l e 10.1), we a r c a b l e t o r e s t o r e the additivity of the d a t a . T h e third item in each cell gives the l o g a r i t h m of (he expected value, a s s u m i n g m u l t i p l i c a t i v e
'
a,  1
">
os = 2 3 2 0.30 7 10 1.00
a,  3 4 3 0.48 8 15 1.18 Additive effects Multiplicative effects Log of multiplicative effect: Additive effects Multiplicative effects Log of multiplicative effect:
/'. 
1 0
()
II2
 5
s 0.70
216
relations. N o t i c e that the i n c r e m e n t s are strictly additive again (SS^ x B 0). As a m a t t e r of fact, on a l o g a r i t h m i c scale we could simply write a t = 0, a 2 = 0.30, a 3 = 0.48, = 0 , 2 = 0.70. H e r e is a g o o d illustration of h o w t r a n s f o r m a t i o n of scale, discussed in detail in Section 10.2, helps us m e e t t h e a s s u m p tions of analysis of variance.
10.2 T r a n s f o r m a t i o n s If t h e evidence indicates t h a t the a s s u m p t i o n s for an analysis of v a r i a n c e o r for a t test c a n n o t be m a i n t a i n e d , t w o courses of action are o p e n t o us. W e m a y carry out a different test n o t requiring t h e rejected a s s u m p t i o n s , such as o n e of the distributionfree tests in lieu of a n o v a , discussed in the next section. A second a p p r o a c h w o u l d be to t r a n s f o r m t h e variable to be a n a l y z e d in such a m a n n e r t h a t the resulting t r a n s f o r m e d variates meet the a s s u m p t i o n s of the analysis. Let us look at a simple e x a m p l e of w h a t t r a n s f o r m a t i o n will do. A single variate of the simplest kind of a n o v a (completely r a n d o m i z e d , singleclassification, M o d e l I) d e c o m p o s e s as follows: Y{j = + a{ + In this m o d e l the c o m p o n e n t s are additive, with the e r r o r term normally distributed. H o w e v e r , we m i g h t e n c o u n t e r a situation in which the c o m p o n e n t s were multiplicative in effect, so that Y^ = which is the p r o d u c t of the three terms. In such a case t h e a s s u m p t i o n s of n o r m a l i t y a n d of homoscedasticity w o u l d b r e a k d o w n . In any o n e a n o v a , t h e p a r a m c t r i c m e a n is c o n s t a n t but t h e t r e a t m e n t elfcct a; differs f r o m g r o u p to g r o u p . Clearly, the scatter a m o n g t h e variates Ytj would d o u b l e in a g r o u p in which a, is twicc as great as in a n o t h e r . Assume that = I, the smallest = 1, a n d the greatest, 3; then if a, = 1, the range of the Y's will be 3 1 = 2. However, w h e n a, = 4, the c o r r e s p o n d i n g range will be four times as wide, f r o m 4 1 = 4 to 4 3 = 12, a range of 8. Such d a t a will be heterosccdastic. W e can correct this situation simply by t r a n s f o r m ing o u r model into logarithms. W c would therefore o b t a i n log Yj = log + log a, + log e, y , which is additive a n d homoscedastic. T h e entire analysis of variance would then be carried out on the t r a n s f o r m e d variates. At this point m a n y of you will feel m o r e or less u n c o m f o r t a b l e a b o u t what wc have done. T r a n s f o r m a t i o n seems t o o m u c h like " d a t a grinding." W h e n you learn t h a t often a statistical test may be m a d e significant after t r a n s f o r m a t i o n of a set of d a t a , t h o u g h it would not be so w i t h o u t such a t r a n s f o r m a t i o n , you m a y feel even m o r e suspicious. W h a t is the justification for t r a n s f o r m i n g the d a t a ? It takes s o m e getting used to the idea, but there is really n o scientific necessity to e m p l o y the c o m m o n linear or arithmetic scale to which wc arc a c c u s t o m e d . Y o u a r c p r o b a b l y a w a r e t h a t teaching of the "new m a t h " in e l e m e n t a r y schools h a s d o n e m u c h to dispel the naive notion that the decimal system of n u m b e r s is the only " n a t u r a l " one. In a similar way, with s o m e experience in science a n d in the h a n d l i n g of statistical d a t a , you will a p p r e c i a t e the fact that the linear scale, so familiar to all of us f r o m o u r earliest expe
1 0 . 2 / TRANSFORMATIONS
rience, occupies a similar position with relation t o other scales of m e a s i m nu ni as does the decimal system of n u m b e r s with respect to the b i n a r y and o c t a l n u m b e r i n g systems a n d others. If a system is multiplicative o n a linear scale, it m a y be m u c h m o r e convenient to think of it as an additive system on a logarithmic scale. A n o t h e r f r e q u e n t t r a n s f o r m a t i o n is the s q u a r e r o o t of a variable. T h e s q u a r e r o o t of the surface area of an o r g a n i s m is often a m o r e a p p r o p r i a t e m e a s u r e of the f u n d a m e n t a l biological variable subjected to physiological a n d e v o l u t i o n a r y forces t h a n is t h e area. This is reflected in the n o r m a l distribution of the s q u a r e r o o t of the variable as c o m p a r e d to the skewed distribution of areas. In m a n y cases experience has t a u g h t us to express experimental variables not in linear scale b u t as l o g a r i t h m s , s q u a r e roots, reciprocals, or angles. Thus, pH values are l o g a r i t h m s a n d dilution series in microbiological titrations are expressed as reciprocals. As s o o n as you are ready t o accept the idea t h a t the scale of m e a s u r e m e n t is a r b i t r a r y , you simply have to look at the distributions of t r a n s f o r m e d variates to decide which t r a n s f o r m a t i o n most closely satisfies the a s s u m p t i o n s of the analysis of variance before c a r r y i n g out an a n o v a . A f o r t u n a t e fact a b o u t t r a n s f o r m a t i o n s is t h a t very often several d e p a r t u r e s f r o m the a s s u m p t i o n s of a n o v a are simultaneously cured by the s a m e transf o r m a t i o n to a new scale. T h u s , simply by m a k i n g the d a t a homoscedastic, we also m a k e them a p p r o a c h n o r m a l i t y a n d e n s u r e additivity of the t r e a t m e n t effects. W h e n a t r a n s f o r m a t i o n is applied, tests of significance arc p e r f o r m e d on the t r a n s f o r m e d d a t a , but estimates of m e a n s are usually given in the familiar u n t r a n s f o r m e d scale. Since the t r a n s f o r m a t i o n s discussed in this c h a p t e r are nonlinear, confidence limits c o m p u t e d in the t r a n s f o r m e d scale a n d c h a n g e d back t o the original scale would be asymmetrical. Stating the s t a n d a r d e r r o r in the original scale w o u l d therefore be misleading. In reporting results of research with variables that require t r a n s f o r m a t i o n , furnish m e a n s in the backt r a n s f o r m e d scale followed by their (asymmetrical) confidence limits rather than by their s t a n d a r d errors. An easy way to find out w h e t h e r a given t r a n s f o r m a t i o n will yield a distribution satisfying the a s s u m p t i o n s of a n o v a is to plot the c u m u l a t i v e distributions of the several samples on probability paper. By c h a n g i n g the scale of the sccond c o o r d i n a t e axis f r o m linear to logarithmic, s q u a r e root, or any o t h e r one, we can see w h e t h e r a previously curved line, indicating skewness, straightens out to indicate n o r m a l i t y (you m a y wish to refresh your m e m o r y on these graphic techniques studied in Section 5.5). W e can look u p u p p e r class limits on t r a n s f o r m e d scales or e m p l o y a variety of available probability g r a p h p a p e r s whose second axis is in logarithmic, a n g u l a r , or o t h e r scale. T h u s , we not only test whether the d a t a b e c o m e m o r e n o r m a l t h r o u g h t r a n s f o r m a t i o n , but wc can also get an estimate of the s t a n d a r d deviation u n d e r t r a n s f o r m a t i o n as measured by the slope of the lilted line. T h e a s s u m p t i o n of homosccdasticity implies that the slopes for the several samples should be the same. If the slopes are very heterogeneous, homoscedasticity has not b e e n achieved. Alternatively, wc can
218
e x a m i n e g o o d n e s s of fit tests for n o r m a l i t y (see C h a p t e r 13) for the samples u n d e r v a r i o u s t r a n s f o r m a t i o n s . T h a t t r a n s f o r m a t i o n yielding the best fit over all samples will be chosen for the a n o v a . It is i m p o r t a n t that the t r a n s f o r m a t i o n not be selected on the basis of giving the best a n o v a results, since such a proced u r e w o u l d distort t h e significance level. The logarithmic transformation. T h e most c o m m o n t r a n s f o r m a t i o n applied is conversion of all variates into logarithms, usually c o m m o n logarithms. W h e n ever the m e a n is positively correlated with the variance (greater m e a n s are acc o m p a n i e d by greater variances), the logarithmic t r a n s f o r m a t i o n is quite likely to remedy the situation a n d m a k e the variance i n d e p e n d e n t of the m e a n . Freq u e n c y d i s t r i b u t i o n s skewed to the right are often m a d e m o r e symmetrical by t r a n s f o r m a t i o n to a l o g a r i t h m i c scale. W e saw in the previous section a n d in T a b l e 10.1 t h a t logarithmic t r a n s f o r m a t i o n is also called for w h e n effects are multiplicative. The square root transformation. W e shall use a s q u a r e root t r a n s f o r m a t i o n as a detailed illustration of t r a n s f o r m a t i o n of scale. W h e n the d a t a are counts, as of insects on a leaf or blood cells in a h e m a c y t o m e t e r , we frequently find the s q u a r e r o o t t r a n s f o r m a t i o n of value. You will r e m e m b e r that such distrib u t i o n s are likely to be Poissondistributed rather than normally d i s t r i b u t e d a n d that in a Poisson distribution the variance is the same as the m e a n . Therefore, the m e a n a n d variance c a n n o t be independent but will vary identically. T r a n s f o r m i n g the variates to s q u a r e roots will generally m a k e the variances i n d e p e n d e n t of the means. W h e n the c o u n t s include zero values, it has been f o u n d desirable to code all variates by a d d i n g 0.5. T h e t r a n s f o r m a t i o n then is v'v + i T a b l e 10.2 shows an application of the s q u a r e root t r a n s f o r m a t i o n . T h e s a m p l e with the greater m e a n has a significantly greater variance prior to transf o r m a t i o n . After t r a n s f o r m a t i o n the variances arc not significantly different. F o r r e p o r t i n g m e a n s the t r a n s f o r m e d m e a n s arc squared again and confidence limits arc r e p o r t e d in lieu of s t a n d a r d errors. The arcsine transformation This t r a n s f o r m a t i o n (also k n o w n as the angular transformation) is especially a p p r o p r i a t e to percentages and p r o p o r t i o n s . You may r e m e m b e r from Section 4.2 that the s t a n d a r d deviation of a binomial distribution is = \Jpq/k. Sincc = />, </ = I p, a n d k is c o n s t a n t for any o n e p r o b l e m , it is clear that in a binomial distribution the variance would be a function of the mean. T h e arcsine t r a n s f o r m a t i o n preserves the i n d e p e n d e n c e of the two. T h e t r a n s f o r m a t i o n finds 0 = arcsin >//>, where is a p r o p o r t i o n . T h e term "arcsin" is s y n o n y m o u s with inverse sine or sin which stands for "Ihe angle whose sine is" the given quantity. Thus, if we c o m p u t e or look up arcsin v '0.431 0.6565, we find 41.03", the angle whose sine is 0.6565. T h e arcsine transf o r m a t i o n stretches out both tails of a distribution of percentages or p r o p o r tions and compresses the middle. W h e n the percentages in the original d a t a fall between 30",', a n d 70",'., it is generally not neccssary to apply the arcsinc transformation.
t a b l e
10.2
An application of the square root transformation. T h e d a t a r e p r e s e n t t h e n u m b e r of a d u l t Drosophila e m e r g i n g f r o m singlepair c u l t u r e s for t w o different m e d i u m f o r m u l a t i o n s ( m e d i u m A c o n t a i n e d DDT).
(1) (2)
(3) Medium /
(4) Medium f
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.00 1.00 1.41 1.73 2.00 2.24 2.45 2.65 2.83 3.00 3.16 3.32 3.46 3.61 3.74 3.87 4.00
1 5 6
2 1 2 3 1 1 1 1 1 2 75
15
Untransformed s2 Square root 'Jy transformation variable
1.933 1.495
11.133 9.410
1.299 0.2634
of variances
3.307 0.2099
transformed
9.410 1.495
6.294**
(squared) means
/'
Wl
ns
Backtransformed
Medium
(7
95% confidence limits
10.937
sjt
'o.osVy
" iT"
L2 = JY
Backtransformed
f i 0 . 0S .Vr
(squared) confidence limits
I.] l.\
9.324
12.681
220
10.3 / n o n p a r a m e t r 1 c m e t h o d s in
o f a n o v a 232
BOX 10.1 MannWhitney V test for two samples, ranked observations, not paired. A measure of heart function (left ventricle ejection fraction) measured in two samples of patients admitted to the hospital under suspicion of heart attack. The patients were classified on the basis of physical examinations during admission into different socalled Killip classes of ventricular dysfunction. We compare the left ventricle ejection fraction for patients classified as Killip classes I and III. The higher Killip class signifies patients with more severe symptons. Thefindingswere already graphed in the source publication, and step 1 illustrates that only a graph of the data is required for the MannWhitney U test. Designate the sample size of the larger sample as nl and that of the smaller sample as n2. In this case, n, = 29, n2 = 8. When the two samples are of equal size it does not matter which is designated as n,. 1. Graph the two samples as shown below. Indicate the ties by placing dots at the same level.
;

ft
* %
0.49 + 0.13
= 29
0.28 + 0.08 n = 8 1 m
1 Killip class
2. For each observation in one sample (it is convenient to use the smaller sample), count the number of observations in the other sample which are lower in value (below it in this graph). Count \ for each tied observation. For example, there are lj observations in class I below the first observation in class III. The half is introduced because of the variate in class I tied with the lowest variate in class III. There are 2f observations below the tied second and third observations in class III. There are 3 observations below the fourth and fifth variates in class III, 4 observations below the sixth variate, and 6 and 7 observations, respectively, below the seventh and eight variates in class III. The sum of these counts C = 29{. The MannWhitney statistic Vs is the greater of the two quantities C and (n,n2  C), in this case 29 and [(29 8)  29] = 202^.
222
chapter
10 , a s s u m p t i o n s o f a n a l y s i s o f v a r i a n c 1
compare U, with critical value for /,,2] in Table XI. The null hypothesis is rejected if the observed value is too large. In cases where n t > 20, calculate the following quantity
t
U s ~ n'n^2 /""(" + n t + 1)
12
which is approximately normally distributed. The denominator 12 is a constant. Look up the significance of ts in Table III against critical values of for a onetailed or twotailed test as required by the hypothesis. In our case this would yield t
V
A further complication arises from observations tied between the two groups. Our example is a case in point. There is no exact test. For sample sizes n, < 20, use Table XI, which will then be conservative. Larger sample sizes require a more elaborate formula. But it takes a substantial number of ties to affect the outcome of the test appreciably. Corrections for ties increase the t value slightly; hence the uncorrected formula is more conservative. We may conclude that the two samples with a t, value of 3.191 by the uncorrected formula are significantly different at < 0.01.
it. Conversely, all the p o i n t s of the lowervalued s a m p l e would be below every point of the highervalued o n e if we started out with the latter. O u r total c o u n t w o u l d therefore be the total c o u n t of o n e s a m p l e multiplied by every o b s e r v a t i o n in the second sample, which yields n x ti 2 . T h u s , since we are told to take the greater of the t w o values, the sum of the c o u n t s C or n,n2 C, o u r result in this ease would be n x n 2 . O n the o t h e r h a n d , if the t w o samples coincided c o m pletely, then for each point in o n e s a m p l e we would have those p o i n t s below it plus a half point for t h e tied value representing t h a t observation in the second s a m p l e which is at exactly the same level as the observation u n d e r c o n s i d e r a t i o n . A little e x p e r i m e n t a t i o n will show this value to be [n(n l)/2] + (n/2) = n2/l. Clearly, the range of possible U values must be between this a n d n { rt 2 , a n d the critical value must be s o m e w h e r e within this range. O u r conclusion as a result of the tests in Box 10.1 is that (he two admission classes characterized by physical e x a m i n a t i o n differ in their ventricular dysfunction as m e a s u r e d by left ventricular ejection fraction. T h e sample characterized as m o r e severely ill has a lower ejection fraction t h a n the sample characterized .. . :
10.3 / n o n p a r a m e t r 1 c m e t h o d s i n i
of
anova
223
T h e M a n n  W h i t n e y V test is based on ranks, a n d it measures differences in location. A n o n p a r a m e t r i c test t h a t tests differences between t w o distributions is the KolmogorovSmirnov twosample test. Its null hypothesis is identity in distribution for the two samples, a n d thus the test is sensitive to differences in location, dispersion, skewness, a n d so forth. This test is quite simple to carry out. It is based on the unsigned differences between the relative c u m u l a t i v e frequency distributions of the t w o samples. Expected critical values can be l o o k e d u p in a table or evaluated a p p r o x i m a t e l y . C o m p a r i s o n between observed a n d expected values leads to decisions w h e t h e r the m a x i m u m difference between the t w o cumulative frequency distributions is significant. Box 10.2 s h o w s the application of the m e t h o d to samples in which both n1 a n d n2 < 25. T h e e x a m p l e in this box features m o r p h o l o g i c a l m e a s u r e m e n t s
BOX 10.2 KolmogorovSmirnov twosample test, testing differences in distributions of two samples of continuous observations. (Both n, and n2 <, 25.)
Two samples of nymphs of the ehigger Trombicuia lipovskyi. Variate measured is length of cheliceral base stated as micrometer units. The sample sizes are rij = 16,
2 = 10. Sample A Y Sample Y
104 109 112 114 116 118 118 119 121 123 125 126 126 128 128 128
100 105 107 107 108 111 116 120 121 123
1. Form cumulative frequencies F of the items in samples 1 and 2. Thus in column (2) we note that there are 3 measurements in sample A at or below 112.5 micrometer units. By contrast there are 6 such measurements in sample (column (3)). 2. Compute relative cumulative frequencies by dividing frequencies in columns (2) and (3) by , and n2, respectively, and enter in columns (4) and (5).
224
chapter
10 , a s s u m p t i o n s o f a n a l y s i s o f v a r i a n c 1
Box 10.2 Continued 3. Compute d, the absolute value of the difference between the relative cumulative frequencies in columns (4) and (5), and enteT in column (6). 4. Locate the largest unsigned difference D. It is 0.475. 5. Multiply D by ,n 2 . We obtain (16)(10J(0.475) = 76. 6. Compare ntn2D with its critical value in Table XIII, where we obtain a value of 84 for = 0.05. We accept the null hypothesis that the two samples have been taken from populations with the same distribution. The KolmogorovSmirnov test is less powerful than the MannWhitney U test shown in Box 10.1 with respect to the alternative hypothesis of the latter, i.e., differences in location. However, KolmogorovSmirnov tests differences in both shape and location of the distributions and is thus a more comprehensive test
(')
(2) (J)
(4)
"l
W Fj. 2 0.100
{6)
Sample Y
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 0 0 0 1 1 1 1 1 2 2 2 3 3 4 4 5 5 7 8 8 9 9 L0 10 11 13 13 16
Sample
Ft 1 1 1 1 1 2 2 4 5 5 5 6 6 6 6 6 7 7 7 7 8 9 9 10 10 10 10 10 10
d**
_2
>>1 2
0.100 0.100 0.100 0.100 0.038 0.138 0.138 0.338 0.438 0.375 0.375 0.475 0.412 0.412 0.350 0.350 0.388 0.388 0.262 0.200 0.300 0.338 0.338 0.375 0.375 0.312 0.188 0.188 0 D
0 0 0 0.062 0.062 0.062 0.062 0.062 0.125 0.125 0.125 0.188 0.188 0.250 0.250 0.312 0.312 0.438 0.500 0.500 0.562 0.562 0.625 0.625 0.688 0.812 0.812 1.000
0.100 0.100 0.100 0.100 0.200 0.200 0.400 0.500 0.500 0.500 0.600 0.600 0.600 0.600 0.600 0.700 0.700 0.700 0.700 0.800 0.900 0.900 1.000 1.000 1.000 1.000 1.000 1.000
1 0 . 3 / n o n p a r a m e t r i c m e t h o d s i n i ii ii o f
anova
22.s
of t w o s a m p l e s of chigger n y m p h s . W e use t h e s y m b o l F for c u m u l a t i v e f r e q u e n cies, w h i c h a r e s u m m e d with respect t o the class m a r k s s h o w n in c o l u m n (1), a n d we give t h e c u m u l a t i v e f r e q u e n c i e s of t h e t w o s a m p l e s in c o l u m n s (2) a n d (3). Relative e x p e c t e d frequencies are o b t a i n e d in c o l u m n s (4) a n d (5) by d i v i d i n g by t h e respective s a m p l e sizes, while c o l u m n (6) f e a t u r e s t h e u n s i g n e d difference b e t w e e n relative c u m u l a t i v e frequencies. T h e m a x i m u m u n s i g n e d difference is D = 0.475. It is multiplied b y ntn2 t o yield 76. T h e critical v a l u e f o r this statistic can be f o u n d in T a b l e XIII, w h i c h f u r n i s h e s critical values f o r t h e t w o  t a i l e d two= s a m p l e K o l m o g o r o v  S m i r n o v test. W e o b t a i n nin2D0A0 76 a n d IH 2 /) 0 .05 = 84. T h u s , there is a 10% p r o b a b i l i t y of o b t a i n i n g t h e o b s e r v e d difference by c h a n c e alone, a n d we c o n c l u d e t h a t t h e t w o s a m p l e s d o n o t differ significantly in their d i s t r i b u t i o n s . W h e n these d a t a a r e s u b j e c t e d t o the M a n n  W h i t n e y U test, h o w e v e r , o n e finds t h a t t h e t w o s a m p l e s a r e significantly different at 0.05 > > 0.02. T h i s c o n t r a d i c t s t h e findings of the K o l m o g o r o v  S m i r n o v test in B o x 10.2. B u t t h a t is b e c a u s e t h e t w o tests differ in their sensitivities t o different a l t e r n a t i v e hyp o t h e s e s t h e M a n n  W h i t n e y V test is sensitive to the n u m b e r of i n t e r c h a n g e s in r a n k (shifts in l o c a t i o n ) necessary to s e p a r a t e t h e t w o samples, w h e r e a s t h e K o l m o g o r o v  S m i r n o v test m e a s u r e s differences in t h e entire d i s t r i b u t i o n s of the t w o s a m p l e s a n d is t h u s less sensitive t o differences in l o c a t i o n only. It is a n u n d e r l y i n g a s s u m p t i o n of all K o l m o g o r o v  S m i r n o v tests t h a t the variables s t u d i e d a r e c o n t i n u o u s . G o o d n e s s of fit tests by m e a n s of this statistic a r e treated in C h a p t e r 13. Finally, we shall present a n o n p a r a m e t r i c m e t h o d for t h e p a i r e d  c o m p a r i s o n s design, discussed in Scction 9.3 a n d illustrated in Box. 9.3. T h e m o s t widely used m e t h o d is t h a t of Wilcoxon's signedranks test, illustrated in Box 10.3. T h e e x a m p l e to w h i c h it is applied has not yet been e n c o u n t e r e d in this b o o k . It records m e a n litter size in two strains of guinea pigs k e p t in large colonies d u r i n g the years 1916 t h r o u g h 1924. Bach of these values is t h e a v e r a g e of a large n u m b e r of litters. N o t e the parallelism in the c h a n g e s in the variable in the t w o strains. D u r i n g 1917 a n d 1918 (war y e a r s for the U n i t e d States), a s h o r t a g e of c a r e t a k e r s a n d of f o o d resulted in a d e c r e a s e in t h e n u m b e r of offspring per litter. As s o o n as better c o n d i t i o n s r e t u r n e d , the m e a n litter size increased. N o t i c e t h a t a s u b s e q u e n t d r o p in 1922 is again m i r r o r e d in b o t h lines, suggesting t h a t these f l u c t u a t i o n s a r c e n v i r o n m e n t a l l y caused. It is therefore q u i t e a p p r o p r i a t e t h a t t h e d a t a be t r e a t e d as paired c o m p a r i s o n s , with years as replications a n d the s t r a i n differences as the fixed t r e a t m e n t s to be tested. C o l u m n (3) in Box 10.3 lists the differences o n w h i c h a c o n v e n t i o n a l p a i r e d c o m p a r i s o n s t test c o u l d be p e r f o r m e d . F o r W i l c o x o n ' s test these differences are r a n k e d without regard to sign in c o l u m n (4), s o t h a t the smallest a b s o l u t e difference is r a n k e d 1 a n d the largest a b s o l u t e difference (of t h e nine differences) is r a n k e d 9. Tied r a n k s a r e c o m p u t e d as a v e r a g e s of t h e r a n k s ; t h u s if the f o u r t h a n d fifth difference h a v e the s a m e a b s o l u t e m a g n i t u d e they will b o t h be assigned rank 4.5. After the r a n k s h a v e been c o m p u t e d , the original sign of each differcncc
226
CHAPTER 10 , a s s u m p t i o n s of a n a l y s i s o f v a r i a n c 1
BOX 10.3 Wilcoxon's signedranks test for two groups, arranged as paired observations.
Mean litter size of two strains of guinea pigs, compared over 9 years.
Year
m Strain
(2) Strain 13
m D
w
Rank(R)
+ + + + +
9 8 2 3 7
+ 6 + 5 +4 1 44
Procedure 1. Compute the differences between the pairs of observations. These are entered in column (3), labeled D. 2. Rank these differences from the smallest to the largest without regard to siyn. 3. Assign to the ranks the original signs of the differences. 4. Sum the positive and negative ranks separately. The sum that is smaller in absolute value, Ts, is compared with the values in Table XII for = 9. Since T, = 1, which is equal to or less than the entry for onetailed = 0.005 in the table, our observed difference is significant at the 1% level. Litter size in strain is significantly different from that of strain 13. For large samples ( > 50) compute
4
rnn
+ xK" + i) 12
where T, is as defined in step 4 above. Compare the computed value with in Table III.
10.3 / n o n p a r a m e t r 1 c m e t h o d s in i
of
anova
227
is assigned t o t h e c o r r e s p o n d i n g r a n k . T h e s u m of t h e positive o r of t h e n e g a t i v e r a n k s , w h i c h e v e r o n e is s m a l l e r in a b s o l u t e value, is t h e n c o m p u t e d (it is labeled Ts) a n d is c o m p a r e d w i t h t h e critical v a l u e in T a b l e XII f o r t h e c o r r e s p o n d i n g s a m p l e size. In view of t h e significance of t h e r a n k s u m , it is clear t h a t s t r a i n h a s a litter size different f r o m t h a t of s t r a i n 13. T h i s is a very s i m p l e test t o c a r r y o u t , b u t it is, of c o u r s e , n o t as efficient as the c o r r e s p o n d i n g p a r a m e t r i c t test, w h i c h s h o u l d be p r e f e r r e d if the n e c e s s a r y a s s u m p t i o n s hold. N o t e t h a t o n e n e e d s m i n i m a l l y six differences in o r d e r t o c a r r y o u t W i l c o x o n ' s s i g n e d  r a n k s test. W i t h only six p a i r e d c o m p a r i s o n s , all differences m u s t be of like sign for the test t o be significant a t t h e 5% level. F o r a large s a m p l e a n a p p r o x i m a t i o n using the n o r m a l c u r v e is available, w h i c h is given in B o x 10.3. N o t e t h a t t h e a b s o l u t e m a g n i t u d e s of t h e differences play a role only i n s o f a r as they affect the r a n k s of the differences. A still simpler test is the sign test, in w h i c h we c o u n t t h e n u m b e r of positive a n d negative signs a m o n g the differences ( o m i t t i n g all differences of zero). W c t h e n test t h e h y p o t h e s i s t h a t t h e p l u s a n d m i n u s signs are s a m p l e d f r o m a p o p u l a t i o n in which t h e t w o k i n d s of signs a r e present in e q u a l p r o p o r t i o n s , as m i g h t be e x p e c t e d if t h e r e were n o t r u e difference b e t w e e n t h e t w o p a i r e d samples. S u c h s a m p l i n g s h o u l d follow the b i n o m i a l d i s t r i b u t i o n , a n d the test of the h y p o t h e s i s t h a t the p a r a m e t r i c f r e q u e n c y of t h e plus signs is = 0.5 c a n be m a d e in a n u m b e r of ways. Let us learn these by a p p l y i n g the sign test to the g u i n e a pig d a t a of B o x 10.3. T h e r e a r c n i n e differences, of w h i c h eight a r c positive a n d o n e is n e g a t i v e . W e c o u l d follow the m e t h o d s of Section 4.2 (illustrated in T a b l e 4.3) in which we c a l c u l a t e the c x p e c t c d p r o b a b i l i t y of s a m p l i n g o n e m i n u s sign in a s a m p l e of nine o n the a s s u m p t i o n of = q = 0.5. T h e p r o b a b i l i t y of such a n o c c u r r e n c e a n d all " w o r s e " o u t c o m e s e q u a l s 0.0195. Since we h a v e n o a p r i o r i n o t i o n s t h a t o n e strain s h o u l d h a v e a g r e a t e r litter size t h a n the o t h e r , this is a twotailed test, a n d wc d o u b l e the p r o b a b i l i t y to 0.0390. Clearly, this is a n i m p r o b a b l e o u t c o m c , a n d wc reject the null h y p o t h e s i s that q = 0.5. Since the c o m p u t a t i o n of the cxact p r o b a b i l i t i e s m a y be q u i t e t e d i o u s if n o t a b l e of c u m u l a t i v e b i n o m i a l p r o b a b i l i t i e s is at h a n d , we m a y t a k e a s e c o n d a p p r o a c h , using T a b i c IX, which f u r n i s h e s c o n f i d e n c e limits for for v a r i o u s s a m p l e sizes a n d s a m p l i n g o u t c o m e s . L o o k i n g u p s a m p l e size 9 a n d = 1 ( n u m b e r s h o w i n g the p r o p e r t y ) , we find the 95% c o n f i d e n c e limits to be 0.0028 a n d 0.4751 by i n t e r p o l a t i o n , t h u s e x c l u d i n g the value = q = 0 5 p o s t u l a t e d by the null h y p o t h e s i s . At least at the 5% significance level wc c a n c o n c l u d e that it is unlikely t h a t the n u m b e r of p l u s a n d m i n u s signs is e q u a l . T h e c o n fidence limits imply a t w o  t a i l e d d i s t r i b u t i o n ; if we i n t e n d a o n e  t a i l e d test, wc c a n infer a 0.025 significance level f r o m the 95% c o n f i d e n c e limits a n d a 0.005 level f r o m the 99% limits. O b v i o u s l y , such a o n e  t a i l e d test w o u l d be carried out only if the results were in the d i r e c t i o n of t h e a l t e r n a t i v e h y p o t h e s i s . T h u s , if the a l t e r n a t i v e h y p o t h e s i s were t h a t s t r a i n 13 in Box 10.3 h a d g r e a t e r litter size t h a n strain B, wc w o u l d not b o t h e r t e s t i n g this e x a m p l e at all, sincc the
228
chapter
10 , a s s u m p t i o n s o f a n a l y s i s o f v a r i a n c 1
o b s e r v e d p r o p o r t i o n of y e a r s s h o w i n g t h i s r e l a t i o n is less t h a n half. F o r l a r g e r s a m p l e s , w e c a n use t h e n o r m a l a p p r o x i m a t i o n t o the b i n o m i a l d i s t r i b u t i o n as follows: ts = ( )/ = (Y kp)/y/kpq, w h e r e we s u b s t i t u t e t h e m e a n a n d s t a n d a r d d e v i a t i o n of t h e b i n o m i a l d i s t r i b u t i o n l e a r n e d in S e c t i o n 4.2. In o u r case, w e let s t a n d f o r k a n d a s s u m e t h a t = q = 0.5. T h e r e f o r e , t s = (F = (7 T h e v a l u e of ts is t h e n c o m p a r e d w i t h [) in T a b l e III, u s i n g o n e tail o r t w o tails of the d i s t r i b u t i o n as w a r r a n t e d . W h e n t h e s a m p l e size > 12, this is a s a t i s f a c t o r y a p p r o x i m a t i o n . A t h i r d a p p r o a c h we c a n use is to test t h e d e p a r t u r e f r o m the e x p e c t a t i o n t h a t = q = 0.5 by o n e of the m e t h o d s of C h a p t e r 13.
Exercises
10.1 Allee and Bowen (1932) studied survival time of goldfish (in minutes) when placed in colloidal silver suspensions. Experiment no. 9 involved 5 replications, and experiment no. 10 involved 10 replicates. Do the results of the two experiments differ? Addition of urea, NaCl, and N a 2 S to a third series of suspensions apparently prolonged the life of the fish.
150 180 210 240 240 120 180 240 120 150
330 300 300 420 360 270 360 360 300 120
10.2
Analyze and interpret. Test equality of variances. Compare anova results with those obtained using the MannWhitney U test for the two comparisons under study. To test the effect of urea it might be best to pool Experiments 9 and 10, if they prove not to differ significantly. ANS. Test for homogeneity of Experiments 9 and 10, Us = 33. us. For the comparison of Experiments 9 and 10 versus urea and salts, 136, < 0.001. In a study of flower color in Butterflywced (Asc/epias tuherosa), Woodson (1964) obtained the following results:
Cieoi/raphie region
. V
CI SW2 SW3
226 94 23
exercises
229
10.3 10.4
The variable recorded was a color score (ranging from 1 for pure yellow to 40 for deep orangered) obtained by matching flower petals to sample colors in Maerz and Paul's Dictionary of Color. Test whether the samples are homoscedastic. Test for a difference in surface and subsoil p H in the data of Exercise 9.1, using Wilcoxon's signedranks test. ANS. Ts = 38; > 0.10. Number of bacteria in 1 cc of milk from three cows counted at three periods (data from Park, Williams, and Krumwiede, 1924):
Cow no. 1 2
At time of
milking
After 24 hours
After 48 hours
10.5
10.6
(a) Calculate means and variances for the three periods and examine the relation between these two statistics. Transform the variates to logarithms and compare means and variances based on the transformed data. Discuss. (b) Carry out an anova on transformed and untransformed data. Discuss your results. Analyze the measurements of the two samples of chigger nymphs in Box 10.2 by the MannWhitney U test. Compare the results with those shown in Box 10.2 for the KolmogorovSmirnov test. ANS. V = 123.5, < 0.05. Allee et al. (1934) studied the rate of growth of Ameiurus melas in conditioned and unconditioned well water and obtained the following results for the gain in average length of a sample fish. Although the original variates are not available, we may still test for differences between the two treatment classes. Use the sign test to test for differences in the paired replicates.
Replicate
1 2 3 4 5 6 7 8 9 10
2.20 1.05 3.25 2.60 1.90 1.50 2.25 1.00 0.09 0.83
1.06 0.06 3.55 1.00 1.10 0.60 1.30 0.90 0.59 0.58
CHAPTER
Regression
W e n o w turn to the s i m u l t a n e o u s analysis of two variables. F.vcn t h o u g h we m a y have considered m o r e than o n e variable at a time in o u r studies so far (for example, seawatcr c o n c e n t r a t i o n a n d oxygen c o n s u m p t i o n in Box 9.1, or age of girls and their face widths in Box 9.3), o u r actual analyses were of only o n e variable. However, we frequently m e a s u r e t w o or m o r e variables on each individual, a n d we c o n s e q u e n t l y would like to be able to express m o r e precisely the n a t u r e of the relationships between these variables. This brings us to the subjects of regression a n d correlation. In regression we estimate the r e l a t i o n s h i p of one variable with a n o t h e r by expressing the o n e in terms of a linear (or a m o r e complex) function of the other. We also use regression to predict values of o n e variable in terms of the other. In correlation analysis, which is s o m e t i m e s confused with regression, we estimate the degree to which two variables vary together. C h a p t e r 12 deals with correlation, a n d we shall p o s t p o n e o u r effort to clarify the relation a n d distinction between regression and correlation until then. T h e variables involved in regression a n d correlation are either c o n t i n u o u s or meristic; if meristic, they are treated as t h o u g h they were c o n t i n u o u s . W h e n variables are qualitative (that is, when they a r e attributes), the m e t h o d s regression a n d correlation c a n n o t be used.
11.1 / i n t r o d u c t i o n t o
regression
231
In Section 11.1 we review the n o t i o n of m a t h e m a t i c a l f u n c t i o n s a n d int r o d u c e t h e new terminology required for regression analysis. This is followed in Section 11.2 by a discussion of the a p p r o p r i a t e statistical m o d e l s for regression analysis. T h e basic c o m p u t a t i o n s in simple linear regression a r e s h o w n in Section 11.3 for the case of o n e d e p e n d e n t variate for each i n d e p e n d e n t variate. T h e case with several d e p e n d e n t variates for each i n d e p e n d e n t variate is treated in Section 11.4. Tests of significance a n d c o m p u t a t i o n of confidence intervals for regression p r o b l e m s a r e discussed in Section 11.5. Section 11.6 serves as a s u m m a r y of regression a n d discusses the various uses of regression analysis in biology. H o w t r a n s f o r m a t i o n of scale c a n straighten out curvilinear relationships for ease of analysis is s h o w n in Section 11.7. W h e n t r a n s f o r m a t i o n c a n n o t linearize the relation between variables, an alternative a p p r o a c h is by a n o n p a r a m e t r i c test for regression. Such a test is illustrated in Section 11.8.
V V
a + i>\ D r u g on a n i m a l 1' l ) r u K li on a n i m a l (J D r u g on a n i m a l
.V
M i c r o g r a m s of d r u g / c c blood
ICURI: III
Blood p r e s s u r e of an a n i m a l in m m H g as a f u n c t i o n of d r u g c o n c e n t r a t i o n in mi per ec of b l o o d .
232
chapter
]1 /
regression
animals. T h e relationships depicted in this g r a p h can be expressed by t h e f o r m u l a Y = a + bX. Clearly, Y is a f u n c t i o n of X. W e call the variable Y t h e dependent variable, while X is called the independent variable. T h e m a g n i t u d e of b l o o d pressure Y d e p e n d s on the a m o u n t of the d r u g X a n d can therefore be predicted f r o m the i n d e p e n d e n t variable, which p r e s u m a b l y is free t o vary. A l t h o u g h a cause would always be considered an i n d e p e n d e n t variable a n d an effect a d e p e n d e n t variable, a functional relationship observed in n a t u r e m a y actually be s o m e t h i n g o t h e r t h a n a causeandeffect relationship. T h e highest line is of the r e l a t i o n s h i p Y 20 + 15X, which represents the effect of d r u g A on animal P. T h e q u a n t i t y of d r u g is m e a s u r e d in m i c r o g r a m s , t h e blood pressure in millimeters of m e r c u r y . Thus, after 4 pg of the d r u g have been given, the b l o o d pressure would be Y = 20 + (15)(4) = 80 m m H g . T h e i n d e p e n d e n t variable X is multiplied by a coefficient b, the slope factor. In the e x a m p l e chosen, b = 15; that is, for an increase of o n e m i c r o g r a m of the drug, the b l o o d pressure is raised by 15 m m . In biology, such a relationship can clearly be a p p r o p r i a t e over only a limited r a n g e of values of A'. Negative values of X are meaningless in this case; it is also unlikely that the blood pressure will c o n t i n u e to increase at a u n i f o r m rate. Q u i t e p r o b a b l y the slope of the functional relationship will flatten o u t as the d r u g level rises. But, for a limited portion of the range of variable X ( m i c r o g r a m s of the drug), the linear relationship Y = a + bX may be an a d e q u a t e description of the functional d e p e n d e n c e of V o n A". By this formula, w h e n the i n d e p e n d e n t variable equals zero, the d e p e n d e n t variable equals a. This point is the intcresection of the function line with the Y axis. It is called the Y intercept. In Figure 11.1, when A = 0, the f u n c t i o n just studied will yield a blood pressure of 20 m m H g , which is the n o r m a l blood pressure of animal in the absence of the d r u g . T h e two o t h e r f u n c t i o n s in Figure 11.1 show the effects of varying b o t h a, the Y intercept, a n d />, the slope. In the lowest line, Y = 20 + 7.5.Y, the Y intercept r e m a i n s the same but the slope has been halved. We visualize this as the effect of a different d r u g , B, on the s a m e o r g a n i s m P. Obviously, when n o d r u g is administered, the blood pressure should be at the s a m e V intercept, since the identical o r g a n i s m is being studied. However, a different d r u g is likely to exert a different hypertensive effect, as reflcctcd by the different slope. T h e third relationship also describes the effect of d r u g B, which is assumed to remain the same, but the experiment is carried out on a different species, Q , whose n o r m a l blood pressure is assumed to be 40 m m H g . T h u s , the e q u a t i o n for the effect of d r u g on species Q is written as Y = 40 + 7.5.Y. This line is parallel to that c o r r e s p o n d i n g to the second e q u a t i o n . F r o m y o u r k n o w l e d g e of analytical geometry you will have recognizcd the slope factor b as the slope of the function Y = a + bX, generally symbolized by m. In calculus, b is the derivative of that same function (dY/dX = b). In biostatistics, b is called the regression coefficient, a n d the function is called a regression equation. W h e n wc wish to stress that the regression coefficient is of variable Y on variable X, wc write />,. v .
233
rKiUKl 11.2
BUHKI p r e s s u r e of a n animal in m m l l g a s a ( u n c i i o n of d r u g c o n c e n t r a t i o n in /ig p e r cc of blood. R e p e a t e d s a m p l i n g for a given d r u g c o n c e n t r a t i o n . i r r u g n o n s of d r u g IT MIUKI
234
chapter
11 /
regression
frequency d i s t r i b u t i o n of b l o o d pressure responses Y t o the i n d e p e n d e n t variates X = 2, 4, 6, 8, a n d 10 /^g. In view of the inherent variability of biological material, the responses to each d o s a g e w o u l d n o t be t h e s a m e in every individual; y o u w o u l d o b t a i n a f r e q u e n c y distribution of values of Y (blood pressure) a r o u n d the expected value. A s s u m p t i o n 3 states t h a t these s a m p l e values w o u l d be i n d e p e n d e n t l y a n d n o r m a l l y distributed. T h i s is indicated by t h e n o r m a l curves which are s u p e r i m p o s e d a b o u t several p o i n t s in the regression line in Figure 11.2. A few are s h o w n t o give y o u a n idea of the scatter a b o u t the regression line. In actuality there is, of course, a c o n t i n u o u s scatter, as t h o u g h these s e p a r a t e n o r m a l d i s t r i b u t i o n s were stacked right next t o each other, there being, after all, a n infinity of possible i n t e r m e d i a t e values of X between a n y two dosages. In t h o s e rare cases in which the i n d e p e n d e n t variable is d i s c o n t i n u o u s , the d i s t r i b u t i o n s of y would be physically s e p a r a t e f r o m each o t h e r a n d would o c c u r only a l o n g those p o i n t s of the abscissa c o r r e s p o n d i n g to i n d e p e n d e n t variates. An e x a m p l e of such a case would be weight of offspring (Y) as a f u n c t i o n of n u m b e r of offspring (X) in litters of mice. T h e r e m a y be three o r four offspring per litter but there would be n o intermediate value of X representing 3.25 mice per litter. N o t every experiment will have m o r e t h a n o n e reading of Y for each value of X. In fact, the basic c o m p u t a t i o n s we shall learn in t h e next section are for only o n e value of Y per value of X , this being the m o r e c o m m o n case. H o w e v e r , you should realize t h a t even in such instances the basic a s s u m p t i o n of M o d e l I regression is that the single variate of Y c o r r e s p o n d i n g to the given value of A" is a sample f r o m a p o p u l a t i o n of independently a n d normally distributed variatcs. 4. T h e final a s s u m p t i o n is a familiar one. We a s s u m e that these samples a l o n g the regression line are homosccdastic; that is, that they have a c o m m o n variance 2 , which is the variance of the e's in the expression in item 3. T h u s , we a s s u m e that the variance a r o u n d the regression line is c o n s t a n t a n d i n d e p e n d e n t of the m a g n i t u d e of X or Y. M a n y regression analyses in biology d o not meet the a s s u m p t i o n s of M o d e l I regression. F r e q u e n t l y b o t h X and Y are subject to n a t u r a l variation a n d / o r m e a s u r e m e n t error. Also, the variable X is sometimes not fixed, that is, u n d e r control of the investigator. S u p p o s e wc s a m p l e a p o p u l a t i o n of female flics a n d m e a s u r e wing length a n d total weight of each individual. We might be interested in s t u d y i n g wing length as a function of weight o r we might wish to predict wing length for a given weight. In this case the weight, which we treat as an i n d e p e n d e n t variable, is not fixed and certainly not the "cause" of differences in wing length. T h e weights of the flies will vary for genetic and e n v i r o n m e n t a l reasons and will also be subject to m e a s u r e m e n t error. T h e general case where both variables show r a n d o m variation is called Model 11 regression. A l t h o u g h , as will be discussed in the next c h a p t e r , cases of this sort are m u c h better
11.3 / t h e l i n e a r r e g r e s s i o n
equation
235
analyzed by the m e t h o d s of correlation analysis, we sometimes wish t o describe the f u n c t i o n a l r e l a t i o n s h i p between such variables. T o d o so, we need to resort to the special techniques of M o d e l II regression. In this b o o k we shall limit ourselves to a t r e a t m e n t of M o d e l I regression.
236
On
r
r m m t NO  On 04 m m OO O N rs m ^ 11 ri ~ rn 
NO ri
ON
q
ta
NO NO
" ON <N ON  I
f m " m I
rl
ON O N fN  NO ON ON rn . O N n m C ) rn vS
rn
rrj Cl 1 sO m r'O r
'
NO r NO  _ NO  C I n r rn sC y i r  OO rn r^ OO ro  m NO \ rn r jrn rn  I  rn NO ON I I 1 I 1 1
ON G\ rn m OC un rn 1
$
: :
<r
V,
r O^ rn rn ri '  i ND
11.3 / t h e l i n e a r r e g r e s s i o n
equation
237
8
7 a
FIGURE 1 1 . 3
W e i g h t loss (in mg) of nine batches of 25 Tribolium beetles after six days of starvation at nine different relative humidities. D a t a f r o m T a b l e 1 I.I, after Nelson (1964).
' 0 10
1 30
1 40
A' j
1 70
1 80
20
50 60
90 100
% Relative humidity
m e t h o d for finding the m e a n of Y would be to d r a w a series of horizontal lines across a g r a p h , calculate the s u m of squares of deviations f r o m it, and choose that line yielding t h e smallest s u m of squares. In linear regression, we still d r a w a straight line t h r o u g h our observations, but it is n o longer necessarily horizontal. A sloped regression line will indicate for each value of the i n d e p e n d e n t variable AT, a n estimated value of the dependent variable. W e should distinguish the estimated value of Yh which we shall hereafter designate as Yh (read: Yhat o r 7caret), a n d the observed values, conventionally designated as Y,. T h e regression e q u a t i o n therefore should read Y = a + hX (11.1)
it
FIGURE 1 1.4
238
CHAPTER 11 /
REGRESSION
which indicates that for given values of X, this equation calculates estimated values Y (as distinct from the observed values Y in any actual case). T h e deviation of an observation Yj f r o m the regression line is ( ^ f ; ) a n d is generally symbolized as d Y x . These deviations can still be drawn parallel to the Y axis, but they meet the sloped regression line at an angle (see Figure 11.5). T h e sum of these deviations is again zero ( . = 0), a n d the sum of their squares yields a quantity ( )2 . a n a l o g o u s to the sum of squares 2. F o r reasons that will become clear later, \ & called the unexplained sum of squares. The least squares linear regression line t h r o u g h a set of points is defined as that straight line which results in the smallest value of . Geometrically, the basic idea is that one would prefer using a line that is in some sense close to as m a n y points as possible. F o r purposes of ordinary Model I regression analysis, it is most useful to define closeness in terms of the vertical distances from the points to a line, and to use the line that makes the sum of the squares of these deviations as small as possible. A convenient consequence of this criterion is that the line must pass t h r o u g h the point , . Again, it would be possible but impractical to calculate the correct regression slope by pivoting a ruler a r o u n d the point , and calculating the unexplained sum of squares \ . for each of the innumerable possible positions. Whichever position gave the smallest value of 2 ,A. would be the least squares regression line. The formula for the slope of a line based on the minimum value of d Y . x is obtained by means of the calculus. It is
y
xv , I**
1.2)
Let us calculate h = .'/. 2 for our weight loss data. We first c o m p u t e the deviations from the respective means of ' and Y, as shown in columns (.3) and (4) of Tabic 11.1. The sums of these deviations.
Relative Iniiniilitv
11.3 /
239
a n d >>, are slightly different from their expected value of zero because of r o u n d i n g errors. The squares of these deviations yield sums of squares and variances in columns (5) and (7). In column (6) we have c o m p u t e d the products xy, which in this example are all negative because the deviations are of unlike sign. An increase in humidity results in a decrease in weight loss. The sum of these products " xy is a new quantity, called the sum of products. This is a p o o r but wellestablished term, referring to xy, the sum of the p r o d u c t s of the deviations rather t h a n , the sum of the products of the variates. You will recall that 2 is called the sum of squares, while 2 is the sum of the squared variates. The sum of products is a n a l o g o u s to the sum of squares. When divided by the degrees of freedom, it yields the covariance, by analogy with the variance resulting from a similar division of the sum of squares. You may recall first having encountered covariances in Section 7.4. N o t e that the sum of products can be negative as well as positive. If it is negative, this indicates a negative slope of the regression line: as X increases, Y decreases. In this respect it differs from a sum of squares, which can only be positive. F r o m Table 11.1 we find that .}' =  4 4 1 . 8 1 7 6 , .2 = 8301.3889, and b = /2 =  0 . 0 5 3 , 2 2 . Thus, for a oneunit increase in X, there is a decrease of 0.053,22 units of Y. Relating it to our actual example, we can say that for a 1% increase in relative humidity, there is a reduction of 0.053,22 m g in weight loss. You may wish to convincc yourself that the formula for the regression coefficient is intuitively reasonable. It is the ratio of the sum of products of deviations for X and Y to the sum of squares of deviations for X. If we look at the product for A",, a single value of X, we obtain x,y,. Similarly, (he squared deviation for X, would be x 2 , or x,x,. T h u s the ratio \,y, .,., reduces to y ; /x Although v y / 2 only a p p r o x i m a t e s the average of y,/x ; for (he values of X h the latter ratio indicates the direction and magnitude of the change in Y for a unit change in X. Thus, if y, on the average equals ,v,. b will equal 1. When y, = .,, b 1. Also, when y, > .x,, /> > l; and conversely, when jy, < jx,, b < \ \ \ . How can we complete the equation Y = a + bX'J We have stated that the regression line will go through the point ,, Y. At V 50.39, ^ 6.022; that is, we use , the observed mean of Y, as an estimate of the mean. We can substitute these means into Fxpression (11.1): Y = a + bX Y = a + bX a = bX
240
CHAPTER 11 /
REGRESSION
This is t h e e q u a t i o n t h a t relates weight loss to relative humidity. N o t e t h a t w h e n X is zero (humidity zero), the estimated weight loss is greatest. It is t h e n e q u a l t o a = 8.7038 mg. But as X increases to a m a x i m u m of 100, the weight loss decreases to 3.3818 mg. W e can use the regression f o r m u l a t o d r a w t h e regression line: simply estim a t e y at t w o convenient p o i n t s of X, such as X = 0 a n d X = 100, a n d d r a w a straight line between them. This line has been a d d e d to the observed d a t a a n d is s h o w n in F i g u r e 11.6. N o t e that it goes t h r o u g h the p o i n t , . In fact, for d r a w i n g the regression line, we frequently use the intersection of the t w o m e a n s a n d o n e o t h e r point.
Y = bx y = bx (11.3)
where p is defined as the deviation y y. Next, using Expression (11.1), we estimate Y for every o n e of o u r given values of X. T h e estimated values Y are s h o w n in c o l u m n (8) of T a b l e 11.1. C o m p a r e them with the observed values
.V .v
10 20 :S0 to
'
(
Hchitivc Imniidil v
11.3 /
241
of Y in column (2). Overall agreement between the two columns of values is good. N o t e that except for r o u n d i n g errors. = and hence = . H o w ever, o u r actual Y values usually are different from the estimated values Y. This is due to individual variation a r o u n d the regression line. Yet, the regression line is a better base f r o m which to c o m p u t e deviations t h a n the arithmetic average 7, since the value of X has been taken into account in constructing it. W h e n we c o m p u t e deviations of each observed Y value f r o m its estimated value ( y ) = dY. and list these in column (9), we notice that these deviations exhibit one of the properties of deviations f r o m a mean: they sum to zero except for rounding errors. T h u s = 0, just as }> = 0. Next, we c o m p u t e in column (10) the squares of these deviations and sum them to give a new sum of squares, 2 = 0.6160. W h e n we c o m p a r e ( )2 = y 2 = 24.1307 2 with ( ) = dj = 0.6160, we note that the new sum of squares is much less t h a n the previous old one. W h a t has caused this reduction? Allowing for different magnitudes of X has eliminated most of the variance of Y from the sample. Remaining is the unexplained sum of squares dY. x, which expresses that portion of the total SS of Y that is not accounted for by differences in X. It is unexplained with respect to X. The difference between the total SS, y 2 , and the unexplained SS, XdY.x, is not surprisingly called the explained sum of squares, $2, and is based on the deviations y = Y. T h e c o m p u t a t i o n of these deviations and their squares is shown in columns (11) and (12). Note that a p p r o x i m a t e s zero and that v 2 = 23.5130. Add the unexplained SS (0.6160) to this and you obtain y 2 = 2 + d v = 24.1290, which is equal (except for rounding errors) to the independently calculated value of 24.1307 in column (7). We shall return to the meaning of the unexplained and explained sums of squares in later sections. We conclude this section with a discussion of calculator formulas for computing the regression equation in cases where there is a single value of y for each value of X. The regression coefficient \ > / 2 can be rewritten as
(
'
(11.4)
The d e n o m i n a t o r of this expression is the sum of squares of X. Its c o m p u t a t i o n a l formula, as first encountered in Section 3.9, is 2 = ' 2 ( ) 2 /. We shall now learn an a n a l o g o u s formula for the n u m e r a t o r of Expression (11.4), the sum of products. T h e customary formula is
The quantity is simply the accumulated product of the two variables. Expression (11.5) is derived in Appendix A1.5. The actual c o m p u t a t i o n s for a
242
CHAPTER
12
/ CORRELATION
regression e q u a t i o n (single value of Y per value of X) are illustrated in Box 11.1, employing the weight loss d a t a of Table 11.1. T o c o m p u t e regression statistics, we need six quantities initially. These are ,, 2, , Y2, and . F r o m these the regression e q u a t i o n is calculated as shown in Box 11.1, which also illustrates how to c o m p u t e the explained
B O X
11.1 X.
Data from Table 11.1. Weight loss in mg(Y) Percent relative humidity (A") 8.98 0 8.14 12.0 6.67 29.5 6.08 43.0 5.90 53.0 5.83 62.5 4.68 75.5 4.20 85.0 3.72 93.0
Basic computations 1. Compute sample size, sums, sums of the squared observations, and the sum of the X K's. n=9 A = 31,152.75
2 2
453.5
Y = 54.20 XY = 2289.260
Y = 350.5350
= 6.022 ^ y 2 = 24.1306
= 8301.3889
n
4 5 3
f 20)^44L8,78
11.4
243
is d e m o n s t r a t e d in Appendix A 1.6. T h e term subtracted from 2 is obviously the explained sum of squares, as shown in Expression (11.7) below:
2
= ^
= ^2*2 = ^ *
(")
v2
L y
(*?)2 *2
11.4
M o r e
than
one
value of
Y for each
value
of
We now take up Model I regression as originally defined in Section 11.2 and illustrated by Figure 11.2. F o r each value of the treatment X we sample Y repeatedly, obtaining a sample distribution of Y values at each of the chosen points of A'. We have selected an experiment from the laboratory of one of us (Sokal) in which Tribolium beetles were reared from eggs to a d u l t h o o d at four different densities. T h e percentage survival to a d u l t h o o d was calculated for varying numbers of replicates at these densities. Following Section 10.2, these percentages were given arcsine transformations, which are listed in Box I 1.2. These transformed values are more likely to be n o r m a l and homosccdastic than are percentages. The arrangement of these d a t a is very much like that of a singleclassification model 1 anova. There are four different densities and several survival values at each density. We now would like to determine whether there are differences in survival a m o n g the four groups, and also whether we can establish a regression of survival on density. A first a p p r o a c h , therefore, is to carry out an analysis of variance, using the methods of Section 8.3 and Table 8.1. O u r aim in doing this is illustrated in Figure 1 1.7 (sec page 247). If the analysis of variance were not significant, this would indicate, as shown in Figure 11.7A, that the means are not significantly dilferent from each other, and it would be unlikely that a regression line fitted to these data would have a slope significantly different from zero. However, although both the analysis of variance and linear regression test the same null hypothesis equality of means the regression test is more powerful (less type II error; sec Section 6.8) against the alternative hypothesis that there is a linear relationship between the g r o u p means and the independent variable X. Thus, when the means increase or decrease slightly as X increases it may be that they are not different enough for the mean square a m o n g groups to be significant by a n o v a but that a significant regression can still be found. When we find a marked regression of the means on A", as shown in Figure 11.7B, wc usually will lind a significant difference a m o n g the means by an anova. However, we cannot turn
244
CHAPTER 1 1 /
REGRESSION
B O X
The variates Y are arcsine transformations of the percentage survival of the bettle Tribolium castaneum at 4 densities (X = number of eggs per gram of flour medium).
Density = X (a = 4) 5/g 61.68 58.37 69.30 61.68 69.30 320.33 5 64.07 20/g 68.21 66.72 63.44 60.84 50/g 58.69 58.37 58.37 100/g 53.13 49.89 49.82
Survival; in degrees
y i % . = 15 y = 907.81
Source: Data by Sokai (1967).
259.21 4 64.80
175.43 3 58.48
152.84 3 50.95
3 4
141.2339 12.6079
11.20**
The groups differ significantly with respect to survival. We proceed to test whether the differences among the survival values can be accounted for by linear regression on density. If F < [l/( l)j F a( i , ,, it is impossible for regression to be significant. Computation for regression analysis 1. Sum of X weighted by sample size = ntX
11.4 /
245
BOX 11.2
Continued
a
2. S a m of X 2 weighted by sample size n,X 2 = %Sf + 4(20? + 3(50)2 + 3(100) 2 = 39.225 3. Sum of products of X a n d weighted by sample size = , = i x ( f , Y j = 5(320.33) + + 100(152.84) = 30,841.35
*
! _ (quantity I) 2 A < (555) 2 15
CT
: 20,535.00
, 30,841.35 j m & L .
15
_2747.62
( T xv) 2  2
2 /
18,690 W P 8 ~~ ' 2
246
CHAPTER 1 1 /
REGRESSION
B O X 11.2 Continued Completed anova table with regression Source of variation if 3 SS 423.7016 403.9281 19.7735 138.6867 562.3883 MS 141.2339 403.9281 9.8868 12.6079 F, 11.20** 40.86* < 1 ns
f  y y y Y  y y y y y
Among densities (groups) Linear regression Deviations from regression Within groups Total
1
2 11 14
In addition to the familiar mean squares, MS g r o u p s and MS w l I h l n , we now have the mean square due to linear regression, MS , and the mean square for deviations from regression, MSy .x( = s$. v). T o test if the deviations from linear regression are significant, compare the ratio Fs = MSr. */MS i I h i with imay Since we find Fs < 1, we accept the null hypothesis that the deviations from linear regression are zero. T o test for the presence of linear regression, we therefore tested MSy over the mean square of deviations from regression s j . x and, since Fs = 403.9281/9.8868 = 40.86 is greater than /* 0 0 5 [ 1 2 J = 18.5, we clearly reject the null hypothesis that there is n o regression, or that = 0. 9. Regression coefficient (slope of regression line)
quantity 5
18,690
10. Y intercept = a = f bY .xX a Rj _ Y quantity 9 quantity 1 "< 907.81 15 (0.147,01)555 15 60.5207 + 5.4394 = 65.9601
this a r g u m e n t a r o u n d a n d say t h a t a significant difference a m o n g m e a n s as s h o w n by a n a n o v a necessarily i n d i c a t e s t h a t a significant linear regression c a n be fitted to these d a t a . In F i g u r e I I.7C, the m e a n s follow a U  s h a p e d f u n c t i o n (a p a r a b o l a ) . T h o u g h t h e m e a n s w o u l d likely be significantly different f r o m e a c h o t h e r , clearly a s t r a i g h t line fitted to these d a t a w o u l d be a h o r i z o n t a l line h a l f w a y b e t w e e n the u p p e r a n d the lower p o i n t s . I n such a set of d a t a , linear regression c a n e x p l a i n only little of the v a r i a t i o n of the d e p e n d e n t v a r i a b l e . H o w e v e r , a c u r v i l i n e a r p a r a b o l i c regression w o u l d fit these d a t a a n d r e m o v e
11.4
M O R E T H A N O N E V A L U E V F O R E A C H V A L U E O F 
+
+ + + + +
J
_1
I
FIGURE 1 1.7
'
_i
L _
most of the variance of Y. A similar ease is s h o w n in Figure I I.7D, in which the m e a n s describe a periodically c h a n g i n g p h e n o m e n o n , rising and falling alternatingly. Again the regression line for these d a t a has slope zero. A curvilinear (cyclical) regression could also be fitted to such d a t a , but o u r main p u r p o s e in showing this e x a m p l e is to indicate that there could be heterogeneity a m o n g the means of Y a p p a r e n t l y unrelated to (he m a g n i t u d e of X. R e m e m b e r that in real examples you will rarely ever get a regression as clcarcut as the linear case in 11,7B, or the curvilinear o n e in I 1.7C, not will you necessarily get heterogeneity of the type s h o w n in I I.7D, in which a n y straight I inc fitted to the d a t a would be horizontal. You are m o r e likely to get d a t a in which linear regression can be d e m o n s t r a t e d , but which will not fit a straight line well. Sometimes the residual deviations of the m e a n s a r o u n d linear regression can be removed bv c h a n g i n g from linear to curvilinear regression (as is suggested by the pattern of points in f igure 1 I.7E), a n d sometimes they may remain as inexplicable residual heterogeneity a r o u n d the regression line, as indicated in Figure I I.7F. We carry out the c o m p u t a t i o n s following the by now familiar outline for analysis of variance and o b t a i n the a n o v a table s h o w n in Box 11.2. T h e three degrees of freedom a m o n g the four g r o u p s yield a m e a n s q u a r e dial would be
248
CHAPTER 1 1 /
REGRESSION
highly significant if tested over the w i t h i n  g r o u p s m e a n square. T h e a d d i t i o n a l steps for t h e regression analysis follow in Box 11.2. W e c o m p u t e t h e s u m of s q u a r e s of X, the s u m of p r o d u c t s of X a n d Y, the explained s u m of s q u a r e s of , a n d the unexplained s u m of squares of Y. T h e f o r m u l a s will l o o k u n f a m i l i a r because of the c o m p l i c a t i o n of the several Y's per value of X. T h e c o m p u t a t i o n s for the s u m of squares of X involve the multiplication of X by the n u m b e r of items in the study. T h u s , t h o u g h there m a y a p p e a r to be only f o u r densities, there are, in fact, as m a n y densities (although of only four m a g n i t u d e s ) as there a r e values of Y in the study. H a v i n g completed the c o m p u t a t i o n s , we again present the results in the f o r m of an a n o v a table, as s h o w n in Box 11.2. N o t e t h a t the m a j o r quantities in this table are the same as in a singleclassification a n o v a , b u t in a d d i t i o n we n o w have a sum of squares representing linear regression, which is always based on o n e degree of freedom. This s u m of s q u a r e s is s u b t r a c t e d f r o m the SS a m o n g groups, leaving a residual s u m of s q u a r e s (of two degrees of f r e e d o m in this case) representing the deviations f r o m linear regression. W e should u n d e r s t a n d w h a t these sources of variation represent. T h e linear m o d e l for regression with replicated Y per X is derived directly f r o m Expression (7.2), which is
Yi, = + <*, + c.j
T h e t r e a t m e n t effect a ; = /ix, + Dh where fix is the c o m p o n e n t d u e to linear regression a n d D ; is the deviation of the m e a n { f r o m regression, which is assumed to have a m e a n of zero and a variance of 2. T h u s we c a n write Yij = + /ix, + O, + Cy T h e SS d u e to linear regression represents that p o r t i o n of the SS a m o n g g r o u p s that is explained by linear regression on A'. T h e SS d u e to deviations f r o m regression represents the residual variation o r scatter a r o u n d the regression line as illustrated by the v a r i o u s examples in Figure 11.7. T h e SS within g r o u p s is a m e a s u r e of the variation of the items a r o u n d each g r o u p m e a n . We first test w h e t h e r the m e a n s q u a r e for deviations f r o m regression (MS Y = sY. ) is significant by c o m p u t i n g the variance ratio of MSY. x over the w i t h i n  g r o u p s MS. In o u r case, the deviations from regression a r e clearly not significant, since the m e a n s q u a r e for deviations is less t h a n t h a t within groups. We now test the m e a n s q u a r e for regression, MSY, over the m e a n s q u a r e for deviations from regression and find it to be significant. T h u s linear regression on density has clearly removed a significant p o r t i o n of the variation of survival values. Significance of the mean s q u a r e for deviations f r o m regression could m e a n either that Y is a curvilinear f u n c t i o n of X or that there is a large a m o u n t of r a n d o m heterogeneity a r o u n d the regression line (as already discussed in c o n n e c t i o n with Figure 11.7; actually a mixture of both c o n d i t i o n s m a y prevail). S o m e workers, when analyzing regression examples with several Y variatcs at each value of A", proceed as follows when the deviations from regression a r e
11.4 /
not significant. T h e y a d d the sum of squares for deviations a n d ili.ii wni.n, g r o u p s as well as their degrees of freedom. T h e n they calculate a pooled m e a n s q u a r e by dividing the p o o l e d s u m s of s q u a r e s by the pooled degree. . freedom. T h e m e a n s q u a r e for regression is then tested over the pooled m e a n square, which, since it is based o n m o r e degrees of f r e e d o m , will be a better estimator of t h e e r r o r variance and should permit m o r e sensitive tests. O t h e r w o r k e r s prefer never to pool, a r g u i n g t h a t pooling the t w o s u m s of squares c o n f o u n d s the p s e u d o r e p l i c a t i o n of h a v i n g several Y variates at each value of X with the t r u e replication of having m o r e X points t o d e t e r m i n e the slope of the regression line. T h u s if we h a d only three X p o i n t s but o n e h u n d r e d variates at each, we w o u l d be able to estimate the m e a n value of F f o r each of the three X values very well, b u t we w o u l d be estimating the slope of the line o n the basis of only three points, a risky procedure. T h e second attitude, forgoing pooling, is m o r e conservative a n d will decrease the likelihood that a nonexistent regression will be declared significant. W e c o m p l e t e the c o m p u t a t i o n of the regression coefficient a n d regression e q u a t i o n as s h o w n at the end of Box 11.2. O u r conclusions are t h a t as density increases, survival decreases, a n d that this relationship can be expressed by a significant linear regression of the form Y = 65.9601 0.147,01 X, where X is density per g r a m a n d Y is the arcsine t r a n s f o r m a t i o n of percentage survival. This relation is g r a p h e d in Figure 11.8. T h e sums of p r o d u c t s a n d regression slopes of both examples discussed so far have been negative, a n d you m a y begin to believe that this is always so. However, it is only a n accident of choice of these t w o examples. In the exercises at the end of this c h a p t c r a positive regression cocfficicnt will be e n c o u n t e r e d . W h e n we have e q u a l sample sizes of Y values for each value of X, the c o m p u t a t i o n s bccome simpler. First we carry out the a n o v a in the m a n n e r of Box 8.1. Steps 1 t h r o u g h 8 in Box 11.2 bccome simplified bccause the u n e q u a l sample sizes , are replaced by a c o n s t a n t sample size n, which can generally be factored out of the various expressions. Also, = an. Significance tests applied to such cases a r e also simplified.
figljrr
11.8
D e n s i t y ( n u m b e r of e^Rs/ii of m e d i u m )
250
CHAPTER 1 1 /
REGRESSION
11.5
T e s t s o f s i g n i f i c a n c e in
regression
W e have so far interpreted regression as a m e t h o d for providing an estimate, V\, given a value of X v A n o t h e r i n t e r p r e t a t i o n is as a m e t h o d for explaining s o m e of the v a r i a t i o n of t h e d e p e n d e n t variable Y in terms of the v a r i a t i o n of the i n d e p e n d e n t variable X. T h e SS of a s a m p l e of Y values, ^ 2 , is c o m p u t e d by s u m m i n g a n d s q u a r i n g deviations y =  . In Figure 11.9 we c a n see t h a t the deviation y can be d e c o m p o s e d into t w o parts, y a n d dy x. It is also clear f r o m F i g u r e 11.9 that the deviation y = represents the deviation of the estimated value Y f r o m the m e a n of Y. T h e height of y is clearly a f u n c t i o n of ,x. W e have already seen that y = bx (Expression (11.3)). In analytical geo m e t r y this is callcd the p o i n t  s l o p e form of the e q u a t i o n . If b, the slope of the regression line, were steeper, would be relatively larger for a given value of x. T h e remaining p o r t i o n of the deviation y is dy. . It represents t h e residual variation of the variable Y after the explained variation h a s been s u b t r a c t e d . We c a n sec t h a t y = y dy x by writing out these deviations explicity as = ( ) + {Y y). F o r each of these deviations we can c o m p u t e a c o r r e s p o n d i n g sum of squares. Appendix A 1.6 gives the calculator f o r m u l a for the unexplained s u m of squares,
^ r ' <'> to dy
Y,
and
V'v''y '/.'
KiURi: 11.9 Sthcmalic diagram to s h o w relations inv o l v e d in p a r t i t i o n i n g i h e s u m of s q u a r e s of the d e p e n d e n t variable.
0 '
c o r r e s p o n d s to y (as shown in the previous section). T h u s we are able lo pa tion the s u m of squares of the d e p e n d e n t variable in regression in a wa\ a n a l o g o u s to the p a r t i t i o n of the total SS in analysis of variance. You may w o n d e r how the additive relation of the deviations can be m a t c h e d by an additive relation of their squares w i t h o u t the presence of any cross p r o d u c t s . S o m e simple algebra in A p p e n d i x A 1.7 will show that the cross p r o d u c t s cancel out. T h e m a g n i t u d e of the unexplained deviation d y . x is i n d e p e n d e n t of the m a g n i t u d e of the explained deviation y, just as in a n o v a the m a g n i t u d e of the deviation of a n item f r o m the sample m e a n is i n d e p e n d e n t of the m a g n i t u d e of the deviation of the sample m e a n f r o m the g r a n d mean. This relationship between regression a n d analysis of variance can be carried further. W e can u n d e r t a k e an analysis of variance of the p a r t i t i o n e d s u m s of squares as follows:
tIf _
1
SS ( *v)2
MS
2
Expected 2
MS
1
(estimated Y from
m e a n of )
L>' = ypT
> + Is' ~
 2 _ ,
.x = V r  f v2 ' _ {
'f .*
si
T h e explained
mean
square,
o r mean
square
due
lo linear
regression,
meas
ures the a m o u n t of variation in Y a c c o u n t e d for by variation of X. It is tested over the unexplained mean square, which measures the residual variation and is used as an e r r o r MS. T h e m e a n s q u a r e d u e to linear regression, sf, is based on o n e degree of freedom, a n d consequently (n  2) df remain for the e r r o r M S sincc the total sum of squares possesses  I degrees of freedom. T h e test is of the null hypothesis H 0 : = 0. W h e n we carry out such an a n o v a on the weight loss d a t a of Box I I.I, we o b t a i n the following results:
Source of variation Explained  d u e to linear regression Unexplained r e g r e s s i o n line Total error around
df
SS
MS
l\
1 7 '8~
23.5145
23.5145
267.18**
0.6161 141306
0.08801
T h e significance test is /'s = s$/sy.x. It is clear f r o m the observed value of /', that a large a n d significant p o r t i o n of the variance of V has been explained by regression on A'.
8 3
1> 3
< I
C
>
I!
II
JS * ! >
<3
= a >s S.C 55 * s.
43
.
 e II g a ? * .
< I
<3
I "
II 13
S2 8  feb S'iH { 1 3 ri. X 60 2 .S C tA a nj
11
I
, ". 2
'"
I< U
c
^ ".
(
60
11.5 /
TESTS OF S I G N I F I C A N C E i n
REGRESSION
253
W e n o w proceed to the s t a n d a r d e r r o r s f o r various regression statistics, their e m p l o y m e n t in tests of hypotheses, a n d the c o m p u t a t i o n of confidence limits. Box 11.3 lists these s t a n d a r d errors in t w o columns. T h e r i g h t  h a n d column is for the case with a single Y value for each value of X. T h e first row of the table gives the standard error of the regression coefficient, which is simply the s q u a r e r o o t of the r a t i o of the unexplained variance to t h e s u m of squares of X. N o t e t h a t the unexplained variance Sy . A is a f u n d a m e n t a l q u a n t i t y t h a t is a p a r t of all s t a n d a r d e r r o r s in regression. T h e s t a n d a r d e r r o r of the regression coefficient permits us t o test various h y p o t h e s e s a n d to set confidence limits to o u r sample estimate of b. T h e c o m p u t a t i o n of sb is illustrated in step 1 of Box 11.4, using the weight loss example of Box 11.1.
BOX 11.4
Significance tests and computation of confidence limits of regression statistics. Single value of Y for each value of X.
Based on standard errors and degrees of freedom of Box 11.3; using example of Box 11.1. = 9 X = 50.389 ? = 6.022
by.x =  0.053,22
x 2 = 8301.3889 0.088,01
(ft
 2)
V^nib
b 0
^ ^
0 I 0
'
6 0 2
= 0003,256,1
. = 5.408
< 0.001
3. 95% confidence limits for regression coefficient: fo.osm*!. = 2.365(0.003,256,1) = 0.007,70 L! = 6  i 0 5r71 s 6 = 0.053,22  0.007,70 = 0.060.92 L j = b + to.ospjSi = 0.053,22 + 0.007,70 = 0.045,52 4. Standard error of the sampled mean F (at X): = ^ 1 = 0.098,888,3
254
CHAPTER 1 1 / REGRESSION
BOX 11.4
Continued
= v 8 W 0 4 0 7 ^ 0 ) = %/035,873 = 0.189,40 7. 95% confidence limits for corresponding to the estimate Yt ~ 3.3817 at A", = 100% relative humidity: to.osmSf = 2.365(0.189,40) = 0.447,93 L, = ,  i0.o5(7]AV = 3.3817  0.4479 = 2.9338 L2 = + t 0 .0 S n i si = 3.3817 + 0.4479 = 3.8296
T h e significance lest illustrated in step 2 tests the "significance" of the regression coefficient; that is, it tests the null hypothesis that the sample value of h comes from a p o p u l a t i o n with a p a r a m e t r i c value = 0 for the regression coefficient. This is a t test, the a p p r o p r i a t e degrees of freedom being 2 = 7. If wc c a n n o t reject the null hypothesis, there is no evidence that the regression is significantly deviant f r o m zero in cither the positive or negative direction. O u r conclusions Tor the weight loss d a t a are that a highly significant negative regression is present. Wc saw earlier (Section 8.4) that r = /'. W h e n wc s q u a r e ty =  16.345 f r o m Box I 1.4, wc o b t a i n 267.16, which (within r o u n d i n g error) equals the value of F v found in the a n o v a earlier in this section. T h e significance test in step 2 of Box I 1.4 could, of course, also be used to test w h e t h e r h is significantly different from a p a r a m e t r i c value o t h e r than zero.
Setting confidence limits to the regression coefficient presents n o new features,
since h is normally distributed. T h e c o m p u t a t i o n is shown in step 3 of Box 11.4. In view of the small m a g n i t u d e of sh, the confidence interval is quite n a r r o w . The confidence limits arc shown in Figure 11.10 as d a s h e d lines representing the 95% b o u n d s of the slope. N o t e that the regression line as well as its
255
! J
10
'20 30
r
10
50 60
70
80
00
.V 100
', R e l a t i v e h u m i d i t y
confidence limits passes t h r o u g h the m e a n s of X a n d Y. Variation in b therefore rotates the regression line a b o u t the point ,, .
N e x t , w e c a l c u l a t e a standard error for the observed sample mean Y. You
will recall from Section 6.1 that .sf = sj/n. However, now that we have regressed Y on X, we are able to account for (that is, hold constant) some of the variation of Y in terms of the variation of X. T h e variance of Y a r o u n d the point , Y on the regression line is less than s}; it is Sy.x. At A" we m a y therefore c o m p u t e confidence limits of , using as a s t a n d a r d e r r o r of the m e a n sr = \ J s l . x / n with  2 degrees of freedom. This s t a n d a r d e r r o r is c o m p u t e d in step 4 of Box 11.4, a n d 95% confidence limits for the sampled m e a n at X arc calculated in step 5. These limits (5.7881 6.2559) are considerably n a r r o w e r than the confidence limits for the m e a n based on the c o n v e n t i o n a l s t a n d a r d error s), which would be from 4.687 to 7.357. T h u s , k n o w i n g the relative humidity greatly reduces much of the uncertainty in weight loss. The s t a n d a r d e r r o r for is only a special ease of the standard
any estimated value Y alone) the regression line. A new factor, w h o s e
error for
magnitude
is in part a function of the distance of a given value .V, from its mean X, now enters the error variance. T h u s , the farther a w a y ,Y, is from its mean, the greater will be the e r r o r of estimate. This factor is seen in the third row of Box 1 1.3 as the deviation A,A", squared and divided by the sum of squares of X. T h e s t a n d a r d e r r o r for an estimate Y, for a relative humidity A', 100*7, is given in step 6 of Box 11.4. T h e 95% confidence limits for /(,., the p a r a m e t r i c value c o r r e s p o n d i n g to the estimate Y,. are shown in step 7 o f l h a t box. N o t e that the width of the confidence interval is 3.8296 2.9338 = 0.8958, considerably wider than the confidence interval at X calculated in step 5. which was 6.2559 5.7881 = 0.4678. If we calculate a series of confidence limits for different values of X w c o b t a i n a biconcave confidence belt as shown in Figure 11.11. T h e farther wc get a w a y from the mean, the less reliable are o u r estimates of Y. because of the uncertainty a b o u t the true slope, , of I he regression line.
256 v
CHAPTER 11 / REGRESSION
3 0
F u r t h e r m o r e , the linear regressions that we fit are often only r o u g h a p p r o x i m a t i o n s t o the m o r e complicated f u n c t i o n a l relationships b e t w e e n biological variables. Very often there is an a p p r o x i m a t e l y linear relation a l o n g a certain range of the i n d e p e n d e n t variable, b e y o n d which range the slope c h a n g e s rapidly. F o r example, h e a r t b e a t of a poikilothermic animal will be directly prop o r t i o n a l to t e m p e r a t u r e over a range of tolerable temperatures, but b e n e a t h a n d a b o v e this r a n g e the h e a r t b e a t will eventually decrease as the a n i m a l freezes o r suffers heat p r o s t r a t i o n . Hence c o m m o n sense indicates t h a t o n e s h o u l d be very c a u t i o u s a b o u t e x t r a p o l a t i n g from a regression e q u a t i o n if o n e has any d o u b t s a b o u t the linearity of the relationship. T h e confidence limits for a, the p a r a m e t r i c value of a, are a special case of those for at A", = 0, a n d the s t a n d a r d e r r o r of a is therefore
Tests of significance in regression analyses where there is m o r e t h a n o n e variate V per value of X are carried out in a m a n n e r similar to that of Box 11.4, except that the s t a n d a r d e r r o r s in the lefthand c o l u m n of Box 11.3 are employed. A n o t h e r significance test in regression is a test of the differences between t w o regression lines. W h y would we be interested in testing differences between regression slopes? We might find that different toxicants yield different d o s a g e mortality curves or that different d r u g s yield different relationships between d o s a g e and response (sec, for example, Figure 11.1). O r genetically differing cultures might yield different responses to increasing density, vvhich would be i m p o r t a n t for u n d e r s t a n d i n g the cffcct of n a t u r a l selection in these cultures. T h e regression slope of one variable on a n o t h e r is as f u n d a m e n t a l a statistic of a s a m p l e as is the m e a n or the s t a n d a r d deviation, and in c o m p a r i n g s a m p l e s
11.6
257
it m a y be as i m p o r t a n t to c o m p a r e regression coefficients as it is to c o m p a r e these o t h e r statistics. T h e test for the difference between t w o regression coefficients can be carried out as a n F test. W e c o m p u t e
(*1)(4)
where .s 2 . x is the weighted average sj.
x
YAM,2
yx
V:
F o r o n e Y per value of X, v 2 = nl + n2  4, but when there is m o r e t h a n o n e variate Y per value of X, v2 = ai + a2  4. C o m p a r e Fs with V2,.
11.6
The
uses
of
regression
We have been so busy learning the mechanics of regression analysis t h a t we have not had time to give m u c h t h o u g h t to the various a p p l i c a t i o n s of regression. We shall t a k e u p four m o r e or less distinct applications in this section. All are discussed in terms of M o d e l I regression. First, we might m e n t i o n the study of causation. If we wish to k n o w w h e t h e r variation in a variable V is caused by c h a n g e s in a n o t h e r variable X, we m a n i p u l a t e V in an experiment a n d see w h e t h e r we can o b t a i n a significant regression of Y on X. T h e idea of c a u s a t i o n is a complex, philosophical o n e that we shall not go into here. You have u n d o u b t e d l y been c a u t i o n e d f r o m your earliest scientific experience not to c o n f u s e c o n c o m i t a n t variation with causation. Variables m a y vary together, yet this c o v a r i a t i o n m a y be accidental or both may be f u n c t i o n s of a c o m m o n cause affecting them. T h e latter cases arc usually examples of Model II regression with both variables varying freely. W h e n we m a n i p u l a t e o n e variable a n d lind that such m a n i p u l a t i o n s affect a sccond variable, we generally arc satisfied that the variation of the independent variable A' is the cause of the variation of the d e p e n d e n t variable V (not Ihe cause of the variable!). However, even here it is best to be cautious. W h e n we find that h e a r t b e a t rale in a c o l d  b l o o d e d animal is a f u n c t i o n of a m b i e n t temperature, wc m a y conclude that t e m p e r a t u r e is one of the causes of differences in hearheat rate T h e r e m a y well be o t h e r factors affccting rate of heartbeat. A possible mistake is to invert the causeandeffect relationship. It is unlikely that a n y o n e would s u p p o s e that h e a r b e a t rate affects the t e m p e r a t u r e of the general e n v i r o n m e n t , but we might be mistaken a b o u t the c a u s c  a n d clfect relationships between two chemical substances in the blood, for instance. Despite these cautions, regression analysis is a c o m m o n l y used device for
258
CHAPTER 1 1 /
REGRESSION
screening out causal relationships. While a significant regression of on X d o e s not p r o v e that changes in X are the cause of variations in Y, the converse statem e n t is true. W h e n we find n o significant regression of Y on X , we c a n in all but the most complex cases infer quite safely (allowing for the possibility of type II error) t h a t deviations of X d o n o t affect Y. T h e description of scientific laws a n d prediction are a second general area of application of regression analysis. Science aims at m a t h e m a t i c a l description of relations between variables in nature, a n d M o d e l I regression analysis permits us to estimate f u n c t i o n a l relationships between variables, one of which is subject t o error. These f u n c t i o n a l relationships d o not always have clearly interp r e t a b l e biological meaning. T h u s , in m a n y cases it may be difficult to assign a biological i n t e r p r e t a t i o n to the statistics a a n d b, or their c o r r e s p o n d i n g p a r a m e t e r s a n d . W h e n we can d o so, we speak of a structural mathematical model, one whose c o m p o n e n t p a r t s have clear scientific meaning. However, m a t h e m a t i c a l curves that a r e not s t r u c t u r a l models arc also of value in science. M o s t regression lines a r e empirically fitted curves, in which the f u n c t i o n s simply represent the best m a t h e m a t i c a l fit (by a criterion such as least squares) to an observed set of d a t a . Comparison of dependent rariates is a n o t h e r application of regression. As soon as it is established t h a t a given variable is a function of a n o t h e r one, as in Box 11.2, where we f o u n d survival of beetles to be a f u n c t i o n of density, o n e is b o u n d to ask to w h a t degree any observed difference in survival between two s a m p l e s of beetles is a function of the density at which they have been raised. It would be unfair to c o m p a r e beetles raised at very high density (and expected t o have low survival) with those raised u n d e r o p t i m a l conditions of low density. This is the same point of view that m a k e s us disinclined to c o m p a r e the m a t h e matical knowledge of a fifthgrader with that of a college student. Since we could u n d o u b t e d l y o b t a i n a regression of m a t h e m a t i c a l knowledge on years of schooling in m a t h e m a t i c s , we should be c o m p a r i n g h o w far a given individual deviates from his or her cxpcctcd value based on such a regression. T h u s , relative to his o r her classmates and age group, the fifthgrader may be far better than is the collcge student relative to his or her peer group. This suggests that wc calculate adjusted Y values that allow for the m a g n i t u d e of the i n d e p e n d e n t variable X. A c o n v e n t i o n a l way of calculating such adjusted Y values is to estimate the Y value o n e would expect if the independent variable were equal to its m e a n X a n d the o b s e r v a t i o n retained its observed deviation {dy v ) f r o m the regression line. Since = when ' = X, the adjusted Y value can be c o m puted as , = t dy.x = Y  bx (11.8)
Statistical control is an application of regression that is not widely k n o w n a m o n g biologists and represents a scientific p h i l o s o p h y that is not well established in biology outside agricultural circles. Biologists frequently categorize work as either descriptive or experimental, with the implication that only the latter can be analytical. However, statistical a p p r o a c h e s applied to descriptive'
11.7
R E S I D U A L S A N D T R A N S E O R M A H O N S [N REGRESSION
259
work can, in a n u m b e r of instances, take the place of experimental techniques quite adequatelyoccasionally they are even to be preferred. These a p p r o a c h e s are a t t e m p t s to substitute statistical manipulation of a c o n c o m i t a n t variable for control of the variable by experimental means. An example will clarify this technique. Let us assume that we are studying the effects of various diets on blood pressure in rats. We find that the variability of blood pressure in our rat population is considerable, even before we introduce differences in diet. Further study reveals that the variability is largely due to differences in age a m o n g the rats of the experimental population. This can be d e m o n s t r a t e d by a significant linear regression of blood pressure on age. T o reduce the variability of blood pressure in the population, we should keep the age of the rats constant. T h e reaction of most biologists at this point will be to repeat the experiment using rats of only one age group; this is a valid, commonsense a p p r o a c h , which is part of the experimental method. An alternative a p p r o a c h is superior in some cases, when it is impractical or too costly to hold the variable constant. We might continue to use rats of variable ages and simply record the age of each rat as well as its blood pressure. Then we regress blood pressure on age and use an adjusted m e a n as the basic blood pressure reading for each individual. We can now evaluate the effect of differences in diet on these adjusted means. O r we can analyze the effects of diet on unexplained deviations, dr . A , after the experimental blood pressures have been regressed on age (which a m o u n t s to the same thing). What are the advantages of such an a p p r o a c h ? Often it will be impossible to secure adequate numbers of individuals all of the same age. By using regression we are able to utilize all the individuals in the population. The use of statistical control assumes that it is relatively easy to record the independent variable X and, of course, that this variable can be measured without error, which would be generally (rue of such a variable as age of a laboratory animal. Statistical control may also be preferable because we obtain information over a wider range of both >' and Y and also because we obtain added knowledge a b o u t the relations between these (wo variables, which would not be so if we restricted ourselves (o a single age group.
11.7
Residuals and
transformations
in
regression
An examination of regression residuals, dy. x, may detect outliers in a sample. Such outliers may reveal systematic departures from regression that can be adjusted by transformation of scale, or by the fitting of a curvilinear regression line. When it is believed that an outlier is due lo an observational or recording error, or to c o n t a m i n a t i o n of the sample studied, removal of such an outlier may improve the regression fit considerably. In examining the m a g n i t u d e of residuals, we should also allow for the corresponding deviation from . Outlying values of Yt (hat correspond to deviant variates ., will have a greater influence in determining the slope of the regression line than will variates close
260
CHAPTER 1 1 /
REGRESSION
to X. W e can e x a m i n e the residuals in c o l u m n (9) of Table 11.1 for the weight loss d a t a . A l t h o u g h several residuals are quite large, they tend t o be relatively close to Y. O n l y the residual for 0% relative h u m i d i t y is suspiciously large a n d , at the s a m e time, is the single m o s t deviant o b s e r v a t i o n f r o m X . P e r h a p s the r e a d i n g at this extreme relative h u m i d i t y d o e s n o t fit into the generally linear relations described by the rest of the data. In t r a n s f o r m i n g either or b o t h variables in regression, we aim at simplifying a curvilinear r e l a t i o n s h i p to a linear one. Such a p r o c e d u r e generally increases the p r o p o r t i o n of the variance of the d e p e n d e n t variable explained by the i n d e p e n d e n t variable, a n d the d i s t r i b u t i o n of the deviations of p o i n t s a r o u n d the regression line tends to b e c o m e n o r m a l a n d homoscedastic. R a t h e r t h a n fit a c o m p l i c a t e d curvilinear regression to p o i n t s plotted on an a r i t h m e t i c scale, it is far m o r e expedient to c o m p u t e a simple linear regression for variates plotted on a t r a n s f o r m e d scale. A general test of w h e t h e r t r a n s f o r m a t i o n will i m p r o v e linear regression is to g r a p h the points to be fitted on o r d i n a r y g r a p h p a p e r as well as on o t h e r g r a p h p a p e r in a scale suspected to i m p r o v e the relationship. If the f u n c t i o n straightens out a n d the systematic deviation of p o i n t s a r o u n d a visually fitted line is reduced, the t r a n s f o r m a t i o n is worthwhile. W c shall briefly discuss a few of the t r a n s f o r m a t i o n s c o m m o n l y applied in regression analysis. S q u a r e root a n d arcsine t r a n s f o r m a t i o n s (Section 10.2) are not m e n t i o n e d below, but they arc also effective in regression cases involving d a t a suited to such t r a n s f o r m a t i o n s . T h e logarithmic transformation is the most frequently used. A n y o n e d o i n g statistical work is therefore well advised to keep a supply of semilog p a p e r handy. Most frequently wc t r a n s f o r m the d e p e n d e n t variable Y. This transf o r m a t i o n is indicated when percentage c h a n g c s in the d e p e n d e n t variable vary directly with c h a n g c s in the i n d e p e n d e n t variable. Such a relationship is indicated by the e q u a t i o n Y = aehx, where a a n d h are c o n s t a n t s a n d e is t h e base of the n a t u r a l logarithm. After the t r a n s f o r m a t i o n , wc o b t a i n log Y = log a + />(log c)A. In this expression log e is a c o n s t a n t which when multiplied by h yields a new c o n s t a n t factor h' which is equivalent to a regression coefficient. Similarly, log a is a new Y intercept, a'. Wc can then simply regress log Y on X to o b t a i n the f u n c t i o n log Y = a' + h'X a n d o b t a i n all o u r prediction e q u a t i o n s and confidcncc intervals in this form. Figure 11.12 shows an e x a m p l e of t r a n s f o r m i n g the d e p e n d e n t variate to logarithmic form, which results in considerable straightening of the response curve. A logarithmic t r a n s f o r m a t i o n of the independent variable in regression is effective when p r o p o r t i o n a l c h a n g e s in the i n d e p e n d e n t variable p r o d u c e linear responses in the d e p e n d e n t variable. An e x a m p l e might be the declinc in weight of an o r g a n i s m as density increases, where the successive increases in density need to be in a constant ratio in o r d e r to effect equal decreases in weight. This belongs to a wellknown class of biological p h e n o m e n a , a n o t h e r e x a m p l e which is the W e b e r  F e c h n c r law in physiology and psychology, which states that a stimulus has to be increased by a c o n s t a n t p r o p o r t i o n in o r d e r to p r o d u c e a c o n s t a n t increment in response. Figure 11.13 illustrates how logarithmic
]50r
160
120
= 100
80
50 40
10
20
10
15
20
J '
FIGURE 1 1 . 1 2
L o g a r i t h m i c t r a n s f o r m a t i o n of a d e p e n d e n t v a r i a b l e in r e g r e s s i o n . C h i r p  r a t e a s a f u n c t i o n of t e m p e r a t u r e in m a l e s of t h e t r e e c r i c k e t Oecanthus fultoni. Each point represents the m e a n chirp rate/min f o r all o b s e r v a t i o n s a t a g i v e n t e m p e r a t u r e in " C . O r i g i n a l d a t a in left p a n e l , Y p l o t t e d o n l o g a r i t h m i c s c a l e in r i g h t p a n e l . ( D a t a f r o m B l o c k , 1966.)
R e l a t i v e b r i g h t n e s s ( t i m e s ) in Ion .scale
HGuki I l.l.l
L o g a r i t h m i c t r a n s f o r m a t i o n of t h e i n d e p e n d e n t v a r i a b l e in r e g r e s s i o n . T h i s i l l u s t r a t e s s i / e of electrical r e s p o n s e l o i l l u m i n a t i o n in t h e c e p h a l o p o d eye. O r d i n a t e , m i l l i v o l t s ; a b s c i s s a , r e l a t i v e b r i g h t ness of i l l u m i n a t i o n . A p r o p o r t i o n a l i n c r e a s e in ' ( r e l a t i v e b r i g h t n e s s ) p r o d u c e s a l i n e a r e l e c t r i c a l r e s p o n s e V. ( D a t a in l ' r b h l i c h , 1 9 2 1 . )
262
CHAPTER 1 1 /
REGRESSION
t r a n s f o r m a t i o n of the i n d e p e n d e n t variable results in the straightening of the regression line. F o r c o m p u t a t i o n s one would t r a n s f o r m X into logarithms. L o g a r i t h m i c t r a n s f o r m a t i o n for both variables is applicable in situations in which the true relationship can be described by the f o r m u l a Y = aXb. T h e regression e q u a t i o n is rewritten as log Y = log a + b log X and the c o m p u t a t i o n is d o n e in the c o n v e n t i o n a l m a n n e r . E x a m p l e s are the greatly d i s p r o p o r t i o n a t e g r o w t h of various o r g a n s in s o m e organisms, such as the sizes of antlers of deer or h o r n s of stage beetles, with respect to their general body sizes. A d o u b l e logarithmic t r a n s f o r m a t i o n is indicated when a plot on loglog g r a p h p a p e r results in a straightline graph. Reciprocal transformation. M a n y rate p h e n o m e n a (a given p e r f o r m a n c e per unit of time or per unit of population), such as wing beats per second or n u m ber of eggs laid per female, will yield hyperbolic curves when plotted in original m e a s u r e m e n t scale. T h u s , they form curves described by the general m a t h e m a t ical e q u a t i o n s bXY = 1 or + hX)Y = 1. F r o m these we can derive 1/Y = bX or 1/Y = a + bX. By t r a n s f o r m i n g the d e p e n d e n t variable into its reciprocal, wc can frequently o b t a i n straightline regressions. Finally, some c u m u l a t i v e curves can be straightened by the prohit transformation. Refresh y o u r m e m o r y on the cumulative n o r m a l curve s h o w n in Figure 5.5. R e m e m b e r that by c h a n g i n g the o r d i n a t e of the cumulative n o r m a l into probability scale we were able to straighten out this curve. We d o the same thing here except that we g r a d u a t e the probability scale in s t a n d a r d deviation units. T h u s , the 50% point becomes 0 s t a n d a r d deviations, the 84.13% point becomes f 1 s t a n d a r d deviation, and the 2.27% point becomes 2 standard deviations. Such s t a n d a r d deviations, c o r r e s p o n d i n g to a c u m u l a t i v e perc e n t a g e , a r c c a l l e d normal equivalent deviates (NEI)). If we u s e o r d i n a r y g r a p h
p a p e r and m a r k the o r d i n a t e in NED units, we obtain a straight line when plotting the cumulative n o r m a l curve against it. Probits arc simply n o r m a l equivalent deviates coded by the a d d i t i o n of 5.0, which will avoid negative values for most deviates. T h u s , the probit value 5.0 c o r r e s p o n d s to a c u m u l a t i v e frequency of 50%, the probit value 6.0 c o r r e s p o n d s to a cumulative frequency of 84.13", a n d the probit value 3.0 c o r r e s p o n d s to a cumulative frequency of
2.27",,.
Figure 11.14 shows an example of mortality percentages for increasing doses of an insecticide. These represent differing points of a cumulative frequency distribution. With increasing dosages an ever greater p r o p o r t i o n of the sample dies until at a high e n o u g h dose the entire sample is killed. It is often found that if the doses of toxicants arc transformed into logarithms, the tolerances of m a n y o r g a n i s m s to these poisons arc a p p r o x i m a t e l y normally distributed. These t r a n s f o r m e d doses arc often called dosages. Increasing dosages lead to cumulative normal distributions of mortalities, often called dosagemortalit curves. These curves are the subject m a t t e r of an entire field of biometric analysis, bioassav, to which wc can refer only in passing here. The most c o m m o n technique in this licld is probil analysis. G r a p h i c a p p r o x i m a t i o n s can be e a r n e d out on socalled probil paper, winch is probability graph paper in which the
263
100
so
c
a;
0 10 20 30 Dose
FIGURE 11.14
2 0.1
1 0.5
1 1
1 2
' 30
.V
5 10
D o s a g e mortalily d a t a illustrating an a p p l i c a t i o n of the probit t r a n s f o r m a t i o n . D a t a are m e a n mortalities for two replicates. T w e n t y Drosophila melanogaster per replicate were subjected to seven doses of an " u n k n o w n " insecticide in a class experiment. The point at dose 0.1 which yielded 0",', mortality h a s been assigned a p r o b i t value of 2.5 in lieu of x , which c a n n o t be plotted.
abscissa has been t r a n s f o r m e d into logarithmic scale. A regression line is fitted to dosagemortality d a t a graphed on probit paper (see Figure 11.14). P r o m the regression line the 50" lethal does is estimated by a process of inverse prediction, that is, we estimate the value of ' (dosage) c o r r e s p o n d i n g to a kill of probit 5.0, which is equivalent to 50"'..
regression
When t r a n s f o r m a t i o n s are unable to linearize the relationship between the dependent and independent variables, the research worker may wish to carry out a simpler, n o n p a r a m e t r i c test in lieu of regression analysis. Such a test furnishes neither a prediction e q u a t i o n nor a functional relationship, but it does test whether the d e p e n d e n t variable Y is a monotonically increasing (or decreasing) function of the independent variable ,Y. T h e simplest such test is the ordering lest, which is equivalent to c o m p u t i n g Kendall's rank correlation coefficient (see Box 12.3) a n d can be carried out most easily as such. In fact, in such a case the distinction between regression a n d correlation, which will be discussed in detail in Section 12.1, breaks d o w n . T h e test is carried out as follows. Rank variates ' and V. A r r a n g e the i n d e p e n d e n t variable .V in increasing order of r a n k s a n d calculate the Kendall r a n k correlation of Y with . T h e c o m p u t a t i o n a l steps for the p r o c e d u r e are shown in Box 12.3. If we carry out this c o m p u t a t i o n for the weight loss d a t a of Box 11.1 (reversing the order of percent relative humidity, X. which is negatively related to weight loss, V), we o b t a i n a q u a n t i t y 72, which is significant at < 0.01 when looked up in 'I'able XIV. T h e r e is thus a significant trend of weight loss as a function of relative humidity. T h e ranks of the weight losses are a perfect m o n o t o n i c function
264
CHAPTER 1 1 /
REGRESSION
Exercises
11.1 T h e f o l l o w i n g t e m p e r a t u r e s ( Y ) w e r e r e c o r d e d in a r a b b i t a t v a r i o u s t i m e s ( Z ) a f t e r it w a s i n o c u l a t e d w i t h r i n d e r p e s t v i r u s ( d a t a f r o m C a r t e r a n d M i t c h e l l , 1958).
G r a p h t h e d a t a . C l e a r l y , t h e last t h r e e d a t a p o i n t s r e p r e s e n t a d i f f e r e n t p h e n o m e n o n f r o m t h e first four pairs. For the first four points: (a) C a l c u l a t e b. (b) C a l c u l a t e t h e r e g r e s s i o n e q u a t i o n a n d d r a w in t h e r e g r e s s i o n line, (c) T e s t t h e h y p o t h e s i s t h a t = 0 a n d set 9 5 % c o n f i d e n c e l i m i t s , (d) Set 9 5 % c o n f i d e n c e l i m i t s t o y o u r e s t i m a t e of t h e r a b b i t ' s t e m p e r a t u r e 50 h o u r s a f t e r t h e i n j e c t i o n . A N S . = 100, b = 0 . 1 3 0 0 , F, = 5 9 . 4 2 8 8 , < 0 . 0 5 , 50 = 106.5. 11.2 T h e f o l l o w i n g t a b i c is e x t r a c t e d f r o m d a t a b y S o k o l o f f (1955). A d u l t w e i g h t s of f e m a l e Drosophihi persimilis r e a r e d at 2 4 " C a r c a f f e c t e d b y t h e i r d e n s i t y a s l a r v a e . C a r r y o u t a n a n o v a a m o n g d e n s i t i e s . T h e n c a l c u l a t e t h e r e g r e s s i o n of w e i g h t o n d e n s i t y a n d p a r t i t i o n t h e s u m s s q u a r e s a m o n g g r o u p s i n t o t h a t e x p l a i n e d a n d u n e x p l a i n e d by l i n e a r r e g r e s s i o n . G r a p h t h e d a t a w i t h t h e r e g r e s s i o n line fitted t o t h e m e a n s . I n t e r p r e f y o u r r e s u l t s .
Larval density 1 3 5 6 10 20 40
Mean weight of adults (in mg) 1.356 1.356 1.284 1.252 0.989 0.664 0.475
9 34 50 63 83 144 24
11.3
D a v i s ( 1 9 5 5 ) r e p o r l e d t h e f o l l o w i n g r e s u l t s in a s t u d y of t h e a m o u n t of e n e r g y m e t a b o l i z e d by t h e F n g / i s h s p a r r o w . Passer domesticus, under various constant temperature conditions and a tenhour photoperiod. Analyze and interpret A N S . MSy = 6 5 7 . 5 0 4 3 . MS, ,  8.2186, A/.S wjlhin = 3.9330. d e v i a t i o n s a r c n o t
EXERCISES
265
Temperature
CO
0 4 10 18 26 34
6 4 4 5 7 7
s
1.77 1.99 2.07 1.43 1.52 2.70
11.4
11.5
Using the complete data given in Exercise J 1.1, calculate the regression equation and compare it with the one you obtained for the first four points. Discuss the effect of the inclusion of the last three points in the analysis. Compute the residuals from regression. The following results were obtained in a study of oxygen consumption (microliters/mg dry weight per hour) in Heliothis zea by Phillips and Newsom (1966) under controlled temperatures and photoperiods.
Temperature CC) Photoperiod (h) 10 14 1.61 1.64 1.73
18
21 24
C o m p u t e r e g r e s s i o n f o r e a c h p h o t o p e r i o d s e p a r a t e l y a n d test f o r h o m o g e n e i t y of s l o p e s . A N S . F o r 10 h o u r s : b = 0 . 0 6 3 3 , . s j . K = 0 . 0 1 9 , 2 6 7 . F o r 14 h o u r s : b =
Icmpt'ralun' ( F) 59.8 67.6 70.0 70.4 74.0 75.3 78.0 80.4 81.4 83.2 88.4 91.4 m c
Mean length of developmental period in days Y 58.1 27.3 26.8 26.3 19.1 19.0 16.5 15.9 14.8 14.2 14.4 14.6 1< " 3
266
CHAPTER 1 1 /
REGRESSION
11.7
Analyze and interpret. Compute deviations from the regression line (y f and plot against temperature. The experiment cited in Exercise 11.3 was repeated using a 15hour photoperiod, and the following results were obtained:
Temperuiure
CO 0 10 18 26 34
Calories
6 7 8 10 6
11.8 11.9
Test for the equality of slopes of the regression lines for the 10hour and 15hour photoperiod. ANS. Fs = 0.003. Carry out a nonparametric test for regression in Exercises 11.1 and 11.6. Water temperature was recorded at various depths in Rot Lake on August 1,1952, by Vollenweider and Frei (1953).
0 24.8
1 23.2
2 22.2
3 21.2
6.3
12 5.8
15.5 5.6
Plot the data and then compute the regression line. Compute the deviations from regression. Does temperature varv as a linear function of depth? What do the residuals suggest? ANS. a = 23.384, h =  1.435, F, = 45.2398, < 0.01.
CHAPTER
Correlation
In this c h a p t e r we c o n t i n u e o u r discussion of bivariate statistics. In C h a p t e r 11, on regression, we dealt with the functional relation of o n e variable u p o n the other; in the present chapter, wc treat the m e a s u r e m e n t of the a m o u n t of association between two variables. This general topic is called correlation
analysis.
It is not always o b v i o u s which type of analysis regression or correlation one should e m p l o y in a given problem. T h e r e has been considerable confusion in the m i n d s of investigators a n d also in the literature o n this topic. Wc shall try to m a k e the distinction between these t w o a p p r o a c h e s clear at the outset in Section 12.1. In Scction 12.2 you will be introduced lo the p r o d u c t m o m e n t correlation coefficient, the c o m m o n correlation coefficient the literature. Wc shall derive a formula for this coefficient a n d give you s o m e t h i n g of its theoretical b a c k g r o u n d . T h e close m a t h e m a t i c a l relationship between regression and correlation analysis will be examined in this section. We shall also c o m p u t c a p r o d u c t  m o m e n t correlation coefficient in this scction. In Section 12.3 we will talk a b o u t various tests of significance involving correlation coefficients. T h e n , in Section 12.4, we will i n t r o d u c e some of the a p p l i c a t i o n s of correlation coefficients.
268
CHAPTER
12
CORRELATION
Section 12.5 c o n t a i n s a n o n p a r a m e t r i c m e t h o d t h a t tests for association. It is t o be used in those cases in which t h e necessary a s s u m p t i o n s for tests involving correlation coefficients d o n o t hold, or where quick b u t less t h a n fully efficient tests are preferred for reasons of speed in c o m p u t a t i o n o r for c o n venience.
12.1
Correlation
and
regression
T h e r e has been m u c h c o n f u s i o n on the subject m a t t e r of correlation a n d regression. Q u i t e frequently correlation p r o b l e m s are treated as regression p r o b lems in the scientific literature, and the converse is equally true. T h e r e are several reasons for this confusion. First of all, the m a t h e m a t i c a l relations between the t w o m e t h o d s of analysis are quite close, a n d m a t h e m a t i c a l l y o n e can easily m o v e f r o m one to the other. Hence, the t e m p t a t i o n to d o so is great. Seco n d , earlier texts did n o t m a k e the distinction between the t w o a p p r o a c h e s sufficiently clear, a n d this p r o b l e m has still n o t been entirely overcome. At least o n e t e x t b o o k synonymizes the two, a step that we feel can only c o m p o u n d the confusion. Finally, while an investigator may with good r e a s o n intend to use o n e of the two a p p r o a c h e s , the n a t u r e of the d a t a m a y be such as to m a k e only the o t h e r a p p r o a c h a p p r o p r i a t e . Let us examine these points at some length. T h e m a n y a n d close m a t h e matical relations between regression a n d c o r r e l a t i o n will be detailed in Section 12.2. It suffices for n o w to state that for a n y given problem, the m a j o r i t y of the c o m p u t a t i o n a l steps are the same whether o n e carries out a regression or a correlation analysis. Y o u will recall that the f u n d a m e n t a l q u a n t i t y required for regression analysis is the sum of p r o d u c t s . This is the very s a m e q u a n t i t y that serves as the base for the c o m p u t a t i o n of the correlation coefficient. T h e r e arc s o m e simple m a t h e m a t i c a l relations between regression coefficients and correlation coefficients for the same d a t a . T h u s the t e m p t a t i o n exists to c o m p u t e a correlation coefficient c o r r e s p o n d i n g to a given regression coefficient. Yel, as we shall see shortly, this would be w r o n g unless our intention at the outset were to study association and the d a t a were a p p r o p r i a t e for such a c o m putation. Let us then look at the intentions or p u r p o s e s behind the t w o types of analyses. In regression we intend to describe the d e p e n d e n c e of a variable Y on an i n d e p e n d e n t variable X. As wc have seen, we employ regression e q u a t i o n s for p u r p o s e s of lending s u p p o r t to hypotheses regarding the possible c a u s a t i o n of changes in V by c h a n g e s in X\ for p u r p o s e s of prediction, of variable Y given a value of variable X\ a n d for p u r p o s e s of explaining some of the variation of Y by X, bv using the latter variable as a statistical control. Studies of the effects of t e m p e r a t u r e on h e a r t b e a t rate, nitrogen content of soil on g r o w t h rale in a plant, age of an animal on blood pressure, or dose of an insecticide on mortality of the insect p o p u l a t i o n are all typical examples of regression for the p u r p o s e s n a m e d above.
12.1 /
CORRELATION A N D REGRESSION
269
In c o r r e l a t i o n , b y c o n t r a s t , we a r e c o n c e r n e d largely w h e t h e r t w o variables a r e i n t e r d e p e n d e n t , o r covarythat is, v a r y t o g e t h e r . W e d o n o t e x p r e s s o n e as a f u n c t i o n of t h e o t h e r . T h e r e is n o d i s t i n c t i o n b e t w e e n i n d e p e n d e n t a n d d e p e n d e n t v a r i a b l e s . It m a y well be t h a t of t h e p a i r of v a r i a b l e s w h o s e c o r r e l a t i o n is s t u d i e d , o n e is t h e c a u s e of t h e o t h e r , but we n e i t h e r k n o w n o r a s s u m e this. A m o r e typical (but n o t essential) a s s u m p t i o n is t h a t t h e t w o variables a r e b o t h effects of a c o m m o n cause. W h a t we wish t o e s t i m a t e is t h e d e g r e e to which these v a r i a b l e s vary t o g e t h e r . T h u s we m i g h t be i n t e r e s t e d in t h e c o r r e l a t i o n b e t w e e n a m o u n t of fat in diet a n d i n c i d e n c e of h e a r t a t t a c k s in h u m a n p o p u l a t i o n s , b e t w e e n foreleg length a n d h i n d leg l e n g t h in a p o p u l a t i o n of m a m mals, b e t w e e n b o d y weight a n d egg p r o d u c t i o n in f e m a l e blowflies, o r b e t w e e n age a n d n u m b e r of seeds in a weed. R e a s o n s w h y we w o u l d wish t o d e m o n s t r a t e a n d m e a s u r e a s s o c i a t i o n b e t w e e n p a i r s of v a r i a b l e s need n o t c o n c e r n us yet. W e shall t a k e this u p in Section 12.4. It suffices for n o w t o s t a t e t h a t w h e n we wish t o establish the d e g r e e of a s s o c i a t i o n b e t w e e n p a i r s of v a r i a b l e s in a p o p u l a t i o n s a m p l e , c o r r e l a t i o n a n a l y s i s is t h e p r o p e r a p p r o a c h . T h u s a c o r r e l a t i o n coefficient c o m p u t e d f r o m d a t a that h a v e b e e n p r o p e r l y a n a l y z e d by M o d e l 1 regression is m e a n i n g l e s s as a n e s t i m a t e of a n y p o p u l a tion c o r r e l a t i o n coefficient. C o n v e r s e l y , s u p p o s e we were t o e v a l u a t e a regression coefficient of o n e v a r i a b l e o n a n o t h e r in d a t a t h a t h a d been p r o p e r l y c o m p u t e d as c o r r e l a t i o n s . N o t o n l y w o u l d c o n s t r u c t i o n of such a f u n c t i o n a l d e p e n d e n c e for these variables n o t meet o u r i n t e n t i o n s , b u t we s h o u l d p o i n t o u t t h a t a c o n v e n t i o n a l regression coefficient c o m p u t e d f r o m d a t a in which b o t h variables are m e a s u r e d with e r r o r  a s is the case in c o r r e l a t i o n a n a l y s i s f u r n i s h e s biased e s t i m a t e s of the f u n c t i o n a l relation. E v e n if we a t t e m p t the c o r r e c t m e t h o d in line with o u r p u r p o s e s we m a y r u n a f o u l of the n a t u r e of the d a t a . T h u s we m a y wish t o e s t a b l i s h cholesterol c o n t c n t of b l o o d i d a f u n c t i o n of weight, a n d t o d o so we m a y t a k e a r a n d o m s a m p l e of m e n of the s a m e age g r o u p , o b t a i n e a c h i n d i v i d u a l ' s c h o l e s t e r o l c o n tent a n d weight, a n d regress the f o r m e r o n the latter. H o w e v e r , b o t h these variables will h a v e been m e a s u r e d with e r r o r . I n d i v i d u a l v a r i a t e s of the s u p posedly i n d e p e n d e n t v a r i a b l e ' will n o t h a v e been deliberately c h o s e n o r c o n trolled by the e x p e r i m e n t e r . T h e u n d e r l y i n g a s s u m p t i o n s of M o d e l I regression d o not h o l d , a n d fitting a M o d e l I regression to the d a t a is not legitimate, a l t h o u g h y o u will have n o difficulty f i n d i n g i n s t a n c e s of such i m p r o p e r p r a c tices in t h e p u b l i s h e d research literature. If it is really a n e q u a t i o n d e s c r i b i n g the d e p e n d e n c e of Y o n X that we are after, we s h o u l d c a r r y o u t a M o d e l II regression. H o w e v e r , if it is the d e g r e e of a s s o c i a t i o n b e t w e e n t h e v a r i a b l e s ( i n t e r d e p e n d e n c e ) t h a t is of interest, t h e n we s h o u l d c a r r y o u t a c o r r e l a t i o n analysis, for which these d a t a a r c suitable. T h e c o n v e r s e dilliculty is t r y i n g t o o b t a i n a c o r r e l a t i o n coefficient f r o m d a t a t h a t are p r o p e r l y c o m p u t e d as a regression t h a t is, a r e c o m p u t e d w h e n X is fixed. A n e x a m p l e w o u l d be h e a r t beats of a p o i k i l o t h c r m as a f u n c t i o n of t e m p e r a t u r e , w h e r e several t e m p e r a t u r e s h a v e been a p p l i e d in a n e x p e r i m e n t . S u c h a c o r r e l a t i o n coeflicient is easily o b tained m a t h e m a t i c a l l y but w o u l d s i m p l y be a n u m e r i c a l value, not a n e s t i m a t e
270
TABLE 1 2 . 1
CHAPTER 1 2 /
CORRELATION
The relations between correlation and regression. This table indicates the correct c o m p u t a t i o n for any combination of purposes and variables, as shown. Nature of the two Purpose of investigator Y random, X fixed variables
Establish and estimate dependence of one variable upon another. (Describe functional relationship and/or predict one in terms of the other.) Establish and estimate association (interdependence) between two variables.
Model I regression.
Meaningless for this case. If desired, an estimate of the proportion of the variation of Y explained by X can be obtained as the square of the correlation coefficient between X and Y.
Correlation coefficient. (Significance tests entirely appropriate only if y t , Y2 are distributed as bivariate normal variables.)
of a p a r a m e t r i c m e a s u r e of correlation. T h e r e is an interpretation t h a t c a n be given to the s q u a r e of the correlation coefficient that has some relevance to a regression p r o b l e m . H o w e v e r , it is not in any way an estimate of a p a r a m e t r i c correlation. This discussion is s u m m a r i z e d in T a b l e 12.1, which shows the relations between correlation and regression. T h e two c o l u m n s of the table indicate the t w o c o n d i t i o n s of the pair of variables: in o n e case one r a n d o m a n d m e a s u r e d with error, the o t h e r variable lixed; in the o t h e r ease, both variables r a n d o m . In this text we depart f r o m the usual c o n v e n t i o n of labeling the pair of variables Y and X or X2 for both correlation and regression analysis. In regression we c o n t i n u e the use of Y for the d e p e n d e n t variable a n d X for the i n d e p e n d e n t variable, but in correlation both of the variables are in fact r a n d o m variables, which we have t h r o u g h o u t the text designated as V. We therefore refer to the t w o variables as V, a n d Y2. T h e rows of the table indicate the intention of the investigator in carrying out the analysis, a n d the four q u a d rants of the table indicate the a p p r o p r i a t e p r o c e d u r e s for a given c o m b i n a t i o n of intention of investigator a n d n a t u r e of the pair of variables.
12.2
The
productmoment
correlation
coefficient
T h e r e arc n u m e r o u s correlation coefficients in statistics. The most c o m m o n of these is called the productmoment correlation coefficient, which in its current f o r m u l a t i o n is d u e to Karl Pearson. Wc shall derive its formula t h r o u g h an
You have seen that the sum of products is a measure of covariation, a n d it is therefore likely that this will be the basic quantity f r o m which to obtain a formula for the correlation coefficient. W e shall label the variables whose correlation is to be estimated as and Y2. Their sum of p r o d u c t s will therefore be y \ y 2 and their covariance [1 j(n 1)] y 1 y 2 = s 1 2 . The latter quantity is analogous to a variance, that is, a sum of squares divided by its degrees of freedom. A s t a n d a r d deviation is expressed in original measurement units such as inches, grams, or cubic centimeters. Similarly, a regression coefficient is expressed as so m a n y units of Y per unit of X, such as 5.2 grams/day. However, a measure of association should be independent of the original scale of measurement, so that we can c o m p a r e the degree of association in one pair of variables with that in another. O n e way to accomplish this is to divide the covariance by the s t a n d a r d deviations of variables Yt and Y2. This results in dividing each deviation yl and y 2 by its proper standard deviation and m a k i n g it into a standardized deviate. The expression now becomes the sum of the p r o d u c t s of standardized deviates divided by 1:
l 2
f (n   "l).syi.s>2
C21)
This is the formula for the p r o d u c t  m o m e n t correlation coefficient rYiY, variables Yt and Y2. We shall simplify the symbolism to <:.. =*12
between
(12.2)
( n
1) =
I) =
V X . v
12
(12.3)
VXvrlvi T o slate Expression (12.2) more generally for variables Yt and Yk, we can write it as
1 )SjSk
The correlation coefficient rjk can range from + 1 for perfect association to 1 for perfect negative association. This is intuitively obvious when we consider the correlation of a variable Yj with itself. Expression (12.4) would then yield r ^ = y^y,/\/>' 2 = ^ / } ' ? = 1, which yields a perfect correlation of + I. If deviations in one variable were paired with opposite but equal
272
CHAPTER 1 2 /
CORRELATION
because the sum of p r o d u c t s in the n u m e r a t o r would be negative. Proof that the correlation coefficient is b o u n d e d by + 1 and 1 will be given shortly. If the variates follow a specified distribution, the bivariate normal distribution, the correlation coefficient rjk will estimate a parameter of that distribution symbolized by pjk. Let us a p p r o a c h the distribution empirically. Suppose you have sampled a h u n d r e d items a n d measured two variables on each item, obtaining two samples of 100 variates in this manner. If you plot these 100 items on a g r a p h in which the variables a n d Y2 are the coordinates, you will obtain a scatterg r a m of points as in Figure 12.3 A. Let us assume that both variables, Yl and Y2, are normally distributed a n d that they are quite independent of each other, so that the fact that one individual happens to be greater t h a n the m e a n in character Y1 has no effect whatsoever on its value for variable Y2. T h u s this same individual may be greater or less than the mean for variable Y2. If there is absolutely no relation between and Y2 a n d if the two variables are standardized to make their scales comparable, you will find that the outline of the scattergram is roughly circular. Of course, for a sample of 100 items, the circle will be only imperfectly outlined; but the larger the sample, the more clearly you will be able to discern a circle with the central area a r o u n d the intersection Y2 heavily darkened because of the aggregation there of m a n y points. If you keep sampling, you will have to superimpose new points u p o n previous points, and if you visualize these points in a physical sense, such as grains of sand, a m o u n d peaked in a bellshaped fashion will gradually accumulate. This is a threedimensional realization of a n o r m a l distribution, shown in perspective in Figure 12.1. Regarded from cither c o o r d i n a t e axis, the m o u n d will present a twodimensional appearance, and its outline will be that of a n o r m a l distribution curvc, the two perspectives giving the distributions of V, and Y2, respectively. If we assume that the two variables and Y2 are not independent but are positively correlated to some degree, then if a given individual has a large value of V,, it is more likely t h a n not to have a large value of Y2 as well. Similarly, a small value of V, will likely be associated with a small value of Y2. Were you to sample items from such a population, the resulting scattergram (shown in
Figure 12.3D) would b e c o m e elongated in the form of an ellipse. This is so because those p a r t s of the circlc that formerly included individuals high for one variable and low for the o t h e r (and vice versa), are now scarcely represented. C o n t i n u e d sampling (with Ihc sand grain model) yields a threedimensional elliptic m o u n d , s h o w n in Figure 12.2. If correlation is perfect, all the d a t a will fall along a single regression line (the identical line would describe the regression of Y, on Y2 and of Y2 on Y,), and if we let them pile up in a physical model, they will result in a flat, essentially twodimensional normal curve lying on this regression line. T h e circular or elliptical resulting m o u n d is clearly a two variables, a n d this is the By a n a l o g y with Fxprcssion shape of the outline of the scattergram and of the function of the degree of correlation between the p a r a m e t e r />jk of the bivariate n o r m a l distribution. (12.2), the p a r a m e t e r f>jk can be defined as
where ajk is the p a r a m e t r i c covariance of variables V( and Yk a n d at and ak arc the p a r a m e t r i c s t a n d a r d deviations of variables Yf and Yk, as before. W h e n two variables are distributed according to the bivariatc normal, a sample correlation cocflicicnt rjk estimates the p a r a m e t r i c correlation coefficient pjk. We can m a k e some statements a b o u t the sampling distribution of (>ik and set confidence limits to it. Regrettably, the elliptical shape of s c a t t e r g r a m s of correlated variables is not usually very clear unless either very large samples have been taken or the p a r a m e t r i c correlation (>jk is very high. T o illustrate this point, we show in Figure 12.3 several g r a p h s illustrating s c a t t c r g r a m s resulting from samples of 100 items from bivariatc n o r m a l p o p u l a t i o n s with differing values of (>jk. Note
274
CHAPTER 12 /
CORRELATION
2r I Y
I
Y
0 1  2"
0 1
y 1 2
_" >
o1
3I
3
1
2
11 0
1X
J _
1
1
2
J
1
lKii IRI: 12.3 R a n d o m s a m p l e s f r o m b i v a r i a l e n o r m a l d i s t r i b u t i o n s w i l h v a r y i n g v a l u e s of t h e p a r a m e t r i c c o r r e l a t i o n c o e l h c i c n t p. S a m p l e s i / c s n ()/I U . '),> OS. ( I ) ) , , 100 in all g r a p h s e x c e p t ( i . w h i c h h a s n 0.7. I f ) 0.9. (Ci )p 0.5. 500. () (1.4. 0.7. (I I
t h a t in the first g r a p h (Figure 12.3A), with pJk = 0, the circular distribution is only very vaguely outlined. A far greater sample is required to d e m o n s t r a t e the circular s h a p e of the distribution m o r e clearly. N o substantial difference is n o t e d in F i g u r e 12.3B, based o n pjk = 0.3. K n o w i n g t h a t this depicts a positive correlation, one can visualize a positive slope in the scattergram; b u t w i t h o u t prior knowledge this would be difficult to detect visually. T h e next g r a p h (Figure 12.3C, based on pjk = 0.5) is s o m e w h a t clearer, but still does n o t exhibit an unequivocal trend. In general, correlation c a n n o t be inferred f r o m inspection of scattergrams based on samples f r o m p o p u l a t i o n s with pjk between 0.5 a n d + 0.5 unless there are n u m e r o u s sample points. This point is illustrated in the last g r a p h (Figure 12.3G), also sampled f r o m a p o p u l a t i o n with pjk 0.5 but based on a sample of 500. Here, the positive slope a n d elliptical outline of the scattergram are quite evident. Figure 12.3D, based o n pjk = 0.7 a n d = 100, shows the trend m o r e clearly t h a n the first three graphs. N o t e t h a t the next graph (Figure 12.3E), based on the same m a g n i t u d e of pJk b u t representing negative correlation, also shows the t r e n d but is m o r e s t r u n g out t h a n Figure 12.3D. T h e difference in shape of the ellipse h a s n o relation to the negative n a t u r e of the correlation; it is simply a f u n c t i o n of sampling error, a n d the c o m parison of these t w o figures should give you some idea of the variability to be expected on r a n d o m sampling f r o m a bivariate n o r m a l distribution. Finally, Figure 12.3F, representing a correlation of pjk = 0.9, shows a tight association between the variables a n d a reasonable a p p r o x i m a t i o n to an ellipse of points. N o w let us r e t u r n to the expression for the sample correlation coefficient shown in Expression (12.3). S q u a r i n g this expression results in \2 ( J ^ 21 V _ ( >'.>'2) 2 . ^ >
12
Look at the left term of the last expression. It is the s q u a r e of the sum of p r o d u c t s of variables Y, a n d Y2, divided by the sum of squares of Y,. If this were a regression problem, this would be the f o r m u l a for the explained sum of squares of variable Y2 on variable Y,, E y 2 . In the symbolism of C h a p t e r 11, on regression, it would be E y 2 = ( E . x y ) 2 / E x 2 . T h u s , we can write j5 (12.6)
T h e s q u a r e of the correlation coefficient, therefore, is the ratio formed by the explained sum of squares of variable Y2 divided by the total sum of squares of variable Y2. Equivalently,
1 2  ^ 2
Izi Zri
( !2 61 )
276
CHAPTER 1 2 /
CORRELATION
which can be derived just as easily. (Remember that since we are n o t really regressing one variable on the other, it is just as legitimate to have Yt explained by Y2 as the other way around.) T h e ratio symbolized by Expressions (12.6) a n d (12.6a) is a p r o p o r t i o n ranging f r o m 0 to 1. This becomes obvious after a little contemplation of the m e a n i n g of this formula. The explained sum of squares of any variable must be smaller t h a n its total sum of squares or, maximally, if all the variation of a variable has been explained, it can be as great as the total sum of squares, but certainly no greater. Minimally, it will be zero if n o n e of the variable can be explained by the other variable with which the covariance has been computed. Thus, we obtain an i m p o r t a n t measure of the p r o p o r t i o n of the variation of one variable determined by the variation of the other. This quantity, the square of the correlation coefficient, r\2, is called the coefficient of determination. It ranges from zero to 1 a n d must be positive regardless of whether the correlation coefficient is negative or positive. Incidentally, here is proof that the correlation coefficient c a n n o t vary beyond  1 a n d + 1 . Since its square is the coefficient of determination and we have just shown that the b o u n d s of the latter are zero to 1, it is obvious that the b o u n d s of its square root will be 1. T h e coefficient of determination is useful also when one is considering the relative i m p o r t a n c e of correlations of different magnitudes. As can be seen by a reexamination of Figure 12.3, the rate at which the scatter d i a g r a m s go f r o m a distribution with a circular outline to one resembling an ellipse seems to be m o r e directly proportional to r2 t h a n to r itself. Thus, in Figure 12.3B, with 2 = 0.09, it is difficult to detect the correlation visually. However, by the time we reach Figure 12.3D, with 2 = 0 . 4 9 , the presence of correlation is very apparent. The coefficient of determination is a quantity that may be useful in regression analysis also. You will recall that in a regression we used a n o v a to partition the total sum of squares into explained and unexplained sums of squares. O n c e such an analysis of variance has been carried out, one can obtain the ratio of the explained sums of squares over the total SS as a measure of the p r o p o r t i o n of the total variation that has been explained by the regression. However, as already discusscd in Section 12.1, it would not be meaningful to take the square root of such a coefficient of determination and consider it as an estimate of the parametric correlation of these variables. We shall now take up a mathematical relation between the coefficients of correlation and regression. At the risk of being repetitious, we should stress again that though we can easily convert one coefficient into the other, this docs not mean that the two types of coefficients can be used interchangeably on the same sort of data. O n e i m p o r t a n t relationship between the correlation coefficient and the regression coefficient can be derived as follows from Expression (12.3): J>i>'2
=
yi>'2 xlvi
w e
. >1 _ y ^ i . V Z y i
v r a
If?
y\
.
Similarly, we c o u l d d e m o n s t r a t e t h a t
r
" ~ 1 = / yi ' 1
(127)
\i
(12.7a)
a n d hence b,.2 = r
l 2
s,
^ s2
(12.7b)
In these expressions b2., is the regression coefficient for variable Y2 on Y,. We see, therefore, t h a t the correlation coefficient is the regression slope multiplied by the ratio of the s t a n d a r d deviations of the variables. T h e c o r r e l a t i o n coefficient m a y thus be regarded as a s t a n d a r d i z e d regression coefficient. If the t w o s t a n d a r d deviations a r e identical, both regression coefficients a n d the correlation coefficient will be identical in value. N o w t h a t we k n o w a b o u t the coclficicnt of correlation, s o m e of the earlier work on paired c o m p a r i s o n s (see Section 9.3) can be put into p r o p e r perspective. In Appendix A 1.8 we s h o w for the c o r r e s p o n d i n g p a r a m e t r i c expressions that the variance of a sum of t w o variables is \> + i"2) = sf + si + 2r 1 2 s,.v 2 (12.8)
where s, and .s2 a r e s t a n d a r d deviations of Y, a n d Y2, respectively, a n d ri2 is the correlation coefficient between these variables. Similarly, for a difference between t w o variables, we o b t a i n
= s
i + s2 ~~ 2rl2sls2
(12.9)
W h a t Expression (12.8) indicates is t h a t if we m a k e a new c o m p o s i t e variable that is the sum of t w o o t h e r variables, the variance of this new variable will be the sum of the variances of the variables of which it is c o m p o s e d plus an a d d e d term, which is a f u n c t i o n of the s t a n d a r d deviations of these two variables a n d of the c o r r e l a t i o n between them. It is shown in Appendix A 1.8 that this added term is twicc the covariance of Yl a n d Y2. W h e n the t w o variables
278
CHAPTER
12 /
CORRELATION
BOX 12.1
C o m p u t a t i o n of the p r o d u c t  m o m e n t correlation coefficient.
Relationships between gill weight and body weight in the crab crassipes. 12.
Pachygrapsus
V) r,
Gi It weight in milligrams
(2) Y2
Body weight in grams
159 179 100 45 384 230 100 320 80 220 320 210
14.40 15.20 11.30 2.50 22.70 14.90 1.41 15.81 4.19 15.39 17.25 9.52
3. Y 2 = 14.40 + + 9.52 = 144.57 4. \ = (I4.40) 2 + + (9.52)2 = 2204.1853 5 2 = 14.40(159) + + 9.52(210) = 34,837.10 6. Sum of squares of Y, = = Y2
v(2347)
12
12.2
BOX 12.1
Continued
 34.837.10 
12
 6561.6175
VX y i
being s u m m e d are u n c o r r e c t e d , this a d d e d covariance term will be zero, and the variance of the sum will simply be the sum of variances of the two variables. This is the reason why, in an a n o v a or in a t test of the ditference between the two means, we had to a s s u m e 1 he independence of the two variables to permit us to add their variances. Otherwise we would have had to allow for a covariance term. By contrast, in the p a i r e d  c o m p a r i s o n s technique we expect correlation between the variables, since the m e m b e r s in each pair share a c o m m o n experience. T h e p a i r e d  c o m p a r i s o n s test automatically s u b t r a c t s a covariance term, resulting in a smaller s t a n d a r d e r r o r and consequently in a larger value of i s . since the n u m e r a t o r of the ratio remains the same. T h u s , whenever correlation between two variables is positive, the variance of their differences will be considerably smaller than the sum of their variances; (his is (he reason why the p a i r e d  c o m p a r i s o n s test has to be used in place of (he / test for difference of means. These c o n s i d e r a t i o n s are equally true for the c o r r e s p o n d i n g analyses of variance, singlcclassilication and twoway a n o v a . T h e c o m p u t a t i o n of a p r o d u c t  m o m e n t correlation coefficient is quite simple. T h e basic quantities needed are the same six required for c o m p u t a t i o n of the regression coefficient (Section 11.3). Box 12.1 illustrates how the coefficient should be c o m p u t e d . T h e e x a m p l e is based on a sample of 12 crabs in which gill weight V, a n d b o d y weight Y2 have been recorded. We wish to know whether there is a correlation between the weight of the gill a n d that of the body, the latter representing a measure of overall size. T h e existence of a positive correlation might lead you to conclude that a biggerbodied c r a b with its resulting greater a m o u n t of m e t a b o l i s m would require larger gills in o r d e r to
280
CHAPTER 1 2 /
CORRELATION
400 r
f i g u r e 12.4 S c a t t e r d i a g r a m f o r c r a b d a t a of B o x 12.1.
10
15
20
25
30
H o d y w e i g h t in g r a m s
p r o v i d e the necessary oxygen. T h e computations a r e illustrated in Box 12.1. T h e c o r r e l a t i o n coefficient of 0.87 agrees with the clear slope a n d n a r r o w elliptical outline of the s c a t t e r g r a m for these data in F i g u r e 12.4.
12.3
S i g n i f i c a n c e tests in
correlation
T h e most c o m m o n significance test is w h e t h e r it is possible for a s a m p l e correlation coefficient to have c o m e f r o m a p o p u l a t i o n with a p a r a m e t r i c correlation coefficient of zero. T h e null hypothesis is therefore H n : = 0. This implies that the t w o variables are u n c o r r e c t e d . If the sample comes from a bivariate n o r m a l d i s t r i b u t i o n a n d = 0, the s t a n d a r d e r r o r of the c o r r e l a t i o n coefficient is sr = v<( I r2)/(n 2). T h e hypothesis is tested as a / test with 2 degrees of f r e e d o m , t . = (r  0) N (I r ' l l i i  2) = r s In 2) (I r). W e should emphasize thai this s t a n d a r d e r r o r applies only when = 0, so that it c a n n o t be applied to testing a hypothesis that is a specific value o t h e r than zero. T h e I test for the significance of r is m a t h e m a t i c a l l y equivalent to the f test for the significance of b, in either case m e a s u r i n g the strength of the association between the two variables being tested. This is s o m e w h a t a n a l o g o u s to the situation in M o d e l I and Model II singleclassification a n o v a , where the same h test establishes the significance regardless of Hie model. Significance tests following this f o r m u l a have been carried o u t systematically a n d arc tabulated in T a b i c VIII, which permits the direct inspection of a s a m p l e correlation coefficient for significance without further c o m p u l a t i o n . Box 12.2 illustrates tests of the hypothesis H 0 : = 0, using T a b l e VIII as well as the t lest discussed at lirst.
12.3
S I G N I F I C A N C E TESTS IN C O K K I I A I I O N
BOX 12.2 Tests of significance and confidence limits for correlation coefficients.
Test of the null hypothesis H0: =* 0 versus Hxi 0
The simplest procedure is t o c o n s u l t Table VIII, where t h e critical values o l are tabulated for d f = r is g r e a t e r t h a n the null 2 f r o m I t o 1000. If t h e a b s o l u t e v a l u e of the value in the c o l u m n for t w o observed reject
the tabulated
variables, we
hypothesis. In Box 12.1 w e found the correlation between body weight and
Examples.
12. F o r
10 degrees of 1% level of
freedom signifireject
0.708 at the
H0: p~0,atP<
0.01. when or
VIII i s
t h e t a b l e is n o t a v a i l a b l e o r w h e n a n e x a c t t e s t is n e e d e d a t s i g n i f i c a n c e l e v e l s at d e g r e e s of f r e e d o m o t h e r t h a n t h o s e f u r n i s h e d in t h e table. T h e null is t e s t e d b y m e a n s o f t h e t d i s t r i b u t i o n ( w i t h o f r. W h e n = 0,
hypothesis error
Sr
E 3
2)
Therefore,
{r 0) t
s
=r
^
be 0.86522) =
[
Wd'
2
0.86527(12 
0.8652^39/7725 =
t0,00nm
be used for 5% alternative
onetailed
tests w o u l d
a p p l y if t h e 0.
use of t h e
transformation
d e s c r i b e d i n t h e t e x t . S i n c e <r, =
0
t _ =
2
r
v 3
S i n c e is n o r m a l l y d i s t r i b u t e d a n d w e a r e u s i n g a p a r a m e t r i c s t a n d a r d we compare i, w i t h
deviation, we
veins
o f b e e s b a s e d o n = 5 0 0 , w e w o u l d t
s
1.2111 in T a b l e X. 26.997
1.2111 7 4 9 7 =
10
 6
).
282
CHAPTER 1 2
CORRELATION
B O X
12.2
Continued
Suppose we wish to test the null hypothesis : = +0.5 versus . * + 0.5 tor the case just considered. We would use the following expression: h = l/VfT3 where and are the transformations of r and p, respectively. Again we compare ts with f atool or look it up in Table I I . From Table V I I I we find For r = 0.837 For = 0.500 Therefore t, = (1.2111  0.5493)(V?97) = 14.7538 The probability of obtaining such a value of r, by random sampling is < 1 0 " 6 (see Table II). It is most unlikely that the parametric correlation between rightand leftwing veins is 0.5. Confidence limits If > 50, we can set confidence limits to r using the transformation. We first convert the sample r to z, set confidence limits to this z, and then transform these limits back to the r scale. We shall find 95% confidence limits for the above wing vein length data. For r = 0.837, r = 1.2111, = 0.05.
1 2 t 
at (z  ) "  3
2 = 1.2111 = 0.5493
0 5
"
117 2 1 1 1 .111
1 , 9 6 0
1.2111  0 . 0 8 7 9 = 1.1232 L2 = +
1
V"  3 We retransform these values to the r scale by finding the corresponding arguments for the function in Table X. L, 0.808 and L2 0.862
are the 95% confidence limits around r = 0.837. Test of the difference between two correlation coefficients
1
+
n,
12.3
S I G N I F I C A N C E TESTS IN C O R R E L A T I O N
zz we
is n o r m a l l y compare ts
and
we
are
using a ,
standard normal
deviation, curve."
employ
Table
between by
b o d y weight a n d w i n g length in
Droof
sophila pseudoobscura
= 39 at the Grand Arizona. Grand
was
found
Canyon
and
Flagstaff,
Canyon:
Zj =
0.6213 0.1804
=
0 6
0.8017
n Q
0.6213 0.8017 _ ~
v'0.086,601 ~ find
the probability
2(0.229,41) = hypothesis.
0.458,82, so
W h e n is close t o + 1.0, t h e d i s t r i b u t i o n of s a m p l e v a l u e s of r is m a r k e d l y a s y m m e t r i c a l , a n d , a l t h o u g h a s t a n d a r d e r r o r is a v a i l a b l e for r in such cases, it s h o u l d n o t be a p p l i e d unless the s a m p l e is very large (n > 500), a m o s t inf r e q u e n t case of little interest. T o o v e r c o m e this difficulty, we t r a n s f o r m r to a f u n c t i o n z, d e v e l o p e d by F i s h e r . T h e f o r m u l a for is (12.10)
You m a y recognize this as = t a n h ' r, the f o r m u l a for the inverse hyp e r b o l i c t a n g e n t of r. T h i s f u n c t i o n h a s been t a b u l a t e d in T a b l e X, w h e r e values of c o r r e s p o n d i n g t o a b s o l u t e v a l u e s of r a r c given. I n s p e c t i o n of E x p r e s s i o n (12.10) will s h o w that w h e n r = 0, will also e q u a l zero, since i ' n I e q u a l s zero. H o w e v e r , as r a p p r o a c h e s 1 , (1 + /)/(!  r) a p p r o a c h e s / a n d 0; c o n s e q u e n t l y , a p p r o a c h e s + infinity. T h e r e f o r e , s u b s t a n t i a l differences between r a n d o c c u r at the higher v a l u e s for r. Thus, w h e n r is 0.115, = 0.1 I 55. F o r r =  0 . 5 3 1 , wc o b t a i n =  0 . 5 9 1 5 ; r = 0.972 yields = 2.1273. N o t e byh o w m u c h exceeds r in this last p a i r of values. By f i n d i n g a given value of in T a b i c X, we can also o b t a i n the c o r r e s p o n d i n g value of r. Inverse i n t e r p o l a t i o n m a y be necessary. T h u s , = 0.70 c o r r e s p o n d s t o r = 0.604, a n d a value of =  2 . 7 6 c o r r e s p o n d s t o r = 0.992. S o m e p o c k c t c a l c u l a t o r s h a v e builtin h y p e r b o l i c a n d inverse h y p e r b o l i c f u n c t i o n s . K e y s for such f u n c t i o n s w o u l d o b v i a t e the need for T a b l e X. T h e a d v a n t a g e of the t r a n s f o r m a t i o n is t h a t while c o r r e l a t i o n coefficients arc d i s t r i b u t e d in s k e w e d f a s h i o n for v a l u e s of 0. the values of are a p 
284
CHAPTER 1 2
CORRELATION
This is a n a p p r o x i m a t i o n a d e q u a t e for s a m p l e sizes > 50 a n d a tolerable a p p r o x i m a t i o n even w h e n > 25. An interesting aspect of the variance of evident f r o m Expression (12.11) is that it is i n d e p e n d e n t of the m a g n i t u d e of r, but is simply a f u n c t i o n of sample size n. As s h o w n in Box 12.2, for s a m p l e sizes greater t h a n 50 we c a n also use the t r a n s f o r m a t i o n to test t h e significance of a sample r e m p l o y i n g the hypothesis H0: = 0. In the second section of Box 12.2 we show the test of a null hypothesis t h a t 0. W e m a y have a hypothesis that the true c o r r e l a t i o n between two variables is a given value different f r o m zero. Such h y p o t h e s e s a b o u t the expected c o r r e l a t i o n between t w o variables are frequent in genetic w o r k , a n d we m a y wish t o test observed d a t a against such a hypothesis. Alt h o u g h there is n o a priori reason to a s s u m e that the true c o r r e l a t i o n between right a n d left sides of the bee wing vein lengths in Box 12.2 is 0.5, we s h o w the test of such a hypothesis to illustrate the m e t h o d . C o r r e s p o n d i n g to = 0.5, there is , the p a r a m e t r i c value of z. It is the t r a n s f o r m a t i o n of p. W e n o t e that the probability that the s a m p l e r of 0.837 could have been sampled f r o m a p o p u l a t i o n with = 0.5 is vanishingly small. Next, in Box 12.2 we see h o w to set confidence limits to a s a m p l e correlation coefficient r. This is d o n e by m e a n s of the t r a n s f o r m a t i o n ; it will result in asymmetrical confidence limits when these are r e t r a n s f o r m e d t o the r scale, as when setting confidence limits with variables subjected to s q u a r e root or logarithmic t r a n s f o r m a t i o n s . A test for the significance of the difference between two s a m p l e correlation coefficients is the final e x a m p l e illustrated in Box 12.2. A s t a n d a r d e r r o r for the difference is c o m p u t e d a n d tested against a table of areas of the n o r m a l curvc. In the e x a m p l e the c o r r e l a t i o n between body weight and wing length in two Drosopliila p o p u l a t i o n s was tested, a n d the difference in correlation cocfficicnts between the t w o p o p u l a t i o n s was found not significant. T h e f o r m u l a given is an acceptable a p p r o x i m a t i o n when the smaller of the two samples is greater t h a n 25. It is frequently used with even smaller s a m p l e sizes, as s h o w n in o u r e x a m p l e in Box 12.2.
12.4
Applications
of
correlation
T h e p u r p o s e of correlation analysis is to m e a s u r e the intensity of association observed between a n y pair of variables a n d to test whether it is greater t h a n could be cxpcctcd by c h a n c e alone. O n c e established, such an association is likely to lead to reasoning a b o u t causal relationships between the variables. S t u d e n t s of statistics are told at an early stage n o t to confuse significant correlation with c a u s a t i o n . W c arc also w a r n e d a b o u t socalled n o n s e n s e corrcla
12.4 /
APPLICATIONS OF CORRELATION
285
tions, a wellknown case being the positive c o r r e l a t i o n between the n u m b e r of Baptist ministers a n d the per capita liquor c o n s u m p t i o n in cities with p o p u l a tions of over 10,000 in the U n i t e d States. Individual cases of correlation m u s t be carefully analyzed before inferences are d r a w n f r o m them. It is useful to distinguish correlations in which o n e variable is t h e entire or, m o r e likely, the partial cause of a n o t h e r f r o m others in which the t w o correlated variables have a c o m m o n cause a n d f r o m m o r e c o m p l i c a t e d situations involving b o t h direct influence a n d c o m m o n causes. T h e establishment of a significant c o r r e l a t i o n does not tell us which of m a n y possible s t r u c t u r a l m o d e l s is a p p r o p r i a t e . F u r t h e r analysis is needed t o discriminate between the various models. T h e t r a d i t i o n a l distinction of real versus n o n s e n s e o r illusory c o r r e l a t i o n is of little use. In s u p p o s e d l y legitimate correlations, causal c o n n e c t i o n s are k n o w n o r at least believed to be clearly u n d e r s t o o d . In socalled illusory correlations, no reasonable c o n n e c t i o n between the variables can be f o u n d ; o r if one is d e m o n s t r a t e d , it is of n o real interest or m a y be s h o w n to be a n artifact of the s a m p l i n g p r o c e d u r e . T h u s , the correlation between Baptist ministers a n d liquor c o n s u m p t i o n is simply a c o n s e q u e n c e of city size. T h e larger t h e city, the m o r e Baptist ministers it will c o n t a i n on the average a n d the greater will be the liquor c o n s u m p t i o n . T h e correlation is of little interest t o a n y o n e studying either the distribution of Baptist ministers o r the c o n s u m p t i o n of alcohol. S o m e correlations have time as the c o m m o n factor, a n d processes that c h a n g e with time are frequently likely to be correlated, not because of any functional biological reasons but simply because the c h a n g e with time in the t w o variables u n d e r c o n s i d e r a t i o n h a p p e n s to be in the same direction. T h u s , size of an insect p o p u l a t i o n building u p t h r o u g h the s u m m e r m a y be correlated with the height of some weeds, but this m a y simply be a f u n c t i o n of the passage of time. T h e r e may be n o ecological relation between the plant a n d the insects. A n o t h e r situation in which the correlation might be considered an artifact is when o n e of the variables is in part a m a t h e m a t i c a l f u n c t i o n of the other. T h u s , for example, if Y = Z / A and we c o m p u t e the correlation of A' with Y, the existing relation will tend to p r o d u c e a negative correlation. P e r h a p s the only correlations properly called nonsense o r illusory arc those assumed by p o p u l a r belief o r scientific intuition which, when tested by p r o p e r statistical m e t h o d o l o g y using a d e q u a t e sample sizes, are found to be not significant. T h u s , if we can s h o w that there is no significant correlation between a m o u n t of s a t u r a t e d fats eaten a n d the degree of atherosclerosis, we can consider this to be an illusory correlation. R e m e m b e r also that when testing significance of correlations at c o n v e n t i o n a l levels of significance, you must allow for type I error, which will lead to y o u r j u d g i n g a certain percentage of c o r r e l a t i o n s significant when in fact the p a r a m e t r i c value of = 0. C o r r e l a t i o n coefficients have a history of extensive use a n d application d a t i n g back to the English biometric school at the beginning of the twentieth century. Recent years have seen s o m e w h a t less application of this technique as increasing segments of biological research have b e c o m e experimental. In experiments in which o n e factor is varied a n d the response of a n o t h e r variable to the
286
CHAPTER
12
CORRELATION
deliberate v a r i a t i o n of the first is examined, the m e t h o d of regression is m o r e a p p r o p r i a t e , as has already been discussed. H o w e v e r , large areas of biology a n d of o t h e r sciences r e m a i n where the experimental m e t h o d is not suitable because variables c a n n o t be b r o u g h t u n d e r c o n t r o l of the investigator. T h e r e a r e m a n y a r e a s of medicine, ecology, systematics, evolution, a n d o t h e r fields in which experimental m e t h o d s a r e difficult to apply. As yet, the weather c a n n o t be c o n trolled, n o r c a n historical e v o l u t i o n a r y factors be altered. Epidemiological variables are generally not subject t o experimental m a n i p u l a t i o n . Nevertheless, we need an u n d e r s t a n d i n g of the scientific m e c h a n i s m s underlying these p h e n o m e n a as m u c h as of those in b i o c h e m i s t r y or experimental e m b r y o l o g y . In such cases, correlation analysis serves as a first descriptive technique e s t i m a t i n g t h e degrees of association a m o n g the variables involved.
12.5
Kendall's coefficient of r a n k
correlation
O c c a s i o n a l l y d a t a are k n o w n n o t to follow the bivariate n o r m a l d i s t r i b u t i o n , yet we wish to test for the significance of association between the t w o variables. O n e m e t h o d of analyzing such d a t a is by r a n k i n g the variates a n d calculating a coefficient of r a n k correlation. This a p p r o a c h belongs to the general family of n o n p a r a m e l r i c m e t h o d s we e n c o u n t e r e d in C h a p t e r 10. where we learned m e t h o d s for analyses of r a n k e d variates paralleling a n o v a . In o t h e r cases especially suited to r a n k i n g m e t h o d s , we c a n n o t measure the variable o n an a b s o l u t e scale, but only o n an o r d i n a l scale. This is typical of d a t a in which we estimate relative p e r f o r m a n c e , as in assigning positions in a class. W e can say that A is the best s t u d e n t , is the secondbest student, C a n d D are e q u a l to each o t h e r a n d nextbest, a n d so on. If two instructors independently rank a g r o u p of students, wc can then test w h e t h e r the two sets of r a n k i n g s are i n d e p e n d e n t (which we would not expect if the j u d g m e n t s of the instructors arc based on objective evidence). Of greater biological a n d mcdical interest arc the following examples. We might wish to correlate o r d e r of emergence in a s a m p l e of insects with a r a n k i n g in size, or o r d e r of g e r m i n a t i o n in a s a m p l e of plants with rank o r d e r of flowering. An epidemiologist may wish to associate rank o r d e r of o c c u r r c n c c (by time) of an infectious disease within a c o m m u n i t y , on the o n e hand, with its severity as measured by an objective criterion, on the other. Wc present in Box 12.3 Kendall's coefficient of rank correlation, generally symbolized by (tau), a l t h o u g h it is a s a m p l e statistic, not a p a r a m e t e r . T h e f o r m u l a for Kendall's coefficient of rank correlation is = N/n(n I), where is the conventional s a m p l e size and is a c o u n t of ranks, which can be o b tained in a variety of ways. A second variable Y2, if it is perfectly correlated with the first variable V,. should be in the s a m e o r d e r as the V, variatcs. H o w e v e r , if the correlation is less t h a n perfect, the o r d e r of the variates T, will not entirely c o r r e s p o n d to that of V,. T h e q u a n t i t y m e a s u r e s h o w well the second variable c o r r e s p o n d s to the o r d e r of the first. It has a m a x i m a l value of n{n 1) a n d a minimal value of n{n 1). T h e following small example will m a k e this clear.
BOX 113
Kendall's coefficient of r a n k correlation, .
Computation of a rank correlation coefficient between the blood neutrophil <. .urn . (y,; 10" 3 per ]) and total marrow neutrophil mass (Y2: x 10''per kg) m > patients with nonhematological tumors; = 15 pairs of observations.
(i)
Patient
(4)
Rz 1 9 6 12 15 13 4
U) Patient
(2) Yt
(i)
Ri 9
Y2
1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
1 2 15 3 10 4 12
5 10 8 14 11 2 7 3
Computational
steps
1. Rank variables Y, and Y2 separately and then replace the original variates with the ranks (assign tied ranks if necessary so that for both variables you will always have ranks for variates). These ranks are listed in columns (3) and (5) above. 2. Write down the ranks of one of the two variables in order, paired with the rank values assigned for the other variable (as shown below). If only one variable has ties, order the pairs by the variable without ties. If both variables have ties, it does not matter which of the variables is ordered. 3. Obtain a sum of the counts C i( as follows. Examine the first value in the column of ranks paired with the ordered column. In our case, this is rank 10. Count all ranks subsequent to it which are higher than the rank being considered. Thus, in this case, count all ranks greater than 10. There are fourteen ranks following the 10 and five of them are greater than 10. Therefore, we count a score of C, = 5. Now we look at the next rank (rank 8) and find that six of the thirteen subsequent ranks are greater than it; therefore, C 2 is equal to 6. The third rank is 11, and four following ranks are higher than it. Hence, C 3 = 4. Continue in this manner, taking each rank of the variable in turn and counting the number of higher ranks subsequent to it. This can usually be done in one's head, but we show it explicitly below so that the method will be entirely clear. Whenever a subsequent rank is tied in value with the pivotal rank Rlt count  instead of 1.
288
CHAPTER 1 2 /
CORRELATION
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
236  210 = 26
 1)  ,
n(n  1)  *
where "1 , and "1 2 are the sums of correction terms for ties in the ranks of variable Yl and Y2, respectively, defined as follows. A value equal to t(t 1) is computed for each group of t tied variates and summed over m such groups. Thus if variable Y2 had had two sets of ties, one involving t = 2 variates and a second involving t = 3 variates, one would have computed T2 = 2(2  1) + 3(3  1) = 8. It has been suggested that if the ties are due to lack of precision rather than being real, the coefficient should be computed by the simpler formula.
KANK
CORRELATION
289
= = = v/2(2 +
= L = = = 5 ) / 9 =!)
compared
with
When
<, 4 0 , t h i s a p p r o x i m a t i o n
is n o t
accurate,
and =
Table
XIV
must 4
be to
consulted
T h e table gives various (twotailed) critical values of for = 0.05 is 0.390. zero.
Hence
t h e o b s e r v e d v a l u e o f is n o t s i g n i f i c a n t l y d i f f e r e n t f r o m
S u p p o s e we have a s a m p l e of five individuals t h a t have been arrayed by rank of variable Y{ a n d whose r a n k i n g s for a second variable Y2 are entered paired with the r a n k s for Y^. Y, Y2 1 1 2 3 3 2 4 5 5 4
N o t e that the r a n k i n g by variable Y2 is not totally c o n c o r d a n t with that by Y^ T h e technique employed in Box 12.3 is to c o u n t t h e n u m b e r of higher r a n k s following any given r a n k , sum this q u a n t i t y for all ranks, multiply the sum " C, by 4, and subtract f r o m the result a corrcction factor n(n 1) to o b t a i n a statistic N. If, for p u r p o s e s of illustration, we u n d e r t a k e to calculate the correlation of variable Y, with itself, we will find " C, = 4 + 3 + 2 + 1 + 0 = 10. T h e n we c o m p u t e = 4 " C, n(n 1) = 40 5(4) 20, to o b t a i n the m a x i m u m possible score = n(n 1) = 20. Obviously, Y,, being ordered, is always perfectly c o n c o r d a n t with itself. However, for Y2 we o b t a i n only " C, = 4 + 2 + 2 + 0 + 0 = 8, a n d so = 4(8)  5(4) = 12. Since the m a x i m u m score of for Y, (the score we would have if the correlation were perfect) is n(n I) = 20 and the observed score 12, an o b v i o u s coefficient suggests itself as N/n(n 1) = [4 " C,  n(n  I ]\/n(n  1) = 12/20 = 0.6. Ties between individuals in the r a n k i n g proccss present m i n o r c o m p l i c a t i o n s that arc dealt with in Box 12.3. The correlation in (hat box is between blood neutrophil c o u n t s and total m a r r o w neutrophil m a s s in 15 cancer patients. T h e a u t h o r s note that there is a p r o d u c t  m o m e n t correlation of 0.69 between these two variables, but when the d a t a arc analyzed by Kendall's rank correlation cocfiicicnt, the association between the two variables is low and nonsignificant. E x a m i n a t i o n of the data
290
CHAPTER 1 2 /
CORRELATION
reveals t h a t there is m a r k e d skewness in b o t h variables. T h e d a t a c a n n o t , therefore, meet the a s s u m p t i o n s of bivariate n o r m a l i t y . A l t h o u g h there is little evidence of correlation a m o n g m o s t of the variates, the three largest variates for e a c h variable are correlated, a n d this induces the misleadingly high p r o d u c t m o m e n t correlation coefficient. T h e significance of for s a m p l e sizes g r e a t e r t h a n 40 can easily be tested by a s t a n d a r d e r r o r s h o w n in Box 12.3. F o r s a m p l e sizes u p to 40, l o o k u p critical values of in T a b l e XIV.
Exercises
12.1 G r a p h t h e f o l l o w i n g d a t a in t h e f o r m of a b i v a r i a t e s c a t t e r d i a g r a m . C o m p u t e t h e c o r r e l a t i o n c o e f f i c i e n t a n d set 9 5 % c o n f i d e n c e i n t e r v a l s t o p. T h e d a t a w e r e c o l l e c t e d f o r a s t u d y of g e o g r a p h i c v a r i a t i o n in t h e a p h i d Pemphigus populitransversus. T h e v a l u e s in t h e t a b l e r e p r e s e n t l o c a l i t y m e a n s b a s e d o n e q u a l s a m p l e sizes f o r 2 3 l o c a l i t i e s in e a s t e r n N o r t h A m e r i c a . T h e v a r i a b l e s , e x t r a c t e d f r o m S o k a l a n d T h o m a s (1965), a r e e x p r e s s e d in m i l l i m e t e r s . F, = t i b i a l e n g t h , Y2 = t a r s u s l e n g t h . T h e c o r r e l a t i o n c o e f f i c i e n t will e s t i m a t e c o r r e l a t i o n of t h e s e t w o v a r i a b l e s o v e r l o c a l i t i e s . A N S . r = 0.910, < 0.01.
Locality
code number 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23
v. 0.631 0.644 0.612 0.632 0.675 0.653 0.655 0.615 0.712 0.626 0.597 0.625 0.657 0.586 0.574 0.551 0.556 0.665 0.585 0.629 0.671 0.703 0.662 0.140 0.139 0.140 0.141 0.155 0.148 0.146 0.136 0.159 0.140 0.133 0.144 0.147 0.134 0.134 0.127 0.130 0.147 0.138 0.150 0.148 0.151 0.142
12.2
The f o l l o w i n g d a t a w e r e e x t r a c t e d f r o m a l a r g e r s t u d y b y B r o w e r ( 1 9 5 9 ) o n s p e c i a t i o n in a g r o u p of s w a l l o w t a i l b u t t e r f l i e s . M o r p h o l o g i c a l m e a s u r e m e n t s a r e in m i l l i m e t e r s c o d e d 8.
EXERCISES 302
Specimen number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Length of 8th tergile 24.0 21.0 20.0 21.5 21.5 25.5 25.5 28.5 23.5 22.0 22.5 20.5 21.0 19.5 26.0 23.0 21.0 21.0 20.5 22.5 20.0 21.5 18.5 20.0 19.0 20.5 19.5 19.0 21.5 20.0 21.5 20.5 20.0 21.5 17.5 21.0 21.0 21.0 19.5 19.0 18.0 21.5 23.0 22.5 19.0 22.5 21.0
Length of superuncus 14.0 15.0 17.5 16.5 16.0 16.0 17.5 16.5 15.0 15.5 17.5 19.0 13.5 19.0 18.0 17.0 18.0 17.0 16.0 15.5 11.5 1 1.0 10.0 11.0 1 1.0 1 1.0 11.0 10.5 1 1.0 11.5 10.0 12.0 10.5 12.5 12.0 12.5 1 1.5 12.0 10.5 1 1.0 11.5 10.5 11.0 11.5 13.0 14.0 12.5
Papilio rutulus
292
CHAPTER 1 2 /
CORRELATION
12.3
Compute the correlation coefficient separately for each species and test significance of each. Test whether the two correlation coefficients differ significantly. A pathologist measured the concentration of a toxic substance in the liver and in the peripheral blood (in /ig/kg) in order to ascertain if the liver concentration is related to the blood concentration. Calculate and test its significance.
Liver 0.296 0.315 0.022 0.361 0.202 0.444 0.252 0.371 0.329 0.183 0.369 0.199 0.353 0.251 0.346
Blood 0.283 0.323 0.159 0.381 0.208 0.411 0.254 0.352 0.319 0.177 0.315 0.259 0.353 0.303 0.293
12.4
ANS. = 0.733. The following tabic of data is from an unpublished morphometric study of the cottonwood Populus deltoides by T. J. Crovello. Twentysix leaves from one tree were measured when fresh and again after drying. The variables shown are freshleaf width (V,) and dryleaf width (y2), both in millimeters. Calculate r and test its significance.
Y, 97 105 90 98 92 82 106 97 98 91 76 97
EXERCISES
293
12.5
Brown and Comstock (1952) found the following correlations between the length of the wing and the width of a band on the wing of females of two samples
of t h e b u t t e r f l y Heliconius charitonius:
Sample
1 2
100 46
0.29 0.70
12.6
Test whether the samples were drawn from populations with the same value of p. ANS. No, is = 3.104, < 0.01. Test for the presence of association between tibia length and tarsus length in the data of Exercise 12.1 using Kendall's coefficient of rank correlation.
CHAPTER
Analysis of Frequencies
Almost all o u r work so far has dealt with estimation of parameters and tests of hypotheses for c o n t i n u o u s variables. The present chapter treats an i m p o r t a n t class of cases, tests of hypotheses a b o u t frequencies. Biological variables may be distributed i n t o two or m o r e classes, depending on some criterion such as arbitrary class limits in a c o n t i n u o u s variable or a set of mutually exclusive attributes. An example of the former would be a frequency distribution of birth weights (a c o n t i n u o u s variable arbitrarily divided into a n u m b e r of contiguous classes); one of the latter would be a qualitative frequency distribution such as the frequency of individuals of ten different species obtained from a soil sample. For any such distribution wc may hypothesize that it has been sampled f r o m a population in which the frequencies of the various classes represent certain parametric p r o p o r t i o n s of the total frequency. W e need a test of goodness of fit for our observed frequency distribution to the expected frequency distribution representing o u r hypothesis. You may recall that we first realized the need for such a test in C h a p t e r s 4 and 5, where we calculated expected binomial. Poisson, and normal frequency distributions but were unable to decide whether an observed sample distribution departed significantly f r o m the theoretical one.
13.1
295
In Section 13.1 we introduce the idea of goodness of fit, discuss the types of significance tests that are appropriate, explain the basic rationale behind such tests, a n d develop general c o m p u t a t i o n a l formulas for these tests. Section 13.2 illustrates the actual c o m p u t a t i o n s for goodness of fit when the d a t a are a r r a n g e d by a single criterion of classification, as in a oneway quantitative or qualitative frequency distribution. This design applies to cases expected to follow one of the wellknown frequency distributions such as the binomial, Poisson, or n o r m a l distribution. It applies as well to expected distributions following some other law suggested by the scientific subject matter under investigation, such as, for example, tests of goodness of fit of observed genetic ratios against expected Mendelian frequencies. In Section 13.3 we proceed to significance tests of frequencies in twoway classificationscalled tests of independence. W e shall discuss the c o m m o n tests of 2 2 tables in which each of two criteria of classification divides the frequencies into two classes, yielding a fourcell table, as well as R C tables with more rows a n d columns. T h r o u g h o u t this chapter we carry out goodness of fit tests by the G statistic. W e briefly mention chisquarc tests, which are the traditional way of analyzing such cases. But as is explained at various places t h r o u g h o u t the text, G tests have general theoretical advantages over chisquare tests, as well as being computationally simpler, not only by c o m p u t e r , but also on most pocket or tabletop calculators.
13.1
Tests
for goodness
of fit:
Introduction
The basic idea of a goodness of fit test is easily understood, given the extensive experience you now have with statistical hypothesis testing. Let us assume that a geneticist has carried out a crossing experiment between two F , hybrids and obtains an F 2 progeny of 90 offspring, 80 of which a p p e a r to be wild type and 10 of which are the m u t a n t phenotypc. T h e geneticist assumes d o m i n a n c e and expects a 3:1 ratio of the phenotypes. When we calculate the actual ratios, however, we observe that the d a t a are in a ratio 80/10 = 8:1. Expected values for and q are = 0.75 and ij = 0.25 for the wild type and m u t a n t , respectively. Note that we use the caret (generally called " h a t " in statistics) to indicate hypothetical or expected values of the binomial proportions. However, the observed p r o p o r t i o n s of these two classes are = 0.89 and q = 0.11, respectively. Yet another way of noting the contrast between observation and expectation is to state it in frequencies: the observed frequencies are J\ = 80 and f2 = 10 for the two phenotypes. Expccted frequencies should be (\ = pn = 0.75(90) = 67.5 and / , = qn = 0.25(90) = 22.5, respectively, where refers to the sample size of offspring from the cross. N o t e that when we sum the expected frequencies they yield 67.5 + 22.5 = = 90, as they should. T h e obvious question that comes to mind is whether the deviation from the 3:1 hypothesis observed in o u r sample is of such a m a g n i t u d e as to be improbable. In other words, d o the observed d a t a differ enough from the expected
296
CHAPTER 1 3 /
ANALYSIS OF FREQUENCIES
values to cause us t o reject the null hypothesis? F o r the case j u s t considered, y o u already k n o w t w o m e t h o d s for c o m i n g to a decision a b o u t the null hypothesis. Clearly, this is a b i n o m i a l distribution in w h i c h is t h e p r o b a b i l i t y of b e i n g a wild type a n d q is t h e probability of being a m u t a n t . It is possible to w o r k o u t t h e p r o b a b i l i t y of o b t a i n i n g a n o u t c o m e of 80 wild type a n d 10 m u t a n t s as well as all " w o r s e " cases for = 0.75 a n d q = 0.25, a n d a s a m p l e of = 90 offspring. W e use t h e c o n v e n t i o n a l b i n o m i a l expression here (p + q)n except t h a t a n d q a r e hypothesized, a n d we replace t h e s y m b o l k by n, which we a d o p t e d in C h a p t e r 4 as t h e a p p r o p r i a t e symbol for t h e s u m of all the frequencies in a f r e q u e n c y distribution. In this example, we have only o n e sample, so w h a t w o u l d ordinarily be labeled k in the b i n o m i a l is, at the s a m e time, n. Such an e x a m p l e was illustrated in T a b l e 4.3 a n d Section 4.2, a n d we can c o m p u t e t h e c u m u l a t i v e p r o b a b i l i t y of t h e tail of the binomial distribution. W h e n this is done, we o b t a i n a p r o b a b i l i t y of 0.000,849 for all o u t c o m e s as deviant or m o r e deviant f r o m the hypothesis. N o t e t h a t this is a onetailed test, the alternative h y p o t h e s i s being t h a t t h e r e are, in fact, m o r e wildtype offspring t h a n the M e n d e l i a n h y p o t h e s i s w o u l d postulate. A s s u m i n g = 0.75 and q = 0.25, the observed s a m p l e is, c o n sequently, a very u n u s u a l o u t c o m e , a n d we c o n c l u d e t h a t there is a significant d e v i a t i o n f r o m expectation. A less t i m e  c o n s u m i n g a p p r o a c h based on the same principle is to look u p confidence limits for the b i n o m i a l p r o p o r t i o n s , as w a s d o n e for t h e sign test in Section 10.3. I n t e r p o l a t i o n in T a b l e IX s h o w s that for a s a m p l e of = 90, an observed p e r c e n t a g e of 89% would yield a p p r o x i m a t e 99% confidence limits of 78 a n d 96 for the true percentage of wildtype individuals. Clearly, t h e hypothesized value for 0.75 is b e y o n d t h e 99% confidence b o u n d s . N o w , let us d e v e l o p a third a p p r o a c h by a g o o d n e s s of fit test. T a b l e 13.1 illustrates how we might proceed. T h e first c o l u m n gives the observed f r e q u e n cies / representing the o u t c o m e of the experiment. C o l u m n (2) shows the observed frequencies as (observed) p r o p o r t i o n s a n d q c o m p u t e d as J\/n a n d f2/n, respectively. C o l u m n (3) lists the expected p r o p o r t i o n s for the particular null hypothesis being tested. In this case, (he hypothesis is a 3:1 ratio, c o r r e s p o n d i n g t o expected p r o p o r t i o n s = 0.75 a n d q = 0.25, as we have seen. In c o l u m n (4) we show the cxpected frequencies, which we have already calculated for these p r o p o r t i o n s as / , = pn = 0.75(90) = 67.5 a n d f2 = qn = 0.25(90) = 22.5. T h e log likelihood ratio test for g o o d n e s s of fit m a y be developed as follows. U s i n g Expression (4.1) for the expected relative frequencies in a b i n o m i a l distribution, we c o m p u t e t w o quantities of interest to us here: C(90, 8 0 ) ( ^ ( ^ ) " '  0.132,683,8 C(9(), ' 0 = 0.000,551,754,9
T h e first q u a n t i t y is the probability of observing the sampled results (80 wild type a n d 10 m u t a n t s ) on the hypothesis that = t h a t is, thai the p o p u l a t i o n p a r a m e t e r e q u a l s the observed s a m p l e p r o p o r t i o n . T h e second is the probability of observing the sampled results a s s u m i n g that = f , as per the Mendelian null
So 5 m o\ so
m (N v> 
(N so
in
tN O S r")
C N so < N O O Tt
J c
in
< N so c^  os II II
2 " iu v.
Cl.
,c
8 
> 
II II
c. 2 J3 . oo I os 5 <> ^ C J D ri .S all . C 5 4 1 Q
4 >
O.
T3 2 2
298
CHAPTER 1 3 /
ANALYSIS OF FREQUENCIES
hypothesis. N o t e that these expressions yield the probabilities for the observed o u t c o m e s only, not for observed and all worse outcomes. Thus, = 0.000,551,8 is less t h a n the earlier c o m p u t e d = 0.000,849, which is the probability of 10 and fewer mutants, assuming = f , q = T h e first probability (0.132,683,8) is greater t h a n the second (0.000,551,754,9), since the hypothesis is based on the observed data. If the observed p r o p o r t i o n is in fact equal to the p r o p o r t i o n postulated under the null hypothesis, then the two c o m p u t e d probabilities will be equal and their ratio, L, will equal 1.0. T h e greater the difference between and (the expected p r o p o r t i o n under the null hypothesis), the higher the ratio will be (the probability based on is divided by the probability based on or defined by the null hypothesis). This indicates that the ratio of these two probabilities or likelihoods can be used as a statistic to measure the degree of agreement between sampled and expected frequencies. A test based on such a ratio is called a likelihood ratio test. In our case, L = 0.132,683,8/0.000,551,754,9 = 240.4761. It has been shown that the distribution of
G =
2
2 In L
(13.1)
can be a p p r o x i m a t e d by the distribution when sample sizes are large (for a definition of "large" in this case, see Section 13.2). The a p p r o p r i a t e n u m b e r of degrees of freedom in Table 13.1 is 1 because the frequencies in the two cells for these d a t a add to a constant sample size, 90. The outcome of the sampling experiment could have been any n u m b e r of m u t a n t s from 0 to 90, but the n u m b e r of wild type consequently would have to be constrained so that the total would add up to 90. O n e of the cells in the tabic is free to vary, the other is constrained. Hence, there is one degree of freedom, f n our ease, G 2\nL 2(5.482,62) = 10.9652
If wc c o m p a r e this observed value with a 2 distribution with one degree of freedom, we find that the result is significant (P < 0.001). Clearly, we reject the 3:1 hypothesis and conclude that the p r o p o r t i o n of wild type is greater than 0.75. The gencticist must, consequently, look for a mechanism explaining (his d e p a r t u r e from expectation. Wc shall now develop a simple c o m p u t a t i o n a l formula for G. Referring back to Expression (4.1), we can rewrite the two probabilities c o m p u t e d earlier as C(n,J\)pr<q and
C(n,j\)pf'q Af,A/2
(13.2)
(13.2a)
But
13.1
299
Since f
JW
//lV'//2 (13.3)
and / , , , J f l lnL = / , l n ^ j + / 2 l n ^ J
T h e c o m p u t a t i o n a l steps implied by Expression (13.3) are s h o w n in columns (5) and (6) of Table 13.1. In column (5) are given the ratios of observed over expected frequencies. These ratios would be 1 in the unlikely case of a perfect fit of observations to the hypothesis. In such a case, the logarithms of these ratios entered in column (6) would be 0, as would their sum. Consequently, G, which is twice the natural logarithm of L, would be 0, indicating a perfect fit of the observations to the expectations. It has been shown that the distribution of G follows a 2 distribution. In the particular case we have been s t u d y i n g t h e two p h e n o t y p e classesthe a p p r o p r i a t e 2 distribution would be the one for one degree of freedom. We can appreciate the reason for the single degree of freedom when we consider the frequencies in the two classes of Table 13.1 and their sum: 80 + 10 = 90. In such an example, the total frequency is fixed. Therefore, if we were to vary the frequency of any one class, the other class would have to c o m p e n s a t e for changes in the first class to retain a correct total. Here the m e a n i n g of one degree of freedom becomes quite clear. O n e of the classes is free to vary; the other is not. The test for goodness of fit can be applied to a distribution with more than two classes. If we designate the n u m b e r of frequency classes in the Table as a, the operation can be expressed by the following general c o m p u t a t i o n a l formula, whose derivation, based on the multinominal expectations (for m o r e than two classes), is shown in Appendix A 1.9: G = 2X./;in^) (13.4)
T h u s the formula can be seen as the sum of the independent contributions of departures from expectation (In [ f / f ] ) ) weighted by the frequency of the particular class ( f ) . If the expected values are given as a p r o p o r t i o n , a convenient c o m p u t a t i o n a l formula for G, also derived in Appendix A 1.9, is In (13.5)
/>
T o evaluate the o u t c o m e of our test of goodness of fit, we need to k n o w the a p p r o p r i a t e n u m b e r of degrees of freedom to be applied to the 2 distribution, f or a classes, the niimher of deorees of freedom is 1 Since the sum of
300
CHAPTER 1 3
/ ANALYSIS OF FREQUENCIES
frequencies in any problem is fixed, this means that a 1 classes are free to vary, whereas the ath class must constitute the difference between the total sum and the sum of the previous a 1 classes. In some goodness of fit tests involving more than two classes, we subtract more than one degree of freedom from the number of classes, a. These are instances where the parameters for the null hypothesis have been extracted from the sample data themselves, in contrast with the null hypotheses encountered in Table 13.1. In the latter case, the hypothesis to be tested was generated on the basis of the investigator's general knowledge of the specific problem and of Mendelian genetics. The values of = 0.75 and q = 0.25 were dictated by the 3:1 hypothesis and were not estimated from the sampled data. For this reason, the expected frequencies are said to have been based on an extrinsic hypothesis, a hypothesis external to the data. By contrast, consider the expected Poisson frequencies of yeast cells in a hemacytometer (Box 4.1). You will recall that to compute these frequencies, you needed values for , which you estimated from the sample mean . Therefore, the parameter of the computed Poisson distribution came from the sampled observations themselves. The expected Poisson frequencies represent an intrinsic hypothesis. In such a case, to obtain the correct number of degrees of freedom for the test of goodness of fit, we would subtract from a, the number of classes into which the data had been grouped, not only one degree of freedom for n, the sum of the frequencies, but also one further degree of freedom for the estimate of the mean. Thus, in such a case, a sample statistic G would be compared with chisquare for a 2 degrees of freedom. Now let us introduce you to an alternative technique. This is the traditional approach with which we must acquaint you because you will see it applied in the earlier literature and in a substantial proportion of current research publications. We turn once more to the genetic cross with 80 wildtype and 10 mutant individuals. The computations are laid out in columns (7), (8), and (9) in Table 13.1. We first measure / / , the deviation of observed from expected frequencies. Note that the sum of these deviations equals zero, for reasons very similar to those causing the sum of deviations from a mean to add to zero. Following our previous approach of making all deviations positive by squaring them, we square ( / / ) in column (8) to yield a measure of the magnitude of the deviation from expectation. This quantity must be expressed as a proportion of the expected frequency. After all, if the expected frequency were 13.0, a deviation of 12.5 would be an extremely large one, comprising almost 100% of f , but such a deviation would represent only 10% of an cxpected frequency of 125.0. Thus, we obtain column (9) as the quotient of division of the quantity in column (8) by that in column (4). Note that the magnitude of the quotient is greater for the second line, in which the / is smaller. Our next step in developing our test statistic is to sum the quotients, which is done at the foot of column (9), yielding a value of 9.259,26. This test is called the chisquare test because the resultant statistic, X2, is distributed as chisquare with a 1 degrees of freedom. Many persons inap
13.2 /
301
propriately call the statistic obtained as the sum of column (9) a chisquare. However, since the sample statistic is not a chisquare, we have followed the increasingly prevalent convention of labeling the sample statistic X2 rather than 2. The value of X2 = 9.259,26 from Table 13.1, when compared with the critical value of 2 (Table IV), is highly significant (P < 0.005). The chisquare test is always onetailed. Since the deviations are squared, negative and positive deviations both result in positive values of X2. Clearly, we reject the 3:1 hypothesis and conclude that the proportion of wild type is greater than 0.75. The geneticist must, consequently, look for a mechanism explaining this departure from expectation. Our conclusions are the same as with the G test. In general, X2 will be numerically similar to G. We can apply the chisquare test for goodness of fit to a distribution with more than two classes as well. The operation can be described by the formula
a
(f fi
f)2
(13.6)
which is a generalization of the computations carried out in columns (7), (8), and (9) of Table 13.1. The pertinent degrees of freedom are again a 1 in the case of an extrinsic hypothesis and vary in the case of an intrinsic hypothesis. The formula is straightforward and can be applied to any of the examples we show in the next section, although we carry these out by means of the G test. 13.2 Singleclassification goodness of fit tests Before we discuss in detail the computational steps involved in tests of goodness of fit of singleclassification frequency distributions, some remarks on the choice of a test statistic are in order. We have already stated that the traditional method for such a test is the chisquare lest for goodness of fit. However, the newer approach by the G test has been recommended on theoretical grounds. The major advantage of the G test is that it is computationally simpler, especially in more complicated designs. Earlier reservations regarding G when desk calculators are used no longer apply. The common presence of natural logarithm keys on pocket and tabletop calculators makes G as easy to compute as X2. The G tests of goodness of fit for singleclassification frequency distributions are given in Box 13.1. Expected frequencies in three or more classes can be based on either extrinsic or intrinsic hypotheses, as discussed in the previous section. Examples of goodness of fit tests with more than two classes might be as follows: A genetic cross with four phenotypic classes might be tested against an expected ratio of 9 : 3 : 3 : 1 for these classes. A phenomenon that occurs over various time periods could be tested for uniform frequency of occurrencefor example, number of births in a city over 12 months: Is the frequency of births equal in each month? In such a case the expected frequencies are computed as being equally likely in each class. Thus, for a classes, the expected frequency for any one class would be /.
302
CHAPTER 1 3
ANALYSIS OF FREQUENCIES
BOX 13.1 G Test for Goodness of F i t Single Classification. 1. Frequencies divided into a 2: 2 classes: Sex ratio in 6115 sibships of 12 in Saxony. The fourth column gives the expectedfrequencies,assuming a binomial distribution. These were first computed in Table 4,4 but are here given to fivedecimalplace precision to give sufficient accuracy to the computation of G.
cic? 12 11 10 9 8 7 6 5 4 3 2 1 0
V)
99 0 1 2 3 4 5 6 7 8 9 10 11 12
/
52 181 478 829 1112 1343 1033 670 286 104. 27 6115 = =
(J)
2.347,27) 28.429,73 26.082,46] 132.835,70 410.012,56 854.246,65 1265.630,31 1367.279,36 1085.210,70 628.055,01 258.475,13 71.803,17 12.088,84) 0.932,84 (>13.021,68 6115.000,00
+ +
+
_ + + +
+
Since expected frequencies ft < 3 for a = 13 classes should be avoided, we lump the classes at both tails with the adjacent classes to create classes of adequate size. Corresponding classes of observed frequencies / ( should be lumped to match. The number of classes after lumping is a = 11. Compute G by Expression (13.4):
( U\
= K K^)
= 94.871,55
5 2
\jlJ
+181
+ +2 l 7n
'
(^))
Since there are a = 11 classes remaining, the degrees of freedom would be 1 == 10, if this were an example tested against expected frequencies based on an extrinsic hypothesis. However, because the expected frequencies are based on a binomial distribution with mean pg estimated from the p , of the sample, a further degree of freedom is removed, and the sample value of G is compared with a 2 distribution with a  2 = 11 2 = 9 degrees of freedom. We applied Williams' correction to G, to obtain a better approximation to 2. In the formula computed below, symbolizes the pertinent degrees of freedom of the
= 27,877
The null hypothesisthat the sample data follow a binomial distributionis therefore rejected decisively. Typically, the following degrees of freedom will pertain to G tests for goodness of fit with expected frequencies based on a hypothesis intrinsic to the sample data (a is the number of classes after lumping, if any):
Parameters estimated from sample , a a
Distribution
df
2
a3 a2
When the parameters for such distributions are estimated from hypotheses extrinsic to the sampled data, the degrees of freedom are uniformly a 1. 2. Special case of frequencies divided in a = 2 classes: In an Fz cross in drosophila, the following 176 progeny were obtained, of which 130 were wildtype flies and 46 ebony mutants. Assuming that the mutant is an autosomal recessive, one would expect a ratio of 3 wildtypefliesto each mutant fly. To test whether the observed results are consistent with this 3:1 hypothesis, we set up the data as follows.
Flies f Hypothesis f
/ , = 130 f2 = 4 6 = 176
= 0.75 q = 0.25
' '
'
304
CHAPTER 1 3 /
ANALYSIS OF FREQUENCIES
BOX 13.1 Continued Williams* correction for the twocell case is < = 1 +1/2, which is ?
1 +
2 m r
l M 2
'
in this example. G
0.120,02
01197
Since G a < J j Xo.osm 3.841, we clearly do not have sufficient evidence to reject our null hypothesis.
The case presented in Box 13.1, however, is one in which the expected frequencies are based on an intrinsic hypothesis. We use the sex ratio data in sibships of 12, first introduced in Table 4.4, Section 4.2. As you will recall, the expected frequencies in these data are based on the binomial distribution, with the parametric proportion of males p . estimated from the observed frequencies of the sample (p , = 0.519,215). The computation of this case is outlined fully in Box 13.1. The G test does not yield very accurate probabilities for small f{. The cells with J] < 3 (when a > 5) or f , < 5 (when a < 5) are generally lumped with adjacent classes so that the new / are large enough. The lumping of classes results in a less powerful test with respect to alternative hypotheses. By these criteria the classes of / at both tails of the distribution are too small. We lump them by adding their frequencies to those in contiguous classes, as shown in Box 13.1. Clearly, the observed frequencies must be lumped to match. The number of classes a is the number after lumping has taken place. In our case,
= 11.
Because the actual type I error of G tests tends to be higher than the intended level, a correction for G to obtain a better approximation to the chisquare distribution has been suggested by Williams (1976). He divides G by a correction factor q (not to be confused with a proportion) to be computed as q = 1 + (a2 l)/6m>. In this formula, is the number of degrees of freedom appropriate to the G test. The effect of this correction is to reduce the observed value of G slightly. Since this is an example with expected frequencies based on an intrinsic hypothesis, we have to subtract more than one degree of freedom from a for the significance test. In this case, we estimated p. from the sample, and therefore a second degree of freedom is subtracted from a, making the final number of degrees of freedom a 2 = II 2 9. Comparing the corrected sample value
13.3 /
305
of
^adj 94.837,09 with the critical value of 2 at 9 degrees of freedom, we find it highly significant ( 0.001, assuming that the null hypothesis is correct). We therefore reject this hypothesis and conclude that the sex ratios are not binomially distributed. As is evident from the pattern of deviations, there is an excess of sibships in which one sex or the other predominates. Had we applied the chisquare test to these data, the critical value would have been the same (Xa[9]) Next we consider the case for a = 2 cells. The computation is carried out by means of Expression (13.4), as before. In tests of goodness of fit involving only two classes, the value of G as computed from this expression will typically result in type I errors at a level higher than the intended one. Williams' correction reduces the value of G and results in a more conservative test. An alternative correction that has been widely applied is the correction for continuity, usually applied in order to make the value of G or X2 approximate the 2 distribution more closely. We have found the continuity correction too conservative and therefore recommend that Williams' correction be applied routinely, although it will have little elfect when sample sizes are large. For sample sizes of 25 or less, work out the exact probabilities as shown in Table 4.3, Section 4.2. The example of the two cell case in Box 13.1 is a genetic cross with an expected 3:1 ratio. The G test is adjusted by Williams' correction. The expected frequencies differ very little from the observed frequencies, and it is no surprise, therefore, that the resulting value of G adj is far less than the critical value of 2 at one degree of freedom. Inspection of the chisquare table reveals that roughly 80% of all samples from a population with the expected ratio would show greater deviations than the sample at hand. 13.3 Tests of independence: Twoway tables The notion of statistical or probabilistic independence was first introduced in Section 4.1, where it was shown that if two events were independent, the probability of their occurring together could be computed as the product of their separate probabilities. Thus, if among the progeny of a certain genetic cross the probability that a kernel of corn will be red is \ and the probability that the kernel will be dented is 5, the probability of obtaining a kernel both dented and red will be j ^ = if the joint occurrences of these two characteristics are statistically independent. The appropriate statistical test for this genetic problem would be to test the frequencies for goodness of fit to the expected ratios of 2 (red, not dented):2 (not red, not dented): 1 (red, dented): 1 (not red, dented). This would be a simultaneous test of two null hypotheses: that the expected proportions are j and j for red and dented, respectively, and that these two properties are independent. The first null hypothesis tests the Mendelian model in general. The second tests whether these characters assort independentlythat is, whether they are determined by genes located in different linkage groups. If the second hypothesis
306
CHAPTER 1 3 /
ANALYSIS OF FREQUENCIES
must be rejected, this is taken as evidence that the characters are linkedthat is, located on the same chromosome. There are numerous instances in biology in which the second hypothesis, concerning the independence of two properties, is of great interest and the first hypothesis, regarding the true proportion of one or both properties, is of little interest. In fact, often no hypothesis regarding the parametric values p{ can be formulated by the investigator. We shall cite several examples of such situations, which lead to the test of independence to be learned in this section. We employ this test whenever we wish to test whether two different properties, each occurring in two states, are dependent on each other. For instance, specimens of a certain moth may occur in two color phaseslight and dark. Fifty specimens of each phase may be exposed in the open, subject to predation by birds. The number of surviving moths is counted after a fixed interval of time. The proportion predated may differ in the two color phases. The two properties in this example are color and survival. We can divide our sample into four classes: lightcolored survivors, lightcolored prey, dark survivors, and dark prey. If the probability of being preyed upon is independent of the color of the moth, the expected frequencies of these four classes can be simply computed as independent products of the proportion of each color (in our experiment, 5) and the overall proportion preyed upon in the entire sample. Should the statistical test of independence explained below show that the two properties are not independent, we are led to conclude that one of the color phases is more susceptible to predation than the other. In this example, this is the issue of biological importance; the exact proportions of the two properties are of little interest here. The proportion of the color phases is arbitrary, and the proportion of survivors is of interest only insofar as it differs for the two phases. A second example might relate to a sampling experiment carricd out by a plant ecologist. A random sample is obtained of 100 individuals of a fairly rare species of tree distributed over an area of 400 square miles. For each tree the ecologist notes whether it is rooted in a serpentine soil or not, and whether the leaves arc pubcsccnt or smooth. Thus the sample of = 100 trees can be divided into four groups: serpentinepubescent, serpentinesmooth, nonserpentinepubescent, and nonserpentinesmooth. If the probability that a tree is or is not pubesccnt is independent of its location, our null hypothesis of the independence of these properties will be upheld. If, on the other hand, the proportion of pubcscencc differs for the two types of soils, our statistical test will most probably result in rejection of the null hypothesis of independence. Again, the expected frequencies will simply be products of the independent proportions of the two properties serpentine versus nonserpentine, and pubesccnt versus smooth. In this instance the proportions may themselves be of interest to the investigator. An analogous example may occur in medicine. Among 10,000 patients admitted to a hospital, a certain proportion may be diagnosed as exhibiting disease X. At the same time, all patients admitted are tested for several blood groups. A certain proportion of these arc members of blood group Y. Is there some
13.3 /
307
association between membership in blood group Y and susceptibility to the disease X? The example we shall work out in detail is from immunology. A sample of 111 mice was divided into two groups: 57 that received a standard dose of pathogenic bacteria followed by an antiserum, and a control group of 54 that received the bacteria but no antiserum. After sufficient time had elapsed for an incubation period and for the disease to run its course, 38 dead mice and 73 survivors were counted. Of those that died, 13 had received bacteria and antiserum while 25 had received bacteria only. A question of interest is whether the antiserum had in any way protected the mice so that there were proportionally more survivors in that group. Here again the proportions of these properties are of no more interest than in the first example (predation on moths). Such data are conveniently displayed in the form of a twoway table as shown below. Twoway and multiway tables (more than two criteria) are often known as contingency tables. This type of twoway table, in which each of the two criteria is divided into two classes, is known as a 2 2 table.
Dead Alive
57 54 111
13 25 oo r1
44 29 73
Thus 13 mice received bacteria and antiserum but died, as seen in the table. The marginal totals give the number of mice exhibiting any one property: 57 mice received bacteria and antiserum; 73 mice survived the experiment. Altogether 111 mice were involved in the experiment and constitute the total sample. In discussing such a table it is convenient to label the cells of the table and the row and column sums as follows: a c a + c b d b + d a + b c + d
From a twoway table one can systematically computc the cxpcctcd frequencies (based on the null hypothesis of independence) and compare them with the observed frequencies. For example, the expected frequency for cell d (bacteria, alive) would be . Jbacl.alv ~ nPbacl.alv ~ nPbad
X
{c + d \ f b + d\ A Palv " I
(c + d)(b + d)
which in our case would be (54)(73)/l 11 = 35.514, a higher value than the observed frequency of 29. We can proceed similarly to compute the expected frequencies for each cell in the table by multiplying a row total by a column total, and dividing the product by the grand total. The expected frequencies can be
308
CHAPTER 1 3 /
ANALYSIS OF FREQUENCIES
You will note that the row and column sums of this table are identical to those in the table of observed frequencies, which should not surprise you, since the expected frequencies were computed on the basis of these row and column totals. It should therefore be clear that a test of independence will not test whether any property occurs at a given proportion but can only test whether or not the two properties are manifested independently. The statistical test appropriate to a given 2 x 2 table depends on the underlying model that it represents. There has been considerable confusion on this subject in the statistical literature. For our purposes here it is not necessary to distinguish among the three models of contingency tables. The G test illustrated in Box 13.2 will give at least approximately correct results with moderate to largesized samples regardless of the underlying model. When the test is applied to the above immunology example, using the formulas given in Box 13.2, one obtains G adj = 6.7732. One could also carry out a chisquare test on the deviations of the observed from the expected frequencies using Expression (13.2). This would yield 2 = 6.7966, using the expected frequencies in the table above. Let us state without explanation that the observed G or X 2 should be compared with 2 for one degree of freedom. We shall examine the reasons for this at the end of this section. The probability of finding a fit as bad, or worse, to these data is 0.005 < < 0.01. We conclude, therefore, that mortality in these mice is not independent of the presence of antiserum. We note that the percentage mortality among those animals given bacteria and antiserum is (13)(100)/57 = 22.8%, considerably lower than the mortality of (25)(100)/54 = 46.3% among the mice to whom only bacteria had been administered. Clearly, the antiserum has been effective in reducing mortality. In Box 13.2 we illustrate the G test applied to the sampling experiment in plant ecology, dealing with trees rooted in two different soils and possessing two types of leaves. With small sample sizes (n < 200), it is desirable to apply Williams' correction, the application of which is shown in the box. The result of the analysis shows clearly that we cannot reject the null hypothesis of independence between soil type and leaf type. The presence of pubescent leaves is independent of whether the tree is rooted in serpentine soils or not. Tests of independence need not be restricted to 2 2 tables. In the twoway cases considered in this section, we are concerned with only two properties, but each of these properties may be divided into any number of classes. Thus organisms may occur in four color classes and be sampled at five different times during the year, yielding a 4 5 test of independence. Such a test would examine whether the color proportions exhibited by the marginal totals are inde
13.3
t e s t s
o f
i n d e p e n d e n c e :
t w o  w a y
t a b l e s
309
BOX 13.2 2 x 2 test of independence. A plant ecologist samples 100 trees of a rare species from a 400squaremile area. He records for each tree whether it is rooted in serpentine soils or not, and whether its leaves are pubescent or smooth.
Soil Pubescent Smooth Totals
12 16 28
22 50 72
34 66 100
Compute the following quantities. 1. X / In / for the cell frequencies = 12 In 12 + 22 In 22 + 16 In 16 + 50 In 50 = 337.784,38 2. / for the row and column totals = 34 In 34 + 66 In 66 + 28 In 28 + 72 In 72 = 797.635,16 3. In = 100 In 100 = 460.517,02 4. Compute G as follows: G = 2(quantity 1  quantity 2 + quantity 3) = 2(337.784,38  797.635,16 + 460.517,02) = 2(0.666,24) = 1.332,49 Williams' correction for a 2 2 table is \a + b c+d
^
= l +
J\a + c 6
b+d
= , , W + W  W W + 1 (
6(100)
= 1.022,81
Wt)
1.332,49 _
13028
Compare GadJ with critical value of for one degree of freedom. Since our observed Gadj is much less than Zo.ostu = 3.841, we accept the null hypothesis that the leaf type is independent of the type of soil in which the tree is rooted.
310
CHAPTER 1 3 /
ANALYSIS OF FREQUENCIES
BOX 13.3 If C test of independence using the G test. Frequencies for the and blood groups in six populations from Lebanon.
Populations
( b  6)
MN
Druse Greek Catholic Greek Orthodox Maronites Shiites Sunni Moslems Totals
Compute the following quantities. 1. Sum of transforms of the frequencies in the body of the contingency table  t t f t J ^ f i j = 591 59 + 100 In 100 4 + 91 In 91 = 240.575 + 460.517 +  + 40.488 = 12,752.715 2. Sum of transforms of the row totals
= (/,)'" (/)
= 203 In 203 + + 428 In 428 1078.581 + + 2593.305 = 15,308.461 3. Sum of the transforms of the column totals
=(/)
= 818 In 818 + + 494 In 494 = 5486.213 + + 3064.053 = 4. Transform of the grand total = In = 2466 In 2466 = 19,260.330 5. G = 2(quantity 1  quantity 2 quantity 3 + quantity 4) = 2(12,752.715  15,308.46  16,687.108 + 19,260.330) = 2(17.475) = 34.951 6. The lower bound estimate of q using Williams' correction for an a b table is
13.3 /
Thus Gm = G/qmin = 34.951/1.001,892 = 34.885. This value is to be compared with a 2 distribution with (a  1 ){b  1) degrees of freedom, where a is the number of columns and b the number of rows in the table. In our case, df {3  1)(6  1) = 10. Since .[ = 29.588, our G value is significant at < 0.001, and we must reject our null hypothesis that genotype frequency is independent of the population sampled.
are often called RxC tests of independence, R and C standing for the number of rows and columns in the frequency table. Another case, examined in detail in Box 13.3, concerns the MN blood groups which occur in human populations in three genotypesMM, MN, and NN. Frequencies of these blood groups can be obtained in samples of human populations and the samples compared for differences in these frequencies. In Box 13.3 we feature frequencies from six Lebanese populations and test whether the proportions of the three groups arc independent of the populations sampled, or in other words, whether the frequencies of the three genotypes differ among these six populations. As shown in Box 13.3, the following is a simple general rule for computation of the G test of independence: G = 2 [ ( / In / for the cell frequencies) ( / In / for the row and column totals) In ] The transformations can be computed using the natural logarithm function found on most calculators. In the formulas in Box 13.3 we employ a double subscript to refer to entries in a twoway table, as in the structurally similar case of twoway anova. The quantity fu in Box 13.3 refers to the observed frequency in row i and column j of the table. Williams' correction is now more complicated. We feature a lower bound estimate of its correct value. The adjustment will be minor when sample size is large, as in this example, and need be carried out only when the sample size is small and the observed G value is of marginal significance. The results in Box 13.3 show clearly that the frequency of the three genotypes is dependent upon the population sampled. We note the lower frequency of the
323 CHAPTER 1 3 /
ANALYSIS OF FREQUENCIES
genotypes in the third population (Greek Orthodox) and the much lower frequency of the MN heterozygotes in the last population (Sunni Moslems). The degrees of freedom for tests of independence are always the same and can be computed using the rules given earlier (Section 13.2). There are k cells in the table but we must subtract one degree of freedom for each independent parameter we have estimated from the data. We must, of course, subtract one degree of freedom for the observed total sample size, n. We have also estimated a 1 row probabilities and b 1 column probabilities, where a and b are the number of rows and columns in the table, respectively. Thus, there are k (a 1) (b 1) 1 = fc a b + I degrees of freedom for the test. But since k = a b, this expression becomes {a b) a b + 1 = (a 1) (b 1), the conventional expression for the degrees of freedom in a twoway test of independence. Thus, the degrees of freedom in the example of Box 13.3, a 6 3 case, was (6  1) (3 1) = 10. In all 2 2 cases there is clearly only (2 1) (2 1) = 1 degree of freedom. Another name for test of independence is test of association. If two properties are not independent of each other they are associated. Thus, in the example testing relative frequency of two leaf types on two different soils, we can speak of an association between leaf types and soils. In the immunology experiment there is a negative association between presence of antiserum and mortality. Association is thus similar to correlation, but it is a more general term, applying to attributes as well as continuous variables. In the 2 x 2 tests of independence of this section, one way of looking for suspected lack of independence was to examine the percentage occurrence of one of the properties in the two classes based on the other property. Thus we compared the percentage of smooth leaves on the two types of soils, or we studied the percentage mortality with or without antiserum. This way of looking at a test of independence suggests another interpretation of these tests as tests for the significance of differences between two percentages.
Exercises 13.1 In an experiment to determine the mode of inheritance of a green mutant, 146 wildtype and 30 mutant offspring were obtained when F , generation houseflics were crosscd. Test whether the data agree with the hypothesis that the ratio of wild type of mutants is 3:1. ANS. G = 6.4624, G a d j = 6.441, 1 d f , xg 0 5 [ 1 , = 3.841. Locality A has been exhaustively collected for snakes of species S. An examination of the 167 adult males that have been collected reveals that 35 of these have palecolored bands around their necks. From locality B, 90 miles away, we obtain a sample of 27 adult males of the same species, 6 of which show the bands. What is the chance that both samples are from the same statistical population with respect to frequency of bands? Of 445 specimens of the butterfly Erebia epipsodea from mountainous areas, 2.5",", have light color patches on their wings. Of 65 specimens from the prairie, 70.8'T, have such patches (unpublished data by P. R. Ehrlich). Is this difference significant? llinv First work backwards to obtain original frequencies. ANS. G  175.5163, I dj\ G.Mll = 171.4533.
13.2
13.3
EXERCISES
313
13.4
Test whether the percentage of nymphs of the aphid Myzus persicae that developed into winged forms depends on the type of diet provided. Stem mothers had been placed on the diets one day before the birth of the nymphs (data by Mittler and Dadd, 1966).
Type of diet
% winged
forms
100 92 36
216 230 75
In a study of polymorphism of chromosomal inversions in the grasshopper Moraba scurra, Lewontin and White (1960) gave the following results for the composition of a population at Royalla "B" in 1958.
Chromosome
CD
St/Bl 96 56 6
Bl/Bl 75 64 6
22 8 0
13.6 13.7
13.8
Are the frequencies of the three different combinations of c h r o m o s o m e EK independent of those of the frequencies of the three combinations of chromosome CD? ANS. G = 7.396. Test agreement of observed frequencies with those expected on the basis of a binomial distribution for the d a t a given in Tables 4.1 and 4.2. Test agreement of observed frequencies with those expected on the basis of a Poisson distribution for the data given in Table 4.5 and '["able 4 6. ANS For Tabic 4.5: G = 49.9557, 3 df\ G a d ) = 49.8914. f or Table 4.6: G = 20.6077, 2 J f . G a d j = 20.4858. In clinical tests of the drug Nimesulide, Pfanilner (1984) reports the following results. The drug was given, together with an antibiotic, to 20 persons. A control group of 20 persons with urinary infections were given the antibiotic and a placebo. The results, edited for purposes of this exercise, are as follows:
Antibiotic + Ninwstilith'
Antibiotic + placebo
1 19
16 4
13.9
Analyze and interpret the results. Refer to the distributions of m e l a n o m a over body regions shown in Table 2.1. Is there evidence for differential susceptibility to melanoma of differing body regions in males and females? ANS. G = 160.2366, 5 dj\ G\M = 158.6083.
APPENDIX
Mathematical
Appendix
Al.l
D e m o n s t r a t i o n t h a t t h e s u m of t h e d e v i a t i o n s f r o m t h e m e a n is e q u a l W e h a v e t o l e a r n t w o c o m m o n r u l e s of s t a t i s t i c a l a l g e b r a . W e c a n o p e n a
to zero. p a i r of p a r e n t h e s e s w i t h a sign in f r o n t of t h e m b y t r e a t i n g t h e a s t h o u g h it w e r e a c o m m o n f a c t o r . W e h a v e f
B,) = 1 . 4 , + B , ) + ( A 2 + H 2 ) +
+ M +
B)
= (, 4 , + A2 + + / ! ) + ( , + Bz + + B) Therefore,
il
(A, + B,) = A
=1
i 1
C = C + C + +

( t e r m s )
nC
Since in a given problem a mean is a constant value, " = . If you wish, you may check these rules, using simple numbers. In the subsequent demonstration and others to follow, whenever all summations are over items, we have simplified the notation by dropping subscripts for variables and superscripts above summation signs. We wish to prove that y = 0. By definition,
~ "
= ^
=
isince =
~
Therefore, }' = 0. 1.2 Demonstration that Expression (3.8), the computational formula for the sum of squares, equals Expression (3.7), the expression originally developed for this statistic. We wish to prove that ( = ^
2
)2 = 2  ( ( ) . We have
( Y  ) 2 = ( ^ 2  2 + 2 )
~ 2
+2
(since
=
1
+,
Hence,
(
(2
1.3 Simplified formulas for standard error of the difference between two means. The standard error squared from Expression (8.2) is "(,  1 )s\ + (w2  1 ).s2 n] + n2 2 When , = n, this simplifies to
{ n, n, + >n ,
lb; I
^ri
1 )s[
(  l)(sf + s)(2)
2(n
i.
l)(n)
(sf + s\)
316
APPENDIX 1 /
When ti1 n2 but each is large, so that (r^ 1) nl and (n2 1) n2, the standard error squared of Expression (8.2) simplifies to
rtjSf + n
2
s j
n, + n 2
+ 
, .
n1n2
n2sl n1n2
_ s j
n2
s\ j
which is the standard error squared of Expression (8.4). A1.4 Demonstration that i 2 obtained from a test of significance of the difference between two means (as in Box 8.2) is identical to the Fs value obtained in a singleclassification anova of two equalsized groups (in the same box). ts (from Box 8.2)
Yi Yi
1
(sl + sl)
 1)
n(n \
+
1)(F,
Y,
r< 2 =n(n
( 
y'
 1)
+ )
+ y22
2 = (F, 
m  ^
)
2
+ (
, 
+ . *
+
2
Yi + 2 ^ 2
(since =
2))
2
)2
F2)2]
y2)2 Z ^
MS.
Zyi +
APPENDIX 1 /
MATHEMATICAL APPENDIX
.117
1(2)2 F.=
+ ) i)]
 i)(F x +
?2)2
tl
1.5 Demonstration that Expression (11.5), the computational formula for the sum of products, equals ( X)(Y Y), the expression originally developed for this quantity. All summations are over items. We have > = ( *  ) ( =
(since
= 
=
 +
(since / = , = : similarly, = )
= Similarly,
and
 ^ 1
("5, *'2 ~
.6
Derivation
of
computational
formula
for
((^)2/) . By definition, dY.. x Y. Since = , we can subtract Y from both Y and Y to obtain d y . = y y = y bx Therefore, </?* = (.  /'V)2 = >2 >'2  2 + 2^ " + b2 2
2
2
(since y  bx)
(*2)2 ^
= ^
2 *2
318
APPENDIX 1 /
or
l
d
x =
ly
(^)2
* 2
(11.6)
A1.7 Demonstration that the sum of squares of the dependent variable in regression can be partitioned exactly into explained and unexplained sums of squares, the cross products canceling out. By definition (Section 11.5),
y = y +
>2
= (9
+ .)2
x
+ <
+ 2 yd*,
= 0, then we have demonstrated the required [since y = bx from Expression (11.3) and dy = v bx from Appendix A 1.6]
bx)
2
bYxyb^x
since b =
= *>*> ~
=
h
= (
f>i2^l^2
where , and 2 are standard deviations of Yj and Y2, respectively, and 2 is the paramctric correlation cocfficicnt between Y, and Y2. If = , + Y2, then (y, + 1 2') '.  v 2
( +
l)
(,
2)
 ,  '
(, Y2)
X [ ( y , + 2)
2]2
=
1
(,
, +
= . +y 2)2
1 rf +
(y\
+ y\ + 2y,y 2 ) = ;
1 2 y \ +  y,y2
+ +
.117
APPENDIX 1 / MATHEMATICAL APPENDIX
we
have
l 2
1212
Therefore
= = + 22 + 2 p
1 2
fftff
Similarly, = =
2  2 1 2 , 2 (12.8) (12.9)
i + s i + 2r12s1s2
i + s2 ~ 2r12sj.s;2
f y 12)
A1.9 Proof that the general expression for the G test can be simplified to Expressions (13.4) and (13.5). In general, G is twice the natural logarithm of the ratio of the probability of the sample with all parameters estimated from the data and the probability of the sample assuming the null hypothesis is true. Assuming a multinomial distribution, this ratio is
Pa
L =
n'.
{{ 2
Pa
Pi \Pi
where /, is the observed frequency, pt is the observed proportion, and the expected proportion of class /, while is sample size, the sum of the observed frequencies over the a classes. G = 2 In L
./>
Since / = npt and /, = nph G = 2./;in(4 If we now replace /', by nph G = 2 V /; In = 2 A
npi
(13.4)
/ . I"
In
Pi
./;11" (13.5)
./>
Pi
APPENDIX
Statistical
Tables
I. II. III. IV. V. VI. VII. VIII. IX. X. XI. XII. XIII. XIV.
Twentyfive hundred random digits 321 Areas of the normal curve 322 Critical values of Student's t distribution 323 Critical values of the chisquarc distribution 324 Critical values of the F distribution 326 Critical values of F m a i 330 Shortest unbiased confidence limits for the variance 331 Critical values for correlation coefficients 332 Confidence limits of percentages 333 The transformation of correlation coefficient r 338 Critical values of U, the MannWhitney statistic 339 Critical values of the Wilcoxon rank sum 343 Critical values of the twosample KolmogorovSmirnov statistic 346 Critical values for Kendall's rank correlation coefficient 348
TAB
Twi
I
tive hundred random digits.
2
14952 38149 25861 03370 58554 10412 62687 04281 24817 91751 04873 67835 39387 39246 50269 37696 48809 12825 06261 19595 87152 55937 75131 68523 91001 81842 88481 66829 84193 23796 26615 65541 71562 96742 24738 60599 81537 77192 07189 78392 54723 73460 28610 78901 92159 40964 68447 73328 78393 07625
3
72619 49692 38504 42806 16085 69189 91778 39979 81099 53512 54053 28302 78191 01350 67005 27965 36698 81744 54265 13687 20719 21417 72386 29850 52315 01076 61191 72838 57581 16919 43980 37937 95493 61486 67749 85828 .59527 50623 80539 11733 18227 88841 87957 59710 21971 98780 35665 13266 33021 05255
4
73689 31366 14752 11393 51555 85171 80354 03927 48940 23748 25955 45048 88415 99451 40442 30459 42453 28882 16203 74872 25215 49944 11689 67833 26430 99414 25013 08074 77252 99691 09810 41105 34112 43305 83748 19152 95674 41215 75927 57703 28449 39602 21497 27396 16901 72418 31530 54898 05867 83254
5
52059 52093 23757 71722 27501 29082 23512 82564 69554 65906 48518 56761 60269 61862 33100 91011 83061 27369 23340 89181 04349 38356 95727 05622 54175 31574 30272 57080 85604 80276 38289 70106 76895 34183 59