Você está na página 1de 22


A mini project report Submitted to the SRM University in partial fulfillment of the requirements for the award of the degree of MASTER OF BUSINESS ADMINISTRATION BY

Under the Supervision and Guidance of MS. RAMMA R Assistant Professor
















Chi-square (X2) test is a nonparametric statistical analyzing method often used in experimental work where the data consist in frequencies or `counts' for example the number of boys and girls in a class having their tonsils out as distinct from quantitative data obtained from measurement of continuous variables such as temperature, height, and so on. The most common use of the test is to assess the probability of association or independence of facts. This paper summarizes the chi-squared test. Beginning from the basics it discusses with descriptive examples on where and how to apply this test. The purpose of the paper is to present a quick overview on chi-square test, so that one who doesn't have much knowledge on statistics may use it as a beginner's guide.

1. Introduction


Suppose we have a sample of boys and girls from the 5th, 8th, and 12th grade of school. We may want to know whether there is an association between the gender of students and the grade levels. Or, we may want to know whether the men and women in a liberal arts college differ in their section of majors. These are the type of questions that the chi-squared (2) test is designed to answer. It was first introduced by British statistician Karl Pearson in 1900. Chi is a letter of the Greek alphabet; the symbol is and it's pronounced like KYE, the sound in "kite." The chi square test uses the statistic chi squared, written 2. The "test" that uses this statistic helps an investigator determine whether an observed set of results matches an expected outcome. In some types of research (genetics provides many examples) there may be a theoretical basis for expecting a particular result- not a guess, but a predicted outcome based on a sound theoretical foundation. Before going into details of chi-squared test first we present some statistical preliminaries necessary to clearly understand it.

Population: A population is an individual or group that represents all the members of a certain group or category of interest. Populations do not have to include people. Suppose we want to know the average age of the cats in the city. The population in this study is made up of cats, not people. A population does not need to be large to count as a population. We may want to know the average height of the 3 kids in a family. In this study, the population is comprised of only 3 kids. Sample: A sample is a subset of a given population. Samples are not necessarily good representations of the populations from which they are selected. But choosing representative samples is important for most experiments. Parameter: A parameter is a value generated from, or applied to a population. Variable: A variable is pretty much anything that can be codified and can have value from a set (domain) of more than one value. Variables may be quantitative or qualitative. A quantitative variable is one that is scored in such a way that its values indicate some sort of amount. For example, height is a quantitative variable. On the contrary, a qualitative variable indicates some kind of category. A commonly used

qualitative variable in social science research is the dichotomous variable, which has two different categories. For instance, gender has two categories: male and female. Chisquare test is applicable to when we have qualitative variables classified into categories. Nominally scaled variable: A nominally scaled variable is one in which the labels that are used to identify the different levels of the variable have no weight, or numeric value. For example the sample may be divided into two groups labeled 0 and 1. In this case the value 1 does not indicate a higher score than the value of 0. Rather the, 0 and 1 are simply names, or labels have been assigned to each group. Contingency table: When the members of a sample are doubly classified (i.e., classified in two separate ways), the results may be arranged in rectangular tables. Such a table is called contingency table. TYPES OF CHI SQUARE TESTS Two Types of 2 2 Test for Goodness of Fit o Involves a single categorical variable only 2 Test for Independence o Involves 2(+) categorical variables 1. 2 Test for Goodness of Fit Generally just involves one categorical variable Null hypothesis specifies the proportion of the population in each Determines how well sample data conform to proportions set forth Test statistic (2) examines whether the proportions in the sample

category by the null hypothesis reliably differ from the null hypothesis 2. 2 Test for Independence Examines the relationship between two (or more) categorical variables to determine if they are independent.

Two variables are said to be independent if there is no relationship between them. Two variables are said to be dependent if there is a relationship between them. Similar to the correlation coefficient, except that instead of both variables being continuous, both variables are categorical. o Cant correlate Academic Major with Favorite Barnyard Animal, but you can do a chi-square. Null hypothesis: no relationship. Alternative hypothesis: some relationship. Using Chi Square 1) Knowing which statistic to use to test the relationship between each variable depends on the type of data you have (and sometimes the type of question you want to answer). Chi Square can be used when you have two Discrete Variables 2) Single Discrete Variable Goodness of Fit 2: Allows us to test whether the group frequencies differ from chance patterns (base rate frequencies: the frequency instances naturally occur in the environment). (df = k -1)

Calculating Expected Frequencies For Equal Chances Assumed: Ei = n/k For Unequal Chances Assumed: Ei = n (pi) k = number of groups pi = the probability of occurrence for outcome i. p = % / 100. 20% = .20 O = n, E = n, O = E Statistical Hypotheses: Ho: Of = Ef Ha: Of =\= Ef Example Research Question: Does number of people who say they like cheesy poofs (Yes = 1) vs. those who do not like cheesy poofs (No = 0), differ significantly from the number expected by chance alone?

The Decision to Reject Ho: If 2 obtained is > 2 critical @ p< .05 for df = k-1; Then reject Ho (Fail to reject Ha) If 2 obtained is < 2 critical @ p < .05 for df = k-1; Then Fail to Reject Ho (Reject Ha) 3) Discrete Discrete Test of independence: Allows us to test whether the cross tabulation pattern of two nominal variables differs from the patterns expected by chance. (df = (R-1)(C-1)) where R = # of rows & C = # of columns.

Statistical Hypotheses:

Ho: Of = Ef Ha: Of =\= Ef

Example Research Question: Does the number of people who think they are Eric Cartman (yes = 1, no = 0), relative to whether or not they eat cheesy poofs (yes = 1, no = 0), significantly differ from the frequencies that are expected by chance alone. 4) Limitations on X2: a) Responses must be independent and mutually exclusive and exhaustive. Each case from the sample should fit into one and only one cell of the cross tab matrix. (This also means that repeated measure designs can not be tested using chi square). b) Low expected Frequencies limit the validity of X2.

If df = 1 (e.g., 2x2 matrix), then no expected frequency can be less than 5. Also, If df = 2, all expected frequencies should exceed 2. If df = 3 or greater, then all expected frequencies except one should be 5 or greater and the one cell needs to have an expected frequency of 1 or greater. The interpretation In every 2-test the calculated 2 value will either be (i) less than or equal to the critical 2 value OR (ii) greater that the critical 2 value. If calculated 2 critical 2, then we conclude that there is no statistically significant difference between the two distributions. That is, the observed results are not significantly different from the expected results, and the numerical difference between observed and expected can be attributed to chance. If calculated 2 > critical 2, then we conclude that there is a statistically significant difference between the two distributions. That is, the observed results are significantly different from the expected results, and the numerical difference between observed and expected can not be attributed to chance. That means that the difference found is due to some other factor. This test won't identify that other factor, only that there is some factor other than chance responsible for the difference between the two distributions.





The chi-square (I) test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. Do the numbers of individuals or objects that fall in each category differ significantly from the number you would expect? Is this difference between the expected and observed due to sampling error, or is it a real difference? Chi-Square Test Requirements 1. Quantitative data. 2. One or more categories. 3. Independent observations. 4. Adequate sample size (at least 10). 5. Simple random sample. 6. Data in frequency form. 7. All observations must be used. Expected Frequencies When you find the value for chi square, you determine whether the observed frequencies differ significantly from the expected frequencies. You find the expected frequencies for chi square in three ways: 1. You hypothesize that all the frequencies are equal in each category. For example, you might expect that half of the entering freshmen class of 200 at Tech College will be identified as women and half as men. You figure the expected frequency by dividing the number in the sample by the number of categories. In this exam pie, here there are 200 entering freshmen and two categories, male and female, you divide your sample of 200 by 2, the number of categories, to get 100 (expected frequencies) in each category. 2. You determine the expected frequencies on the basis of some prior knowledge. Let's use the Tech College example again, but this time

pretend we have prior knowledge of the frequencies of men and women in each category from last year's entering class, when 60% of the freshmen were men and 40% were women. This year you might expect that 60% of the total would be men and 40% would be women. You find the expected frequencies by multiplying the sample size by each of the hypothesized population proportions. If the freshmen total were 200, you would expect 120 to be men (60% x 200) and 80 to be women (40% x 200). Now let's take a situation, find the expected frequencies, and use the chi-square test to solve the problem. CASE: Thai, the manager of a car dealership, did not want to stock cars that were bought less frequently because of their unpopular color. The five colors that he ordered were red, yellow, green, blue, and white. According to Thai, the expected frequencies or number of customers choosing each color should follow the percentages of last year. She felt 20% would choose yellow, 30% would choose red, 10% would choose green, 10% would choose blue, and 30% would choose white. She now took a random sample of 150 customers and asked them their color preferences. The results of this poll are shown in Table 1 under the column labeled observed frequencies."


SOLUTION: The expected frequencies in Table 2 are figured from last year's percentages.

Table 2: Expected Frequencies Based on the percentages for last year, we would expect 20% to choose yellow. Figure the expected frequencies for yellow by taking 20% of the 150 customers, getting an expected frequency of 30 people for this category. For the color red we would expect 30% out of 150 or 45 people to fall in this category. Using this method, Thai figured out the expected frequencies 30, 45, 15, 15, and 45. Obviously, there are discrepancies between the colors preferred by customers in the poll taken by Thai and the colors preferred by the customers who bought their cars last year. Most striking is the difference in the green and white colors. If Thai were to follow the results of her poll, she would stock twice as many green cars than if she were to follow the customer color preference for green based on last year's sales. In the case of white cars, she would stock half as many this year. What to do??? Thai needs to know whether or not the discrepancies between last year's choices (expected frequencies) and this year's preferences on the basis of her poll (observed frequencies) demonstrate a real change in customer color preferences. It could be that the differences are simply a result of the random sample she chanced to select. If so, then the population of customers really has not changed from last year as far as color preferences go. The NULL HYPOTHESIS states that there is no significant difference between the expected and observed frequencies. The ALTERNATIVE HYPOTHESIS states they are different. 11

The level of significance (the point at which you can say with 95% confidence that the difference is NOT due to chance alone) is set at .05 (the standard for most science experiments.) The chi-square formula used on these data is

Where, O E df 2 PROCEDURE We are now ready to use our formula for 2 and find out if there is a significant difference between the observed and expected frequencies for the customers in choosing cars. We will set up a worksheet; then you will follow the directions to form the columns and solve the formula. 1. Directions for Setting Up Worksheet for Chi Square is the Observed Frequency in each category is the Expected Frequency in the corresponding category is the "degree of freedom" (n-1) is Chi Square



Degrees of freedom (df) refers to the number of values that are free to vary after restriction has been placed on the data. For instance, if you have four numbers with the restriction that their sum has to be 50, and then three of these numbers can be anything, they are free to vary, but the fourth number definitely is restricted. For example, the first three numbers could be 15, 20, and 5, adding up to 40; then the fourth number has to be 10 in order that they sum to 50. The degrees of freedom for these values are then three. The degrees of freedom here is defined as N - 1, the number in the group minus one restriction (4 - 1). Table 3: Degrees of freedom table Table of chi square critical values probability (P) degrees of freedom* 0.05 0.01 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 3.84 5.99 7.81 9.49 11.07 12.59 14.07 15.51 16.92 18.31 19.68 21.03 22.36 23.68 25.00 26.30 27.59 28.87 30.14 31.41 6.63 9.21 11.34 13.28 15.09 16.81 18.48 20.09 21.67 23.21 24.73 26.22 27.69 29.14 30.58 32.00 33.41 34.81 36.19 37.57


3. Find the table value for Chi Square. Begin by finding the df found in step 2 along the left hand side of the table. Run your fingers across the proper row until you reach the predetermined level of significance (.05) at the column heading on the top of the table. The table value for Chi Square in the correct box of 4 df and P=.05 level of significance is 9.49. 4. If the calculated chi-square value for the set of data you are analyzing (26.95) is equal to or greater than the table value (9.49), reject the null hypothesis. There IS a significant difference between the data sets that cannot be due to chance alone. If the number you calculate is LESS than the number you find on the table, than you can probably say that any differences are due to chance alone. In this situation, the rejection of the null hypothesis means that the differences between the expected frequencies (based upon last year's car sales) and the observed frequencies (based upon this year's poll taken by Thai) are not due to chance. That is, they are not due to chance variation in the sample Thai took; there is a real difference between them. Therefore, in deciding what color autos to stock, it would be to Thai's advantage to pay careful attention to the results of her poll.


THE STEPS IN USING THE CHI-SQUARE TEST MAY BE SUMMARIZED AS FOLLOWS: CHI-SQUARE TEST SUMMARY 1. Write the observed frequencies in column O 2. Figure the expected frequencies and write them in column E. 3. Use the formula to find the chi-square value: 4. Find the df. (N-1) 5. Find the table value (consult the Chi Square Table.) 6. If your chi-square value is equal to or greater than the table value, reject the null hypothesis: differences in your data are not due to chance alone



In 1956, the number of people on record who died of tuberculosis in England and Wales was 5375. Of these 3804 were males and 1571 were females: 3534 males and 1319 females died of tuberculosis of respiratory system, while the remainder died of other forms of tuberculosis. Their data can be arranged in contingency table as shown in Table 1. This 2*2 (the members of the sample having been dichotomized in two different ways) contingency table is an example of the simplest form.


Chi-squared Test The entries in the cells in a contingency table may be frequencies, as in Table 1, or frequencies may be transformed into proportions or percentages. However, it is important to note that in whatever form (frequencies, proportions, etc.) they are presented are not continuous measurements. Chi-squared test can be applied to only discrete data for the purpose of the test; of course, continuous data can be often put into discrete form by the use of intervals on a continuous scale. For instance, age is a continuous variable, but if people are classified into different age-groups, then the intervals of time corresponding to these groups can be treated as if they were discrete units. Chi-square (2) test is a nonparametric statistical test to determine if the two or more classifications of the samples are independent or not. For explanation, let us consider the data presented in Table 1 where the people are classified according to two attributes: gender and type of tuberculosis. We may want to determine whether death caused by tuberculosis of respiratory system or other type of tuberculosis is dependent on gender. To get the answer, we may apply chi-square test. In order to carry out the chi-square test, it will be helpful to rearrange Table 1 leaving the marginal frequencies as they are but replacing the frequencies in the cells of the body of the table by the letters E1 to E4. After this arrangement we get Table 2. Looking at the column of the marginal totals on the right of the table we see that the proportion of deaths, males and females combined, due to tuberculosis of the respiratory system, is


Now, if the two classifications are independent, that is if the form of tuberculosis from which people die is independent of gender, we would expect that the proportion of males that died from tuberculosis of respiratory system would be equal to that of females died from the same cause, and consequently would equal the proportion of the total group, 0.903. The expected values E1 and E2 then must be chosen so that the following holds.

Using equation 2 we find

Once the value of E1 is known, the values of E2; E3 and E4 can be deduced since the following facts are true.

Calculating values of E1; E2; E3, and E4 and putting these expected values replacing the corresponding letters in Table 2, we get Table 3. These are the values that one would expect to find in the cells in the body of Table 1 were the two methods of classification independent. Though the number of people may not be fractional, fractional values may appear in the table of expected frequencies. Especially when the sample size is small, we retain the fractional values to increase the accuracy of subsequent calculations. 17

Now, if we refer again to the data, we see that the observed frequencies in Table 1 differ considerably from the expected frequencies in Table 3. The question in concern is whether this difference is such as could have arisen from random sampling error alone, or whether it indicates a real difference between genders: males and females. Now, let us define our null hypothesis as follows. Null hypothesis: The number of men and women died in 1956 due to tuberculosis of respiratory system and other types of tuberculosis is independent of their sex.

Chi-square test, if properly applied may give us the answer by rejecting the null hypothesis or failing to reject it. The test is based on the chi-square (2) distribution. To compare the observed and expected frequencies, we produce chi-square (2) value using the formula stated in equation 8.

In this equation 8, Oi stands for observed frequencies, Ei stands for expected frequencies, and i runs from 1; 2; : : : ; n, where n is the number of cells in the contingency table. To perform the calculations it is useful to arrange the observed and expected frequencies as shown in Table 4. To asses the significance of the calculated value of 2, we refer to the standard chi-square table presented in Appendix. This table contains the critical 2 values on different degrees of freedom and levels of probability. Referring back to Table 3, we


recall that once the value of any one of the Ei (i = 1; : : : ; 4) had been determined, all other Ei could be deduced. In other words, when the marginal totals of a 2 * 2 contingency table is given, only one cell in the body of the table can be filled arbitrarily. This fact is expressed by saying that a 2 * 2 contingency table has only one degree of freedom. The degree of freedom (df) of a contingency table with r rows and c columns is computed using the following formula given in equation 9.

To assess the significance of our chi-square value 2 = 101:35, we enter the chisquare table of the Appendix, with df = 1, that is we look into the first row, which corresponds to one degree of freedom. The largest value in that row is 10.828 under the probability (P) level 0.001. A value of chi-square equal to or greater than 10.828 would be expected to occur by chance only once in a thousand times if the null hypothesis is true. Since our chi-square value 101.35 is much greater than 10.828, it would be expected to occur even less frequently. Hence our chi-square test rejects the null hypothesis. So, we conclude that the proportion of males died from tuberculosis of respiratory system, namely 3534/3804 = 0:929, is significantly different from the proportion of females, namely 1319/1571 = 0:840, that died from the same cause. Before accomplishing the test we define a level of confidence, which is the probability level (P) we are going to accept. Once the 2 value is computed and the number of degrees of freedom is determined, we go to the chi-square table and look into the row corresponding to the given degree of freedom. Then if we find our 2 value to be less than (to the left side of our level of confidence) that of the value corresponding to our level of confidence, we conclude that our null hypothesis is probably true. On the contrary, if our 2 value lies over the level of confidence or to its right indicating less probability of occurring the difference by chance, we know that our chi-square test rejects the null hypothesis. Therefore, we conclude that the classifications on population are dependent on each other. Limitations of Chi-square Test As mentioned before, chi-square test cannot be applied on continuous data. It can only be applied to qualitative data classified into categories, or labeled using nominally 19

scaled variables [3, 4]. In the standard chi-square table presented in the Appendix the chisquare values computed using the formula in equation 8 assuming that the expected values (E) are large. Most statisticians warn against using the test when any of the expected values are less than 5. This warning implies that the use of chi-square test is restricted to large samples.

CONCLUSION Chi-square test tells us whether the classifications on a given population are dependent on each other or not. However, it is important to stress that the establishment of statistical association by means of chi-square does not necessarily imply any causal


relationship between the attributes being compared, but it does indicate that the reason for the association is worth investigating. For example, if further investigation carried out on the case of people's death from tuberculosis, we might have found out that the reason why the number of men died from tuberculosis of the respiratory system is higher than the women is the fact that there are more smokers in men than in women.