Você está na página 1de 25

Chapter 5: Discriminant Analysis Discriminant Analysis is the appropriate statistical technique when the dependent variable is categorical and

the independent variables are quantitative. In many cases, the dependent variable consists of two groups or classifications, for example, male versus female, high versus low or good credit risk versus bad credit risk. In other instances, more than two groups are involved, such as a threegroup classification involving low, medium and high classifications. The basic purpose of discriminant analysis is to estimate the relationship between a single categorical dependent variable and a set of quantitative independent variables. Discriminant analysis has widespread application in situations where the primary objective is identifying the group to which an object (eg. person, firm or product) belongs. Potential applications include predicting the success or failure of a new product, deciding whether a student should be admitted to graduate school, classifying students as to vocational interests, determining what category of credit risk a person falls into or predicting whether a firm will be a success or not. Discriminant analysis is capable of handling either two groups or multiple groups. When three or more classifications are identified, the technique is referred to as multiple discriminant analysis (MDA). Discriminant analysis involves deriving a variate, the linear combination of the two (or more) independent variables that will discriminate best between defined groups. The linear combination for a discriminant analysis, also known as the discriminant function, is derived from an equation that takes the following form:
75

Z where: Z Wi Xi

= W1X1 + W2X2 + W3X3 + + WnXn = Discriminant score = Discriminant weight for variable i = Independent variable i

Discriminant analysis is the appropriate statistical technique for testing the hypotheses that the group means of a set of independent variables for two or more groups are equal. This group mean is referred to as a centroid. The centroids indicate the most typical location of any individual from a particular group, and a comparison of the group centroids shows how far apart the groups are along the dimension being tested. A situation where there are three groups (1, 2 and 3) and two independent variables (X1, and X2) is plotted below.

The test for the statistical significance of the discriminant function is a generalized measure of the distance between the group centroids. If the overlap in the distribution is small, the
76

discriminant function seperates the groups well. If the overlap is large, the function is a poor discriminator between the groups. Multiple discriminant analysis is unique in one characteristic among the dependence relationships we will study. If there are more than two groups in the dependent variable, discriminant analysis will calculate more than one discriminant function. In fact, it will calculate NG-1 functions, where NG is the number of groups. Step 1: Objectives Of Discriminant Analysis Discriminant Analysis can address any of the following research questions: Determining whether statistically significant differences exist between the average score profiles on a set of variables for two (or more) defined groups. Determining which of the independent variables account the most for the differences in the average score profiles of the two or more groups. Establishing procedures for classifying statistical units (individuals or objects) into groups on the basis of their scores on a set of independent variables. Establishing the number and composition of the dimensions of discrimination between groups formed from the set of independent variables. HATCO Example One of the customer characteristics obtained by HATCO in its survey was a categorical variable (X11) indicating which purchasing approach a firm used: total value analysis (X11 = 1) versus specification buying (X11 = 0). HATCO management team
77

expects that firms using these two approaches would emphasize different characteristics of suppliers in their selection decision. The objective is to identify the perceptions of HATCO (X1 to X7) that differ significantly between firms using the two purchasing methods. The company would then be able to tailor sales presentations and benefits offered to best match the buyers perceptions. Step 2: Research Design for Discriminant Analysis The number of dependent variable groups (categories) can be two or more, but these groups must be mutually exclusive and exhaustive. When three or more categories are created, the possibility arises of examining only the extreme groups in a two-group discriminant analysis. This procedure is called the polar-extremes approach. This involves comparing only the extreme two groups and excluding the middle group from the discriminant analysis. The polar-extremes approach may be useful if we had three groups of cola drinkers: light, medium and heavy and there was considerable overlap between the three categories. We may not be able to clearly discriminate between the three groups, but the differences between light and heavy users may be more pronounced. Independent variables are usually selected in two ways: either from previous research or from intuition selecting variables for which no previous research or theory exists but that might logically be related to predicting the groups for the dependent variable. Discriminant analysis is quite sensitive to the ratio of sample size to the number of predictor variables. Many studies suggest a ratio
78

of 20 observations for each predictor variable, although this will often be unachievable. At a minimum though, the smallest group size must exceed the number of independent variables. Many times the sample is divided into two subsamples, one used for estimation of the discriminant function (the analysis sample) and another for validation purposes (the holdout sample). This method of validating the function is referred to as the split-sample or cross-validation approach. No definite guidelines have been established for dividing the sample into analysis and holdout groups. The most popular procedure is to divide the total group so that one-half of the respondents are placed in the analysis sample and the other half are placed in the holdout sample. Some researchers prefer a 60-40 or a 75-25 split however. When selecting the individuals for the analysis and holdout groups, one usually follows a proportionately stratified sampling procedure, ie. if a sample consists of 40 males and 60 females, the holdout sample should consist of 20 males and 30 females. If the sample size isnt large enough to split in this way (if n < 100) then one compromise would be to develop the function on the entire sample and then use the function to classify the same group used to develop the function. This gives an inflated idea of the predictive accuracy of the function though. Example (HATCO, continued) The discriminant analysis will use the first seven variables from the database (X1 to X7) to discriminate between firms applying each purchasing method (X11). Also, the sample of 100 observations meets the suggested minimum size and provides a 15to-1 ratio of observations to independent variables.
79

We can split the sample size of 100 into an analysis sample of 60 objects and a holdout sample of 40 objects. We should also make sure that we split the total sample using a proportionately stratified sampling procedure, although we should make sure that the split is performed randomly to negate any possible bias in the ordering of our data. Step 3: Assumptions of Discriminant Analysis The key assumptions for deriving the discriminant function are multivariate normality of the independent variables and unknown (but equal) dispersion and covariance matrices for the groups. Data not meeting the multivariate normality assumption can cause problems in the estimation of the discriminant function. Therefore, it is suggested that logistic regression be used as an alternative technique, if possible. Unequal covariance matrices can adversely affect the classification process. If the sample sizes are small and the covariance matrices are unequal, then the statistical significance of the estimation process is adversely affected. But more likely is the case of unequal covariances among groups of adequate sample size, whereby observations are overclassified into the groups with larger covariance matrices. Another characteristic of the data that can affect the results is multicollinearity among the independent variables. Finally, an implicit assumption is that all relationships are linear. Nonlinear relationships are not reflected in the discriminant function unless specific variable transformations are made to represent nonlinear effects. Example (HATCO continued)

80

Our previous examinations of the HATCO data set indicated no problems with multicolinearity, and tests on the assumptions of normality were also performed in Chapter 2. There was not sufficient evidence to stop us proceeding with our analysis. Step 4: Estimation Of The Discriminant Function and Assessment Of Overall Fit Simultaneous estimation involves computing the discriminant function so that all of the independent variables are considered concurrently. Thus the discriminant function is computed based on the entire set of independent variables, regardless of the discriminating power of each independent variable. The simultaneous method is appropriate when, for theoretical reasons, the analyst wants to include all the independent variables in the analysis and is not interested in seeing intermediate results based only on the most discriminating variables. Stepwise estimation is an alternative to the simultaneous approach. It involves entering the independent variables into the discriminant function one at a time on the basis of their discriminating power. The stepwise procedure begins by choosing the single best discriminating variable. The initial variable is then paired with each of the other independent variables one at a time, and the variable that is best able to improve the discriminating power of the function in combination with the first variable is chosen. Eventually, either all independent variables will have been included in the function or the excluded variables will have been judged as not contributing significantly to further discrimination. The reduced set of variables is typically almost as good, and sometimes better than, the complete set of variables. Wilks lambda, Hotellings trace and Pilliais criteria all evaluate the statistical significance of the discriminatory power of the

81

discriminant function(s). Roys greatest characteristic root evaluates only the first discriminant function. Assessing Overall Fit As discussed earlier, the discriminant Z scores of any discriminant function can be calculated for each observation by the following formula: Zjk = a + W1X1k + W2X2k + + WnXnk where Zjk = discriminant Z score of discriminant function j for object k a = intercept Wi = discriminant coefficient for independent variable i Xik = independent variable i for object k This score provides a direct means of comparing observations on each function. The statistical tests for assessing the significance of the discriminant functions do not tell how well the function predicts. We may have group means that are virtually identical, but find significant results because our sample size was large. To determine the predictive accuracy of a discriminant function, the analyst must construct classification matrices. A classification matrix is a matrix containing numbers that reveal the predictive ability of the discriminant function. The numbers on the diagonal of the matrix represent the number of correct classifications, with the off-diagonal numbers representing misclassifications. Before a classification matrix can be constructed, however, the analyst must determine which group to assign each individual. If we have two groups (A and B) and a discriminant function for each

82

group (ZA and ZB) we will assign each individual into the group on which it has the higher discriminant score. The optimal solution must also consider the cost of misclassifying an individual into the wrong group. If the costs of misclassifying an individual are approximately equal, the optimal solution will be the one that will misclassify the fewest number of individuals in each group. If the misclassification costs are unequal, the optimum solution will be the one that minimizes the cost of misclassification. If the analyst is unsure if the observed proportions in the sample are representative of the population proportions, then equal probabilities should be employed. However, if the sample is randomly drawn from the population so that the groups do estimate the population proportions in each group, then the best estimates of actual group sizes and the prior probabilities are not equal values but, instead, the sample proportions. To validate the discriminant function through the use of classification matrices, the sample should have been randomly divided into two groups. One of the groups (the analysis sample) is used to compute the discriminant function. The other group (the holdout, or validation sample) is retained for use in developing the classification matrix. The procedure involves multiplying the weights generated by the analysis sample by the raw variable measurements of the holdout sample. Then the individual discriminant scores for the holdout sample are calculated and each individual is assigned to the group on which it has the higher discriminant score. A statistical test for the discriminatory power of the classification matrix is Presss Q statistic. This simple measure compares the number of correct classifications with the total sample size and the number of groups. The calculated value is then compared with a
83

critical value from the Chi-Square distribution with 1 degree of freedom. If this value exceeds this critical value, the classification matrix can be deemed statistically better than chance. The Q statistic is calculated by the following formula: Presss Q = [N (nK)]2 N(K 1) where N = total sample size n = number of observations correctly classified K = number of groups One must be careful in drawing conclusions based solely on this statistic, however, because as the sample size becomes larger, a lower classification rate will be deemed significant. Example (HATCO continued) Firstly we will examine the group means for each of the independent variables based on the 60 observations constituting the analysis sample. The comparison of group means is performed in the table below: Variable X1, Delivery Speed X2, Price Level X3, Price Flexibility X4, Mnufctrer Image X5, Service X6, Salesforce Image X7, Product Quality X11=0 X11=1 F-value Significance 2.712 4.3343 36.53 <.0001 3.108 1.7686 22.95 <.0001 6.800 8.8429 76.99 <.0001 5.168 5.2829 0.15 .7044 2.884 3.0143 0.41 .5226 2.564 2.7200 0.52 .4730 8.276 6.0172 51.95 <.0001

Because the objective of this analysis is to determine which variables are the most efficient in discriminating between firms using the two purchasing approaches, a stepwise procedure is used.
84

A recommended first step is to analyze the differences in the group means, between the different levels of the dependent variable, and determine whether any variables can be excluded at the start. It is recommended that any variables that have an F-value of less than 1 should be dropped from consideration immediately. It appears that variables X4, X5 and X6 dont have any impact on X11 and so to use them would just complicate our analysis unnecessarily. The next step is to use our remaining variables (X1, X2, X3 and X7) in a stepwise procedure. SAS does not perform this procedure in discriminant analysis, although it doesnt take long to perform the iterations separately in SAS. After entering all four of our explanatory variables individually, X3 does the best job of any single variable in discriminating. Matching X3 with X1, X2 and X7 individually, we find that the combination of X3 and X7 works best. We cannot find any substantial improvements after that. It appears that a solution that uses just X3 and X7 as explanatory variables will offer the best discrimination between the groups. In the analysis sample of 60 observations, we know that the dependent variable consists of two groups, 25 firms follow the specification buying approach and the remaining 35 firms using the total value analysis method. Since our sample of firms is randomly drawn, we can be reasonably sure that this sample does reflect the population proportions. Thus, this discriminant analysis uses the sample proportions to specify the prior probabilities for classification purposes. The values of X3 and X7 for individual 1 are plugged into the classification function for each of the (2) groups, and the individual is classified into the group that produces the higher value. This procedure is repeated for all 60 observations.

85

Our classification matrix for our analysis sample is represented below: Actual group 0 1 Total 23 2 25 2 33 35 25 35 60

Classified into

0 1 Total

Presss Qanalysis sample = [60 (56 x 2)]2 = 45.067 60 x (2 - 1) which is greater than our critical value of 6.63. Therefore our results exceed the classification accuracy expected by chance at a statistically significant level. Since this was calculated from our analysis sample though, we would have expected this to be the case. The next step is to see if our holdout sample performs as well. The classification matrix for this sample is represented below: Actual group 0 1 Total 13 6 19 2 19 21 15 25 40

Classified into

0 1 Total

Presss Qholdout sample = [40 (32 x 2)]2 = 14.4 40 x (2 - 1) which is also greater than our critical value of 6.63. Therefore our results still exceed the classification accuracy expected by chance at a statistically significant level.

86

The researcher must remember to always use caution in the application of a holdout sample with small data sets. In this case the small sample size of 40 for the holdout sample was adequate, but larger sizes are always more desirable. Misclassifications One important step after the completing the classification procedure is to examine any misclassifications. From examining the output, we can see that observations 7 and 13 were actually in group 0 (specification buying) but were predicted to be in group 1 (total value analysis). The opposite goes for observations 35 and 58, which were predicted to be in group 0 but were actually in group 1. Once the misclassified cases are identified, further analysis can be performed to understand the reasons for their misclassification. We can combine our misclassified cases from both the analysis and holdout samples and then compare to the correctly classified cases. The attempt is to identify specific differences on the independent variables that may identify either new variables to be added or common characteristics that should be considered Step 5: Interpretation of the Results To interpret our results we need to examine the classification functions: Variable Label Constant X3, Price Flexibility X7, Product Quality 0 -51.66546 8.25616 5.49035 1 -60.42779 10.94608 3.81958

To interpret the effect that each variable has on the different groups we really need to focus on the differences between the
87

coefficients. For example, the coefficient of X3 is larger in Group 1 than Group 0 meaning that observations with a high perception of price flexibility are more likely to be in Group 1. Alternatively, the coefficient of X7 is lower in Group 1 than Group 0 meaning the higher an observations perception of product quality is, the less likely they are to be in Group 1. Step 6: Validation of the Results The final stage of a discriminant analysis involves validating the discriminant results to provide assurances that the results have external as well as internal validity. With the propensity of discriminant analysis to inflate the hit ratio if evaluated only on the analysis sample, cross-validation is an essential step. We can cross-validate by employing an additional sample as a holdout sample, as we have seen, or by profiling the groups on an additional set of variables. Split-Sample or Cross-Validation Procedures The justification for dividing the sample into two groups is that an upward bias will occur in the prediction accuracy of the discriminant function if the individuals used in developing the classification matrix are the same as those used in computing the function. The implications of this upward bias are particularly important when the researcher is concerned with the external validity of the findings. Other researchers have suggested, however, that greater confidence could be placed in the validity of the function by following this procedure several times. The researcher would randomly divide the sample into analysis and holdout samples several times, each time testing the validity of the function through the development of a classification matrix and a hit ratio. Then the several hit ratios would be averaged to obtain a single measure.
88

Another option is the U-method, which is based on the leaveone-out principle, on which the discriminant function is fitted to repeatedly drawn samples of the original population. A dataset with 100 observations would involve 100 different discriminant analyses being performed, each on 99 of the 100 observations. Each time the discriminant function is calculated, it would be used to classify the remaining observation that wasnt involved in the calculation of the function. This is the CROSSVALIDATE method performed in SAS. Profiling Group Differences Another approach is to profile the groups on a separate set of variables that should mirror the observed group differences. This separate profile provides an assessment of external validity in that the groups vary on both the independent variables and the set of associated variables. This is similar in character to the process we used to validate clusters in Cluster Analysis. Example (HATCO continued) The final stage addresses the internal and external validity of the discriminant function. The primary means of validation is through the use of the holdout sample and the assessment of its predictive validity. In this manner, validity is established if the discriminant function performs at an acceptable level in classifying observations that were not used in the estimation process. Our hit ratios of 93.3% (analysis sample) and 80% (holdout sample) certainly appear to validate our results well.

89

Logistic Regression As we have discussed, discriminant analysis is appropriate when the dependent variable is categorical. However, when the dependent variable has only two groups, logistic regression may be preferred for several reasons. Firstly, discriminant analysis relies on strictly meeting the assumptions of multivariate normality and equal variancecovariance matrices across groups assumptions that are not met in many situations. Logistic regression does not face these strict assumptions and is much more robust when these assumptions are not met, making its application appropriate in many more situations. The second reason is that, even if the assumptions are met, many researchers prefer logistic regression because it is similar to regression. Both have straightforward statistical tests and the ability to incorporate nonlinear effects. For these and more technical reasons, logistic regression is equivalent to two-group discriminant analysis and may be more suitable in many situations. Our discussion of logistic regression does not cover each of the six steps of the decision process, but instead highlights the differences and similarities between logistic regression and discriminant analysis. In discriminant analysis, the categorical character of the dependent variable is accommodated by making predictions of group membership, based on classification scores. Logistic Regression approaches this task in a manner more similar to that found in multiple regression. It differs from multiple regression, however, in that it directly predicts the probability of an event occurring.

90

Since this probability must be between 0 and 1, our predicted value must be bounded to fall within the range zero and one. To do this, logistic regression uses an assumed relationship between the independent and dependent variables that resembles an S-shaped curve.

We can see with the logistic regression model represented above, that with very low levels of the independent variable, the probability approaches zero. As the independent variable increases, the probability increases up the curve, tending towards, but never exceeding one. Ordinary regression models cannot accommodate a relationship like this, as it is inherently nonlinear. Moreover, such situations cannot be studied with ordinary regression, because doing so would violate several assumptions, including normality of the error term and constant variance. Logistic regression was developed to specifically deal with these issues. Its unique relationship between dependent and independent variables requires a somewhat different approach in estimating, assessing goodness of fit and interpreting coefficients.
91

Estimating the Logistic Regression Model The nonlinear nature of the logistic transformation requires that a maximum likelihood procedure be used in an iterative manner to find the most likely estimates for the coefficients. This results in the use of the likelihood value instead of the sum of squares when calculating measure of overall fit. To estimate a logistic regression model, the logistic curve is fitted to the actual data. Below are two examples of fitting a logistic relationship. In the first case, the logistic regression cannot fit the data well because there is considerable overlap between the two groups in terms of the explanatory variable. In the second case there is much more of a well-defined relationship.

92

This simple example can be extended to include multiple independent variables, just as in regression. Interpreting The Coefficients One of the advantages of logistic regression is that we need to know only whether an event occurred (yes or no, credit risk or not) to use a binary variable as our dependent variable. From this binary variable, the procedure predicts the probability that the event will or will not occur. A predicted probability of more than .5 results in a prediction of yes, otherwise no. Logistic regression derives its name from the logistic transformation used with the dependent variable. When this transformation is used, however, the logistic regression and its coefficients take on a somewhat different meaning from those found in ordinary regression.

93

The procedure that calculates the logistic coefficient compares the probability of an event occurring with the probability of it not occurring. This odds ratio can be expressed as Prob(event) = exp{0 + 1X1 + + nXn} Prob(no event) The estimated coefficients (0, 1, 2,, n) are actually measures of the changes in the ratio of the probabilities, termed the odds ratio. Use of this procedure does not change in any way the way we interpret the sign of the coefficient. A positive coefficient increases the probability, whereas a negative sign decreases the predicted probability. If we wanted to find the probability of an event occurring, that would be: Prob(event) = exp{0 + 1X1 + + nXn} 1 + exp{0 + 1X1 + + nXn}

Assessing The Goodness Of Fit Logistic regression is similar to multiple regression in many of its results, but it is different in the method of estimating coefficients. Instead of minimizing the squared deviations, logistic regression maximizes the likelihood that an event will occur. Using this alternative estimation technique also requires that we assess model fit in different ways. The overall measure of how well the model fits is given by the likelihood value. We often base our decision though on -2*log(likelihood value), which is often referred to as 2LL. A well fitting model will have a small value for 2LL, with the

94

minimum value of -2LL = 0 corresponding to a likelihood of 1 and a perfect fit. The researcher can also construct a pseudo-R2 value for logistic regression, similar to the R2 value in regression analysis. The R2 value for a logistic regression model can be calculated as: R2logit = -2LLnull (-2LLmodel) -2LLnull where 2LLnull is calculated from the logistic regression model with all non-intercept parameters set to zero. This value is also provided by the SAS output. Testing Significance of the Coefficients Logistic regression can also test the hypothesis that a coefficient is different from zero. In multiple regression, the t value is used to assess the significance of each coefficient, in logistic regression though we use the Wald statistic. This provides the statistical significance for each estimated coefficient so that hypothesis testing can occur. Faced with a binary variable, a researcher need not resort to methods designed to accommodate the limitations of multiple regression, nor be forced to employ discriminant analysis, especially if its statistical assumptions are violated. Logistic regression addresses these problems and provides a method developed to deal directly with this situation in the most efficient manner possible. HATCO Example

95

The following example is identical to the two-group discriminant analysis discussed earlier, with logistic regression used this time for the estimation of the model. Steps 1, 2 and 3: Research Objectives, Research Design and Statistical Assumptions The issues addressed in the first three steps of the decision process are identical for the two group discriminant analysis and logistic regression. The research problem is still to determine if differences in perceptions of HATCO can distinguish between customers using specification buying versus total value analysis, The sample of 100 customers is divided into an analysis sample of 60 observations, with the remaining 40 observations constituting the holdout or validation sample. We can now focus on the results stemming from the use of logistic regression to estimate and understand the differences between these two types of customers. Step 4: Estimation of the Logistic Regression Model and Assessing Overall Fit As with discriminant analysis, where we didnt want to use all seven perception variables to discriminate between groups, but merely use the variables that had the greatest differences in means between the two levels of X11, we have a similar objective in logistic regression. We will use the LOGISTIC procedure in SAS to give us our model, with a STEPWISE selection procedure that will narrow down our choice of variables to the most discriminating ones. The STEPWISE procedure is a model building technique whereby variables can be entered into or removed from the model at any point based on the significance of their discriminatory powers.
96

Since logistic regression uses an iterative procedure to estimate the coefficients, it is important to check your solution to make sure the model converges. Sometimes there is no one unique solution as several different sets of coefficients will give a solution of the same quality. This will often happen if there is complete separation between the 0 and 1s, meaning that SAS wont know how steeply the logistic curve should climb. If our model fails to converge, we may still wish to report the solution, although we should realize that the solution will often be unstable. The STEPWISE procedure begins by adding X3 to the null model, followed by X7 and at both stages our model converges. At the next stage X6 is added, however this leads to problems with convergence. It is then removed from the model, leaving us with only X3 and X7 in our logistic function. Our odds ratio is then given by: Prob(event) = exp{-3.5904 + 1.9719X3 1.5973X7} Prob(no event) The probability of a particular event can also be given by: Prob(event) = exp{-3.5904 + 1.9719X3 1.5973X7} 1 + exp{-3.5904 + 1.9719X3 1.5973X7} And our goodness of fit can be measured by: R2logit = -2LLnull (-2LLmodel) = 81.503 - 21.322 81.503 -2LLnull

97

= 60.181 = .73839 81.503 Our classification table is as follows: Classified into 0 1 Total 0 23 2 25 1 2 33 35 Total 25 35 60

X11

Since this classification table is from our analysis sample, we would expect it to give an inflated opinion of our hit ratio, which in this case is 56/60 = 93.3%. The classification table from our holdout sample looks like this: Classified into 0 1 0 13 2 1 6 19 Total 19 21 Total 15 25 40

X11

Which is not as impressive, but still yields a hit-ratio of 80%. This two-variable model, including X3 and X7, demonstrates excellent model fit and statistical significance at the overall model level, as well as for the variables included in the model. Step 5: Interpretation of the Results The logistic regression model produces a variate very similar to that of the two-group discriminant analysis. In both cases X3 and X7 were the only variables included in our final solution. The implications from both analyses were similar: price flexibility (X3) had a positive association and product quality (X7) had a negative association with the dependent variable. Given that the dependent
98

variable (X11) had two groups specification buying (X11 = 0) and total value analysis (X11 = 1) the coefficients imply that firms using total value analysis have lower perceptions of the product quality while having higher perceptions of price flexibility. Step 6: Validation of the Results The validation of the logistic regression is accomplished here through the same method used in discriminant analysis: creation of analysis and holdout samples. Both methods yielded identical hit ratios 93.3% for analysis sample, 80% for the holdout sample leading to the conclusion that both methods have strong empirical support.

99

Você também pode gostar