Você está na página 1de 50

ANLISIS DE HETEROCEDASTICIDAD Se desea estimar la regresin de gastos en investigacin y desarrollo sobre las utilidades y se dispone de la siguiente informacin sobre

gastos de investigacin y desarrollo para 18 grupos de industrias en relacin con las utilidades. Agrupacin industrial 1. Contenedores y empaques 2. Industrias financieras no bancarias 3. Industrias de servicios 4. Metales y minera 5. Vivienda y construccin 6. Manufacturas en general 7. Ind. Relac. Con descanso y esparcimiento 8. papel y productos forestales 9. Alimentos 10. Salud 11. Industria aeroespacial 12. Productos del consumidor 13. Productos elctricos y electrnicos 14. Qumicos 15.Conglomerados 16. Equipo de oficina y computadoras 17. Combustible 18. Automotores. Gastos en I&D 62.5 92.9 178.3 258.4 494.7 1083.0 1620.6 421.7 509.2 6620.1 3918.6 1595.3 6107.5 4454.1 3163.8 13210.7 1703.8 9528.2 Utilidades 185.1 1569.5 276.8 2828.1 225.9 3751.9 2884.1 4645.7 5036.4 13869.9 4487.8 10278.9 8787.3 16438.8 9761.4 19774.5 22626.6 18415.4

Regresin de gastos en investigacin y desarrollo sobre las utilidades Modelo Gastos = 1 + 2 utilidades + 1. Modelo estimado por MCO
Dependent Variable: GASTOS Method: Least Squares Date: 11/04/03 Time: 18:15 Sample: 1 18 Included observations: 18 Variable Coefficient C 114.3863 UTILIDADES 0.363159 R-squared 0.509106 Adjusted R-squared 0.478426 S.E. of regression 2676.458 Sum squared resid 1.15E+08 Log likelihood -166.5413 Durbin-Watson stat 3.211849

Std. Error t-Statistic 959.0321 0.119273 0.089151 4.073527 Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion F-statistic Prob(F-statistic)

Prob. 0.9065 0.0009 3056.856 3705.973 18.72682 18.82575 16.59362 0.000884

Matriz de varianzas y covarianzas de los coeficientes

C UTILIDADES

C 919742.5 -64.39737

UTILIDADES -64.39737 0.007948

Residuos de MCO
Obs Actual Ajustado Residuo Grfico de los residuos 1 62.5000 181.607 -119.107 | . * . | 2 92.9000 684.365 -591.465 | . *| . | 3 178.300 214.909 -36.6088 | . * . | 4 258.400 1141.44 -883.038 | . *| . | White Heteroskedasticity Test: 298.276 5 494.700 196.424 | . |* . | F-statistic 1083.00 1476.92 -393.924 23.89868 Probability | 6 . *| . 0.000022 | Obs*R-squared 13.70046 Probability | 7 1620.60 1161.77 458.826 . |* . 0.001059 | 8 421.700 1801.52 -1379.82 | . *| . | Test Equation: 9 509.200 1943.40 -1434.20 | .* | . | Dependent Variable: 5151.37 1468.73 10 6620.10 RESID^2 | . | *. | Method: Least Squares 11 3918.60 1744.17 2174.43 | . | *. | Date: 11/04/03 Time: 18:33 -2251.75 12 1595.30 3847.05 | .* | . | Sample: 16107.50 3305.58 2801.92 18 13 | . | * | Included observations: 18 14 4454.10 6084.29 -1630.19 | .* | . | 15 Variable 3163.80 3659.33 -495.531 Error | t-Statistic . *| . | Prob. Coefficient Std. 16 13210.7 7295.68 5915.02 . | . C 3514642. 3062704. | 1.147562 * |0.2691 17UTILIDADES 8331.45 -6627.65 1703.80 . | . |0.0685 -1582.847 806.5273 |* -1.962546 18 9528.20 6802.11 2726.09 | 3.656483 . | * |0.0023 UTILIDADES^2 0.135479 0.037052 R-squared 0.761136 Mean dependent var 6367491. Adjusted R-squared 0.729288 S.D. dependent var 12388674 S.E. of regression 6445827. Akaike info criterion 34.34678 Sum squared resid 6.23E+14 Schwarz criterion 34.49517 Log likelihood -306.1210 F-statistic 23.89868 Durbin-Watson stat 2.038419 Prob(F-statistic) 0.000022

Prueba de heterocedasticidad de White

La prueba de White nos indica que hay una relacin significativa entre los residuos al cuadrado y las utilidades al cuadrado, lo que nos lleva a considerar como ponderacin 1/utilidades, as el modelo a estimar por MCG sera:

Modelo transformado

Gastos = Utilidades

1 + 2 + . Utilidades Utilidades

2. Aplicando MCP obtenemos:


Dependent Variable: GASTOS Method: Least Squares Date: 11/04/03 Time: 19:17 Sample: 1 18 Included observations: 18 Weighting series: 1/UTILIDADES Variable Coefficient C 146.7962 UTILIDADES 0.335094 Weighted Statistics R-squared 0.245814 Adjusted R-squared 0.198677 S.E. of regression 492.5761 Sum squared resid 3882099. Log likelihood -136.0745 Durbin-Watson stat 1.988109 Unweighted Statistics R-squared 0.503135 Adjusted R-squared 0.472081 S.E. of regression 2692.688 Durbin-Watson stat 3.137861

Std. Error 64.28226 0.120014

t-Statistic 2.283619 2.792122

Prob. 0.0364 0.0131 517.3364 550.2620 15.34161 15.44054 5.214916 0.036398 3056.856 3705.973 1.16E+08

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion F-statistic Prob(F-statistic) Mean dependent var S.D. dependent var Sum squared resid

Matriz de covarianzas de los coeficientes


C UTILIDADES C 4132.209 -3.736920 UTILIDADES -3.736920 0.014403

Como podemos apreciar la varianza de las estimaciones de MCP es mayor que las de MCO porque al no considerarse la heterocedasticidad se ha sub-estimado la verdadera varianza de los estimadores de los coeficientes Prueba de heterocedasticidad para los residuos obtenidos con estimacin de MCP
White Heteroskedasticity Test: F-statistic 3.454817 Obs*R-squared 5.676655 Test Equation: Dependent Variable: STD_RESID^2 Method: Least Squares Date: 11/04/03 Time: 19:30 Sample: 1 18 Included observations: 18 Variable Coefficient C 623410.5 UTILIDADES -108.9450 UTILIDADES^2 0.004105 R-squared Adjusted R-squared S.E. of regression 0.315370 0.224086 376482.7 Probability Probability 0.058334 0.058523

Std. Error 178884.0 47.10700 0.002164

t-Statistic 3.484999 -2.312713 1.896655

Prob. 0.0033 0.0353 0.0773 215672.2 427403.5 28.66614

Mean dependent var S.D. dependent var Akaike info criterion

Sum squared resid Log likelihood Durbin-Watson stat

2.13E+12 -254.9953 2.174881

Schwarz criterion F-statistic Prob(F-statistic)

28.81454 3.454817 0.058334

Apreciamos que la heterogeneridad de la varianza de las perturbaciones ha disminuido de manera que esta ya no alcanza significancia. 3. Estimacin con la correccin de White
Dependent Variable: GASTOS Method: Least Squares Date: 11/04/03 Time: 19:38 Sample: 1 18 Included observations: 18 White Heteroskedasticity-Consistent Standard Errors & Covariance Variable Coefficient Std. Error t-Statistic Prob. C 114.3863 713.9924 0.160207 0.8747 UTILIDADES 0.363159 0.145391 2.497807 0.0238 R-squared 0.509106 Mean dependent var 3056.856 Adjusted R-squared 0.478426 S.D. dependent var 3705.973 S.E. of regression 2676.458 Akaike info criterion 18.72682 Sum squared resid 1.15E+08 Schwarz criterion 18.82575 Log likelihood -166.5413 F-statistic 16.59362 Durbin-Watson stat 3.211849 Prob(F-statistic) 0.000884

Como apreciamos de los resultados los estimados de los coeficientes, el R 2, Se, coinciden con los de MCO, la diferencia est en las desviaciones estndar y en las pruebas t para la significancia, ya que para la estimacin de las varianzas y covarianzas de los coeficientes se ha tenido en cuenta que la varianzas son heterogneas Matriz de covarianzas de White para los coeficientes
C UTILIDADES C 509785.2 -92.53723 UTILIDADES -92.53723 0.021139

Con esta estimacin no se ha obtenido estimadores ms eficientes, sino que se ha hecho una mejor estimacin de la verdadera varianza de los estimadores. Las perturbaciones siguen teniendo varianzas heterogneas.

# # # # # # # # #

x28.txt Reference: Richard Gunst, Robert Mason, Regression Analysis and Its Applications: a data-oriented approach, Dekker, 1980, pages 370-371. ISBN: 0824769937.

# Gary McDonald, Richard Schwing, # Instabilities of regression estimates relating air pollution to mortality, # Technometrics, # Volume 15, Number 3, pages 463-482, 1973. # # Helmut Spaeth, # Mathematical Algorithms for Linear Regression, # Academic Press, 1991, # ISBN 0-12-656460-4. # # Discussion: # # The death rate is to be represented as a function of other variables. # # There are 60 rows of data. The data includes: # # I, the index; # A1, the average annual precipitation; # A2, the average January temperature; # A3, the average July temperature; # A4, the size of the population older than 65; # A5, the number of members per household; # A6, the number of years of schooling for persons over 22; # A7, the number of households with fully equipped kitchens; # A8, the population per square mile; # A9, the size of the nonwhite population; # A10, the number of office workers; # A11, the number of families with an income less than $3000; # A12, the hydrocarbon pollution index; # A13, the nitric oxide pollution index; # A14, the sulfur dioxide pollution index; # A15, the degree of atmospheric moisture. # B, the death rate. # # We seek a model of the form: # # B = A1 * X1 + A2 * X2 + A3 * X3 + A4 * X4 + A5 * X5 # + A6 * X6 + A7 * X7 + A8 * X8 + A9 * X9 + A10 * X10 # + A11 * X11 + A12 * X12 + A13 * X13 + A14 * X14 + A15 * X15 # # Modified: # # 08 November 2010 # 17 columns 60 rows Index A1 average annual precipitation in inches A2 average January temperature in degrees Fahrenheit A3 average July temperature in degrees Fahrenheit A4 percent of 1960 SMSA population 65 years old or older A5 household size, 1960 A6 schooling for persons over 22 A7 household with full kitchens A8 population per square mile in urbanized areas A9 percent nonwhite population A10 percent office workers A11 poor families (annual income under $3000) A12 relative pollution potential of hydrocarbons A13 relative pollution potential of oxides of Nitrogen A14 relative pollution of Sulfur Dioxide A15 percent relative humidity, annual average at 1pm. B death rate 1 36 27 71 8.1 3.34 11.4 81.5 3243 8.8 42.6 11.7 21 15 59 59 2 35 23 72 11.1 3.14 11.0 78.8 4281 3.6 50.7 14.4 8 10 39 57 3 44 29 74 10.4 3.21 9.8 81.6 4260 0.8 39.4 12.4 6 6 33 54 4 47 45 79 6.5 3.41 11.1 77.5 3125 27.1 50.2 20.6 18 8 24 56 5 43 35 77 7.6 3.44 9.6 84.6 6441 24.4 43.7 14.3 43 38 206 55 6 53 45 80 7.7 3.45 10.2 66.8 3325 38.5 43.1 25.5 30 32 72 54 7 43 30 74 10.9 3.23 12.1 83.9 4679 3.5 49.2 11.3 21 32 62 56 8 45 30 73 9.3 3.29 10.6 86.0 2140 5.3 40.4 10.5 6 4 4 56 9 36 24 70 9.0 3.31 10.5 83.2 6582 8.1 42.5 12.6 18 12 37 61 10 36 27 72 9.5 3.36 10.7 79.3 4213 6.7 41.0 13.2 12 7 20 59 11 52 42 79 7.7 3.39 9.6 69.2 2302 22.2 41.3 24.2 18 8 27 56 12 33 26 76 8.6 3.20 10.9 83.4 6122 16.3 44.9 10.7 88 63 278 58 13 40 34 77 9.2 3.21 10.2 77.0 4101 13.0 45.7 15.1 26 26 146 57 14 35 28 71 8.8 3.29 11.1 86.3 3042 14.7 44.6 11.4 31 21 64 60

921.870 997.875 962.354 982.291 1071.289 1030.380 934.700 899.529 1001.902 912.347 1017.613 1024.885 970.467 985.950

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

37 35 36 15 31 30 31 31 42 43 46 39 35 43 11 30 50 60 30 25 45 46 54 42 42 36 37 42 41 44 32 34 10 18 13 35 45 38 31 40 41 28 45 45 42 38

31 46 30 30 27 24 45 24 40 27 55 29 31 32 53 35 42 67 20 12 40 30 54 33 32 29 38 29 33 39 25 32 55 48 49 40 28 24 26 23 37 32 33 24 83 28

75 85 75 73 74 72 85 72 77 72 84 76 81 74 68 71 82 82 69 73 80 72 81 77 76 72 67 72 77 78 72 79 70 63 68 64 74 72 73 71 78 81 76 70 76 72

8.0 7.1 7.5 8.2 7.2 6.5 7.3 9.0 6.1 9.0 5.6 8.7 9.2 10.1 9.2 8.3 7.3 10.0 8.8 9.2 8.3 10.2 7.4 9.7 9.1 9.5 11.3 10.7 11.2 8.2 10.9 9.3 7.3 9.2 7.0 9.6 10.6 9.8 9.3 11.3 6.2 7.0 7.7 11.8 9.7 8.9

3.26 11.9 3.22 11.8 3.35 11.4 3.15 12.2 3.44 10.8 3.53 10.8 3.22 11.4 3.37 10.9 3.45 10.4 3.25 11.5 3.35 11.4 3.23 11.4 3.10 12.0 3.38 9.5 2.99 12.1 3.37 9.9 3.49 10.4 2.98 11.5 3.26 11.1 3.28 12.1 3.32 10.1 3.16 11.3 3.36 9.7 3.03 10.7 3.32 10.5 3.32 10.6 2.99 12.0 3.19 10.1 3.08 9.6 3.32 11.0 3.21 11.1 3.23 9.7 3.11 12.1 2.92 12.2 3.36 12.2 3.02 12.2 3.21 11.1 3.34 11.4 3.22 10.7 3.28 10.3 3.25 12.3 3.27 12.1 3.39 11.3 3.25 11.1 3.22 9.0 3.48 10.7

78.4 4259 79.9 1441 81.9 4029 84.2 4824 87.0 4834 79.5 3694 80.7 1844 82.8 3226 71.8 2269 87.1 2909 79.7 2647 78.6 4412 78.3 3262 79.2 3214 90.6 4700 77.4 4474 72.5 3497 88.6 4657 85.4 2934 83.1 2095 70.3 2682 83.2 3327 72.8 3172 83.5 7462 87.5 6092 77.6 3437 81.5 3387 79.5 3508 79.9 4843 79.9 3768 82.5 4355 76.8 5160 88.9 3033 87.7 4253 90.7 2702 82.5 3626 82.6 1883 78.0 4923 81.3 3249 73.8 1671 89.5 5308 81.0 3665 82.2 3152 79.8 3678 76.2 9699 79.8 3451

13.1 14.8 12.4 4.7 15.8 13.1 11.5 5.1 22.7 7.2 21.0 15.6 12.6 2.9 7.8 13.1 36.7 13.6 5.8 2.0 21.0 8.8 31.4 11.3 17.5 8.1 3.6 2.2 2.7 28.6 5.0 17.2 5.9 13.7 3.0 5.7 3.4 3.8 9.5 2.5 25.9 7.5 12.1 1.0 4.8 11.7

49.6 13.9 51.2 16.1 44.0 12.0 53.1 12.7 43.5 13.6 33.8 12.4 48.1 18.5 45.2 12.3 41.4 19.5 51.6 9.5 46.9 17.9 46.6 13.2 48.6 13.9 43.7 12.0 48.9 12.3 42.6 17.7 43.3 26.4 47.3 22.4 44.0 9.4 51.9 9.8 46.1 24.1 45.3 12.2 45.5 24.2 48.7 12.4 45.3 13.2 45.5 13.8 50.3 13.5 38.3 15.7 38.6 14.1 49.5 17.5 46.4 10.8 45.1 15.3 51.0 14.0 51.2 12.0 51.9 9.7 54.3 10.1 41.9 12.3 50.5 11.1 43.9 13.6 47.4 13.5 59.7 10.3 51.6 13.2 47.3 10.9 44.8 14.0 42.2 14.5 37.5 13.0

23 1 6 17 52 11 1 5 8 7 6 13 7 11 648 38 15 3 33 20 17 4 20 41 29 45 56 6 11 12 7 31 144 311 105 20 5 8 11 5 65 4 14 7 8 14

9 15 58 958.839 1 1 54 860.101 4 16 58 936.234 8 28 38 871.766 35 124 59 959.221 4 11 61 941.181 1 1 53 891.708 3 10 61 871.338 3 5 53 971.122 3 10 56 887.466 5 1 59 952.529 7 33 60 968.665 4 4 55 919.729 7 32 54 844.053 319 130 47 861.833 37 193 57 989.265 10 34 59 1006.490 1 1 60 861.439 23 125 64 929.150 11 26 50 857.622 14 78 56 961.009 3 8 58 923.234 17 1 62 1113.156 26 108 58 994.648 32 161 54 1015.023 59 263 56 991.290 21 44 73 893.991 4 18 56 938.500 11 89 54 946.185 9 48 53 1025.502 4 18 60 874.281 15 68 57 953.560 66 20 61 839.709 171 86 71 911.701 32 3 71 790.733 7 20 72 899.264 4 20 56 904.155 5 25 61 950.672 7 25 59 972.464 2 11 60 912.202 28 102 52 967.803 2 1 54 823.764 11 42 56 1003.502 3 8 56 895.696 8 49 54 911.817 13 39 58 954.442

# # # # # # # # # # # # # # # # # # # # # # #

x17.txt Reference: Helmut Spaeth, Mathematical Algorithms for Linear Regression, Academic Press, 1991, ISBN 0-12-656460-4. F S Wood, The Use of Individual Effects and Residuals in Fitting Equations to Data, Technometrics, Volume 15, 1973, pages 677-695. Discussion: In the investigation of the manufacturing process in a refinery, the octane rating of a particular petrol was measured as a function of the 3 raw materials, and a variable that characterized the manufacturing conditions. There are 82 rows of data. I, the index; The data include:

# A1, amount of material 1; # A2, amount of material 2; # A3, amount of material 3; # A4, manufacturing condition rating; # B, the octane rating; # # We seek a model of the form: # # B = A1 * X1 + A2 * X2 + A3 * X3 + A4 * X4. # 6 columns 82 rows Index Material 1 amount Material 2 amount Material 3 amount Condition Octane number 1 55.33 1.72 54 1.66219 92.19 2 59.13 1.20 53 1.58399 92.74 3 57.39 1.42 55 1.61731 91.88 4 56.43 1.78 55 1.66228 92.80 5 55.98 1.58 54 1.63195 92.56 6 56.16 2.12 56 1.68034 92.61 7 54.85 1.17 54 1.58206 92.33 8 52.83 1.50 58 1.54998 92.22 9 54.52 0.87 57 1.56230 91.56 10 54.12 0.88 57 1.57818 92.17 11 51.72 0.00 56 1.60401 92.75 12 51.29 0.00 58 1.59594 92.89 13 53.22 1.31 58 1.54814 92.79 14 54.76 1.67 58 1.63134 92.55 15 53.34 1.81 59 1.60228 92.42 16 54.84 2.87 60 1.54949 92.43 17 54.03 1.19 60 1.57841 92.77 18 51.44 0.42 59 1.61183 92.60 19 53.54 1.39 59 1.51081 92.30 20 57.88 1.28 62 1.56443 92.30 21 60.93 1.22 62 1.53995 92.48 22 59.59 1.13 61 1.56949 91.61 23 61.42 1.49 62 1.41330 91.30 24 56.60 2.10 62 1.54777 91.37 25 59.94 2.29 61 1.65523 91.25 26 58.30 3.11 62 1.29994 90.76 27 58.25 3.10 63 1.19975 90.90 28 55.53 2.88 64 1.20817 90.43 29 59.79 1.48 62 1.30621 90.83 30 57.51 0.87 60 1.29842 92.18 31 62.82 0.88 59 1.40483 91.73 32 62.57 0.42 60 1.45056 91.10 33 60.23 0.12 59 1.54357 91.74 34 65.08 0.10 60 1.68940 91.46 35 65.58 0.05 59 1.74695 91.44 36 65.64 0.05 60 1.74919 91.56 37 65.28 0.42 60 1.78053 91.90 38 65.03 0.65 59 1.78104 91.61 39 67.84 0.49 54 1.72387 92.09 40 73.74 0.00 54 1.73496 90.64 41 72.66 0.00 55 1.71966 91.09 42 71.31 3.44 55 1.60325 90.51 43 72.30 4.02 55 1.66783 90.24

44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81

68.81 66.61 63.66 63.85 67.25 67.19 62.34 62.98 69.89 73.13 65.09 64.71 64.05 63.97 70.48 71.11 69.05 71.99 72.03 69.90 72.16 70.97 70.55 69.73 69.93 70.60 75.54 49.14 49.10 44.66 44.64 4.23 5.53 17.11 67.60 64.81 63.13 63.48

6.88 2.31 2.99 0.24 0.00 0.00 0.00 0.00 0.00 0.00 1.01 0.61 1.64 2.80 4.64 3.56 2.51 1.28 1.28 2.19 0.51 0.09 0.05 0.06 0.05 0.00 0.00 0.00 0.00 4.99 3.73 10.76 7.99 5.06 1.84 2.24 1.60 3.46

55 52 52 50 53 52 48 47 55 57 57 55 57 60 60 60 60 55 56 56 56 55 52 54 55 55 55 40 42 42 44 41 40 47 55 54 52 52

1.69836 1.77967 1.81271 1.81485 1.72526 1.86782 2.00677 1.95366 1.89387 1.81651 1.45939 1.38934 1.33945 1.42094 1.57680 1.41229 1.54605 1.55182 1.60390 1.67265 1.55242 1.45728 1.26174 1.28802 1.36399 1.42210 1.67219 2.17140 2.31909 2.14314 2.08081 2.17070 1.99418 1.61437 1.64758 1.69592 1.66118 1.48216

91.01 91.90 91.92 92.16 91.36 92.16 92.68 92.88 92.59 91.35 90.29 90.71 90.41 90.43 89.87 89.98 90.00 89.66 90.08 90.67 90.59 91.06 90.69 91.11 90.32 90.36 90.57 94.17 94.39 93.42 94.65 97.61 97.08 95.12 91.86 91.61 92.17 91.56

82 62.25 3.56 50 1.49734 92.16

x20.txt # # Reference: # # Helmut Spaeth, # Mathematical Algorithms for Linear Regression, # Academic Press, 1991, # ISBN 0-12-656460-4. # # K Brownlee, # Statistical Theory and Methodology, # Wiley, 1965, pages 464-465. # # Discussion: # # In various states, population and drinking data was recorded. # # There are 46 rows of data. The data includes: # # I, the index; # A0, 1; # A1, the size of the urban population,

# A2, the number of births to women between 45 to 49 # (actually, the reciprocal of that value, times 100) # A3, the consumption of wine per capita, # A4, the consumption of hard liquor per capita, # B, the death rate from cirrhosis. # # We seek a model of the form: # # B = A0 * X0 + A1 * X1 + A2 * X2 + A3 * X3 + A4 * X4. # 7 columns 46 rows Index One Urban population (percentage) Late births (reciprocal * 100) Wine consumption per capita Liquor consumption per capita Cirrhosis death rate 1 1 44 33.2 5 30 41.2 2 1 43 33.8 4 41 31.7 3 1 48 40.6 3 38 39.4 4 1 52 39.2 7 48 57.5 5 1 71 45.5 11 53 74.8 6 1 44 37.5 9 65 59.8 7 1 57 44.2 6 73 54.3 8 1 34 31.9 3 32 47.9 9 1 70 45.6 12 56 77.2 10 1 54 45.9 7 57 56.6 11 1 70 43.7 14 43 80.9 12 1 65 32.1 12 33 34.3 13 1 36 36.9 10 48 53.1 14 1 47 38.9 10 69 55.4 15 1 63 47.6 14 54 57.8 16 1 35 33.0 9 47 62.8 17 1 50 38.9 7 68 67.3 18 1 55 35.7 18 47 56.7 19 1 33 31.2 6 27 37.6 20 1 81 53.8 31 79 129.9 21 1 63 42.5 13 59 70.3 22 1 78 53.3 20 97 104.2 23 1 63 47.0 19 95 83.6 24 1 65 44.9 10 81 66.0 25 1 45 35.6 4 26 52.3 26 1 78 50.5 16 76 86.9 27 1 60 42.3 9 37 66.6 28 1 52 43.8 6 46 40.1 29 1 37 33.2 6 40 55.7 30 1 55 36.0 21 76 58.1 31 1 69 47.6 15 70 74.3 32 1 84 50.0 17 66 98.1 33 1 54 43.8 7 63 40.7 34 1 61 45.0 13 59 66.7 35 1 47 42.2 8 55 48.0 36 1 57 53.0 28 149 122.5 37 1 87 51.6 23 77 92.1 38 1 50 31.9 22 43 76.0 39 1 85 56.1 23 74 97.5 40 1 27 31.5 7 56 33.8 41 1 84 50.0 16 63 90.5 42 1 37 32.4 2 41 29.7

43 44 45 46

1 1 1 1

33 44 63 58

36.1 35.3 39.3 43.8

6 3 8 13

59 32 40 57

28.0 51.6 55.7 55.5 Multiple Regression with Many Predictor Variables

The purpose of multiple regression is to predict a single variable from one or more independent variables. Multiple regression with many predictor variables is an extension of linear regression with two predictor variables. A linear transformation of the X variables is done so that the sum of squared deviations of the observed and predicted Y is a minimum. The computations are more complex, however, because the interrelationships among all the variables must be taken into account in the weights assigned to the variables. The interpretation of the results of a multiple regression analysis is also more complex for much the same reason. The prediction of Y is accomplished by the following equation: Y'i = b0 + b1X1i + b2X2i + + bkXki The "b" values are called regression weights and are computed in a way that minimizes the sum of squared deviations

in the same manner as in simple linear regression. In this case there are K predictor variables rather than two and K + 1 regression weights must be estimated, one for each of the K predictor variable and one for the constant (b0) term.

EXAMPLE DATA
The data used to illustrate the inner workings of multiple regression will be generated from the "Example Student." The data are presented below:

Life Satisfaction Simulated Data


Example Student
PSY645 Dr. Stockburger Due Date Finish LifeSat Income 1 1 1 1 0 22 20 42 48 33 33 21 26 37 26 15 88 73 14 38 45 16 64 19

Subject Age Gender Married IncomeC HealthC ChildC LifeSatC SES Smoke Spirit 1 2 3 4 5 6 7 8 9 10 16 28 16 23 18 30 19 19 34 16 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 0 0 0 0 38 38 52 51 52 43 55 52 60 53 0 0 1 0 0 2 0 2 17 16 39 22 25 53 28 17 20 21 17 21 40 31 38 36 41 52 56 27 1 30 39 30 60 32 39 51 35 23 29

0
16 6 7 25 19

1
0 0 0 1 0 0 0 0

0
1 1 1 0

0
29 0

2
0

11 12 13 14 15 16 17 18 19 20

25 16 16 16 16 32 19 17 24 26

1 1 0 1 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1

3 1 0 18 0 26 0 10 17

39 42 43 54 52 54 46 55 52 57

0 0 0 1 0 1 0 2 0 1

18 31 15 34 20 39 17 48 16 39

34 29 28 38 38 37 25 53 36 41

1 1 1 0 0 0 0 0 0 0

61 58 39 40 27 30 36 43 54 32

1 1 1 0 1 1 0 1 1

40 35 32 37 35 47 26 42 38 42

56 70 71 44 25 38 39 6 75 67

Age Gender (0=Male, 1=Female) Married (0=No, 1=Yes) IncomeC Income in College (in thousands) HealthC Score on Health Inventory in College ChildC Number of Children while in College LifeSatC Score on Life Satisfaction Inventory in College SES Socio Economic Status of Parents LifeSatC Score on Life Satisfaction Inventory in College Smoker (0=No, 1=Yes) SpiritC Score on Spritiuality Inventory in College Finish Finish the program in college (0=No, 1=Yes) LifeSat Score on Life Satisfaction Inventory seven years after College Income Income seven years after College (in thousands)

The major interest of this study is the prediction of life satisfaction seven years after college from the variables that can be measured while the student is in college. These data are available both as a text file and as an SPSS/WIN save file. After doing a univariate analysis to check for outliers, the first step in analysis of data such as this is to explore the relationship borders. The minimum border of the relationships will be the bivariate correlations of all possible predictor variables with the dependent measures, LifeSat and Income. The maximum border will be a linear regression model with all possible predictor variables in the regression model.

The Correlation Matrix


The correlation matrix is given below for all possible predictor variables and the two dependent measures, LifeSat and Income.

The best and only significant ( =.05) predictor of life satisfaction seven years after college was life satisfaction in college with a correlation coefficient of .494. Other relatively high correlation coefficients included: Married (.454), Health in College (.386), Gender (.350 with females showing a generally higher level of life satisfaction), and Smoking (-.349 with Non-smokers showing a generally higher level of life satisfaction). Income seven years after college was best predicted by knowing whether the student finished the college program or not (.499). Other variables that predicted income included the measure of spirituality (.340) and income in college (.282). The matrix of intercorrelations of all predictor variables is presented below.

The Full Model


The other boundary in multiple regression is called the full model, or model with all possible predictor variables included. To construct the full model, all predictor variables are included in the first block and the "Method" remains on the default value of "Enter." The three tables of output for life satisfaction seven years after college are presented below.

Note that the unadjusted multiple R for this data is .976, but that the adjusted multiple R is .779. This rather large change is due to the fact that a relatively small number of observations are being predicted with a relatively large number of variables. the unadjusted value of R2 means that all subsets of predictor variables will have a value of multiple R that is smaller than .976. Note also that these variables in combination do not significantly (Sig. F Change = .094) predict life satisfaction seven years after college.

The middle table "ANOVA" doesn't provide much information in addition to the R2 change in the previous table. Note that the "Sig. F Change" in the preceding table is the same as the "Sig." value in the "ANOVA" table. This table was more useful in previous incarnation of multiple regression analysis.

The full model is not statistically significant (F = 5.493, df = 11,3, sig.= .094), even though life satisfaction in college was statistically significant (p<.05) by itself. The value for this table had a total degrees of freedom of 14 because four observation had missing data and were not included in the analysis. The other degree of freedom corresponds to the intercept (constant) of the regression line. The method of handling missing data is called "listwise" because all data for a particular observation are not included if a single variable is missing. The "Sig." column on the "Coefficients" table presents the statistical significance of that variable given all the other variables have been entered into the model. Note that no variables are statistically significant in this table. The variable "Married" comes close (Sig. = .055), but close doesn't count in significance testing. Previously it was found that the correlation between being married and life satisfaction seven years after college was relative high and positive (.454), meaning that individuals who were married in college were generally more satisfied with life seven years later. The regression weight for this same variable in the full

model was negative (-20.542), meaning that over twenty points would be subtracted from an individual's predicted life satisfaction score seven years after college if they were married in college! Such are the nuances of multiple regression. Partial output for the full model predicting the other dependent measure, income seven years after college, is presented below.

The results are similar to the prediction on life satisfaction, with an unadjusted multiple R of .905, giving an upper limit to the combined predictive power of all the predictor variables.

Fitting Sequential Models


After the boundaries of the regression analysis have been established, the area between the extremes may be examined to get an idea of the interaction between the predictor variables with respect to prediction. There are different schools of thought about how this should be accomplished. One school, hierarchical regression, argues that theory should drive the statistical model and that the decision of what and when terms enter the regression model should be determined by theoretical concerns. A second school of thought, stepwise regression, argues that the data can speak for themselves and allows the procedure to select predictor variables to enter the regression equation.

Hierarchical Regression
Hierarchical regression adds terms to the regression model in stages. At each stage, an additional term or terms are added to the model and the change in R2 is calculated. An hypothesis test is done to test whether the change in R2 is significantly different from zero. Using the example data, suppose a researcher wishes to examine the prediction of life satisfaction seven years after college in several stages. In the first stage, he/she enters demographic variables that the individual has little or no control over, age, gender, and socio-economic status of parents. In the second block variables are entered that the individual has at least some control, such as smoking, having children, being married, etc. The third block consists of the two attitudinal variables, life satisfaction and spirituality. This is accomplished in SPSS/WIN by entering the independent variables in blocks. Be sure the R2 change box is selected as a "Statistics" option. The first table is a table of what variables were entered or removed at the different stages. The second table is summary of the results of the different models.

The largest change in R2 was from model 1 to model 2, with an R2 change of .708 from .102 to .810. This value was not significant, however, as were R2 changes associated with either of the other two models. Then final model has the same multiple R as the full model presented in an earlier section. The third table presents the ANOVA significance table for the three models. The fourth table contains the regression weights and significance levels for each model. As before, the "Sig." column is an hypothesis test of the significance of that variable, given all the other variables at that stage have been entered into the model.

Note how the values of the regression weights and significance levels change as a function of when they have been entered into the model and what other variables are present. The fifth table presents information about variables not in the regression equation at any particular stage, called excluded variables.

The value of "Beta In" is the size of the standardized regression weight if that variable had been entered into the model by itself in the next stage. The "Sig." column is the R2 change significance level that the variable would enter the regression equation. In this case, it can be seen that individually both INCOMEC and SPIRITC would significantly enter the regression model in the second stage. The "Partial Correlation" is the correlation between that variable and the residual of the previous model. The higher the partial correlation, the greater the change in R2 if that variable were entered into the equation by itself at the next stage. As described in the help files of SPSS/WIN, the "Collinearity Statistics Tolerance" is "calculated as 1 minus R squared for an independent variable when it is predicted by the other independent variables already included in the analysis." This statistic may be interpreted as "A variable with very low tolerance contributes little information to a model, and can cause computational problems." (SPSS/WIN help files.) In this case LIFESATC has a low Collinearity Statistics Tolerance (7.835E-02 or .07835) in model 2 and might cause problems if entered into the model at that point. Problems in collinearity were discussed in the earlier chapter on multiple regression with two variables.

Step-up Regression
At any stage, rather than entering all the variables as a block, step-up regression enters the variables one at a time, the order of entry determined by the variable that causes the greatest R2 increase, given the variables already entered into the model. To do a step-up regression using SPSS/WIN, enter all the variables in the first block and select "Method" as "Forward." The results of the step-up regression can be better understood if the correlation coefficients are recomputed between life satisfaction seven years after college and all the predictor variables, using the "Listwise" option for missing data.

Note that the correlation coefficients have changed from the original table and that the highest correlation is with SPIRITC with a value of .587. The SPIRITC variable, then, would enter the stepup regression in the first step. The partial correlation of all the remaining variables and the residual of the first stage model would then be computed. The variable with the largest partial correlation would be entered into the regression at the next step, given that it was statistically significant. The criteria for entering variables into the regression model may be optionally adjusted.

The "Model Summary" table shows that two variables, SPIRITC and FINISH, are entered into the prediction model with a multiple R of .743. The SPIRITC variable was entered first (it had the largest correlation with life satisfaction) and FINISH was entered next.

The "Coefficients" table is presented next.

The final table presents information about variables not in the regression equation.

At the conclusion of the first model, both FINISH and HEALTHC would significantly (p<.05) enter the regression equation at the next step. Since FINISH had a higher partial correlation (-.653) than HEALTHC

(.544) it was entered into the equation at the next step. When FINISH was entered into the equation in model 2, HEALTHC would no longer significantly enter the regression model.

Step-down Regression
By starting with a full model and eliminating variables that do not significantly enter the regression equation, a partial model may be found. This can be accomplished in SPSS/WIN by selecting a "Method" of "Backward" in the linear regression procedure. As can be seen below, the results of this analysis differ greatly from the use of the Forward Method. The "Model Summary" table is presented below.

As the table above illustrates, this method starts with the full model with an R2 of .978. The variable of HEALTC is eliminated at the first step because it has the lowest partial correlation of any variable given that all the other predictor variables are entered into the regression equation. The next variables eliminated, in order, were SMOKE, INCOMEC, and GENDER, resulting in a model with eight predictor variables and a multiple R of .981. Note that all variables in Model 5 were significant in the following table.

As before, the table of excluded values gives information about variables not in the regression equation at any point in time.

Note that none of these variables were significant in the final model.

Caveats and Options


Stepwise procedures allow the data to drive the theory. Some statisticians (I would have to include myself among them) object to the mindless application of statistical procedures to multivariate data. There is no guarantee that the Forward and Backward procedures would agree on the same model if the options were set to different values so that the same number of variables were entered into the model. At some point a variable may no longer contribute to the regression model because of other variables in the model, even if it did contribute at an earlier point in time. For that reason SPSS/WIN provides methods of "STEPWISE" and "REMOVE" which test at each stage to see if a variable still belongs in the model. These methods could be considered a combination of Forward and Backward methods. Using them still does not guarantee that the methods will converge on a single regression model.

CROSS-VALIDATION
The manner is which regression weights are computed guarantee that they will provide an optimal fit with respect to the least square criterion for the existing set of data. If a statistician wishes to predict a different set of data, the regression weights are no longer optimal. There will be substantial shrinkage in the value of R2 if the weights estimated on one set of data are used on a second set of data. The amount of shrinkage can be estimated using a cross-validation procedure.

In cross-validation, regression weights are estimated using one set of data and are tested on a second set of data. If the regression weights estimated on the first set of data predict the second set of data, the weights are said to be cross-validated. Suppose an industrial/organization psychologist wished to predict job success using four different test scores. The psychologist could collect the four test scores from a randomly selected group of job applicants. After hiring all the selected group of job applicants, regardless of their scores on the tests, a measure of success on the job is taken. Success on the job is now predicted from the four test scores using a multiple regression procedure. Stepwise procedures may be used to eliminate tests that are predicting similar variance in job success. In any case, the psychologist is now ready to predict job success from the test scores for a new set of job applicants. Not so fast! Careful application of multiple regression methods require that the regression weights be crossvalidated on a different set of job applicants. Another random sample of job applicants is taken. Each applicant is given the test battery and then hired, again regardless of what scores they made on the tests. After some time on the job a measure of job success is taken. Job success is then predicted by using the regression weights found using the first set of job applicants. If the new data is successfully predicted using old regression weights, the regression procedure is said to be cross-validated. It is expected that the accuracy of prediction will not be as good for the second set of data. This is because the regression procedure is subject to variances in data from sample to sample, called "error". The greater the error in the regression procedure, the greater the shrinkage of the value of R2. The above procedure is an idealized method of the use of multiple regression. In many real life applications of the procedure, random samples of job applicants are not feasible. There may be considerable pressure from administration to select on the basis of the test battery for the first sample, let alone the second sample needed for cross-validation. In either case the multiple regression procedure is compromised. In most cases application of regression procedures to a selected rather than a random sample will result in poorer predictions. All this must be kept in mind when evaluating research on prediction models.

Conclusion
Multiple regression provides a powerful method to analyze multivariate data. Considerable caution, however, must be observed when interpreting the results of a multiple regression analysis. Personal recommendations include a theory that drives the selection of variables and cross-validation of the results of the analysis.

Multiple Regression with Categorical Variables

When a researcher wishes to include a categorical variable with more than two level in a multiple regression prediction model, additional steps are needed to insure that the results are interpretable. These steps include recoding the categorical variable into a number of separate, dichotomous variables. This recoding is called "dummy coding." In order for the rest of the chapter to make sense, some specific topics related to multiple regression will be reviewed at this time.

The Multiple Regression Model


Multiple regression is a linear transformation of the X variables such that the sum of squared deviations of the observed and predicted Y is minimumized. The prediction of Y is accomplished by the following equation: Y'i = b0 + b1X1i + b2X2i + + bkXki The "b" values are called regression weights and are computed in a way that minimizes the sum of squared deviations.

Dichotomous Predictor Variables


Categorical variables with two levels may be directly entered as predictor or predicted variables in a multiple regression model. Their use in multiple regression is a straightforward extension of their use in simple linear regression. When entered as predictor variables, interpretation of regression weights depends upon how the variable is coded. If the dichotomous variable is coded as 0 and 1, the regression weight is added or subtracted to the predicted value of Y depending upon whether it is positive or negative. If the dichotomous variable is coded as -1 and 1, then if the regression weight is positive, it is subtracted from the group coded as -1 and added to the group coded as 1. If the regression weight is negative, then addition and subtraction is reversed. Dichotomous variables can be included in hypothesis tests for R2 change like any other variable.

Testing for Blocks of Variables


A block of variables can simultaneously be entered into an hierarchical regression analysis and tested as to whether as a whole they significantly increase R2, given the variables already entered into the regression equation. The degrees of freedom for the R2 change test corresponds to the number of variables entered in the block of variables.

Correlated and Uncorrelated Predictor Variables


Adding variables to a linear regression model will always increase the unadjusted R2 value. If the additional predictor variables are correlated with the predictor variables already in the model, then the combined results are difficult to predict. In some cases, the combined result will provide only a slightly better prediction, while in other cases, a much better prediction than expected will be the outcome of combining two correlated variables.

If the additional predictor variables are uncorrelated (r = 0) with the predictor variables already in the model, then the result of adding additional variables to the regression model is easy to predict. Namely the R2 change will be equal to the correlation coefficient squared between the added variable and predicted variable. In this case it makes no difference what order the predictor variables are entered into the prediction model. For example, if X1 and X2 were uncorrelated (r12 = 0) and r1y2 = .3 and r2y2 = .4, then R2 for X1 and X2 would equal .3 + .4 = .7. The value for R2 change for X2 given X1 was in the model would be .4. The value for R2 change for X2 given no variable was in the model would be .4. It would make no difference at what stage X2 was entered into the model, the value for R2 change would always be .4. Similarly, the R2 change value for X1 would always be .3. Because of this relationship, uncorrelated predictor variables will be preferred, when possible.

Example Data
The following simulated data was generated using Example Student. It is available as a text file and an SPSS/WIN sav file.

Faculty Salary Simulated Data


Faculty 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Salary 38 58 80 30 50 49 45 42 59 47 34 53 35 42 42 51 51 40 48 34 46 45 50 61 62 51 59 65 49 37 Gender 0 1 1 1 1 1 0 1 0 1 0 0 1 0 0 0 1 0 1 1 1 0 1 0 1 0 0 1 0 1 Rank 3 2 3 1 1 1 3 1 3 2 1 2 1 1 1 3 2 1 2 1 2 1 1 3 3 1 3 2 1 1 Dept 1 2 1 3 3 1 2 3 1 1 3 1 2 2 2 1 2 1 1 1 2 3 3 1 3 3 3 3 1 Years 0 8 9 0 0 1 4 0 3 0 3 0 1 2 2 7 8 3 1 7 2 6 2 3 2 8 0 5 0 9 Merit 1.47 4.38 3.65 1.64 2.54 2.06 4.76 3.05 2.73 3.14 4.42 2.36 4.29 3.81 3.84 3.15 5.07 2.73 3.56 3.54 2.71 5.18 2.66 3.7 3.75 3.96 2.88 3.37 2.84 5.12

Salary Gender (0=Male, 1=Female) Rank (1=Assistant, 2=Associate, 3=Full)

Dept Department (1=Family Studies, 2=Biology, 3=Business) Years since making Rank Average Merit Ranking

It is fairly clear that Gender could be directly entered into a regression model predicting Salary, because it is dichotomous. The problem is how to deal with the two categorical predictor variables with more than two levels (Rank and Dept).

Categorical Predictor Variables


Dummy Coding - making many variables out of one
Because categorical predictor variables cannot be entered directly into a regression model and be meaningfully interpreted, some other method of dealing with information of this type must be developed. In general, a categorical variable with k levels will be transformed into k-1 variables each with two levels. For example, if a categorical variable had six levels, then five dichotomous variables could be constructed that would contain the same information as the single categorical variable. Dichotomous variables have the advantage that they can be directly entered into the regression model. The process of creating dichotomous variables from categorical variables is called dummy coding. Depending upon how the dichotomous variables are constructed, additional information can be gleaned from the analysis. In addition, careful construction will result in uncorrelated dichotomous variables. As discussed earlier, these variables have the advantage of simplicity of interpretation and are preferred to correlated predictor variables.

Dummy Coding with three levels


The simplest case of dummy coding is when the categorical variable has three levels and is converted to two dichotomous variables. For example, Dept in the example data has three levels, 1=Family Studies, 2=Biology, and 3=Business. This variable could be dummy coded into two variables, one called FamilyS and one called Biology. If Dept = 1, then FamilyS would be coded with a 1 and Biology with a 0. If Dept=2, then FamilyS would be coded with a 0 and Biology would be coded with a 1. If Dept=3, then both FamilyS and Biology would be coded with a 0. The dummy coding is represented below. Dummy Coded Variables Dept Family Studies Biology Business 1 2 3 FamilyS 1 0 0 Biology 0 1 0

Using SPSS/WIN to Dummy Code Variables


The dummy coding can be done using SPSS/WIN and the "Transform," "Recode," and "Into different Variable" options. The Dept variable is the "Numeric Variable" that is going to be transformed. In this case the FamilyS variable is going to be created. The window on the screen should appear as follows:

Clicking on the "Change" button" and then on the "Old and New Values" button will result in the following window:

The "Old Value" is the level of the categorical variable to be changed, the "New Value" is the value on the transformed variable. In the example window above, a value of 3 on the Dept variable will be coded as a 0 on the FamilyS variable. The "Add" button must be pressed to add the recoding to the list. When all the recodings have been added, click on the "Continue" button and then the "OK" button. The recoding of the Biology is accomplished in the same manner. A listing of the data is presented below.

The correlation matrix of the dummy variables and the Salary variable is presented below.

Two things should be observed in the correlation matrix. The first is that the correlation between FamilyS and Biology is not zero, rather it is -.474. Second is that the correlation between the Salary variable and the two dummy variables is different from zero. The correlation between FamilyS and Salary is significantly different from zero. The results of predicting Salary from FamilyS and Biology using a multiple regression procedure are presented below. The first table enters FamilyS in the first block and Biology in the second. The second table reverses the order that the variables are entered into the regression equation. The model summary tables are presented below.

In the first table above both FamilyS and Biology are significant. In the second, only FamilyS is statistically significant. Note that both orderings end up with the same value for multiple R (.604). It makes a difference what order the variables are entered into the regression equation in the hierarchical analysis. In the next tables, both FamilyS and Biology have been entered in the first block. The model summary table, ANOVA, and Coefficients tables are presented below.

The ANOVA and model summary tables contain basically redundant information in this case. The Coefficients table can be interpreted as Biology making 8.886 thousand dollars less in salary per year relative to the Business department, while the Family Studies department make 12.350 thousand dollars less than the Business department. Note that the "Sig." levels in the "Coefficients" table are the same as the significance levels of the model summary tables presented earlier when each of the dummy coded variables is entered into the regression equation last.

Similarity of Regression analysis and ANOVA


The results of the preceding analysis can be compared to the results of using the ANOVA procedure in SPSS/WIN with Salary as the dependent measure and Dept as the independent. The following table presents the table of means and ANOVA table.

Note first that the ANOVA tables produced using the ANOVA command and the LINEAR REGRESSION command are identical. ANOVA is a special case of linear regression when the variables have been dummy coded. The second notable comparison of the tables involves the regression weights and the actual differences between the means. Note that the regression weight for FamilyS in the regression procedure is -12.350 and the difference between the means of the Family Studies department (42.25) and the Business department (54.60) is -12.350.

Dummy Coding into Independent Variables


Selection of an appropriate set of dummy codes will result in new variables that are uncorrelated or independent of each other. In the case when the categorical variable has three levels this can be accomplished by creating a new variable where one level of the categorical variable is assigned the value of -2 and the other levels are assigned the value of 1. The signs are arbitrary and may be reversed, that is, values of 2 and -1 would work equally well. The second variable created as a dummy code will have the level of the categorical variable coded as -2 given the value of 0 and the other values recoded as 1 and -1. In all cases the sum of the dummy coded variable will be zero. Trust me, this is actually much easier than it sounds. Interpretation is straightforward. Each of the new dummy coded variables, called a contrast, compares levels coded with a positive number to levels coded with a negative number. Levels coded with a zero are not included in the interpretation. For example, Dept in the example data has three levels, 1=Family Studies, 2=Biology, and 3=Business. This variable could be dummy coded into two variables, one called Business (comparing the Business Department with the other two departments) and one called FSvsBio (for Family Studies versus Biology.) The Business contrast would create a variable where all members of the Business Department would be given a value of -2 and all members of the other two departments would be given a value of 1. The FSvsBio contrast would assign a value of 0 to members of the Business Department, 1 divided by the number of members of the Family Studies Department to member of the Family Studies Department, and -1 divided by the number of members of the Biology Department to members of the Biology Department. The FSvsBio variable could be coded as 1 and -1 for Family Studies and Biology respectively, but the recoded variable would no longer be uncorrelated with the first dummy coded variable (Business). In most practical

applications, it makes little difference whether the variables are correlated or not, so the simpler 1 and -1 coding is generally preferred. The contrasts are summarized in the following table. Dummy Coded Variables Dept Family Studies Biology Business 1 2 3 Business 1 1 -2 FSvsBio 1/N1 = 1/12= .0833 -1/N2 = -1/7 = -.1429 0

The data matrix with the dummy coded variables would appear as follows .

The correlation matrix containing the two contrasts and the Salary variable is presented below.

Note that the correlation coefficient between the two contrasts is zero. The correlation between the Business contrast and Salary is -.585 with a squared correlation coefficient of .342. This correlation coefficient has a significance level of .001. The correlation coefficient between the FSvsBio contrast and Salary is -.150 with a squared value of .023. In this case entering Business or FSvsBio first makes no difference in the results of the regression analysis.

Entering both contrasts simultaneously into the regression equation produces the following ANOVA table.

Note that this table is identical to the two ANOVA tables presented in the previous section. It may be concluded that it does not make a difference what set of contrasts are selected when only the overall test of significance is desired. It does make a difference how contrasts are selected, however, if it is desired to make a meaningful interpretation of each contrast. The coefficient table for the simultaneous entry of both contrasts is presented below.

Note that the "Sig." level is identical to the value when each contrast was entered last into the regression model. In this case the Business contrast was significant and the FSvsBio contrast was not. The interpretation of these results would be that the Business Department was paid significantly more than the Family Studies and Biology Departments, but that no significant differences in salary were found between the Family Studies and Biology Departments. By carefully selecting the set of contrasts to be used in the regression with categorical variables, it is possible to construct tests of specific hypotheses. The hypotheses to be tested are generated by the theory used when designing the study.

Categorical Predictor Variables with Six Levels


If a categorical variable had six levels, five dummy coded contrasts would be necessary to use the categorical variable in a regression analysis. For example, suppose that a researcher at a headache care center did a study with six groups of four patients each (N is being deliberately kept small). The dependent measure is subjective experience of pain. The six groups consisted of six different treatment conditions. Group 1 Treatment None

2 3 4 5 6

Placebo Psychotherapy Acupuncture Drug 1 Drug 2

An independent contrast is a contrast that is not a linear combination of any other set of contrasts. Any set of independent contrasts would work equally well if the end result was the simultaneous test of the five contrasts, as in an ANOVA. One of the many possible examples is presented below. Dummy Coded Variables Group None Placebo Psychotherapy Acupuncture Drug 1 Drug 2 1 2 3 4 5 6 C1 0 1 0 0 0 0 C2 0 0 1 0 0 0 C3 0 0 0 1 0 0 C4 0 0 0 0 1 0

C5 0 0 0 0 0 1

Application of this dummy coding in a regression model entering all contrasts in a single block would result in an ANOVA table similar to the one obtained using Means, ANOVA, or General Linear Model programs in SPSS/WIN. This solution would not be ideal, however, because there is considerable information available by setting the contrasts to test specific hypotheses. The levels of the categorical variable generally dictate the structure of the contrasts. In the example study, it makes sense to contrast the two control groups (1 and 2) with the other four experimental groups (3, 4, 5, and 6). Any two numbers would work, one assigned to groups 1 and 2 and the others assigned to the other four groups, but it is conventional to have the sum of the contrasts equal to zero. One contrast that meets this criterion would be (-2, -2, 1, 1, 1, 1). Generally it is easiest to set up contrasts within subgroups of the first contrast. For example, a second contrast might test whether there are differences between the two control groups. This contrast would appear as (1, -1, 0, 0, 0, 0). A third contrast might compare non-drug vs. rug treatment groups, groups 3 and 4 vs. groups 5 and 6 (0, 0, 1, 1, -1, -1). As can be seen, this would be a contrast within the experimental treatment groups. Within the non-drug treatment, a contrast comparing Group 3 with Group 4 might be appropriate (0, 0, 1, -1, 0, 0). Within the drug treatment conditions, a contrast comparing the two drug treatments would be the last contrast (0, 0, 0, 0, 1, -1). Combined, the contrasts are given in the following table. Dummy Coded Variables

Group None Placebo Psychotherapy Acupuncture Drug 1 Drug 2 1 2 3 4 5 6

C1 -2 -2 1 1 1 1

C2 1 -1 0 0 0 0

C3 0 0 1 1 -1 -1

C4 0 0 1 -1 0 0

C5 0 0 0 0 1

-1

The following table presents example data and dummy coded contrasts for this hypothetical study.

The correlation matrix of the five contrasts and the pain variable is presented below.

Note that the correlation coefficients between the five contrasts are all zero. This occurs because all groups have an equal number of subjects. Using pain as the dependent variable and the five contrasts as the independent variables, the regression results tables entering all variables in block 1 are presented below.

Of major interest is the "Sig." column on the "Coefficients" table. Note that all contrasts are statistically significant except C5. This can be interpreted as: (1) The treatment conditions were more effective than the control conditions, (2) the two control conditions significantly differed from one another, with placebo more effective than control (3) The drug groups were more effective in reducing pain than the non-drug conditions (4) Acupuncture was significantly more effective than Psychotherapy (5) the two drug treatments were not significantly different from one another. The output from the "General Linear Model, Simple Factorial" program in SPSS/WIN is presented below.

Note that it is for practical purposes identical to the ANOVA table produced using the multiple regression program with the dummy coded contrasts. In effect what the General Linear Model program does is to automatically select a set of contrasts and then perform a regression analysis with those contrasts. The General Linear Model program allows the user to specify a special set of contrasts so that an analysis like the one done with dummy coding of contrasts in multiple regression might be performed. It is left for the reader to explore SPSS/WIN for this ability.

Combinations of Categorical Predictor Variables


In the original example data set for this chapter there were three obvious categorical variables, Gender, Rank, and Dept. Gender could be directly entered into the regression model. After dummy coding into two contrasts each, Rank and Dept could be directly entered into the regression model. Difficulties arise, however, when combinations of these categorical variables must be considered. For example, consider Gender and Dept. Rather than two groups and three groups, this combination of categorical variables must be considered as six groups, Male Family Studies, Female Family Studies, Male Biology, Female Biology,

Male Business, and Female Business. Dummy coding these data would require five dummy coded contrasts. Three exist, one for Gender and two for Dept, but there is no accounting for the two additional contrasts. They will be the focus of the next topics, interaction effects.

EQUAL SAMPLE SIZE


Because everything works out much cleaner when equal sample sizes are assumed, this case will be presented first. The example data set has been reduced to twelve subjects, two for each combination of Gender and Dept. The reduced data set is presented below.
Faculty 7 11 14 15 9 12 4 10 8 2 5 6 Salary 45 34 42 42 59 53 30 47 42 58 50 49 Gender 0 0 0 0 0 0 1 1 1 1 1 1 Dept 1 1 2 2 3 3 1 1 2 2 3 3

The levels of Gender and Dept will now be combined to produce six groups.
Salary 45 34 42 42 59 53 30 47 42 58 50 49 Gender 0 0 0 0 0 0 1 1 1 1 1 1 Dept Group 1 1 1 1 2 2 2 2 3 3 3 3 1 4 1 4 2 5 2 5 3 6 3 6

The situation is now analogous to the earlier case when the categorical variable had six levels. Main Effects A categorical variable with six levels can be dummy coded into five contrasts. The first three contrasts have already been discussed. The first of these contrasts will compare males with females and will comprise the Gender Main Effect. The next two will compare the salaries of the three departments over levels of gender and will be called the Department Main Effect. The dummy codes for these main effects are presented below.
Salary Group Gender Main Department Main Effect

45 34 42 42 59 53 30 47 42 58 50 49

1 1 2 2 3 3 4 4 5 5 6 6

Effect 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1

1 1 1 1 -2 -2 1 1 1 1 -2 -2

1 1 -1 -1 0 0 1 1 -1 -1 0 0

This is basically the same coding as discussed earlier, except it is simplified because of the equal number of subjects in each cell. It will later be demonstrated that the correlation coefficients between these dummy coded variables is zero. Interaction Effects Two additional dummy coded variables are needed to account for the categorical variable. These contrasts will comprise the Interaction Effect. In this case the easiest way to find the needed contrasts is to multiply the dummy coded contrast for gender times the dummy coded contrasts for Department. This has the result of changing the sign of the department contrasts for one gender but not the other. The results of this operation appear below.
Group Salary 45 34 42 42 59 53 30 47 42 58 50 49 1 1 2 2 3 3 4 4 5 5 6 6 Gender Main Effect 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 Department Main Effect 1 1 1 1 -2 -2 1 1 1 1 -2 -2 1 1 -1 -1 0 0 1 1 -1 -1 0 0 Interaction Effect 1 1 1 1 -2 -2 -1 -1 -1 -1 2 2 1 1 -1 -1 0 0 -1 -1 1 1 0 0

The correlation matrix for this data set is presented below.

Note that the contrasts all have a correlation coefficient of zero among themselves. The contrasts will be entered into the regression equation predicting salary in three blocks. The first block will contain C1, the second will contain C2 and C3, while the third will contain C4 and C5. The results of this analysis are presented below.

Entering the contrasts in the opposite order has no effect on R Square Change.

The value for "F Change" and "Sig. F change" is different, however, because different error terms are employed in each case. In this subset of the data, none of the contrasts are significant. The interpretation of the main effects and interaction effects will be the topic of discussion of the next chapter.

UNEQUAL SAMPLE SIZE


Equal sample size is seldom achieved in the real world, even in the best-designed experiments. Unequal sample size makes the effects no longer independent. This implies that it makes difference in hypothesis testing when the effects are added into the model, first, middle, or last. The same dummy coding that was applied to equal sample sizes will now be applied to the original data with unequal sample sizes. The simplest way to do this is to recode GENDER into C!, DEPARTMENT into C2 and C3, and compute C4 and C5 by multiplying corresponding contrasts into the new contrast. For example, C4 could be created by multiplying C1 * C2 and C5 could be created by multiplying C1 * C3. The data and dummy coded contrasts appear below.

The correlation matrix of the contrasts is presented below.

Note that the correlation coefficients between the contrasts are not zero. This has the effect of changing the value of R2 Change for a term depending upon when that term was entered into the model. This is illustrated by entering the two contrasts associated with Dept (C2 and C3) first, second, and last. Main Effects of Dept Entered First

Main Effects of Dept Entered Second There are two different ways in which the main effect of Dept may be entered second in the regression model. The first is after Gender and is presented below.

As can be seen, the value of R2 change for adding C2 and C3 changes only slightly from .379 to .376. A slightly greater change in R2 change value is observed if the interaction contrasts (C4 and C5) are entered before the main effect of department.

Note that the value of R2 change is greater for Gender (C1) if it is entered last, rather than first. Main Effects of Dept Entered Third

Note that the value of R2 change is only changed slightly depending upon when it was entered into the model. The pattern of results of the significance tests would not change. Main Effect of Gender Given Rank, Dept, Gender X Rank, Gender X Dept, Years, Merit The dummy coded contrasts can be used like any other variables in a multiple regression analysis. In order to find the significance of the effect of Gender given Rank, Dept, Gender X Rank, Gender X Dept, Years, and Merit, the Rank and Gender X Rank effects must be created as dummy coded contrasts. In the following data file the Rank main effect consists of two contrasts: C2a contrasting Full professors with

Assistant and Associate professors and C3a contrasting Assistant with Associate professors. The Gender X Rank interaction contrasts (C4a and C5a) are constructed by multiplying the Gender contrast (C1) times the two contrasts for the main effect for Rank.

Gender 0 0 0 1 1 1

Rank 1 2 3 1 2 3

C1 -1 -1 -1 1 1 1

C2a 1 1 -2 1 1 -2

C3a 1 -1 0 1 -1 0

C4a -1 -1 2 1 1 -2

C5a -1 1 0 1 -1 0

The additional dummy coded variables are added to the data file in the following.

Salary is predicted in six blocks (only two are really needed) in the following multiple regression analysis. In a simplified analysis, the first block would contain all variables except Gender (C1) and the second would contain only Gender (C1).

As can be seen, the R2 change for Gender has increased to a value of .120 which is significant. The value of multiple R is not really 1.000, but very high, close to 1.000. For that reason the error variance is extremely small, resulting in significant effects. This illustrates the problem of fitting too few data points with too many parameters. If all the effects mentioned above are entered into the model in a single block, the coefficients table appears as follows.

A has been described earlier, the "Sig." column is the significance level of that variable if it is entered last in the regression model. Since t2 = F, it is noted that 77.2052 is equal to 5960.619, within rounding error. In this case, every variable except C4 and Years is statistically significant.

The alert reader has probably noted that other interaction terms could be created and entered into the regression model. For example, four dummy coded contrasts could be created such that a Rank X Dept interaction could be found. Multiplying this by the Gender contrast (C1) would result in a three-way Gender X Rank X Dept interaction.

ANOVA using General Linear Model in SPSS/WIN


Although the dummy coding of variables in multiple regression results in considerable flexibility in the analysis of categorical variables, it can also be tedious to program. For this reason most statistical packages have made a program available that automatically creates dummy coded variables and performs the appropriate statistical analysis. In most cases the user is unaware of the calculations being performed in the computer program. This is the case with the General Linear Model program in SPSS/WIN. This program is selected in SPSS/WIN by "Statistics", "General Linear Model", and "GLM - General Factorial". To perform the Gender by Department analysis discussed earlier in this section, enter Salary as the dependent measure and Gender and Dept as fixed factors. The screen should appear as follows.

Click "OK" and the results are as follows.

Note that the "F" column and "Sig." column is identical to the results of the R2 change analysis presented earlier in this chapter if each of the effects is entered last. This is the meaning of the default "Type III Sum of Squares." The interpretation of "effects," the result of the dummy coding of categorical variables, is the subject of the next chapter.

Você também pode gostar