Você está na página 1de 5

Homework 1: Linear Regression, Linear Combinations, Inference, & Prediction

Dr. Timothy R. Johnson Spring Semester, 2014


This homework assignment is due no later than 3:00 on Wednesday, February 12th. Please read the instructions below carefully.

Homework Instructions
This assignment is due by 3:00 on Wednesday, February 12th. A hard copy is strongly preferred. Late assignments will only be accepted in extreme circumstances and only if arrangements have been made in advance. Your solutions must be typed and very neatly organized. I will not try to infer your solutions if it they are not clearly presented. Equations need not be typeset perfectly but they should be clear. You may substitute letters for symbols (e.g., b1 for 1 ), and you may write-out equations (neatly) by hand if necessary. Include with your solutions the relevant R output and the R scripts that created them. Include these within the text of your solutions using cut-and-paste. Try to include only the relevant output. I would recommend using a monospace font (e.g., Consolas or Courier) for R scripts and output for clarity. It is permitted for you to discuss the homework with other students in the course. However you must still write your own R scripts, produce your own output, and write up your own solutions. You are welcome to ask me questions concerning the homework. I will be particularly open to helping with any R problems. I want to evaluate your understanding of applied regression, not R. If you email me with a R question, it may be helpful for you to include with your email your full R script so that I can replicate your problem. The Statistics Assistance Center (SAC) and Statistical Consulting Center (SCC) are not designed to accommodate this course. Direct all questions to me.

homework 1 : linear regression , linear combinations , inference , & prediction

Anatomical Abnormalities Associated with Schizophrenia


Researchers conducted an observational study that looked at anatomical abnormalities in schizophrenics.1 To control for genetic and socioeconomic factors, the researchers looked at 15 pairs of monozygotic twins where one twin was schizophrenic and the other was not. One of the anatomical variables that the researchers considered was the volume (cm3 ) of the left hippocampus, measured using magnetic resonance imaging. The data are in a data frame called case0202 that is included in the Sleuth3 package. In this problem you will use regression to do two basic statistical analyses: one-sample and two-sample inferences for means. 1. Consider the model E(Yi ) = 0 where Yi is the difference in the volume of the left hippocampus between the twins in the i-th pair. Note that 0 is then simply what would usually be referred to as the population mean difference (). Provide a point and interval estimate of 0 , and conduct a test of the null hypothesis H0 : 0 = 0 (i.e., that there is no difference, on average, in the volume of the left hippocampus between schizophrenics and non-schizophrenics). Use a signicance level of = 0.05. Finally comment on how the condence interval for 0 can be used to determine the outcome of the test. 2. The researchers used a matched-pairs designs based on twins which resulted in two dependent samples that could be collapsed into a single sample of differences. However they could alternatively have used two independent samples one of individuals affected by schizophrenia and another (unrelated) sample of individual unaffected by schizophrenia. For the purpose of this problem you will do an analysis that assumes (incorrectly) that the samples are independent. But rst it is necessary to reorganize the data. Use the following commands to create a new data frame.2
> volume <- c(case0202$Unaffected, case0202$Affected) > status <- rep(c("unaffected","affected"), each = 15) > newdata <- data.frame(volume, status)
Suddath, R. L., Christison, G. W., Torrey, E. F., Casanova, M. F. & Weinberger, D. R. (1990). Anatomical abnormalities in the brains of monozygotic twins discordant for schizophrenia, New England Journal of Medicine, 322(12), 789794.
1

There are some very powerful functions available for R to reorganize data in various ways, but here we will use a simple approach. Note that rep is a function that will create replicates.
2

Consider the model E(Yi ) = 0 + 1 xi where now Yi is the volume of the left hippocampus of the i-th subject, and xi is an indicator variable for whether or not the i-th subject is affected or unaffected by schizophrenia. (a) Estimate the parameters of the model. Report the parameter estimates and the condence intervals for each parameter. Also explain how R has dened the indicator variable (i.e., when is it one and when is it zero?).

homework 1 : linear regression , linear combinations , inference , & prediction

(b) Let a and u represent the population mean left hippocampus volumes for affected and unaffected subjects, respectively. Write these two parameters as a function of 0 and 1 based on the model above. Also write u a and a u as a function of 0 and 1 . (c) Use the contrast function to provide point and interval estimates for a , u , u a , and a u .3 (d) Which parts of the output from summary and contrast are useful in making inferences about the presence and magnitude of a difference in the volume of the left hippocampus between those affected with schizoprhenia and those who are unaffected? Is the difference statistically signicant and what is the estimate of the magnitude of this difference?
You can check your answers using the fact that the estimates of a and u are equal to the corresponding sample means. The sample means can be computed using the aggregate function.
3

Alcohol Metabolism in Men & Women


Researchers conducted an observational study to compare how men and women metabolise alcohol.4 Women tend to show a higher blood-alcohol concentration than men after drinking, even when controlling for drinking history and body mass. It was thought that this difference could be due, in part, to differences between men and women in the enzymatic activity in the stomach that degrades some of the alcohol before it enters the bloodstream. This study measured the rst-pass metabolism of alcohol in 18 women and 14 men by measuring the difference in blood alcohol concentration when a xed dose of alcohol (0.3g) was administered intravenously versus orally.5 The gastric alcohol dehydrogenase of each subject was also measured using samples of stomach mucus.6 The question here is if and how the rst-pass metabolism measure differs between men and women when controlling for gastric alcohol dehydrogenase. The data are in the data frame case1101 that is part of the Sleuth3 package. The data also include a variable that indicates whether each subject was an alcoholic, but this information will be ignored here. 1. Estimate a linear regression model using the rst-pass metabolism measure as a response variable and the subjects gastric alcohol dehydrogenase and sex as explanatory variables. Include in your model a term for an interaction between the two explanatory variables. Report the results using the summary function, and explain how each of the three explanatory variables in the model E(Yi ) = 0 + 1 xi1 + 2 xi2 + 3 xi3 are computed. That is, what are xi1 , xi2 , and xi3 specically, meaning how would you compute them by hand? Note that like the
Frezza, M., di Padova, C., Pozzato, G., Terpin, M., Baraona, E., & Lieber, C. S. (1990). High blood alcohol levels in women. The role of decreased gastric alcohol dehydrogenase activity and rst-pass metabolism. New England Journal of Medicine, 322, 9599.
4

Measured in mmol/liter-hour.

Measured in mol/min/g. Alcohol dehydrogenase is the enzyme that breaks down alcohol.
6

homework 1 : linear regression , linear combinations , inference , & prediction

model for the whiteside data frame, this model denes a separate regression line for men and women when modeling the relationship between expected rst-pass metabolism and gastric dehydrogenase activity. Write the intercept and the slope of each line as a function of the model parameters.7 You may nd it useful to use visreg to visualize the model and conrm your results. 2. Using contrast estimate the following linear combinations based on the model you used in the previous problem: the slopes for men and women, and the differences in expected rst-pass metabolism between men and women at gastric alcohol dehydrogenase levels of 0, 1, 2, 3, 4, and 5 mol/min/g.8 Report the output from contrast and be sure to annotate your output or indicate in some way which estimates represent which linear combinations.9 3. Compute the predicted rst-pass metabolism for men and women, separately, at alcohol dehydrogenase levels of 0, 1, 2, 3, 4, and 5 mol/min/g. Also compute the prediction intervals for each of the twelve combinations of these values of the explanatory variables.10 4. Which test reported by the summary function is the test of the null hypothesis that the regression lines for men and women are parallel? Do this test using full and null models with the anova function. Report the observed values of the test statistics and the p-values for each test, and verify that the p-values for the two tests are equal and that the square of the observed t test statistic equals the observed F test statistic. 5. Although most statistical packages will compute indicator variables and interaction terms automatically, it is useful sometimes to be able to do so manually. Create three new variables called x1, x2, and x3 that are equal to the explanatory variables created by R automatically. A helpful method for creating indicator variables is the ifelse function. To see how it works, try the following example.11
> u <- c("a","a","b","b","c") > v <- ifelse(z == "b", 1, 0)

Remember that it can be helpful to write the model case-wise like with the whiteside data where
7

E(gasi ) =

0 + 2 tempi , 0 + 1 + ( 2 + 3 )tempi ,

if before, if after.

This makes it clear that in that model, the slope and intercept before insulation are 2 and 0 , respectively, and that the slope and intercept after insulation are 2 + 3 and 0 + 1 , respectively. Note that the estimated expected value of the rst-pass metabolism at an alcohol dehydrogenase activity level of 0 is an intercept. This can be handy for checking your work.
8

Note that you can compute the estimates by hand using the estimates of the model parameters and your solution to the previous question to check your work.
9

The function expand.grid would be useful here.


10

The operator == concerns if one thing is equal to the other. An easy mistake to make is to write = instead of ==, or visa versa.
11

The other thing you need to know is that the multiplication operator in R is *. For example:
> w <- c(10,20,30,40,50) > z <- v * w

Report the output from summary for a model using x1, x2, and x3 as your explanatory variables to verify that the parameter estimates are identical to those you obtained in the rst problem.

homework 1 : linear regression , linear combinations , inference , & prediction

6. Examine the following two models using the summary function.


> lm(Metabol ~ Sex + Sex:Gastric, data = case1101) > lm(Metabol ~ Sex + Sex:Gastric - 1, data = case1101)

Here it is crucial to use : to specify the interaction rather than * since the latter will add additional terms that are not wanted. The term -1 tells R to not include 0 in the model (which it usually does by default). You should notice that the rst model has the form E(Yi ) = 0 + 1 xi1 + 2 xi2 + 3 xi3 while the second model has the form E(Yi ) = 1 xi1 + 2 xi2 + 3 xi3 + 4 xi4 . Each of these two models is a reparameterization of the model you considered earlier. For each model, explain how each xij is dened (i.e., how would it be computed from the data?). You should be able to do this using the output from summary. Then for each model and for each line that represents the relationship between the expected rst-pass metabolism and gastric alcohol dehydrogenase, write the slope and intercept of the line as a function of the model parameters.12 Do this by writing the model case-wise. 7. Given that the data are close to the origin and it is reasonable that the expected rst-pass metabolism should be zero if the gastric alcohol dehydrogenase activity is also zero, a model in which the regression lines for both men and women pass through the origin (i.e., have a y-intercept of zero) could be justied. Do this by modifying one of the models you considered in the previous two problems.13 Note that since both lines pass through the origin the model should have only two parameters since there are two slopes but the intercepts are xed. Report the output from the summary of this model. You may nd it useful to use visreg to check your model.14

You may nd it useful to use your results from earlier problems to verify your answers.
12

There is more than one parameterization of such a model.


13

14

I would have liked to have you used

contrast with this model but unfor-

tunately it tends to have trouble with these alternative parameterizations. There are other ways to do linear combinations in R but they are not as user-friendly.

Você também pode gostar