Escolar Documentos
Profissional Documentos
Cultura Documentos
Regression
Analysis
I M.Sc Bio-
Statistics
Dr. A.R.Muralidharan
Department of statistics
College of Natural and Computational Sciences
Debre Berhan University
College of natural and computational science
Department of statistics
Linear regression
Biostatistics master program
Contents
1. Introduction to regression
1.1 dealing with relationship between variables (correlation and chi-square test of
association)
1.2 historical background of regression
1.3 step and objective of regression
2. Simple Linear Regression
2.1 Functional vs. Statistical Relation
2.2 Formal Statement of the Model
2.3 Graphical Representation of the Simple Linear Regression Model
2.3.1 Interpretation of the Regression Parameters
2.4 Estimation of the Regression Parameters
2.4.1 Method of Least Squares.
2.4.2 maximum likely hood estimation
2.5 Inferences in Regression Analysis
2.5.1 Inference for 1
2.5.2 Inference for 0
2.5.3 Inference for Mean Response
2.6 Predicting New Observations
2.6.1 Predicting a Single New Observation
2.6.2 Predicting the Mean of m New Observations
2.6.3 Confidence Band for the Regression Line
2.7 Analysis of Variance
2.8 F -Test versus t-Test
2.9 General Linear Test
2.10 Coefficient of Determination
3. Diagnostics and Remedial Measures
3.1.Departures from the Model
3.2.Residuals
3.3. Diagnostics Plots
3.3.1. Univariate Plots of X and Y
3.3.2. Bivariate Plots
3.4.Formal Tests
3.4.1. Tests for Normality
3.4.2. Test for Autocorrelation
3.4.3. Tests for Non-Constancy of Variance
3.4.4. Outlier Identification
3.4.5. Lack-of-Fit Test
3.5.Remedial Measures
3.5.1. Nonlinearity
3.5.2. Non-Constancy of Error Variance
3.5.3. Outliers
3.5.4. Non-Independence
3.5.5. Non-Normality
3.6.Transformations
3.6.1. Linearizing Transforms
3.6.2. Non-Normality or Unequal Error Variance
3.6.3. Box-Cox Family of Transformations
4. Simultaneous Inference and Other Topics
4.1.Joint Estimation
4.1.1. Statement vs. Family Confidence
4.1.2. Bonferroni Joint Confidence Intervals
4.2.Regression Through the Origin
4.3.Effects of Measurement Errors
4.3.1. Errors in the Response Variable
4.3.2. Errors in the Predictor Variable
5. Multiple Linear Regression
5.1.Multiple Regression Models
5.2.General Linear Regression Model
5.2.1. General Form
5.2.2. Specific Forms
5.3.Matrix Formulation
5.3.1. Estimation of Regression Coefficients
5.3.2. Fitted Values and Residuals
5.3.3. ANOVA Results
5.3.4. F -Test for Regression Relation
5.4. Coefficients
4.4.1. Multiple Determination
4.4.2. Multiple Correlation
5.5. Inference about the Regression Parameters
5.5.1. Interval Estimation of
5.6.Inference About Mean Response
5.6.1. Interval Estimation of E(Yh)
5.6.2. Confidence Region for Regression Surface
5.6.3. Simultaneous Confidence Intervals for Several Mean Responses
5.7.Predictions
5.7.1. New Observation, Yh(new)
5.7.2. Mean of m New Observations
5.7.3. New Observations
5.8.Diagnostics and Remedial Measures
5.8.1. Diagnostic Plots
5.8.2. Formal Tests
5.8.3. Remedial Measures
5.9.Extra Sum of Squares
5.10. Coefficient of Partial Determination
5.11. Standardized Multiple Regression
5.12. Multicolinearity
5.12.1. Uncorrelated Predictors
5.12.2. Correlated Predictors
5.12.3. Effects of Multicolinearity
6. Model Building, Diagnostics and Remedial Measures
6.1.Model Building Process
6.2.Criteria for Model Selection
6.2.1. R2p, SSEpCriterion
6.2.2. or MSEpCriterion
6.2.3. Mallows CpCriterion
6.2.4. AIC and SBC Criteria
6.2.5. P RESSpCriterion
6.3.Automatic Search Procedures for Model Selection
6.3.1. Best Subsets Algorithms
6.3.2. Stepwise Regression
6.3.3. Forward Selection
6.3.4. Backward Selection
6.3.5. Danger of Automatic Selection Procedures
6.4.Diagnostic Methods
6.4.1. Identifying Outlying Y Observations
6.4.2. Identifying Outlying X Observations
6.4.3. Identifying Influential Cases
6.4.4. Multicolinearity Diagnostics
6.5.Remedial Measures
6.5.1. Weighted Least Squares
6.5.2. Ridge Regression
6.5.3. Robust Regression
6.5.4. Regression with Auto correlated Errors
6.5.5. Nonlinear Regression
Textbook:
Montgomery, D.C., Peck, E.A. and Vining, G.G. (2002). Introduction to Linear Regression
Analysis (3rd Edition). John Wiley & Sons, Inc., New York.
References:
1. Carroll, R.J. and Ruppert, D. (1988). Transformation and Weighting in Regression.
Chapmanhall, London
2. Chatterjee, S. and Price, B. (1977). Regression Analysis by Example; Wiley, New York.
3. Draper, N.R. and Smith, H. (1998). Applied Regression Analysis (3rd Edition). John
Wiley & Sons, Inc., New York.
4. Mosteller, F. and J.W.Tukey (1977). Data Analysis and Regression: A second Course in
Statistics. Addison-Wesley, Reading, Mass.
5. Myrers, R.H. (1990). Classical and Modern Regression with Applications (2nd Edition).
PWSKent Publishers, Boston.
Contents
1. Introduction to Regression
1.1. Design with relationship between variables (Correlation and Chisquare test of
association)
1.2. Historical back ground of Regression
1.3. Steps and objectives of regression
Introduction to Regression Analysis
1. Introduction
Regression Analysis is a statistical technique that can be used to develop a
mathematical equation showing how variables are related. Our objective is
therefore to establish a mathematical relationship which could help as in
determining the value of the dependent variable. The graph obtained from such
relationships could be a linear or a curve (non-linear). If the relationship is
linear, then the regression is called a linear regression. If the relationship is non-
linear, then the regression is known as non-linear regression.
Because of its application and its necessity in research, there is a thrust to learn
the concepts and its applications of Regression.
The predictor or explanatory variables are also called by other names such as
independent variables, covariates, regressors, factors, and carriers. The name
independent variable, though commonly used, is the least preferred, because in
practice the predictor variables are rarely independent of each other.
For instance, a researcher conducts a study on height- weight data, and it leads
to know the relation between height-weight.
In any study, if it deals with relationship, it must be important to investigate the
following questions:
(1).Which variable among the study has either down or up.
(2).Among the involved variables under study which is with more or less
relationship.
(3). Can it be possible to examine the predictors.
To deal this type of relationship or association we may use inferential tests and
it leads to examine difference between groups. The study which are involved
with relationship or association, the commonly used statistical test are
a. Correlation
b. Chi-square tests and
c. Simple or multiple linear regression
These tests may give answers from the below questions.
A. Is there a relationship between two variables
B. Is there any association between variables
2
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
a. Correlation
Correlation test are used to examine relationship between two or more variables
with degree and direction. These variables are a quantitative it is well known to
use Pearsons method otherwise use spearmans coefficient.
To carry out such tests it is to be known the following points:
Variable involved ( in numbers)
Independent and dependent variables
Types of variables are they
If the tests give P< 0.05 implies that the variables are statistically significant.
b. Chi-Square tests
It is used to test whether a statistical significant association exists between two
categorical variables. Usually it accompanies a cross tabulation between two
variables popularly known as contingency table.
To perform this test we can get any answer of the following questions:
1. Is there any relationship/association between the variables involved in the
given study
2. Is the data fitted as Best.
c. Simple or Multiple linear regression deals in this course to find the
relationship between variables
.
1.2. History
The modern era in this area was inaugurated by Sir Francis Galton (1822
1911), in his book Hereditary genius An enquiry into its laws and
consequences of 1869, and his paper Regression towards mediocrity in
hereditary stature of 1886. Galtons real interest was in intelligence, and how it
is inherited. But intelligence, though vitally important and easily recognisable,
is an elusive concept human ability is infinitely variable (and certainly multi
dimensional!), and although numerical measurements of general ability exist
(intelligence quotient, or IQ) and can be measured, they can serve only as a
proxy for intelligence itself. Galton had a passion for measurement, and
resolved to study something that could be easily measured; he chose human
height. In a classic study, he measured the heights of 928 adults, born to 205
sets of parents. He took the average of the fathers and mothers height (mid-
parental height) as the predictor variable x, and height of offspring as response
variable y. (Because men are statistically taller than women, one needs to take
the gender of the offspring into account. It is conceptually simpler to treat the
sexes separately and focus on sons, say though Galton actually used an
adjustment factor to compensate for women being shorter.) When he displayed
his data in tabular form, Galton noticed that it showed elliptical contours that
is, that squares in the (x, y)-plane containing equal numbers of points seemed to
lie approximately on ellipses. The explanation for this lies in the bivariate
normal distribution Galtons interpretation of the sample and population
regression lines (SRL) and (PRL). In (PRL), x and y are measures of
3
Page
that variability of height is changing (though mean height has visibly increased
from the first authors generation to his children). So (at least to a first
approximation) we may take these as equal, when (PRL) simplifies to Hence
Galtons celebrated interpretation: for every inch of height above (or below) the
average, the parents transmit to their children on average inches, where is
the population correlation coefficient between parental height and offspring
height. A further generation will introduce a further factor , so the parents will
transmit again, on average 2 inches to their grandchildren.
This will become 3 inches for the great-grandchildren, and so on. Thus for
every inch of height above (or below) the average, the parents transmit to their
descendants after n generations on average n inches of height. As we know
0<<1
( > 0 as the genes for tallness or shortness are transmitted, and parental
and offspring height are positively correlated; < 1 as = 1 would imply
that parental height is completely informative about offspring height, which is
patently not the case). So n 0 (n):
the effect of each inch of height above or below the mean is damped out with
succeeding generations, and disappears in the limit. Galton summarised this as
Regression towards mediocrity in hereditary stature, or more briefly,
regression towards the mean (Galton originally used the term reversion instead,
and indeed the term mean reversion still survives). This explains the name of
the whole subject.
8. Using the chosen model(s) for the solution of the posed problem.
B. Objectives
In general, the goals of a regression analysis are to predict or explain
differences in values of the outcome variable with information about values
of the explanatory variables. We are primarily interested in the following
issues:
1. The form of the relationship among the outcome and explanatory
variables, or what the equation that represents the relationship looks like.
2. The direction and strength of the relationships. As we shall learn, these are
based on the valence and size of the slope coefficients.
3. Which explanatory variables are important and which are not. This issue is
based on comparing the size of the slope coefficients (which has some
problems) and on the p-values (or confidence intervals).
4. Predicting a value or set of values of the outcome variable for a given set
of values of the explanatory variables.
EXERCISE
1. Define the term Regression. Explain in detail
2. Write down few examples of applications in Regression analysis.
3. What are the different types of regression? Explain
4. Explain the following terms:
a. Predictor b. Regressor c. Regressed d. Data fitting c. Functions and its
type.
5. What are the techniques available to dealing with relationships?
6. What are the uses of regression analysis
7. Write short notes on History of Regression and its various
development.
8. Steps and objectives in regression Discuss
9. Give some model expression and explain its terms involved
10.What is residual in regression? Explain it with their distributions?
5
Page
Contents
2. Simple linear regression
2.1. Functional Vs Statistical relation
2.2. Formal statement of the model
2.3. Graphical representation of the Simple linear regression model
2.3.1. Interpretation of the regression parameters
2.4. Estimation of the regression parameters
2.4.1. Method of Least squares ( LS)
2.4.2. Maximum Likelihood Estimation (MLE)
2.5. Inferences in the regression analysis
2.5.1. Inference for
2.5.2. Inference for
2.5.3. Inferences for Mean response
2.6. Predicting new observations
2.6.1. Predicting a single new observation
2.6.2. Predicting a Mean of m new observation
2.6.3. Confidence band for the regression line
2.7. Analysis of variance
2.8. F-test Versus t-test
2.9. General Linear test
2.10. Coefficient of Determination
Introduction to Regression Analysis
In usual way, we do not know the true relationship between X and y , but
would like to describe or somehow use this relationship.
As in case of Statistical relationship, it is not an exact result with relationship.
We get a TREND of the association between the variables as predictor and
response. We can use a diagrammatic representation to show this relationship. It
is not a perfect one. The observations for a statistical relation do not fall directly
on the curve of relationship.
( )
The term ( ) to approximate Y and an error committed.
In this function Y is not known exactly for every X, we often approximate the
relation between X and Y.
Let us define function as a linear as
( )
Then Y becomes
1
Page
This is a very simple model to explain the relationship between two variables.
The function of X is an equation for a straight line.
.
3.
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
As we know that, in any statistical analysis, the graph plays an important role.
By using this pictorial representation of the given data we may get an idea or
trend about the data. It will give a dispersion or variation or scatterings of the
data. Although graphs are supplementary of the data but not a substitute, it is
necessary to present in a graph. Usually graphs are two dimensional that is
bivariate.
The fundamental graphical tool for looking at regression data, a two-
dimensional, scatter plot. In regression problems with one predictor and one
response, the scatter plot of the response versus the predictor is the starting
point for regression analysis. In problems with many predictors, several simple
graphs will be required at the beginning of an analysis. A scatter plot matrix is a
convenient way to organize looking at many scatter plots at once.
A regression problem with one predictor, which we will generically call X and
one response variable, which we will call Y. Data consists of values (xi, yi ), i =
1, . . . , n, of (X, Y) observed on each of n units or cases. In any particular
problem, both X and Y will have other names , are more descriptive of the data
that is to be analyzed. The goal of regression is to understand how the values of
Y change as X is varied over its range of possible values. A first look at how Y
changes as X is varied is available from a scatter plot. Below are the graphical
representations of a simple linear regression.
Figure 2.2
The above scatter plot of mothers and daughters heights in the Pearson and
Lee data. The original data have been jittered to avoid overplotting, but if
rounded to the nearest inch would return the original data provided by Pearson
and Lee.
3
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
This model with single repressor x that has a relationship with a response y that
is a straight line. The error term assummed to have mean zero and unknown
4
variance . Also it is to be noted that the errors are uncorrelated, that is they
Page
are independent.
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
( ) ( )
( )
[ ]
As our aim is to estimate the unknown parameter in the equation 1.1. Here the
unknown parameters are . This has to be estimated by the given sample
data. To estimate these parameters we apply a method called Least Squares.
By this we will estimate the parameters so that the sum of squares of differences
between the observation and the straight line is a minimum.
The fitted value for case i is given by for which we use the shorthand
notation yi
Although the ei are not parameters in the usual sense, we shall use the same hat
notation to specify the residuals: the residual for the ith case, denoted ei, is
given by the equation
which should be compared with the equation for the statistical errors,
Table 2.1 also lists definitions for the usual univariate and bivariate summary
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
statistics, the sample averages (x, y), sample variances (SD2x , SD2y ), and
estimated covariance and correlation (sxy, rxy). The hat rule described earlier
would suggest that different symbols should be used for these quantities; for
example, xy might be more appropriate for the sample correlation if the
population correlation is xy .
( i=1,2,3......n
To measure the adequacy of fitted model we need this residual and also in
8
Some questions are taken into consideration of the fitted data which
includes the following
The estimated regression line above fitted by the method of least squares has a
number of properties worth noting.
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
The
Consider a data (x,y) as a pairs of n observations and we assume that the errors
in the regression model areNID(0, ) then the observations y in this sample are
iid random variables with mean
The Likelihood function is a joint distribution of the observations given and the
parameters unknown constants.
For the simple linear regression model with normal errors, the likelihood
10
function is
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
It is to be noted that MLE have better statistcal properties than least squares
estimators. The MLE require a full distributional assumption, in that case the
random errors follow a normal distribution with the same second moments as
required for the least squares estimates.
11
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
In linear regression, estimation can be done using LSE,which does not require
the normality assumption. From LSE, we obtain and 0, which are the point
estimators of the parameters and , respectively.
Statistical inference, however, is based on the normal distribution.
The errors, , are assumed to be mutually independent and (identically)
normally distributed.
We rely on the Central Limit Theorem and assume that the inference is
approximately correct.
2.5.1. Inferences of
Suppose that we wish to test the hypothesis that the slope equals a constant say
For the linear regression model, the condition implies that even more than no
linear association between the variables. Since the model has all probability
distributions of y are normal with constant variance, and the means are equal
with respect to null hypothesis, it also follows that the probabilitydistributions
of y are identical under null hypothesis. From figure it is depicted, thus
the normal error regression model implies that there is not only no
association(linear) between x and y but also no relation of any type between the
variables. Since the probability distributions of y are then identical at all levels
of x.
Sampling distribution of
14
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
Since the above rearranged equation holds for all possible values of the(1- )
CI limits as
Now we have specified a two sided alternative. Since the error follows
NID(0, ), the observations yi are NID( ) . Now is a linear
combination of the observations, so is normally distributed with mean
variance using the mean and variance of
From this t statistic, the denominator is often called the estimated standard
error( simply standard error) of the slope
That is ( )
We often =
( )
There are only not frequent occasions when we wish to make inferences
concerning , the intercept of the regression line. These occur when scope of
the model includes X=0.
15
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
Sampling Distribution of b0
Will be
Hence the CI for this coefficient and test are with t distribution.
The 1- confidence limits for this coefficient is the same manner as those of the
previous coefficient and it is expressed
se( ) (( ) is the standard error of the intercept.
4. Power of tests
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
It is well known that one of the main goals of regression analysis is to estimate
the mean of the distribution for a specific value of the predictor X.
Let denote the level of X for which we wish to estimate the mean response.
may be a value which occurred in the sample, or it may be some other value
of the predictor variable within the scope of the model. The mean response
when is denoted by [ ]. The point estimator [ ]:
Note that the width of the confidence interval width is minimum for x h= . we
might expect our best estimates of y to be made at x values near the center of
the data and the precision of estimation to deteriorate as we move to the
boundary of the x-space.
The above CI points out that the issue of extrapolation is much more subtle;
further the x value is from the centre of the data, the more variable estimate by
Moreover the limit points that the prediction is approached the boundary and
that as we moved beyond this boundary the prediction may deteriorate rapidly.
Further, the farther we move away from the original region of x-space, the more
likely it is that equation or model error will play a role in the process. This is not
the same as never extrapolate. The CI supports such use of the prediction
equation and not supports using the regression model to forecast many periods
in the future. In general, the greater the extrapolation, the higher is the chance of
equation error or model error impacting the results. The probability statement
associated with the CI holds only when a single confidence interval in the mean
response is to be constructed.
If x0 is the value of the repressor variable of interest, then the model to predict
the new value of the response as a point estimate y0 will be
y0.
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
To construct a prediction interval for a single new observation at X=x0 , the two
sources of variability are:
. This is estimated as
It has be noted that the estimated variance of a single new prediction LHS is
always greater than the estimated variance of the estimated mean response in
RHS.
A confidence interval base on t distribution with (n-2) df, the prediction interval
the same level and let denote the mean response of these new
cases. The estimator for is the predicted value itself, all cases with the
same level of X then the prediction will be the same that is the prediction is
We consider the basic regression model for analysis from the perspective of
ANOVA. The ANOVA approach will be a basic approach to test significance of
regression also for multiple regression models and some other types of linear
statistical models.
The measure of total variation will be the sum of the squares of deviations (SST
or SSTO) and these deviations are separated with the other variation
components
Let SST =( )
If all the observation are the same in the data SST = 0, the greater the variation
among the observations, the larger is SST. Thus SST is a measure of the
uncertainty pertaining to the lot. When the variable X is taken , the variation
21
22
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
ANOVA
The F-statistic tests use to test whether the regression model has any predictive
or explanatory ability. Otherwise the null hypothesis(H0) as the regression
model is Significant.
Decision RULE:
The test statistics F follows (1, (n-2)) df when H0 holds, the decision rule is as
follows when the risk of TYPE 1 error is to be controlled at the level .
24
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
2.8.
25
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
The ANOVA with a simple hypothesis is a general test for a linear statistical
model. It is necessary to develop a general test approach in terms of the simple
linear regression model. By using the general approach it is possible to make
wide application and easy understanding the pattern or trend of the linear
regression. Thus the approach of generality is termed as General Linear Test.
This approach involves with three fundamental steps. This GL-test is based on
the reduction in SS.
To compare the two Error Sum of Squares SSE(F) and SSE(R) , as that SSE(F)
is less than SSE(R) as mathematically SSE(F) SSE(R)
The reason behind is that the more parameter are in the model, the better one
can fit the data and the smaller are the deviations around the fitted regression.
When SSE(F) is not much less than SSE(R), using the full model does not
account for much more of the variability o the observations than does the
reduced model under H0 holds. In other words, SSE(F) is closer to SSE(R), the
variation of the observations around the fitted regression function for the full
26
model is almost as greater as the variation around the fitted regression function
for the reduced model.
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
The small, a difference between SSEs suggests that H0 holds. On the other
hand, a large difference implies that H0 holds because the additional parameters
in the model to help to reduce substantially the variations of the observations Yi
around the fitted regression.
The general linear test approach above can be used for highly complex tests of
linear statistical models, as well as for simple tests.
How can we do this?
1. Fit the full model and obtain SSE(F).
2. Fit the reduced model and obtain SSE(R).
3. Compute F to test whether the full model significantly improves the reduced
model.
Exercise:
Sample 1 2 3 4 5 6 7 8 9 10
Purity(%) 86.91 89.85 90.28 86.34 92.58 87.33 86.29 91.86 95.61 89.86
Hydrocarbons(%) 1.02 1.11 1.43 1.11 1.01 0.95 1.11 0.87 1.43 1.02
Sample 11 12 13 14 15 16 17 18 19 20
Purity(%) 96.73 99.42 98.66 96.07 93.65 87.31 95 96.85 85.20 90.56
Hydrocarbons(%) 1.46 1.55 1.55 1.55 1.40 1.15 1.01 0.99 0.95 0.98
a. Fit a simple linear regression
b. Test the hypothesis:
1. 2.
c. Calculate and interpret R2
29
d. Find 95% of CI
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)
30
Page