Você está na página 1de 43

Linear

Regression
Analysis
I M.Sc Bio-
Statistics

Dr. A.R. MuraliDharan


Debre Berhan University
Regression
Analysis(STAT561)
I M.Sc (Biostatistics) 2016-17

Dr. A.R.Muralidharan
Department of statistics
College of Natural and Computational Sciences
Debre Berhan University
College of natural and computational science
Department of statistics
Linear regression
Biostatistics master program
Contents
1. Introduction to regression
1.1 dealing with relationship between variables (correlation and chi-square test of
association)
1.2 historical background of regression
1.3 step and objective of regression
2. Simple Linear Regression
2.1 Functional vs. Statistical Relation
2.2 Formal Statement of the Model
2.3 Graphical Representation of the Simple Linear Regression Model
2.3.1 Interpretation of the Regression Parameters
2.4 Estimation of the Regression Parameters
2.4.1 Method of Least Squares.
2.4.2 maximum likely hood estimation
2.5 Inferences in Regression Analysis
2.5.1 Inference for 1
2.5.2 Inference for 0
2.5.3 Inference for Mean Response
2.6 Predicting New Observations
2.6.1 Predicting a Single New Observation
2.6.2 Predicting the Mean of m New Observations
2.6.3 Confidence Band for the Regression Line
2.7 Analysis of Variance
2.8 F -Test versus t-Test
2.9 General Linear Test
2.10 Coefficient of Determination
3. Diagnostics and Remedial Measures
3.1.Departures from the Model
3.2.Residuals
3.3. Diagnostics Plots
3.3.1. Univariate Plots of X and Y
3.3.2. Bivariate Plots
3.4.Formal Tests
3.4.1. Tests for Normality
3.4.2. Test for Autocorrelation
3.4.3. Tests for Non-Constancy of Variance
3.4.4. Outlier Identification
3.4.5. Lack-of-Fit Test
3.5.Remedial Measures
3.5.1. Nonlinearity
3.5.2. Non-Constancy of Error Variance
3.5.3. Outliers
3.5.4. Non-Independence
3.5.5. Non-Normality
3.6.Transformations
3.6.1. Linearizing Transforms
3.6.2. Non-Normality or Unequal Error Variance
3.6.3. Box-Cox Family of Transformations
4. Simultaneous Inference and Other Topics
4.1.Joint Estimation
4.1.1. Statement vs. Family Confidence
4.1.2. Bonferroni Joint Confidence Intervals
4.2.Regression Through the Origin
4.3.Effects of Measurement Errors
4.3.1. Errors in the Response Variable
4.3.2. Errors in the Predictor Variable
5. Multiple Linear Regression
5.1.Multiple Regression Models
5.2.General Linear Regression Model
5.2.1. General Form
5.2.2. Specific Forms
5.3.Matrix Formulation
5.3.1. Estimation of Regression Coefficients
5.3.2. Fitted Values and Residuals
5.3.3. ANOVA Results
5.3.4. F -Test for Regression Relation
5.4. Coefficients
4.4.1. Multiple Determination
4.4.2. Multiple Correlation
5.5. Inference about the Regression Parameters
5.5.1. Interval Estimation of
5.6.Inference About Mean Response
5.6.1. Interval Estimation of E(Yh)
5.6.2. Confidence Region for Regression Surface
5.6.3. Simultaneous Confidence Intervals for Several Mean Responses
5.7.Predictions
5.7.1. New Observation, Yh(new)
5.7.2. Mean of m New Observations
5.7.3. New Observations
5.8.Diagnostics and Remedial Measures
5.8.1. Diagnostic Plots
5.8.2. Formal Tests
5.8.3. Remedial Measures
5.9.Extra Sum of Squares
5.10. Coefficient of Partial Determination
5.11. Standardized Multiple Regression
5.12. Multicolinearity
5.12.1. Uncorrelated Predictors
5.12.2. Correlated Predictors
5.12.3. Effects of Multicolinearity
6. Model Building, Diagnostics and Remedial Measures
6.1.Model Building Process
6.2.Criteria for Model Selection
6.2.1. R2p, SSEpCriterion
6.2.2. or MSEpCriterion
6.2.3. Mallows CpCriterion
6.2.4. AIC and SBC Criteria
6.2.5. P RESSpCriterion
6.3.Automatic Search Procedures for Model Selection
6.3.1. Best Subsets Algorithms
6.3.2. Stepwise Regression
6.3.3. Forward Selection
6.3.4. Backward Selection
6.3.5. Danger of Automatic Selection Procedures
6.4.Diagnostic Methods
6.4.1. Identifying Outlying Y Observations
6.4.2. Identifying Outlying X Observations
6.4.3. Identifying Influential Cases
6.4.4. Multicolinearity Diagnostics
6.5.Remedial Measures
6.5.1. Weighted Least Squares
6.5.2. Ridge Regression
6.5.3. Robust Regression
6.5.4. Regression with Auto correlated Errors
6.5.5. Nonlinear Regression
Textbook:
Montgomery, D.C., Peck, E.A. and Vining, G.G. (2002). Introduction to Linear Regression
Analysis (3rd Edition). John Wiley & Sons, Inc., New York.
References:
1. Carroll, R.J. and Ruppert, D. (1988). Transformation and Weighting in Regression.
Chapmanhall, London
2. Chatterjee, S. and Price, B. (1977). Regression Analysis by Example; Wiley, New York.
3. Draper, N.R. and Smith, H. (1998). Applied Regression Analysis (3rd Edition). John
Wiley & Sons, Inc., New York.
4. Mosteller, F. and J.W.Tukey (1977). Data Analysis and Regression: A second Course in
Statistics. Addison-Wesley, Reading, Mass.
5. Myrers, R.H. (1990). Classical and Modern Regression with Applications (2nd Edition).
PWSKent Publishers, Boston.
Contents
1. Introduction to Regression
1.1. Design with relationship between variables (Correlation and Chisquare test of
association)
1.2. Historical back ground of Regression
1.3. Steps and objectives of regression
Introduction to Regression Analysis

1. Introduction
Regression Analysis is a statistical technique that can be used to develop a
mathematical equation showing how variables are related. Our objective is
therefore to establish a mathematical relationship which could help as in
determining the value of the dependent variable. The graph obtained from such
relationships could be a linear or a curve (non-linear). If the relationship is
linear, then the regression is called a linear regression. If the relationship is non-
linear, then the regression is known as non-linear regression.

Regression analysis is a conceptually simple method for investigating


functional relationships among variables. The relationship is expressed in the
form of an equation or a model connecting the response or dependent variable
and one or more explanatory or predictor variables.

Regression is the estimation or prediction of values of one variable from


known values of one or more variables. The variable whose value is to be
estimated or predicted is known as dependent/predicted variable; while the
variables whose values are used to determine the value of the dependent
variable are called independent/predictor variables.

Regression could be classified in to two types according to the number of


variables. If the variables are only two (one dependent and one independent),
then the regression is called simple regression. If more than two variables are
involved then the regression is known as multiple regressions.

Because of its application and its necessity in research, there is a thrust to learn
the concepts and its applications of Regression.

Regression analysis ( a brief)


Regression is a procedure to Predict or estimate or forecaste so unknown under
the study.
Let we denote the response variable by Y and the set of predictor variables by
X1.......... , Xp, where p denotes the number of predictor variables. The true
relationship between Y and X1....... , Xp, can be approximated by the regression
model

where is assumed to be a random error representing the discrepancy in the


approximation. It accounts for the failure of the model to fit the data exactly.
The function describes the relationship between Y and XI,X2
,. . ., Xp,. An example is the linear regression model is
1
Page

A.R. Muralidharan, Dept. Of Statistics


Regression Analysis Bridge course material I M.Sc (Bio Statistics)

The predictor or explanatory variables are also called by other names such as
independent variables, covariates, regressors, factors, and carriers. The name
independent variable, though commonly used, is the least preferred, because in
practice the predictor variables are rarely independent of each other.

1.1. Dealing with Relationship between variables (Correlation and


Chi-square test for association)
If there any variables more than one exist, it is necessary to find the relationship
or interactions or impact or influence among them. In the same it is necessary to
dealing with relationship between variables in regression analysis.
In regression analysis is to check the influence or association with cause and
effect relation under study. If there is any relation exists with (at least one)
variables then only one can do any analysis in Regression.

Regression analysis discovers the relationship between one or more response


variables and the predictors. If any study involves with more than a single
variable, it is to be discussed with relationship between the variables. Any
researcher, interest in relations between the variables, must use regression.

For instance, a researcher conducts a study on height- weight data, and it leads
to know the relation between height-weight.
In any study, if it deals with relationship, it must be important to investigate the
following questions:
(1).Which variable among the study has either down or up.
(2).Among the involved variables under study which is with more or less
relationship.
(3). Can it be possible to examine the predictors.

To deal this type of relationship or association we may use inferential tests and
it leads to examine difference between groups. The study which are involved
with relationship or association, the commonly used statistical test are
a. Correlation
b. Chi-square tests and
c. Simple or multiple linear regression
These tests may give answers from the below questions.
A. Is there a relationship between two variables
B. Is there any association between variables
2
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

a. Correlation
Correlation test are used to examine relationship between two or more variables
with degree and direction. These variables are a quantitative it is well known to
use Pearsons method otherwise use spearmans coefficient.
To carry out such tests it is to be known the following points:
Variable involved ( in numbers)
Independent and dependent variables
Types of variables are they
If the tests give P< 0.05 implies that the variables are statistically significant.
b. Chi-Square tests
It is used to test whether a statistical significant association exists between two
categorical variables. Usually it accompanies a cross tabulation between two
variables popularly known as contingency table.
To perform this test we can get any answer of the following questions:
1. Is there any relationship/association between the variables involved in the
given study
2. Is the data fitted as Best.
c. Simple or Multiple linear regression deals in this course to find the
relationship between variables
.
1.2. History
The modern era in this area was inaugurated by Sir Francis Galton (1822
1911), in his book Hereditary genius An enquiry into its laws and
consequences of 1869, and his paper Regression towards mediocrity in
hereditary stature of 1886. Galtons real interest was in intelligence, and how it
is inherited. But intelligence, though vitally important and easily recognisable,
is an elusive concept human ability is infinitely variable (and certainly multi
dimensional!), and although numerical measurements of general ability exist
(intelligence quotient, or IQ) and can be measured, they can serve only as a
proxy for intelligence itself. Galton had a passion for measurement, and
resolved to study something that could be easily measured; he chose human
height. In a classic study, he measured the heights of 928 adults, born to 205
sets of parents. He took the average of the fathers and mothers height (mid-
parental height) as the predictor variable x, and height of offspring as response
variable y. (Because men are statistically taller than women, one needs to take
the gender of the offspring into account. It is conceptually simpler to treat the
sexes separately and focus on sons, say though Galton actually used an
adjustment factor to compensate for women being shorter.) When he displayed
his data in tabular form, Galton noticed that it showed elliptical contours that
is, that squares in the (x, y)-plane containing equal numbers of points seemed to
lie approximately on ellipses. The explanation for this lies in the bivariate
normal distribution Galtons interpretation of the sample and population
regression lines (SRL) and (PRL). In (PRL), x and y are measures of
3
Page

variability in the parental and offspring generations. There is no reason to think


Regression Analysis Bridge course material I M.Sc (Bio Statistics)

that variability of height is changing (though mean height has visibly increased
from the first authors generation to his children). So (at least to a first
approximation) we may take these as equal, when (PRL) simplifies to Hence
Galtons celebrated interpretation: for every inch of height above (or below) the
average, the parents transmit to their children on average inches, where is
the population correlation coefficient between parental height and offspring
height. A further generation will introduce a further factor , so the parents will
transmit again, on average 2 inches to their grandchildren.
This will become 3 inches for the great-grandchildren, and so on. Thus for
every inch of height above (or below) the average, the parents transmit to their
descendants after n generations on average n inches of height. As we know
0<<1
( > 0 as the genes for tallness or shortness are transmitted, and parental
and offspring height are positively correlated; < 1 as = 1 would imply
that parental height is completely informative about offspring height, which is
patently not the case). So n 0 (n):
the effect of each inch of height above or below the mean is damped out with
succeeding generations, and disappears in the limit. Galton summarised this as
Regression towards mediocrity in hereditary stature, or more briefly,
regression towards the mean (Galton originally used the term reversion instead,
and indeed the term mean reversion still survives). This explains the name of
the whole subject.

1.3. Steps and objectives

A. STEPS IN REGRESSION ANALYSIS


Regression analysis includes the following steps:
1. Statement of the problem
2. Selection of potentially relevant variables
3. Data collection
4. Model specification
5. Choice of fitting method
6. Model fitting
4
Page

7. Model validation and criticism


Regression Analysis Bridge course material I M.Sc (Bio Statistics)

8. Using the chosen model(s) for the solution of the posed problem.
B. Objectives
In general, the goals of a regression analysis are to predict or explain
differences in values of the outcome variable with information about values
of the explanatory variables. We are primarily interested in the following
issues:
1. The form of the relationship among the outcome and explanatory
variables, or what the equation that represents the relationship looks like.
2. The direction and strength of the relationships. As we shall learn, these are
based on the valence and size of the slope coefficients.
3. Which explanatory variables are important and which are not. This issue is
based on comparing the size of the slope coefficients (which has some
problems) and on the p-values (or confidence intervals).
4. Predicting a value or set of values of the outcome variable for a given set
of values of the explanatory variables.

EXERCISE
1. Define the term Regression. Explain in detail
2. Write down few examples of applications in Regression analysis.
3. What are the different types of regression? Explain
4. Explain the following terms:
a. Predictor b. Regressor c. Regressed d. Data fitting c. Functions and its
type.
5. What are the techniques available to dealing with relationships?
6. What are the uses of regression analysis
7. Write short notes on History of Regression and its various
development.
8. Steps and objectives in regression Discuss
9. Give some model expression and explain its terms involved
10.What is residual in regression? Explain it with their distributions?
5
Page
Contents
2. Simple linear regression
2.1. Functional Vs Statistical relation
2.2. Formal statement of the model
2.3. Graphical representation of the Simple linear regression model
2.3.1. Interpretation of the regression parameters
2.4. Estimation of the regression parameters
2.4.1. Method of Least squares ( LS)
2.4.2. Maximum Likelihood Estimation (MLE)
2.5. Inferences in the regression analysis
2.5.1. Inference for
2.5.2. Inference for
2.5.3. Inferences for Mean response
2.6. Predicting new observations
2.6.1. Predicting a single new observation
2.6.2. Predicting a Mean of m new observation
2.6.3. Confidence band for the regression line
2.7. Analysis of variance
2.8. F-test Versus t-test
2.9. General Linear test
2.10. Coefficient of Determination
Introduction to Regression Analysis

Simple Linear Regression:


In this chapter we will deal with the type of regression involving only two
variables and having a linear relationship, i.e. simple linear regression.
Simple regression in terms of functional relationships between two variables

2.1. Functional Versus Statistical Relationships


As we know that it is important to clear about the relationship between variables
and its types. Thus we classify the relationships into two:
a. Functional (Deterministic)
b. Statistical
A Functional relationship between two variables is expressed by a
mathematical formula. If X denotes the independent variable and Y the
dependent variable, a functional relation is

In functional relation it gives an exact relationship between the predictor X and


the response Y. For instance let us consider the Fahrenheit (F) and Celsius (C)
relationship as
( )
In this equation when we give the value of C , we will get an exact value of F.
That is it is possible to write as F(C) is the relationship between Fahrenheit and
Celsius.

In usual way, we do not know the true relationship between X and y , but
would like to describe or somehow use this relationship.
As in case of Statistical relationship, it is not an exact result with relationship.
We get a TREND of the association between the variables as predictor and
response. We can use a diagrammatic representation to show this relationship. It
is not a perfect one. The observations for a statistical relation do not fall directly
on the curve of relationship.
( )
The term ( ) to approximate Y and an error committed.
In this function Y is not known exactly for every X, we often approximate the
relation between X and Y.
Let us define function as a linear as
( )
Then Y becomes
1
Page

A.R. Muralidharan, Dept. Of Statistics


Regression Analysis Bridge course material I M.Sc (Bio Statistics)

This is a very simple model to explain the relationship between two variables.
The function of X is an equation for a straight line.
.

2.2. Formal statement of the Model


The model can be stated as

Figure 2.1a Figure 2.1b


Fig 2.1.a Equation of a straight line E(Y |X = x) = 0 + 1x. and Fig 2.1.b
Approximating a curved mean function by straight line cases adds a fixed
component to the errors.
Assumptions:
1. In the above model X and Y are as Known(independent) and
unknown(dependent) random variables.
2. 0 and 1are parameters. These parameters are :
a. unknown
b. constant and not random
c. not dependent on the trials
2
Page

3.
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

2.3. Graphical representation of the simple linear regression model:

As we know that, in any statistical analysis, the graph plays an important role.
By using this pictorial representation of the given data we may get an idea or
trend about the data. It will give a dispersion or variation or scatterings of the
data. Although graphs are supplementary of the data but not a substitute, it is
necessary to present in a graph. Usually graphs are two dimensional that is
bivariate.
The fundamental graphical tool for looking at regression data, a two-
dimensional, scatter plot. In regression problems with one predictor and one
response, the scatter plot of the response versus the predictor is the starting
point for regression analysis. In problems with many predictors, several simple
graphs will be required at the beginning of an analysis. A scatter plot matrix is a
convenient way to organize looking at many scatter plots at once.
A regression problem with one predictor, which we will generically call X and
one response variable, which we will call Y. Data consists of values (xi, yi ), i =
1, . . . , n, of (X, Y) observed on each of n units or cases. In any particular
problem, both X and Y will have other names , are more descriptive of the data
that is to be analyzed. The goal of regression is to understand how the values of
Y change as X is varied over its range of possible values. A first look at how Y
changes as X is varied is available from a scatter plot. Below are the graphical
representations of a simple linear regression.

Figure 2.2
The above scatter plot of mothers and daughters heights in the Pearson and
Lee data. The original data have been jittered to avoid overplotting, but if
rounded to the nearest inch would return the original data provided by Pearson
and Lee.
3
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

Here are some important characteristics of Figure2.2:


1. The range of heights appears to be about the same for mothers and for
daughters. Because of this, we draw the plot so that the lengths of the horizontal
and vertical axes are the same, and the scales are the same. If all mothers and
daughters had exactly the same height, then all the points would fall exactly
on a 45 line. Some computer programs for drawing a scatterplot are not
smart enough to figure out that the lengths of the axes should be the same,
so you might need to resize the plot or to draw it several times.
2. The original data that went into this scatterplot was rounded so each of the
heights was given to the nearest inch. If we were to plot the original data,
we would have substantial overplotting with many points at exactly the same
location. This is undesirable because we will not know if one point represents
one case or many cases, and this can be very misleading. The easiest solution
is to use jittering, in which a small uniform random number is added to each
value. In Figure 1.1, we used a uniform random number on the range from
0.5 to +0.5, so the jittered values would round to the numbers given in the
original source.
3. One important function of the scatter plot is to decide if we might reasonably
assume that the response on the vertical axis is independent of the predictor on
the horizontal axis. This is clearly not the case here since as we move across
Figure 2.2 from left to right, the scatter of points is different for each value of
the predictor.
4. The scatter of points in the graph appears to be more or less elliptically
shaped, with the axis of the ellipse tilted upward..
5. Scatter plots are also important for finding separated points, which are either
points with values on the horizontal axis that are well separated from the other
points or points with values on the vertical axis that, given the value on the
horizontal axis, are either much too large or too small. In terms of this example,
this would mean looking for very tall or short mothers or, alternatively, for
daughters who are very tall or short, given the height of their mother.
2.4. Estimation of the Regression Parameters
Many methods are available for obtaining estimate of parameter in a model.
Parameters are unknown quantities that characterize a model. Estimates of
parameters are computable functions of data and are therefore statistics. To
keep this distinction clear, parameters are denoted by Greek letters like , ,
and , and estimates of parameters are denoted by putting a hat over the
corresponding Greek letter.
As we define a model (Simple linear)
---------------------1.1

This model with single repressor x that has a relationship with a response y that
is a straight line. The error term assummed to have mean zero and unknown
4

variance . Also it is to be noted that the errors are uncorrelated, that is they
Page

are independent.
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

To get mean response at any value of interest is

( ) ( )

As the variance of Y given any value of x is

( )
[ ]

Thus, the true regression model is a line of mean values.


The slope can be interpreted as the change in the mean of Y for that X as a
unit change. Furthermore the variability of Y at a point X is determined by the
variance of the error component of the model. This gives a distribution of Y
values at each X and the variances of this distribution are same at each X. For
instance, let us consider that a true regression model relating height(X) and
weight (Y) as = 3.5+ 2x and variance as 2, we use normality assumption
for the error ie., random variation. Since y is a sum of a constant the
mean and normally distributed.
To estimate or predict at any point of given x to get y value from the above
equation. Let we take if x=10 then 3.5+2(10)= 23.5 with same variance.
The variance determines the amount of variability or noise, in the observation
Y.
It is to be clear that variance is small, the point will fall close to the
line(TREND) and if it is large the points may deviate considerably from the
line. The regression equation gives an approximation to the true functional
relationship between the variables under study.
An important objective of any regression analysis is to estimate the unknown
parameters in the given model. This process is also called fitting the model.
There are several estimation techniques available to fit a given model. As our
interest we have a brief discussion on two estimation process to find regression
parameters.
They are
1. Method of least squares
2. Maximum likelihood estimators
A regression model as used to establish causality ( the relationship between the
variables(independent and independent) must have a basis of sample data)
rather than cause-effect relationship between the variables. The relationship
may be suggested by theoretical considerations.
5
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

Regression analysis can aid in confirming cause-effect relationship, but it


cannot be the sole basis of such claim. Regression analysis is one of the part to
problem solving by data analytic approach, it means the equations may not be
the main objective of the study. To gain the information and understanding the
system from the data is more important.
To get a BEST FIT we have to take care about the data and its process to
collect. Any regression analysis is only as good as the data that it is based on its
methods. As in the way in any clinical trials we have three important design to
collect data. They are
1. Retrospective
2. Observational and
3. Designed trials
To get a simplified analysis and more suitable fitted model we need a good data.
Now we can device an estimate process to get parameters.
In the equation 1.1 the repressor x as controlled by the analyst and measure the
error to minimize with y.
The parameters are usually called regression coefficients, theses coefficients
have simple and useful interpretation.

2.4.1. Method of least squares

As our aim is to estimate the unknown parameter in the equation 1.1. Here the
unknown parameters are . This has to be estimated by the given sample
data. To estimate these parameters we apply a method called Least Squares.
By this we will estimate the parameters so that the sum of squares of differences
between the observation and the straight line is a minimum.

Let the model be


The above equation is a sample regression model with n pairs of data ( )
I=1,2,3.....n
By using least square criteria
( ) ( )
Here after we use to be estimated. By differentiate with respect
to the parameters,
Then the function ( ) will be
( )
( ) ( ) and
( )
( ) ( )
6
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

and after solving

The fitted value for case i is given by for which we use the shorthand
notation yi

Although the ei are not parameters in the usual sense, we shall use the same hat
notation to specify the residuals: the residual for the ith case, denoted ei, is
given by the equation

which should be compared with the equation for the statistical errors,

All least squares computations for simple regression depend only on


averages, sums of squares and sums of cross-products. Definitions of the
quantities used are given in Table 2.1. Sums of squares and cross-products have
been centred by subtracting the average from each of the values before squaring
or taking cross-products. Appropriate alternative formulas for computing the
corrected sums of squares and cross products from uncorrected sums of squares
and cross products that are often given in elementary textbooks are useful for
mathematical proofs, but they can be highly inaccurate when used on a
computer and should be avoided.
7

Table 2.1 also lists definitions for the usual univariate and bivariate summary
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

statistics, the sample averages (x, y), sample variances (SD2x , SD2y ), and
estimated covariance and correlation (sxy, rxy). The hat rule described earlier
would suggest that different symbols should be used for these quantities; for
example, xy might be more appropriate for the sample correlation if the
population correlation is xy .

This inconsistency is deliberate since in many regression situations, these


statistics are not estimates of population parameters
The OLS estimators are those values 0 and 1 that minimize the function

When evaluated at ( 0, 1), we call the quantity RSS( 0, 1) the residual


sum of squares, or just RSS.
They are given by the expressions

The term residual is difference between the observed value of yi with


the corresponding fitted

Mathematically, the ith residual is

( i=1,2,3......n

To measure the adequacy of fitted model we need this residual and also in
8

detecting the departures from the original values.


Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

After the fitting, it is to be remember that the interpretation about the


parameters and residuals are important.

Some questions are taken into consideration of the fitted data which
includes the following

1. How good the data fitted


2. Is it useful in prediction
3. Is there any assumptions violated

Properties of Fitted Regression line

The estimated regression line above fitted by the method of least squares has a
number of properties worth noting.

1. The sum of the residuals is zero ie.,


2. The sum of the squared residuals is a minimum value.
3. The sum of the observed equals the sum of fitted
4. The sum of the weighted residual is zero when the residual in the ith trial
is weighted by the level of the predictor variable in the ith trial:
9


Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

5. The regression line always goes through the point ( )

Estimation of Variance (Error Terms)

The

2.4.2. Maximum Likelihood Estimators

Least squares procedures best linear unbiased estimator of the regression


coefficients. An alternative method of parameter estimation is the Maximum
Likelihood Estimators (MLE), as the form of the distribution of the errors is
known.

Consider a data (x,y) as a pairs of n observations and we assume that the errors
in the regression model areNID(0, ) then the observations y in this sample are
iid random variables with mean

The Likelihood function is a joint distribution of the observations given and the
parameters unknown constants.

For the simple linear regression model with normal errors, the likelihood
10

function is
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

The MLE are the parameter values, say

. The unbiased estimator MSE differs

but slightly from the MLE especially if n is not small. .

It is to be noted that MLE have better statistcal properties than least squares
estimators. The MLE require a full distributional assumption, in that case the
random errors follow a normal distribution with the same second moments as
required for the least squares estimates.
11
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

2.5. Inferences in Regression analysis

In testing hypotheses and constructing confidence intervals about the model


parameters. In regression analysis , these procedures require an additional
assumption that the model error are NIID with mean 0 and variance .
Inferences concerning, the regression parameters considering both
interval estimation of these parameters and test about them.

In linear regression, estimation can be done using LSE,which does not require
the normality assumption. From LSE, we obtain and 0, which are the point
estimators of the parameters and , respectively.
Statistical inference, however, is based on the normal distribution.
The errors, , are assumed to be mutually independent and (identically)
normally distributed.
We rely on the Central Limit Theorem and assume that the inference is
approximately correct.
2.5.1. Inferences of

It is important to estimate and drawing inferences about parameters, thus we are


interesting in drawing inferences about one of the parameter the slope of the
regression line in model. This parameter will provide more information on the
average .

Suppose that we wish to test the hypothesis that the slope equals a constant say

The appropriate hypothesis


12
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

Fig: Regression model when

For the linear regression model, the condition implies that even more than no
linear association between the variables. Since the model has all probability
distributions of y are normal with constant variance, and the means are equal
with respect to null hypothesis, it also follows that the probabilitydistributions
of y are identical under null hypothesis. From figure it is depicted, thus
the normal error regression model implies that there is not only no
association(linear) between x and y but also no relation of any type between the
variables. Since the probability distributions of y are then identical at all levels
of x.

It is good to consider a sampling distribution of b1 , the point estimator of


hen we infer the parameter .

Sampling distribution of

The point estimator of b1 was expressed before , it is as follows

Sampling distribution of b1 means it consider several different values of bi that


would be obtained with repeated sampling levels of the predictor variable X are
held constant from sample to sample. From this sampling distribution with
normal error regression model we have the mean and variance as below
13
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

Here b1 is a linear combination of the Yi . This leads a normality sampling


distribution of b1. Since a linear combinations of independent normal random
variables is normally distributed. The mean of this distribution is unbiased of
b1(by Gauss-Markov theorem). The variance of b1 can be estimated from the
above variance by replacing numerator as MSE, thus the unbiased estimator of
that is

Taking positive square root to get sd

Sampling distribution of (b1- )/s{b1}

14
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

Rearranging and and relating the above inequalities we obtain

Since the above rearranged equation holds for all possible values of the(1- )
CI limits as

Now we have specified a two sided alternative. Since the error follows
NID(0, ), the observations yi are NID( ) . Now is a linear
combination of the observations, so is normally distributed with mean
variance using the mean and variance of

The test statistic



( ) and

From this t statistic, the denominator is often called the estimated standard
error( simply standard error) of the slope

That is ( )


We often =
( )

2.5.2. Inferences Concerning

There are only not frequent occasions when we wish to make inferences
concerning , the intercept of the regression line. These occur when scope of
the model includes X=0.
15
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

Sampling Distribution of b0

From the least square estimate b0 is now the sampling


distribution for bo would be obtained with repeated sampling for different
values of b0 when the levels of the predictor variable x are held constant from
sample to sample. For this model the sampling distribution of b0 is normal and
its mean and variance are:

The coefficient b0 follows normality, of its linear combination of the


observations Yi as like the another coefficient b1. An estimator of

Will be

Hence the CI for this coefficient and test are with t distribution.

The 1- confidence limits for this coefficient is the same manner as those of the
previous coefficient and it is expressed

And the test will be



( )
( )


se( ) (( ) is the standard error of the intercept.

Some considerations on making inferences concerning

1. Effects of Departures from normality


2. Interpretation of confidence coefficient and risks of errors
3. Spacing of the X levels
16

4. Power of tests
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

2.5.3. Inferences for mean Response

It is well known that one of the main goals of regression analysis is to estimate
the mean of the distribution for a specific value of the predictor X.

Let denote the level of X for which we wish to estimate the mean response.
may be a value which occurred in the sample, or it may be some other value
of the predictor variable within the scope of the model. The mean response
when is denoted by [ ]. The point estimator [ ]:

. First, we consider now the sampling distribution of at the


value X= .

From the estimated mean response at X= is and the


variance is

FIGURE: Effect on of the variation in b1 from sample to sample

Since the cov( ) = 0, the sampling distribution is


17
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

Note that the width of the confidence interval width is minimum for x h= . we
might expect our best estimates of y to be made at x values near the center of
the data and the precision of estimation to deteriorate as we move to the
boundary of the x-space.

Many of the regression analysis never use a regression model to extrapolate of


the original data. The term Extrapolate means the prediction equation beyond
the boundary of the x-space.

The above CI points out that the issue of extrapolation is much more subtle;
further the x value is from the centre of the data, the more variable estimate by

Moreover the limit points that the prediction is approached the boundary and
that as we moved beyond this boundary the prediction may deteriorate rapidly.
Further, the farther we move away from the original region of x-space, the more
likely it is that equation or model error will play a role in the process. This is not
the same as never extrapolate. The CI supports such use of the prediction
equation and not supports using the regression model to forecast many periods
in the future. In general, the greater the extrapolation, the higher is the chance of
equation error or model error impacting the results. The probability statement
associated with the CI holds only when a single confidence interval in the mean
response is to be constructed.

2.6. Predicting new variables

An important aspect of regression analysis is to predict new variable from the


given regression model, a new observation of y corresponding to a specified
level of the repressor variable x.

If x0 is the value of the repressor variable of interest, then the model to predict
the new value of the response as a point estimate y0 will be

Now considering obtaining an interval estimate of this future observation of y0.


The above CI is inappropriate for this problem because it is an interval on the
mean y, not a probability statement about future observations from that
18

distribution. Hence we develop a prediction interval for the future observation


Page

y0.
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

In this prediction process we have different situations to predict as below:

1. Predicting a single new observation


2. Predicting the mean of m new observation
3. Confidence Band for the regression line
2.6.1. Predicting a single new observation

The new single observation y to be predicted is viewed as the result of a new


trial, independent of the trials on which the regression analysis is based. Let the
new trial X= x0 and the new observation on y as y0(new) or sometimes.
It is assumed that for the basic sample data continues to be appropriate for the
new observation.

To construct a prediction interval for a single new observation at X=x0 , the two
sources of variability are:

a. The variability in the possible location of the distribution of Y ( the mean


response) and
b. The variability within the distribution of Y ( variability of Error)

As we discussed above the equation gives new point estimator of a new


observation. It has the same point estimator for estimating mean response. The
difference now is that of that the prediction of a single new observation is more
variable than when estimating the mean response. That is

. This is estimated as

It has be noted that the estimated variance of a single new prediction LHS is
always greater than the estimated variance of the estimated mean response in
RHS.

A confidence interval base on t distribution with (n-2) df, the prediction interval

is . This interval is called a prediction


interval because we are predicting a new observation of a random variable.
19
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

2.6.2. Predicting the Mean of m new observations

Sometimes it is necessary to predict the mean of m new observations on y for


the given level of the predictor variable. Let we consider m new observations at

the same level and let denote the mean response of these new
cases. The estimator for is the predicted value itself, all cases with the
same level of X then the prediction will be the same that is the prediction is

Notice, difference in RHS and LHS variances

2.6.3. Confidence Band for the Regression Line

For the entire regression line, [ ] , sometimes we would like


to obtain a confidence band for the above line. This band enables us to see the
region in which the entire regression line lies. For determining the
appropriateness of a fitted regression function.
20
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

2.7. Analysis of Variance (ANOVA)

We consider the basic regression model for analysis from the perspective of
ANOVA. The ANOVA approach will be a basic approach to test significance of
regression also for multiple regression models and some other types of linear
statistical models.

The ANOVA is based on a partitioning of total variability and its degrees of


freedom in the response variable y. This variation is conventionally measured
in terms of the deviations of the Yi around their mean Y. That is mathematical
notation will be

The measure of total variation will be the sum of the squares of deviations (SST
or SSTO) and these deviations are separated with the other variation
components

Let SST =( )

If all the observation are the same in the data SST = 0, the greater the variation
among the observations, the larger is SST. Thus SST is a measure of the
uncertainty pertaining to the lot. When the variable X is taken , the variation
21

reflecting the uncertainty concerning Y is that of the observations around the


Page

regression line that is what we estimated or fitted. As shown in figure b above.


Regression Analysis Bridge course material I M.Sc (Bio Statistics)

The measure of variation in the observations by the predictor variable X is taken


into as the Sum of squares of deviation given in terms of residuals sum of
squares (SSE).

Now we consider the partition of the components:

The Sum of squares of these equivalent expression will be

22
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

The above decomposed variation have involve with two parts:

1. Expression with regression line (SSR= The explained part)


2. Un explained part (SSE)

To perform the computation of all the above expressions will be mathematically


formulated as :

From this the variation will be distributed as

Now it is necessary to depict the above components into a table as ANOVA


table. This table will be easy to explain the SSs
23
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

ANOVA

From the above table it is easy to calculate F statistics:

The F-statistic tests use to test whether the regression model has any predictive
or explanatory ability. Otherwise the null hypothesis(H0) as the regression
model is Significant.

As the hypothesis and against are

Decision RULE:

The test statistics F follows (1, (n-2)) df when H0 holds, the decision rule is as
follows when the risk of TYPE 1 error is to be controlled at the level .
24
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

2.8.

25
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

2.9. General Linear Test (GL-test)

The ANOVA with a simple hypothesis is a general test for a linear statistical
model. It is necessary to develop a general test approach in terms of the simple
linear regression model. By using the general approach it is possible to make
wide application and easy understanding the pattern or trend of the linear
regression. Thus the approach of generality is termed as General Linear Test.
This approach involves with three fundamental steps. This GL-test is based on
the reduction in SS.

To compare the two Error Sum of Squares SSE(F) and SSE(R) , as that SSE(F)
is less than SSE(R) as mathematically SSE(F) SSE(R)

The reason behind is that the more parameter are in the model, the better one
can fit the data and the smaller are the deviations around the fitted regression.

When SSE(F) is not much less than SSE(R), using the full model does not
account for much more of the variability o the observations than does the
reduced model under H0 holds. In other words, SSE(F) is closer to SSE(R), the
variation of the observations around the fitted regression function for the full
26

model is almost as greater as the variation around the fitted regression function
for the reduced model.
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

The small, a difference between SSEs suggests that H0 holds. On the other
hand, a large difference implies that H0 holds because the additional parameters
in the model to help to reduce substantially the variations of the observations Yi
around the fitted regression.

The general linear test approach above can be used for highly complex tests of
linear statistical models, as well as for simple tests.
How can we do this?
1. Fit the full model and obtain SSE(F).
2. Fit the reduced model and obtain SSE(R).
3. Compute F to test whether the full model significantly improves the reduced
model.

2.10. Coefficient of Determination ( )


The measure R-Square is called Coefficient of determination; we have
discussed the major problems and its related concepts. There is no single
measure to find the Degree of Linear Association. It is used in finding
precision, which vary from one application to another. By using R-Square ( ),
the usefulness of prediction depends on the interval can be determined. There
are times to deal with degree of linear association. It is of interest in its study
and analysis. It is one of the method frequently used in practice to describe the
27

degree of linear association between x and y.


Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

The quantity can be expressed as:

Since SST(or SSTO) is a measure of variability in y without considereing the


effect of the regressor variable x and SSE is a measure of variability in y
remaining after x has been consider, R-Square is often called the proportion of
variation explained by the regressor x. This implies that a natural measure of the
effect of x in reducing the variation in y. That is in reducing the uncertainty in
predicting y, is to express the reduction in variation as a proportion of the total
variation.
Since and it follows that: . By using R2 we may
interpret as the propotianate reduction of total variation associated with the use
of the predictor variable x. The larger the R2 the more the total variation of y is
reduced by introducing the predictor variable x. In other words value of
coefficient that are close to one implies that the most of the variability in y is
explained by the regression model.
The limiting value of R2 occur as follows:

2. when the fitted regression line is horizontal so that b1=0 and


2
then SSE=SST and R =0 .
In practice, R2 is no likely to be Zero or One,but somewhere between these
limits. If R2 closer to one, the greater association between x and y. Below are
the diagrams represent the coefficient of determination.

Misunderstanding in the Result of R2

1. A high coefficient of determination indicates that useful predictions can


be made.
2. A high coefficient of determination indicates that the estimated line is a
good fit.
3. A coefficient of determination near zero indicates that X and Y are not
related.
The relation between the Coefficient of correlation and coeffifient of
28

determination can be expressed as:


Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

Exercise:

1. Define and give examples of the following terms


i. Functional relation ii. Statistical relation
2. What are the basic concepts involved in statistical relation?
3. What are the models available in Regression analysis?
4. In constructing a regression model what are the points to be in
discussion?
5. Explain the following terms in constriction of regression models:
A. Selection of predictor variables
B. Functional form of regression relations
C. Scope of the model
6. Construct the linear regression equation based on the given information:

Also estimate for some value of given x.

7. In a test of the alternatives , an analyst


concluded Ho. Does this conclusion imply that there is no linear
association between x and y? Explain
8. The purity of oxygen produced by a fractionation process is thought to be
related to the percentage of hydrocarbons in the main condensor of the
processing unit. Twenty samples are given below;

Sample 1 2 3 4 5 6 7 8 9 10
Purity(%) 86.91 89.85 90.28 86.34 92.58 87.33 86.29 91.86 95.61 89.86
Hydrocarbons(%) 1.02 1.11 1.43 1.11 1.01 0.95 1.11 0.87 1.43 1.02
Sample 11 12 13 14 15 16 17 18 19 20
Purity(%) 96.73 99.42 98.66 96.07 93.65 87.31 95 96.85 85.20 90.56
Hydrocarbons(%) 1.46 1.55 1.55 1.55 1.40 1.15 1.01 0.99 0.95 0.98
a. Fit a simple linear regression
b. Test the hypothesis:
1. 2.
c. Calculate and interpret R2
29

d. Find 95% of CI
Page
Regression Analysis Bridge course material I M.Sc (Bio Statistics)

9. Consider the simple linear regression model , derive


the following:
A. Estimate the coefficients by using (i) Method of least squares and
(ii) maximum likelihood estimators.
B. Find the confidence interval 100(1- ) for (i). (ii)
10. Prove that the maximum value of R2 is less than 1 if the data contain
repeated(different) observations on y at the same value of x.

30
Page

Você também pode gostar