Escolar Documentos
Profissional Documentos
Cultura Documentos
4. What are the difference between Linear Regression and Logistic Regression?
Ans: 1. Linear regression requires the dependent variable to be continuous i.e. numeric values
(no categories or groups). While logistic regression requires the dependent variable to be
binary - two categories only (0/1).
2. Linear regression is based on ordinary least square estimation. While logistic regression is
based on Maximum Likelihood Estimation.
3. Linear regression needs a linear relationship between the dependent and independent
variables. While logistic regression does not need a linear relationship between the dependent
and independent variables.
4. In Linear regression analysis, error term should be normally distributed. While logistic
regression
does
not
require
error
term
should
be
normally
distributed.
5. Linear regression assumes that residuals are approximately equal for all predicted
dependent variable values. While logistic regression does not need residuals to be equal for
each level of the predicted dependent variable values.
6. Linear regression requires 5 cases per independent variable in the analysis. While logistic
regression needs at least 10 events per independent variable.
Li ( X i ) X i i 1 X i
Y
1 Y i
Since the observations are assumed to be independent, the likelihood function expresses the
values of in terms of known, fixed values for y
n
L( ) Li ( X i )
i 1
ln L( ) Li ( X i )
i 1
Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like
iteratively reweighted least squares is used to find an estimate of the regression coefficients,
.
2 l ( ( 0 ) ) l ( )
( 0)
where l ( ) is the log likelihood of the fitted (full) model (or under H 1) and l ( ) is the
log likelihood of the (reduced) model specified by the null hypothesis (H0) evaluated at
2
the maximum likelihood estimate of that reduced model. This test statistic has a -
distribution with (p-r) degrees of freedom, r is the unknown parameters of the reduced
model.
8. Why linear regression model makes no sense when your dependent Y variable is binary
(only takes the values of 0 or 1)?
Ans: In least-squares regression, we model the dependent variable Y as a linear function of the Xvariables plus a random error that is assumed to have a normal distribution. That is, the ith Yobservation is assumed to have been generated with the following equation:
distribution with mean 0 and standard deviation . This means the Yi must be a continuous
variable (i.e., one that takes values on an interval), not a binary variable (i.e., a variable taking
only the values 0 and 1).
Thus, if we use regular least-squares regression when the dependent variable is binary or
dichotomous (and should be using logistic regression) some of the assumptions are violating
such as the least-squares requirement that the regression errors have a normal distribution.
When the assumptions that underlie the least-squares regression model are violated, we can
no longer rely on the statistical inference (e.g., which regression coefficients are significant)
or predictions that are made based on the least-squares model.
Figure below shows the kind of data that is appropriate for regular least-squares regression Y
vs X:
Figure below shows the kind of data that is appropriate for logistic regression Y vs X:
relative chance of an event happening under two different conditions. So if the odds of thing
A are MA to NA and the odds of thing B are MB to NB then the odds is defined as
MA
OR
MB
NA
NB
We can also write the odds ratio in terms of probabilities. Using the formula given above to
calculate the probabilities gives
PA
MA
PB
MB
MA NA
M B NB
Since, using probabilities, the odds ratio expressed in terms of probabilities is:
OR
PA /(1 PA )
PB /(1 PB )
For Example: Suppose that the probability of a bad outcome is 0.2 if a patient takes the
existing treatment, but that this is reduced to 0.1 if they take the new treatment. The odds of a
bad outcome with the existing treatment is 0.2/0.8=0.25, while the odds on the new treatment
are 0.1/0.9=0.111 (recurring). The odds ratio comparing the new treatment to the old
treatment is then simply the correspond ratio of odds: (0.1/0.9) / (0.2/0.8) = 0.111 / 0.25 =
0.444 (recurring). This means that the odds of a bad outcome if a patient takes the new
treatment are 0.444 that of the odds of a bad outcome if they take the existing treatment. The
odds (and hence probability) of a bad outcome are reduced by taking the new treatment. We
could also express the reduction by saying that the odds are reduced by approximately 56%,
since the odds are reduced by a factor of 0.444.
Ans: A logistic regression is used for predicting the probability of occurrence of an event by
fitting data to a logit function. Let us consider fitted logit function by
log it ( pi ) 0 1 X i
pi
log(odds )
1 pi
Or we can write
odds1
when X=Xj,
pX j
1 pX j
odds2
when X=Xj+1,
p X j 1
1 p X j 1
0 1 X j
0 1 X j 1
e 1
odds2
log OR
OR
f yi ; i iyi 1 i
1 yi
y
i
i
1 i
1 i
i
1 i exp yi log
1 i
for yi 0 and 1
Q log
the log odds of response 1, is called the logit of . GLMs that use the logit link are called
logit models.