Você está na página 1de 7

INTRODUCTION TO QUANTITATIVE ANALYSIS.

LEONARDO D. VILLAMIL.
HW6
12/05/2016
THE SET UP
Set the working directory.
setwd("E:/leonardo/education/gsu/introQuantitativeAnalysis/hw/hw6")
df = read.csv("crime.csv")
plot(df)
A scatter plot for all variables.

plot(density(df$crime),main="crime")
plot(density(df$violent),main="violent")
Distribution for the variables crime and violent.

A scatter plot for the variables crime and violent.


plot(df$violent,df$crime,main="Scatter plot Crime vs. Violent",xlab =
"Violent",ylab="Crime")

Correlation between crime and violent.


cor(df$crime,df$violent)
0.7565051
Bivariate regression model.
y=lm(df$crime~df$violent, data=df)

Multivariate linear regression model.


y=lm(df$crime~df$violent+funding+no_hs+in_college+college25, data=df)

Summary(y)

2.
The typical information that we will encounter in a linear regression report is as follow:
the intercept and coefficients of the linear model.
Five summary statistics of the residuals.
A test for significance of regression or model adequacy.
Analysis of variance

The following is a linear regression report applied to the problem of analyzing the
vending machine service routes and his distribution system. In this situation we are
interested in the amount of time required by the route driver to service the vending
machines in an outlet. Only two explanatory variables are used since the report will do
the same analysis as describe above.
Besides the formula for the linear regression model the output of the report presents an
analysis of variance with the following format:

The following is the report for the linear model describing the delivery time:

According with the regression table above:


Y = 2.34 + 1.616a + 0.014b.
Where y is the mean delivery time, a is the no_cases, and b = the distance walked by
the service man.
The intercept, in our problem, is essentially the expected value of time required to serve
the machine.

The residuals is a summary statistics of the y y(hat). Where the min = -5.788 and the
max = 7.419.
The Ho: assume that every parameter equal 0 the alternative that at least one is
different from zero.
In testing the intercept we got a p-value = 0.01 the a parameter a p-value = 0 and the b
parameter a p-value = 0. These values are tabulated with the *** in the row of
significance code.
The coefficient Standard Error measures the average amount that the coefficient
estimates vary from the actual average value of our response variable. Wed ideally
want a lower number relative to its coefficients.

The Pr(>|t|) acronym found in the model output relates to the probability of observing
any value equal or larger than |t|. A small p-value indicates that it is unlikely we will
observe a relationship between the predictor (no_cases, distace) and response
(delivery time) variables due to chance. Typically, a p-value of 5% or less is a good cutoff point. In our model example, the p-values are very close to zero. Note the signif.
Codes associated to each estimate. Three stars (or asterisks) represent a highly
significant p-value. Consequently, a small p-value for the intercept and the slope
indicates that we can reject the null hypothesis which allows us to conclude that there is
a relationship between delivery time and no_cases and distance.
Residual Standard Error is a measure of the quality of a linear regression fit.
Theoretically, every linear model is assumed to contain an error term E. Due to the
presence of this error term, we are not capable of perfectly predicting our response
variable (delivery time) from the predictors (no_cases and distance). The Residual
Standard Error is the average amount that the response (delivery time) will deviate from
the true regression line. In our example, the actual delivery time required to maintain a
machine can deviate from the true regression line by approximately 3.259 min, on
average.
Since we have three parameters (intercept, no_cases, distance) our degrees of freedom
is 22 since we use 25 observations.
Multiple R-squared, Adjusted R-squared
The R-squared statistic (R2) provides a measure of how well the model is fitting the
actual data. It takes the form of a proportion of variance. The R2 is a measure of the
linear relationship between our predictor variable (no_cases, distance) and our
response / target variable (delivery time). It always lies between 0 and 1 (i.e.: a number
near 0 represents a regression that does not explain the variance in the response
variable well and a number close to 1 does explain the observed variance in the
response variable). In our example, the R2 we get is 0.9596. Or roughly 95.96% of the
variance found in the response variable (delivery time) can be explained by the
predictor variable.

F-Statistic
F-statistic is a good indicator of whether there is a relationship between our predictor
and the response variables. The further the F-statistic is from 1 the better it is. However,
how much larger the F-statistic needs to be depends on both the number of data points
and the number of predictors. Generally, when the number of data points is large, an Fstatistic that is only a little bit larger than 1 is already sufficient to reject the null
hypothesis (H0 : There is no relationship between speed and distance). The reverse is
true as if the number of data points is small, a large F-statistic is required to be able to
ascertain that there may be a relationship between predictor and response variables. In
our example the F-statistic is 261.2 which is relatively larger than 1 given the size of our
data.
Note that the model we ran above was just an example to illustrate how a linear model
output looks like in R and how we can start to interpret its components. Obviously the
model is not optimized. One way we could start to improve is by transforming our
response variable.
3.
Y

Using R.

Standard error of b.

Using R:

Você também pode gostar