Você está na página 1de 5

April 28, 2016

Abstract
In this paper we test a claim made about the quality of American
universities by performing statistical analysis on data recorded in 2005[1].

Introduction

It must be noted that, on the Internet, it is surprisingly difficult to find freely


available datasets ready for statistical analysis. The data in particular that we
have chosen to analyze consists of observed values for more than 30 different
variables from each of 1,302 North American universities sampled, out of the
total of 4,276 that existed during the year 2005 [2]. While the samples do account
for a significant proportion of the total population, they are not necessarily
random (i.e. the choice of universities from the population may have depended
on some unknown factor(s)). Nevertheless, we assume that, because many of the
variables follow a normal distribution, and because the sample size is almost onethird of the population size, that the sample is a good enough approximation.
We hope that, despite having to make this assumption, the analysis will be
sufficient for demonstrating that conceptually and procedurally the statistical
analysis techniques are well understood by our group.
Define
Pt = the total population of 4,276 universities
Ps = the population of 1,302 sampled universities
Si = a subset of Ps
for i = 1 to j, where j depends on the test performed.
If our assumption is correct-that is, that Ps was approximately randomly sampled from Pt -then each Si will also be approximately random with respect to
Pt , and therefore our conclusion will apply to both Ps and Pt . Otherwise our
conclusion will apply only to Ps . Thus, in the worst case, our conclusion is
guaranteed to be correct for at least a select 1,302 of all universities in the US.

Objectives

We will test the following claim:

In 2005, Northern US colleges were more effective than Southern


US Colleges.
We ultimately expect to point out an interesting (albeit random) fact about the
data by testing this claim.

3
3.1

Test
Preparing the data for use with R

Before testing, we needed to make the dataset compatible with the R environment. We first wrote a script for inserting comma (,) delimiters into the raw
textual data; we then used spreadsheet software to parse the data into tabular
form. Next, we used Rs read.xlsx2() function to directly import the tabular
data as a variable.

3.2

Selecting relevant sample data

In our claim, we interpret effectiveness or level of success to mean graduation


rate. Typically to test this we would first take nn samples of graduation rates
from colleges in the north, and ns samples of graduation rates from colleges
in the south. With these sample data we could then test a claim about the
difference of the 2 sample means.
In the case of our data, we used R to generate a subset containing only
sampled colleges in the north, and a subset containing only sampled colleges in
only the south. To do this, we typed the following
n = c("CT","MA","RI","NH","ME","VT","NY","PA","OH","MI","IN","IL",
"WI","MN","ND","MT","ID","WA")
s = c("FL","GA","MD","NC","SC","VA","WV","DE","AL","KY","MS","TN",
"AK","LA","OK","TX")
north = data[which(data$State %in% n),]
south = data[which(data$State %in% s),]
We then selected the relevant data from these colleges by generating vectors
with only the graduation rates:
graten = as.integer(as.vector(t(north$Graduation.Rate)))
grates = as.integer(as.vector(t(south$Graduation.Rate)))

3.3

Histograms

Next we generated histograms for the north and south graduation rates, respectively. Each reveals an approximately normal distribution (though the first
dataset may appear to be slightly skewed):

3.4

Summary Statistics

We then used R to compute all relevant summary statistics for both samples.
To do this, we typed the following:
> summary(graten)
Min. 1st Qu. Median
16.00
55.00
67.00
> summary(grates)
Min. 1st Qu. Median
8.00
40.25
53.00

Mean 3rd Qu.


66.86
80.00
Mean 3rd Qu.
54.17
66.00

Max.
100.00
Max.
100.00

Thus, the 5 number summary for graduation rates of colleges sampled in the
north is [16, 55, 67, 80, 100], while the 5 number summary for graduation rates
of colleges sampled in the south is [8, 40.25, 53, 66, 100]. We also see that the
N = 66.86, while the
mean graduation rate of colleges sampled in the north is X
S = 54.17.
mean graduation rate of colleges sampled in the south is X
We also used R to compute the sample variances:
> var(graten)
[1] 317.0165
> var(grates)
[1] 348.3213
Thus s2N = 317.0165 and s2S = 348.3213.

3.5

Inferential Statistics

Let N = graten and S = grates. We are to test the following


H0 : N = S

against H1 : N > S

We note that the sample variances are slightly different, and not similar enough
to be considered approximately equal. Thus, we apply a one-sided two sample
Welch t-test to test our claim. To do this in R, we used the t.test() function.
Choosing a significance level of = 0.01, we obtained
> t.test(graten,grates,conf.level=0.99,alternative=c("greater"))
Welch Two Sample t-test
data: graten and grates
t = 10.355, df = 825.02, p-value < 2.2e-16
alternative hypothesis: true difference in means is greater than 0
99 percent confidence interval:
9.834299
Inf
sample estimates:
mean of x mean of y
66.86355 54.17259
4

Thus, our p-value is 2.21016 and our (99%) confidence interval is [9.834299,].

Conclusion

Since our p-value is 2.21016 < = 0.01, we concluded that, in the year 2005,
the success of colleges located in the northern US was significantly greater than
that of colleges located in the southern US. Furthermore, we are 99% confident
that this success rate is at least 9.83% greater (approximately) for colleges in
the north.

References
[1] U.S. News & World Reports Guide to America0 s Best Colleges, 1995.
Also listed at Amstat at http://www.amstat.org/publications/jse/datasets/usnews.txt
[2] U.S. Department of Education. Institute of Education Sciences, National Center for Education Statistics. Digest of Education Statistics.
Degree-granting Postsecondary Institutions, by Control and Level of
Institution: Selected Years, 1949-50 through 2012-13. Retrieved from
http://nces.ed.gov/programs/digest/d13/tables/dt13 317.10.asp

Você também pode gostar