Você está na página 1de 14

qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.

html

qplot from the ggplot2 package


In this rmd file you are to add R commands where indicated to answer questions through the
creation of graphs via qplot. Each of the ten questions is two points (correct = 2 points, almost
correct 1 point, otherwise 0 points). Turn in both this edited rmd file and a pdf of the rendered
rmd file.

We will examine data concerning college enrollment. First, read the data (provided via a csv file)
into R.

colleges.df <- read.csv('E:/Kelly/college.csv')


head(colleges.df)

1 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

## college private application.count


## 1 Cazenovia College Yes 3847
## 2 College of Mount St. Joseph Yes 798
## 3 Lindenwood College Yes 810
## 4 Harvey Mudd College Yes 1377
## 5 Missouri Southern State College No 1576
## 6 Grove City College Yes 2491
## acceptance.count enrollment.count enrolled.from.top.10.percent
## 1 3433 527 9
## 2 620 238 14
## 3 484 356 6
## 4 572 178 95
## 5 1326 913 13
## 6 1110 573 57
## enrolled.from.top.25.percent fulltime.undergrad.count
## 1 35 1010
## 2 41 1165
## 3 33 2155
## 4 100 654
## 5 50 3689
## 6 88 2213
## parttime.undergrad.count phd.granted.count terminal.degree.count
## 1 12 22 47
## 2 1232 46 46
## 3 191 65 85
## 4 5 100 100
## 5 2200 52 54
## 6 35 65 65
## s.f.ratio perc.alumni expenditures graduation.rate
## 1 14.3 20 7697 100
## 2 11.1 35 6889 100
## 3 24.1 9 3480 100
## 4 8.2 46 21569 100
## 5 20.3 9 4172 100
## 6 18.4 18 4957 100

str(colleges.df)

2 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

## 'data.frame': 777 obs. of 15 variables:


## $ college : Factor w/ 777 levels "Abilene Christian Universi
## $ private : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2
## $ application.count : int 3847 798 810 1377 1576 2491 2961 4019 4302
## $ acceptance.count : int 3433 620 484 572 1326 1110 1932 2779 992 27
## $ enrollment.count : int 527 238 356 178 913 573 628 888 418 756 ...
## $ enrolled.from.top.10.percent: int 9 14 6 95 13 57 24 40 83 46 ...
## $ enrolled.from.top.25.percent: int 35 41 33 100 50 88 68 73 96 72 ...
## $ fulltime.undergrad.count : int 1010 1165 2155 654 3689 2213 2669 3891 1593
## $ parttime.undergrad.count : int 12 1232 191 5 2200 35 616 128 5 594 ...
## $ phd.granted.count : int 22 46 65 100 52 65 71 88 93 75 ...
## $ terminal.degree.count : int 47 46 85 100 54 65 82 92 98 89 ...
## $ s.f.ratio : num 14.3 11.1 24.1 8.2 20.3 18.4 14.1 13.9 8.4
## $ perc.alumni : int 20 35 9 46 9 18 42 19 63 32 ...
## $ expenditures : int 7697 6889 3480 21569 4172 4957 8189 10872 2
## $ graduation.rate : int 100 100 100 100 100 100 100 100 100 100 ...

dim(colleges.df)

## [1] 777 15

The dataset consists of the college name and 14 attributes:

private: is the college private (YES) or not private (NO)


application.count: number of applications to the college
acceptance.count: of the applicants, number who were accepted for enrollment
enrollment.count: of the applicants, number who accepted
enrolled.from.top.10.percent: percent of enrollees who were in the top 10% of their high
school class
enrolled.from.top.25.percent: percent of enrollees who were in the top 25% of their high
school class
fulltime.undergrad.count: count of undergraduate students enrolled full time
parttime.undergrad.count: count of undergraduate students enrolled part time
phd.granted.count: number of doctorate degrees awarded
terminal.degree.count: number of terminal degrees awarded
s.f.ratio
perc.alumni: percent alumni who donate to the school
expenditures: school annual expenses
graduation.rate: graduate rate

Added to the data is a graduation rate category, with graduation rates from 0 to 33% labeled low,
34 to 70% labeled 'medium', and 81 and over labeled 'high'.

colleges.df$graduation.category <- cut(colleges.df$graduation.rate, breaks = c(0,33,

colleges.df$graduation.category

3 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

Some of the attributes we will be working with include graduation.rate, application.count,


acceptance.count, enrollment.count, enrolled.from.top.10.percent, perc.alumniand whether the
school is private. Before examining attribute relationships, it is helpful to understand the
distribution of the data.

Q1) Create a histogram using the attribute graduation rate (graduation.rate) and map the
gradudation category to the fill aesthetic; the breaks should appear in 5 unit increments.

qplot(x = graduation.rate, data = colleges.df, geom = 'histogram', binwidth = 5, fil

Q2) Create a density plot for alumni donation percentage (perc.alumni); smooth the curve (adjust
factor to 10) and fill it with the 'private' attribute (set the transparency to 40%).

qplot(x = perc.alumni, data = colleges.df, geom='density',adjust=10, fill = private,

## Warning: Computation failed in `stat_smooth()`:


## object 'y' not found

4 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

Let's start with the assumption that students wish to graduate. Accordingly, demand
(application.count) for colleges with high graduate rates (graduation.rate) should be relatively
high.

Q3) Create a scatterplot examining this assumption about graduation rate and demand. Include a
line (colored as darkgoldenrod3) representing the relationship in the plot (do not include the
confidence interval). Label the x axis 'Graduation Rate' and the y axis 'Applications'. Only show
colleges with 25,000 or fewer applications (by manipulating the graph presentation, not by
manipulating the dataset). Color the glyphs skyblue.

qplot(x = graduation.rate, y = application.count, xlab="Graduation Rate", ylab="Appl

5 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

In a sentence, how would you interpret the relationship between the two variables?

As we can see in the scatterplot, the darkgoldenrod3 line represents that graduation
rate and application count are positively correlated.

The relationship between graduation rates and applications may be better clarified by looking at
segments of graduation rates. In other words, applicants do not respond to slight difference in
graduation rates (10% vs. 12%) but material differences (10% vs. 50%).

Q4) Create a single graph with three boxplots, one for each graduation category (the data frame
derived attribute), for applications, restricting the y axis to 7500.

qplot(x = graduation.category, y = application.count, data = colleges.df ,geom = "bo

## Warning: Removed 82 rows containing non-finite values (stat_boxplot).

6 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

In a sentence, how
would you interpret the relationship as evidenced by the boxplots?

As we can see in the scatterplot, the median of the high graduation category boxplot
is higher than the other two boxplots.So we can say that, as the application count for
high graduation rate category has high application count.

It seems reasonable to assert that students performing better in high school (as measured by
finishing in the top 10%) would seek 'better' schools all else equal (still, there may be selection
biases whereby better performing students earn admittance to more difficult schools which would
mitigate the relationship empirically). We will define 'better'

popular, defined as application count


selective, calculated as the ratio of acceptance to application
attractive, calculated as ratio of enrolled to accepted

Q5) Create a scatterplot relating high school performance (enrolled.from.top.10.percent) to


school selectiveness (acceptance.count divided by application.count) with a trend line (as
demonstrated in class) without displaying a confidence interval and add an additional layer
that introduces a new geom type that lays a red horizontal line across the graph at the
average of the acceptance count to application count ratio, calculated as
mean(colleges.df$acceptance.count/colleges.df$application.count). NOTE: We have not
covered the horizontal line geom type in class. This question is included to have you
research on your own some of the qplot functions.

7 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

qplot(x = colleges.df$enrolled.from.top.10.percent, y =colleges.df$acceptance.co

In a sentence, how
would you interpret the relationship as evidenced by the scatterplot?

We can see in the scatterplot that there is negative linear relationship between
high school performance to school selectiveness.

Q6) Create a scatterplot relating high school performance (enrolled.from.top.10.percent) to


school attractiveness (enrollment.count divided by acceptance.count) with color
representing the school's private or public status. Include a line without the confidence
interval (note: qplot will recognize the split between public and private and create two lines).

qplot(x = colleges.df$enrolled.from.top.10.percent, y = colleges.df$enrollment.c

8 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

In a sentence, how
would you interpret the relationship as evidenced by the scatterplot?

It is evident from the scatterplot that student are more attracted to the public
colleges than the private colleges.

To see if a similar relationship exists with selectiveness and attractiveness and percent of
student body in the top 25% of their high school, we simply need to establish that the 10%
and 25% are correlated.

Q7) Create a scatterplot using the top 10% and top 25% attributes.

qplot(x = enrolled.from.top.10.percent, y = enrolled.from.top.25.percent, data =

9 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

In a sentence, how
would you interpret the relationship as evidenced by the scatterplot?

There is logarithmic correlation between enrolled.from.top.10.percent and


enrolled.from.top.25.percent variable as seen in the scatterplot.

Q8) Create a bubblechart using the ratio of application to fulltime enrollment


(application.count/fulltime.undergrad.count) and ratio of acceptance to fulltime enrollment
(acceptance.count/fulltime.undergrad.count) for the position aesthetics, the ratio of
enrollment to fulltime enrollment (enrollment.count/fulltime.undergrad.count) as the size
aesthetic, and the private/public status as color.

qplot(x = application.count/fulltime.undergrad.count, y = acceptance.count/fullt

10 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

In a sentence,
describe how the school that is below .5 on the y axis and above 2.5 on the x axis is
different than the overall trend represented in the graph.

As we can see in the bubblechart, public and private schools are scattered
throughout. However,the school that is below .5 on the y axis and above 2.5 on
the x axis are public schools.

Why are the public schools clustered where they are on the graph.

For y between 0.4 to 0.6 and for x between 0.25 to 1 public schools are
clustered.

Q9) The code for a scatterplot showing applications (count) to acceptance (count) by
college is below. Note that the axis ranges have been set the same so a numerical as well
as relative selectiveness can be read from the graph.

qplot(x = application.count, y = acceptance.count, xlim = c(0, 50000), ylim = c(

11 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

A pocket of exclusive schools appears in the range of applications over 10,000 with
acceptance less than 5,000. Create the same scatterplot but with an x axis between 10000
and 20000 and a y axis between 0 and 5000. Include in this graph the names of the
schools close to the data points (black text, size aesthetic set to 4). NOTE: We have not
covered the text labels in class. This question is included to have you research on your own
some of the qplot functions.

qplot(x = application.count, y = acceptance.count, xlim = c(10000, 20000), ylim

## Warning: Removed 771 rows containing missing values (geom_point).

## Warning: Removed 771 rows containing missing values (geom_text).

12 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

Q10) Propose your own hypothesis of data relationships (only using the data available in
the colleges.csv file) by editing the rmd file below with your proposition. Explore your
proposition with the creation of a graph that you have defined.

Hypothesis: Colleges having high expenditures have high donations from alumni.

qplot(x = perc.alumni, y = expenditures, color = private, data =colleges.df)

13 of 14 10/4/2016 12:07 PM
qplot from the ggplot2 package file:///E:/Kelly/Homework2_qplot.html

As we can see in the scatterplot, the colleges having more expenditure have
more percent alumni donating to the school.

14 of 14 10/4/2016 12:07 PM

Você também pode gostar