Você está na página 1de 13

Abdullah Afridi 5369-6871

Bret Ellenbogen 1186-1901


STA3024 Section 20297
December 4​th​, 2018

Selection of Variables
This report aims to assess the relationship between ​parental education levels​​ and the
likelihood of children to attend graduate or professional school immediately after
graduation​​. This was of interest because it shows the extent to which parental education levels
influence students’ motivation and ability to achieve higher education. Conversely, the
relationship between parental and child educational levels/aspirations quantifies the degree to
which reaching higher education is not accessible or not wanted by those without exposure to it
as well, a pressing issue, given the highest level of education one receives is quite valuable with
respect to their future economic prospects.
There were originally five possible responses to the question “Which best describes your
likelihood of attending graduate or professional school right after graduation?” and six possible
responses to the questions maternal and paternal education level, which are summarized in the
table below. Both variables have more than three response options.

Table 1
Maternal/Paternal Education Level (MAT E/PAT E) Likelihood of Further Education
(FUR E)

Less than high school (<HS) Very Unlikely (VU)

High School Graduate (HS) Moderately Unlikely (MU)

Some College (SC) Moderately Likely (ML)

Bachelor’s Degree (B) Very Likely (ML)

Master’s or Professional Degree (M/P) *Other/Don’t Know

Doctoral Degree (D)


*Other: These responses and all associated data were not used in the analysis.

The assumptions for the ​𝜒2​​ Test for Independence are outlined in the following table.

Table 2
Assumptions Validity

The data was obtained from an independent The data was obtained from a survey
random sample. completed by the large majority of students
taking this statistics course (STA3024 at UF).
This presents an inherent sampling bias, as
there might be an association between people
who had to or wanted to take this course and
many of the survey’s variables. This
constitutes a reason to say that the data are not
necessarily random. However, for the
purposes of this project, we will proceed
anyway and do so with caution.

At least 80% of the cells have expected counts There are 9 cells in the contingency table.
≥ 5. Referring to Table 6, of those 9, only 1 has an
expected count less than 5 (the cell
corresponding to a father with a bachelor’s
degree and unlikely probability to attend
graduate school had an expected count of
4.67). The proportion of cells which do not
have an expected count of 5 or greater is thus
1/9, or 11.11%. This is less than 20%, so the
assumption is valid.

All of the cells have expected counts ​≥ 1. There are 9 cells in the contingency table,
and, referring to Table 6, all of them have
expected counts greater than 1. Therefore, the
assumption is valid.

Contingency Table Description


The contingency table made had levels of paternal education as the rows and the likelihood of
attending graduate or professional school as columns. The maternal education levels were also
used to conduct a ​𝜒2​​ test of independence, but the p-values were quite low (>.20), and so it was
decided that the paternal education level would be used to fully analyze.

In order to meet the assumptions for a valid ​𝜒2​​ Test, some question responses with very few data
points were combined with others such that the 80% of all expected cell counts would be at least
5. The responses for paternal education “Less than high school,” “High School Graduate,” and
“Some College” were all combined into one category, “Less than a Bachelor’s Degree” (< B).
Furthermore, the category “Doctoral Degree” was combined with “Master’s or Professional
Degree” to create one category, “Master’s or Professional Degree or Doctoral Degree” (M/P +
D). Furthermore, there was a very small amount of responses in the “Very Unlikely” category
across all factors of paternal education. So, the “Very unlikely” category was combined with
“Moderately unlikely” to create one category, “Unlikely” (U).

After combining these categories, the contingency table looked as follows.

Table 3
FUR E

U ML VL TOTAL

<B 8 23 38 69

PAT E B 2 7 40 49

M/P + D 8 11 52 71

TOTAL 18 41 130 189

Conditional Probability Distribution:


To calculate conditional probabilities, the formula for conditional probability was used, P(A | B)
= P(A ∩ B)/ P(B), for example​,
​ ​ U) / P(U) = (8/189) / (18/189) = 8/18 = 0.4444
P(< B | U) = P( < B ∩

Table 4
Conditional Probability Distribution

U ML VL

<B .4444 .5610 .2923

B .1111 .1707 .3077

M/P + D .4444 .2683 .4000


Figure 1 shows how the probability of a given FUR E response varies with different
levels of PAT E. The chart shows that, given someone is very likely to attend graduate or
professional school, it is most likely that their father has a Master’s Degree or higher. Similarly,
among the group unlikely to immediately pursue further education, it is equally as probable for
their fathers to have at least a Master’s Degree, or below a Bachelor’s Degree. However, in the
moderately likely category of responses, it is most probable that one’s father hasn’t obtained at
least a Bachelor’s Degree.
The conditional distribution suggests that paternal education and aspirations of higher
education are not independent. If they were independent, then the probability of one response
would be the same across different levels of the second variable. However, it can be clearly seen
from the table that the probability of one’s father having less than a Bachelor’s Degree varies
across the responses for likelihood of attending graduate or professional school. The same can be
said for the probabilities of one’s father having a Bachelor’s Degree or having at least a Master’s
Degree. If the variables were independent, then all blue bars in Figure 1 would be the same
length, as would all the orange and grey bars.

Joint Probability Distribution:


To calculate joint probability, the value of the cell count divided by the total number of
responses. For instance, P( < B ∩ ​ ​ U) = 8/189 = .0423
Table 5
Joint Probability Distribution

U ML VL

<B 0.0423 0.1217 0.2011

B 0.0106 0.0370 0.2116

M/P + D 0.0423 0.0582 0.2751

Figure 2 shows the probability of each of the 9 possible outcomes recorded in the
contingency table. The sum of these 9 probabilities, then, equals 1. The most probable outcome
is for one to be very likely to attend graduate or professional school and have a father with at
least a Master’s Degree. The least likely of the outcomes is the probability for someone to be
unlikely to immediately attend graduate or professional school, with a father having a Bachelor’s
Degree. Of the sample taken, it was narrowly most probable for someone’s father to have a
Master’s Degree or higher, followed closely by less than a Bachelor’s, with the least probable
outcome being a father with a Bachelor’s degree.

Statement of Hypotheses for ​𝜒2​​ Test of Independence​:


Symbolic
H​0​: ​ ​∀(​i,j) : i ∊ {U, ML, VL} ​∩ ​j ∊ {B, M/P, D}, ​O​ij​ = E​ij
H​A​: ​∃ (i, j) : ​i ∊ {U, ML, VL} ∩ ​ ​j ∊ {B, M/P, D},​ ​O​ij​ ≠ ​E​ij

Verbal
Null Hypothesis: The further education plans of children are independent of their paternal
education level.
Alternate Hypothesis: The further education plans of children are dependent on their paternal
education level.

Data Analysis
Table 6
Expected Values FUR E

U ML VL TOTAL

<B 6.57 14.97 47.46 69

PAT E B 4.67 10.63 33.70 49

M/P + D 6.76 15.40 48.84 71

TOTAL 18 41 130 189

The expected cell counts were calculated through an application of basic statistics best
explained in example. For cell (<B, U) the probabilities of of a sample being a part of U and <B
are multiplied. The number of people in U, 18, divided by the total number of samples, 189, is
multiplied by the number of people in <B, 69, divided by the total number of samples, 189. This
is the probability of both events happening. Then, this probability can be multiplied by the total
number of samples, 189, to find the expected value of those belonging to this category, 6.57.
This formula can be simplified to the row total x column total / grand total. This formula was
applied using statistical software to find the final expected values of all cells in the table. The
expected values represent what would happen if the two events were truly independent.
As per the standard procedure of performing a 𝜒​2​ Test of Independence, because it is used
to quantify the difference between true independence (expected) and reality (observed) to
determine if two groups are independent, the expected values are subtracted from the observed
values. This difference is then squared to account for negative values and then divided by the
expected value to account for the magnitude of the difference in the context of the data. These
standardized end values are represented by (O-E)​2​/(E). The sum of the standardized variations
from independence represents the final test-statistic. Mathematically this looks like the following
equation.

WIP- i’ll fix this in word.


The value for the ​𝜒2​​ test statistic for the test for independence ended up as 12.135304.
The 𝜒​2​ distribution is entirely positively, with a mean at its degrees of freedom. In this case,
there were four df, resulting from the product (3-1)(3-1). The standard deviation of the
distribution, √(df)​, w
​ as √8, or 2.83. 12.135 is very far to the right of the mean of 4, over 3
standard deviations above it, so the p-value, 0.0164, is low. This p-value means that, if paternal
education and pursuit of further education were actually independent, there would be a
probability of 0.0164 to obtain a value of 𝜒​2 ​at least as extreme as 12.135304.

Because the p-value is less than the significance level of α = 0.05, the null hypothesis is
rejected. There is enough evidence to conclude that further education plans of students is
dependent on the level of paternal education.
Figure 3

The tables above show the contributions from each cell to the test statistic. The
standardized residuals show the strength and direction of the difference between the observed
and expected values. The contributions to the chi square are all positive as the cumulative
probability density function is one sided, and only aims to give the variation from independence.

2x2 Contingency Table


Table
Likelihood of pursuing further education

2x2 Table Unlikely Likely

≤​​ B 10 108

M/P + D 8 63
Odds Ratios and Relative Risk
Those whose fathers had education levels less than or equal to a bachelor’s degree are
10.8 times more likely than not to pursue further education while those whose fathers are
educated beyond the graduate level are 7.875 times more likely to pursue further education than
not. This means the odds ratio of pursuing further education, between those whose fathers are
less and more formally educated is 1.3714. The relative risk is calculated as the probability of
being likely to pursue higher education given a father who didn’t divided by the probability of
being likely to pursue higher education among those who had a father that did pursue higher
education. This value is 1.03148, as shown in the figure below. This represents how much more
the M/P+D group is to be likely to pursue higher education than the ≤ B group. Specifically, a
relative risk of greater than one means that those in the ≤ B group are more likely to indicate that
they will pursue further education immediately after graduation.
Figure 4

As seen in the figure above, the odds for being likely or unlikely across the different
groups all have 95% confidence intervals that include 1. This means that none of them are
significant for the 2x2 contingency table. This is most likely because information was lost when
grouping together different groups from the 3x3 table in order to make the 2x2 table.
Fisher’s Exact test is exact and more conservative than normal tests, typically being used
for smaller sample sizes. This is why it yielded a higher p-value. Nonetheless, both tests found
that the relationship between paternal education and likelihood of pursuing further education
were independent when the categories were merged. This shows that the true differences arise
when people are allowed to choose between being moderately and highly likely.
Logistic Regression
The report also aimed to understand the relationship between the probability of pursuing further
education, a categorical output, and college GPA, a quantitative input. This was done with a
logistic regression. To make the pursuit of higher education binary, “Other/Don’t Know”
responses were removed, “Very Unlikely” and “Moderately Unlikely” were considered failures
(0), and “Very Likely” and “Moderately Likely” were considered successes (1).

The regression model took the form of:


 
p̂ = (e-0.28014415 + 0.71087544​)/( 1+ e​-0.28014415 + 0.71087544​)
where x is college GPA, and p̂ is the probability of pursuing graduate or professional school
immediately after graduation.
FIGURE 5
Logistic Regression of Probability of Pursuing Further Education against College GPA

The plot is shown above in Figure 5.​ ​The plot indicates that as GPA increases, the probability of
aspiring higher education increases. A vertical line is drawn at x = 0, as the college GPA scale
ranges from 0.0 to 4.0. To the left of x=0, the overall trend can still be seen, but the only relevant
meaning comes from the right of the line.
Ideally, due to the GPA scale ranging from 0.0 to 4.0, the plot should approach a horizontal
asymptote (p̂ = 1) at a GPA value of 4.0, and in the other direction, approach a horizontal
asymptote (p̂ = 0) at a GPA of 0.0. Ideally, due to the GPA scale ranging from 0.0 to 4.0, the
plot should approach a horizontal asymptote (p̂ = 1) at a GPA value of 4.0, and in the other
direction, approach a horizontal asymptote (p̂ = 0) at a GPA of 0.0. However, the probability
only approaches 0 at values of negative GPA, which doesn’t make rational sense.

According to this model, people with a 4.0 GPA have a probability of .928 of considering it
likely to attend graduate or professional school immediately after graduation. This probability
subsequently drops to .758 at GPA of 2.0. A hypothetical GPA of 0 would result in a probability
of .430, meaning that in the sample, every student has a moderate probability of looking to attend
graduate school. The large number of positive responses contributed to this. This was most likely
due to sample bias, because people in advanced statistics classes might be taking the class in
preparation of further education.
Summary of Findings

It was found that the level of paternal education and a given student’s self-perceived
likelihood of immediately attending graduate or professional school after graduation are
dependent. Individual cell ​𝜒2​ ​contributions indicate that fewer people whose father did not
complete a Bachelor’s Degree consider themselves “very likely” to pursue further education than
expected, whereas more people those whose father have a Bachelor’s Degree or higher consider
themselves “very likely” than expected. The perceived trend of self-perceived likelihood
strengthening with increasing paternal education is not fully consistent in the analysis, however.
For instance, there were more people considering themselves “unlikely” to pursue further
education, among those whose father had at least a Master’s, than expected.

Overall, paternal education and further education plans are dependent, often times with
those whose fathers have higher levels of education considering themselves more likely to
pursue higher education. When reduced to a 2x2 contingency table, however, the association was
weakened due to a loss of information.

The logistic regression model above indicates that as GPA increases, so does the probability of
one considering themselves to attend graduate or professional school.

Você também pode gostar