Data Exploration Mini Project

Mingyo Lee
Data Exploration Mini-Project

Introduction
I decided to observe the number of shoes that people own. The values of my data
represent the number of pairs of shoes each person owns. The population is LASA students in
the LASA Class of 2016 Facebook group. I used the unit pairs of shoes because shoes are sold
in pairs, not individually. To collect my data, I used Google Forms and I posted in the LASA
Class of 2016 Facebook group to request that people fill out my form. I collected data from the
Facebook group because I knew that most people in the group get a notification when someone
posts in the group, so people would be more likely to pay attention to my survey than if I had just
posted it on Facebook normally. Also, posting in the group is a more likely and easier way to
collect a random sample than calling or messaging specific people. My report will start with
statistical analysis and relevant graphs for the original data set along with an explanation of how
I got those numbers, the set created by adding 100 to each number in the original data, and the
set which was created by increasing the numbers in the original data set by 50%. After that, there
will be written analyzations about each data set, followed by my conclusion.
Original Data
Sample Size: 77
Mean: 11.51948
Median: 9
1st Quarter: 6
3rd Quarter: 15
IQR: 9
Minimum: 0
Maximum: 69
Standard Deviation: 9.398503
Variance: 88.33185
Range: 69
The sample size is the number of people that filled out my form, which would be 77. I
found the mean by dividing the sum of the data (887) by the number of people who answered
(77). The maximum is the largest value found in the data set (69) and minimum is the smallest
value (0). I found the Median by placing the numbers in order by value and finding the value in
the middle (9). The 1st Quartile is found by finding the median between the median of the full
data set (9) and the minimum (0), which is 6. The 3rd Quartile is found by finding the median
between the median of the full data set (9) and the maximum (69), which is 15. I found the IQR
by subtracting the 1st Quartile (6) from the 3rd Quartile (15). The range is found by subtracting
the minimum (0) from the maximum (69). The standard deviation is found by finding the square
root of the sum of each value minus the mean squared divided by the number of values minus 1,
or
, where x is each value, x is the mean (11.51948), N is the number of values
(77), and s is the standard deviation. The standard deviation would therefore be equal to
9.398503. Variance is the standard deviation (9.398503) squared, which would be around 88.33.
Graphs
Histogram
Boxplot
Stem and Leaf Plot

0|022333344444555555666677778888888899999
1|00000000000233444555556788
2|000013346
3|00
4|
5|
6|9
Second Calculation
Sample Size: 77
Mean: 111.5195
Median: 109
1st Quarter: 106
3rd Quarter: 115
IQR: 9
Minimum: 100
Maximum: 169
Variance: 88.33185
Range: 69
Graphs
Histogram
Boxplot
Stem and Leaf Plot

10 | 0 2 2 3 3 3 3 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 8 8 8 8 9 9 9 9 9
11 | 0 0 0 0 0 0 0 0 0 0 0 2 3 3 4 4 4 5 5 5 5 5 6 7 8 8
12 | 0 0 0 0 1 3 3 4 6
13 | 0 0
14 |
15 |
16 | 9
Third Calculation
Sample Size: 77
Mean: 17.27922
Median: 109
1st Quarter: 9
3rd Quarter: 22.5
IQR: 13.5
Minimum: 0
Maximum: 103.5
Variance: 198.7467
Range: 103.5
Graphs
Histogram
Boxplot
Stem and Leaf Plot

0|0335555666668888889999
1|11112222222244444555555555558
2|00111333334677
3|000025569
4|55
5|
6|
7|
8|
9|
10 | 4
Data Analysis
In the original and first set of data, I found that the vast majority of people had 1-10 pairs
of shoes. The data was largely positively skewed as a result, and therefore, the median was used
as the method to decide the center of the data. I found that the values of 30 and 69 pairs of shoes
were outliers because they both lay outside 1.5x the IQR on the right. 1.5x the IQR is 13.5, so the
outliers on the right would be values past 20.5, since the 3rd Quartile is at 15 pairs of shoes.
Therefore, the values 30 and 69 are outliers.
In the second calculation, compared with the first calculation, the mean increased by
100.00002, though I suspect that the 0.00002 is due to R Studios rounding. The median
increased by 100, while the standard deviation stayed the same. For the third calculation, the
mean, median, and standard deviation all increased by 50% compared to that of the first
calculation.
If we assume that the original data is a normal distribution, then the percent that is greater
than 5 units above the mean of the original data is 29.73631%. The value greater than the mean
is 16.51948, the mean is 11.51948, and the standard deviation is 9.398503. In R Studio, we can
enter pnorm(16.51948, 11.51948, 9.398503), which will give us the proportion of data smaller
than the value 5 units above the mean, 0.7026369. Since we want the percent of the data greater
than the value 5 units above the mean, we subtract 0.7026369 from 1 and multiply it by 100,
getting 29.73631%.
The percent that is between 3 units below the mean and 2 units above the mean is
20.94709%. In R Studio, pnorm(13.51948, 11.51948, 9.398503) tells us that the proportion
below 2 units above the mean is 0.5842585, while pnorm(8.51948, 11.51948, 9.398503) tells us
that the proportion below 3 units below the mean is 0.3747875. Therefore, if we subtract
0.3747875 from 0.5842585 and multiply by 100, we get the percent that is between 3 units below
the mean and 2 units above the mean, which is 20.94709%.
The number of units required for the top 10% is 23.2676. Since 90% of the values are
below the top 10%, 0.9 must be the proportion that corresponds to the z-score of the top 10%.
That z-score is 1.28, and so 1.28 = (x - mean)/standard deviation = (x - 11.51948) / 9.398503,
where x is the number of units required for the top 10%. Therefore, the number of units required
for the top 10% is 23.2676.
Conclusion
According to the data I have collected, more people seem to own less shoes, and less
people seem to own more shoes. This is illustrated by the right skewed histogram showing the
relationship between number of pairs of shoes owned and the frequency. 50 respondents out of
the total 77 (64.94%) owned 0-10 pairs of shoes, but that number quickly decreased when the
number of pairs of shoes was increased to 11-20 pairs. Only 3 out of 77 respondents (3.89%)
owned 30 or more pairs of shoes.

Data Exploration Mini Project

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Data Exploration Mini Project

Enviado por

Direitos autorais:

Formatos disponíveis

Mingyo Lee

Data Exploration Mini-Project

Stem and Leaf Plot

Stem and Leaf Plot

Stem and Leaf Plot

Você também pode gostar