Você está na página 1de 83

Quantitative Data Analysis

Basic Objectives in Data


Analysis
1. Feel for the Data
2. Testing Goodness of Data
3. Hypothesis Testing
Case : Excelsior Enterprise: P. 304-305

1. What is the kind of this


research?
2. What is the problem statement?
3. What are the variables? (D&I)
4. What kind of information do you
get from this case are related to
the employees?
1. Feel for the Data

Frequencies/ Bar charts or Pie chart


Measures of Central Tendency:
The Mean, The Mode, The Median.
Measures of Dispersion: Range,
Variance, Standard Deviation
Relationships between variables:
Relationship between two nominal
variables ( Chi square X2 test),
Correlations (Pearson correlation)
2. Testing Goodness of Data

Reliability : Cronbachs Alpha


Validity
3. Hypothesis Testing

Next chapter
1. Feel for the Data

Frequencies/ Bar charts or Pie chart

Frequencies refer to the number of


times various subcategories of a
certain phenomenon occur, from
which the percentage and the
cumulative percentage of their
occurrence can be easily calculated.
Ex. Excelsior Enterprise- frequencies

Frequency Percent Valid Cumulative


Percent Percent
Marketing 13 7.5 7.5 7.5
Production 49 28.1 28.1 35.6
Sales 44 25.3 25.3 60.9
Finance 5 2.9 2.9 63.8
Servicing 34 19.5 19.5 83.3
Maintenanc 5 2.9 2.9 86.2
e
Personnel 16 9.2 9.2 95.4
Public 3 1.7 1.7 97.1
Relations
Accounting 5 2.9 2.9 100.0
Total 174 100.0 100.0 100.0
Bar Chart
60
50
40
30
20
10
0
Pie Chart

Production
Sales
Finance
Servicing
Maintenance
Personnel
Public Relations
Accounting
Frequencies for other
Gender
variables
Men Women
79.9% 20.1%

Shift First shift Second shift Third shift


79.9% 20.1% 18%

Work hours Full time Part time


16% 84%

Gender Less High College Masters Doctoral


than school degree degree Degree
high diploma
school
8% 39% 32% 20% 1%
1. Feel for the Data

Measures of Central Tendency:


The Mean, The Mode, The Median.
Measures of Central
Tendency
The Mean. The mean or the average is a measure of
central tendency that offers a general picture of the
data without unnecessarily inundating one with each
of the observations in a data set.
For example, the production department might keep
detailed records on how many units of a product are
being produced each day. However, to estimate the
raw materials inventory, all that the manager might
want to know is how many units per month, on an
average, the department has been producing over the
past 6 months. This measure of central tendency, that
is, the mean, might offer the manager a good idea of
the quantity of materials that need to be stocked.
Measures of Central
Tendency
The Median. The median is the central item in a group of
observations when they are arrayed in either an ascending
or a descending order.
Let us take an example to examine how the median is
determined as a measure of central tendency. Let us say
the annual salaries of nine employees in a department are
$65,000, $30,000, $25,000, $64,000, $35,000, $63,000,
$32,000, $60,000, and $61,000.
The mean salary here works out to be about $48,333, but
the median is $60,000. That is, when arrayed in the
ascending order, the figures will be as follows: $25,000,
$30,000, $32,000, $35,000, $60,000, $61,000, $63,000,
$64,000, $65,000,
and the figure in the middle is $60,000. If there are an even
number of employees, then the median will be the average
of the middle two salaries.
Measures of Central
Tendency
The Mode. In some cases, a set of observations would not lend
itself to a meaningful representation through either the mean or
the median, but can be signified by the most frequently
occurring phenomenon. F
or instance, in a department where there are 10 White women,
24 White men, 3 African American women, and 2 Asian women,
the most frequently occurring groupthe modeis the white
men. Neither a mean nor a median is calculable or applicable in
this case. There is also no way of indicating any measure of
dispersion.
As is evident from the above, nominal data lend themselves to
description only by the mode as a measure of central tendency.
It is possible that a data set could contain bimodal observations.
For example, using the foregoing scenario, there could also be
24 Asian men who are specially recruited for a project. Then we
have two modes, the White men and the Asian men.
dispersions.
Mode: The score that occurs most
frequently
Only statistic appropriate for nominal
data

Example:

2, 3, 4, 6, 7, 8,Mode?
8, 8, 9, 9, 10, 10, 10,
10
Median: The score associated with the 50th
percentile
Scale of measurement:
Ordinal, interval or ratio
DO NOT use with nominal data
Methods of determination:
N = even
List scores from low to high
Median is the Mean of
N and N+1
2 2

N = odd
List scores from low to high
Median = N+1
2
Mean: Arithmetic average
Most common measure of central
tendency
Scale of measurement:
Interval or ratio
An instructor recorded the average number
of absences for his students in one semester.
For a random sample the data are:
2 4 2 0 40 2 4 3 6

Calculate the mean, the median, and the mode

Mean:

Median: Sort data in order

0 2 2 2 3 4 4 6 40

The middle value is 3, so the median is 3.


Mode: The mode is 2 since it occurs the most times.
Measures of Dispersion
There are three main measures of
dispersion:
The range
Variance
standard deviation

20
Measures of Dispersion
The measure of dispersion is also unique to
nominal and interval data.
Two sets of data might have the same
mean, but the dispersions could be
different.
For example, if Company A sold 30, 40, and
50 units of a product during the months of
April, May, and June, respectively, and
Company B sold 10, 40, and 70 units during
the same period, the average units sold per
month by both companies is the same40
unitsbut the variability or the dispersion in
the latter company is larger.
Measures of Dispersion
The three measurements of dispersion
connected with the mean are the
range, the variance, and the standard
deviation.
Range. Range refers to the extreme values
in a set of observations.
The range is between 30 and 50 for
Company A (a dispersion of 20 units), while
the range is between 10 and 70 units (a
dispersion of 60 units) for Company B.
Another more useful measure of dispersion
is the variance.
The Range
The range is defined as the
difference between the largest score
in the set of data and the smallest
score in the set of data, XL - XS
What is the range of the following
data:
4 8 1 6 6 2 9 3 6 9
The largest score (XL) is 9; the
smallest score (XS) is 1; the range is
X -X =9-1=8 23
When To Use the Range
The range is used when
you have ordinal data or
you are presenting your results to people
with little or no knowledge of statistics
The range is rarely used in scientific
work as it is fairly insensitive
It depends on only two scores in the set of
data, XL and XS
Two very different sets of data can have
the same range:
1 1 1 1 9 vs 1 3 5 7 9
24
Measures of Dispersion
Variance. The variance is calculated by
subtracting the mean from each of the
observations in the data set, taking the
square of this difference, and dividing the
total of these by the number of
observations.
In the above example, the variance for
each of the two companies is:
(30 40)2 + (40 40)2 + (50 40)2 /3
Variance for Company A = 66.7
(10 40)2 + (40 40)2 + (70 40)2 /3
Variance for Company B = 600
As we can see, the variance is much
larger in Company B than Company
A. It makes it more difficult for the
manager of Company B to estimate
how much goods to stock than it is
for the manager of Company A.
Thus, variance gives an indication of
how dispersed the data in a data set
are.
What Does the Variance Formula
Mean?
Variance is the mean of the squared
deviation scores
The larger the variance is, the more
the scores deviate, on average, away
from the mean
The smaller the variance is, the less
the scores deviate, on average, from
the mean

27
Measures of Dispersion
Standard Deviation. The standard
deviation, which is another measure of
dispersion for interval and ratio scaled
data, offers an index of the spread of a
distribution or the variability in the data.
It is a very commonly used measure of
dispersion, and is simply the square
root of the variance. In the case of the
above
two companies, the standard deviation for
Companies A and B would be 66.7 and
600 or 8.167 and 24.495, respectively.
Standard Deviation
Definition
Measures of dispersion are
descriptive statistics that describe
how similar a set of scores are to
each other
The more similar the scores are to each
other, the lower the measure of
dispersion will be
The less similar the scores are to each
other, the higher the measure of
dispersion will be
In general, the more spread out a 30
Measures of Dispersion
Which of the
distributions of 125
scores has the 100
larger dispersion? 75
50
25
The upper 0
distribution has 1 2 3 4 5 6 7 8 9 10
125
more dispersion 100
because the 75
50
scores are more 25
0
spread out 1 2 3 4 5 6 7 8 9 10
That is, they are less 31
similar to each other
Standard Deviation-
Variance
Standard deviation = variance

Variance = standard deviation2

32
The relations between Means
and Deviations
1. Coefficient of Variation: is defined as
the ratio of thestandard deviationto the
mean.
The coefficient of variation should be
computed only for data measured on a
ratio scale, as these are measurements
that can only take non-negative values.
The coefficient of variation may not have
any meaning for data on aninterval scale
Skewness
The skewness for anormal distributionis zero,
and any symmetric data should have a
skewness near zero. Negative values for the
skewness indicate data that are skewed left
and positive values for the skewness indicate
data that are skewed right. By skewed left, we
mean that the left tail is long relative to the
right tail. Similarly, skewed right means that
the right tail is long relative to the left tail. If
the data are multi-modal, then this may affect
the sign of the skewness
Skewness

S.K
=
S.K
=
Example

The owner of the Ches Tahoe


restaurant is interested in how much
people spend at the restaurant. He
examines10randomly selected
receipts for parties of four and writes
down the following data.
44, 50, 38, 96, 42, 47, 40,
39, 46, 50
Example
Find the following:
1. Mean
2. Mode
3. Median
4. Range
5. Variances
6. St. Deviation
7. Coefficient of Variances (CV)
8. Skewness
Relationships
between
variables
1- Relation Between two nominal
variables: Chi-Square (X2 test )
2- Correlations
Chi-Square Analysis Details
Decide whether or not the result is statistically
significant.
The results are statistically significant if the p-value is less
than
alpha, where alpha is the significance level (usually =
0.05).

Report the conclusion in the context of the


situation.

The p-value is ______ which is < a, this result is


statistically significant. Reject the H0 Conclude
that (the two variables) are related.
The p-value is ______ which is > a, this result is NOT
statistically significant. We cannot reject the H0
Cannot conclude that (the two variables) are related.
Detailed Example
1. Ho: There is no relationship between the geographical area that
a
student grew up and whether or not that the student drinks
alcohol.
Ha: There is relationship between the geographical area that a
student
grew up and whether or not that the student drinks alcohol.

2. To check the conditions we need to calculate the expected


counts for
each cell.
E11 = (R1xC1)/n = (86x87)/825 = 9.07,
E12 = (R1xC2)/n = (86x738)/825 = 76.93,
Detailed Example

No Yes All

Here is the Minitab Big_City 21 65 86


output 9.07 76.93 86.00
with the Observed and
Expected counts for Rural 11 130 141
each 14.87 126.13 141.00
cell. We can see that
SmallTow 18 198 216
the
22.78 193.22 216.00
conditions are satisfied!
Suburban 37 345 382
40.28 341.72 382.00

All 87 738 825


87.00 738.00 825.00
Detailed Example

3. Chi- Square statistic and P-value:


2 = sum {(Observed Expected)2/Expected}
= (21-9.07)2/9.07+ (65-76.93)2/76.93
+ (11-14.87)2/14.87+ (130-126.13)2/126.13
+ (18-22.78)2/22.78+ (198-193.22)2/193.22
+ (37-40.28)2/40.28+ (345-341.72)2/341.72
= 20.091
df = (4-1)x(2-1) =3
p-value= P(X> 20.091) < P(X> 16.17) = 0.001 (Table A.4)

4. Since the p-value< 0.05, the test is significant, and we can


reject the null.

5. We can conclude that there is a relationship between the


geographical area that a student grew up and whether or
not that the student drinks alcohol.
Correlation
A correlation coefficient
measures the extent to which
two variables tend to change
together. The coefficient
describes both the strength
and the direction of the
relationship.

Smith/Davis (c) 2005 Prentice Hall


1. Pearson product moment
correlation
The Pearson correlation evaluates
the linear relationship between two
continuous variables. A relationship
is linear when a change in one
variable is associated with a
proportional change in the other
variable.
Correlations: Measuring and
Describing Relationships

A correlation typically evaluates


three aspects of the relationship:
the direction
the form
the degree

48
Correlations: Measuring and
Describing Relationships

The direction of the relationship is


measured by the sign of the
correlation (+ or -). A positive
correlation means that the two
variables tend to change in the same
direction; as one increases, the other
also tends to increase. A negative
correlation means that the two
variables tend to change in opposite
directions; as one increases, the 49
Correlations: Measuring and
Describing Relationships

The most common form of


relationship is a straight line or linear
relationship which is measured by
the Pearson correlation.

51
Correlations: Measuring and
Describing Relationships

The degree of relationship (the


strength or consistency of the
relationship) is measured by the
numerical value of the correlation. A
value of 1.00 indicates a perfect
relationship and a value of zero
indicates no relationship.

52
Discussion the result of
the analysis SPSS

T test )/( ):/(

Anove )/( 3

correlation

Regression

CHI Square Nominal Nominal

)( ) (


.
:




multiple Linear
Regression "

" "
" " " " "
" :

Correlation and Regression





...

r 1 . 1

1

1-
.

.

.


:
) : (Pearson

) : (Spearman
)
( ) 8
5 8 5(
.
) : (Kandell,s tau
.
) : (Phi
) (
) (.
) : (Cramers

) ( )
)






Correlations


Pearson Correlation 1 .959**
Sig. (2-tailed) . .000
N 10 10
Pearson Correlation .959** 1
Sig. (2-tailed) .000 .
N 10 10
**. Correlation is significant at the 0.01 level
(2-tailed).


tailed Significance = 0.000.2
r 0.959
=0.5

Correlations


Pearson Correlation 1 .959** .780** .833**
Sig. (2-tailed) . .000 .008 .003
N 10 10 10 10
Pearson Correlation .959** 1 .746* .811**
Sig. (2-tailed) .000 . .013 .004
N 10 10 10 10
Pearson Correlation .780** .746* 1 .890**
Sig. (2-tailed) .008 .013 . .001
N 10 10 10 10
Pearson Correlation .833** .811** .890** 1
Sig. (2-tailed) .003 .004 .001 .
N 10 10 10 10
**. Correlation is significant at the 0.01 level (2-tailed).
*. Correlation is significant at the 0.05 level (2-tailed).

0.01

0.05



0.949


0.949





Correlations


Pearson Correlation 1.000 .949
.949 1.000
)Sig. (1-tailed . .000
.000 .
N 10 10
10 10

0.88

Paired Samples Correlations

N Correlation Sig.
Pair Current Salary &
474 .880 .000
1 Beginning Salary
T-Test
: )(
:
: "


ANOVA ) Analysis
( 0f Variance ) ( Fisher
) ( Ttest


:

T
test
:
) ( ) (

.


) (
Independent Samples Test

Levene's Test for


Equality of Variances t-test for Equality of Means
95% Confidence Interval of
Mean Std. Error the Difference
F Sig. t df Sig. (2-tailed) Difference Difference Lower Upper
Current Salary Equal variances
119.669 .000 -10.945 472 .000 -$15,409.86 $1,407.906 -$18176.40 -$12643.32
assumed
Equal variances
-11.688 344.262 .000 -$15,409.86 $1,318.400 -$18003.00 -$12816.73
not assumed
) (Leven,s test =F
9.669 Sig = 0.0
T

t= 1.688
Sig=0.0


Sig. (2- tailed) =
0.05 0.00


.

Paired Samples Test

Paired Dif erences


95% Confidence Interval
Std. Std. Error of the Dif erence Sig.
Mean Deviation Mean Lower Upper t df (2-tailed)
Pair Current Salary -
$17,403.48 $10814.62 $496.732 $16,427.41 $18,379.56 35.036 473 .000
1 Beginning Salary

Analysis of
Variance
Types : Analysis of Variance
One Way ANOVAs

) ( Two Way ANOVAs



Multivariate Analysis of Variance

) ( One Way Manova


) ( Two Way Manova

Two Way
Analysis of Variance

.




.
Two Way ANOVA

) (
) (.

:

- ) (main effect

.
- ) (main effect

.
- ) (Interaction


Two Way Analysis of Variance

"
0.05



" "
"
"
"
"

.

Between-Subjects Factors

Value Label N
Gender f Female 216
m Male 258
Employment 1 Clerical 363
Category 2 Custodial 27
3
Manager 84
N

Descriptive Statistics

Dependent Variable: Current Salary


Gender Employment Category Mean Std. Deviation N
Female Clerical $25,003.69 $5,812.838 206
Manager $47,213.50 $8,501.253 10
Total $26,031.92 $7,558.021 216
Male Clerical $31,558.15 $7,997.978 157
Custodial $30,938.89 $2,114.616 27
Manager $66,243.24 $18,051.570 74
Total $41,441.78 $19,499.214 258
Total Clerical $27,838.54 $7,567.995 363
Custodial $30,938.89 $2,114.616 27
Manager $63,977.80 $18,244.776 84
Total $34,419.57 $17,075.661 474

Gender Sig. = 0.0
0.05
.
Jobcat
Sig. = 0.0 0.05
.

Sig. = 0.0 0.05

Tests of Between-Subjects Effects

Dependent Variable: Current Salary


Type III Sum
Source of Squares df Mean Square F Sig.
Corrected Model 9.646E+10a 4 2.411E+10 272.780 .000
Intercept 1.773E+11 1 1.773E+11 2005.313 .000
GENDER 5247440732 1 5247440732 59.359 .000
JOBCAT 3.232E+10 2 1.616E+10 182.782 .000
GENDER * JOBCAT 1247682867 1 1247682867 14.114 .000
Error 4.146E+10 469 88401147.44
Total 6.995E+11 474
Corrected Total 1.379E+11 473
a. R Squared = .699 (Adjusted R Squared = .697)
Basic Objectives in Data
Analysis
1. Feel for the Data

2.Testing Goodness
of Data
3. Hypothesis Testing
Checking the Reliability of Measures:
Cronbachs Alpha
The inter item consistency reliability or
the Cronbachs alpha reliability
coefficients of the five independent and
dependent variables were obtained.
They were all above .80. A sample of the
result obtained for Cronbachs alpha test
for the dependent variable, Intention to
Leave, together with instructions on how
it is obtained is, shown in:
Reliability Coefficients 6 items
Alpha = .8172
The result indicates that the Cronbachs alpha
for the six item Intention to Leave measure is .
82. The closer the reliability coefficient gets to
1.0, the better. In general, reliabilities less
than .60 are considered to be poor, those in
the .70 range, acceptable, and those over .80
good. Cronbachs alpha for the other four
independent variables ranged from .81 to .85.
Thus, the internal consistency reliability of the
measures used in this study can be
considered to be good.