Escolar Documentos
Profissional Documentos
Cultura Documentos
management
Descriptive statistics- Numerical measures
DAY 4
Recap
Day 1 Introduction, types of statistics, data and its types
Definition of statistics, terminologies : population , sample,
parameter, statistic, qualitative and quantitative data, levels of
measurements : Nominal, Ordinal, Interval and Ratio- sources
of collecting data Primary and secondary, applications of
Statistics in various functions of management - data mining and
data warehousing
xi x
3
n
Skewness
(n 1)(n 2) s
3 Md
S
If S < 0, the distribution is negatively skewed
(skewed to the left).
If S = 0, the distribution is symmetric (not
skewed).
If S > 0, the distribution is positively skewed
(skewed to the right).
Coefficient of Skewness
1
23 2
26 3
29
M
d1 26 M
d2 26 M
d3 26
1
12.3 2
12.3 3
12.3
3 1 M
d1
3 2 M d2
3 3 M
d3
S 1
S 2
S 3
1 2 3
3 23 26 3 26 26 3 29 26
12.3 12.3 12.3
0.73 0 0.73
Distribution Shape: Skewness
.25
.20
.15
.10
.05
0
Distribution Shape: Skewness
Moderately Skewed Left
Skewness is negative.
Mean will usually be less than the median.
Skewness = .31
.35
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Distribution Shape: Skewness
Moderately Skewed Right
Skewness is positive.
Mean will usually be more than the median.
Skewness = .31
.35
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Distribution Shape: Skewness
Highly Skewed Right
Skewness is positive (often above 1.0).
Mean will usually be more than the median.
.35
Skewness = 1.25
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Distribution Shape: Skewness ( FOR
PRACTICE)
Example: Apartment Rents
Seventy efficiency apartments were randomly
sampled in a college town. The monthly rent prices
for the apartments are listed below in ascending order.
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Distribution Shape: Skewness
.25
.20
.15
.10
.05
0
Kurtosis
Peakedness of a distribution
Leptokurtic: high and thin
Mesokurtic: normal in shape
Platykurtic: flat and spread out
Leptokurtic
Mesokurtic
Platykurtic
salary
salary
3310
Mean 3540
3355
Standard Error 47.81989569
3450 Median 3505
3480 Mode 3480
3480 Standard Deviation 165.6529779
3490 Sample Variance 27440.90909
3520 Kurtosis 1.718883645
3540 Skewness 1.091108688
3550 Range 615
Minimum 3310
3650
Maximum 3925
3730
Sum 42480
3925 Count 12
Relative location Z score
* In addition to measures of location, variability, and
shape, we are also interested in the relative location of
values within a data set.
xi x
zi
s
x
3 1 + 1 + 3
2 + 2
Detecting Outliers
An outlier is an unusually small or unusually large
value in a data set.
A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
It might be:
an incorrectly recorded data value
a data value that was incorrectly included in the
data set
a correctly recorded data value that belongs in
the data set
Detecting Outliers FOR PRACTICE
Example: Apartment Rents
The most extreme z-scores are -1.20 and 2.27
Using |z| > 3 as the criterion for an outlier, there
are no outliers in this data set.
Minimum Q1 Q2 Q3 Maximum
Steps to construct Box Plot
1. A box is drawn with the ends of the box located at the first and third quartiles. For
the salary data,Q1 3465 andQ3 3600. This box contains the middle50%of the data.
2. A vertical line is drawn in the box at the location of the median (3505 for the salary
data).
3. By using the interquartile range, IQR Q3 Q1, limits are located. The limits for the
box plot are 1.5(IQR) below Q1 and 1.5(IQR) above Q3. For the salary data, IQR
Q3- Q1 = 3600 3465= 135. Thus, the limits are 3465- 1.5(135) = 3262.5 and
3600 + 1.5(135)= 3802.5. Data outside these limits are considered outliers.
4. The dashed lines are called whiskers. The whiskers are drawn from the
ends of the box to the smallest and largest values inside the limits computed in step 3.
Thus, the whiskers end at salary values of 3310 and 3730.
400 425 450 475 500 525 550 575 600 625
Q1 = 445 Q3 = 525
Q2 = 475
Box Plot
400 425 450 475 500 525 550 575 600 625
( xi x )( yi y ) for
sxy
n 1 samples
( xi x )( yi y ) for
xy populations
N
Correlation Coefficient
for for
samples populations
Correlation Coefficient
The coefficient can take on values between -1 and +1.
r<0 r>0
r=0
Coefficient of Correlation
+1 Strong positive linear relationship
or r = 0 No linear relationship
SSXY
r
SSX SSY
X X Y Y
X X Y Y
2 2
X Y
XY n
X
2
Y 2
Y
2
1 r 1
2
X n n
Covariance and Correlation Coefficient
x y ( xi x ) ( yi y ) ( xi x )( yi y )
277.6 69 10.65 -1.0 -10.65
259.5 71 -7.45 1.0 -7.45
269.1 70 2.15 0 0
267.0 70 0.05 0 0
255.6 71 -11.35 1.0 -11.35
272.9 69 5.95 -1.0 -5.95
Average 267.0 70.0 Total -35.40
Std. Dev. 8.2192 .8944
Covariance and Correlation Coefficient
X Y
XY
n
r
X
2
Y
2
X n Y n
2 2
92.93 2725
21,115.07
12
720.22
92 .93 2
619,207 2725
2
12 12
.815
Scatter Plot and Correlation Matrix
for the Economics Example
245
240
Futures Index
235
230
225
220
7.40 7.60 7.80 8.00 8.20
Interest
100
80
Final Scores
60
40
20
0
0 50 100 150
Test Scores
Problem
An instructor is interested in finding out
how the number of absentees on a given
day is related to the mean temp that day
sample of 10 days
Abs
8 7 5 4 2 3 5 6 8 9
Temp
10 20 25 30 40 45 50 55 59 60
What is DV and IV? Draw a scatter diagram.
Explain the shape of the diagram.
Problem
Temperature Vs Absenteeism
10
8
Absenteeism
0
0 20 40 60 80
Temperature
Weighted Mean
When the mean is computed by giving each data
value a weight that reflects its importance, it is
referred to as a weighted mean.
In the computation of a grade point average (GPA),
the weights are the number of credit hours earned for
each grade.
When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.
Weighted Mean
x wx i i
w i
where:
xi = value of observation i
wi = weight for observation i
Grouped Data
The weighted mean computation can be used to
obtain approximations of the mean, variance, and
standard deviation for the grouped data.
To compute the weighted mean, we treat the
midpoint of each class as though it were the mean
of all items in the class.
We compute a weighted mean of the class midpoints
using the class frequencies as weights.
Similarly, in computing the variance and standard
deviation, the class frequencies are used as weights.
Mean for Grouped Data
Sample Data
x fM i i
Population Data
fMi i
N
where:
fi = frequency of class i
Mi = midpoint of class i
Mean of Grouped Data
Weighted average of class midpoints
Class frequencies are the weights
fM
f
fM
N
f 1M 1 f 2 M 2 f 3 M 3 f iM i
f 1 f 2 f 3 fi
Calculation of Grouped Mean
Class Interval Frequency Class Midpoint fM
20-under 30 6 25 150
30-under 40 18 35 630
40-under 50 11 45 495
50-under 60 11 55 605
60-under 70 3 65 195
70-under 80 1 75 75
50 2150
fM
2150
43 . 0
f 50
Median of Grouped Data
N
cfp
Median L 2 W
fmed
Where:
L the lower limit of the median class
cfp = cumulative frequency of class preceding the median class
fmed = frequency of the median class
W = width of the median class
N = total of frequencies
Median of Grouped Data -- Example
Cumulative N
Class Interval Frequency Frequency cfp
20-under 30 6 6 Md L 2 W
fmed
30-under 40 18 24
40-under 50 11 35 50
50-under 60 11 46 24
60-under 70 3 49 40 2 10
11
70-under 80 1 50
N = 50 40.909
Variance and Standard Deviation
of Grouped Data
Population Sample
f M S M X
2 2
f
2
2
n1
N
S
2
2 S
Population Variance and Standard
Deviation of Grouped Data
f M fM M M M
2 2
Class Interval f
M 2
2
f 7200
144 12
2
144
N 50
Parameters and Statistics
Population Sample
Size N n
Mean
Variance S2
Standard
Deviation S
Coefficient of
Variation CV cv
Covariance Sxy
Coefficient of
Correlation r
Chapter 3 : page 123-168