Você está na página 1de 33

Statistik Deskriptif

A. Describing Qualitative Data (Nominal & Ordinal


Scales)

Definition 2.1 A class is one of the categories into which


qualitative data can be classified.

Definition 2.2 The class frequency is the number of


observations in the data set falling into a particular class.

Definition 2.3 The class relative frequency is the class


frequency divided by total number of observations in the
data set.

Definition 2.4 The class percentage is the class relative


frequency multiplied by 100.
We commonly use tables, bar graphs, pie charts or Pareto
diagram to summarize a qualitative data set.

Bar graph: The categories (classes) of the qualitative


variable are represented by bars, where the height of each
bar is either the class frequency, class relative frequency,
or class percentage.

Pie chart: The categories (classes) of the qualitative


variable are represented by slices of a pie (circle). The size
of each slice is proportional to the class relative frequency.

Pareto diagram: A bar graph with the categories


(classes) of the qualitative variable (i.e., the bars)
arranged by height in descending order from left to right.
A. Describing Quantitative Data (Interval & Ratio
Scales)

For describing, summarizing, and detecting patterns of


a data set we can use three graphical methods: dot
plots, stem-and-leaf displays, and histograms.

Dot plot: The numerical value of each quantitative


measurement in the data set is represented by a dot on
a horizontal scale. When data values repeat, the dots
are placed above one another vertically
Stem-and-leaf display: The numerical value of
quantitative variable is partitioned into a stem and a
leaf. The possible stems are listed in order in a column.
The leaf for each quantitative measurement in the data
set is placed in the corresponding stem row. Leaves for
observations with the same stem value are listed in
increasing order horizontally.

Histogram: The possible numerical values of the


quantitative variable are partitioned into class intervals,
where each interval has the same width. These intervals
form the scale of the horizontal axis. The frequency or
relative frequency of observations in each class interval is
determined. A vertical bar is placed over each class
interval with height equal to either the class frequency or
class relative frequency.
Determining the Number of Classes in a
Histogram

Number of Observations Number of Classes


in Data Set

Less than 25 56
25 50 7 14
More than 50 15 20
Summation Notation

x
i 1
i x1 x2 ... xn

Example 1:
1. If xi i, i 1,2,...,10
then
n 10

x i 1 2 3 ... 10 55
i 1
i
i 1
5
2. 3 3 3 3 3 3 15
i 1
Numerical Measures of Central
Tendency

A large number of numerical methods are


available to describe quantitative data sets.
Most of these methods measure one of two
data characteristics:
1. The central tendency of the set of
measurements that is, the tendency of the
data to cluster, or center, about certain
numerical values.
2. The variability of the set of measurements
that is, the spread of the data.
Central Tendency Measures
Definition 2.5 The mean of a set of
quantitative data is the sum of the
measurements divided by the number of
measurements contained in the datan set.
x i
x i 1
Formula for the sample mean: n

Symbols:
x
1. = sample mean
2. = Population mean
Example 2. Calculate the mean of the
following five sample measurements: 1,
2, 3, 4, 5
Solution:
5

x i
1 2 3 4 5 15
x i 1
3
5 5 5
Definition 2.6 The median of a quantitative
data set is the middle number when the
measurements are arranged in ascending (or
decreasing) order.

Calculating a sample median (m):


Arrange the n measurements from smallest to
largest
1. If n is odd, m is the middle number.
2. If n is even, m is the mean of the middle two
numbers.
Example 3. Calculate the median m of the
following sample measurements:
a. 5, 7, 4, 5, 20, 6, 2
b. 5, 7, 4, 20, 6

Solution: Firstly we rank the measurements in


ascending order.
a. 2, 4, 5, 5, 6, 7, 20
The median m = 5.
b. 4, 5, 5, 6, 7, 20
The median m = (5+6)/2 = 5.5
Definition 2.7 A data set is said to be
skewed if one tail of the distribution has more
extreme observations than the other tail.

Detecting skewness:
1. If the median is less than the mean then the
data set is skewed to the right.
2. If the median equals the mean then the data
set is symmetric.
3. If the mean is less than the median then the
data set is skewed to the left.
Definition 2.7
The mode is the measurement that occurs
most frequently in the data set.

Example 4: Find the mode for the following


data:
8 7 9 6 8 10 9 9 5 7
Solution: Since 9 occurs most often, the mode
is 9.
Numerical Measures of
Variability
Measures of central tendency provide only a
partial description of a quantitative data set.
The description is incomplete without a
measure of the variability, or spread, of the data
set. Knowledge of the datas variability along
with its center can help us visualize the shape
of a data set as well as its extreme values.

Definition 2.8 The range of quantitative data


set is equal to the largest measurements minus
the smallest measurement.
Definition 2.9 The sample variance for a
sample of n measurements is equal to the sum
of the squared deviations from the mean
divided by (n-1). In symbols,
n

(x x)
i
2

s
2 i 1

n 1
Definition 2.10 The sample standard
deviation, s, is defined as the positive square
root of the sample variance. Thus
s s 2

Symbols:
2
1. s = Sample variance
2. s = Sample standard deviation
3. = Population variance
2

4. = Population standard deviation


Example 5. Calculate the means, the ranges, the
sample variances and the sample standard
deviations of the following data:
a. Sample 1: 1, 2, 3, 4, 5
b. Sample 2: 2, 3, 3, 3, 4
Solution:
The mean for sample 1:x = (1+2+3+4+5)/5 = 3
The mean for sample 2:x = (2+3+3+3+4)/5 = 3
The range for sample 1 = 5 1 = 4
The range for sample 2 = 4 2 = 2
The sample variance for sample 1:
5

(x x)
i
2
(1 3) 2 (2 3) 2 (3 3) 2 (4 3) 2 (5 3) 2 10
s
2 i 1
2.5
5 1 4 4

The sample variance for sample 2:


5

(x x)
i
2
(2 3) 2 (3 3) 2 (3 3) 2 (3 3) 2 (4 3) 2 2
s
2 i 1
0.5
5 1 4 4
The sample standard deviation for sample 1:
s 2.5 1.581
The sample standard deviation for sample 2:
s 0.5 .707
Example 6
The following data are the percentage of revenues spent on
Research and Development (R&D) by 50 companies:
13.5 9.5 8.2 6.5 8.4 8.1 6.9
7.5 10.5 13.5 7.2 7.1 9.0 9.9
8.2 13.2 9.2 6.9 9.6 7.7 9.7
7.5 7.2 5.9 6.6 11.1 8.8 5.2 10.6
8.2 11.3 5.6 10.1 8.0 8.5 11.7 7.1
7.7 9.4 6.0 8.0 7.4 10.5 7.8 7.9
6.5 6.9 6.5 6.8 9.5
Using computer software (Excel) we get:
The mean = 8.492
The variance = 3.922792
The standard deviation = 1.980604
Interpreting the Standard Deviation

Chebychevs Rule (for any data set):


1. No useful information is provided on fraction
of measurements that fall within 1 standard
deviation of the mean.
2. At least 3/4 of the measurements will fall
within 2 standard deviation of the mean.
3. At least 8/9 of the measurements will fall
within 3 standard deviation of the mean.
4. Generally, for any number k greater than 1, at
least (1-1/k^2) of the measurements will fall
within k standard deviation of the mean.
Interpreting the Standard Deviation

The Empirical Rule (for data sets with


frequency distributions that are mound-
shaped and symmetric):

1. Approximately 68% of the measurements will


fall within 1 standard deviation of the mean.
2. Approximately 95% of the measurements will
fall within 2 standard deviation of the mean.
3. Approximately 99.7% of the measurements will
fall within 3 standard deviation of the mean.
Example 7
In Example 6, the mean and standard
deviation (rounded) of the R&D
percentages for 50 companies are 8.49
and 1.98, respectively. Calculate the
fraction of these measurements that lie
within the intervals
x s, x 2 s, and x 3s
Solution:
( x s, x s ) (8.49 1.98,8.49 1.98) (6.51,10.47)

contains 34 of the 50 measurements, or 68%.


( x 2 s, x 2 s ) (8.49 3.96,8.49 3.96) ( 4.53,12.45)

contains 47 of the 50 measurements, or 94%.


( x 3s, x 3s ) (8.49 5.94,8.49 5.94) (2.55,14.43)

contains all, or 100%, of the measurements.


Example 8
A manufacturer of automobile batteries claims that the
average length of life for its grade A battery is 60 months.
However, the guarantee on this brand is for just 36
months. Suppose the standard deviation of the life length
is known to be 10 months, and the frequency distribution
of the life-length data is known to be mound-shaped.
a. Approximately what percentage of the manufacturers
grade A batteries will last more than 50 months,
assuming the manufacturers claim is true?
b. Approximately what percentage of the manufacturers
grade A batteries will last less than 40 months, assuming
the manufacturers claim is true?
c. Suppose your battery lasts 37 months. What could you
infer about manufacturers claim?
Solution:
Numerical Measures of Relative
Standing
Definition 2.11
For any set of n measurements
(arranged in ascending or descending
order), the pth percentile is a number
such that p% of the measurements fall
below the pth percentile and (100-p)%
fall above it.
Example 9
Locate the 25th percentile and 95th percentile of
the R&D data.
Solution:
Using SPSS we get 25th percentile equals 7.05
and 95th percentile equals 13.335.
Definition 2.12 The sample z-score for
a measurement x is
xx
z
s
The population z-score for a
measurement x is
x
z

Example 10
Suppose 200 steelworkers are selected, and
the annual income of each is determined. The
mean and standard deviation are $34,000 and
$2,000 respectively. Suppose Joe Smiths
annual income is $32,000. What is his sample
z-score?
Solution:
x x $32,000 $34,000
z 1.0
s $2,000
For mound-shaped distributions of data:

1. Approximately 68% of the measurements will


have a z-score between -1 and 1.
2. Approximately 95% of the measurements will
have a z-score between -2 and 2.
3. Approximately 99.7% of the measurements
will have a z-score between -3 and 3.
Outliers
Definition 2.13 An observation (or measurement)
that is unusually large or small relative to the other
values in a data set is called an outlier.

Definition 2.14 The lower quartile QL is the 25th


percentile of data set. The middle quartile m is the
median. The upper quartile QU is the 75th percentile.

Definition 2.15 The interquartile range (IQR) is the


distance between the lower and upper quartiles:
IQR = QU - QL
Lower inner fence = QL 1.5 IQR
Upper inner fence = QU + 1.5 IQR
Lower outer fence = QL 3 IQR
Upper outer fence = QU + 3 IQR

Measurements that fall beyond the


outer fence are considered to be
outliers.
Example 11
Detect the outliers of the following data
sets:
a. 2, 5, 6, 8, 10, 15, 20, 22, 30
b. 2, 5, 6, 8, 10, 15, 20, 22, 40
Solution:

Você também pode gostar