Você está na página 1de 18

Chapter 2 - Descriptive Statistics and

Graphical data analysis


In the previous chapter, we discussed many measures used for summarizing data like
mean, median, variance, IQR, and coefficient of skewness. Computing statistical measures
without looking at a plot is an invitation to misunderstanding data. Graphs provide visual
summaries of data which more quickly and completely describe essential information than
do tables of numbers. Patterns and theories of how the system behaves are developed by
observing the data through graphs. Their results provide guidance for the selection of
appropriate deductive hypothesis testing procedures. This is known as Exploratory Data
Analysis (EDA).

The chapter discusses various commonly used graphical analysis tools.

Histograms

Histograms are plots used for visually inspecting the distribution of a data set. Histograms
are quite useful for depicting large differences in shape or symmetry, such as whether a
data set appears symmetric or skewed. To construct a histogram, the first step is to "bin"
the range of valuesthat is, divide the entire range of values into a series of intervalsand
then count how many values fall into each interval. The bins are usually specified as
consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent,
and are often (but are not required to be) of equal size. Histograms can be generated using
Excel with a built-in tool (Tutorial with example: https://www.ablebits.com/office-addins-
blog/2016/05/11/make-histogram-excel/).

Example:

We are going to analyze annual rainfall pattern in LA.

Table 1. Sample annual rainfall data in L.A.

Year Annual rainfall


(inches)
1981 10.71
1982 31.25
1983 10.43
1984 12.82
1985 17.86
1986 7.66
1987 12.48
1988 8.08
1989 7.35
1990 11.47
1991 21
1992 27.36
1993 8.11
1994 24.35
1995 12.46
1996 12.4
1997 31.01
1998 9.09
1999 11.57
2000 17.94
2001 4.42
2002 16.49
2003 9.24
2004 37.25
2005 13.19
2006 3.21
2007 13.53
2008 9.08
2009 16.36
2010 20.2
2011 8.69
2012 5.85
2013 6.08
2014 8.52
2015 9.65
2016 19

Excel has a data analysis package that allows the user to easily generate a histogram. Fig 1
shows the histogram generated for annual rainfall data. Table 2 shows the frequency table
based on which the histogram is generated.
Histogram
14

12

10

8
Frequency

0
5 10 15 20 25 30 35 40 More
Bin

Fig 1. Histogram of annual rainfall data

Table 2. Frequency table for rainfall histogram

Bin (upper limit) Frequency


5 2
10 12
15 10
20 5
25 3
30 1
35 2
40 1

To manually generate histogram, one should first fix the number of bins (or classes), and
the bin size (or class interval). As a rule of thumb, one may use the Sturges formula:

m = 3.3log10(n) + 1
where,

m = Number of class interval, you should round the number to the next highest
integer
n = Number of data points.

Bin size = Range/ m

An alternative method to determine number of class intervals (m) is:


m =

You can create the intervals by starting with the minimum number or some number that
is close to it.

In the example given above, using Sturges formula:

m = 3.3log10(36) + 1 = 6.135 (Round it to 7)

Table 3. Bins for annual rainfall data using Sturges formula


Lower Upper Class
limit limit mark
3 10 6.5
10 17 13.5
17 24 20.5
24 31 27.5
31 38 34.5

Note: A class mark is defined as the center of a class.

To find the number of occurrences in each interval (frequency), use either the FREQUENCY
or COUNTIF functions. The formats are: FREQUENCY(Data range, Upper limit range) and
COUNTIF(Data range, <&Upper limit). Since the functions give cumulative frequency,
actual frequency for each class interval needs to be calculated as shown in Table 4.
Table 4. Bins for annual rainfall data using EXCEL functions
Lower limit Upper limit Cumulative Frequency
frequency
3 10 14 14
10 17 26 12
17 24 31 5
24 31 33 2
31 38 36 3

16

14

12

10
Frequency

0
10 17 24 31 38
bins

Fig 2. Histogram for rainfall data created using FREQUENCY and/or COUNTIF functions

Notice that Fig 1 and 2 are slightly different because we chose a different bin size. Now,
what other information can you gather about the data by looking at the histogram?

Shape: It can be seen from the shape that the data has a right skewed distribution. This
means the data tends to cluster near the lower bound of the horizontal axis. Consider this
behavior as something similar to a difficult exam: most students will perform poorly
making the left most bars in the histogram taller.

Now, let us use some of the numerical measures we learned in the last chapter to
summarize the data.

Measures of location: We already looked at the definition and formula used to compute
mean and median. Based on that, for the annual rainfall data,
Mean = 14.06 inches
Median = 11.98 inches

The mean and the median are different since this is an asymmetrical, skewed distribution.
One can also use the histograms frequency table to compute mean and median.
w
Median L (0.5n cfb )
fm

Where
L = the lower limit of the interval that contains the median,
w = interval width,
fm = frequency of the interval that contain the median,
n = Number of data points, and
cfb = cumulative frequencies for intervals before the median interval.

In order to use the above-equation, you need to determine the interval that contains the
median. The first interval, where the cumulative interval exceeds 50%, is the interval that
contains the median. You can do that by dividing the cumulative frequency by the total
number of observations (see table below).

Table 5. Frequency table for annual rainfall data


Lower limit Upper limit Frequency Cumulative Relative
frequency cumulative
frequency
3 10 14 14 0.39
10 17 12 26 0.72
17 24 5 31 0.87
24 31 2 33 0.92
31 38 3 36 1

Accordingly, the second interval contains the median (cumulative frequency is 0.72).
Then we can calculate the median using the above-equation, where L= 10, w = 7, fm = 12,
n = 36, cfb = 14. Accordingly, median is 12.3 inches.

Similarly, mean can be calculated from frequency table using the formula:

k
f i xi
X
i 1 n
Where
xi = class mark,
k = number of classes,
fi = frequency of class i, and
n = number of measurements (total sum of frequencies).
In this example, mean comes out to be 14.27 inches.

In addition to mean and median, there is another commonly used measure of location
used while describing a histogram. It is known as mode. The mode of a set of
measurements (observations) is defined as the measurement with the highest frequency
of occurrence. From the frequency table, it can be seen that the second interval has the
highest frequency (26). Then the mid-point (class mark) of that interval is considered the
mode. That is, the mode for the data is 13.5 inches.

Measures of spread: The spread refers to the numerical summaries that indicate the
degree of variability or dispersion, indicating how spread out the data are. The range,
which is the difference between the lowest data value and the highest data value, is a
very crude measure of spread. The range can also be computed from the frequency table
as the difference between the highest and lowest class mark (34.5-6.5 = 28). Of course,
the wider the range, the larger the degree of dispersion.

Another measure of spread we discussed in Chapter 1 was variance. Variance can be


computed from the frequency table using the following formula:

1 k
PopulationVariance
n i 1
( xi X ) 2 f i

Where
xi = class mark,
k = number of classes,
fi = frequency of class i, and
n = number of measurements (total sum of frequencies)
X = Average

The above-mentioned formula is for population variance. While dealing with sample
variance, n should be replaced with n-1. The formula then becomes,

1 k
SampleVariance ( xi X )2 f i
n 1 i 1

Note that in the case of sample variance, we divide by n-1 instead of n to get unbiased
estimate of the population variance (2). There is a mathematical justification to do so
which will be discussed in subsequent chapters. However, you can think of this as a
punishment (cost); the smaller the sample size the more drastic the effect on s2. As the
sample gets larger our estimate of the variance improves and the effect of subtracting 1
from n becomes negligible. Using the formula, the variance for this data is 74.9 square
inches. Square-root of the variance gives standard deviation, s. For this dataset, s is 8.7
inches.

Another measure of spread is IQR. As discussed in previous chapter, IQR = Q3 Q1. Q3 and
Q1 can be computed using equations similar to median:

w
Q1 L (0.25n cf b )
fm
w
Q3 L (0.75n cf b )
fm

Based on the formula, for this dataset, Q 1 is 7.5 and Q3 is 18.4 inches.

Another measure of dispersion is coefficient of variation (COV). COV is defined as:

s
COV
x
It is unit-less. This becomes useful when we need to compare two different variables with
different units (thus making it difficult to compare them in other meaningful ways). For
this data, COV is 0.61. Distributions with COV<1 are considered low-variance, and
distributions with COV>1 is considered high-variance.

Dot Diagrams

Dot diagrams are useful tools for depicting the shape of the frequency distribution of a
continuous random variable, when only small samples, with typical sizes of 25-30, are
available. This is a common situation in many hydrologic analyses due to usually limited
periods of records and data samples of short lengths.

For creating dot diagrams, data are first ranked in ascending order of their values and
then plotted on a single horizontal axis. This comes in particularly useful when you want
to compare different data sets. The figure below shows the dot diagram for monthly
rainfalls at Sacramento and Houston:
6

0
0 0.5 1 1.5 2 2.5

Fig 3. Dot diagram for monthly rainfall data at Sacramento and Houston
Immediately, you can observe that the dispersion (spread) in the first data set is larger
than the second one. Also, it helps to notice if there are outliers among the data points.

Stem and leaf Diagrams

Stem and leaf diagrams are like histograms turned on their side, with data magnitudes to
two significant digits presented rather than only bar heights. Individual values are easily
found. The S-L profile is identical to the histogram and can similarly be used to judge
shape and symmetry, but the numerical information adds greater detail. One S-L could

function as both a table and a histogram for small data sets.


For example, consider this data set about the bullet penetration depth:
.75 40.50
28 38.35
29 56.00
30 42.55
31 38.35
32 27.75
33 49.85
34 43.60
35 38.75
51.25
36
47.90
37 .35
48.15
38 .35, .35, .75
42.90
39 .75 43.85
40 .50 37.35
41 .00, .15 47.30
42 .55, .90 41.15
43 .60, .85 51.60
44 39.75
45 41.00
46
47 .30, .90
48 .15
49 .85
50
51 .25, .60
52
53
54
55
56 .00
Fig 4. Stem and leaf plot

Scatter Diagrams

The two-dimensional scatterplot is one of the most familiar graphical methods for data
analysis. It illustrates the relationship between two variables. Of usual interest is whether
that relationship appears to be linear or non-linear, whether different groups of data lie
in separate regions of the scatterplot, whether there are outliers, and whether the
relationship (or association) between the two variables is weak or strong. We will study
this in detail in coming chapters.
Fig 4a. Scatter plots as tools to identify the relationship between two variables. First
figure shows a positive linear correlation, second shows a negative linear correlation, and
third shows a non-linear relationship (Source: https://statistics.laerd.com/)

Fig 4b. Scatter plots as tools to identify outliers

Quantiles and Quantile Plots

A quantile is defined as a number (from the sample or population) that corresponds the
fraction of the sample that less or equal to the value of the quantile. Quantiles of importance
are: Q(0.25), Q(0.5), and Q(0.75). You are already familiar with these quantiles as lower
quartile (Q1), median or mid-quartile (Q2), and upper quartile (Q3). In terms of percentiles,
they are 25th, 50th, and 75th percentiles. Percentiles are quantiles that divide a distribution
into 100 equal parts. Hence, we can generalize and say that the k-th percentile of a set of
values divides them so that k% of the values lie below and (100-k)% of the values lie above.

How to compute quantiles for a data set?


There are different approaches to compute quantiles. One such method is discussed
below. Keep in mind that different software (EXCEL, MATLAB, SAS, R) use different
approaches and as such, answers might differ. This is more obvious in the case of small
data sets.

Given a set of values x1, x2.. xn , we can define the quantiles for any fraction p as follows:

1. Sort the values in ascending order


x1 x2.. xn

2. The ordered values are called the order statistics of the original sample.
3. Take the order statistics to be the quantiles which correspond to the fractions (or
sample fraction):

i 1
pi where i= 1,n
n 1
4. In general, to define the quantile which corresponds to the fraction p, use linear
interpolation between the two nearest pi. If p lies a fraction f of the way from pi
and pi+1 define the pth quantile to be:

Q( p) (1 f )Q( pi ) fQ( pi 1 )

The function Q is defined as quantile function.

Let us consider an example data set:

3.7, 2.7, 3.3, 1.3, 2.2, 3.1

First, sort the values into order: 1.3, 2.2, 2.7, 3.1, 3.3, 3.7

The sample fractions p, for these values are computed using the above mentioned
formula:
Sample 0 0.2 0.4 0.6 0.8 1
fraction
(p)
Quantile 1.3 2.2 2.7 3.1 3.3 3.7

Lets say we want to find Q(0.25) or the lower quartile. This lies between the sample
fractions 0.2 and 0.4.
Q(0.25) = (1-0.25)*Q(0.2)+0.25*Q(0.4) = 0.75*2.2+0.25*2.7 = 2.325

Excel can be used to compute the lower, middle, and upper quartiles. The function is
QUARTILE(array, quart) where quart is a value from 0 to 4 depending on which quartile
the user wants to compute.

Fig 5. Plot explaining the calculation of quantiles


There are many graphical approaches in Statistics that are based on quantiles. We will
discuss two of the most commonly used quantile based plots:

Quantile Plot
Quantile plots visually portray the quantiles, or percentiles (which equal the quantiles
times 100) of the distribution of sample data. Quantiles of importance such as the median
are easily discerned (quantile, or cumulative frequency = 0.5). They have many advantages
over other plots like:
1. Arbitrary categories are not required, as with histograms or Stem and Leaf,
2. All of the data are displayed, unlike a boxplot (which will be discussed in the next
section)
Quantile plots are sample approximations of the cumulative distribution function (cdf) of a
continuous random variable. We will study more about distribution functions in the
coming chapters.
Construction of a quantile plot
1. To construct a quantile plot, the data are ranked from smallest to largest. The smallest
data value is assigned a rank i=1, while the largest receives a rank i=n, where n is the
sample size of the data set. The data values themselves are plotted along one axis,
usually the horizontal axis.
2. On the other axis is the "plotting position", which is a function of the rank i and sample
size n (similar to the sample fraction discussed before). A number of plotting position
formulae is available (Table 6). The general formula is given as:
ia
pi
n 2a
In the previous example, we used a =1 (which provides the results same as EXCELs
QUARTILE function). Some of the other commonly used formulas are given below.

Reference a Formula

Weibull (1939) 0 i/(n+1)


Blom (1958) 0.375 (i-0.375)/(n+0.25)

Cunnane (1978) 0.4 (i-0.4)/(n+0.2)


Gringorten (1963) 0.44 (i-0.44)/(n+0.12)

Hazen (1914) 0.5 (i-0.5)/n


All of the above mentioned formulas are commonly used in the context of hydrology.
Figure 6 shows the quantile plot for annual LA rainfall dataset. Since Weibull plotting
formula is commonly used when using precipitation data, we will use that for this example.
1
0.9
0.8
0.7
0.6
Quantiles

0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40
Annual rainfall (inches)

Fig 6. Quantile plot for annual LA rainfall

When comparing two datasets, we use Quantile-Quantile (Q-Q) plots. This is useful for
determining if two data sets come from populations with a common distribution. We will
discuss this in detail in coming chapters.

Box Plot

A very useful and concise graphical display for summarizing the distribution of a data set is
the boxplot. Boxplots provide visual summaries of:
1. The center of the data (the median the center line of the box)
2. The variation or spread (interquartile range the box height)
3. The skewness (quartile skew the relative size of box halves)
4. Presence or absence of unusual values (outside or far outside values, also known
as outliers)
The boxplot is a graphical representation of five points, the three quantiles and the
minimum and maximum measurements. The process of displaying the graph is illustrated
with the rainfall data. From the data, we can calculate the quantiles, the minimum and the
maximum.

Min = 3.21 Max = 37.25 Q3= 17.88


Q1= 8.645 Q2= 11.985 IQR = 9.23
1.5 x IQ or Maximum (whichever is smaller)

1.5 x IQ or Minimum (whichever is smaller)

Fig 7. Example boxplot (Source:


www.web.mit.edu/11.220/www05/brushup/spss/boxplot.htm)

In addition to the box, there is an upper whisker and lower whisker.

Upper whisker length = Q3+ min of (1.5IQR or maximum value in the time series)
Lower whisker length = Q1-min of (1.5IQR or minimum value in the time series)

Any data point that falls between 1.5IQR and 3IQR is considered to be an outlier. Any data
point that falls beyond 3IQR from the box is considered to be an extreme.

Excel 2016 has a box plot option among its different types of charts. However, to plot it in
earlier versions of Excel, use the instructions given below:
http://www.dummies.com/education/math/statistics/box-and-whisker-charts-for-excel/

Figure 8 shows the box plot for the annual LA rainfall.


Outlier

Fig 8. Box plot for annual LA rainfall

Runs Sequence plot


A run sequence plot is a graphical data analysis technique for preliminary scanning of the
data. It consists of:
Vertical axis = i-th observation;
Horizontal axis = dummy index i.
The runs sequence plot is thus a plot of the raw data plotted in the same order that it resides
in the variable. This is a useful first step in the analysis of any data in that it provides
information about trends, patterns in variation, and outliers. It also gives the analyst an
excellent feel for the data.
40

35

30

25

20

15

10

0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Fig 9. Runs sequence plot for annual LA rainfall

Acknowledgement: Dr. Ramzi for his help with preparation of the notes.

Você também pode gostar