Você está na página 1de 18

Quartiles for Box Plots

Jon Peltier
Peltier Technical Services, Inc.
http://peltiertech.com

Table of Contents
Quartiles for Box Plots ...................................................................................................................................................... 2
Median ..................................................................................................................................................................................... 2
Hinge Techniques for Determining Quartiles .......................................................................................................... 3
Alternative Hinge Definitions .................................................................................................................................... 3
Inclusionary Hinge Definition (Tukey) .......................................................................................................... 4
Exclusionary Hinge Definition (Moore and McCabe or M&M).......................................................... 4
Empirical Distribution Function (CDF) ......................................................................................................... 5
All Possible Cases of N ............................................................................................................................................. 5
8 (4k) Values................................................................................................................................................................ 5
9 (4k+1) Values .......................................................................................................................................................... 6
10 (4k+2) Values ........................................................................................................................................................ 6
11 (4k+3) Values ........................................................................................................................................................ 6
Interpolation Methods of Determining Quartiles ................................................................................................... 7
N-1, N, and N+1 Bases for Computing Quartiles................................................................................................. 7
(N-1) Basis .................................................................................................................................................................... 8
(N) Basis ........................................................................................................................................................................ 8
(N+1) Basis ................................................................................................................................................................... 8
8 (4k) Values................................................................................................................................................................ 8
9 (4k+1) Values ....................................................................................................................................................... 10
10 (4k+2) Values ..................................................................................................................................................... 10
11 (4k+3) Values ..................................................................................................................................................... 11
Comparison of Values from All Hinge and Quartile Methods ......................................................................... 12
Effect of N ....................................................................................................................................................................... 12
Doubling the Data Set ................................................................................................................................................ 15
Techniques Used by Software Packages ............................................................................................................. 17
Recommendations ...................................................................................................................................................... 17
Quartiles in the Peltier Tech Box Plot Utility......................................................................................................... 18
References........................................................................................................................................................................... 18
Quartiles for Box Plots 2

Quartiles for Box Plots

Any student of statistics, in fact, any middle school student, has constructed a box plot. A simple
box plot (or box-and-whisker chart), like the one above, needs five parameters for each
category. These are the minimum and maximum values for that category, the median value
(middle value), and the first and third quartiles. In fact, the median can also be called the second
quartile. The box plot above is not a meaningless sample of the chart type: it shows the variation
in quartiles determined by several different methods. These will be described in excruciating
detail in this tutorial.
A useful, if vague, definition of quartile is one of three values that approximately divide a sorted
data set into four parts of equal size. This division is easy and exact, if the number of values in the
set is evenly divisible by four. But in the majority of cases, it is less certain.
Many techniques have been put forth for determining quartiles, and mostly they resolve into the
handful of methods shown above, which are used by software packages. The techniques give
similar, though not exactly the same, results. In this document I will describe these definitions of
quartiles in hopes of shedding some light on this topic, which is more widely used than
understood.
I am not a statistician, but Ive had to understand quartiles to build my Box Plot Utility. Many of my
users wonder about how quartiles are calculated, so Ive decided to document my understanding.
If you have further questions, or if you find any mistakes, please let me know in the comments.
Median
The median is the central value in a sorted data set. If the values are listed from left to right in
order of increasing value, there are as many values to the left of the median as to the right.
Determining the median is easy. If there is an odd number of values, the median is the value in the
middle. For example, in this set of nine values, the median is the fifth value (in this case, 5), with
four values below it and four above.
Quartiles for Box Plots 3

If there is an even number of values, the median does not correspond to a value in the data set.
Instead the median is the average of the largest value in the lower half and the smallest value in
the higher half. In this set of eight values, the median separates the bottom four from the top four,
so we define it as the average of the fourth and fifth values, in this case, 4.5.

For a small number of simple data sets, the definition of quartiles is as easy, but usually its more
involved. Even when its easy, the statistical treatments make it seem harder than it is.

Hinge Techniques for Determining Quartiles


Hinges represent an easy definition for quartiles of a data set. Arrange the values in the shape of a
W, with equal length legs.

The central value of the data set is the value at the peak in the middle of the W, in this case, 5.
The values at the bottoms of the W, 3 and 7, are called Hinges, and they serve as quartiles in this
simple definition.

A data set makes a neat W if the number of points N can be defined by


N = 4k + 5
where k is a positive integer. For k=1, N=9, as in the example above. For k=2, N=13. A data set of
13 values is shown below.

Alternative Hinge Definitions


To make it easier to explain and comprehend the discussions that follow, the data sets will be laid
out along a number line such as this:

The gray numbers under the number line correspond to locations along the length of the set of
values (as a continuous variable), while numbers above the number line correspond to the index
of a particular value of the data set. A fractional number above the number line indicates that the
resulting value is interpolated between the adjacent values.
Quartiles for Box Plots 4

Using a W-shaped data layout works for some sizes of data sets, but how can we define hinges for
any arbitrarily sized data set? Such a set is shown below; it has 9 values, which we solved above,
but we will use it to illustrate the general case.

We can define the Median as the middle value of the data set, as always. We can define the lower
hinge as the median of the bottom half of the data, and the upper hinges as the median of the top
half of the data. Sounds easy, right?
Inclusionary Hinge Definition (Tukey)

When John Tukey was laying out his first box plots, he decided that the central data point (the
median) of an odd-numbered data set should be included with both lower and upper halves of the
data when determining the medians of these halves, that is, when determining the hinges.
As in the W layout example above, the median of this 9-value example is 5, while the hinges are
at 3 and 7.

Tukeys name is generally associated with this definition of hinges.


Exclusionary Hinge Definition (Moore and McCabe or M&M)

Some statisticians didnt like the idea of the median being counted twice by being part of both the
bottom and top halves of the data. Besides, it already has an important role as Median of the data
set.
These people thought that the hinges should be defined as the medians of the upper and lower
halves of the data set, excluding the median. In the 9-value example, the Median is unchanged at 5,
but now the hinges are 2.5 and 7.5.

This exclusionary approach to hinges is often referred to by the names of two of its proposers,
Moore and McCabe, or M&M if you like candy.
If the data set has an even number of values, of course, there is no distinction between the
inclusionary and exclusionary definitions, as there is no central median value to include or
exclude.
Quartiles for Box Plots 5

Empirical Distribution Function (CDF)

For data sets with an even number of observations, the inclusionary (Tukey) hinges and the
exclusionary (M&M) hinges are the same. For odd numbers of values, the inclusionary hinges are
always closer to the median than the exclusionary hinges.
A third approach is a compromise between the Tukey and Moore-McCabe approaches. Called the
Empirical Distribution Function or the Cumulative Distribution Function, or referred to by its
initials CDF, it says in the case of an odd number of values in the data set, include the central
median if including it will result in odd-numbered halves, and exclude the central median if
excluding it will result in odd-numbered halves. This compromise results in actual values from the
data set (as opposed to averages of two adjacent values) being used as hinges most of the time.
The CDF technique is the default quartile method used by the statistics package SAS, where its
called Empirical Distribution Function with Rounding.
All Possible Cases of N

We cant always lay the data out in a W shape. Since were dealing with splitting a data set into
four parts, we only need to consider four cases: when the number of values N is evenly divisible by
4, or when the remainder of this division is 1, 2, or 3. These cases can be written as
N = 4k
N = 4k + 1
N = 4k + 2
N = 4k + 3
We will look analyze 8 (4k), 9 (4k+1), 10 (4k+2), and 11 (4k+3) values. Nine values here is a
repeat of the Tukey and M&M illustration above.
8 (4k) Values

Here is our number line with a set of 8 (4k) observations.

The median is the average of the two values closest to the center, that is, 4.5.
The lower hinge is the average of the two central values in the bottom half of the data, 2.5. The
upper hinge is the average of the two central values in the top half of the data, 6.5.

No need to hurt our brains deciding whether to include or exclude the global median from
determination of the hinges. The hinges are the same for all three methods for N=4k.
Quartiles for Box Plots 6

9 (4k+1) Values

The 9 (4k+1) observation data set leads to two results, depending on whether the central median
value is included or excluded from determination of the hinges. In both cases, the median is the
central value, or 5.
In the inclusionary (Tukey) approach, the hinges are the midpoints of the data halves, or 3 and 7.

In the exclusionary (M&M) approach, the hinges are the averages of the two central values of each
half of the data set, 2.5 and 7.5.

The CDF and Tukey hinges are in agreement for N=4k+1. The M&M values are slightly further
from the median.
10 (4k+2) Values

In the 10 (4k+2) observation sample, the median is the average of the two central values, or 5.5.
The lower and upper hinges are the central values of the bottom and top halves of the data set, 3
and 8. No worries about the global median value here.

The three hinge definitions all agree on the hinge values for N=4k+2 values.
11 (4k+3) Values

The 11 (4k+3) observation data set leads to two results, as in the 9 (4k+1) observation data set. In
both cases, the median is the central value, or 6.
Quartiles for Box Plots 7

In the inclusionary determination, the hinges are the averages of the two central values of each
half of the data, 3.5 and 8.5.

In the exclusionary (M&M) approach, the hinges are the central values of the two halves of the
data set, 3 and 9.

The CDF and M&M hinges are the same for N=4k+3, with the Tukey hinges lying slightly closer to
the median.

Interpolation Methods of Determining Quartiles


As in the case of hinges, we need to consider four cases:
N = 4k
N = 4k + 1
N = 4k + 2
N = 4k + 3
We will look at 8 (4k), 9 (4k+1), 10 (4k+2), and 11 (4k+3) values. Conceptually they are the same,
but the simplified equations have differences.
We also need to look at three ways to distribute values along a number line. Ill refer to these as
the (N-1) Basis, the (N) Basis, and the (N+1) Basis. They are similar approaches, and give similar
results.
N-1, N, and N+1 Bases for Computing Quartiles
To make it easier to explain and comprehend the discussions that follow, the data sets will be laid
out along a number line such as this:

The gray numbers under the number line correspond to locations along the length of the set of
values (as a continuous variable), while numbers above the number line correspond to the index
of a particular value of the data set. A fractional number above the number line indicates that the
resulting value is interpolated between the adjacent values.
Quartiles for Box Plots 8

(N-1) Basis

The values are laid out on the integer positions of the number line, starting at zero and ending
with N-1. The N values are distributed over a length of N-1. For N=8, it looks like this:

The position of a percentile p is that percentage of the length of the number line, or p*(n-1).
The N-1 Basis method is used by Excels legacy QUARTILE function and by the QUARTILE.INC
function introduced in Excel 2010. It is apparently not used in any other calculation packages,
though it is equivalent to at least one obscure quartile determination method not discussed here.
(N) Basis

Each value is located in the middle of a one-unit length of number line. Value 1 is located midway
between 0 and 1, etc. The number line starts at zero and ends at N. The N values are distributed
over a length of N. Again, for N=8:

The position of a percentile p is that percentage of the length of the number line, or p*(n).
The N Basis method is apparently not used by any package, even though it seems to me to be the
most balanced method (while the others seem to have one kind of off-by-one error or the other).
(N+1) Basis

The values are laid out on the integer positions of the number line starting at 1 and ending at N.
The number line extends beyond the first value to zero, and for symmetry beyond the Nth value to
N+1. The values are then distributed over a length of N+1. For N=8:

The N+1 Basis is used by Minitab and JPM, and is available in SAS. It is also used in the
QUARTILE.EXC function introduced in Excel 2010.
8 (4k) Values

The formulas for locating the quartiles on the number line for N=8(4k) using the N-1 Basis are
shown here:

We need to increase the number line indices by 1 to get the value indices (or observation
numbers).

These value indices are not the values of the quartiles. The value indices indicate which value is to
be used as the quartile. A whole number M indicates that the Mth value is the quartile. A fractional
Quartiles for Box Plots 9

number of the form M+m means the quartile is located between the Mth and (M+1)th values, the
fraction m away from the Mth value.
For example, lets look at the 8-observation data set {2,3,5,8,11,12,14,17} in the chart below. For
N=8 using the N-1 basis, our quartiles are the 2.25th, 4.5th, and 6.75th values. For the first
quartile, 2.25 means the value 0.25 of the way from the 2nd to the 3rd values, or 0.25 of the way
from 3 to 5, or 3.5. For the median, 4.5 means halfway between the 4th and 5th values, or halfway
between 8 and 11, or 9.5. For the third quartile, 6.75 means the value 0.75 of the way from the 6th
to the 7th value, or 0.75 of the way from 12 to 14, or 13.5.

Locate the quartiles on the number line for N=8(4k) using the N basis using these formulas:

These number line indices must be incremented by 0.5 to get the value indices.

Finally, for N=8(4k) and the N+1 Basis, the number line indices are computed using:

The number line indices correspond exactly to the value indices.

The pattern you should notice is that while the medians are the same, the N-1 Basis quartiles are
closer to the median then the N+1 Basis quartiles, and the N Basis quartiles are in between these.
Quartiles for Box Plots 10

9 (4k+1) Values

Here are the formulas and number line representations for N=9 (4k+1) using the N-1 Basis:

Here are the formulas and number line representations for N=9 (4k+1) using the N Basis:

Here are the formulas and number line representations for N=9 (4k+1) using the N+1 Basis:

Same relative positions of the N-1, N, and N+1 Basis quartiles.


10 (4k+2) Values

Its probably getting tedious, but here are the formulas and number line representations for N=10
(4k+2) using the N-1 Basis:

Here are the formulas and number line representations for N=10 (4k+2) using the N Basis:
Quartiles for Box Plots 11

Here are the formulas and number line representations for N=10(4k+2) using the N+1 Basis:

Same pattern as before; N-1 quartiles closest to the median and the N+1 quartiles furthest.
11 (4k+3) Values

Finally, here are the formulas and number line representations for N=11 (4k+3) using the N-1
Basis:

Here are the formulas and number line representations for N=11 (4k+3) using the N Basis:

Here are the formulas and number line representations for N=11 (4k+3) using the N+1 Basis:

No deviation from the previously observed N-1, N, N+1 trends.


Quartiles for Box Plots 12

Comparison of Values from All Hinge and Quartile Methods


Effect of N
First and third quartiles (or first and second hinges) for N=8 through 15 are tabulated below for
all of the quartile determination methods described in the previous sections.
Quartiles for Box Plots 13

Below left, the hinges for the Tukey (inclusionary) and the Moore and McCabe (exclusionary)
methods are plotted. We see that for even N, the methods result in the same hinges, while for odd
N, Tukey is closer to the median, and M&M is further from the median. Below right, the CDF is
overlaid on the previous plot of Tukey and M&M. For even N, all techniques agree, while for odd N,
the CDF sticks with the method that yields a whole number index.

The chart below left plots the quartile indices for the N+1, N, and N-1 Basis approaches. The N-1
quartiles are closer to the median, the N+1 quartiles are further, and the N are in between. This is
the pattern we noticed in the number lines in the previous section. It becomes interesting (below
right) when we overlay the various hinge techniques on the N+1/N/N-1 plot. We see that the
Tukey hinges is bounded by the N-1 and N quartiles.
Quartiles for Box Plots 14

The M&M quartiles are bounded by the N and N+1 quartiles (below left). And the CDF hinges are
bounded by N+1 and N-1 (below right).

Finally, since these quartiles are intended for use in box plots, here are box plots comparing the six
techniques, one box plot each for N=8, 9, 10, and 11.
Quartiles for Box Plots 15

Doubling the Data Set


Before making any recommendations, lets see how the techniques compare when we double a
data set. For example, if we have a data set of {1,2,3,4,5} and another data set with the same
values, but two of each, {1,1,2,2,3,3,4,4,5,5}, we would expect to find the same quartiles. Here is
what all the techniques predict.
Quartiles for Box Plots 16

Forget staring at a table of numbers, the predictions are plotted in the following charts. Any pairs
that are not vertically aligned have different quartiles for the data set and its double. These
unmatched cases are drawn in orange in the charts below and also in the preceding table.

Of all the techniques evaluated, only CDF yields the same quartiles for all cases of a data set and its
double.
Quartiles for Box Plots 17

Techniques Used by Software Packages


The left-hand chart below rehashes the difference between the N+1 and N-1 techniques for
interpolating quartiles. Microsoft Excels legacy QUARTILE function uses the N-1 approach, while
Minitab, JMP, and other packages use the N+1 approach. Microsoft added two functions to Excel
2010: QUARTILE.INC, which is based on N-1 and is therefore identical to QUARTILE, and
QUARTILE.EXC, which is based on N+1. SAS also offers an N+1 option (see below).

The quartiles for the N-1 technique are closer to the median, so the interquartile range (IQR) is
smaller, and the limit for identifying outliers, 1.5IQR above Q3 and below Q1, are also closer to the
median.
This means that there are likely to be more outliers identified by Excel than by Minitab. This
difference in behavior was a mystery to me a decade ago when my employer provided us with
Minitab in addition to Excel, but now its very clear. (Many more things were a mystery to the
brilliant minds I was stuck working with, but thats a story for another day, when were killing
time and beers in the pub.)

The right-hand chart above shows the two SAS quartile options. The default is CDF (SAS option
PCTLDEF = 5), which as we have seen yields identical quartiles for a data set and the same data set
with two of each value. SAS also offers the N+1 option (PCTLDEF = 4), which is used by Minitab,
JMP, and Excels QUARTILE.EXC. SAS also offers three more options (PCTLDEF 1, 2, and 3), which
often produce asymmetric median and quartile definitions, because they round to the larger or
closer of two values instead of averaging.
The two sets of results are slightly different, and the CDF quartiles tend to be closer to the median
than the N+1 quartiles. As with Excels N-1 results, the CDF will have smaller IQR than N+1,
leading to identification of more borderline outliers.
Recommendations
So after all this noise, which quartile definition should you use?
Quartiles for Box Plots 18

The CDF approach is considered by Langford (Quartiles in Elementary Statistics) to be the all-
around best approach. It is also the default for the powerful software package SAS, though it
doesnt seem to be used in other packages. This may then be the option of choice.
However, an important consideration is consistency. If others you work with are using Minitab or
JMP, you should use the N+1 option for compatibility.
Quartiles in the Peltier Tech Box Plot Utility
The Peltier Tech Box Plotter provides all of the quartile method described here. The CDF
calculation is selected by default, but this option can be changed in the simple list box, inserted in
the worksheet next to the box plot.

References
I found innumerable sources for this information about quartiles. Most were either very basic, or
not useful at all. The following three are the most useful links I found.
Quartiles in Elementary Statistics
Eric Langford, California State University, Chico
Journal of Statistics Education Volume 14, Number 3 (2006).
This paper had an extensive and highly mathematical discussion of the methods described here,
and several others.
Quartiles: How to calculate them?
David Journet, iTSS Wallingford
This short paper provided a summary of the SAS, Minitab, and Excel methods, supporting the
information in the first reference.
Calculating Quartiles: Why Computer-Generated Results Dont Always Agree
Delmar E. Searles, Asbury University
This article was the only place Id ever seen a number line used to explain the difference between
the N-1 and N+1 approaches to percentile definitions. I found this description almost intuitive, and
decided to adopt it for all of my descriptions here. We are, after all, visual creatures, and most of us
are predominantly visual learners.

Você também pode gostar