Você está na página 1de 16

How to Read and Use a Box-and-Whisker Plot

Feb 15, 2008

The box-and-whisker plot is an exploratory graphic, created


by John W. Tukey, used to show the distribution of a dataset (at a glance). Think of
the type of data you might use a histogram with, and the box-and-whisker (or box
plot, for short) could probably be useful.
The box plot, although very useful, seems to get lost in areas outside of Statistics, but
I’m not sure why. It could be that people don’t know about it or maybe are clueless on
how to interpret it. In any case, here’s how you read a box plot.

Reading a Box-and-Whisker Plot

Let’s say we ask 2,852 people (and they miraculously all


respond) how many hamburgers they’ve consumed in the past week. We’ll sort those
responses from least to greatest and then graph them with our box-and-whisker.
Take the top 50% of the group (1,426) who ate more hamburgers; they are represented
by everything above the median (the white line). Those in the top 25% of hamburger
eating (713) are shown by the top “whisker” and dots. Dots represent those who ate a
lot more than normal or a lot less than normal (outliers). If more than one outlier ate
the same number of hamburgers, dots are placed side by side.

Find Skews in the Data


The box-and-whisker of course shows you more than just four split groups. You can
also see which way the data sways. For example, if there are more people who eat a
lot of burgers than eat a few, the median is going to be higher or the top whisker could
be longer than the bottom one. Basically, it gives you a good overview of the data’s
distribution.

That’s all there is to it, so the next time you’re thinking of making a bar graph or a
histogram, think about using Tukey’s beloved box-and-whisker plot too.

Want to learn more about making data graphics? Become a member.


Chart Type Used
Box Plot
The classic chart of quartiles, median, minimum, and maximum shows a basic view of distributions.
Box Plot
Submitted by fabian on Sun, 02/03/2008 - 23:00

In my opinion the box plot is one of the most underestimated views in current fashionable
information visualization approaches. Modern chart libraries come with a lot of available
charts but almost all of them miss the box plot. Thus, I decided to write this article to put the
brilliant box plot back on the map again and provide a CSS/Javascript solution for displaying

box plots. I checked all chart libraries mentioned in this


outstanding article of smashing magazine (and a few more) and this is what I found: From 15
web-based charting engines only one provides box plots. A little bit more encouraging are
libraries and tools (although the listing is definitively not exhaustive): out of 8 investigated
only one doesn’t come with box plots. (You'll find detailed tables at the end of the article).

History:
The box plot goes back to John Tukey, which published in 1977 this efficient method to
display robust statistics (Tukey77).

Best Practice:
The most impressive and excellent usage of a box plot I found on the world freedom atlas:

Let’s first
look at the view at the top. Also here a box plot is displayed. The red dot in the blue bar is the
median; the lines at the left and right represent the lower and upper quartiles (I
will explain later on what numbers a box plot actually displays); 0 and 40 are the minimal
and maximal possible values. If you move with the mouse over a country on the map, it is
highlighted in the box plot as you can see in the picture above: the country with a raw
political rights score of 34 (it’s Mongolia by the way). Another very nice feature here is the
stacking of elements with the observed value at the top of the blue bar. This indicates for
each value how many countries have this score – and thereby providing an immediately
comprehensible understanding of the underlying distribution. But, of course, this is only
possible if you have a predictable number of values to stack – otherwise you cannot
determine the necessary height; and if these values are integers – otherwise you have an
infinite number of possible positions for the values and a stacking is not possible. The box
plot at the bottom of the above picture is as recommended by Edward Tufte (Tufte01). Again,
the red dot represents the median; the ends of the lines towards the red dot are the lower
and upper quartile, respectively; the ends of the lines towards the borders are the minimum
and maximum values. Another nice feature here is the yellow line showing the development
of the shown index (the raw political rights score) over the last years for the selected country
(the currently selected year is displayed in darker blue). Each particular value for the
selected country in each year is connected by the yellow line. As one can see immediately, it
is a little decreasing. As mentioned above this is probably the most stunning example of a
box plot, everything is done correctly. Still, in my opinion, there are some drawbacks with
Tufte's recommendation for box plots. Usually, a box plot is displayed in the following way
(this one was created with the data exploration tool KNIME, where this box plot was

implemented by myself):
Tufte's recommendation is based on the notion of avoiding chart junk and the principle of
maximizing data ink, i.e. the ink in the drawing should be used to display data and not
decoration or junk. While this is certainly a good guideline, it is sometimes difficult to read.
In the example of the world freedom atlas, it is only possible to decipher the actual values by
looking at the box plot to the left. By maximizing the data ink sometimes the readability is
minimized. In the example below definitively more “ink” was used, but in my opinion the
essential information – the key values and their exact numbers – are immediately visible.
This might not be as appealing as the box plot above, but if you are really interested in the
values this version might better fit your needs. (Maybe, because I’m more familiar with it?)

Theory:
But what is this all about? What values are displayed in a box plot? What are the advantages
of a box plot? The image below should at least clarify the used terms, whose meaning is
explained below. A small example should make things
clear. Consider a small village with 25 inhabitants. This is what they earn and the resulting

box plot:

Citizen Nr. Income Key Value


25 3,001.25 Maximum

24 2,996.45

23 2,919.35

22 2,787.02

21 2,784.72

20 2,696.83

19 2,412.51 Q3: 0.75 * 25 = 18.75 = 19.

18 2,400.43

17 2,367.84

16 2,333.37

15 2,285.53

14 2,214.87

13 2,069.79 Median: 0.5 * 25 = 12.5 = 13.

12 1,923.62

11 1,819.22

10 1,773.34

9 1,597.54

8 1,589.48
7 1,494.65 Q1: 0.25 * 25 = 6.25 = 7.

6 1,423.74

5 1,391.92

4 1,334.88

3 1,184.53

2 1,125.78

1 1,005.85 Minimum

As you can see, the basic idea is to sort the data and then select the minimum, the maximum
and the values at the referring positions: median (0.5), lower (Q1) (0.25) and upper quartile
(Q3) (0.75). Why these values are considered to be robust statistic key values? In order to
explain this, consider a similar village with one rich person and the following incomes:

Citizen Nr. Income Key Value

24 10,345.67 Maximum

23 2,919.35 Upper Bound

22 2,787.02

21 2,784.72

20 2,696.83

19 2,412.51

18 2,400.43 Q3: (18. + 19.)/2 = 2,406.47


17 2,367.84

16 2,333.37

15 2,285.53

14 2,214.87

13 2,069.79

12 1,923.62 Median: (12. + 13.) / 2 = 1,996.71

11 1,819.22

10 1,773.34

9 1,597.54

8 1,589.48

7 1,494.65

6 1,423.74 Q1: (6. + 7.) / 2 = 1,459.2

5 1,391.92

4 1,334.88

3 1,184.53

2 1,125.78

1 1,005.85 Minimum / Lower Bound


Two things are important here:

1. Calculation of the quartiles (X0.25, X0.5, X0.75): Let's consider we would


have 4 values. Then the position of the median would be 4 * 0.5 = 2 which is not the
the middle of four values. Actually, there is no value in the middle of four values, so
we have to take the mean between the 2nd and 3rd value. If we have 5 values, then
the position of the median is 5 * 0.5 = 2.5. Then the ceiled value is 3 and the 3rd value
is indeed in the middle of 5 values (2 above and 2 below). The same holds for the
other quartiles. To sum it up, the quartiles are calculated as follows:
1. calculate the position p
2. check if it is an integer
 yes: take the mean between value at position p and p+1
 no: take the value at ceil(p)

Almost all programming languages start counting at zero, so the values don't have to
be ceiled but floored to get the correct positon and if it is an integer the mean
between p and p-1 has to be taken.

2. The horizontal bars outside of the box in the middle (called whiskers: hence the name
box and whisker plot) are not always the maximum and the minimum.

The whiskers mark those values which are minimum and maximum unless these values
exceed 1.5 * IQR. The IQR is the inter quartile range: the distance between Q1 and Q3. If
there are observations which are outside 1.5 * IQR or even 3 * IQR then they are considered
as mild and extreme outliers, respectively. The picture below depicts the concept in a
qualitative way (distances are not correct):

And here the


robust statistics become relevant. Let’s compare the median with the mean (the mean is the
sum of all values divided by the number of values).

Robust Statistics:
In the first case we have a median of 2,069.79 and a mean of 2,037.38, so they are quite
comparable. In the second case – according to the mean of 2,303.437 – the village is richer,
while the median keeps incorruptible saying the truth (1996.705) and the only rich person is
displayed as what it is in this village: an outlier. The same holds for the other key values, of
course.

Summary
At this point we can summarize, what a box plot actually displays.

 at least 25% of all values are below the lower quartile Q1.
 at least 50% of all values are below (or above) the median.
 at least 25% of all values are above the upper quartile Q3.
 The box contains 50% of the data (Q3 (75%) - Q1(25%) = 50%).
 You can read from the size of the box, the distance of the whiskers the distribution of
the values.
 Between the median and the quartiles are 25% of the data (75% - 50% = 25% and
50% - 25% = 25%), i.e. the position of the median inside the box indicates whether
there are more values towards the upper or lower quartile.
 Not to mention the outliers, which are those values, that are far away from most of
the other values

Interpret the key results for Boxplot


Learn more about Minitab

Complete the following steps to interpret a boxplot.

In This Topic

 Step 1: Assess the key characteristics


 Step 2: Look for indicators of nonnormal or unusual data
 Step 3: Assess and compare groups

Step 1: Assess the key characteristics


Examine the center and spread of the distribution. Assess how the sample size may affect
the appearance of the boxplot.

Center and spread


Examine the following elements to learn more about the center and spread of your
sample data.

Median

The median is represented by the line in the box. The median is a common
measure of the center of your data.

Interquartile range box


The interquartile range box represents the middle 50% of the data.

Whiskers
The whiskers extend from either side of the box. The whiskers represent the
ranges for the bottom 25% and the top 25% of the data values, excluding
outliers.

Hold the pointer over the boxplot to display a tooltip that shows these statistics.
For example, the following boxplot of the heights of students shows that the
median height is 69. Most students have a height that is between 66 and 72, but
some students have heights that are as low as 61 and as high as 75.

Investigate any surprising or undesirable characteristics on the boxplot. For


example, a boxplot may show that the median length of wood boards is much
lower than the target length of 8 feet.

Sample size (N)


The sample size can affect the appearance of the graph. For example, although
the following boxplots seem quite different, both of them were created using
randomly selected samples of data from the same population.
N = 15

N = 500
A boxplot works best when the sample size is at least 20. If the sample size is too
small, the quartiles and outliers shown by the boxplot may not be meaningful. If
the sample size is less than 20, consider using Individual Value Plot.

Step 2: Look for indicators of nonnormal or unusual


data
Skewed data indicate that data may be nonnormal. Outliers may indicate other
conditions in your data.

Skewed data
When data are skewed, the majority of the data are located on the high or low
side of the graph. Skewness indicates that the data may not be normally
distributed.

The following boxplots are skewed. The boxplot with right-skewed data shows
wait times. Most of the wait times are relatively short, and only a few wait times
are long. The boxplot with left-skewed data shows failure time data. A few items
fail immediately and many more items fail later.
Right-skewed

Left-skewed
Some analyses assume that your data come from a normal distribution. If your
data are skewed (nonnormal), read the data considerations topic for the analysis
to make sure that you can use data that are not normal.

Outliers
Outliers, which are data values that are far away from other data values, can
strongly affect your results. Often, outliers are easiest to identify on a boxplot. On
a boxplot, outliers are identified by asterisks (*).

TIP
Hold the pointer over the outlier to identify the data point.
Try to identify the cause of any outliers. Correct any data-entry errors or
measurement errors. Consider removing data values that are associated with
abnormal, one-time events (special causes). Then, repeat the analysis.

Step 3: Assess and compare groups


If your boxplot has groups, assess and compare the center and spread of groups.

Centers
Look for differences between the centers of the groups. For example, the
following boxplot shows the thickness of wire from four suppliers. The median
thicknesses for some groups seem to be different.

Spreads
Look for differences between the spreads of the groups. For example, the
following boxplot shows the fill weights of cereal boxes from four production
lines. The median weights of the groups of cereal boxes are similar, but the
weights of some groups are more variable than others.

Você também pode gostar