Você está na página 1de 10

Standard Deviation and Variance

Deviation just means how far from the normal


Standard Deviation
The Standard Deviation is a measure of how spread out numbers are.
Its symbol is (the greek letter sigma)
The formula is easy: it is the square root of the Variance. So now you
ask, "What is the Variance?"
Variance
The Variance is defined as:
The average of the squared differences from the Mean.
To calculate the variance follow these steps:

Work out the Mean (the simple average of the numbers)

Then for each number: subtract the Mean and square the result
(the squared difference).

Then work out the average of those squared differences. (Why


Square?)
Example
You and your friends have just measured the heights of your dogs (in
millimeters):

The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and
300mm.
Find out the Mean, the Variance, and the Standard Deviation.
Your first step is to find the Mean:

Answer:
Mean =

600 + 470 + 170 + 430 + 300


5

1970
5

= 394

so the mean (average) height is 394 mm. Let's plot this on the chart:

Now we calculate each dog's difference from the Mean:

To calculate the Variance, take each difference, square it, and then
average the result:

So, the Variance is 21,704.


And the Standard Deviation is just the square root of Variance, so:
Standard Deviation: = 21,704 = 147.32... = 147 (to the nearest
mm)

And the good thing about the Standard Deviation is that it is useful. Now
we can show which heights are within one Standard Deviation (147mm) of

the Mean:

So, using the Standard Deviation we have a "standard" way of knowing


what is normal, and what is extra large or extra small.
Rottweilers are tall dogs. And Dachshunds are a bit short ... but don't tell
them!
Now try the Standard Deviation Calculator.
But ... there is a small change with Sample Data
Our example was for a Population (the 5 dogs were the only dogs we
were interested in).
But if the data is a Sample (a selection taken from a bigger Population),
then the calculation changes!
When you have "N" data values that are:

The Population: divide by N when calculating Variance (like we


did)

A Sample: divide by N-1 when calculating Variance


All other calculations stay the same, including how we calculated the
mean.
Example: if our 5 dogs were just a sample of a bigger population of dogs,
we would divide by 4 instead of 5 like this:
Sample Variance = 108,520 / 4 = 27,130
Sample Standard Deviation = 27,130 = 164 (to the nearest mm)
Think of it as a "correction" when your data is only a sample.
Formulas
Here are the two formulas, explained at Standard Deviation Formulas if
you want to know more:

The "Population Standard Deviation":

The "Sample Standard Deviation":


Looks complicated, but the important change is to
divide by N-1 (instead of N) when calculating a Sample Variance.

*Footnote: Why square the differences?


If we just added up the differences from the mean ... the negatives would
cancel the positives:
4+4-4-4
=0
4
So that won't work. How about we use absolute values?
|4| + |4| + |-4| + |-4|

4+4+4+4
=

=4
4

That looks good (and is the Mean Deviation), but what about this case:
|7| + |1| + |-6| + |-2|

7+1+6+2
=

=4
4

Oh No! It also gives a value of 4, Even though the differences are more
spread out!
So let us try squaring each difference (and taking the square root at the
end):
42 + 42 + 42 + 42

64
=

=4
4

72 + 12 + 62 + 22

90
=

= 4.74...
4

That is nice! The Standard Deviation is bigger when the differences are
more spread out ... just what we want!
In fact this method is a similar idea to distance between points, just
applied in a different way.
And it is easier to use algebra on squares and square roots than absolute
values, which makes the standard deviation easy to use in other areas of
mathematics.

Comparing the range, interquartile range and standard deviation


As stated earlier in this chapter the range and the interquartile range are
much less sensitive to changes in the data than the standard deviation. If
a single value changes then the standard deviation, by definition, will also
change. Hence the standard deviation is a more powerful summary
measure as it makes more comprehensive use of the entire dataset.
However,situations when the mean might not be an appropriate measure
of centre were discussed previously. If the mean is not a meaningful
summary of the centre of the data, then it follows that the standard
deviation, which is calculated from distances around the mean, will not be
a useful summary of the spread of the values.
Therefore, if distributional assumptions (data is symmetric) can be made
and there are adequate numbers in the sample to check those
assumptions (as a rule of thumb it is often said that a sample size of at
least 20 would be adequate), then the mean and standard deviation
should be used to quantify the centre and spread of the measurements.
Alternatively, if the data distribution is skew and/or the sample size is
small then it is preferable to use the median and interquartile range to
summarise the measurements.

Describing Data: Why median and IQR are often better than mean
and standard deviation
Skip to end of metadata

Attachments:4
Added by Jim Wahl, last edited by Jim Wahl on Mar 20, 2013
Go to start of metadata

Cf4

Grading on a curve in college instilled a habit for using mean and standard
deviation to describe a set of continuous data points. On any given
assessment, about 68 percent of students were within one standard
deviation of the mean. These were the "B" students. About 15 percent
were one standard deviation above; these students each received an "A"
and the rest were "other." At the end of the semester, there might be one
or two students in a 100-student freshman course who were two standard
deviations above the class average. These students would get a nice
letter from the head of the department.
The habit of using "mean" and "standard deviation" and the convenient
rule that 68 percent of samples are within one standard deviation of the
mean and 95 percent are within two standard deviations makes these
measures attractive. Unfortunately, mean and standard deviation are
trickier to use than you might remember.

For starters, these statistics only work well on normally distributed "bell
curve" data. Such things as test scores or heights / weights of a
population are all generally "normal." Errors on a broadband service,
network capacity estimates, and stability metrics are generally not
normal. Error distributions, for example, are often "positively skewed" with
the left side of the bell curve compressed (most lines have low error
counts) and with a long tail on the right side of the mean (a few outliers
have continuous errors).
Another problem is that mean and standard deviation are not robust
against outliers. Below are two groups of data with identical mean and
standard deviation. But theyre not identical: Group I has a wider
distribution below the mean and Group II has a single high outlier.

Median and IQR


An alternative to mean and standard deviation are median and
interquartile range (IQR). IQR is the difference between the third and first
quartiles (a.k.a. the 75th and 25th quantiles). IQR is often reported using

the "five-number summary," which includes: minimum, first quartile,


median, third quartile and maximum.
Like mean and standard deviation, median and IQR measure the central
tendency and spread, respectively, but are robust against outliers and
non-normal data. They have a couple of additional advantages:

Outlier Identification. IQR makes it easy to do an initial estimate


of outliers by looking at values more than one-and-a-half times the IQR
distance below the first quartile or above the third quartile.
Skewness. Comparing the median to the quartile values shows
whether data is skewed. For example, Group I has a high proportion of
larger values, and the median is therefore closer to the third quartile than
the first quartile. By contrast, the values in Group II are more evenly
distributed.

Below are the Group I and II data with the IQR and median. Now the red
median bar makes it clear that Group I's values are typically higher than
Group II's, but Group I also has a wider spread, as indicated by the wider
IQR shaded bar.

Calculating Median and IQR


When doing an analysis, I'll typically create rows for the five-number
summary, IQR, skew, and count of low and high outliers. Excel has built-in
functions for all of these, except outliers, which can be computed using a
basic formula.
Measure

Excel Function

min and max

MIN()

MAX()
quartiles Q1 and Q3

QUARTILE.INC(, 1)
QUARTILE.INC(, 3)

median

MEDIAN() or QUARTILE.INC(, 2)

IQR

Q3 Q1

Outliers (low and high)

COUNTIF(, "< "&(Q1-1.5*IQR) )


COUNTIF(, "< "&(Q3+1.5*IQR) )

The open source statistical software R also makes it easy to calculate


these values with the built-in summary function. I'll look at this in a future
blog post and also look at R's box plot capability, which shows median,
IQR, and outliers in a concise graphical summary that enables rapid
scanning of many variables or of a single variable over time.

Você também pode gostar