Você está na página 1de 6

LCGC Europe Online Supplement

statistics and data analysis

Understanding the
Structure of
Scientific Data
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.

This is the first in a series of articles that aims to promote the better use of
statistics by scientists. The series intends to show everyone from bench
chemists to laboratory managers that the application of many statistical
methods does not require the services of a statistician or a mathematician
to convert chemical data into useful information. Each article will be a concise introduction to a small subset of methods. Wherever possible, diagrams
will be used and equations kept to a minimum; for those wanting more theory, references to relevant statistical books and standards will be included.
By the end of the series, the scientist should have an understanding of the
most common statistical methods and be able to perform the test while
avoiding the pitfalls that are inherent in their misapplication.

In this article we look at the initial steps in


data analysis (i.e., exploratory data analysis),
and how to calculate the basic summary
statistics (the mean and sample standard
deviation). These two processes, which
increase our understanding of the data
structure, are vital if the correct selection of
more advanced statistical methods and
interpretation of their results are to be
achieved. From that base we will progress to
significance testing (t-tests and the F-test).
These statistics allow a comparison between
two sets of results in an objective and
unbiased way. For example, significance
tests are useful when comparing a new
analytical method with an old method or
when comparing the current days
production with that of the previous day.
Exploratory Data Analysis
Exploratory data analysis is a term used to
describe a group of techniques (largely
graphical in nature) that sheds light on the
structure of the data. Without this
knowledge the scientist, or anyone else,
cannot be sure they are using the correct
form of statistical evaluation.
The statistics and graphs referred to in this
first section are applicable to a single
column of data (i.e., univariate data), such
as the number of analyses performed in a
laboratory each month. For small amounts
of data (<15 points), a blob plot (also

known as a dot plot) can be used to


explore how the data set is distributed
(Figure 1). Blob plots are constructed
simply by drawing a line, marking it off
with a suitable scale and plotting the data
along the axis.
A stem-and-leaf plot is yet another
method for examining patterns in the data
set. These are complex to describe and
perceived as old fashioned, especially with
the modern graphical packages available
today. For the sake of completeness they
are described in Box 1.
For larger data sets, frequency
histograms (Figure 2(a)) and Box and
Whisker plots (Figure 2(b)) may be better
options to display the data distribution.
Once the data set is entered, or as is more
usual with modern instrumentation,
electronically imported, most modern PC
statistical packages can construct these

graph types with a few clicks of the


mouse. All of these plots can give an
indication of the presence or absence of
outliers (1). The frequency histogram, stem
and leaf plot, and blob plot can also
indicate the type of distribution the data
belongs to. It should be remembered that
if the data set is from a non-normal (2)
distribution, (Figure 2(a) and possibly
Figure 1(a)), it may be that which looks like
an outlier is in fact a good piece of
information. The outliers are the most
extreme points on the right-hand side of
Figures 1(a) and 2(a). Note: Outliers, outlier
tests and robust methods will be the
subject of a later article.
Assuming there are no obvious outliers,
we still have to do one more plot to make
sure we understand the data structure. The
individual results should be plotted against
a time index (i.e., the order the data were

(a)
Scale
Mean

(b)

Scale
Mean

figure 1 Blob plots of the raw data.

LCGC Europe Online Supplement

statistics and data analysis

obtained). If any systematic trends are


observed (Figures 3(a)3(c)) then the
reasons for this must be investigated.
Normal statistical methods assume a
random distribution about the mean with
time (Figure 3(d)) but if this is not the case
the interpretation of the statistics can be
erroneous.
Summary Statistics
Summary statistics are used to make sense
of large amounts of data. Typically, the
mean, sample standard deviation, range,
confidence intervals, quantiles (1), and
measures for skewness and
spread/peakedness of the distribution
(kurtosis) are reported (2). The mean and
sample standard deviation are the most
widely used and are discussed below

Box 1: Stem-and-leaf plot


A stem-and-leaf plot is another
method of examining patterns in the
data set. They show the range, in
which the values are concentrated,
and the symmetry. This type of plot is
constructed by splitting data into the
stem (the leading digits). In the figure
below, this is from 0.1 to 0.6, and
the leaf (the trailing digit). Thus,
0.216 is represented as 2|1 and
0.350 by 3|5. Note, the decimal
places are truncated and not rounded in this type of plot. Reading the
plot below, we can see that the data
values range from 0.12 to 0.63. The
column on the left contains the
depth information (i.e., how many
leaves lie on the lines closest to the
end of the range). Thus, there are 13
points which lie between 0.40 and
0.63. The line containing the middle
value is indicated differently with a
count (the number of items in the
line) and is enclosed in parentheses.
Stem-and-leaf plot
Units = 0.1
42

1|2 = 0.12

Count =

together with how they relate to the


confidence intervals for normally
distributed data.
The Mean
The average or arithmetic mean (3) is
generally the first statistic everyone is
taught to calculate. This statistic is easily
found using a calculator or spreadsheet
and simply involves the summing of the
individual results x1, x2, x3, ..., xi) and
division by the number of results (n),
n

 xi
i1

x n

where,
n

  x1  x2  x3  xi
i1

Frequency (N of data points in each bar)

(a)

1|22677
2|112224578
3|000011122333355
4|0047889
5|56669
6|3

The Standard Deviation (3)


The standard deviation is a measure of the
spread of data (dispersion) about the mean
and can again be calculated using a
calculator or spreadsheet. There is,
however, a slight added complication; if
you look at a typical scientific calculator
you will notice there are two types of

(b)
1.5  interquartile
upper quartile value
interquartile
median
lower quartile value
1.5  interquartile

*outlier

interquartile range is the range which contains the middle 50% of the data when
*The
it is sorted into ascending order.

figure 2 Frequency histogram and Box and Whisker plot.

(a)
Magnitude
10
8
6
4
2
0

(c)

10
8
6
4
2
0

(b)
Magnitude

10
8
6
4
2
Time 0
Time
n = 7, mean = 6, standard deviation = 2.16
n = 9, mean = 6, standard deviation = 2.65

Magnitude

5
14
(15)
13
6
1

Unfortunately, the mean is often reported


as an estimate of the true-value (m) of
whatever is being measured without
considering the underlying distribution.
This is a mistake. Before any statistic is
calculated it is important that the raw data
should be carefully scrutinized and plotted
as described above. An outlying point can
have a big effect on the mean (compare
Figure 1(a) with 1(b)).

(d)
Magnitude

10
8
6
4
2
Time 0
Time
n = 9, mean = 6, standard deviation = 1.80
n = 9, mean = 6, standard deviation = 2.06

figure 3 Time-indexed plots.

LCGC Europe Online Supplement

statistics and data analysis

99.7%
95%
68%
Mean

-3

-2

-1

Standard deviations from the mean

figure 4 The relationship between the


normal distribution curve, the mean and
standard deviation.

(a)
(i)

standard deviation (denoted by the


symbols n and n-1, or  and s). The
correct one to use depends upon how the
problem is framed. For example, each
batch of a chemical contains 10 sub-units.
You are asked to analyse each sub-unit, in
a single batch, for mercury contamination
and report the mean mercury content and
standard deviation. Now, if the mean and
standard deviation are to be used solely
with this analysed batch, then the 10
results represent the whole population (i.e.,
all are tested) and the correct standard
deviation to use is the one for a population
(n). If, however, the intended use of the
results is to estimate the mercury

probably not different and would 'pass' the t-test


(tcrit > tcalculated value)

(ii)

(b)
(i)

probably different and would 'fail' the t-test


(tcrit < tcalculated value)

(ii)

(c)
(i)

could be different but not enough data to say for


sure (i.e., would 'pass' the t-test [tcrit > tcalculated value])

(ii)

1
(d)
(i)
(ii)
2

  n  ((xi  )2 / n)
practically identical means, but with so many data
points there is a small but statistically siginificant
('real') difference and so would 'fail' the t-test
(tcrit < tcalculated value)

(e)
(i)
spread in the data as measured by the variance
are similar would 'pass' the F-test (Fcrit > Fcalculated value)
(ii)

(f)
(i)

(ii)

spread in the data as measured by the variance are


different would 'fail' the F-test (Fcrit < Fcalculated value)
and hence (i) gives more consistent results than (ii)

(g)
(i)

(ii)

contamination for several batches of the


chemical, the 10 results then represent a
sample from the whole population and the
correct standard deviation to use is that for
a sample (n-1). If you are using a statistical
package you should always check that the
correct standard deviation is being
calculated for your particular problem.

could be a different spread but not enough data


to say for sure would 'pass' the F-test
(Fcrit > Fcalculated value)

figure 5 Comparison of different data sets.

s  n1 ((xi  x)2 / n  1)

Interpreting the mean and standard


deviation
If the distribution is normal (i.e., when the
data are plotted it approximates to the curve
shown in Figure 4) then the mean is located
at the centre of the distribution. Sixty-eight
per 0cent of the results will be contained
within 1 standard deviation from the mean,
95% within 2 standard deviations and
99.7% within 3 standard deviations.
Using the above facts it is possible to
estimate a standard deviation from a
stated confidence interval and vice versa a
confidence interval from a standard
deviation. For example, if a mean value of
0.72 0.02 g/L at the 95% confidence
level is quoted then it follows that the
standard deviation = 0.02/2 or 0.01 g/L. If
the same figure was quoted at the 99.7%
confidence level the standard deviation
would be 0.02/3 or 0.0066 g/L.

LCGC Europe Online Supplement

statistics and data analysis

Significance Testing
Suppose, for example, we have the
following two sets of results for lead
content in water 17.3, 17.3, 17.4, 17.4
and 18.5, 18.6, 18.5, 18.6. It is fairly clear,
by simply looking at the data, that the two
sets are different. In reaching this
conclusion you have probably considered
the amount of data, the average for each
set and the spread in the results. The
difference between two sets of data is,
however, not so clear in many situations.
The application of significance tests gives
us a more systematic way of assessing the
results with the added advantage of
allowing us to express our conclusion with
a stated degree of confidence.
What does significance mean?
In statistics the words significant and
significance have specific meanings. A
significant difference, means a difference
that is unlikely to have occurred by chance.
A significance test, shows up differences
unlikely to occur because of a purely
random variation.
As previously mentioned, to decide if one
set of results is significantly different from
another depends not only on the
magnitude of the difference in the means
but also on the amount of data available

Jargon

and its spread. For example, consider the


blob plots shown in Figure 5. For the two
data sets shown in Figure 5(a), the means
for set (i) and set (ii) are numerically
different. From the limited amount of
information available, however, they are
from a statistical point of view the same.
For Figure 5(b), the means for set (i) and
set (ii) are probably different but when
fewer data points are available, Figure 5(c),
we cannot be sure with any degree of
confidence that the means are different
even if they are a long way apart. With a
large number of data points, even a very
small difference, can be significant (Figure
5(d)). Similarly, when we are interested in
comparing the spread of results, for
example, when we want to know if
method (i) gives more consistent results
than method (ii), we have to take note of
the amount of information available
(Figures 5(e)(g)).
It is fortunate that tables are published
that show how large a difference needs to
be before it can be considered not to have
occurred by chance. These are, critical
t-value for differences between means,
and critical F-values for differences
between the spread of results (4).
Note: Significance is a function of sample
size. Comparing very large samples will

Definition

Alternate Hypothesis A statement describing the alternative to the null hypothesis


(H1)
(i.e., there is a difference between the means [see two-tailed]
or mean1 is mean2 [see one-tailed]).
Critical Value
(tcrit or Fcrit)
cance

The value obtained from statistical tables or statistical packages at a


given confidence level against which the result of applying a signifitest is compared.

Null hypothesis
(H0)

A statement describing what is being tested


(i.e., there is no difference between the two means [mean1 = mean2]).

One-tailed

A one-tailed test is performed if the analyst is only interested in the


answer when the result is different in one direction, for example, (1)

the
new production method results in a higher yield, or (2) the amount of
waste product is reduced (i.e., a limit value , >, <, or is used in the
alternate hypothesis). In these cases the calculation to determine the
t-value is the same as that for the two-tailed t-test but the critical
value is different.
Population
Sample

Two-tailed

A large group of items or measurements under investigation


(e.g., 2500 lots from a single batch of a certified reference material).
A group of items or measurements taken from the population
(e.g., 25 lots of a certified reference material taken from a batch
containing 2500 lots).
A two-tailed t-test is performed if the analyst is interested in any
change. For example, is method A different from method B
(i.e., is used in the alternate hypothesis. Under most circumstances
two-tailed t-tests should be performed).

table 1 Definitions of statistical terms used in significance testing.

nearly always lead to a significant


difference but a statistically significant
result is not necessarily an important result.
For example in Figure 5(d) there is a
statistically significant difference, but does
it really matter in practice?
What is a t-test?
A t-test is a statistical procedure that can
be used to compare mean values. A lot of
jargon surrounds these tests (see Table 1
for definition of the terms used below) but
they are relatively simple to apply using the
built-in functions of a spreadsheet like
Excel or a statistical software package.
Using a calculator is also an option but you
have to know the correct formula to apply
(see Table 2) and have access to statistical
tables to look up the so-called critical
values (4).
Three worked examples are shown in
Box 2 (5) to illustrate how the different
t-tests are carried out and how to interpret
the results.
What is an F-test?
An F-test compares the spread of results in
two data sets to determine if they could
reasonably be considered to come from the
same parent distribution. The test can,
therefore, be used to answer questions
such as are two methods equally precise?
The measure of spread used in the F-test is
variance which is simply the square of the
standard deviation. The variances are
ratioed (i.e., divide the variance of one set
of data by the variance, of the other) to
get the test value F = 2
S1 2
S2
This F value is then compared with a critical
value that tells us how big the ratio needs
to be to rule out the difference in spread
occurring by chance. The Fcrit value is
found from tables using (n11) and (n21)
degrees of freedom, at the appropriate
level of confidence.
[Note: it is usual to arrange s1 and s2 so
that F > 1]. If the standard deviations are to
be considered to come from the same
population then Fcrit > F. As an example we
use the data in Example 2 (see Box 2).
2

F  2.75 1.471 2  3.49

Fcrit = 9.605 (51) and (51) degrees of


freedom at the 97.5% confidence level.
As Fcrit> Fcalculated we can conclude that the
spread of results in the two data sets are
not significantly different and it is,
therefore, reasonable to combine the two
standard deviations as we have done.

LCGC Europe Online Supplement

Using statistical software


(what is a p-value?)
When you use statistical software packages
and some spreadsheet functions, the
results of performing a significance test are
often summarized as a p-value. The
p-value represents an inverse index of the
reliability of the statistic (i.e., the
probability of error in accepting the
observed result as valid). Thus, if we are
comparing two means to see if they are
different a p-value of 0.10 is equivalent to
saying we are 90% certain that the means
are different; 0.05 is equivalent to saying
we are 95% certain that the means are
different; and 0.01 we are 99% certain
that the means are different, i.e., [(1p) x
100%]. It is usual when analysing chemical
data (but somewhat arbitrary) to say that
p-levels 0.05 are statistically significant.
Some assumptions
behind significance testing
In most statistical tests it is
assumed that the sample correctly
represents the population and that the
population follows a normal distribution.
Although these assumptions are never
complied with precisely, in a large number
of situations where laboratory data is being
used they are not grossly violated.
Conclusions
Always plot your data and understand
the patterns in it before calculating any
statistic, even the arithmetic mean.
Make sure the correct standard deviation
is calculated for your particular
circumstance. This will nearly always be
the sample standard deviation (n-1).
Significance tests are used to compare,
in an unbiased way, the means or spread
(variance) of two data sets.
The tests are easily performed using
statistical routines in spreadsheets and
statistical packages.
The p-value is a measure of confidence
in the result obtained when applying a
significance test.
Acknowledgement
The preparation of this paper was
supported under a contract with the UK
Department of Trade and Industry as part
of the National Measurement System Valid
Analytical Measurement Programme
(VAM)6.
References
(1) ISO 3534 part 1: Statistics Vocabulary and
Symbols. Part 1: Probability and General
Statistical Terms (1993).
(2) BS 2846 part 7: Tests for Departure from
Normality (1984).

statistics and data analysis

(3) BS 2846 part 4 (ISO 2854): Techniques of


Estimation Relating to Means and Variances
(1976).
(4) D.V. Lindley and W.F. Scott, New Cambridge
Elementary Statistical Tables (ISBN: 0 521
48485 5). Cambridge University Press (1995).
(5) T.J. Farrant, Practical Statistics for the Analytical
Scientist: A Bench Guide (ISBN: 085 404 4426),
Royal Society of Chemistry (1997).
(6) M. Sargent, VAM Bulletin, Issue 13, 45,
(Laboratory of the Government Chemist,
Teddington, UK) Autumn 1995.

Shaun Burke currently works in the Food


Technology Department of RHM Technology
Ltd, High Wycombe, Buckinghamshire, UK.
However, these articles were produced while
he was working at LGC, Teddington,
Middlesex, UK (http://www.lgc.co.uk).

Bibliography
1. G.B. Wetherill, Elementary Statistical
Methods, Chapman and Hall, London,
UK.
2. J.C. Miller and J.N. Miller, Statistics for
Analytical Chemistry, Ellis Horwood PTR
Prentice Hall, London, UK.
3. J. Tukey, Exploration of Data Analysis,
Edison and Westley.
4. T.J. Farrant, Practical Statistics for the
Analytical Scientist: A Bench Guide
(ISBN: 085 404 4426), Royal Society of
Chemistry, London, UK (1997).

t-test to use when comparing

Equation

The long-term average (population mean, ) with a sample mean

t

The difference between two means (e.g., two analytical methods)

x 
s/ n

For a two-tailed test

t

d  n
sd

For a one-tailed test


the sign is important

t

d n
sd

Difference between independent sample means with equal variances

t

x1  x2
1
1
sc
n1  n2

Difference between independent sample means with unequal variances

t

x1  x2
s21 s22
n1  n2

where:

x is the sample mean, is the population mean, s is the standard deviation for the sample, n is the number items in the sample,

|d | is the absolute mean difference between pairs, d is the mean difference between pairs, sd is the sample standard deviation for the

pairs, x1 and x2 are two independent sample means, n1 and n2 are the number of items making up each sample
2

and s is the combined standard deviation found using


c

sc 

s1 n1  1  s2 n2  1
n1  n2 2

where s1 and s2 are the sample standard deviations.


Note: The degrees of freedom () used for looking up the critical t value for independent sample means with unequal variances

is given by

s41
s42
s21 s22
1
 k 2 n2 n  1  k 2 n2 n 1 where k  n1  n2
1 1
2 2

table 2 Summary of statistical formulae.

LCGC Europe Online Supplement

statistics and data analysis

Box 2
Example 1
A chemist is asked to validate a new
economic method of derivatization
before analysing a solution by a standard
gas chromatography method. The longterm mean for the check samples using
the old method is 22.7 g/L. For the new
method the mean is 23.5 g/L, based on
10 results with a standard deviation of
0.9 g/L. Is the new method equivalent
to the old? To answer this question we
use the t-test to compare the two mean
values. We start by stating exactly what
we are trying to decide, in the form of
two alternative hypotheses; (i) the means
could really be the same, or (ii) the
means could really be different. In
statistical terminology this is written as:
The null hypothesis (H0): new method
mean = long-term check sample mean.
The alternative hypothesis (H1): new
method mean long-term check sample
mean.

To test the null hypothesis we calculate


the t-value as below. Note, the calculated
t-value is the ratio of the difference
between the means and a measure of
the spread (standard deviation) and the
amount of data available (n).
t

23.5  22.7
 2.81
0.9 / 10

In the final step of the significance test


we compare the calculated t-value with
the critical t-value obtained from tables
(4). To look up the critical value we need
to know three pieces of information:
(i) Are we interested in the direction
of the difference between the two
means or only that there is a difference,
for example, are we performing a onesided or two-sided t-test (see Table 1)?
In the case above, it is the latter, therefore, the two-sided critical value is used.
(ii) The degrees of freedom: this is
simply the number of data points
minus one (n1).
(iii) How certain do we want to be
about our conclusions? It is normal
practice in chemistry to select the 95%
confidence level (i.e., about 1 in 20
times we perform the t-test we could
arrive at an erroneous conclusion).
However, in some situations this is an
unacceptable level of error, such as in
medical research. In these cases, the
99% or even the 99.9% confidence
level can be chosen.

Method 1

4.2

4.5

6.8

7.2

4.3

5.40

1.471

Method 2

9.2

4.0

1.9

5.2

3.5

4.76

2.750

table 3 Results from two methods used to determine concentrations of selenium.

tcrit = 2.26 at the 95% confidence


level for 9 degrees of freedom.
As tcalculated > tcrit we can reject the null
hypothesis and conclude that we are 95%
certain that there is a significant difference
between the new and old methods.
[Note: This does not mean the new
derivatization method should be
abandoned. A judgement needs to
be made on the economics and on
whether the results are fit for purpose.
The significance test is only one piece
of information to be considered.]
Example 2 (5)

Two methods for determining the


concentration of Selenium are to be
compared. The results from each
method are shown in Table 3:
Using the t-test for independent
sample means we define the null

hypothesis H0 as x 1 = x 2
This means there is no difference between
the means of the two methods (the
alternative hypothesis is H1: x1 x2). If
the two methods have sample standard
deviations that are not significantly
different then we can combine (or pool)
the standard deviation (Sc).

(see What is an F-Test?)


Sc 

1.4712  (5  1)  2.7502  (5  1)
(5  5  2)

0.64
0.64 0.459
2.205  0.632 1.395

The 95% critical value is 2.306 for


n = 8 (n1 + n2 2 ) degrees of freedom.
This exceeds the calculated value of
0.459, thus the null hypothesis (H0)
cannot be rejected and we conclude
there is no significant difference between
the means or the results given by the
two methods.
Example 3 (5)

Two methods are available for


determining the concentration of
vitamins in foodstuffs. To compare
the methods several different sample
matrices are prepared using the same
technique. Each sample preparation is
then divided into two aliquots and
readings are obtained using the two
methods, ideally commencing at the
same time to lessen the possible effects
of sample deterioration. The results are
shown in Table 4.

The null hypothesis is H0: d = 0

against the alternative H1: d 0


The test is a two-tailed test as we are

interested in both d<0 and d>0

The mean d = 0.475 and the sample


standard deviation of the paired
differences is sd = 0.700
t

 2.205

If the standard deviations are


significantly different then the t-test
for un-equal variances should be used
(Table 2).
Evaluating the test statistic t
t

(5.40  4.76)
1 1
5 5

=>

2.205

0.475  8
 1.918
0.700

The tabulated value of tcrit (with


n = 7 degrees of freedom, at the 95%
confidence limit) is 2.365. Since the
calculated value is less than the critical
value, H0 cannot be rejected and it
follows that there is no difference between
the two techniques.

Matrix
Method

A (mg/g)

2.52

3.13

4.33

2.25

2.79

3.04

2.19

2.16

B (mg/g)

3.17

5.00

4.03

2.38

3.68

2.94

2.83

2.18

-0.65

-1.87

0.30

-0.13

-0.89

0.10

-0.64

-0.02

Difference (d)

table 4 Comparison of two methods used to determine the concentration of vitamins in foodstuffs.

Você também pode gostar