Escolar Documentos
Profissional Documentos
Cultura Documentos
Understanding the
Structure of
Scientific Data
Shaun Burke, RHM Technology Ltd, High Wycombe, Buckinghamshire, UK.
This is the first in a series of articles that aims to promote the better use of
statistics by scientists. The series intends to show everyone from bench
chemists to laboratory managers that the application of many statistical
methods does not require the services of a statistician or a mathematician
to convert chemical data into useful information. Each article will be a concise introduction to a small subset of methods. Wherever possible, diagrams
will be used and equations kept to a minimum; for those wanting more theory, references to relevant statistical books and standards will be included.
By the end of the series, the scientist should have an understanding of the
most common statistical methods and be able to perform the test while
avoiding the pitfalls that are inherent in their misapplication.
(a)
Scale
Mean
(b)
Scale
Mean
1|2 = 0.12
Count =
xi
i1
x n
where,
n
x1 x2 x3 xi
i1
(a)
1|22677
2|112224578
3|000011122333355
4|0047889
5|56669
6|3
(b)
1.5 interquartile
upper quartile value
interquartile
median
lower quartile value
1.5 interquartile
*outlier
interquartile range is the range which contains the middle 50% of the data when
*The
it is sorted into ascending order.
(a)
Magnitude
10
8
6
4
2
0
(c)
10
8
6
4
2
0
(b)
Magnitude
10
8
6
4
2
Time 0
Time
n = 7, mean = 6, standard deviation = 2.16
n = 9, mean = 6, standard deviation = 2.65
Magnitude
5
14
(15)
13
6
1
(d)
Magnitude
10
8
6
4
2
Time 0
Time
n = 9, mean = 6, standard deviation = 1.80
n = 9, mean = 6, standard deviation = 2.06
99.7%
95%
68%
Mean
-3
-2
-1
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
1
(d)
(i)
(ii)
2
n ((xi )2 / n)
practically identical means, but with so many data
points there is a small but statistically siginificant
('real') difference and so would 'fail' the t-test
(tcrit < tcalculated value)
(e)
(i)
spread in the data as measured by the variance
are similar would 'pass' the F-test (Fcrit > Fcalculated value)
(ii)
(f)
(i)
(ii)
(g)
(i)
(ii)
Significance Testing
Suppose, for example, we have the
following two sets of results for lead
content in water 17.3, 17.3, 17.4, 17.4
and 18.5, 18.6, 18.5, 18.6. It is fairly clear,
by simply looking at the data, that the two
sets are different. In reaching this
conclusion you have probably considered
the amount of data, the average for each
set and the spread in the results. The
difference between two sets of data is,
however, not so clear in many situations.
The application of significance tests gives
us a more systematic way of assessing the
results with the added advantage of
allowing us to express our conclusion with
a stated degree of confidence.
What does significance mean?
In statistics the words significant and
significance have specific meanings. A
significant difference, means a difference
that is unlikely to have occurred by chance.
A significance test, shows up differences
unlikely to occur because of a purely
random variation.
As previously mentioned, to decide if one
set of results is significantly different from
another depends not only on the
magnitude of the difference in the means
but also on the amount of data available
Jargon
Definition
Null hypothesis
(H0)
One-tailed
the
new production method results in a higher yield, or (2) the amount of
waste product is reduced (i.e., a limit value , >, <, or is used in the
alternate hypothesis). In these cases the calculation to determine the
t-value is the same as that for the two-tailed t-test but the critical
value is different.
Population
Sample
Two-tailed
Bibliography
1. G.B. Wetherill, Elementary Statistical
Methods, Chapman and Hall, London,
UK.
2. J.C. Miller and J.N. Miller, Statistics for
Analytical Chemistry, Ellis Horwood PTR
Prentice Hall, London, UK.
3. J. Tukey, Exploration of Data Analysis,
Edison and Westley.
4. T.J. Farrant, Practical Statistics for the
Analytical Scientist: A Bench Guide
(ISBN: 085 404 4426), Royal Society of
Chemistry, London, UK (1997).
Equation
t
x
s/ n
t
d n
sd
t
d n
sd
t
x1 x2
1
1
sc
n1 n2
t
x1 x2
s21 s22
n1 n2
where:
x is the sample mean, is the population mean, s is the standard deviation for the sample, n is the number items in the sample,
|d | is the absolute mean difference between pairs, d is the mean difference between pairs, sd is the sample standard deviation for the
pairs, x1 and x2 are two independent sample means, n1 and n2 are the number of items making up each sample
2
sc
s1 n1 1 s2 n2 1
n1 n2 2
is given by
s41
s42
s21 s22
1
k 2 n2 n 1 k 2 n2 n 1 where k n1 n2
1 1
2 2
Box 2
Example 1
A chemist is asked to validate a new
economic method of derivatization
before analysing a solution by a standard
gas chromatography method. The longterm mean for the check samples using
the old method is 22.7 g/L. For the new
method the mean is 23.5 g/L, based on
10 results with a standard deviation of
0.9 g/L. Is the new method equivalent
to the old? To answer this question we
use the t-test to compare the two mean
values. We start by stating exactly what
we are trying to decide, in the form of
two alternative hypotheses; (i) the means
could really be the same, or (ii) the
means could really be different. In
statistical terminology this is written as:
The null hypothesis (H0): new method
mean = long-term check sample mean.
The alternative hypothesis (H1): new
method mean long-term check sample
mean.
23.5 22.7
2.81
0.9 / 10
Method 1
4.2
4.5
6.8
7.2
4.3
5.40
1.471
Method 2
9.2
4.0
1.9
5.2
3.5
4.76
2.750
hypothesis H0 as x 1 = x 2
This means there is no difference between
the means of the two methods (the
alternative hypothesis is H1: x1 x2). If
the two methods have sample standard
deviations that are not significantly
different then we can combine (or pool)
the standard deviation (Sc).
1.4712 (5 1) 2.7502 (5 1)
(5 5 2)
0.64
0.64 0.459
2.205 0.632 1.395
2.205
(5.40 4.76)
1 1
5 5
=>
2.205
0.475 8
1.918
0.700
Matrix
Method
A (mg/g)
2.52
3.13
4.33
2.25
2.79
3.04
2.19
2.16
B (mg/g)
3.17
5.00
4.03
2.38
3.68
2.94
2.83
2.18
-0.65
-1.87
0.30
-0.13
-0.89
0.10
-0.64
-0.02
Difference (d)
table 4 Comparison of two methods used to determine the concentration of vitamins in foodstuffs.