Você está na página 1de 72

Environmetrics I, 1.3.

2013 / VMT 1
Environmetrics I
Introduction
In general, environmetrics is a discipline that deals with mathematical and statistical
methods analyse and design environmental measurements. However, in this course,
the emphasis is on methods related to environmental engineering.
When statistical methods are applied to other sciences the term metrics is used,
such as biometrics, chemometrics, econometrics, environmetrics, psychometrics etc.
Some of these metric sciences have already a long history, e.g. biometrics or
psychometrics but many of them are relatively new and not so well known, e.g.
chemometrics or environmetrics. Statistical methods have a great importance also in
informatics and especially in bioinformatics. The reason for the increasing impor-
tance is the nature of modern measurement technology which typically produces
multivariate measurement signals to be interpreted in a meaningful way. For exam-
ple, spectra (IR, NIR, XRF, Raman), chromatograms , zeta-potential curves, DNA-
micro array data all give signals that contain typically hundreds or thousands of
numbers instead of a single response. Mathematically speaking, the response is a
vector of numbers instead of a single number.
The same development has occurred also in monitoring industrial processes.
Modern on-line measurement technology gives a huge number of measurements
describing the state of a process at a given instant. Many of the statistical methods
used for analysing multivariate measurement signals can also be applied in analys-
ing multivariate process measurements.
Another field of applied statistics of increasing importance is related to improving
and controlling the quality of industrial processes, e.g. wastewater purification.
Statistical process control (SPC) and statistical design of experiments (DOE) have
become essential tools in all branches of industry. Historically, the use of statistical
methods in industry began within production of industrial components during the
second world war. Later the same ideas were introduced to process industry as well,
especially by such statisticians as Box and Hunter. The J apanese developed these
ideas further and statistical quality control was an essential part of the J apanese
success story. Some of the J apanese ideas developed for component production,
especially those of Taguchi, were transported to process industry somewhat
uncritically, neglecting the special nature of chemical and biological processes.
The most recent quality philosophy is the so called Six Sigma quality policy,
developed by Motorola and adopted for example in leading telecommunication
companies like Nokia and in many chemical companies as well, e.g. Dow Chemicals
or Du Pont. Actually Six Sigma is a collection of statistical methods applied both in
production and product design in addition to general philosophy, where statistical
aspects of all measurement processes are taken into account. The aim of this course
is to give the basic knowledge and skills for understanding and applying such quality
Environmetrics I, 1.3.2013 / VMT 2
policies and applying statistical methods in process industry in general.
All statistical methods are computer intensive, i.e. applying these methods require
use of statistical software. In this introductory course, we shall use the statistical
capabilities of MS Excel, Matlab (or Octave) and R. Most of the tools found in Excel
can be found also in OpenOffice (www.openoffice.org). A plenty of commercial
statistical software is also available, such as MiniTab, SAS, BMDP, Systat,
Statgraphics or S-plus. There is also a powerful freeware statistical softaware called
R (http://www.r-project.org/) which we get acquainted with during the course (see
also http://users.metropolia.fi/~velimt/Environmetrics/I/tutorial4R.doc ). Also general
mathematical software, such as Matlab, Mathematica, MathCad or Maple can be
used for statistical calculations and analyses.
The aim of this course is to provide the statistical background knowledge to underst-
and and to learn the methods stated above. In addition we will learn the use of the
most common statistical procedures. However, the emphasis is in understanding the
basic idea and to get familiar with the statistical terminology and, moreover, to learn
how to do some elementary statistical calculations in Excel or R.
The course is divided into the following topics
! The nature of statistical variation
! Graphical tools for describing statistical properties of measurement data
! Computational (mathematical) tools for describing statistical properties of
measurement data
! Basic concepts of probability, random variables and their distributions
! Measurement uncertainty
! Confidence and prediction intervals
! Principles of statistical testing and some most common tests
! Regression analysis and calibration
! Use of statistical software
Most of the examples in the text are of general nature, but we shall take up also
environmental applications in the lectures.
The nature of statistical variation in measurement data
Some questions
! Why should an environmental engineer study statistics?
! Why is it important to take statistical variation into account in making conclu-
sions of measurement data?
! How to discriminate between true (causal) dependencies and random (stochas-
tic, statistical) differences?
! Why is it possible to estimate the amount and the nature of statistical variation?
! How can we make comparisons between different methods, equipment or
Environmetrics I, 1.3.2013 / VMT 3
models taking the randomness in the measurement data into account?
! How to control processes minimizing the effects caused by uncertainties in
process data?
Some facts
! Every measurement contains measurement errors.
! Measurement errors can be roughly divided into two categories: systematic and
random.
! Random variation obeys some well known laws.
! If random variation (statistical aspects) are not taken into account, seriously
false conclusions can be made.
! Environmental laws and regulations are laced with statistical terms and con-
cepts.
Exercise 1
Try to figure out how the above questions and facts are related, i.e. which facts
are important, or maybe even give an answer, to a specific question.
Concepts related to statistical variation
Samples and population
A population in statistics means the collection of objects of interest. The objects
can e.g. individuals or measurement results. In most practical applications,
population is a theoretical (hypothetical) concept, and typically with infinitely
many objects. In such cases all analyses are based on finite samples. In many
applications of environmental measurements, finding out a good sampling
scheme, such that will produce representative samples, is very important.
However, sampling theory would be too vast field to be covered in this introduc-
tory course, and our focus is on analysing sampled data assuming that it is
representative enough. In some of the examples we shall work on some special
problems of sampling, e.g. testing homogeneity of a material.
Precision and accuracy
Consider measuring the concentration a chemical in a given sample. It is
natural to assume that the concentration is a constant, although unknown,
value. However, if we conduct repeated measurements under similar conditions
(with a sensitive enough analytical method), it is very unlikely to get two similar
results. The differences between the results of the replicate measurements
reflect the random error of the measurement. The difference between the true
value and the measured value is the measurement error. This error can be
divided into two parts: systematic (a bias) and random error. In most cases the
true value is not known and consequently the error is not known. In some cases
the true value can be considered as known, e.g. when making measurements
of a standard reference material.
Environmetrics I, 1.3.2013 / VMT 4
Although repeated measurements (when talking about repeated measure-
ments, it is usually assumed that the measurements are conducted under
constant conditions!) vary, we assume that they have a tendency vary around a
fixed value, the expected value. The expected value (or the expectation) is a
hypothetical concept, which is almost never exactly known. However its
existence can be justified by the so called law of large numbers in the theory of
probability. In practice, this means that if we could make infinitely many re-
peated measurements, their average (their mean value) would be the expected
value. The difference between the true value and the expected value is called
the bias and it is the systematic part of the error.
The variation around the expected value, i.e. the differences between mea-
sured values and the expected value, is the random (statistical) variation. The
closeness of measured values to the true value is called the accuracy of the
measurement and the closeness of measured values to the expected value is
called the precision of the measurement. It is important to note that an accurate
measurement (a measurement having good accuracy) also has to be precise (a
measurement having good precision) but a precise measurement need not be
accurate!
Precision is often divided into repeatability and reproducibility.
Mathematically a measurement result is considered a random variable (we
consider this concept more later). We also use the convention that random
variables are denoted by the Latin alphabet capital letters and non-random
(deterministic) variables or constants are denoted by the Greek alphabet or
Latin alphabet lower case letter. Thus if we denote measurement result by Y,
the error by E, the expected value by and the true value by y
0
, we have the
following decomposition:
(1)
The first difference on the right hand side of Eq. (1) corresponds to the random
error reflecting the precision and the second one to the systematic error (the
bias). From this equation we can see that both the systematic and the random
error are known only if we know the expected value and the true value. As
already mentioned, the true value is seldom known, but the expected value can
be estimated from repeated measurements. However, it is often assumed that
the measurement device is well calibrated, and consequently the systematic
error is zero or at least negligible.
Exercise 2
1) Consider shooting at a target (or playing darts). Depict the following cases
Environmetrics I, 1.3.2013 / VMT 5
(use the space below): a) good accuracy and precision, b) good precision but
poor accuracy, c) poor accuracy and precision and d) poor precision without
bias. 2) Estimate the bias in your figures a, b and c.
a) b)
c) d)
The role of models in statistical reasoning
The learning process
Logical reasoning has two forms: inductive and deductive. Learning and
cumulation of scientific knowledge is a result of both, typically one followed by
the other with experimentation in between. When we apply some theory, we are
using deductive logic, but when we form new theories, or hypotheses, based on
experimental results, we are using inductive logic. Typically induction is based
on contradiction between existing theories, that can be expressed as hypothe-
ses, and experimental results. It is good to keep this general framework in mind
when we work on analysing experimental results.
Models
It is important to understand that statistical, or scientific in general, reasoning is
impossible without mathematical models. Usually the models are stated as
hypotheses based on several assumptions. The measurement results either
support or falsify the model, but they never prove the model to be correct. Thus
all conclusions based on the measurement results contain uncertainty. The role
of statistics is, on the other hand, to minimize this uncertainty, and on the other
hand, to estimate the amount of uncertainty. Therefore, concepts of probability
Environmetrics I, 1.3.2013 / VMT 6
are crucial in statistical reasoning.
Usually the models can be divided into a deterministic and into a stochastic
(random) part, both based on certain assumptions related to the problem. We
illustrate this with a simple example, calibration by a straight line: consider a
spectroscopic measurement. We measure the absorbance (Y
i
) at several
known concentrations (x
i
) and assume that the absorbance depends linearly of
the concentration; this is the deterministic part of the model. We also assume
that the measured values can be represented in the form Y
i
=y
i
+E
i
, where y
i
s
denote the expected values which are the true values in the absence of
systematic errors, and E
i
s denote the random errors. It is usually assumed that
the random errors are statistically independent, have expected value zero and
obey the normal (Gaussian) distribution. We come to these concepts later. This
is the stochastic part of the model. The deterministic part of the model of this
example can be expressed by the equation of a straight line
(2)
The task is to determine the unknown parameters in some optimal way. In
general, determining unknown parameters, taking random errors into account,
is called parameter estimation and the special case of estimating parameters of
a known or hypothetical equation, is called regression analysis, both of which
we shall discuss more later on.
In practise, the stochastic part of the model is always related to a sample. A
sample is a collection of measurements of a hypothetical population (in some
other fields of science, the population need not be hypothetical). The question
on sampling is quite relevant in getting reliable results. Although problems of
sampling are not considered much in this course, its importance should be kept
in mind.
Variable types
Variables can be classified in many ways. The most important one, from
engineering point of view, is qualitative vs. quantitative. A qualitative variable
has values that have no quantitative interpretation, e.g. raw material types.
Qualitative variable can be called also categorical, and in some connections,
factors. It is said that a qualitative variable is measured in nominal scale.
Quantitative variables can further be divided into discrete and continuous.
Some variables are semi-quantitative in the sense that the values can be
ordered, but they cannot be unequivocally quantized. Such variables are called
ordinal, or it is said that they are measured in ordinal scale.
Environmetrics I, 1.3.2013 / VMT 7
Graphical tools
We shall now present some graphical tools for describing measurement errors
that do not require such concepts as mean, standard deviation, normal distribu-
tion etc. Later on, after studying computational statistics, we shall present some
more graphical tools.
Histograms
A histogram is a special case of a bar chart. The measured values, or more
often, equally based intervals of measured values are put on the x-axis and a
bar whose area is proportional to the of number of observations falling onto
those intervals (or values) is drawn on each interval. If intervals have equal
lengths, the height is proportional to the of number of observations falling onto
those intervals. The rules of thumb for creating good histograms are the follow-
ing:
! Use equally spaced intervals whose end points are nice numbers
! The number of intervals should roughly be the square root of the number
of observations (measurements)
! The union of the intervals should contain all observations
An example: suppose we have measured the pH of a solution repeatedly 30
times and got the following results:
9.11 9.03 9.00 9.18 8.91 9.00
8.98 9.05 8.94 9.03 9.01 9.05
8.85 9.13 9.21 8.89 8.99 8.93
9.00 8.95 8.97 9.06 8.88 8.93
9.01 9.03 8.86 9.13 9.12 9.03
Let us first sort the table (column-wise):
8.85 8.93 8.98 9.01 9.03 9.12
8.86 8.93 8.99 9.01 9.05 9.13
8.88 8.94 9.00 9.03 9.05 9.13
8.89 8.95 9.00 9.03 9.06 9.18
8.91 8.97 9.00 9.03 9.11 9.21
Because 30 is ca. 5, we make 5 intervals whose union contains the interval
[8.85, 9.21]. We get nice numbers if we divide the interval [8.80, 9.30] into 5
sub-intervals, whose limits are 8.8, 8.9, 9.0, 9.1, 9.2, 9.3. This yields the
following histogram:
Environmetrics I, 1.3.2013 / VMT 8
8.7 8.8 8.9 9 9.1 9.2 9.3 9.4
0
1
2
3
4
5
6
7
8
9
10
11
The figure tells us that the pH value tends to be around 9 and the frequencies
become the smaller the further away the measurements are from 9. Actually the
measurements seem to be quite normally distributed and we shall come to this
point later on.
In Excel, the histogram is drawn in the following way:
1. Type in the data into Excel
2. Type in the upper limits of the class intervals (these are called bins in
Excel)
3. Click Tools ans then Data Analysis
4. Select the data range into the Input Range box
5. Select the bins range into the Bin Range box
6. Mark the Chart Output option
7. Click OK
Note that Excel uses the upper class interval limits on the x-axis and does not
automatically produce connected bars. These can be changed later!
Histogram
0
10
20
8,90 9,00 9,10 9,20 9,30 More
Bin
F
r
e
q
u
e
n
c
y
Frequency
It is useful to be able to make a histogram also by hand by hand. In R a
histogram is produced by the command hist(x), where x is a variable containing
the data.
Sample distributions
Environmetrics I, 1.3.2013 / VMT 9
A histogram is an empirical representation of a so-called probability density
function, which in turn is the derivative function of some (cumulative) probability
distribution function (well come to these concepts later). The empirical
distribution function is a staircase function where the sorted (in ascending
order) measurement values are on the x-axis, and the sequential number
divided by the total number is on the y-axis. If you want to see an example, type
in R
> x = r nor m( 100)
> y = ( 1: 100) / 100
> pl ot ( sor t ( x) , y, t ype=' b' )
Or in Matlab
>> x = r andn( 100, 1) ;
>> y = ( 1: 100) / 100;
>> pl ot ( sor t ( x) , y, ' - o' )
Scatter plots
Scatter plots are mainly used to look for correlations between variables. But
they can also be used to check for trends or statistical independence. In this
case, the measured values are plotted against the order of measurement (or
time, so called time series plots). Let us suppose that the pH measurements
from our histogram example are ordered row-wise. Then the check-for-inde-
pendence scatter plot looks like
0 5 10 15 20 25 30
8.8
8.85
8.9
8.95
9
9.05
9.1
9.15
9.2
9.25
The figure does not show any clear patterns or trends supporting statistical
independence of the measurements. Note that the points should be connected
with lines only if the variable on the x-axis (here the order of measurements) is
ordered! In Excel a similar figure is produced by the Chart Wizard XY-Scatter
tool. Before using it, you have to rearrange the data into a single row or a
Environmetrics I, 1.3.2013 / VMT 10
single column. In R you simply type plot(x).
8,80
8,85
8,90
8,95
9,00
9,05
9,10
9,15
9,20
9,25
0 10 20 30 40
Series1
Box and whisker plots (box plots)
This is a common way of summarizing data classified by categorical variable. It
is not easy to make a Box and whisker plot in Excel, but it is very easy in R.
Example
This example is taken from Statistics for Environmental Engineers.
The data can be found on the course web-page.
The R-commands
> dat a = r ead. t abl e( ' Ex3. 6. dat a' , header =TRUE)
> at t ach( dat a)
> pl ot ( Locat i on, TPH)
give the following graph:
Environmetrics I, 1.3.2013 / VMT 11
East West
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
The interpretation of box and whisker plot can be found e.g.
http://en.wikipedia.org/wiki/Box_plot .
Exercise 3
a) Make a histogram and different scatter plots of data of at least two variables,
e.g. weight and length of different students.
b) Plot the empirical distribution function of the pH data.
c) Make different kind of plots you can think of, using the data from the previ-
ous example.
Basic statistics
A statistic is a number reflecting the behaviour of a sample of numbers. There
are two kinds of basic statistics: 1) statistics of central tendency and 2) statis-
tics of dispersion. The most common statistics of category 1 are the mean
value (average) and the median.
If the measurements are denoted by , the mean value is defined
by
Environmetrics I, 1.3.2013 / VMT 12
(3)
Note that, if the measurements are consider to be random variables, then the
mean value is also a random variable! The mean value of our previous pH
sample is approximately 9.01 (in Excel AVERAGE(range) and in R mean(x) ,
range means a range of cells, e.g. A1:B10).
The median is defined to be the central most number of the sample, and if the
size of the sample is an even number, the average of the two central most
numbers. In our pH sample the two central most numbers are 9.00 and 9.01
and thus the median is 9.005 (in Excel MEDIAN(range) and in R median(x)).
There are several other averages (squared, geometric, harmonic etc.), but they
are introduced when we need (not all of them needed in this course).
The median can be quite different from the mean value:
Exercise 4
Calculate the mean value and the median of the samples: 1, 2, 3, 4, 5 and 1, 2,
3, 4, 50.
For reasons, obvious from the preceding exercise, the median is called a
robust statistic.
Exercise 5
Explain, in your own words, the concept of robustness in statistics. Try to figure
out such case where the use of robust statistics would be important.
The sample standard deviation of the sample is defined by
(4)
It easy to show that also the following is true
(4b)
Environmetrics I, 1.3.2013 / VMT 13
For the pHe data the standard deviation is ca. 0.091 (check it yourself!).
The standard deviation is the root mean square distance of the values from
their mean value. More interpretations are given later on. In Excel the standard
deviation is given by STDEV(range) and in R by sd(x).
Exercise 6
1) Calculate the standard deviation of the samples: 1, 2, 3, 4, 5 and 1, 2, 3, 4,
50.
2) Find a sample of real measurement data from a text book or from internet
and calculate mean, median and standard deviation and draw a histogram of
that data. Try to give interpretations!
Environmetrics I, 1.3.2013 / VMT 14
Random variables and their distributions
A random variable is a variable whose exact value cannot be told in advance.
Instead, we can assign probabilities to the possible values. Every measurement
is a random variable, because the exact value cannot be told until the measur-
ement is carried out. In addition, the next similar (repeated) measurement will
not yield the same value (usually). To be able to make meaningful calculations
and conclusion we need the concept of a distribution of a given random
variable. There are two different kinds of random variables: discrete and
continuos.
Discrete random variables
A discrete random variable can have only discrete values. Typical cases are
such as tossing a coin (heads or tails), tossing a dice (1, 2, 3, 4, 5 or 6),
number of defects in a product, number of breaks in a process in a specified
time interval etc. The distribution of a discrete random variable is given by its
point density function (pdf). A pdf can be represented by a table or by a pdf-
plot.
Let us consider tossing a dice. We have obviously the following table connect-
ing the values and their probabilities:
value probability
1 1/6
2 1/6
3 1/6
4 1/6
5 1/6
6 1/6
The corresponding pdf-plot is
Environmetrics I, 1.3.2013 / VMT 15
0 1 2 3 4 5 6 7
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Exercise 7
1) Sketch the pdf of tossing a coin
2) Sketch the pdf of the sum of the values of tossing two dices. Hint: to find out
the probabilities of the different possible sums of two dices make a table of six
rows and six columns for the possible sums of two dices.
We shall get familiar with the two most common discrete probability distribu-
tions: the binomial distribution and the Poisson distribution.
An experiment is called a binomial experiment, if the following conditions are
fulfilled:
1. The experiments are statistically independent, i.e. the result of one exper-
iment has no effect whatsoever on the results of the other experiments.
2. The outcome of each experiment can have only two values, one which is
called a success (S) and the other which is called a failure (F). Note that
in some applications, the success can mean, for example, a defect in a
product. Thus, the term success simply means something that we are
interested in.
3. The probability of a success is a constant value (p) in each experiment.
Now, if X is the number of successes (k) in n binomial experiments, the
probability that X gets the value k ( P(X = k) ), i.e. the pdf of X, is given by the
following formula
(5)
Environmetrics I, 1.3.2013 / VMT 16
Example 1
A coin is tossed 5 times. a) What is the probability of getting exactly 3 heads?
b) What is the probability of getting at most 3 heads?
Solution: a) Tossing a coin is clearly a binomial experiment and in this case n =
5, k =3 and p =0.5. Thus Eq. (5) gives us
b) In b) you have to know the fact that if A and B are disjoint events, i.e. events
that cannot happen mutually, like getting 2 heads and getting 3 heads in the
same series of 5 tosses, the probability of either of the results is calculated
simply by adding the probabilities P(A) and P(B). (In mathematical notation this
expressed by: ). The probability asked for is
The binomial coefficients can be calculated by the definition
or they can be obtained from the Pascals triangle.
In Excel the function for binomial coefficients is BINOMDIST(k;n;p;false/true),
value false for the fourth argument gives the pdf-values and true corresponds
to the cumulative density function. A cumulative density function (cdf) is a
function F(x) which gives the probability . In R the binomial probabili-
ties are given by dbinom(k,n,p) and cumulative binomial probabilities by
pbinom(k,n,p).
In Excel the solution for a) is given by =BINOMDIST(3;5;0,5;false) and for b) by
=BINOMDIST(3;5;0,5;true) and in R the corresponding commands would be a)
dbinom(3,5,0.5) and b) pbinom(3,5,0.5).
Exercise 8
1) What is the probability of getting at least 3 heads when tossing a coin 5
times?
2) What is the probability of getting at least 3 6's when tossing a dice 10 times?
3) In average, 1% of certain products are out of specification. What is the prob-
ability that in a package of 20 products there are more than 2 products that are
out of specifications?
Environmetrics I, 1.3.2013 / VMT 17
In 1) you have to use the fact that P(A) =1-P(not getting A). This basic property
of probabilities will be used many times later on.
Poisson distribution
When n increases and p decrease in a way that the product np is approaching
a constant value , the limiting distribution is called the Poisson distribution.
Thus binomial probabilities with large n and small p can approximated with the
Poisson distribution. However, the practical relevance of the Poisson distribu-
tion relies on the fact that Poisson distribution describes well probabilities in
cases where something is sparsely distributed in a continuous medium. A
couple of examples will clarify what is meant by this:
The number of microbes in a sample of a dilution is approximately Pois-
son distributed.
The number of breaks per a specified length in a thread coming from a
spinning machine is approximately Poisson distributed.
The number of flaws per specified area on a painted surface is approxi-
mately Poisson distributed.
The number of process breaks per a specified time period is approxi-
mately Poisson distributed.
The pdf for the Poisson distribution is given by
(6)
In Excel POISSON(k; ;false/true) and R dpois(k, ) and ppois(k, ).
Exercise 9
1) Calculate 3) of exercise 7 using the Poisson approximation and compare the
results.
2) Consider a continuos process where the number of breaks per day is
Poisson(2) distributed. During one week 25 breaks where reported. Do you
consider this as strong evidence for a claim that the mean break frequency of
the process has increased during that week? Base your explanation on proba-
bilities.
Environmetrics I, 1.3.2013 / VMT 18
Continuous random variables
Ordinary physicochemical quantities (pressure, temperature, voltage, concentr-
ation etc.) can in principle have any real number values, at least within certain
limits. In practise, this is of course limited by the accuracy of the measurement
device. The infinite number of possible values (i.e. values whose probability is
positive) causes theoretical problems in trying to figure out a reasonable pdf.
This problem is overcome if we define probabilities only for intervals. Thus
instead of point values of a discrete pdf, we have a curve where the area under
the curve in a given interval is the probability of that interval. Such functions
are called probability density functions (pdf). So, the abbreviation pdf can mean
either a point density function or a probability density function.
Below is an example of a distribution of a continuos random variable that can
have values in the interval [1, 2]. By counting approximately the rectangles
under the curve, we get ca. 50 rectangles. The shaded area (the interval [1.3,
1.7]) contains ca. 28 rectangles. Thus the probability for the interval [1.3, 1.7] is
approximately 28/50 =0.56. Remembering that areas under curves can be
calculated as integrals, we can get an accurate value: ,
where f is the function corresponding to the curve.
Exercise 10
a) Determine the defining expression for the function f based on the fact that it
is a parabola. b) Evaluate .
Environmetrics I, 1.3.2013 / VMT 19
Naturally, the cumulative probability density function (cdf) is the integral up to a
given value x
(7)
Because, in the above example, probabilities for all values under 1 are zero,
would be .
Uniform distribution
A random variable is said to be uniformly distributed if all values in a given
interval are equally likely. This leads to the following pdf:
(8)
Exercise 11
The expected value EX (theoretical mean value) of a continuous random
variable is defined by the integral
(9)
and the (theoretical) variance of a continuous random variable is defined by the
integral
. (10)
Calculate the expected value and the variance of a random variable that is
uniformly distributed on the interval [a, b]. The square root of the theoretical
variance is called the theoretical standard deviation. It is important that you
understand the difference between theoretical statistics and sample statistics!
We shall discuss this important difference more later on.
Environmetrics I, 1.3.2013 / VMT 20
Exponential distribution
In reliability theory, lifetimes of different objects are studied. The simplest
lifetime distribution is the exponential distribution which is well suited for
example to study lifetimes of light bulbs. Its pdf is
(11)
Exercise 12
Instead of a lifetime of a car battery, we can consider the run-length (expressed
in km) of it, i.e. how many kilometres have been run with the battery when it
finally stops functioning. Suppose that the run-length (X) is exponentially distri-
buted: .
a) What is the probality that battery will function at least 100000 km?
b) Suppose that battery has functioned 100000 km. What is the probability that
it will last another 100000 km (i.e. total of 200000 km)?
For b) youll have to know how to calculate conditional probabilities. The
general formula is the following: Let A and B be any events. Then the condi-
tional probability (read: probability of A given B) is calculated
(12)
where means A and B. The result of b) maybe surprising!
Note that the above example shows a very strange property of the exponential
distribution, i.e. the random variable forgets the past and the probabilities of an
old bulb whose lifetime is known can be calculated as if it were new. What is
really amazing is that light bulbs behave very closely to this.
The lifetime distributions of most other objects are more complicated and the
most common of them is the Weibull distribution. You can easily find a lot of
information about common lifetime distributions in the internet.
Environmetrics I, 1.3.2013 / VMT 21
Normal distribution
The most important continuous distribution is the normal, or Gaussian distribut-
ion. Its importance is based on a mathematical theorem, the central limit theor-
em (CLT). The main idea behind the CLT is that the sum of any independent
random variables is approximately normally distributed. Of course, certain
conditions must be fulfilled, but we dont go into details here. The effect of the
CLT can be seen for example in throwing several dices and looking at the
distribution of the sum of the results. Already the distribution of the sum of five
dices looks very normal.
If we consider measurement errors, we notice that the final total error usually is
a sum of many small errors coming from all kind of different sources. It is also
rare that these sub-errors should depend on each other. For these reasons,
most measurement errors are roughly normally distributed.
The probability density function of the normal distribution is
(13)
If a random variable X is normally distributed, it is denoted by X ~ . It is
easy to show (by integration) that for a normally distributed X
and .
Unfortunately the normal pdf has not got such an integral function that could be
expressed by any ordinary functions. Therefore one has to rely on numerical
integration or tables of normal distribution. Below is a plot of N(5, 1)-distribu-
tion:
Environmetrics I, 1.3.2013 / VMT 22
0 1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Rules of thumb for normal probabilities
In many cases it is accurate enough to use approximate normal probabilities
given by the following rules of thumb: If X ~N( , ), then
1.
2.
3.
Note that practically all normally distributed measurements would be in the
interval . These rules of thumb are used in many tools of statistical quality
control and in validation of laboratory analytical methods.
Example
Let X ~ N(10,1
2
). Now -2 is 8 and +2 and thus ca. 95% of the results
would in average be in the interval [8, 12]. We could also conclude that ca.
0,15% of the results would in average be greater than 13, because 0,3% would
be outside the interval [7, 13].
Such reasoning as above is much easier if you sketch by hand a plot the
situation in question similar to the one below (to depict the latter interval of the
previous example).
Environmetrics I, 1.3.2013 / VMT 23
5 6 7 8 9 10 11 12 13 14 15
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
99,7%
0,15% 0,15%
Exercise 13
Suppose that the body length of a male population is N(180,5
2
) distributed.
Estimate the proportion of males whose length exceed a) 185, b) 190 and c)
195.
If more accurate probabilities are needed you have to use standard normal
tables or some statistical software. In Excel the pdf/cdf of the normal distribu-
tion function is NORMDIST(x; ; ;false/true). The option false gives the pdf
and the option true gives the cdf. In R the corresponding functions are dnorm(x-
, , ) and pnorm(x, , ). The inverse of the cdf in Excel is NORMINV(p; ; )
and in R qnorm(p, , ), where p is a given probability. Note that the inverse cdf
gives the upper limit of such interval [ , x] that P( [ , x]) is p. Again, a
sketch of the situation in hand helps in figuring out how to use these functions!
Examples
Let X ~ N(10,1
2
). Then P(X 12) is in Excel NORMDIST(12;10;1;true) and
Then P(X >12) is in Excel 1-NORMDIST(12;10;1;true) and P(8 <X 12) is in
Excel NORMDIST(12;10;1;true)-NORMDIST(8;10;1;true).
Figure out yourself the corresponding R formulae.
If we should want to know the interval whose centre is 10 and whose probability
is 80% we would conclude (by symmetry) that probability of getting value below
the upper limit of this interval is 90% and probability of getting value below the
lower limit of this interval is 10%. Thus the upper limit is given by
Environmetrics I, 1.3.2013 / VMT 24
NORMINV(0,9;10;1) and the lower limit by NORMINV(0,1;10;1).
One should pay attention to the fact that these functions are always related to
upper limits of intervals starting from minus infinity.
Make sketches of the situation in the examples above!
Exercise 14
a) Plot pdfs of the normal distribution with different values for and . Find
out the role of and in the shape and location of the curve!
b) A concentration measurement is normally N(0,35; 0,03
2
) distributed. What is,
in average, the proportion of measurement falling in the interval [0,30; 0,40]?
c) Consider the measurement in b). What is the interval whose centre is 0,35
that contains 90% of the measurements (in average)?
If computer programs are not available, one should use normal probability
tables. Normal probability tables are available only for so called standard
normal distribution N(0,1). To be able to use them for any normal distribution,
one has to know the following:
(14)
We need this formula in some other applications too.
Normal probability plots (Q-Q-plots)
There are many graphical ways to check the normality of a given sample of
measurement data. One way is to plot a histogram of the data and to look how
much it resembles the Gaussian curve.
For small samples, however, a better way exists: the normal probability plot.
The idea in normal probability plots is similar that of using logarithmic scale for
linearizing exponential behaviour, in this case a sample distribution function is
linearized. In Excel, producing normal probability plots is rather tedious. We
shall do it in the computer labs and you will get an example as an Excel
workbook. In R it is easy: one simply has to type qqnorm(x) where x is a
variable containing the data. As an example 30 normally N(10,2
2
) distributed
random numbers are generated and then a normal probability plot is drawn:
Environmetrics I, 1.3.2013 / VMT 25
>x =rnorm(30,10,2)
>qqnorm(x)
>qqline(x)
These commands will give the plot below
-2 -1 0 1 2
4
6
8
1
0
1
2
1
4
Nor mal Q-Q Plot
Theoretical Quantiles
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s
The points are approximately on a straight line, as they should.
Exercise 15
The table below contains 100 measured daily purities of oxygen delivered by a
certain supplier. The numbers are the two decimals after 99%, i.e. the purity
can be calculated by 99+x/100, where x is a number from the table. The data
are given in row-wise time order. a) Check by a suitable plot if there are any
trends in the data. b) Make a normal probability plot of the data. Explain how
the observed distribution deviates from the normal one. c) Make a histogram.
63 61 67 58 55 50 55 56 52 64
73 57 63 81 64 54 57 59 60 68
58 57 67 56 66 60 49 79 60 62
60 49 62 56 69 75 52 56 61 58
66 67 56 55 66 55 69 60 69 70
65 56 73 65 68 59 62 58 62 66
57 60 66 54 64 62 64 64 50 50
72 85 68 58 68 80 60 60 53 49
Environmetrics I, 1.3.2013 / VMT 26
55 80 64 59 53 73 55 54 60 60
58 50 53 48 78 72 51 60 49 67
Note that you can copy the data in the pdf-file into Excel or read it into R.
Expectation and theoretical standard deviation
In exercise 10 we defined the expected value EX (theoretical mean value) of a
continuous random variable by the integral
(9)
and the (theoretical) variance of a continuous random variable is defined by the
integral
. (10)
The theoretical standard deviation is defined as the square root of the variance.
For discrete random variables, the integrals are replace by sums and the
density function is replaced by point densities. Remember that for a normal
distribution EX = and D
2
X = .
Exercise 16
Write down the definitions of EX and D
2
X for discrete random variables.
Estimation and estimators
It is important to make a clear distinction between theoretical and sample
quantities. Actually the corresponding sample quantities are so called esti-
mates or estimators for the theoretical ones. For example, the sample mean
is an estimator for , if the data is normally distributed. The value of an estima-
tor is called an estimate. Of course, a good estimator is such that it gives in
average a correct value for the theoretical. An other desirable property is that
the variance of an estimator should be small. The first property is called
unbiasedness and the second one the minimum variance property. There are
different principles for constructing good estimators, the most common ones
being the method of maximum likelihood and the method of least squares
which, under certain assumptions, are closely related. We are not going into
Environmetrics I, 1.3.2013 / VMT 27
details of the theory of statistical estimation in this course. However, we shall
introduce later several estimators and it is important to understand what is
meant by an estimator or by an estimate.
Propagation of errors (properties of expectation and variance)
A linear combination of random variables is an expression of the form
, i.e.
The expectation and variance have the following properties with respect to
linear combinations:
(15)
(16)
Important! Formula 16 holds only if the random variable are independent
of each other!
ADD SOME SPECIAL CASES!!!
Exercise 17
a) Write down (below!) a formula for the standard deviation of a linear combina-
tion, i.e. for .
b) The sample mean (average) is a special case of linear combinations. Supp-
ose that p independent random variables X
i
have the same expected value
and the same variance . Write down the formulas for expected value,
variance and standard deviation of the mean of such random variables.
c) A difference of two random is a special case of linear combinations. Suppose
that 2 independent random variables X
1
and X
2
have the same expected value
and the same variance . Write down the formulas for expected value,
variance and standard deviation of the difference of such two random vari-
Environmetrics I, 1.3.2013 / VMT 28
ables.
Any constant can be considered a special random variable whose expected
value is the value of the constant itself and whose standard deviation is zero.
According to this, we get e.g.
(17)
and
. (18)
If the random variables X
i
, I =1,2,...,p are independent, then the expected
value of product is the product of the expected values, i.e. the formula
(19)
holds.
A special case
If the random variables X
i
, I =1,2,...,p are independent and also normally
distributed, the linear combinations are also normally distributed with expected
values and standard deviations given by the above formulae.
Environmetrics I, 1.3.2013 / VMT 29
Propagation of errors (estimation of measurement uncertainty)
The formulae 15-19 are not of much use if we consider nonlinear functions of
independent random variables, i.e. expressions like . Fortu-
nately, in ordinary measurement quantities, the standard errors are quite small
with respect to the expected values. Therefore, the variation of Y around the
expected values can be approximated by a first order Taylor
series expansion:
Now, if we replace variables by corresponding random variables, we notice that
this a linear combination plus a constant and, consequently, we can use
formulae 15-19:
(20)
(21)
Note that all derivatives in Eq. 19 are evaluated at the expected values! Note
also that in practice, the expected values are replaced by mean values and the
(theoretical) standard deviations by sample standard deviations.
It should also be noted that the Eq. 21 is valid only if the random variables are
independent. If they are correlated the formula is more complicated, and beond
the scope of this introductory course. However, it is easy to learn for anyone
with good enough knowledge of matrix algebra.
A special case
If the random variables X
i
, I =1,2,...,p are independent and also normally
distributed and the standard deviations are small enough, the random variable
Y is approximately normally distributed with expected values and standard
deviations given by formulae 20 and 21.
Example
Let U be a voltage measurement with expected value 9V and standard devia-
tion 0,2V and R be measured resistance with expected value 3 and standard
deviation 0,1 . What is the expected value and standard deviation of the
current I = ? Using Eq. 20 and 21, we get
Environmetrics I, 1.3.2013 / VMT 30
(A),
and
.
and thus
0,041 (A)
and (or ) is 0,20A.
Exercise 18
a) The measured concentration of a sample is 12,5 0,5 mmol/l and the
volume of the sample is 0,200 0,003 l. The -values are standard
measurement uncertainties. Calculate the number of moles in the sample
and estimate its standard measurement uncertainty.
b) Consider that you have made replicate measurements of the voltage and
the current of a device. Which do you think is a better way to estimate the
power (P = UI
2
): i) first calculate the means of U and I and calculate P
using the means, or ii) calculate aa many P s as you have replicates, and
the calculate the mean of these P s?
Environmetrics I, 1.3.2013 / VMT 31
Confidence intervals
A confidence interval is an interval that contains the true value of an unknown
parameter with a given (predefined) probability. It is very important to under-
stand the upper and lower limits of a confidence interval are random variable
and thus varying from sample to sample! It is easy to construct a confidence
interval for the unknown expected value of a normal distribution ( ) based on a
sample of n replicate measurements with a known standard deviation ( , i.e.
for all i. According to the result of exercise 16b, the mean of the
random variables is normally distributed with standard deviation . Therefore
(supply the missing details yourself!)
Now, because right hand side is a probability of the standard normal distribu-
tion, for predefined probability, say , we know that
(make a plot to see this!)
where is the inverse value of the cumulative standard normal density of
the probability . Finally, solving for x, we get and conse-
quently the limits of the confidence interval are
(22)
The probability is called the confidence level. Typical values of are
0.05, 0.01 and 0.001 the corresponding confidence levels being 0.95, 0.99 and
0.999.
As an example, let us consider the previous example of estimating the standard
deviation of a current. Let us also assume that the values 9V and 3 are mean
values of 5 measurements. Now, what would be the 95%-confidence limit for
the current I?
Environmetrics I, 1.3.2013 / VMT 32
Using e.g. Excel we get that is ca. 1.96 (the rules of thumb would
give the value 2!). Therefore the confidence limit would be
[2.9, 3.1]
Thus the true current should be with probability 95% in the interval [2.9A, 3.1A].
Exercise 19
If the students in exercise 18 had made 10 replicate measurements of
concentration and volume, and the mean values would have been the same as
the nominal values and the -values would have been known standard deviat-
ions. What would then be the 99%-confidence interval for the number of
moles?
In practice, the standard deviation is seldom known. Instead, we have to use
a value that has been estimated from replicate measurements, i.e. the sample
standard deviation (S) discussed in the beginning of this course. This causes
increased uncertainty in the confidence interval and has to be taken into acco-
unt. Without going to the proof, we just state that in this case the Eq. 22 has to
be replaced by the following one
, (23)
where is obtained from the Students t-distribution. The Students t-
distribution has a parameter that is called degrees of freedom (df). In general
degrees of freedom is the number of observation needed for an estimate (S in
this case) minus the number of mathematical restrictions implied in the estimate
(1 in this case: the sum of deviations from the mean in the formula for S has to
be zero!). The cdf for the Students t-distribution in Excel are TDIST and TINV,
but they function differently. In R the corresponding functions are pt(x,df) and
qt(p,df).
Exercise 20
a) Find out, using the Excel and R help facilities, how the R functions pt and
qt differ from the corresponding Excel functions TDIST and TINV.
b) Plot the pdf of the Students t-distribution in the same plot with several
degrees of freedom.
c) Find out the functional form of the pdf of the Students t-distribution from
Environmetrics I, 1.3.2013 / VMT 33
literature or from the internet.
The confidence interval for the variance, is calculated by:
, (24)
where is obtained from the -distribution with n1 degrees of
freedom. The cdf for the -distribution in Excel are CHIDIST and CHIINV, but
again they function differently. In R the corresponding functions are pchisq(x,d-
f) and qchisq(p,df).
Exercise 21
a) Find out how the R functions pchisq and qchisq differ from the corre-
sponding Excel functions CHIDIST and CHIINV.
b) Plot the pdf of the -distribution in the same plot with several degrees of
freedom.
c) Find out the functional form of the pdf of the -distribution from literature
or from the internet.
Exercise 22
The article Multi-functional Pneumatic Gripper Operating under Constant Input
Actuation Air Pressure by J . Przybyl (Journal of Engineering Technology,
1988) discussees the performance of a 6-digit pneumatic robotic gripper. One
part of the article concerns the gripping pressure (measured by manometers)
delivered to objects of different shapes for fixed input air pressures. The data
given here are the measurements (in psi) reported for an actuation pressure of
40 psi for (respectively) a 1.7 in. 1.5 in. 3.5 in. rectangular bar and a
circular bar of radius 1.0 in. and length 3.5 in.
a) Make 98% confidence intervals for the expected gripping pressure for
both objects.
b) Compare the confidence intervals and make proper conclusions.
c) Calculate the confidence intervals for the standard deviation of the
gripping pressure fo both objects.
d) Compare the confidence intervals and make proper conclusions.
Environmetrics I, 1.3.2013 / VMT 34
Rectangular
bar
Circular bar
76 84
82 87
85 94
88 80
82 92
Prediction intervals
A prediction interval is an interval that contains a future result with a given
probability. The idea is that a future observation is predicted by the mean
value, i.e. , where E is the error having the same distribution as
the original sample. Thus the standard error of E is which is estimated by S.
Using the rules of propagation of error the estimated standard error of is
, and thus the prediction interval is:

. (25)
If the standard deviation is known, S can be replaced by the known and the
t-value by the corresponding z-value.
Exercise 23
Calculate 95% prediction intervals for the gripping pressure of both objects of
the previous exercise.
Statistical tolerance intervals
Statistical tolerances are used in statistical quality control (SPC). Their
construction is beyond the scope of this course, but it is good know what they
are. A statistical tolerance interval is an interval that has the property that the
interval covers at least p% of the population (usually results of a measurement)
are contained within the interval with a probability . You can find for informa-
tion about the topic (and other topics within this course as well) in the link
below:
Environmetrics I, 1.3.2013 / VMT 35
http://www.itl.nist.gov/div898/handbook/prc/section2/prc263.htm .
The homepage of this book is
http://www.itl.nist.gov/div898/handbook/ .
Environmetrics I, 1.3.2013 / VMT 36
Comparing effects
A very common problem in production or in industrial R&D is the question of
deciding whether there is a real difference between two or more sets of results
or not. To be more precise, whether there is a statistically significant difference
or not. Namely, any real results will differ from each other and the question is if
the differences can be considered systematic or could they just come from the
inevitable random variation. Statistical tests give a objective tool to make ratio-
nal decisions in such cases.
Statistical tests
The basic idea behind the logic of statistical test can be described by the
following extreme case. Suppose we have a hypotheses that in certain mea-
surements the results are between 9 and 11 and results outside those limits are
absolutely impossible and there is nothing in our instrument. Now we make a
new result and get 12.5. What is the logical conclusion? The result contradicts
the assumptions and consequently the assumptions cannot be true.
In real case we seldom get results that are absolutely impossible in regard to
the assumption, but of course, we can get extremely improbable results. In
such cases it is more rational to suspect the assumptions than to accept that
we were extremely lucky (in the sense of getting highly improbable results). Of
course we have to set a limit for what is improbable enough. This limit is the
level of significance ( ) and the most common values for are 0.05, 0.01 or
0.001. Now, we shall formalize what is said above and we shall also get
another view to the level of significance and how to choose it.
Any statistical test consists of the following:
The hypotheses
(the null hypothesis) and (the alternative hypothesis or the re-
search hypothesis).
is something we want to prove and is something which is rejected if
we accept . The is an analogy to trials: is guilty and is not-
guilty. In most cases states that there is no difference between the
things that are compared. In our analogy, the presumption is that the
suspect is not guilty.
The statistical assumptions of the test
In order to be able to calculate probabilities we have to assume some-
thing about the measurements. The most common assumptions are that
the measurement errors are approximately normally distributed with a
constant variance and statistically independent.
Environmetrics I, 1.3.2013 / VMT 37
The two types of decision errors
In making conclusions by a statistical test we have to possibilities to make
an error (and two possibilities to be correct):
Error of type I means to reject when it actually is true and
Error of type II means not to reject when it actually is false.
The probability of type I error is called a p-value and a predefined upper
limit for it is the level of significance . The p-value can be interpreted as
the improbability of the results and consequently a p-value below
means that result improbable enough (impossible enough) for rejecting
.
The probability of type II error is denoted by . Unfortunately, and
are not independent of each other and depends on and in addition
also on the true differences between the things to be compare (often
denoted by ). The quantity 1- is called the power of a test and its the
probability of rejecting when it actually is not true. Naturally, we would
like to have as powerful test as possible. Indeed, the most common tests,
described below, are the most powerful ones under the common assumpt-
ions of normality and independence.
The test statistic
The test statistic is a quantity, calculated from a sample of measurement
results, whose distribution is known if is true. For this reason, the null
hypothesis must always include an equality, and the test statistic
probabilities are calculated assuming the equality, and thus the parame-
ters of the distribution are known. If the null hypothesis includes also
inequality, i.e. it is of the type or , the logic is that we need to test only
the case that is closest to the alternative hypothesis, i.e. the case of
equality.
In order to be able to calculate the power (or the probability of type II
error), one has to know the distribution of the test statistic if is true.
However, if is true, we also have to know the true difference. There-
fore, the power can only be calculated for assumed differences.
Making the conclusion (decision)
After calculating the value of a test statistic, we can calculate the probabil-
ity of getting such a value, or even a more extreme value. This probability
Environmetrics I, 1.3.2013 / VMT 38
is called the p-value of the test.
The logic is, that if we did reject by the given value of the test statistic, we
would reject it, of course, also by more improbable values as well. If the p-value
is below , the null hypothesis is rejected and otherwise not.
Note that all statistical tests have the structure described above. Therefore, if
you learn how to conduct any single test, you actually know how to
conduct all statistical tests. The only new things in a new test are: the null
hypotheses, how to calculate the test statistic and its distribution under .
Environmetrics I, 1.3.2013 / VMT 39
The most common simple statistical tests
The two-sample t-test
This test is suitable whenever the expected values of two samples (A and B)
are compared with each other. To guarantee the independence assumption,
the experiments should be made in random order!
: (27)
: (two-sided alternative) or (28a)
: (one-sided alternative) or (28b)
: (one-sided alternative) (28c)
Important: The form the alternative hypothesis, i.e. whether it is one or two-
sided, must be decided before any experiments are made. Also the level of
significance should be decided beforehand. If the test that was decided to be
two-sided, would not reject but the corresponding one-sided test would, the
logical way of acting is to make new experiments to confirm the suspicion of a
difference. Also, it should be noted that in one-sided tests the actual null
hypothesis is (or ). However the p-value is always calculated
under the null hypothesis . The logic is that if would be
rejected with the alternative , then naturally it would be rejected under
any smaller value of because the discrepancy with the null hypothesis
would be greater.
The statistical assumptions of this test are the ordinary ones: the measurement
errors are approximately normally distributed with a constant variance and
statistically independent.
The test statistic (t) is calculated by the following formula:
, (29)
where is the so called pooled standard deviation and it is calculated by:
. (30)
Environmetrics I, 1.3.2013 / VMT 40
A random variable T corresponding to the test statistic is Students t-distributed
with degrees of freedom, denoted by . The p-value can
be calculated by calculating the probability for the two-sided hypothe-
sis (28a), or or for the one-sided hypotheses (28b and 28c).
In Excel you can calculate p-values using the function TDIST(t; ;tails).
Give tails value 1 for a one-sided test and value 2 for a two-sided test. Note
that TDIST calculates probabilities for T or |T| being greater than t unlike e.g.
NORMDIST. In R you can use pt(t, ). The inverse function of TDIST is
TINV(p;df) and the inverse function of pt is qt(p;df).
If we assume a true difference , the power test statistic obeys the non-central
t-distribution with a non-centrality parameter
. (31)
In practice, has to be replaced by its estimate . If you have a tool to calcu-
late probabilities of non-central t-distribution, it is quite easy to calculate the
power for a given and by calculating the probability
, (32)
where obeys the non-central t-distribution with the non-centrality parameter
given by Eq. 31. For one-sided test, use only the other inequality and
instead of ! Actually, in most cases you could calculate only the probabil-
ity , because the other probability is
practically zero. Excel does not have function for exact power calculations and
approximate methods have to used (described below). In R there is a nice func-
tion for carrying out power calculations of ordinary t-tests. The function is
power.t.test(n =N, delta = , sd = ,
sig.level = , power =1- , type = type, alt = alternative),
where type is either "two.sample", "one.sample" or "paired" (the meaning of
these terms will become apparent later) and alternative is either "two.sided" or
"one.sided". power.t.test(power =.90, delta =1, alt ="one.sided"). The default
values for sig.level, type and alt are 0,05; two.sample and two.sided. The
function is used in such a way that any one of the first 5 arguments can be
omitted and the function will calculate the omitted one, for example
Environmetrics I, 1.3.2013 / VMT 41
power.t.test(power =.90, sd =1, delta =1) would calculate n for a for a two-
sided two-sample t-test at the significance level 0,05 giving the following output
Two-sample t test power calculation
n =22.02110
delta =1
sd =1
sig.level =0.05
power =0.9
alternative =two.sided
NOTE: n is number in *each* group
The result means that we should make at least 22 experiments with both meth-
ods to be compared in order to have 90% power at significance level 0,05 if the
true difference between the methods would be 1.
In most cases its easier to use an approximate formula which assumes that
:
. (33)
To use Eq. 32 to calculate , one has to solve it for . Usually Eq. 32 is used
to calculate the number experiments needed, i.e. the sample size, in order to
detect a difference of size with probability 1- . It is good to know that the
approximation 32 underestimates the sample size and, therefore, it is reason-
able to round the sample size upwards substantially.
Note that must be replace by for one-sided tests.
Exercise 24
Calculate the number of experiments needed using (33) in the example of using
the R-function power.t.test and compare this to the value given R (ca. 22 per
method).
Note also that the power can be calculated by direct simulation with random
numbers, i.e. by simulation samples of given sizes, performing the t-tests for
these sample and calculating the proportion of type II errors.

Environmetrics I, 1.3.2013 / VMT 42
Example
Suppose that you are producing something by an enzymatic reaction using an
enzyme produced by supplier A. Another supplier B claims that their more
expensive enzyme is more effective giving better yields. Before changing the
enzyme you decide to make experiments with both enzymes (it is important that
you conduct the experiments in a random order with respect to enzyme type!).
Now, suppose that you have got the following results put in Excel
We obviously have a two-sample test and we will use a one-sided alternative
because of the other suppliers claim. We could of course calculate the mean
values and standard deviations and then put these into the formulas above.
However, it is much easier to use the tools provided in Excel. J ust click Tools
and then click Data Analysis... and you will get the following alternatives
and choose the highlighted one. Now you get the following window
Environmetrics I, 1.3.2013 / VMT 43
Variable ranges mean the cell ranges of the two samples. The hypothesized
mean difference means the difference in expected values according to the null
hypothesis (usually 0 which is also the default value). If you include titles in the
ranges for the samples, you have to mark the box Labels. You can also give
the desired level of significance in the box Alpa (0,05 is the default value). In
the Output Range you can give a reference to a cell under which the results of
the test will be written. After filling in, the window will look like
The output will look like
Environmetrics I, 1.3.2013 / VMT 44
We notice that the variance are quite different, and yet we chose a test assum-
ing equal variances. Was that correct? It was, if the difference in the variances
is not significant; we shall learn later how to check it. We also note that the p-
value is not less than 0,05 and therefore the null hypothesis is not rejected and
one should not believe the other supplier claim at this level of significance.
However, the p-value is quite near 0,05 and maybe we just did not conduct
enough experiments taking the large random variability into account. The differ-
ence in means is slightly over 2 (percentages). Now we can check approxi-
mately how many experiments would be needed assuming that the true differ-
ence is 2. For the calculation we need an estimate for the standard deviation.
The best estimate in this case is given by the square root of the so called pool-
ed variance given in the Excel output which is ca. 2,84. Let us also have =
0,05 and =0,95. Now, (one-sided test!) is given by NORMINV(1-
0,05;0;1) or NORMSINV(1-0,05) and it is ca. 1,96; happens to be the
same in this example. (In R qnorm(1-0.05) ). Substituting these into the equa-
tion gives us the Excel formula =2*(1,645+1,645)^2/(2/2,84)^2 giving over
43,65, i.e. at least 44 experiments per supplier. The R function power.t.test
would give 44,34, i.e. at least 45 experiments per supplier. This example shows
how difficult it would be to detect true differences, that are small with respect to
the standard error, with high reliability. It is possible only by conducting a large
number of experiments.
Environmetrics I, 1.3.2013 / VMT 45
In R, the test could have been carried out easily using the t.test function. The
previous example could have solved by
> A = c( 851, 831, 835, 883, 863, 845, 824, 861, 814, 878) / 100
> B = c( 843, 834, 911, 877, 844, 891, 851, 931, 878, 843) / 100
> t . t est ( A, B)
Which gives the following output:
Wel ch Two Sampl e t - t est
dat a: A and B
t = - 1. 7192, df = 16. 065, p- val ue = 0. 1048
al t er nat i ve hypot hesi s: t r ue di f f er ence i n means i s not equal t o 0
95 per cent conf i dence i nt er val :
- 0. 48671686 0. 05071686
sampl e est i mat es:
mean of x mean of y
8. 485 8. 703
Note the small differences to the Excel output. The reason is that R actually
carries out the test without assuming the variances to be equal (note variances
are usually considered equal, if their difference is not statistically significant).
This, so-called Welch t-test, is available also in Excel, vice versa, if you add
the argument var . equal =TRUE youll get results of a classical t-test, i.e. the
same results as in Excel. The results of these two variants of the test are very
close to each other, unless the variances differ substatntially.
The one-sample t-test
This test is suitable whenever the expected value of a single sample is com-
pared to a nominal (or reference or standard) value ( ). To guarantee the
independence assumption, the experiments should be made in random order!
: (34)
: (two-sided alternative) or (35a)
: (one-sided alternative) or (35b)
: (one-sided alternative) (35c)
Again, it should be noted that in one-sided tests the null hypothesis should be
changed into in 35b, and into in 35c. However, as already
pointed out, the test statistic probabilities (p-values) are calculated assuming
the equality (34).
The statistical assumptions of this test are the same as in the two-sample test.
The test statistic (t) is calculated by the following formula:
Environmetrics I, 1.3.2013 / VMT 46
, (36)
A random variable T corresponding to the test statistic is Students t-distributed
with degrees of freedom, denoted by . The p-value is calculated
exactly as in the two-sample test and the decision logic is similar as well. The
changes in the power calculations are quite obvious ( is replaced by
and is replaced by n -1) and in the approximation Eq. 33 the multi-
plier 2 is simply omitted.
This is due to the fact that the standard deviation of a difference of two random
variables is times larger than the standard deviation of the single random
variable and, in the present case, the other part of the difference is a constant.
In R, you simple choose type =one.sample .
There is no tool for this test in Excel, so you either have to do the calculations
yourself or use R function t.test. With t.test you can make both one-sample and
two-sample tests. The next example shows how to use it in the one-sample
test.
Example
A nutrient producer produces a nutrient whose nitrogen concentration is
promised to be 30%. A client analyses 5 samples and gets the results
31,04 31,03 30,87 29,97 30,7 (%)
Should the client make a reclamation that the nitrogen concentration is not
30%?
Now the null hypotheses is that the nitrogen concentration is 30% ( =30) and
alternative hypotheses is that it is not 30% ( 30). The calculations in Excel
are shown below.
Environmetrics I, 1.3.2013 / VMT 47
The p-value tells us the null hypotheses has to rejected on the significance
level =0,05. On the other hand the mean value tells us the true nitrogen
concentration is higher than 30% and, consequently, normally there should not
be need for reclamation, assuming that a higher concentration is better.
In R, the test would be performed as follows:
>x=c(31.04,31.03,30.87,29.97,30.7)
>t.test(x,mu=30)
One Sample t-test
data: x
t =3.6469, df =4, p-value =0.02183
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
30.17233 31.27167
sample estimates:
mean of x
30.722
Note that R calculates also the confidence interval for the concentration which
also shows the fact that with 95% confidence, the true concentration is over
30%. Of course we could have calculated the confidence interval in Excel as
well. Actually, correctly interpreted, confidence intervals carry the same
information as statistical tests.
Environmetrics I, 1.3.2013 / VMT 48
The two-sample F-test
This test is suitable whenever the variances of two samples (A and B) are
compared with each other. A very typical case would be a case where to
analytical methods are compared in order to find out whether there is a differ-
ence in the measurement uncertainties of these two methods. Remember that
measurement uncertainties are measured by standard deviations and that
variances are squares of standard deviations.
The statistical assumptions are the same as in t-tests.
The hypotheses are:
: (37)
: (two-sided alternative) or (38a)
: (one-sided alternative) or (38b)
: (one-sided alternative) (38c)
The test statistic (f) is calculated by the following formula:
, (39)
Note that the remark given after formulae 34-35c holds again, and also for all
statistical other tests.
A random variable F corresponding to the test statistic is Fishers F-distributed
with and degrees of freedom, denoted by . Be-
cause of the asymmetry of the F-distribution, it is not quite obvious how to calc-
ulate two-sided p-values in a F-test. The easiest, and the most commonly used,
way is to calculate the probability , if f is greater than one and the
probability , if f is smaller than one, and then multiply the probabil-
ity obtained by two.
In Excel you can calculate p-values using the function FDIST(f; ; ).
Note that FDIST calculates probabilities for F being greater than f unlike e.g.
NORMDIST or the corresponding R-functions. In R, the function is
pf(f, , ). The inverse function of FDIST is FINV(p;df1;df2) and the
inverse function of pf is qf(p,df1,df2).
Environmetrics I, 1.3.2013 / VMT 49
The power calculations are based on the non-central F-distribution.
Example
Two analytical methods are compared, the Kjehdahl method and a spectrosco-
pic (IR) method. The laboratory wants to find out whether the two methods have
equal repeatability standard deviations or not. This can be done by the two-
sample F-test. There is a tool in Excel Data Analysis tools but in this case the
calculations are so easy that the test is faster to perform by doing the calcula-
tion by oneself, as shown below. The null hypotheses is that the standard
deviations are equal which actually is converted to the equal hypothesis that
the variances are equal. The alternative is that they are not equal, i.e. it is a
two-sided alternative.
The p-value tells the laboratory that the standard deviations of these two meth-
ods are not significantly different. Conduct a suitable test to test this hypothe-
ses. Note that the null hypothesis must be that there is not any systematic
difference.
Exercise 25
Naturally the laboratory is also interested in whether there is a significant
systematic difference in the expected values. Conduct a suitable test to test
Environmetrics I, 1.3.2013 / VMT 50
The one-sample F-test
This test is suitable whenever the variance of a single sample is compared to a
nominal (or reference or standard) value ( ).
The hypotheses are:
: (37)
: (two-sided alternative) or (38a)
: (one-sided alternative) or (38b)
: (one-sided alternative) (38c)
Remember that the null hypothesis changes in one-side tests.
The test statistic ( ) is calculated by the following formula:
, (39)
A random variable (this is capital ) corresponding to the test statistic is
-distributed with degrees of freedom, denoted by . The p-
value in a two-sided test is calculated poses similar difficulties as in the two
sample F-test. The most common way is to calculate the probability ,
if is greater than one and the probability , if is smaller than
one and then multiply the probability obtained by two.
In Excel you can calculate p-values using the function CHIDIST( ;n-1). Note
that CHIDIST calculates probabilities for being greater than unlike e.g.
NORMDIST. In R, the function is pchisq( ,n-1). The inverse function of CHIDI-
ST is CHIINV(p,df) and the inverse function of pchisq is qchisq(p,df).
The power calculations are based on the non-central -distribution.
Example
The manufacturer of a diagnostic test claims that relative repeatability standard
deviation of the results is 5%. A client wants to test the claim by making 10
replicate analyses and the performing a -test for variance. There isnt a
ready-made tool in Excel for this test and you have to do the calculations your-
self as shown below. Note that first, in this example, you have to calculate the
reference standard deviation knowing that it should be 5% of the mean value.
Environmetrics I, 1.3.2013 / VMT 51
The p-value is below 0,05 and we have to conclude that the estimate of the true
standard deviation is significantly higher than the claimed one.
Exercise 26
Silicon wafers are produced batch-wise in lots of 1000 wafers. An improvement
in the process is suggested, but it would require an investment of 100000 .
Therefore the new, modified process is tested in pilot scale and compared to
the old one. Twenty batches, ten with the old and ten with the new process, are
made in random order. The numbers below are differences from the nominal
thickness. Would you suggest that the investment is worth making? State all
assumptions needed for an appropriate statistical tests and base your sugges-
tion on the results of the tests.
Old: -8.6 4.8 11.4 -4.7 10.2 9.0 -10.8 12.7 3.5 -12.3
New: -0.9 6.0 -2.1 6.7 -0.1 5.9 -0.9 2.4 -4.6 1.0
You should first check whether the difference in standard deviations is signific-
ant or not by a two sample F-test. If not, you can use a two sample t-test for
testing the difference in the expected values.
Environmetrics I, 1.3.2013 / VMT 52
SPC and control charts
Statistical process control (SPC) was developed during the second world war
mostly in US where it was mostly forgotten in industrial processes after the war.
However, the ideas of SPC spread to J apan where the methodology was widely
accepted by most industries, mainly due to the works of Deming, Ishikawa,
J uran and Taguchi. The succes story of J apanese cars woke up American
industry and the methodology was reinvented in US. Today SPC is only a part
of more general quality philosophies, e.g. Six Sigma or Taguchis philosophy.
Yet, SPC has a fundamental role in creating good quality.
The simplest SPC tools are the so called control charts. Their basic idea is that
processes should be adjusted only if one can be almost sure that the observed
changes in the process are not due to the normal, uncontrollable (random)
variation in the process. There are three kinds of control charts, those which
are aimed to detect abrupt large changes, those which are designed to detect
small but systematic changes and those which are designed to detect changes
in variability.
In order to make a control chart, process data are needed. These data can not
be just any data, but they should represent process that is under normal
control, i.e. a period (or periods) when everything is going alright. This called
the construction data.
The Shewhart ( -) control chart
This control chart is primarily designed for detection of abrupt large changes,
though with some additional decision rules, it can be, and is, used for other
purposes as well. In most cases the Shewhart control chart does not follow
individual process variables, rather it uses means of consecutive measure-
ments. It should noticed that these means are not moving averages, but means
of distinct groups of constant group size (n). When the construction data have
been gathered, group means ( ) and group standard deviations ( ) are
calculated (N is the number of groups). Then a pooled standard deviation is
calculated as the mean square of the group standard deviations:
. (40)
After this the upper and lower control limits (UCL and ACL) are calculated as
follows:
Environmetrics I, 1.3.2013 / VMT 53
(41a)
(41b)
where stands for mean of the means (the grand mean) of the construction
data. Sometimes a nominal value or a target value is used instead.
It should be noted that the old fashioned way of calculating the control limits is
based on ranges of samples instead of standard deviations of the samples, and
is still widely used. The range of a sample is the maximum value minus the
minimum value. The reason for using ranges is the fact that they are simpler to
calculate and, by the time control charts were introduced, the only ones
applicable for hand calculations.
Exercise 27 (in computer lab)
The data consist of yields of a process and the goal is have a steady yield.
Your task is to construct a Shewhart
Sometimes also limits are plotted on the chart. These called the
upper and lower warning limits and some additional decision rules can be
applied to them. The decision rules are often called runs rules (see for an
example in http://iew3.technion.ac.il/sqconline/runplot.html).
EWMA control charts
Exponentially weighted moving averages (EWMA) are widely used especially in
process industry. They are suitable for the same purposes as Shewhart control
charts but, in addition, they are also used for ordinary process control. The idea
is simple: the original process values ( ) taken at equally spaced time steps
( ) are replaces by the values
. (42)
The value for is chosen in such way that the mean square error between
and is minimized in the construction data that is assumed to be under
control. The initial value for ( ) in EWMA plays an important role in comput-
ing all the subsequent EWMA's ( s). Setting to is one method of
Environmetrics I, 1.3.2013 / VMT 54
initialization. Another way is to set it to the target of the process.
Yet another possibility would be to use the average of the first four or five
observations. You can find more details e.g. in the link
http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc431.htm .
CUSUM control charts
In some applications it is important to be able to detect small systematic chan-
ges in the process which is not possible by using Shewhart control charts. For
such problems, the CUSUM (cumulative sum) control chart is the appropriate
tool.
There is a good discussion about EWMA and CUSUM control charts in Box,
Hunter & Hunter: Statistics for Experimenters.
Environmetrics I, 1.3.2013 / VMT 55
Regression analysis
Simple regression and calibration
A very common problem in engineering and labororatory analyses is to de-
scribe experimental data by a mathematical model (an equations or a set of
equations). In general, this is called regression analysis. There are two differ-
ent situations that also typically lead to different techniques: regression related
to empirical or to mechanistic models. A mechanistic model is a model whose
equations can be derived from chemical, physical or biological theories, e.g.
concentration of a chemical A in a first order reaction obeys the equation
. An empirical model is such that the equations describing the
dependency are not known and the regression is based on simple functions
that are flexible enough to fit the data; typically such function are polynomials.
In reality, most models are more or less in-between purely empirical and purely
mechanistic models.
Regression models can be classified also on other bases. The one that has
most practical importance is whether the model is linear with respect to the
unknown parameters of the model. For example, the model is
not linear with respect to parameter k. But if we take logarithms at both sides
we get by renaming and we get
a linear model y =a - kt. Many models can be linear by such re-parameterisati-
on, but one should be aware of the fact that the re-parameterised model may
give less reliable estimates for the unknown parameters than the original
model.
The far most common principle of estimating the unknowns is the method least
squares, developed already by Gauss. The idea is simple: suppose that each
observation is modelled by
, (43)
where represents the random error of the ith measurement, represents
the value of the independent variable or a vector of values of the independent
variables of the ith measurement, and is a vector containing the unknown
parameters, e.g. =[a b]. Now the least squares estimate of , denoted often
by is the value that minimizes the sum . The differences
are called the residuals and they are the differences between
measured and modelled, i.e. fitted, values. Naturally, small residuals mean a
good fit.
Another useful view is to consider regression as solving an overdetermined
system of equations.
Environmetrics I, 1.3.2013 / VMT 56
(44)
If such a system is linear, the regression is called linear regression.
The difference between linear and non-linear regression models is that in linear
the minimum is obtained by a simple formula, but in non-linear models iterative
methods have to be used. However, using present mathematical software, e.g.
Excels Solver, even iterative minimization is relatively easy.
If the system (43) is linear, the estimation problem is really simple, because
solving linear systems is simple. You simply have to be able to build the
coefficient matrix, and then solve it using (computer) matrix algebra. However,
for the statistical analysis, many more calculations are needed, and
consequently in practice you will have to learn some dedicated regression tool,
and to interpret its output as well. The aforementioned system of equations
view is useful in applying many of the regression tools, because they require
the user to first build the coefficient matrix, though usually without the constant
column of ones corresponding to the intercept of the model.
Example
Consider the Monod equation , and the following data
S [mg/l] 81 162 244 366 460
[1/h] 0.0083 0.0151 0.0191 0.0216 0.023
obtained from a wastewater treatment plant.
If the data is substituted into the equation, we obtain the following system on
nonlinear equations:
Environmetrics I, 1.3.2013 / VMT 57
Solving this nonlinear system requires iterative tools which are avalaible in
Excel, Matlab or R. However we shall study how this can be converted into a
linear system. Now, let us manipulate the equation first by multiply-
ing the both sides with the denominator: . Then let us
rearrange it so that corresponds to a standard form of a linear system. First we
have to have known values on the rhs. For that, let us divide both sides with
: . Let us group the known (measured) values as coeffi-
cients of the unknown values: . Now, let us repara-
meterise the equation by defining
=b
1
, =b
2
, S =y, =x
1
, and S =x
2
.
This gives . Note that the unknowns are the bs, not the xs. A
system of such equations would make up a linear system in standard form.
However, well make a slight change to the standard form, because in regres-
sion analysis, the lhs and the rhs of the equations are typically interchanged,
and also unknowns are typically represented as the coefficients of the mode,
i.e. . It is important to notice, that coefficient matrix of the
corresponding linear system is made up of xs, not of bs. Notice also that this
equation doesnt have an intercept. You could easily build the the coefficient
matrix of the corresponding linear system in e.g. Octave or FreeMat by the
following commands:
>> S = [ 81 162 244 366 460] ' ;
>> mu = [ 0. 0083 0. 0151 0. 0191 0. 0216 0. 023] ' ;
>> CM = [ mu mu. *S]
ans =
0. 0083 0. 6723
0. 0151 2. 4462
0. 0191 4. 6604
0. 0216 7. 9056
Environmetrics I, 1.3.2013 / VMT 58
0. 023 10. 58
You could also solve the system equally easily, if you rememeber that the \
operator gives the least square solution of an overdetermined system:
>> r hs = S;
>> b = CM\ r hs
b =
5792. 2
30. 543
Of course, after obtaining the bs, we have to solve for and from the
equations =b
1
, =b
2
:
>> mu_max = 1/ b( 2) , K_S = b( 1) / b( 2)
mu_max = 0. 032741
K_S = 189. 64
Whenever possible, you should visualize the goodness of the fit, e.g.
f i gur e( 1) , pl ot ( S, mu, ' o' , x, y, ' Li neWi dt h' , 2)
xl abel ( ' S ( mg/ L COD) ' )
yl abel ( ' mu' )
The source, from which this example was taken
1
, used a different linearization
1
http://www.dnr.state.wi.us/org/water/wm/ww/biophos/4biol.htm
Environmetrics I, 1.3.2013 / VMT 59
(the so-called Lineweaver-Burke plot) and obtained the figure below (compare
the fits!).
In Excel the solution would be equally easy: you would just type in the columns
of the coefficient matrix and the rhs, and then use the regression tool. However,
in Excel, if the model doesnt contain an intercept, you have to tick a box
named Constant is Zero.
After obtaining the regression coefficients, it is important to asses the reliability
of the model obtained by statistical analyses. Many of the analysis methods are
based on the residuals of the model. In general the ith residual e
i
is defined by
the difference between measured and calculated (fitted) value, i.e. by the
expression . Naturally, calculating the residuals in Octave or
FreeMat (or Matlab) would easy (note that we have residuals both for the
linearized model, and for the original nonlinear model):
>> e_nl i n = mu- mu_max*S. / ( K_S+S)
e_l i n =
12. 3908
- 0. 1764
- 8. 9736
- 0. 5725
3. 6340
>> e_nl i n = mu- mu_max*S. / ( K_S+S)
e_nl i n =
1. 0e- 003 *
- 1. 4990
0. 0164
0. 6775
0. 0337
- 0. 1831
Note also, that standard deviations (uncertainties) for the parameters are
obtained easily only for linear models, and calculating then standard deviations
for the original parameters, you have to use the reparametrisation equations
and the technique explained in section Propagation of errors on p. 27. How-
ever, in this case the Eq. 21 is not valid, because the estimates of the bs are
correlated. In such cases, you need the covariance matrix of the estimated
parameters which given by = (X'X)
1
where S
res
is the residual
Environmetrics I, 1.3.2013 / VMT 60
standard error calculated by where p is the number of un-
known parameters, and X is the coefficent matrix, including the intercept
column, if the model contains an intercept. Now, the covariance matrix for any
linear transformation of the parameters, c =Ab, is given by Cov(c) =A A'. In
the linearization case, the elements of A are the partial derivatives of the
transformations from the parameters of the linearized model back to the original
model. For example, in the Monod example A would be
.
The variances of the (back-) transformed parameters c are on the diagonal of
matrix Cov(c).
Exercise
Use Excels regression tool to obtain the bs and the residual standard error
(calle just standard error in Excels output) of the previous example. Then
calculate and , the covariance matrix of bs and the standard errors of
and in Octave, FreeMat or Matlab.
There are nice nonlinear regression tools, e.g. in R, that calculate such statis-
tics as standard errors of the parameter estimates automatically. Also if the
original model is simple, all conventional statistical analyses become simple
because linear regression tools are widely available, also in Excel (the regress-
ion tool). We will get acquainted with this tool through fitting a straight line. This
is the most common application of regression analysis, typically in connection
with linear calibration. The formulas of linear calibration are given in the
appendix, and here we will just show how to use Excels regression tool. Will
study the use of R in the computer labs together with some examples given in
lectures.
Environmetrics I, 1.3.2013 / VMT 61
Example
This example is taken from the EURACHEM/CITAC guide Quantifying Uncer-
tainty in Analytical Measurement (Example A5, Table A5.2 p. 75). In this
example Cadmium is analysed using a spectroscopic method. The s are the
known concentration of standard samples and s are the absorbances, three
replicates for each sample. The model is a straight line , i = 1,2,...,5.
First you have to type in the data as shown below
The column std contains the standard deviations of the replicate measure-
ments. Then you have to click Tools and Data Analysis and to choose Regres-
sion.
Then you fill in the form as below
Environmetrics I, 1.3.2013 / VMT 62
The first part of the results looks like
The important figures are titled R Square ( ) and Standard Error. The
former tells the proportion of the variance in the y-data that is explained by
model and the latter is the estimate of the standard error of the residuals. In a
good model should be near 100% and the standard error should be near
the standard deviation of replicate measurements. However, it should be noted
that the residual standard error is affected also by a possible systematic error
in the model. If the residual standard error is significantly larger than the
replicate standard error, the model is said to suffer from lack of fit. If replicates
have been made, lack of fit can be tested by the so called lack of fit test.
The second part of the regression output looks like
Environmetrics I, 1.3.2013 / VMT 63
The most important figures in this part are
Significance F is the p-value in a test whose null hypotheses is that the
model could be reduced to , i.e. the y-values do not depend of the
x-values. The very small value tells us that the model is highly significant,
as expected.
Coefficients give the estimates for a and b (a is the Intercept).
Standard Error gives the estimated standard errors of the coefficients.
P-value gives the p-values in a test whose null hypotheses is that the
corresponding coefficient is zero. In this case both values are small, and
especially smaller than the most common level of significance 0,05, telling
us that the both coefficients differ significantly from zero.
The third part of the regression output gives the residual information shown
below
Environmetrics I, 1.3.2013 / VMT 64
Predicted absorbance gives the fitted values, i.e. the values calculated
by the straight line equation using the estimated values of the coefficients.
Residuals gives the residuals, i.e. the differences between the measured
and fitted values.
Standard Residuals gives the residuals divided by their standard devia-
tions. These can be evaluated using the rules of thumb of the standard
normal distribution. Thus an absolute value clearly larger than two would
mean a susceptible observation where something may have gone wrong.
A better way would be to make a normal probability plot of the residuals or
to calculate so called externally Studentized residuals. However, this is
beyond the scope of this course.
Exercise 28
Estimate the concentration, according to the model obtained, of sample whose
absorbance is 0,152?
In real applications we would be interested in estimating also the uncertainty of
a concentration given by the calibration model. The calculations for this can be
found in the appendix or in the EURACHEM/CITAC guide, or you can follow the
instruction given in the appendix.
It is always good to make a plot of the fit. In Excel, you simply use the XY-
Scatter -type of a plot and a linear trend line as shown below
Environmetrics I, 1.3.2013 / VMT 65
0
0,05
0,1
0,15
0,2
0,25
0 0,2 0,4 0,6 0,8 1
You can add the equation into the chart in the trend line options.
All that was done above in Excel could have done easily also in R by using the
R-function lm (linear model). You can find examples in the R-tutorial
http://users.metropolia.fi/~velimt/TG09S/Environmetrics_I/tutorial4R.pdf
Environmetrics I, 1.3.2013 / VMT 66
Multiple regression and multivariate calibration
If there are more than one explanatory variable, also called independent or x-
variables), you need multiple regression. Our model on p. 54 is
an example of multiple linear regression, but the original model, though
nonlinear, is not multiple because the only variable is S. Regression models
with only one explanatory (independent) variable are called univariate.
An important application of multiple regression is the multivariate calibration
used in modern spectroscopic analyses, especially in NIR and IR spectroscopy.
In multivariate calibration, the known concentrations are modelled using more
than one absorbance. This has many benefits, including automatic baseline
correction and insensitivity to so called matrix effects. Other typical applications
appear in so-called data analysis and analysis of designed experiments.
The nice thing is that you can use just the same tools in R, Matlab and Excel
for multiple and univariate regression. We will study more multiple regression
in Environmetrics II.
However, if you recall what was taught about over-determined equations in
Equations and matrices, you already know how to apply multiple regression, as
was shown on p. 54. The only things that you have to know more are
Regression functions like lm in R, or the regression macro in Excel, add a
constant column of ones for the intercept by default. Therefore, if your
model doesnt have an intercept, you must explicitly ask for a no-intercept
model to be estimated.
You have to learn the statistical interpretations of the least squares
solutions. However, this you can learn from univariate regression.
Exercise 29
Bartholomew, Adv. Appl. Microbiology, vol. 2 p.289, 1960, studied the yield of
vitamin B in batch fermentation. The table below gives yield in g/g as function
of a mass transfer parameter [g mol O
2
/ml/h 10
-4
].
0.9 1.5 1.8 2.6 3.6 4.2 4.5 5.5 5.9
vit. B 2.1 4.2 3.8 4.5 4.3 3.5 3.9 3.0 2.3
In order to find the optimal value for the mass transfer coefficient, the author
suggested the following model for fitting the data
.
Environmetrics I, 1.3.2013 / VMT 67
Figure out a suitable transformation and re-parameterization of the model that
makes the equation linear w.r.t. to its parameters. The estimate the transformed
parameters using multiple regression. After that solve for the original parame-
ters a, b and c, and plot model against the measured data. Also find out the
optimal mass transfer coefficient according to this model.
Environmetrics I, 1.3.2013 / VMT 68
Appendix: Linear calibration
An example of Excel-calculations can be found at the end of this text.
Let us denote the quantity to be calibrated by and the measured
responses by . We shall consider the following two calibration models
a) ja b) and their least square estimates ja , obtained from the
following equations:
(1a)
(1b)
In subsequent equations the summatation index may have been omitted, but it is
always from 1 to n.
In reliability calculations we need also the estimates of the standard deviation of the
parameters. The formulas for these contain the residual standard obtained from
either of equations below (a refers to modell a and b refers to model b).
(2a)
(2b)
Now we can state the formulas for standard deviations of and :
(3a)
(3b-1)
Environmetrics I, 1.3.2013 / VMT 69
(3b-2)
In some problems we have to use the both parameters and the covariance between
and has also to be taken into account.
. (4)
The general formula to apply the covariance in calculating the variance of a linear
combination is given below. ja are any constants and X and Y are any random
variable. If you apply the formula to a case with parameters estimates of the straight
line you have X = ja Y = .
. (5)
Normally, you would use this in a linearized form of some equation.
If you want to calculate the so called confidence curves around the straight line the
points on the these two curves are calculated by the following formulas.
(6a)
(6b)
The number , p =1 or 2 and is the confidence level is obtained in Excel
by TINV( ;n-p).
When you need an estimate for the standard error of the concentration (or the x-
value in general) of a new unknown sample you need the following approximate
formulas. First let us denote x-value for the new sample, solved from the straight line
equation, by , whence , or in model a.
Environmetrics I, 1.3.2013 / VMT 70
, (7a)
(7b)
An example (from the book Miller,Miller, Statistics for Analytical Chemistry)
The table on the right contains a series of calibration
measurements of standard sample by AAS (Atomic
Absorption Spectroscopy). The x-variable is the known
concentration (ng/l) and the y-variable is the absorban-
ce. The tasks are to a) define the least estimates for
the parameters b) Plot the data, the calibration straight
line and the confidence curves c) Calculate the standard
errors of the parameters and the covariance between
them d) Calculate the confidence limit for the concentra-
tion of an unknown sample whose absorbance is 0,456.
You can omit part of the calculations if you use Excels
regression tool. However, the confidence curves, the
confidence limits for the unknown concentration and the
values for the confidence curves you have to calculate yourself.
First we have to complete the table so that it is easy to calculate the required sums of
squares, as shown in the table below. The formula in the cell A9 is =SUM(A2:A8)
and in the cell A10 it is =SUMSQ(A2:A8). The other sums are obtained by copying
the formulas. The estimates for the intercept and for the slope are obtained by the
functions INTERCEPT and SLOPE. Thus the formulas in the cells A13 and B13 are
=INTERCEPT($B$2:$B$8;$A$2:$A$8) and =SLOPE($B$2:$B$8;$A$2:$A$8). The
sum of squares is obtained by first calculating the differences . The
formula in the cell D2 is =A2-AVERAGE($A$2:$A$8). You have to use the absolute
reference in order to be able to copy the formula correctly. For the standard devia-
tions you have to calculate the residuals and theis standard error. The formula for the
residual in the cell E2 is =B2-($A$13+$B$13*A2). The formula for the residual
standard error in the cell E11 is =SQRT(E10/(7-2)). Now, the standard error for
parameters are obtained the formulas 3b-1 and 3b-2 whose Excel versions are in the
cells A14 and B14: =E11*SQRT(A11/(7*D10)) ja =E11*1/SQRT(D10). All this is
shown in the table below.
Environmetrics I, 1.3.2013 / VMT 71
The table shows also the confidence limits obtained by the formuals
ja . For example, the cell A15 contains the formula
=A$13-TINV(0.05;7-2)*A$14.
For the confidence curves you have to calculate values densely enough, e.g.
with values 0, 1, 2, ...., 30. The y-values for confidence curves are obtained usin
the formula 6n. The formula in the cells C19 in the table below is
=$B19-$E$11*TINV(0.05;7-2)*SQRT(1+1/7+(A19-$A$9/7)^2/$D$10)
and the rest of the values are obtained by copying this formula. Note that
the rows from 22 to 47 are hidden in the table.
After plotting the data points, the straight line and the confidence curves we get
the figure below.
Environmetrics I, 1.3.2013 / VMT 72
-0.100
0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
0 5 10 15 20 25 30 35
calibration line
95% lower limi
95% lower limit
measurements

Você também pode gostar