Escolar Documentos
Profissional Documentos
Cultura Documentos
Keith A. Cherkauer
Revised
March 8th, 2007
Created by Keith A. Cherkauer on April 16th, 1997
Revised March 8th, 2007
K. A. Cherkauer
Department of Agricultural and Biological Engineering
Purdue University
225 S. University St.
West Lafayette, IN 47907
cherkaue at purdue.edu
Statistics Notes
TABLE OF CONTENTS
1) SAMPLE STATISTICS 1
c) Independence 1
d) Expectations 1
e) Moments 2
g) Cumulants 3
h) Transformation of Variables 3
2) PROBABILITY DISTRIBUTIONS 5
c) Log-Normal Family 8
i) Normal or Gaussian Distribution 8
ii) 2 Parameter Log Normal Distribution 8
iii) 3 Parameter Log Normal Distribution 9
d) Discrete Distributions 9
i) Binomial Distribution 9
ii) Poisson Distribution 9
3) PARAMETER ESTIMATION 11
i
K. A. Cherkauer - Purdue University
a) Maximum likelihood 11
b) Method of Moments 11
4) HYPOTHESIS TESTING 12
a) Test Basics 12
5) REGRESSION ANALYSIS 15
d) Multiple Regression 16
6) TREND ANALYSIS 17
c) Correlation 18
d) Storage-Related Statistics 19
e) Time-Series Models 19
8) PROBABILITY OF EXTREMES 21
9) ANALYSIS OF VARIANCE 21
c) Standard Regression 22
10) BIBLIOGRAPHY 23
ii
Statistics Notes
Statistics Notes
1) Sample Statistics
a) Probability Density Function (PDF)
i) If ƒ(X) is a PDF then ƒ(X) ≥ 0 for all X ∈ ℜ, and
∞
ii) ∫ −∞
f ( X )dX = 1 , and
b
iii) P(a < X < b) = ∫ f ( X )dX .
a
b) Cumulative Distribution Function (CDF)
i) The cumulative distribution F(X) of a continuous random variable X with density function ƒ(X) is given
x
by F ( X ) = P( X ≤ x) = ∫ f (t )dt , for -∞ < x < ∞.
−∞
c) Independence
i) Let X and Y be two random variables, discrete or continuous, with joint probability distribution ƒ(X,Y),
and marginal distributions g(X) and h(Y), respectively. The random variables X and Y are said to be
independent if and only if ƒ(X,Y) = g(X)h(Y), for all (x,y) within their range. Or events A and B are
independent if the probability of A occurring after B has occurred, P(A|B), is equal to the probability of A
occurring, P(A).
d) Expectations
∞
i) The mean or mathematical expectation of a random variable X with PDF ƒ(X) is E ( X ) = ∫ xf ( x)dx ,
−∞
N
or E ( X ) = ∑ xi f ( xi ) , if the function is discrete.
i =1
N
1
ii) Mean ≡ E(X) = µx. The sample mean is x=
N
∑x
i =1
i .
1 N
iii) Variance ≡ E[X-E(X)] = σ . The sample variance is S =
2 2
∑
N − 1 i =1
( xi − x ) 2 .
2
1
K. A. Cherkauer - Purdue University
N N
∑ i ( x − x ) = ∑ [(xi − µ ) − (x − µ )]2
i =1 i =1
N N
= ∑ ( xi − µ ) − 2( x − µ )∑ ( xi − µ ) + N ( x − µ )
2 2
i =1 i =1
[ ] 1 2
N N N
= ∑ E ( xi − µ ) − 2∑ E [( xi − µ )( x − µ )] + ∑ E ( x − µ )
2
E S2
N − 1 i =1 i =1 i =1
N N
∑ E ( x i − µ ) 2
= ∑ σ x 2 = Nσ x 2 = Nσ 2
i =1 i =1
N N
σ2
∑ E ( x − µ ) = ∑ σ x = N σ x = N =σ 2
2 2 2
i =1 i =1 N
N N
x + x 2 + ... + x i + ...x N
2∑ E[( x i − µ )( x − µ )] = 2∑ E ( x i − µ ) 1 − µ
i =1 i =1 N
N
= ∑ E{( x i − µ )[( x1 − µ ) + ( x 2 − µ ) + ... + ( x i − µ ) + ...( x N − µ )]}
2
N i =1
2 N
= ∑ E[( x1 − µ )( xi − µ ) + (x 2 − µ )(x i − µ ) + ... + (x i − µ )( xi − µ ) + ...(x N − µ )(x i − µ )]
N i =1
2 N 2 N
= ∑ E ( x i − µ ) = ∑ σ x = Nσ 2 = 2σ 2
2 2 2
N i =1 N i =1 N
[ ]
E S2 =
1
[
Nσ 2 − 2σ 2 + σ 2 =
( N − 1)σ 2 = σ 2
]
N −1 N −1
e) Moments
µ r ′ = E ( X r ) = ∫ x r f ( x)dx .
∞
i) The rth moment for the probability density function f(x), is
−∞
(1) The zero moment, µ0’ = 1.
(2) The first moment, µ1’ = E(X) = µx.
(3) The second moment, µ2’ = E(X2) = µx2 + σx2.
ii) The rth central moment, or moment taken around the mean, for the p.d.f. f(x) is
∞
µ r C = E[ X − E ( X )]r = ∫ ( X − µ X ) r f ( X )dX .
−∞
(1) The first central moment, µ1C = 0.
(2) The second central moment, µ2C = E[X – E(X)]2 = σx2.
(3) The third central moment, µ3C = E[X – E(X)]3 = γσx3.
f) Moment Generating Functions
i) The moment generating function of a probability distribution function, ƒ(x), is:
∞
M X (t ) = E (e xt ) = ∫ e xt f ( x)dx .
−∞
µr′ =
d r M X (t )
ii) To find for the rth moment, µr’, solve .
dt r t =0
(1) Example: Find the first and second moments for an exponential distribution.
2
Statistics Notes
1 x − x0
The PDF is f (X ) = exp− .
β β
First solve for the moment generating equation,
x − x0
−
∞ ∞ e xt β
M x (t ) = E (e ) = ∫ e f ( X )dx = ∫
xt xt
e dx
x0 x0 β
1 ∞ xtβ − x + x0
β∫
M x (t ) = exp dx
x0
β
exp( x0 β ) ∞ (1 − tβ )x
=
β ∫
x0
exp −
β
dx
∞
exp( x0 β ) β (1 − tβ )x
= − exp −
β 1 − tβ β x0
exp( x0 β ) (1 − tβ )x0 e tx0
= exp − =
1 − tβ β 1 − tβ
Now solve for the first moment:
dM x(t ) d e tx0 βe tx0 x0 e tx0
= = +
dt t = 0 dt 1 − tβ t =0 (1 − tβ )
2
(1 − tβ ) t =0
′
= β + x 0 = µ1
the second moment:
d 2 M x(t ) d βe tx0 x0 e tx0
= +
dt 2 t =0 dt (1 − tβ )2 (1 − tβ ) t = 0
2 β 2 e tx0 2 x0 βe tx0 x0 2 e tx0
= + +
(1 − tβ )
3
(1 − tβ )2 (1 − tβ ) t =0
′
= 2 β 2 + 2 x0 β + x0 = µ 2
2
= β2
g) Cumulants
i) A cumulant is a moment that is invariant with changes in the mean.
h) Transformation of Variables
i) Suppose that X is a continuous random variable with PDF ƒ(X). Let Y = u(X) define a one-to-one
correspondence between the values of X and Y, so that the equation y = u(x) can be uniquely solved for x
∂x
in terms of y, say x = w(y). Then the PDF of Y is g ( y ) = f [ w( y )] .
∂y
Example: Log-Normal distribution
3
K. A. Cherkauer - Purdue University
1 1 y − µ 2
φ (Y ) = exp − , where y = ln(x).
y
2πσ y 2 2 σ y
dy 1 1 ln( x) − µ 2 d [ln( x)]
f ( X ) = φ (Y ) = exp −
y
dx 2πσ y2 2 σy dx
1 1 ln( x) − µ
2
= exp −
y
x 2πσ y
2 2 σy
4
Statistics Notes
2) Probability Distributions
a) Extreme Value Family
i) The extreme value distribution comes from analysis of extreme events. Given a data set XI {X1, X2, …,
XN}, the extreme value will be Mi = max(Xi). If the set Xi is for example daily streamflow, then Mi is the
maximum annual streamflow. Given multiple years of data, the extreme values will gradually approach a
distribution that is a member of the extreme value family. If the Xi’s are from an extreme value
distribution, then the Mi’s will have the same extreme value distribution. The branch of the family is
determined by the value of k.
Extreme value family as functions of the Type 1 reduced variate, y1, by the relation
( )
x = u + α 1 − e − ky1 k : EV1, k=0 (solid); EV2, k<0 (long dash); EV3, k>0 (short dash).
Increasing |k| increases curvature.
5
K. A. Cherkauer - Purdue University
1 / k −1
exp{− [1 − k ( x − u ) / α ] }.
1 x3 − u
f ( x3 ) = 1 −
1/ k
(3) The PDF is k
α α
3
(4) Where u is the location parameter, α is the scale parameter, and k is the shape parameter.
(5) The standardized variate is –y3 = 1 - k(x3 – u)/α, -∞ ≤ y3 ≤ 0; with PDF, g(y3) = {(-y31/k-1)exp[-(-
y3)1/k]}/k; and DF, G(y2) = exp[-(-y3)1/k].
(6) The values of G(y2) depend on k, so values must be tabulated for different values of k.
v) General Extreme Value (GEV)
(1) The GEV has the same form as the EV2, or EV3, except that k is unrestricted. It is used when the
type of extreme value function needed is unknown. When fitted to the data, the GEV returns an
estimated value of k. If k is close to 0, the data should be reanalyzed with an EV1 distribution, since
the EV1, and EV2 distributions will approach that of the EV1 as k 0.
1 − e − ky1
(2) The variate of the GEV, x, is related to the variate of the EV1, y1, by x = u + α .
k
vi) Weibull Distribution
(1) As noted before the Weibull distribution is a variation of the EV3 for minimum values. If X has an
EV3 distribution, then –X has a Weibull distribution. This is an important distribution, with useful
properties, therefore it has been included as a separate category.
k −1
k x x k
(2) The PDF is f ( x ) = exp − .
α α α
x k
(3) The CDF is F ( x ) = 1 − exp − .
α
(4) Where x > 0; α, k > 0.
1
(5) Mean = µ1′ = E ( x) = αΓ1 + .
k
2 1 2
(6) Variance = σ = E[ x − E ( x )] = α Γ1 + − Γ1 + .
2 2 2
k k
(7) If k = 1, then the distribution becomes exponential.
(8) If X has a Weibull distribution, then Y = -ln(X) has a Gumbel (EV1) distribution..
6
Statistics Notes
1
(1) The PDF is f ( x) = e − ( x − x0 ) / β .
β
(2) The CDF is F ( x) = 1 − e − ( x − x0 ) / β .
(3) Mean = E ( x ) = µ1′ = x0 + β .
(4) Variance = E[x – E(x)]2 = µ2 = β 2.
(5) Skewness = g = 2.
(6) The standardized variate is y = (x – x0)/β, with PDF, g(y) = e-y; and CDF, G(y) = 1 – e-y.
Exponential Distribution
10
8
6
4
2
0
0 0.1 0.2 0.3 0.4 0.5 0.6
Gamma Distribution
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5
7
K. A. Cherkauer - Purdue University
.
β γ Γ(γ )
x0 β γ Γ(γ )
(3) Mean = E(x) = µ1’ = x0 + βγ.
(4) Variance = E[x – E(x)]2 = µ2 = β 2γ.
(5) Skewness = g = 2/γ1/2.
(6) The standardized variate is the same as that for the gamma distribution.
v) log Pearson Type 3
(1) Use Z = ln(x) for a 3 parameter log Pearson Type 3 of x, or Z = ln(x-x0), for a more general 4
parameter log Pearson Type 3 distribution of x. For the 3 parameter log Pearson Type 3:
(ln x − z 0 )η −1 e − (ln x− z ) / λ
0
1 N
S x2 = ∑
N − 1 i =1
( xi − x ) 2 . These equations are also the maximum likelihood estimators for the
normal distribution.
ii) 2 Parameter Log Normal Distribution
1 1 ln( x) − µ
2
f ( x) = exp − .
y
(1)
2 σy
x 2πσ y
2
(2) This distribution may be used if data show a positive skew. Skewness can be determined from the
CV = σ µ , as γ x = 3CV x + CV x . As CV 0, the log-normal
3
coefficient of variation,
distribution approaches the normal distribution.
(3) Distribution parameters estimated by the method of moments:
1
σ x2 2 σ2
σ y = ln1 + 2 , and µ y = ln (µ x ) − y .
µ x 2
8
Statistics Notes
1 1 ln( x − a ) − µ
2
f ( x) = exp − .
y
(1)
σ
( x − a) 2πσ y 2
2
y
(2) Sometimes subtracting off a lower bound parameter, a, will sometimes make Y = ln(X) normally
distributed.
(3) The method of moments is inefficient for the LN3 distribution. A simple and efficient method to
x1 x N − x median
2
9
K. A. Cherkauer - Purdue University
σ σ2
variable with population mean = µ and variance = σ2, and estimated variance S2.
(2) The distribution function for ν degrees of freedom, is given by
1
ν2 xν 2 −1 − x 2
e , x>0
f ( x) = 2 Γ(ν 2 )
0,
elsewhere
(3) Mean = µ = ν
(4) Variance = σ2 = 2ν
(5) If Vi ∈ { V1, V2, …, VN }, and Vi has a chi-squared distribution, then the sum of all Vi’s also has a
chi-squared distribution.
iii) F-Distribution
U ν1
(1) The F statistic is F= , where U and V are independent random variables having chi-squared
V ν2
distributions with ν1 and ν2 degrees of freedom.
(2) The distribution, with ν1 and ν2 degrees of freedom, is given by
Γ[(ν 1 + ν 2 ) 2](ν 1 ν 2 )ν1 2 f ν1 2−1
h( f ) = Γ(ν 1 2 )Γ(ν 2 2 ) (1 + ν 1 f ν 2 )(ν1 +ν 2 ) / 2 , 0 < f < ∞ .
0,
elsewhere
1
(3) f1−α (ν 1 ,ν 2 ) =
fα (ν 2 ,ν 1 )
10
Statistics Notes
3) Parameter Estimation
a) Maximum likelihood
i) Maximum likelihood estimates a distributions parameters by finding the values for them most likely to
produce the sample data.
ii) The likelihood function for the p.d.f. f(X/A) is: L( X / A) = ∏ f ( X / A) ,
where X is a set of random number to be described (X1, X2, … XN), and A is the set of distribution
parameters (A1 = µ, A2 = σ2, …).
iii) The parameters are found by maximizing the likelihood function with respect to the various parameters:
dL( X / A) dL( X / A)
= 0, = 0 , …, and then solving for those parameters.
dA1 dA2
iv) In some instances it may be easier to find the parameters that maximize the log likelihood equation,
ln[L(X|A)]. The values of A that maximize the likelihood function, will also maximize the log likelihood.
v) Example: Find the maximum likelihood estimators of the normal distribution
( ) [( xi − µ ) σ ]2
N N 1
1 −
L X / µ ,σ 2 = ∏ f ( xi ) = ∏ e 2
i =1 i =1 2πσ 2
∑ [( xi − µ ) σ ]2
1
1 −
=
( 2πσ )
2
N
e
2
Setting the partial derivatives, ∂L/∂µ and ∂L/∂σ2 to zero and solving for the distribution parameters,
yields the maximum likelihood estimators:
1
µˆ =
N
∑ xi = m1
σˆ = ∑ ( xi − x ) = m2
2 1 2
N
b) Method of Moments
i) Using the moment generating formula, produce the same number of moments as the distribution has
parameters.
ii) Solve for the desired parameters.
iii) Compute the moments from the data set (X1, X2, …, XN), and use them to calculate estimates for the
distribution parameters.
iv) Example: Find the moment estimators for the exponential distribution:
The mean, µ, and standard deviation, σ, and related to the distribution parameters x0 and β by
µ = x0 + β
σ =β
Therefore is x0 is known and the sample mean x is the estimate of µ, β̂ is gotten from βˆ = x − x0 .
If both x0 and β are unknown, then the equations for both parameters must be used. First βˆ = σˆ , where
σ̂ is the sample estimate of σ, then xˆ0 = x − βˆ = x − σˆ .
11
K. A. Cherkauer - Purdue University
4) Hypothesis Testing
a) Test Basics
i) The Null Hypothesis, H0, is the condition assumed by the tester to be true.
ii) The Alternative Hypothesis, H1, is the condition assumed by the tester to be true if H0 is rejected.
iii) Test probability table:
H0 True H1 True
H0 Accepted Significance (1-α) Type II Error (β)
H0 Rejected Type I Error (α) Power (1-β)
(1) The significance level, α, is set by the tester, and determines the probability that the null hypothesis
will be rejected if it is true.
(2) β, the Type II Error, is the probability of accepting the null hypothesis when it is in fact false.
(3) The power of the test, 1-β, can be determined if a specific alternative hypothesis is set, otherwise it is
unknown.
iv) A test statistic is computed from the data, and compared against the expected statistic computed from α.
Note that the distribution used in the comparison is that of the test statistic, not of the sampled data, or
population.
b) Types of Statistical Tests
Parametric tests assume a distribution, usually a normal distribution, while non-parametric tests make no
assumption about the statistics distribution.
i) Parametric Tests
(1) A Z test assumes a normal or Gaussian distribution, with both the population mean and variance
x−µ
known. The test statistic is in the form Z= . This test is usually used to compare the
σ N
means of two sets of normally distributed independent random data. The Central Limit Theorem
states that if Z1, Z2, … , ZN are random samples from any distribution ƒ(X), then the mean of the
random sample µz will be normally distributed.
(2) A T test is similar to the Z test, except that the population variance is not known; instead the sample
variance is used, and the test assumes a t-distribution. The resulting form of the test statistic is
x−µ
t= . The t-distribution will be wider than the normal distribution, to compensate for the
S N
sample variance, but as N ∞, the t-distribution normal distribution. The t-test assumes that
data is normally distributed around their respective means, and that they have the same variance.
(a) A paired t-test is commonly used for evaluating matched pairs of data. The test is conducted on
the differences between the data sets, Di = Xi – Yi, so the differences must be normally
distributed.
(3) The chi-squared test assumes a chi-square distribution, which is a measure of the variance. The test
X −µ
2
12
Statistics Notes
independent group. It makes no assumptions about the types of distribution, but data must be
homoscedastic.
• First, the combined data set is ranked by the magnitude of the values: the greatest value receiving
the highest rank, and ties being assigned their average rank.
• Next, ranks of the smaller data set are summed, and compared against the probability distribution
created from all possible outcomes (bounded by sum of lowest rankings and sum of highest
rankings). For large sets of data (N > 10), a Z test can be used on the ranks.
Example: Determine the probability distribution for Xi ∈ { X1, X2, X3 }, and Yi ∈ { Y1, Y2, Y3, Y4 }:
Possible Rank Combinations
1,2,3 1,3,4 1,4,5 1,5,6 1,6,7
1,2,4 1,3,5 1,4,6 1,5,7
1,2,5 1,3,6 1,4,7
1,2,6 1,3,7
1,2,7
Sums of Ranks
6 8 10 12 14
7 9 11 13
8 10 12
9 11
10
Probability Distribution
0.20
0.15
Probability
0.10
0.05
0.00
6 7 8 9 10 11 12 13 14
Rank Sum
(2) The signed-rank sum test (also the Wilcoxon signed-rank test) is used to determine if the median
difference between two paired sets of data (X, Y), is equal to zero. It can also be used to test if the
median of one set of data is significantly different from zero.
• First, the absolute value of the differences between the data sets (Di = Xi – Yi), are ranked, with
highest rank being assigned to the greatest difference, and ties being assigned their average rank.
13
K. A. Cherkauer - Purdue University
• Next, the signs of the differences (±) are applied to their respective ranks, and the positive ranks
are summed.
• The computed statistic is then compared with the probability distribution of all possible
combinations of positive rank (bounded by 0 and the sum of all N ranks). For large data sets, N
> 10, the distribution can be approximated with a normal distribution.
(3) Kendall's tau measures the strength of monotonic relationships between two data sets (X, Y). A
monotonic relationship exists if as the X values increase the dependent values of Y all increase or all
decrease. Kendall’s test does not assume a linear trend, like the correlation coefficient, but it does
require that the data have equal variance.
• First, data is sorted so that the dependent variable, X, is increasing (Xi < Xj, for i < j).
• Next the differences are computed for all combinations of Yj - Yi with i < j (there are N*(N-1)/2
possibilities).
• Positive and negative differences are counted separately and the test statistic, S, is the number of
positive differences minus the number of negative differences.
• The tau statistic, a measure of the data’s correlation, is found by dividing S by N*(N-1)/2, the
total number of step combinations. Thus tau will be +1 if all steps are positive, and –1 if all steps
are negative.
• Trend can be tested for by comparing the S statistic to the distribution of all possible
combinations of pluses and minuses (bounded by 0 and N*(N-1)/2).
14
Statistics Notes
5) Regression Analysis
a) Regression analysis tries to describe the relationship between two or more continuous variables by fitting a
linear model.
b) Ordinary Least Squares
i) Population model: y i = α + βxi + ε i
(1) Where α is the intercept, β is the slope, and εi is the random error between the data pair and the
model prediction.
ii) Estimated parameters: y i = a + bx i + ei
iii) Linear estimation: ~
yi = a + bxi
iv) To find parameters, minimize the sum of squares of the errors:
N N N
dSSE dSSE
SSE = ∑ ei = ∑ ( y i − ~
y i ) 2 = ∑ ( y i − a − bxi ) 2 , by solving = 0 , and = 0 , for
2
i =1 i =1 i =1 da db
a and b.
dSSE
da = ∑ (− 2 y i + 2a + 2bxi ) = 0
dSSE =
db
(
∑ − 2 xi yi + 2axi + 2bxi 2 = 0 )
∑ (− y i + a + bxi ) = 0
∑ (− y i + a + bxi )xi = 0
∑ y i = Na + b∑ xi
∑ xi y i = a ∑ xi + b ∑ xi
2
a=
1
N
[∑ y i − b∑ xi ]
Then substitute into the second equation:
∑x y i i =
1
N
[∑ y i − b ∑ xi ]∑ x i + b∑ xi
2
15
K. A. Cherkauer - Purdue University
vi) The assumptions used in an OLS regression vary with the purpose for which it is to be used:
Purpose
Test hypotheses,
Predict Y and a Obtain best
Predict Y given estimate
variance for the linear unbiased
X confidence
prediction estimator of Y
intervals
Assumption
Model form is correct: Y is
+ + + +
linearly related to X
Data used to fit the model
are representative of the + + + +
data of interest
Variance of the residuals is
+ + +
constant (homoscedastic)
The residuals are
+ +
independent
The residuals are normally
+
distributed
of β OLS is [ ] (
COV βˆ OLS = γ 2 X T X )
−1
= γ 2 I N , where IN is the identity matrix.
16
Statistics Notes
v) Hypothesis testing can be used to determine whether one set of parameters is better than another, by
testing ~
yi = β 0 + β 1 x1i + ... + β k x ki versus the extended model
~y = β + β x + ... + β x + β x
i 0 1 1i k ki k +1 ( k +1) i + ... + β m x mi , to determine which model explains more
of the sample variance.
6) Trend Analysis
a) Trend analysis fits a linear regression to data, where time is one of the parameters. The slope of the regression
should be tested to see if it significantly different from zero, if so there is evidence to support a trend.
b) More advanced trend tests, like the Seasonal Kendall’s Test, will also check for seasonal trends.
17
K. A. Cherkauer - Purdue University
N
∑ (x
t =1
t − x) 2
1
0.9
Short Memory
0.8
0.7 Long Memory
0.6
r(k)
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5
Lag
Example of autocorrelation plots for long and short memory Markov processes.
ii) Cross-correlation is a measure of how strongly the value at the current time step in one data set, X, is
related to the value of a previous time step in another data set, Y.
18
Statistics Notes
(1) The lag-k cross-correlation, ρk, can be estimated as the lag-k cross-covariance, γk = COV(Xt+kYt),
divided by the lag-0 cross-covariance:
N −k
1
γx N
∑(x t +k − x )( yt − y )
ρˆ k = rk = t + k yt
= t =1
, k ≥ 0.
(γ xxγ yy )
12
1 N N
1
2
∑ ( xt − x ) 2 ∑ ( yt − y ) 2
t t t t
N t =1 t =1
iii) Seasonal correlations can also be computed, where the current season’s dependency on previous seasons
is checked.
iv) Plotting the computed correlation coefficients at various lags generates correlation plots. Correlation
values vary from –1 to 1, while the lag-0 correlation is always 1. The rate at which the correlation values
drop off with increasing lag indicates how dependent they are on the preceding time steps. A lag-1
rk = r1 .
k
Markov process will decrease as a function of its lag-1 autocorrelation:
v) The partial autocorrelation function (PACF), for an autoregressive process is defined as
1 ρ1 ... ρ k −1 φ k 1 ρ1
ρ 1 ... ρ k − 2 φ k 2 ρ 2
1 = , where φkk is the PACF. The PACF is used to help detect
⋮ ⋮ ⋱ ⋮ ⋮ ⋮
ρ k −1 ρ k −2 … 1 φ kk ρ k
the underlying model form. For an autoregressive model, φkk = 0 for k > p, the order of the model.
Therefore the order of the needed AR model can be estimated from a plot of the PACF.
d) Storage-Related Statistics
i) When attempting to model hydrologic time series for simulation of reservoir systems some understanding
of the long term flow must be attained. The Hurst coefficient is a measure of long-term persistence of
flow, based on the cumulative departures from mean flow.
• Let {X} be the sequence of flows: X1, X2, …, XN.
• Let k be the size of a subset of flows from {X}: Xi+1, Xi+2, …, Xi+k. k = 1, 2, …, N.
• Define:
k k
k
Dk = ∑ X i − ∑X i , Dmax = max(Dk), Dmin = min(Dk), for all subsets of length k.
i =1 N i =1
• The Hurst Range, RN, for the record length N is RN = Dmax - Dmin.
RN
• The Range is then normalized to yield the rescaled range R N* = ∝ N H . Where SN is the
SN
standard deviation of the flow sequence and H is known as the Hurst Coefficient.
• Hurst showed that for approximately 900 geophysical time series the Hurst coefficient has an average
value of 0.73, and a standard deviation of 0.09. Theoretically, and numerically, the H for a Markov
model is 0.5. Thus natural systems always have a Hurst coefficient higher than numerical models,
this is known as the Hurst phenomena. One interpretation of this is that H = 0.5 for short-memory
models with short term dependence structure, while H > 0.5 for long-memory models with long-term
dependence structure.
e) Time-Series Models
q
i) A moving average model of order q, MA(q), is defined as: yt = µ + ε t − ∑θ j ε t − j , where µ is the
j =1
process mean, the ε’s are uncorrelated, normally distributed random variables, and the θ’s are the model
parameters. The most commonly used form of this model is the MA(1): yt = µ + ε t − θ 1ε t −1 .
19
K. A. Cherkauer - Purdue University
p
ii) An autoregressive or Markov model of order p, AR(p), is defined as: y t = µ + ∑ φ i ( y t −i − µ ) + ε t ,
i =1
where µ is the process mean; the ε’s are uncorrelated, with zero mean, and variance σε2, normally
distributed random variables; and the φ’s are model parameters. The most commonly used form of this
model is the AR(1): yt = µ + φ1 ( yt −1 − µ ) + ε t .
iii) Combining the AR(p) and MA(q) models yields the more versatile autoregressive moving average model,
p q
ARMA(p,q): yt = µ + ∑ φ i ( yt −i − µ ) + ε t − ∑θ j ε t − j , with p autoregressive parameters, φi, and q
i =1 j =1
moving average parameters, θj. The most common form of this model is the ARMA(1,1):
yt = µ + φ1 ( yt −1 − µ ) + ε t − θ 1ε t −1 .
iv) Model Properties:
(1) The backwards operator, B, is defined as: BjZt = Zt-j.
∞
(2) The moving average parameters are generated using θ ( B ) = 1 − ∑θ j B j .
j =1
∞
(3) The autoregressive parameters are generated using φ ( B) = ∑φ j B j , where φ0 = 1. φ is also the
j =0
partial autocorrelation function, φkk, previously defined. A useful property of the ACF is that φkk = 0
for k > p.
(4) θ(B) = φ-1(B).
20
Statistics Notes
8) Probability of Extremes
a) The modeling of extreme events is of major importance to the general population. Levees and bridges need to
be able to withstand some of the most extreme flooding events. The probability of extreme events can be used
to predict the magnitude of floods with 50, 100, … year return periods, from data sets of peaks-over-threshold,
or annual maximums.
9) Analysis of Variance
a) ANOVA (ANalysis Of VAriance)
i) Analysis of variance (ANOVA) tests k independent groups, or treatments, for similarity in their means
(H0: µ0 = µ1 = … = µk).
Treatment
1 2 … i … k
y11 y21 … yi1 … yk1
y12 y22 … yi2 … yk2
… … … …
y1N y2N … yiN … ykN
Total T1. T2. … Ti. … Tk. T..
Mean y1 . y2 . … y. … yk . y..
i
ii) The one-way classification analysis-of-variance model is, yij = µ + αi + εij, for each y in i = 1…k
k
treatments, and j = 1..N measurements per treatment. The model is subject to the constraint ∑α
i =1
i = 0.
i =1 i =1
k
question. To test multiple contrasts, they must be orthogonal, ∑b c
i =1
i i ni = 0 , where bi, and ci, are
coefficients of different contrast functions.
b) Regression Approach to ANOVA
i) The one-way ANOVA in matrix notation is:
21
K. A. Cherkauer - Purdue University
y11 1 1 0 ⋯ 0 ε 11
y 1 1 0 ⋯ 0 ε
12 12
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
y1N 1 1 0 ⋯ 0 ε 1N
− − − − − − − − − − − − − −
y 21 1 0 1 ⋯ 0 ε 21
y 1 µ
0 1 ⋯ 0 ε 22
22 α1
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
y = 1 α 2 + ε
⋮ 2N
0 1 ⋯ 0
2N
− − − − − − − − − − − − − −
α k
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
− − − − − − − − − − − − − −
yk1 1 0 0 ⋯ 1 ε k1
y 1 0 0 ⋯ 1 ε
k2 k2
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
y 1 0 ⋯ 1 ε
kN 0 kN
ii) Apply the least square approach (Ab = g, where A = XTX, and b = XTY), to the ANOVA model. The
normal equations are given by:
Nk N N ⋯ N µˆ T
N
N 0 ⋯ 0 αˆ1 T1
N 0 N ⋯ 0 αˆ 2 = T2
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
⌢
N 0 0 ⋯ N α k Tk
iii) This matrix is singular (the last k rows of the (k+1)x(k+1) matrix add up to the top row), so the
parameters are not estimable. However, the α’s as used by the ANOVA model are actually deviations of
the treatment means from the overall mean, µ. Therefore testing the equality of population means is
equivalent to testing that the αi’s are all zero.
iv) Using the constraint that all αi’s sum to zero, the matrix form of the model can be rewritten as:
0 N N ⋯ N µˆ 0
N
N 0 ⋯ 0 αˆ 1 T1
N 0 N ⋯ 0 αˆ 2 = T2
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
⌢
N 0 0 ⋯ N α k Tk
T ..
Which can be solved, yielding the estimating equations µˆ = = y.. , and
nk
Ti . T ..
αˆ i = − = y i . − y.. , i = 1, 2, …, k.
n nk
c) Standard Regression
22
Statistics Notes
i) The amount of variance that can be described by a linear regression is determined by the correlation
coefficient, r. The value of r2 * 100, is the percentage of total variance described by the linear regression.
An r2 of 1 or –1 indicates that all variance is described by a linear model.
10) Bibliography
(1975). Flood Studies Report: Volume I, Hydrological Studies. London, Natural Environment Research
Council.
Benjamin, J. R., and C. Allin Cornell (1970). Probability, Statistics and Decidion for Civil Engineers. San
Francisco, McGraw-Hill Book Company.
Bunt, L. N., and Alan Barton (1967). Probability and Hypothesis Testing. Toranto, George G. Harrap and
Co. Ltd.
Draper, N. R., and H. Smith (1966). Applied Regression Analysis. New York, John Wiley and Sons, Inc.
Maidment, D. R., Ed. (1993). Handbook of Hydrology. San Francisco, McGraw-Hill, Inc.
Meyers, B. L., and Norbert L. Enrick (1970). Statistical Functions: A Source of Practical Derivations Based
on Elementary Mathematics. Kent State, Kent State University Press.
Shaw, E. M. (1983). Hydrology in Practice. London, Van Nostrand Reinhold (UK) Co. Ltd.
23