Escolar Documentos
Profissional Documentos
Cultura Documentos
the Project
Manager
SAMPLING AND THE
LAW OF LARGE
NUMBERS
The Law of Large Numbers, LLN, tells us it‟s possible to estimate
certain information about a population from just the data
measured, calculated, or observed from a sample of the population.
A whitepaper by
John C. Goodpasture, PMP
Managing Principal
Square Peg Consulting, LLC
Sampling and the Law of Large Numbers
Examples for the Project Manager
Analysis by sampling is called ‘drawing an inference’, and the branch of statistics from which it
comes is called ‘inferential statistics’. Drawing an inference is similar to ‘inductive reasoning’.
In both cases, inference and induction, one works from a set of specific observations back to the
more general case, or to the rules that govern the observations.
Risk assessments
There are two risk assessments to be made. Examples in this paper will illustrate these
two assessments.
1. “Margin of error”, which refers to the estimated error around the measurement,
observation, or calculation of statistics within the interval of the sample data, and
Margin of error is the percentage of the interval relative to the statistic being
estimated:
Because margin of error is a ratio, the risk manager actually has to be concerned for both
the numerator and the denominator: for small statistical values [for a small denominator]
the interval [the numerator] must be likewise small—and, a small interval is achieved by
having a large sample size, N.
2. “Confidence interval”, which refers to the interval within which the true
population parameters are likely to be with a specified probability.
Confidence intervals have their own risk. The principle risk is that the sample
misrepresents the population. If confidence is stated as 95% for some interval, then there
is a 5% chance that the true population parameter lays outside the interval. Consider this
case: a population with a parameter real value of 8 is sampled [of course, this fact—the
real value of 8—is unknown to the project team]. But, also unknown to the project team,
for example, the sample may be influenced by some infrequent outliers in the population.
From the sample data the sample average may be calculated to be 10. The question is:
what is the quality of this metric value? We will use confidence interval and margin of
error as surrogates for sample statistics quality.
Deciding upon the sample size—meaning: the value of N—introduces a tension between
the project‟s budget and/or schedule managers, and the risk managers. Tension is another
word for risk.
Budget managers want to limit the cost of gathering more data than is needed and
thereby limit cost risk—in other words, avoid oversampling.
Risk managers want to limit the impact of not having enough data and thereby
limit functional, feature, or performance risk.
Sampling policy
The risk plan customarily invokes a project management policy regarding the degree of
risk that is acceptable:
“Margin of error” is customarily accepted between +/- 3 to 5%
“Confidence Interval” is customarily a pre-selected percentage between 80 and
99%, most commonly 95% or 99%.
The sampling protocol for a given project is designed by the risk manager to support
these policy objectives
General examples
Below are several population examples that are common in project situations. They fall
into one of two population types, discrete proportions and continuous data.
Project managers and the project office often deal with proportions
Project control account managers and team leaders often deal with “continuous
data”.
In Six Sigma, such category data is called ‘attribute data’. For example, a semi-
conductor wafer fits either into a category of ‘defect free’ or into another category of
‘defective’. The metric is the count in each category.
Proportion is often notated as ‘p’ for the proportional count in one category, and ‘1-p’ for
the other. ‘1-p’ is sometimes denoted ‘q’.
Continuous data is descriptive: the data values describe features and attributes, like size,
weight, density, and the like. Collections or sets of continuous data values are
characterized with descriptive statistics, like average weight, or average hours of
experience; and other statistics that can be calculated from data, like standard deviation
and variance.
Six Sigma refers to such populations as having ‘continuous or variable data’ metrics,
referring to the idea that such metrics can be measured on a continuous scale.
Project Estimates
Regardless of the nature of the population, the issues for the project manager are the
same:
Sample size [count of values in the sample] is driven by risk tolerance for the
Sample size
possible error in the sample results. A larger count reduces error possibilities.
[count of values]
There are formulas for sample size that take into account risk tolerance.
The margin of error in the estimated statistic improves with increasing count
Margin of error
of data values in the sample
Confidence that the actual population parameter is within the sample data
Population interval improves as the interval is made wider for a given number of samples
parameter values.
confidence Thus, for a sample of 30 values, the confidence interval for 99% confidence is
wider than for 90% confidence
The project manager elects to sample the data record population to determine the
proportionality, p, of records that are Category-1 so that the scheduling manager has
information to guide project scheduling.
The project risk management plan requires estimates to have 95% confidence for design
parameters, and a margin of error of less than +/- 5% on sample data values.
Sample design: With no a priori hypothesis of the expected proportionality of „p‟, some
iteration may be required. A good starting point is to assume p = 0.5. The risk manager
refers to the chart given in the appendix entitled “Proportion „p‟ vs +/- Margin of Error
%” that is a plot of error percentage for a confidence of 95%. From that chart, the risk
manager finds that for a +/- 5% margin of error of „p‟ with 95% confidence a sample size
greater than 1,000 but smaller than 3000 is needed.
Solving the margin of error equation for N in fact gives 1,536 as the appropriate starting
point for N
Starting with N = 1,536, if the first sample returns a „p‟ value that is 0.5 or greater, the
margin of error is likely less than +/- 5%; no further sampling is required. Otherwise, a
larger sample size is required.
Sample analysis: Assume the sample returns a value of „p‟ of 0.7. From the confidence
interval equation for proportions given in the appendix, the 95% confidence interval for
the estimated proportion is calculated to be 67% to 73%, centered on 70%.
Risk management analysis: There is a 5% probability that the proportion „p‟ is not
within the confidence interval of 67% to 73%. There is not enough information to
forecast whether the proportion „p‟ is more likely less than 67% or greater than 73%.
From the chart in the appendix for margin of error, the margin of error of the
proportionality value 0.7 is about +/- 4.7 %, or +/- 0.032, from 0.668 to 0.732.
The project manager elects to sample the pilot population rather than weigh every pilot.
The project risk management plan requires estimates to have 95% confidence for design
parameters and +/- 3% margin of error for sample data statistics.
Sample Frame: From the chart in the appendix entitled “% Margin of Error v N, 95%
Confidence” the risk manager finds that a sample of size 85 is required to meet the +/-
3% policy metric and simultaneously meet the 95% confidence interval metric. So, in this
example, 85 pilots are weighed from a population frame of active duty military pilots,
both men and women.
Assume the sample average is found from the sample data to be 175 lbs [79.4 kg], and
the Sample σ is calculated from the sample data by spreadsheet function. Assume the
Sample σ is calculated to be 25 lbs [11.3 kg].
Sample analysis: From the equation given in the appendix for continuous data, the 95%
confidence interval for the estimated average weight of the pilot population is estimated
to be about +/- 5.4 lbs [+/- 2.4 kg], or from 169.6 to 180.4 lbs [76.9 to 81.8 kg].
Risk management analysis: There is a 5% probability that the average pilot weight is
not within the confidence interval.
The sample average of 175 pounds is estimated to have a margin of error of +/- 3%, or
+/- 5.2 pounds [+/- 2.4 kg].
Appendix
The confidence objective, expressed as a %, is read as, for example, 80% confidence
the real population parameter is within the interval, and 20% confidence the real
population parameter is outside the interval.
80% Interval = Sample average +/- (1.3 / N) x Sample σ [narrowest interval]
90% Interval = Sample average +/- (1.7 / N) x Sample σ
95% Interval = Sample average +/- (2 / N) x Sample σ
99% Interval = Sample average +/- (2.7 / N) x Sample σ [widest interval]
The following is a plot for the margin of error as a function of the sample size, N