Fitting Data To Distributions PDF

Fitting Distributions to Data
Practical Issues in the Use of Probabilistic Risk Assessment

Sarasota, Florida February 28 - March 2, 1999
Presented by Bill Huber, Quantitative Decisions, PA
Overview
Experiments, data, and distributions Fitting distributions to data Implications for PRA: managing risk
2
Fitting Distributions to Data, March 1, 1999
Objectives
By the end of this talk you should know:
exactly what a distribution is several ways to picture a distribution how to compare distributions how to evaluate discrepancies that are important how to determine whether a fitted distribution is appropriate for a probabilistic risk analysis
3
Part I Experiments, data, and distributions

Outcomes
An outcome is the result of an experiment or sequence of observations. Examples:
The results of an opinion poll. Data from a medical study. Analytical results of soil samples collected for an environmental investigation.
5
Sample spaces
A sample space is a collection of possible outcomes. Examples:
The set of answers that could be given by 1,052 respondents to the question, Do you believe that the Flat Earth theory should be taught to all third graders? The set of arsenic concentrations that could be produced by measurements of 38 soil samples. The set of all groups of people who might be selected for a drug trial.
Events
An event is a set of possible outcomes. Examples:
The event that 5% of respondents answer yes. This event contains many outcomes because it does not specify exactly which 5% of the respondents. The event that the average arsenic concentration is less than 20 ppm. This event includes infinitely many outcomes.
7
Distributions
A distribution describes the frequency or probability of possible events. When the outcomes can be described by numbers (such as measurements), the sample space is a set of numbers and events are formed from intervals of numbers.
8
An example
Experiment: sample a U.S. adult at random. Measure the skin surface area. The sample space is the set of all surface areas for all U.S. adults. This set (the population) is constantly changing as children become adults and others die. Therefore there is no static population and there is no one distribution that is demonstrably the correct one. At best we can hope to find a succinct mathematical description that approximately agrees with the frequencies at which various skin surface areas will be observed in independent repetitions of this experiment.
9
Where data come in

To help identify a good distribution, we sample the present population. This means we conduct a small number of independent repetitions of the experiment. The results are the data. But: how do you go about finding a distribution that will describe the frequencies of future repetitions of the experiment? We will probe this issue by picturing and comparing distributions.
10
Picturing distributions: histograms

One approach is to graph the distributions value for a bunch of tiny equal-size nonoverlapping intervals. (These are called bins.) The values on the vertical axis are relative frequencies.
0 .1 2 0 .1 0 .0 8 0 .0 6 0 .0 4 0 .0 2 0 8 10 12 14 16 18 20 22 24 26 28 30 32 34
This is a portrait of a square root normal distribution. It could describe natural variation in skin surface area, for example (units are 1000 cm2).
11
Comparing distributions (1)

The histogram method, although good, has a problem: Distributions that are almost the same can look different, depending on choice of bins. Small random variations are also magnified.
The lower distribution shows the frequencies from 200 measurements of people randomly selected from the upper distribution.
0 .1 2 0 .1 0 .0 8 0 .0 6 0 .0 4 0 .0 2 0
8 10 12 14 16 18 20 22 24 26 28 30 32
32
0 .1 4 0 .1 2 0 .1 0 .0 8 0 .0 6 0 .0 4 0 .0 2 0 8 10 12 14 16 18 20 22 24 26 28 30 34 36
12
34
Comparing sets of numbers

A more powerful way to compare two sets of numbers is to pair them and plot them in two dimensions as a scatterplot.
X Y 5 7 15 20 30 27 41 39 57 55
How quickly can you determine the relationship among these numbers? How confident are you of your answer?
13
Comparing sets of numbers

The numbers are closely associated--have the same statistical pattern--when the scatterplot is close to a straight line. This approach also works nicely for comparing distributions--but first we have to find a way to pair off values in the distributions.
X Y 5 7 15 20 30 27 41 39 57 55
60 50 40 30 20 10 0 0 20 40 60
14
Comparing distributions (2)

Given: one data set. To compare its distribution to a reference, generate the same number of values from the reference. Pair data and reference values from smallestsmallest to largestlargest, as illustrated.
Order 1 2 3 4 197 198 199 200 Percentile Reference Measured 0.25% - 2.81 10.0 0.75% - 2.43 10.8 1.25% - 2.24 11.0 1.75% - 2.11 11.2
...
98.25% 98.75% 99.25% 99.75%

0 .1 4 0 .1 2 0 .1 0 .0 8 0 .0 6 0 .0 4 0 .0 2 0
2.11 2.24 2.43 2.81
29.9 30.3 34.2 35.3
0 .1 2 0 .1
0 .0 8 0 .0 6 0 .0 4 0 .0 2 0 8 10 12 14 16 18 20 22 24 26 28 30 32 34
10
12
14
16
18
20
22
24
26
28
30
32
34
Reference
Measured
15
36
Probability plots
Finally, draw the scatterplot. It is close to a straight line: the data and its reference distribution therefore have the same shape (although one might be shifted and rescaled relative to the other). This is a probability plot.
40 35 30 25 20 15 10 -3 -2 -1 0 1 2 3 Expected value for standard normal distribution

16
Reading probability plots

To read a probability plot, you apply a magnifying glass to the areas that do not follow the trend. This is a general principle: to characterize data, you provide a simple description of the general mass, and then highlight any discrepant results (they are the interesting ones!).
40 35 30 25
pp
20 15 10 -3 -2 -1 0 1 2 3 Expected value for standard normal distribution

17
Statistical magnification
The two largest points are slightly higher than the line. Interpretation: our largest measurements have a slight tendency to be larger than the largest measurements in the reference distribution. (The amount by which they are larger is inconsequential, though.)
18
Interpretation issues
What do we use for a reference distribution? Why? Which deviations from the reference should concern us? How much of a deviation is important? What risk do we run if a mistake is made in the interpretation? All these questions are interrelated.
19
Progress update
We have a scientific framework and language for discussing measurements and observations, events and distributions. You have learned to picture distributions using histograms. You have learned to compare (and depict) distributions using probability plots. You have learned to use statistical magnification to evaluate deviations from a reference standard.
0 .1 4 0 .1 2 0 .1 0 .0 8 0 .0 6 0 .0 4 0 .0 2 0 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
40
As, ppm, as measured
35 30 25 20 15 10 -3 -2 -1 0 1 2 3 Expected value for standard normal dis
20
Part II
Fitting distributions to data
An example data set

Lets take a close look at some arsenic measurements of soil samples. What is the first thing you would do with these data?
As, ppm 3.9 5.2 6.0 7.5 9.9 10.7 15.7 21.9 25.1 31.3 32.9 68.9 32.9 68.9 36.0 78.4 37.6 79.9 43.9 84.6 45.4 89.3 45.4 89.3 50.1 94.0 51.7 111.3 112.8 116.0 116.0 117.5 150.4 172.4 219.4 220.9 222.5 264.8
22
The first thing to do

Ask why. If you dont know how the data will be used to make a decision or take an action, then any analysis you attempt is likely to be misleading or irrelevant. Do not be tempted to embark on an analysis of data simply because they are there and you have some tools to do it with.
23
The purpose and its implications

In our example, the arsenic measurements will be used to develop a concentration term for a human health risk assessment. Therefore:
We want to characterize the arithmetic mean concentration We do not want to grossly underestimate the mean We should focus on characterizing the largest values best. 24
The second thing to do

Draw a picture.
As, ppm 3.9 5.2 6.0 7.5 9.9 10.7 15.7 21.9 25.1 31.3 32.9 68.9 32.9 68.9 36.0 78.4 37.6 79.9 43.9 84.6 45.4 89.3 45.4 89.3 50.1 94.0 51.7 111.3 112.8 116.0 116.0 117.5 150.4 172.4 219.4 220.9 222.5 264.8
0.4 12
Count
0.3
Proportion per Bar
0.2
0.1
0 0
100 AS
200
0.0 300
25
A portrait gallery
Stem and leaf
0 0000011 0 H 2233333 0 44455 0 M 6677 0 8889 1 H 11111 1 1 5 1 7 1 2 1 2 22 * * * Outside Values * * * 2 6
Box and whisker

Count
10 8 6 4 2 0 0
Kernel density
100 200 Arsenic, ppm
300
Dot plot
300
300
12
Histogram
Count
0.4
Let t er summa r y M (1 9 h ) H (1 0 ) E (5 h ) D (3 ) X (1 ) 31.3 10.3 6.0 3.9 60.3 72.1 85.9 113.5 134.4 112.8 161.4 220.9 264.8
0.3
Proportion per Bar
Stripe plot
0 100 200 Arsenic, ppm 300
0.2
0.1
0 0
100 AS
200
0.0 300
26
The third thing to do: compare

256 128 64 32 16 8 4 2 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Expected Value for Uniform Distribution
300
0 0 1 2 3 4 5 Expected Value for Exponential Distribution
0 0 1 2 3 4 5 6 7 Expected Value for Gamma(2) Distribution
300
300
200
AS
AS
200
100
100
300 240 180
300
200
200
AS
AS
100
60
AS
120
100
0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Expected Value for Beta(1,5) Distribution
-3
-2 -1 0 1 2 3 Expected Value for Normal Distribution
0 2 4 6 8 10 12 Expected Value for Chi-square(3) Distribution
27
Shades of normal
Normal
300
300 240 180
Cube root normal

512 256 128 64 32 16
AS
Lognormal
200
AS
AS
120 60
8 4
100
0 -3 -2 -1 0 1 2 3 Expected Value for Normal Distribution

-3 -2 -1 0 1 2 3 Expected Value for Normal Distribution
2 -3 -2 -1 0 1 2 3 Expected Value for Normal Distribution
The middle fits the best. It is neither normal nor lognormal. --But none fit as well as some of the previous distributions.
28
A closer look at a good fit

The fit to the upper 75% of data--the large ones, the ones that really count--is beautiful. Yet, the uniform distribution has lower and upper limits. Do you really think the arsenic concentrations at a site would be so definitely limited?
512
Arsenic, ppm (log scale)
256 128 64 32 16 8 4 2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Expected Value for Uniform Distribution
29
A key point, repeated

Keep asking, what effect could a (potential) characterization of the data have on the decision or action? If we describe the example data in terms of a loguniform distribution, then we have decided to treat them as having an upper bound (of about 300 ppm) and are therefore implicitly not considering the possibility there may be much higher concentrations present. This is usually not a good assumption to make when few measurements are available.
30
What the many comparisons show

The question is not what is the best fit? (So dont go on a distribution hunt!) The issue is to select a reference distribution that
Fits the data well enough Has a basis in theory or empirical experience Manages the risk attached to using the reference distribution for further analysis and decision making.
31
Measuring fits well enough

Recall that statistical magnification explores the differences between the data and the reference distribution. These differences can be graphed.
3.0 2.5 2.0 Measured Value 1.5 1.0 0.5 0.0 - 0.5 - 1.0 0.00 10.00 20.00 30.00 40.00 Th eor et ic a l f it
32
What it means to fit well enough

Statistical theory provides Methods to measure these differences Methods to determine the chance that such differences are just random deviations from the reference distribution: Kolmogorov-Smirnov Anderson-Darling Shapiro-Wilks However, do not become a slave to the P-value: its a measurement, not a rule.
P = 20.3428864%
3.0 2.5 2.0 Measured Value 1.5 1.0 0.5 0.0 - 0.5 - 1.0 0.00 10.00 20.00 30.00 40.00 Th eor et ic a l f it
33
The role of theory and experience

Observations or measurements may vary for many reasons, including
Accumulated independent random errors not controlled by the experimenter Natural variation in the population sampled.
The form of variation can also depend on

Mechanism of sample selection or observation: random, stratified, focused, etc. What is being observed: representative objects, extreme objects (such as floods), etc.
Both theory and experience often suggest a form for the reference distribution in these cases.
34
Examples of reference distributions

Normal distribution: variation arises from independent additive errors. Lognormal: variation arises from independent multiplicative errors. Often observed in concentrations of substances in the environment. (Often confused with mixtures of normal distributions, too!) Many other well known mathematical distributions describe extreme events, waiting times between random events, etc.
35
Part III
Managing Risk
Managing risk
Begin by considering how discrepancies between reality and the model might affect the decision. Example: Concentration of pollutant due to direct deposition onto plant surfaces is estimated as Pd = [Dyd + (Fw * Dyw)] * Rp * [1 - exp(-kp * Tp)]
Yp * kp
Dyd = yearly dry deposition rate, Dyw = wet rate, Fw = adhering fraction of wet deposition, Rp = interception fraction of edible portion, kp = plant surface loss coefficient, Tp = time of plants exposure to deposition, Yp = yield (USEPA 1990: Methodology for Assessing Health Risks Associated With Indirect Exposure to Combustor Emissions).
A large value of a red (italic) variable or a small value of a blue variable creates a large value of Pd.
37
Managing risk, continued

Pd = [Dyd + (Fw * Dyw)] * Rp * [1 - exp(-kp * Tp)]
Yp * kp
A large value of a red variable or a small value of a blue variable creates a large value of Pd. Suppose: These variables are modeled as distributions in a probabilistic risk assessment The decision will be influenced by the large values of Pd (pollutant concentration potentially ingested by people). Then, look at the important tail: Make sure the upper tail fits the data for a red variable well Make sure the lower tail fits the data for a blue variable well.
38
Managing risk: examples

A large value of a red variable or a small value of a blue variable creates a large value of Pd. On the left, a blue variable: the reference distribution (straight line) underestimates the data values; therefore using the reference distribution in a PRA may overestimate the pollutant concentration. Now you evaluate the middle and right pictures for a blue and red variable, respectively.
300 240 180
AS
256 128 64
AS
512 256 128 64 32 16 8 4
120 60
32 16 8 4 2 1
-3
-2 -1 0 1 2 3 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 -3 -2 -1 0 1 2 3 Expected Value for Normal Distribution Expected Value for Normal Distribution Expected Value for Uniform Distribution
39
Evaluation Checklist
A defensible choice of distribution simultaneously:
Can be effectively incorporated in subsequent analyses; is mathematically or computationally tractable Fits the data well at the important tail Has a scientific, theoretical, or empirical rationale.
Red flags (any of which is cause for scepticism):

A distribution was fit from a very large family of possible distributions using an automated computer procedure The distribution was never pictured The distribution is unusual and has no scientific rationale The important tail of the distribution deviates from the data in an anti-conservative way.
40
Conclusion
You should now know
exactly what a distribution is several ways to picture a distribution how to compare distributions how to evaluate discrepancies that are important how to determine whether a fitted distribution is appropriate for a probabilistic risk analysis.
41

Fitting Data To Distributions PDF

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Fitting Data To Distributions PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Fitting Distributions to Data

Practical Issues in the Use of Probabilistic Risk Assessment

Presented by Bill Huber, Quantitative Decisions, PA

Part I Experiments, data, and distributions

Where data come in

Picturing distributions: histograms

Comparing distributions (1)

Comparing sets of numbers

Comparing sets of numbers

Comparing distributions (2)

98.25% 98.75% 99.25% 99.75%

2.11 2.24 2.43 2.81

29.9 30.3 34.2 35.3

40 35 30 25 20 15 10 -3 -2 -1 0 1 2 3 Expected value for standard normal distribution

Reading probability plots

20 15 10 -3 -2 -1 0 1 2 3 Expected value for standard normal distribution

As, ppm, as measured

35 30 25 20 15 10 -3 -2 -1 0 1 2 3 Expected value for standard normal dis

Presented by Bill Huber, Quantitative Decisions, PA

An example data set

The first thing to do

The purpose and its implications

The second thing to do

Proportion per Bar

Fitting Distributions to Data, March 1, 1999

Box and whisker

100 200 Arsenic, ppm

100 200 Arsenic, ppm

100 200 Arsenic, ppm

Proportion per Bar

Fitting Distributions to Data, March 1, 1999

The third thing to do: compare

300 240 180

-2 -1 0 1 2 3 Expected Value for Normal Distribution

0 2 4 6 8 10 12 Expected Value for Chi-square(3) Distribution

Fitting Distributions to Data, March 1, 1999

Cube root normal

0 -3 -2 -1 0 1 2 3 Expected Value for Normal Distribution

2 -3 -2 -1 0 1 2 3 Expected Value for Normal Distribution

A closer look at a good fit

A key point, repeated

What the many comparisons show

Measuring fits well enough

What it means to fit well enough

The role of theory and experience

The form of variation can also depend on

Examples of reference distributions

Presented by Bill Huber, Quantitative Decisions, PA

Managing risk, continued

Managing risk: examples

512 256 128 64 32 16 8 4

Fitting Distributions to Data, March 1, 1999

Red flags (any of which is cause for scepticism):

Você também pode gostar