Você está na página 1de 24

HELSINKI UNIVERSITY OF TECHNOLOGY Department of Engineering Physics and Mathematics Systems Analysis Laboratory Mat-2.

4108 Independent Research Projects in Applied Mathematics

Sample Size Requierement for Monte Carlo - simulations using Latin Hypercube Sampling

Anna Matala 60968U 20.5.2008

Contents
1 Introduction 2 Monte Carlo Integration 3 Latin Hypercube Sampling (LHS) 3.1 3.2 3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimate of variance . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 5 5 6 8 8 12 12 15 19 23 23

4 Sample size 5 Examples 5.1 5.2 5.3 5.4 Example of sampling with lognormal distribution . . . . . . . . . Example of multinormal distribution . . . . . . . . . . . . . . . . Example of binary variables . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Conclusions

Introduction

The goal of probabilistic re risk analysis is to discover the probabilities of the harmful consequences of re. The variating initial and boundary conditions, that lead to growth of re, are taken into account statistically. The consequences are represented as a limit state function. The probability is calculated by integrating the limit state function over multidimensional joint distribution. In practice, the integration is made by using Monte Carlo method, because the physical models of re are non-linear and often really complex. In Probabilistic Fire Simulator (PFS, developed by VTT) [1] the sampling can be done by Simple Random sampling (also known as Monte Carlo sampling) or Latin Hypercube sampling. The purpose of this work is to study the accuracy of Latin Hypercube sampling and to nd a simple manner to evaluate the sample size. Estimates and experiences are searched from literature. Finally, the results are tested in simple simulations.

Monte Carlo Integration

Random sampling has been used for solving numerical problems as early as 1777 when Comte de Buon made his experiment by dropping needle to the board with parallel lines with same distance [2]. If the length of the needle is L and the distance between lines d (d>L), the probability that the needle intersects a line is 2L . d

p=

(1)

Later Laplace realized that this method could be used for estimating the value of .

Modern Monte Carlo (SRS) method was born in 1940s when Stanislaw Ulam, John von Neumann and others started to use random numbers to examine physics from the stochastic perspective. One of the most eective use of Monte Carlo method is to evaluate denite integrals which analytically would be too dicult to nd [3]. If the problem is

y=
a

f (x)dx,

(2)

the solution can found by hit-or-miss (acceptance-rejection) method: A box is drawn around the function from a to b and from 0 to y0 (y0 >f(x)). After this, N random numbers are taken from a uniform distribution and plotted (as in Figure 1). A number of points that fall below f(x) are then counted and marked with N0 . Now the estimate of integral can be calculated using equation N0 (y0 (b a)). N

yest =

(3)

The estimate is getting more and more precise when N is increased and will converge to the correct answer when N . In this example, x had only one dimension for simplicity, but the method can be successfully applied to multidimensional problems as well.

Figure 1: Hit or miss method in integration of curve f(x).

3
3.1

Latin Hypercube Sampling (LHS)


Method

The Latin Hypercube Sampling [4]-[6] is a type of stratied Monte Carlo sampling. All the areas of the sample space are represented by input values. To generate a sample size N from K variables x=[x1 , x2 , ..., xK ]T with probability density function f(x), the procedure is as follows. The range of each variable is portioned into N non overlapping intervals of equal probability 1/N (Figure 2). From each interval one value is selected randomly according to the probability density of the interval. The N values of x1 are paired in a random manner with values of x2 (Figure 3), these pairs are then paired similarly with values of x3 and so on, until N samples of n variables are formed. So created cube has N K cells which cover the sample space. Denote the input vector X=[x1 ,...,xN ].

Figure 2: Density function of variable xi divided to N=5 non overlapping intervals.

Figure 3: Pairing x1 with x2 when N=5.

3.2

Variance analysis

Let h(x) be the transformation function of input x [4],[7]. The estimate of expectation value E(h(x)) is
N

h = N 1
j=1

h(xj ).

(4)

The variance of the estimate depends on the sample method used. For Simple Random sampling the variance is var(h) = N 1 var(h(x)) and for the Latin Hypercube sampling var(h) = N 1 var(h(x)) + N 1 cov(h(x1 ), h(x2 )). N (5)

(6)

Thus the LHS sampling decreases the variance if and only if cov(h(x1 ),h(x2 ))<0. In their article McKay et al. [4] proved that the covariance is negative whenever h(X) is monotonic in each of its components. However, in many problems the monotonicity does not hold. Stein [7] showed 1987 that when N , the covariance term is asymptotically non positive. Dene function
K

gk (xk ) =

h(x)
i=1,i=k

dFi (xi )

(7)

and {XjN }(where j=1,...,N and N=1,2,...). Steins Theorem 1 shows that if E(h2 )< and N ,

K cov(h(X)1N , h(X)2N ) = E(h)2 N 1 N By Jensenss inequality, gk (x)2 dFk (x) (

gk (x)2 dFk (x) + o(N 1 ). (8)


k=1

gk (x)dFk (x))2 = (Eh)2 ,

(9)

so the highest-order term in expansion of cov(h(x1 ),h(x2 )) is non positive. That means, that

limN N cov(h(x1 ), h(x2 )) 0. This can be written easier if we dene


K

(10)

ha (x) =
k=1

gk (xk ) (K 1)E(h)

(11)

and

r(x) = h(x) ha (x). The function ha (x) is the best additive t to h(x) which means
K

(12)

r2 (x)dF (x)

(h(x)
k=1

hk (xk ))2 dF (x).

(13)

Now an approximation of variance can be written as N ,


N

var(N 1
j=1

h(XjN )) = N 1

r(x)2 dF (x) + o(N 1 ).

(14)

This is relevant whenever N is much larger than K, so the sample size is greater than the number of variables.

3.3

Estimate of variance

A simple way to get an estimate to the variance of average is called replicated Latin hypercube sampling [7]. The idea is to produce several independent LHS samples and then compare the variances between samples. If the number of samples is and size of them M, N=M. If
M

hi = M 1
j=1

h(Xi ), j

(15)

then an unbiased estimator for the variance is

var(1
i=1

hi ) =

1 2 [ hi ( ( 1) i=1

i=1

hi

)2 ].

(16)

This is exactly same as the sample variance of mean for normal sampling. Now if is too small, the estimate of variance is not precise. On the other hand, increasing while N is xed increases this estimator of variance. However, as long as ratio
M K

is large, the increase in variance is small.

Sample size

In literature almost no recommendations or estimates for the sample size of LHS sampling were proposed. However, it is possible the study the examples and apply them to re-Monte Carlo simulation. In re simulations, the most limiting factor is time. Often the samples are simulated by Fire Dynamics Simulator (FDS) and that makes the Monte Carlo simulation quite slow. As an example of order of magnitude, already 1000 samples can require computation time of several months. Typically number of variables is less than 10. All the following examples are about the LHS sampling unless else is mentioned.
M K =10.

Stein [7] gives one example where ratio

In his example N=1000, M=200,

K=20 and =5. For re simulation, this could mean 5 independent samples of size 100 makes total sample size of 500. In his other example there are 6 variables and the estimator of variance is calculated by 100 independent replications of samples of size 100. Now
M K

would be nearly 17. This is unnecessary high

value and clearly impossible for re simulation. However, according to the rst

example it is possible to get reliable results by using values 5 and

M K

> 10.

Jianfang Jia et al. [8] give an example of NF-B signaling pathway model. They did global sensitivity analysis using 64 independent variables and 1000 samples, so the ratio of samples and variables is 15.6. Applied to 10 variables, this example would suggest sample size 156.

As mentioned before, the latin hypercube sampling converges at least as fast as simple random sampling if N is big enough. Swidzinski et al. [9] go further stating that LHS could be nearly ve times more eective in yield estimation than traditional sampling methods. They justify it by their study at the Texas A&M. In the tests, they generated 10 iterations of samples of size 200 using both LHS and simple random sampling. After that, sample size of 1000 was generated 10 times using simple random sampling. The results showed that variation in the yield estimate during the 10 iterations were similar when LHS sample size was 200 and simple random sample was 1000. The theory of simple random sampling is better known, so this gives suggestive size of LHS samples.

Eduard Hofer [10] studied computationally intensive models, where sample sizes were 100 as maximum and the samples were generated by Simple Random sampling. As the distributions of parameters are usually not known, the fractiles of 5% and 95% can not be immidiately obtained, and hence Monte Carlo simulations are used. The resulting distributions are often long-tailed so the fractile estimates are highly dependent of the sample size N. Therefore, the possible impact of the sampling error needs to be quantied. That can be done by calculating statistical tolerance limits (u(%), v(%)) where u is the percentage of the combined inuence of the quantied uncertainties and v is the condence level of it. In other words, with condence level v(%) we can say that u(%) of the samples are inside the tolerance limits. The tolerance interval is chosen to be from the smallest to the largest observation in the sample.

Hofer gives an example: Two sided statistical tolerance limits (95%, 95%) for Simple Random sample require only 93 model runs. This is a result of nonparametric statistics and depends only on the percentages u and v, not on the number of uncertainties taken into account. Besides, it does not assume any particular type of underlying distribution. This result is general and really promissing from the point of view of re simulation, especially because LHS

Table 1: Sample sizes for two-sided nonparametric tolerance limits. 1- is the condence coecient and q is the fraction of the population that lies between the smallest and the largest observation of the sample. q 0.5 0.7 0.750 0.8 0.85 0.9 0.95 1- 0.5 0.7 0.75 0.8 0.85 0.9 0.95 0.975 0.98 0.99 0.995 0.999 3 5 5 5 6 7 8 9 9 11 12 14 6 8 9 9 10 12 14 17 17 20 22 27 7 10 10 11 13 15 18 20 21 24 27 33 9 12 13 14 16 18 22 26 27 31 34 42 11 16 18 19 22 25 30 35 37 42 47 58 17 24 27 29 33 38 46 54 56 64 72 89 34 49 53 59 67 77 93 110 115 130 146 181 67 97 107 119 134 155 188 221 231 263 294 366 84 122 134 149 168 194 236 277 290 330 369 458 168 244 269 299 337 388 473 555 581 662 740 920 0.975 0.98 0.99

sampling can be expected to show less variability than simple random sampling.

The solution for this example can be found from Table 1 originally extracted from Conovers book ([11]: Appendix A6). The table shows the sample sizes required according some condence coecients and population fractions. If the Table 1 does not oer direct answer, it is possible to make an approximation 1 1+q 1 (1) + (r + m 1), 4 1q 2

(17)

where r is the number of observation of lower bound of tolerance limit and (N + 1 m) the number of observation of upper bound. The sample must be assorted to ascending order. r=m=1 is commonly used, so the lower bound is the smallest observation and the upper bound the largest. 1 is the 1 quantile of a chi-square random variable with 2(r+m) degrees of freedom. Revising the example before causes the same result: 1 1 + 0.95 1 9.488 + (1 + 1 1) = 93.008. 4 1 0.95 2

However, the statistical tolerance limits cannot be computed from LHS sample. The reason is that the sample estimates of the 95% fractile can be signicantly 10

below the true 95% fractile for long-tailed distributions. Without statistical tolerance limits it is not possible to know how likely this shortcoming is and therefore the analysis has to be made using simple random sampling. Of course, after nding sample size N, it can be considered as upper limit of real LHS sample size needed.

11

5
5.1

Examples
Example of sampling with lognormal distribution

The following example is about samples of lognormal distribution. Lognormal distribution is chosen because it is important in re simulations.

Lognormal variables can be written y=ex where x is normally distributed with mean and variance . Density function of lognormal distribution [12] is 1

(ln(y))2 2 2

f (y) = with mean = e+ equations:


2

y 2

(18)
2

/2

and variance 2 = e2+ (e 1). The mean and

variance of ln(y) (, 2 respectively) can be calculated backwards using following

1 2 = ln() ln( 2 + 1) 2 and 2 = ln( 2 + 1). 2

(19)

(20)

Monte Carlo simulations were made with sample sizes N=50, 100, 500, 1000, 5000 and 10000, using both Simple Random sampling (SRS) and Latin Hypercube sampling (LHS). The idea was to compare mean and variance of samples. The lognormal distribution used had mean 27.4 and variance 16, which result (equations (19) and (20)) =3.3000 and 2 =0.0211. The sample means and variances are calculated using equations [13] 1 N
N

x= and

xi
i=1

(21)

s2 =

1 N 1

(xi x)2 .
i=1

(22)

This is valid for both SRS and LHS sampling [4].

12

Figure 4: Sample means as function of sample size N. In Figures 4 and 5 are shown sample means and variances with SRS and LHS sampling. In simple case of only one variable, the LHS sampling is better with every N. The dierences between the correct and estimated mean and variance values are listed in table 2.

N 50 100 500 1000 5000 10000

Table 2: Dierence of mean and variance. (SRS) 2 (SRS) (LHS) 2 (LHS) 0.169231071 0.333796062 0.235102161 0.071753115 0.012967127 0.023052468 3.549920858 2.874371726 0.102067908 0.424214694 0.153119925 0.054337911 0.039925728 0.007375298 0.000254559 0.000419834 0.000180192 0.0000689647 0.068954216 0.155918985 0.032888354 0.003795222 0.004282639 0.00197263

13

Figure 5: Sample variances as function of sample size N.

14

5.2

Example of multinormal distribution

Consider next a situation where variable X depends on ve independent, normal distributed variables (with parameters i , i ). The x is dened as
5

x=
i=1

xi ,

(23)

where xi N(i ,i ). The parameters used in this example are listed in table 3 and the simulations were made with N=50, 100, 500, 1000 and 5000. Sum of normal distributed variables is also normal distrbuted with parameters = and =
2 i .

In this example, the mean is 16.5 and standard deviation

=5.7879 (variance 33.4998). In Figures 6 and 7 are shown the sample mean and variance with dierent sample sizes. It seems that LHC is closer to real parameter values than SRS with each N.

Table 3: Parameters of variables. i i i 1 2 3 4 5 1 2.5 -4 10 7 0.5 2 5 2 0.5

Study now the distributions of samples using the Pearsons chi-square test. That compares sample distribution F(x) to some known distribution G(x). The null hypothesis is H0 : F(x)=G(x). To calculate the p-value, the sample has to be divided in m categories. The 2 is
m

2 =
i=1

(Oi Ei )2 , Ei

(24)

where Oi is frequency in category i and Ei =NP, the expected frequency of category i (calculated from known distribution). The degree of freedom is df=m-l -1, where l is the number of parameters in known distribution. The prerequisites are N>50, Ei >2 for all i and only 20 % of categories have Ei <5. If p-value is less than chosen risk level, H0 will be refuted. The risk level is normally 5 %.

15

Figure 6: Mean of sample as function of sample size N. Null hypothesis is now H0 : F(x)=N(16.5,5.7879). Divide distribution into 8 classes that are listed in table 4. The probabilities come from the standard normal distribution table, where z = expected to be >5 in every cathegory.
x .

It can be

seen that now both prerequisites are lled, because when N 50, sample size is

The 2 -values with dierent sample sizes are presented in table 5. Now df = 8 2 1 = 5. In 2 -table the maximum value to be accepted with risk level 5 % is 11.07. In this example, all the 2 -values are in the acceptance region, so all the samples normally distributed with parameters = 16.5 and = 5.7879. In this case, with ve independent variables, sample size 50 was enough in both SRS and LHS sampling.

16

Figure 7: Variance of sample as function of sample size N.

Cathegory I II III IV V VI VII VIII

x [-,9.09) [9.09,12) [12,14) [14,16) [16,18) [18,20) [20,23) [23, )

P 0.1003 0.1174 0.1159 0.1305 0.1385 0.1231 0.1429 0.1314

Table 4: Categories of normal distribution.

17

N 50 100 500 1000 5000 10000

Table 5: 2 values. 2 (SRS) 2 (LHS) 10.37789289 10.21332038 3.388478928 6.533513491 3.706511017 2.131056413 9.148392622 2.156814332 4.64044467 5.451614406 6.611654417 0.857279318

18

5.3

Example of binary variables

Consider a nuclear power plant where in case of re the failure time t1 of some important component is lognormally distributed with parameters 1 and 1 . The failure can be prevented by the re service if they arrive before the failure happens. The arriving time of the re service (t2 ) is lognormally distributed with parameters 2 and 2 . Now the interesting factor would be to know what is the probability that the failure can be prevented, or that t2 < t1 . This is actually the probability p(A) of Bernoulli distribution, where X=1 if A happens and X=0 if not. In lognormal distribution negative values are not allowed, so the probability of P(t2 <t1 ) has to be calculated from joint distribution. The joint lognormal distribution is

f (t1 , t2 ) = f (t1 )f (t2 ) =

1 t1 2

(ln(t1 )1 )2 2 21

2 t2 2

(ln(t2 )2 )2 2 22

(25)

where and are calculated from equations (19) and (20). Now we are interested in the probability of t2 < t1 , and that can be found by integrating
t1

F (t2 < t1 ) =
0 0

f (t1 )f (t2 )dt2 dt1 .

(26)

This can be integrated numerically e.g. by matlab using function called trapz.

In this example 1 =12, 1 =3, 2 =13 and 2 =5, so 1 =2.455, 2 =0.246, 2 =2.496 and 2 =0.371. The distributions are plotted in Figure 8. Expected probability is calculated by integrating the joint distribution, and has value of 0.463.

Condence intervals for Bernoulli distribution are calculated as p(1 p) , n

p z/2

(27)

where 1- is the condence level (often 95 %), z/2 N(0,1), p is the estimated probability and n is the sample size. This is valid only for simple random sampling. In m independent repetitions, the real value of p should be inside the condence intervals mp times in an average.

Repeat the simulations 14 times (averages as a function of sample size N are shown in Figure 9), and calculate the condence intervals as in equation (27). 19

Figure 8: Lognormal distributions f1 (1 ,1 ) and f2 (1 ,2 ). The idea is now to compare the condence intervals of Monte Carlo samples to similar calculated intervals of Latin Hypercube samples. These intervals are not valid condence intervals of LHS sample, but comparing these intervals can give more information. Choose 95 % condence level, so Z0.025 = 1.96. The condence intervals were calculated to each simulation, and then counted the times the real value of p was included in the condence interval in 14 repetitions. The results are listed in table 6. It would be interesting to test null-hypothesis H0 : P(The real value of p is inside the 95 % condence interval in LHS sampling)=0.95, but that is not possible for two reasons. First, to make that test, m should be big. The rule is that m 10 p m(1-)10. p The estimates are near 0.95 so the minimum m should be 200 which is too much considering that with PFS (Probabilistic Fire Simulator) the maximum variable number is 15 and for each p two variables are needed. This would lead to nearly 30 simulations for each N. However, with smaller number of repetitions it is possible to get an idea of the result of the test. Another problem is, that the test variable is

20

N 25 50 100 500 1000 5000 10000

n (SRS) 13 12 12 14 12 13 13

n (LHS) 14 14 14 14 14 14 14

Table 6: n is a number of times that p is inside the condence intervals in 14 repetitions.

z=

pp p(1 p)/m

(28)

which goes to innity if p=1. In this case the null-hypothesis would be always rejected.

After 14 repetitions the estimate of probability is p=1 with each N in LHS sampling. The test variable is and the null-hypothesis rejected. For LHS sampling, the 95 % condence intervals are signicantly smaller than for SRS sampling, so with every N the LHS sampling gives at least as good results as the SRS.

21

Figure 9: Averages of 14 independent repetitions.

22

5.4

Discussion

To determine the sample size is not a simple task. Many guides have focused on the Simple Random sampling, but only one common rule was found from the literature for Latin Hypercube samples: It gives at least as good results as Simple Random sampling when the sample size is big enough. The sample size required may depend on the distribution, number of variables or something else. The only way to see if the sample has converged, is to make more simulations and see if the result stays similar. Even this does not work for latin hypercube sampling, where the sample size should be known beforehand.

In all three examples the latin hypercube samples were closer to the real value for each N. For ve independent and normal distributed variables sample size 50 was enough to say that the sample distribution did not dier from the original distribution signicantly. In the last example it was seen that the samples generated with LHS had smaller condence intervals, so the results of LHS sampling are more accurate.

Conclusions

The target of this work was to study the accuracy of Latin Hypercube sampling and nd some advices how to choose the sample size N. In literature, very few directions were given, and in the tests it became clear that general rules are dicult to nd. However, both the literature and examples refer to the fact that Latin Hypercube samples converge with smaller sample size than simple random samples. The sample sizes needed in simple random sampling can be used as upper limits for the Latin Hypercube sample size.

References
[1] Hostikka, S., Keski-Rahkonen, O. Probabilistic simulation of re scenarios. Nuclear Engineering and Design 224 (2003) 301-311. [2] Veach, E. Robust Monte Carlo methods for light transport simulation (1997). 23

[3] Landau, D., Binder, K. A Guide to Monte Carlo Simulations in Statistical Physics (2000). [4] McKay M., Beckman, R., Conover, W. A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics, vol 21, no. 2, May 1979. [5] Keramat, M., Kielbasa, R. Modied Latin Hypercube Sampling Monte Carlo (MLHSMC) Estimation for Average Quality Index. Analog Integrated Circuits and Signal Processing, vol. 19, no. 1, April 1999. [6] Wyss, G., Jorgensen, K. A Users Guide to LHS: Sandias Latin Hypercube Sampling Software. (1998) [7] Stein, M. Large Sample Properties of Simulations Using Latin Hypercube Sampling. Technometrics, vol 29, no.2, May 1987. [8] Jia, J., Yue, H., Liu, T., Wang, H. Global Sensitivity Analysis of Cell Signalling Transduction Networks Based on Latin Hypercube Sampling Method. Bioinformatics and Biomedical Engineering, 2007. [9] Swidzinski J., Chang, K. A novel nonlinear statistical modeling technique for microwave devices. Microwave Symposium Digest., 2000 IEEE MTT-S International (0-7803-5687-X) June 2000. Vol.2;p.887 [10] Hofer, E. Sensitivity analysis in the context of uncertainty analysis for computationally intensive morels. Computer Physics Communications 117 (1999) 21-34. [11] Conover, W. Practical nonparametric statistics, 2nd ed. (Wiley, New York, 1980). [12] R ade, L., Westergren, B. Mathematics handbook for science and engineering. (4th edition, 1998). [13] Laininen, P. Sovelletun todennkisyyslaskun kaavoja ja a o taulukoita.

24

Você também pode gostar