Statistik Data Analysis PDF

Modern Methods of Data Analysis
Lecture II (27.04.10)
Contents:
● Characterize data samples
● Characterize distributions
● Correlations, covariance
Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Reminder: Average of a Sample
● arithmetic mean of data set:
● weighted mean of data set:
● mode – most prob. value (peak in distribution, not unique)
● median – smallest value which is ≥ 50% of events

better use median than mean, more robust against outliers!
● similar defined Quantile: Median = 50% Quantil
● truncated mean: useful if the underlying distribution is

expected to be asymmetric
Measure the Spread of a Sample
● How to characterize width/spread?
● First thought .... mean deviation from the mean:
● Could consider average absolute deviation:
However hard to handle mathematically.

Sample Variance
● Way better quantity:
mean square deviation called sample variance s² or V
● For any random variable :

Sample Variance
● For data analysis, preferably loop only once over data:
mean square – square of the mean

Sample Variance
For large numbers, safer to shift distribution by

estimated mean :

Standard Deviation (RMS), FWHM
● standard deviation σ or RMS: root mean squared
[“standard ” is a joke, there are several standards in literature ...]
● FWHM: full width at half maximum

more robust against outliers, fluctuations harder at low
statistics; for Gaussian distributed events: FWHM = 2.35σ

Example:
● Give sample variance, RMS and FWHM:

Expectation Values
● So far characterized given set realization of an
experiment (sum over N) by sample mean,
sample spread ...
● Now talk about mean, spread of a distribution:
Note
However for N->∞, Law of large numbers
Variance of a Distribution:
● V[x] = E[(x-μ)²] =
● V[x] = f(x): PDF
● V[x] = E[x²] – µ²
V[x] is the measure of the spread of the distribution,

not how well the mean is measured!

Example:
N = 100 N = 1000
µ=5
σ=1
N = 10000

How to determine uncertainty on the mean?
● E[ x ] = ???
● V[ x ] = ???

Expectation Value of sample mean

Variance of the Sample Mean

m(B0) = 5279.63 ± 0.53 (stat) ± 0.33 (sys)
● CDF has a mass resolution of 16 MeV:
the reconstructed mass of a single B meson is spread
around the true B mass with σ=16 MeV
● The B mass can be measured with way better precision

Unbiased Estimators:
Unbiased Estimator “erwartungstreuer Schätzer”
unbiased estimator for true mean µ is :
for n data points, we estimate the true variance V(x) by the

“sample variance s²”
- if true mean µ is known!
- If the true mean is unknown, then an unbiased estimator

for the variance σ² is the “sample variance s²”:
beware of N-1!
“One single value is not enough to determine mean and spread.”

Solution: Unbiased Estimator for V(x)

Solution: Unbiased Estimators for V(x)

Efficiency of Estimators
● Optimal Estimator: ”optimal” ↔ smallest variance

(Likelihood maximization gives optimal estimator, will
be proven in later lecture)
● Efficiency of Estimator:
“variance of optimal estimator/variance of estimator”
● For Gaussian distribution is optimal estimator
● non optimal estimators are called not robust
● E.g. Median of Gauss distribution has 64% efficiency

Symmetric truncated Mean
● truncated mean (“getrimmter Mittelwert”):
– e.g. r = 40% truncated mean:
● 10% lowest and 10% highest values
ignored, calculate mean of 80% central

values
– r = 50% truncated mean -> arithmetic mean
– r -> 0% -> median

Laplace or
double exponential
Cauchy efficiency
r = 0.23 truncated
mean best estimator
for unkown sym.
distribution
r
Moments
●
r-th algebraic moment
● r-th central moment
Expectation value: 1. algebraic moment

Variance: 2. central moment
“Schiefe”/skewness
- pos. for right winged distributions
“Wölbung”/kurtosis
- measure for ratio of core relative to tails
- pos. kurtosis: longer tails than Gaussian
Skewness & Kurtosis
kurtosis < 0 kurtosis > 0
Gaussian distribution have kurtosis = 0

Which fraction of events is within 1,2,3 σ
1σ
2σ
3σ
4σ
This is only true for Gaussian distributions!

Biennaymé-Tchebycheff-Inequality
For every distribution the following inequality is valid:
k Gauss Tchebycheff
1 0.317 1.0
2 0.0555 0.25
3 0.0027 0.1111
4 0.000063 0.0625

Solution: Biennaymé-Tchebycheff-Inequality
Given a PDF f(x) and a function positive w(x)≥0:
with :

Two Dimensional Distributions
Multiple ways to visualize 2-dim distributions
● box plot
● lego plot
● surface plot
● numbers
● scatter plot
● color map
● contour plot
● ...

Two dimensional Distributions
● straight generalization of 1-dim PDFs
A 2-dim PDF is a function f(x,y)≥0 with

Marginal Distributions
● Marginal distributions: projection on the axis
“Randverteilungen”

Conditional Probability
●

Exercise:
● Compute
● Compute

Statistik Data Analysis PDF

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Statistik Data Analysis PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Modern Methods of Data Analysis

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

● weighted mean of data set:

● mode – most prob. value (peak in distribution, not unique)

● median – smallest value which is ≥ 50% of events

● similar defined Quantile: Median = 50% Quantil

● truncated mean: useful if the underlying distribution is

● First thought .... mean deviation from the mean:

● Could consider average absolute deviation:

However hard to handle mathematically.

mean square deviation called sample variance s² or V

● For any random variable :

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

mean square – square of the mean

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

For large numbers, safer to shift distribution by

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

[“standard ” is a joke, there are several standards in literature ...]

● FWHM: full width at half maximum

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

● Give sample variance, RMS and FWHM:

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

● V[x] = f(x): PDF

V[x] is the measure of the spread of the distribution,

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

● The B mass can be measured with way better precision

unbiased estimator for true mean µ is :

for n data points, we estimate the true variance V(x) by the

- If the true mean is unknown, then an unbiased estimator

“One single value is not enough to determine mean and spread.”

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

● Optimal Estimator: ”optimal” ↔ smallest variance

● For Gaussian distribution is optimal estimator

● non optimal estimators are called not robust

● E.g. Median of Gauss distribution has 64% efficiency

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

ignored, calculate mean of 80% central

– r -> 0% -> median

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Expectation value: 1. algebraic moment

kurtosis < 0 kurtosis > 0

Gaussian distribution have kurtosis = 0

This is only true for Gaussian distributions!

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

A 2-dim PDF is a function f(x,y)≥0 with

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Você também pode gostar