Basic Statistics: Cheat Book

Module 1:-Basic Statistics MEASURES OF DISPERSION MOMENTS
Measures of central tendency: - mean, median, Dispersion may be defined as the spread of the Moments are the arithmetic mean of the powers of
mode; Measures of dispersion-Range, Mean items in a series around its average. A measure of deviations in a series either from its mean or any
deviation, Quartile deviation and Standard dispersion can be expressed either in the absolute arbitrary origin
deviation; Moments, Skewness and Kurtosis, Linear form or in the relative form. Absolute measures of Central moments
correlation, Karl Pearson’s coefficient of dispersion are expressed in the same unit in which If moments are estimated by taking deviation of
Correlation, Rank correlation and linear data are collected. items from its arithmetic mean, they are called
regression. Absolute measures of dispersion include: central moments
(i) Range 𝜇1 = ∑(x-M) / n or ∑f(x-M) / n
MEASURES OF CENTRAL TENDENCY Range is the difference between the largest and 𝜇2 = ∑(x-M)2 / n or ∑f(x-M)2 / n
An average is a single significant figure which sums up smallest values in a series 𝜇3 = ∑(x-M)3 / n or ∑f(x-M)3 / n
the characteristics of a group of figures. An average is RANGE = L – S 𝜇4 = ∑(x-M)4 / n or ∑f(x-M)4 / n
called a measure of central tendency since it L= largest item, S=Smallest item Raw moments
determines the central value to which the items in a L–S If moments are estimated by taking deviation of
series tend to cluster. Important measures of central Coefficient of range = ------- items from an arbitrary origin, they are called raw
tendency are:
L+S moments.
(i) Mean
Arithmetic mean is obtained by dividing the sum total
(ii) Quartile deviation 𝜇′1 = ∑(x-a) / n or ∑f(x-a) / n
of the values of items with its number. Quartile deviation is defined as the half of the 𝜇′2 = ∑(x-a)2 / n or ∑f(x-a)2 / n
∑x/n OR ∑fx /n distance between the third and the first quartiles 𝜇′3 = ∑(x-a)3 / n or ∑f(x-a)3 / n
Merits: Q3 – Q1 𝜇′4 = ∑(x-a)4 / n or ∑f(x-a)4 / n
It is rigidly defined QD = -------------
Easy to calculate 2 SKEWNESS:
Simple to understand Q1 is the value of [(n +1)/4]th item Lack of symmetry is called Skewness. If a
It is based on all observations in a series OR distribution is not symmetrical then it is called
It is less affected by sampling fluctuations (N/4 – m1) skewed distribution.
It is amenable to further algebraic treatment Q1 = L1 + -------------- X C Positively skewed distribution:
Demerits: F1 If the frequency curve has longer tail to right
Affected by extreme values the distribution is known as positively skewed
Mean cannot be calculated from frequency Q3 is the value of [3(n+1)/4]th item distribution and Mean > Median > Mode.
tables with open end classes OR
Sometimes mean will be a value not found 3(N/4 – m3)
in the series and it may be an absurd value Q3 = L1 + -------------- X C
Cannot be applied for qualitative data
F3
(i) Median
Median is the middlemost item in an arranged series. It
(iii) Mean deviation
is called positional average Mean deviation is defined as the arithmetic mean of
Value of ((n+1) /2 )th item the absolute deviations of items from an average.
OR ∑|d|
Negatively skewed distribution:
Median = L + ((N/2-m)C) / f ------
If the frequency curve has longer tail to left
Merits: n
the distribution is known as negatively skewed
Can be easily calculated (iv) Standard deviation
distribution and Mean < Median < Mode.
Simple to understand Standard deviation is the positive square root of the
Measure of Skewness
Not affected by extreme values arithmetic mean of the squares of deviations from
Karl Pearson coefficient of Skewness
Can be correctly estimated from frequency arithmetic mean.
= (Mean -Mode) / SD
tables with open end classes SD = √∑x2/n - (∑x/n)2 Bowley’s coefficient of Skewness
Can be estimated graphically OR
Most appropriate average to deal with = [(Q1 + Q3) – 2Median] / Q3 – Q1
SD = √∑fx2/n - (∑fx/n)2
qualitative data
Demerits:
n1(𝜎12+d12)+n2(𝜎 22+d22) KURTOSIS
Not based on all the items in a series COMBINED SD = √ -------------------------------- Kurtosis measures the degree of peakedness or
Not capable of further algebraic treatment n1 + n2 flatness of a frequency distribution, usually taken
More affected by sampling fluctuations A relative measure of dispersion is the ratio of a relative to a normal distribution.
If estimated from a small number of items measure of dispersion to an appropriate average When a frequency curve is more peaked than the
median need not be good representative of from which deviations are measured. normal curve, it is called leptokurtic. When it is
the given data (i) Coefficient of range more flat topped than the normal curve it is called
(L-S )/ (L+S) platykurtic. Normal curve is smooth, continuous
(ii) Mode (ii) Coefficient of quartile and bell shaped is called mesokurtic.
Mode is the most frequently occurring item in a series deviation
Merits: (Q3 – Q1) / (Q3+Q1)
Can be easily calculated and readily (iii) Coefficient of mean deviation
understood MD(M)
Not affected by extreme values
= --------
It can be calculated from frequency tables
M
with open ended classes
A mode is the most popular value in a series (iv) Coefficient of variation
and it gives the true representative value SD X 100 / mean
Demerits: Measure of Kurtosis
Mode is not rigidly defined 𝛽 = 𝜇4 /(𝜇2)2
Not based on all items in a series
When 𝛽 = 3, the curve is mesokurtic. When 𝛽 <3,
Can’t be used for further algebraic treatment
Much affected by sampling fluctuations the curve is platykurtic. When 𝛽 >3, the curve is
Some series may have more than one mode leptokurtic.
and some other may have no mode at all.
Mo = Value of the item with highest frequency
OR
Mo = L1 + (Cf2) / (f1+f2)
CORRELATION REGRESSION ANALSYSIS
The relationship between two or more variables in a Regression analysis means the estimation or
given series is called correlation and the numerical prediction of the unknown value of one variable
measurement of the degree of relationship is called (dependent variable) from the known value of the
correlation coefficient (independent variable)
Positive and Negative correlation Dependent and independent variables
Positive Correlation: The variable whose value is influenced or is to be
The correlation in the same direction is called predicted is called dependent variable and the
positive correlation. If one variable increase other is variable which influences the values or is used for
also increase and one variable decrease other is also prediction is called independent variable.
decrease. Simple and multiple regressions
Negative Correlation: When there are only two variables, the regression
The correlation in opposite direction is equation so obtained is called simple regression.
called negative correlation, if one variable is The regression analysis which studies the
increase other is decrease and vice versa relationship between more than two variables at a
Perfect Correlation time is called multiple regression.
If there is any change in the value of one variable, Linear and non-linear regression
the value of the others variable is changed in a fixed Regression is said to be linear if a unit change in
proportion, the correlation between them is said to the value of the independent variable always leads
be perfect correlation. It is indicated numerically as to a constant change in the value of the independent
+1 and -1 variable.
Perfect Positive Correlation: If regression is non-linear if the ratio of change in
If the values of both the variables are move in same the value of the independent variable to the resultant
direction with fixed proportion is called perfect change in the value of the dependent variable is not
positive correlation. It is indicated numerically a fixed ratio.
as +1. Regression lines
Perfect Negative Correlation: Regression line is a graphic technic to show
If the values of both the variables are move functional relationship between dependent and
in opposite direction with fixed proportion is called independent variables. There are two regression
perfect negative correlation. It is indicated lines,
numerically as -1. Regression line y on x
Linear and non linear Regression line x on y
Linear
Correlation is said to be linear if the ratio of change Regression equations
is constant. Regression equation y on x
Non Linear Y = a + bx
Correlation is said to be non-linear if the ratio of byx = (n∑xy – (∑x X ∑y)) /( n∑x2- (∑x)2)
change is not constant. Regression equation x on y
Simple, partial and multiple correlations X = a + by
In simple, the relationship between two variables is byx = (n∑xy – (∑x X ∑y)) /( n∑y2- (∑y)2)
considered.
In partial, we study the relationship between one
variable with one of the other variables, presuming
the other variables constant.
In multiple, the relationship between more than two
variables are studied simultaneously
Methods of studying correlation
1. Scatter diagram
Scatter diagram is a visual device to study the
nature of relationship between two variables.
2. Correlation graph
It is a graphical representation of relationship
between two variables.
Karl Pearson’s coefficient of correlation

𝛾=
n∑xy – (∑x * ∑y)
------------------------------------------
√( n∑x2- (∑x)2) * √( n∑y2- (∑y)2)
Rank correlation
The correlation coefficient obtained from the ranks
is called rank correlation
Spearman’s rank correlation = 1- (6∑D2 / (n(n2-1)
Module 2:- Probability Theory Compound events THEORY OF SAMPLING
Events having two or more sample points as their Population
Sample space, Events, Different approaches to
elements are called compound events A population is the aggregate of all the units under study
probability, Addition and in any field of enquiry. A population can be finite or
multiplication theorems on probability, Sure event
infinite. A population containing finite number of items is
An event whose occurance is inevitable is called a known as a finite population. A population having an
Independent events, Conditional sure event infinite number of objects is termed as an infinite
probability, Bayes Theorem
population.
Impossible event A sample is a finite subset of a population, selected from
PROBABILITY If an event cannot occur when a random experiment it with the objective of investigating its properties. A
Probability of an event may be defined as the is conducted, then that event is called an impossible good sample will have the following characteristics (i)
numerical expression of the likelihood of the event repressiveness (ii) adequacy (iii) independence (iv)
occurrence of that event. It is a number lying homogeneity
Uncertain event
between 0 and 1 Merits:
An event is said to be uncertain if it’s happening is  This method is more economical as data are
neither sure nor impossible collected form only from a part of the
Approaches to probability Equally likely events population.
Classical approach Two events are said to be equally likely if any one  Data can be collected and classified more
P(A) = m/n of them cannot be expected to occur in preference quickly
where, to the other  Small number of investigators needed
m=number of favorable cases Mutually exclusive events  If collection of data results in the destruction of
n=number of total outcomes Two events are said to be mutually exclusive if they units this is the only possible method
Relative frequency approach cannot occur simultaneously  It can be effectively used to verify the accuracy
P(A) = Lt (m/n) of data collected on a census basis
n→ ∝ Partially overlapping events
Demerits:
Subjective approach  If sample units are not carefully selected,
Two events are partially overlapping if both of them sample results may be inaccurate and
This approach measures the confidence of an can occur simultaneously misleading
individual regarding the occurrence of an event Exhaustive events  If sample size is inadequate then the selected
Axiomatic approach Two events are said to be collectively exhaustive sample may not honestly represent the universe
It is based on set theory and is based on the events if taken together they contain all of the  possibility of sampling errors
following principles possible outcomes of the random experiment  sampling method is of no use if the size of
P(S) = 1 Mutually exclusive and exhaustive events population is small
0 ≤ P(A) ≤ 1 A set of events are said to be mutually exclusive  in the absence of well trained and experienced
If two events are mutually exclusive, then and exhaustive if one of them must and only one investigators data collected through sample
P(A U B) = P(A) + P(B) surveys are not reliable
can occur
Types of sampling
Dependent and independent events 1. Probability sampling
Events are said to be independent if the occurrence Simple
Theories of probability or non-occurrence of one event does not influence (a) Lottery method
Addition theorem the probability of the other event Under this method, all items are numbered or named in
case1: when events are mutually exclusive Events are said to be dependent if the occurrence or separate slips and folded and mixed in a container and a
P(A U B) = P(A) + P(B) non-occurrence of one event influence the blind selection of slips are made
Case2: when events are not mutually exclusive probability of the other event (b) Table of random numbers
P(A U B) = P(A) + P(B) – P(A ∩ B) Complement of event
Several standard tables of random numbers are available
among which one prepared by Tippet is the most popular
The event ‘A and the event ‘not A’ are called one. Page form the table is opened and item
Multiplication theorem complementary events corresponding to the first random number obtained is
Case1: when events are independent Union of two events selected
P(A ∩ B) = P(A) X P(B) The union of two events A and B denoted by A U B Complex
Case2: when events are dependent is the set of sample points in A, in B or in both (a) Systematic sampling
P(A ∩ B) = P(A) X P(B/A) Intersection of two events A Systematic sampling is formed by selecting one unit at
OR The intersection of two events A and B denoted as random and then selecting additional units at evenly
P(A ∩ B) = P(B) X P(A/B) spaced additional units until the sample has been formed.
A ∩ B is the set of sample space common to both
(b) Stratified sampling
In this method, population is subdivided into homogenous
Random experiment Conditional probability groups called strata and sample items are selected from
Experiments that are repeated under the same Probability of an event A, given that B has each stratum
condition and the outcomes of which cannot be happened is called the conditional probability of A (c) Cluster sampling
predicted in any repetition is called a random given B and is denoted as P(A/B) In cluster sampling, the given population is divided into
experiment P(A/B) = P(A∩B) / P(B) small groups or clusters. Some of these clusters are then
Random variable P(B/A) = P(A∩B) / P(A) selected at random and sample items are taken from these
If the value of a variable is determined by the selected clusters
outcome of a random experiment, it is called a (d) Multi-sage sampling
Bayes' theorem In this method, sampling is carried out in two or more
random variable.
stages
Probability distribution (e) Sequential sampling
It is a schedule that shows the various values of the In this method, a number of items are drawn and tested
random variable and the probability associated with one after another in sequence till a satisfactory sample lot
each value PERMUTATIONS is obtained
n!
Sample space nP = ------
r 2. Non-probability sampling
The set which contains all the possible outcomes of
a random experiment as the element is the sample (n-r)! (a) Judgment sapling
In this method, selection of sample units is exclusively
space of that experiment
COMBINATIONS based on personal judgments of the investigator
Events (b) Convenient sampling
An event is a subset of a sample space of a random Combination of n objects taking r at a time is
n! A convenience sample is obtained by selecting convenient
experiment nc = --------- units of population for the purpose of data collection.
r
Simple event (c) Quota sampling
If an event contains only a single sample point as its r!(n-r)! In this method, a quota is fixed for collecting samples form
element is a simple event each groups in an universe according to certain specific
traits
Module 3:- Random variables n!  The first and third quartiles are
n
and distribution Cx = ------ equidistant from median
x! (n-x)! M-Q1 = Q3-M
Random variables
Properties of normal distribution  Asymptotic to x axis
If the value of a variable is determined by the
Mean = np  The two points of inflection of a
outcome of a random experiment, it is called a normal curve occur at the points
Variance = npq
random variable. x = mean +sd and
Random variables can be classified into two- Std Dev = √ npq
x = mean – sd
Poisson distribution
Discrete random variable
Poisson distribution is a discrete probability  Area under the normal curve is
If a random variable takes specific discrete distributed as follows:
values such as 1, 2 , 5, ½ etc.. such random distribution discovered by the French
mathematician Simon Denis Poisson. mean±SD covers 68.27% of the total
variables are called discrete random variables. area
Continuous random variables A random variable X is said to follow a
Poisson distribution with parameter λ if the mean±2SD covers 95.45% of the total
If a random variable can take any of the area
infinite values between two limits, such a probability distribution is given by
P(x) = (λxe-λ) / x! mean±3SD covers 99.73% of the total
random variable is termed as continuous area
random variables. Eg.. 0<X<10 Where
Probability distribution is a schedule that λ>0
e≈ 2.718 Standard normal distribution
shows the various values of the random A standardized form of normal distribution by
variable and the probability associated with poisson distribution can be used as an
approximation of binomial distribution when taking mean as 0 and standard deviation as 1 is
each value known as standard normal distribution. The
(1) n tends to infinity (2) probability P tends to
zero (3) np is infinity. normal variable x can be converted into
Mathematical expectations, standard normal variable Z using the following
The expected value (or expectation, or Properties of Poisson distribution
formula
mathematical expectation, or mean) of a  Mean = λ = np
Z = (x – m) / SD
random variable is the weighted average of all  Variance = λ
Where
possible values that this random variable can  Std Dev = √ λ m = mean
take on. The weights used in computing this  Probability that the event will happen, SD = Std dev
average correspond to the probabilities in case P = λ/n
of a discrete random variable, or densities in  Probability that the event will not
case of a continuous random variable. happen, Module 4: Theory of Estimation
Q = 1 – (λ/n) Estimation
If x is a discrete random variable, expected
Estimation refers to the process by which one
value is given as:  Poisson distribution is positively
makes inferences about a population based on
skewed and leptokurtic
information obtained from sample.
This can be written as: Point estimator
E(x) = ∑ X P Continuous probability distribution
An estimate of a population parameter given
If x is a continuous random variable, expected Normal distribution by a single number is known as a point
value is given as: Normal distribution is the most popular
estimator of the parameter. Mean (Ẋ) is an
distribution which deals with continuous
example of point estimator.
random variables. It was introduced by
Interval estimator
English mathematician Abraham De Moivre as
An estimate of a parameter given by two
a limiting form of binomial distribution.
numbers between which the parameter is
E(C) = C where C is a constant Total area under a normal curve represents
expected to lie is called an interval estimate of
E(X+Y) = E(X) + E(Y) total probability ie,1
the parameter.
E(aX+B)= aE(X)+b Properties of a normal distribution
Properties of good estimator
Where a and b are constants  Normal curve is bell shaped. The
Unbiasedness
E(XY) = E(X)* E(Y) perpendicular drawn from the base
A statistic β is said to be an unbiased estimator
Where X and Y are independent line through the point x = mean will of a parameter B if the expected value of β is
Var(X) = E(X2) – [E(X)]2 divide the total area under the curve B
Discrete probability distributions into two equal parts. E (β = B)
Binomial distribution Consistency
Binomial distribution is a theoretical A desirable property of a good estimator is that
distribution which deals with discrete random the accuracy should increase when the sample
variables. It is also known as Bernoulli size becomes larger.
distribution after Swiss mathematician James Efficiency (minimum variance)
Bernoulli. Among two consistent estimators, the
A random variable x is said to follow normal estimator with the smaller variance is said to
distribution with parameters n and p if its be more efficient
probability function is  Mean= median = mode
Sufficiency
P(x) = nCx pxqn-x  Symmetric distribution
An estimate β is said to be a sufficient
Where; Skewness is 0
estimator of B if it contain all information
n = number of trials  Normal curve is mesokurtic about B.
p=probability of success (kurtosis=3)
q=probability of failure (1-p)  Normal curve is continuous from
x=number of success minus infinity to plus infinity
Maximum Likelihood estimation Given the test scores of two random samples t Test-for difference of means.
It is a method of point estimation with some of men and women, does one group differ The t statistic to test whether the means are
strong theoretical properties than the ordinary from the other? A possible null hypothesis is different can be calculated as follows:
least squares method. The method of that the mean male score is the same as the
maximum likelihood selects values of the mean female score:
model parameters that produce a distribution H0: μ1 = μ2
that gives the observed data the greatest where: Where
probability. The ML estimator of variance H0 = the null hypothesis
=∑(X-Ẋ)2 / n μ1 = the mean of population 1, and
The ML method is a large sample method. It μ2 = the mean of population 2.
has got broad application in that it can be Types of errors, Here is the pooled standard
applied to regression models that are nonlinear A type I error, also known as an error of the deviation, 1 = group one, 2 = group two. The
in parameters. first kind, is the wrong decision that is made denominator of t is the standard error of the
when a test rejects a true null hypothesis. The difference between two means.
Central Limit Theorem significance level of a test denoted by α is equal For significance testing, the degrees of
It can be shown that even when the original to the probability of Type I error. freedom for this test is 2n − 2 where n is the
population is not normal, if we draw samples A type II error, also known as an error of the number of participants in each group.
of n items from it and obtain the distribution of second kind, is the wrong decision that is made Paired t-test,
sample means, we notice that the distribution when a test fails to reject a false null hypothesis. Formula:
of the sample means become more and more The rate of the type II error is denoted by β and
normal as the sample size n increases. This is related to the power of a test (which equals 1-β).
known as central limit theorem. ---------------- Level of significance,
Let X1, X2, . . . , Xn denote n independent The amount of evidence required to accept that
random variables, all of which have the same an event is unlikely to have arisen by chance is Where
PDF with mean = μ and variance = σ2. Let ˉX known as the significance level or critical p-
= _Xi/n (i.e., the sample mean). Then as n value
increases indefinitely (i.e., n→∞), The p-value is the probability of observing data
at least as extreme as that observed, given that
the null hypothesis is true. If the obtained p-value
is small then it can be said either the null Chi-square test for goodness of fit
hypothesis is false or an unusual event has Chi Square is defined as
That is, ˉX approaches the normal distribution occurred. The significance level is usually
with mean μ and variance σ2/n. Notice that denoted by α. Popular levels of significance are
this result holds true regardless of the form of 10% (0.1), 5% (0.05), 1% (0.01), 0.5% (0.005),
the PDF. ------------------------------------------ and 0.1% (0.001). If a test of significance gives a If the critical chi is greater than the tabulated
If we take a samples of n from an arbitrary p-value lower than the significance level α, the chi reject the null. Otherwise do not reject
population and calculate mean, then sampling null hypothesis is rejected. Such results are F test - test for equality of two population
distribution of mean will approach the normal informally referred to as 'statistically significant'. variances.
distribution as the sample size n increases. Critical region, Two-sample F test for equality of variances. F
The central limit theorem states that, given is defined as
certain conditions, the mean of a sufficiently
large number of independent random
variables, each with finite mean and variance,
Where
will be approximately normally distributed
Module 5:-Testing of hypothesis Critical region or region of rejection is the set

Null and alternative hypothesis, of values of the test statistic for which the null
A statistical statement about a population hypothesis is rejected.
parameter assumed before taking the sample Critical value is the threshold value delimiting Arrange so > and reject H0 for critical F
for possibel rejection on the basis of outcome
the regions of acceptance and rejection for the greater than tabulated F. ie
test statistic.
of the sample data is known as a null
Region of acceptance is the set of values of the
hypothesis. Chi Square (Karl Pearson)
test statistic for which we fail to reject the null
A hypothesis is said to be alternate hypothesis It is a statistical technique to test null
hypothesis.
if it is complementary to null hypothesis. hypothesis for possible rejection.
t Test-for single mean,
For instance, a certain drug may reduce the If the calculated X2 is greater than the
One-sample t-test
chance of having a heart attack. Possible null tablulated X2, reject the null
In testing the null hypothesis that the
hypotheses are "this drug does not reduce the Uses
population mean is equal to a specified value
chances of having a heart attack" or "this drug • Test of Goodness of fit
μ0, one uses the statistic
has no effect on the chances of having a heart • Test of independence of attributes
attack". The test of the hypothesis consists of • Test for population variance
administering the drug to half of the people in • Test for equality of several
a study group as a controlled experiment. If population proportions
the data show a statistically significant change where is the sample mean, s is the sample
in the people receiving the drug, the null standard deviation of the sample and n is the t (Student)
hypothesis is rejected. sample size. The degrees of freedom used in It is a statistical technique to test null
Example this test is n − 1. hypothesis for possible rejection.
If the calculated t is greater than the tablulated
t, reject the null
Uses
• Test for single mean
• Test for difference between two
sample means(paired t test)
• Test of an observed sample
correlation coefficient
F (Fisher)
It is a statistical technique to test null
hypothesis for possible rejection.
If the calculated F is greater than the tablulated
F, reject the null
Uses
• Test for equality of several
population means
• Test for equality of population
variances
• Testing significance of an observed
sample correlation ratios
• Testing significance of an observed
sample multiple correlations

Basic Statistics: Cheat Book

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Basic Statistics: Cheat Book

Enviado por

Direitos autorais:

Formatos disponíveis

Module 1:-Basic Statistics MEASURES OF DISPERSION MOMENTS

Karl Pearson’s coefficient of correlation

Module 5:-Testing of hypothesis Critical region or region of rejection is the set

Você também pode gostar