Escolar Documentos
Profissional Documentos
Cultura Documentos
Lesson 2
Agenda
After completing
this course, you will
be able to
understand:
Confidence Interval
Statistical Methods
Descriptive Statistics
Inferential Statistics
Sample
Population
Estimation
Measures of dispersion
Hypothesis Testing
It is important that the investigator carefully and completely defines the population before
collecting the sample, including a description of the members to be included.
A sample is a group of units selected from a larger group (the population). By studying the sample it
is hoped to draw valid conclusions about the larger group.
A sample is generally selected for study because the population is too large to study in its entirety.
The sample should be representative of the general population. This is often best achieved by
random sampling.
Copyright 2014, Simplilearn, All rights reserved.
Sampling techniques
Sampling
Probability
Simple
Random
Systematic
Convenience
Non-Probability
Stratified
Judgmental
Cluster
Quota
Snowball
Descriptive Statistics
To describe the sample.
To indicate the error associated with results and graphical output.
No of players
40
Player's Age No of players
30
>35
7
20
25-35
15
15-25
34
No of players
10
0
>35
25-35
15-25
Copyright 2014, Simplilearn, All rights reserved.
Mean
Median
Mode
Mean
Median
Mode
Frequency
30
25
20
15
10
5
0
6
Copyright 2014, Simplilearn, All rights reserved.
Measure of Dispersion
Variance
Standard Deviation
Variance =
Standard Deviation = 4 = 2
20
5
=4
Valid
N
Min
Max
Mean
Standard Deviation
1000
1000
0.90
99.95
11.72
10.36
1000
475
0.00
173.00
13.27
16.90
1000
386
0.00
77.70
14.21
19.07
1000
678
0.00
109.25
13.78
14.08
1000
296
0.00
111.95
11.58
19.72
On average, customers spend the most on equipment rental, but there is a lot of variation in the
amount spent.
Customers with calling card service spend only slightly less, on average, than equipment rental
customers, and there is much less variation in the values.
The real problem here is that most customers don't have every service, so a lot of 0's are being
counted. One solution to this problem is to treat 0's as missing values so that the analysis for each
service becomes conditional on having that service.
Copyright 2014, Simplilearn, All rights reserved.
Probability Theory
Probability is a branch of mathematics that deals with the uncertainty of an event happening in
the future.
Probability value always occurs within a range of 0 to 1.
HEAD
TAIL
Assigning Probabilities
Classical method based on equally likely
outcomes.
E.g.: Rolling a dice.
No. of
No. of
cars used days
Probability
(3/60)=0.05
10
(10/60)=0.17
16
(16/60)=0.27
15
(15/60)=0.25
(9/60)=0.15
(7/60)=0.11
Probability Distribution
Probability distribution for a random variable gives information about how the probabilities are
distributed over the values of that random variable.
Its defined by f(x) which gives probability of each value.
E.g. Suppose we have sales data for AC sale in last 300 days.
Probability of units sold,
No. of days
f(x)
Units sold
0
10
0.03
55
0.18
150
0.5
55
0.18
25
0.08
0.02
Probability
of units
sold, f(x)
0.2
0
Probability Theory
Binominal Distribution satisfies:
A fixed number of trials
Each trial is independent of the others
(Assume that the conditions of binomial distribution apply: the outcomes for Amirs purchases are
independent, and the population of chocolate bars is effectively infinite.)
Copyright 2014, Simplilearn, All rights reserved.
p = 1/6
q = 5/6
Copyright 2014, Simplilearn, All rights reserved.
2.
3.
4.
= 0.979
5.
Number of purchase days required so that probability of success is greater than 0.95:
P(X 1) 0.95 = 1 P(X 0) 0.95
= 1 P(X=0) 0.05
= n 16.43 (applying log function)
= 17days.
Copyright 2014, Simplilearn, All rights reserved.
Normal Distribution
A Normal distribution is a theoretical model of the whole population.
It is perfectly symmetrical about the central value; the mean represented by zero.
Mean
Median
Mode
Copyright 2014, Simplilearn, All rights reserved.
Poisson distribution
Discrete probability distribution for events that happen randomly in time
Following conditions need to be satisfied
The event results in a success or failure
The average number of successes, is known
Probability of success is proportional to the region/time
Probability of success in an extremely small region/time is almost zero.
Properties: Mean and variance is equal, and is denoted by .
Examples
Average number of houses sold by a company is 5 per day. What is the probability that exactly 4
houses will be sold tomorrow?
Average number of births in a hospital is 2.1 births per hour. What is the probability that there
will be exactly 6 births in the next two hours?
Copyright 2014, Simplilearn, All rights reserved.
Skewness
Kurtosis
Statistic
Std. Error
Statistic
Std. Error
1000
2.966
0.077
14.012
0.155
475
3.465
0.112
26.735
0.224
386
0.756
0.124
0.641
0.248
678
2.150
0.094
7.572
0.187
296
1.359
0.142
3.079
0.282
Equipment last month data is more accurate in nature and its SD is comparatively lower than the
other measures.
Conclusion - Equipment is the segment where the telecom company is getting more profits than
the others and it can invest more .
Confidence interval
Its a rule for a population parameter to determine an interval that is likely to include the parameter
based on the sample information.
Supposing that a random variable has been taken and the random samples were taken repeatedly
from the population, certain percentage of interval contains unknown value.
In such case, if population is repeatedly sampled and intervals calculated in that fashion then 95%
of interval contains true value of the unknown parameter.
This interval is then said to be 95% confident for the population proportion.
Data Requirements
Confidence level
Statistic
Margin of error
Range of the confidence interval = sample statistic + margin of error.
The uncertainty associated with the confidence interval is specified by the confidence level.
Tests of Significance
Tests used in assessing the evidence in favor of or against a given assumption
Begins with a Null Hypothesis, H0
Tests either validate the null hypothesis, or reject it in favor of an Alternate Hypothesis, Ha
Two types of tests
One sided tests
Two sided tests
Results decided by calculating the p-value
Interpretation:
If p-value is less than the significance level , reject the null hypothesis.
General values of are 0.05, 0.01.
General Assumptions:
The distribution is almost normal
The samples in the distribution have almost unequal variances
Copyright 2014, Simplilearn, All rights reserved.
Tests of Significance
One sample z-test
Two sample z-test
One sample t-test
Two sample t-test
Paired t-test
Chi Squared test
F test - Analysis of Variance (ANOVA)
F test - Regression
Copyright 2014, Simplilearn, All rights reserved.
Total
Exposure
Yes
No
Yes
37
13
50
No
17
53
70
Total
54
66
120
Copyright 2014, Simplilearn, All rights reserved.
= 29.1
Calculate the degrees of freedom :
(Number of rows 1) X (Number of columns 1)
df = (2 1) X (2 1) = 1
Calculate the p-value from the chi-squared table
For chi-squared value 29.1 and degrees of freedom = 1, from the table, p-value is < 0.001
Interpretation: There is 0.001 chance of obtaining such discrepancies between expected and
observed values if there is no association
Conclusion : There is an association between the exposure and disease.
Copyright 2014, Simplilearn, All rights reserved.
ANOVA
Analysis of Variance used to compare more than two groups
Extension of the independent t-tests
Factor variable variable defining the groups
Marks
82
83
38
83
78
59
97
68
55
Basic Idea : Partition the total variation in the data into the variance between groups and variance
within groups.
Parametric
Non Parametric
Sign Test
ANOVA
Kruskall Wallis
Correlation
Measure of association between variables
Positive and negative correlation, ranging between +1 and -1
Positive correlation example:
Earning and expenditure
Negative correlation example
Speed and time
Correlation coefficient
r : correlation coefficient
+1 : Perfectly positive
-1 : Perfectly negative
Summary
Here is a quick
recap of what we
have learned in this
lesson
Probability distribution
Quiz
QUIZ
1
a.
Mean
b. Median
c.
Mode
d.
Standard Deviation
QUIZ
1
a.
Mean
b. Median
c.
Mode
d.
Standard Deviation
Answer: d
Explanation: Standard Deviation is used to measure dispersion and not to measure central
tendency.
Copyright 2014, Simplilearn, All rights reserved.
Calculate the mean, median and mode of the following data and choose the right
option:
QUIZ
2
d.
Calculate the mean, median and mode of the following data and choose the right
option:
QUIZ
2
d.
Answer: a.
Mean is the average of all the values, median is the middle value and the mode is the most
commonly occurring value.
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
3
a.
b. +1
c.
-1
d.
QUIZ
3
a.
b. +1
c.
-1
d.
Answer: b.
Explanation: Correlation coefficient ranges from +1 to -1, with +1 being perfectly positive
and -1 being perfectly negative.
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
4
Calculate the variance of the following data and choose the right option:
5,10,12,4,8,9,16
a.
15.41
b. 14.41
c.
9.14
d.
12.41
QUIZ
4
Calculate the variance of the following data and choose the right option:
5,10,12,4,8,9,16
a.
15.41
b. 14.41
c.
9.14
d.
12.41
Answer: b.
Variance is the average of squared deviations about the mean, given by
From the research question below, choose the alternative hypothesis from the
following options.
QUIZ
5
= 0
b. > 0
c.
< 0
d.
From the research question below, choose the alternative hypothesis from the
following options.
QUIZ
5
= 0
b. > 0
c.
< 0
d.
Answer: b.
Explanation: The question forms a one sided hypothesis, checking if the average
temperature has increased, that is, if > 0
Copyright 2014, Simplilearn, All rights reserved.
QUIZ
6
Choose the commonly used value for significance level from the values given below
a.
0.1
b. 0.5
c.
1.0
d.
0.05
QUIZ
6
Choose the commonly used value for significance level from the values given below
a.
0.1
b. 0.5
c.
1.0
d.
0.05
Answer: d.
Explanation: The commonly used value for significance levels are 0.01 and 0.05.
QUIZ
7
a.
Paired t-test
b. Independent t-test
c.
Sign Test
d.
Analysis of Variance
QUIZ
7
a.
Paired t-test
b. Independent t-test
c.
Sign Test
d.
Analysis of Variance
Answer: c.
Explanation: T-tests and ANOVA are used in parametric testing.
QUIZ
8
QUIZ
Answer: d.
Explanation: Normal distribution. Rest of the things are satisfied by binomial distribution.
QUIZ
9
a. Sample
b. Measure of central tendency
c. Measures of dispersion
d. Hypothesis testing
QUIZ
a. Sample
b. Measure of central tendency
c. Measures of dispersion
d. Hypothesis testing
Answer: d.
Explanation: Hypothesis testing is not a part of descriptive statistics, it is a part of inferential
statistics.
QUIZ
10
a. Median
b. Mode
c. Mean
d. Standard deviation
QUIZ
10
a. Median
b. Mode
c. Mean
d. Standard deviation
Answer: c.
Explanation: Mean is used to calculate the central value or average of an given value of
numbers.
QUIZ
11
a. Median
b. Mode
c. Mean
d. Standard deviation
QUIZ
11
a. Median
b. Mode
c. Mean
d. Standard deviation
Answer: b.
Explanation: Mode is used to calculate the highest frequency which is being occurred in a
given value of numbers.
QUIZ
12
a. Median
b. Mode
c. Mean
d. Standard deviation
QUIZ
12
a. Median
b. Mode
c. Mean
d. Standard deviation
Answer: d.
Explanation: Standard deviation is used to measure the dispersion in a given set of numbers.
QUIZ
13
a. 0 and 1
b. -1 and 1
c. Negative and positive
d. Only positive
QUIZ
13
a. 0 and 1
b. -1 and 1
c. Negative and positive
d. Only positive
Answer: a.
Explanation: The probability of an event always lies between 0 and 1 i.e. success or failure
of that event
QUIZ
14
a. Skewness
b. Outlier
c. Kurtosis
d. Variance
QUIZ
14
a. Skewness
b. Outlier
c. Kurtosis
d. Variance
Answer: c.
Explanation: Kurtosis is mainly used to measure the peakedness of the distribution of a
particular data set .
QUIZ
15
a. Skewness
b. Outlier
c. Kurtosis
d. Variance
QUIZ
15
a. Skewness
b. Outlier
c. Kurtosis
d. Variance
Answer: a.
Explanation: Skewness is the measure of deviation from symmetry and this maybe left
skewed or right skewed.
QUIZ
16
QUIZ
16
Answer: b.
Explanation: If the p value is less than 0.05, i.e., p<0.05. we reject the true null hypothesis.
Thank You