Escolar Documentos
Profissional Documentos
Cultura Documentos
- Sonal Ghanshani
Statistics
Why Statistics?
▪ Data are everywhere
▪ Statistical techniques are used to make many decisions that affect our lives
▪ No matter what your career, you will make professional decisions that involve data. An
understanding of statistical methods will help you make these decisions effectively
What is Statistics?
▪ Statistics is the science of assigning a probability to an event based on experiments. It is the
application of quantitative principles to the collection, analysis, and interpretation numerical
data.
▪ Statistics presents a rigorous scientific method for gaining insight into data.
▪ For example, suppose we are studying the age on engineering graduates in a research.
▪ With so many measurements, simply looking at the data fails to provide an informative account.
▪ However, statistics can give an instant overall picture of data based on graphical presentation or
numerical summarization irrespective to the number of data points.
▪ Besides data summarization, another important task of statistics is to make inference and predict
relations of variables.
How Does Statistics Work?
▪ Statistics utilizes data from a population to analyse and draw conclusions. Based on these
conclusions, a decision is taken.
▪ The population may be a community, an organization, sales details, weather details, etc.
▪ Statisticians determine the quantitative model that suits a given type of problem. Then
they decide the kind of data that should be collected and examined.
Types of Statistics
Statistical Methods
Descriptive Inferential
Types of Statistics
▪ Descriptive statistics
• Methods of organizing, summarizing, and presenting data in an informative way
▪ Inferential statistics
• The methods used to determine something about a population on the basis of a sample
o Population –The entire set of individuals or objects of interest or the measurements obtained from all
individuals or objects of interest
o Sample – A portion, or part, of the population of interest
Descriptive statistics
Collect Data
Summarize Data
Present Data
Descriptive statistics
Collect Data
Summarize Data
Present Data
Population and Sample
Population and Sample
▪ A population is the set of all measurements of interest to the study.
▪ A sample is a selected subset of measurements of a population to represent the
population.
Population and Sample
▪ Market Share of a Product
• For example you need to estimate the market share of a detergent product specifically, say, Tide
• Population here is the entire population
• Sample is the a set of Supermarkets/shops
• Market Share is calculated on the sample, not the population
Sources of Data
▪ Primary Data
• Surveys
o Mail: Lowest rate of response, usually the lowest cost
o Web: Faster response and inexpensive
o Telephone: Fastest response
o Personal Interview: Usually focus groups. Most costly. Interviewer effects can be seen
▪ Secondary Data
• This is the data that has been compiled or published elsewhere
• Example: Census Data
• Advantages: It can be gathered quickly and inexpensively
• Disadvantages: May be outdated. May not be accurate
Errors
▪ Response Errors
▪ Subject lies
▪ Subject makes a mistake
▪ Interviewer makes a mistake
▪ Interviewer effects
Sample 2
▪ N = 1,000,000
▪ Response rate = 20%
Which is better?
▪ Small but representative sample can be useful in making inferences
▪ A large sample which is unrepresentative, which makes them biased, is useless. There is no
way to correct for it
▪ Therefore, sample 1 is better than sample 2
Example
▪ Television Ratings
• Nielsen publishes TRP (Television Rating Points) ratings for media content
• Ratings are based on a sample, not the population
• Population size is around 400 million and sample size is usually 50,000
• If a show has 15.2 rating, it means 15.2% of the sample were watching the show
▪ Problem here is we don’t know how representative our sample is of the population
Selecting a Sample
▪ Probability Samples
▪ A sample collected in such a way that each point in the population has a known “chance” of getting selected
• Simple Random Sample(SRS)
o Every population element has equal chance of getting selected
• Systematic Random Sample
o Choose the first sample randomly and then select every kth element where k = N/n
• Stratified Sample
o The population is sub-divided based on a characteristic and a SRS is conducted within each stratum
• Cluster Sample
o First take a random sample of clusters from the population of clusters. Then, SRS within each cluster. Example:
Election district, Orchard
Types of Data
Types of Data
Categorical Data
▪ This refers to data that can be classified into separate groups.
▪ It is also called qualitative data.
▪ This data represents characteristics.
▪ For example, gender of a person can be male or female. It can also have numerical values
like 1 for male and 0 for female.
▪ Categorical data can be further classified as nominal or ordinal.
Types of Data
Numerical Data
▪ Data that can be measured is called numerical data.
▪ It is also called quantitative data.
▪ Discrete Data:
▪ If the values can be clearly separated from each other, then it is discrete data.
▪ Example: Number of children
▪ Continuous data
▪ Example: height of a person
Types of Data
Numerical Data
▪ One simple way to check if the data is continuous or discrete is to check whether if we can
add more decimal points to the data
▪ You might say you are 5’11’’ tall. But in actuality you may be 5’11.23432” tall
▪ If you say you have 2 children, you cannot have 2.234545 children
Types of Data
Scales of Measurement
Scales of Measurement - Nominal
▪ All we can say is that one is different from each other
▪ Gender: Male, Female, Transgender
▪ Eye color: Blue, Green, Brown, Hazel
▪ Type of house: Bungalow, Duplex, Ranch
▪ Type of pet: Dog, Cat, Rodent, Fish, Bird
Scales of Measurement - Ordinal
▪ Ordinal scale of measurement refers to ordered series of relationships or rank order.
▪ The ordinal scale contains data that can be placed in order.
▪ Ordinal scales do not represent a measurable quantity. It is difficult to measure the
interval between the values.
▪ It is clear from the table that ratio scale satisfies all the four properties of scales of measurements
What do we do with the data?
Descriptive statistics
Collect Data
Summarize Data
Present Data
Taxonomy of Statistics
Statistical
Methods
Descriptive Inferential
Univariate
Measure of Dispersion
Measure of Shape
Measure of Central Tendency
Measures of Central Tendency
▪ A measure of central tendency is a summary measure that attempts to describe a whole
set of data with a single value that represents the middle or centre of its distribution.
▪ There are three main measures of central tendency: the mean, median, and mode. Each
of these measures describes a different indication of the typical or central value in the
distribution.
Measures of Central Tendency
Mean
▪ The arithmetic mean is the most widely used average.
▪ For any set of data on the variable x, the mean is denoted by 𝑥 ̅ and is obtained by
dividing the sum of observations by their number
1 𝑛
▪ 𝑥ҧ = σ1 𝑥𝑖
𝑛
Example
Calculate the mean for the following dataset:
1 2 2 4 5 10
Solution:
1 + 2 + 2 + 4 + 5 + 10 24
𝜇= = =4
6 6
1 2 2 4 5 70
Solution:
1 + 2 + 2 + 4 + 5 + 70 84
𝜇= = = 14
6 6
Properties of Mean
▪ It is greatly affected by extreme values
(𝑥𝑖 − 𝜇) = 0
𝑖=1
▪ Ordered data
50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 135, 140
• In golf stroke mechanics, a drive, also known as a tee shot, is a long-distance shot played from the tee box, intended to move the ball a great
distance down the fairway towards the green.
• A tee is a stand used to support a stationary ball so that the player can strike it
Measures of Central Tendency
Median
▪ Robert hit 12 balls at Grimsby driving range. The recorded distances of his drives,
measured in yards, are given below. Find the median distance for his drives.
85, 125, 130, 65, 100, 70, 75, 50, 140, 95, 70
▪ Ordered data
50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 140
• In golf stroke mechanics, a drive, also known as a tee shot, is a long-distance shot played from the tee box, intended to move the ball a great
distance down the fairway towards the green.
• A tee is a stand used to support a stationary ball so that the player can strike it
Properties of Median
▪ Median is unique for a dataset
▪ Median is not affected by extreme values
▪ Any observation selected at random is just as likely to be greater than the median as less
than the median
Measures of Central Tendency
Mode
▪ The mode is the most commonly occurring value in a distribution.
▪ Example:
1112345
Mode = 1
5 5 5 6 8 10 10 10
Mode = 5, 10
▪ All the measures are easy to interpret and not too difficult to compute.
▪ Only the mean directly depends on all the observations. A change in any one of the observations influences
the value of the mean. The median and mode are not so sensitive.
▪ The mean is, generally, the best measure of central tendency. In case of extreme values, median is better
measure of central tendency.
Scale of Measurement and Measure of Central Tendency
Nominal Mode
Ordinal Mode and Median
Interval Mode, Median and Mean
Ratio Mode, Median and Mean
Example
▪ For studying smoking habits,
▪ Do you smoke? Yes or No
▪ How many cigarettes did you smoke in the last 3 days?
A is ordinal data
B is interval data better than ordinal
Measures Of Non-central Tendency
Quartiles
▪ Splits the dataset into four equal quarters
▪ The quartiles, like the median, either take the value of one of the observations, or the value
halfway between two observations
Quartiles
▪ 𝑄1 - 25% of the observations are smaller than 𝑄1 and 75% of the observations are greater than 𝑄1
▪ 𝑄2 - 50% of the observations are smaller than 𝑄2 and 50% of the observations are greater than 𝑄2
▪ 𝑄3 - 75% of the observations are smaller than 𝑄3 and 25% of the observations are greater than 𝑄3
▪ If n/4 is an integer, the first quartile has the value halfway between n/4th observation and the next
observation
▪ If n/4 is not an integer, the first quartile has the value of the observation whose position
corresponds to the next highest integer
Quartiles
▪ Score of students in a test
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10
▪ No. of observations = 11; 11/4 = 2.75 is not an integer. So, the 3rd value
▪ The 3rd value from the left and the 3rd value from the right in the ordered data will be 𝑄1 and 𝑄3
▪ Ordered Data
3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15
▪ 𝑸𝟏 = 4; 𝑸𝟐 = 8; 𝑸𝟑 = 10
Quartiles
▪ Score of students in a test
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10, 8
▪ No. of observations = 12; 12/4 = 3 is an integer. So, the average of the 3rd & 4th value
▪ The average of the 3rd & 4th value from the left and the right in the ordered data will be 𝑄1
and 𝑄3
▪ Ordered Data
3, 4, 4, 6, 8, 8, 8, 8, 9, 10, 10, 15
▪ 𝑸𝟏 = 5; 𝑸𝟐 = 8; 𝑸𝟑 = 9.5