Escolar Documentos
Profissional Documentos
Cultura Documentos
What is statistics?
Statistics is the science of describing or making inferences about the world from a sample of data.
Statistics
Descriptive
Inferential
Descriptive Statistics
Descriptive statistics are methods for organizing and summarizing data.
Inferential Statistics
Two main methods:
Definitions
A variable is a characteristic or condition that can change or take on different values. Datum is one observation about the variable being measured. Data are a collection of observations.
The goal of statistics is to help researchers organize and interpret the data.
TYPES OF VARIABLES
VARIABLES QUALITATIVE QUANTITATIVE
NOMINAL
ORDINAL
INTERVAL
RATIO
Discrete
Continuous
Qualitative variables
Qualitative variable
Qualitative, nominal or categorical variable is data that comprises of categories that cannot be rank ordered each category is just different.
Examples:
What is your gender?
(please tick)
Male Female
Real Madrid
Barcelona
None
Somewhat satisfied
Neutral Somewhat dissatisfied Very dissatisfied
Quantitative variables
Interval variables
Interval variables measured on a continuous scale and has no true zero point. Examples: Time moves along a continuous measure or seconds, minutes and so on and is without a zero point of time. Temperature moves along a continuous measure of degrees and is without a true zero.
Ratio variables
Ratio data measured on a continuous scale and does have a true zero point. Examples: Age Weight Height Ratio data measured on a discrete scale and does have a true zero point. Example: Number of children
Nominal
Population
The entire group of individuals is called the population.
Population
Sample
Usually populations are so large that a researcher cannot examine the entire group. Therefore, a sample is selected to represent the population in a research study.
Population
Sample
Why sample?
Measuring all units is impractical, if not impossible. Sampling just a few units saves money. Sampling just a few units saves time. Some measurements are destructive.
Design matrix
Sex Female Male Male Male Age 23 43 19 23 Smoke Yes Yes Not Yes Country USA Colombia Brazil Brazil Married Yes Yes Yes Not
Female
Female Male Male Female
56
78 54 76 43
Not
Yes Not Yes Not
Canada
USA Spain Colombia Peru
Yes
Yes Not Not Yes
9 Individuals
5 Variables
Dimension 9 x 5
Statistic tools
Tables
One way frequency table Number of passangers 2 4 5 6 7 8 Total Absolute frequency 2 23 41 18 8 1 93 Relative frequency 2/93 23/93 41/93 18/93 8/93 1/93
Tables
Two way frequency table
Sex\ Hobby
Dance 2 16 18
Sports 10 6 16
TV 8 8 16
Total 20 30 50
Tables
Frequency table
Age
10-14 15-19 20-24
Absolute frequency 2 16 18
25-29
30-34 Total
For quantitive variables.
3
1 40
2.5 100
Graphs
Bar chart Pie chart Pictograms Histogram Density plot Scatter plot Time series plot Boxplot
26
Graphs
Graphs
Graphs
Statistic pictograms
Do not recommended
29
Graphs
Graphs
Graphs
33
Recommended book
http://www.laeditorialvirtual.com.ar/Pages2/Huff_Darrell/Huff_ComoMentirConEstadisticas.html#_Toc334380216
34
A cartoon
35
Recommended videos
http://www.youtube.com/watch?v=nUJNstRFvvo
http://www.youtube.com/watch?v=ETbc8GIhfHo
36
37
Mean
The mean of a data set is the sum of the data entries divided by the number of entries.
Population mean:
x N
mu
Sample mean:
x x n
x-bar
38
Mean
Example: the following are the ages of all seven employees of a small company: 53 32 61 57 39 44 57
x 343 N 7
49 years
Median
The median of a data set is the value that lies in the middle of the data when the data set is ordered. If the data set has an odd number of entries, the median is the middle data entry. If the data set has an even number of entries, the median is the mean of the two middle data entries. Example: calculate the median age of the seven employees. 53 32 61 57 39 44 57 57 61
40
Mode
The mode of a data set is the data entry or category that occurs with the greatest frequency. If no entry is repeated, the data set has no mode. If two entries occur with the same greatest frequency, each entry is a mode and the data set is called bimodal.
Example: find the mode of the ages of the seven employees. 53 32 61 57 39 44 57
The mode is 57 because it occurs the most times. An outlier is a datum that is far from the other in the data set.
41
Weighted Mean
A weighted mean is the mean of a data set whose entries have varying weights. A weighted mean is given by
(x w ) x w
where w is the weight of each entry x.
42
Weighted Mean
Example: grades in a statistics class are weighted as follows. Tests are worth 50% of the grade, homework is worth 30% of the grade and the final is worth 20% of the grade. A student receives a total of 80 points on tests, 100 points on homework, and 85 points on his final. What is his current grade?
Weighted Mean
Begin by organizing the data in a table.
Source Score, x Weight, w xw
80 100 85
40 30 17
x (x w ) 87 0.87 w 100
The students current grade is 87%.
Shapes of distributions
Histogram
Density
Shapes of distributions
A frequency distribution is symmetric when a vertical line can be drawn through the middle of a graph of the distribution and the resulting halves are approximately the mirror images.
Shapes of distributions
A frequency distribution is uniform (or rectangular) when all entries, or classes, in the distribution have equal frequencies. A uniform distribution is also symmetric.
Shapes of distributions
A frequency distribution is skewed if the tail of the graph elongates more to one side than to the other. A distribution is skewed left (negatively skewed) if its tail extends to the left. A distribution is skewed right (positively skewed) if its tail extends to the right.
Measures of Variation
49
The mean is a good indicator of the central tendency of a set of data, but it does not provide the whole picture about the data set. Example 1: comparison of the distribution of two data sets Mean 7 7 Median 7 7
5 1
6 2
7 7
8 12
9 13
50
Example 2: Suppose that in a hospital, each patients pulse rate is taken in the morning, at noon, and in the evening. On a certain day, pulse rate for Mean Median Patient A: 72 76 74 74 74
Patient B:
72
91
59
74
72
Note: Mean pulse rate is same for both the patients. While patient As pulse rate is stable, patient Bs fluctuates widely.
51
Range
The range of a data set is the difference between the maximum and minimum date entries in the set. Range = (Maximum data entry) (Minimum data entry) Example: The following data are the closing prices for a certain stock on ten successive Fridays. Find the range. Stock 56 56 57 58 61 63 63 67 67 67
( )2
The sample standard deviation of a sample data set of n entries is the square root of the sample variance.
( )2 1
54
14
12
14
Frequency
10 8 6 4 2 0 2 4
Frequency
x =4 s = 1.18
12 10 8
6 4 2 0
x =4 s=0
Data value
Data value
6
55
Galton board
Normal Distribution
The most widely used distribution is the normal distribution, also known as the Gaussian distribution.
Empirical Rule
P( < X < + ) = 0.6827 P( 2 < X < + 2) = 0.9545 P( 3 < X < + 3) = 0.9973
N(0,1)
On table
Answer: 0.1949
On table
Answer: 0.9147
On table
On table
Standardizing
When you weigh a sample of bags you get these results: 1007gr, 1032gr, 1002gr, 983gr, 1004gr, ... (a hundred measurements) Mean = 986 gr Standard Deviation = 20 gr