Você está na página 1de 8

1-1 Statistics Introduction

Goal of statistics Learn about a large group by examining data from some of its members. We collect data
from a portion of a larger group so that we can learn something about the larger group.
Objective: To understand the different types of data and how they can be collected.
An important goal in statistics is to use sample data to draw conclusions or generalizations about the population. This is
an important concept since it is too expensive, time consuming, and to collect data from every member of the
population which is a census.
Collection of facts such as values, observations measurements, genders, survey responses.
Data can be Descriptive (like "high" or "fast") or Numerical (numbers).
Science of planning studies and experiments, obtaining data and then organizing, summarizing,
analyzing, Interpreting, presenting, concluding based on the data.
Statistics- the process of collecting, organizing, analyzing, and interpreting data.
Complete collection of ALL individuals (Scores, people, measurement, etc) to be studied.
The collection or set of all subjects of interest in a given situation
Parameter- a numerical measurement describing some characteristic of a population
Census Collection of data from every member of the population.
Sub collection of members selected from a population. A good sample is representative. This means that
each sample point represents the attributes of a known number of population elements.
A portion, or subset, of a population
Statistic- a numerical measurement describing some characteristic of a sample
Sample data must be collected in an appropriate way such as trough a process of random selection.
If data are not collected in an appropriate way, the data may be so completely useless that no amount
of statistical torturing can salvage them.

Descriptive Statistics- presenting data by using numerical descriptions or tables and charts that summarize the data.
Statistical Inference- making generalizations about a larger population based on a representative sample.

Ways to collect data:
1. Observational study - we observe and measure specific characteristics, but no treatment is given
2. Experiment - some treatment is applied specifically to observe its effects of the subject.
Experimental units receive the treatment; if it is a human they are called subjects.
*Experiments are the only true way to show cause and effect

1-2 Statistical Thinking
Dont rely on blind acceptance of mathematical calculations Factors to consider
Context of the data WHAT is the context of the data? What is the data about? What does the data represent?
Why was the data collected?
Source of the data WHERE was the data collected? What is the source?
Sampling Method HOW was the data obtained?
Conclusions THEN what can we conclude from the data?
Practical Implications THIS IMPLIES what practical implications result from the analysis?

Determine the goal of the study [the issue] [Ask a question. Justify a statement determine if the data
supports or contradicts a statement.]
Always consider the context of the data because it will directly affect the statistical procedures we use.
Source Is the source objective or is there incentive to be biased?
Method used to collect sample data can [can introduce biased behavior and] influence the validity of the
Ex. Voluntary Response Sample (Self Selection, respondents themselves decide inclusion)
A sample you get when you ask for volunteers.
When sample members are self-selected volunteers.
Respondents are more likely to participate, have special interest, and feel strongly about a subject.
Conclusion Should be clear statements that are simple to understand by individuals with no statistical knowledge.
Statements should be derived from within the realm of the statistical analysis. (not outside)
Q: p7 Example 4 They concluded that the freshman 15 weight gain is a myth.
Isnt ANY conclusion meaningless if the Sampling Method is not appropriate?
[Some statement about action or non-action regarding a finding based on the statistical analysis.]
Practical Significance Based on available sample data, methods of statistics can be used to reach a
conclusion that some treatment or finding is effective, but common sense might suggest that the
treatment or finding does not make enough of a difference to justify its use or be practical.
Based on the available sample data, methods of statistics can be used to reach a conclusion, [for
example], that some treatment or finding is effective.
In statistics, a result is called "statistically significant" if it is unlikely to have occurred by chance.
A result that is not likely to occur randomly, but rather is likely to be attributable to a specific cause.
means that you are very sure that the statistic is reliable.
Significance is a statistical term that tells how sure you are that a difference or relationship exists.
means that you are very sure that the difference is real (i.e., it didn't happen by fluke).
Where common sense might suggest the treatment or finding does not make enough difference to justify
its use or to be practical.
Whether or not the magnitude of the observed effect or relationship justifies action.

Factors to consider when data is collected:
Context-description of what the values represent. Where they came from and why were they collected.
Source of Data-Where did the data come from? Can it be biased?
Sampling Method-how sample data is collect. This can greatly influence the validity of conclusions.
Conclusions-make statements clear and based on the data.
Practical Implications-relate any practical information that might appear important.
Statistical Significance-event unlikely to happen by chance alone. Discussed throughout the course.

Example 1 Page 35#8 For a study in which subjects are treated with a new drug and then observed, is the study
observational study or an experiment?

1-3 Types of Data
A goal of statistics is to make inferences or generalizations about a population
Parameter A numerical measurement describing
some characteristic of a population. (The
entire population of something)

Statistic A numerical measurement describing some characteristic of a sample.
Example 1
1. Parameter: There are exactly 100 Senators in the 109
Congress of the U.S., and 55% of them are
Republicans. The figure of 55% is a parameter because it is based on the entire population of all 100
2. Statistic: In 1936, Literary Digest polled 2.3 million adults in the U.S. and 57% said they would vote for
Alf Landon for the presidency. That figure of 57% is a statistic because it is based on a sample, not the
entire population of all adults in the U.S.

Variable - In statistics, a variable has two defining characteristics:
A variable is an attribute that describes a person, place, thing, or idea.
The value of the variable can "vary" from one entity to another.

Types of data
(or numerical ) data consists of numbers representing counts or measurements
Result when the number of possible values is either a finite number or a countable
number. Count type data that can only take certain values (like whole numbers 0,1,2,3).
Result from infinitely many possible values that correspond to some continuous scale that
covers a range of values without gaps, interruptions or jumps. Can take any value (within a
range) [Height, weight, length, time]
Discrete data is counted, Continuous data is measured.
(or qualitative or attribute) data consists of names or labels that are not numbers representing counts
or measurements. Places data in a category hence categorical!
Example: Final grade for a course.
Example 2
1. Quantitative Data: The ages (in years) of survey respondents.
2. Categorical Data: The political party affiliations. (Democrat, Republican, Independent, other) of survey
3. Categorical Data: The jersey numbers 24, 28, 17, 54, and 31 are sewn in the shirts of the LA Lakers
starting basketball team. The numbers are substitutes for names. They dont count or measure
anything, so they are categorical data.
Example 3
Discrete Data: The numbers of eggs that hens lay are discrete data because they represent counts.
Continuous Data: The amounts of milk from cows are continuous data because they are measurements
that can assume any value over a continuous span.

Example 2 - Quantitative data or Categorical (qualitative) data
a. Race- Categorical
b. Smoker- Categorical
c. Systolic blood pressure- Quantitative
d. Level of calcium in the blood- Quantitative

Example 3 - Discrete or continuous data set?
a. Number of rooms in an owner occupied housing unit. No half bathroom. Discrete
b. The amount of water in a 32 ounce container. Continuous

If a variable can take on any value between two specified values, it is called a continuous variable; otherwise, it is called
a discrete variable.

Example 4 - Determine if the numerical value is a parameter or a statistic.
a. A recent survey of a sample of MBAs reported that the average salary for an MBA is more than $82,000.
b. Starting salaries for the 667 MBA graduates from the University of Chicago Graduate School of Business
increased 8.5% from the previous year.
Based on all 667 MBA graduates so this is a population parameter.
Example 5 - Identify the population and sample?
A survey of 1906 household in the United States found that 13% have a high definition television. Sample

Four levels of measurement
Levels of data- classified by how they are categorized, counted, or measured
Characterized by data that consists of names, labels or categories only. Ordering or calculations of the
data has no significance and is meaningless.
Examples: country of origin, political affiliation, gender
Data that can be arranged in some order but the differences between data values either cannot be
determined or are meaningless.
Examples: level of education, movie ratings, course grades, ranks.
Like the Ordinal level with the additional property that the difference between any two data values is
meaningful. However, data at this level does not have a natural zero starting point (where none of the
quantity is present).
Data that consist of categories that can be arranged in ranking order and mathematical difference
between two data values is meaningful. Data must be quantitative
Examples: temperature, dates
Like the Interval level with the additional property that there is a natural zero starting point (where zero
indicates that none of the quantity is present. For values at this level, differences and ratios are both
meaningful. (Twice one number in relation to another is meaningful)
Examples: age, length, weight

Example 4
Examples of sample data at the nominal level of measurement.
1. Yes/No/undecided: Survey responses of yes, no, and undecided.
2. Political Party: The political party affiliations of survey respondents (Democrat, Republican,
Independent, other).
Example 5
Examples at the ordinal level of measurement.
1. Course grades Cannot determine the difference between the grades.
2. Ranks - (First, second, third) The difference is meaningless because it does not represent an exact
quantity that can be compared to other such differences.
Example 6
Examples of Interval level of management.
1. Temperatures Values are ordered and differences can be determined. However, there is no natural
starting point. 0 is not a natural starting point and does not represent the total absence of heat.
2. Years Time did not begin in the year 0. Not a natural stating point.
Example 7
Examples of ratio level of measurement.
1. Distances Distances traveled by cars. 0 represents no distance traveled. 400km is twice as far as
2. Prices Prices of college textbooks. $0 represents no cost and $100 does cost twice as much as a
$50 book.

Example 6 Identify the level of measurement
a. Types of music played by a radio station - Nominal
b. Average Monthly Precipitation in inches for Tampa, Fl - Ratio
c. Eye color - Nominal
d. Rank your math skills from 1 to 10 - Ordinal

1-4 Critical Thinking
Focus on [interpreting] the meaning of information obtained by studying data.
Dont blindly use formulas and procedures. Think carefully about context, source, data collection method,
conclusions reached.
Misuse of Statistics - Evil intent on the part of dishonest persons. Unintentional errors. Bad sampling

Misuse of Statistics
A sample you get when you ask for volunteers. The respondents themselves decide whether or not to
be included. People with strong opinions are more likely to participate. So it is possible that responses
are not representative of the whole population. We can only make valid conclusions about the specific
group of people who choose to participate.
The resulting sample tends to over-represent individuals who have strong opinions.

Example: Polls conducted through the internet.
Mail in Polls.
Telephone call in (From newspaper, radio, television)
and Causality
Misinterpretation by finding a statistical association and between two variables and concluding that one
of the variables causes or directly affects the other.
People lie or make mistakes. SO, collect the results yourself!
Small Samples Conclusions should not be based on samples that are too small or not collected in an appropriate way.
Some studies cite misleading or unclear percentages. References made to percentages that exceed
Example: lost baggage improved 100% means that currently no baggage is lost.
The wording of questions can influence the results of a study. Survey questions can be loaded or
intentionally worded to elicit a desired response.
Order of
Questions are unintentionally loaded by the ordering of the items to be considered.
Person refuses to respond or is not available.
Sometimes, individuals chosen for the sample are unwilling or unable to participate in the survey.
Non-response bias is the bias that results when respondents differ in meaningful ways from non-
Missing Data
People drop out of studies or fail to report their information (Ex. Low income groups, homeless)
[inability to reach out and include those who should be part of a representation of a population.]
Self serving promotions of studies (Ex. Kiwi). Sponsor may influence the results for self interest.
Precise does not mean accurate. Precise numbers may still be estimates.
The practice of fabricating studies. (Ex. Avis)

1-5 Collecting Sample Data
The method used to collect sample data influences the quality of statistical analysis.
If sample data are not collected in an appropriate way, the data may be so completely useless that no
amount of statistical torturing can salvage them.
Keep in mind that we are collecting data from a sample to make conclusions about the population. All data collected can easily be
put into you calculator and used to draw conclusions about your population but the way the data is collected is critical.
Garbage in garbage out if you use bad samples you get bad data and your conclusions about the population will be worthless.

Part 1: Basics of Collecting Data
Observational Studies Characteristics are observed and measured, but subjects being studied are not modified.
Some treatment is applied to a subject or subjects (experimental unit) and effects on the
subject or subjects are observed.
It is important to select the sample of subjects in such a way that the sample is likely to be representative of the
larger population. [The target population of the study.]

The quality of a sample statistic (i.e., accuracy, precision, representativeness) is strongly affected by the way that sample
observations are chosen; that is., by the sampling method.
Simple Random Sample
A sample of n subjects is selected in such a way that every possible SAMPLE OF THE
SAME SIZE n, has the same chance of being selected. [Every combination of n size can be
(The best way and often required)- a sampling method in which each subject of the
population has an equal chance of being selected and all groups of size n have an equal
chance of being selected.
Examples: pulling names out of a hat, using a random number generator
Random Sample
Members from the population are selected in such a way that each INDIVIDUAL MEMBER in
the population has an equal chance of being selected. We expect all components of the
population to be (approximately) proportionately represented.
Q: What is meant by component?
Probability Sample
Selecting members from a population in such a way that each member of the population has
a known (but not necessarily the same chance) of being selected. [Vague!]
A sample in which each element of the population has a known non-zero probability of
Non-Probability Sample
We do not know the probability that each population element will be chosen, and/or we
cannot be sure that each population element has a non-zero chance of being chosen.
Non-probability sampling methods offer two potential advantages - convenience and cost.
The main disadvantage is that non-probability sampling methods do not allow you to estimate
the extent to which sample statistics are likely to differ from population parameters. Only
probability sampling methods permit that kind of analysis.
Systematic Sampling
We select a starting point then select each kth (Ex. Every 50
) element in the population.
Systematic sampling will yield the same sample over and over again if conditions remain the
Convenience Sampling
Use results that are easy to get. (Readily available and convenient) The researcher using
such a sample cannot scientifically make generalizations about the total population from this
sample because it would not be representative enough.
Stratified Sampling
[Divide, then select
subjects from each]
Subdivide the population into at least 2 different subgroups (or strata) so that the subjects in
the same group share the SAME CHARACTERISTICS (gender, age bracket), then draw a
SAMPLE from each [and every] subgroup (stratum).
Often used to guarantee that each group is properly represented within the sample.
Examples: group students by graduating class then randomly select students from each
class or grouping the population into males and females and then take a random sample
from each
Cluster Sampling
[Divide, then select entire
Divide the population into sections (clusters), then randomly select some of those clusters,
then select ALL of the members of the cluster.

P28 Q: Bias/Tampering?
- Pollsters can adjust or weight the results of stratified or cluster sampling to correct for any
disproportionate representations of groups.

For a fixed sample size, if you randomly select subjects from different strata, , you are likely
to get more consistent (and less variable) results than by selecting a random sample from the
general population. [Assumption: Provided that the determined strata are meaningful,
important and appropriate to the study. Hence, there is a dependency here.]
Multistage Sampling
Using a combination of the basic sampling methods. Pollsters select a sample in different
stages and each stage might use different method of sampling.

Example: U.S. governments unemployment statistics based on surveyed households.
(A random, stratified and cluster approach.)

Current Population Survey (factors such as unemployment, college enrollments, weekly
earnings amounts, )
1. Partition the entire U.S. into 2007 different regions (Primary Sampling Units
metropolitan areas, large counties, or groups of smaller counties)
2. Select a sample of PSUs in each of the 50 states. (792. 432 largest populations plus
360 randomly selected PSUs from the other 1575.) [Unclear - How are all states
represented? Some state counties are bigger than other states metro areas!]
3. Partition each of the 792 PSUs into blocks using a stratified sampling to select a
sample of blocks. [What characteristic?]
4. In each selected block, identify clusters of households that are close to each other.
They randomly select clusters and they interview all households in the selected

Example 7 Page 35 #12 The town of Poughkeepsie Police set up a sobriety checkpoint at which every fifth driver is
stopped and interviewed. What type of sampling is used? Systematic
Example 8 Pg 35 #14 The U.S. Department of Corrections collects data about returning prisoners by randomly selecting
five federal prisons and surveying all of the prisoners in each of the prisons. What type of sampling is used? Cluster
Example 9 pg 35 #18 In a study of college programs, 820 students are randomly selected from those majoring in
communications, 1463 students are randomly selected from those from those majoring in business, and 760 students
are randomly selected from those majoring in history. What type of sampling is used? Stratified

Part 2: Beyond the Basics of collecting data Types of observational studies and experiment design.
Cross-sectional Study Data are observed, measured and collected at one point in time. [A snapshot]
Retrospective Study
(Case-control Study)
Data are collected from the past by going back in time (examination of records, interviews)
Example: Car crash victims who died and those who did not die.
Prospective Study
(longitudinal, cohort)
Data are collected in the future from groups sharing common factors such as smokers,
nonsmokers (called cohorts)
[Active and ongoing]

Design of Experiments
When subjects are assigned to different groups through a process of random selection.
Example: The Salk Vaccine Experiment.
Use chance as a method to create 2 groups that are similar [in ways that are important to the
experiment]. When it is very difficult to directly assign subjects to two groups having similar
characteristics of age, sex, weight, height, diet, etc
Found to be an extremely effective method for assigning subjects to groups.
The repetition of an experiment on more than on subject. Samples should be large enough to
eliminate erratic behavior that is characteristic of very small samples. But, APPROPRIATE
(The repetition or duplication of an experiment so that the results can be verified/confirmed.)

Use a sample size that is large enough to reveal the true nature of any effects, and obtain the
sample using an appropriate method such as random sampling.
When the subject is not aware of being administered the treatment or the placebo. Allows us to
determine whether the treatment effect is significantly different from the Placebo Effect which occurs
when an untreated subject reports an improvement in symptoms. Minimizes the Placebo Effect or
allows investigators to account for it. [Enables appropriate handling when the Placebo Effect is
encountered. The treatment effect needs to be greater than/surpass the Placebo Effect]
Double-blind Blinding occurs at the subject level and at the treatment administration level.

Three key elements to experimentation:
1. Randomization- is when subjects are assigned to different groups through a process of random selection. This
insures us that the groups are similar.
2. Replication-the repetition of an experiment on more than one subject. Allows us to recognize the effects of
different treatments.
3. Control-helps minimize outside variables.
Forms of control:
Placebo a dummy treatment with no active ingredient often a sugar pill. Often given to a control group which
is for comparison purposes.
Placebo effect-untreated subject reports an improvement in symptoms.
Blinding- a technique in which the subject doesnt know whether he or she is receiving a treatment or a placebo.
Double Blind-blinding occurred at two levels both the subject and the doctor do not know which treatment the
subject is receiving.

Controlling Effects of Variables
(also confounding factor,
lurking variable)
Occurs in an experiment when the effects of different factors are not distinguishable.
Example: 2 groups Placebo group=Men and Treatment group=Women. Gender can be a
factor that affects the treatment results [along with the treatment itself].
Confounding occurs when the experimental controls do not allow the experimenter to
reasonably eliminate plausible alternative explanations for an observed relationship between
independent and dependent variables.
Ex. Ice cream causes drowning <> Friends and loved ones of drowning victims console with
ice cream.
Ex. relationship between heavy drinking and lung cancer. Here, the data should be controlled
for smoking as it is related to both drinking and lung cancer.
Completely Randomized
Experimental Design
Randomness is used to assign subjects to the treatment group and placebo group. The
objective of this experimental design is to control [isolate] the effect the treatment so that it can
be clearly distinguished from the effect of the placebo.
Randomized Block
A block is a group of subjects with similar characteristics, but blocks differ in ways that might
affect the experiment. If testing one or more different treatments with different blocks, 1. Form
the blocks with different characteristics; 2. Randomly assign treatments to the subjects within
each block. [Same number of treatments in each block?]
Rigorously Controlled
Carefully assign [similar] subjects to different treatment groups so that those given each
treatment have characteristics [factors] that are similar in ways that are important to the
experiment. [Q: How is this different than matched?]
Matched Pairs Design
Compare exactly two treatment groups (such as treatment and placebo) by using subjects
matched in pairs that are somehow related or have similar characteristics. May include
measurements before and after.
Example: Crest experiment on pairs of twins.

Summary - 3 very important considerations in the design of experiments
1. Use randomization to assign subjects to different groups.
2. Use replication by repeating the experiment on enough subjects so that effects of treatments and other factors
can be seen.
3. Control the effects of variables by using techniques such as blinding and a completely randomized experimental

Sampling Errors The same experiment among different groups can have different results.
Sampling Error The difference between a sample result and the true population result; such that an error results from
chance sample fluctuations.
This is the difference between the sample statistic and the true population parameter caused by
chance variation in selecting a sample. This means that results will vary from sample to sample.
When the sample data are incorrectly collected, recorded or analyzed (biased sample, defective
instruments, copying the data incorrectly).

Bad Sampling Techniques:
Voluntary response sample-one in which the respondents themselves decide whether to be included.
Convenience sampling- use the results easiest to get Errors that occur while collecting the sample:
Under-coverage- some group of population is left out of the process when choosing a sample.
Example: Convenience sampling. Census homeless people.
Wording of the question- when the question is leading or wordy
Non-response- failure to obtain data from an individual selected for the sample. This could be the individual refuses to
participate or the interview cannot reach the individual (people not answering phone!)
Example: Neilson ratings
Response error- occurs when a subject gives an incorrect response this could be a misunderstanding of the question or
the respondent lies.

Errors that occur after the sample has been obtained:
Processing errors- mistakes in mechanical tasks for example, doing arithmetic or entering responses into a computer
incorrectly. ---- A type of non-sampling error


Example 10 The Internet service provider America Online ran a survey of its users and asked if they preferred a real
Christmas tree or a fake one. AOL received 7073 responses, and 4650 of them preferred a real tree. Given that 4650 is
66% of the 7073 responses can we conclude that about 66% of people who observe Christmas prefer a real tree? Why
or why not?

Example 11 The author received a telephone call in which the caller claimed to be conducting a national opinion
research poll. The author was asked if his opinion about Congressional candidate John Sweeney would change if he
knew that in 2001, Sweeney had a car crash while driving under the influence of alcohol. Does this appear to be an
objective question or one designed to influence voters opinions in favor of Sweeneys opponent, Kirstin Gillibrand?

Example 12 An experiment is conducted each subject is given medication to reduce their blood pressure. If a subject is
also exercising the treatment (medication) is confounded with the exercise we would be unsure of what is causing the
reduction in blood pressure.

Example 13 There is a positive association between reading scores and shoe size. As reading scores increase so does
shoe size. There is really no relationship between the two variables instead there is a lurking variable, age.