Escolar Documentos
Profissional Documentos
Cultura Documentos
Methodology
Notes
Topics
Inferential Statistics
Correlation
T-test and ANOVA
Linear Regression
Report Writing
o
o
o
o
o
o
o
Literature Review
Based on the assumption that knowledge accumulates & that we
learn from & build on what others have done.
Goals of Literature Review
To show familiarity & establish credibility of researcher & the
importance of the problem
Provide theoretical background to the study
To show the path of prior research & its linkage with the current
project
To learn & improve research methodology, measures used
To integrate & summarize what is known in an area+ identify gaps
To learn from others & stimulate new ideas
To contextualize the findings of the study & integrate them into the
existing body of knowledge once data collection & analysis is
complete
Types of Reviews
o
o
o
o
o
o
Scholarly journals
Books
Dissertations
Government documents/Policy reports
Conference papers/ Working papers,
monographs
Web search- www.scholar.google.com
Electronic databases- Sagepub, LexisNexis,
ERIC, JSTOR etc.
Objectives
Concepts
Variables
Hypothesis
Assumptions
Selection of Topic
o
o
o
o
o
o
Personal experience
Curiosity
The state of knowledge in the field
Solving a problem
Personal values
Everyday life
o
o
o
o
o
Concepts
o
o
o
o
o
o
o
o
Concepts
o
o
o
Concepts
o
o
o
o
o
o
Variables
o
Dimensions of Variables
o
o
o
o
Types of Variables
o
Variables
o
Levels of Measurements
The categories of a variable should be exhaustive &
mutually exclusive
o Nominal variables The values comprise of a list of
names (religion, states, occupation)- Qualitative
measurement-- involves only classification
o Ordinal variable- the categories have names and
these values can be rank ordered (socio economic
class, opinions)-- involves classification+rank
ordering
o All ordinal variables can be treated as nominal
variables but not the other way round.
Levels of Measurements
o
Types of Variables
Categorical variables- measured at nominal or
ordinal levels where variables are divided into
multiple categories
Continuous variables- have continuity in
measurement and are measured at an
interval or ratio level.
Units of Analysis
Populations of people
o Farms
o Communities
o Myths
o Cities
o Countries
Always collect data on the lowest unit of analysis. It is
easy to aggregate data collected on individuals but
not possible to disaggregate data collected on
groups
o
Hypotheses
o
Positive
Negative
Curvilinear
Hypothesis
Criterion of a good hypothesis
Conceptual clarity
Should have empirical referents
Specific
Should be related to available technique of testing
Should be related to a body of theory
Causal Hypothesis
Covariation
o
o
Hypothesis
Measurement
Statistical or verbal
summarization
Observations
Use of Research
n
Research Design
n
Purpose of research
Researchers interest
General use of theory
n
n
Exploratory Research
Descriptive Research
Explanatory Research
Applied Research
Studies may have multiple purposes but
one purpose is usually dominant.
Case Study
Case Studies
Longitudinal Designs
n
n
Experimental Design
n
Quasi Experimental
Time
1( pretest)
2( Post Test)
Experimental
Measure dependent
variable
Control
Measure dependent
variable
Experimental (no
pretest)
It takes care of
Testing the main effects of the
experiment
Understanding the interaction effect of
testing
Combined effect of maturation and
history
Experimental Study
n
n
n
No pre test
Post test only
Controlled group may be given alternate
treatment or placebo
Not as reliable as the other experimental
designs
Eliminates threats to internal validity and
can establish causality
Ethics in Research
66
Context specific .
Can be universal
Can be specific to a particular context
Can specific to a particular locality
68
08/09/11
76
77
78
79
COMPOSITE MEASURES
QUANTITATIVE RESEARCH
METHODOLOGY
Madhura Nagchoudhuri
Composite Measures
Composite measures are used to measure variables that are
complex or multifaceted such that they cannot be measured
using a single item on a questionnaire e.g. stress, quality of life,
human development
Two types of composite measures Indexes
Scales
Composite Measures
Factors to keep in mind while selecting items
to create a scale or an index Face validity
Items should have adequate variance- useful in
distinguishing people from each other. People
should not come up with uniform answer.
Types of Scales
3 types of scales most commonly used
include Likert scales
Semantic differential scales
Guttman scales
Likert Scale
Format frequently used in contemporary survey
questionnaires.
Respondent is presented with a series of statements to which
s/he is to respond indicating whether s/he strongly agrees,
agrees, undecided/neutral, disagrees or strongly disagrees.
There is an unambiguous ordinality in the response categories.
Usually is 3point, 5point or 7point (odd no. with a midpoint)
Assumes that each item on the scale has equal intensity
Lends itself to simple method of scaling with the possibility of
scoring being done in a uniform way e.g. scores of 0 to 5 may
be assigned where score of 5 is assigned to strongly agree
for positive items and strongly disagree for negative items.
Guttman Scales
Clear difference in intensity in the way items
are structured moving from the least intense to
the most intense.
If a respondent agrees to the more intense
items (harder items) then one may assume that
s/he will agree to the less intense or easier
items.
E.g. Bogardus Social Distance Scale
Reliability
Deals with the indicators of dependability
A reliable indicator or measure gives the
same result every time
Three types of Reliability1. Stability reliability -reliability across time,
2. Representative reliability -across
subpopulation, groups of people and
3. Equivalence reliability -consistency across
different indicators
Sources of Error
Testing Reliability
Reliability is determined by obtaining two or
more measures of the same thing and seeing
how closely they agree.
Four methods of testing reliability
Test retest
Alternate form
Split Half
Observer reliability
Test-Retest
Repeated administering the same instrument to
the same set of people on separate occasions
They should not be subjects in actual study
If the results of repeated tests are similar, then
the reliability is high
Drawback- the first test has an influence on the
next
Measuring instruments that are strongly
affected by memory or repetition, should not be
tested for reliability using this method
Alternate Form
Different but equivalent forms of the same test
are administered to the same group of
individuals usually close in time and then
compared
Drawback- developing equivalent tests can be
time consuming
Some problems associated with test-retest are
not completely eliminated
Split Half
Items of the instrument are divided into
comparable halves
The test is administered and the scores of the
two halves are compared.
If the scores are same then the test is reliable
Major problem in designing two halves that
are equivalent
Observer Reliability
Comparing administration of an instrument
done by different observers or interviewers
The observers need to be thoroughly
trained
At least two people will code the content of
the responses according to certain criterion
Validity
Validity: A measure is valid if it measures
what it is supposed to measure
Four Types of Measurement Validity
- Face validity
- Content validity
- Criterion validity
- Construct validity
Face Validity
The easiest type of validity to achieve and
most basic
It is the judgment by the scientific community
that the indicator really measures the construct
Content Validity
It is a special type of face validity
Whether it captures the entire meaning
Is the full content of the definition
represented?
E.g. Feminism, empowerment
Criterion Validity
The validity of an indicator is verified by
comparing it with another measure of
The same construct in which the researcher has
confidence
Two subtypes Concurrent
Predictive
Construct Validity
It is for measures with multiple indicators
Two types
Convergent
Discriminant
Sampling
Introduction
Inferential statistical methods use sample statistics to
make predictions about population parameters.
The quality of inferences depends crucially on how well
the
sample represents the population.
To ensure a good sample representation
randomization
is essential.
What is randomization?
Randomization is the mechanism for ensuring that the
sample representation is adequate for inferential
methods.
Methods of Sampling
Sampling is quite often used in our day-today
practical life where, our purpose is to
determine the population characteristics
only by observing a finite sub set of
individuals taken from it.
Sampling methods can be classified under
two heads namely,
1. Probability Sampling Methods
2. Non-probability Sampling Methods
Probability Sampling
Methods
1. Simple Random Sampling
2. Systematic Sampling
3. Stratified Sampling
4. Cluster Sampling
5. Multistage Sampling
Non-Probability Sampling
Methods
Social research is often conducted in situations that do
not allow the kinds of probability sampling discussed
so
far, for large-scale social surveys. Suppose we want to
study homelessness. Neither a list of all homeless
individuals is available nor it can be created. Moreover,
there are times when probability sampling wouldnt be
appropriate. Such situations call for non-probability
sampling.
Methods of Non-Probability
Sampling
1. Purposive or Judgement Sampling;
2. Volunteer Sampling;
3. Snowball Sampling;
4. Quota Sampling; and
5. Selecting Informants
Snowball Sampling
Snowball sampling method is appropriate when the
members of a special population or individuals with a
rare characteristic are difficult to locate such as
persons in a village who were bitten by snake. In this
method, the researcher collects data from the few
members of the target population; then asks those
individuals to provide information needed to locate
other members of the target population whom they
happen to know.
.
Quota Sampling
Quota sampling begins with a matrix, or table, describing the
relevant characteristics of the target population. Depending on
your research purposes, you may need to know what proportion
of the population is male and what proportion female as well as
what proportion of each gender fall into various age categories,
educational levels, ethnic groups, and so forth.
Once you have created such matrix and assigned a relative
proportion to each cell in the matrix, you proceed to collect data
from people having all the characteristics of a given cell. You
then assign a weight to all the people in a given cell that is
appropriate to their total population. When all the sample
elements are so weighted , the overall data should provide a
reasonable representation of the total population.
Sampling Error
The sampling error of a statistic is
the error that occurs when a statistic
based on a sample estimates or
predicts the value of a population
parameter.
1. Other Sources of Variability/Error
2. Under-coverage
3. Response Bias
4. Non-response
5. Missing Data
Sampling Distribution
A sampling distribution is a probability
distribution that determines probabilities of
the possible values of a sample statistic.
Standard Error
The standard deviation of the sampling
distribution of the sample statistic is called
the standard error of the statistic
Research Methods
Tools & Techniques for
Data Collection
Interview
Questionnaire
Focus Groups
Observation
Methods of Interview
l
l
l
l
Interview
Types of Interviews
Informal
Unstructured
Interviewing
l
Unstructured Interviewing-
Interview
l
Semi-structured Interviewing-
Different Components
l
The Interviewer
The Researched/Respondent
l
l
l
l
l
l
l
Skills in Interviewing
l
Probing
-
Silent Probe
- Echo Probe
- Uh-huh Probe
- Tell Me-More Probe
- Long-Question Probe
l
l
Points to Note.
l
l
l
l
l
Importance of Language
Pace of the Study
Being Yourself
The little things!!
Using a Tape Recorder (recording equipment
etc)
Taking Notes
Response Effects
l
l
l
l
l
Types of Interviews
l
l
Face to face
Telephone
l
l
Telephone Interviews
Advantages:
-Have the impersonal quality of the questionnaire.
Inexpensive, need less time and energy
Can reach everyone who has a phone
Less influence of the interviewers personality
Disadvantages:
Not useful for people without a telephone connection
Cannot be a long schedule
Data can be false if investigators are not monitored
properly
Questionnaire
l
l
l
l
l
l
l
No method is perfect
On an average interview method can ensure
82% of fully filled schedules as against 68%
by questionnaire method
Short schedules for a population having
telephone connections, telephone
interviews are possible
Focus Groups
l
l
Observation
l
Observation
l
Observation
l
Types of Observation
l
l
l
Participant Observation
l
l
Advantages
l
l
l
Direct observation
Video recording and then analysing
l
l
Content Analysis
l
l
l
l
l
Frequency
Direction
Intensity
Space
What is Data ?
Data refers to a collection of organized
information, usually the result of experience,
observation or experiment, other information
within a computer system, or a set of premises.
This may consist of numbers, words, or
images, particularly as measurements or
observations of a set of variables.
169
Data Processing
The Survey data which are collected from field
require certain operations before it can be used
for analysis.
The data processing requirements are to be
specified in an earlier stage of any research
study in terms of time, cost, manpower,
materials, etc.
170
Cont
Processing of collected data is required for
drawing out meaningful results. Data
processing involves the various steps, from
editing of questionnaires to analysis and
report-writing.
There are different stages of data processing:
Editing and Scrutinising
Coding and Data Entry
Validation, Checking & Updating
Analysis
171
Completeness
By completeness, it is meant that the filled-inschedule is complete in all manners. The first
point to check is whether there are answers for
every question. If an interviewer forgets to ask
a question or to record an answer, it may be
possible to deduce from other data on the
questionnaire what the answer should have
been and thus fill the gap at the editing stage.
173
Accuracy
By accuracy, it is meant that the answers are
correctly filled-in. It is not enough to check
that questions are answered; one must try to
check whether the answers are accurate.
Answers needing arithmetic even of the
simplest kind, should be edited carefully.
174
Uniformity
The editing stage gives every opportunity for
checking that interviewers have interpreted
questions and instructions uniformly.
For example, if a question on occurrence of a
calamity is to be asked as follows :
"Whether any calamity has occurred in your
village during the past two years". Every
investigator should confine to the period of
two years only so that there will be uniformity
in the case of the period.
175
177
Coding
Coding is translating answers into numerical values or
assigning numbers to the various categories of a
variable to be used in data analysis.
Coding is generally done while preparing the questions
and interview schedules. Fieldwork is thus done with
pre-coded questions. However, sometimes, when
questions are not pre-coded, coding is done after the
fieldwork.
Coding is done on the basis of the instructions given in
the codebook. The codebook gives a numerical code
for each variable.
178
Example
If Age = 9 years;
Then, code
0 9
9 0
179
Types of Questions
Different types of questions should be examined
before coding:
(i)
(ii)
(iii)
(iv)
(v)
Cont
(iv) Open-ended Questions: These questions are left
completely open for the interviewers, and no
alternatives are suggested in the questionnaire. The
reason for this may be either of the following:
(a) The alternatives are known, but there are too may to
make it practicable to list them all (Example :
Contraceptive methods).
(b) The possible replies cannot be foreseen, and as a
consequence the answers are taken down verbatim
and later classified in manageable groups (Example :
Occupational Status).
182
Cont
(v) Multi-Coded Questions: Multi-coded
questions belong to the group of `fixed
alternative' questions, as the number of
possible replies are fixed. However, in multicoded questions, the answers are not
necessarily mutually exclusive, so that two or
more answers are allowed for the same
respondent. The codes for this are developed
differently from the other types.
183
Cont
For this type of question, a Binary System of
codes is used, rather than a consecutive order.
This idea is that all the categories ticked can be
added together to form one code without any
loss of information, as each `sum' represents a
unique combination of answers.
184
Cont
A detailed coding manual or set of instructions should
be prepared before the coding begins. Since, the
editing and coding operations are related, the timing
of the coding depends on that of the editing. In
general, the coding should not begin until there are an
adequate number of edited questionnaires available,
and there is assurance that there will be a continuous
flow of questionnaires. Once the coding starts, there
should not be delays due to the unavailability of
edited questionnaires. There must be adequate office
space so that questionnaires can be checked as they
are returned from the field.
185
Cont
The unedited questionnaires should be kept separate
from the edited ones. Likewise, those that have been
coded should be stored separately from those not yet
coded. Adequate working space should be provided
for each individual coder so that there is no
overcrowding and the work can proceed satisfactorily.
All coders should be given specific training for
sufficient understanding of the job. The real effective
way to train is to ensure that they are given enough
on the job practice, followed up with careful
evaluation of the work performed.
186
Cont
Data Entry: The data are entered into a
computer. For example, the data are entered
into SPSS package.
Editing: The data are checked and corrected
on computer for format and structure errors to
ensure that all and only required data are
present. Also, the data are checked and
corrected for out of range and inconsistent
responses.
188
Cont
Recoding: The edited data are transformed
from the actual responses to a set of variables
convenient for analysis.
Tabulation: The recorded data are tabulated
according to the specifications laid down for
writing reports.
Archiving and further analysis: The different
data files with complete documentation are
organized for further research.
189
Cont
It is very important for any meaningful interpretation
of data that all possible errors and inconsistencies are
corrected before the analysis phase.
Thus cleaning or machine editing of data is an
extremely important function involving both the
researchers and data processors.
Essentially, computer editing is a repetition of the
manual editing and is necessary both because of
human error in the manual operation and to correct
errors introduced during coding and punching.
190
Machine Editing
After the office editing, a more comprehensive
checking must be carried out by the computer.
Machine editing can be divided into two main stages.
A)Format and structure check which involves in
checking the following items:
Each part of the identification (e.g. sample area,
household, and line number) contains a valid value.
All sample households are present.
191
Cont
B) Range and Consistency Checks:
All codes are within the ranges specified for them in
the code book.
All skips in the questionnaire have been correctly
executed.
The information recorded is internally consistent.
Dates in the event histories flow in a sequential order
with a specified minimum elapsed time between
events.
192
Cont
The computer is used to locate errors and not
to make corrections.
During the format, structure and consistency
checks, error reports are produced from the
computer.
Correct values are looked up in the original
questionnaires and written into suitable update
forms along with the identification of the
record to be corrected. This work is usually
done by the office editors.
193
Cont
It is, therefore, very important that: (i) the error
reports from the computer are clear and easily
comprehensible to the non-data processing
staff, and (ii) the update forms for writing
down the corrections are simple to fill out. It
should be ensured that careful organization is
done of the way corrections are to be made on
the computer. Questionnaires should be easily
accessible and located on shelves clearly
labelled with the cluster/region to which they
belong.
194
Cont
The editing staff looking up the corrections
must be thoroughly trained on how to interpret
error listings from the computer, how to look
up appropriate corrections and how to fill out
the update forms.
The contents of update forms are key punched
and used to update the computer files. The
whole checking and correction procedure must
be repeated until no more errors are
encountered.
195
What is a codebook?
A codebook describes and documents the
questions asked or items collected in a survey.
The codebook will describe the subject of the
survey or data collection, the sample and how
it was constructed, and how the data were
coded, entered, and processed.
The questionnaire or survey instrument will be
included along with a description or layout of
how the data file is organized.
196
Broadly what do the answer to these questions mean in real terms while
answering the research questions of the study?
n
Data Analysis
n
Univariate Table
Table 4.3 HELP TAKEN FOR ADMISSION PROCESS
Self
18 (47.3%)
8 (21.5%)
Trustees or volunteers
3 (7.8%)
3 (7.8%)
Hostel seniors
2 (5.2%)
Friends
2 (5.2%)
Dont remember
2 (5.2%)
no
yes
Total
Count
% within Sex
Count
% within Sex
Count
% within Sex
female
2095
21.1%
7847
78.9%
9942
100.0%
male
1517
13.3%
9874
86.7%
11391
100.0%
Total
3612
16.9%
17721
83.1%
21333
100.0%
Graphical Representation
Different types of Graphs: The primary purpose of
graphical representation is to highlight important
features of the data.
- Line graph- displays trends in data
- Bar chart/graph- used to show nominal or ordinal
level data
- Pie chart- used to show nominal or ordinal data
- Histogram- used to represent distributions of
interval or ratio data.
90
80
70
Count
60
50
40
low
SES
middle
high
lo
w
2
3
.
5
%
m
id
d
le
4
7
.
5
%
the distribution.
It is very sensitive to extreme scores.
The mean is least subject to sampling variation as compared
to other measures of central tendency. If repeated samples
were taken from a population the mean would vary
somewhat from sample to sample but it would vary lot less
than median or mode. This is the reason why it is used so
frequently in inferential statistics.
Measures of Dispersion/Variability
n
Topics
Topics
Measures of Association
Correlation
Chi square test
t-test ( for independent samples)
One way ANOVA
217
Types of Statistics
Broadly, statistics are of two types:
I. Descriptive and II. Inferential
Descriptive statistical procedures summarise large
groups of numbers. They are also called summary
statistics.
Ex: Measures of Central tendency, variance, S.D.,
correlation and so on.
The second category of statistics is called inferential
statistics. Inferential statistics are the statistical
techniques used by researchers to generalize from
characteristics of a small group to a larger group not
measured by the researcher.
Ex: t-test, ANOVA, Chi-Square etc.
a) parametric and
b) non-parametric or
distribution-free statistics.
Parametric statistics
are
only
for
Interval/ratio levels
of
measurements
though some use it
on ordinal data also.
Assume homogeneity
of variance
do
not
specify
normality
or
homogeneity
of
variance.
Some
researchers prefer to
use these statistics
when
these
two
assumptions
are
violated.
Non parametric
statistics are used for
nominal/ordinal
levels of
measurement
Non-Parametric
Spearmans rank correlation
(rho)
Sign test
Mann-Whitney U test
One-way ANOVA
Kruskal-Wallis
ANOVA of ranks
One-way
ANOVA
repeated measures
one-way
non-parametric statistical
The observations must be A nontest is a test whose model
independent and the
does not specify conditions
sample a random one.
about the parameters of
the population from which
Observations must be
the sample was drawn.
drawn from normally
Most nonnon-parametric tests
distributed populations.
apply to data in an ordinal
At least, the level of
scale, and some to data in a
nominal scale.
measurement must be on
Simply increasing the size
interval scale
of N increases the
Homogeneity of variance
efficiency of nonnonparametric statistics.
Descriptive Stats:
Frequency Distributions
A frequency distribution is a display of the frequency
of occurrence of each value/score. It can be
presented either in a tabular form or as a graph.
Bar charts are suitable for nominal variables and for
interval/ratio variables histograms and frequency
polygons are useful.
Measures of Central tendency- mean, mode and
median.
The measures of variation include range, standard
deviation and variance.
31 Aug 2012
222
Descriptive Stats:
Frequency Distributions
To obtain frequency table, MCT, and variability:
Analyze>descriptive stat>frequencies>select
variables>statistics>continue>charts/histograms>
OK
31 Aug 2012
223
Descriptive stats
To explain about the differences between
M,Md,Mdn,SD, range, variance.
To show the calculation of Standard deviation.
Mention briefly if necessary about z scores.
Then go to Cross tabulation and chi-square
Multiple response analysis
Correlation.
31 Aug 2012
224
Cross tabulation
Helps us explore the relationship between
variables. It goes beyond descriptive
statistics.
Whereas Chi-square tells us whether two
variables are related or not dependent.
Chi--Square test
Chi
There are different types of Chi-square analysis:
Test for goodness of fit
Applies to the analysis of single categorical
variable and determines if differences in
frequency exist across response categories
compared to the population from which the
sample is drawn.
Test of Independence
Applies to independence or relatedness
between two categorical variables. This is a
very common method used by researchers.
Chi--Square test
Chi
Questions addressed by Chi-square test.
Chi--Square test
Chi
Degrees of freedom = (r-1)(c-1).
If the calculated value is higher than the table
value (reported in the output), then we
conclude that there is some significant
association between the two variables.
Significance level usually selected will be: .05 or
lower.
How to report: there is significant relationship
between sex of respondent and the possession
of land as assets (X2 = 34.21, df=1, p<0.000).
Objectives
Explain what is meant by a chi-square goodness of fit test
Conduct a chi-square goodness of fit test
Given a two-way table, compute conditional distributions
Conduct a chi-square test for homogeneity of populations
Conduct a chi-square test for association / independence
Use technology to conduct a chi-square significance test
Chi-Square Distribution
Total area under a chi-square curve is equal to 1
It is not symmetric, it is skewed right
The shape of the chi-square distribution depends on the
degrees of freedom (just like t-distribution)
As the number of degrees of freedom increases, the chi-square
distribution becomes more nearly symmetric
The values of are nonnegative; that is, values of are
always greater than or equal to zero (0); they increase to a
peak and then asymptotically approach 0
Conditions
All Chi-Square tests (Goodness of Fit, Homogeneity,
Independence):
Independent SRSs
All expected counts are greater than or equal to 1
(all Ei 1)
No more than 20% of expected counts are less than
5
Remember it is the expected counts, not the observed
that are critical conditions
Test of Association/Independence
Correlation
value
Increase
Increase
Decrease
Decrease
Increase
Decrease
Decrease
Increase
No association
Correlation
Correlation and causation.
Association between variable doesnt mean causation.
Sometimes variables may be spuriously correlated.
Problems relating to multicollinearity
How to report the result?
r (N=75).78; p<0.05.
Person
Height (x)
1
68
2
71
3
62
4
75
5
58
6
60
7
67
8
68
9
71
10
69
11
68
12
67
13
63
14
62
15
60
16
63
17
65
18
67
19
63
20
61
Person
Height (x)
x*y
x*x
y*y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Sum =
68
71
62
75
58
60
67
68
71
69
68
67
63
62
60
63
65
67
63
61
1308
4.1
4.6
3.8
4.4
3.2
3.1
3.8
4.1
4.3
3.7
3.5
3.2
3.7
3.3
3.4
4
4.1
3.8
3.4
3.6
75.1
278.8
326.6
235.6
330
185.6
186
254.6
278.8
305.3
255.3
238
214.4
233.1
204.6
204
252
266.5
254.6
214.2
219.6
4937.6
4624
5041
3844
5625
3364
3600
4489
4624
5041
4761
4624
4489
3969
3844
3600
3969
4225
4489
3969
3721
85912
16.81
21.16
14.44
19.36
10.24
9.61
14.44
16.81
18.49
13.69
12.25
10.24
13.69
10.89
11.56
16
16.81
14.44
11.56
12.96
285.45
MULTIPLE RESPONSE/CHOICE
ANALYSIS
To buy groceries
To visit relatives
To visit friends
To run errands
To attend social
function
vi. To places of worship
vii. Any other..
31 Aug 2012
245
31 Aug 2012
%*
172 (73.2)
To buy groceries
160(68.1)
To visit relatives
157(66.8)
To places of worship
146(62.1)
138(58.7)
For a stroll
114 (48.5)
To run errands
87(37.0)
Shopping/visit mall
71(30.2)
36(15.3)
To park
34(14.5)
22(9.4)
246
Statistical Inference
The process of generalization in prescribed manner
from a sample to its universe is known as
Statistical Inference.
Population Parameters
: Population mean
: Population standard deviation
Sample Statistic
x: Sample mean
s: Sample standard deviation
Universe/Population
SAMPLE
HYPOTHESIS TESTING
Hypothesis testing in inferential statistics involves
making inferences about the nature of the
population on the basis of observations of a sample
drawn from the population.
What is Statistical Hypothesis?
A Hypothesis is a statement/conjecture about one or
more population parameters.
Null Hypothesis
OR
H0: -455 = 0
Where
=
455 =
population mean
Hypothesis value to be tested
Null hypothesis
is true
Null hypothesis
is false
Reject null
hypothesis
Type I error
Correct
decision
Do not reject
null hypothesis
Correct
decision
Type II error
z=
X -m
z=
4. Decide about H0
Suppose we had found that the sample mean (X) for 144
students was not 535, but 465. Our hypotheses, sampling
distribution, and critical values (+1.96 and -1.96) remain
the same, but now the test statistic is
z=
X -m
465 - 455
= 1.20
8.33
-1.96
+1.96
1.20
Note that the test statistic does not exceed the critical value; it does not fall
into the region of the rejection; and we should not reject the null
hypothesis
H0 : = 455
H1 : < 455
Here the critical region lies on the left tail of the distribution
s
sX =
n
Students t Distributions
Does the adjustment of using s to estimate have an effect on the statistical test?
Actually, it does, especially for small samples. The effect is that the normal
distribution is inappropriate as the sampling distribution of the mean. In the
beginning of the 20th century William S. Gosset found that, for small samples,
sampling distribution departed substantially from the normal distribution and that,
as sample sizes changed, the distributions changed. This gave rise to not one
distribution but a family of distributions.
The t distributions are a family of symmetrical, bell-shaped distributions that
change as the sample size changes.
Degrees of Freedom : The number of degrees of freedom is a mathematical
concept defined as the number of observations less the number of restrictions
placed on them.
X -m
t=
Where
SX =
X -m
SX
S
n
Test Statistic =
Statistic - Parameter
Standard error of the Statistic
This test statistic is then compared to the critical value. If the test statistic exceeds the critical
values in absolute value, then the null hypothesis is rejected
Confidence Interval
2
When is Known
CI= X (ZCV) (X)
Where
X = Sample mean
ZCV = Critical value using the normal distribution and
X = Standard error of the mean
Confidence Interval
When 2 is Unknown
CI= X (tCV) (sX)
Where
X = Sample mean
tCV = Critical value using appropriate t distribution and
sX = estimated standard error of the mean from the sample
Goals
After this, you should be able to:
Goals
(continued)
Curvilinear relationships
y
x
y
x
y
Weak relationships
y
x
y
x
y
x
y
Correlation Coefficient
(continued)
Features of and r
Unit free
Range between -1 and 1
The closer to -1, the stronger the negative
linear relationship
The closer to 1, the stronger the positive
linear relationship
The closer to 0, the weaker the linear
relationship
Examples of Approximate
r Values
y
r = -1
r = -.6
r=0
r = +.3
r = +1
Calculating the
Correlation Coefficient
Sample correlation coefficient:
r=
( x - x)( y - y )
[ ( x - x ) ][ ( y - y ) ]
2
r=
n xy - x y
[n( x 2 ) - ( x )2 ][n( y 2 ) - ( y )2 ]
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Calculation Example
Tree
Height
Trunk
Diamete
r
xy
y2
x2
35
280
1225
64
49
441
2401
81
27
189
729
49
33
198
1089
36
60
13
780
3600
169
21
147
441
49
45
11
495
2025
121
51
12
612
2601
144
S=321
S=73
S=3142 S=14111
S=713
Calculation Example
Tree
Height,
y 70
r=
n xy - x y
60
50
40
30
(continued)
8(3142) - (73)(321)
[8(713) - (73) 2 ][8(14111) - (321) 2 ]
= 0.886
20
10
0
0
10
Trunk Diameter, x
12
14
No Relationship
Population
Slope
Coefficient
Independent
Variable
y = 0 + 1x +
Linear component
Random
Error
term, or
residual
Random Error
component
y = 0 + 1x +
(continued)
Observed Value
of y for xi
i
Predicted Value
of y for xi
Slope = 1
Random Error
for this x value
Intercept = 0
xi
Estimate of
the regression
intercept
Estimate of the
regression slope
y i = b0 + b1x
Independent
variable
Introduction
Correlation
the strength of the linear relationship between two
variables
Regression analysis
determines the nature of the relationship
294
295
r = +1
r = -1
r=0
r = 0.6
296
Scatter plot
BMD
dependent variable
make inferences about
Calcium intake
independent variable
297
Non-Normal data
298
Normalised
299
300
301
Interpreting correlation
l
strong correlation
l
302
Interpreting correlation
l
303
304
Agreement
l
measuring devices
users
techniques
305
Non-parametric correlation
l
l
l
Make no assumptions
Carried out on ranks
Spearmans r
l
Kendalls t
l
l
l
easy to calculate
has some advantages over r
distribution has better statistical properties
easier to identify concordant / discordant pairs
Role of regression
l
l
linear
curvilinear
307
dependent variable Y
l BMD
independent variable X
l dietary intake of Calcium
Y = a + bX
value of Y when X=0 change in Y when X increases by 1
308
Role of regression
l
Used to predict
l
l
l
extrapolation risky!
relation between age and bone age
309
310
Multiple regression
l
311
Logistic regression
l
yes / no
predict whether a patient with Type 1 diabetes
will undergo limb amputation given history of
prior ulcer, time diabetic etc
l
result is a probability
Summary
l
Correlation
l
l
l
l
Regression
l
l
l
l
Regression:
Checking the Model
Peter T. Donnan
Professor of Epidemiology and Biostatistics
Objectives of session
Recognise the need to check fit of
the model
Carry out checks of assumptions in
SPSS for simple linear regression
Understand predictive model
Understand residuals
Dependent (y)
a
Explanatory (x)
sig
.000
.000
sig
.000
.000
H0 : slope b = 0
Test t = slope/se = -0.008/0.002 = 4.546 with
p<0.001, so statistically significant
Predicted LDL = 2.024 - 0.008xAge
45
1.664
55
1.584
65
1.504
75
1.424
Assumptions of Regression
1. Relationship is linear
2. Outcome variable and hence
residuals or error terms are approx.
Normally distributed
Definition of a residual
A residual is the difference between
the predicted value (fitted line) and the
actual value or unexplained variation
ri = y i E ( y i )
Or
ri = yi ( a + bx )
Residuals
Output:
Scatterplot of residuals vs. predicted
Note
1) Mean of
residuals
= 0
2) Most of
data lie
within +
or -3
SDs of
mean
Assumptions of Regression
1. Relationship is linear
2. Outcome variable and hence
residuals or error terms are approx.
Normally distributed
Output:
Histogram of standardised residuals
Plot of
residuals
with
normal
curve
supersuper
imposed
Output:
Cumulative probability plot
Look for
deviation
from
diagonal
line to
indicate
nonnon
normality
Output:
Description of residuals
Descriptive statistics for residuals
Residuals Statisticsa
Casewise Diagnostics(a)
Minimum Maximum
Predicted Value
1.314867 1.843205
Residual
-1.65389 4.0658469
Std. Predicted Value
-2.750
3.264
Std. Residual
-2.302
5.660
Worth
investigation?
N
1383
1383
1383
1383
Case Number
Residual
164
5.660
209
4.395
250
3.143
268
3.064
274
3.227
362
4.095
517
3.636
849
3.968
1047
4.207
1075
3.885
1103
3.519
1229
3.016
1290
3.975
Predicted
5.5840
4.5260
3.7875
3.8730
4.0953
4.5350
4.3240
4.3290
4.4360
4.4040
3.9905
3.7660
4.2345
4.0658471
3.1573148
2.2581750
2.2013357
2.3180975
2.9415398
2.6122125
2.8508873
3.0223141
2.7907805
2.5279157
2.1667456
2.8553933
1.518153
1.368685
1.529325
1.671664
1.777153
1.593460
1.711788
1.478113
1.413686
1.613219
1.462584
1.599254
1.379107
Output:
Model fit and serial correlation
Model Summary
Model
1
R
.121a
Summary
After fitting any regression model check
assumptions Functional form linearity is default,
often not best fit, consider quadratic
Check Residuals for approx. normality
Check Residuals for outliers (> 3 SDs)
All accomplished within SPSS
What is ANOVA?
A statistical method for testing whether two or more dependent variable means are
equal (i.e., the probability that any differences in means across several groups are
due solely to sampling error).
Variables in ANOVA (Analysis of Variance):
Dependent variable is metric.
Independent variable(s) is nominal with two or more levels also called
treatment, manipulation, or factor.
One--way ANOVA: only one independent variable with two or more levels.
One
Two--way ANOVA: two independent variables each with two or more levels.
Two
With ANOVA, a single metric dependent variable is tested as the outcome of a
treatment or manipulation.
With MANOVA (Multiple Analysis of Variance), two or more metric dependent
variables are tested as the outcome of a treatment(s).
Components of an Empirical
Research Paper in Economics
Title
Abstract
Table of Contents
Introduction and Literature Survey
Theoretical Analysis
Empirical Testing
Conclusions
References
Introduction
The purpose of the introduction to the
research report is to provide the rationale for
the research. This rationale should address
four issues:
What is the nature of the issue or problem the
research investigates?
Why is this worthy of investigation?
Introduction
What have previous researchers discovered
about this issue or problem?
What does your research attempt to prove?
The Answer
It depends on how many major studies have been
completed on the topic.
If you only report one or two sources, readers may
suspect that you have not put enough effort into
searching the literature. You dont want to miss a
major study, since at best it will make you look
careless and at worst it may weaken the rationale for
your research.
Theoretical Analysis
The purpose of this part of research is to
present the theoretical analysis of the issue or
problem you are investigating. This is also
described as presenting your theoretical
model.
Conclusions
The purpose of this part of the research report is to
summarize your findings, that is, to restate your
argument and conclude whether or not it is valid. In
light of the statistical results, what can you infer
about your hypothesis? To what extent did your
empirical testing confirm your analysis?
Abstract or Summary
Introduction
Review of Literature
Methods
Results
Conclusions and Discussion
References
Note:
Abstract or Summary
The abstract or summary tells the reader very briefly what the main
points and findings of the paper are.
This allows the reader to decide whether the paper is useful to them.
Get into the habit of reading only abstracts while searching for
papers that are relevant to your research.
Read the body of a paper only when you think it will be useful to
you.
Introduction
The introduction tells the reader:
Introductions should:
Review of Literature
The literature review tells the reader what other researchers
have discovered about the papers topic or tells the reader
about other research that is relevant to the topic. Often what
students call a research paper is merely a literature review.
Along the way it states facts and ideas about the social world
and supports those facts and ideas with evidence for from
where they came (empiricism).
Review of Literature
Review of Literature
If an idea is used, but cannot be substantiated by the
community of sociologists, the literature review clearly shows
that the author is speculating and details the logic of the
speculation.
Note: Explaining why social events occur as they do requires use (and
testing) of explanations that have worked before. THESE
EXPLANATIONS ARE CALLED THEORIES.
1.
2.
Research hypotheses
Hypotheses are statements of the expected relationship(s)
between two (or more) variables
For example:
Men will have higher investment income than women.
Older Americans are more likely to oppose abortion for a
woman who doesnt want her baby because she is poor.
Methods
Methods
A METHODS SECTION MUST CONTAIN:
3.
4.
5.
5. Results
Results
The results section includes:
Results
References
Book Chapter:
Last Name, first name. Year. Chapter Name. Pages in the book in Book Name, edited
by first name last name. City of Publisher: Publisher.
Bianciardi, Roberto. 1997. "Growing Up Italian in New York City." Pp.179-213 in Adult
Narratives of Immigrant Childhoods, edited byAna Relles. Rose Hill, PA:
Narrative Press.
Book:
Last name, first name. Year. Book Name. City of Publisher: Publisher.
Stryker, Sheldon. 1980. Symbolic Interactionism: A Social Structural Version. Menlo
Park, CA: Benjamin/Cummings.
References
2.
3.
4.
6.
What they said about your topic in the journals, books, and
other publications
From you:
What is it?
Original sources of research are all the proof we have for some facts.
Without the paper trail of academic thought:
To avoid plagiarism:
1.
2.
3.
4.
5.