Revised Chapter 3

Chapter 3
Conceptual Framework
3.1 Definition of Terms:
Bias is defined as the difference between the expected value of an estimator and the true
value of the parameter being estimated. An estimator or decision rule can be positive,
negative or even zero. An estimator having nonzero bias is said to be an unbiased
estimator. The bias is expressed by:
Bias (θ$ ) = E[θ$ ] - θ

where θ$ is the estimator of the true value of the parameter θ and θ is the true value of
the parameter.
Accuracy is defined to be the measurement on how close the estimates to the true value.
Precision is defined to be the measurement on how close the estimates with one another.
Efficiency is defined to be the measurement on how a job is accomplished through a set
of criteria with a minimum waste of time, effort or skill.
Nonresponse (NR) is the failure to obtain valid response from a unit in the survey.
3.2 Types of NR
The types of nonresponse focus on the method in which the observations are nonresponse
values. Kalton (1983) stressed the importance to differentiate the types of nonresponse:
total (unit) nonresponse, item nonresponse, partial nonresponse.

Unit (or Total) nonresponse takes place wherein no information is collected from a
sampling unit. There are many causes of this nonresponse, namely, the failure to contact
the respondent (not at home, moved or unit not being found), refusal to collect
information, inability of the unit to cooperate (might be due to an illness or a language
barrier) or questionnaires that are lost.
Item nonresponse, on the other hand, happens when the information is collected from a
unit is incomplete due to the refusal of answering some of the questions. There many
causes of item nonresponse, namely, refusal to answer the question due to the lack of
information necessarily needed by the informant, failure to make the effort required to
establish the information by retrieving it from his memory or by consulting his records,
refuses to give answers because the questions might be sensitive, embarrassing or
considers to his perception of the survey’s objectives, the interviewer fails to record an
answer or the response is subsequently rejected at an edit check on the grounds that it is
inconsistent with other responses (may include an inconsistency arising from a coding or
punching error occurring in the transfer of the response of the computer data file).
Partial Nonresponse is the failure to collect large sets of items for a responding unit. A
sampled unit fails to provide responses for the following, namely, in one or more waves
of a panel survey, later phases of a multi-phase data collection procedure (e.g. second
visit of the FIES), and later items in the questionnaire after breaking off a telephone
interview. Other reasons namely include, data are unavailable after all possible checking
and follow-up, inconsistency of the responses that do not satisfy natural or reasonable
constraints known as edits which one or more items are designated as unacceptable and
therefore are artificially missing, and similar causes in Unit (Total) Nonresponse. In this
study, the researchers dealt with Partial Nonresponse occurring in the second visit of the
FIES 1997.
3.3. Patterns of NR
A critical issue in addressing the problem of nonresponse is identifying the pattern of
nonresponse. Determining the patterns of nonresponse is important because it influences
how missing data should be handled. There are three patterns of nonresponse namely
Missing Completely At Random (MCAR), Missing at Random (MAR) and Non
Ignorable Nonresponse (NIN).
A missing data is said to be MCAR if the probability of having a missing value for Y is
unrelated to the value of Y itself or to any other variable in the data set. Data that are
MCAR reflect the highest degree of randomness and show no underlying reasons for
missing observations that can potentially lead to bias research findings (Musil, et al,
2002). Hence, the missing data is randomly distributed across all cases such that the
occurrence of missing data is independent to other variables in the data set. An example
of the MCAR pattern would be that a laboratory sample is dropped without any apparent
reason.
Another pattern of nonresponse is the MAR case. The missing data is considered to be
MAR if the probability of missing data on Y is unrelated to the value of Y after

controlling for other variables in the analysis. This means that the likelihood of a case
having incomplete information on a variable can be explained by other variables in the
data set. Suppose the same example in MCAR was used, however, the laboratory sample
was dropped due to some error in handling the amount of drug to be used. Given the
amount of drug to be used, missingness does not depend on laboratory sample itself.
Meanwhile, the NIN is regarded as the most problematic nonresponse pattern. When the
probability of missing data on Y is related to the value of Y and possibly to some other
variable Z even if other variables are controlled in the analysis, such case is termed as
NIN. NIN missing data have systematic, nonrandom factors underlying the occurrence of
the missing values that are not apparent or otherwise measured. NIN missing data are the
most problematic because of the effect in terms of generalizing research findings and
may potentially create bias parameter estimates, such as the means, standard deviations,
correlation coefficients or regression coefficients (Musil, et al., 2002). Suppose the
example was used for NIN, however, the laboratory sample was dropped due to other
hidden causes like the type of drug to be used is invalid even after controlling for the
amount of drug that was used.
These patterns are considered as an important assumption before any imputation takes
place. For an imputation procedure to work and achieve statistically acceptable and
reliable estimates, the pattern of nonresponse must either satisfy the MCAR or MAR
assumption. For this study, the researchers created missing observations that satisfy the
MCAR assumption.
3.4 NR Bias
In most surveys, there is a large propensity of the post-analysis results to become invalid
due to the missing data. Missing data can be discarded, ignored or substituted through
some procedure. When data is deleted or ignored in generating estimates, the
nonresponse bias becomes a problem. (Kalton, 1983) The effect of deleting the missing
data on NR bias is illustrated below:
Suppose the population is divided in two groups or strata. The first group consists of all
units from which information were obtained. This is known as the respondents. The
second group consists of all units from which no information or incomplete information
were obtained. This is known as the nonrespondents.
Let R be the number of respondents and M (M stands for missing) be the number of
nonrespondents in the population, with N = R + M. Assume that a simple random sample
is drawn from each group. The corresponding sample quantities are r and m, with r + m=
R M
n. Let R=
N
and M =
N
be the proportions of respondents and nonrespondents in the
r m
population and let r = and m = be the response and nonresponse rates in the
n n
sample. (Kalton, 1983)
If no compensation is made for nonresponse, the respondent sample mean yr is used to
estimate Y . Its bias is given by B( Yr ) = E( Yr ) − Y . The expectation of yr can be
obtained in two stages, first conditional on fixed r and then over different values of r, i.e.
E( yr ) = E1E2( yr )where E2 is the conditional expectation for fixed r and E1 is the
expectation over different values of r. Thus,
Hence, the bias of yr is given by
The equation above shows that yr is approximately unbiased for Y if either the
 is small or the mean for nonrespondents, Ym , is close

proportion of nonrespondents M
to that for respondents, yr . Since the survey analyst usually has no direct empirical
evidence on the magnitude of ( Yr−Ym ), the only situtation in which he can have
confidence that the bias is small is when the nonresponse rate is low. However, in
 many survey results escape sizable baises because (

practice, even with moderate M
Yr−Ym ) is fortunately often not large. (Kalton, 1983)
In reducing nonresponse bias caused by missing data, there are many procedures that can
be applied and one of these procedures is imputation. In this study, imputation procedures
are applied to compensate for nonresponse and reduce bias to the estimates. Imputation is
briefly defined as the substitution of values for the nonresponse observations.
3.5 Imputation Process
Imputation is one of the many procedures that can be used to deal with nonresponse to
generate unbiased results. Imputation is defined as a process of replacing a missing value,

through available statistical and mathematical techniques, with a value that is considered
to be a reasonable substitute for the missing information. (Kalton, 1983)
Imputation has certain advantages. First, imputation methods help reduce biases in survey
estimates. Second, imputation makes analysis easier and the results are simpler to
present. Imputation does not make use of complex algorithms to estimate the population
parameters in the presence of missing data hence, much processing time is saved. Lastly,
using imputation techniques can ensure consistency of results across analyses, a feature
that an incomplete data set cannot fully provide.
On the other hand, imputation has also several disadvantages. There is no guarantee that
the results obtained after applying imputation methods will be less biased than those
based on the incomplete data set. There is a possibility that the biases from the results
using imputation could be greater. Hence, the use of imputation methods depends on the
suitability of the assumptions built into the imputation procedures used. Even if the biases
of univariate statistics are reduced, there is no assurance that the distribution of the data
and the relationships between variables will remain. More importantly, imputation is just
a fabrication of data. Many naive researchers falsely treat the imputed data as a complete
data set for n respondents as if it were a straightforward sample of size n.
There are four Imputation Methods (IMs) applied in this study, namely, the Overall
(Grand) Mean Imputation, Hot Deck Imputation, Deterministic Regression
Imputation and Stochastic Regression Imputation. For most imputation methods,

imputation classes are needed to be defined in order to proceed in performing the IMs.
Problems might arise if imputation classes are not formed with caution to imputation
methods that rely on them. One of them is the number of imputation classes. The
imputation class must have a definite number of classes applied to each method. The
larger the number of imputation class, the possibility of having fewer observations in one
class increases. This can cause the variance of the estimates under that class to increase.
On the other hand, the smaller the number of imputation class, the possibility of having
more observations in that class increases thus making the estimates burdened with
aggregation bias.
3.4.1 Overall Mean Imputation
The mean imputation method is the process by which missing data is imputed by the
mean of the available units of the same imputation class to which it belongs (Cheng &
Sy, 1999). One of the types of this method is the Overall Mean Imputation (OMI)
method. The OMI method simply replaces each missing data by the overall mean of the
available (responding) units in the same population. The overall mean is given by:
The imputation class for this method is the entire population itself. In fact, in many
related literature, imputation classes is not a requirement and often ignored in performing
this method.
There are many advantages and disadvantages of this method. The advantage of using
this method is its universality. This means that it can be applied to any data set.
Moreover, this method does not require the use of imputation classes to be homogeneous
or the variables to be highly correlated. Without imputation classes, the method become
easier to use and results are generated faster. Among the related literature included in this
study, this is the most used method in imputing for missing data.
However, there are serious disadvantages of this method. Since missing values are
imputed by a single value, the distribution of the data becomes distorted (see Figure 1).
The distribution of the data becomes too peaked making it unsuitable in many post-
analysis. Second, it produces large biases and variances because it does not allow
variability in the imputation of missing values. Many related literatures stated that this is
the least effective and it is highly discouraged to use this method.

3.4.2 Hot Deck Imputation
One of the most popular and widely known methods used is the Hot Deck Imputation
(HDI) method. The HDI method is the process by which the missing observations are
imputed by choosing a value from the set of available units. This value is either selected
at random (traditional hot deck), or in some deterministic way with or without
replacement (deterministic hot deck), or based on a measure of distance (nearest-neighbor
hot deck). To perform this method, let Y be the variable that contains missing data and X
that has no missing data. In imputing for the missing data:
1. Find a set of categorical X variables that are highly associated with Y. The X
variables to be selected will be the matching variables in this imputation.
2. Form a contingency table based on X variables.
3. If there are cases that are missing within a particular cell in the table, select a case
from the set of available units from Y variable and impute the chosen Y value to
the missing value. In choosing for the imputation to be substituted to the missing
value, both of them must have similar or exactly the same characteristics.
If the matching variables are closely associated with the variable being imputed, the
nonresponse bias should be reduced which is similar to the advantage of imputation
classes stated earlier.
Example 1: Suppose that a simulation study is conducted to investigate the effect of
imputation to the bias of the estimates. Assume that three people in the survey refused to
answer some of the questions in the study. Missing answers from each unobserved unit
are replaced by a known value from an observed unit who has similar characteristics such
as sex, degree or course (Course), Dean Lister (DL), Honor student in High School
(HS2), and Hours of study classes (HSC). Suppose the set of X matching variables are
DL and HS2.
Table 1: Imputed values of GPA using the HDI
*Values in parenthesis are imputed value

Person Sex DL HS2 HSC GPA*
1 M Y Y 2 [3.999]
2 F Y N 1 3.567
3 F N N 0 1.298
4 F N Y 0 2.781
5 M N Y 1 2.344
6 M N N 0 1.111
7 M N Y 1 [2.781]
8 F Y N 1 3.246
9 F Y N 1 [3.246]
10 F Y Y 1 3.999
Table 2: Original Data vs. Imputed Data for OMI and HDI
In figures 2, 3, and 4, the distribution of the original data and the imputed data using the
OMI and HDI method are shown.
Figure 2: Bar Graph of the imputed data using HDI
Figure 3: Bar Graph of the imputed data using OMI
Figure 4: Bar Graph of the Original data

The use of hot deck imputation is justified. First, imputed values came from the same
class, nonresponse bias of the data decreases. This is because the observation coming
from the imputation classes are homogeneous. If the OMI method was used here, the bias
would definitely increase. More importantly, the distribution of the data was preserved.
In OMI, it can be sure that the distribution will be distorted since the only one value
would be substituted for the missing values.
Like OMI, there are certain advantages in using this method. One major attraction of this
method cited by Kazemi (2005) is that imputed values are all actual observed values.
More importantly, the shape of the distribution is preserved. Since imputation classes are
introduced, the chance in distorting the distribution decreases.
On the other hand, it also has a set of disadvantages. First, in order to form imputation
classes, all X variables must be all categorical. Second, the possibility of generating a dis-
torted data set increases if the method used in imputing values to the missing observa-
tions is without replacement as the nonresponse rate increases. Observations from the
donor record might be used repeatedly by the missing values causing the shape of the dis-
tribution to get distorted. Third, the number of imputation classes must be limited to en-
sure that all missing values will have a donor for each class.
3.4.3 Regression Imputation
As in MI and HDI methods, this procedure is one of the widely known used imputation
methods. The method of imputing missing values via the least-squares regression is
known to be the Regression Imputation (RI) method. There are many ways in creating
a regression model. The y-variable for which imputations are needed is regressed on the
auxiliary variable (x1, x2, ..., xp) for the units providing a response on y. These auxiliary
variables may be quantitative or qualitative, the latter being incorporated into the regres-
sion model by means of dummy variables. In other related studies, categories of the
matching variable are transformed into dummy variables because they provide
3.4.3.1 Deterministic Regression Imputation
The use of the predicted value from the model given the values of the auxiliary values
that contains no missing data for the record with a missing response in the variable y is
called the Deterministic Regression Imputation (DRI). This method is seen as the gen-
eralization of the mean imputation method. The model for DRI is given by:
µ
yk = βµ 0 + βµ i xik
∑
where
$y is the predicted value for the k-th nonresponding unit to be imputed

k
βµ 0 and βµ i are the parameter estimates
Xik is the auxiliary variable that can either be a quantitative variable or a dummy
variable under the k-th nonresponding unit
There are advantages and disadvantages of using DRI. DRI has the potential to produce
closer imputed value for the nonresponse observation. In order to make the method effec-
tive by imputing a predicted value which is near the actual value, a high R2 is needed.
Though this method has the potential to make closer imputed values, this method is a
time-consuming operation and often times unrealistic to consider its application for all
the items with missing values in a survey.
Using the DRI can also underestimate the variance of the estimates. It can also distort the
distribution of the data. One major disadvantage of this method is that it can produce out-
of-range values or unfeasible values (e.g. predicting a negative age).
3.4.3.2 Stochastic Regression Imputation
The use of the predicted value from the model has similar undesirable distributional prop-
erties in the mean imputation method. To compensate for it, an estimated residual is
added to the predicted value. The use of this predicted value plus some type of randomly
chosen estimated residual is called the Stochastic Regression Imputation (SRI) method.
The model for SRI is given by:
µ
yk = βµ 0 + βµ i xik + e$ k
∑
where
$y is the predicted value for the k-th nonresponding unit to be imputed

k
βµ 0 and βµ i are the parameter estimates

Xik is the auxiliary variable that can either be a quantitative variable or a dummy
variable under the k-th nonresponding unit
e$ k is the randomly chosen residual for the k-th nonresponding unit
There are various ways in which this could be done depending on the assumptions made
about the residuals. The following are some possibilities:

There are advantages and disadvantages in using SRI. Similar to DRI, this method can
produce imputed values that are near to the nonresponse observation if the model has a
high R2. This method is also a time-consuming operation and often times unrealistic to
consider its application for all the items with missing values in a survey. This method can
also produce out-of-range values other than the predicted value without the added residu-
al. It is possible under SRI that after adding the residual to the deterministic imputation
which is feasible, an unfeasible value could result.

Revised Chapter 3

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Revised Chapter 3

Enviado por

Direitos autorais:

Formatos disponíveis

Chapter 3

3.1 Definition of Terms:

negative or even zero. An estimator having nonzero bias is said to be an unbiased

estimator. The bias is expressed by:

Bias (θ$ ) = E[θ$ ] - θ

Efficiency is defined to be the measurement on how a job is accomplished through a set

of criteria with a minimum waste of time, effort or skill.

total (unit) nonresponse, item nonresponse, partial nonresponse.

information, inability of the unit to cooperate (might be due to an illness or a language

barrier) or questionnaires that are lost.

refuses to give answers because the questions might be sensitive, embarrassing or

A critical issue in addressing the problem of nonresponse is identifying the pattern of

nonresponse. Determining the patterns of nonresponse is important because it influences

Missing Completely At Random (MCAR), Missing at Random (MAR) and Non

Ignorable Nonresponse (NIN).

MAR if the probability of missing data on Y is unrelated to the value of Y after

having incomplete information on a variable can be explained by other variables in the

correlation coefficients or regression coefficients (Musil, et al., 2002). Suppose the

amount of drug that was used.

some procedure. When data is deleted or ignored in generating estimates, the

data on NR bias is illustrated below:

were obtained. This is known as the nonrespondents.

nonrespondents in the population, with N = R + M. Assume that a simple random sample

sample. (Kalton, 1983)

expectation over different values of r. Thus,

Hence, the bias of yr is given by

 is small or the mean for nonrespondents, Ym , is close

 many survey results escape sizable baises because (

Yr−Ym ) is fortunately often not large. (Kalton, 1983)

briefly defined as the substitution of values for the nonresponse observations.

3.5 Imputation Process

generate unbiased results. Imputation is defined as a process of replacing a missing value,

to be a reasonable substitute for the missing information. (Kalton, 1983)

that an incomplete data set cannot fully provide.

data set for n respondents as if it were a straightforward sample of size n.

(Grand) Mean Imputation, Hot Deck Imputation, Deterministic Regression

Imputation and Stochastic Regression Imputation. For most imputation methods,

3.4.1 Overall Mean Imputation

the least effective and it is highly discouraged to use this method.

at random (traditional hot deck), or in some deterministic way with or without

replacement (deterministic hot deck), or based on a measure of distance (nearest-neighbor

that has no missing data. In imputing for the missing data:

variables to be selected will be the matching variables in this imputation.

2. Form a contingency table based on X variables.

nonresponse bias should be reduced which is similar to the advantage of imputation

classes stated earlier.

Example 1: Suppose that a simulation study is conducted to investigate the effect of

Table 1: Imputed values of GPA using the HDI

*Values in parenthesis are imputed value

OMI and HDI method are shown.

Figure 2: Bar Graph of the imputed data using HDI

Figure 3: Bar Graph of the imputed data using OMI

Figure 4: Bar Graph of the Original data

would be substituted for the missing values.

introduced, the chance in distorting the distribution decreases.

3.4.3 Regression Imputation

3.4.3.1 Deterministic Regression Imputation

$y is the predicted value for the k-th nonresponding unit to be imputed

βµ 0 and βµ i are the parameter estimates

variable under the k-th nonresponding unit

the items with missing values in a survey.

of-range values or unfeasible values (e.g. predicting a negative age).

3.4.3.2 Stochastic Regression Imputation