Complete Thesis PDF

Imputation Procedures for Partial Nonresponse: The
Case of 1997 Family Income and Expenditure Survey

(FIES)
A Thesis
Presented to
The Faculty of the Mathematics Department
College of Science
De La Salle University - Manila
In Partial Fulfillment
of the Requirements for the Degree
Bachelor of Science in Statistics Major in Actuarial Science
by
Diana Camille B. Cortes
James Edison T. Pangan
August 2007
Approval Sheet
The thesis entitled
Imputation Procedures for Partial Nonresponse: The Case of 1997 FIES
Submitted by Diana Camille B. Cortes and James Edison T. Pangan, upon

the recommendation of their adviser, has been accepted and approved in partial
fulfillment of the requirements for the degree of Bachelor of Science in Statistics
Major in Actuarial Science.
ARTURO Y. PACIFICADOR JR., Ph.D.

Thesis Adviser
PANEL OF EXAMINERS
RECHEL G. ARCILLA, Ph.D.

Chairperson
IMELDA E. de MESA, M.O.S. MICHELE G. TAN, M.S.

Member Member
Date of Oral Defense: August 25, 2007

Acknowledgments
The researchers would like to extend their warmest gratitude to the following
people, who have undoubtedly contributed to the success of this study:
• To Dr. Jun Pacificador Jr., for his supervision, suggestions and guidance
during the duration of this thesis.
• To our panelists, Dr. Rechel Arcilla, Prof. Imelda de Mesa and Ms. Michele
Tan for helping us improve our thesis.
• To Dr. Ederlina Nocon, for providing us the software LaTeX during THSMTH1
• To our parents especially Jed’s mother, Prof. Erlinda Pangan, for constantly
reminding the researchers (i.e. ”Tapos na ba ang thesis nyo?”) about the
thesis.
• To Mark Nanquil and Norman Rodrigo, for helping us in using LaTeX and
for their unwavering support to our thesis
• To our classmates, friends from COSCA, La Salle Debate Society and Math
Circle for their continuous encouragement and support.
• Lastly, to The Lord Almighty, for providing us the strength, patience, wis-
dom and determination to finish this thesis.
Table of Contents
Title Page i
Approval Sheet ii
Acknowledgments iii
Table of Contents iv
Abstract ix
1 The Problem and Its Background 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Review of Related Literature 8
3 Conceptual Framework 17
3.1 Definition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Types of Nonresponse . . . . . . . . . . . . . . . . . . . . . . . . 18

v
3.3 Patterns of Nonresponse . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Nonresponse Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 The Imputation Process . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Overall Mean Imputation . . . . . . . . . . . . . . . . . . 26
3.5.2 Hot Deck Imputation . . . . . . . . . . . . . . . . . . . . . 28
3.5.3 Regression Imputation . . . . . . . . . . . . . . . . . . . . 30
Deterministic Regression . . . . . . . . . . . . . . . . . . . 31
Stochastic Regression . . . . . . . . . . . . . . . . . . . . . 32
4 Methodology 34
4.1 Source of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 General Background . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Sampling Design and Coverage . . . . . . . . . . . . . . . 35
4.1.3 Survey Characteristics . . . . . . . . . . . . . . . . . . . . 35
4.1.4 Survey Nonresponse . . . . . . . . . . . . . . . . . . . . . 36
4.2 The Simulation Method . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Formation of Imputation Classes . . . . . . . . . . . . . . . . . . 39
4.4 Performing the Imputation Methods . . . . . . . . . . . . . . . . 41
4.4.1 Overall Mean Imputation (OMI) . . . . . . . . . . . . . . 41
4.4.2 Hot Deck Imputation (HDI) . . . . . . . . . . . . . . . . . 42
4.4.3 Deterministic and Stochastic Regression Imputation (DRI)

and (SRI) . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Comparison of Imputation Methods . . . . . . . . . . . . . . . . . 45
4.5.1 The Bias of the Mean of the Imputed Data . . . . . . . . . 45

vi
4.5.2 Comparing the Distributions of the Imputed vs. the Actual

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.3 Other Measures in Assessing the Performance of the Impu-
tation Methods . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5.4 Determining the Best Imputation Method . . . . . . . . . 50
5 Results and Discussion 51

5.1 Descriptive Statistics of Second Visit Data Variables . . . . . . . . 51
5.2 Formation of Imputation Classes . . . . . . . . . . . . . . . . . . 52
5.2.1 Mean of the Simulated Data by Nonresponse Rate for Each
Variables of Interest . . . . . . . . . . . . . . . . . . . . . 58
5.3 Regression Model Adequacy . . . . . . . . . . . . . . . . . . . . . 59
5.4 Evaluation of the Imputation Methods . . . . . . . . . . . . . . . 63
5.4.1 Overall Mean Imputation . . . . . . . . . . . . . . . . . . 64
5.4.2 Hot Deck Imputation . . . . . . . . . . . . . . . . . . . . . 66
5.4.3 Deterministic Regression Imputation . . . . . . . . . . . . 69
5.4.4 Stochastic Regression Imputation . . . . . . . . . . . . . . 71
5.5 Distribution of the True vs. Imputed Values . . . . . . . . . . . . 73
5.6 Choosing the Best Imputation . . . . . . . . . . . . . . . . . . . . 78
6 Conclusion 83
7 Recommendations for Further Research 87

vii
List of Tables
• Table 1: Imputed Values of GPA Using HDI (p.29)
• Table 2: Descriptive Statistics of the 1997 FIES Second Visit (p.51)
• Table 3: The Candidate MV PROV and Its Categories (p.53)
• Table 4: The Candidate MV CODEP1 and Its Categories (p.54)
• Table 5: The Candidate MV CODES1 and Its Categories (p.54)
• Table 6: Chi-Square Test of Independence for the Matching Variable (p.55)
• Table 7: Measures of Association for Matching Variable (p.56)
• Table 8: Descriptive Statistics of the Data Grouped into Imputation Classes.
(p.57)
• Table 9: Means of the Retained and Deleted Observations (p.58)
• Table 10: Model Adequacy Results (p.60)
• Table 11: Criteria Results for the OMI Method (p.64)
• Table 12: Criteria Results for the HDI Method (p.66)
• Table 13: Criteria Results for the DRI Method (p.69)
• Table 14: Criteria Results for the SRI Method (p.71)
• Table 15: Ranking of the Different Imputation Methods: 10% NRR (p.79)
viii
List of Figures
• Figure 1: Distribution of the Data Before and After Imputation (p.27)
• Figure 2: Bar Chart for TOTIN2, 10% NRR (p.73)
• Figure 5: Bar Chart for TOTEX2, 10% NRR (p.75)

Abstract
Several imputation methods have been developed for imputing missing responses.
It is often not clear which imputation method is best for a particular assumption.
In choosing an imputation method, several factors should be considered such as
the types of estimates that will be generated, the type and pattern of nonresponse,
and the availability of the auxiliary data that are highly correlated with charac-
teristic of interest or with the response propensity.
This study compared the effectiveness of four imputation methods namely the
Overall Mean, Hot Deck, Deterministic and Stochastic Regression Imputation us-
ing the first visit variable to be its auxiliary variable. Values for variables Total
Income and Expenditures (TOTIN2 and TOTEX2)of the second visit were set
to nonresponse to satisfy the assumption of partial nonresponse. The results of
the study provide some support for the following conclusions: (a) for the 1997
FIES data, the Hot Deck Imputation and Overall Mean Imputation methods are
not appropriate for handling partial nonresponse data; (b) stochastic regression
imputation was selected as the best imputation method; and (c) the imputation
classes must be homogeneous to produce less biased estimates.
Chapter 1
The Problem and Its Background
1.1 Introduction
Missing data in sample surveys is inevitable. The problem of missing data occurs
for various reasons such as when the respondent moved to another location, refused
to participate in the survey or is unable to answer specific items in the survey. This
failure to obtain responses from the units selected in the sample is called nonre-
sponse. There are several types of nonresponse; (a) Unit nonresponse refers to
the failure to collect any data from a sample unit; (b) while item nonresponse
refers to the failure to collect valid responses to one or more items from a respond-
ing sample unit (i.e. in cases of surveys with only one phase or considers a single
phase ignoring other phases); and (c) partial nonresponse occurs when there
is a failure to collect responses for large sets or a block of items (i.e. in cases of
surveys with two phases, the same respondent cannot answer in the second phase
of the survey hence the items for the second phase of the survey are missing) for
a responding unit.
The effect of nonresponse must not be ignored since it leads to biased estimates
2
which if large would result to inaccuracy. Bias due to nonresponse is believed to

be a function of nonresponse rates and the difference in characteristic between re-
sponding and nonresponding units. The larger the nonresponse rate or the wider
the difference in characteristic between the responding and nonresponding units,
the larger the bias.
In practice, there are three ways of handling missing data. These are discarding
the missing values, applying weighting adjustments or using imputation meth-
ods. Discarding the missing values or otherwise known as the Available Case
Method is based on excluding the nonresponse records when analyzing the vari-
able of interest. The problem with this method is that it does not account for
the difference in characteristic between the responding and nonresponding units.
Hence, methods for compensating missing data are applied. Another method is
called weighting adjustments. Weighting adjustments is based on matching
nonrespondents to respondents in terms of data available on nonrespondents and
increasing the weights of matched respondents to account for the missing values.
Hence, a weight proportionate to the amount of nonresponse is often multiplied
to the inverse of the response rate. This is often applied for unit nonresponse.
On the other hand, imputation is also used by statisticians to account for non-
response, usually in the case of item and partial nonresponse. In imputation, a
missing value is replaced by a reasonable substitute for the missing information.
Once nonresponse has been dealt with, whether by weighting adjustments or im-
putation, then researchers can proceed with their data analysis.
3
The Family Income and Expenditure Survey (FIES) is an example of

a survey which has more than one round of data collection. The FIES is a nation-
wide survey of households conducted every three years with two visits per survey
period on the sample unit by the National Statistics Office (NSO) in order to
provide information of the country’s income distribution, spending patterns and
poverty incidence. Like any other survey, FIES encounters the problem of missing
data, particularly the problem of nonresponse during the second visit. Given the
various contributions that this survey can provide, it is then important to have
precise estimates of the income and expenditure indicators.
With the 1997 FIES as the data set for this study, this paper will focus on dealing
with partial nonresponse through the use of imputation methods. It aims to exam-
ine the effects of imputed values in coming up with estimates for the missing data
at various nonresponse rates. Furthermore, the study aims to determine which
imputation method is appropriate for the FIES data through applying some of the
methods mentioned in the study about the 1978 Research Panel Survey for the
Income Survey Development Program (ISDP) entitled Compensating for Missing
Data by Kalton (1983).
4
1.2 Statement of the Problem
This paper attempts to answer the following questions:
1. Which imputation method is the most appropriate for the FIES data?
2. How do varying nonresponse rates affect the results for each imputation
method?
1.3 Objectives of the Study
The paper aims to achieve the following objectives:
1. To compare the imputation methods namely Overall Mean Imputation,

Hot Deck Imputation, Deterministic Regression and Stochastic Regression,
based on its efficiency and ability to recapture the deleted values by gener-
ating the missing values on the FIES 1997 second visit data using the first
visit data of the same survey.
2. To investigate the effect of the varying rates of missing observations, partic-

ularly the effect of 10%, 20% and 30% nonresponse rates on the precision of
the estimates.
5
1.4 Significance of the Study
Nonresponse is a common problem in conducting surveys. The presence of non-

response in surveys causes to create incomplete data, which could pose serious
problems during data analysis, particularly in the generation of statistically reli-
able estimates. For this reason, the use of imputation methods enables to account
for the difference between respondents and nonrespondents. This then helps re-
duce the nonresponse bias in the survey estimates.
Since most statistical packages require the use of complete data before conducting
any procedure for data analysis, the use of imputation methods can ensure con-
sistency of results across analyses, something that an incomplete data set cannot
fully provide.
In a news article by Obanil(2006) entitled Topmost Floor of the NSO Building

gutted by Fire posted at Manila Bulletin Online, it mentioned that last Oc-
tober 3, 2006 around 1 Million Pesos worth of documents were destroyed by the
fire. Among the documents gutted by the fire is the first visit questionnaire of the
FIES for the NCR which at the time of the fire has not yet been encoded.
In terms of statistical research, most countries in the developing world such as

the United States, Canada, UK and the Netherlands already employ imputation
methods in their respective national statistical offices. In a country such as the
Philippines, where data collection is very difficult especially for some regions like
the National Capital Region (NCR), imputation will be able to ease the problem
6
of data collection and nonresponse.
More importantly, given the great impact of this survey to the country, employing
imputation methods help statisticians to provide a method in handling nonre-
sponse, which could lead to a more meaningful generalization about our country’s
income distribution, spending patterns and poverty incidence. Hence, having es-
timates with less bias and more consistent results, this can contribute in making
our policymakers and economists provide better solutions in improving the lives
of the Filipinos.
1.5 Scope and Limitations
Throughout this paper, only the 1997 Family Income and Expenditure Survey
(FIES), will be used to tackle the problem of nonresponse and to examine the
impact of the different imputation methods applied in the dataset. With regards
to the extent of how these imputation methods will be applied and evaluated, this
paper will only cover the partial nonresponse occurring in the National Capital
Region (NCR) since NCR is noted as the region with highest nonresponse rate.
Also, the variables that will be imputed for this study would be the Total Income
(TOTIN2) and Total Expenditures (TOTEX2) in the second visit of the FIES
data.
The researchers will only focus on using the 1997 FIES data on the first visit
7
to impute the partial nonresponse that is present on the second visit. This paper
also assumes that the first visit data is complete and the pattern of nonresponse
follows Missing Completely at Random (MCAR) case. The MCAR case happens
if the probability of response to Y is unrelated to the value of Y or to any other
variables; making the missing data randomly distributed across all cases (Musil et.
al, 2002). If the pattern on nonresponse does not satisfy the MCAR assumption,
imputation methods may not achieve its purpose.
As for the imputation methods, only four will be applied for this paper namely:
Overall Mean Imputation (OMI), Hot Deck Imputation (HDI), Deterministic Re-
gression Imputation (DRI) and Stochastic Regression Imputation (SRI).
On the aspect of evaluating the efficacy and appropriateness of the four impu-
tation methods, this will only be limited to the following: (a) Bias of the Mean
of the Imputed Data, (b) Assessment of the Distributions of the Imputed vs. the
Actual Data and (c) the criteria mentioned in the report entitled Compensating
for Missing Data(Kalton, 1983) namely the Mean Deviation, Mean Absolute De-
viation and the Root Mean Square Deviation.
Chapter 2
Review of Related Literature
Much research effort has been devoted in the efficacy of various imputation meth-
ods. In the report entitled Compensating for Missing Survey Data, two simultion
studies were carried out using the data in the 1978 Income Survey Development
Program (ISDP) Research Panel to compare some imputation methods. The first
study compared imputation methods for the variable Hourly Rate of Pay while
the second dealt with the imputation of the variable Quarterly Earnings. For both
studies, the author stratified the data into its imputation classes, constructed data
sets with missing values by randomly deleting some of the recorded values in the
original dataset and then applied the various imputation methods to fill in the
missing values. This process was replicated ten times to ensure consistency of the
results. Once the imputation methods have been applied, the three measures for
evaluating the effectiveness of imputation methods namely the Mean Deviation,
Mean Absolute Deviation and the Root Mean Square Deviation were obtained
and averaged across the ten trials (Kalton, 1983).
For the first study of imputing the variable Hourly Rate of Pay, eight methods
were used namely the Grand Mean Imputation (GM), the Class Mean Imputa-
9
tion using eight imputation classes (CM8), the Class Mean Imputation using ten
imputation classes (CM10), Random Imputation with eight imputation classes
(RM8), Random Imputation with ten imputation classes (RM10), Multiple Re-
gression Imputation (MI), Multiple Regression Imputation plus a random residual
chosen from a normal distribution (MN) and Multiple Regression Imputation plus
a randomly chosen respondent residual (MR). Using the Mean Deviation criteria,
the results showed that all mean deviations were negative, indicating that the im-
puted values underestimated the actual values. Moreover, the results show that
the Grand Mean Imputation (GM) has the greatest underestimation among the
eight procedures. Meanwhile for the Mean Absolute Deviation and Root Mean
Square Deviation, which measures the ability to reconstruct the deleted value, the
results showed that the Grand Mean Imputation fared the worst for both criteria.
In addition, it also showed that the Multiple Regression Imputation (MI) ob-
tained the best measures for the two criteria and that the procedures with greater
number of imputation classes (i.e.CM8 VS. CM10, RC8 VS. RC10) yield slightly
better results for the two criteria (Kalton, 1983).
For the second study, which is the imputation of Quarterly Earnings, ten impu-
tation procedures were used. These are the Grand Mean Imputation (GM), the
Class Mean Imputation using eight imputation classes (CM8), the Class Mean Im-
putation using twelve imputation classes (CM12), Random Imputation with eight
imputation classes (RM8), Random Imputation with twelve imputation classes
(RM12), Multiple Regression Imputation (MI), Multiple Regression Imputation
plus a random residual chosen from a normal distribution (MN), Multiple Regres-
10
sion Imputation plus a randomly chosen respondent residual (MR), Mixed De-
ductive and Random Imputation using eight imputation classes (DI8) and Mixed
Deductive and Random Imputation using twelve imputation classes (DI12). Using
the first criteria, the Mean Deviation, the results showed that the Grand Mean
(GM) obtained a positive bias. This implied that the grand mean imputation is
not an effective imputation method for the this study. The results also showed
that the regression imputation procedures have almost similar results producing
almost unbiased estimates. In addition, the Class Mean Imputation methods
(CM8 and CM12) have similar measures with those of the Random Imputation
Methods. Nevertheless, all methods have produced relatively small mean devia-
tions except for the last two methods. Comparing the Mean Absolute Deviations
and the Root Mean Square Deviations, the results show that the Grand Mean
Imputation obtained values similar to the regression procedures with residuals
(i.e. Multiple Regression Imputation plus a random residual chosen from a nor-
mal distribution or MN, Multiple Regression Imputation plus a randomly chosen
respondent residual or MR). The results also show that the RC8. RC12, MN and
MR procedures are over one third larger compared to deterministic procedures
such as the CM8, CM12 and MI procedures (Kalton, 1983).
To further investigate the relatively larger biases of DI8 and DI12 procedures,
the author further divided the date into the deductive and non deductive cases.
This shed further light on the Mean Deviations and Mean Absolute Deviations
of the various imputation methods. It was found that the mean deviations are
positive on the deductive case and negative on the non deductive case for all of
11
the procedures. These then explains why there are relatively small deviations in
the previous results since the measures between the cases tend to cancel out. It
also showed that the DI8 and DI12 results are similar to those of the RC8, RC12,
CM8 and CM12 in the non deductive cases but are largely different in the deduc-
tive cases. This explains the larger values of DI8 and DI12 in the previous results
(Kalton, 1983).
At the end of the two studies, it showed that the imputation procedures tend
to overestimate the Hourly Rate of Pay and underestimate the Quarterly Earn-
ings. Moreover, it showed how the mean imputation appears to be the weakest
imputation method among the studies since it has distorted the distribution of
the original data. Lastly, Kalton’s study shows the impact of increasing the im-
putation classes with respect to the criteria used such that it gives a better yield
of values for the three criteria.
In contrast to Kalton’s criteria in measuring the performance of imputation proce-

dures, a paper entitled A Comparison of Imputation Techniques for Missing Data
by C. Musil, C. Warner, P. Yobas and S. Jones (2002), the authors presented a
much simple approach in evaluating the performance of imputation techniques by
using the means, standard deviation and correlation coefficients, then comparing
the statistics of the original data with the statistics obtained from the five methods
namely Listwise deletion, Mean Imputation, Deterministic Regression, Stochastic
Regression and EM Method. The Expectation Maximization (EM) Method is an
iterative procedure that generates missing values by using expectation (E-step)
12
and maximization (M-step) algorithms. The E-step calculates expected values

based on all complete data points while the M-step replaces the missing values
with E-step generated values and then recomputed new expected values.
Using the Center for Epidemiological Studies data on stress and health ratings of
older adults, the authors imputed a single variable namely the functional health
rating. Of the 492 cases, 20% cases were deleted in an effort to maximize the
effects of each imputation method. Except for the Listwise Deletion and Mean
Imputation, the researchers used the SPSS Missing Value Analysis function for
the Deterministic Regression, Stochastic Regression and EM Method. For the
correlations, the researchers obtained the correlation values of the original data
and the five methods of the imputed variable with the variables, age, gender and
self assed health rating. (Musil et. al, 2002) The results show that comparing
the mean of the original data with the five methods, all imputed values underesti-
mated the mean. The closest to the original data was the Stochastic Regression,
followed very closely by EM Method, Deterministic Regression, Listwise Deletion
and Mean Imputation. The same results also hold for the standard deviations.
For the correlations, however, the EM Method produced the closest correlation
values to the original data followed closely by the Stochastic Regression, Deter-
ministic Regression, Listwise Deletion and Mean Imputation. Hence, the Finding
suggests that the Stochastic Regression and EM Method performed better while
the Mean Imputation is the least effective (Musil et. al, 2002).
In another study by Nordholt (1998) entitled Imputation Methods, Simulation,

13
Experiments and Practical Examples, the authors described two simulation exper-
iments of the Hot Deck Method. The first study focused on comparing whether
the Hot Deck Method performs better than leaving the records with nonresponse
out of the data set when analyzing the variable, which is known as the Available
Case Method. This was done by constructing a fictitious data set of four values;
two of these variables were used for the imputation. Then nonresponse rates were
identified namely 5%, 10% and 20% and the simulation process was replicated 50
times. The data set containing the missing values was first analyzed using the
Available Case Method then followed by the Hot Deck Imputation. Same with
the methodology of Musil et.al.(2002), descriptive statistics such as the mean,
variance and correlation were computed. Moreover, the absolute differences be-
tween the original and the available case method also with the original and hot
deck method were computed. Based on his criteria, the results show that Hot
Deck performs better than the Available Case Method. Also, it showed that the
Hot Deck, while had closer results with the original data, has the tendency to
underestimate the values. In terms of the absolute differences, it was observed
that these values increase when the percentage of missing values also increases.
Nordholt’s second simulation study focused on the effects of covariates, other-

wise known as imputation classes on the quality of the Hot Deck Imputation.
Using the data of the Dutch Housing Demand Survey of Statistics Netherlands,
the variable value of the house was chosen as the variable to be imputed due to its
importance and the frequency of nonresponse occurring in that variable. For this
study, the observations under category 13 (value worth at least 150,000) and cat-
14
egory 22 (value worth at 300,000) are changed into missing values. The rationale
for this choice was to ensure that the original value from these categories will note
be used as the replacements for the variable to be imputed since it is no longer
in the file. Then imputation classes were created once the missing values were
already identified. A table showing the number of respondents before and after
imputation showed that in every category except for 13 and 22, which was set as
missing values, the number of respondents increased after the imputation. This
showed that the remaining records have equal probability of becoming a donor
record for an imputation and that not all imputations give values that are near
category 13 or 22. Nordholt also explored on the Available Case Method and Hot
Deck Method for this real life data. Same with the first study, the Hot Deck fared
better than the Available Case Method (Nordholt, 1998).
Lastly, Nordholt addressed several questions regarding imputation. Using ex-

amples of how imputation is applied on the real life surveys such as the Dutch
Housing Demand Survey, European Community Household Panel Survey (ECHP)
and the Dutch Structure of Earning Survey, he outline four criteria to decide which
variables to be imputed. These are the importance of a variable, the percentage
of nonresponse, the predictability of missing values and the cost of imputation.
He also mentioned how it is important to estimate the duration of the imputation
process due to the need of the study to be timely. The duration, according to
Nordholt, is dependent on the number of variables to be imputed, the available
capacity, the user friendliness of an imputation package and the desired imputa-
tion quality. These issues must be settled first before conducting any imputation
15
process and choosing the appropriate imputation strategy (Nordholt, 1998).
There were two undergraduate theses that conducted a similar study on imputa-
tion. The first undergraduate thesis was by Salvino and Yu (1996). They assessed
the efficiency of the Mean Imputation versus Hot Deck Imputation Technique by
applying these techniques on the 1991 Census on Agriculture and Fisheries (CAF)
data. In their research, they generated an incomplete data using the Gauss Soft-
ware for the imputed variables which were the count for cattle, hogs and chicken.
In order to determine which is better between the two, the variances were com-
pared. Looking at the variances, it was determined that the Hot Deck Imputation
Technique was better. Also, the design effect was considered by dividing the vari-
ance of the Hot Deck Imputation versus the Mean Imputation, since the ratio
produced was less than one, they concluded that again, the Hot Deck Imputation
Technique is a better option.
Another undergraduate thesis by Cheng and Sy (1999)focused on assessing im-

putation techniques on a clinical data. Using the data from DATA: A Cellection
of Problems from Many Fields for the Student and Research Worker, the authors
employed four methods of imputation namely Mean Imputation, Hot Deck Impu-
tation, Linear Regression and Multiple Linear Regression. Using the statistical
package, SAS, they performed diagnostic checking for the regression models as
well as determining the R2 of the different linear combinations of regression mod-
els to come up with the regression equation that was used for the imputation.
Once this was determined, they proceeded in using the MS Fox Pro program to
16
create a tool for implementing their imputations. When the results were obtained,
they assessed the efficacy of the imputation techniques by looking at the accuracy
and precision of the estimates. Accuracy was measured by the percentage error
and the variance of these percentage errors were the basis for the precision of the
estimates. The results show that the Linear Regression was the best method,
followed closely by Multiple Regression, then Hot Deck and finally the Mean Im-
putation. While this is the case, it can be noted that in this study, the criteria for
determining the best imputation model was not extensive as it only used the per-
centage error and variance of the percentage errors. Also, the study wasn’t able
to explore on the use of imputation classes to improve the accuracy and precision
of the imputation methods.
Chapter 3
Conceptual Framework
3.1 Definition of Terms
Bias is defined as the difference between the expected value of an estimator and
the true value of the parameter being estimated. The bias is expressed by:
Bias(θ̂) = E[θ̂] − θ
where θ̂ is the estimator of the true value of the parameter θ and θ is the true
value of the parameter.The bias of an estimator can be positive, negative or even
zero. An estimator having nonzero bias is said to be an unbiased estimator.
Accuracy is the extent to which estimates are close to the value of the parameter.
Precision is is the extent to which estimates are close to one another.
Efficiency is defined to be the measurement on how a method is accomplished
through a set of criteria.
Nonresponse is the failure to collect valid response for a particular unit.
18
3.2 Types of Nonresponse
The types of nonresponse focus on the method in which the observations are
nonresponse values. Kalton (1983) stressed the importance to differentiate the
types of nonresponse: Unit (or total) nonresponse, item nonresponse, and partial
nonresponse. Unit (or Total) nonresponse takes place when no was informa-
tion collected from a sampling unit. There are many causes of this nonresponse,
namely, the failure to contact the respondent (not at home, moved or unit not
being found), refusal to give information, inability of the unit to cooperate (might
be due to an illness or a language barrier) or lost questionnaires.
Item nonresponse, on the other hand, happens when the information collected
from a unit is incomplete due to refusal to answering some of the questions. There
are many causes of item nonresponse, namely, refusal to answer the question due
to the lack of information necessarily needed by the informant, failure to make
the effort required to establish the information by retrieving it from his memory
or by consulting his records, refusal to give answers because the questions might
be sensitive, the interviewer fails to record an answer or the response is subse-
quently rejected at an edit check on the grounds that it is inconsistent with other
responses (may include an inconsistency arising from a coding or punching error
occurring in the transfer of the response to the computer data file).
Lastly, Partial Nonresponse is the failure to collect large sets of items for a
responding unit. A sampled unit fails to provide responses for the following,
namely, in one or more waves of a panel survey, later phases of a multi-phase
19
data collection procedure (e.g. second visit of the FIES), and later items in the
questionnaire after breaking off a telephone interview. Other reasons namely in-
clude, data are unavailable after all possible checking and follow-up, inconsistency
of the responses that do not satisfy natural or reasonable constraints known as
edits, which one or more items are designated as unacceptable and therefore are
artificially missing, and similar causes given in Unit (Total) Nonresponse. In this
study, the researchers dealt with Partial Nonresponse occurring in the second visit
of the 1997 FIES.
3.3 Patterns of Nonresponse
A critical issue in addressing the problem of nonresponse is identifying the pattern

of nonresponse. Determining the patterns of nonresponse is important because
it influences how missing data should be handled. There are three patterns of
nonresponse namely Missing Completely At Random, Missing at Random
and Non Ignorable Nonresponse.
A missing data is said to be Missing Completely At Random (MCAR)

if the probability of having a missing value for Y is unrelated to the value of Y
itself or to any other variable in the data set. Data that are MCAR reflect the
highest degree of randomness and show no underlying reasons for missing observa-
tions that can potentially lead to bias research findings (Musil et.al, 2002). Hence,
the missing data is randomly distributed across all cases such that the occurrence
of missing data is independent to other variables in the data set.An example of
20
the MCAR pattern is when a sample unit in the survey fails to provide an answer
to the total monthly expenditure because the unit cannot be reached.
Another pattern of nonresponse is the Missing At Random (MAR) case. The

missing data is considered to be MAR if the probability of missing data on Y
is unrelated to the value of Y after controlling for other variables in the analy-
sis. This means that the likelihood of a case having incomplete information on
a variable can be explained by other variables in the data set. An example of
the MAR pattern is when a sampling unit fails to provide an answer to the total
monthly expenditure because the sampling unit is a male household. The missing
information about the total monthly expenditure is dependent on the gender of
the sampling unit and not on the total monthly expenditure itself.
Meanwhile, the Non Ignorable Nonresponse (NIN) is regarded as the most

problematic nonresponse pattern. When the probability of missing data on Y
is related to the value of Y and possibly to some other variable Z even if other
variables are controlled in the analysis, such case is termed as NIN. NIN miss-
ing data have systematic, nonrandom factors underlying the occurrence of the
missing values that are not apparent or otherwise measured. NIN missing data
are the most problematic because of the effect in terms of generalizing research
findings and may potentially create bias parameter estimates, such as the means,
standard deviations, correlation coefficients or regression coefficients (Musil et.al,
2002). An example of the NIN pattern is when a sampling unit from the higher
income groups fails to provide information even if the gender of the unit is being
21
controlled. Using the example in the MAR pattern, however, the sampling unit
did not also provide answer because he was a high income earner. This is consid-
ered NIN since the sampling unit also depends on the income group even if the
gender of the unit was controlled (Musil, et al., 2002).
These patterns are considered as an important assumption before any imputa-

tion takes place. For an imputation procedure to work and achieve statistically
acceptable and reliable estimates, the pattern of nonresponse must either satisfy
the MCAR or MAR assumption. For this study, the researchers created missing
observations that satisfy the MCAR assumption.
3.4 Nonresponse Bias
In most surveys, there is a large propensity of the post-analysis results to become

invalid due to the missing data. Missing data can be discarded, ignored or sub-
stituted through some procedure. When data is deleted or ignored in generating
estimates, the nonresponse bias becomes a problem (Kalton, 1983). The effect of
deleting the missing data on nonresponse bias is illustrated below:
Suppose the population is divided in two groups or strata. The first group consist-
ing of all units in the population for which units will be obtained (Respondents)
and the second group are those units for which no measurement will be obtained
(Nonrespondents).
22
To arrive at the proper estimation of the nonresponse bias, the following quanti-
ties are defined:
Let R be the number of respondents and M (M stands for missing) be the number
of nonrespondents in the population, with R + M = N. Assume that a Simple
Random Sample (SRS) with replacement is drawn from each group. The cor-
R
responding sample quantities are r and m, with r + m = n. Let R̄ = N
and
M
M̄ = N
be the proportions of respondents and nonrespondents in the population
r m
and let r̄ = n
and m̄ = n
be the response and nonresponse rates in the sample.
The population total and mean are given by Y = Yr + Ym = RȲr + M Ȳm and
Ȳ = R̄Ȳr + M̄ Ȳm , where Yr and Ȳr are the total and mean for respondents and
Ym and Ȳm are the same quantities for the nonrespondents. The corresponding
sample quantities are y = yr + ym = rȳr and ȳ = r̄ȳr + m̄ȳm (Kalton, 1983).
If no compensation is made for nonresponse, the respondent sample mean ȳr is

used to estimate Y . Its bias is given by B(ȳr ) = E[ȳr ] − Ȳ . The expectation of ȳr
can be obtained in two stages, first conditional on fixed r and then over different
values of r, i.e. E[ȳr ] = E1 E2 [ȳr ] where E2 is the conditional expectation for fixed
r and E1 is the expectation over different values of r.
The E[y¯r ] is given by:

P E (y
E[ȳr ] = E1 [ 2
r
ri )
] = E1 [Y¯r ] = Y¯r
Hence, the bias of ȳr is given by:
B(ȳr ) = Ȳr − Ȳ = M̄ (Ȳr − Ȳm ).

23
The equation above shows that ȳr is approximately unbiased for Ȳ if either the
proportion of nonrespondents M̄ is small or the mean for nonrespondents, Ȳm ,
is close to the respondents, Ȳr . Since the survey analyst usually has no direct
empirical evidence on the magnitude of (Ȳr − Ȳm ), the only situtation in which
he can have confidence that the bias is small is when the nonresponse rate is low.
However, in practice, even with moderate M̄ many survey results escape sizable
biases because (Ȳr − Ȳm ) is fortunately often not large (Kalton, 1983).
In reducing nonresponse bias caused by missing data, there are many procedures
that can be applied and one of these procedures is imputation. In this study,
imputation methods are applied to eliminate nonresponse and reduce bias to the
estimates. Imputation is briefly defined as the substitution of values for the non-
response observations.
24
3.5 The Imputation Process
Imputation is listed as one of the many procedures that can be used to deal with
nonresponse in order to generate more unbiased results. Imputation is the process
of replacing a missing value through available statistical and mathematical tech-
niques, with a value that is considered to be a reasonable substitute for the missing
information (Kalton,1983).
Imputation has certain advantages. First, utilizing imputation methods help re-
duce biases in survey estimates. Second, imputation makes analysis easier and
the results are simpler to present. Imputation does not make use of complex al-
gorithms to estimate the population parameters in the presence of missing data;
hence, much processing time is saved. Lastly, using imputation methods can en-
sure consistency of results across analyses, a feature that an incomplete data set
cannot fully provide.
On the other hand, imputation has also several disadvantages. There is no guar-
antee that the results obtained after applying imputation methods will be less
biased than those based on the incomplete data set. Hence, the use of imputation
methods depends on the suitability of the assumptions built into the imputation
procedures used. Even if the biases of univariate statistics are reduced, there
is no assurance that the distribution of the data and the relationships between
variables will remain. More importantly, imputation is just a fabrication of data.
Many naive researchers falsely treat the imputed data as a complete data set for
n respondents as if it were a straightforward sample of size n.
25
There are four Imputation Methods (IMs) applied in this study, namely, the
Overall (Grand) Mean Imputation (OMI), Hot Deck Imputation (HDI),
Deterministic Regression Imputation (DRI) and Stochastic Regression
Imputation (SRI). For most imputation methods, imputation classes are needed
to be defined before performing the imputation methods.
Imputation classes are stratification classes that divide the data into groups
before imputation takes place. The formation of imputation classes is very useful
if the classes are divided into homogeneous groups. That is, similar characteris-
tics that has some propensity to provide the same response. The variables used to
define imputation classes are called matching variables. In getting the values
to be substituted to the nonresponse observations, a group of observations coming
from a variable with a response are used. These records are called donors. The
records with missing observations to be substituted are called recipients.
Problems might arise if imputation classes are not formed with caution. One
of them is the number of imputation classes. The imputation class must have a
definite number of classes applied to each method. The larger the number of im-
putation classes, the possibility of having fewer observations in one class increases.
This can cause the variance of the estimates under that class to increase. On the
other hand, the smaller the number of imputation classes, the possibility of hav-
ing more observations in that class increases thus making the estimates burdened
with aggregation bias.
26
3.5.1 Overall Mean Imputation
The mean imputation method is the process by which missing data is imputed by
the mean of the available units of the same imputation class to which it belongs
(Cheng and Sy, 1999). One types of this method is the Overall Mean Imputa-
tion (OMI) method. The OMI is one of the widely used method in imputing ofr
missing data. The OMI method simply replaces each missing data by the over-
all mean of the available (responding) units in the same population. The overall
mean is given by
r
X
yri
i=1
ȳomi = r
= ȳr
where yomi is the mean of the entire sample of the responding units of the yth
variable and yri is the observation under y which are responding units.
There are many advantages and disadvantages of this method. The advantage
of using this method is its universality. This means that it can be applied to any
data set. Moreover, this method does not require the use of imputation classes.
Without imputation classes, the method becomes easier to use and results are
generated faster.
27
Figure 1 Distribution of the Data Before and After Imputation
However, there are serious disadvantages of this method. Since missing values are
imputed by a single value, the distribution of the data becomes distorted (Figure
1). The distribution of the data becomes too peaked making it unsuitable in many
post-analysis. Second, it produces large biases and variances because it does not
allow variability in the imputation of missing values. Many related literatures
stated that this is the least effective. Thus, this method is never recommended to
used.
28
3.5.2 Hot Deck Imputation
One of the most popular and widely known methods used is the Hot Deck
Imputation (HDI) method. The HDI method is the process by which the
missing observations are imputed by choosing a value from the set of available
units. This value is either selected at random (traditional hot deck), or in
some deterministic way with or without replacement (deterministic hot deck),
or based on a measure of distance (nearest-neighbor hot deck). To perform
this method, let Y be the variable that contains missing data and X that has no
missing data. In imputing for the missing data:
1. Find a set of categorical X variables that are highly associated with Y . The
X variables to be selected will be the matching variables in this imputation.
2. Form a contingency table based on X variables.
3. If there are cases that are missing within a particular cell in the table, select a
case from the set of available units from Y variable and impute the chosen Y
value to the missing value. In choosing for the imputation to be substituted
to the missing value, both of them must have similar or exactly the same
characteristics.
Cheng and Sy (1999) stated that HDI method give estimates that reflect more
accurately to the actual data by making imputation classes. If the matching vari-
ables are closely associated with the variable being imputed, the nonresponse bias
should be reduced.
29
Example 1: Suppose that a survey is conducted with a sample of ten people.

In the survey, three people refused to provide their Grade Point Average (GPA)
for the previous term. Missing answer from each nonrespondent are replaced by
a known value from a responding unit who has similar characteristics such as sex,
degree or course (Course), Dean Lister (DL), Honor student in High School (HS2),
and Hours of study classes (HSC). Suppose the set of X matching variables that
are highly associated to GPA are the variables DL and HS2. Table 1 shows the
data with imputed values. Values in parenthesis are the imputed values that were
randomly chosen in their respective imputation classes.
Table 1: Imputed Values of GPA using HDI
Like OMI, there are certain advantages in using this method. One major attrac-
tion of this method cited by Kazemi (2005) is that imputed values are all actual
30
values. More importantly, the shape of the distribution is preserved. Since im-
putation classes are introduced, the chance in distorting the distribution decreases.
On the other hand, it also has a set of disadvantages. First, in order to form
imputation classes, all X variables must be categorical. Second, the possibility
of generating a distorted data set increases if the method used in imputing val-
ues to the missing observations is without replacement as the nonresponse rate
increases. Observations from the donor record might be used repeatedly by the
missing values causing the shape of the distribution to get distorted. Third, the
number of imputation classes must be limited to ensure that all missing values
will have a donor for each class.
3.5.3 Regression Imputation
As in MI and HDI methods, this procedure is one of the widely known used im-
putation methods. The method of imputing missing values via the least-squares
regression is known to be the Regression Imputation (RI) Method. There
are many ways of creating a regression to be used in imputing for the missing
observations. The y-variable for which imputations are needed is regressed on
the auxiliary variable (x1 , x2 , x3 , ..., xp ) for the units providing a response on y.
These auxiliary variables may be quantitative or qualitative, the latter being in-
corporated into the regression model by means of dummy variables. There are
two basic types of the RI method: (a) Deterministic Regression Imputation and
(b) Stochastic Regression Imputation.
31
In comparing for the accuracy and efficiency of the RI method, it will be helpful
if the RI methods to be compared have the same imputation class.
Deterministic Regression
The use of the predicted value from the model given the values of the auxiliary
values that contains no missing data for the record with a missing response in the
variable y is called the Deterministic Regression Imputation (DRI). This method
is seen as the generalization of the mean imputation method. The model for DRI
is given by:
P
ŷk = β̂0 + β̂1 Xik
where ŷk the predicted value under the kth nonresponding unit to be imputed, β̂0
and β̂i are the parameter estimates, Xik is the auxiliary variable that can either
be a quantitative variable or a dummy variable under the k-th nonresponding unit.
There are advantages and disadvantages of using DRI. DRI has the potential
to produce closer imputed value for the nonresponse observation. In order to
make the method effective by imputing a predicted value, which is near the actual
value, a high R2 is needed. Though this method has the potential to make closer
imputed values, this method is a time-consuming operation and often times un-
realistic to consider its application for all the items with missing values in a survey.
Using the DRI can also underestimate the variance of the estimates. It can also
distort the distribution of the data. One major disadvantage of this method is
32
that it can produce out-of-range values or unfeasible values (e.g. predicting a

negative age).
Stochastic Regression
The use of the predicted value from the deterministic regression model has similar
undesirable distributional properties in the mean imputation method. To com-
pensate for it, an estimated residual is added to the predicted value. The use
of this predicted value plus some type of randomly chosen estimated residual is
called the Stochastic Regression Imputation (SRI) method The model for
SRI is given by:
P
ŷk = β̂0 + β̂1 Xik + eˆk
where ŷk the predicted value under the kth nonresponding unit to be imputed,
β̂0 and β̂i are the parameter estimates, Xik is the auxiliary that can either be a
quantitative variable or a dummy variable under the k-th nonresponding unit and
eˆk is the randomly chosen residual for the k-th nonresponding unit.
There are various ways in which this could be done depending on the assump-
tions made about the residuals. The following are some possibilities:
1. Assume that the errors are homoscedastic and normally distributed,N (0, σe2 ).
Then σe2 could be estimated by the residual variance from the regression,s2e ,
and the residual for a recipient could be chosen at random from N (0, s2e )
2. Assume that the errors are heteroscedastic and normally distributed, with
2 2
σej being the residual variance in some group j. Estimate the σej by s2ej ,
33
and choose a residual for a recipient in group j from N (0, s2ej ).
3. Assume that the residuals all come from the same, unspecified, distribution.
Then estimate yk by ŷk + êk , where êi is the estimated residual for a random-
chosen donor.
4. The assumption in (3) accepts the linearity and additivity of the model.
If there are doubts about these assumptions, it may be better to take not
a random-chosen donor but instead one close to the recipient in terms of
his x-values (Kalton, 1983). In the limit, if a donor with the same set of
x-values is found, this procedure reduces to assigning that donor’s y-value
to the recipient.
There are advantages and disadvantages in using SRI. Similar to DRI, this method
can produce imputed values that are near to the nonresponse observation if the
model has a high R2 . This method is also a time-consuming operation and often
times unrealistic to consider its application for all the items with missing values
in a survey. This method can also produce out-of-range values other than the
predicted value without the added residual. It is possible under SRI that after
adding the residual to the deterministic imputation, which is feasible, an unfeasi-
ble value could result.
Chapter 4
Methodology
4.1 Source of Data
The purpose of this section is to give an overview about the data that will be used
for this study which is the 1997 Family Income and Expenditures Survey (FIES).
4.1.1 General Background
The 1997 FIES is a nationwide survey with two visits per survey period on the
same households conducted by the National Statistics Office (NSO) every three
years. The objectives of the survey are as follows:
1. to gather data on family income and family living expenditures and related
information affecting income and expenditure levels and patterns in the
Philippines;
2. to determine the sources of income and income distribution, levels of living

and spending patterns, and the degree of inequality among families;
3. to provide benchmark information to update weights in the estimation of

35
consumer price index, and
4. to provide information in the estimation of the country’s poverty threshold

and incidence.
4.1.2 Sampling Design and Coverage
The sampling design method for the 1997 FIES is a stratified multi - stage sam-
pling design consisting of 3,416 Primary Sampling Units (PSU’s) for the provincial
estimate, the PSU’s referred by the 1997 FIES are the barangays. Then, a subsam-
ple of 2,247 PSU’s comprises as the master sample for the regional level estimates
(NSO, 1997-2005).
This multi stage sampling design involved three stages. First is the selection
of sample barangays. Second is the selection of sample enumeration areas. Enu-
meration areas pertain to the subdivision of barangays. This was followed by
a selection of sample households. The sampling frame and stratification of the
three stages were based on the 1995 Census of Population (POPCEN) and 1990
Census of Population and Housing (CPH). From this method, a sample of 41,000
households participated in this survey (NSO, 1997-2005).
4.1.3 Survey Characteristics
The 1997 FIES questionnaire contains about 800 data items, where questions are
asked by the interviewer to the respondent of the selected sample household. A re-
36
spondent is defined as the household head or the person who manages the finances
of the family or any member of the family who can give reliable information to
the questionnaire (NSO, 1997-2005).
The items or variables gathered in the 1997 FIES are listed in Appendix A.
4.1.4 Survey Nonresponse
Two types of nonresponse occurred in the 1997 FIES. The first type of nonresponse
which resulted from factors such as being unaware of the question, unwilling to
provide the answer or omission of the question during the interview is called the
item nonresponse.This type of nonresponse totaled to only 2.1% of the total num-
ber of respondents (NSO, 1997-2005).
The other type of nonresponse which is due to households being temporarily

away, on vacation, not at home, demolished or transferred residence during the
second visit is called as partial nonresponse. This type of nonresponse totaled to
only 3.6% of the total number of respondents (NSO, 1997-2005).
The NSO has only devised the deductive imputation for solving the problem of
item nonresponse while no specific method was mentioned to compensate for the
partial nonresponse (NSO, 1997-2005).
Hence, the researchers will focus on the comparison of imputation procedures

for partial nonresponse. The researchers chose which regional data set will be
37
used to apply the imputation methods. In this case, the National Capital Region
(NCR) was chosen because it was noted as the region with highest nonresponse
rate. The data consist of 4,130 households, 39 categorical variables and the rest are
continuous variables pertaining to income and expenditures of the respondents.
As to which variables will be imputed, the researchers chose two variables namely
the second visit Total Income (TOTIN2) and Total Expenditure (TOTEX2). The
selections for these variables were chosen due to its importance to the FIES and
the frequency of missing values for these observations.
4.2 The Simulation Method
In order to investigate and make an empirical comparison of the statistical prop-

erties of the estimates with imputed values using selected imputation methods, a
data set with missing observations was simulated. This simulation method will
create an artificial data set with missing observations to indicate which values will
be imputed.
The alogrithm for this simulation procedure is as follows:
1. To get the number of observations to be set to missing for each nonresponse

rate, the total number of observations from the complete 1997 FIES data
set, which is 4130 was multiplied to the indicated nonresponse rate. The
nonresponse rates used for this study were 10%, 20% and 30%. The rational
for setting different nonresponse rate is because the study aims to investigate
the effect of varying nonresponse rates for each imputation method.
38
2. Each observation from the matrix of random numbers was assigned to both
observations of the 1997 FIES second visit variables TOTIN2 and TOTEX2.
This was done in order to satisfy the assumptions that the data has partial
nonresponse and that the missing observations follow the Missing Com-
pletely At Random (MCAR) nonresponse pattern.
3. The second visit observations for both variables were sorted in ascending
order through their corresponding random number.
4. The first 10% of the sorted second visit data for both variables were selected
and set to as missing observations. The same procedure goes for the data
set which will contain 20% and 30% nonresponse rates respectively.
5. The missing observations were flagged. This was done to distinguish the
imputed from the actual values during the data analysis.
This simulation method was implemented with the use of the Decimal Basic pro-
gram, SIMULATION.BAS (Appendix B) where the files Simulated Values for
Income (SIMI) and Simulated Values for Expenditure (SIME), a matrix contain-
ing missing observations for the income and expenditure, were stored in order to
use it in the application of the imputation methods.
39
4.3 Formation of Imputation Classes
Imputation classes are stratification classes that divide the data in order to pro-
duce groups that have similar characteristics. Assuming that the units that have
the same characteristics have the propensity to give the same response, the for-
mation of imputation classes would help reduce the bias of the estimates.
The steps undertaken in the formation of the imputation classes are as follows:
1. The researchers identified the potential matching variables, which are the
candidate variables that could have an association with the variables of
interest (i.e. TOTEX2 and TOTIN2).
2. The categorical variables from the first visit data must fit into the criteria
in order to be selected as a candidate variable. Three criteria were used as
a basis for selecting the candidate variables. The first criterion is that the
variable must be known. Second, the candidate variable must be easy to
measure. Lastly, the probability of missing observations for the candidate
variable is small. If the variable from the first visit data would fit in the
three criteria, then it can be used as a candidate variable.
3. For the variables that have many categories, the researchers reduced the
number of categories for these variables. The rationale for this procedure
is because having too many categories can increase heterogeneity and the
bias of the estimates. This was done with the use of the software Statistica,
particularly, the Recode function.
40
4. Measures of association were tested on the matching variables. The Chi

Squared Test for Independence was the first test applied on the variables.
This was made to determine if the candidate variables is a significant factor
for the variables of interest.
5. Other measures for evaluating the degree of association of matching vari-

ables to the variables of interest followed. Other measures of association
such as the Phi-coefficient, Cramer’s V and Contingency Coefficient were
used. The candidate variable with the greatest degree of association will be
chosen as the matching variable that will group the data into their respective
imputation class.
All these tests were performed using the statistical packages Statistica and SPSS.
The results of these tests were presented in the next chapter.
41
4.4 Performing the Imputation Methods
4.4.1 Overall Mean Imputation (OMI)
The Overall Mean Imputation (OMI) is an imputation procedure where the miss-
ing observations are replaced with the mean of the variable which contains avail-
able units. As said in the Conceptual Framework, this imputation method does
not require the formation of imputation classes, which makes this method as the
simplest procedure among the four methods in this study.
The procedures in applying the Overall Mean Imputation (OMI) are as follows:
1. The overall mean for the variables of interest, which is the first visit variables
of interest, TOTIN1 and TOTEX1 was computed. The formula that was
used for the computation of the overall mean is:
r
X
yri
i=1
ȳomi = r
where ȳomi is the overall mean for the first visit variables of interest,TOTEX1
or TOTIN1 while yri is the observation for the first visit variables of interest,
TOTEX1 or TOTIN1 and r is the total number of responding units for the
first visit variable TOTEX1 or TOTIN1.
2. Using the nonresponse data sets generated, the missing observations for the
second visit variables TOTEX2 and TOTIN2 were replaced with the overall
means of the first visit TOTEX1 and TOTIN1.
The implementation of the Overall Mean Imputation (OMI) was made through
the Decimal Basic program OMI.BAS (Appendix B).
42
4.4.2 Hot Deck Imputation (HDI)
The Hot Deck (HDI) Imputation is an imputation procedure where the missing
observations are replaced by choosing a value from the set of available units.
The steps undertaken in applying the Hot Deck Imputation (HDI) are as follows:
1. The donor and recipient record for each imputation class and variable were
first identified.
2. The missing observations of the second visit TOTIN2 and TOTEX2 were
assigned to their respective recipient records for each imputation class while
the first visit TOTIN2 and TOTEX2 observations were placed to their re-
spective donor records for each imputation class.
3. The values that were substituted for the missing observations were randomly
chosen from the donor record for each imputation class.
The implementation of the Hot Deck Imputation (HDI)was made through the
Decimal Basic program HOT DECK.BAS (Appendix B).
43
4.4.3 Deterministic and Stochastic Regression Imputation
(DRI) and (SRI)
Deterministic Regression Imputation (DRI) is a procedure that involves

the generation of a Least Square Regression Equation where Y is regressed on the
auxilliary variables (x1 , x2 , ..., xp ) in order to predict for the missing value. On
the other hand, Stochastic Regression Imputation (SRI) is an imputation
method which employs a similar procedure to that of the deterministic regression
but with an additional procedure of adding an error term ê to the estimated value
in order to generate imputed values for the missing data.
The steps employed for the Regression Imputation are as follows:
1. A logarithmic transformation was applied for the first visit variables of in-
terest,TOTEX1 and TOTIN1 as well as for the second visit variables of
interest, TOTEX2 and TOTIN2. The rationale for this transformation is
that the income and expenditure variables are not normally distributed.
Moreover, logarithmic transformations help correct the non-linearity of the
regression equation.
2. The formation of regression equation was done after the transformation. For
this study, only one predictor variable was used and the general formula for
the regression equation is:
ŷ = β̂0 + β̂1 x + êi
where ŷ is the predicted observation for the second visit variable TOTIN2 or
TOTEX2, β̂0 and β̂1 are the parameter estimates, x is the first visit variable,
44
and êi is the random residual term. Note that for DRI, êi = 0.
3. For the stochastic regression which involves the computation of the error
term, the following steps were made:
(a) A frequency distribution of the residuals was created. This involved

the following steps:
i. The residuals were grouped into class intervals and in each inter-
val,the frequencies for each was obtained.
ii. The relative frequencies and relative cumulative frequencies were

computed.
(b) The class means of the frequency distributions were used to obtain the
error terms for the regression equation.
4. The diagnostic checking requires the fitted model to satisfy the following
assumptions:
(a) Linearity
(b) Normality of the error terms
(c) Independence of error terms
(d) Constancy of Variance
The results for the diagnostic checking of each regression equation used for
this study were presented in Appendix C.
5. The missing observations were replaced by the predicted value using the
corresponding regression equation.
45
4.5 Comparison of Imputation Methods
4.5.1 The Bias of the Mean of the Imputed Data
The primary objective of using imputation methods is to be able to generate sta-

tistically reliable estimates. To check if the imputation methods produce reliable
estimates and determine the effect of the varying nonresponse rates on the perfor-
mance of imputation methods, one of the three criteria, which is the bias of the
sample mean was measured.
To compute for the bias of the mean of the imputed data, the following pro-
cedures were implemented:
1. The mean of the imputed data, ȳ 0 was computed. For Hot Deck and Stochas-
tic Regression Imputation, the average of all the mean of the 1,000 simulated
data sets was computed.
2. The mean of the actual data, ȳ was computed.
3. The resulting bias of the mean of the imputed data was computed by getting
the difference between (1) and (2).
The results of this section will be presented in the next chapter.

46
4.5.2 Comparing the Distributions of the Imputed vs. the
Actual Data
In order to determine which imputation method was able to maintain the same
distribution of the actual data, a goodnesss - of - fit test was utilized. For this
study, the researchers chose the Kolmogorov - Smirnov (K-S) test. The Kol-
mogorov - Smirnov is a goodness of fit test concerned with the degree of agree-
ment between the distribution of a set of sampled (observed) values and some
specified theoretical distribution (Siegel, 1988). In this study, the researchers
were concerned with how the imputation methods affected the distribution of the
1997 FIES data.
The following steps are made for the Kolmogorov - Smirnov Test:
1. Income and Expenditure deciles were created. The creation of these deciles
was based on the second visit actual 1997 FIES data.
2. The obtained deciles were used as upper bounds of the frequency classes.
3. A Frequency Distribution Table (FDT) for each trial was created. For this
part, the researchers used the SPSS aggregate function to generate the FDT.
4. The FDT includes the Relative Cumulative Frequency (RCF) for both the
imputed and actual distribution. RCFs are computed by dividing the cu-
mulative frequency by the total number of observations.
5. The absolute value of the difference of the actual data RCF and the imputed
RCF was computed. This was computed using Microsoft Excel
47
6. The test statistic for the Kolmogrov - Smirnov Test, which is the maximum
deviation, D, was determined by using this formula:
D = max|RCFimputed − RCFactual |
7. Since this is a large sample case and assuming a 0.05 level of significance,
1.36
the critical value for this is computed using the formula: √ ,
N
N = 4, 130
8. If D is less than the critical value, then the conclusion that the imputed
data maintains the same distribution of the actual data follows.
To provide additional information to the distribution of the imputed vs. actual

data, the comparison of the frequency distribution of the actual (deleted) vs. im-
puted values was obtained. This was done in order to show the effect of the
imputed values to the distribution of the data set.
In performing the test, the following steps are made:
1. Income and Expenditure deciles were created. The deciles that were used
in the previous test were the same deciles used here.
2. The obtained deciles were used as upper bounds of the frequency classes.
3. A Frequency Distribution Table (FDT) for both the imputed and actual
values was generated.
4. For Hot Deck and Stochastic Regression which had 1,000 sets the Relative
Frequencies (RF) for each frequency class were averaged over 1,000 RFs.
48
5. To be able to illustrate how imputation methods were able to reconstruct

or distort the actual, deleted values, bar charts were created for each non-
response rate and variable of interest.
The results of this test were be presented in the next chapter.
4.5.3 Other Measures in Assessing the Performance of the
Imputation Methods
Lastly, the researchers adopted measures used by Kalton (1983) in his report enti-
tled Compensating for Missing Data for evaluating the effectiveness of imputation
methods. These measures are: (a) Mean Deviation (MD), (b) Mean Ab-
solute Deviation (MAD) and (c) Root Mean Square Deviation (RMSD).
The Mean Deviation (MD) measures the bias of the imputed values. This is
represented by the formula:
X
(ŷmi − ymi )
MD = m
, i = 1, 2..., m
where ŷmi is the imputed value for the variables TOTEX2 or TOTIN2 and ymi is
the actual value of the variables TOTEX2 or TOTIN2 for case i = 1, 2..., m.
According to Kalton (1983), the Mean Absolute Deviation (MAD) is a criterion

for measuring how close the imputed values were able to reconstruct the actual
values that were set to missing. This is represented by the formula:
X
|(ŷmi − ymi )|
M AD = m
, i = 1, 2..., m
49
where ŷmi is the imputed value for the variables TOTEX2 or TOTIN2, and ymi is
The Root Mean Square Deviation (RMSD) is the square root of the sum of the
square deviations of the imputed and actual observation. Same as the MAD, it
measures the closeness with which the deleted values are reconstructed. This is
expressed as:
rX
(ŷmi − ymi )2
RM SD = m
where ŷmi is the imputed value for the variables TOTEX2 or TOTIN2, and ymi is
These three criteria for measuring the performance of the imputation methods
were implemented using the Decimal Basic program. After each imputation
method is performed, the program proceeds in finding the Mean Deviation, Mean
Absolute Deviation and Root Mean Square Deviation and were saved in their cor-
responding Criteria for Expenditure (CRITEX) and Criteria for Income (CRITIN)
files.
50
4.5.4 Determining the Best Imputation Method
To answer the primary objective of this study which is determining the best or the
most appropriate imputation technique for FIES 1997, the researchers ranked the
four imputation methods based on the criteria discussed in the previous sections.
The selection of the best method will be independent for all the variables of inter-
est and nonresponse rates. The ranking of the imputation methods covered the
following: Bias of the Mean of the Imputed Data, Estimated Percentage
of Correct Distribution of the Imputed Data (PCD) which refers to the
proportion, out of the total number of simulated data sets, that the imputed data
set was able to reconstruct the actual data set, Mean Deviation (MD), Mean
Absolute Deviation (MAD) and Root Mean Square Deviation (RMSD)
The procedure for ranking are as follows:
1. In each criteria mentioned above, the imputation methods were ranked using
the scale of 1 to 4,with 1 indicating the best imputation method and 4 being
the worst.
2. For each variable of interest (i.e. TOTEX2, TOTIN2), the obtained rankings
of a particular imputation method for each criteria is added.
3. The imputation method with the lowest total will be considered as the best
imputation method for the respective variable of interest and nonresponse
rate.
The results of the ranking procedure were presented in the next chapter.
Chapter 5
Results and Discussion
5.1 Descriptive Statistics of Second Visit Data
Variables
Table 2 shows the descriptive statistics of the second visit variables of interests
(VI), TOTEX2 and TOTIN2. This was computed to provide a brief idea on how
much a household spends and earns in a period of time, measure the differences
of the statistics between the two variables and to compare the results with other
tests later on.
Table 2: Descriptive Statistics of the 1997 FIES Second Visit
Variable Mean Std. Dev Min Max N
TOTEX2 102,389.8 129,866.6 8,926.00 3,903,978 4,130

TOTIN2 134,119.4 216,934.9 9,067.00 4,357,180 4,130
The average total spending of a household in the National Capital Region (NCR) is
about Php 102,389.80 while the average total earnings amounted to Php 134,119.40,
52
a difference of more than thirty thousand pesos. it can be noted that the observa-
tions from the TOTIN2 have a larger mean and standard deviation as compared
to TOTEX2. The dispersion can be also seen by just looking at the minimum at
maximum of the two variables.
5.2 Formation of Imputation Classes
Tables 3, 4, 5 shows the candidate matching variables along their respective cat-
egories and scope. The candidate MVs that were tested are the provincial area
codes (PROV), recoded education status (CODES1) and recoded total employed
household members (CODEP1). The candidate PROV has four categories and it is
the only matching variable that was not recoded. The other candidates, namely,
CODEP1, which is recoded total employed household members and CODES1,
which is the recoded education status are the matching variables that were re-
duced to smaller number o groups since the original number of categories for
these two candidate MVs were 7 and 99 respectively. As mentioned in the previ-
ous chapters, the number of categories are further reduced into smaller groups to
minimize the heterogeneity and the bias of the estimates.
53
Table 3: The Candidate MV PROV and its Categories

54
Table 4: The Candidate MV CODEP1 and its Categories
Table 5: The Candidate MV CODES1 and its Categories

55
Table 6 shows the results of the Chi-Square Test of Independence performed

to determine if the candidate matching variables (MVs) are associated with
the VIs. The MV stated in the methodology must be highly correlated to the
variables of interest. The first visit VIs were used as the variables to be tested for
association rather than second visit VIs since the second visit VIs already con-
tained missing data.
Table 6: Chi-Square Test of Independence for the Matching Variable
The Chi-Squared test of association for the candidates and the variables of inter-
est showed that PROV, CODES1 and CODEP1 are associated to CODIN1 and
CODEX1. The p-values for all the candidates were less than 0.0001 indicating
that the association is very significant. The results of succeeding measures of as-
sociation will determine which of the three candidates will be chosen as the MV
of the study.
56
Table 7 shows the other measures of association, namely, the Phi-Coefficient,

Cramers V and the Contingency Coefficient. These measures were computed
in order to assess the degree of association of the candidates to CODIN1 and
CODEX1.
Table 7: Measures of Association for Matching Variable
The measures of association showed small degrees of association with the vari-
ables CODIN1 and CODEX1. This kind of result is expected in real complex
data, given larger variability among the observations. From Table 7, it is clearly
shown that the CODES1 is the MV which exhibit the largest association among
the variables and therefore, the MV that can ensure that the ICs are homoge-
neous. Thus, CODES1 is the chosen MV for this data.
57
To have a detailed description of the CODES1 imputation classes, the de-

scriptive statistics for each imputation class was obtained. Table 8 shows the
descriptive statistics of each imputation class of the data. The descriptive sta-
tistics will tell if the best MV decreases the variability of the observations. In
checking for the variability of each imputation class, the standard deviation will
be used and compared with the value from the overall standard deviation of the
variables of interest.
Table 8: Descriptive Statistics of the Data Grouped into Imputation Classes.
The table shown above indicates that IC1 is the imputation class with the least
standard deviation compared to the two ICs, IC2 and IC3. IC2 and IC3 produced
large standard deviations however it is being neutralized by a low value from IC1
which has the largest proportion of the data. A possible reason why the standard
deviation and the mean of IC3 are large is because majority of the extreme values
were contained on that class.
58
5.2.1 Mean of the Simulated Data by Nonresponse Rate
for Each Variables of Interest
Results in Table 9 show the means for both second visit VIs, TOTEX2 and
TOTIN2, under all NRR. This was generated to be used an input in the eval-
uation of the mean from the imputed data for each IM.
Table 9: Means of the Retained and Deleted Observations
The mean of the observations set to nonresponse and observations retained showed
contrasting results. For both variables, TOTEX2 and TOTIN2, when the nonre-
sponse rate increases, the mean of the observations set to missing (deleted) also
increases. Conversely, the mean of observations retained decreases when non-
response rate increases. Perhaps the large values that were set to nonresponse
increased the means of the data sets containing nonresponse for the varying rates
of nonresponse. Hence, as the number of missing values increases, the deviation
between the means of the actual and retained data slowly increases.
59
5.3 Regression Model Adequacy
Table 10 shows the different regression models for all VIs and nonresponse rates
(NRRs) that were checked for adequacy. The columns are represented as follows:
(a) VI, (b) the nonresponse rate (NRR), (c) IC, (d) the prediction model, (e)
the coefficient of determination (R2 ) and (f) the F-statistic and its corresponding
p-value in parenthesis.
For the notations used in Table 10, the codes IC1, IC2, IC3 represents the first,
second and third imputation class respectively. Meanwhile, for the regression
equations used for the regression imputation, ŷi represents the dependent variable,
which is the predicted value for the second visit variable TOTIN2 or TOTEX2.
Logarithmic transformations were utilized in order to correct the non-linearity for
the regression equations. The code (LN F V E1i ) is the logarithmic transforma-
tion of the observation from the first visit variable Total Expenditure (TOTEX1)
under IC1. Similarly, (LN F V I1i ) is the logarithmic transformation of the first
visit observation for the variable Total Income (TOTIN1) under IC1. The same
notation also applies for (LN F V E2i ) and (LN F V E3i ) under IC2 and IC3 for
the variable TOTEX1, respectively and (LN F V I2i ) and (LN F V I3i ) under IC2
and IC3 for the variable TOTIN1, respectively.
60
Table 10: Model Fitted Results

61
Table 10 showed the regression models used for the regression imputations
under their respective VIs and ICs. Before using these equations for imputating
missing values, diagnostic checking of the models, which include Linearity, Nor-
mality of Error Terms, Independence of Error Terms and Constancy of Variance,
were performed.
First, the researchers looked at the coefficient of determination or R2 of each

regression equation in order to determine the explanatory power of first visit VI
to the second visit VI. A large value of R2 is a good indication of how well the
model fits the data. The highest R2 in Table 10 measured 93.2% for the model
fitted under TOTEX2, IC3 with 30% NRR). Meanwhile, the lowest coefficient
of determination can be found for the fitted values under the variable TOTIN2,
under IC1 with 20% NRR, which had an R2 of 70.3%. For all NRR and VIs, the
third IC generated the highest R2 while the first IC produced the lowest R2 .
Second, the models were checked if they satisfy the assumption of linearity. This
was performed using the ANOVA tables presented in Appendix C. The results
of the diagnostic checking showed that all models exhibited the assumption of
linearity. The p-values for all the models were less than 0.0001, an indication that
the linearity of the models is very significant.
Third, the next phase for diagnostic checking is to check if the regression model
satisfy the assumption of normality. For this study, the researchers examined the
Normal Probability Plot(NPP) of the regression models which can be found in
62
Appendix C. The normal probability plot in all models moderately follows the
S-shaped pattern which indicates that the residuals are not normal but rather
lognormal. However, the shape of the NPP improved after ln transformation was
applied even though the model was not linear previously. Since the data used is
a complex data, the models were used even if assumption of the residuals to be
normal is not perfectly achieved.
Fourth, in testing for the assumption of independence of error terms, the Durbin -
Watson test was implemented. Results in Appendix C show that all of the models
satisfy the assumption of independence.
Lastly, to check if the residuals satisfy homoscedasticity or the equality of vari-

ances, a scatter plot of the residuals against the predicted values was obtained.
Results in Appendix C showed that there were no distinct patterns evident in
the scatter plot. The logarithmic transformation resolved the problem of het-
eroscedasticity.
Hence, given this discussion, the results show that the assumptions for the di-
agnostic checking of the regression equations used for the regression imputations
were satisfied.
63
5.4 Evaluation of the Imputation Methods
To determine the effect of nonresponse rates in the results for each imputation
method (IM), evaluation of different IMs was performed. In the evaluation of the
different IMs, the results of each IM will be discussed independently. For each IM,
the discussion of results will be as follows: (1) bias of the mean of the imputed
data, (2) distribution of the imputed data using the Kolmogorov-Smirnov Good-
ness of Fit Test, and (3) other measures of variability using the mean deviation
(MD), mean absolute deviation (MAD), and root mean square deviation (RMSD).
The table of results will contain the following columns: (a) variable of interest
(VI), (b) nonresponse rate (NRR), (c) the bias of the mean of the imputed data,
Bias (ŷ 0 ),(d) percentage of correct distribution of the imputed data to the actual
data set out of 1000 trials (PCD), (e) MD, (f) MAD, and (g) RMSD.
64
5.4.1 Overall Mean Imputation
Table 11 shows the results of the different criteria in evaluating the imputed data
using the overall mean imputation (OMI) method.
Table 11: Criteria Results for the OMI Method
1. Bias of the Mean of the Imputed Data

In (c) of Table 11, results showed that as the NRR increases, the bias for
TOTEX2 slowly decreases in magnitude. The decrease in magnitude of the
respondents’ mean as NRR increase is the rationale behind the decrease of
the bias of the mean of the imputed data. As the magnitude of the respon-
dents’ mean decreases, variability caused by imputing a single value (i.e.
the mean of TOTEX1, the total expenditure of the first visit data, which is
equal to 105,566.9) that is higher than the mean of the actual data set also
decreases.
On the other hand, the results shown for TOTIN2 were the opposite of
TOTEX2 as NRR increases. The bias of the mean of the imputed data for
65
TOTIN2 rapidly increases in magnitude as NRR increases. The rationale for

this is the decrease in magnitude of the respondents’ mean as NRR increases.
However, unlike in TOTEX2, the imputed values (i.e. the mean of TOTIN1,
the total income for the first visit data, which is equal to 121,820.7) are much
lower than the actual mean of the data set.
2. Distribution of the Imputed Data

Results in column (d) of Table 11 showed that in all NRRs and VIs, the
OMI method failed to maintain the distribution of the actual data. This
was expected primarily because for each missing observation for the VIs, the
observations were replaced by a single value which is the overall mean of the
first visit of the VIs. Results from related studies that performed OMI stated
that this method is one of the worst among all IM since it distorts the distri-
bution of the data. The distribution of the data becomes too peaked which
makes this method unsuitable for many post-analyses (Cheng and Sy, 1999).
3. Other Measures of Variability

The three criteria in Table 11 under the columns (e), (f) and (g) show the
other measures of variability of the imputed data. The values for the MAD
and RMSD are increasing in magnitude as NRR increases for TOTEX2. The
data which have the highest percentage of imputed values have the highest
values for the three measures of variability in TOTEX2. It is worth noting
that a huge increase in magnitude is seen in all the three criterions from the
66
twenty to thirty percent NRR for TOTEX2.
For TOTIN2, the data which have twenty percent imputed observations
have the highest values in all the three measures of variability. Unlike for
TOTEX2, surprisingly, values from the three measures of variability under
the highest NRR have the lowest results.
5.4.2 Hot Deck Imputation
Table 12 shows the results of the different criteria in evaluating imputed data with
imputations using the Hot Deck Imputation (HDI) method with three imputation
classes.
Table 12: Criteria Results for the HDI Method

Similar to the results in the OMI method for the TOTIN2 variable, as the
NRR increases, the bias of the mean of the imputed data rapidly increases.
In the TOTEX2 variable, the biases fluctuated as the NRR increases. For
TOTEX2 and TOTIN2, the data with the highest NRR has the largest bias.
67
For the TOTEX2 variable, the data with twenty percent NRR provided the
least bias. On the other hand, the data with the lowest NRR yielded the
smallest bias for TOTIN2.

Results in column (d) shows that in TOTIN2, the data which contained ten
and twenty percent imputation of the total number of observations, main-
tained the distribution of the actual data. In TOTEX2, only the data which
contained ten percent imputations of the total number of observations main-
tained the distribution of the actual data for all the one thousand data sets.
In the data which contained twenty percent imputations of the total number
of observations, 969 out of the 1000 data sets maintained the distribution of
the actual data set.
For TOTEX2 and TOTIN2, the data with the highest number of imputed
observations failed to maintain the distribution of the actual data. Much
worse, none of the simulated data set for TOTEX2 registered the same dis-
tribution as the actual. On the other hand, only a lone data set maintained
the same distribution as the actual. The researchers look into the possibility
that more than one recipient are having the same donor.
68

other measures of variability of the imputed data. For the variable TO-
TEX2, the following results were obtained: (i) data that contains twenty
percent imputed value yielded the least values for the MD and RMSD, (ii)
the data with the lowest number of imputations yielded the largest value
for MD and RMSD and (iii) MAD is the only criterion which the values are
increasing as NRR increases.
For the variable TOTIN2, the following results were obtained: (i) all the
three criteria increases as NRR increases, (ii) results for the three criteria
were larger than for TOTEX2, and (iii) the data with the largest number of
imputations generated the highest value in the three criteria.
69
5.4.3 Deterministic Regression Imputation
using the Deterministic Regression Imputation method with three imputation
classes (DRI).
Table 13: Criteria Results for the DRI Method

Looking at Table 13, column (c), the bias slowly increases in magnitude as
the NRR increases for TOTEX2 and TOTIN2. Compared to OMI and HDI
where the bias increases tremendously as NRR increases, the increase in bias
for DRI is much slower. The bias of the data with twenty percent NRR is
just twice the bias of the data set with ten percent NRR. For TOTEX2, this
method produces larger bias for the mean of the imputed data in all NRR
than the OMI and HDI.
70

Contrary to the results in the OMI method under this criterion, results in
column (e) shows that the imputed data maintained the distribution of the
actual data in all NRR and VIs. It is even much better than HDI since all
of the imputed data sets under all the NRRs and VIs preserved the same
distribution as the actual data. It is interesting to note that the regression
models that were used in this study did not show the expected results that
were mentioned in the related literature and provided a distinct result. Ear-
lier studies that made use of categorical auxiliary variables, the matching
variables that were transformed into dummy variables, concluded that DRI
is just the same as the mean imputation. However, in this study, the inde-
pendent variable was the first visit VIs and for each imputation class there
is a fitted model which registered a good R2 .

other measures of variability of the imputed data. For these criteria, the fol-
lowing results were obtained: First, results from the three criteria are almost
stable as NRR increases for TOTEX2 and TOTIN2. The rate of change of
the values for MD, MAD and RMSD is minimal compared to OMI and HDI.
Second, the MAD and RMSD have smaller values than for OMI and HDI
for TOTEX2 and TOTIN2. Fitting models with high R2 was the key factor
that made this method better than the other two IM previously evaluated.
71
5.4.4 Stochastic Regression Imputation
using the Stochastic Regression Imputation method with three imputation classes
(SRI).
Table 14: Criteria Results for the SRI Method

Looking at Table 14, column (c), for TOTEX2 and TOTIN2, values pro-
duced for this method yielded much better results than for DRI. The bias
for TOTEX2 and TOTIN2 do not follow the same scenario for the previous
three method that as the NRR increases, the bias increases. The biases
fluctuate from one NRR to another. Compared to the three previously eval-
uated, this method provided the least bias in the highest NRR for both
TOTEX2 and TOTIN2. While the other methods reached a four digit bias,
SRI generated only a three digit bias. Moreover, there is a huge disparity in
the third NRR where it only produced less than twenty percent of the bias
produced by its deterministic counterpart.
72

Looking at Table 14 column (d), SRI showed better results than HDI which
also simulated the data 1000 times. Unlike in HDI, SRI maintained the same
distribution for all imputed data sets for the first and third nonresponse
rates. The SRI also outperformed HDI for the twenty percent NRR. In
earlier studies, the stochastic regression imputation performs better than
any of the three methods used here. The random residual was added to the
deterministic predicted value to preserve the distribution of the data.
3. Other measures of variability

The three criteria in Table 14 under the columns (e), (f), and (g) show the
other measures of variability of the imputed data. For this criteria, the fol-
lowing results were obtained: First, similar to the results in measuring the
bias of the mean of the imputed data, results in TOTIN2 for all the criteria
fluctuates from one NRR to another. Second, in TOTEX2, only the RMSD
criterion increase as NRR increases while the MAD and MD fluctuates from
one NRR to another. Third, the data with the highest NRR yielded the
lowest results for the MD criterion. Fourth, for TOTIN2, the data with
twenty percent NRR yielded the largest values for the three criteria.
73
5.5 Distribution of the True vs. Imputed Values
To provide additional information on the distribution of the imputed data that

was discussed previously, the distribution of the true (deleted) values (TVs) and
the imputed values (IVs) from each of the IMs for all the VIs and NRRs were
obtained. Figures 2, 3, and 4 show the bar graphs for the 10%, 20% and 30%
NRR in TOTIN2, respectively. For TOTEX2, Figures 5, 6, and 7 show the bar
graphs for the 10%, 20% and 30% NRR, respectively.
Figure 2: Bar Chart for TOTIN2, 10% NRR

74

75
Figure 5: Bar Chart for TOTEX2, 10% NRR

76
For the OMI method, the figures clearly illustrate the distortion of the distrib-
ution. Since the OMI method assigns the mean of the first visit VI to all the
missing cases, all the data sets concentrated in one particular frequency class.
The three other methods which implemented imputation classes, gave a better
outcome than OMI by spreading the distribution of the imputed data.
For the HDI method, all the figures clearly illustrate the over representation in
the first frequency class, that is less than 37,859.5 for TOTEX2 and less than
40,570.0 for TOTIN2. Over representation can also be seen in Figure 5 under the
second frequency class, in Figures 6 and 7 under the last frequency class.
While there is an over representation of the data for HDI3, an under represen-
tation was observed. Looking at Figures 2, 3, and 4, under representation were
observed in the seventh frequency class (128,000.0 - 161,669.0) for TOTIN2 .
77
For the two regression imputation methods, unlike HDI and OMI which had major
clusters, produced more spread distribution although there are some areas that are
under represented. The failure to consider a random residual term in deterministic
regression resulted into a severe under representation of the data in particularly
in the first frequency class in all the figures. Looking at Figures 2, 3, and 4, under
representation can be seen in the last frequency class in TOTIN2. For SRI which
added a random residual provided better results than DRI. However, there are
some areas that the added random produced significant excess mostly from the
last frequency class. These can be seen in Figures 5,6 and 7 for TOTEX2 and
Figures 2 and 4 for TOTIN2.
78
5.6 Choosing the Best Imputation
For this section, the rankings of all the tests are the basis to determine which of
the following IMs will be chosen as the best IMs for this particular study and data.
The selection of the best method will be independent for all VIs and NRRs. The
ranking are based on a four-point system wherein the rank value of 4 denotes the
worst IM for that specific criterion and 1 denotes the best IM for that criterion. In
case of ties, the average ranks will be substituted. The IM with the smallest rank
total will be declared the best IM for the particular VI and NRR. The ranking
of IM will cover the following criteria: (a) Bias of the mean of the imputed data
(N.B.), (b) percentage of correct distributions (PCD), and (c) Other measures of
variability, namely, MD, MAD and RMSD. All in all, there are five criteria that
each IM will be rank in.
Tables 12, 13 and 14 show the ranking of the different imputation methods for the
10%, 20% and 30% NRR respectively. For each NRR, the table containing the
rankings of the IMs will go as follows: (a) VIs, (b) Criteria, (c) OMI, (d) HDI,
(e) DRI, and (f) SRI.
79
Table 15: Ranking of the Different Imputation Methods: 10% NRR

80

81

82
Rankings show that the two regression IMs provided better results than their
model-free counterparts. For all the nonresponse rates under the TOTIN2 vari-
able, the two regression imputation methods tied as the best IM, and surprisingly
the HDI finished the worst IM behind OMI. Under the TOTEX2 variable, mixed
rankings were seen for all nonresponse rates. The regression methods still pro-
vided good results. The SRI method finished first in the 10% and 30% NRR
and ranked third in the 20% NRR while the DRI method finished third, first and
second in the 10%, 20% and 30% NRR respectively. While the HDI was seen as
the worst IM for TOTIN2, the OMI was concluded the worst IM for TOTEX2 by
ranking last for both 10% and 20% NRR and third for the 30% NRR.
In conclusion, the best imputation method for this study is the Stochastic Re-
gression Imputation using the 1997 FIES data. It is very closely followed by
the Deterministic Regression Imputation. No records in the results show
that SRI method ranked last in all the criteria, NRRs and VIs, unlike for DRI
which provided the worst IM in the bias of the mean of the imputed data and MD
criteria. The researchers selected the HDI as the worst IM in this study. The HDI
method fared the worst such that majority of the results in the different criteria
under each NRR and VI in particular the said method rated poorly.
Chapter 6
Conclusion
This paper discussed a range of imputation methods to compensate for partial

nonresponse in survey data and showed empirical proofs on the disadvantages
and advantages of the methods. It showed that when applying imputation pro-
cedures, it is important to consider the type of analysis and the type of point
estimator of interest. Whether the researcher’s goal is to produce unbiased and
efficient estimates of means, totals, proportions and official aggregated statistics
or a complete data file that can be used for a variety of different analyses and
by different users, the researcher should clearly identify first the type of analysis
that will suit his or her purpose. In addition, several practical issues that involve
the case of implementation, such as difficulty of programming, amount of time it
spends and complexity of the procedures used must also be taken into considera-
tion.
Anyone faced with having to make decisions about imputation procedures will
usually have to choose some compromise between what is technically effective and
what is operationally expedient. If resources are limited, this is a hard choice.
This study aims to help future researchers in choosing the most appropriate im-
84
putation technique for the case of partial nonresponse.
For our particular implementation, all of the methods were run to a programming
language due to the unavailability of software that can generate imputations for
all the methods needed ofr this study. In all of the methods, the overall mean
imputation was the easiest to use and create a computer program. The other
three methods required the formation of imputation classes. Both regression im-
putations were the hardest to program and the most time consuming imputation
methods.
The performance of several imputation methods in imputing partial nonresponse

observations was compared using the 1997 Family Income Expenditure Survey
(FIES) data set. A set of criteria were computed for each method based on the
data set with imputed values and data set with actual values to find the best
imputation method. The criteria in judging the best method were the bias and
variance estimates of the imputed data, the preservation of the distribution by
the actual data, and the other measures of accuracy and precision incorporated
from the study of Kalton (1983).
The results show that the choice of imputation method significantly affected the es-
timates of the actual data. The similarities among the two best methods, namely,
the Deterministic and Stochastic Regression imputation methods were due in part
to the adequacy and prediction power of the models.
85
The bias and variance estimates of the imputed data obtained appeared to vary
much across imputation methods and it was unexpected that the Hot Deck Im-
putation method rendered the highest estimates in majority of the nonresponse
rates as well as its variables. Stochastic Regression, on the other hand, was the
best method in that particular criterion since in majority of the results in the
tests produced relatively small biases and variances.
The distributions of the imputed data of each method were checked for the preser-
vation of the distribution using the Kolmogorov-Smirnov Goodness of Fit test. In
the methods used in this study, both regression imputation methods retained the
distribution of the data especially the Deterministic Regression Imputation that
generated exactly the same distribution as the actual data.
In the other tests of accuracy and precision, namely, the mean deviation, mean
absolute deviation and root mean square deviation, the different methods pro-
vided mixed results in all nonresponse rates. The results for some methods did
not consistently and clearly yielded good results. Only half of the methods used
provided great results in one particular criterion which is the preservation of the
distribution of the data. In the other results, inconsistency was obviously seen
due to the alternating rankings from each method.
Given the criteria and procedures in judging the best imputation procedure among
the four methods, the selection of the best method was difficult. Consequently,
in order to determine the best method of imputing nonresponse observation for
86
each variable in the study, the methods were ranked according to several criteria.
Methods that were ranked 1 indicate as the best imputation method while meth-
ods ranked 4 shows that it is the worst in that particular criterion.
After comparing the methods, the two regression method namely the Determinis-
tic and Stochastic Regression Imputation gave the outstanding results. Therefore,
it can be concluded that that the Stochastic Regression Imputation procedure is
considered the best imputation method for this study since the it did not rank
poorly in any criteria under all NRRs and VIs.
The efficiency of the imputation method was supported by the R2 of the model
and the added random residual in the deterministic imputed value. The random
residuals added to the deterministic imputation provided a change in making the
estimates less biased than its deterministic counterpart.
Deterministic regression imputation method performed much better than Hot

Deck imputation method. It is surprising that the Hot Deck imputation method
was less efficient than deterministic regression where in the related studies, it
emerged as the better method than deterministic regression. Most likely the
selection of donors with replacement caused its poor performance and not the
imputation classes. If the imputation classes were the cause of its low ranking,
then both regression imputation methods estimates could be as worse as the Hot
Deck imputation even if the model is adequate.
Chapter 7
Recommendations for Further Research
In this study, we have compared four imputation methods commonly used in deal-
ing with partial nonresponse data and with the assumption of MCAR. However,
there are other methods that are currently being developed and improved. For
example, the multiple imputation method involves independently imputing more
than one value for each nonresponse value. Multiple imputation is an important
and powerful form of imputation and has the advantage that variance estimation
under imputation can be carried out comparatively easily (Kalton, 1983).
Regarding the variance estimation, further studies should implement the use of
proper variance estimates like the Jackknife variance estimator. This variance
estimator is more often used in comparing the variance estimates of most impu-
tation methods. The study of Rao and Shao (1992) has proposed an adjusted
Jackknife variance estimator to use with the imputation methods related to the
Hot Deck imputation procedure. This variance estimator is said to be asymptot-
ically unbiased.
Future researchers may test other methods on the same data set and compare
88
the results with those presented in this paper. They could also compare the re-
sults of this study with those of multiple imputation and the Rao-Shao jackknife
variance estimator. There is a need, however, for a higher knowledge in statistics
and Bayesian statistics in using the above procedures. The complexity of the
methods especially both regression imputations could hinder future researchers in
the use of modern variance estimator.
It is also suggested that the use of a method to select a matching variable through
the use of advanced modern statistical methods like the CHAID analysis. The
acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one
of the oldest tree classification methods originally proposed by Kass (1980; accord-
ing to Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by
Morgan and Messenger, 1973). CHAID will ”build” non-binary trees (i.e., trees
where more than two branches can attach to a single root or node), based on a
relatively simple algorithm that is particularly well suited for the analysis of larger
datasets. Also, because the CHAID algorithm will often effectively yield many
multi-way frequency tables (e.g., when classifying a categorical response variable
with many categories, based on categorical predictors with many classes), it has
been particularly popular in marketing research, in the context of market segmen-
tation studies. (Statsoft, 2003)
In pursuing regression imputation, instead of creating models for each imputa-

tion class that can really be time-consuming at the same time frustrating since
not all models will have the same result, dummy variables should be inserted in
89
the model. These dummy variables are the categories of the matching variables.
It would definitely save time and money since only one model is created and tested.
These researchers strongly recommend using a statistical package that can gen-
erate faster and a lot easier imputations but generate less biased estimates than
programming. It would definitely save time than creating a computer program
that eats up a majority of the research time in debugging and prevent computer
crashes due to computer memory overload.
Bibliography
[1] Cheng, J.H. and Sy, F. ,A Comparison of Several Techniques of Imputation

on Clinical Data (Undergraduate Thesis, De La Salle University) 1997.
[2] Kalton, G, (1983) Compensating for Missing Survey Data, Michigan.
[3] Kazemi, I. (2005). Methods for Missing Data: Center for

Applied Statistics. Retrieved 12 March 2007, from
http://www.cas.lancs.ac.uk/shortcourses/notes/missingdata/Session5.pdf
[4] Musil, C., Warner, C., Yobas, P. K. and Jones. S. A Comparison of Impu-
tation Techniques for Handling Missing Data, Western Journal of Nursing
Research. Vol.24,No.7, 815-829 (2002)
[5] National Statistics Office (NSO)(1997 - 2005). Technical Notes on the 1997
Family Income and Expenditure Survey (FIES). Retrieved 18 June 2007,
from http://www.census.gov.ph/data/technotes/notefies.html
[6] Netter, J., Wasserman, W. and Kutner, M.H.. Applied Linear Statistical Mod-
els 2nd ed. Homewood, Illinois: Richard D. Irwin, Inc.
91
[7] Nordholt, E.S. (1998). Imputation: Methods, Simulation, Experiments and

Practical Examples, International Statistical Review, Vol. 66, No. 2, 157-
180.
[8] Obanil, R. (2006, October 3). Topmost floor of NSO Building Gutted by
Fire. The Manila Bulletin Online. Retrieved 28 August 2007, from
http://www.mb.com.ph/issues/2006/10/03/MTN2061037203.html
[9] Salvino, S. and Yu, A. C. Some Approaches in Dealing With Nonresponse

in Survey Operations With Applications to the 1991 Marinduque Census of
Agriculture and Fisheries Data (Undergraduate Thesis, De La Salle Univer-
sity)(1996)
[10] Siegel, S.(1988). Nonparametric Statistics for the Behavioral Sciences.New

York: Mc Graw - Hill
[11] No author. CHAID Analysis [Electronic version], Elec-

tronic Statsoft Textbook. Retrieved 29 July 2007, from
http://www.statsoft.com/textbook/stchaid.html
[12] StatSoft, Inc. STATISTICA (data analysis software system), version 7.1.
www.statsoft.com.(2005)
Appendix
Appendix A
Items and Information Gathered in the FIES
1997
Part I – Identification A. Identification of the Household
B. Other Information:
and Other Information 1. Particulars about the Head of the Family
a) Sex
b) Age as of Last Birthday
c) Martial Status
d) Highest Grade Completed
e) Employment Status
f) Occupation
g) Kind of Industry / Business
h) Class of Worker
2. Other information about the Household
a) Type of Household
b) Number of Family Members Enumerated
c) Number of boarders, helpers and other non-
relatives
d) Number of Family Members who are Employed
for Pay or Profit
Part II - Expenditures A. Food, Alcoholic Beverages and Tobacco
1. Particulars about the Head of the Family
and Other a) Cereals and Cereal Preparations
b) Roots and Tubers
Disbursements c) Fruits and Vegetables
d) Meat and Meat Preparations
e) Dairy Products and Eggs
f) Fish and Marine Products
g) Coffee, Cocoa and Tea
h) Non-Alcoholic Beverages
i) Food Not Elsewhere Classified
2. Food Regularly Consumed Outside the Home
3. Alcoholic Beverages
4. Tobacco
5. Food Items, Alcoholic Beverages and
Tobacco Received as Gifts
B. Fuel, Light and Water, Transportation and
Communication and Household Operation
C. Personal Care and Effects, Clothing, Footwear and Other
Wear
D. Education, Recreation and Medical Care
E. Furnishings and Equipment
F. Taxes
G. Housing, House Maintenance and Minor Repairs
H. Miscellaneous Expenditures
I. Other Disbursements
Part III – Income and A. Salaries and Wages from Employment
B. Net Share of Crops, Fruits and Vegetables Producedor
Other Receipts
Livestock and Poultry Raised by Other Households
C. Other Sources of Income
1. Cash Receipts, Gifts, Support, Relief and Other Forms
of Assistance From Abroad
2. Cash Receipts, Support, Assistance and Relief from
Domestic Source
3. Rentals Received From Non-Agricultural Lands,
Buildings, Spaces and Other Properties
4. Interest
5. Pension and Retirement, Workmen's Compensation and
Social Security Benefits
6. Net Winnings from Gambling, Sweepstakes and Raffle
7. Dividends From Investment
8. Profits from Sale of Stocks, Bonds and Real and
Personal Property
9. Back pay and Proceeds from Insurance
10. Inheritance
D. Other Receipts
93
Appendix B
Source Codes of the Imputation Programs
SIMULATION.BAS
RANDOMIZE
LET TOT = 4130 ! TOT = TOTAL NUMBER OF OBSERVATIONS !
DIM IFIES(TOT,3), IMISS(TOT,4), EFIES(TOT,3), EMISS(TOT,4)
! Opening the file which doesn't contain nonresponse observations !
OPEN #1: NAME "C:\Documents and Settings\student\My Documents\CI.CSV"

MAT READ #1: IFIES
CLOSE #1
OPEN #2: NAME "C:\Documents and Settings\student\My Documents\CE.CSV"
MAT READ #2: EFIES
CLOSE #2
! Creating a new file to save the original data with an additional column with
nonresponse observations !
OPEN #3: NAME "C:\Documents and Settings\student\My

Documents\MISSE20%.CSV"
ERASE #3
OPEN #4: NAME "C:\Documents and Settings\student\My Documents\MISSI20%.CSV"

ERASE #4
! IFIES(TOT,3) and EFIES(TOT,3) Matrix Breakdown !
! 1st Column = EDUCATION STATUS CATEGORIES !

! 2nd Column = 1st VISIT data !
! 3rd Column = 2nd visit data !
! IMISS(TOT,4) and EMISS(TOT,4) Matrix Breakdown !
! 1st Column = EDUCATION STATUS CATEGORIES !

! 2nd Column = 1st VISIT data !
! 3rd Column = 2nd visit data !
! 4th Column = 2nd visit containing with nonresponse observations !
REM SIMULATION OF NONRESPONSE
! Setting manually the nonresponse rate !

LET NRR = 0.1 ! NRR = NonResponse Rate !
LET NON = NRR * TOT !NON = Number of Nonrespondents !

SIMULATION.BAS
DIM RN(TOT,1) ! Matrix of Random Numbers !
FOR I = 1 TO TOT
LET RN(I,1) = RND
FOR COL = 1 TO 3
LET IMISS(I,COL) = IFIES(I,COL)
LET EMISS(I,COL) = EFIES(I,COL)
NEXT COL
LET IMISS(I,4) = IFIES(I,3)
LET EMISS(I,4) = EFIES(I,3)
NEXT I
REM SORTING THE DATA IN ASCENDING ORDER BY THE RANDOM NUMBER
FOR OUTER = 1 TO TOT - 1

FOR INNER = OUTER + 1 TO TOT - 1
IF RN(OUTER,1) > RN(INNER,1) THEN
SWAP RN(OUTER,1), RN(INNER,1)
FOR COL = 1 TO 4
SWAP IMISS(OUTER,COL), IMISS(INNER,COL)
SWAP EMISS(OUTER,COL), EMISS(INNER,COL)
NEXT COL
END IF
NEXT inner
NEXT OUTER
REM SIMULATION OF NONRESPONSE
FOR B = 1 TO NON
LET IMISS(B,4) = -1 ! Setting the observation to nonresponse !
LET EMISS(B,4) = -1
NEXT B
! Saving the output in the designated files called !
MAT WRITE #3: EMISS

MAT WRITE #4: IMISS
! Closing the active channels !
CLOSE #3
CLOSE #4
END
OMI.BAS
LET TOT = 4130 ! TOT = Total number of observations !

LET NRR = .30 ! NRR = NonResponse Rate !
LET NON = NRR * TOT ! NON = Number of Nonresponse observations!
DIM SIMI(TOT,5) ! Matrix that will contain the income data !
DIM SIME(TOT,5) ! Matrix that will contain the expenditure data !
! Breakdown of the Matrices !

! 1st Column = Education status categories !
! 2nd Column = First visit data of the nonresponse variable!
! 3rd Column = Second visit data of the nonresponse variable!
! 4TH Column = Second visit data containing nonresponse observation of the
nonresponse variable!
! 5TH Column = Flags that are indicators if the values of the fourth column are set to
nonresponse/imputed (5th column = 0) and response/actual (5th column = 1) !
! Opening and creation of files to be used to upload the data and write the results in the
program !
OPEN #1: NAME "E:\MISSI30%.CSV"
MAT READ #1: SIMI
CLOSE #1
OPEN #2: NAME "E:\MISSE30%.CSV"
MAT READ #2: SIME
CLOSE #2
OPEN #3: NAME "E:\IDATA30%.TXT"
ERASE #3
OPEN #4: NAME "E:\EDATA30%.TXT"
ERASE #4
REM Computation of the overall mean of the first visit nonresponse variables to be
imputed later on the program
FOR I = 1 TO TOT
LET OMEX = OMEX + SIME(I,2)
LET OMIN = OMIN + SIMI(I,2)
NEXT I
LET OMEX = OMEX/TOT
LET OMIN = OMIN/TOT
REM Imputation Procedures
REM Substitution of the mean to the nonresponse observations
FOR J = 1 TO NON
LET SIMI(j,4) = OMIN
LET SIME(J,4) = OMEX
NEXT J
OMI.BAS
REM Computation of the mean deviation, mean absolute deviation, root mean square
deviation
! INCDOMI = deviation of the imputed and the actual observation for the income
variable under OMI !
! EXPDOMI = deviation of the imputed and the actual observation for the expenditure
LET INCDOMI = 0
LET EXPDOMI = 0
! INCMDOMI = mean deviation of the imputed and the actual observation for the income
! EXPMDOMI = mean deviation of the imputed and the actual observation for the
expenditure variable under OMI !
LET INCMDOMI = 0
LET EXPMDOMI = 0
! INCMADOMI = mean absolute deviation of the imputed and the actual observation for
the income variable under OMI !
! EXPMADOMI = mean absolute deviation of the imputed and the actual observation for
the expenditure variable under OMI !
LET INCMADOMI = 0
LET EXPMADOMI = 0
! INCRMSDOMI = root mean square deviation of the imputed and the actual observation
for the income variable under OMI !
! EXPRMSDOMI = root mean square deviation of the imputed and the actual observation
for the expenditure variable under OMI !
LET INCRMSDOMI = 0
LET EXPRMSDOMI = 0
FOR I = 1 TO NON
! Computation of the criteria to be evaluated !
LET INCDOMI = SIMI(I,3) - SIMI(I,4)

LET EXPDOMI = SIME(I,3) - SIME(I,4)
LET INCMDOMI = INCMDOMI + INCDOMI
LET EXPMDOMI = EXPMDOMI + EXPDOMI
LET INCMADOMI = INCMADOMI + ABS(INCDOMI)
LET EXPMADOMI = EXPMADOMI + ABS(EXPDOMI)
OMI.BAS
LET INCRMSDOMI = INCRMSDOMI + (INCDOMI^2)

LET EXPRMSDOMI = EXPRMSDOMI + (EXPDOMI^2)
NEXT I
LET INCMDOMI = INCMDOMI/NON
LET EXPMDOMI = EXPMDOMI/NON
LET INCMADOMI = INCMADOMI/NON
LET EXPMADOMI = EXPMADOMI/NON
LET INCRMSDOMI = SQR(INCRMSDOMI/NON)
LET EXPRMSDOMI = SQR(EXPRMSDOMI/NON)
MAT WRITE #3: SIMI

MAT WRITE #4: SIME
WRITE #5: INCMDOMI, INCMADOMI, INCRMSDOMI

WRITE #6: EXPMDOMI, EXPMADOMI, EXPRMSDOMI
CLOSE #3
CLOSE #4
CLOSE #5
CLOSE #6
END
HOT DECK.BAS
RANDOMIZE
LET TRIALS = 100 ! Number of trials !
LET N1 = 2635 ! N1 = Number of total observations under employment status 1 !
LET NRR = 0.3 ! NRR = Nonresponse rate !
LET NON = NRR * TOT ! NON = Number of total nonresponse observations !

nonresponse/imputed (5th column = 0) and response/actual (5th column = 1)
DIM SIMI(TOT*TRIALS,5) ! Matrix that will contain the income data !
DIM SIME(TOT*TRIALS,5) ! Matrix that will contain the expenditure data !
DIM SIMI1(N1,5), SIMI2(N2,5), SIMI3(N3,5) ! Matrix that will contain the income
observations for each imputation class!
DIM SIME1(N1,5), SIME2(N2,5), SIME3(N3,5) ! Matrix that will contain the

expenditure observations for each imputation class!
! Matrices that contains the criteria that will be computed later in the program!
DIM CRITIN(TRIALS,3), CRITEX(TRIALS,3)
program !
OPEN #1: NAME "E:\EXPENDITURE30%\ES1MISSE30%.TXT"

MAT READ #1: SIME1
CLOSE #1
OPEN #2: NAME "E:\INCOME30%\ES1MISSI30%.TXT"

MAT READ #2: SIMI1
CLOSE #2

MAT READ #3: SIME2
CLOSE #3
HOT DECK.BAS

MAT READ #4: SIMI2
CLOSE #4

MAT READ #5: SIME3
CLOSE #5

MAT READ #6: SIMI3
CLOSE #6
OPEN #7: NAME "E:\SIMEHD30%.TXT"

ERASE #7
OPEN #8: NAME "E:\SIMIHD30%.TXT"

ERASE #8
OPEN #9: NAME "E:\CRITEX30%.TXT"

ERASE #9
OPEN #10: NAME "E:\CRITIN30%.TXT"

ERASE #10
DO
LET TRIAL = TRIAL + 1 ! Trial count !
! PECK# = Observation number that was chosen randomly for the expenditure
variable under education status # !
! PICK# = Observation number that was chosen randomly for the income variable
under education status # !
REM Hot deck imputation procedure
FOR I= 1 TO N1
IF SIME1(I,5) = 0 THEN
LET PECK1 = INT(RND*N1) + 1
LET SIME1(I,4) = SIME1(PECK1,2)
END IF
IF SIMI1(I,5) = 0 THEN
LET PICK1 = INT(RND*N1) + 1
LET SIMI1(I,4) = SIMI1(PICK1,2)
END IF
NEXT I
FOR J = 1 TO N2
HOT DECK.BAS
IF SIME2(J,5) = 0 THEN
LET SIME2(J,4) = SIME2(PECK2,2)
END IF
IF SIMI2(J,5) = 0 THEN
LET SIMI2(J,4) = SIMI2(PICK2,2)
END IF
NEXT J
FOR K = 1 TO N3
IF SIME3(K,5) = 0 THEN
LET SIME3(K,4) = SIME3(PECK3,2)
END IF
IF SIMI3(K,5) = 0 THEN
LET SIMI3(K,4) = SIMI3(PICK3,2)
END IF
NEXT K
LET MDSIMI1 = 0
LET MDSIMI2 = 0
LET MDSIMI3 = 0
LET MADSIMI1 = 0
LET MADSIMI2 = 0
LET MADSIMI3 = 0
LET RMSDSIMI1 = 0
LET RMSDSIMI2 = 0
LET RMSDSIMI3 = 0
LET MDSIME1 = 0
LET MDSIME2 = 0
LET MDSIME3 = 0
LET MADSIME1 = 0
LET MADSIME2 = 0
LET MADSIME3 = 0
LET RMSDSIME1 = 0
LET RMSDSIME2 = 0
LET RMSDSIME3 = 0
HOT DECK.BAS
FOR A = 1 TO N1
LET NUM = NUM + 1
IF SIMI1(A,5) = 0 THEN
LET DIFFI1 = SIMI1(A,4) - SIMI1(A,3)
LET MDSIMI1 = MDSIMI1 + DIFFI1
LET MADSIMI1 = MADSIMI1 + (ABS(DIFFI1))
LET RMSDSIMI1 = RMSDSIMI1 + (DIFFI1)^2
END IF
IF SIME1(A,5) = 0 THEN
LET DIFFE1 = SIME1(A,4) - SIME1(A,3)
LET MDSIME1 = MDSIME1 + DIFFE1
LET MADSIME1 = MADSIME1 + (ABS(DIFFE1))
LET RMSDSIME1 = RMSDSIME1 + (DIFFE1)^2
END IF
FOR COL = 1 TO 5
LET SIMI(NUM,COL) = SIMI1(A,COL)
LET SIME(NUM,COL) = SIME1(A,COL)
NEXT COL
LET SIMI(NUM,2) = TRIAL

LET SIME(NUM,2) = TRIAL
NEXT A
FOR B = 1 TO N2
LET NUM = NUM + 1
IF SIMI2(B,5) = 0 THEN
LET DIFFI2 = SIMI2(B,4) - SIMI2(B,3)
END IF
IF SIME2(B,5) = 0 THEN
LET DIFFE2 = SIME2(B,4) - SIME2(B,3)
LET RMSDSIME2 = RMSSIME2 + (DIFFE2)^2
END IF
FOR COL = 1 TO 5
LET SIMI(NUM,COL) = SIMI2(B,COL)
LET SIME(NUM,COL) = SIME2(B,COL)
NEXT COL
NEXT B
FOR C = 1 TO N3
HOT DECK.BAS
LET NUM = NUM + 1

IF SIMI3(C,5) = 0 THEN
LET DIFFI3 = SIMI3(C,4) - SIMI3(C,3)
END IF
IF SIME3(C,5) = 0 THEN
LET DIFFE3 = SIME3(C,4) - SIME3(C,3)
LET RMSDSIME3 = RMSSIME3 + (DIFFE3)^2
END IF
FOR COL = 1 TO 5
LET SIMI(NUM,COL) = SIMI3(C,COL)
LET SIME(NUM,COL) = SIME3(C,COL)
NEXT COL
NEXT C
LET CRITIN(TRIAL,1) = (MDSIMI1 + MDSIMI2 + MDSIMI3)/NON

LET CRITEX(TRIAL,1) = (MDSIME1 + MDSIME2 + MDSIME3)/NON
LET CRITIN(TRIAL,2) = (MADSIMI1 + MADSIMI2 + MADSIMI3)/NON
LET CRITEX(TRIAL,2) = (MADSIME1 + MADSIME2 + MADSIME3)/NON
LET CRITIN(TRIAL,3) = SQR((RMSDSIMI1 + RMSDSIMI2 +
RMSDSIMI3)/NON)
LET CRITEX(TRIAL,3) = SQR((RMSDSIME1 + RMSDSIME2 +
RMSDSIME3)/NON)
LOOP UNTIL TRIAL = TRIALS
MAT WRITE #7: SIME

MAT WRITE #8: SIMI
MAT WRITE #9: CRITEX

MAT WRITE #10: CRITIN
CLOSE #7
CLOSE #8
CLOSE #9
CLOSE #10
END
DRI.BAS


DIM SIMI(TOT,5) ! Matrix that will contain the income data !

DIM SIME(TOT,5) ! Matrix that will contain the expenditure data !
program !

MAT READ #1: SIME1
CLOSE #1
MAT READ #2: SIMI1
CLOSE #2
MAT READ #3: SIME2
CLOSE #3
MAT READ #4: SIMI2
CLOSE #4
MAT READ #5: SIME3
CLOSE #5
MAT READ #6: SIMI3
CLOSE #6
DRI.BAS
OPEN #7: NAME "E:\SIMEREG30%.TXT"

ERASE #7
OPEN #8: NAME "E:\SIMIREG30%.TXT"

ERASE #8
REM Deterministic Regression

! LNFV = Natural logarithm of the first visit respondent’s observation!
! LNSV = Natural logarithm of the second visit respondent’s observation!
! DREGE# = Deterministic regression imputed value for expenditure under education
status #!
! DREGI# = Deterministic regression imputed value for income under education status
#!
DIM DREGE1(N1,1), DREGE2(N2,1), DREGE3(N3,1)

DIM DREGI1(N1,1), DREGI2(N2,1), DREGI3(N3,1)
DIM LNFVE1(N1,1), LNFVE2(N2,1), LNFVE3(N3,1)
DIM LNFVI1(N1,1), LNFVI2(N2,1), LNFVI3(N3,1)
DIM LNSVE1(N1,1), LNSVE2(N2,1), LNSVE3(N3,1)
DIM LNSVI1(N1,1), LNSVI2(N2,1), LNSVI3(N3,1)
REM Logarithmic transformation

FOR I = 1 TO N1
LET LNFVE1(I,1) = LOG(SIME1(I,2))
LET NOR1 = NOR1 + 1
LET LNSVE1(I,1) = LOG(SIME1(I,3))
LET EYBAR1 = EYBAR1 + LNSVE1(I,1)
LET EXBAR1 = EXBAR1 + LNFVE1(I,1)
END IF
LET LNFVI1(I,1) = LOG(SIMI1(I,2))

LET LNSVI1(I,1) = LOG(SIMI1(I,3))
LET IYBAR1 = IYBAR1 + LNSVI1(I,1)
LET IXBAR1 = IXBAR1 + LNFVI1(I,1)
END IF
NEXT I
FOR J = 1 TO N2
LET LNFVE2(J,1) = LOG(SIME2(J,2))
LET NOR2 = NOR2 + 1
DRI.BAS
LET LNSVE2(J,1) = LOG(SIME2(J,3))

LET EYBAR2 = EYBAR2 + LNSVE2(J,1)
LET EXBAR2 = EXBAR2 + LNFVE2(J,1)
END IF
LET LNFVI2(J,1) = LOG(SIMI2(J,2))

LET LNSVI2(J,1) = LOG(SIMI2(J,3))
LET IYBAR2 = IYBAR2 + LNSVI2(J,1)
LET IXBAR2 = IXBAR2 + LNFVI2(J,1)
END IF
NEXT J
FOR K = 1 TO N3
LET LNFVE3(K,1) = LOG(SIME3(K,2))
LET NOR3 = NOR3 + 1
LET LNSVE3(K,1) = LOG(SIME3(K,3))
LET EYBAR3 = EYBAR3 + LNSVE3(K,1)
LET EXBAR3 = EXBAR3 + LNFVE3(K,1)
END IF
LET LNFVI3(K,1) = LOG(SIMI3(K,2))

LET LNSVI3(K,1) = LOG(SIMI3(K,3))
LET IYBAR3 = IYBAR3 + LNSVI3(K,1)
LET IXBAR3 = IXBAR3 + LNFVI3(K,1)
END IF
NEXT K
LET EYBAR1 = EYBAR1 / (NOR1)

LET IYBAR1 = IYBAR1 / (NOR1)
LET EXBAR1 = EXBAR1 / (NOR1)

LET IXBAR1 = IXBAR1 / (NOR1)



DRI.BAS

FOR A = 1 TO N1
LET EDXY1 = EDXY1 + ((LNFVE1(A,1) - EXBAR1)*(LNSVE1(A,1)-
EYBAR1))
LET EDXSQR1 = EDXSQR1 + (LNFVE1(A,1) - EXBAR1)^2
END IF
LET IDXY1 = IDXY1 + ((LNFVI1(A,1) - IXBAR1)*(LNSVI1(A,1)-
IYBAR1))
LET IDXSQR1 = IDXSQR1 + (LNFVI1(A,1) - IXBAR1)^2
END IF
NEXT A
FOR B = 1 TO N2
LET EDXY2 = EDXY2 + ((LNFVE2(B,1) - EXBAR2)*(LNSVE2(B,1)-
EYBAR2))
LET EDXSQR2 = EDXSQR2 + (LNFVE2(B,1) - EXBAR2)^2
END IF
LET IDXY2 = IDXY2 + ((LNFVI2(B,1) - IXBAR2)*(LNSVI2(B,1)-
IYBAR2))
LET IDXSQR2 = IDXSQR2 + (LNFVI2(B,1) - IXBAR2)^2
END IF
NEXT B
FOR C = 1 TO N3
LET EDXY3 = EDXY3 + ((LNFVE3(C,1) - EXBAR3)*(LNSVE3(C,1)-
EYBAR3))
LET EDXSQR3 = EDXSQR3 + (LNFVE3(C,1) - EXBAR3)^2
END IF
LET IDXY3 = IDXY3 + ((LNFVI3(C,1) - IXBAR3)*(LNSVI3(C,1)-
IYBAR3))
LET IDXSQR3 = IDXSQR3 + (LNFVI3(C,1) - IXBAR3)^2
END IF
NEXT C
LET EBONE1 = EDXY1 / EDXSQR1

LET IBONE1 = IDXY1 / IDXSQR1
LET EBZERO1 = EYBAR1 - (EBONE1*EXBAR1)

LET IBZERO1 = IYBAR1 - (IBONE1*IXBAR1)
DRI.BAS




FOR Y1 = 1 TO N1
LET DREGE1(Y1,1) = EBZERO1 + EBONE1*LNFVE1(Y1,1)
LET DREGI1(Y1,1) = IBZERO1 + IBONE1*LNFVI1(Y1,1)
NEXT Y1
FOR Y2 = 1 TO N2
NEXT Y2
FOR Y3 = 1 TO N3
NEXT Y3
! Computation of residuals !
DIM ECRES1(NOR1,1), ECRES2(NOR2,1), ECRES3(NOR3,1)

DIM ICRES1(NOR1,1), ICRES2(NOR2,1), ICRES3(NOR3,1)
FOR A = 1 TO N1
LET EH1 = EH1 + 1
LET ECRES1(EH1,1) = DREGE1(A,1) - LNSVE1(A,1)
END IF
LET IH1 = IH1 + 1
LET ICRES1(IH1,1) = DREGI1(A,1) - LNSVI1(A,1)
END IF
NEXT A
FOR B = 1 TO N2
LET EH2 = EH2 + 1
DRI.BAS
LET ECRES2(EH2,1) = DREGE2(B,1) - LNSVE2(B,1)

END IF
LET IH2 = IH2 + 1
LET ICRES2(IH2,1) = DREGI2(B,1) - LNSVI2(B,1)
END IF
NEXT B
FOR C = 1 TO N3
LET EH3 = EH3 + 1
LET ECRES3(EH3,1) = DREGE3(C,1) - LNSVE3(C,1)
END IF
LET IH3 = IH3 + 1
LET ICRES3(IH3,1) = DREGI3(C,1) - LNSVI3(C,1)
END IF
NEXT C
FOR Y1 = 1 TO N1
LET DREGE1(Y1,1) = EXP(DREGE1(Y1,1))
LET DREGI1(Y1,1) = EXP(DREGI1(Y1,1))
NEXT Y1
FOR Y2 = 1 TO N2
NEXT Y2
FOR Y3 = 1 TO N3
NEXT Y3
FOR A = 1 TO N1
LET NUM = NUM + 1
LET MDDREGE1 = MDDREGE1 + (DREGE1(A,1) - SIME1(A,3))
LET MADDREGE1 = MADDREGE1 + (ABS(DREGE1(A,1) -
SIME1(A,3)))
LET RMSDDREGE1 = RMSDDREGE1 + ((DREGE1(A,1) -
SIME1(A,3))^2)
LET SIME1(A,4) = DREGE1(A,1)
END IF
DRI.BAS
LET MDDREGI1 = MDDREGI1 + (DREGI1(A,1) - SIMI1(A,3))
LET MADDREGI1 = MADDREGI1 + (ABS(DREGI1(A,1) -
SIMI1(A,3)))
LET RMSDDREGI1 = RMSDDREGI1 + ((DREGI1(A,1) -
SIMI1(A,3))^2)
LET SIMI1(A,4) = DREGI1(A,1)
END IF
FOR COL = 1 TO 5
NEXT COL
NEXT A
FOR B = 1 TO N2
LET NUM = NUM + 1
LET MDDREGE2 = MDDREGE2 + (DREGE2(B,1) - SIME2(B,3))
LET MADDREGE2 = MADDREGE2 + (ABS(DREGE2(B,1) -
SIME2(B,3)))
LET RMSDDREGE2 = RMSDDREGE2 + ((DREGE2(B,1) -
SIME2(B,3))^2)
LET SIME2(B,4) = DREGE2(B,1)
END IF
LET MDDREGI2 = MDDREGI2 + (DREGI2(B,1) - SIMI2(B,3))
LET MADDREGI2 = MADDREGI2 + (ABS(DREGI2(B,1) -
SIMI2(B,3)))
LET RMSDDREGI2 = RMSDDREGI2 + ((DREGI2(B,1) -
SIMI2(B,3))^2)
LET SIMI2(B,4) = DREGI2(B,1)
END IF
FOR COL = 1 TO 5
NEXT COL
NEXT B
FOR C = 1 TO N3
LET NUM = NUM + 1
LET MDDREGE3 = MDDREGE3 + (DREGE3(C,1) - SIME3(C,3))
DRI.BAS
LET MADDREGE3 = MADDREGE3 + (ABS(DREGE3(C,1) -

SIME3(C,3)))
LET RMSDDREGE3 = RMSDDREGE3 + ((DREGE3(C,1) -
SIME3(C,3))^2)
LET SIME3(C,4) = DREGE3(C,1)
END IF
LET MDDREGI3 = MDDREGI3 + (DREGI3(C,1) - SIMI3(C,3))
LET MADDREGI3 = MADDREGI3 + (ABS(DREGI3(C,1) -
SIMI3(C,3)))
LET RMSDDREGI3 = RMSDDREGI3 + ((DREGI3(C,1) -
SIMI3(C,3))^2)
LET SIMI3(C,4) = DREGI3(C,1)
END IF
FOR COL = 1 TO 5
NEXT COL
NEXT C
LET MDDREGE = (MDDREGE1 + MDDREGE2 + MDDREGE3)/NON

LET MDDREGI = (MDDREGI1 + MDDREGI2 + MDDREGI3)/NON
LET MADDREGE = (MADDREGE1 + MADDREGE2 + MADDREGE3)/NON

LET MADDREGI = (MADDREGI1 + MADDREGI2 + MADDREGI3)/NON
LET RMSDDREGE = SQR((RMSDDREGE1 + RMSDDREGE2 +

RMSDDREGE3)/NON)
LET RMSDDREGI = SQR((RMSDDREGI1 + RMSDDREGI2 +
RMSDDREGI3)/NON)
PRINT " MDDREGE ", " MADDREGE ", " RMSDDREGE "
PRINT MDDREGE, MADDREGE, RMSDDREGE
PRINT " MDDREGI ", " MADDREGI ", " RMSDDREGI "
PRINT MDDREGI, MADDREGI, RMSDDREGI
MAT WRITE #7: SIME

MAT WRITE #8: SIMI
CLOSE #7
CLOSE #8
END
SRI.BAS
RANDOMIZE

DIM SIMI(TOT*TRIALS,5) ! Matrix that will contain the income data !
DIM SIME(TOT*TRIALS,5) ! Matrix that will contain the expenditure data !

! Matrices that contains the criteria that will be computed later in the program!
DIM CRITIN(TRIALS,3), CRITEX(TRIALS,3)
program !

MAT READ #1: SIME1
CLOSE #1
MAT READ #2: SIMI1
CLOSE #2
MAT READ #3: SIME2
CLOSE #3
MAT READ #4: SIMI2
CLOSE #4
SRI.BAS

MAT READ #5: SIME3
CLOSE #5
MAT READ #6: SIMI3
CLOSE #6
OPEN #7: NAME "E:\SIMEREG30%.TXT"
ERASE #7
OPEN #8: NAME "E:\SIMIREG30%.TXT"
ERASE #8
! LNFV = Natural logarithm of the first visit respondent’s observation!

! LNSV = Natural logarithm of the second visit respondent’s observation!
! DREGE# = Deterministic regression imputed value for expenditure under education
status #!
! DREGI# = Deterministic regression imputed value for income under education status
#!
DIM DREGE1(N1,1), DREGE2(N2,1), DREGE3(N3,1)

DIM DREGI1(N1,1), DREGI2(N2,1), DREGI3(N3,1)
DIM LNFVE1(N1,1), LNFVE2(N2,1), LNFVE3(N3,1)
DIM LNFVI1(N1,1), LNFVI2(N2,1), LNFVI3(N3,1)
DIM LNSVE1(N1,1), LNSVE2(N2,1), LNSVE3(N3,1)
DIM LNSVI1(N1,1), LNSVI2(N2,1), LNSVI3(N3,1)
REM Logarithmic transformation

FOR I = 1 TO N1
LET LNFVE1(I,1) = LOG(SIME1(I,2))
LET NOR1 = NOR1 + 1
LET LNSVE1(I,1) = LOG(SIME1(I,3))
LET EYBAR1 = EYBAR1 + LNSVE1(I,1)
LET EXBAR1 = EXBAR1 + LNFVE1(I,1)
END IF
LET LNFVI1(I,1) = LOG(SIMI1(I,2))

LET LNSVI1(I,1) = LOG(SIMI1(I,3))
LET IYBAR1 = IYBAR1 + LNSVI1(I,1)
LET IXBAR1 = IXBAR1 + LNFVI1(I,1)
END IF
NEXT I
FOR J = 1 TO N2
LET LNFVE2(J,1) = LOG(SIME2(J,2))
SRI.BAS
LET NOR2 = NOR2 + 1

LET LNSVE2(J,1) = LOG(SIME2(J,3))
LET EYBAR2 = EYBAR2 + LNSVE2(J,1)
LET EXBAR2 = EXBAR2 + LNFVE2(J,1)
END IF
LET LNFVI2(J,1) = LOG(SIMI2(J,2))

LET LNSVI2(J,1) = LOG(SIMI2(J,3))
LET IYBAR2 = IYBAR2 + LNSVI2(J,1)
LET IXBAR2 = IXBAR2 + LNFVI2(J,1)
END IF
NEXT J
FOR K = 1 TO N3
LET LNFVE3(K,1) = LOG(SIME3(K,2))
LET NOR3 = NOR3 + 1
LET LNSVE3(K,1) = LOG(SIME3(K,3))
LET EYBAR3 = EYBAR3 + LNSVE3(K,1)
LET EXBAR3 = EXBAR3 + LNFVE3(K,1)
END IF
LET LNFVI3(K,1) = LOG(SIMI3(K,2))

LET LNSVI3(K,1) = LOG(SIMI3(K,3))
LET IYBAR3 = IYBAR3 + LNSVI3(K,1)
LET IXBAR3 = IXBAR3 + LNFVI3(K,1)
END IF
NEXT K





SRI.BAS

FOR A = 1 TO N1
LET EDXY1 = EDXY1 + ((LNFVE1(A,1) - EXBAR1)*(LNSVE1(A,1)-
EYBAR1))
LET EDXSQR1 = EDXSQR1 + (LNFVE1(A,1) - EXBAR1)^2
END IF
LET IDXY1 = IDXY1 + ((LNFVI1(A,1) - IXBAR1)*(LNSVI1(A,1)-
IYBAR1))
LET IDXSQR1 = IDXSQR1 + (LNFVI1(A,1) - IXBAR1)^2
END IF
NEXT A
FOR B = 1 TO N2
LET EDXY2 = EDXY2 + ((LNFVE2(B,1) - EXBAR2)*(LNSVE2(B,1)-
EYBAR2))
LET EDXSQR2 = EDXSQR2 + (LNFVE2(B,1) - EXBAR2)^2
END IF
LET IDXY2 = IDXY2 + ((LNFVI2(B,1) - IXBAR2)*(LNSVI2(B,1)-
IYBAR2))
LET IDXSQR2 = IDXSQR2 + (LNFVI2(B,1) - IXBAR2)^2
END IF
NEXT B
FOR C = 1 TO N3
LET EDXY3 = EDXY3 + ((LNFVE3(C,1) - EXBAR3)*(LNSVE3(C,1)-
EYBAR3))
LET EDXSQR3 = EDXSQR3 + (LNFVE3(C,1) - EXBAR3)^2
END IF
LET IDXY3 = IDXY3 + ((LNFVI3(C,1) - IXBAR3)*(LNSVI3(C,1)-
IYBAR3))
LET IDXSQR3 = IDXSQR3 + (LNFVI3(C,1) - IXBAR3)^2
END IF
NEXT C


SRI.BAS




FOR Y1 = 1 TO N1
NEXT Y1
FOR Y2 = 1 TO N2
NEXT Y2
FOR Y3 = 1 TO N3
NEXT Y3
! Computation of residuals !
DIM ECRES1(NOR1,1), ECRES2(NOR2,1), ECRES3(NOR3,1)

DIM ICRES1(NOR1,1), ICRES2(NOR2,1), ICRES3(NOR3,1)
FOR A = 1 TO N1
LET EH1 = EH1 + 1
LET ECRES1(EH1,1) = DREGE1(A,1) - LNSVE1(A,1)
END IF
LET IH1 = IH1 + 1
LET ICRES1(IH1,1) = DREGI1(A,1) - LNSVI1(A,1)
END IF
NEXT A
FOR B = 1 TO N2
SRI.BAS
LET EH2 = EH2 + 1

LET ECRES2(EH2,1) = DREGE2(B,1) - LNSVE2(B,1)
END IF
LET IH2 = IH2 + 1
LET ICRES2(IH2,1) = DREGI2(B,1) - LNSVI2(B,1)
END IF
NEXT B
FOR C = 1 TO N3
LET EH3 = EH3 + 1
LET ECRES3(EH3,1) = DREGE3(C,1) - LNSVE3(C,1)
END IF
LET IH3 = IH3 + 1
LET ICRES3(IH3,1) = DREGI3(C,1) - LNSVI3(C,1)
END IF
NEXT C
FOR Y1 = 1 TO N1
NEXT Y1
FOR Y2 = 1 TO N2
NEXT Y2
FOR Y3 = 1 TO N3
NEXT Y3
REM Computation of stochastic regression added random residuals

! Setting a starting point in finding the maximum and minimum value for each imputation
class and nonresponse variable !
LET MAXIRES1 = ICRES1(1,1)

LET MINIRES1 = ICRES1(1,1)
LET MAXERES1 = ECRES1(1,1)

LET MINERES1 = ECRES1(1,1)

SRI.BAS



FOR ROW1 = 1 TO NOR1
IF MAXIRES1 < ICRES1(ROW1,1) THEN

LET MAXIRES1 = ICRES1(ROW1,1)
END IF
IF MINIRES1 > ICRES1(ROW1,1) THEN

LET MINIRES1 = ICRES1(ROW1,1)
END IF
IF MAXERES1 < ECRES1(ROW1,1) THEN

LET MAXERES1 = ECRES1(ROW1,1)
END IF
IF MINERES1 > ECRES1(ROW1,1) THEN

LET MINERES1 = ECRES1(ROW1,1)
END IF
NEXT ROW1

END IF

END IF

END IF

SRI.BAS
END IF
NEXT ROW2

END IF

END IF

END IF

END IF
NEXT ROW3
LET RNGE1 = MAXERES1 - MINERES1

LET RNGI1 = MAXIRES1 - MINIRES1

LET CLRESID = 4
LET CWE1 = RNGE1/CLRESID

LET CWI1 = RNGI1/CLRESID

! The use of the class means of the frequency classes of the frequency distribution of the
residuals !
FOR I1 = 1 TO NOR1
FOR CL = 1 TO 4
SRI.BAS
LET LBI1 = MINIRES1 + (CWI1*(CL-1))

LET UBI1 = MINIRES1 + (CWI1*(CL))
LET LBE1 = MINERES1 + (CWE1*(CL-1))
LET UBE1 = MINERES1 + (CWE1*(CL))
IF ICRES1(I1,1) => LBI1 AND ICRES1(I1,1) < UBI1 THEN

LET STORIS1(I1,1) = (LBI1+UBI1)/2
END IF
IF ECRES1(I1,1) => LBE1 AND ECRES1(I1,1) < UBE1 THEN

LET STORES1(I1,1) = (LBE1+UBE1)/2
END IF
NEXT CL
IF ICRES1(I1,1) = MAXIRES1 THEN

END IF
IF ECRES1(I1,1) = MAXERES1 THEN

END IF
NEXT I1
FOR I2 = 1 TO NOR2
FOR CL = 1 TO 4

END IF

END IF
NEXT CL

END IF
SRI.BAS

END IF
NEXT I2
FOR I3 = 1 TO NOR3
FOR CL = 1 TO 4

END IF

END IF
NEXT CL

END IF

END IF
NEXT I3
REM Stochastic Regression
DO
LET TRIAL = TRIAL + 1
FOR B1 = 1 TO N1
IF SIMI1(B1,5) = 0 THEN
LET PICK1 = INT(RND*NOR1) + 1
LET SREGI1(B1,1) = DREGI1(B1,1) + STORIS1(PICK1,1)
END IF
IF SIME1(B1,5) = 0 THEN
LET PECK1 = INT(RND*NOR1) + 1
LET SREGE1(B1,1) = DREGE1(B1,1) + STORES1(PECK1,1)
END IF
NEXT B1
SRI.BAS
FOR B2 = 1 TO N2
END IF
END IF
NEXT B2
FOR B3 = 1 TO N3
END IF
END IF
NEXT B3
FOR Y1 = 1 TO N1
LET SREGE1(Y1,1) = EXP(SREGE1(Y1,1))
LET SREGI1(Y1,1) = EXP(SREGI1(Y1,1))
NEXT Y1
FOR Y2 = 1 TO N2
NEXT Y2
FOR Y3 = 1 TO N3
NEXT Y3
LET MDSREGI1 = 0
LET MDSREGI2 = 0
LET MDSREGI3 = 0
LET MADSREGI1 = 0
LET MADSREGI2 = 0
LET MADSREGI3 = 0
LET RMSDSREGI1 = 0
SRI.BAS
LET RMSDSREGI2 = 0
LET RMSDSREGI3 = 0
LET MDSREGE1 = 0
LET MDSREGE2 = 0
LET MDSREGE3 = 0
LET MADSREGE1 = 0
LET MADSREGE2 = 0
LET MADSREGE3 = 0
LET RMSDSREGE1 = 0
LET RMSDSREGE2 = 0
LET RMSDSREGE3 = 0
FOR A = 1 TO N1
LET NUM = NUM + 1
LET MDSREGE1 = MDSREGE1 + (SREGE1(A,1) -
SIME1(A,3))
LET MADSREGE1 = MADSREGE1 + (ABS(SREGE1(A,1) -
SIME1(A,3)))
LET RMSDSREGE1 = RMSDSREGE1 + ((SREGE1(A,1) -
SIME1(A,3))^2)
LET SIME1(A,4) = SREGE1(A,1)
END IF
LET MDSREGI1 = MDSREGI1 + (SREGI1(A,1) - SIMI1(A,3))
LET MADSREGI1 = MADSREGI1 + (ABS(SREGI1(A,1) -
SIMI1(A,3)))
LET RMSDSREGI1 = RMSDSREGI1 + ((SREGI1(A,1) -
SIMI1(A,3))^2)
LET SIMI1(A,4) = SREGI1(A,1)
END IF
FOR COL = 1 TO 5
NEXT COL

NEXT A
SRI.BAS
FOR B = 1 TO N2
LET NUM = NUM + 1
LET MDSREGE2 = MDSREGE2 + (SREGE2(B,1) - SIME2(B,3))
LET MADSREGE2 = MADSREGE2 + (ABS(SREGE2(B,1) -
SIME2(B,3)))
LET RMSDSREGE2 = RMSDSREGE2 + ((SREGE2(B,1) -
SIME2(B,3))^2)
LET SIME2(B,4) = SREGE2(B,1)
END IF
LET MDSREGI2 = MDSREGI2 + (SREGI2(B,1) - SIMI2(B,3))
LET MADSREGI2 = MADSREGI2 + (ABS(SREGI2(B,1) - SIMI2(B,3)))
LET RMSDSREGI2 = RMSDSREGI2 + ((SREGI2(B,1) - SIMI2(B,3))^2)
LET SIMI2(B,4) = SREGI2(B,1)
END IF
FOR COL = 1 TO 5
NEXT COL
NEXT B
FOR C = 1 TO N3
LET NUM = NUM + 1
LET MDSREGE3 = MDSREGE3 + (SREGE3(C,1) - SIME3(C,3))
LET MADSREGE3 = MADSREGE3 + (ABS(SREGE3(C,1) -
SIME3(C,3)))
LET RMSDSREGE3 = RMSDSREGE3 + ((SREGE3(C,1) -
SIME3(C,3))^2)
LET SIME3(C,4) = SREGE3(C,1)
END IF
LET MDSREGI3 = MDSREGI3 + (SREGI3(C,1) - SIMI3(C,3))
LET MADSREGI3 = MADSREGI3 + (ABS(SREGI3(C,1) -
SIMI3(C,3)))
LET RMSDSREGI3 = RMSDSREGI3 + ((SREGI3(C,1) -
SIMI3(C,3))^2)
LET SIMI3(C,4) = SREGI3(C,1)
END IF
FOR COL = 1 TO 5
SRI.BAS

NEXT COL
NEXT C
LET CRITEX(TRIAL,1) = (MDSREGE1 + MDSREGE2 + MDSREGE3)/NON

LET CRITIN(TRIAL,1) = (MDSREGI1 + MDSREGI2 + MDSREGI3)/NON
LET CRITEX(TRIAL,2) = (MADSREGE1 + MADSREGE2 +

MADSREGE3)/NON
LET CRITIN(TRIAL,2) = (MADSREGI1 + MADSREGI2 +
MADSREGI3)/NON
LET CRITEX(TRIAL,3) = SQR((RMSDSREGE1 + RMSDSREGE2 +

RMSDSREGE3)/NON)
LET CRITIN(TRIAL,3) = SQR((RMSDSREGI1 + RMSDSREGI2 +
RMSDSREGI3)/NON)
LOOP UNTIL TRIAL = TRIALS
MAT WRITE #7: SIME

MAT WRITE #8: SIMI
MAT WRITE #9 CRITEX

MAT WRITE #10: CRITIN
CLOSE #7
CLOSE #8
CLOSE #9
CLOSE #10
END
94
Appendix C
Model Validation of the Regression Equations
used in the Regression Imputation Procedures
TOTEX, 10% Nonresponse Rate, First Imputation Class
Summary Statistics of the model
MULTIPLE R 0.853
MULTIPLE R2 0.728
ADJUSTED R 0.728
F-STAT 6363.590
P-VALUE 0.000
STD. ERR. OF
0.277
ESTIMATE
Analysis of Variance
SV SS Df MS F p-level
Model 486.5398 1 486.5398 6363.590 0.00
Residual 181.7378 2377 0.0765
Total 668.2776 2374
Predicted vs. Residuals
Predicted vs. Residual Scores

Dependent variable: LNSV
3.0
2.5
2.0
1.5
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5
Predicted Values 95% confidence
Durbin-Watson Test for the independence of error terms
Serial
DW STAT
Correlation
Estimate 2.075187 -0.038196
Normal Probability Plot
Normal Probability Plot of Residuals

4
Expected Normal Value
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Residuals
TOTEX, 10% Nonresponse Rate, Second Imputation Class
MULTIPLE R 0.884
MULTIPLE R2 0.782
ADJUSTED R 0.781
F-STAT 4574.954
P-VALUE 0.000
STD. ERR. OF
0.309
ESTIMATE
SV SS df MS F p-level
Model 436.3063 1 436.3063 4574.954 0.00
Residual 121.8809 1278 0.0954
Total 558.1871 1279
Durbin-Watson Test for the independence of error terms
DW Serial
STAT Correlation
Estimate 2.043782 -0.022334

Expected Normal Value 4
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residuals
Predicted vs. Residuals Scatter Plot

3.0
2.5
2.0
1.5
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
TOTEX, 10% Nonresponse Rate, Third Imputation Class
MULTIPLE R 0.9499
MULTIPLE R2 0.9023
ADJUSTED R 0.9005
F-STAT 516.8993
P-VALUE 0.0000
STD. ERR. OF
0.3314
ESTIMATE
Analysis of variance
SV SS Df MS F p-level
Model 56.78424 1 56.78424 516.8993 0.000000
Residual 6.15191 56 0.10986
Total 62.93615
Durbin-Watson Test for Independence of Error terms
DW Serial
STAT Correlation
Estimate 2.167374 -0.094747

3
-1
-2
-3
-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Residuals
Predicted vs. Residual Scatter Plot

1.2
1.0
0.8
0.6
0.4
Residuals
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0
TOTEX, 20% Nonresponse Rate, First Imputation Class
MULTIPLE R 0.857
MULTIPLE R2 0.734
ADJUSTED R 0.734
F-STAT 5786.271
P-VALUE 0.000
STD. ERR. OF ESTIMATE 0.273
Model 429.7392 1 429.7392 5786.271 0.00
Residual 155.8159 2098 0.0743
Total 585.5551 2099
DW Serial
Stat Correlation
Estimate 2.013534 -0.006859
Normality Probability Plot

4
3
Expected Normal
2
1
Value
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Residuals
Predicted vs. Residual Scatter Plot

2.5
2.0
1.5
1.0
Residuals
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5
Expenditure, 20% Nonresponse Rate, Second Imputation Class
MULTIPLE R 0.887
MULTIPLE R2 0.787
ADJUSTED R 0.787
F-STAT 4268.380
P-VALUE 0.000
STD. ERR. OF
0.308
ESTIMATE
Analysis Of Variance
Model 405.0530 1 405.0530 4268.380 0.00
Residual 109.3204 1152 0.0949
Total 514.3734
DW Serial
STAT Correlation
Estimate 2.014977 -0.007528

4
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residuals

3.0
2.5
2.0
1.5
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
Expenditure, 20% Nonresponse Rate, Third Imputation Class
MULTIPLE R 0.9490
MULTIPLE R2 0.9006
ADJUSTED R 0.8985
F-STAT 434.6591
P-VALUE 0.0000
STD. ERR. OF
0.3336
ESTIMATE
Model 48.36771 1 48.36771 434.6591 0.000000
Residual 5.34131 48 0.11128
Total 53.70902
DW Serial
Stat Correlation
Estimate 2.268400 -0.167040

3
2
1
0
-1
-2
-3
-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8
Residuals
Predicted vs. Residuals Scatterplot

0.8
0.6
0.4
Residuals
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0
Expenditure, 30% Nonresponse Rate, First Imputation Class
MULTIPLE R 0.840
MULTIPLE R2 0.705
ADJUSTED R 0.705
F-STAT 4382.102
P-VALUE 0.000
STD. ERR. OF
0.290
ESTIMATE
Model 367.4335 1 367.4335 4382.102 0.00
Residual 153.5270 1831 0.0838
Total 520.9605
DW Serial
Stat Correlation
Estimate 2.072061 -0.036173

4
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residuals

3.0
2.5
2.0
1.5
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5
Expenditure, 30% Nonresponse Rate, Second Imputation Class
MULTIPLE R 0.890
MULTIPLE R2 0.791
ADJUSTED R 0.791
F-STAT 3841.345
P-VALUE 0.000
STD. ERR. OF
0.300
ESTIMATE
Model 346.1192 1 346.1192 3841.345 0.00
Residual 91.1849 1012 0.0901
Total 437.3041
Durbin-Watson Test for Independence of error terms
DW Serial
Stat Correlation
Estimate 2.023625 -0.012021

4
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residuals

3.0
2.5
2.0
1.5
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0
Expenditure, 30% Nonresponse Rate, Third Imputation Class
MULTIPLE R 0.9425
MULTIPLE R2 0.8882
ADJUSTED R 0.8856
F-STAT 333.7148
P-VALUE 0.0000
STD. ERR. OF
0.3237
ESTIMATE
Model 34.97366 1 34.97366 333.7148 0.000000
Residual 4.40164 42 0.10480
Total 39.37531
DW Serial
Stat Correlation
Estimate 2.589756 -0.326722

2.5
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2
Residuals

1.2
1.0
0.8
0.6
Residuals
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
Income, 10% Nonresponse Rate, First Imputation Class
MULTIPLE R 0.840
MULTIPLE R2 0.706
ADJUSTED R 0.706
F-STAT 5703.605
P-VALUE 0.000
STD. ERR. OF
0.331
ESTIMATE
Model 625.0487 1 625.0487 5703.605 0.00
Residual 260.4915 2377 0.1096
Total 885.5402
Durbin-Watson Test for independence of error terms
DW Serial
STAT Correlation
Estimate 2.047121 -0.023913

4
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residuals

3.0
2.5
2.0
1.5
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
7 8 9 10 11 12 13 14
Income Variable, 10% Nonresponse Rate, Second Imputation Class
MULTIPLE R 0.897
MULTIPLE R2 0.805
ADJUSTED R 0.804
F-STAT 5261.480
P-VALUE 0.000
STD. ERR. OF
0.331
ESTIMATE
Model 576.7623 1 576.7623 5261.480 0.00
Residual 140.0941 1278 0.1096
Total 716.8564
Durbin-Watson Test
DW
Serial
STAT
Estimate 1.934528 0.031428

4
3
2
1
0
-1
-2
-3
-4
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Residuals

2.5
2.0
1.5
1.0
Residuals
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
8 9 10 11 12 13 14 15
Income Variable, 10% Nonresponse Rate, Third Imputation Class
MULTIPLE R 0.9591
MULTIPLE R2 0.9199
ADJUSTED R 0.9185
F-STAT 642.9754
P-VALUE 0.0000
STD. ERR. OF
0.3171
ESTIMATE
Model 64.67059 1 64.67059 642.9753 0.000000
Residual 5.63249 56 0.10058
Total 70.30308
DW Serial
Stat Correlation
Estimate 2.157647 -0.079104

Expected Normal Value 3
2
1
0
-1
-2
-3
-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
Residuals

1.6
1.4
1.2
1.0
0.8
0.6
Residuals
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
10 11 12 13 14 15 16
MULTIPLE R 0.838
MULTIPLE R2 0.703
ADJUSTED R 0.702
F-STAT 4954.234
P-VALUE 0.000
STD. ERR. OF
0.331
ESTIMATE
Model 542.9226 1 542.9226 4954.233 0.00
Residual 229.9148 2098 0.1096
Total 772.8375
Durbin-Watson Test
DW Serial
Stat Correlation
Estimate 1.985916 0.007035

4
3
2
1
0
-1
-2
-3
-4
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Residuals

Dependent variable: LN SV
3.0
2.5
2.0
1.5
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
7 8 9 10 11 12 13 14
Income, 20% Nonresponse Rate, Second Imputation Class
MULTIPLE R 0.906
MULTIPLE R2 0.821
ADJUSTED R 0.821
F-STAT 5275.064
P-VALUE 0.000
STD. ERR. OF
0.318
ESTIMATE
Model 532.7126 1 532.7126 5275.063 0.00
Residual 116.3370 1152 0.1010
Total 649.0496
DW Serial
Stat Correlation
Estimate 1.980753 0.008036

4
3
2
1
0
-1
-2
-3
-4
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
Residuals

2.0
1.5
1.0
0.5
Residuals
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
9 10 11 12 13 14 15
Income, 20% Nonresponse Rate, Third Imputation Class
MULTIPLE R 0.9566
MULTIPLE R2 0.9151
ADJUSTED R 0.9133
F-STAT 517.0385
P-VALUE 0.0000
STD. ERR. OF
0.3277
ESTIMATE
Model 55.51737 1 55.51737 517.0385 0.000000
Residual 5.15403 48 0.10738
Total 60.67140
DW Serial
Stat Correlation
Estimate 2.280743 -0.142227

3
-1
-2
-3
-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
Residuals

1.6
1.4
1.2
1.0
0.8
0.6
Residuals
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
-1.2
10 11 12 13 14 15 16
MULTIPLE R 0.845
MULTIPLE R2 0.713
ADJUSTED R 0.713
F-STAT 4557.328
P-VALUE 0.000
STD. ERR. OF 0.330
ESTIMATE
Model 496.0775 1 496.0775 4557.328 0.00
Residual 199.3093 1831 0.1089
Total 695.3868
DW Serial
Stat Correlation
Estimate 2.094357 -0.047223

4
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3. 0
Residuals
Predicted vs. Residuals Scatter plot
Pre di cte d v s. Re si dual S co res

Dependent v ariable: LN SV
3.0
2.5
2.0
1.5
Residuals
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9. 0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0
Predicted Values 95% conf idence
Income, 30% Nonresponse Rate, Second Imputation Class
MULTIPLE R 0.909
MULTIPLE R2 0.826
ADJUSTED R 0.826
F-STAT 4793.392
P-VALUE 0.000
STD. ERR. OF
0.310
ESTIMATE
SS df MS F p-level
Model 460.5995 1 460.5995 4793.392 0.00
Residual 97.2436 1012 0.0961
Total 557.8431
DW Serial
Stat Correlation
Estimate 2.072614 -0.038549

4
3
2
1
0
-1
-2
-3
-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
Residuals
Predicted vs. Residuals Scatter plot
2.0
1.5
1.0
Residuals
0.5
0.0
-0.5
-1.0
-1.5
-2.0
9 10 11 12 13 14 15
Income, 30% Nonresponse Rate, Third Imputation Class
MULTIPLE R 0.9654
MULTIPLE R2 0.9319
ADJUSTED R 0.9303
F-STAT 574.8240
P-VALUE 0.0000
STD. ERR. OF
0.2753
ESTIMATE
SS df MS F p-level
Model 43.55594 1 43.55594 574.8240 0.000000
Residual 3.18245 42 0.07577
Total 46.73840
DW Serial
Stat Correlation
Estimate 2.249448 -0.190671

2.5
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0. 8
Residuals
Predicted vs. Residuals Scatterplots

0.8
0.6
0.4
0.2
Residuals
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0

Complete Thesis PDF

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Complete Thesis PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Imputation Procedures for Partial Nonresponse: The

Case of 1997 Family Income and Expenditure Survey

The thesis entitled

Imputation Procedures for Partial Nonresponse: The Case of 1997 FIES

Submitted by Diana Camille B. Cortes and James Edison T. Pangan, upon

ARTURO Y. PACIFICADOR JR., Ph.D.

RECHEL G. ARCILLA, Ph.D.

IMELDA E. de MESA, M.O.S. MICHELE G. TAN, M.S.

Date of Oral Defense: August 25, 2007

1 The Problem and Its Background 1

1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Objectives of the Study . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Review of Related Literature 8

3.1 Definition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Types of Nonresponse . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Patterns of Nonresponse . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Nonresponse Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 The Imputation Process . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.1 Overall Mean Imputation . . . . . . . . . . . . . . . . . . 26

3.5.2 Hot Deck Imputation . . . . . . . . . . . . . . . . . . . . . 28

3.5.3 Regression Imputation . . . . . . . . . . . . . . . . . . . . 30

4.1 Source of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1.1 General Background . . . . . . . . . . . . . . . . . . . . . 34

4.1.2 Sampling Design and Coverage . . . . . . . . . . . . . . . 35

4.1.3 Survey Characteristics . . . . . . . . . . . . . . . . . . . . 35

4.1.4 Survey Nonresponse . . . . . . . . . . . . . . . . . . . . . 36

4.2 The Simulation Method . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Formation of Imputation Classes . . . . . . . . . . . . . . . . . . 39

4.4 Performing the Imputation Methods . . . . . . . . . . . . . . . . 41

4.4.1 Overall Mean Imputation (OMI) . . . . . . . . . . . . . . 41

4.4.2 Hot Deck Imputation (HDI) . . . . . . . . . . . . . . . . . 42

4.4.3 Deterministic and Stochastic Regression Imputation (DRI)

4.5 Comparison of Imputation Methods . . . . . . . . . . . . . . . . . 45

4.5.1 The Bias of the Mean of the Imputed Data . . . . . . . . . 45

4.5.2 Comparing the Distributions of the Imputed vs. the Actual

5 Results and Discussion 51

7 Recommendations for Further Research 87

• Table 2: Descriptive Statistics of the 1997 FIES Second Visit (p.51)

• Table 3: The Candidate MV PROV and Its Categories (p.53)

• Table 4: The Candidate MV CODEP1 and Its Categories (p.54)

• Table 5: The Candidate MV CODES1 and Its Categories (p.54)

• Table 6: Chi-Square Test of Independence for the Matching Variable (p.55)

• Table 7: Measures of Association for Matching Variable (p.56)

• Table 8: Descriptive Statistics of the Data Grouped into Imputation Classes.

• Table 9: Means of the Retained and Deleted Observations (p.58)

• Table 10: Model Adequacy Results (p.60)

• Table 11: Criteria Results for the OMI Method (p.64)

• Table 12: Criteria Results for the HDI Method (p.66)

• Table 13: Criteria Results for the DRI Method (p.69)

• Table 14: Criteria Results for the SRI Method (p.71)

• Figure 1: Distribution of the Data Before and After Imputation (p.27)

• Figure 2: Bar Chart for TOTIN2, 10% NRR (p.73)

• Figure 3: Bar Chart for TOTIN2, 20% NRR (p.74)

• Figure 4: Bar Chart for TOTIN2, 30% NRR (p.74)

• Figure 5: Bar Chart for TOTEX2, 10% NRR (p.75)

• Figure 6: Bar Chart for TOTEX2, 20% NRR (p.75)

• Figure 7: Bar Chart for TOTEX2, 30% NRR (p.76)

The Problem and Its Background

which if large would result to inaccuracy. Bias due to nonresponse is believed to

The Family Income and Expenditure Survey (FIES) is an example of

1.2 Statement of the Problem

This paper attempts to answer the following questions: