Você está na página 1de 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/239793067

Estimation of a rare sensitive attribute using Poisson distribution

Article  in  Statistics: A Journal of Theoretical and Applied Statistics · June 2012


DOI: 10.1080/02331888.2010.524300

CITATIONS READS

28 197

3 authors, including:

Sarjinder Singh
Texas A&M University - Kingsville
417 PUBLICATIONS   2,858 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Passing Time with Statistics View project

All content following this page was uploaded by Sarjinder Singh on 04 November 2016.

The user has requested enhancement of the downloaded file.


Statistics
A Journal of Theoretical and Applied Statistics

ISSN: 0233-1888 (Print) 1029-4910 (Online) Journal homepage: http://www.tandfonline.com/loi/gsta20

Estimation of a rare sensitive attribute using


Poisson distribution

Margaret Land , Sarjinder Singh & Stephen A. Sedory

To cite this article: Margaret Land , Sarjinder Singh & Stephen A. Sedory (2012) Estimation
of a rare sensitive attribute using Poisson distribution, Statistics, 46:3, 351-360, DOI:
10.1080/02331888.2010.524300

To link to this article: http://dx.doi.org/10.1080/02331888.2010.524300

Published online: 24 Jan 2011.

Submit your article to this journal

Article views: 150

View related articles

Citing articles: 12 View citing articles

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=gsta20

Download by: [Texas A & M University--Kingsville] Date: 03 November 2016, At: 22:17
Statistics, Vol. 46, No. 3, June 2012, 351–360

Estimation of a rare sensitive attribute using Poisson distribution


Margaret Landa , Sarjinder Singhb * and Stephen A. Sedoryb
a TX-ESA Environmental Consultants, 1713 Santa Monica Blvd., Kingsville, TX 78363-3460, USA;
b Department of Mathematics, Texas A&M University-Kingsville, MSC172, 700 University Blvd.,
Kingsville, TX 78363, USA

(Received 12 December 2009; final version received 1 September 2010 )

In this paper, a new method to estimate the mean of the number of persons possessing a rare sensitive
attribute is proposed by utilizing the Poisson distribution in survey sampling. Two situations are discussed:
that when the proportion of persons possessing a rare unrelated attribute is known and that when it is
unknown. Unbiased estimators of the mean number of persons possessing the rare sensitive attribute
under two different situations are proposed. The variance expressions are derived in each situation. The
relative efficiencies of the proposed estimators over the direct question method estimator are investigated for
different choices of parameters and are discussed. A technical point is made that the traditional randomized
response models cannot be used to estimate the mean of the Poisson random variable.

Keywords: randomized response sampling; variance; rare sensitive variables; efficiency

1. Introduction

Warner [1] considered a case in which the respondents in a population can be divided into two
mutually exclusive groups: one group with stigmatizing/sensitive characteristic A and the other
group without it. For estimating π , the proportion of respondents in the population belonging to
the sensitive group A, a simple random and with replacement sample (SRSWR) of n respondents
is selected from the population. To collect information on the sensitive characteristic, Warner [1]
made use of a randomization device. One such device is a deck of cards with each card having
one of the following two statements: (i) ‘I belong to group A’; (ii) ‘I do not belong to group
A’. The statements occur with relative frequencies, P0 and (1 − P0 ), respectively, in the deck of
cards. Each respondent in the sample is asked to select a card at random from the well-shuffled
deck. Without showing the card to the interviewer, the interviewee answers the question, ‘Is the
statement on the card true for you?’ The number of respondents n1 that answer ‘yes’ is binomially
distributed with parameters n and P0 π + (1 − P0 )(1 − π ). The maximum likelihood estimator
of π exists for P0  = 0.5 and is given by
(n1 /n) − (1 − P0 )
π̂w = . (1)
2P0 − 1

*Corresponding author. Email:sarjinder@yahoo.com

ISSN 0233-1888 print/ISSN 1029-4910 online


© 2012 Taylor & Francis
http://dx.doi.org/10.1080/02331888.2010.524300
http://www.tandfonline.com
352 M. Land et al.

The above estimator is unbiased with variance


π(1 − π) P0 (1 − P0 )
V (π̂w ) = + . (2)
n n(2P0 − 1)2
In Warner’s model, both statements and associated questions refer to the same sensitive character
A or its complement Ac . Greenberg et al. [2] felt that, to protect the privacy of the respondents,
it is desirable that the two questions be unrelated and suggested an unrelated question model.
In Greenberg et al.’s unrelated question model, the data-gathering randomization device consists
of two questions: (i) Are you a member of group A? (ii) Are you a member of group Y ? where the
characteristic Y or its complement are innocuous and unrelated to A. For example, in estimating
the proportion of persons having extramarital relations in a certain community, the two questions
may be: (a) Are you having extra marital relations? (b) Were you born in the month of March?
Clearly, the second question has nothing to do with extra marital relations. Greenberg et al. [2],
in their theoretical development, dealt with two situations involving πy (the proportion of persons
with unrelated character, Y ): that where it is known and that where it is unknown. Greenberg
et al. [2] suggested that one of the optimal choices of Pi , i = 1, 2, should be close to one and the
other close to zero. The value πy should be chosen close to zero or one according as π < 0.5 or
π > 0.5. Since the work by Warner [1], a huge literature has emerged on the use and construction
of different randomization devices to estimate the population proportion of a sensitive attribute
in survey sampling. For example, one could refer to Tracy and Mangat [3] and Fox and Tracy [4],
among others.
In the present paper, we consider a different and unique problem where the number of persons
possessing a rare sensitive attribute is very small and a huge sample size is required to estimate this
number. The capacity of our communication systems is increasing rapidly; so it should soon be
possible to conduct such large randomized response surveys over the internet, by telephone, etc.
It is worth mentioning that at present, some respondents do not believe in the use of randomization
devices in surveys, but people are becoming more educated about its benefits as time goes on.
There seems to be a need of some sort of conference to be organized by sociologists across the
world to teach the general public about the role of randomized response devices in human surveys
and to help create trust in its use.

2. Proposed estimator when proportion of a rare unrelated attribute is known

Let π1 be the true proportion of the rare sensitive attribute A1 in the population . For example, the
proportion of AIDS patients who continue having affairs with strangers; the proportion of persons
who have witnessed a murder; the proportion of persons who are told by their doctors that they will
not survive long due to a ghastly disease, the number of girls raped by their own fathers, etc. Note
that crime and criminals have no limits; thus there is no end of such issues. Consider selecting a
large sample of n persons from the population such that as n → ∞ and π1 → 0 then nπ1 = λ1
(finite). Let π2 be the true proportion of the population having the rare unrelated attribute A2 such
that as n → ∞ and π2 → 0 then nπ2 = λ2 (finite and known). For example, π2 might be the
proportion of persons who are born exactly at 12:00 o’clock; the proportion of babies born blind;
the proportion of triplet births delivered by ladies, etc. Each respondent selected in the sample is
requested to rotate a spinner bearing two types of statements:
(a) Do you possess the rare sensitive attribute A1 ?
and
(b) Do you possess the rare unrelated attribute A2 ?
with probabilities P and (1 − P ), respectively.
Statistics 353

Sometimes it may not be feasible to go door by door with a spinner to collect data from the
respondents. In such situations, we suggest replacing the use of a spinner with a question based
on a known proportion of a third unrelated characteristic, say A3 . Let P be the known proportion
of persons, say who are born in summer. Thus each respondent through a secure e-mail system or
telephone could be asked to do the following: If you are born in summer and you are possessing the
rare sensitive attribute A1 , then report ‘yes’, otherwise report ‘no’. If you are not born in summer
and you possess the rare unrelated non-sensitive attribute A2 , then report ‘yes’, otherwise report
‘no’. In this way, privacy will be maintained whether or not one prefers to use a spinner or some
third characteristic.
In either one of these cases, the probability of a ‘yes’ answer is given by

θ0 = P π1 + (1 − P )π2 . (3)

Note that both attributes A1 and A2 are very rare in the population. As before, assuming that
as n → ∞ and θ0 → 0 such that nθ0 = λ0 (finite). Let y1 , y2 , . . . , yn be a random sample of n
observations from the Poisson distribution with parameter λ0 . Obviously, the likelihood function
of the random sample of n observations is given by


n
e−λ0 λ i
y
L= 0
. (4)
i=1
yi !

The natural log-likelihood function is given by


 

n 
n
ln(L) = −n[P λ1 + (1 − P )λ2 ] + yi ln[P λ1 + (1 − P )λ2 ] − ln(yi ). (5)
i=1 i=1

On setting
∂ ln(L)
= 0. (6)
∂λ1
The maximum-likelihood estimator of λ1 is given by
 
1
n
1
λ̂1 = yi − (1 − P )λ2 . (7)
P n i=1

Thus, we have the following theorems:

Theorem 2.1 The estimator λ̂1 is an unbiased estimator of the parameter λ1 .

Proof Since yi ∼ P (λ0 ), that is, yi follows a Poisson distribution with parameter λ0 = P λ1 +
(1 − P )λ2 , we have
 n   n 
1 1 1 1
E(λ̂1 ) = E(yi ) − (1 − P )λ2 = λ0 − (1 − P )λ2
P n i=1 P n i=1
1
= [λ0 − (1 − P )λ2 ] = λ1 ,
P
which proves the theorem. 
354 M. Land et al.

Theorem 2.2 The variance of the estimator λ̂1 is given by

λ1 (1 − P )λ2
V (λ̂1 ) = + . (8)
nP nP 2

Proof Since yi ∼ P (λ0 ), that is, yi follows Poisson distribution with parameter λ0 = P λ1 +
(1 − P )λ2 , and all are independent, we have
  n   
1 1 1 
n
1
V (λ̂1 ) = V yi − (1 − P )λ2 = 2 V (yi )
P n i=1 P n2 i=1
 
1 
n
1 λ0 P λ1 + (1 − P )λ2
= 2 2
λ0 = 2
=
P n i=1 nP nP 2
λ1 (1 − P )λ2
= + .
nP nP 2 

Theorem 2.3 An unbiased estimator of the variance of the estimator λ̂1 is


 n 
1 
v̂(λ̂1 ) = 2 2 yi . (9)
n P i=1

Proof Taking the expected value on both sides of Equation (9), we have
 n   n 
1  1  λ0 P λ1 + (1 − P )λ2
E[v̂(λ̂1 )] = 2 2 E yi = 2 2 λ0 = 2
= ,
n P i=1
n P i=1
nP nP 2

which proves the theorem. 

2.1. Relative efficiency

The per cent relative efficiency of the proposed estimator λ̂1 with respect to the direct question
method based estimator (equivalently where P = 1) reduces to

λ1 P 2
RE = × 100%. (10)
P λ1 + (1 − P )λ2

From Equation (10), it is clear that the relative efficiency of the proposed estimator is free from
the sample size n. Also, it is clear if the value of the two parameters λ1 and λ2 are equal, then the
relative efficiency is a function of the randomization device parameter P . If P = 1 the relative
efficiency attains a maximum of 100%, but the choice of P = 1 is not practicable. To look at
the magnitude of the relative efficiency, we chose three different pairs of (λ1 : λ2 ) as (1.5:1.5),
(1.5:0.5) and (0.5:1.5), and we changed the values of P from 0.60 to 0.90 with a step of 0.0001.
For the choice of (λ1 : λ2 ) as (1.5:0.5), the relative efficiency remains a bit higher than the other
two cases, which indicates that it is good to use as the rare unrelated attribute Y , one with a
mean value less than that of the rare sensitive attribute A without effecting the cooperation of the
respondents in using the proposed randomization device such as the spinner shown in Figure 1.
The results based on the proposed method are presented in Figure 2. The relative efficiency of the
proposed estimator could be retained from 60% to 80% while choosing the value of P from 0.70
Statistics 355

Spinner with two rare attributes


Rare unrelated
attribute A2

(1−P) P
1–P
Rare sensitive
attribute A1
P

Figure 1. Spinner useful for rare attributes.

Value of P versus relative efficiency


100
Relative efficiency

80
60
40
20
0
0.5 0.6 0.7 0.8 0.9 1
P
l1 = l2 =1.5 l1 = 1.5, l2 = 0.5 l1 = 0.5, l2 = 1.5

Figure 2. Relative efficiency for different choices of λ1 and λ2 .

to 0.85 at the cost of protection of the respondents’ privacy. The choice of P should be made such
that the respondents should not feel that their privacy is threatened.
The main problem with the use of the proposed method in Section 2 is that sometimes the mean
value of the rare unrelated attribute remains unknown. In the next section, we suggested a method
that is free from such a limitation.

3. Proposed estimator when proportion of the rare unrelated attribute is unknown

In this method, each respondent in the sample of n persons, selected using SRSWR from the given
population, is requested to rotate two spinners one after the other. Each respondent in the sample
is requested to use spinner-I with the statements:
(a) Do you possess the rare sensitive attribute A1 ?
and
(b) Do you possess the rare unrelated attribute A2 ?
with probabilities P and (1 − P ) respectively.
Next, the respondent is requested to use spinner-II with the statements:
(a) Do you possess the rare sensitive attribute A1 ?
and
(b) Do you possess the rare unrelated attribute A2 ?
356 M. Land et al.

Spinner-I with two rare attributes Spinner-II with two rare attributes

Rare unrelated Rare sensitive


attribute A2 attribute A1

(1−P) T
P T
(1−T )
1–P (1–T )

P Rare sensitive Rare unrelated


attribute A1 attribute A2

Figure 3. Two spinners useful for rare attributes.

with probabilities T and (1 − T ), respectively. Spinners I and II are shown in Figure 3. We feel
that the cost of the survey will not be affected much if each respondent is requested to rotate two
spinners rather than one. The use of spinners could also be replaced with two known unrelated
characteristics as discussed earlier if one anticipates doing a large-scale survey through e-mails
or telephone surveys.
As before, the probabilities of a ‘yes’ answer in the use of spinners I and II are given,
respectively, by

θ1 = P π1 + (1 − P )π2 and θ2 = T π1 + (1 − T )π2

Assuming that as n → ∞ and θ1 → 0 and θ2 → 0, then nθ1 = λ∗1 (say, finite) and nθ2 = λ∗2 (say,
finite). By following Section 2, we have

1
n
P λ̂1 + (1 − P )λ̂2 = y1i , (11)
n i=1

and

1
n
T λ̂1 + (1 − T )λ̂2 = y2i , (12)
n i=1

where y1i and y2i denote the observed values in the first and the second response from the ith
respondent, respectively. Solving Equations (11) and (12) for λ̂1 , we have:

Theorem 3.1 An unbiased estimator of the parameter λ1 for the rare sensitive attribute A1 is
given by
 
1 n n
λ̂1 = (1 − T ) y1i − (1 − P ) y2i , (13)
n(P − T ) i=1 i=1

with T  = P .
Statistics 357

Proof Since y1i ∼ P (λ∗1 ) and y2i ∼ P (λ∗2 ), thus by taking expected values on both sides of
Equation (13), we have
 
1 n n
E(λ̂1 ) = (1 − T ) E(y1i ) − (1 − P ) E(y2i )
n(P − T ) i=1 i=1
 
1 n n
∗ ∗
= (1 − T ) λ1 − (1 − P ) λ2
n(P − T ) i=1 i=1
1
= [(1 − T ){P λ1 + (1 − P )λ2 } − (1 − P ){T λ1 + (1 − T )2 }]
(P − T )
1
= [(1 − T )P λ1 + (1 − T )(1 − P )λ2 − (1 − P )T λ1 − (1 − P )(1 − T )λ2 ]
(P − T )
= λ1 ,

which proves the theorem. 

Theorem 3.2 The variance of the unbiased estimator of the parameter λ1 is given by

1
V (λ̂1 ) = [{P (1 − T )2 + T (1 − P )2 − 2P T (1 − P )(1 − T )}λ1
n(P − T )2
+ {(1 − P )(1 − T )(2 − P − T ) − 2(1 − P )2 (1 − T )2 }λ2 ]. (14)

Proof Since y1i ∼ P (λ∗1 ) and y2i ∼ P (λ∗2 ), thus V (y1i ) = λ∗1 and V (y2i ) = λ∗2 . Note that both
responses are not independent, thus we have

1 n n
V (λ̂1 ) = 2 (1 − T ) 2
V (y 1i ) + (1 − P ) 2
V (y2i )
n (P − T )2 i=1 i=1

n
−2(1 − T )(1 − P ) Cov(y1i , y2i )
i=1
 
1 n n n
∗ ∗
= 2 (1 − T ) 2
λ + (1 − P ) 2
λ − 2(1 − T )(1 − P ) λ∗12 (15)
n (P − T )2 i=1
1
i=1
2
i=1

where
λ∗1 = V (y1i ) = E(y1i ) = P λ1 + (1 − P )λ2 , (16)

λ∗2 = V (y2i ) = E(y2i ) = T λ1 + (1 − T )λ2 , (17)


and

λ∗12 = Cov(y1i , y2i ) = E(y1i y2i ) − E(y1i )E(y2i )


= P T (λ21 + λ1 ) + (1 − P )(1 − T )(λ22 + λ2 ) + P (1 − T )λ1 λ2 + (1 − P )T λ1 λ2
− [P λ1 + (1 − P )λ2 ][T λ1 + (1 − T )λ2 ] (18)
= P T λ1 + (1 − P )(1 − T )λ2 .

On using Equations (16), (17) and (18) into Equation (15), we have the theorem. 
358 M. Land et al.

Corollary 3.1 An unbiased estimator to estimate the parameter λ2 for the rare unrelated
attribute A2 is given by
 n 
1  n
λ̂2 = T y1i − P y2i , (19)
n(T − P ) i=1 i=1

with approximate variance:


1
V (λ̂2 ) = [{T P (T + P ) − 2P 2 T 2 }λ1 + {T 2 (1 − P ) + P 2 (1 − T )
n(T − P )2
− 2P T (1 − P )(1 − T )}λ2 ]. (20)

Proof Analogous to the proofs of Theorems 3.1 and 3.2. 

Corollary 3.2 An unbiased estimator of the variance of the estimator λ̂1 is given by
1
v̂(λ̂1 ) = [{P (1 − T )2 + T (1 − P )2 − 2P T (1 − P )(1 − T )}λ̂1
n(P − T )2
+ {(1 − P )(1 − T ) (2 − P − T ) − 2(1 − P )2 (1 − T )2 }λ̂2 ], (21)

and an unbiased estimator of the variance of the estimator λ̂2 is given by


1
v̂(λ̂2 ) = [{T P (T + P ) − 2P 2 T 2 }λ̂1 + {T 2 (1 − P )
n(T − P )2
+ P 2 (1 − T ) − 2P T (1 − P )(1 − T )}λ̂2 ], (22)

where λ̂1 and λ̂2 are defined in Equations (13) and (19), respectively.

3.1. Relative efficiency

The per cent relative efficiency of the proposed estimator λ̂1 with respect to the direct question
method based estimator (say, with P = 1 and T = 1 situation) is defined as

λ1 (P − T )2 × 100%
RE = . (23)
{P (1 − T ) + T (1 − P )2 − 2P T (1 − P )(1 − T )}λ1
2

+(1 − P )(1 − T ){(2 − P − T ) − 2(1 − P )(1 − T )}λ2

From Equation (23), it is clear that the relative efficiency of the proposed estimator is free from
the sample size n. It is also clear that the value of the two parameters λ1 and λ2 are equal then the
relative efficiency is a function of the randomization device parameters P and T . If P = 1 and
T = 0, the relative efficiency attains the maximum of 100%, but the choice of P = 1 and T = 0
is not practicable. To investigate the magnitude of the relative efficiency, we chose three different
pairs of (λ1 : λ2 ) as (1.5:1.5), (1.5:0.5) and (0.5:1.5), and we let the value of P range from 0.60 to
0.90 with a step of 0.001 and let that of T range from 0.1 to 0.4 with a step of 0.001. For the choice
of (λ1 : λ2 ) as (1.5:0.5), the relative efficiency remains a bit higher than the other two cases, which
suggests that it is good to use as the rare unrelated attribute Y , one with a mean value less than
that of the rare sensitive attribute A without effecting the cooperation of the respondents while
using the proposed randomization device such as spinners shown in Figure 3. The results based
Statistics 359

Figure 4. Relative efficiency versus P and T .


360 M. Land et al.

on the proposed method are presented in the three graphs in Figure 4. The relative efficiency of
the proposed estimator could be constrained between 60% and 80% when choosing the value of
P between 0.70 and 0.85 and that of T between 0.15 and 0.30. The choice of P and T should
be made in such a way that the respondents should not feel that their privacy is threatened, while
the difference (P − T ) should be kept as large as possible so that the variance of the proposed
estimator remains small.
Important remarks: (1) Following the Greenberg et al.’s unrelated question model, it is also
possible to take two independent samples of large sizes to solve the problem of unknown rare
attribute, but this may increase the cost of the survey too much due to the rarity of the attributes.
(2) Note that other types of randomized response models such as those due to Warner [1], Kuk
[5] and Franklin [6], etc. are not easily extendable for estimating the mean of a Poisson random
variable, because one cannot eliminate the factor (1 − π ) which results in a very high probability
of getting ‘yes’ answers in case of rare sensitive attributes, and this may lead to an inconsistent
estimator of the mean of the Poisson random variable if one tries to use these models.

Acknowledgements
The authors are thankful to the Editor-in-Chief Professor O. Bunke and a learned referee for the valuable comments on
the original version of the manuscript.

References

[1] S.L. Warner, Randomized response: A survey technique for eliminating evasive answer bias, J. Amer. Statist. Assoc.
60 (1965), pp. 63–69.
[2] B.G. Greenberg, A.L.A. Abul-Ela, W.R. Simmons, and D.G. Horvitz, The unrelated question randomized response
model – theoretical framework, J. Amer. Statist. Assoc. 64 (1969), pp. 520–539.
[3] D.S. Tracy and N.S. Mangat, Some developments in randomized response sampling during the last decade – a follow
up of review by Chaudhuri and Mukerjee, J. Appl. Statist. Sci. 4(2/3) (1996), pp. 147–158.
[4] J.A. Fox and P.E. Tracy, Randomized Response: A Method for Sensitive Surveys, SAGE Publications, Beverly Hills,
CA, 1986.
[5] A.Y.C. Kuk, Asking sensitive questions indirectly, Biometrika 77 (1990), pp. 436–438.
[6] L.A. Franklin, A comparison of estimators for randomized response sampling with continuous distribution from a
dichotomous population. Comm. Statist. Theory Methods 18 (1989), pp. 489–505.

View publication stats

Você também pode gostar