Você está na página 1de 26

The VLDB Journal (2010) 19:385–410

DOI 10.1007/s00778-009-0170-1

REGULAR PAPER

Suppressing microdata to prevent classification based inference


Ayça Azgin Hintoglu · Yücel Saygın

Received: 24 April 2008 / Revised: 11 August 2009 / Accepted: 3 October 2009 / Published online: 19 November 2009
© Springer-Verlag 2009

Abstract The revolution of the Internet together with the 1 Introduction


progression in computer technology makes it easy for insti-
tutions to collect an unprecedented amount of personal data. In tandem with the advances in networking and storage
This pervasive data collection rally coupled with the increas- technologies, the private sector as well as the public sector has
ing necessity of dissemination and sharing of non-aggregated increased their efforts to gather, manipulate, and commodify
data, i.e., microdata, raised a lot of concerns about privacy. information on a large scale. Non-governmental organiza-
One method to ensure privacy is to selectively hide the con- tions collect large amounts of personal information about
fidential, i.e. sensitive, information before disclosure. How- their customers or members for many reasons including bet-
ever, with data mining techniques, it is now possible for an ter customer relationship management and high-level deci-
adversary to predict the hidden confidential information from sion making. Public safety, on the other hand, is the major
the disclosed data sets. In this paper, we concentrate on one motivation for large-scale personal information collection
such data mining technique called classification. We extend efforts initiated by governmental organizations. This perva-
our previous work on microdata suppression to prevent both sive data-harvesting effort coupled with the increasing need
probabilistic and decision tree classification based inference. to share the data with other institutions or with public raised
We also provide experimental results showing the effective- concerns about privacy. Privacy is the ability of an individ-
ness of not only the proposed methods but also the hybrid ual to prevent information about himself becoming known
methods, i.e., methods suppressing microdata against both to other people without his approval [1]. More specifically,
classification models, on real-life data sets. it is the right of individuals to have the control over the data
they provide. This includes controlling (1) how the data are
Keywords Privacy · Disclosure protection · going to be used, (2) who is going to use it, and (3) for what
Data suppression · Data perturbation · Data mining purpose.
Widespread usage of powerful data analysis tools and data
mining techniques, enabling institutions to extract previously
unknown and strategically useful information from huge col-
lections of data sets and thus gain competitive advantages,
has also contributed to the fears about privacy. Data min-
ing techniques can be used for many reasons including but
This work was partially funded by the Information Society not limited to national security warning and national security
Technologies programme of the European Commission, Future and decision making [2] for government agencies, and providing
Emerging Technologies under the IST-6FP-014915 GeoPKDD project. better business intelligence and customer relationship man-
agement for enterprises. On the other hand, they can also be
A. Azgin Hintoglu · Y. Saygın (B)
Sabancı University, Istanbul, Turkey used by adversaries to infer hidden confidential information
e-mail: ysaygin@sabanciuniv.edu about individuals from the disclosed data sets, and thus pose
A. Azgin Hintoglu a great threat to privacy. The security and privacy threats
e-mail: aycah@sabanciuniv.edu due to use of data mining techniques was first pointed out

123
386 A. Azgin Hintoglu, Y. Saygın

Table 1 Academic health


medical records ID Name Zipcode Gender Age Indigestion Chest pain Palpitation Diagnosis

1 Alice 90302 Female 29 Y N Y Dyspepsia


2 Bob 90410 Male 22 N Y Y Angina Pectoris
3 John 90301 Male 27 Y N N Dyspepsia
4 Lisa 90310 Female 43 Y N N Gastritis
5 Chris 90301 Male 52 N Y Y Gastritis
6 Leo 90410 Male 47 Y Y Y Angina Pectoris
7 Prue 90305 Female 30 N N Y Angina Pectoris
8 Joe 90402 Male 36 N Y Y Angina Pectoris
8 Ross 90301 Male 52 Y Y Y Gastritis

Table 2 Academic health


medical records shared with Zipcode Gender Age Indigestion Chest pain Palpitation Diagnosis
Academic Research Institute
90302 Female 29 Y N Y Dyspepsia
90410 Male 22 N Y Y ?
90301 Male 27 Y N N Dyspepsia
90310 Female 43 Y N N Gastritis
90301 Male 52 N Y Y Gastritis
90410 Male 47 Y Y Y Angina Pectoris
90305 Female 30 N N Y Angina Pectoris
90402 Male 36 N Y Y Angina Pectoris
90301 Male 52 Y Y Y Gastritis

by O’Leary [3] and was discussed further in a symposium Therefore, Academic Health removed not only the ID and
on knowledge discovery in databases and personal privacy name attributes but also the diagnosis information from Bob’s
[4–7]. Since then, privacy issues have become one of the most medical records before sharing it, as shown in Table 2. Unfor-
important aspects of database and data mining research. tunately, given these medical records, Academic Research
Institute can easily find Bob’s diagnosis to be Angina
Example 1 Consider an on-line federation of hospitals and Pectoris using a predictive data mining technique called clas-
research organizations collaborating with each other, named sification.
HealthFed. Each federated hospital collects medical records
of their patients together with their privacy preferences, and In this paper, we address this particular problem of pri-
interacts with research organizations within the federation vacy preserving microdata diclosure. We assume that each
to share this information. In particular, assume that the city individual might have different preferences regarding their
clinic Academic Health and Academic Research Institute, privacy. Therefore, the confidential attributes might differ for
both being part of the HealthFed federation, collaborate each individual. With such a setting, one method to ensure
with each other for research purposes. More specifically, privacy while disclosing a microdata set is to selectively
Academic Health shares patients’ medical records with hide1 the confidential data values. This method ensures pri-
Academic Research Institute after ensuring the privacy pref- vacy from a micro-level perspective. But, this is not the
erences of each patient is preserved. Table 1 shows such case for the macro-level perspective, as with powerful data
patients who gave consent to Academic Health to disclose analysis tools and data mining techniques it is now possi-
their medical records to third persons for research purposes ble for an adversary to predict hidden confidential infor-
provided that their ID and name attributes are removed before mation using the rest of the disclosed data set. We concen-
disclosure. However, Bob, knowing that it might still be pos- trate on one such possible threat—classification—which is a
sible to link his medical records with other data sources data mining technique widely used for prediction purposes.
through potentially identifying attributes like gender, zip- We extend our previous work [8] on microdata suppression
code, and age, required that not only his name but also his
diagnosis information to be hidden before disclosure. 1 Deleting or replacing with a symbol denoting unknown.

123
Suppressing microdata to prevent classification based inference 387

(1) to prevent not only probabilistic but also decision tree and the evaluation metrics. The details of suppression algo-
classification based inference, and (2) to handle not only rithms are described in Sect. 3. Section 4 provides a discus-
single but also multiple confidential data value suppression sion on the effectiveness of proposed suppression algorithms.
to reduce the side-effects. We achieve this by finding a set Sect. 5 reports on our emprical results. The related work on
of data values that might cause confidential information to data suppression and privacy is discussed in Sect. 6. Finally,
be disclosed from the released microdata set indirectly, i.e., Sect. 7 concludes the paper and outlines future research direc-
through classification-based inference, and replacing them tions of this study.
with a symbol indicating suppression.
Much research has taken place in the area of disclosure
protection in order to preserve privacy. One such popular 2 Preliminaries
disclosure protection approach is cell suppression [9,10].
The basic idea of cell suppression is to protect sensitive 2.1 Problem formulation
information in aggregated data sets, i.e., statistical tables.
The problems addressed by cell suppression and microdata Let Λ = {α1 , α2 , . . . , αn } be the set of attributes with asso-
suppression are quite similar in principal. Nevertheless, the ciated domains 2 Vα1 , Vα2 , . . . , Vαn , and extended domains 3
methodologies used are completely different, as the charac- eVα1 , eVα2 ,…, eVαn , respectively. Let D = {d1 , d2 , . . . , dm }
teristics of the data sets they are trying to protect are different. be the microdata set where each tuple di ∈ eVα1 × eVα2 ×
In statistical tables, inference results from the marginal totals · · · × eVαn is an ordered list of values.
given along with the aggregated data itself. On the other hand, For each attribute α j ∈ Λ, there is a mapping α j [di ] :
in a microdata set it results from the statistical correlations eVα1 × eVα2 × · · · × eVαn → eVα j from eVα1 × eVα2 × · · · ×
between non-aggregated attributes. eVαn into the extended domain eVα j . The mapping α j [di ]
Another method, specifically proposed to ensure privacy represents the value of attribute α j of microdata tuple di .
of disclosed microdata sets, is anonymization. Ano-nymi- Similarly, for each microdata set D, there is a mapping
zation aims at modifying microdata sets before disclosure D[constraint] : D → S ∈ 2 D from D = {d1 , d2 , ..., dm }
such that the identities of individuals cannot be recognized into S ∈ 2 D . The mapping D[constraint] represents the
using other microdata sets that are available to public. Var- set of all tuples satisfying the constraint, expressed in con-
ious approaches [11–13,15,16], employing generalization junctive normal form, on attribute values. Examples of valid
and suppression on potentially identifying portions of the constraint expressions include the following:
data sets, were proposed to address the anonymization prob-
lem. However, all of them have the inherent problem of – α1 [d] = val1 ,
assuming that the set of potentially identifying attributes – ¬ α1 [d] = val1 ,
is known in advance. This assumption is a strong one con- – αi [d] = vali ∧ ¬ α j [d] = val j , and
sidering that we live in an internetworked society in which – α1 [d] = val1 ∧ α2 [d] = val2 ∧ · · · ∧ αn [d] = valn .
institutions are increasingly required to make their data elec-
tronically available. For example, in United States, all gov- Definition 2.1 (Classifiers (Σ)) Σ denotes the set of all clas-
ernmental records, except those covered by a specified set sifiers that aims to predict the value of a single attribute, i.e.,
of exceptions including The Privacy Act of 1974, are freely the target attribute4 ατ , in terms of the predictor attributes.5
available for public access according to Freedom of Infor-
mation Act. Examples of such records include birth, death, Each classifier ς ∈ Σ is defined in the context of a training
and marriage records, drivers’ records, real estate ownership data set and a target attribute. For example, a classifier of type
records, court records, etc. [17,18]. With diverse sources of Naı̈ve Bayesian built using the data set D with ατ ∈ Λ as the
information available online, mostly unanonymized, it is hard D,ατ
target attribute is denoted as ςnb . If the type of the clas-
to determine the potentially identifying attributes with 100% sifier, the training data set or the target attribute is unknown
confidence. Moreover, it has been shown in various works or not relevant in a given context, then a special symbol ⊥
that the proposed approaches to solve the anonymization is used instead of the respective symbol. For example, ς⊥D,ατ
problem fail to provide anonymity, and hence do not protect
privacy [19–21]. Finally and most importantly, approaches 2 The domain of an attribute is represented by a finite set of discrete
addressing anonymization assume that each individual hav- values excluding the unknown (i.e. null) value denoted by ν.
ing a record in the microdata set has the same privacy pref- 3 The extended domain of an attribute is represented by a finite set of
erences, which is far from being realistic. discrete values including the unknown (i.e. null) value denoted by ν
The remainder of this paper is organized as follows: Sect. 2 such that eVα j = Vα j ∪ {ν}.
4
presents the probabilistic and decision tree classifiers, the Also called the class attribute or the dependent attribute.
Microdata Suppression problem, the modification schemes, 5 Also called the independent attributes.

123
388 A. Azgin Hintoglu, Y. Saygın

denotes the set of all classifiers built using the data set D with In this work, we relax the above statement such that there
ατ ∈ Λ as the target attribute. Following the training phase, exists no Naı̈ve Bayesian or ID3 classifier that can correctly
each classifier ς ∈ Σ can be viewed as a function that takes predict the confidential data value.
a microdata tuple and predicts the most probable value of the
D  −di ,α j D  −di ,α j
target attribute based on other attributes’ values. For example, ς (di ) = α j [di ] ∀ς ∈ ςnb ∪ ςid3 (6)
if ς ∈ ς⊥⊥,ατ , then ς : (eVα1 × eVα2 × · · · × eVαn ) → Vατ .
Definition 2.2 (Naı̈ve Bayesian Classifier (Σnb )) Let the jth 2.2 Modification strategies for microdata suppression
attribute value of tuple di , that is α j [di ], is unknown. Accord-
ing to Bayes’ theorem the probability that α j [di ] has value There are two possible modification strategies that can be
v ∈ Vα j is equal to the posterior probability of v conditioned adopted to address the microdata suppression problem.
on di and is given by Modification Strategy 1. Deleting an Attribute Value. This
p(v) p(di |v) modification scheme, also referred to as hiding, involves
p(v|di ) = (1)
p(di ) replacement of attribute values, including the confidential
where p(v) and p(di ) are the prior probabilities of v and data values, with a special symbol denoting the unknown
di , respectively, and p(di |v) is the posterior probability of (i.e. null) value ν. Replacing attribute values with ν results in
di conditioned on v. Naı̈ve Bayesian classifier is a proba- uncertainty in the microdata set. For example, in the simplest
bilistic classifier based on Bayes’ theorem with the class case of a binary attribute, an unknown value can be either 0
conditional independence assumption, that is, the effect of or 1. Assuming that the value was 0 will contribute to the
an attribute value on another attribute (i.e. class attribute) is resulting classification model in a contradicting way com-
independent of the values of the remaining attributes. Due to pared to the assumption that it was 1. By carefully selecting
class conditional independence assumption, we can rewrite the tuples and attributes to replace with an unknown value,
the posterior probability p(di |v) as follows: we can decrease the precision of the classification models

j−1 
n which can be built to predict the confidential data values.
p(di |v) = p(αk [di ]|v) p(αk [di ]|v) (2) The details of how to select these values to replace with ν are
k=1 k= j+1 provided below in the form of downgrade strategies.
D−d ,α Modification Strategy 2. Generalizing an Attribute Value.
The Naı̈ve Bayesian Classifier ςnb i j built using D − di
as the training data set will predict the most probable value This modification scheme involves generalization of attribute
for α j [di ] as vπ ∈ Vα j if and only if the following condition values, including the confidential data values, using a concept
holds: hierarchy or an ontology.
Within the scope of this study, we proposed suppression
p(vπ |di ) > p(v|di ) | ∀v ∈ Vα j − vπ (3) algorithms utilizing the deletion scheme, and leave the gen-
Since p(di ) is same for all v ∈ Vα j , it can be ignored as eralization scheme as part of our future work.
shown below:
p(vπ ) p(di |vΠ ) > p(v) p(di |v) | ∀v ∈ Vα j − vπ (4) 2.3 Downgrade strategies for microdata suppression
Definition 2.3 (ID3 Classifier (Σid3 )) Let α j [di ] be
D−d ,α There are two possible downgrade strategies that can be
unknown. The ID3 classifier ςid3 i j built using D − di
adopted to address the Microdata Suppression Problem.
as the training data set is a decision tree where each internal
node represents a decision node, each branch represents an Downgrade Strategy 1. Classification Model Downgrade.
outcome of the decision, and each leaf node represents a pos- Let D be the original microdata set with the confidential data
sible value v ∈ Vα j for α j [di ]. Such a classifier will predict value α j [di ]. Classification model downgrade aims at trans-
the most probable value for α j [di ] as vπ ∈ Vα j if and only if forming the original microdata set D to D  that satisfies the
the test of the remaining attributes of di against the decision following constraints:
tree leads a path from the root node to a leaf node labeled
with vπ . i. α j [di ] = ν,
Definition 2.4 (Suppressing a Confidential Data Item) Let ii. ∀α ∈ Λ − α j , α[di ] = α[di ],
D−di ,α j
D  be the microdata set after applying a set of modifications iii. D − di = D  − di , iff ∃ς ∈ ς⊥ ς (di ) = α j [di ],
to D. The confidential data value α j [di ] will be suppressed D  −di ,α j
iv. ∀ς ∈ ς⊥ ς (di ) = α j [di ].
with respect to D  , if and only if there exist no classifiers that
can correctly predict the confidential data value.
This scheme aims at degrading all classification models ς ∈
D  −di ,α j D  −di ,α j
ς (di ) = α j [di ] ∀ς ∈ ς⊥ (5) ς⊥ by modifying the tuples d ∈ D − di .

123
Suppressing microdata to prevent classification based inference 389

Downgrade Strategy 2. Microdata Tuple Downgrade. Let where



D be the original microdata set with the confidential data 0 if α j [di ] = α j [di ]
value α j [di ]. Microdata tuple downgrade aims at transform- disti j =
1 otherwise
ing the original microdata set D to D  that satisfy the follow-
The second information loss metric, utilized within the
ing constraints:
scope of this work, is the Sum of Kullback Leibler Distances.
This metric measures the information loss in terms of the
i. α j [di ] = ν, distance between the first-order probability distributions of
D−di ,α j
ii. ∃α ∈ Λ−α j , α[di ] = α[di ], iff ∃ς ∈ ς⊥ ς (di ) = the original and the modified microdata sets.
α j [di ],
Definition 2.6 (Kullback Leibler Distance) Let D and D 
iii. D − di = D  − di ,
be the original and modified microdata sets, respectively. Let
D  −di ,α j D−di ,α j
iv. ς⊥ = ς⊥ , and α ∈ Λ be an attribute with probability distribution pα in
D−d ,α
v. ∀ς ∈ ς⊥ i j ς (di ) = α j [di ]. D and pα in D  . The Kullback Leibler distance between D
and D  in terms of attribute α can be defined as the distance
between the first-order probability distributions of α in D
Unlike the classification model downgrade, this scheme deg-
and D  .
rades only the microdata tuple containing the confidential 
D−d ,α pα (v)
data value di such that the classification models ς ∈ ς⊥ i j K L D(D, D  ) = D( pα || pα ) = pα (v)log  (8)
pα (v)
cannot correctly predict the confidential data value α j [di ]. v∈Vα
Within the scope of this study, we proposed two suppres- Definition 2.7 (Sum of Kullback Leibler Distances) Let D
sion algorithms for downgrading the classification model and and D  be the original and modified microdata sets, respec-
two suppression algorithms for downgrading the microdata tively. The sum of Kullback Leibler distances between D and
tuple to prevent probabilistic and decision tree classification D  over all attributes α ∈ Λ can be defined as follows:
based inference. 
S K L D(D, D  ) = D( pα || pα ) (9)
α∈Λ
2.4 Evaluation measures The last information loss metric used is the Average
Change in Mutual Information. This metric measures the
The two important issues in microdata suppression are (1) information loss by finding the average change in joint prob-
minimization of information loss enabling further use of the ability distributions of all attributes.
resulting microdata set, and (2) maximization of uncertainty
enabling protection of confidential data values from classi- Definition 2.8 (Mutual Information) Let αk ∈ Λ and αl ∈ Λ
fication based inference. In the following, seven metrics for be two attributes of the microdata set D with probability dis-
measuring information loss and uncertainty incurred by the tributions pαk and pαl , respectively, and joint probability dis-
suppression process are introduced, respectively. tribution pαk ,αl . The mutual information between αk and αl
in D, measuring their mutual dependence, can be defined as
follows.
2.4.1 Information loss metrics
I D (αk , αl ) = D( pαk ,αl || pαk pαl )
In this work, three different metrics are used to measure the   pα ,α (vk , vl )
= pαk ,αl (vk , vl )log k l (10)
information loss: the Direct Distance, Sum of Kullback Lei- pαk pαl
vk ∈Vαk vl ∈Vαl
bler Distances and Average Change in Mutual Information.
The Direct Distance, the simplest of all information loss Definition 2.9 (Average Change in Mutual Information) Let
metrics, basically counts the number of attribute values hid- D and D  be the original and modified microdata sets, respec-
den during the suppression process. tively. The average change in mutual information over all
attributes α ∈ Λ can be defined as follows:
n n I D (αi ;α j )
2 i=1 j=i I D  (αi ;α j )

Definition 2.5 (Direct Distance) Let D and D  be the orig- AC M I (D, D ) = (11)
n(n − 1)
inal and modified microdata sets, respectively. The direct
distance between D and D  can be defined as the number of 2.4.2 Uncertainty metrics
non-matching attribute values.
In this work, a single metric, the Sum of Conditional Entro-

m 
n
D D(D, D  ) = disti j (7) pies is used to measure the uncertainty introduced into the
i=1 j=1 modified data set. Higher uncertainty implies better privacy.

123
390 A. Azgin Hintoglu, Y. Saygın

Fig. 1 Microdata suppression


algorithms

Definition 2.10 (Conditional Entropy) Let D and D  be the 3.1 Suppression against probabilistic classification models
original and modified microdata sets, respectively, and α ∈ Λ
be an attribute. Let X αD on Vα be a random variable with In the following, we present three algorithms for preventing
instances α[d1 ], α[d2 ], ..., α[dm ] and probability distribution probabilistic classification-based inference. The proposed

pα . Let X αD on Vα be a random variable with instances algorithms aim to suppress a confidential data value, such
α[d1 ], α[d2 ], ..., α[dm ] and probability distribution pα . The that it is no longer among the Top-1 Probable value set.

conditional entropy of X αD given X αD can be defined as fol-
lows: Definition 3.1 (Top-k Probable) Let α j [di ] be confidential
and thus be replaced by ν. The Naı̈ve Bayesian Classifier
   D−d ,α
H (X αD |X αD ) = − p(v, v  )log( p(v|v  )) (12) ςnb i j built using D−di as the training data set will predict
α [d ]
v∈Vα v  ∈Vα the Top-k Probable value set for α j [di ] as Ωk j i ⊆ Vα j . The
Top-k Probable value set satisfies the following constraints:
Definition 2.11 (Sum of Conditional Entropies) Let D and
D  be the original and modified microdata sets, respectively. i. Its size is equal to k.
The sum of conditional entropies of D given D  can be
 
defined as follows:  α j [di ] 
Ωk =k (14)
 
SC E(D, D  ) = H (X αD |X αD ) (13)
α∈Λ
ii. The probability of α j [di ] being equal to the least proba-
ble value in the Top-k Probable value set is greater than
the probability of α j [di ] being equal to the most probable
The detailed descriptions of the information theoretic met- value among the remaining attribute values.
rics introduced in this section can be found in [22].
α [di ] α [di ]
p(ω|di )>p(v|di ) | ∀v ∈ Vα j −Ωk j ∧ ω ∈ Ωk j
(15)
3 Our approaches for suppressing microdata
The proposed suppression algorithms aim at either reduc-
As pointed in Sect. 2, hiding a confidential data value alone ing p(α j [di ]|di ) below that of a randomly selected attribute,
may not be enough to protect it, in case the whole data set is called the Random Next Best Guess, among Top-k Probable6
going to be disclosed. This results from the fact that an adver- value set or increasing the probability of a set of selected attri-
sary can build a classification model using the rest of the data butes, called the Next Best Guess Set, above p(α j [di ]|di ).
set as the training data set and s/he could use it to predict the
actual confidential data value. In order to avoid such attacks, Definition 3.2 (Random Next Best Guess) The random
we propose four algorithms suppressing only one confiden- next best guess, vr nbg ∈ Vα j , is a randomly selected value
tial data value at a time, against two popular classifier types: from Vα j satisfying the following conditions:
probabilistic and decision tree classifiers, as shown in Fig. 1.
We select Naı̈ve Bayesian and ID3 as typical representatives
i. It is different from α j [di ].
of probabilistic and decision tree classifiers, respectively, and
developed our heuristics accordingly. Moreover, we propose
enhancement to two of the proposed algorithms to suppress vr nbg = α j [di ] (16)
multiple confidential data values at a time to reduce the side
effects. 6 The effect of changing k is further discussed in Sect. 4.

123
Suppressing microdata to prevent classification based inference 391

ii. It is among the Top-k Probable value set. p(α j [di ]) p(di |α j [di ])
p(α j [di ]|di ) =
p(di )
α [di ]

= p(α j [di ]) p(di |α j [di ])
vr nbg ∈ Ωk j (17) α j [di ]

= p(α j [di ]) p(α M I [di ]|α j [di ])

iii. The probability of α th × p(α[di ]|α j [di ])
j attribute of di being equal to vr nbg 
is smaller than that of confidential data value α j [di ] and α j [di ]
α∈Λ− α j ,α M I
greater than zero.
Let us assume that,
p(α j [di ]|di ) > p(vr nbg |di ) > 0 (18)
α [di ]
– the size of the microdata set D[α j [d] = α j [di ] ∧ α Mj I
3.1.1 DECP algorithm α [d ]
[d] = α Mj I i [di ]] − di be F α j [di ] , and
α j [di ],α M I
The DECP algorithm aims at suppressing the confidential – the size of the microdata set D[α j [d] = α j [di ]] − di be
data value α j [di ] so that it cannot be correctly predicted by Fα j [di ] .
D  −d ,α
the downgraded classification model ςnb i j . It accom-
plishes its goal by decreasing the probability p(α j [di ]|di ) Single replacement of a maximum impact data value causes
below that of the random next best guess vr nbg . α [d ]
p(α Mj I i [di ]|α j [di ]) to decrease from F α j [di ] / Fα j [di ]
α j [di ],α M I
to F α j [di ] − 1/Fα j [di ] . This, in turn decreases p(di |α j
Definition 3.3 (Maximum Impact Attribute) The attribute α j [di ],α M I
α [d ]
with maximum impact on p(α j [di ]|di ), denoted by α Mj I i , [di ]) by F α j [di ] − 1/F α j [di ] as shown below.
α j [di ],α M I α j [di ],α M I
is the one that satisfies the following conditions:
α [d ]
α [d ]   p  (di |α j [di ]) = p  (α Mj I i [di ]|α j [di ])
α Mj I i = arg min  D[α j [d]=α j [di ] ∧ α[d]=α[di ]]−di  
α∈Λ × p  (α[di ]|α j [di ])
  
∧  D[α j [d] = α j [di ] ∧ α[d] = α[di ]] − di  > 1 (19) α j [di ]
α∈Λ− α j ,α M I

Definition 3.4 (Maximum Impact Data Values) The maxi- F α j [di ] −1


α j [di ],α M I
α [d ]
mum impact data values are the instances of α Mj I i in tuples =
Fα j [di ]
d ∈ D[α j [d] =
α [d ]
α j [di ] ∧ α Mj I i [d] =
α [d ]
α Mj I i [di ]] excluding 
× p(α[di ]|α j [di ])
di . 
α j [di ]
α∈Λ− α j ,α M I

In each iteration, the DECP algorithm identifies the maxi- F α j [di ] F α j [di ] −1
α [d ] α j [di ],α M I α j [di ],α M I
mum impact attribute α Mj I i and modifies the tuples d, such =
α [d ] α [d ] F α j [di ] Fα j [di ]
that d ∈ D[α j [d] = α j [di ] ∧ α Mj I i [d] = α Mj I i [di ]] − di , α j [di ],α M I
α [d ] 
by replacing α Mj I i [d] with ν until the goal is achieved, that × p(α[di ]|α j [di ])
is, until p(α j [di ]|di ) becomes less than p(vr nbg |di ). Each 
α j [di ]
α∈Λ− α j ,α M I
such replacement results in the maximum possible reduction
in p(α j [di ]|di ), thus requiring less number of modifications. F α j [di ] −1
α j [di ],α M I α [d ]
= p(α Mj I i [di ]|α j [di ])
α [d ] F α j [di ]
Theorem 3.1 Let α Mj I i be the maximum impact attribute α j [di ],α M I

satisfying Equation (19). Then, every replacement of a maxi- × p(α[di ]|α j [di ])
mum impact data value with ν causes the maximum decrease 
α j [di ]
α∈Λ− α j ,α M I
in p(α j [di ]|di ), thus resulting in fewer data values to be mod-
ified. F α j [di ] −1
α j [di ],α M I
= p(di |α j [di ])
Proof Let us first find the effect of replacing a maximum F α j [di ]
α j [di ],α M I
impact data value with ν on p(α j [di ]) p(di |α j [di ]). Remem-
ber that, since p(di ) is same for all v ∈ Vα j , it can be ignored Now let us assume that there is another attribute αk which
α [d ]
when calculating p(α j [di ]|di ). decreases p(α j [di ]|di ) more than that of α Mj I i . This implies

123
392 A. Azgin Hintoglu, Y. Saygın

the following: the DECP algorithm can replace at most (n − 1)(N − 1) data
F −1 values with ν for suppressing a confidential data value.

α j [di ]
Fα j [di ],αk − 1 α j [di ],α M I
< Example 2 Now, let us illustrate how the DECP algorithm
Fα j [di ],αk F α j [di ]
α j [di ],α M I suppress Bob’s confidential diagnosis.
(Fα j [di ],αk − 1)F α j [di ] < (F α j [di ] − 1)Fα j [di ],αk Step 1 Initially, the Naı̈ve Bayesian classification model is
α j [di ],α M I α j [di ],α M I
Fα j [di ],αk < F α j [di ]
constructed to find the probabilities p(v|di ) for all v ∈ Vα j =
α j [di ],α M I {dyspepsia, angina pectoris, gastritis}. The Naı̈ve
which contradicts the definition of Maximum Impact Attri- Bayesian classification model constructed using the medical
bute. So, we can conclude that every replacement of a max- records of Table 2 is shown in Table 3. According to the model
imum impact data value with ν causes the highest decrease the probabilities are p(dyspepsia|d2 ) = 0, p(angina
in p(α j [di ]|di ) which in turn implies that the number of data pectoris|d2 ) = 16 , and p(gastritis|d2 ) = 18 1
.
values that should be modified is minimal. 
Step 2 The probability p(angina pectoris|d2 ) is greater
than both p(dyspepsia|d2 ) and p(gastritis|d2 ). As Bob’s
The algorithm works as follows: Let α j [di ] be confiden-
diagnosis can be correctly predicted, the suppression process
tial. As the first step, the algorithm verifies the need for sup-
starts.
pression. It finds p(v|di ) for all v ∈ Vα j and checks the truth
value of the following assertion: Step 3 Let us assume that gastritis is selected as the ran-
dom next best guess. From this point on the DECP algo-
p(α j [di ]|di ) > p(v|di )|∀v ∈ Vα j − α j [di ] (20)
rithm will try to decrease p(angina pectoris|d2 ) below
If Assertion (20) is true, it picks a random next best guess p(gastritis|d2 ).
vr nbg from Vα j . Next, in each iteration it finds the maximum Step 4 To select the maximum impact attribute, the counts
α [d ]
impact attribute α Mj I i and replaces the maximum impact for each symptom attribute is found as follows:
data values by ν as long as p(α j [di ]|di ) > p(vr nbg |di ). After
processing all maximum impact attributes, it re-checks the – countindigestion = |D[diagnosis[d] = diagnosis[d2 ]
truth value of Assertion (20). If Assertion (20) is still true, it ∧indigestion[d] = indigestion[d2 ]]| = 2
reverts all changes and deletes the tuple di from the microdata – countchest pain = |D[diagnosis[d] = diagnosis[d2 ]
set. An overview of the algorithm is depicted in Fig. 2a. ∧chest pain[d] = chest pain[d2 ]]| = 2
If |Vα j | = 2 is true, then suppressing the confidential – count palpitation = |D[diagnosis[d] = diagnosis[d2 ]
data value might result in an adversary guessing it correctly ∧ palpitation[d] = palpitation[d2 ]]| = 3
with 100% confidence. Therefore, the decision to suppress
a confidential data value is randomized for the case where Since they have the same minimum count, both indigestion
|Vα j | = 2. This results in an adversary guessing the actual and chest pain can be the maximum impact attribute. Let us
confidential data value with 50% confidence which is the assume that indigestion is selected as the maximum impact
maximum uncertainty that can be achieved under such cir- attribute.
cumstances.
Step 5 All tuples d satisfying the constraint indigestion
Lemma 3.1 Let α j [di ] be the confidential data value, n be [d] = N ∧ diagnosis[d] = angina pectoris is found.
the number of attributes, and N be the number of tuples Tuples 7 and 8 satisfy the aforementioned constraint.
in D[α j [d] = α j [di ]] − di . Then, the upper bound for the Step 6 The indigestion attribute is hidden from tuple 7. With
number of data values that can be modified by the DECP this replacement p(angina pectoris|d2 ) decreases by 21 to
algorithm is equal to (n − 1)(N − 1).
12 . As p(angina pectoris|d2 ) is still greater than
1

p(gastritis|d2 ), the suppression process continues with the


Proof The proof of this statement is straightforward. The
next maximum impact attribute which is chest pain.
DECP algorithm modifies the maximum impact data val-
α [d ] Step 7 All tuples d satisfying the constraint chest pain[d]
ues from the tuples d ∈ D[α j [d] = α j [di ] ∧ α Mj I i [d] =
α [d ] α [d ] α [d ] = Y ∧ diagnosis[d] = angina pectoris is found. Tuples
α Mj I i [di ]]−di . As D[α j [d] = α j [di ]∧α Mj I i [d] = α Mj I i
6 and 8 satisfy the aforementioned constraint.
[di ]] ⊆ D[α j [d] = α j [di ]], the number of tuples that can be
modified for each maximum impact attribute is bounded by Step 8 The chest pain attribute is hidden from tuple 6. With
N − 1. At each iteration, the DECP algorithm picks a differ- this replacement p(angina pectoris|d2 ) decreases by 21 to
24 . As p(angina pectoris|d2 ) is smaller than p(gastritis|
1
ent maximum impact attribute and replaces the instances of
this attribute with ν. Since there are n − 1 different alterna- d2 ), the suppression process stops. The resulting microdata
tives for a maximum impact attribute, we can conclude that can be seen in Table 4.

123
Suppressing microdata to prevent classification based inference 393

(a)

(b)
Fig. 2 Pseudocode of DECP and INCP algorithms. a. Pseudocode of DECP algorithm. b. Pseudocode of INCP algorithm

123
394 A. Azgin Hintoglu, Y. Saygın

Table 3 Naı̈ve Bayesian


Diagnosis p(Diagnosis) p(Symptom|Diagnosis) p(Diagnosis|d2 )
classification model constructed
using the medical records shown Indigestion Chest pain Palpitation
in Table 2
Dyspepsia 2/8 0 0 1/2 0
Gastritis 3/8 1/3 2/3 2/3 1/18
Angina pectoris 3/8 2/3 2/3 1 1/6

Table 4 Academic health


medical records after DECP Zipcode Gender Age Indigestion Chest pain Palpitation Diagnosis
execution
90302 Female 29 Y N Y Dyspepsia
90410 Male 22 N Y Y ?
90301 Male 27 Y N N Dyspepsia
90310 Female 43 Y N N Gastritis
90301 Male 52 N Y Y Gastritis
90410 Male 47 Y ? Y Angina Pectoris
90305 Female 30 ? N Y Angina Pectoris
90402 Male 36 N Y Y Angina Pectoris
90301 Male 52 Y Y Y Gastritis

3.1.2 INCP algorithm attribute values v ∈ Snbg , it re-checks the truth value of
Assertion (20). If Assertion (20) is still true, then DECP algo-
The INCP algorithm aims at suppressing the confidential data rithm is executed to complete the algorithm. An overview of
value α j [di ] so that it cannot be correctly predicted by the the algorithm is depicted in Fig. 2b.
D  −d ,α
downgraded classification model ςnb i j . It accomplishes
its goal, as its name implies, by increasing the probabili- Lemma 3.2 Let α j [di ] be the confidential data value, m be
ties p(v|di ) for all v in the next best guess set,Snbg , above the number of tuples in D and N be the number of tuples
p(α j [di ] | di ). in D[α j [d] = α j [di ]] − di . Assuming that there are enough
number of tuples that can be used for the suppression pro-
Definition 3.5 (Next Best Guess Set) The next best guess set, cess (i.e. no need for executing DECP), the upper bound for
Snbg ⊆ Vα j , for microdata tuple di is the set of all attribute the number of data values that can be modified by the INCP
values v ∈ Vα j − α j [di ] satisfying the following condition: algorithm is equal to m − N − 1 − |Snbg |.


Snbg = v|v ∈ Vα j − α j [di ] ∧ p(v|di ) ≥ p(vr nbg |di ) (21) Proof The proof of this statement is straightforward. The
For each v ∈ Snbg , the INCP algorithm identifies the tuples INCP algorithm modifies the tuples d ∈ D[¬ α1 [d] =
d ∈ D[α j [d] = v] having no common attribute value with α1 [di ] ∧ · · · ∧ ¬ α j−1 [d] = α j−1 [di ] ∧ α j [d] = v ∧
di and modifies them by replacing α j [d] with ν in order to ¬ α j+1 [d] = α j+1 [di ] ∧ · · · ∧ ¬ αn [d] = αn [di ]] for each
increase p(v|di ). v ∈ Snbg . In the worst case, Snbg contains all 
possible values

The algorithm works as follows: Let α j [di ] be confiden- of attribute α j except α j [di ]. This implies v∈Snbg  D[α j

tial. As the first step, the algorithm verifies the need for sup- [d] = v] = m − N − 1. Moreover, due to the definition of
pression. It finds p(v|di ) for all v ∈ Vα j and checks the truth next best guess set and random next best guess the probabil-
value of Assertion (20). If Assertion (20) is true, it picks a ity p(v|di ) for each v ∈ Snbg must be greater than zero. This
random next best guess vr nbg from Vα j and forms Snbg by implies that, in the worst case there exists at least one tuple
finding the attribute values v ∈ Vα j satisfying p(v|di ) ≥ which has the same data values with di (except α j ) for each
p(vr nbg |di ). Next, for each v ∈ Snbg , the algorithm finds v ∈ Snbg . So, we can conclude that the INCP algorithm can
the tuples d ∈ D[¬ α1 [d] = α1 [di ] ∧ · · · ∧ ¬ α j−1 [d] = replace at most m − N − 1 − |Snbg | data values with ν for
α j−1 [di ] ∧ α j [d] = v ∧ ¬ α j+1 [d] = α j+1 [di ] ∧ · · · ∧ suppressing a confidential data value. 

¬ αn [d] = αn [di ]] and modifies them by replacing α j [d]
with ν until the goal is achieved, that is, until p(v|di ) becomes Example 3 Now, let us illustrate how the INCP algorithm
less than or equal to p(α j [di ]|di ). After processing all suppress Bob’s confidential diagnosis.

123
Suppressing microdata to prevent classification based inference 395

Table 5 Academic health


medical records after INCP Zipcode Gender Age Indigestion Chest pain Palpitation Diagnosis
execution
90302 Female 29 Y N Y Dyspepsia
90410 Male 22 N Y Y ?
90301 Male 27 Y N N Dyspepsia
90310 Female 43 Y N N ?
90301 Male 52 N Y Y Gastritis
90410 Male 47 Y Y Y Angina Pectoris
90305 Female 30 ? N Y Angina Pectoris
90402 Male 36 N Y Y Angina Pectoris
90301 Male 52 Y Y Y Gastritis

Step 1 Initially, the Naı̈ve Bayesian classification model Since they have the same minimum count, both indigestion
is constructed to find the probabilities p(v|di ) for all v ∈ and chest pain can be the maximum impact attribute. Let us
Vα j = {dyspepsia, angina pectoris, gastritis}. The assume that indigestion is selected as the maximum impact
Naı̈ve Bayesian classification model constructed using the attribute.
medical records of Table 2 is shown in Table 3. According
to the model the probabilities are p(dyspepsia|d2 ) = 0, Step 8 All tuples d satisfying the constraint indigestion
p(anginapectoris|d2 ) = 16 , and p(gastritis|d2 ) = 18 1
. [d] = N ∧ diagnosis[d] = angina pectoris are found.
Tuples 7 and 8 satisfy the aforementioned constraint.
Step 2 The probability p(angina pectoris|d2 ) is greater
than both p(dyspepsia|d2 ) and p(gastritis|d2 ). As Bob’s Step 9 The indigestion attribute is hidden from tuple 7. With
diagnosis can be correctly predicted, the suppression process this replacement p(angina pectoris|d2 ) decreases by 21 to
21 . As p(angina pectoris|d2 ) is smaller than p(gastritis|
starts. 2

Step 3 Let us assume that gastritis is selected as the random d2 ), the suppression process stops. The resulting microdata
next best guess. From this point on, the INCP algorithm will can be seen in Table 5.
try to increase p(gastritis|d2 ) above p(an- ginapectoris|
d2 ). 3.1.3 DROPP algorithm

Step 4 All tuples d which have no common symptoms with The DROPP algorithm aims at suppressing the confidential
Bob among D[diagnosis[d] = gastritis] are found. Tuple data value α j [di ] so that it cannot be correctly predicted
4 satisfies the aforementioned constraint. D−d ,α
by the classification model ςnb i j . It aims at dropping
Step 5 The diagnosis attribute is hidden from tuple 4. After the probability p(α j [di ]|di ) below that of the random next
this replacement, p(gastritis|d2 ) increases to 17 , and best guess vr nbg so that it cannot be correctly predicted by
D−d ,α
p(angina pectoris|d2 ) increases to 21 4
. As p(angina the classification model ςnb i j . Unlike DECP and INCP
pectoris|d2 ) is still greater than p(gastritis|d2 ), the sup- algorithms, it achieves its goal by downgrading the tuple di ,
D−d ,α
pression process continues. instead of downgrading classification model ςnb i j .
The algorithm employs the following modified definition
Step 6 Since there are no more tuples which have common
of Maximum Impact Attribute:
symptoms with Bob among D[diagnosis[d] = gastritis],
the suppression process continues with the DECP execution.
Definition 3.6 (Maximum Impact Attribute ) The attribute
Step 7 To select the maximum impact attribute, the counts with maximum impact on p(α j [di ]|di ), denoted by
α [d ]
for each symptom attribute are found as follows: α Mj I  i , is the one that satisfies the following conditions:
α [d ]
– countindigestion = |D[diagnosis[d] = diagnosis[d2 ] α Mj I  i = arg max
α∈Λ
∧indigestion[d] = indigestion[d2 ]]| = 2  
– countchest pain = |D[diagnosis[d] = diagnosis[d2 ]  D[α j [d] = α j [di ] ∧ α[d] = α[di ]] − di 
×  
∧chest pain[d] = chest pain[d2 ]]| = 2  D[α j [d] = vr nbg ∧ α[d] = α[di ]]
– count palpitation = |D[diagnosis[d] = diagnosis[d2 ] α [d ] α [d ]
∧ palpitation[d] = palpitation[d2 ]]| = 3 ∧ p(α Mj I  i [di ]|α j [di ]) > p(α Mj I  i [di ]|vr nbg ) (22)

123
396 A. Azgin Hintoglu, Y. Saygın

Definition 3.7 (Maximum Impact Data Value ) The Replacement of the maximum impact data value causes
F α j [di ]
maximum impact data value is the instance of maximum p(α j [di ]|di ) Fvr nbg α j [di ],α
MI
α [d ] to decrease by × as shown
impact data attribute α Mj I  i in tuple di . p(vr nbg |di ) Fα j [di ] F
vr nbg ,α
α j [di ]
MI
below.
It must be noted that, the maximum impact data values has
a higher probability of occurrence in tuples d ∈ D[α j [d] = p  (α j [di ]|di ) ∼
= p  (α j [di ]) p  (di |α j [di ])
α j [di ]] − di than that of tuples d ∈ D[α j [d] = vr nbg ]. 
= p(α j [di ]) × p(α[di ]|α j [di ])
Therefore, they are the key to decrease p(α j [di ]|di ) below 
α j [di ]
p(vr nbg |di ). α∈Λ− α j ,α M I 
α [d ]
In each iteration, the DROPP algorithm identifies α Mj I  i Fα j [di ]
α [d ]
and modifies the tuple di by replacing α Mj I  i [di ] with ν until = p(α j [di ]|di )
F α j [di ]
α j [di ],α M I 
the goal is achieved, that is, until p(α j [di ]|di ) becomes less
than p(vr nbg |di ). Each such replacement results in the max- p  (vr nbg |di ) ∼
= p  (vr nbg ) p  (di |vr nbg )
p(α [d ]|d ) 
imum possible reduction in p(vrj nbgi |dii) , thus requiring less = p(vr nbg ) × p(α[di ]|vr nbg )

number of modifications. α j [di ]
α∈Λ− α j ,α M I 
α [d ]
Theorem 3.2 Let α Mj I  i be the maximum impact attribute Fvr nbg
satisfying Eq. (22). Then, every replacement of a maximum = p(vr nbg |di )
F α j [di ]
impact data value with ν causes the maximum decrease in vr nbg ,α M I 
p(α j [di ]|di ) Fα j [di ]
p(vr nbg |di ) , thus resulting in fewer data values to be modified. p(α j [di ]|di ) F α j [di ]
p  (α j [di ]|di ) α j [di ],α
MI
Proof Let us first find the effect of replacing a maximum =
p  (vr nbg |di ) p(vr nbg |di ) F
Fvr nbg
impact data value with ν on p(α j [di ]|di ) and p(vr nbg |di ). α j [di ]
vr nbg ,α
Remember that, since p(di ) is same for all v ∈ Vα j , it can MI

be ignored when calculating p(α j [di ]|di ). F α j [di ]


p(α j [di ]|di ) Fα j [di ] vr nbg ,α M I 
= × ×
p(α j [di ]) p(di |α j [di ]) p(vr nbg |di ) Fvr nbg F α j [di ]
p(α j [di ]|di ) = α j [di ],α M I 
p(di ) F α j [di ]

= p(α j i p(di |α j [di ])
[d ]) p(α j [di ]|di ) p  (α j [di ]|di ) Fvr nbg α j [di ],α M I 
/ = ×
∼ α j [di ] p(vr nbg |di ) p  (vr nbg |di ) Fα j [di ] F α j [di ]
= p(α j [di ]) p(α M I  [di ]|α j [di ]) vr nbg ,α M I 

× p(α[di ]|α j [di ]) Now, let us assume that there is another attribute αk which
 p(α [d ]|d ) α [d ]
α j [di ]
α∈Λ− α j ,α M I  decreases p(vrj nbgi |dii) more than that of α Mj I  i . This implies
the following:

Similarly, F α j [di ]
Fvr nbg Fα j [di ],αk Fvr nbg α j [di ],α M I 
× > ×
p(vr nbg ) p(di |vr nbg ) Fα j [di ] Fvr nbg ,αk Fα j [di ] F
vr nbg ,α M I 
α j [di ]
p(vr nbg |di ) =
p(di ) F α j [di ]

= p(vr nbg ) p(di |vr nbg ) Fα j [di ],αk α j [di ],α M I 
vr nbg
>

= p(vr nbg ) p(α M I  [di ]|vr nbg ) Fvr nbg ,αk F α j [di ]
vr nbg ,α M I 

× p(α[di ]|vr nbg ) However, this contradicts the definition of Maximum Impact
 vr nbg 
α∈Λ− α j ,α M I  Attribute . So, we can conclude that every replacement of
a maximum impact data value with ν causes the highest
p(α [d ]|d )
Let the size of the microdata set D[α j [d] = α j [di ] ∧ decrease in p(vrj nbgi |dii) , which in turn implies that the number
α [d ] α [d ] of data values that should be modified is minimal. 

α Mj I  i [d] = α Mj I  i [di ]] − di be F α j [di ] and the size
α j [di ],α M I 
of the microdata set D[α j [d] = α j [di ]] − di be Fα j [di ] . Let The algorithm works as follows: Let α j [di ] be confiden-
α [d ] tial. As the first step, the algorithm verifies the need for sup-
the size of the microdata set D[α j [d] = vr nbg ∧ α Mj I  i [d] =
α [d ] pression. It finds p(v|di ) for all v ∈ Vα j and checks the
α Mj I  i [di ]] be F α j [di ] and the size of the microdata set truth value of Assertion (20). If Assertion (20) is true, it
vr nbg ,α M I 
D[α j [d] = vr nbg ] be Fvr nbg . picks a random next best guess vr nbg from Vα j . Next, in each

123
Suppressing microdata to prevent classification based inference 397

Fig. 3 Pseudocode of DROPP algorithm

α [d ]
iteration it finds the maximum impact attribute α Mj I  i and Example 4 Now, let us illustrate how the DROPP algorithm
α [d ] suppress Bob’s confidential diagnosis.
replaces the maximum impact data value α Mj I  i [di ]
by ν.
After each iteration, it re-checks the truth value of Assertion
(20) to decide whether to continue execution. If Assertion Step 1 Initially, the Naı̈ve Bayesian classification model is
(20) is still true after all possible maximum impact attributes constructed to find the probabilities p(v|di ) for all v ∈ Vα j =
are processed, it reverts all changes and deletes the tuple {dyspepsia, angina pectoris, gastritis}. The Na-ı̈ve
di from the microdata set. An overview of the algorithm is Bayesian classification model constructed using the medi-
depicted in Fig. 3. cal records of Table 2 is shown in Table 3. According to the
model the probabilities are p(dyspepsia|d2 ) = 0, p(angina
Lemma 3.3 Let α j [di ] be the confidential data value and n pectoris|d2 ) = 16 , and p(gastritis|d2 ) = 18 1
.
be the number of attributes. Then, the upper bound for the
number of data values that can be modified by the Step 2 The probability p(angina pectoris|d2 ) is greater
DROPP algorithm is equal to n − 1. than both p(dyspepsia|d2 ) and p(gastritis|d2 ). As Bob’s
diagnosis can be correctly predicted, the suppression process
Proof The proof of this statement is straightforward. The starts.
DROPP algorithm modifies just the tuple di which has n − 1
data values excluding the confidential data value. So, we can Step 3 Let us assume that gastritis is selected as the random
conclude that the DROPP algorithm can replace at most n −1 next best guess. From this point on, the DROPP algorithm
data values with ν for suppressing a confidential data value. will try to drop p(angina pectoris|d2 ) below p(gastritis|

d2 ).

123
398 A. Azgin Hintoglu, Y. Saygın

Table 6 Academic health


medical records after DROPP Zipcode Gender Age Indigestion Chest pain Palpitation Diagnosis
execution
90302 Female 29 Y N Y Dyspepsia
90410 Male 22 ? Y ? ?
90301 Male 27 Y N N Dyspepsia
90310 Female 43 Y N N Gastritis
90301 Male 52 N Y Y Gastritis
90410 Male 47 Y Y Y Angina Pectoris
90305 Female 30 N N Y Angina Pectoris
90402 Male 36 N Y Y Angina Pectoris
90301 Male 52 Y Y Y Gastritis

Step 4 To select the maximum impact attribute , the follow- 3.2 Suppression against decision tree classification models
ing counts and ratios are found:
In the following, we present the HID3 algorithm for prevent-
ing decision tree classification based inference.
angina pectoris
– countindigestion = |D[diagnosis[d] = angina
pectoris ∧ indigestion[d] = indigestion[d2 ]]| = 2
gastritis
– countindigestion = |D[diagnosis[d] = gastritis
3.2.1 HID3 algorithm
∧indigestion[d] = indigestion[d2 ]]| = 1
angina pectoris
– countchest pain = |D[diagnosis[d] = angina
The HID3 algorithm aims at suppressing the confidential data
pectoris ∧ chest pain[d] = chest pain[d2 ]]| = 2 D−d ,α
gastritis value α j [di ] so that the ID3 classification model ςid3 i j
– countchest pain = |D[diagnosis[d] = gastritis
cannot correctly predict its actual value. Similar to the
∧chest pain[d] = chest pain[d2 ]]| = 2
angina pectoris DROPP algorithm, it achieves its goal by downgrading the
– count palpitation = |D[diagnosis[d] = angina
tuple di .
pectoris ∧ palpitation[d] = palpitation[d2 ]]| = 3
gastritis The algorithm works as follows: Let α j [di ] be confiden-
– count palpitation = |D[diagnosis[d] = gastritis∧
tial. As the first step, the algorithm builds the decision tree
palpitation[d] = palpitation[d2 ]]| = 2 D−d ,α
– ratiochest pain = 1 using D−di and verifies the need for suppression. If ςid3 i j
– ratioindigestion = 2 can correctly predict the confidential data value it calls the
– ratio palpitation = 3/2 recursive ID3Hide function. Then, the ID3Hide function
checks whether the root node is a leaf. If it is a leaf and
its value is different from the confidential data value α j [di ]
As, indigestion has the maximum ratio, it is selected as the it returns true, which in turn terminates the recursive func-
maximum impact attribute . tion successfully. Or else, it returns false. If the root node is
not a leaf, then it finds the most probable value vπ ∈ Vα j for
α j [di ], and checks whether vπ is equal to α j [di ]. If the most
Step 5 The indigestion attribute is hidden from tuple 2. With probable value vπ is not equal to the actual confidential data
this replacement p(angina pectoris|d2 ) increases to 41 , and value α j [di ] it returns true. Otherwise, it further explores the
p(gastritis|d2 ) increases to 16 . As p(angina pectoris|d2 ) child nodes of the root in order to suppress α j [di ]. Let the
is still greater than p(gastritis|d2 ), the suppression process decision attribute of the root node be αr oot , the most com-
continues with the next maximum impact attribute which is mon child of the root (i.e. the child with highest training
palpitation. population) be child MC and the child containing αr oot [di ]
be child Match . If αr oot [di ] = ν or child Match = child MC it
tries to suppress the confidential data value using child MC .
Step 6 The palpitation attribute is hidden from tuple 2. With Or else, it uses child Match for suppression. After exploring
this replacement p(angina pectoris|d2 ) remains the same, all possible sub-branches, if the algorithm fails to suppress
but p(gastritis|d2 ) increases to 41 . As p(angina pectoris| the confidential data value, it reverts all changes and deletes
d2 ) is equal to p(gastritis|d2 ), the suppression process the tuple di from the microdata set. An overview of the algo-
stops. The resulting microdata can be seen in Table 6. rithm is depicted in Fig. 4.

123
Suppressing microdata to prevent classification based inference 399

Fig. 5 Decision tree constructed using the medical records shown in


Table 2

Step 2 Starting from the root=node 1, the ID3Hide func-


tion checks whether it is possible to correctly predict Bob’s
diagnosis. Since Bob’s diagnosis can be correctly predicted
using the path chest pain = N ∧ indi- gestion = N , the
suppression process starts.

Step 3 Using the whole microdata the ID3Hide function


Fig. 4 Pseudocode of HID3 algorithm checks whether the majority of the tuples have chest pain =
N . Since, the number of tuples having chest pain = N is
equal to the number of tuples having chest pain = Y , the
If |Vα j | = 2 is true, then suppressing the confidential
function calls itself with root=node 3.
data value might result in an adversary guessing it correctly
with 100% confidence. Therefore, the decision to suppress Step 4 Starting from the subtree root=node 3, the ID3- Hide
a confidential data value is randomized for the case where function checks whether it is possible to correctly predict
|Vα j | = 2. This results in an adversary guessing the actual Bob’s diagnosis. Since Bob’s diagnosis can be correctly pre-
confidential data value with 50% confidence which is the dicted using the path indigestion = N , the suppression
maximum uncertainty that can be achieved under such cir- process continues.
cumstances.
Step 5 Using only the tuples with chest pain = N , the
Lemma 3.4 Let α j [di ] be the confidential data value and
ID3Hide function checks whether the majority of the tuples
n be the number of attributes. Then, the upper bound for
have indigestion = N . Since the number of tuples having
the number of data values that can be modified by the HID3
indigestion = N is smaller than the number of tuples hav-
algorithm is equal to n − 1.
ing indigestion = Y , the function calls itself with root=
Proof The proof of this statement is straightforward. The node 5.
HID3 algorithm modifies just the tuple di which has n − 1
Step 6 As node 5 is a leaf, the ID3Hide function checks
data values excluding the confidential data value. So, we can
whether the most probable value, i.e., angina pectoris, and
conclude that the HID3 algorithm can replace at most n − 1
the confidential diagnosis are equal. As they are equal, the
data values with ν for suppressing a confidential data value.
function returns from the recursive call signaling an unsuc-


cessful run.
Example 5 For this specific example, let us assume Bob does
Step 7 As the recursive call to ID3Hide was unsuccessful,
not have any chest pain and illustrate how the HID3 algo-
the current node’s attribute, i.e., the indigestion attribute, is
rithm suppress his confidential diagnosis.
hidden from Bob’s tuple. Next, the function calls itself with
Step 1 Initially, the ID3 classification model shown in Fig. 5 root=node 4, as tuples with indigestion = Y constitute the
is constructed based on the medical records shown in Table 2. majority among tuples with chest pain = N .

123
400 A. Azgin Hintoglu, Y. Saygın

Table 7 Academic health


medical records after HID3 Zipcode Gender Age Indigestion Chest pain Palpitation Diagnosis
execution
90302 Female 29 Y N Y Dyspepsia
90410 Male 22 ? Y Y ?
90301 Male 27 Y N N Dyspepsia
90310 Female 43 Y N N Gastritis
90301 Male 52 N Y Y Gastritis
90410 Male 47 Y Y Y Angina Pectoris
90305 Female 30 N N Y Angina Pectoris
90402 Male 36 N Y Y Angina Pectoris
90301 Male 52 Y Y Y Gastritis

Step 8 Starting from the subtree root=node 4, the ID3- Hide value (i.e. α j = con f _value) and maximum impact data
function checks whether it is possible to correctly predict value. Next, for each tuple d ∈ S, the need for suppression is
Bob’s diagnosis. Since Bob’s diagnosis cannot be correctly verified by finding p(v|d) for all v ∈ Vα j and checking the
predicted using the path palpitation = Y , the suppression truth value of the following assertion:
process stops. The resulting microdata can be seen in Table 7.
p(α j [d]|d) > p(v|d)|∀v ∈ Vα j − α j [d] (23)
3.3 Suppression of multiple confidential data values If Assertion (23) is true for a tuple d ∈ S, it picks a random
next best guess vrdnbg , from Vα j . Next, the candidate maxi-
In the following, we present the enhanced versions of DECP mum impact data values are sorted. Different from the orig-
and DROPP algorithms for preventing probabilistic classi- inal DECP, which uses only the secondary impact to deter-
fication based inference. The proposed algorithms aim to mine which maximum impact data value to use, e-DECP also
reduce to side-effects while suppressing multiple confiden- uses the primary impact in order to guarantee suppress of
tial data values. maximum number of confidential data values with a single
iteration. With maximum impact values sorted, the rest of
3.3.1 e-DECP algorithm the execution is quite similar to the original DECP which
involves replacement of maximum impact data value
The enhanced DECP algorithm aims at suppressing multiple instances, calculation of probabilities, and re-checking of
confidential data values at a time so that none of them can be Assertion (23). An overview of the algorithm is depicted in
correctly predicted by the downgraded classification model Fig. 6a.
D  ,α
ςnb j . The proposed algorithm reduces the side-effects of
the original DECP algorithm when (1) all confidential data 3.3.2 e-DROPP algorithm
values belong to a single attribute, and (2) all confidential
data values have the same value. The generic case which The enhanced DROPP algorithm aims at suppressing mul-
handles the suppression of multiple confidential values that tiple confidential data values at a time so that none of them
belong to different attributes require exhaustive modeling of can be correctly predicted by the corresponding classifica-
D,α
dependencies and will be investigated as part of our future tion models ςnb . The proposed algorithm reduces the
work. side-effects of the original DROPP algorithm when all confi-
The algorithm works as follows: Let α j be the confidential dential data values belong to a single tuple. The generic case
attribute, S ⊂ D be the set of tuples for which α j , satisfy- which handles the suppresion of multiple confidential values
ing the constraint α j [d] = con f _value for all d ∈ S, is that belong to multiple tuples require exhaustive modeling
confidential. As the first step, the algorithm replaces all con- of dependencies and will be investigated as part of our future
fidential data values with ν. Then, it identifies the candidate work.
maximum impact data values, and initializes their primary The algorithm works as follows: Let [di ] be the tuple con-
and secondary impacts. The primary impact is the number taining multiple confidential data values, and S be the set
of tuples which will be affected (i.e. the probabilities will be of attributes containing a confidential data value in di . As
affected) if an instance of the maximum impact data value is the first step, the algorithm verifies the need for suppression
replaced with ν. The secondary impact, on the other hand, is for each confidential data value. More specifically, for each
the number of tuples that support both the confidential data α ∈ S, it finds p(v|di ) where v ∈ Vα and checks the truth

123
Suppressing microdata to prevent classification based inference 401

(a)

(b)
Fig. 6 Pseudocode of e-DECP and e-DROPP algorithms. a. Pseudocode of e-DECP algorithm. b. Pseudocode of e-DROPP algorithm

value of the following assertion: confidential value.7 To identify the maximum impact data
value in each iteration, the impacts of candidates are aver-
p(α[di ]|di ) > p(v|di )|∀v ∈ Vα − α[di ] (24)
aged and sorted. With maximum impact values sorted, the
If Assertion (24) is true, it picks a random next best guess rest of the execution is quite similar to the original DROPP
vrvnbg
α
from Vα . Next, it identifies the candidate maximum
impact data values, and initializes their impacts on each 7 Please refer to the Eq. (22) for the calculation of impact.

123
402 A. Azgin Hintoglu, Y. Saygın

which involves replacement of maximum impact data value next best guess is employed during suppression in order to
instances from di , calculation of probabilities and re-check- reduce the confidence of an adversary predicting the actual
ing of Assertion (24). An overview of the algorithm is confidential value as shown below.
depicted in Fig. 6b. Success Rate
Confidence = (28)
k
4 Discussion on the effectiveness of suppression The second issue, that is inherent in all suppression algo-
algorithms rithms, occurs when |Vα j | = 2. Let us assume that the
decision to suppress the confidential data value α j [di ] is not
The motivation of the suppression algorithms presented in randomized when |Vα j | = 2. In this case, the algorithms will
this paper is to make a given set of confidential data values try to suppress the confidential data value with the maximum
non-discoverable, while minimizing the effect on the use- possible success rate. Knowing this fact, an adversary can
fulness of the data for purposes other than predicting the predict the actual confidential data value to be the one with
confidential data values. But, how can we make sure that an the second highest probability in Vα j with a confidence equal
adversary would not be able to predict the suppressed con- to the success rate of the algorithm. In order to avoid such
fidential data values? Certainly this might be a problem if attacks, we randomly decide to suppress a confidential data
randomization is not employed in various stages of the algo- value for microdata sets with |Vα j | = 2.
rithms. Let us assume that an adversary knows not only D  , Another issue is the effectiveness of the suppression algo-
the transformed microdata set, but also Vα j , the domain of rithms against different classification models. Remember that
the confidential data value α j [di ], and analyze how randomi- two of the proposed algorithms, the DECP and INCP algo-
zation avoids prediction of the actual confidential data value. rithms, aim at downgrading the classification model by
First, let us assume that a modified version of DECP is used modifying D − di . In the first method, the probability of
in order to suppress the confidential data value α j [di ]. This resemblance of the tuple containing the confidential data
version of DECP aims at decreasing p(α j [di ]|di ) below that value to other tuples d ∈ D satisfying α j [d] = α j [di ] is
of the next best guess vnbg instead of vr nbg . reduced. And, in the latter method, the probability of resem-
blance of the tuple containing the confidential data value to
Definition 4.1 (Next Best Guess) The next best guess, vnbg the tuples d ∈ D satisfying α j [d] = α j [di ] is increased.
∈ Vα j , is a randomly selected value from Vα j satisfying the On the other hand, the DROPP and HID3 algorithms aim at
following conditions: downgrading the microdata tuple containing the confidential
data value. Both methods find the attributes that enables cor-
i. It is different from α j [di ], rect prediction of the actual confidential value and hide them
from the tuple containing the confidential data value. As a
vnbg = α j [di ] (25) result, the probability of similarity of the tuple containing the
confidential data value to the other tuples d ∈ D satisfying
ii. It is among the top-2 probable set, α j [d] = α j [di ] is reduced. Since all classification methods
tend to find the target attribute value of a tuple based on its
α [di ]
vnbg ∈ Ω2 j (26) resemblance to other tuples in the training data set, the pro-
posed suppression algorithms are expected to achieve their
iii. The probability of α th
j attribute of di being equal to vnbg goal even when used with other classification methods. In
is smaller than that of confidential data value α j [di ] and order to verify this, we measured the effectiveness of each
greater than zero. algorithm against both Naı̈ve Bayesian and ID3 classifica-
tion. The results can be found in Sect. 5.
p(α j [di ]|di ) > p(vnbg |di ) > 0 (27) The final issue that needs to be discussed is the side effect
of the proposed algorithms which is related to the number of
This leads to a change in the ordering of the top-2 probable attribute values hidden excluding the confidential data value.
α [d ]

set Ω2 j i = α j [di ], vnbg . Knowing this fact, an adver- Remember that, for each suppression algorithm we derived
sary can predict the actual confidential data value to be the an upper bound for the number of attribute values that will
α [d ] be modified in the previous section. According to these der-
one with the second highest probability in Ω2 j i with a
confidence equal to the success rate of the algorithm. That is ivations we can conclude the following:
to say, if the success rate of the algorithm is 100%, then the
adversary can predict the actual confidential data value with i. The upper bound for the number of data values that can
100% confidence. This problem exists not only in DECP but be modified by the INCP algorithm depends on m, the
also in INCP and DROPP algorithms. Therefore, the random number of tuples in D,

123
Suppressing microdata to prevent classification based inference 403

Table 8 Data sets used in the experiments


Data set No. of No. of No. of
instances attributes unknowns

Wisconsin breast cancer 699 10 16


Car evaluation 1,728 7 0

ii. the upper bound for the number of data values that
can be modified by the DROPP and HID3 algorithms
depends on n, the number of attributes in D,
iii. the upper bound for the number of data values that can
be modified by the DECP algorithm depends on n ∗ m,
the number of attributes in D times the number of tuples Fig. 7 Average execution times of proposed algorithms
in D,
to suppress confidential data values for which the domain
Now, let us assume that m >> n. In this case, the worst case size of the corresponding attribute is greater than 2.
performance of the DROPP and HID3 algorithms should be
much better than the worst case performance of the DECP and 5.2 Results and analysis of algorithms
INCP algorithms with respect to side effects. However, for
data sets satisfying n >> m, e.g., gene expression data, the In this study, we first measured the average execution times8
worst case performance of the INCP algorithm will outper- required to suppress a confidential data value. The results,
form the DECP, DROPP, and HID3 algorithms with respect as depicted in Fig. 7, shows that the suppression algorithms
to the side effects. Note that the DECP algorithm will per- performed remarkably similar with respect to execution time.
form slightly worse than the other algorithms it is grouped Another performance criterion is the percent of success-
with, as in both cases either m or n loses its significance with ful suppressions.9 For each suppression algorithm, we first
respect to the other term. measured the percent of successful suppressions against the
algorithm’s primary10 classification model. As illustrated in
Fig. 5a,b, the proposed algorithms successfully suppressed
5 Experimental results all confidential data values with respect to their primary clas-
sification model.
This section presents the experimental results. The primary Next, we investigated the correctness of the following
objective of the experiments is to compare the suppression hypotheses:
algorithms in terms of CPU time performance, rate of suc-
cess, information loss, and uncertainty. Hypothesis 5.1 All proposed algorithms suppressing confi-
dential data values against probabilistic classification models
also blocks the decision tree classification based inference.
5.1 Data sets and implementation details
Hypothesis 5.2 All proposed algorithms suppressing confi-
In order to conduct the experiments we selected two data sets
dential data values against decision tree classification models
from the University of California at Irvine repository [28],
also blocks the probabilistic classification based inference.
the Wisconsin Breast Cancer data set [24] and the Car Eval-
uation data set. Table 8 provides a description of the data sets
Hypothesis 5.3 All proposed algorithms suppressing confi-
including the number of instances, the number of attributes,
dential data values also blocks more complex classification
and the number of unknowns.
models (e.g. SVM) based inference.
We implemented the proposed algorithms using the C++
programming language. To evaluate the performance of the
8 In order to find the average execution times, we suppressed a data
algorithms, we performed experiments on a 2.20 GHz Cele-
value from each instance of the data sets and averaged the CPU time
ron PC with 256 MB of memory running the Windows oper- results.
ating system. As the suppression algorithms contain random 9 A successful suppression implies that the confidential data value is
components, the experimental results presented are averages suppressed without deleting the microdata tuple containing it.
of five realizations unless stated otherwise. Moreover, in 10 Naı̈ve Bayesian classification model for DECP, INCP and DROPP
order to illustrate the power of the algorithms we choose algorithms, and ID3 classification model for HID3 algorithm.

123
404 A. Azgin Hintoglu, Y. Saygın

Table 9 Success of proposed algorithms against different classification models


Algorithm Percent of successful suppressions
Against Naı̈ve Bayesian classification Against ID3 classification Against SVM

W. Breast Cancer (%) Car evaluation (%) W. Breast Cancer (%) Car evaluation (%) W. Breast Cancer (%) Car evaluation (%)

DECP 100 100 77 86 41 65


INCP 100 100 75 84 41 56
DROPP 100 100 90 85 89 88
HID3 84 65 100 100 86 80

We measured the percent of successful suppressions achi-


eved by each algorithm against (1) its secondary11 classifi-
cation model, and (2) SVM using the SVM-Struct [25], as it
is a more powerful classification technique. As illustrated in
Table 9, the proposed algorithms exceeded a success rate of
65% against their secondary classification models, thus prov-
ing the correctness of Hypotheses 5.1 and 5.2. For SVM, the
success rates range from 41 to 89%. This proves that the pro-
posed algorithms still provide a protection of the confidential
data values against more powerful classification techniques.
The next performance criterion is the information loss
caused by the suppression algorithms. We used three eval-
uation metrics in order to measure the information loss: the Fig. 8 Average direct distance results of proposed algorithms
Direct Distance, Sum of Kullback Leibler Distances and
Average Change in Mutual Information. The details of these
information loss metrics can be found in Sect. 2.4.1. As a
this experiment for two sets of confidential data values:12 one
benchmark, we used the Naı̈ve Row Deletion (NRD) algo-
randomly selected from the Wisconsin breast cancer data set
rithm. The NRD algorithm suppresses a confidential data
and one randomly selected from the Car Evaluation data set.
value via deleting the microdata tuple to which it belongs,
The results are shown in Fig. 9. Similar to the average direct
i.e., replacing each and every data value forming the micro-
distance results, the HID3 algorithm causes the least amount
data tuple, including the confidential data value, with ν.
of information loss in terms of direct distance followed by
The first information loss metric we used was the aver-
the DROPP, INCP and DECP algorithms.
age direct distance which measures the average number of
The second information loss metric we used was the sum
unknowns introduced due to suppression of a single confi-
of Kullback Leibler distances which measures the distance
dential data value. The average direct distance results for the
between the first order probability distributions13 of the orig-
suppression algorithms are shown in Fig. 8. As can be seen
inal and the new data sets. The performance of suppression
from the figure, the HID3 algorithm causes the least amount
algorithms in terms of sum of Kullback Leibler distances is
of information loss in terms of direct distance followed by
shown in Fig. 10. Similar to the direct distance results, the
the DROPP algorithm. Actually, both of these algorithms are
HID3 algorithm causes the least amount of information loss
bounded by the NRD algorithm, as they aim at downgrading
in terms of sum of Kullback Leibler distances followed by
the microdata tuple instead of the classification models. On
the DROPP and NRD algorithms. The DECP and INCP algo-
the other hand, the DECP and INCP algorithms perform rel-
rithms perform relatively worse than the other algorithms.
atively worse than the others, as they aim at downgrading the
The last information loss metric we used was the average
classification model instead. Apart from the average direct
change in mutual information which measures the average
distance, we also measured the total direct distance versus the
distance between the second-order probability distributions
number of confidential data values suppressed. We realized
12 Note that, the same set of confidential data values are used throughout
the rest of the experiments measuring the information loss and uncer-
tainty.
11ID3 classification model for DECP, INCP and DROPP, and Naı̈ve 13 The probability distributions of the original and new data sets is
Bayesian classification model for HID3. derived using the extended domains of the attributes.

123
Suppressing microdata to prevent classification based inference 405

Fig. 9 Total Direct Distance Results of Proposed Algorithms. (a) Car Evaluation Data Set. (b) Wisconsin Breast Cancer Data Set

Fig. 10 Sum of Kullback Leibler Distance results of proposed algorithms. a. Car evaluation data set. b. Wisconsin breast cancer data set

of the original and modified data sets. The performance of We can summarize the presented experimental results as
suppression algorithms in terms of average change in mutual follows:
information is shown in Fig. 11. The results show that (1) the
HID3 algorithm causes the least amount of change in the cor- 1. There is a tradeoff between the rate of successful suppres-
relations within the data sets followed by the DROPP, NRD, sions and the information loss caused by the suppression
INCP, and DECP algorithms, and (2) the DECP and INCP process.
algorithms prevent inference of confidential data values bet- 2. The DECP algorithm achieves the highest success rate
ter than the DROPP and HID3 algorithms against different while causing the highest amount of information loss and
classification algorithms, as they distort correlations within uncertainty. This justifies Lemma 3.1 which states that
the data sets more. the upper bound for the number of data values that can
The final performance criterion is the uncertainty intro- be modified by the DECP algorithm is equal to (n − 1)
duced by the suppression algorithms. We used the sum of (N − 1) < nm.
conditional entropies in order to measure the expected value 3. The INCP algorithm achieves the second highest suc-
of uncertainty introduced into the modified data sets. The cess rate while causing the second highest information
performance of suppression algorithms in terms of sum of loss and uncertainty. It is followed by the DROPP and
conditional entropies is shown in Fig. 12. The results show HID3 algorithms. This ordering is completely due to
that (1) the HID3 algorithm introduces the least amount of (1) the characteristics of the Wisconsin Breast Cancer
uncertainty in the modified data sets followed by the DROPP, and Car Evaluation data sets satisfying the inequality
NRD, INCP, and DECP algorithms, and (2) the DECP and m > > n, i.e., the number of transactions is much more
INCP algorithms prevent inference of confidential data than the number of attributes, and (2) the upper bounds
values better than the DROPP and HID3 algorithms against for the number of data values that can be modified by the
different classification algorithms, as they cause more uncer- algorithms. With a data set satisfying the inequality
tainty within the data sets. n > > m, the ordering for success, information loss,

123
406 A. Azgin Hintoglu, Y. Saygın

Fig. 11 Average Change in Mutual Information Results of Proposed Algorithms. (a) Car Evaluation Data Set. (b) Wisconsin Breast Cancer Data
Set

Fig. 12 Sum of Conditional Entropy Results of Proposed Algorithms. (a) Car Evaluation Data Set. (b) Wisconsin Breast Cancer Data Set

and uncertainty will be reversed and be either DROPP-


HID3-INCP or HID3-DROPP-INCP.

5.3 Results and analysis of hybrid algorithms

In this study, we merged each Naı̈ve Bayesian suppression


algorithm with the Decision Tree suppression algorithm
HID3 in a round robin fashion,14 to demonstrate the perfor-
mance of the hybrid algorithms against both classification
methods.
First, we measured the average execution times required Fig. 13 Average execution times of hybrid algorithms
by each hybrid algorithm in order to suppress a confiden-
tial data value. The results, as depicted in Fig. 13, show that
the original and hybrid suppression algorithms performed average direct distance for the hybrid algorithms as shown in
remarkably similar with respect to execution time. Next, we Fig. 14. Similar to the previous results, the DROPP+HID3
measured the percent of successful suppressions achieved by algorithm caused the least amount of information loss in
each hybrid algorithm against both classification methods. terms of average direct distance followed by the INCP+ HID3
The hybrid algorithms successfully suppressed all confiden- and DECP+HID3 algorithms.
tial data values with respect to both classification models,
and achieved 100% success rate. Finally, we measured the 5.4 Results and analysis of e-DECP and e-DROPP
algorithms
14 For example, for HID3+DECP hybrid algorithm, first the HID3 algo-
rithm is executed to suppress the confidential data value against deci-
In this study, we also enhanced the DECP and DROPP algo-
sion tree classification based inference. Then, the DECP algorithm is
executed to suppress the confidential data value against probabilistic rithms to suppress multiple confidential data values at a time.
classification based inference. In order to demonstrate the performance of the enhanced

123
Suppressing microdata to prevent classification based inference 407

probability distribution or perturb some of the attribute values


once and for all. Finally, the output perturbation approaches
perturb the answer to queries while leaving the data in the
database unchanged.
One popular disclosure protection approach is cell
suppression [9,10]. Cell suppression consists of two sub-
approaches:primary suppression and secondary (i.e.
complementary) suppression. The basic idea of primary cell
suppression is to find all sensitive cells that might cause con-
fidential information to be disclosed from the released sta-
tistical data set and replace them by a symbol indicating the
suppression. Yet, primary suppression itself is not enough
to protect the sensitive cells due to inference channels exist-
Fig. 14 Average direct distance results of hybrid algorithms
ing in the data set. In order to reach the desired protection
for sensitive cells, other cells, i.e., marginal totals, contain-
ing non-confidential information that might lead to inference
algorithms when compared to the original ones, we carried
of suppressed sensitive cells also need to be suppressed; this
out 1,000 trials with randomly chosen confidential data value
is called secondary (complementary) cell suppression. More-
sets. We measured the average number of unknowns intro-
over, while finding a set of complementary suppressions pro-
duced due to suppression of multiple confidential data values
tecting all sensitive cells, the information loss associated with
excluding the confidential data values themselves. The results
the suppressed entries has to be minimized. This combinato-
showed that the enhanced algorithms performed remarkably
rial optimization problem is known as the Cell Suppression
better than the original versions, and reduced the side-effects
Problem (CSP) in statistics literature. Since CSP belongs to
by 50%.
the class of NP-hard problems [30–32], many heuristic meth-
ods have been proposed [9,10,32–34] (see Willenborg and
De Waal [35] for more references) to address the problem.
6 Related work CSP problem is similar to the the Microdata Suppression
Problem (MSP),which this work is trying to address. Nev-
The problem of protecting privacy while disclosing public- ertheless, the methodologies used to address these problems
use data sets were previously investigated in the context of are quite different due to the difference in the types of data
statistical databases (SDBs) as the statistical disclosure lim- sets they are trying to protect. In statistical data sets, infer-
itation problem (also referred to as the inference problem) ence results from the marginal totals given along with the
[26,54]. The statistics literature, motivated by the need to data itself. On the other hand, in microdata sets, inference
publish statistical data sets with one or more contingency results from the statistical correlations between attributes like
tables containing aggregate statistics, focused on identify- income and education.
ing and protecting sensitive cells which may lead to deri- Another popular disclosure protection approach that
vation of aggregate confidential information. An extensive belongs to data perturbation family is microaggregation
survey of statistical database security can be found in [26] [36–43]. Different from cell suppression, microaggregation
and more recent work on disclosure control in statistical aims at protecting numeric microdata by clustering individ-
databases can be found in [28,29]. According to [26] the ual records into small aggregates and replacing actual values
methods proposed for securing SDBs from inference attacks of individual records by group means prior to publication.
can be mainly classified into four categories:conceptual, As Ferrer et al. pointed out in his work [36], microaggre-
query restriction, data perturbation and output perturba- gation assumes confidentiality rules in use allow publica-
tion. Conceptual approaches include techniques that detect tion of microdata sets if the individual records correspond
and remove inference channels during the database design to groups of k or more individuals. While an efficient
mainly at the conceptual data model level. Query restriction polynomial algorithm exists for optimal univariate microag-
approaches provide protection by restricting the query set gregation [41], microaggregation of multivariate data guar-
size, controlling the overlap among successive queries, or anteeing minimum information loss is known to be NP-hard
making query results of small size unavailable to users of the [38]. Hence, several heuristic methods have been proposed
database. On the other hand, data perturbation approaches [36,42,43] to address this problem. Recently, new heuristics
introduce noise in the data by transforming the original data- employing genetic algorithms have been proposed to further
base into a perturbed one. These approaches either replace lower the information loss [39,40]. Moreover, microaggre-
the whole data set with a new one coming from the same gation has been extended to handle categorical data by means

123
408 A. Azgin Hintoglu, Y. Saygın

of employing different clustering algorithms [37]. Different The work closest to ours is proposed by Chang et al. [59].
from MSP problem, microaggregation assumes all respon- In his work, Chang proposes a new paradigm for dealing
dents contributing to the microdata set have the same pri- with the inference problem, which combines the application
vacy preferences. It is meaningful to use microaggregation of decision tree analysis with the concept of parsimonious
in such a setting where sensitive attributes are the same for downgrading. He shows how classification models can be
all respondents. Nevertheless, if respondents’ privacy pref- used to predict suppressed confidential data values and con-
erences differ, then it will result unnecessary attribute values cludes that some feedback mechanism is needed to protect
to be generalized which will result in more information loss. suppressed data values against classification models.
The security and privacy issues arising from the infer-
ence problem, which results in private-sensitive data to be
inferred from public-insensitive data, have also been inves-
7 Conclusion
tigated by multilevel secure databases research [44–48] and
general purpose databases research [49–52]. Methods pro-
In this paper we pointed out the possible privacy breaches
posed within the database context mainly focus on detec-
induced by data mining algorithms on hidden microdata val-
tion and removal of meta-data, i.e., database constraints like
ues. We considered two classification models that could be
functional and multi-valued dependencies, based inferences
used for prediction purposes by adversaries. As an initial step
either during database design [47–49,53] or during query
to attack the problem, we proposed six heuristics to suppress
time [46,54,55]. However, they do not take into account the
the selected confidential data values so that they cannot be
statistical correlations among database attributes which can
inferred using probabilistic and decision tree classification
be discovered by various data mining techniques and hence
models. Our methods are based on modifying the original
result in imprecise inferences like the rule ‘A implies B’ with
microdata set by inserting unknown values with as little dam-
25% confidence.
age as possible. Naïve Bayesian and ID3 classification mod-
There are also other approaches investigating the privacy
els were selected as representatives of the probabilistic and
issues arising during microdata disclosure within the scope
decision tree classification models, respectively. Our experi-
of anonymization problem. K-anonymity, [11–13], being one
ments with real data sets showed that the proposed algorithms
of those approaches, aims at preserving the anonymity dur-
are effective in blocking the inference channels that are based
ing the data dissemination process using generalizations and
both on their target classification models and on their second-
suppressions on potentially identifying portions of the data
ary classification model. And, this verifies our statement that
set. Other approaches addressing the anonymization prob-
each of the suppression algorithms will also prevent infer-
lem include [15,16,56,57]. In his work [15], Iyengar uses
ence when used with other classification methods. Moreover,
suppression and generalization approaches to satisfy privacy
we measured the side effects of the algorithms using different
constraints. Moreover, he examines the tradeoff between pri-
metrics and observed that there is a tradeoff between the rate
vacy and information loss within different data usage con-
of successful suppressions and the information loss caused
texts and proposes a genetic algorithm to find the optimal
by the suppression process. This observation together with
anonymization. On the other hand, in [16] Ohrn uses bool-
the experimental results verifies our discussion on the side
ean reasoning, and in [56,57] Ferrer uses microaggregation
effects of the suppression algorithms:
to address the anonymization problem. Besides the fact that
these approaches successfully preserve privacy through
anonymization, none of them addresses the inference threat – Side effects of the DECP algorithm depend both on the
to privacy due to data mining approaches. Therefore, they number of attributes and the number of transactions,
do not directly apply to MSP. Moreover, similar to the mic- ensuring that the success rate of the DECP algorithm
roaggregation, anonymization approaches assume that each will always be higher than the other algorithms in any
respondent contributed to the microdata set has the same pri- situation.
vacy preferences, i.e., wants to be anonymous, which is not – Side effects of the INCP algorithm depend on the number
realistic. of transactions, ensuring that the success rate of the INCP
Another approach, proposed by Wang et al. [58], addresses algorithm will always be higher than the other algorithms
the threats caused by data mining abilities, using a template- if the number of transactions is higher than the number
based approach. The proposed approach aims to (1) preserve of attributes.
the information for a wanted classification analysis and (2) – Side effects of the DROPP and HID3 algorithms depend
limit the usefulness of unwanted sensitive inferences, i.e., on the number of attributes, ensuring that the success rate
classification rules, that may be derived from the data. More of these algorithms will always be higher than the other
specifically, it focuses on suppressing sensitive rules, instead algorithms if the number of attributes is higher than the
of sensitive data values. number of transactions.

123
Suppressing microdata to prevent classification based inference 409

Next, to increase the success of the overall suppression 12. Samarati, P.: Protecting respondents’ identities in microdata
process with respect to both classification models, we used a release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)
13. Sweeney, L.: k-Anonymity: A model for protecting privacy. Int. J.
hybrid approach which suppresses against both classification Uncertain. Fuzziness Knowl. Sys. 10(5), 557–570 (2002)
models. The experimental results for the hybrid algorithms 14. USC Annenberg School—Center for the Digital Future: The
showed that the success rate with respect to both classifica- Highlights of the Digital Future Report, Year Five, Ten
tion methods is 100%, meaning all confidential data values Years Ten Trends. Available at http://www.digitalcenter.org/pdf/
Center-for-the-Digital-Future-2005-Highlights.pdf
are successfully suppressed. 15. Iyengar, V.S.: Transforming data to satisfy privacy constraints.
Finally, to decrease the side-effects of the overall suppres- SIGKDD (2002)
sion process, when there are multiple confidential data values 16. Øhrn, A., Ohno-Machado, L.: Using boolean reasoning to anony-
to suppress, we used the enhanced heuristics, e-DECP and mize databases. Artif. Intell. Med. 15(3), 235–254 (1999)
17. Nissenbaum, H.: Protecting privacy in an information age: the prob-
e-DROPP. The experimental results for the enhanced algo- lem of privacy in public. Law Philos. 17, 559–596 (1998)
rithms showed that the side-effects can be reduced by more 18. Sweeney, L.: Information Explosion. In: Confidentiality, Disclo-
than 50%. sure, and Data Access: Theory and Practical Applications for Sta-
As part of our future work, we plan to investigate the fol- tistical Agencies. Zayatz, L., Doyle, P., Theeuwes, J., Lane, J.,
(eds.) Urban Institute, Washington, DC (2001)
lowing: 19. Dreiseitl, S., Vinterbo, S., Ohno-Machado, L.: Disambiguation
data: extracting information from anonymized sources. In: Pro-
– Suppressing confidential data values against other classi- ceedings of the 2001 American Medical Informatics Annual Sym-
fication algorithms, e.g., logistic regression, posium, pp. 144–148 (2001)
20. Aggarwal, C.: On k-anonymity and the curse of dimensionality. In:
– Suppressing multiple confidential data values at a time Proceedings of the 31st VLDB Conference (2005)
(generic version having no constraints), 21. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam,
– Developing a generic suppression technique, independent M.: -Diversity: privacy beyond k-anonymity. In: Proceedings of
from individual classification methods, based on informa- the 22nd IEEE International Conference on Data Engineering
(2006)
tion theory, 22. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley,
– Using generalization as a fine grained method,and New York (1991)
– Suppressing evolving (i.e., continuously updated) micro- 23. UCI Machine Learning Repository, http://www.ics.uci.edu/
data. ~mlearn/MLSummary.html
24. Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear
programming. SIAM News 23(5), 1–18 (1990)
25. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large mar-
References gin methods for structured and interdependent output variables. J.
Mach. Learn. Res. 6, 1453–1484 (2005)
1. Wikipedia,: Privacy—Wikipedia, the free encyclopedia. Available 26. Adam, N.R., Wortmann, J.C.: Security-control methods for
at http://en.wikipedia.org/wiki/Privacy (2005) statistical databases: a comparative study. ACM Comput.
2. Report to Congress regarding the Terrorism Information Awareness Surv. 21(4), 515–556 (1989)
Program, May 20, 2003 27. Denning, D.E.: Cryptography and Data Security. Addison-Wes-
3. O’Leary, D.E.: Knowledge discovery as a threat to database ley, (1982)
security. In: Piatetsky-Shapiro, G., Frawley, W. (eds.) Knowledge 28. Domingo-Ferrer, J.: (eds) Inference control in statistical databases.
Discovery in Databases, pp. 507–516. AAAI Press, MIT Press, Lecture Notes in Computer Science, vol. 2316. Springer-Verlag,
Menlo Park, California (1991) Berlin (2002)
4. O’Leary, D.E.: Some privacy issues in knowledge discovery: the 29. Farkas, C., Jajodia, S.: The inference problem: a survey. SIGKDD
OECD personal privacy guidelines. IEEE Expert: Intelligent Syst. Explorations (2003)
Appl. 10(2), 48–52 (1995) 30. Geurts, J.: Heuristics for cell suppression in tables. Technical Paper,
5. Klosgen, W.: Knowledge discovery in databases and data privacy. Netherlands Central Bureau of Statistics (1992)
IEEE Expert, April 1995 31. Kao, M.Y.: Data security equals graph connectivity. SIAM J. Dis-
6. Piatetsky-Shapiro, G.: Knowledge discovery in databases vs. per- cret. Math. 9, 87–100 (1996)
sonal privacy. IEEE Expert, April 1995 32. Kelly, J.P., Golden, B.L., Assad, A.A.: Cell suppression: disclo-
7. Selfridge, P.: Privacy and knowledge discovery in databases. IEEE sure protection for sensitive tabular data. Networks 22, 397–417
Expert, April 1995 (1992)
8. Azgın Hintoǧlu, A., Saygın, Y.: Suppressing microdata to prevent 33. Fischetti, M., Salazar, J.J.: Models and algortihms for the
probabilistic classification based inference. In: Proceedings of the 2-dimensional cell suppression problem in statistical disclosure
Workshop on Secure Data Management (SDM’05) (2005) control. Math. Program. 84, 283–312 (1999)
9. Cox, L.H.: Suppression methodology and statistical disclosure con- 34. Fischetti, M., Salazar, J.J.: Models and algorithms for optimizing
trol. J. Am. Stat. Assoc. 75(370), 377–385 (1980) cell suppression in tabular data with linear constraints. J. Am. Stat.
10. Sande, G.: Automated cell suppression to reserve confidentiality of Assoc. 95(451), 916–928 (2000)
business statistics. In: Proceedings of the 2nd International Work- 35. Willenborg, L., De Waal, T.: Statistical disclosure control in prac-
shop on Statistical Database Management, pp. 346–353 (1983) tice. Lecture Notes in Statistics, vol. 111. Springer Verlag, New
11. Samarati, P., Sweeney, L.: Protecting privacy when disclosing York (1996)
information: k-anonymity and its enforcement through general- 36. Domingo-Ferrer, J., Torra, V.: Practical data-oriented microaggre-
ization and suppression. IEEE Sympos. Res. Security Privacy gation for statistical disclosure control. IEEE Trans. Knowl. Data
(1998) Eng. 14(1), 189–201 (2002)

123
410 A. Azgin Hintoglu, Y. Saygın

37. Torra, V.: Microaggregation for categorical variables: a median 49. Delugach, H., Hinke, T.: Wizard: A database inference analysis
based approach. In: Domingo-Ferrer, J., Torra, V. (eds.), Privacy in and detection system. IEEE Trans. Knowl. Data Eng. 8(1), 56–
Statistical Databases, vol. 3050, pp. 162–174 (2004) 66 (1996)
38. Oganian, A., Domingo-Ferrer, J.: On the complexity of optimal 50. Hinke, T., Delugach, H., Wolf, R.P.: Protecting databases from
microaggregation for statistical disclosure control. Stat. J. U.N. inference attacks. Comput. Secur. 16(8), 687–708 (1997)
Econ. Comm. Eur. 18(4), 345–354 (2001) 51. Dawson, S., di Vimercati, S.D.C., Lincoln, P., Samarati, P.: Minimal
39. Solanas, A., Martinez-Balleste, A., Mateo-Sanz, J.M., data upgrading to prevent inference and association. In: Proceed-
Domingo-Ferrer, J.: Towards microaggregation with genetic ings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Sympo-
algorithms. In: Proceedings of the Third IEEE Conference on sium on Principles of Database Systems, pp. 114–125. ACP Press,
Intelligent Systems, pp. 65–70 (2006) (1999)
40. Martinez-Balleste, A., Solanas, A., Domingo-Ferrer, J., 52. Brodsky, A., Farkas, C., Jajodia, A.: Secure databases: Constraints,
Mateo-Sanz, J.M.: A Genetic approach to multivariate mic- inference channels and monitoring disclosure. IEEE Trans. Knowl.
roaggregation for database privacy. In: Proceedings of 23rd IEEE Data Eng. 12(6), 900–919 (2000)
Internation Conference on Data Engineering, pp. 180–185 (2007) 53. Hinke, T.H., Delugach, H.S., Chandrasekhar, A.: A fast algorithm
41. Hansen, S.L., Mukherjee, S.: A polynomial algorithm for opti- for detecting second paths in database inference analysis. J. Com-
mal univariate microaggregation. IEEE Trans. Knowl. Data put. Secur. 3(2, 3), 147–168 (1995)
Eng. 15(4), 1043–1044 (2003) 54. Denning, D.: Commutative filters for reducing inference threats in
42. Laszlo, M., Mukherjee, S.: Minimum spanning tree partition- multilevel database systems. In: Proceedings of IEEE Symposium
ing algorithm for microaggregation. IEEE Trans. Knowl. Data on Security and Privacy, pp. 134–146 (1985)
Eng. 17(7), 902–911 (2005) 55. Thuraisingham, B.: Security checking in relational database man-
43. Sande, G.: Exact and approximate methods for data directed micro- agement systems augmented with inference engines. Comput.
aggregation in one or more dimensions. Int. J. Uncertain. Fuzziness Secur. 6, 479–492 (1987)
Knowl. Syst. 10(5), 459–476 (2002) 56. Domingo-Ferrer, J., Torra, V.: Ordinal, continious and heterego-
44. Jajodia, S., Meadows, C. : Inference problems in multilevel secure neous k-anonymity through microaggregation. Data Min. Knowl.
database management systems. In: Abrams, M.D., Jajodia, S., Discov. 11(2), 195–212 (2005)
Podell, H.J. (eds.) Information Security—An Integrated Collection 57. Domingo-Ferrer, J., Solanas, A., Martinez-Balleste, A.: Privacy
of Essays, pp. 570–584. IEEE C. S. Press, (1989) in statistical databases:k-anonymity through microaggregation. In:
45. Quian, X., Stickel, M.E., Karp, P.D., Lunt, T.F., Garvey, T.D.: Proceedings of IEEE Granular Computing (2006)
Detection and elimination of inference channels in multilevel rela- 58. Wang, K., Fung, B.C.M., Yu, P.S.: Template-based privacy preser-
tional database systems. In: Proceedings of IEEE Symp. Security vation in classification problems. In: ICDM ’05: Proceedings of the
and Privacy, pp. 196–205 (1993) Fifth IEEE International Conference on Data Mining, pp. 466–473
46. Stachour, P., Thuraisingham, B.: Design of LDV: A multilevel (2005)
secure relational database management system. IEEE Trans. 59. Chang, L., Moskowitz, I.S.: Parsimonious downgrading and deci-
Knowl. Data Eng. 2(2), 190–209 (1990) sion trees applied to the inference problem. In: Proceedings of the
47. Su, T., Ozsoyoglu, G.: Inference in MLS database systems. IEEE Workshop of New Security Paradigms, pp. 82–89 (1999)
Trans. Knowl. Data Eng. 3(2–3), 147–168 (1991)
48. Marks, D.: Inference in MLS database systems. IEEE Trans.
Knowl. Data Eng. 8(1), 46–55 (1996)

123

Você também pode gostar