Escolar Documentos
Profissional Documentos
Cultura Documentos
DOI 10.1007/s00778-009-0170-1
REGULAR PAPER
Received: 24 April 2008 / Revised: 11 August 2009 / Accepted: 3 October 2009 / Published online: 19 November 2009
© Springer-Verlag 2009
123
386 A. Azgin Hintoglu, Y. Saygın
by O’Leary [3] and was discussed further in a symposium Therefore, Academic Health removed not only the ID and
on knowledge discovery in databases and personal privacy name attributes but also the diagnosis information from Bob’s
[4–7]. Since then, privacy issues have become one of the most medical records before sharing it, as shown in Table 2. Unfor-
important aspects of database and data mining research. tunately, given these medical records, Academic Research
Institute can easily find Bob’s diagnosis to be Angina
Example 1 Consider an on-line federation of hospitals and Pectoris using a predictive data mining technique called clas-
research organizations collaborating with each other, named sification.
HealthFed. Each federated hospital collects medical records
of their patients together with their privacy preferences, and In this paper, we address this particular problem of pri-
interacts with research organizations within the federation vacy preserving microdata diclosure. We assume that each
to share this information. In particular, assume that the city individual might have different preferences regarding their
clinic Academic Health and Academic Research Institute, privacy. Therefore, the confidential attributes might differ for
both being part of the HealthFed federation, collaborate each individual. With such a setting, one method to ensure
with each other for research purposes. More specifically, privacy while disclosing a microdata set is to selectively
Academic Health shares patients’ medical records with hide1 the confidential data values. This method ensures pri-
Academic Research Institute after ensuring the privacy pref- vacy from a micro-level perspective. But, this is not the
erences of each patient is preserved. Table 1 shows such case for the macro-level perspective, as with powerful data
patients who gave consent to Academic Health to disclose analysis tools and data mining techniques it is now possi-
their medical records to third persons for research purposes ble for an adversary to predict hidden confidential infor-
provided that their ID and name attributes are removed before mation using the rest of the disclosed data set. We concen-
disclosure. However, Bob, knowing that it might still be pos- trate on one such possible threat—classification—which is a
sible to link his medical records with other data sources data mining technique widely used for prediction purposes.
through potentially identifying attributes like gender, zip- We extend our previous work [8] on microdata suppression
code, and age, required that not only his name but also his
diagnosis information to be hidden before disclosure. 1 Deleting or replacing with a symbol denoting unknown.
123
Suppressing microdata to prevent classification based inference 387
(1) to prevent not only probabilistic but also decision tree and the evaluation metrics. The details of suppression algo-
classification based inference, and (2) to handle not only rithms are described in Sect. 3. Section 4 provides a discus-
single but also multiple confidential data value suppression sion on the effectiveness of proposed suppression algorithms.
to reduce the side-effects. We achieve this by finding a set Sect. 5 reports on our emprical results. The related work on
of data values that might cause confidential information to data suppression and privacy is discussed in Sect. 6. Finally,
be disclosed from the released microdata set indirectly, i.e., Sect. 7 concludes the paper and outlines future research direc-
through classification-based inference, and replacing them tions of this study.
with a symbol indicating suppression.
Much research has taken place in the area of disclosure
protection in order to preserve privacy. One such popular 2 Preliminaries
disclosure protection approach is cell suppression [9,10].
The basic idea of cell suppression is to protect sensitive 2.1 Problem formulation
information in aggregated data sets, i.e., statistical tables.
The problems addressed by cell suppression and microdata Let Λ = {α1 , α2 , . . . , αn } be the set of attributes with asso-
suppression are quite similar in principal. Nevertheless, the ciated domains 2 Vα1 , Vα2 , . . . , Vαn , and extended domains 3
methodologies used are completely different, as the charac- eVα1 , eVα2 ,…, eVαn , respectively. Let D = {d1 , d2 , . . . , dm }
teristics of the data sets they are trying to protect are different. be the microdata set where each tuple di ∈ eVα1 × eVα2 ×
In statistical tables, inference results from the marginal totals · · · × eVαn is an ordered list of values.
given along with the aggregated data itself. On the other hand, For each attribute α j ∈ Λ, there is a mapping α j [di ] :
in a microdata set it results from the statistical correlations eVα1 × eVα2 × · · · × eVαn → eVα j from eVα1 × eVα2 × · · · ×
between non-aggregated attributes. eVαn into the extended domain eVα j . The mapping α j [di ]
Another method, specifically proposed to ensure privacy represents the value of attribute α j of microdata tuple di .
of disclosed microdata sets, is anonymization. Ano-nymi- Similarly, for each microdata set D, there is a mapping
zation aims at modifying microdata sets before disclosure D[constraint] : D → S ∈ 2 D from D = {d1 , d2 , ..., dm }
such that the identities of individuals cannot be recognized into S ∈ 2 D . The mapping D[constraint] represents the
using other microdata sets that are available to public. Var- set of all tuples satisfying the constraint, expressed in con-
ious approaches [11–13,15,16], employing generalization junctive normal form, on attribute values. Examples of valid
and suppression on potentially identifying portions of the constraint expressions include the following:
data sets, were proposed to address the anonymization prob-
lem. However, all of them have the inherent problem of – α1 [d] = val1 ,
assuming that the set of potentially identifying attributes – ¬ α1 [d] = val1 ,
is known in advance. This assumption is a strong one con- – αi [d] = vali ∧ ¬ α j [d] = val j , and
sidering that we live in an internetworked society in which – α1 [d] = val1 ∧ α2 [d] = val2 ∧ · · · ∧ αn [d] = valn .
institutions are increasingly required to make their data elec-
tronically available. For example, in United States, all gov- Definition 2.1 (Classifiers (Σ)) Σ denotes the set of all clas-
ernmental records, except those covered by a specified set sifiers that aims to predict the value of a single attribute, i.e.,
of exceptions including The Privacy Act of 1974, are freely the target attribute4 ατ , in terms of the predictor attributes.5
available for public access according to Freedom of Infor-
mation Act. Examples of such records include birth, death, Each classifier ς ∈ Σ is defined in the context of a training
and marriage records, drivers’ records, real estate ownership data set and a target attribute. For example, a classifier of type
records, court records, etc. [17,18]. With diverse sources of Naı̈ve Bayesian built using the data set D with ατ ∈ Λ as the
information available online, mostly unanonymized, it is hard D,ατ
target attribute is denoted as ςnb . If the type of the clas-
to determine the potentially identifying attributes with 100% sifier, the training data set or the target attribute is unknown
confidence. Moreover, it has been shown in various works or not relevant in a given context, then a special symbol ⊥
that the proposed approaches to solve the anonymization is used instead of the respective symbol. For example, ς⊥D,ατ
problem fail to provide anonymity, and hence do not protect
privacy [19–21]. Finally and most importantly, approaches 2 The domain of an attribute is represented by a finite set of discrete
addressing anonymization assume that each individual hav- values excluding the unknown (i.e. null) value denoted by ν.
ing a record in the microdata set has the same privacy pref- 3 The extended domain of an attribute is represented by a finite set of
erences, which is far from being realistic. discrete values including the unknown (i.e. null) value denoted by ν
The remainder of this paper is organized as follows: Sect. 2 such that eVα j = Vα j ∪ {ν}.
4
presents the probabilistic and decision tree classifiers, the Also called the class attribute or the dependent attribute.
Microdata Suppression problem, the modification schemes, 5 Also called the independent attributes.
123
388 A. Azgin Hintoglu, Y. Saygın
denotes the set of all classifiers built using the data set D with In this work, we relax the above statement such that there
ατ ∈ Λ as the target attribute. Following the training phase, exists no Naı̈ve Bayesian or ID3 classifier that can correctly
each classifier ς ∈ Σ can be viewed as a function that takes predict the confidential data value.
a microdata tuple and predicts the most probable value of the
D −di ,α j D −di ,α j
target attribute based on other attributes’ values. For example, ς (di ) = α j [di ] ∀ς ∈ ςnb ∪ ςid3 (6)
if ς ∈ ς⊥⊥,ατ , then ς : (eVα1 × eVα2 × · · · × eVαn ) → Vατ .
Definition 2.2 (Naı̈ve Bayesian Classifier (Σnb )) Let the jth 2.2 Modification strategies for microdata suppression
attribute value of tuple di , that is α j [di ], is unknown. Accord-
ing to Bayes’ theorem the probability that α j [di ] has value There are two possible modification strategies that can be
v ∈ Vα j is equal to the posterior probability of v conditioned adopted to address the microdata suppression problem.
on di and is given by Modification Strategy 1. Deleting an Attribute Value. This
p(v) p(di |v) modification scheme, also referred to as hiding, involves
p(v|di ) = (1)
p(di ) replacement of attribute values, including the confidential
where p(v) and p(di ) are the prior probabilities of v and data values, with a special symbol denoting the unknown
di , respectively, and p(di |v) is the posterior probability of (i.e. null) value ν. Replacing attribute values with ν results in
di conditioned on v. Naı̈ve Bayesian classifier is a proba- uncertainty in the microdata set. For example, in the simplest
bilistic classifier based on Bayes’ theorem with the class case of a binary attribute, an unknown value can be either 0
conditional independence assumption, that is, the effect of or 1. Assuming that the value was 0 will contribute to the
an attribute value on another attribute (i.e. class attribute) is resulting classification model in a contradicting way com-
independent of the values of the remaining attributes. Due to pared to the assumption that it was 1. By carefully selecting
class conditional independence assumption, we can rewrite the tuples and attributes to replace with an unknown value,
the posterior probability p(di |v) as follows: we can decrease the precision of the classification models
j−1
n which can be built to predict the confidential data values.
p(di |v) = p(αk [di ]|v) p(αk [di ]|v) (2) The details of how to select these values to replace with ν are
k=1 k= j+1 provided below in the form of downgrade strategies.
D−d ,α Modification Strategy 2. Generalizing an Attribute Value.
The Naı̈ve Bayesian Classifier ςnb i j built using D − di
as the training data set will predict the most probable value This modification scheme involves generalization of attribute
for α j [di ] as vπ ∈ Vα j if and only if the following condition values, including the confidential data values, using a concept
holds: hierarchy or an ontology.
Within the scope of this study, we proposed suppression
p(vπ |di ) > p(v|di ) | ∀v ∈ Vα j − vπ (3) algorithms utilizing the deletion scheme, and leave the gen-
Since p(di ) is same for all v ∈ Vα j , it can be ignored as eralization scheme as part of our future work.
shown below:
p(vπ ) p(di |vΠ ) > p(v) p(di |v) | ∀v ∈ Vα j − vπ (4) 2.3 Downgrade strategies for microdata suppression
Definition 2.3 (ID3 Classifier (Σid3 )) Let α j [di ] be
D−d ,α There are two possible downgrade strategies that can be
unknown. The ID3 classifier ςid3 i j built using D − di
adopted to address the Microdata Suppression Problem.
as the training data set is a decision tree where each internal
node represents a decision node, each branch represents an Downgrade Strategy 1. Classification Model Downgrade.
outcome of the decision, and each leaf node represents a pos- Let D be the original microdata set with the confidential data
sible value v ∈ Vα j for α j [di ]. Such a classifier will predict value α j [di ]. Classification model downgrade aims at trans-
the most probable value for α j [di ] as vπ ∈ Vα j if and only if forming the original microdata set D to D that satisfies the
the test of the remaining attributes of di against the decision following constraints:
tree leads a path from the root node to a leaf node labeled
with vπ . i. α j [di ] = ν,
Definition 2.4 (Suppressing a Confidential Data Item) Let ii. ∀α ∈ Λ − α j , α[di ] = α[di ],
D−di ,α j
D be the microdata set after applying a set of modifications iii. D − di = D − di , iff ∃ς ∈ ς⊥ ς (di ) = α j [di ],
to D. The confidential data value α j [di ] will be suppressed D −di ,α j
iv. ∀ς ∈ ς⊥ ς (di ) = α j [di ].
with respect to D , if and only if there exist no classifiers that
can correctly predict the confidential data value.
This scheme aims at degrading all classification models ς ∈
D −di ,α j D −di ,α j
ς (di ) = α j [di ] ∀ς ∈ ς⊥ (5) ς⊥ by modifying the tuples d ∈ D − di .
123
Suppressing microdata to prevent classification based inference 389
123
390 A. Azgin Hintoglu, Y. Saygın
Definition 2.10 (Conditional Entropy) Let D and D be the 3.1 Suppression against probabilistic classification models
original and modified microdata sets, respectively, and α ∈ Λ
be an attribute. Let X αD on Vα be a random variable with In the following, we present three algorithms for preventing
instances α[d1 ], α[d2 ], ..., α[dm ] and probability distribution probabilistic classification-based inference. The proposed
pα . Let X αD on Vα be a random variable with instances algorithms aim to suppress a confidential data value, such
α[d1 ], α[d2 ], ..., α[dm ] and probability distribution pα . The that it is no longer among the Top-1 Probable value set.
conditional entropy of X αD given X αD can be defined as fol-
lows: Definition 3.1 (Top-k Probable) Let α j [di ] be confidential
and thus be replaced by ν. The Naı̈ve Bayesian Classifier
D−d ,α
H (X αD |X αD ) = − p(v, v )log( p(v|v )) (12) ςnb i j built using D−di as the training data set will predict
α [d ]
v∈Vα v ∈Vα the Top-k Probable value set for α j [di ] as Ωk j i ⊆ Vα j . The
Top-k Probable value set satisfies the following constraints:
Definition 2.11 (Sum of Conditional Entropies) Let D and
D be the original and modified microdata sets, respectively. i. Its size is equal to k.
The sum of conditional entropies of D given D can be
defined as follows: α j [di ]
Ωk =k (14)
SC E(D, D ) = H (X αD |X αD ) (13)
α∈Λ
ii. The probability of α j [di ] being equal to the least proba-
ble value in the Top-k Probable value set is greater than
the probability of α j [di ] being equal to the most probable
The detailed descriptions of the information theoretic met- value among the remaining attribute values.
rics introduced in this section can be found in [22].
α [di ] α [di ]
p(ω|di )>p(v|di ) | ∀v ∈ Vα j −Ωk j ∧ ω ∈ Ωk j
(15)
3 Our approaches for suppressing microdata
The proposed suppression algorithms aim at either reduc-
As pointed in Sect. 2, hiding a confidential data value alone ing p(α j [di ]|di ) below that of a randomly selected attribute,
may not be enough to protect it, in case the whole data set is called the Random Next Best Guess, among Top-k Probable6
going to be disclosed. This results from the fact that an adver- value set or increasing the probability of a set of selected attri-
sary can build a classification model using the rest of the data butes, called the Next Best Guess Set, above p(α j [di ]|di ).
set as the training data set and s/he could use it to predict the
actual confidential data value. In order to avoid such attacks, Definition 3.2 (Random Next Best Guess) The random
we propose four algorithms suppressing only one confiden- next best guess, vr nbg ∈ Vα j , is a randomly selected value
tial data value at a time, against two popular classifier types: from Vα j satisfying the following conditions:
probabilistic and decision tree classifiers, as shown in Fig. 1.
We select Naı̈ve Bayesian and ID3 as typical representatives
i. It is different from α j [di ].
of probabilistic and decision tree classifiers, respectively, and
developed our heuristics accordingly. Moreover, we propose
enhancement to two of the proposed algorithms to suppress vr nbg = α j [di ] (16)
multiple confidential data values at a time to reduce the side
effects. 6 The effect of changing k is further discussed in Sect. 4.
123
Suppressing microdata to prevent classification based inference 391
ii. It is among the Top-k Probable value set. p(α j [di ]) p(di |α j [di ])
p(α j [di ]|di ) =
p(di )
α [di ]
∼
= p(α j [di ]) p(di |α j [di ])
vr nbg ∈ Ωk j (17) α j [di ]
∼
= p(α j [di ]) p(α M I [di ]|α j [di ])
iii. The probability of α th × p(α[di ]|α j [di ])
j attribute of di being equal to vr nbg
is smaller than that of confidential data value α j [di ] and α j [di ]
α∈Λ− α j ,α M I
greater than zero.
Let us assume that,
p(α j [di ]|di ) > p(vr nbg |di ) > 0 (18)
α [di ]
– the size of the microdata set D[α j [d] = α j [di ] ∧ α Mj I
3.1.1 DECP algorithm α [d ]
[d] = α Mj I i [di ]] − di be F α j [di ] , and
α j [di ],α M I
The DECP algorithm aims at suppressing the confidential – the size of the microdata set D[α j [d] = α j [di ]] − di be
data value α j [di ] so that it cannot be correctly predicted by Fα j [di ] .
D −d ,α
the downgraded classification model ςnb i j . It accom-
plishes its goal by decreasing the probability p(α j [di ]|di ) Single replacement of a maximum impact data value causes
below that of the random next best guess vr nbg . α [d ]
p(α Mj I i [di ]|α j [di ]) to decrease from F α j [di ] / Fα j [di ]
α j [di ],α M I
to F α j [di ] − 1/Fα j [di ] . This, in turn decreases p(di |α j
Definition 3.3 (Maximum Impact Attribute) The attribute α j [di ],α M I
α [d ]
with maximum impact on p(α j [di ]|di ), denoted by α Mj I i , [di ]) by F α j [di ] − 1/F α j [di ] as shown below.
α j [di ],α M I α j [di ],α M I
is the one that satisfies the following conditions:
α [d ]
α [d ] p (di |α j [di ]) = p (α Mj I i [di ]|α j [di ])
α Mj I i = arg min D[α j [d]=α j [di ] ∧ α[d]=α[di ]]−di
α∈Λ × p (α[di ]|α j [di ])
∧ D[α j [d] = α j [di ] ∧ α[d] = α[di ]] − di > 1 (19) α j [di ]
α∈Λ− α j ,α M I
In each iteration, the DECP algorithm identifies the maxi- F α j [di ] F α j [di ] −1
α [d ] α j [di ],α M I α j [di ],α M I
mum impact attribute α Mj I i and modifies the tuples d, such =
α [d ] α [d ] F α j [di ] Fα j [di ]
that d ∈ D[α j [d] = α j [di ] ∧ α Mj I i [d] = α Mj I i [di ]] − di , α j [di ],α M I
α [d ]
by replacing α Mj I i [d] with ν until the goal is achieved, that × p(α[di ]|α j [di ])
is, until p(α j [di ]|di ) becomes less than p(vr nbg |di ). Each
α j [di ]
α∈Λ− α j ,α M I
such replacement results in the maximum possible reduction
in p(α j [di ]|di ), thus requiring less number of modifications. F α j [di ] −1
α j [di ],α M I α [d ]
= p(α Mj I i [di ]|α j [di ])
α [d ] F α j [di ]
Theorem 3.1 Let α Mj I i be the maximum impact attribute α j [di ],α M I
satisfying Equation (19). Then, every replacement of a maxi- × p(α[di ]|α j [di ])
mum impact data value with ν causes the maximum decrease
α j [di ]
α∈Λ− α j ,α M I
in p(α j [di ]|di ), thus resulting in fewer data values to be mod-
ified. F α j [di ] −1
α j [di ],α M I
= p(di |α j [di ])
Proof Let us first find the effect of replacing a maximum F α j [di ]
α j [di ],α M I
impact data value with ν on p(α j [di ]) p(di |α j [di ]). Remem-
ber that, since p(di ) is same for all v ∈ Vα j , it can be ignored Now let us assume that there is another attribute αk which
α [d ]
when calculating p(α j [di ]|di ). decreases p(α j [di ]|di ) more than that of α Mj I i . This implies
123
392 A. Azgin Hintoglu, Y. Saygın
the following: the DECP algorithm can replace at most (n − 1)(N − 1) data
F −1 values with ν for suppressing a confidential data value.
α j [di ]
Fα j [di ],αk − 1 α j [di ],α M I
< Example 2 Now, let us illustrate how the DECP algorithm
Fα j [di ],αk F α j [di ]
α j [di ],α M I suppress Bob’s confidential diagnosis.
(Fα j [di ],αk − 1)F α j [di ] < (F α j [di ] − 1)Fα j [di ],αk Step 1 Initially, the Naı̈ve Bayesian classification model is
α j [di ],α M I α j [di ],α M I
Fα j [di ],αk < F α j [di ]
constructed to find the probabilities p(v|di ) for all v ∈ Vα j =
α j [di ],α M I {dyspepsia, angina pectoris, gastritis}. The Naı̈ve
which contradicts the definition of Maximum Impact Attri- Bayesian classification model constructed using the medical
bute. So, we can conclude that every replacement of a max- records of Table 2 is shown in Table 3. According to the model
imum impact data value with ν causes the highest decrease the probabilities are p(dyspepsia|d2 ) = 0, p(angina
in p(α j [di ]|di ) which in turn implies that the number of data pectoris|d2 ) = 16 , and p(gastritis|d2 ) = 18 1
.
values that should be modified is minimal.
Step 2 The probability p(angina pectoris|d2 ) is greater
than both p(dyspepsia|d2 ) and p(gastritis|d2 ). As Bob’s
The algorithm works as follows: Let α j [di ] be confiden-
diagnosis can be correctly predicted, the suppression process
tial. As the first step, the algorithm verifies the need for sup-
starts.
pression. It finds p(v|di ) for all v ∈ Vα j and checks the truth
value of the following assertion: Step 3 Let us assume that gastritis is selected as the ran-
dom next best guess. From this point on the DECP algo-
p(α j [di ]|di ) > p(v|di )|∀v ∈ Vα j − α j [di ] (20)
rithm will try to decrease p(angina pectoris|d2 ) below
If Assertion (20) is true, it picks a random next best guess p(gastritis|d2 ).
vr nbg from Vα j . Next, in each iteration it finds the maximum Step 4 To select the maximum impact attribute, the counts
α [d ]
impact attribute α Mj I i and replaces the maximum impact for each symptom attribute is found as follows:
data values by ν as long as p(α j [di ]|di ) > p(vr nbg |di ). After
processing all maximum impact attributes, it re-checks the – countindigestion = |D[diagnosis[d] = diagnosis[d2 ]
truth value of Assertion (20). If Assertion (20) is still true, it ∧indigestion[d] = indigestion[d2 ]]| = 2
reverts all changes and deletes the tuple di from the microdata – countchest pain = |D[diagnosis[d] = diagnosis[d2 ]
set. An overview of the algorithm is depicted in Fig. 2a. ∧chest pain[d] = chest pain[d2 ]]| = 2
If |Vα j | = 2 is true, then suppressing the confidential – count palpitation = |D[diagnosis[d] = diagnosis[d2 ]
data value might result in an adversary guessing it correctly ∧ palpitation[d] = palpitation[d2 ]]| = 3
with 100% confidence. Therefore, the decision to suppress
a confidential data value is randomized for the case where Since they have the same minimum count, both indigestion
|Vα j | = 2. This results in an adversary guessing the actual and chest pain can be the maximum impact attribute. Let us
confidential data value with 50% confidence which is the assume that indigestion is selected as the maximum impact
maximum uncertainty that can be achieved under such cir- attribute.
cumstances.
Step 5 All tuples d satisfying the constraint indigestion
Lemma 3.1 Let α j [di ] be the confidential data value, n be [d] = N ∧ diagnosis[d] = angina pectoris is found.
the number of attributes, and N be the number of tuples Tuples 7 and 8 satisfy the aforementioned constraint.
in D[α j [d] = α j [di ]] − di . Then, the upper bound for the Step 6 The indigestion attribute is hidden from tuple 7. With
number of data values that can be modified by the DECP this replacement p(angina pectoris|d2 ) decreases by 21 to
algorithm is equal to (n − 1)(N − 1).
12 . As p(angina pectoris|d2 ) is still greater than
1
123
Suppressing microdata to prevent classification based inference 393
(a)
(b)
Fig. 2 Pseudocode of DECP and INCP algorithms. a. Pseudocode of DECP algorithm. b. Pseudocode of INCP algorithm
123
394 A. Azgin Hintoglu, Y. Saygın
3.1.2 INCP algorithm attribute values v ∈ Snbg , it re-checks the truth value of
Assertion (20). If Assertion (20) is still true, then DECP algo-
The INCP algorithm aims at suppressing the confidential data rithm is executed to complete the algorithm. An overview of
value α j [di ] so that it cannot be correctly predicted by the the algorithm is depicted in Fig. 2b.
D −d ,α
downgraded classification model ςnb i j . It accomplishes
its goal, as its name implies, by increasing the probabili- Lemma 3.2 Let α j [di ] be the confidential data value, m be
ties p(v|di ) for all v in the next best guess set,Snbg , above the number of tuples in D and N be the number of tuples
p(α j [di ] | di ). in D[α j [d] = α j [di ]] − di . Assuming that there are enough
number of tuples that can be used for the suppression pro-
Definition 3.5 (Next Best Guess Set) The next best guess set, cess (i.e. no need for executing DECP), the upper bound for
Snbg ⊆ Vα j , for microdata tuple di is the set of all attribute the number of data values that can be modified by the INCP
values v ∈ Vα j − α j [di ] satisfying the following condition: algorithm is equal to m − N − 1 − |Snbg |.
Snbg = v|v ∈ Vα j − α j [di ] ∧ p(v|di ) ≥ p(vr nbg |di ) (21) Proof The proof of this statement is straightforward. The
For each v ∈ Snbg , the INCP algorithm identifies the tuples INCP algorithm modifies the tuples d ∈ D[¬ α1 [d] =
d ∈ D[α j [d] = v] having no common attribute value with α1 [di ] ∧ · · · ∧ ¬ α j−1 [d] = α j−1 [di ] ∧ α j [d] = v ∧
di and modifies them by replacing α j [d] with ν in order to ¬ α j+1 [d] = α j+1 [di ] ∧ · · · ∧ ¬ αn [d] = αn [di ]] for each
increase p(v|di ). v ∈ Snbg . In the worst case, Snbg contains all
possible values
The algorithm works as follows: Let α j [di ] be confiden- of attribute α j except α j [di ]. This implies v∈Snbg D[α j
tial. As the first step, the algorithm verifies the need for sup- [d] = v] = m − N − 1. Moreover, due to the definition of
pression. It finds p(v|di ) for all v ∈ Vα j and checks the truth next best guess set and random next best guess the probabil-
value of Assertion (20). If Assertion (20) is true, it picks a ity p(v|di ) for each v ∈ Snbg must be greater than zero. This
random next best guess vr nbg from Vα j and forms Snbg by implies that, in the worst case there exists at least one tuple
finding the attribute values v ∈ Vα j satisfying p(v|di ) ≥ which has the same data values with di (except α j ) for each
p(vr nbg |di ). Next, for each v ∈ Snbg , the algorithm finds v ∈ Snbg . So, we can conclude that the INCP algorithm can
the tuples d ∈ D[¬ α1 [d] = α1 [di ] ∧ · · · ∧ ¬ α j−1 [d] = replace at most m − N − 1 − |Snbg | data values with ν for
α j−1 [di ] ∧ α j [d] = v ∧ ¬ α j+1 [d] = α j+1 [di ] ∧ · · · ∧ suppressing a confidential data value.
¬ αn [d] = αn [di ]] and modifies them by replacing α j [d]
with ν until the goal is achieved, that is, until p(v|di ) becomes Example 3 Now, let us illustrate how the INCP algorithm
less than or equal to p(α j [di ]|di ). After processing all suppress Bob’s confidential diagnosis.
123
Suppressing microdata to prevent classification based inference 395
Step 1 Initially, the Naı̈ve Bayesian classification model Since they have the same minimum count, both indigestion
is constructed to find the probabilities p(v|di ) for all v ∈ and chest pain can be the maximum impact attribute. Let us
Vα j = {dyspepsia, angina pectoris, gastritis}. The assume that indigestion is selected as the maximum impact
Naı̈ve Bayesian classification model constructed using the attribute.
medical records of Table 2 is shown in Table 3. According
to the model the probabilities are p(dyspepsia|d2 ) = 0, Step 8 All tuples d satisfying the constraint indigestion
p(anginapectoris|d2 ) = 16 , and p(gastritis|d2 ) = 18 1
. [d] = N ∧ diagnosis[d] = angina pectoris are found.
Tuples 7 and 8 satisfy the aforementioned constraint.
Step 2 The probability p(angina pectoris|d2 ) is greater
than both p(dyspepsia|d2 ) and p(gastritis|d2 ). As Bob’s Step 9 The indigestion attribute is hidden from tuple 7. With
diagnosis can be correctly predicted, the suppression process this replacement p(angina pectoris|d2 ) decreases by 21 to
21 . As p(angina pectoris|d2 ) is smaller than p(gastritis|
starts. 2
Step 3 Let us assume that gastritis is selected as the random d2 ), the suppression process stops. The resulting microdata
next best guess. From this point on, the INCP algorithm will can be seen in Table 5.
try to increase p(gastritis|d2 ) above p(an- ginapectoris|
d2 ). 3.1.3 DROPP algorithm
Step 4 All tuples d which have no common symptoms with The DROPP algorithm aims at suppressing the confidential
Bob among D[diagnosis[d] = gastritis] are found. Tuple data value α j [di ] so that it cannot be correctly predicted
4 satisfies the aforementioned constraint. D−d ,α
by the classification model ςnb i j . It aims at dropping
Step 5 The diagnosis attribute is hidden from tuple 4. After the probability p(α j [di ]|di ) below that of the random next
this replacement, p(gastritis|d2 ) increases to 17 , and best guess vr nbg so that it cannot be correctly predicted by
D−d ,α
p(angina pectoris|d2 ) increases to 21 4
. As p(angina the classification model ςnb i j . Unlike DECP and INCP
pectoris|d2 ) is still greater than p(gastritis|d2 ), the sup- algorithms, it achieves its goal by downgrading the tuple di ,
D−d ,α
pression process continues. instead of downgrading classification model ςnb i j .
The algorithm employs the following modified definition
Step 6 Since there are no more tuples which have common
of Maximum Impact Attribute:
symptoms with Bob among D[diagnosis[d] = gastritis],
the suppression process continues with the DECP execution.
Definition 3.6 (Maximum Impact Attribute ) The attribute
Step 7 To select the maximum impact attribute, the counts with maximum impact on p(α j [di ]|di ), denoted by
α [d ]
for each symptom attribute are found as follows: α Mj I i , is the one that satisfies the following conditions:
α [d ]
– countindigestion = |D[diagnosis[d] = diagnosis[d2 ] α Mj I i = arg max
α∈Λ
∧indigestion[d] = indigestion[d2 ]]| = 2
– countchest pain = |D[diagnosis[d] = diagnosis[d2 ] D[α j [d] = α j [di ] ∧ α[d] = α[di ]] − di
×
∧chest pain[d] = chest pain[d2 ]]| = 2 D[α j [d] = vr nbg ∧ α[d] = α[di ]]
– count palpitation = |D[diagnosis[d] = diagnosis[d2 ] α [d ] α [d ]
∧ palpitation[d] = palpitation[d2 ]]| = 3 ∧ p(α Mj I i [di ]|α j [di ]) > p(α Mj I i [di ]|vr nbg ) (22)
123
396 A. Azgin Hintoglu, Y. Saygın
Definition 3.7 (Maximum Impact Data Value ) The Replacement of the maximum impact data value causes
F α j [di ]
maximum impact data value is the instance of maximum p(α j [di ]|di ) Fvr nbg α j [di ],α
MI
α [d ] to decrease by × as shown
impact data attribute α Mj I i in tuple di . p(vr nbg |di ) Fα j [di ] F
vr nbg ,α
α j [di ]
MI
below.
It must be noted that, the maximum impact data values has
a higher probability of occurrence in tuples d ∈ D[α j [d] = p (α j [di ]|di ) ∼
= p (α j [di ]) p (di |α j [di ])
α j [di ]] − di than that of tuples d ∈ D[α j [d] = vr nbg ].
= p(α j [di ]) × p(α[di ]|α j [di ])
Therefore, they are the key to decrease p(α j [di ]|di ) below
α j [di ]
p(vr nbg |di ). α∈Λ− α j ,α M I
α [d ]
In each iteration, the DROPP algorithm identifies α Mj I i Fα j [di ]
α [d ]
and modifies the tuple di by replacing α Mj I i [di ] with ν until = p(α j [di ]|di )
F α j [di ]
α j [di ],α M I
the goal is achieved, that is, until p(α j [di ]|di ) becomes less
than p(vr nbg |di ). Each such replacement results in the max- p (vr nbg |di ) ∼
= p (vr nbg ) p (di |vr nbg )
p(α [d ]|d )
imum possible reduction in p(vrj nbgi |dii) , thus requiring less = p(vr nbg ) × p(α[di ]|vr nbg )
number of modifications. α j [di ]
α∈Λ− α j ,α M I
α [d ]
Theorem 3.2 Let α Mj I i be the maximum impact attribute Fvr nbg
satisfying Eq. (22). Then, every replacement of a maximum = p(vr nbg |di )
F α j [di ]
impact data value with ν causes the maximum decrease in vr nbg ,α M I
p(α j [di ]|di ) Fα j [di ]
p(vr nbg |di ) , thus resulting in fewer data values to be modified. p(α j [di ]|di ) F α j [di ]
p (α j [di ]|di ) α j [di ],α
MI
Proof Let us first find the effect of replacing a maximum =
p (vr nbg |di ) p(vr nbg |di ) F
Fvr nbg
impact data value with ν on p(α j [di ]|di ) and p(vr nbg |di ). α j [di ]
vr nbg ,α
Remember that, since p(di ) is same for all v ∈ Vα j , it can MI
Similarly, F α j [di ]
Fvr nbg Fα j [di ],αk Fvr nbg α j [di ],α M I
× > ×
p(vr nbg ) p(di |vr nbg ) Fα j [di ] Fvr nbg ,αk Fα j [di ] F
vr nbg ,α M I
α j [di ]
p(vr nbg |di ) =
p(di ) F α j [di ]
∼
= p(vr nbg ) p(di |vr nbg ) Fα j [di ],αk α j [di ],α M I
vr nbg
>
∼
= p(vr nbg ) p(α M I [di ]|vr nbg ) Fvr nbg ,αk F α j [di ]
vr nbg ,α M I
× p(α[di ]|vr nbg ) However, this contradicts the definition of Maximum Impact
vr nbg
α∈Λ− α j ,α M I Attribute . So, we can conclude that every replacement of
a maximum impact data value with ν causes the highest
p(α [d ]|d )
Let the size of the microdata set D[α j [d] = α j [di ] ∧ decrease in p(vrj nbgi |dii) , which in turn implies that the number
α [d ] α [d ] of data values that should be modified is minimal.
α Mj I i [d] = α Mj I i [di ]] − di be F α j [di ] and the size
α j [di ],α M I
of the microdata set D[α j [d] = α j [di ]] − di be Fα j [di ] . Let The algorithm works as follows: Let α j [di ] be confiden-
α [d ] tial. As the first step, the algorithm verifies the need for sup-
the size of the microdata set D[α j [d] = vr nbg ∧ α Mj I i [d] =
α [d ] pression. It finds p(v|di ) for all v ∈ Vα j and checks the
α Mj I i [di ]] be F α j [di ] and the size of the microdata set truth value of Assertion (20). If Assertion (20) is true, it
vr nbg ,α M I
D[α j [d] = vr nbg ] be Fvr nbg . picks a random next best guess vr nbg from Vα j . Next, in each
123
Suppressing microdata to prevent classification based inference 397
α [d ]
iteration it finds the maximum impact attribute α Mj I i and Example 4 Now, let us illustrate how the DROPP algorithm
α [d ] suppress Bob’s confidential diagnosis.
replaces the maximum impact data value α Mj I i [di ]
by ν.
After each iteration, it re-checks the truth value of Assertion
(20) to decide whether to continue execution. If Assertion Step 1 Initially, the Naı̈ve Bayesian classification model is
(20) is still true after all possible maximum impact attributes constructed to find the probabilities p(v|di ) for all v ∈ Vα j =
are processed, it reverts all changes and deletes the tuple {dyspepsia, angina pectoris, gastritis}. The Na-ı̈ve
di from the microdata set. An overview of the algorithm is Bayesian classification model constructed using the medi-
depicted in Fig. 3. cal records of Table 2 is shown in Table 3. According to the
model the probabilities are p(dyspepsia|d2 ) = 0, p(angina
Lemma 3.3 Let α j [di ] be the confidential data value and n pectoris|d2 ) = 16 , and p(gastritis|d2 ) = 18 1
.
be the number of attributes. Then, the upper bound for the
number of data values that can be modified by the Step 2 The probability p(angina pectoris|d2 ) is greater
DROPP algorithm is equal to n − 1. than both p(dyspepsia|d2 ) and p(gastritis|d2 ). As Bob’s
diagnosis can be correctly predicted, the suppression process
Proof The proof of this statement is straightforward. The starts.
DROPP algorithm modifies just the tuple di which has n − 1
data values excluding the confidential data value. So, we can Step 3 Let us assume that gastritis is selected as the random
conclude that the DROPP algorithm can replace at most n −1 next best guess. From this point on, the DROPP algorithm
data values with ν for suppressing a confidential data value. will try to drop p(angina pectoris|d2 ) below p(gastritis|
d2 ).
123
398 A. Azgin Hintoglu, Y. Saygın
Step 4 To select the maximum impact attribute , the follow- 3.2 Suppression against decision tree classification models
ing counts and ratios are found:
In the following, we present the HID3 algorithm for prevent-
ing decision tree classification based inference.
angina pectoris
– countindigestion = |D[diagnosis[d] = angina
pectoris ∧ indigestion[d] = indigestion[d2 ]]| = 2
gastritis
– countindigestion = |D[diagnosis[d] = gastritis
3.2.1 HID3 algorithm
∧indigestion[d] = indigestion[d2 ]]| = 1
angina pectoris
– countchest pain = |D[diagnosis[d] = angina
The HID3 algorithm aims at suppressing the confidential data
pectoris ∧ chest pain[d] = chest pain[d2 ]]| = 2 D−d ,α
gastritis value α j [di ] so that the ID3 classification model ςid3 i j
– countchest pain = |D[diagnosis[d] = gastritis
cannot correctly predict its actual value. Similar to the
∧chest pain[d] = chest pain[d2 ]]| = 2
angina pectoris DROPP algorithm, it achieves its goal by downgrading the
– count palpitation = |D[diagnosis[d] = angina
tuple di .
pectoris ∧ palpitation[d] = palpitation[d2 ]]| = 3
gastritis The algorithm works as follows: Let α j [di ] be confiden-
– count palpitation = |D[diagnosis[d] = gastritis∧
tial. As the first step, the algorithm builds the decision tree
palpitation[d] = palpitation[d2 ]]| = 2 D−d ,α
– ratiochest pain = 1 using D−di and verifies the need for suppression. If ςid3 i j
– ratioindigestion = 2 can correctly predict the confidential data value it calls the
– ratio palpitation = 3/2 recursive ID3Hide function. Then, the ID3Hide function
checks whether the root node is a leaf. If it is a leaf and
its value is different from the confidential data value α j [di ]
As, indigestion has the maximum ratio, it is selected as the it returns true, which in turn terminates the recursive func-
maximum impact attribute . tion successfully. Or else, it returns false. If the root node is
not a leaf, then it finds the most probable value vπ ∈ Vα j for
α j [di ], and checks whether vπ is equal to α j [di ]. If the most
Step 5 The indigestion attribute is hidden from tuple 2. With probable value vπ is not equal to the actual confidential data
this replacement p(angina pectoris|d2 ) increases to 41 , and value α j [di ] it returns true. Otherwise, it further explores the
p(gastritis|d2 ) increases to 16 . As p(angina pectoris|d2 ) child nodes of the root in order to suppress α j [di ]. Let the
is still greater than p(gastritis|d2 ), the suppression process decision attribute of the root node be αr oot , the most com-
continues with the next maximum impact attribute which is mon child of the root (i.e. the child with highest training
palpitation. population) be child MC and the child containing αr oot [di ]
be child Match . If αr oot [di ] = ν or child Match = child MC it
tries to suppress the confidential data value using child MC .
Step 6 The palpitation attribute is hidden from tuple 2. With Or else, it uses child Match for suppression. After exploring
this replacement p(angina pectoris|d2 ) remains the same, all possible sub-branches, if the algorithm fails to suppress
but p(gastritis|d2 ) increases to 41 . As p(angina pectoris| the confidential data value, it reverts all changes and deletes
d2 ) is equal to p(gastritis|d2 ), the suppression process the tuple di from the microdata set. An overview of the algo-
stops. The resulting microdata can be seen in Table 6. rithm is depicted in Fig. 4.
123
Suppressing microdata to prevent classification based inference 399
123
400 A. Azgin Hintoglu, Y. Saygın
Step 8 Starting from the subtree root=node 4, the ID3- Hide value (i.e. α j = con f _value) and maximum impact data
function checks whether it is possible to correctly predict value. Next, for each tuple d ∈ S, the need for suppression is
Bob’s diagnosis. Since Bob’s diagnosis cannot be correctly verified by finding p(v|d) for all v ∈ Vα j and checking the
predicted using the path palpitation = Y , the suppression truth value of the following assertion:
process stops. The resulting microdata can be seen in Table 7.
p(α j [d]|d) > p(v|d)|∀v ∈ Vα j − α j [d] (23)
3.3 Suppression of multiple confidential data values If Assertion (23) is true for a tuple d ∈ S, it picks a random
next best guess vrdnbg , from Vα j . Next, the candidate maxi-
In the following, we present the enhanced versions of DECP mum impact data values are sorted. Different from the orig-
and DROPP algorithms for preventing probabilistic classi- inal DECP, which uses only the secondary impact to deter-
fication based inference. The proposed algorithms aim to mine which maximum impact data value to use, e-DECP also
reduce to side-effects while suppressing multiple confiden- uses the primary impact in order to guarantee suppress of
tial data values. maximum number of confidential data values with a single
iteration. With maximum impact values sorted, the rest of
3.3.1 e-DECP algorithm the execution is quite similar to the original DECP which
involves replacement of maximum impact data value
The enhanced DECP algorithm aims at suppressing multiple instances, calculation of probabilities, and re-checking of
confidential data values at a time so that none of them can be Assertion (23). An overview of the algorithm is depicted in
correctly predicted by the downgraded classification model Fig. 6a.
D ,α
ςnb j . The proposed algorithm reduces the side-effects of
the original DECP algorithm when (1) all confidential data 3.3.2 e-DROPP algorithm
values belong to a single attribute, and (2) all confidential
data values have the same value. The generic case which The enhanced DROPP algorithm aims at suppressing mul-
handles the suppression of multiple confidential values that tiple confidential data values at a time so that none of them
belong to different attributes require exhaustive modeling of can be correctly predicted by the corresponding classifica-
D,α
dependencies and will be investigated as part of our future tion models ςnb . The proposed algorithm reduces the
work. side-effects of the original DROPP algorithm when all confi-
The algorithm works as follows: Let α j be the confidential dential data values belong to a single tuple. The generic case
attribute, S ⊂ D be the set of tuples for which α j , satisfy- which handles the suppresion of multiple confidential values
ing the constraint α j [d] = con f _value for all d ∈ S, is that belong to multiple tuples require exhaustive modeling
confidential. As the first step, the algorithm replaces all con- of dependencies and will be investigated as part of our future
fidential data values with ν. Then, it identifies the candidate work.
maximum impact data values, and initializes their primary The algorithm works as follows: Let [di ] be the tuple con-
and secondary impacts. The primary impact is the number taining multiple confidential data values, and S be the set
of tuples which will be affected (i.e. the probabilities will be of attributes containing a confidential data value in di . As
affected) if an instance of the maximum impact data value is the first step, the algorithm verifies the need for suppression
replaced with ν. The secondary impact, on the other hand, is for each confidential data value. More specifically, for each
the number of tuples that support both the confidential data α ∈ S, it finds p(v|di ) where v ∈ Vα and checks the truth
123
Suppressing microdata to prevent classification based inference 401
(a)
(b)
Fig. 6 Pseudocode of e-DECP and e-DROPP algorithms. a. Pseudocode of e-DECP algorithm. b. Pseudocode of e-DROPP algorithm
value of the following assertion: confidential value.7 To identify the maximum impact data
value in each iteration, the impacts of candidates are aver-
p(α[di ]|di ) > p(v|di )|∀v ∈ Vα − α[di ] (24)
aged and sorted. With maximum impact values sorted, the
If Assertion (24) is true, it picks a random next best guess rest of the execution is quite similar to the original DROPP
vrvnbg
α
from Vα . Next, it identifies the candidate maximum
impact data values, and initializes their impacts on each 7 Please refer to the Eq. (22) for the calculation of impact.
123
402 A. Azgin Hintoglu, Y. Saygın
which involves replacement of maximum impact data value next best guess is employed during suppression in order to
instances from di , calculation of probabilities and re-check- reduce the confidence of an adversary predicting the actual
ing of Assertion (24). An overview of the algorithm is confidential value as shown below.
depicted in Fig. 6b. Success Rate
Confidence = (28)
k
4 Discussion on the effectiveness of suppression The second issue, that is inherent in all suppression algo-
algorithms rithms, occurs when |Vα j | = 2. Let us assume that the
decision to suppress the confidential data value α j [di ] is not
The motivation of the suppression algorithms presented in randomized when |Vα j | = 2. In this case, the algorithms will
this paper is to make a given set of confidential data values try to suppress the confidential data value with the maximum
non-discoverable, while minimizing the effect on the use- possible success rate. Knowing this fact, an adversary can
fulness of the data for purposes other than predicting the predict the actual confidential data value to be the one with
confidential data values. But, how can we make sure that an the second highest probability in Vα j with a confidence equal
adversary would not be able to predict the suppressed con- to the success rate of the algorithm. In order to avoid such
fidential data values? Certainly this might be a problem if attacks, we randomly decide to suppress a confidential data
randomization is not employed in various stages of the algo- value for microdata sets with |Vα j | = 2.
rithms. Let us assume that an adversary knows not only D , Another issue is the effectiveness of the suppression algo-
the transformed microdata set, but also Vα j , the domain of rithms against different classification models. Remember that
the confidential data value α j [di ], and analyze how randomi- two of the proposed algorithms, the DECP and INCP algo-
zation avoids prediction of the actual confidential data value. rithms, aim at downgrading the classification model by
First, let us assume that a modified version of DECP is used modifying D − di . In the first method, the probability of
in order to suppress the confidential data value α j [di ]. This resemblance of the tuple containing the confidential data
version of DECP aims at decreasing p(α j [di ]|di ) below that value to other tuples d ∈ D satisfying α j [d] = α j [di ] is
of the next best guess vnbg instead of vr nbg . reduced. And, in the latter method, the probability of resem-
blance of the tuple containing the confidential data value to
Definition 4.1 (Next Best Guess) The next best guess, vnbg the tuples d ∈ D satisfying α j [d] = α j [di ] is increased.
∈ Vα j , is a randomly selected value from Vα j satisfying the On the other hand, the DROPP and HID3 algorithms aim at
following conditions: downgrading the microdata tuple containing the confidential
data value. Both methods find the attributes that enables cor-
i. It is different from α j [di ], rect prediction of the actual confidential value and hide them
from the tuple containing the confidential data value. As a
vnbg = α j [di ] (25) result, the probability of similarity of the tuple containing the
confidential data value to the other tuples d ∈ D satisfying
ii. It is among the top-2 probable set, α j [d] = α j [di ] is reduced. Since all classification methods
tend to find the target attribute value of a tuple based on its
α [di ]
vnbg ∈ Ω2 j (26) resemblance to other tuples in the training data set, the pro-
posed suppression algorithms are expected to achieve their
iii. The probability of α th
j attribute of di being equal to vnbg goal even when used with other classification methods. In
is smaller than that of confidential data value α j [di ] and order to verify this, we measured the effectiveness of each
greater than zero. algorithm against both Naı̈ve Bayesian and ID3 classifica-
tion. The results can be found in Sect. 5.
p(α j [di ]|di ) > p(vnbg |di ) > 0 (27) The final issue that needs to be discussed is the side effect
of the proposed algorithms which is related to the number of
This leads to a change in the ordering of the top-2 probable attribute values hidden excluding the confidential data value.
α [d ]
set Ω2 j i = α j [di ], vnbg . Knowing this fact, an adver- Remember that, for each suppression algorithm we derived
sary can predict the actual confidential data value to be the an upper bound for the number of attribute values that will
α [d ] be modified in the previous section. According to these der-
one with the second highest probability in Ω2 j i with a
confidence equal to the success rate of the algorithm. That is ivations we can conclude the following:
to say, if the success rate of the algorithm is 100%, then the
adversary can predict the actual confidential data value with i. The upper bound for the number of data values that can
100% confidence. This problem exists not only in DECP but be modified by the INCP algorithm depends on m, the
also in INCP and DROPP algorithms. Therefore, the random number of tuples in D,
123
Suppressing microdata to prevent classification based inference 403
ii. the upper bound for the number of data values that
can be modified by the DROPP and HID3 algorithms
depends on n, the number of attributes in D,
iii. the upper bound for the number of data values that can
be modified by the DECP algorithm depends on n ∗ m,
the number of attributes in D times the number of tuples Fig. 7 Average execution times of proposed algorithms
in D,
to suppress confidential data values for which the domain
Now, let us assume that m >> n. In this case, the worst case size of the corresponding attribute is greater than 2.
performance of the DROPP and HID3 algorithms should be
much better than the worst case performance of the DECP and 5.2 Results and analysis of algorithms
INCP algorithms with respect to side effects. However, for
data sets satisfying n >> m, e.g., gene expression data, the In this study, we first measured the average execution times8
worst case performance of the INCP algorithm will outper- required to suppress a confidential data value. The results,
form the DECP, DROPP, and HID3 algorithms with respect as depicted in Fig. 7, shows that the suppression algorithms
to the side effects. Note that the DECP algorithm will per- performed remarkably similar with respect to execution time.
form slightly worse than the other algorithms it is grouped Another performance criterion is the percent of success-
with, as in both cases either m or n loses its significance with ful suppressions.9 For each suppression algorithm, we first
respect to the other term. measured the percent of successful suppressions against the
algorithm’s primary10 classification model. As illustrated in
Fig. 5a,b, the proposed algorithms successfully suppressed
5 Experimental results all confidential data values with respect to their primary clas-
sification model.
This section presents the experimental results. The primary Next, we investigated the correctness of the following
objective of the experiments is to compare the suppression hypotheses:
algorithms in terms of CPU time performance, rate of suc-
cess, information loss, and uncertainty. Hypothesis 5.1 All proposed algorithms suppressing confi-
dential data values against probabilistic classification models
also blocks the decision tree classification based inference.
5.1 Data sets and implementation details
Hypothesis 5.2 All proposed algorithms suppressing confi-
In order to conduct the experiments we selected two data sets
dential data values against decision tree classification models
from the University of California at Irvine repository [28],
also blocks the probabilistic classification based inference.
the Wisconsin Breast Cancer data set [24] and the Car Eval-
uation data set. Table 8 provides a description of the data sets
Hypothesis 5.3 All proposed algorithms suppressing confi-
including the number of instances, the number of attributes,
dential data values also blocks more complex classification
and the number of unknowns.
models (e.g. SVM) based inference.
We implemented the proposed algorithms using the C++
programming language. To evaluate the performance of the
8 In order to find the average execution times, we suppressed a data
algorithms, we performed experiments on a 2.20 GHz Cele-
value from each instance of the data sets and averaged the CPU time
ron PC with 256 MB of memory running the Windows oper- results.
ating system. As the suppression algorithms contain random 9 A successful suppression implies that the confidential data value is
components, the experimental results presented are averages suppressed without deleting the microdata tuple containing it.
of five realizations unless stated otherwise. Moreover, in 10 Naı̈ve Bayesian classification model for DECP, INCP and DROPP
order to illustrate the power of the algorithms we choose algorithms, and ID3 classification model for HID3 algorithm.
123
404 A. Azgin Hintoglu, Y. Saygın
W. Breast Cancer (%) Car evaluation (%) W. Breast Cancer (%) Car evaluation (%) W. Breast Cancer (%) Car evaluation (%)
123
Suppressing microdata to prevent classification based inference 405
Fig. 9 Total Direct Distance Results of Proposed Algorithms. (a) Car Evaluation Data Set. (b) Wisconsin Breast Cancer Data Set
Fig. 10 Sum of Kullback Leibler Distance results of proposed algorithms. a. Car evaluation data set. b. Wisconsin breast cancer data set
of the original and modified data sets. The performance of We can summarize the presented experimental results as
suppression algorithms in terms of average change in mutual follows:
information is shown in Fig. 11. The results show that (1) the
HID3 algorithm causes the least amount of change in the cor- 1. There is a tradeoff between the rate of successful suppres-
relations within the data sets followed by the DROPP, NRD, sions and the information loss caused by the suppression
INCP, and DECP algorithms, and (2) the DECP and INCP process.
algorithms prevent inference of confidential data values bet- 2. The DECP algorithm achieves the highest success rate
ter than the DROPP and HID3 algorithms against different while causing the highest amount of information loss and
classification algorithms, as they distort correlations within uncertainty. This justifies Lemma 3.1 which states that
the data sets more. the upper bound for the number of data values that can
The final performance criterion is the uncertainty intro- be modified by the DECP algorithm is equal to (n − 1)
duced by the suppression algorithms. We used the sum of (N − 1) < nm.
conditional entropies in order to measure the expected value 3. The INCP algorithm achieves the second highest suc-
of uncertainty introduced into the modified data sets. The cess rate while causing the second highest information
performance of suppression algorithms in terms of sum of loss and uncertainty. It is followed by the DROPP and
conditional entropies is shown in Fig. 12. The results show HID3 algorithms. This ordering is completely due to
that (1) the HID3 algorithm introduces the least amount of (1) the characteristics of the Wisconsin Breast Cancer
uncertainty in the modified data sets followed by the DROPP, and Car Evaluation data sets satisfying the inequality
NRD, INCP, and DECP algorithms, and (2) the DECP and m > > n, i.e., the number of transactions is much more
INCP algorithms prevent inference of confidential data than the number of attributes, and (2) the upper bounds
values better than the DROPP and HID3 algorithms against for the number of data values that can be modified by the
different classification algorithms, as they cause more uncer- algorithms. With a data set satisfying the inequality
tainty within the data sets. n > > m, the ordering for success, information loss,
123
406 A. Azgin Hintoglu, Y. Saygın
Fig. 11 Average Change in Mutual Information Results of Proposed Algorithms. (a) Car Evaluation Data Set. (b) Wisconsin Breast Cancer Data
Set
Fig. 12 Sum of Conditional Entropy Results of Proposed Algorithms. (a) Car Evaluation Data Set. (b) Wisconsin Breast Cancer Data Set
123
Suppressing microdata to prevent classification based inference 407
123
408 A. Azgin Hintoglu, Y. Saygın
of employing different clustering algorithms [37]. Different The work closest to ours is proposed by Chang et al. [59].
from MSP problem, microaggregation assumes all respon- In his work, Chang proposes a new paradigm for dealing
dents contributing to the microdata set have the same pri- with the inference problem, which combines the application
vacy preferences. It is meaningful to use microaggregation of decision tree analysis with the concept of parsimonious
in such a setting where sensitive attributes are the same for downgrading. He shows how classification models can be
all respondents. Nevertheless, if respondents’ privacy pref- used to predict suppressed confidential data values and con-
erences differ, then it will result unnecessary attribute values cludes that some feedback mechanism is needed to protect
to be generalized which will result in more information loss. suppressed data values against classification models.
The security and privacy issues arising from the infer-
ence problem, which results in private-sensitive data to be
inferred from public-insensitive data, have also been inves-
7 Conclusion
tigated by multilevel secure databases research [44–48] and
general purpose databases research [49–52]. Methods pro-
In this paper we pointed out the possible privacy breaches
posed within the database context mainly focus on detec-
induced by data mining algorithms on hidden microdata val-
tion and removal of meta-data, i.e., database constraints like
ues. We considered two classification models that could be
functional and multi-valued dependencies, based inferences
used for prediction purposes by adversaries. As an initial step
either during database design [47–49,53] or during query
to attack the problem, we proposed six heuristics to suppress
time [46,54,55]. However, they do not take into account the
the selected confidential data values so that they cannot be
statistical correlations among database attributes which can
inferred using probabilistic and decision tree classification
be discovered by various data mining techniques and hence
models. Our methods are based on modifying the original
result in imprecise inferences like the rule ‘A implies B’ with
microdata set by inserting unknown values with as little dam-
25% confidence.
age as possible. Naïve Bayesian and ID3 classification mod-
There are also other approaches investigating the privacy
els were selected as representatives of the probabilistic and
issues arising during microdata disclosure within the scope
decision tree classification models, respectively. Our experi-
of anonymization problem. K-anonymity, [11–13], being one
ments with real data sets showed that the proposed algorithms
of those approaches, aims at preserving the anonymity dur-
are effective in blocking the inference channels that are based
ing the data dissemination process using generalizations and
both on their target classification models and on their second-
suppressions on potentially identifying portions of the data
ary classification model. And, this verifies our statement that
set. Other approaches addressing the anonymization prob-
each of the suppression algorithms will also prevent infer-
lem include [15,16,56,57]. In his work [15], Iyengar uses
ence when used with other classification methods. Moreover,
suppression and generalization approaches to satisfy privacy
we measured the side effects of the algorithms using different
constraints. Moreover, he examines the tradeoff between pri-
metrics and observed that there is a tradeoff between the rate
vacy and information loss within different data usage con-
of successful suppressions and the information loss caused
texts and proposes a genetic algorithm to find the optimal
by the suppression process. This observation together with
anonymization. On the other hand, in [16] Ohrn uses bool-
the experimental results verifies our discussion on the side
ean reasoning, and in [56,57] Ferrer uses microaggregation
effects of the suppression algorithms:
to address the anonymization problem. Besides the fact that
these approaches successfully preserve privacy through
anonymization, none of them addresses the inference threat – Side effects of the DECP algorithm depend both on the
to privacy due to data mining approaches. Therefore, they number of attributes and the number of transactions,
do not directly apply to MSP. Moreover, similar to the mic- ensuring that the success rate of the DECP algorithm
roaggregation, anonymization approaches assume that each will always be higher than the other algorithms in any
respondent contributed to the microdata set has the same pri- situation.
vacy preferences, i.e., wants to be anonymous, which is not – Side effects of the INCP algorithm depend on the number
realistic. of transactions, ensuring that the success rate of the INCP
Another approach, proposed by Wang et al. [58], addresses algorithm will always be higher than the other algorithms
the threats caused by data mining abilities, using a template- if the number of transactions is higher than the number
based approach. The proposed approach aims to (1) preserve of attributes.
the information for a wanted classification analysis and (2) – Side effects of the DROPP and HID3 algorithms depend
limit the usefulness of unwanted sensitive inferences, i.e., on the number of attributes, ensuring that the success rate
classification rules, that may be derived from the data. More of these algorithms will always be higher than the other
specifically, it focuses on suppressing sensitive rules, instead algorithms if the number of attributes is higher than the
of sensitive data values. number of transactions.
123
Suppressing microdata to prevent classification based inference 409
Next, to increase the success of the overall suppression 12. Samarati, P.: Protecting respondents’ identities in microdata
process with respect to both classification models, we used a release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)
13. Sweeney, L.: k-Anonymity: A model for protecting privacy. Int. J.
hybrid approach which suppresses against both classification Uncertain. Fuzziness Knowl. Sys. 10(5), 557–570 (2002)
models. The experimental results for the hybrid algorithms 14. USC Annenberg School—Center for the Digital Future: The
showed that the success rate with respect to both classifica- Highlights of the Digital Future Report, Year Five, Ten
tion methods is 100%, meaning all confidential data values Years Ten Trends. Available at http://www.digitalcenter.org/pdf/
Center-for-the-Digital-Future-2005-Highlights.pdf
are successfully suppressed. 15. Iyengar, V.S.: Transforming data to satisfy privacy constraints.
Finally, to decrease the side-effects of the overall suppres- SIGKDD (2002)
sion process, when there are multiple confidential data values 16. Øhrn, A., Ohno-Machado, L.: Using boolean reasoning to anony-
to suppress, we used the enhanced heuristics, e-DECP and mize databases. Artif. Intell. Med. 15(3), 235–254 (1999)
17. Nissenbaum, H.: Protecting privacy in an information age: the prob-
e-DROPP. The experimental results for the enhanced algo- lem of privacy in public. Law Philos. 17, 559–596 (1998)
rithms showed that the side-effects can be reduced by more 18. Sweeney, L.: Information Explosion. In: Confidentiality, Disclo-
than 50%. sure, and Data Access: Theory and Practical Applications for Sta-
As part of our future work, we plan to investigate the fol- tistical Agencies. Zayatz, L., Doyle, P., Theeuwes, J., Lane, J.,
(eds.) Urban Institute, Washington, DC (2001)
lowing: 19. Dreiseitl, S., Vinterbo, S., Ohno-Machado, L.: Disambiguation
data: extracting information from anonymized sources. In: Pro-
– Suppressing confidential data values against other classi- ceedings of the 2001 American Medical Informatics Annual Sym-
fication algorithms, e.g., logistic regression, posium, pp. 144–148 (2001)
20. Aggarwal, C.: On k-anonymity and the curse of dimensionality. In:
– Suppressing multiple confidential data values at a time Proceedings of the 31st VLDB Conference (2005)
(generic version having no constraints), 21. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam,
– Developing a generic suppression technique, independent M.: -Diversity: privacy beyond k-anonymity. In: Proceedings of
from individual classification methods, based on informa- the 22nd IEEE International Conference on Data Engineering
(2006)
tion theory, 22. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley,
– Using generalization as a fine grained method,and New York (1991)
– Suppressing evolving (i.e., continuously updated) micro- 23. UCI Machine Learning Repository, http://www.ics.uci.edu/
data. ~mlearn/MLSummary.html
24. Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear
programming. SIAM News 23(5), 1–18 (1990)
25. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large mar-
References gin methods for structured and interdependent output variables. J.
Mach. Learn. Res. 6, 1453–1484 (2005)
1. Wikipedia,: Privacy—Wikipedia, the free encyclopedia. Available 26. Adam, N.R., Wortmann, J.C.: Security-control methods for
at http://en.wikipedia.org/wiki/Privacy (2005) statistical databases: a comparative study. ACM Comput.
2. Report to Congress regarding the Terrorism Information Awareness Surv. 21(4), 515–556 (1989)
Program, May 20, 2003 27. Denning, D.E.: Cryptography and Data Security. Addison-Wes-
3. O’Leary, D.E.: Knowledge discovery as a threat to database ley, (1982)
security. In: Piatetsky-Shapiro, G., Frawley, W. (eds.) Knowledge 28. Domingo-Ferrer, J.: (eds) Inference control in statistical databases.
Discovery in Databases, pp. 507–516. AAAI Press, MIT Press, Lecture Notes in Computer Science, vol. 2316. Springer-Verlag,
Menlo Park, California (1991) Berlin (2002)
4. O’Leary, D.E.: Some privacy issues in knowledge discovery: the 29. Farkas, C., Jajodia, S.: The inference problem: a survey. SIGKDD
OECD personal privacy guidelines. IEEE Expert: Intelligent Syst. Explorations (2003)
Appl. 10(2), 48–52 (1995) 30. Geurts, J.: Heuristics for cell suppression in tables. Technical Paper,
5. Klosgen, W.: Knowledge discovery in databases and data privacy. Netherlands Central Bureau of Statistics (1992)
IEEE Expert, April 1995 31. Kao, M.Y.: Data security equals graph connectivity. SIAM J. Dis-
6. Piatetsky-Shapiro, G.: Knowledge discovery in databases vs. per- cret. Math. 9, 87–100 (1996)
sonal privacy. IEEE Expert, April 1995 32. Kelly, J.P., Golden, B.L., Assad, A.A.: Cell suppression: disclo-
7. Selfridge, P.: Privacy and knowledge discovery in databases. IEEE sure protection for sensitive tabular data. Networks 22, 397–417
Expert, April 1995 (1992)
8. Azgın Hintoǧlu, A., Saygın, Y.: Suppressing microdata to prevent 33. Fischetti, M., Salazar, J.J.: Models and algortihms for the
probabilistic classification based inference. In: Proceedings of the 2-dimensional cell suppression problem in statistical disclosure
Workshop on Secure Data Management (SDM’05) (2005) control. Math. Program. 84, 283–312 (1999)
9. Cox, L.H.: Suppression methodology and statistical disclosure con- 34. Fischetti, M., Salazar, J.J.: Models and algorithms for optimizing
trol. J. Am. Stat. Assoc. 75(370), 377–385 (1980) cell suppression in tabular data with linear constraints. J. Am. Stat.
10. Sande, G.: Automated cell suppression to reserve confidentiality of Assoc. 95(451), 916–928 (2000)
business statistics. In: Proceedings of the 2nd International Work- 35. Willenborg, L., De Waal, T.: Statistical disclosure control in prac-
shop on Statistical Database Management, pp. 346–353 (1983) tice. Lecture Notes in Statistics, vol. 111. Springer Verlag, New
11. Samarati, P., Sweeney, L.: Protecting privacy when disclosing York (1996)
information: k-anonymity and its enforcement through general- 36. Domingo-Ferrer, J., Torra, V.: Practical data-oriented microaggre-
ization and suppression. IEEE Sympos. Res. Security Privacy gation for statistical disclosure control. IEEE Trans. Knowl. Data
(1998) Eng. 14(1), 189–201 (2002)
123
410 A. Azgin Hintoglu, Y. Saygın
37. Torra, V.: Microaggregation for categorical variables: a median 49. Delugach, H., Hinke, T.: Wizard: A database inference analysis
based approach. In: Domingo-Ferrer, J., Torra, V. (eds.), Privacy in and detection system. IEEE Trans. Knowl. Data Eng. 8(1), 56–
Statistical Databases, vol. 3050, pp. 162–174 (2004) 66 (1996)
38. Oganian, A., Domingo-Ferrer, J.: On the complexity of optimal 50. Hinke, T., Delugach, H., Wolf, R.P.: Protecting databases from
microaggregation for statistical disclosure control. Stat. J. U.N. inference attacks. Comput. Secur. 16(8), 687–708 (1997)
Econ. Comm. Eur. 18(4), 345–354 (2001) 51. Dawson, S., di Vimercati, S.D.C., Lincoln, P., Samarati, P.: Minimal
39. Solanas, A., Martinez-Balleste, A., Mateo-Sanz, J.M., data upgrading to prevent inference and association. In: Proceed-
Domingo-Ferrer, J.: Towards microaggregation with genetic ings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Sympo-
algorithms. In: Proceedings of the Third IEEE Conference on sium on Principles of Database Systems, pp. 114–125. ACP Press,
Intelligent Systems, pp. 65–70 (2006) (1999)
40. Martinez-Balleste, A., Solanas, A., Domingo-Ferrer, J., 52. Brodsky, A., Farkas, C., Jajodia, A.: Secure databases: Constraints,
Mateo-Sanz, J.M.: A Genetic approach to multivariate mic- inference channels and monitoring disclosure. IEEE Trans. Knowl.
roaggregation for database privacy. In: Proceedings of 23rd IEEE Data Eng. 12(6), 900–919 (2000)
Internation Conference on Data Engineering, pp. 180–185 (2007) 53. Hinke, T.H., Delugach, H.S., Chandrasekhar, A.: A fast algorithm
41. Hansen, S.L., Mukherjee, S.: A polynomial algorithm for opti- for detecting second paths in database inference analysis. J. Com-
mal univariate microaggregation. IEEE Trans. Knowl. Data put. Secur. 3(2, 3), 147–168 (1995)
Eng. 15(4), 1043–1044 (2003) 54. Denning, D.: Commutative filters for reducing inference threats in
42. Laszlo, M., Mukherjee, S.: Minimum spanning tree partition- multilevel database systems. In: Proceedings of IEEE Symposium
ing algorithm for microaggregation. IEEE Trans. Knowl. Data on Security and Privacy, pp. 134–146 (1985)
Eng. 17(7), 902–911 (2005) 55. Thuraisingham, B.: Security checking in relational database man-
43. Sande, G.: Exact and approximate methods for data directed micro- agement systems augmented with inference engines. Comput.
aggregation in one or more dimensions. Int. J. Uncertain. Fuzziness Secur. 6, 479–492 (1987)
Knowl. Syst. 10(5), 459–476 (2002) 56. Domingo-Ferrer, J., Torra, V.: Ordinal, continious and heterego-
44. Jajodia, S., Meadows, C. : Inference problems in multilevel secure neous k-anonymity through microaggregation. Data Min. Knowl.
database management systems. In: Abrams, M.D., Jajodia, S., Discov. 11(2), 195–212 (2005)
Podell, H.J. (eds.) Information Security—An Integrated Collection 57. Domingo-Ferrer, J., Solanas, A., Martinez-Balleste, A.: Privacy
of Essays, pp. 570–584. IEEE C. S. Press, (1989) in statistical databases:k-anonymity through microaggregation. In:
45. Quian, X., Stickel, M.E., Karp, P.D., Lunt, T.F., Garvey, T.D.: Proceedings of IEEE Granular Computing (2006)
Detection and elimination of inference channels in multilevel rela- 58. Wang, K., Fung, B.C.M., Yu, P.S.: Template-based privacy preser-
tional database systems. In: Proceedings of IEEE Symp. Security vation in classification problems. In: ICDM ’05: Proceedings of the
and Privacy, pp. 196–205 (1993) Fifth IEEE International Conference on Data Mining, pp. 466–473
46. Stachour, P., Thuraisingham, B.: Design of LDV: A multilevel (2005)
secure relational database management system. IEEE Trans. 59. Chang, L., Moskowitz, I.S.: Parsimonious downgrading and deci-
Knowl. Data Eng. 2(2), 190–209 (1990) sion trees applied to the inference problem. In: Proceedings of the
47. Su, T., Ozsoyoglu, G.: Inference in MLS database systems. IEEE Workshop of New Security Paradigms, pp. 82–89 (1999)
Trans. Knowl. Data Eng. 3(2–3), 147–168 (1991)
48. Marks, D.: Inference in MLS database systems. IEEE Trans.
Knowl. Data Eng. 8(1), 46–55 (1996)
123