Você está na página 1de 14

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

A Method for Mining Infrequent Causal


Associations and Its Application in Finding
Adverse Drug Reaction Signal Pairs
Yanqing Ji, Hao Ying, Fellow, IEEE, John Tran, Peter Dews, Ayman Mansour, R. Michael
Massanari
AbstractIn many real-world applications, it is important to mine causal relationships where an event or event pattern causes
certain outcomes with low probability. Discovering this kind of causal relationships can help us prevent or correct negative
outcomes caused by their antecedents. In this paper, we propose an innovative data mining framework and apply it to mine
potential causal associations in electronic patient datasets where the drug-related events of interest occur infrequently.
Specifically, we created a novel interestingness measure, exclusive causal-leverage, based on a computational, fuzzy
recognition-primed decision (RPD) model that we previously developed. On the basis of this new measure, a data mining
algorithm was developed to mine the causal relationship between drugs and their associated adverse drug reactions (ADRs).
The algorithm was tested on real patient data retrieved from the Veterans Affairs Medical Center in Detroit, Michigan. The
retrieved data included 16,206 patients (15,605 male, 601 female). The exclusive causal-leverage was employed to rank the
potential causal associations between each of the three selected drugs (i.e., enalapril, pravastatin and rosuvastatin) and 3,954
recorded symptoms, each of which corresponded to a potential ADR. The top 10 drug-symptom pairs for each drug were
evaluated by the physicians on our project team. The numbers of symptoms considered as likely real ADRs for enalapril,
pravastatin and rosuvastatin were 8, 7, and 6, respectively. These preliminary results indicate the usefulness of our method in
finding potential ADR signal pairs for further analysis (e.g., epidemiology study) and investigation (e.g., case review) by drug
safety professionals.
Index Terms Adverse drug reactions, association rules, data mining algorithms, interestingness measure, recognition primed
decision model.

1 INTRODUCTION
INDING causal associations between two events or
sets of events with relatively low frequency is very
useful for various realworld applications. For exam
ple, a drug used at an appropriate dose may cause one or
more adverse drug reactions (ADRs), although the prob
ability is low. Discovering this kind of causal relation
ships can help us prevent or correct negative outcomes
caused by its antecedents. However, mining these rela
tionships is challenging due to the difficulty of capturing
causality among events and the infrequent nature of the
events of interest in these applications.
In this paper, we try to employ a knowledgebased
approach to capture the degree of causality of an event
pair within each sequence since the determination of cau
sality is often ultimately application or domain depen
dent. We then develop an interestingness measure that
incorporates the causalities across all the sequences in a
database. Our study was motivated by the need of disco
vering ADR signals in postmarketing surveillance, even
though the proposed framework can be applied to many
different applications. ADRs represent a serious world
wide problem [1, 2]. They can complicate a patients med
ical condition or contribute to increased morbidity, even
death. Studies have shown that ADRs contribute to about
5% of all hospital admissions and represent the 5
th
com
monest cause of death in hospitals [3].
Even though premarketing clinical trials are required
for all new drugs before they are approved for marketing,
these trials are necessarily limited in samplesize and du
ration, and thus are not capable of detecting rare ADRs.
In general, an ADR cannot be recognized by these trials if
its occurrence rate is less than 0.1% [1].Therefore, drug
safety depends heavily on postmarketing surveillance;
that is, the monitoring of impacts of medicines once they
have been made available to consumers. In the U.S., cur
rent postmarketing surveillance methods primarily rely
on the FDAs spontaneous reporting system Med
Watch
TM
. Because ADR reports are filed at the discretion
of the users of the system, there is gross underreporting
xxxx-xxxx/0x/$xx.00 200x IEEE

- Y. Ji is with the Department of Electrical and Computer Engineering,


Gonzaga University, Spokane, WA 99258 USA Email: ji@gonzaga.edu.
- H. Ying is with the Department of Electrical and Computer Engineering,
Wayne State University, Detroit, MI 48202, USA Email:
hao.ying@wayne.edu.
- J. Tran is with the Spokane Mental Health, Spokane, WA 99202, USA
Email: jtran@smhca.org.
- P. Dews is with the Department of Medicine, St. Mary Mercy Hospital
(Trinity Health), Livonia, MI 48154, USA Email:dewsp@trinityealth.org.
- A. Mansour is with Department of Electrical and Computer Engineering,
Wayne State University, Detroit, MI 48202, USA Email:
el_akaber@yahoo.com.
- R. Michael Massanari is with the Research for the Critical Junctures Insti
tute, Bellingham, WA 98225, USA Email: Michael.Massanari@wwu.edu.
Manuscript received (insert date of submission if desired).
F
Digital Object Indentifier 10.1109/TKDE.2012.28 1041-4347/12/$31.00 2012 IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
[4, 5]. Consequently, the current approach may require
years to identify and withdraw problematic drugs from
the market, and result in unnecessary mortality, morbidi
ty, and cost of healthcare. Studies have shown that only
half of newly discovered serious ADRs are detected and
documented in the Physicians Desk Reference within 7
years after drug approval [6].
Systematic methods for the detection of suspected
safety problems from spontaneous reports have been stu
died and practically implemented. For example, the FDA
currently adopts a data mining algorithm called Multi
item Gamma Poissson Shrinker [7] for detecting potential
signals from its spontaneous reports. Another important
signal detection strategy is known as the Bayesian Confi
dence Propagation Neural Network that has been used by
the Uppsala Monitoring Center in routine pharmacovigil
ance with its World Health Organization database [8].
Various other methods such as proportional reporting
ratios [9], empirical Bayes screening [10], and reporting
odds ratios [11] have been used in the spontaneous re
porting centers of other nations (e.g., England and Aus
tralian). These methods have shown better performance
than traditional methods. However, the performance of
these techniques could be highly situationdependent due
to the weaknesses and potential biases inherent in spon
taneous reporting [12].
As electronic patient records become more and more
easily accessible in various health organizations such as
hospitals, medical centers, and insurance companies, they
provide a new source of information that has great poten
tial to generate ADR signals much earlier [13, 14]. Note
that each patient case can be considered as an event se
quence where various events such as drug prescription,
occurrence of a symptom and lab test occur at different
times. In the literature, there exist a couple of studies [14,
15] that attempted to find the associations between drugs
and potential ADRs by mining their temporal relation
ships. That is, they tried to mine temporal association
rules (represented as
T
X Y ) where Y

occurs after X

within a time window of length T
.
These studies ob
tained promising results based on administrative health
data. However, temporal association was the only para
meter used for linking a symptom with a drug within
each patient case in their work. Temporal association as
sumes that cause precedes effect. Other parameters such
as dechallenge and rechallenge can also give direct or
indirect cues of the potential causal association of a drug
symptom pair. Dechallenge is defined as the relationship
between withdrawal of the drug and abatement of the
adverse effect. Rechallenge describes the relationship be
tween reintroduction of the drug followed by recurrence
of the adverse event. In addition, their approaches suffer
from the sharp boundary problem. On the one hand, the
symptom events near the time boundaries are either ig
nored or overemphasized. On the other hand, two symp
tom events contribute equally to the interestingness
measure as long as they occur within the hazard period T.
That is, the length of the time duration between exposure
to the drug and occurrence of the symptom has no effect
on the interestingness measure. This is not true in reality
because if an ADR symptom occurs within a shorter pe
riod, it is usually more likely to be caused by the drug.
To more effectively mine infrequent causal associa
tions, it is necessary to develop a new data mining
framework. The current paper is a substantial extension
of our previous work [16, 17] where an interestingness
measure called causalleverage was developed on the basis
of a computational fuzzy recognitionprimed decision
(RPD) model [18] we previously developed . In the
present paper, we focus on mining infrequent causal asso
ciations. We significantly extended our previous work in
both theories and experiments:
(1) We developed and incorporated an exclusion me
chanism that can effectively reduce the undesirable effects
caused by frequent events. Our new measure is named
exclusive causalleverage measure.
(2) We proposed a data mining algorithm to mine ADR
signal pairs from electronic patient database based on the
new measure. The algorithms computational complexity
is analyzed.
(3) We compared our new exclusive causalleverage
measure with our previously proposed causalleverage
measure as well as two traditional measures in the litera
ture: leverage and risk ratio.
(4) To establish the superiority of our new measure, we
did extensive experiments. In our previous work, we
tested the effectiveness of the causalleverage measure us
ing a single drug in the experiment. In the current paper,
we selected three drugs and evaluated the top 10 ICD9
(International Classification of Diseases, 9th Revision)
codes ranked by the exclusive causalleverage measure for
each drug. We also tested how the length of hazard pe
riod T affects the performance of the exclusive causal
leverage measure.
2 RELATED WORK IN THE LITERATURE
Discovering the association between two event sets is an
active and important area of data mining research [1922].
Many measures have been proposed to mine association
rules in the form of X Y , where X

and Y

are two
event sets and they are disjoint (i.e., X Y | = ). For in
stance, the support and confidence measures were the orig
inal interestingness measures proposed for association
rules [23]. The support of an association rule supp( X Y )
is the proportion of sequences in which both X and Y oc
cur at least once, among all the event sequences. The con
fidence of an association rule is defined as conf( X Y ) =
supp( X Y ) /supp( X ), where supp( X ) is the pro
portion of sequences that contain X . Given these two
measures, the association rule mining problem can be
formalized as finding those rules whose support and confi
dence are greater than prespecified thresholds minsupp
and minconf, respectively. In the literature, researchers
have also proposed various other interestingness meas
ures such as IS [24], Klosgens measure [25], weighted
relative accuracy [26], and so on. Originated from areas
such as probability, statistics or information theory, these
traditional interestingness measures are primarily de
signed for finding frequent association rules. Thus, they
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 3
have limited capability in discovering patterns that occur
infrequently.
There exist some works on rare eventset (or itemset)
mining in the literature [2735]. A straightforward ap
proach to discovering infrequent eventsets is to relax the
uniform minimal support criterion (i.e. minsupp) as indi
cated in [27]. One drawback of this mining approach is
the extremely high computational cost [28, 29, 36]. In ad
dition, if this approach was used to discover rare events
like ADRs, thresholds minsupp and minconf would have to
be set very small, possibly leading to a lot of false associa
tions [14].
A couple of studies attempted to use a less uniform
support criteria [2830]. Yun et al. adopted a relative sup
port (a rate against the relative frequency of the data ex
isting in a database) approach [29], while Liu and his col
leagues proposed a method that relied on multiple mi
nimal support thresholds specified by users [28]. These
approaches output all frequent eventsets and association
rules together with a subset of all infrequent ones.
Several other studies explored rare eventsets (rules)
with support lower than the threshold [3135]. These stu
dies implemented different strategies to traverse the
power set lattice of a dataset. For example, some of them
take a bottomup approach to moving across the lattice
[31, 32, 35], while another algorithm named Rarity (pro
posed by Troiano et al.) takes a topdown method [33].
Koh and Rountree proposed an algorithm called Apriori
Inverse where sporadic rules are discovered by simply
discarding all eventsets above a support threshold [34].
The above approaches to detection of rare association
rules are based on traditional interestingness measures
like support and confidence. One pitfall of these measures
is that they simply find the statistical correlation between
X

and Y
.
For example, the confidence of an association
rule X Y determines how frequently Y

appears in
those event sequences that contain X . That is, traditional
measures like confidence try to find the statistical signi
ficance of the coexistence of two eventsets. They do not
indicate any temporal relationship between X

and Y
.
In
addition, they are not able to capture the causal relation
ships between two event sets. They do not specify wheth
er X

causes Y
,
or vice versa, or a third event causes the
coexistence of X

and Y
.
Hence, statistical associations
established by traditional measures may or may not
represent temporal or causal associations.

Causal modeling and inference have been widely stu
died in the field of machine learning and traditional
probability and statistics theory [3739]. Compared with
statistical coexistence, causal relationships have a couple
of unique properties. First, there is an intrinsic asymmetry
in a causeeffect relationship. That is, when saying event
X

causes event Y , one would not expect X to be influ
enced by Y . Second, the concept of causality is linked to
time dependencies. That is, the causes must precede their
effects. Researchers have proposed various models to
model causal relationships for the purpose of prediction
or data modeling (i.e. estimation of a probability law that
has generated the data). These models include artificial
neural networks [40], Markov models [41], various graph
ical models [38, 39], etc. One of the most famous models is
Pearls causal model where causal modeling and infe
rences are based on representations by means of Directed
Acyclic Graphs (DAGs) [38]. For a more extensive and
detailed review about the state of the art of causality,
readers are referred to a recent survey paper [37].
With a surge of interest in association rule mining in
recent years, some of the above models have been bor
rowed to mine the causal relationships between two
events or eventsets. For instance, Bayesian network and
its variants have been adapted to mine causal structures
in a couple of studies [42, 43]. One criticism of Bayesian
analysis is its requirement to specify prior distributions
for all variables, which can be very difficult since, in
many cases, prior knowledge is vague or nonexistent, or
can be tedious if the number of variables is large [37, 38].
Silverstein, et al., proposed a constraintbased algorithm
that narrowed down the search space and thus made the
mining process more scalable [39]. In addition, Hidden
Markov Models (HMMs) have been utilized to predict
protein structures and analyze genome sequences based
on massive amounts of observed data [44, 45]. However,
these approaches are inherently probabilitybased and
designed to find frequent patterns. Thus, they have li
mited capability in discovering infrequent associations.
3 BACKGROUND ON FUZZY RPDMODEL
The RPD model [40, 46] represents a popular cognitive
decision model. It is particularly useful for modeling how
human experts make decisions based on their prior expe
riences. It was found that about 50% to 80% of all deci
sions were made in this way [41]. The original RPD model
is descriptive and is not directly implementable on a
computer. Hence, we developed a fuzzy RPD model [18],
which is not only computational but also capable of han
dling vague and subjective information using fuzzy logic
[47, 48].
Experiences play a key role in the RPD model. The
ADR detection experiences were acquired through the
joint efforts of our engineering and medical team mem
bers after careful analysis of the relevant literature. Ac
cording to the classification scheme in [49], a particular
pattern of cue values characterizes a specific degree of
causality which may require certain courses of action to
handle the ADR. Therefore, we can define four expe
riences, each of which is associated with a degree of cau
sality (i.e., very likely, probable, possible and unlikely).
These experiences were stored in an experience know
ledge base. Fig. 1 presents a sample experience illustrated
in natural language for easier understanding.
As shown, an experience consists of four components
cues, goals, actions, and expectancies. Cues represent the
higherlevel information (synthesized from elementary or
environmental data) that a decision maker must pay at
tention to. Expectancies describe what will happen next
as the current situation continues to evolve in a changing
context. Goals represent an end state that the decision
maker is trying to achieve. Actions represent a set of po
tential decisions that the decision maker can take in the
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
current situation. Cues are used to match the current situ
ation with prior experiences and determine which expe
rience can be reused to solve a new problem. This sample
experience has four cues: temporal association, dechal
lenge, rechallenge, and other explanation. The first three
cues have been explained in Introduction. Other explana
tions denote alternative explanations by concurrent dis
ease or other drugs. The type of cue could be quantitative,
nominal or fuzzy in the proposed computational fuzzy
RPD model. For instance, the cue temporal association
may have fuzzy values (e.g. unlikely, possible, likely).
The weights for these cues are design parameters and are
assigned by domain experts. Table 1 shows how the four
cues are related to degree of causality of a signal pair in
the four experiences. These mappings were given by the
physicians in our research project. For instance, when
cues temporal association, rechallenge and dechallenge
have fuzzy cue values possible, unlikely, and possible, re
spectively, and there is no other explanations, this cue
value pattern represents a possible causal association be
tween the drug of interest and the suspected ADR from
the perspective of a physician. In general, these mappings
are determined by domain experts.
After representing the experiences, the next step is to
extract the cue values from elementary data for the cur
rent situation (e.g. a new patient case in which a drug
symptom pair need to be evaluated). Fuzzy rules and
fuzzy reasoning are used to achieve this task. For exam
ple, to obtain the fuzzy value of temporal association, one
of the fuzzy rules of the type if the time duration be
tween taking the drug and the occurrence of the adverse
event is short, then temporal association is likely is used.
An embedded fuzzy inference engine is employed to per
form fuzzy reasoning. The inference engine is what drives
the RPD model, updating cues once new information is
detected, and monitoring expectancies and goals. After
the cue values of the current situation are known, similar
ity measures are applied to measure the degree of match
ing between the current situation and prior experiences.
The actions in the most matching experience are then
used to solve the current problem. For more details of the
fuzzy RPD model as well as concrete examples, the reader
is referred to our previous work [18].
The fuzzy RPD model was preliminarily validated in
our previous study [18] by using it to calculate the extent
of causality between cisapride and some of its adverse
effects for 100 simulated patients created based on the
profiles of 1,015 patients of the VA Medical Center. The
simulated patients were used because we found that the
number of apparent signal pairs of potential interesting
symptoms (i.e., cisapride and Torsades de Pointe and/or
syncope) in the real patient data was very limited. There
fore, we used the real patients to create simulated patient
cases, all of which containing drugsymptom pairs of in
terest with various degrees of causality. The models va
lidity was then established by comparing the decisions
made by the model and those by two independent expe
rienced physicians for the 100 simulated patients. The
levels of agreements were measured by the weighted
Kappa statistic, which is a measure of agreement between
two raters after chance agreement is controlled. The re
sults suggested good to excellent agreements [50].
4 DEVELOPING A NEWMEASURE AND APPLYING IT
TO SEARCHING ADRSIGNAL PAIRS
In this section, we first introduce a new measure we re
cently developed. We then discuss how to mine potential
ADRs from electronic patient records using this measure.
4.1 A New Exclusive Causal-Leverage Measure
We use <X,Y> and C<X,Y> to represent a pair of events and
the degree of causality of the pair in a sequence, respec
tively. C<X,Y> is calculated by matching the cue values ex
tracted from a sequence with the cue values of the de
fined experiences, each of which corresponds to a unique
category of causality (i.e., very likely, probable, possible,
and unlikely). Specifically, we use a vector
V=(c1,c2,,ci,,cm) to represent the set of cue values ex
tracted from an event sequence, where ci represents the
value of i
th
cue and m is the total number cues. Similarly,
we use V = (c1,c2,,ci,,cm) to represent the set of cue
values in an experience. We define the similarity between
a pair of cue values ci and ci as local similarity SL(ci, ci)
whose calculation depend on the type of the cue (i.e.
quantitative, nominal or fuzzy); that is,

Fig. 1. A sample experience.
TABLE 1
RELATING CUES TO CAUSALITY CATEGORY OF A SINGLE PAIR
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 5
,
1
m
X Y i i
i
C w
< >
=
=

'
'
'
'
'
0,
( , ),
( , )
1 _ ( , ),
_ ( , ),
i i
i i
L i i
i i
i i
if c or c is unknown
overlap c c if cue i is nominal
S c c
normalized diff c c if cue i is quantitative
fuzzy sim c c if cue i is fuzzy value

(1)

In the above definition, the cue similarity is set 0 if ei
ther of the cue values is unknown. The function overlap()
is defined as:
'
'
1,
( , )
0,
i i
i i
if c c
overlap c c
otherwise
=

(2)
That is, for nominal cue values, the similarity is 1 if the
cue values are equal; otherwise it is 0. Computing the
similarity for quantitative cue values involves distance
measure normalized_diff :


'
'
_ ( , )
i i
i i
i
c c
normalized diff c c

=
A
(3)
where
i i i
a b A =
is employed to normalize the cue dif
ference, and
i
a
and
i
b are the maximum and minimum
values for cue i, respectively.
The last expression in the above formula,
'
_ ( , )
i i
fuzzy sim c c
, deals with cues represented by fuzzy
sets. Let cue i be defined on the universe of discourse X
and x be an element of X. In this case, fuzzy cue values
i
c
and
i
c
'
are two fuzzy sets whose membership func
tions are defined in terms of x . Their similarity could be
defined on the basis of possibility measures on the [0, 1]
interval [51]:

( ) ( )
( ) ( ) ( )
'
, , , 0.5
_ ( , )
1.5 , , ,
i i i i
i i
i i i i
poss c c if poss c c
fuzzy sim c c
poss c c poss c c otherwiese

' '
<

' ' -

(4)
where

( ) ( ) ( )
'
, max min ( ), ( )
i
i
i i c
c
x
poss c c x x x X
'
= e
(5)

Here,
( )
i
c
x
and
'
( )
i
c
x are the membership functions of
fuzzy set
i
c
and
i
c
'
, respectively. The min() and max()
are standard fuzzy operators, and hence
( ) ( )
'
max min ( ), ( )
i
i
c
c
x
x x
computes the maximum of all the
minimums between pairs
( )
i
c
x
and
'
( )
i
c
x for all the
elements x . i
c
is the complement of
i
c
(i.e.,
( ) 1 ( )
i
i
c
c
x x =
).
The similarity between two sets of cue values V and V
is named as global similarity SG(V, V). It is defined as the
weighted sum of all the local similarities with respect to
each pair of cue values. That is,

( )
'
1
1
( , )
,
n
i L i i
i
G n
i
i
w S c c
S V V
w
=
=
' =

(6)
where | | 0,1
i
w e is the weight for cue i, which represents
the relative significance of the cue and is assigned by the
user.
The above global similarity is used to find the most
matching experience whose associated causality category
can characterize the causal association of a pair of interest
in a particular event sequence. We use this approach to
obtain a similarity value between the current pair and
each of the experiences. After that, these similarity values
are normalized so that their sum is equal to 1. These nor
malized values are then used to represent the member
ship values of corresponding categories for the pair of
interest in a particular sequence. For example, if the nor
malized similarity value between a pair and the expe
rience exhibiting possible association is 0.6, we would
say the causality of the pair is possible with a member
ship value of 0.6 in that sequence. If we use d and s to
represent a drug and a symptom, respectively, Table 2
gives a couple of sample drugsymptom pairs as well as
their membership values related to each causality catego
ry for the patient cases that contain the pairs. Note that,
given a particular drugsymptom pair, there may exist
none, one, or multiple patient cases that contain this pair.
On the other hand, the same patient case may contain
multiple different pairs. Patient 4 in the table contains
both <d1, s2> and < d3, s5>. C<X,Y> within one patient case is
equal to the weighted sum of the membership values of
all causality categories.
In general, if there exist m

experiences that classifies
the causality between X

and Y into m

distinctive cate
gories, the degree of causality is defined as


(7)

where i is the membership of the
th
i

causality category
for the pair, and wi represent the corresponding weight
when converging all the causality categories into one.
Moreover,



The selection of wi

is based on two considerations. First,
causality categories representing stronger causal associa
tions should have higher weights. That is, if we assume
the causality levels represented by m experience are in a
TABLE 2
MEMBERSHIP OF EACH CAUSALITY CATEGORY FOR SAMPLE
SIGNAL PAIRS (PID MEANS PATIENT IDENTIFICATION NUMBER)
1
1
m
i
i

=
=

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
decreasing order, then wm > > w2 > w1

must be satisfied,
where wm

and w1 correspond to the highest and lowest
causality levels, respectively. Second, the range of C<X,Y>
should be [0, 1]. That is, C<X,Y> should be 0 for the extreme
situation ={0, 0,1} where the evidence in a sequence
strongly shows unlikely association of the pair. If all the
evidence in the patient supports very likely association
(i.e. ={1,0 0}), C<X,Y> should be 1. Otherwise, C<X,Y> is
between 0 and 1. In general, the weight for the
th
i

causali
ty category is calculated by


(8)

where 1 m i > > . For example, if there are 4 causality cate
gories (i.e. m=4), then the set of weights for them would
be
{ } 1.0, 0.667, 0.333, 0.0
. One can see that the computed
C<X,Y> using these weights satisfies the above two crite
rions.
We introduce causal association rule (CAR), denoted
by
C
X Y , meaning that X

potentially causes Y . We
define the support of a CAR, supp(
C
X Y ), as the accu
mulated votes over all sequences. The vote of the
th
j

se
quence is represented by C
j
<X,Y>. That is


supp(
C
X Y ) = (9)

where n is the number of sequences that contain the pair.
N is the total number of sequences in the database. The
advantage of this definition is that the degree of causality
is incorporated into the measure such that sequences ex
hibiting higher causality will contribute more to the
measure.
Based on the above definition of the support of a CAR,
we define an interestingness measure called causal
leverage:

causalleverage(
C
X Y )
= supp (
C
X Y ) supp (
C
X )

supp ( Y ) (10)

where supp (
C
X ) is the proportion of sequences whose
votes are greater than 0 with respect to the pair <X,Y>.
supp( Y ) is the proportion of sequences that containY .
This measure extends the traditional leverage measure
(i.e., leverage( X Y )=supp( X Y ) supp( X )supp( Y ))
by incorporating the degree of causality of the pair
among all sequences into it.
With better capability of capturing causal associations,
the causalleverage measure can be employed to mine the
causal association of pair <X,Y> given a collection of event
sequences. However, its ability of discovering the causal
associations between drugs and their potential ADRs is
still limited because of the complexity and special charac
teristics of electronic patient data. Recall that the degree
of causal association of a pair within a particular patient
case is determined by four cues: temporal association,
rechallenge dechallenge, and other explanations. One can
see that three of them are timerelated. For a particular
pair, their values can be abstracted from a specific patient
case using fuzzy rules and these values collectively de
termine the degree of causal association of the pair in the
patient case. However, the obtained cue values may not
really represent the actual causal association of the pair
due to the complexity of the electronic patient data. One
of the complexities is that frequently prescribed drugs are
easily temporally associated with any symptom by coin
cidence. For example, if the hazard period T is equal to 90
days and the start date of a drug appears once every two
months within the same case, then the drug has temporal
association with any symptom existed in the patient case.
Similarly, a particular drug tends to coincidently form
temporal association with symptoms (e.g. fever) caused
by common diseases. Note that symptoms of a potential
ADR cannot be differentiated from those of an underlying
disease, both of which are encoded using the same ICD9
codes in electronic patient databases. For a drug
symptom pair, the associated symptom is possibly caused
either by a drug or by a disease after the drug is started. If
both a drug and a symptom have high frequencies in a
health database, they can even easily form rechallenge by
coincidence. As a result, the obtained causalleverage value
can be high for the pair even though this value does not
imply any degree of causality. Therefore, it is necessary to
find an exclusion mechanism to reduce the undesirable
effects caused by high frequent events (e.g., either drugs
or symptoms).
Intuitively, if a high frequent event Y randomly occurs
after another event X within time period T, then it
should have the same chance to also occur before X
within the same period T among many event sequences.
Case1 in Fig. 2 demonstrates a pattern where the symp
tom s1 occurs before the drug d1 within time period T. A
reasonable explanation of this pattern is that the underly
ing disease of the symptom s1 causes the prescription of
the drug d1. Case2 indicates a pattern where the symptom
s1 occurs after the drug d1 within time period T. If we look
at Case2 only, a reasonable explanation could be that the
symptom s1 represents a possible ADR caused by the
drug d1 since they exhibit reasonable temporal associa
tion. However, if we look at these two patient cases to

Fig. 2. Illustration of patient cases that exhibit different types of event
patterns.
1
1
i
i
w
m

,
1
n
j
X Y
j
C
N
=

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 7
gether, it is more likely that the symptom s1 simply occurs
randomly. In this case, both the pair <d1, s1> and its re
versed pair <s1, d1> exhibit temporal associations in vari
ous patient cases. In some cases, the above two patterns
can coexist in the same patient case as shown in Case3. For
such a pattern, there exist several possible explanations.
First, d1 is caused by the first s1 and the second s1 occurs
coincidently. Second, the second s1 is caused by d1 and the
first s1 occurs coincidently. Third, s1 and d1 simply occur
randomly without any causal association. A more realistic
patient case can even exhibit a more complex pattern as
shown in Case4. It denotes that d1 has temporal and re
challenge relationship with s1 since their temporal associ
ation occurs twice along the time. However, this pattern
can be explained differently. For example, the second d1
occurs after the first s1, which can be explained as the un
derlying disease of the symptom causing the prescription
of the drug. The temporal association between s1 and d1
also occurs twice and they also exhibit rechallenge rela
tionship. Considering all these four types of patterns to
gether, if drugs and/or symptoms occur randomly and
frequently, they may falsely form temporal association
and rechallenge relationships. However, the number of
cases exhibiting these relationships for a pair <drug, symp
tom> and that for its reversed pair (i.e., <symptom, drug>)
should be similar.
Given the above observations, we define a reverse caus
alleverage of CAR, denoted as reverse causal
leverage(
C
X Y ), with respect to pair <X,Y>.

reverse causalleverage(
C
X Y )
= causalleverage(
C
Y X )
= supp (
C
Y X ) supp (
C
Y )

supp ( X ) (11)

where supp (
C
Y X ) is the support of the CAR
C
Y X .
supp (
C
Y ) is the proportion of sequences whose votes
are greater than 0 with respect to the pair <Y,X>.
supp( X ) is the proportion of sequences that contain X .
The measure represents the degree of reverse causality
between X and Y . That is, it denotes the strength that Y
causes X . Intuitively, if X and Y are randomly distri
buted among the patient cases in a database and do not
have any causal relationships, causalleverage(
C
X Y )
should have same or similar value as reverse causal
leverage(
C
X Y ), no matter how high the frequency of the
pair is. Otherwise, if X causes

Y , the former should be
greater than the latter. On the other hand, if Y causes X ,
then the former should be smaller than the latter. Hence,
we define an exclusive causalleverage measure as below:

exclusive causalleverage(
C
X Y )
= causalleverage(
C
X Y ) reverse causalleverage(
C
X Y )
=
supp (
C
X Y ) supp (
C
X )

supp ( Y )
[supp (
C
Y X ) supp (
C
Y )

supp ( X )] (12)

If this measure has a positive value, it denotes that X has
causality with Y . The bigger the value, the stronger the
causal association. If its value is negative, it indicates a
reverse causality. Otherwise, a value close to zero is a
sign of no causal association between X causes

Y . Thus,
this measure not only provides a way to quantify the de
gree of causal association but also excludes undesirable
effects caused by highfrequency events that occur ran
domly.
4.2 Mining ADR Signal Pairs from Electronic Pa-
tient Databases
Drugs and their associated ADRs have causal relation
ships. In this subsection, we examine how to search for
potential ADR signal pairs from an electronic patient da
tabase using the above exclusive causalleverage measure.
We assume that patient data are stored in relational tables
in a database and can be retrieved using database lan
guage like Structured Query Language (SQL). These
tables are linked through patient identification numbers
(PIDs). We also assume that the drugrelated data and
symptomrelated data are stored in two tables called Pa
tient Drug Table and Patient Symptom Table, respective
ly.
Fig. 3 shows an overall picture of the data mining algo
rithm. First, preprocessing is needed to get two types of
information: 1) a list of all drugs D(d1, d2, , dm) in the
database and the support count for each drug; 2) a list of
all symptoms S(s1, s2, sn) in the database and the sup
port count for each symptom. The lists of drugs and
symptoms are needed to form all possible drugsymptom
pairs whose causal strengths will be assessed. Since a pa
tient database normally only contains a subset of all drugs
on the market and a subset of all symptoms, it is neces
sary to search the Patient Drug Table and the Patient
Symptom Table to get the drugs and symptoms covered
by the database. Using the discovered drugs and symp
toms in the database (instead of all drugs on the market
and all available symptoms) can avoid forming unneces
sary pairs, which will reduce the computational complex
ity of the data mining algorithm. In addition, the support
Fig. 3. Algorithm for mining causal association rules


Fig. 4. Data structures for storing drug/symptom support counts. (a)
drug hash table, (b) symptom hash table
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
count for each drug or symptom will be used to calculate
the exclusive causalleverage value for related pairs. Hence
their values also need to be computed.
Algorithm 1 shows how to search a database (DB) for
the list of drugs and the support count for each drug. The
discovered drugs and their support counts are stored in a
hash table as shown in Fig 4a. Initially, the hash table is
empty. We use Dk to represent the set of drugs taken by
the k
th
patient pk in the database. For each patient, we first
retrieve all the drugs taken the patient. For each of these
drugs, we then check whether the hash table contains the
drug. If the hash table does not contain the drug, the
drugs support count is set as 1, and both the drug and its
support count are added to the hash table. Otherwise, the
drugs support count is increased by 1. After all the pa
tient cases are searched, the hash table is returned. Note
that, when calculating the support count for a drug, it is
counted only once for one patient case even if the drug
appears several times in that patient case. One can see
that the computational complexity of this searching
process can be affected by the total number of patients N
in the database and the average number of drugs taken
by a patient. The later is normally determined by the type
of patients in the database. For instance, old patients often
have multiple diseases and thus may take multiple drugs
either at the same time or different times.
The algorithm used to search the Patient Symptom Ta
ble for a list of symptoms covered by the database as well
as the support count for every symptom (Fig 4b) is similar
to Algorithm 1. If users are only interested in mining the
potential ADRs of a particular drug or a couple of drugs,
the users can specify the drugs of interest. Similarly, the
users can also specify the list of symptoms if they want to
analyze which drugs can cause the symptoms of interest.
In both cases, however, the Patient Drug Table and the
Patient Symptom Table still need to be searched in order
to get the support count for each drug or symptom.
After getting the list of drugs D(d1, d2, , dm) and the
list of symptoms S(s1, s2, sn), the next step is to generate
all the possible pairs, each of which represents a CAR.
Algorithm 2 shows the process for pair generation and
evaluation. Please note that most existing data mining
methods mine all interesting association rules that com
bine all possible events or items in a database. We are
only interested in mining drugsymptom pairs that can be
easily generated, given D and S. That is, we are not inter
ested in other patterns like drugdrug pairs, symptom
symptom pairs or combinations of multiple drugs and
symptoms. Thus, our algorithm generates a much fewer
number of candidate rules, which implies much less
complexity. The complexity of this pair generation
process is O(mn) where m and n are the number of drugs
and symptoms, respectively. In addition, since we are
interested in mining infrequent patterns, it is inappropriate
to prune pairs using the support measure (i.e. support >
minsupp). However, in postmarketing surveillance, a sig
nal pair generated by a data mining method is generally
not considered as valid if only one or two patient cases
contain the pair [9]. Therefore, we utilize a minimum
support count mincount=3 (instead of minsupp) to further
reduce the number pairs that will be evaluated. Specifical
ly, we retrieve the PIDs of those patient cases that contain
both the drug and the symptom of each pair. This is done
by sending a SQL statement to query the database. Mod
ern database management systems can utilize optimiza
tion techniques like index to speed up this process. If the
number of PIDs that support a pair is greater than or
equal to mincount (Line 4 of Algorithm 2), the pair is eva
luated. To evaluate the strength of the causal association
of the pair, its causalleverage value is first computed.
Then, the pair is reversed. That is, the pair < di, sj > be
comes < sj, di >. The reverse causalleverage value of pair < di,
sj > is equal to the causalleverage value of its reversed pair
< sj, di >. After that, the exclusive causalleverage value of the
pair is computed by subtracting its reverse causalleverage
value from its normal causalleverage value.
Algorithm 3 shows how to compute the causal
leverage value of a general pair between event X and Y .
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 9
Both X and Y could be either drug event or symptom
event. First, the drug or symptom hash table is searched
in order to get the support count for event Y . Then, for
each PID that supports the pair, a process called cue ab
straction is used to extract a set cue values V from the
related patient case. Specifically, a list of drug start dates
and a list of symptom dates are retrieved from the Patient
Drug Table and the Patient Symptom Table, respectively.
Note that, since each patient case records the patients
history for a long period of time, the same drug may be
prescribed many times and thus multiple drug start dates
may exist within one patient case. Similarly, there may
exist multiple symptom dates for the symptom within the
same patient case. Therefore, we must compare each start
date of the drug with each symptom date of the symptom
in order to obtain interested temporal patterns for the pair
within the case. The complexity of this process is O(LdLs),
where Ld and Ls are the length of the list of drug start
dates and the length of the symptom dates, respectively.
Ld and Ls often depend on the characteristics of the pa
tient. For example, a patient with chronic diseases tends
to take the same drug for many times and have repeated
symptoms. After getting the temporal patterns, cue val
ues for temporal association, rechallenge and dechallenge are
derived from these patterns using fuzzy rules. Note that,
in order to make our algorithm more generic, the cue oth
er explanations was not utilized in this study since it is
drugdependent. Readers are referred to our previous
paper for specific fuzzy rules that were used in this
process [16].
After the set of cue values V of the pair are extracted
from the related patient case, similarity values are com
puted between V and each set of cue values V in the ex
perience knowledge base (EKB). This step is also called
feature matching in the Fuzzy RPD model. These similari
ty values are then normalized so that they are trans
formed to the format shown in Table 2. After that, the
degree of causality C<X, Y> within the current patient is
computed. If C<X, Y> is greater than 0, it is added to the ac
cumulated votes and the number of cases whose votes are
greater than 0 is increased by one. After the above com
putation is done for all the supporting cases of the pair,
the causalleverage value of the pair is computed (refer to
Eq. (10)) and returned.
Finally, we rank all the pairs in a decreasing order ac
cording to their exclusive causalleverage values after all
these values are computed. The higher the value, the
more likely a causal association exists in the pair or CAR.


5 EXPERIMENTS
5.1 Experiment data
Our experiment data were from Veterans Affairs Medical
Center in Detroit, Michigan. We were interested in two
classes of drugs: statin drugs and ACEI (angiotensin
converting enzyme inhibitors) drugs. Deidentified elec
tronic data of all the patients who received at least one of
the drugs of our interest during the time period from Sep
tember 4, 2007 to September 16, 2009 were retrieved.
Event data such as dispensing of drug, office visits, and
certain laboratory tests were retrieved for all the patients.
For each event certain details were obtained. For example,
the data for dispensing of drug include the name of the
drug, quantity of the drug dispensed, dose of the drug,
drug start date, drug schedule, and the number of refills.
The retrieved data included 16,206 patients. 15,605
(96.3%) were male, and 601 (3.7%) were female. Their av
erage age was 68.0. The drugs enalapril, pravastatin and
rosuvastatin were selected to test the proposed data min
ing framework. Pravastatin and rosuvastatin are both
statin drugs, and enalapril is an ACEI drug. These three
drugs allow us to examine the effectiveness of our data
mining framework in identifying potential ADRs for
drugs both in the same drug class and in different drug
classes. The total number of patients who took one of the
three drugs enalapril, pravastatin and rosuvastatin at
least once is 1021, 1383, and 882, respectively. Note that
multiple drugs might be prescribed for the same patient
either at the same date or different dates along time. The
total number of distinctive ICD9 codes was 3,954 in the
data. Each code represented a potential ADR that might
be causally associated with any drug.
The data were stored in five relational tables in a Mi
crosoft Access database. These tables contained patients
demographic data, visit data, diagnostic data, drug
related data, and laboratory data. SQL was used to query
the tables and obtain desired information. The query re
sults were returned to Java programs that implemented
the data mining algorithm. Fuzzy sets and fuzzy rules
were coded using Fuzzyjess [52], a Javabased fuzzy infe
rence engine.
5.2 Experiment Results
Table 3 gives the causalleverage, reverse causalleverage and
exclusive causal_leverage between drug enalapril and each
of seven ICD9 codes: 276.7 (hyperkalemia), 366.02 (post
subcaps polar cataract), 428.0 (congestive heart failure),
327.27 (central sleep apnea in conditions classified else
where), 305.60 (cocaine abuse unspecified), 275.42
(Hypercalcemia), and 525.9 (unspecified disorder of the
teeth and supporting structures). The code 276.7
represents a wellknown ADR caused by drug. The table
shows that its causalleverage is much larger than its reverse
TABLE 3
COMPARING THE DEGREE OF ASSOCIATION BETWEEN ENALAPRIL
AND EACH OF SEVEN ICD-9 CODES UNDER DIFFERENT MEAS-
URES (HAZARD PERIOD T = 90 DAYS)
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
causalleverage, and thus its exclusive causal_leverage is clos
er to its causalleverage. The code 366.02 was considered by
our physicians as neither an ADR nor a medical condition
due to drug enalapril. One can see that its causalleverage
and reverse causalleverage are very close even though ei
ther one is larger than the causalleverage of 276.7. The
code 428.0 denotes a medical condition for taking enala
pril. The table indicates that its causalleverage is smaller
than its reverse causalleverage, which results in a negative
exclusive causal_leverage value. The table also includes four
other disease conditions that are not related to drug ena
lapril. That is, they are neither ADRs caused by enalapril
nor medical conditions that cause the prescription of ena
lapril. The code 327.27 has the smallest positive exclusive
causal_leverage value among all those unrelated codes.
Based on the codes 276.7 and 327.27, one can see that the
positive range of exclusive causal_leverage values is 2.94E7
~ 1.36 E4 with respect to drug enalapril. The code 305.60
has the largest positive exclusive causal_leverage value
among those unrelated codes. Its reverse causalleverage is
0, which means that no cases exhibit reasonable temporal
patterns for the reverse pair. According to Table 4, the
rank of this code is as high as 6. This indicates that some
unrelated codes could still be ranked high due to the
complexity of the data. The codes 275.42 and 525.9 are
simply two randomlyselected symptoms that are not
related to drug enalapril. They exhibit similar properties
as codes 366.02 and 327.27. That is, their causalleverage
and reverse causalleverage values are relatively large, but
their absolute values of the exclusive causal_leverage meas
ure are small. Ranking these seven ICD9 codes using
their causalleverage values, the code 428.0 will demon
strate the strongest causal association with drug enalapril.
But if exclusive causal_leverage is used, the code 276.7 (i.e.
the known ADR) will have the strongest association with
the drug. Therefore, the table indicates that our algorithm
can indeed effectively remove some undesirable effects
caused by frequent events.
For each of the three selected drugs, we computed the
exclusive causalleverage values for all the pairs between
the drug and each of the 3,954 ICD9 codes. We ranked
the codes in a decreasing order according to each pairs
exclusive causalleverage value. The top 10 codes were eva
luated by our physicians to determine whether they were
likely real ADRs associated with the drug. The physicians
made decisions based on the Physicians Desk Reference
[53], literature reporting, and their clinical experience.
We also ranked the association between the drug and
each ICD9 code using three other measures: causal
leverage, traditional leverage without considering causali
ty, and risk ratio. Risk ratio is defined as the probability
that a case coexists with the ICD9 code relative to the
probability that a noncase coexists with the same ICD9
code. Risk ratio and leverage represent two typical fre
quencybased interestingness measures. The hazard pe
riod was set as T=90 days when causalleverage and exclu
sive causalleverage were calculated. The value of this pa
rameter needs to be chosen carefully since if it is too large,
some unrelated symptoms could form false temporal rela
tionships with the drug. If it is too small, real signals may
be excluded because some ADRs only happen after a
drug has been taken for as long as tens of days. After ba
lancing these two situations, our physicians think 90
would be an appropriate value for this parameter.
Table 4 lists the top 10 ICD9 codes identified by the
exclusive causalleverage measure for drug enalapril. Eight
of them were thought as real ADRs associated with ena
lapril with high likelihood. Their ranks in the three other
TABLE 5
COMPARING RANKS OF ICD-9 CODES GIVEN DRUG PRAVASTA-
TIN UNDER FOUR DIFFERENT INTERESTINGNESS MEASURES
(NOS - NOT OTHERWISE SPECIFIED; NEC NOT ELSEWHERE
CLASSIFIED)
TABLE 4
COMPARING RANKS OF ICD-9 CODES GIVEN DRUG ENALAPRIL
UNDER FOUR DIFFERENT INTERESTINGNESS MEASURES
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 11
measures are also included in the table. One can see that
neither risk ratio nor leverage ranked the eight ICD9
codes highly. The table also indicates that the results gen
erated by these two measures (i.e. risk ratio and leverage)
are much more similar than those generated by the other
two measures. By looking at the ranks of the eight ICD9
codes, the causalleverage measure produced much better
performance as compared with the risk ratio and leverage
measures since it could capture the causal relationship
between a pair. Even though the tradition leverage meas
ure obtained comparable or even better ranks for two of
the eight ICD9 codes (i.e. 276.7 and 233.7), it ranked the
remaining six ICD9 codes as low as around 1000. With
capability of reducing the undesirable effects of frequent
ICD9 codes, the exclusive causalleverage measure
achieved the best performance.
The experiment results for drug pravastatin are given
in Table 5. Seven out of the top 10 ICD9 codes were iden
tified as likely ADRs by our physicians. The causal
leverage measure gives higher ranks for all these seven
ICD9 codes than both risk ratio and traditional leverage
measures. Another interesting observation is that the
causalleverage measure classified three out of the seven
ICD9 codes into its top 10 list. This implies that this
measure does have the potential to discover some poten
tial ADRs because of its capability in capturing the causal
relationship between a drug and a symptom. Consistent
ly, the exclusive causalleverage measure again achieved the
best performance.
Table 6 gives the experiment results for drug rosuvas
tatin. Six out of the top 10 ICD9 codes were found to be
likely ADRs of this drug. In terms of the relative abilities
of the four interestingness measures in discovering poten
tial ADRs, these results are consistent with the results
demonstrated by the other two drugs.
Since the length of hazard period T affects the perfor
mance of our exclusive causalleverage measure, we tested
the rankings of three randomly selected known ADRs of
drug enalapril for different T values (Table 7). Either 30 or
150 does not seem to be appropriate values for T since
some known ADRs were ranked very low. Similar rank
ings were achieved when T was set to 60, 90 or 120 days.
6 DISCUSSION
The above experiment results showed that our exclusive
causalleverage measure outperformed traditional interes
tingness measures like risk ratio and leverage because of
its ability to capture suspect causal relationships and ex
clude undesirable effects caused by frequent events.
However, due to the complexity, incompleteness, and
potential bias (e.g. tolerable ADRs sometimes may not be
recorded) of the data, some ICD9 codes may still be false
ly ranked high based on the exclusive causalleverage. That
is, false associations cannot be completely avoided using
our method. Please note that, in postmarketing surveil
lance, data mining is primarily used to shorten the long
list of potential drugADR pairs. Like the other data min
ing methods, the signal pairs found by our approach will
be subject to further analysis (e.g., epidemiology study),
case review, and interpretation by drug safety profession
als experienced in the nuances of pharmacoepidemiology
and clinical medicine [54].
When compared to the ADRs from clinical trials, the
ADRs identified by the exclusive causalleverage measure
for the statins (pravastatin and rosuvastain) are unique
and different. Clinical trials revealed some similar ADRs
between the two statins including nausea, headaches,
myalgia, and dizziness. However, the ADRs in this
study, as listed in Tables 5 and 6, revealed no overlap of
ADRs between the two statins. One reason for this dispar
ity could be that we only evaluated the top 10 symptoms
ranked by our measure. The difference may also be ac
counted by the differences in setting and sampling of cas
es. Clinical trial data come from double blinded placebo
studies, but the hospital data represent a naturalistic
study. In clinical trials, the participants are individuals
with hyperlipidemia as their only diagnosis or the prima
ry diagnosis, and they are not allowed to have any com
plicated medication regimen or unstable medical condi
tions. The data set in our study derived from hospita
lized patients, who are primarily male (96.3%) with mean
age of 68, who normally have multiple diagnoses in vari
ous severities and taking variety of medications. As a
result, the ADRs derived from the exclusive causalleverage
measure represent more naturalistic and real patients in
clinical care. The method allows us to examine and eva
luate complicated samples of patients to draw useful
causal information.
TABLE 6
COMPARING RANKS OF ICD-9 CODES GIVEN DRUG ROSUVAS-
TATIN UNDER FOUR DIFFERENT INTERESTINGNESS MEASURES
TABLE 7
RANKINGS OF THREE KNOWN ADRS OF DRUG ENALAPRIL UN-
DER DIFFERENT LENGTHS OF HAZARD PERIOD T
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
The focus of this paper is on how to design a novel in
terestingness measure and apply the result to the impor
tant adverse drug reaction problem (i.e., finding interest
ing ADR signal pairs). Algorithm design and efficiency
analysis become more important when one studies how
to efficiently mine all possible rare eventsets and associa
tion rules based on minimal support. Another difference
is that according to the literature, existing rare association
rule mining research assumes that all the data are stored
in a single table. In contrast, our work must use data from
a relational (medical) database composed of several re
lated tables. Although designing efficient SQL statements
and data models to handle large volumes of data in a re
lational database is beyond the scope of this paper, we
will consider developing an efficient algorithm suitable
for relational databases and analyze its complexity and
efficiency in the future.
7 CONCLUSION
Mining the causal association between two events is very
important and useful in many real applications. It can
help people discover the causality of a type of events and
avoid its potential adverse effects. However, mining these
associations is very difficult especially when events of
interest occur infrequently. We have developed a new
interestingness measure, exclusive causalleverage, based on
an experiencebased fuzzy RPD model. This measure can
be used to quantify the degree of association of a CAR.
Moreover, the measure was designed to mask the unde
sirable effects caused by highfrequency events. We have
applied this measure to detect the causal associations be
tween each of the three drugs (i.e., enalapril, pravastatin,
and rosuvastatin) and the ICD9 codes. A data mining
algorithm was developed to search a real electronic pa
tient database for potential ADR signals. Experimental
results showed that our algorithm could effectively make
known ADRs rank high among all the symptoms in the
database.
ACKNOWLEDGMENT
We would like to thank Ellison Floyd and Richard E. Mil
ler for retrieving the patient information from the VAMC
Detroits databases. This work was supported in part by
National Institutes of Health under grant 1 R21
GM08282101A1.
REFERENCES
[1] J. Talbot and P. Waller, Stephens Detection of New Adverse Drug
Reactions, 5 ed.: John Wiley & Sons, 2004.
[2] L. T. Kohn, J. M. Corrigan, and M. S. Donaldson, To error is
human: building a safer health system. Washington, DC: National
Academy Press, 2000.
[3] J. Lazarou, B. H. Pomeranz, and P. N. Corey, Incidence of
adverse drug reactions in hospitalized patients: a metaanalysis
of prospective studies, JAMA, vol. 279, pp. 12005, Apr 15 1998.
[4] K. Hartmann, A. K. Doser, and M. Kuhn, Postmarketing safety
information: how useful are spontaneous reports?,
Pharmacoepidemiology & Drug Safety, vol. 8, pp. 6571, 1999.
[5] L. Hazell and S. A. W. Shakir, UnderReporting of Adverse
Drug Reactions A Systematic Review, Drug Safety, vol. 29, pp.
385396, 2006.
[6] K. E. Lasser, P. D. Allen, S. J. Woolhandler, D. U. Himmelstein,
S. M. Wolfe, and D. H. Bor, Timing of new black box warnings
and withdrawals for prescription medications, Journal of the
American Medical Association, vol. 287, pp. 221520, 2002.
[7] A. Szarfman, J. M. Tonning, and P. M. Doraiswamy,
Pharmacovigilance in the 21st century: new systematic tools
for an old problem., Pharmacotherapy, vol. 24, pp. 1099104,
2004.
[8] M. Lindquist, I. R. Edwards, A. Bate, H. Fucik, A. M. Nunes,
and M. Stahl, From association to alerta revised approach to
international signal analysis, Pharmacoepidemiology & Drug
Safety, vol. 1, pp. 1525, 1999.
[9] S. J. Evans, P. C. Waller, and S. Davis, Use of proportional
reporting ratios (PRRs) for signal generation from spontaneous
adverse drug reaction reports, Pharmacoepidemiology & Drug
Safety, vol. 10, pp. 4836, 2001.
[10] W. DuMouchel, Bayesian Data Mining in Large Frequency
Tables, With an Application to the FDA Spontaneous Reporting
System, Am Stat pp. 177190, 1999.
[11] P. Purcell and S. Barty, Statistical techniques for signal
generation: the Australian experience, Drug Safety, vol. 25, pp.
41521, 2002.
[12] M. Hauben, Early postmarketing drug safety surveillance: data
mining points to consider, Annals of Pharmacotherapy, vol. 38,
pp. 162530, Oct 2004.
[13] Y. Ji, H. Ying, M. S. Farber, J. Yen, P. Dews, R. E. Miller, and R.
M. Massanari, A Distributed, Collaborative Intelligent Agent
System Approach for Proactive Postmarketing Drug Safety
Surveillance, IEEE Transactions on Information Technology in
Biomedicine, vol. 14, pp. 826837, 2010.
[14] H. Jin, J. Chen, H. He, G. Williams, C. Kelman, and C. OKeefe,
Mining unexpected temporal associations: applications in
detecting adverse drug reactions, IEEE Transactions on
Information Technology in Biomedicine, vol. 12, pp. 488500, 2008.
[15] G. N. Norn, J. Hopstadius, A. Bate, K. Star, and I. R. Edwards,
Temporal pattern discovery in longitudinal electronic patient
records Data Mining and Knowledge Discovery, vol. 20, pp. 361
387, 2010.
[16] Y. Ji, H. Ying, P. Dews, M. S. Farber, A. Mansour, J. Tran, R. E.
Miller, and R. M. Massanari, A Fuzzy RecognitionPrimed
Decision ModelBased Causal Association Mining Algorithm
for Detecting Adverse Drug Reactions in Postmarketing
Surveillance, presented at the Proceedings of The IEEE
International Conference on Fuzzy Systems, IEEE World
Congress on Computational Intelligence, Barcelon, Spain, 2010.
[17] Y. Ji, H. Ying, P. Dews, A. Mansour, J. Tran, R. E. Miller, and R.
M. Massanari, A Potential Causal Association Mining
Algorithm for Screening Adverse Drug Reactions in
Postmarketing Surveillance, Information Technology in
Biomedicine, IEEE Transactions on, vol. 15, pp. 428437, 2011.
[18] Y. Ji, R. M. Massanari, J. Ager, J. Yen, R. E. Miller, and H. Ying,
A Fuzzy LogicBased Computational RecognitionPrimed
Decision Model, Information Science, vol. 177, pp. 43384353,
2007.
[19] P.N. Tan, M. Steinbach, and V. Kumar. (2005). Introduction to
Data Mining.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
JI ET AL.: MINING INFREQUENT CAUSAL ASSOCIATIONS IN ELECTRONIC HEALTH DATABASES 13
[20] L. Geng and H. J. Hamilton, Interestingness Measures for Data
Mining: A Survey, ACM Computing Surverys, vol. 38, 2006.
[21] A. K. H. Tung, H. Lu, J. Han, and L. Feng, Efficient mining of
intertransaction association rules IEEE Transactions on
Knowledge and Data Engineering, vol. 15, pp. 43 56, 2003.
[22] C. Marinica and F. Guillet, KnowledgeBased Interactive
Postmining of Association Rules Using Ontologies, IEEE
Transactions on Knowledge and Data Engineering, vol. 22, pp. 784
797, 2010.
[23] R. AGRAWAL and R. SRIKANT, Fast algorithms for mining
association rules, presented at the Proceedings of the 20th
International Conference on Very Large Databases, Santiago,
Chile, 1994.
[24] P.N. Tan and V. Kumar, Interestingness Measures for
Association Patterns : A Perspective, Department of Computer
Science, University of Minnesota, 2000.
[25] W. Klosgen, Explora: a multipattern and multistrategy
discovery assistant, in Advances in knowledge discovery and data
mining, U. M. Fayyad, et al., Eds., 1st ed Cambridge, MA: MIT
Press, 1996, pp. 249271.
[26] N. Lavrac, P. Flach, and B. Zupan, Rule Evaluation Measures:
A Unifying View, presented at the Proceedings of the 9th
International Workshop on Inductive Logic Programming,
Bled, Slovenia, 1999.
[27] G. M. Weiss, Mining with rarity: a unifying framework,
SIGKDD Explor. Newsl., vol. 6, pp. 719, 2004.
[28] B. Liu, W. Hsu, and Y. Ma, Mining association rules with
multiple minimum supports, presented at the Proceedings of
the fifth ACM SIGKDD international conference on Knowledge
discovery and data mining, San Diego, California, United
States, 1999.
[29] H. Yun, D. Ha, B. Hwang, and K. H. Ryu, Mining association
rules on significant rare data using relative support, J. Syst.
Softw., vol. 67, pp. 181191, 2003.
[30] Y.H. Hu and Y.L. Chen, Mining association rules with
multiple minimum supports: a new mining algorithm and a
support tuning mechanism, Decis. Support Syst., vol. 42, pp. 1
24, 2006.
[31] L. Szathmary, P. Valtchev, and A. Napoli, Finding minimal
rare itemsets and rare association rules, presented at the
Proceedings of the 4th international conference on Knowledge
science, engineering and management, Belfast, Northern
Ireland, UK, 2010.
[32] L. Szathmary, A. Napoli, and P. Valtchev, Towards Rare
Itemset Mining, presented at the Proceedings of the 19th IEEE
International Conference on Tools with Artificial Intelligence
Volume 01, 2007.
[33] L. Troiano, G. Scibelli, and C. Birtolo, A Fast Algorithm for
Mining Rare Itemsets, presented at the Proceedings of the 2009
Ninth International Conference on Intelligent Systems Design
and Applications, 2009.
[34] Y. S. Koh and N. Rountree, Finding sporadic rules using
AprioriInverse, vol. 3518 LNAI, ed, 2005, pp. 97106.
[35] M. Adda, L. Wu, and Y. Feng, Rare itemset mining, 2007, pp.
7380.
[36] C. Zhang and S. Zhang, Association Rule Mining: Models and
Algorithms. New York: SpringerVerlag, 2002.
[37] I. Guyon, D. Janzing, and B. Schlkopf, Causality: Objectives
and Assessment, JMLR Workshop and Conference Proceedings,
vol. 6, pp. 138, 2010.
[38] J. Pearl, Causality: Models, Reasoning and Inference, 2nd ed.:
Cambridge University Press, 2009.
[39] D. Koller and N. Friedman, Probabilistic Graphical Models:
Principles and Techniques, 1st ed.: The MIT Press, 2009.
[40] M. Hassoun, Fundamentals of Artificial Neural Networks: The MIT
Press, 2003.
[41] G. A. Fink, Markov Models for Pattern Recognition: From Theory to
Applications: Springer, 2010.
[42] D. Heckerman, Bayesian Networks for Data Mining, Data
Min. Knowl. Discov., vol. 1, pp. 79119, 1997.
[43] G. F. Cooper, A Simple ConstraintBased Algorithm for
Efficiently Mining Observational Databases for Causal
Relationships, Data Min. Knowl. Discov., vol. 1, pp. 203224,
1997.
[44] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler,
Hidden Markov models in computational biology.
Applications to protein modeling, J Mol Biol, vol. 235, pp. 1501
31, 1994.
[45] S. R. Eddy, Profile hidden Markov models, Bioinformatics, vol.
14, pp. 75563, 1998.
[46] G. A. Klein, A recognitionprimed decision making model of
rapid decision making, Decision Making In Action: Models and
Methods, pp. 138147, 1993.
[47] L. A. Zadeh, Fuzzy sets, Information and Control, vol. 8, pp.
338353, 1965.
[48] H. Ying, Fuzzy Control and Modeling: Analytical Foundations and
Applications: IEEE Press, 2000.
[49] I. R. Edwards and J. K. Aronson, Adverse drug reactions:
definitions, diagnosis, and management., Lancet, vol. 356, pp.
12559, 2000.
[50] Y. Ji, H. Ying, J. Yen, and R. M. Massanari, A Fuzzy Logic
Based Computational RecognitionPrimed Decision Model,
Information Science, vol. 77, pp. 43384353, 2007.
[51] M. Cayrol, H. Farreny, and H. Prade, Fuzzy Pattern
Matching, Kybernetes, vol. 11, pp. 103116, 1982.
[52] R. Orchard, Fuzzy Reasoning in Jess: The FuzzyJ Toolkit and
FuzzyJess, in Third International Conference on Enterprise
Information Systems, Setubal, Portugal, 2001, pp. 533542.
[53] P. Staff, Ed., Physicians Desk Reference 2011. PDR Network,
2010.
[54] M. Hauben and X. Zhou, Quantitative methods in
pharmacovigilance: focus on signal detection, Drug Safety, vol.
26, pp. 15986, 2003.

Yanqing Ji received the B.Eng. degree in
industrial automation from Qingdao University,
Qingdao, China, in 1997, the M.Sc. degree in
physical electronics from the University of
Science and Technology of China, Hefei, Chi-
na, in 2001, and the Ph.D. degree in computer
engineering from Wayne State University,
Detroit, MI, in 2007. He is currently an Assis-
tant Professor with the Department of Electric-
al and Computer Engineering, Gonzaga Uni-
versity, Spokane, WA. His research interests
include multiagent systems, parallel and distri-
buted systems, data mining, and their biomedical applications. Dr. Ji
is a member of the Program Committees of various international
conferences and workshops including the 2009 International Work-
shop on Graphics Processing Unit Technologies and Applications,
and reviewer for various international journals.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Hao Ying (S88M90SM97-F2012) re-
ceived the B.S. degree in electrical engineer-
ing, the M.S. degree in computer engineering
from Donghua University, Shanghai, China, in
1982 and 1984, respectively, and the Ph.D.
degree in biomedical engineering from The
University of Alabama, Birmingham, in 1990.
He is currently a Professor with the Depart-
ment of Electrical and Computer Engineering,
Wayne State University, Detroit, MI. He has
authored a book entitled Fuzzy Control and
Modeling: Analytical Foundations and Applications (IEEE Press,
2000), over 90 peer-reviewed journal papers, and more than 130
conference papers. Prof. Ying is an Associate Editor or a member of
Editorial Board for 10 international journals and is a member of the
Fuzzy Systems Technical Committee of the IEEE Computational
Intelligence Society. He was Program Chair for The 2005 North
American Fuzzy Information Processing Society (NAFIPS) Confe-
rence and Co-Program Chair for the 2010 NAFIPS Conference as
well as for The International Joint Conference of NAFIPS Confe-
rence, the Industrial Fuzzy Control and Intelligent System Confe-
rence, and the NASA Joint Technology Workshop on Neural Net-
works and Fuzzy Logic held in 1994.
John Tran received his BS in Chemistry from Santa Clara Universi-
ty, California in 1990 and MD from School of Medicine, St. Louis
University in 1995. John completed his Residency Training in Gen-
eral Psychiatry from Washington University in St. Louis, Missouri in
1999 and his Fellowship Training in Geriatric Psychiatry from Uni-
versity of Washington (UW), Seattle, Washington in 2000. He com-
pleted his Post-Doctoral Research Fellowship in Geriatric Psychiatry
from UW in 2002. He is currently Board Certified in General and
Geriatric Psychiatry. John currently serves as the Medical Director
and Chief of Geriatric Psychiatry for Frontier Behavioral Health. He
is responsible for directing and providing psychiatric services to au-
dult and elderly patients in outpatient psychiatric clinic as well as in
long term care settings, including skilled nursing facilities, assistant
livings, and adult family homes. He currently supervised 11 psy-
chiatrists and 10 Advanced Registered Practice Nurses. John has
great interests in clinical research and teaching. He is the Director
and Primary Instigator for Frontier Institute where he conduct NIH
and Industry sponsored clinical trials for new medication develop-
ments for psychiatric disorders including depression, anxiety, schi-
zophrenia, dementias, etc. He also serves as Associate Clinical
Professor in School of Medicine, University of Washington, as well
as Adjunct Associate Clinical Professor in School of Pharmacy and
School of Nursing, Washington State University, in teaching and
training of medical students, psychiatry residents, pharmacy and
nursing students.
Peter Dews received his medical degree from Wayne State Univer-
sity in 1986. He also holds a bachelor's degree in microbiology and a
master's degree in clinical research design and statistical analysis
from the University of Michigan. He is currently the Chief Medical
Officer and Vice President of Medical Aff airs for St. Mary Mercy
Hospital, Trinity Health. He was the medical director of ambulatory
services for the Wayne State University Physician Group's Depart-
ment of Internal Medicine and had practiced general medicine for
nearly 15 years before he joined Trinity Health. Dr. Dews is a mem-
ber of the Society of General Internal Medicine, the American Col-
lege of Physicians, the Association of Health Services Research, the
American Medical Informatics Society and the American College of
Physician Executivesphotograph and biography not available.
Ayman Mansour received his B.Sc in Electrical and Electronics
Engineering from University of Sharjah, Sharjah, U.A.E, in 2004 and
his M.Sc. degree in Electrical Engineering from the Univertsity of
Jordan, Amman, Jordan, in 2006. He joined the Department of Elec-
trical and Computer Engineering at Wayne State University, Detroit,
Michigan in 2008 as a PhD student. His research interests include
multiagent systems, intelligent systems and data mining.
R. Michael Massanari is a graduate of the Feinberg School of Med-
icine, Northwestern University, Chicago, IL, in 1967. He received
post-graduate training at the University of Pennsylvania in Philadel-
phia and Northwestern University in Chicago. He also received a
M.S. degree in Preventive Medicine and Epidemiology in 1985 from
the University of Iowa in Iowa City, Iowa. He is currently Executive
Director of the Critical Junctures Institute in Bellingham, WA. He
served as Director of the Center for Healthcare Effectiveness Re-
search (CHER) and Professor of Medicine and Community Medicine
at Wayne State University until his retirement in 2009. Dr. Massanari
is a Fellow of the American College of Physicians and a member of
the Academy of Health and American Public Health Association.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Você também pode gostar