The Mechanics of Probabilistic Record Matching

THE MECHANICS OF
PROBABILISTIC RECORD
MATCHING
Jeffrey Tyzzer
Why Does this Deck Exist?
! I struggled while studying probabilistic matching--
reading, e.g., the works of Fellegi and Sunter,
Newcombe, Schumacher, and Herzog, et al.--and
wanted to summarize my findings as much to help
others understand it as to check my own
understanding. To that end, please direct any errors
and constructive feedback to me at
jefftyzzer@sbcglobal.net
2
Agenda
! Recall that Master Data Management (MDM)
enables the consolidation and syndication of
trusted, authoritative, data
! In this presentation, we focus on the consolidation--
or unification--of master data, which is the heart of
all MDM systems
3
Matches
! In a data set, constructs (i.e. records) are proxies for
real-world objects
! Matches are entity instances (records) that have the
same values for those properties (attributes) that
serve to identify them
! One of the goals of Master Data Management is to
ensure that there is a 1:1 correspondence between
the real and proxy objects
4
Ways of Matching
! There are two principal ways to match: deterministically and
probabilistically
! Deterministic matching is rules-based, e.g. IF R1a1 = R2a1 AND
R1a2 = R2a2 THEN Link ELSE NonLink
! Deterministic matching is binary--all or nothing
! Probabilistic matching is likelihood-based
! Probabilistic matching is analog--its based on a range of
agreement
! The pioneers of probabilistic matching were Newcombe, et
al., Tepping, and Fellegi & Sunter.
! Probabilistic matching is particularly useful in the absence of
unique identifiers, when only so-called quasi-identifiers are
available, such as names and dates birth
5
Consider
! R1 Name: Jeff Tyzzer Address: 848 Swanston Dr.
Phone: (916) 555-1212
! R2 Name: Jeffrey Tyzzer Address: 884 Swanson Dr.
Phone: 555-1212
! Would you consider these two records to be
matches? Why? Would they be deterministic or
probabilistic matches?
6
Hypothesis Testing
! In classic probabilistic matching, we take our cue
from inferential statistics when comparing two
records probabilistically:
! H
0
- The null hypothesis: The records do not represent
the same real-world object, i.e. they are not matches
! H
A
- The alternate hypothesis: The records represent the
same real-world object, i.e. they are matches
! Typically, H
0
is rejected if our test statistic is less than .
05 (the so-called p-value)
7
Hypothesis Testing, contd
! A Type I error, designated with the Greek letter
(alpha), occurs when we incorrectly reject H
0
! A Type II error, designated with the Greek letter
(beta), occurs when we incorrectly fail to reject H
0
8
Record Linkage and Type I & II Errors
! Since weve decided that H
0
indicates that the
records are different, if we commit a Type I error
(incorrectly rejecting H
0
) were (wrongly) asserting
that the records match. This is a false positive
! Since weve decided that H
A
indicates that the
records are the same (matches), if we commit a
Type II error (incorrectly failing to reject H
0
) were
(wrongly) asserting that the records do not match.
This is a false negative
9
Agreement Probabilities
! We must first decide on our match attributes, a domain-
specific decision. For this presentation, we will use First
Name, Last Name, and DoB
! For our purposes, when comparing these attributes
between records there are two possible outcomes: they
will agree or they wont
! We calculate the probabilities of these attributes
agreeing under each of the preceding hypotheses.
There are several methods for computing these; among
them are sampling, prior studies, and Maximum
Likelihood Estimation (MLE) using Expectation
Maximization (EM)
10
Example
Attribute Non-match (H
0
) Match (H
A
)
Last Name .05 .95
First Name .15 .90
DoB .25 .85
! Using one of the techniques mentioned in slide 10s
last bullet point, say we find that, for our data, when
the two records do in fact represent the same entity
the last names match 95% of the time, the first names
90%, and the DoBs 85%. When the two records are
known to represent different entities, the match rates
are much lower--5%, 15%, and 25%, respectively
11
Match Attribute Possibilities
! Since for simplicitys sake were saying that the
attributes must simply either match or not--designating
1 for a match and 0 for a non-match--then for our three
attributes we have the following 2
3
agreement
possibilities:
LN FN DoB
0 0 0
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
1 1 1
12
Match Attribute Probabilities
! The space of all possible agreement patterns is referred by the
Greek letter (gamma)
! Given the agreement probabilities listed on slide 11, we next
compute two probabilities for each of the eight agreement patterns
(slide 11) in (in the same attribute order): the m (match)
probability and the u (non-match) probability
! Example - the m probability for the (0,0,0) pattern (i.e. none match):
(1 - .95) * (1 - .90) * (1 - .85) = 0.00075
! Example - the u probability for the (1,0,1) pattern (match on LN and
DoB):
(.05) * (1-.15) * (.25) = 0.01063
! The agreement pattern is viewed as a discrete random variable
representing the set of all possible comparison outcomes
13
Match Attribute Probabilities, contd
! The completed table looks like this:
Agreement Pattern m u
0,0,0 .00075 .60563
1,0,0 .01425 .03188
0,1,0 .00675 .10688
0,0,1 .00425 .20188
1,1,0 .12825 .00563
1,0,1 .08075 .01063
0,1,1 .03825 .03563
1,1,1 .72675 .00188
14
Observations
! Given the agreement probabilities on slide 11, only
72.675% of the records would have matched
deterministically and only 60.563% of those records that
dont match would have disagreed on all three attributes
! Both columns (must) sum to 1
! Probabilistic matching gives us maybe in addition to yes and
no as a possible outcome--it lets us deal with those situations
where not all attributes match, but some do (recall your
answers to the questions on slide 6)
! This technique assumes conditional independence among the
match attributes, which may not always be the case
(consider the correlation between name and gender)
15
Almost There
! The next two steps are:
! Calculate the log-likelihood ratio test statistic T, the
base-2 logarithm of the ratio of m and u
e.g., T = log
2
(0.03825/0.03563) = 0.10237
and order the results ascending by T
! Sum the cumulative probabilities (m top down, u bottom
up)
16
The Test Statistic & Cumulative Probs
Agreement
Pattern
m u T m () u ()
0,0,0 0.00075 0.60563
-9.65733 0.00075 1.00000
0,0,1 0.00425 0.20188
-5.56989 0.00500 0.39441
0,1,0 0.00675 0.10688
-3.98496 0.01175 0.19253
1,0,0 0.01425 0.03188
-1.16169 0.02600 0.08565
0,1,1 0.03825 0.03563
0.10237 0.06425 0.05377
1,0,1 0.08075 0.01063
2.92532 0.14500 0.01814
1,1,0 0.12825 0.00563
4.50968 0.27325 0.00751
1,1,1 0.72675 0.00188
8.59458 1.00000 0.00188
17
Deciding on the Thresholds
! We have three choices when confronted with a pair of records: definitely
link them, definitely do not link them, and maybe link them. How do we
decide? By establishing thresholds for each of the three possibilities,
resulting in three discrete (and disjoint) T regions (slide 17)
! If, as we said on slide 7, we reject H
0
when the test statistic is less than .05,
then weve decided that were willing to accept an alpha of .05, meaning
that were OK with a Type I error (a false positive, given our definitions of
H
0
and H
A
) 5% of the time. In other words, were willing to accept that up
to 5% of our linked records could be linked erroneously
! Assume that beta, our tolerance for a Type II error (a false negative, given
our definitions of H
0
and H
A
) is also .05. (Note that the false positive and
negative thresholds are domain-specific--whats the possible harm of a
false positive in a hospital setting versus one for, say, a direct marketer
compiling a household address list?)
18
Deciding on the Thresholds, contd
! The sum of the m probabilities represents our false
positive rate and the sum of the u probabilities is our
false negative rate. The last two columns in the table on
slide 17, respectively, show these
! Our settings of alpha and beta dictate that any pair of
records with a T of -1.16169 () or less is a definite
non-link and that any pair of records with a T of
2.92532 () or greater is a definite link. Thus,
those with an agreement pattern of (0,1,1) are
our maybes. This is known as the clerical review
region
19
A Graphical Representation
0.00000
0.20000
0.40000
0.60000
0.80000
1.00000
1.20000
-9.65733 -5.56989 -3.98496 -1.16169 0.10237 2.92532 4.50968 8.59458

20
Interpretation
! Record pairs to the left of the red line (lambda) are a
certain no and those to the right of the green line (mu)
are a certain yes. In-between the two lines is the
maybe region, whose record pairs require human
review
! Fellegi & Sunters technique assures us that the maybe
region is as small as possible given our settings for
alpha and beta (ref. the NeymanPearson lemma)
! The width of the clerical region is a function of the
values of and (slide 8)
21
Example I
Record LN FN DoB
1 Tyzzer John 5/26/19xx
2 Tyzzer Jeff 5/26/19xx
! The agreement pattern is (1,0,1). Given its
corresponding T value, these records would be
classified as a match
22
Example II
Record LN FN DoB
1 Smith Jeff 5/26/19xx
2 Tyzzer Jeff 5/26/19xx
! The agreement pattern is (0,1,1). Given its
corresponding T value, these records would be
classified as a maybe and queued for clerical
review
23
Some Final Thoughts
! To compute the agreement probabilities (slide 11), the expectation
maximization (EM) technique is usually employed. These probabilities drive
all subsequent results
! The demonstrated scenario and examples are deliberately trivial
! A more realistic situation would likely include more match columns and
several more possible configurations of them instead of simple agreement
or disagreement
! A more realistic situation would also have accommodated fuzzy matches
and incorporated value-specific frequencies into the probability
calculations. For last name, say, the agreement pattern would then be
interpreted as the LN agrees and is <>, e.g. Smith
! To reduce the number of record-to-record comparisons from n(n-1)/2
(intrafile) or n*m (interfile) to something manageable, blocking (e.g. on zip
code or the phonetic encoding of the surname) is typically used
24
References
! B Do Chuong, and Serafim Batzoglou. What is the Expectation
Maximization Algorithm? Nature Biotechnology 26.8 (2008): 897-9.
! Fellegi, Ivan, and Alan B. Sunter. A Theory for Record Linkage.
Journal of the American Statistical Association 64.328 (1969):
1183-1210.
! Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data
Quality and Record Linkage Techniques. New York: Springer Science
+ Business Media, 2007.
! ---. Record Linkage. WIREs Computational Statistics 2.5 (2010):
535-543.
! Kirkendall, Nancy. Weights in Computer Matching: Applications and
an Information Theoretic Point of View. Record Linkage
Techniques--1985. Internal Revenue Service.
25

The Mechanics of Probabilistic Record Matching

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

The Mechanics of Probabilistic Record Matching

Enviado por

Direitos autorais:

Formatos disponíveis

THE MECHANICS OF

Você também pode gostar