Você está na página 1de 179

Entity Resolution: Tutorial

EntityResolution:Tutorial
Li Getoor
Lise
G t

Ashwin Machanavajjhala

UniversityofMaryland
College Park MD
CollegePark,MD

DukeUniversity
Durham,NC
,

http://www.cs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
http://goo.gl/f5eym

WhatisEntityResolution?
Problemofidentifyingandlinking/groupingdifferent
manifestationsofthesamerealworldobject.
f
f
j
Examplesofmanifestationsandobjects:
p
j
Differentwaysofaddressing(names,emailaddresses,FaceBook
accounts)thesamepersonintext.
Webpageswithdifferingdescriptionsofthesamebusiness.
Differentphotosofthesameobject.

Ironically,EntityResolutionhasmanyduplicatenames
Recordlinkage
Coreference resolution

Duplicatedetection

Referencereconciliation

Fuzzymatch
y

Object consolidation
Objectconsolidation
Objectidentification

Deduplication
Approximatematch

Entity clustering
Entityclustering

Identityuncertainty

Hardeningsoftdatabases
Doubles

Merge/purge

Household matching
Householdmatching

Householdingg

Referencematchingg

ERMotivatingExamples

LinkingCensusRecords
Public Health
PublicHealth
Websearch
Comparison shopping
Comparisonshopping
Counterterrorism
Spam detection
Spamdetection
MachineReading

ERandNetworkAnalysis

before

after

Motivation:NetworkScience
Measuringthetopologyoftheinternetusing
traceroute

IPAliasingProblem[Willinger etal.2009]

IPAliasingProblem[Willinger etal.2009]

TraditionalChallengesinER
Name/Attributeambiguity

Thomas Cruise
ThomasCruise

MichaelJordan

TraditionalChallengesinER
Name/Attributeambiguity
Errors due to data entry
Errorsduetodataentry

TraditionalChallengesinER
Name/Attributeambiguity
Errors due to data entry
Errorsduetodataentry
MissingValues

[Gilletal;Univ ofOxford2003]

TraditionalChallengesinER

Name/Attributeambiguity
Errors due to data entry
Errorsduetodataentry
MissingValues
Changing Attributes
ChangingAttributes

Dataformatting

Abbreviations/DataTruncation
/

BigDataERChallenges

BigDataERChallenges
LargerandmoreDatasets
Needefficientparalleltechniques

MoreHeterogeneity
M
H t
it
Unstructured,UncleanandIncompletedata.Diversedatatypes.
Nolongerjustmatchingnameswithnames,butAmazonprofileswith
No longer just matching names with names, but Amazon profiles with
browsinghistoryonGoogleandfriendsnetworkinFacebook.

BigDataERChallenges
LargerandmoreDatasets
Needefficientparalleltechniques

MoreHeterogeneity
M
H t
it
Unstructured,UncleanandIncompletedata.Diversedatatypes.

Morelinked
More linked
Needtoinferrelationshipsinadditiontoequality

MultiRelational
Dealwithstructureofentities(AreWalmart andWalmart Pharmacy
thesame?)

Multidomain
l d
Customizablemethodsthatspanacrossdomains

Multipleapplications
Multiple applications (websearchversuscomparisonshopping)
(web search versus comparison shopping)
Servediverseapplicationwithdifferentaccuracyrequirements

Outline
1.
2
2.
3.
4
4.

AbstractProblemStatement
Algorithmic Foundations of ER
AlgorithmicFoundationsofER
ScalingERtoBigData
Challenges & Future Directions
Challenges&FutureDirections

Outline
1. AbstractProblemStatement
2 AlgorithmicFoundationsofER
2.
Algorithmic Foundations of ER
a)
b)
c)
d)

DataPreparationandMatchFeatures
Pairwise ER
ConstraintsinER
Algorithms

RecordLinkage
Record
Linkage
Deduplication
CollectiveER

3. ScalingERtoBigData
4. Challenges&FutureDirections

5minutebreak

Outline
1. AbstractProblemStatement
2 AlgorithmicFoundationsofER
2.
Algorithmic Foundations of ER
3. ScalingERtoBigData
a) Blocking/CanopyGeneration
Blocking/Canopy Generation
b) DistributedER

4. Challenges&FutureDirections
Challenges & Future Directions

Outline
1.
2
2.
3.
4
4.

AbstractProblemStatement
Algorithmic Foundations of ER
AlgorithmicFoundationsofER
ScalingERtoBigData
Challenges & Future Directions
Challenges&FutureDirections

ScopeoftheTutorial
Whatwecover:
FundamentalalgorithmicconceptsinER
ScalingERtobigdatasets
TaxonomyofcurrentERalgorithms

Whatwedonotcover:

Schema/ontologyresolution
Schema/ontology
resolution
Datafusion/integration/exchange/cleaning
Entity/InformationExtraction
PrivacyaspectsofEntityResolution
Detailsonsimilaritymeasures
Technical details and proofs
Technicaldetailsandproofs

ERReferences
Book/SurveyArticles
DataQualityandRecordLinkageTechniques
[T Herzog F Scheuren W Winkler Springer 07]
[T.Herzog,F.Scheuren,W.Winkler,Springer,
07]
DuplicateRecordDetection[A.Elmagrid,P.Ipeirotis,V.Verykios,TKDE07]
AnIntroductiontoDuplicateDetection[F.Naumann,M.Herschel,M&P
y
]
synthesislectures2010]
EvaluationofEntityResolutionApproachedonRealworldMatchProblems
[H.Kopke,A.Thor,E.Rahm,PVLDB2010]
DataMatching[P.Christen,Springer2012]

Tutorials
RecordLinkage:SimilaritymeasuresandAlgorithms
[N.Koudas,S.Sarawagi,D.Srivatsava SIGMOD06]
DatafusionResolvingdataconflictsforintegration
[X.Dong,F.Naumann VLDB09]
EntityResolution:Theory,PracticeandOpenChallenges
Entity Resolution: Theory Practice and Open Challenges
http://goo.gl/Ui38o [L.Getoor,A.Machanavajjhala AAAI12]

PART 1
PART1
ABSTRACT PROBLEM STATEMENT
ABSTRACTPROBLEMSTATEMENT

AbstractProblemStatement
RealWorld

DigitalWorld

Records/
/
Mentions

Deduplication ProblemStatement
Clustertherecords/mentionsthatcorrespondtosame
entity
y

Deduplication ProblemStatement
Clustertherecords/mentionsthatcorrespondtosame
entity
y
Intensional Variant:Computeclusterrepresentative

RecordLinkageProblemStatement
Linkrecordsthatmatchacrossdatabases
A

ReferenceMatchingProblem
Matchnoisyrecordstocleanrecordsinareferencetable

Reference
Table
bl

AbstractProblemStatement
RealWorld

DigitalWorld

Deduplication ProblemStatement

Deduplication withCanonicalization

GraphAlignment(&motifsearch)
Graph1

Graph2

Relationshipsarecrucial

Relationshipsarecrucial

Notation

R:setofrecords/mentions(typed)
H: set of relations / hyperedges (typed)
H:setofrelations/hyperedges
M:setofmatches(recordpairsthatcorrespondtosameentity)
N: set of non matches (recordpairscorrespondingtodifferententities)
N:setofnonmatches
(
d i
di t diff
t titi )
E:setofentities
L set of links
L:setoflinks

TTrue(M
(Mtrue,N
Ntrue,EEtrue,LLtrue):accordingtorealworld
)
di
l
ld
vs Predicted(Mpred,Npred,Epred,Lpred ):byalgorithm

RelationshipbetweenMtrue andMpred
Mtrue (SameAs ,Equivalence)
(Similar representations and similar attributes)
Mpred (Similarrepresentationsandsimilarattributes)
Mtrue

RxR

Mpred

Metrics
Pairwise metrics
Precision/Recall,F1
#ofpredictedmatchingpairs

Clusterlevelmetrics
purity,completeness,complexity
Precision/Recall/F1:Clusterlevel,closestcluster,MUC,B
Precision/Recall/F1: Cluster level closest cluster MUC B3 ,
RandIndex
Generalizedmergedistance[Menestrina etal,PVLDB10]

Littleworkthatevaluationscorrectpredictionoflinks

TypicalAssumptionsMade
Eachrecord/mentionisassociatedwithasinglereal
worldentity.
y

Inrecordlinkage,noduplicatesinthesamesource
Iftworecords/mentionsareidentical,thentheyaretrue
If two records/mentions are identical then they are true
matches

(
(,)
) Mtrue

ERversusClassification
Findingmatchesvs nonmatchesisaclassificationproblem
Imbalanced:typicallyO(R)matches,O(R^2)nonmatches
Instancesarepairsofrecords.PairsarenotIID

(,) Mtrue
AND

( , ) Mtrue
(,)

(
(,)
) Mtrue

ERvs (Multirelational)Clustering
Computingentitiesfromrecordsisaclusteringproblem
Intypicalclusteringalgorithms(kmeans,LDA,etc.)
number of clusters is a constant or sub linear in R.
numberofclustersisaconstantorsublinearinR.
In
InER:numberofclustersislinearinR,andaverage
ER: number of clusters is linear in R and average
clustersizeisaconstant.Significantfractionofclusters
aresingletons.
g

PART 2
PART2
ALGORITHMIC FOUNDATIONS OF ER
ALGORITHMICFOUNDATIONSOFER

OutlineofPart2
a) DataPreparationandMatchFeatures
b) Pairwise ER

Determiningwhetherornotapairofrecordsmatch

c) ConstraintsinER
5minutebreak

d) Algorithms

Recordlinkage(Propagationthroughexclusitivity negativeconstraint),
Deduplication (Propagationthroughtransitivitypositiveconstraint),
Collective(PropagationthroughGeneralConstraints)

MOTIVATINGEXAMPLE:
MOTIVATING
EXAMPLE:
BIBLIOGRAPHICDOMAIN

Entities&RelationsinBibliographicDomain
Wrote

Paper
p
Title
# of Authors
Topic
Word1
Word 2

WordN
Cites

Author

Name
Research Area

WorksAt

I tit ti
Institution
Name

Author Mention
NameString

Institute Mention
NameString

Paper Mention
TitleString

AppearsIn
pp

Venue

:entityrelationships
:cooccurrencerelationships
:resolutionrelationships

Name

Venue Mention
NameString
g

PART 2
PART2a
DATAPREPARATION&
DATA
PREPARATION &
MATCHFEATURES

Normalization

Schemanormalization
SchemaMatching e.g.,contactnumberandphonenumber
Compoundattributes
C
d tt ib t
f ll dd
fulladdressvs
str,city,state,zip
t it t t i
Nestedattributes
Listoffeaturesinonedataset(airconditioning,parking)vs eachfeaturea
boolean attribute
Setvaluedattributes
Setofphonesvs primary/secondaryphone
Recordsegmentationfromtext
Record segmentation from text

Datanormalization
Oftenconverttoalllower/allupper;removewhitespace
detectingandcorrectingvaluesthatcontainknowntypographicalerrorsor
d
i
d
i
l
h
i k
hi l
variations,
expandingabbreviationsandreplacingthemwithstandardforms;replacing
nicknames with their proper name forms
nicknameswiththeirpropernameforms
Usuallydonebasedondictionaries(e.g.,commercialdictionaries,postaladdresses,
etc.)

MatchingFeatures
Fortworeferencesxandy,computeacomparisonvectorof
similarityscoresofcomponentattribute.
[1stauthormatchscore,
papermatchscore,
venuematchscore,
yearmatchscore,.]

Similarityscores
y
Boolean(matchornotmatch)
Realvaluesbasedondistancefunctions

SummaryofMatchingFeatures
Handle
H
dl
Typographicalerrors

Equalityonaboolean predicate
Editdistance
Levenstein,SmithWaterman,Affine

VectorBased
Cosine
Cosinesimilarity,TFIDF
similarity, TFIDF
GoodforTextlike
reviews/tweets

AlignmentbasedorTwotiered
JaroWinkler,SoftTFIDF,MongeElkan

PhoneticSimilarity
Soundex

Setsimilarity
Jaccard,Dice

GoodforNames

Translationbased
Numericdistancebetweenvalues
Domainspecific
p

Useful packages
Usefulpackages:
SecondString,http://secondstring.sourceforge.net/
Simmetrics:http://sourceforge.net/projects/simmetrics/
LingPipe,http://aliasi.com/lingpipe/index.html
Li Pi
htt // li i
/li
i /i d ht l

Usefulfor
abbreviations,
alternate names.
alternatenames.

RelationalMatchingFeatures
Relationalfeaturesareoftensetbased
Setofcoauthorsforapaper
Setofcitiesinacountry
f ii i
Setofproductsmanufacturedbymanufacturer

Canusesetsimilarityfunctionsmentionedearlier
CommonNeighbors:Intersectionsize
Jaccard
Jaccardss Coefficient:Normalizebyunionsize
Coefficient: Normalize by union size
AdarCoefficient:Weightedsetsimilarity

Canreasonaboutsimilarityinsetsofvalues
C
b t i il it i
t f l
AverageorMax
Otheraggregates

PART 2 b
PART2b
PAIRWISE MATCHING

Pairwise MatchScore
Problem:Givenavectorofcomponentwisesimilaritiesforapairof
records(x,y),computeP(xandymatch).
Solutions:
1. Weightedsumoraverageofcomponentwisesimilarityscores.
Thresholddeterminesmatchornonmatch.

0.5*1stauthormatchscore +0.2*venuematchscore +0.3*ppapermatchscore


p
.

Hardtopickweights.

Matchonlastnamematchmorepredictivethanloginname.
Match on Smith
Matchon
Smith lesspredictivethanmatchon
less predictive than match on Getoor
Getoor or
or Machanavajjhala.

Hardtotuneathreshold.

Pairwise MatchScore
Problem:Givenavectorofcomponentwisesimilaritiesforapairof
records(x,y),computeP(xandymatch).
Solutions:
1. Weightedsumoraverageofcomponentwisesimilarityscores.
Thresholddeterminesmatchornonmatch.
2 Formulaterulesaboutwhatconstitutesamatch.
2.
Formulate rules about what constitutes a match

(1stauthormatchscore>0.7ANDvenuematchscore>0.8)
OR(papermatchscore>0.9ANDvenuematchscore>0.9)

M
Manuallyformulatingtherightsetofrulesishard.
ll f
l ti th i ht t f l i h d

BasicMLApproach
r =(x,y)isrecordpair, iscomparisonvector,M matches,U non
matches
Decisionrule

P( | r M )
R
P( | r U )

R t r Match
R t r Non - Match

Fellegi &Sunter Model[FS,Science69]


r =(x,y)isrecordpair, iscomparisonvector,M matches,U non
matches
Decisionrule

P( | r M )
R
P( | r U )

R tl r Match
tl R tu r Potential Match
R tu r Non - Match
NaveBayes Assumption: P( | r M ) i P( i | r M )

MLPairwise Approaches
Supervisedmachinelearningalgorithms
Decisiontrees
[Cochinwala
[Co hin ala etal,IS01]
et al IS01]

Supportvectormachines
[Bilenko &Mooney,KDD03];[Christen,KDD08]

Ensemblesofclassifiers
Ensembles of classifiers
[Chenetal.,SIGMOD09]

ConditionalRandomFields(CRF)
[Gupta&Sarawagi,VLDB09]
[Gupta & Sarawagi VLDB09]

Issues:
Training
Trainingsetgeneration
set generation
Imbalancedclasses manymorenegativesthanpositives(evenafter
eliminatingobviousnonmatchesusingBlocking)
Misclassificationcost
Misclassification cost

CreatingaTrainingSetisakeyissue
Constructingatrainingsetishard sincemostpairsof
recordsareeasynonmatches.
y
100recordsfrom100cities.
Only106 pairsoutoftotal108 (1%)comefromthesamecity

Somepairsarehardtojudgeevenbyhumans
Inherentlyambiguous
E.g.,ParisHilton(personorbusiness)

Missingattributes
Starbucks,Torontovs Starbucks,QueenStreet,Toronto

AvoidingTrainingSetGeneration
Unsupervised/SemisupervisedTechniques
EMbasedtechniquestolearnparameters
[Winkler06,Herzogetal07]

GenerativeModels
[Ravikumar &Cohen,UAI04]
& Cohen UAI04]

ActiveLearning
CommitteeofClassifiers
[Sarawagi etalKDD00,Tajeda etalIS01]

Provablyoptimizingprecision/recall
Provably optimizing precision/recall
[Arasu etalSIGMOD10,Bellare etalKDD12]

Crowdsourcing
[WangetalVLDB12,MarcusetalVLDB12,]
[W
t l VLDB 12 M
t l VLDB 12 ]

CommitteeofClassifiers [Tejada etal,IS01]

ActiveLearningwithProvableGuarantees
Mostactivelearningtechniquesminimize01loss
[Beygelzimer etalNIPS2010].

However,ERisveryimbalanced:
Numberofnonmatches>100*numberofmatches.
Classifyingallpairsasnonmatcheshaslow01loss(<1%).
y g p
(
)

Hence,needactivelearningtechniquesthatminimize
precision/recall.
precision/recall

ActiveLearningwithProvableGuarantees
Monotonicity ofPrecision[Arasu etalSIGMOD10]

Thereisalargerfractionof
matchesinC1thaninC2.

Algorithmsearchesforthe
p
g
y
optimalclassifierusingbinary
searchoneachdimension

ActiveLearningwithProvableGuarantees
[Bellare etalKDD12]
O (log2 n)callstoablackbox
O(log
n) calls to a blackbox 01lossactivelearningalgorithm.
0 1 loss active learning algorithm
Exponentiallysmallerlabelcomplexitythan[Arasu etalSIGMOD10]
(in the worst case).
(intheworstcase).
1.
2.
3
3.

PrecisionConstrained Weighted01LossProblem
(using a Lagrange Multiplier )
(usingaLagrangeMultiplier).
Givenafixedvaluefor,weighted01Losscanbeoptimizedby(onecallto)a
blackbox activelearningclassifier.
Ri ht l
Rightvalueof
f iscomputedbysearchingoveralloptimalclassifiers.
i
t db
hi
ll ti l l ifi
Classifiersareembeddedina2dplane(precision/recall)
Searchisalongtheconvexhulloftheembeddedclassifiers

Crowdsourcing
Growinginterestinintegratinghumancomputationindeclarative
workflowengines.
ERisanimportantproblem(e.g.,forevaluatingfuzzyjoins)
[WangetalVLDB12,MarcusetalVLDB12,]

Opportunity:utilizecrowdsourcing forcreatingtrainingsets,
orforactivelearning.
Keyopenissue:Handlingerrorsinhumanjudgments
InanexperimentonAmazonMechanicalTurk:
Pairwise matchingjudgment,eachgivento5differentpeople

Majorityofworkersagreedontruthononly90%ofpairwise judgments.

SummaryofSingleEntityERAlgorithms
Manyalgorithmsforindependentclassificationofpairsofrecords
asmatch/nonmatch
MLbasedclassification&FellegiSunter
Pro:Advancedstateoftheart
P Ad
d
f h
Con:Buildinghighfidelitytrainingsetsisahardproblem

ActiveLearning&Crowdsourcing forERareactiveareasof
research.

PART 2
PART2c
CONSTRAINTS

Constraints
Importantformsofconstraints:
Transitivity:IfM1andM2match,M2andM3match,thenM1and
y
M3match
Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2
FunctionalDependency:IfM1andM2match,thenM3andM4must
Functional Dependency: If M1 and M2 match then M3 and M4 must
match

Transitivityiskeytodeduplication
Exclusivityiskeytorecordlinkage
Functionaldependenciesfordatacleaning,e.g.,
[Ananthakrishna etal.,VLDB02][Fan,PODS08][Bohannonet
al ICDE07]
al,ICDE07]

Positive&NegativeEvidence
Positive
Transitivity:IfM1andM2match,M2andM3match,thenM1and
y
M3match
Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwith
M2
FunctionalDependency:IfM1andM2match,thenM3andM4must
match

Negative
Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1
and M3 do not match
andM3donotmatch
Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2
FunctionalDependency:IfM1andM2donotmatch,thenM3and
M4cannotmatch

Positive&NegativeEvidence
Positive
Transitivity:IfM1andM2match,M2andM3match,thenM1and
y
M3match
Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwith
M2
FunctionalDependency:IfM1andM2match,thenM3andM4must
match

Negative
Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1
and M3 do not match
andM3donotmatch
Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2
FunctionalDependency:IfM1andM2donotmatch,thenM3and
M4cannotmatch

ConstraintTypes
HardConstraint

SoftConstraint

PositiveEvidence

IfM1,M2matchthenM3,M4must
match
If t
Iftwopapersmatch,theirvenues
t h th i
match

IfM1,M2matchthenM3,
M4morelikelytomatch
If t
Iftwovenuesmatch,then
t h th
theirpapersaremorelikely
tomatch

NegativeEvidence

MentionM1andM2mustreferto
distinctentities(Uniqueness)
Coauthors are distinct
Coauthorsaredistinct

IfM1,M2dontmatchthen
M3,M4lesslikelytomatch
If institutions donttmatch,
Ifinstitutionsdon
match
thenauthorslesslikelyto
match

IfM1,M2dontmatchthenM3,M4
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch

ConstraintTypes
HardConstraint

SoftConstraint

PositiveEvidence

IfM1,M2matchthenM3,M4must
match
If t
Iftwopapersmatch,theirvenues
t h th i
match

IfM1,M2matchthenM3,
M4morelikelytomatch
If t
Iftwovenuesmatch,then
t h th
theirpapersaremorelikely
tomatch

NegativeEvidence

MentionM1andM2mustreferto
distinctentities(Uniqueness)
Coauthors are distinct
Coauthorsaredistinct

IfM1,M2dontmatchthen
M3,M4lesslikelytomatch
If institutions donttmatch,
Ifinstitutionsdon
match
thenauthorslesslikelyto
match

IfM1,M2dontmatchthenM3,M4
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch

ConstraintTypes
HardConstraint

SoftConstraint

PositiveEvidence

IfM1,M2matchthenM3,M4must
match
If t
Iftwopapersmatch,theirvenues
t h th i
match

IfM1,M2matchthenM3,
M4morelikelytomatch
If t
Iftwovenuesmatch,then
t h th
theirpapersaremorelikely
tomatch

NegativeEvidence

MentionM1andM2mustreferto
distinctentities(Uniqueness)
Coauthors are distinct
Coauthorsaredistinct

IfM1,M2dontmatchthen
M3,M4lesslikelytomatch
If institutions donttmatch,
Ifinstitutionsdon
match
thenauthorslesslikelyto
match

IfM1,M2dontmatchthenM3,M4
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch

ConstraintTypes
HardConstraint

SoftConstraint

PositiveEvidence

IfM1,M2matchthenM3,M4must
match
If t
Iftwopapersmatch,theirvenues
t h th i
match

IfM1,M2matchthenM3,
M4morelikelytomatch
If t
Iftwovenuesmatch,then
t h th
theirpapersaremorelikely
tomatch

NegativeEvidence

MentionM1andM2mustreferto
distinctentities(Uniqueness)
Coauthors are distinct
Coauthorsaredistinct

IfM1,M2dontmatchthen
M3,M4lesslikelytomatch
If institutions donttmatch,
Ifinstitutionsdon
match
thenauthorslesslikelyto
match

IfM1,M2dontmatchthenM3,M4
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch

ConstraintTypes
HardConstraint
PositiveEvidence

SoftConstraint

IfM1,M2matchthenM3,
IfM1,M2matchthenM3,M4must
Notethatsomeofthe
M4morelikelytomatch
match
constraintsmayberelational
y
If t
Iftwopapersmatch,theirvenues
t h th i
If t
Iftwovenuesmatch,then
t h th
andrequirejoins
match
theirpapersaremorelikely
tomatch

Maybedirectional
May
be directional
orbidirectional
NegativeEvidence
MentionM1andM2mustreferto

IfM1,M2dontmatchthen
distinctentities(Uniqueness)
M3,M4lesslikelytomatch
Constraints can
be recursive
Coauthors are distinct Constraintscanberecursive,
Coauthorsaredistinct
If institutions
Ifinstitutionsdon
donttmatch,
match
thenauthorslesslikelyto
e.g.,iftwoauthorshave
match
IfM1,M2dontmatchthenM3,M4
matchingcoauthors,then
cannotmatch
theymatch
h
h
Iftwovenuesdontmatch,thentheir
papersdontmatch

AdditionalConstraints
AggregateConstraints[Chaudhuri etal.SIGMOD07]
countconstraints
count constraints
EntityAcanlinktoatmostNBs
Authorshaveatmost5papersatanyconference

Otheraggregateslikesum,averagemorecomplex
Oth
t lik
l

Again,
Again,thesecanbeeitherhardorsoftconstraints,
these can be either hard or soft constraints,
providepositiveornegativeevidence

MatchDependencies

Whenmatchingdecisionsdependonother
matching decisions (in other words, matching
matchingdecisions(inotherwords,matching
decisionsarenotmadeindependently),we
refer to the approach as collective
refertotheapproachascollective

MatchExtent
Global: Iftwopapersmatch,thentheirvenuesmatch
This
Thisconstraintcanbeappliedtoallinstancesofvenue
constraint can be applied to all instances of venue
mentions
AlloccurrencesofSIGMODcanbematchedtoInternational
Conference on Management of Data
ConferenceonManagementofData

Local:Iftwopapersmatch,thentheirauthorsmatch
Thisconstraintcanonlybeappliedlocally
This constraint can only be applied locally
DontwanttomatchalloccurrencesofJ.SmithwithJeffSmith,onlyin
thecontextofthecurrentpaper

Ex.SemanticIntegrityConstraints
Type

Example

Aggregate

C1=Noresearcher haspublishedmorethanfiveAAAIpapersinayear

Subsumption

C2=IfacitationXfromDBLP matchesacitationYinahomepage,then
eachauthormentionedinYmatchessomeauthormentionedinX

Neighborhood

C3=IfauthorsXandYsharesimilarnamesandsomecoauthors,they
arelikelytomatch
lik l t
t h

Incompatible

C4 =NoresearcherexistswhohaspublishedinbothHCIandnumerical
analysis

Layout

C5=Iftwomentionsinthesamedocumentsharesimilarnames,they
f
i
i h
d
h
i il
h
arelikelytomatch

Key/Uniqueness

C6=MentionsinthePClistingofaconferenceisto different
researchers

Ordering

C7=Iftwo citationsmatch,thentheirauthorswillbematchedinorder

Individual

C8=TheresearcherwiththenameMayssam Sariahasfewerthan
fi
fivementionsinDBLP(newgraduatestudent)
ti
i DBLP (
d t t d t)
[Shen,Li&Doan,AAAI05]

AlgorithmsforHandlingConstraints

Recordlinkage propagationthroughexclusivity

Deduplication propagationthroughtransitivity

Weightedkpartitematching

C
Correlationclustering
l i
l
i

Collective propagationthroughgeneralconstraints
Collective
propagation through general constraints

Similaritypropagation

P b bili ti
Probabilisticapproaches
h

Dependencygraphs,CollectiveRelationalClustering
LDA,CRFs,MarkovLogicNetworks,ProbabilisticRelationalModels,

Hybridapproaches

Dedupalog

PART 2 d
PART2d
ALGORITHMS

RECORD LINKAGE
RECORDLINKAGE

11assumption
Matchingbetween(almost)deduplicated databases.
Each record in one database matches at most one record
Eachrecordinonedatabasematchesatmostonerecord
inanotherdatabase.
Pairwise ERmaymatcharecordinonedatabasewith
morethanonerecordinseconddatabase

WeightedKPartiteMatching
Weighted
Edges

Weighted
Edges

Edgesbetweenpairsofrecordsfromdifferentdatabases
Edgeweights
o Pairwise matchscore
o Logoddsofmatching
L
dd f
hi

WeightedKPartiteMatching

Findamatching(eachrecordmatchesatmostoneotherrecord
fromotherdatabase)thatmaximizethesumofweights.
)
GeneralproblemisNPhard(3Dmatching)
Successivebipartitematchingistypicallyused.
Successive bipartite matching is typically used [Gupta&Sarawagi,VLDB
[G pta & Sara agi VLDB
09]

DEDUPLICATION

Deduplication =>Transitivity
Oftenpairwise ERalgorithmoutputinconsistentresults
(x,y) Mpred ,(y,z) Mpred ,but(x,z) Mpred

Idea:Correctthisbyaddingadditionalmatchesusingtransitive
closure
l
Incertaincases,thisisabadidea.
In certain cases this is a bad idea
Graphsresultingfrompairwise ERhave
diameter>20
[Rastogi etalCorr
et al Corr 12]
12]

Addedby
Transitive
Transitive
Closure

Needclusteringsolutionsthatdealwiththisproblemdirectlyby
reasoningaboutrecordsjointly.

ClusteringbasedER
Resolutiondecisionsarenotmadeindependentlyfor
eachpairofrecords
Basedonvarietyofclusteringalgorithms,but
Numberofclustersunknownaprioiri
Many,manysmall(possiblysingleton)clusters

Oftentakeapairwisesimilaritygraphasinput
Mayrequiretheconstructionofaclusterrepresentative
orcanonicalentity

ClusteringMethodsforER
HierarchicalClustering
[[Bilenko etal,ICDM05]
,
]

NearestNeighborbasedmethods
[Chaudhuri etal,ICDE05]

CorrelationClustering
[SoonetalCL01,Bansal etalML04,NgetalACL02,
Ailon etalJACM08,Elsner etalACL08,Elsner etalILPNLP09]

IntegerLinearProgrammingviewofER
rxy {0,1},rxy =1ifrecordsx andyareinthesamecluster.
w+xy [0,1],costofclusteringxandytogether
[ ]
g
y g
w xy [0,1],costofplacingxandyindifferentclusters

Transitive
closure

CorrelationClustering

Clustermentionssuchthat
totalcostisminimized
total cost is minimi ed
Solidedgescontributew+

totheobjective
Dashededgescontributew xyy totheobjective
xy

2
1
3

Costbasedonpairwise similarities
Additive:w+xy =pxy andw xy =(1pxy)
oga t
c +xy =log(p
og(pxy))andw
a d xy =log(1p
og( pxy)
Logarithmic:w

CorrelationClustering
SolvingtheILPisNPhard[Ailon etal2008JACM]
Anumberofheuristics[Elsner etal2009ILPNLP]
GreedyBEST/FIRST/VOTEalgorithms
Greedy BEST/FIRST/VOTE algorithms
GreedyPIVOTalgorithm(5approximation)
LocalSearch

GreedyAlgorithms
SStep1:Permutethenodesaccordingarandom
1 P
h
d
di
d
Step2:AssignrecordxtotheclusterthatmaximizesQuality
Start
a new cluster if Quality < 0
StartanewclusterifQuality<0
Quality:
BEST:Clustercontainingtheclosestmatch
[Ngetal2002ACL]

FIRST:Clustercontainsthemostrecentvertexywithw+xy >0
[Soonetal2001CL]
[Soon et al 2001 CL]

VOTE:Assigntoclusterthatminimizesobjectivefunction.
[Elsner etal08ACL]
PracticalNote:
Runthealgorithmformanyrandompermutations,andpicktheclusteringwith
bestobjectivevalue(betterthanaveragerun)

Greedywithapproximationguarantees
PIVOTAlgorithm
[Ailon etal2008JACM]
Pickarandom(pivot)recordp.
(p
)
p
Newcluster=
2

={1,2,3,4}C={{1,2,3,4}}
={2,4,1,3}C={{1,2},{4},{3}}
{ , , , }
{{ , }, { }, { }}
={3,2,4,1}C={{1,3},{2},{4}}

Whenweightsare0/1,
Forw+xy +wxy =1,

1
3
4

E(cost(greedy))<3 OPT
E(cost(greedy))<5 OPT

[Elsner etal,ILPNLP09]:Comparisonofvariouscorrelationclusteringalgorithms

PART2d

CANONICALIZATION

Canonicalization
Mergeinformationfromduplicatementionstoconstruct
aclusterrepresentativewithmaximalinformation
p

Starbucks,
3457HillsboroughRoad
Starbucks
Durham,NC
3457HillsboroughRoad,Durham,NC
Ph:null
Ph:(919)3334444
Starbacks,
HillsboroughRd,Durham
Ph:(919)3334444
CriticallyimportantinWebportalswhere

usersmust beshownaconsolidatedview
Eachmentiononlycontainsasubsetofthe
attributes
Mentionscontainvariations(ofnames,
addresses)
Someofthementionshaveincorrectvalues

CanonicalizationAlgorithms
Rulebased:
Fornames:typicallylongestnamesareused.
Forsetvaluesattributes:UNIONisused.

FForstrings,[Culotta
ti
[C l tt etalKDD07]learnaneditdistanceforfinding
t l KDD07] l
dit di t
f fi di
themostrepresentativecentroid.
Canusemajorityruletofixerrors
(if4outof5sayabusinessisclosed,thenbusinessisclosed).
Thismaynotalwaysworkduetocopying[DongetalVLDB09],orwhen
underlyingdatachanges[PaletalWWW11]

CanonicalizationforEfficiency
StanfordEntityResolutionFramework[Benjelloun VLDBJ09]
Considerablackbox matchandmergefunction
Matchisapairwise boolean operator
Merge:constructcanonicalversionofamatchingpair

Canminimizetimetocomputematchesbyinterleavingmatching
andmerging
esp.,whenmatchandmergefunctions
satisfymonotonicity properties.

r345

r12
r1

r45
r2

r3

r4

r5

COLLECTIVE ENTITY RESOLUTION


COLLECTIVEENTITYRESOLUTION

CollectiveApproaches
Decisionsforclustermembershipdependsonotherclusters
Nonprobabilisticapproaches
p
pp
SimilarityPropagation

ProbabilisticModels
GenerativeModels
G
i M d l
UndirectedModels

HybridApproaches
y
pp

SIMILARITY PROPAGATION
SIMILARITYPROPAGATION

SimilarityPropagationApproaches
Similaritypropagationalgorithmsdefineagraphwhichencodes
thesimilaritybetweenentitymentionsandmatchingdecisions,
andcomputematchingdecisionsbypropagatingsimilarityvalues.
Detailsofconstructedgraphandhowthesimilarityiscomputedvaries
Algorithmsareusuallydefinedprocedurally
Algorithms are usually defined procedurally
Whileprobabilitiesmaybeencodedinvariouswaysinthealgorithms,there
isnoglobalprobabilisticmodeldefined

Approachesoftenmorescalablethanglobalprobabilisticmodels

DependencyGraph
[Dong et al SIGMOD05 ]
[Dongetal.,SIGMOD05]
Constructagraphwherenodesrepresentsimilaritycomparisons
between attribute values (realvalued)
betweenattributevalues(real
valued)andmatchdecisionsbased
and match decisions based
onmatchingdecisionsofassociatednodes(booleanvalued)
Asmentionsareresolved,enrichedtocontainassociatednodesof
allmatchedmentions
Similaritypropagateduntilfixedpointisreached
Negativeconstraints(notmatchnodes)arecheckedaftersimilarity
propagationisperformed,andinconsistenciesarefixed

ExploittheDependencyGraph
Slid f
Slidesfrom[Dongetal,SIGMOD05]
[D
t l SIGMOD05]

(a1,a4)

(Distributed,Distributed )

(RobertS.Epstein,Epstein,R.S.)

(169180,169180)

(a2,a5)
(p1,p2)
(MichaelStonebraker,Stonebraker,M.)
(a3,a6)
(EugeneWong,Wong,E.)

Referencesimilarity

(v1,v2)

(ACM,ACMSIGMOD)

(1978,1978)

Attributesimilarity

ExploittheDependencyGraph
(a1,a4)

(Distributed,Distributed )

(RobertS.Epstein,Epstein,R.S.)

(169180,169180)

(a2,a5)
(p1,p2)
(MichaelStonebraker,Stonebraker,M.)
(v1,v2)

(a3,a6)
(EugeneWong,Wong,E.)

Reconciled

(ACM,ACMSIGMOD)

Similar

(1978,1978)

CollectiveRelationalClustering
[Bh
[Bhattacharya&Getoor,TKDD07]
h
&G
TKDD07]

Constructagraphwhereleafnodesareindividual
mentions
i
Performhierarchicalagglomerativeclusteringtomerge
clustersofmentions
l t
f
ti
Similaritycomputedbasedonacombinationofattribute
and relational similarity
andrelationalsimilarity
Whenclustersaremerged,updatethesimilaritiesofany
related clusters (clusters corresponding to mentions
relatedclusters(clusterscorrespondingtomentions
whichcooccurwithmergedmentions)

ObjectiveFunction

Mi i i
Minimize:

w sim
A

weightfor
attributes
b

(ci ,c j ) wR simR (ci , c j )

similarityof
attributes
b

weightfor
relations
l

Similaritybasedonrelationaledges
b
betweenc
d j
i andc

Greedy clustering algorithm: merge cluster pair with max


reduction in objective function
where for example

simA (ci , c j )
and

sim(c , c )

aAttributes

*
i

*
j

for cluster representative c*

simR (ci , c j ) sim jaccard ( N (ci ),


) N (c j ))

where N(c) are the relational neighbors of c

RelationalClusteringAlgorithm
1.
2.
3.

Findsimilarreferencesusingblocking
Bootstrapclustersusingattributesandrelations
Computesimilaritiesforclusterpairsandinsertintopriorityqueue

4.
5.
6.
7
7.
8.
9.
10.

Repeatuntilpriorityqueueisempty
p
p
yq
py
Findclosestclusterpair
Stopifsimilaritybelowthreshold
If no negative constraints violated
Ifnonegativeconstraintsviolated
Mergetocreatenewcluster
Constructcanonicalclusterrepresentative
Updatesimilarityforrelatedclusters

O(nklogn)algorithmw/efficientimplementation

SimilaritypropagationApproaches
Method

Notes

Constraints

Evaluation

RelDC
[Kalashnikovet
al,TODS06]
l TODS06]

Reference
disambiguation
usingusing
i
i
Relationship
baseddata
cleaning(RelDC)
g(
)

Modelchoice
nodesidentified
usingfeature
i f
basedsimilarity

Context
attraction
measuresthe
h
relational
similarity

Accuracyand
runtime forAuthor
resolutionand
l i
d
directorresolution
inMovie database

Reference
Reconciliation
[Dongetal,
SIGMOD05]

Dependency
Graphfor
propagating
similarities+
enforcenon
match
constraints

Reference
Both positive
enrichment
andnegative
Explicitly handle constraints
missingvalues
Parametersset
byhand

Precision/Recall,
F1onpersonal
information
managementdata
(PIM),Coradataset

Collective
Relational
Clustering
g
[Bhattacharya&
Getoor,TKDD07]

Modified
hierarchical
agglomerative
gg
clustering
approach

Constructs
canonical entity
asmergesare
g
made

Precision/Recall,
F1onthree
bibliographic
g p
datasets:CiteSeer,
ArXiv,andBioBase,
andsyntheticdata

Focuson
coauthor
resolution and
propagation

PROBABILISTICMODELS:
PROBABILISTIC
MODELS:
GENERATIVEAPPROACHES

GenerativeProbabilisticApproaches
ProbabilisticsemanticsbasedonDirectedModels
Model
Modeldependenciesbetweenmatchdecisionsinagenerative
dependencies between match decisions in a generative
manner
Disadvantage:acyclicity requirement

Varietyofapproaches
BasedonLatentDirichlet Allocation,BayesianNetworks

Examples
LatentDirichlet Allocation[Bhattacharya&Getoor,SDM07]
ProbabilisticRelationalModels[Pasula etal,NIPS02]

LDAforEntityResolution:Discovering
Groups from CoOccurrence
GroupsfromCo
OccurrenceRelations
Relations
Stephen P Johnson
Chris Walshaw

Kevin McManus

M k Cross
Mark
C ss

M tin E
Martin
Everett
tt

Parallel Processing Research Group

Stephen C Johnson
Alfred V Aho

Ravi Sethi

J ff
Jeffrey
D Ullman
Ull
Bell Labs Group

P1: C. Walshaw, M. Cross, M. G. Everett,


S. Johnson

P4: Alfred V. Aho, Stephen C. Johnson,


J ff
Jefferey
D
D. Ull
Ullman

P2: C. Walshaw, M. Cross, M. G. Everett,


S. Johnson, K. McManus

P5: A. Aho, S. Johnson, J. Ullman

P3: C. Walshaw, M. Cross, M. G. Everett

P6: A. Aho, R. Sethi, J. Ullman

LDAERModel

Entity label a and group label z for


each reference r
: mixture
mixture of groups for each cooccurrence
z: multinomial for choosing entity a
for each group z
Va: multinomial for choosing
reference r from entity a
Dirichlet priors with and

T
r

V
R

InferenceusingblockedGibbssampling
forefficiency(andimprovedaccuracy)

GenerativeApproaches
Method

Learning/Inference
Method

Evaluation

[Li,Morie,&
[Li
Morie &
Generative
Generative
Roth,AAAI04] modelfor
mentionsin
documents

Truncated EMtolearn
EM to learn
parametersandMAP
inferenceforentities
(unsupervised)

F1on
F1
on person
person
names,
locationsand
organizationsin
TRECdataset

Probabilistic
Probabilistic
Relational
Relational
M d l [P l Models
Models[Pasula
M d l
etal.,NIPS03]

Parameterslearned
onseparatedcorpora,
i f
inferencedoneusing
d
i
MCMC

%ofcorrectly
identified
clusters
l
on
subsetsof
CiteSeer data

LatentDirichlet
Latent
Dirichlet Latent
LatentDirichlet
Dirichlet BlockedGibbs
Blocked Gibbs
Allocation
Allocation
Sampling
Unsupervised
[Bhattacharya Model
&Getoor,
approach
SDM06]

Precision/Recall
/F1onCiteSeer
andHEPdata

PROBABILISTICMODELS:
PROBABILISTIC
MODELS:
UNDIRECTEDAPPROACHES

UndirectedProbabilisticApproaches
ProbabilisticsemanticsbasedonMarkovNetworks
Advantage:noacyclicity
Advantage: no acyclicity requirements

Insomecases,syntaxbasedonfirstorderlogic
Advantage:declarative
g

Examples
ConditionalRandomFields(CRFs)[McCallum&Wellner,
NIPS04]
MarkovLogicNetworks(MLNs)[Singla &Domingos,ICDM06]
ProbabilisticSimilarityLogic[Broecheler &Getoor,UAI10]

MarkovLogic
AlogicalKBisasetofhardconstraints onthesetof
possibleworlds
Makethemsoftconstraints;whenaworldviolatesa
formula,itbecomeslessprobablebutnotimpossible
Giveeachformulaaweight
Higherweight
Higher weight Strongerconstraint
Stronger constraint

P(world) exp weights o f formula s it sat isfies

[Richardson&Domingos,06]

MarkovLogic
AMarkovLogicNetwork(MLN) isasetofpairs(F,w)
where
F isaformulainfirstorderlogic
w isarealnumber
# true groundings
of ith clause

P( X ) exp wi ni ( x)
Z
iF

Normalization Constant

Iterate over all first-order MLN formulas

[Richardson&Domingos,06]

ERProblemFormulationinMLNs
Given
A
ADBofrecordsrepresentingmentionsofentitiesinthereal
DB of records representing mentions of entities in the real
world,e.g.papermentions
Asetoffieldse.g.author,title,venue
A set of fields e g author title venue
Eachrecordrepresentedasasetoftypedpredicatese.g.
HasAuthor(paper,author),HasVenue(paper,venue)
(p p ,
),
(p p ,
)

Goal
Todeterminewhichoftherecords/fieldsrefertothesame
d
h h f h
d /f ld f
h
underlyingentity
Slidesfrom[Singla &Domingos,ICDM06]

HandlingEquality
IntroduceEquals(x,y) orx=y
Introducetheaxiomsofequality
I t d
th
i
f
lit
Reflexivity: x=x
Symmetry: x=y y=x
Transitivity: x=y y=z z=x
PredicateEquivalence:
x11 =xx2 y11 y22 (R(x1,y
, y1)
) R(x2,y2))

Positive,SoftEvidence
Introducereversepredicateequivalence
SSamerelationwiththesameentitygivesevidenceabout
l ti
ith th
tit i
id
b t
twoentitiesbeingsame
R(x1,y1) R(x2,y2) x1 =x2 y2 =y2

Nottruelogically,butgivesusefulinformation
Example
HasAuthor(C1,J.Cox)
HasAuthor(C1
J Cox) HasAuthor(C2,CoxJ.)
HasAuthor(C2 Cox J )
(J.Cox=CoxJ.)

C1 = C2
C1=C2

FieldComparison
Eachfieldisastringcomposedoftokens
Introduce HasWord(field word)
IntroduceHasWord(field,word)
Usereversepredicateequivalence
HasWord(f1,w1) HasWord(f2,w2) w1= w2 f1=f2

Example
HasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) (Cox=Cox)
(J.Cox=CoxJ.)

Canhavedifferentweightforeachword
h
d ff
h f
h
d

TwolevelSimilarity
Individualwordsasunits:Cantdealwithspelling
mistakes
Breakeachwordintongrams:Introduce
HasNgram(word,ngram)
Usereversepredicateequivalenceforwordcomparisons

RecordMatching
SimplestVersion:Fieldsimilaritiesmeasuredby
presence/absenceofwordsincommon
HasWord(f1,w1) HasWord(f2,w2) HasField(r1,f1)
HasField(r2,f2) w1= w2 r1=r2

Example
E
l
HasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) HasAuthor(P1,
J.Cox)
) HasAuthor(P2,CoxJ.)
( ,
) ((Cox=Cox)) ((P1=P2))

Transitivity
(f11 =ff2)
) (f2 =ff3)
) ( f3 =ff1)

AdditionalConstraints
HasAuthor(c,a
HasAuthor(c
a1)
) HasAuthor(c,a
HasAuthor(c a2)
) Coauthor(a1,a
a2)
Coauthor(a1,a2) Coauthor(a3,a4) a1=a3 a2=a4

Inference
Usecheapheuristics(e.g.TFIDFbasedsimilarity)to
identifyplausiblepairs
yp
p
Inference/learningoverplausiblepairs
Inferencemethod:lazygrounding+MaxWalkSAT
Learning:supervisedandtransfer(learn/handsetonone
g p
(
/
domainandtransferred)

ProbabilisticSoftLogic

[Broecheler &Getoor,UAI10]
& Getoor UAI10]

Declarativelanguagefordefiningconstrainedcontinuous
Markov random field (CCMRF) using first order logic
Markovrandomfield(CCMRF)usingfirstorderlogic
(FOL)
Softlogic:truthvaluesin[0,1]
Soft logic: truth values in [0 1]
LogicaloperatorsrelaxedusingLukasiewicz tnorms
Mechanismsforincorporatingsimilarityfunctions,and
Mechanisms for incorporating similarity functions and
reasoningaboutsets
MAPinferenceisaconvexoptimization
MAP inference is a convex optimization
Efficientsamplingmethodformarginalinference

FOLtoCCMRF
PSLconvertsaweightedruleintopotentialfunctionsby
penalizing its distance to satisfaction
penalizingitsdistancetosatisfaction,

isthetruthvalueofgroundruleunder
interpretationx
Thedistributionovertruthvaluesis

::weightofruler
weight of rule r
:allgroundingsofruler
p g
:PSLprogram

UndirectedApproaches
Method

Learning/Inference
Method

Evaluation

[McCallum&
[McCallum
&
Wellner,
NIPS04]

Conditional
Conditional
RandomFields
(CRFs)
capturing
transitivity
constraints

Graphpartitioning
Graph
partitioning
(Boykov etal.1999),
performedvia
correlationclustering

F1on
F1
on DARPA
DARPA
MUC&ACE
datasets

[Singla &
D i
Domingos,
ICDM06]

MarkovLogic
N
Networks
k
(MLNs)

Supervisedlearning
ConditionalLog
andinferenceusing
di f
i
lik lih d and
likelihood
d
MaxWalkSAT &MCMC AUConCora
andBibServ
data

[Broecheler & Probabilistic


Getoor,UAI10] SimilarityLogic
(PSL)

Supervisedlearning
andinferenceusing
continuous
optimization

Precision/Recall
/F1Ontology
Alignment

HYBRID APPROACHES
HYBRIDAPPROACHES

HybridApproaches
Constraintbasedapproachesexplicitlyencoderelational
constraints
Theycanbeformulatedashybridofconstraintsand
probabilisticmodels
Orasconstraintoptimizationproblem

Examples
ConstraintbasedEntityMatching[Shen,Li&Doan,AAAI05]
Dedupalog [Arasu,Re,Suciu,ICDE09]

Dedupalog [Arasu etal.,ICDE09]


PaperRef(id, title, conference, publisher, year)
Wrote(id authorName,
Wrote(id,
authorName Position)
TitleSimilar(title1,title2)
A th Si il ( th 1 th 2)
AuthorSimilar(author1,author2)

Datatobe
deduplicated
(Thresholded)Fuzzy
J i O t t
JoinOutput

Step(0)Createinitialapproximatematches;thisisinputtoDedupalog.

Step(1)Declaretheentities

ClusterPapers,Publishers,&Authors

Paper!(id) :- PaperRef(id,-,-,-)
Publisher!(p) :- PaperRef(-,-,-,p,-)
Author!(a)
( ) :- Wrote(-,a,-)
(, ,)

Dedupalog isflexible:
UniqueNamesAssumption(UNA)
P bli h (UNA) d P
Publishers(UNA)andPapers(NOTUNA)
(NOT UNA)
Slidesbasedon[Arasu,Re,Suciu,ICDE09]

Step(2)DeclareClusters
InputintheDB

PaperRef(id, title, conference, publisher, year) Clusterpapers,


publishers and authors
publishers,andauthors
Wrote(id authorName,
Wrote(id,
authorName Position)
TitleSimilar(title1,title2)
A th Si il ( th 1 th 2)
AuthorSimilar(author1,author2)

Paper!(id) :- PaperRef(id,-,-,-)
Publisher!(p) :- PaperRef(-,-,-,p,-)
PaperRef(- - - p -)
Author!(a) :- Wrote(-,a,-)

Clustersaredeclared using*(likeIDBsorViews):Theseareoutput

Author*(a1,a2) <-> AuthorSimilar(a1,a2)


Clusterauthorswithsimilarnames
*IDBsareequivalencerelations:
q
Symmetric,Reflexive,&Transitively
ClosedRelations:i.e.,Clusters

Author1
AA

Author2
Arvind Arasu

Arvind A Arvind Arasu

ADedupalog
A
D d
l program isa
i
setofdataloglikerules

152

SimpleConstraints
Paperswithsimilartitlesshouldlikelybeclusteredtogether
Paper*(id1,id2) <-> PaperRef(id1,t1,-), PaperRef(id2,t2,-),TitleSimilar(t1,t2)
Author*(a1,a2) <-> AuthorSimilar(a1,a2)

Paper*(id
Paper
(id1,id
id2) <= PaperEq(id1,id
id2 )
Paper*(id1,id2) <= PaperNeq(id1,id2)

((<>)Softconstraints:
)
Payacostifviolated.
(<=)Hardconstraints:Any
clusteringmustsatisfythese

PapersinPaperEQ must beclusteredtogether,


thoseinPaperNEQ mustnot beclusteredtogether
1. PaperEQ,PaperNEQ arerelations(EDBS)
2. denotesNegationhere.

Additional Constraints
AdditionalConstraints
Clusteringtwopapers,thenmustclustertheirfirstauthors

Author*(a1,a2) <= Paper*(id1,id2), Wrote(id1,a1,1), Wrote(id2,a2,1)


Cl
Clusteringtwopapersmakesitlikelyweshouldclustertheirpublisher
i
k i lik l
h ld l
h i
bli h
Publisher*(x,y) <- Publishes(x,p1), Publishes(x,p2),Paper*(p1,p2)

if
iftwoauthorsdonotsharecoauthors,thendonotclusterthem
two authors do not share coauthors, then do not cluster them
Author (x, y) <- (Wrote(x, p1,), Wrote(y, p2,), Wrote(z, p1,),
Wrote(z, p2,), Author(x, y))

Dedupalog viaCC
Semantics:TranslateaDedupalog Programtoasetofgraphs
Nodesarereferences(inthe!Relation)

EntityReferences:Conference!(c)
VLDBJ

Conference*(c1,c2) <-> ConfSim(c1,c2)

VLDB

Positiveedges
[ ] Negativeedgesareimplicit
[]
Negative edges are implicit

VLDBconf
ICDE

ICDT

InternationalConf.DE

Forasinglegraphw.o.hardconstraints
we can reuse prior work for O(1) apx
wecanreusepriorworkforO(1)apx.
156

CorrelationClustering
Soft

Conference*(c
Conference
(c1,cc2) << ConfSim(c1,cc2)
Conference*(c1,c2) <= ConfEQ(c1,c2)
Conference*(c
C f
*( 1,c2) <=
< ConfNEQ(c
C fNEQ( 1,c2)

Hard
Equal

Positive
[]Negative

NotEqual

VLDBJ
VLDB

VLDBconf
ICDE

ICDT
1. Pick a random order of edges
2. While there is a soft edge do
1. Pick first soft edge in order
2. If
turn into
[-]
] turn into
3. Else is [
4. Deduce labels
3. Return Transitively closed subsets

InternationalConf.DE

Voting
Extendalgorithmtowhole languageviavotingtechnique.
Support different entity types recursive programs etc
Supportdifferententitytypes,recursiveprograms,etc.
Manydedupalog
y
p gp
programs
g
haveanO(1)apx
Thm:AllsoftprogramsO(1)

Thm: Arecursivehard
constraintsnoO(1)apx
Expert:multiwaycuthard

Systemproperties:
(1)Streamingalgorithm
(2)linearin#ofmatches(notn2)
(3)Userinteraction
Features:Supportforweights,referencetables
(partially),andcorrespondinghardnessresults.

HybridApproaches
Method

Evaluation

Constraint
basedEntity
Matching
[Shen,Li&
Doan,
AAAI05];
buildson(Li,
Morie,&Roth,
AIMag 2004)

Twolayermodel:
Layer1:Generativemodelfordatasetsthatsatisfy
constraints;
Layer2:EMalgorithmandtherelaxationlabeling
algorithmtoperformmatching.Ineachiteration,use
EM to estimate parameters of the generative model
EMtoestimateparametersofthegenerativemodel
andamatchingassignment,thenemploysrelaxation
labelingtoexploittheconstraints

Researchers
andIMDBwith
noiseadded

Dedupalog
[Arasu,Re,
Suciu,ICDE09]

Declarativespecificationforrichcollectionof
constraintswithnicesyntactic sugaraddedtodatalog
forER.Inference:Correlationclustering+voting

Precision/Recall
onCora,subset
ofACMdataset

Summary:CollectiveApproaches
Decisionsforclustermembershipdependsonotherclusters
Similaritypropagationapproaches
yp p g
pp
ProbabilisticModels
GenerativeModels
UndirectedModels
U di
dM d l

HybridApproaches
Nonprobabilisticapproachesoftenscalebetterthangenerative
probabilisticapproaches
Undirected/constraintbasedmodelsareofteneasiertospecify
Scalingundirectedmodelsactiveareaofresearch

PART 3
PART3
SCALING ER TO BIGDATA
SCALINGERTOBIG
DATA

ScalingERtoBigData
Blocking/CanopyGeneration
Distributed ER
DistributedER

PART 3
PART3a
BLOCKING/CANOPY GENERATION
BLOCKING/CANOPYGENERATION

Blocking:Motivation
Navepairwise:|R|2 pairwise comparisons
1000
1000businesslistingseachfrom1,000differentcitiesacross
business listings each from 1,000 different cities across
theworld
1trillioncomparisons
11.6days (ifeachcomparisonis1s)

Mentionsfromdifferentcitiesareunlikelytobematches
BlockingCriterion:City
1billioncomparisons
16minutes(ifeachcomparisonis1s)

Blocking:Motivation
Mentionsfromdifferentcitiesareunlikelytobematches
Maymisspotentialmatches
May miss potential matches

Blocking:Motivation
MatchingPairs
ofRecords

PairsofRecords
satisfying
Blockingcriterion
SetofallPairs
ofRecords

BlockingAlgorithms1
Hashbasedblocking

EachblockCi isassociatedwithahashkeyhi.
Mentionx ishashedtoCi ifhash(x)=hi.
Withinablock,allpairsarecompared.
Each hash function results in disjoint blocks.
Eachhashfunctionresultsindisjointblocks.

Whathashfunction?
Deterministicfunctionofattributevalues
BooleanFunctionsoverattributevalues
[[Bilenko etalICDM06,MichelsonetalAAAI06,
,
,
DasSarma etalCIKM12]
minHash (minwiseindependentpermutations)
[Broder etalSTOC98]

BlockingAlgorithms2
Pairwise Similarity/Neighborhoodbasedblocking
Nearby
Nearbynodesaccordingtoasimilaritymetricareclustered
nodes according to a similarity metric are clustered
together
Resultsinnondisjointcanopies.

Techniques
SortedNeighborhoodApproach[HernandezetalSIGMOD95]
CanopyClustering[McCallumetalKDD00]

SimpleBlocking:InvertedIndexonaKey
Examplesofblockingkeys:

Firstthreecharactersoflastname
First
three characters of last name
City+State+Zip
CharacterorTokenngrams
Minimuminfrequentngrams

LearningOptimalBlockingFunctions
Usingoneormoreblockingkeysmaybeinsufficient
2,376,206American
2,376,206 AmericansssharedthesurnameSmithinthe2000US
shared the surname Smith in the 2000 US
NULLvaluesmaycreatelargeblocks.

Solution:Constructblockingfunctionsbycombining
simplefunctions

ComplexBlockingFunctions
Conjunctionoffunctions[MichelsonetalAAAI06,Bilenko etalICDM06]
{City}AND{lastfourdigitsofphone}

Chaintrees[DasSarma etalCIKM12]
If
If({City}=NULLorLA)then
({Ci } NULL LA) h {lastfourdigitsofphone}AND{areacode}
{l f
di i f h
} AND {
d }
else {lastfourdigitsofphone}AND{City}

BlkTrees [DasSarma etalCIKM12]

LearninganOptimalfunction[Bilenko etalICDM06]
Findkblockingfunctionsthateliminatethemostnon
matches,whileretainingalmostallmatches.
,
g
Needatrainingsetofpositiveandnegativepairs

AlgorithmIdea:RedBlueSetCover
PositiveExamples
Blocking Keys
BlockingKeys
NegativeExamples

PickkBlockingkeyssuchthat
(a)Atmost bluenodesare
notcovered
(b)Numberofrednodes
coveredisminimized

LearninganOptimalfunction[Bilenko etalICDM06]
AlgorithmIdea:RedBlueSetCover
PositiveExamples
p
BlockingKeys

Pick k Blocking keys such that


PickkBlockingkeyssuchthat
(a)Atmost bluenodesare
notcovered
(b)Numberofrednodes
coveredisminimized

NegativeExamples

GreedyAlgorithm:
Greedy Algorithm
Constructgoodconjunctionsofblockingkeys{p1,p2,}.
Pick
k conjunctions {pi1,p
pi2,,p
pik}},suchthatthefollowingis
such that the following is
Pickkconjunctions{p
minimized

minHash (Minwise IndependentPermutations)


LetFx beasetoffeaturesformentionx
(functionsof)attributevalues
(functions of) attribute values
characterngrams
optimalblockingfunctions

Let bearandompermutationoffeaturesinFx
E.g.,orderimposedbyarandomhashfunction

minHash(x)=minimumelementinFx accordingto

WhyminHash works?
Surprisingproperty:Forarandompermutation,

Howtobuildablockingschemesuchthatonlypairswith
How to build a blocking scheme such that only pairs with
Jacquardsimilarity>sfallinthesameblock(withhighprob)?
Probabilitythat
(x,y)mentionsare
bl k d t th
blockedtogether

Similarity(x,y)

BlockingusingminHashes
ComputeminHashes usingr*kpermutations(hash
functions)
)
Band
of r minHashes
Bandofr

k blocks

Signatures
Signature sthatmatchon1outofk
that match on 1 out of k bands,gotothe
bands go to the
sameblock.

minHash Analysis
FalseNegatives:(missingmatches)
P(pair x y notinthesameblock
P(pairx,y
not in the same block
withJacquardsim =s)
shouldbeverylowforhighsimilaritypairs
should be very low for high similarity pairs

False
Positives: (blocking nonmatches)
FalsePositives:(blockingnonmatches)
P(pairx,y inthesameblock
with Jacquard sim =s)
withJacquardsim
s)

Sim(s)

P(notsame
block)

0.9

108

0.8

0.00035

0.7

0.025

0.6

0.2

0.5

0.52

0.4

0.81

0.3

0.95

0.2

0.994

0.1

0.9998

CanopyClustering[McCallumetalKDD00]
Input:MentionsM,
d(x,y),adistancemetric,
thresholdsT1 >T2
Algorithm:
1 PickarandomelementxfromM
1.
Pi k
d
l
t f
M
2. CreatenewcanopyCx using
mentionsys.t.d(x,y)<T
y
( ,y) 1
3. Deleteallmentionsy fromM
s.t.d(x,y)<T2(fromconsiderationinthisalgorithm)
4. ReturntoStep1ifM isnotempty.

Inmultiple
canopies

Eachelement
h
hasasingle
i l
primarycanopy

PART 3 b
PART3b
DISTRIBUTED ER
DISTRIBUTEDER

DistributedER
Mapreduceisverypopularforlargetasks
M
d
i
l f l
k
Simpleprogrammingmodelformassivelydistributeddata

Hadoop providesfaulttoleranceandisopensource
MapPhase
(perrecordcomputation)

Shuffle

ReducePhase
(globalcomputation)

ERwithDisjointBlocking
ComputeBlocksinMap
Map Phase
MapPhase
(perrecordcomputation)

RemainingERinReduce
Reduce Phase
ReducePhase
(globalcomputation)

Shuffle

BlockID

Noneedtocompare
recordsacross
reducers

NondisjointBlocking
Howtoblock?
Hashbased:
Hash based:needanefficienttechniquetogrouprecordsif
need an efficient technique to group records if
theymatchonnoutofkblockingkeys[Vernica etalSIGMOD10]
Distancebased:canopyclusteringonmapreduce[Mahout]
IterativeBlocking[Whang etalSIGMOD09]

Problem:Informationneededforarecordisin
p
multiplereducers.
Informationneededforarecordisinmultiplereducers.
Example1:
Example 1:
Reducer1:amatcheswithb
Reducer2:amatcheswithc
Needtocommunicateinordertocorrectlyresolvea,b,c
N dt
i t i
d t
tl
l b

Problem:Informationneededforarecordisin
p
multiplereducers.

Example2:Dedup papersandauthors
Id

Author1

Author2

Paper

A1

JohnSmith

RichardJohnson IndicesandViews

A2

JSmith

RJohnson

SQLQueries

A3

Dr Smyth
Dr.Smyth

R Johnson
RJohnson

Indices and Views


IndicesandViews

Slideadaptedfrom[Rastogi etalVLDB11]talk

Problem:Informationneededforarecordisin
p
multiplereducers.

Canopyclusteringresultsinnondisjointclusters
[McCallumetalKDD00]
J.Smith

Canopy
py
for
Richard

JohnS.

John
Richard
Smith JohnJacob
Johnson Richard
J h J b
Smith
RichardM.
R.Smith
Johnson
C
Canopy
forSmith

Canopy
f
for
John

Slideadaptedfrom[Rastogi etalVLDB11]talk

Problem:Informationneededforarecordisin
p
multiplereducers.
CoAuthor(A1,B1) CoAuthor(A2,B2) match(B1,B2) match(A1,A2)

CoAuthor rulegroundstothecorrelation
match(RichardJohnson,RJohnson)=>match(J.Smith,JohnSmith)

J.Smith
Richard
Johnson

Canopy
Canopy
for
Johnson

R
Johnson

Steve
Johnson

JohnS.

John
Smith JohnJacob
R.Smith
Canopy
forSmith

Canopy
for
John

Slideadaptedfrom[Rastogi etalVLDB11]talk

Problem:Informationneededforarecordisin
p
multiplereducers.
Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,
KangetalICDM2009]

+CorrelationClustering/CollectiveERineachcomponent
C
l i Cl
i / C ll i ER i
h
Solution2:CorrelationClustering/CollectiveERineachcanopy
g/
py
+MessagePassing[Rastogi etalVLDB11]

Problem:Informationneededforarecordisin
p
multiplereducers.
Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,
KangetalICDM2009]

+CorrelationClustering/CollectiveERineachcomponent
C
l i Cl
i / C ll i ER i
h
Connectedcomponentscanbelargeinrelational/multientityER.
p
g
/
y
Solution2:CorrelationClustering/CollectiveERineachcanopy
+MessagePassing[Rastogi etalVLDB11]

MessagePassing
Simple Message Passing (SMP)
SimpleMessagePassing(SMP)
1. RunentitymatcherM locallyineachcanopy
2. IfM findsamatch(r1,r2)insomecanopy,passitas
evidencetoallcanopies
3. RerunM withineachcanopyusingnewevidence
4. Repeatuntilnonewmatchesfoundineachcanopy
il
h f
di
h

Runtime:O(k
Runtime:
O(k2 f(k)c)
f(k) c)

k:maximumsizeofacanopy
f(k):TimetakenbyERoncanopyofsizek
c:numberofcanopies
Slideadaptedfrom[Rastogi etalVLDB11]talk

FormalProperties
forawellbehavedERmethod
Convergence:No.ofstepsno.ofmatches
Consistency:Outputindependentofthecanopyorder
y
p
p
py
Soundness:Eachoutputmatchisactuallyatruematch
Completeness:Eachtruematchisalsoaoutputmatch
J.Smith

JohnS.

John
Richard
Smith JohnJacob
Johnson Richard
Smith
RichardM.
R.Smith
Johnson
Slideadaptedfrom[Rastogi etalVLDB11]talk

Completeness
Papers2and3matchonlyifacanopy
knows that
knowsthat
match(a1,a2)
match(b2,b3)
match(c2,c3)
t h( 2 3)

Simplemessagepassingwillnotfindanymatches
thus,nomessagesarepassed,noprogress
Solution:Maximalmessagepassing
Sendamessageifthereisapotentialformatch

Slideadaptedfrom[Rastogi etalVLDB11]talk

SummaryofScalability
O(|R|2)pairwise computationscanbeprohibitive.
Blockingeliminatescomparisonsonalargefractionofnonmatches.

EqualitybasedBlocking:
Construct(oneormore)blockingkeysfromfeatures
Recordsnotmatchingonanykeyarenotcompared.
Records not matching on any key are not compared

Neighbohood basedBlocking:
Formoverlappingcanopiesofrecordsbasedonsimilarity.
Onlycomparerecordswithinacluster.

Computingconnectedcomponents/MessagePassinginadditionto
blocking can help distribute ER
blockingcanhelpdistributeER.

P t4
Part4
CHALLENGESANDFUTURE
CHALLENGES
AND FUTURE
DIRECTIONS

Challenges
Sofar,wehaveviewedERasaonetimeprocessappliedtoentire
database;noneoftheseholdinrealworld.
TemporalER
Temporal ER
ERalgorithmsneedtoaccountforchangeinrealworld
Reasoningaboutmultiplesources[Pal&M etal.WWW12]
Modeltransitions[LietalVLDB11]

Reasoningaboutsourcequality
Sourcesarenotindependent
CopyingProblem[DongetalVLDB09]

QueryTimeER
Howdoweselectivelydeterminethesmallestnumberofrecordstoresolve,so
wegetaccurateresultsforaparticularquery?
Collectiveresolutionforqueries
Collective resolution for queries [Bhattacharya&GetoorJAIR07]
[Bhattacharya & Getoor JAIR07]

ER&Usergenerateddata
Deduplicated entitiesinteractwithusersintherealworld
Userstag/associatephotos/reviewswithbusinessesonGoogle/Yahoo

Whatshouldbedonetosupportinteractions?

OpenIssues
ERisoftenpartofbiggerinferenceproblem
Pipelinedapproachesandjointapproachestoinformationextraction
and graph identification
andgraphidentification
HowcanwecharacterizehowERerrorsaffectoverallqualityof
results?

ERTheory
ER Theory
Needbettersupportfortheorywhichcangiverelationallearning
bounds

ER&Privacy
ER & Privacy
ERenablesrecordreidentification
HowdowedevelopatheoryofprivacypreservingER?

ERBenchmarks
ER B h
k
NeedforlargescalerealworldERdatasetswithgroundtruth
Syntheticdatausefulforscalingbuthardtocapturerichcomplexities
ofrealworld
f
l
ld

Summary
Growingomnipresenceofmassivelinkeddata,andtheneed
forcreatingknowledgebasesfromtextandunstructureddata
motivate a number of challenges in ER
motivateanumberofchallengesinER
EspeciallyinterestingchallengesandopportunitiesforERand
socialmedia/usergenerateddata
/
As
Asdata,noise,andknowledgegrows,greaterneeds&
data noise and knowledge grows greater needs &
opportunitiesforintelligentreasoningaboutentityresolution
Manyotherchallenges
M
th
h ll
Largescaleidentitymanagement
Understandingtheoreticalpotentials&limitsofER

THANK YOU!
THANKYOU!

References Intro
W.Willinger etal,MathematicsandtheInternet:ASourceofEnormousConfusionand
GreatPotential,NoticesoftheAMS56(5),2009
L Gill and M Goldcare English
L.GillandM.Goldcare,
EnglishNationalRecordLinkageofHospitalEpisodeStatisticsand
National Record Linkage of Hospital Episode Statistics and
DeathRegistrationRecords,ReporttotheDepartmentofHealth,2003
T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer2007
A. Elmagrid etal,
A.Elmagrid
et al, Duplicate
DuplicateRecordDetection
Record Detection,,TKDE2007
TKDE 2007
P.Christen,DataMatching,Springer2012
N.Koudas etal,RecordLinkage:SimilaritymeasuresandAlgorithms,SIGMOD2006
X. Dong & F. Naumann, Data
X.Dong&F.Naumann,
Datafusion
fusionResolving
Resolvingdataconflictsforintegration
data conflicts for integration,,VLDB2009
VLDB 2009
L.Getoor &A.Machanavajjhala,EntityResolution:Theory,PracticeandOpenChallenges,
AAAI2012

References SingleEntityER
D.Menestrina etal,EvaluationEntityResolutionResults,PVLDB3(12),2010
M.Cochinwala etal,Efficientdatareconciliation,InformationSciences137(14),2001
M Bilenko &R.Mooney,
M.Bilenko
& R Mooney Adaptive
AdaptiveDuplicateDetectionUsingLearnableStringSimilarity
Duplicate Detection Using Learnable String Similarity
Measures,KDD2003
P.Christen,Automaticrecordlinkageusingseedednearestneighbour andsupportvector
machineclassification.,KDD2008
Z.Chenetal,Exploitingcontextanalysisforcombiningmultipleentityresolutionsystems,
SIGMOD2009
A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun
Coreference,NIPS2004
f

H.Newcombe etal,Automaticlinkageofvitalrecords,Science1959
I.Fellegi &A.Sunter,ATheoryforRecordLinkage,JASA1969
W.Winkler,OverviewofRecordLinkageandCurrentResearchDirections,ResearchReport
Series,USCensus,2006
T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer,2007
P R ik
P.Ravikumar
&W C h
&W.Cohen,AHierarchicalGraphicalModelforRecordLinkage,UAI2004
A Hi
hi l G hi l M d l f R
d Li k UAI 2004

References SingleEntityER(contd.)
S.Sarawagi etal,InteractiveDeduplication usingActiveLearning,KDD2000
S.Tejada etal,LearningObjectIdentificationRulesforInformationIntegration,IS2001
A Arasu etal,
A.Arasu
et al On
Onactivelearningofrecordmatchingpackages
active learning of record matching packages,SIGMOD2010
SIGMOD 2010
K.Bellare etal,Activesamplingforentitymatching,KDD2012
A.Beygelzimer etal,AgnosticActiveLearningwithoutConstraints,NIPS2010
J Wang et al CrowdER:
J.Wangetal,
CrowdER:Crowdsourcing
Crowdsourcing EntityResolution
Entity Resolution,PVLDB5(11),2012
PVLDB 5(11) 2012
A.Marcusetal,HumanpoweredSortsandJoins,PVLDB5(1),2011

References SingleEntityER(contd.)

R.Gupta&S.Sarawagi,
R.
Gupta & S. Sarawagi, Answering
AnsweringTableAugmentationQueriesfromUnstructuredListsontheWeb
Table Augmentation Queries from Unstructured Lists on the Web,,
PVLDB2(1),2009
A.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,CIKM
2012
M Bilenko etal,
M.Bilenko
et al Adaptive
AdaptiveProductNormalization:UsingOnlineLearningforRecordLinkagein
Product Normalization: Using Online Learning for Record Linkage in
ComparisonShopping,ICDM2005
S.Chaudhuri etal,RobustIdentificationofFuzzyDuplicates,ICDE2005
W.Soonetal,Amachinelearningapproachtocoreference resolutionofnounphrases,
C
ComputationalLinguistics27(4)2001
t ti
l Li
i ti 27(4) 2001
N.Bansal etal,CorrelationClustering,MachineLearning56(13),2004
V.Ng&C.Cardie,Improvingmachinelearningapproachestocoreference resolution,ACL2002
,
g
p
g
M.Elsner &E.Charnaik,Youtalkingtome?acorpusandalgorithmforconversation
disentanglement,ACLHLT2008
M.Elsner &W.Schudy,BoundingandComparingMethodsforCorrelationClusteringBeyondILP,
ILPNLP2009
N Ailon etal,
N.Ailon
et al Aggregating
Aggregatinginconsistentinformation:Rankingandclustering
inconsistent information: Ranking and clustering,JACM55(5),2008
JACM 55(5) 2008
X.Dongetal,IntegratingConflictingData:TheRoleofSourceDependence,PVLDB2(1),2009
A.Paletal,InformationIntegrationoverTimeinUnreliableandUncertainEnvironments,WWW
2012
A.Culotta etal,CanonicalizationofDatabaseRecordsusingAdaptiveSimilarityMeasures,KDD2007
O.Benjelloun etal,Swoosh:AgenericapproachtoEntityResolution,VLDBJ18(1),2009

References Constraints&MultiRelationalER
R.Ananthakrishna
A
h k i h et.al,Eliminatingfuzzyduplicatesindatawarehouses,VLDB2002
l li i i f
d li
i d
h
VL 2002
A.Arasu etal,LargeScaleDeduplication withConstraintsusingDedupalog,ICDE2009
S.Chaudhuri etal.,Leveragingaggregateconstraintsfordeduplication,SIGMOD07
X.Dongetal,ReferenceRecounciliation inComplexInformationSpaces,SIGMOD2005
I.Bhattacharya&L.Getoor,CollectiveEntityResolutioninRelationalData,TKDD2007
I.Bhattacharya&L.Getoor,ALatentDirichlet ModelforUnsupervisedEntityResolution,SDM2007
P.Bohannonetal.,ConditionalFunctionalDependenciesforDataCleaning,ICDE 2007
M.Broecheler &L.Getoor,ProbabilisticSimilarityLogic,UAI2010
,
y g ,
W.Fan,Dependenciesrevisitedforimprovingdataquality,PODS2008
H.Pasula etal,IdentityUncertaintyandCitationMatching,NIPS2002
D.Kalashnikovetal,DomainIndependentDataCleaningviaAnalysisofEntityRelationshipGraph,
TODS06
J.Laffertyetal,ConditionalRandomFields:ProbabilisticModelsforSegmentingandLabeling
SequenceData.,ICML2001
X.Lietal,IdentificationandTracingofAmbiguousNames:DiscriminativeandGenerative
Approaches,AAAI2004
A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun
Coreference,NIPS2004
M.Richardson&P.Domingos,MarkovLogic,MachineLearning62,2006
W.Shen etal.,ConstraintbasedEntityMatching,AAAI2005
P.Singla &P.Domingos,EntityResolutionwithMarkovLogic,ICDM2006
Whang etal.,GenericEntityResolutionwithNegativeRules,VLDBJ2009
Whang elal.,JointEntityResolution,ICDE2012

References Blocking
M.Bilenko etal,AdaptiveBlocking:LearningtoScaleUpRecordLinkageandClustering,ICDM
2006
M.Michelson&C.Knoblock,LearningBlockingSchemesforRecordLinkage,AAAI2006
g
g
g
A.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,
CIKM2012
A.Broder etal,MinWiseIndependentPermutations,STOC1998
G P di etal,Beyond100millionentities:largescaleblockingbasedresolutionfor
G.Papadias
l B
d 100 illi
ii l
l bl ki b d
l i f
heterogenous data,WSDM2012
M.Hernandez&S.Stolfo,Themerge/purgeproblemforlargedatabases,SIGMOD1995
A. McCallum et al, Efficient
A.McCallumetal,
Efficientclusteringofhigh
clustering of highdimensional
dimensionaldatasetswithapplicationto
data sets with application to
referencematching,KDD2000
L.Kolbetal,Dedoop:Efficientdeduplication withHadoop,(demo)PVLDB5(12),2012
R.Vernica etal,EfficientParallelSetSimilarityJoinsUsingMapReduce,SIGMOD2010
ApacheMahout:ScalableMachineLearningandDataMining,http://mahout.apache.org/
S.Whang etal,EntityResolutionwithIterativeBlocking,SIGMOD2009
U.Kangetal,PEGASUS:APetaScaleGraphMiningSystem Implementationand
Observations,ICDM2009
Observations
ICDM 2009
V.Rastogi etal,FindingConnectedComponentsonMapreduceinPolyLogRounds,Corr 2012
V.Rastogi etal,LargeScaleCollectiveEntityMatching,PVLDB4(4),2011

References Challenges&FutureDirections
I.BhattacharyaandL.Getoor,"QuerytimeEntityResolution",JAIR2007
X.Dong,L.BertiEquille,D.Srivastava,Truthdiscoveryandcopyingdetectioninadynamic
world,VLDB2009
world
VLDB 2009
P.Li,X.Dong,A.Maurino,D.Srivastava,LinkingTemporalRecords,VLDB2011
A. Pal,V.Rastogi,A.Machanavajjhala,P.Bohannon,Informationintegrationovertimein
unreliable and uncertain environments,,WWW2012
unreliableanduncertainenvironments
WWW 2012

Você também pode gostar