Você está na página 1de 27

Active Sampling

for Entity Matching


Kedar Bellare1, Suresh Iyengar1, Aditya Parameswaran2 & Vibhor Rastogi1
1 Yahoo! Research
2 Stanford University

KDD 2012
(slide)[http://shrdocs.com/presentations/9266/index.html]

5
Takuya Makino

Saturday, August 3, 13

Active
Active
Sampling
Sampling
for Entity Matching
with Guarantees

for Entity Matching

ACM Transactions on Embedded Computing Systems, 2010

Kedar Bellare11, Suresh Iyengar21, Aditya Parameswaran32, Vibhor Rastogi4 1


Kedar
Bellare
, Suresh Iyengar , Aditya Parameswaran & Vibhor Rastogi
1
Facebook
Inc.
1 Microsoft
Yahoo! Research
2
Research Lab India
2 Stanford
Stanford University
3
University
4
Google
Inc.
KDD
2012

(slide)[http://shrdocs.com/presentations/9266/index.html]

5
Takuya Makino

Saturday, August 3, 13


entity

: == True/False
active learning

Saturday, August 3, 13

Entity Matching
Imbalanced data:
blocking(): = 100:1

Imbalanced data [Arasu, 11]
precisionrecall

Saturday, August 3, 13

precisionrecall()

sub-linearlabel complexity

Saturday, August 3, 13

Importance Weighted Active Learning


[Beygelzimer+, 09]
xtpt
ptloss

xt1/pt
loss

loss
Saturday, August 3, 13

Overview

IWAL [Beygelzimer+, 09]


Saturday, August 3, 13

CONVEXHULL Algorithm

IWAL [Beygelzimer+, 09]


Saturday, August 3, 13

precisionrecall
maximize RECALL(h),
subject to PRECISION(h) >= r,

=>

maximize -fn(h),
subject to tp(h) - fp(h) >= 0,

=r/(1-r)

h = argmax -fn(h) + (tp(h) - fp(h))

h = argmax X(h) + Y(h)

black box B()

hblack box
Saturday, August 3, 13


H

recallh
Y(h)>=0
P={(X(h), Y(h)):hH}
H
P-1/
Y(h)>=0h O(log n)

Saturday, August 3, 13

Saturday, August 3, 13

h
h

Saturday, August 3, 13

h
Y(h)

precision
recallh

0
X(h)
Saturday, August 3, 13

while min < max do


Y(h)

h = B(mid),
if Y(h)>=0, max=mid

max

mid

min

precision
recallh

0
X(h)
Saturday, August 3, 13

while min < max do


Y(h)

h = B(mid),
if Y(h)>=0, max=mid

max mid

min

precision
recallh

0
X(h)
Saturday, August 3, 13

while min < max do


Y(h)

h = B(mid),
if Y(h)>=0, max=mid

max min
mid

precision
recallh

0
X(h)
Saturday, August 3, 13

REJECTION SAMPLING Algorithm

IWAL [Beygelzimer+, 09]


Saturday, August 3, 13

black box B(*)


minimize (fn(h) + (1 - )fp(h))/n
RECALL(0-1LOSS)
min fn(h)/n (0-1 LOSSWEIGHTED)

(1 - )
0-1 LOSSB

http://www.machinedlearnings.com/2012/01/cost-sensitive-binary-classication.html
Saturday, August 3, 13

: http://www.slideshare.net/pfi/20120105-pfi

Saturday, August 3, 13

REJECTION SAMPLING ()

1-

1-

(fn(h) + (1 - )fp(h))/n
Saturday, August 3, 13

REJECTION SAMPLING ()

1-

1-

(fn(h) + (1 - )fp(h))/n
Saturday, August 3, 13

Label Complexity
Blabel complextitymax(/(1 - ), (1 - )/)
O(log n)

Saturday, August 3, 13

Saturday, August 3, 13

Saturday, August 3, 13

Saturday, August 3, 13

Saturday, August 3, 13

Conclusion
entity matchingprecisionrecall

active learning
black box
label & computational complexity
recall
state-of-the-artoutperform

Saturday, August 3, 13

Você também pode gostar