Você está na página 1de 33

Probabilistic/Uncertain Data Management -- III

1. 2. Dalvi, Suciu. Efficient query evaluation on probabilistic databases, VLDB2004. Das Sarma et al. Working models for uncertain data, ICDE2006.

Slides based on the Suciu/Dalvi SIGMOD05 tutorial


1

What is a Probabilistic Database ?


An item belongs to the database is a probabilistic event
Tuple-existence uncertainty Attribute-value uncertainty

A tuple is an answer to the query is a probabilistic event Can be extended to all data models; we discuss only probabilistic relational data
2

Possible Worlds Semantics


The set of all possible database instances: INST = {I1, I2, I3, . . ., IN} Definition A probabilistic database Ip is a probability distribution on INST

Pr : INST ! [0,1]

s.t. i=1,N Pr(Ii) = 1

Definition A possible world is I s.t. Pr(I) > 0


3

Query Semantics

Given a query Q and a probabilistic database Ip, what is the meaning of Q(Ip) ?

Query Semantics
Semantics 1: Possible Answers A probability distribution on sets of tuples 8 A. Pr(Q = A) = I 2 INST. Q(I) = A Pr(I)

Semantics 2: Possible Tuples A probability function on tuples 8 t. Pr(t 2 Q) = I 2 INST. t2 Q(I) Pr(I)
5

Possible Worlds Query Semantics


Possible answers semantics Precise Can be used to compose queries Difficult user interface Possible tuples semantics Less precise, but simple; sufficient for most apps Cannot be used to compose queries Simple user interface
6

Possible Worlds Semantics: Summary


Complete model; Clean formal semantics for SQL queries Not very useful as a representation or implementation tool HUGE number of possible worlds! Need more effective representation formalisms Something that users can understand/explore Allow more efficient query execution
Avoid possible worlds explosion

Perhaps giving up completeness

Representation Formalisms
Problem Need a good representation formalism Will be interpreted as possible worlds Several formalisms exists, but no winner

Main open problem in probabilistic db


8

Evaluation of Formalisms
Completeness? What possible worlds can it represent? What probability distributions on worlds? Closure? Is it closed under evaluation of query operators?
9

A Complete Formalism: Intensional Databases


Atomic event ids Probabilities: Event expressions: , , : e1, e2, e3, p1, p2, p3, 2 [0,1] e3 (e5 : e2)

Intensional probabilistic database J: each tuple t has an event attribute t.E


10

Intensional DB ) Possible Worlds


Name Address Seattle Denver E e1 (e2 e3) (e1 e2 ) (e2 e3 )

J=

John Sue

e1 e2 e3 =
Ip

000 ;

001

010

011

100
Sue

101

110

111

John Seattle

Denver

John Seattle

(1-p1)(1-p2)(1-p3) +(1-p1)(1-p2)p3 +(1-p1)p2(1-p3) +p1(1-p2)(1-p3

Sue p1(1-p2) p3 (1-p1)p2 p3

Denver

p1p2(1-p3) 11 +p1p2p3

Possible Worlds ) Intensional DB


Name Address

E1 = e1 John Seattle E2 = :e1 e2 p1 John Boston E3 = :e1 :e2 e3 Sue Seattle E4 = :e1 :e2 :e3 e4 Prefix code Name Address John Seattle p2
Sue Seattle

Pr(e1) = p1 Pr(e2) = p2/(1-p1) Pr(e3) = p3/(1-p1-p2) Pr(e4) = p4 /(1-p1-p2-p3)

Name Address
John John Sue Seattle Boston Seattle

E
E1 E2 E1 E4 E1 E2 E3

=Ip
Name Address Sue Seattle

J=

p3

Name Address John Boston

p4

Intensional DBs are complete 12

Closure Under Operators


v E1 E2 . . v E1 :E2

v E

v1 v2

E1 E2 E1 E2

v E

v1 E1

v2 E2

v E1

v E2

One still needs to compute probability of event expression 13

Summary on Intensional Databases


Event expression for each tuple Possible worlds: any subset Probability distribution: any Complete but impractical Evaluate the probability of long event expressions Important abstraction: consider restrictions

Related to c-tables [Imilelinski&Lipski:1984]


14

A Restricted Formalism: Explicit Independent Tuples


Tuple independent probabilistic database
INST = P(TUP) N = 2M

TUP = {t1, t2, , tM} = all tuples pr : TUP ! [0,1]


No restrictions

Pr(I) = t 2 I pr(t) t I (1-pr(t))

15

Tuple Prob. ) Possible Worlds


Name City Seattle Boston Boston
City Bosto Name John Sue City Seattl Bosto

pr p1 = 0.8 p2 = 0.6 p3 = 0.9


Name John Fred City Seattl Bosto Name Sue Fred City Bosto Bosto Name John Sue Fred City Seattl Bosto Bosto

J= Ip = ; I1
Name John City Seattl Name Sue

John Sue Fred


City Bosto Name Fred

E[ size(Ip) ] = 2.3 tuples

I2

I3

I4 I5 I6
p1(1-p2)p3

I7
(1-p1)p2p3

I8
p1p2p3

(1-p1) (1-p2) (1-p3) p1(1-p2)(1-p3) (1-p1)p2(1-p3) (1-p1)(1-p2)p3 p1p2(1-p3)

16

=1

Tuple-Independent DBs are Incomplete


Name Address Name Address John Seattle John Seattle Seattle pr p1 p2

p1

Sue

Name Address John Sue Seattle Seattle

p1p2

=Ip

Very limited cannot capture correlations across tuples Not Closed Query operators can introduce complex correlations!
17

1-p1 - p1p2

Tuple Prob. ) Query Evaluation


Customer Name John Sue Fred City Seattle Boston Boston pr p1 p2 p3 John John John Sue Sue Sue Product Gizmo Gadget Gadget Camera Gadget Gadget Date ... ... ... ... ... ... pr q1 q2 q3 q4 q5 q6

Fred
SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Customer and y.Product = Gadget Tuple Seattle Boston

Gadget

...

q7

Probability p1(1-(1-q2)(1-q3)) 1- (1- p2(1-(1-q5)(1-q6))) (1 - p3 q7 ) 18

Application: Similarity Predicates


Cust Product Gizmo Gadget Gadget Camera Category dishware instrument instrument musicware

Name City
John Sue Fred Seattle Boston Boston

Profession
statistician musician physicist
Step 1: evaluate ~ predicates

John John John Sue

Sue
Sue Fred

Gadget
Gadget Gadget

microphone
instrument microphone

SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Cust and y.Product = Gadget and x.profession ~ scientist and y.category ~ music

19

Application: Similarity Predicates


Cust Product Gizmo Category dishware pr q1=0.2

Name City
John Sue Seattle Boston

Profession pr
statistician musician p1=0.8 p2=0.2

John

John
John Sue Sue Sue

Gadget
Gadget Camera Gadget Gadget

instrument
instrument musicware

q2=0.6
q3=0.6 q4=0.9 q6=0.6

Fred

Boston

physicist

p3=0.9

Step 1: evaluate ~ predicates

microphone q5=0.7 instrument

SELECT DISTINCT x.city FROM Personp x, Purchasep y WHERE x.Name = y.Cust and y.Product = Gadget and x.profession ~ scientist and y.category ~ music

Fred

Gadget
Tuple

microphone q7=0.7
Probability p1(1-(1-q2)(1-q3)) 1-(1-p2(1-(1-q5)(1-q6))) 20 3q7) (1-p

Step 2: evaluate rest of query

Seattle Boston

Summary on Explicit Independent Tuples


Independent tuples Possible worlds: subsets Probability distribution: restricted Closure: no

21

Query Evaluation on Probabilistic DBs


Focus on possible tuple semantics
Compute likelihood of individual answer tuples

Probability of Boolean expressions Complexity of query evaluation

22

Needed for query processing

Probability of Boolean Expressions


E = X1X3 X1X4 X2X5 X2X6

Randomly make each variable true with the following probabilities Pr(X1) = p1, Pr(X2) = p2, . . . . . , Pr(X6) = p6

What is Pr(E) ???


Answer: re-group cleverly E = X1 (X3 X4 ) X2 (X5 X6)

Pr(E)=1 - (1-p1(1-(1-p3)(1-p4))) (1-p2(1-(1-p5)(1-p6)))

23

Now lets try this:

= X1X2 X1X3 X2X3


X1 X2 0 0 X3 0 1 E 0 0 Pr

No clever grouping seems possible. Brute force:

0 0

0
0 1 1

1
1 0 0

0
1 0 1

0
1 0 1 p1(1-p2)p3 (1-p1)p2p3

Pr(E)=(1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3

1
1

1
1

0
1

1
1

p1p2(1-p3)
p1p2p3

Seems inefficient in general 24

[Valiant:1979]

Complexity of Boolean Expression Probability


Theorem [Valiant:1979] For a boolean expression E, computing Pr(E) is #P-complete
NP = class of problems of the form is there a witness ? SAT #P = class of problems of the form how many witnesses ? #SAT

The decision problem for 2CNF is in PTIME The counting problem for 2CNF is #P-complete 25

Summary on Boolean Expression Probability


#P-complete Its hard even in simple cases: 2DNF Can approximate through Monte Carlo (MC) simulation
26

Query Complexity
Data complexity of a query Q: Compute Q(Ip), for probabilistic database Ip Simplest scenario only: Possible tuples semantics for Q Independent tuples for Ip
27

[Fuhr&Roellke:1997,Dalvi&Suciu:2004]

Extensional Query Evaluation


Relational ops compute probabilities
v p v1 v2 p1 p2 v 1-(1-p1)(1-p2)

v p1(1-p2)

v p1 p2

v p

v1 p1

v2 p2

p1

v p2

Unlike intensional evaluation, data complexity: PTIME 28

[Dalvi&Suciu:2004]

SELECT DISTINCT x.City FROM Personp x, Purchasep y WHERE x.Name = y.Cust and y.Product = Gadget Jon Sea p1(1-(1-q1)(1-q2)(1-q3)) Correct Jon 1-(1-q1)(1-q2)(1-q3) Jon Jon q1 q2

Wrong ! Sea 1-(1-p1q1)(1- p1q2)(1- p1q3)

Jon Sea p1q1


Jon Sea p1q2 Jon Sea p1q3 Jon q1 Jon q2 Jon q3

Jon Sea p1

Jon Sea

p1

Depends on plan !!! 29 Jon q3

[Dalvi&Suciu:2004]

Query Complexity
Sometimes @ correct (safe) extensional plan

Qbad :- R(x), S(x,y), T(y)

Data complexity is #P complete

Theorem The following are equivalent Q has PTIME data complexity Q admits an extensional plan (and one finds it in PTIME) Q does not have Qbad as a subquery
30

Computing a Safe SPJ Extensional Plan


Problem is due to projection operations An unsafe extensional projection combines tuples that are correlated assuming independence Projection over a join that projects away at least one of the join attrs Unsafe projection! Intuitive: Joins create correlated output tuples

31

Computing a Safe SPJ Extensional Plan


Algorithm for Safe Extensional SPJ Evaluation Apply safe projections as late as possible in the plan If no more safe projections exist, look for joins where all attributes are included in the output
Recurse on the LHS, RHS of the join

Sound and complete safe SPJ evaluation algorithm If a safe plan exists, the algo finds it!
32

Summary on Query Complexity


Extensional query evaluation: Very popular Guarantees polynomial complexity However, result depends on query plan and correctness not always possible! General query complexity #P complete (not surprising, given #SAT) Already #P hard for very simple query (Qbad)

Probabilistic databases have high query complexity 33

Você também pode gostar