Probability Boolean Expressions

Probabilistic/Uncertain Data Management -- III
1. 2. Dalvi, Suciu. Efficient query evaluation on probabilistic databases, VLDB2004. Das Sarma et al. Working models for uncertain data, ICDE2006.
Slides based on the Suciu/Dalvi SIGMOD05 tutorial

1
What is a Probabilistic Database ?

An item belongs to the database is a probabilistic event
Tuple-existence uncertainty Attribute-value uncertainty
A tuple is an answer to the query is a probabilistic event Can be extended to all data models; we discuss only probabilistic relational data
2
Possible Worlds Semantics

The set of all possible database instances: INST = {I1, I2, I3, . . ., IN} Definition A probabilistic database Ip is a probability distribution on INST
Pr : INST ! [0,1]
s.t. i=1,N Pr(Ii) = 1
Definition A possible world is I s.t. Pr(I) > 0

3
Query Semantics
Given a query Q and a probabilistic database Ip, what is the meaning of Q(Ip) ?
Query Semantics
Semantics 1: Possible Answers A probability distribution on sets of tuples 8 A. Pr(Q = A) = I 2 INST. Q(I) = A Pr(I)
Semantics 2: Possible Tuples A probability function on tuples 8 t. Pr(t 2 Q) = I 2 INST. t2 Q(I) Pr(I)
5
Possible Worlds Query Semantics

Possible answers semantics Precise Can be used to compose queries Difficult user interface Possible tuples semantics Less precise, but simple; sufficient for most apps Cannot be used to compose queries Simple user interface
6
Possible Worlds Semantics: Summary

Complete model; Clean formal semantics for SQL queries Not very useful as a representation or implementation tool HUGE number of possible worlds! Need more effective representation formalisms Something that users can understand/explore Allow more efficient query execution
Avoid possible worlds explosion
Perhaps giving up completeness
Representation Formalisms
Problem Need a good representation formalism Will be interpreted as possible worlds Several formalisms exists, but no winner
Main open problem in probabilistic db

8
Evaluation of Formalisms
Completeness? What possible worlds can it represent? What probability distributions on worlds? Closure? Is it closed under evaluation of query operators?
9
A Complete Formalism: Intensional Databases

Atomic event ids Probabilities: Event expressions: , , : e1, e2, e3, p1, p2, p3, 2 [0,1] e3 (e5 : e2)
Intensional probabilistic database J: each tuple t has an event attribute t.E

10
Intensional DB ) Possible Worlds

Name Address Seattle Denver E e1 (e2 e3) (e1 e2 ) (e2 e3 )
J=
John Sue
e1 e2 e3 =
Ip
000 ;
001
010
011
100
Sue
101
110
111
John Seattle
Denver
John Seattle
(1-p1)(1-p2)(1-p3) +(1-p1)(1-p2)p3 +(1-p1)p2(1-p3) +p1(1-p2)(1-p3
Sue p1(1-p2) p3 (1-p1)p2 p3
Denver
p1p2(1-p3) 11 +p1p2p3
Possible Worlds ) Intensional DB

Name Address
E1 = e1 John Seattle E2 = :e1 e2 p1 John Boston E3 = :e1 :e2 e3 Sue Seattle E4 = :e1 :e2 :e3 e4 Prefix code Name Address John Seattle p2
Sue Seattle
Pr(e1) = p1 Pr(e2) = p2/(1-p1) Pr(e3) = p3/(1-p1-p2) Pr(e4) = p4 /(1-p1-p2-p3)
Name Address
John John Sue Seattle Boston Seattle
E
E1 E2 E1 E4 E1 E2 E3
=Ip
Name Address Sue Seattle
J=
p3
Name Address John Boston
p4
Intensional DBs are complete 12
Closure Under Operators

v E1 E2 . . v E1 :E2
v E
v1 v2
E1 E2 E1 E2
v E
v1 E1
v2 E2
v E1
v E2
One still needs to compute probability of event expression 13
Summary on Intensional Databases

Event expression for each tuple Possible worlds: any subset Probability distribution: any Complete but impractical Evaluate the probability of long event expressions Important abstraction: consider restrictions
Related to c-tables [Imilelinski&Lipski:1984]

14
A Restricted Formalism: Explicit Independent Tuples

Tuple independent probabilistic database
INST = P(TUP) N = 2M
TUP = {t1, t2, , tM} = all tuples pr : TUP ! [0,1]

No restrictions
Pr(I) = t 2 I pr(t) t I (1-pr(t))
15
Tuple Prob. ) Possible Worlds

Name City Seattle Boston Boston
City Bosto Name John Sue City Seattl Bosto
pr p1 = 0.8 p2 = 0.6 p3 = 0.9

Name John Fred City Seattl Bosto Name Sue Fred City Bosto Bosto Name John Sue Fred City Seattl Bosto Bosto
J= Ip = ; I1
Name John City Seattl Name Sue
John Sue Fred

City Bosto Name Fred
E[ size(Ip) ] = 2.3 tuples
I2
I3
I4 I5 I6
p1(1-p2)p3
I7
(1-p1)p2p3
I8
p1p2p3
(1-p1) (1-p2) (1-p3) p1(1-p2)(1-p3) (1-p1)p2(1-p3) (1-p1)(1-p2)p3 p1p2(1-p3)
16
=1
Tuple-Independent DBs are Incomplete

Name Address Name Address John Seattle John Seattle Seattle pr p1 p2
p1
Sue
Name Address John Sue Seattle Seattle
p1p2
=Ip
Very limited cannot capture correlations across tuples Not Closed Query operators can introduce complex correlations!
17
1-p1 - p1p2
Tuple Prob. ) Query Evaluation

Customer Name John Sue Fred City Seattle Boston Boston pr p1 p2 p3 John John John Sue Sue Sue Product Gizmo Gadget Gadget Camera Gadget Gadget Date ... ... ... ... ... ... pr q1 q2 q3 q4 q5 q6
Fred
SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Customer and y.Product = Gadget Tuple Seattle Boston
Gadget
...
q7
Probability p1(1-(1-q2)(1-q3)) 1- (1- p2(1-(1-q5)(1-q6))) (1 - p3 q7 ) 18
Application: Similarity Predicates

Cust Product Gizmo Gadget Gadget Camera Category dishware instrument instrument musicware
Name City
John Sue Fred Seattle Boston Boston
Profession
statistician musician physicist
Step 1: evaluate ~ predicates
John John John Sue
Sue
Sue Fred
Gadget
Gadget Gadget
microphone
instrument microphone
SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Cust and y.Product = Gadget and x.profession ~ scientist and y.category ~ music
19
Application: Similarity Predicates

Cust Product Gizmo Category dishware pr q1=0.2
Name City
John Sue Seattle Boston
Profession pr
statistician musician p1=0.8 p2=0.2
John
John
John Sue Sue Sue
Gadget
Gadget Camera Gadget Gadget
instrument
instrument musicware
q2=0.6
q3=0.6 q4=0.9 q6=0.6
Fred
Boston
physicist
p3=0.9
Step 1: evaluate ~ predicates
microphone q5=0.7 instrument
SELECT DISTINCT x.city FROM Personp x, Purchasep y WHERE x.Name = y.Cust and y.Product = Gadget and x.profession ~ scientist and y.category ~ music
Fred
Gadget
Tuple
microphone q7=0.7
Probability p1(1-(1-q2)(1-q3)) 1-(1-p2(1-(1-q5)(1-q6))) 20 3q7) (1-p
Step 2: evaluate rest of query
Seattle Boston
Summary on Explicit Independent Tuples

Independent tuples Possible worlds: subsets Probability distribution: restricted Closure: no
21
Query Evaluation on Probabilistic DBs

Focus on possible tuple semantics
Compute likelihood of individual answer tuples
Probability of Boolean expressions Complexity of query evaluation
22
Needed for query processing
Probability of Boolean Expressions

E = X1X3 X1X4 X2X5 X2X6
Randomly make each variable true with the following probabilities Pr(X1) = p1, Pr(X2) = p2, . . . . . , Pr(X6) = p6
What is Pr(E) ???

Answer: re-group cleverly E = X1 (X3 X4 ) X2 (X5 X6)
Pr(E)=1 - (1-p1(1-(1-p3)(1-p4))) (1-p2(1-(1-p5)(1-p6)))
23
Now lets try this:
= X1X2 X1X3 X2X3

X1 X2 0 0 X3 0 1 E 0 0 Pr
No clever grouping seems possible. Brute force:
0 0
0
0 1 1
1
1 0 0
0
1 0 1
0
1 0 1 p1(1-p2)p3 (1-p1)p2p3
Pr(E)=(1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3
1
1
1
1
0
1
1
1
p1p2(1-p3)
p1p2p3
Seems inefficient in general 24
[Valiant:1979]
Complexity of Boolean Expression Probability

Theorem [Valiant:1979] For a boolean expression E, computing Pr(E) is #P-complete
NP = class of problems of the form is there a witness ? SAT #P = class of problems of the form how many witnesses ? #SAT
The decision problem for 2CNF is in PTIME The counting problem for 2CNF is #P-complete 25
Summary on Boolean Expression Probability

#P-complete Its hard even in simple cases: 2DNF Can approximate through Monte Carlo (MC) simulation
26
Query Complexity
Data complexity of a query Q: Compute Q(Ip), for probabilistic database Ip Simplest scenario only: Possible tuples semantics for Q Independent tuples for Ip
27
[Fuhr&Roellke:1997,Dalvi&Suciu:2004]
Extensional Query Evaluation

Relational ops compute probabilities
v p v1 v2 p1 p2 v 1-(1-p1)(1-p2)
v p1(1-p2)
v p1 p2
v p
v1 p1
v2 p2
p1
v p2
Unlike intensional evaluation, data complexity: PTIME 28
[Dalvi&Suciu:2004]
SELECT DISTINCT x.City FROM Personp x, Purchasep y WHERE x.Name = y.Cust and y.Product = Gadget Jon Sea p1(1-(1-q1)(1-q2)(1-q3)) Correct Jon 1-(1-q1)(1-q2)(1-q3) Jon Jon q1 q2
Wrong ! Sea 1-(1-p1q1)(1- p1q2)(1- p1q3)
Jon Sea p1q1

Jon Sea p1q2 Jon Sea p1q3 Jon q1 Jon q2 Jon q3
Jon Sea p1
Jon Sea
p1
Depends on plan !!! 29 Jon q3
[Dalvi&Suciu:2004]
Query Complexity
Sometimes @ correct (safe) extensional plan
Qbad :- R(x), S(x,y), T(y)
Data complexity is #P complete
Theorem The following are equivalent Q has PTIME data complexity Q admits an extensional plan (and one finds it in PTIME) Q does not have Qbad as a subquery
30
Computing a Safe SPJ Extensional Plan

Problem is due to projection operations An unsafe extensional projection combines tuples that are correlated assuming independence Projection over a join that projects away at least one of the join attrs Unsafe projection! Intuitive: Joins create correlated output tuples
31
Computing a Safe SPJ Extensional Plan

Algorithm for Safe Extensional SPJ Evaluation Apply safe projections as late as possible in the plan If no more safe projections exist, look for joins where all attributes are included in the output
Recurse on the LHS, RHS of the join
Sound and complete safe SPJ evaluation algorithm If a safe plan exists, the algo finds it!
32
Summary on Query Complexity

Extensional query evaluation: Very popular Guarantees polynomial complexity However, result depends on query plan and correctness not always possible! General query complexity #P complete (not surprising, given #SAT) Already #P hard for very simple query (Qbad)
Probabilistic databases have high query complexity 33

Probability Boolean Expressions

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Probability Boolean Expressions

Enviado por

Direitos autorais:

Formatos disponíveis

Probabilistic/Uncertain Data Management -- III

Slides based on the Suciu/Dalvi SIGMOD05 tutorial

What is a Probabilistic Database ?

Possible Worlds Semantics

s.t. i=1,N Pr(Ii) = 1

Definition A possible world is I s.t. Pr(I) > 0

Possible Worlds Query Semantics

Possible Worlds Semantics: Summary

Perhaps giving up completeness

Main open problem in probabilistic db

A Complete Formalism: Intensional Databases

Intensional probabilistic database J: each tuple t has an event attribute t.E

Intensional DB ) Possible Worlds

(1-p1)(1-p2)(1-p3) +(1-p1)(1-p2)p3 +(1-p1)p2(1-p3) +p1(1-p2)(1-p3

Sue p1(1-p2) p3 (1-p1)p2 p3

Possible Worlds ) Intensional DB

Pr(e1) = p1 Pr(e2) = p2/(1-p1) Pr(e3) = p3/(1-p1-p2) Pr(e4) = p4 /(1-p1-p2-p3)

Name Address John Boston

Intensional DBs are complete 12

Closure Under Operators

One still needs to compute probability of event expression 13

Summary on Intensional Databases

Related to c-tables [Imilelinski&Lipski:1984]

A Restricted Formalism: Explicit Independent Tuples

TUP = {t1, t2, , tM} = all tuples pr : TUP ! [0,1]

Pr(I) = t 2 I pr(t) t I (1-pr(t))

Tuple Prob. ) Possible Worlds

pr p1 = 0.8 p2 = 0.6 p3 = 0.9

John Sue Fred

E[ size(Ip) ] = 2.3 tuples

(1-p1) (1-p2) (1-p3) p1(1-p2)(1-p3) (1-p1)p2(1-p3) (1-p1)(1-p2)p3 p1p2(1-p3)

Tuple-Independent DBs are Incomplete

Name Address John Sue Seattle Seattle

Tuple Prob. ) Query Evaluation

Probability p1(1-(1-q2)(1-q3)) 1- (1- p2(1-(1-q5)(1-q6))) (1 - p3 q7 ) 18

Application: Similarity Predicates

John John John Sue

Application: Similarity Predicates

Step 1: evaluate ~ predicates

microphone q5=0.7 instrument

Step 2: evaluate rest of query

Summary on Explicit Independent Tuples

Query Evaluation on Probabilistic DBs

Probability of Boolean expressions Complexity of query evaluation

Needed for query processing

Probability of Boolean Expressions

What is Pr(E) ???

Pr(E)=1 - (1-p1(1-(1-p3)(1-p4))) (1-p2(1-(1-p5)(1-p6)))

Now lets try this:

= X1X2 X1X3 X2X3

No clever grouping seems possible. Brute force:

Pr(E)=(1-p1)p2p3 + p1(1-p2)p3 + p1p2(1-p3) + p1p2p3

Seems inefficient in general 24

Complexity of Boolean Expression Probability

Summary on Boolean Expression Probability

Extensional Query Evaluation

Unlike intensional evaluation, data complexity: PTIME 28

Wrong ! Sea 1-(1-p1q1)(1- p1q2)(1- p1q3)

Jon Sea p1q1

Depends on plan !!! 29 Jon q3

Qbad :- R(x), S(x,y), T(y)

Data complexity is #P complete

Computing a Safe SPJ Extensional Plan

Computing a Safe SPJ Extensional Plan