Escolar Documentos
Profissional Documentos
Cultura Documentos
1 2
Query
IR Google
Document Retrieval system Answer list
collection
Web
3 4
1
Indexing-based IR Main problems in IR
Document and query indexing
Document Query How to best represent their contents?
(precision)
(recall)
7 8
internal representation
Factors to consider:
Simple method: using middle-frequency words
Accuracy to represent meanings (semantics)
Exhaustiveness (cover all the contents) Frequency/Informativity
Facility for computer to manipulate frequency informativity
What is the best representation of contents?
Char. string (char trigrams): not precise enough
Max.
Word: good coverage, not precise
Phrase: poor coverage, more precise
Concept: poor coverage, precise
The higher the tf, the higher the importance (weight) for the doc. tf(t, D)=log[freq(t,D)] n = #docs containing t
2
Document Length
Normalization Stopwords / Stoplist
Sometimes, additional normalizations e.g. function words do not bear useful information for IR
Doc. length
13 14
Porter algorithm
Stemming
(Porter, M.F., 1980, An algorithm for suffix stripping,
Program, 14(3) :130-137)
Step 1: plurals and past participles
Reason: SSES -> SS caresses -> caress
Different word forms may bear similar meaning (*v*) ING -> motoring -> motor
(e.g. search, searching): create a “standard” Step 2: adj->n, n->v, n->adj, …
representation for them (m>0) OUSNESS -> OUS callousness -> callous
(m>0) ATIONAL -> ATE relational -> relate
Stemming: Step 3:
Removing some endings of word (m>0) ICATE -> IC triplicate -> triplic
computer Step 4:
compute (m>1) AL -> revival -> reviv
computes (m>1) ANCE -> allowance -> allow
comput Step 5:
computing
computed (m>1) E -> probate -> probat
computation (m > 1 and *d and *L) -> single letter controll -> control
15 16
3
Retrieval Cases
How are document and query representations - Sort in decreasing order of the weight of the word
(IR model)
19 20
Problems:
where t is in Q.
i
21 22
a = weight of t in D
R(D, ti) = µ (D);
i i
Query
ti
R(D, Q1 ∧Q 2
) = min(R(D, Q1), R(D, Q2));
R(D, Q1 ∨Q 2
) = max(R(D, Q1), R(D, Q2)); Q = < b , b , b , …, b > 1 2 3 n
R(D, ¬Q 1
) = 1 - R(D, Q1).
b = weight of t in Q
i i
R(D,Q) = Sim(D,Q)
23 24
4
Matrix representation Some formulas for Sim
Document space t1 t2 t3 … tn Term vector Dot product Sim( D, Q) = ∑ (a * b ) t1
i i
space
D1 a11 a12 a13 … a1n
∑ (a * b ) D i i
D2 a21 a22 a23 … a2n Cosine Sim( D, Q) = i
i
i
2
t2
2∑ ( a * b )
… Dice Sim( D, Q) =
i i
∑a +∑b
i
2 2
∑ (a * b )
i
Q b1 b2 b3 … bn Jaccard Sim( D, Q) =
i i
∑ ∑ b − ∑ (a * b )
i
+
2 2
a i i i i
i i i
25 26
=∑
a i
b i P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)
Sim( D, Q) = i i i
∑ ∑
j
a * ib
2
∑ ∑b
a
j
i
2
i
j
j
2
j
j
2 ∝ P(D|R)
D = {t1=x1, t2=x2, …}
1 present
xi =
0 absent
∏ P (t
use ∑j a j and ∑j b j to normalize the
2 2 P( D | R) = i = xi | R )
- ( t i = xi )∈D
ti ti
- Dot product P ( D | NR ) = ∏ P (ti = 1 | NR ) xi P (ti = 0 | NR ) (1− xi ) = ∏ qi i (1 − qi ) (1− xi )
x
Jaccard) 29 30
5
Prob. model (cont’d) Prob. model (cont’d)
r n -r n
For document ranking
i i i i
ti irrelevant samples:
Rel. doc. Irrel.doc. Doc.
without t without t without ti
∑ ∑ pi (1 − qi ) 1 − pi i i
31 32
q (1 − p )
∑ w (kK ++1tf)tf (kk ++1qtf)qtf + k | Q | avdl
avdl − dl
i
ti i i
Score( D, Q) = 1 3
+ dl
2
=∑x
r (N − R − n + r ) i i i i t ∈Q 3
( R − r )(n − r )
i
dl
K = k1 ((1 − b) + b
ti i i i i
)
Smoothing (Robertson-Sparck-Jones formula) avdl − dl
i i
i
i
i
i
i
ti ∈D
i k1, k2, k3, d: parameters
qtf: query term frequency
When no sample is available: dl: document length
p =0.5,
i
6
An illustration of P/R
General form of precision/recall calculation
Precision Precision
1.0 List Rel? 1.0 - * (0.2, 1.0)
Doc2 0.6 -
* (0.4, 0.67)
* (0.6, 0.6)
Doc3 Y * (0.2, 0.5)
0.4 -
Doc4 Y
Recall
0.2 -
1.0
Doc5
-Precision change w.r.t. Recall (not a fixed point) … 0.0 | | | | | Recall
0.2 0.4 0.6 0.8 1.0
-Systems cannot compare at one Precision/Recall point Assume: 5 relevant docs.
-Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0) 37 38
1 1 1 2 3 1 1 2
MAP = [ ( + + ) + ( + )]
2 3 1 5 10 2 4 8 39 40
An evaluation example
Test corpus (SMART)
Run number: 1 2 Average precision for all points
Compare different IR systems on the same Num_queries: 52 52 11-pt Avg: 0.2859 0.3092
7
The TREC experiments TREC evaluation methodology
Once per year Known document collection (>100K) and query set
A set of documents and queries are distributed (50)
to the participants (the standard answers are Submission of 1000 documents for each query by
each participant
unknown) (April) Merge 100 first documents of each participant ->
Participants work (very hard) to construct, fine- global pool
tune their systems, and submit the answers Human relevance judgment of the global pool
(1000/query) at the deadline (July) The other documents are assumed to be irrelevant
Evaluation of each system (with 1000 answers)
NIST people manually evaluate the answers
and provide correct answers (and classification
Partial relevance judgments
of IR systems) (July – August)
But stable for system ranking
TREC conference (November)
43 44
Some techniques to
Impact of TREC improve IR effectiveness
Provide large collections for further
Interaction with user (relevance feedback)
experiments - Keywords only cover part of the contents
Compare different systems/techniques on - User can help by indicating relevant/irrelevant
realistic data document
Q
new = α*Qold + β*Rel_d - γ*Nrel_d
Similar experiments are organized in other where Rel_d = centroid of relevant documents
NRel_d = centroid of non-relevant documents
areas (NLP, Machine translation,
Summarization, …)
47 48
8
Effect of RF Modified relevance feedback
Users usually do not cooperate (e.g.
2nd retrieval
AltaVista in early years)
1st retrieval
Pseudo-relevance feedback (Blind RF)
* * *
* x * x x
* *
* * * x x
Using the top-ranked documents as if they
* * * * R* Q * NR x are relevant:
Q
new
* x * x x Select m terms from n top-ranked documents
* *
* * x One can usually obtain about 10% improvement
49 50
Q = information retrieval
Q’ = (information + data + knowledge + …) relationships
(retrieval + search + seeking + …) Combine pseudo-relevance feedback and term co-
Corpus analysis:
occurrences
two terms that often co-occur are related (Mutual More effective than global analysis
information)
Two terms that co-occur with the same words are
related (e.g. T-shirt and coat with wear, …)
51 52
53 54
9
Related applications:
IR for (semi-)structured
documents PageRank in Google
Using structural information to assign weights I1
to keywords (Introduction, Conclusion, …)
Hierarchical indexing I2
A B PR ( A) = (1 − d ) + d ∑ PRC ((II))
i i
i
57 58
10
Why is IR difficult
Vocabularies mismatching
Synonymy: e.g. car v.s. automobile
Polysemy: table
Queries are ambiguous, they are partial specification
of user’s need
Content representation may be inadequate and
incomplete
The user is the ultimate judge, but we don’t know
how the judge judges…
The notion of relevance is imprecise, context- and user-
dependent
11