Você está na página 1de 11

Outline

Introduction to Information  What is the IR problem?


Retrieval  How to organize an IR system? (Or the
main processes in IR)
 Indexing
Jian-Yun Nie
 Retrieval
University of Montreal
 System evaluation
Canada
 Some current research topics

1 2

The problem of IR Example


 Goal = find documents relevant to an information
need from a large document set Info.
need

Query
IR Google
Document Retrieval system Answer list
collection
Web

3 4

IR problem Possible approaches


 First applications: in libraries (1950s)
1. String matching (linear search in
ISBN: 0-201-12227-8
Author: Salton, Gerard documents)
Title: Automatic text processing: the transformation, - Slow
analysis, and retrieval of information by computer
Editor: Addison-Wesley - Difficult to improve
Date: 1989 2. Indexing (*)
Content: <Text>
 external attributes and internal attribute (content) - Fast
 Search by external attributes = Search in DB - Flexible to further improvement
 IR: search by content
5 6

1
Indexing-based IR Main problems in IR
 Document and query indexing
Document Query  How to best represent their contents?

 Query evaluation (or retrieval process)


indexing indexing  To what extent does a document correspond

(Query analysis) to a query?

Representation Representation  System evaluation


(keywords) Query (keywords)  How good is a system?

evaluation  Are the retrieved documents relevant?

(precision)

 Are all the relevant documents retrieved?

(recall)

7 8

Document indexing Keyword selection and weighting



 How to select important keywords?
Goal = Find the important meanings and create an

internal representation

 Factors to consider:
 Simple method: using middle-frequency words
 Accuracy to represent meanings (semantics)
 Exhaustiveness (cover all the contents) Frequency/Informativity
 Facility for computer to manipulate frequency informativity
 What is the best representation of contents?
 Char. string (char trigrams): not precise enough
Max.
 Word: good coverage, not precise
 Phrase: poor coverage, more precise
 Concept: poor coverage, precise

Coverage Accuracy Min.


(Recall) String Word Phrase Concept (Precision) 123… Rank
9 10

tf*idf weighting schema Some common tf*idf schemes


 tf = term frequency
 frequency of a term/keyword in a document  tf(t, D)=freq(t,D) idf(t) = log(N/n)

The higher the tf, the higher the importance (weight) for the doc.  tf(t, D)=log[freq(t,D)] n = #docs containing t

 df = document frequency  tf(t, D)=log[freq(t,D)]+1 N = #docs in corpus

 no. of documents containing the term  tf(t, D)=freq(t,d)/Max[f(t,d)]


 distribution of the term
 idf = inverse document frequency weight(t,D) = tf(t,D) * idf(t)
the unevenness of term distribution in the corpus

the specificity of term to a document

The more the term is distributed evenly, the less it is specific to a  Normalization: Cosine normalization, /max, …
document
weight(t,D) = tf(t,D) * idf(t)
11 12

2
Document Length
Normalization Stopwords / Stoplist
 Sometimes, additional normalizations e.g.  function words do not bear useful information for IR

length: of, in, about, with, I, although, …


weight (t , D)  Stoplist: contain stopwords, not to be used as index
pivoted (t, D ) =  Prepositions
slope
1+ normalized _ weight (t, D )  Articles
(1 − slope) × povot  Pronouns
 Some adverbs and adjectives
Probability  Some frequent words (e.g. document)
of relevance
slope
pivot  The removal of stopwords usually improves IR
effectiveness
Probability of retrieval  A few “standard” stoplists are commonly used.

Doc. length
13 14

Porter algorithm
Stemming
(Porter, M.F., 1980, An algorithm for suffix stripping,
Program, 14(3) :130-137)
 Step 1: plurals and past participles
 Reason:  SSES -> SS caresses -> caress
 Different word forms may bear similar meaning  (*v*) ING -> motoring -> motor
(e.g. search, searching): create a “standard”  Step 2: adj->n, n->v, n->adj, …
representation for them  (m>0) OUSNESS -> OUS callousness -> callous
 (m>0) ATIONAL -> ATE relational -> relate
 Stemming:  Step 3:
 Removing some endings of word  (m>0) ICATE -> IC triplicate -> triplic
computer  Step 4:
compute  (m>1) AL -> revival -> reviv
computes  (m>1) ANCE -> allowance -> allow
comput  Step 5:
computing
computed  (m>1) E -> probate -> probat
computation  (m > 1 and *d and *L) -> single letter controll -> control
15 16

Lemmatization Result of indexing


 transform to standard form according to syntactic  Each document is represented by a set of weighted
category. keywords (terms):
E.g. verb + ing → verb
noun + s → noun
D1 → {(t1, w1), (t2,w2), …}
 Need POS tagging
 More accurate than stemming, but needs more resources e.g.D1 → {(comput, 0.2), (architect, 0.3), …}
D2 → {(comput, 0.1), (network, 0.5), …}
 crucial to choose stemming/lemmatization rules
noise v.s. recognition rate
 Inverted file:
 compromise between precision and recall
comput → {(D1,0.2), (D2,0.1), …}
light/no stemming severe stemming Inverted file is used during retrieval for higher efficiency.
-recall +precision +recall -precision
17 18

3
Retrieval Cases

 The problems underlying retrieval  1-word query:


 Retrieval model The documents to be retrieved are those that
include the word
 How is a document represented with the
- Retrieve the inverted list for the word
selected keywords?

 How are document and query representations - Sort in decreasing order of the weight of the word

compared to calculate a score?  Multi-word query?


 Implementation - Combining several lists

- How to interpret the weight?

(IR model)

19 20

IR models Boolean model


 Document = Logical conjunction of keywords
 Matching score model  Query = Boolean expression of keywords
 R(D, Q) = D →Q
 Document D = a set of weighted keywords
e.g. D = t1 ∧ t2 ∧ … ∧ tn
 Query Q = a set of non-weighted keywords Q = (t1 ∧ t2) ∨ (t3 ∧ ¬t4)
 R(D, Q) = Σ w(t , D) i i
D →Q, thus R(D, Q) = 1.

Problems:
where t is in Q.
i

 R is either 1 or 0 (unordered set of documents)


 many documents or few documents
 End-users cannot manipulate Boolean operators correctly
E.g. documents about kangaroos and koalas

21 22

Extensions to Boolean model


(for document ordering) Vector space model
 D = {…, (t , w ), …}: weighted keywords
i i
 Vector space = all the keywords encountered
 Interpretation: <t , t , t , …, t >
1 2 3 n

 D is a member of class ti to degree wi.


 Document
 In terms of fuzzy sets: µ (D) = wi
ti
D = < a , a , a , …, a >
A possible Evaluation: 1 2 3 n

a = weight of t in D
R(D, ti) = µ (D);
i i

 Query
ti

R(D, Q1 ∧Q 2
) = min(R(D, Q1), R(D, Q2));

R(D, Q1 ∨Q 2
) = max(R(D, Q1), R(D, Q2)); Q = < b , b , b , …, b > 1 2 3 n

R(D, ¬Q 1
) = 1 - R(D, Q1).
b = weight of t in Q
i i

 R(D,Q) = Sim(D,Q)
23 24

4
Matrix representation Some formulas for Sim
Document space t1 t2 t3 … tn Term vector Dot product Sim( D, Q) = ∑ (a * b ) t1
i i
space
D1 a11 a12 a13 … a1n
∑ (a * b ) D i i
D2 a21 a22 a23 … a2n Cosine Sim( D, Q) = i

D3 a31 a32 a33 … a3n


∑ a *∑ b
i
Q
i
2

i
i
2

t2
2∑ ( a * b )
… Dice Sim( D, Q) =
i i

∑a +∑b
i
2 2

Dm am1 am2 am3 … amn i


i i

∑ (a * b )
i

Q b1 b2 b3 … bn Jaccard Sim( D, Q) =
i i

∑ ∑ b − ∑ (a * b )
i

+
2 2
a i i i i
i i i
25 26

Implementation (space) Implementation (time)


 Matrix is very sparse: a few 100s terms for  The implementation of VSM with dot product:
a document, and a few terms for a query,
 Naïve implementation: O(m*n)
while the term space is large (~100k)
 Implementation using inverted file:

 Stored as: Given a query = {(t1,b1), (t2,b2)}:


1. find the sets of related documents through inverted file for
D1 → {(t1, a1), (t2,a2), …} t1 and t2
2. calculate the score of the documents to each weighted term
t1 → {(D1,a1), …} (t1,b1) → {(D1,a1 *b1), …}
3. combine the sets and sum the weights (∑)
 O(|Q|*n)
27 28

Other similarities Probabilistic model


 Cosine:
∑ (a * b )
 Given D, estimate P(R|D) and P(NR|D)

=∑
a i
b i  P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)
Sim( D, Q) = i i i

∑ ∑
j
a * ib
2
∑ ∑b
a
j
i
2
i
j
j
2

j
j
2 ∝ P(D|R)

D = {t1=x1, t2=x2, …}
1 present
xi = 
 0 absent

∏ P (t
use ∑j a j and ∑j b j to normalize the
2 2  P( D | R) = i = xi | R )
- ( t i = xi )∈D

weights after indexing = ∏ P (ti = 1 | R ) P (ti = 0 | R ) (1− xi ) = ∏ pi i (1 − pi ) (1− xi )


xi x

ti ti
- Dot product P ( D | NR ) = ∏ P (ti = 1 | NR ) xi P (ti = 0 | NR ) (1− xi ) = ∏ qi i (1 − qi ) (1− xi )
x

(Similar operations do not apply to Dice and ti ti

Jaccard) 29 30

5
Prob. model (cont’d) Prob. model (cont’d)
r n -r n
For document ranking
i i i i

 Rel. doc. Irrel.doc. Doc.


∏p (1− xi ) How to estimate pi and qi?
(1 − pi )
xi
i with t with t with t
P( D | R ) i

Odd ( D) = log = log


i i
ti

N R -r N-R –n+r N-n


∏q (1 − qi ) (1− xi )
xi
P( D | NR ) i  A set of relevant and
i i i i i

ti irrelevant samples:
Rel. doc. Irrel.doc. Doc.
without t without t without ti
∑ ∑ pi (1 − qi ) 1 − pi i i

= xi log + log r n −r R N-R N


qi (1 − pi ) 1 − qi pi = i qi = i i
i i

ti ti Rel. doc Irrel.doc. Samples


Ri N − Ri
∝ ∑ x log
p (1 − q ) i i
q (1 − p )
i
ti i i

31 32

Prob. model (cont’d) BM25


Odd ( D) = ∑ x log
p (1 − q ) i i

q (1 − p )
∑ w (kK ++1tf)tf (kk ++1qtf)qtf + k | Q | avdl
avdl − dl
i
ti i i
Score( D, Q) = 1 3
+ dl
2

=∑x
r (N − R − n + r ) i i i i t ∈Q 3

( R − r )(n − r )
i
dl
K = k1 ((1 − b) + b
ti i i i i
)
 Smoothing (Robertson-Sparck-Jones formula) avdl − dl

Odd ( D) = ∑ x (r (+R0−.5r)(+N0−.5R)(n− n− r+ +r 0+.50).5) = ∑ w


ti
i
i

i i
i

i
i

i
i

ti ∈D
i  k1, k2, k3, d: parameters
 qtf: query term frequency
 When no sample is available:  dl: document length
p =0.5,
i

q =(n +0.5)/(N+0.5)≈n /N  avdl: average document length


i i i

 May be implemented as VSM


33 34

(Classic) Presentation of results System evaluation


 Efficiency: time, space
 Query evaluation result is a list of documents,
 Effectiveness:
sorted by their similarity to the query.
 How is a system capable of retrieving relevant
 E.g. documents?
 Is a system better than another one?
doc1 0.67
 Metrics often used (together):
doc2 0.65  Precision = retrieved relevant docs / retrieved docs
doc3 0.54  Recall = retrieved relevant docs / relevant docs
retrieved relevant

relevant retrieved
35 36

6
An illustration of P/R
General form of precision/recall calculation
Precision Precision
1.0 List Rel? 1.0 - * (0.2, 1.0)

Doc1 Y 0.8 - * (0.6, 0.75)

Doc2 0.6 -
* (0.4, 0.67)
* (0.6, 0.6)
Doc3 Y * (0.2, 0.5)
0.4 -
Doc4 Y
Recall
0.2 -
1.0
Doc5
-Precision change w.r.t. Recall (not a fixed point) … 0.0 | | | | | Recall
0.2 0.4 0.6 0.8 1.0
-Systems cannot compare at one Precision/Recall point Assume: 5 relevant docs.
-Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0) 37 38

MAP (Mean Average Precision) Some other measures


MAP =
1
n
∑ | R1 | ∑ rj
Qi i D j ∈Ri ij


Noise = retrieved irrelevant docs / retrieved docs
Silence = non-retrieved relevant docs / relevant docs
 Noise = 1 – Precision; Silence = 1 – Recall
 rij = rank of the j-th relevant document for Qi
 Fallout = retrieved irrel. docs / irrel. docs

 |Ri| = #rel. doc. for Qi  Single value measures:

 n = # test queries  F-measure = 2 P * R / (P + R)


 E.g. Rank: 1 4 1st rel. doc.
 Average precision = average at 11 points of recall
 Precision at n document (often used for Web IR)
5 8 2nd rel. doc.
 Expected search length (no. irrelevant documents to read
10 3 rd
rel. doc. before obtaining n relevant doc.)

1 1 1 2 3 1 1 2
MAP = [ ( + + ) + ( + )]
2 3 1 5 10 2 4 8 39 40

An evaluation example
Test corpus (SMART)
Run number: 1 2 Average precision for all points
 Compare different IR systems on the same Num_queries: 52 52 11-pt Avg: 0.2859 0.3092

test corpus Total number of documents over % Change: 8.2


Recall:
all queries
Exact: 0.4139 0.4166
 A test corpus contains: Retrieved:
Relevant:
780
796
780
796
at 5 docs:
at 10 docs:
0.2373
0.3254
0.2726
0.3572
 A set of documents Rel_ret: 246 229 at 15 docs: 0.4139 0.4166
at 30 docs: 0.4139 0.4166
Recall - Precision Averages:
 A set of queries Precision:
at 0.00 0.7695 0.7894 Exact: 0.3154
 Relevance judgment for every document-query pair at 0.10 0.6618 0.6449 0.2936
At 5 docs: 0.4308 0.4192
(desired answers for each query) at 0.20 0.5019 0.5090
At 10 docs: 0.3538 0.3327
at 0.30 0.3745 0.3702
 The results of a system is compared with the
At 15 docs: 0.3154 0.2936
at 0.40 0.2249 0.3070 At 30 docs: 0.1577 0.1468

desired answers. at 0.50 0.1797 0.2104


at 0.60 0.1143 0.1654
at 0.70 0.0891 0.1144
at 0.80 0.0891 0.1096
at 0.90 0.0699 0.0904
41 42
at 1.00 0.0699 0.0904

7
The TREC experiments TREC evaluation methodology
 Once per year  Known document collection (>100K) and query set
 A set of documents and queries are distributed (50)

to the participants (the standard answers are  Submission of 1000 documents for each query by
each participant
unknown) (April)  Merge 100 first documents of each participant ->
 Participants work (very hard) to construct, fine- global pool
tune their systems, and submit the answers  Human relevance judgment of the global pool

(1000/query) at the deadline (July)  The other documents are assumed to be irrelevant
 Evaluation of each system (with 1000 answers)
 NIST people manually evaluate the answers
and provide correct answers (and classification
 Partial relevance judgments
of IR systems) (July – August)
 But stable for system ranking
 TREC conference (November)
43 44

Tracks (tasks) CLEF and NTCIR


 Ad Hoc track: given document collection, different
topics
 CLEF = Cross-Language Experimental
 Routing (filtering): stable interests (user profile),
Forum
incoming document flow  for European languages
 CLIR: Ad Hoc, but with queries in a different
 organized by Europeans
language
 Each per year (March – Oct.)
 Web: a large set of Web pages

 Question-Answering: When did Nixon visit China?  NTCIR:


 Interactive: put users into action with system  Organized by NII (Japan)
 Spoken document retrieval
 For Asian languages
 Image and video retrieval
45  cycle of 1.5 year 46
 Information tracking: new topic / follow up

Some techniques to
Impact of TREC improve IR effectiveness
 Provide large collections for further
 Interaction with user (relevance feedback)
experiments - Keywords only cover part of the contents
 Compare different systems/techniques on - User can help by indicating relevant/irrelevant
realistic data document

 Develop new methodology for system  The use of relevance feedback



evaluation To improve query expression:

Q
new = α*Qold + β*Rel_d - γ*Nrel_d

 Similar experiments are organized in other where Rel_d = centroid of relevant documents
NRel_d = centroid of non-relevant documents
areas (NLP, Machine translation,
Summarization, …)
47 48

8
Effect of RF Modified relevance feedback
 Users usually do not cooperate (e.g.
2nd retrieval
AltaVista in early years)
1st retrieval
 Pseudo-relevance feedback (Blind RF)
* * *
* x * x x
* *
* * * x x
 Using the top-ranked documents as if they
* * * * R* Q * NR x are relevant:
Q
new
* x * x x  Select m terms from n top-ranked documents
* *
* * x  One can usually obtain about 10% improvement

49 50

Query expansion Global vs. local context analysis


 A query contains part of the important words
 Global analysis: use the whole document
 Add new (related) terms into the query collection to calculate term relationships
 Manually constructed knowledge base/thesaurus
 Local analysis: use the query to retrieve a
subset of documents, then calculate term
(e.g. Wordnet)

 Q = information retrieval
 Q’ = (information + data + knowledge + …) relationships
(retrieval + search + seeking + …)  Combine pseudo-relevance feedback and term co-
 Corpus analysis:
occurrences

 two terms that often co-occur are related (Mutual  More effective than global analysis
information)
 Two terms that co-occur with the same words are
related (e.g. T-shirt and coat with wear, …)
51 52

Some current research topics:


Go beyond keywords Theory …
 Keywords are not perfect representatives of concepts  Bayesian networks
 Ambiguity:  P(Q|D)
table = data structure, furniture? D1 D2 D3 … Dm
 Lack of precision:
“operating”, “system” less precise than “operating_system” t1 t2 t3 t4 …. tn
 Suggested solution

 Sense disambiguation (difficult due to the lack of contextual c1 c2 c3 c4 … cl


information)
 Using compound terms (no complete dictionary of Inference Q revision
compound terms, variation in form)
 Using noun phrases (syntactic patterns + statistics)
 Language models
 Still a long way to go

53 54

9
Related applications:

Logical models Information filtering


 IR: changing queries on stable document collection
 How to describe the relevance
 IF: incoming document flow with stable interests
relation as a logical relation? (queries)
 yes/no decision (in stead of ordering documents)
D => Q
 Advantage: the description of user’s interest may be
 What are the properties of this improved using relevance feedback (the user is more willing
to cooperate)
relation?  Difficulty: adjust threshold to keep/ignore document
The basic techniques used for IF are the same as those for
 How to combine uncertainty with a 
IR – “Two sides of the same coin”
logical framework? keep
… doc3, doc2, doc1 IF
 The problem: What is relevance? ignore
User profile
55 56

IR for (semi-)structured
documents PageRank in Google
 Using structural information to assign weights I1
to keywords (Introduction, Conclusion, …)
 Hierarchical indexing I2
A B PR ( A) = (1 − d ) + d ∑ PRC ((II))
i i
i

 Querying within some structure (search in


 Assign a numeric value to each page
title, etc.)  The more a page is referred to by important pages, the more this
 INEX experiments page is important

 Using hyperlinks in indexing and retrieval  d: damping factor (0.85)


(e.g. Google)
 …  Many other criteria: e.g. proximity of query words
 “…information retrieval …” better than “… information … retrieval …”

57 58

IR on the Web Final remarks on IR


 No stable document collection (spider,  IR is related to many areas:
crawler)  NLP, AI, database, machine learning, user

 Invalid document, duplication, etc. modeling…

 library, Web, multimedia search, …


 Huge number of documents (partial
 Relatively week theories
collection)
 Very strong tradition of experiments
 Multimedia documents
 Many remaining (and exciting) problems
 Great variation of document quality  Difficult area: Intuitive methods do not
 Multilingual problem necessarily improve effectiveness in practice
 …
59 60

10
Why is IR difficult
 Vocabularies mismatching
 Synonymy: e.g. car v.s. automobile
 Polysemy: table
 Queries are ambiguous, they are partial specification
of user’s need
 Content representation may be inadequate and
incomplete
 The user is the ultimate judge, but we don’t know
how the judge judges…
 The notion of relevance is imprecise, context- and user-
dependent

 But how much it is rewarding to gain 10%


improvement!
61

11

Você também pode gostar