IR Problem: Introduction To Information Retrieval Outline

Outline
Introduction to Information What is the IR problem?

Retrieval How to organize an IR system? (Or the
main processes in IR)
Indexing
Jian-Yun Nie
Retrieval
University of Montreal
System evaluation
Canada
Some current research topics
1 2
The problem of IR Example

Goal = find documents relevant to an information
need from a large document set Info.
need
Query
IR Google
Document Retrieval system Answer list
collection
Web
3 4
IR problem Possible approaches

First applications: in libraries (1950s)
1. String matching (linear search in
ISBN: 0-201-12227-8
Author: Salton, Gerard documents)
Title: Automatic text processing: the transformation, - Slow
analysis, and retrieval of information by computer
Editor: Addison-Wesley - Difficult to improve
Date: 1989 2. Indexing (*)
Content: <Text>
external attributes and internal attribute (content) - Fast
Search by external attributes = Search in DB - Flexible to further improvement
IR: search by content
5 6
1
Indexing-based IR Main problems in IR
Document and query indexing
Document Query How to best represent their contents?
Query evaluation (or retrieval process)

indexing indexing To what extent does a document correspond
(Query analysis) to a query?
Representation Representation System evaluation

(keywords) Query (keywords) How good is a system?
evaluation Are the retrieved documents relevant?
(precision)
Are all the relevant documents retrieved?
(recall)
7 8
Document indexing Keyword selection and weighting

How to select important keywords?
Goal = Find the important meanings and create an
internal representation
Factors to consider:
Simple method: using middle-frequency words
Accuracy to represent meanings (semantics)
Exhaustiveness (cover all the contents) Frequency/Informativity
Facility for computer to manipulate frequency informativity
What is the best representation of contents?
Char. string (char trigrams): not precise enough
Max.
Word: good coverage, not precise
Phrase: poor coverage, more precise
Concept: poor coverage, precise
Coverage Accuracy Min.

(Recall) String Word Phrase Concept (Precision) 123… Rank
9 10
tf*idf weighting schema Some common tf*idf schemes

tf = term frequency
frequency of a term/keyword in a document tf(t, D)=freq(t,D) idf(t) = log(N/n)
The higher the tf, the higher the importance (weight) for the doc. tf(t, D)=log[freq(t,D)] n = #docs containing t
df = document frequency tf(t, D)=log[freq(t,D)]+1 N = #docs in corpus
no. of documents containing the term tf(t, D)=freq(t,d)/Max[f(t,d)]

distribution of the term
idf = inverse document frequency weight(t,D) = tf(t,D) * idf(t)
the unevenness of term distribution in the corpus

the specificity of term to a document

The more the term is distributed evenly, the less it is specific to a Normalization: Cosine normalization, /max, …
document
weight(t,D) = tf(t,D) * idf(t)
11 12
2
Document Length
Normalization Stopwords / Stoplist
Sometimes, additional normalizations e.g. function words do not bear useful information for IR
length: of, in, about, with, I, although, …

weight (t , D) Stoplist: contain stopwords, not to be used as index
pivoted (t, D ) = Prepositions
slope
1+ normalized _ weight (t, D ) Articles
(1 − slope) × povot Pronouns
Some adverbs and adjectives
Probability Some frequent words (e.g. document)
of relevance
slope
pivot The removal of stopwords usually improves IR
effectiveness
Probability of retrieval A few “standard” stoplists are commonly used.
Doc. length
13 14
Porter algorithm
Stemming
(Porter, M.F., 1980, An algorithm for suffix stripping,
Program, 14(3) :130-137)
Step 1: plurals and past participles
Reason: SSES -> SS caresses -> caress
Different word forms may bear similar meaning (*v*) ING -> motoring -> motor
(e.g. search, searching): create a “standard” Step 2: adj->n, n->v, n->adj, …
representation for them (m>0) OUSNESS -> OUS callousness -> callous
(m>0) ATIONAL -> ATE relational -> relate
Stemming: Step 3:
Removing some endings of word (m>0) ICATE -> IC triplicate -> triplic
computer Step 4:
compute (m>1) AL -> revival -> reviv
computes (m>1) ANCE -> allowance -> allow
comput Step 5:
computing
computed (m>1) E -> probate -> probat
computation (m > 1 and *d and *L) -> single letter controll -> control
15 16
Lemmatization Result of indexing

transform to standard form according to syntactic Each document is represented by a set of weighted
category. keywords (terms):
E.g. verb + ing → verb
noun + s → noun
D1 → {(t1, w1), (t2,w2), …}
Need POS tagging
More accurate than stemming, but needs more resources e.g.D1 → {(comput, 0.2), (architect, 0.3), …}
D2 → {(comput, 0.1), (network, 0.5), …}
crucial to choose stemming/lemmatization rules
noise v.s. recognition rate
Inverted file:
compromise between precision and recall
comput → {(D1,0.2), (D2,0.1), …}
light/no stemming severe stemming Inverted file is used during retrieval for higher efficiency.
-recall +precision +recall -precision
17 18
3
Retrieval Cases
The problems underlying retrieval 1-word query:

Retrieval model The documents to be retrieved are those that
include the word
How is a document represented with the
- Retrieve the inverted list for the word
selected keywords?
How are document and query representations - Sort in decreasing order of the weight of the word
compared to calculate a score? Multi-word query?

Implementation - Combining several lists
- How to interpret the weight?
(IR model)
19 20
IR models Boolean model

Document = Logical conjunction of keywords
Matching score model Query = Boolean expression of keywords
R(D, Q) = D →Q
Document D = a set of weighted keywords
e.g. D = t1 ∧ t2 ∧ … ∧ tn
Query Q = a set of non-weighted keywords Q = (t1 ∧ t2) ∨ (t3 ∧ ¬t4)
R(D, Q) = Σ w(t , D) i i
D →Q, thus R(D, Q) = 1.
Problems:
where t is in Q.
i
R is either 1 or 0 (unordered set of documents)

many documents or few documents
End-users cannot manipulate Boolean operators correctly
E.g. documents about kangaroos and koalas
21 22
Extensions to Boolean model

(for document ordering) Vector space model
D = {…, (t , w ), …}: weighted keywords
i i
Vector space = all the keywords encountered
Interpretation: <t , t , t , …, t >
1 2 3 n
D is a member of class ti to degree wi.

Document
In terms of fuzzy sets: µ (D) = wi
ti
D = < a , a , a , …, a >
A possible Evaluation: 1 2 3 n
a = weight of t in D
R(D, ti) = µ (D);
i i
Query
ti
R(D, Q1 ∧Q 2
) = min(R(D, Q1), R(D, Q2));
R(D, Q1 ∨Q 2
) = max(R(D, Q1), R(D, Q2)); Q = < b , b , b , …, b > 1 2 3 n
R(D, ¬Q 1
) = 1 - R(D, Q1).
b = weight of t in Q
i i
R(D,Q) = Sim(D,Q)
23 24
4
Matrix representation Some formulas for Sim
Document space t1 t2 t3 … tn Term vector Dot product Sim( D, Q) = ∑ (a * b ) t1
i i
space
D1 a11 a12 a13 … a1n
∑ (a * b ) D i i
D2 a21 a22 a23 … a2n Cosine Sim( D, Q) = i
D3 a31 a32 a33 … a3n

∑ a *∑ b
i
Q
i
2
i
i
2
t2
2∑ ( a * b )
… Dice Sim( D, Q) =
i i
∑a +∑b
i
2 2
Dm am1 am2 am3 … amn i

i i
∑ (a * b )
i
Q b1 b2 b3 … bn Jaccard Sim( D, Q) =
i i
∑ ∑ b − ∑ (a * b )
i
+
2 2
a i i i i
i i i
25 26
Implementation (space) Implementation (time)

Matrix is very sparse: a few 100s terms for The implementation of VSM with dot product:
a document, and a few terms for a query,
Naïve implementation: O(m*n)
while the term space is large (~100k)
Implementation using inverted file:
Stored as: Given a query = {(t1,b1), (t2,b2)}:

1. find the sets of related documents through inverted file for
D1 → {(t1, a1), (t2,a2), …} t1 and t2
2. calculate the score of the documents to each weighted term
t1 → {(D1,a1), …} (t1,b1) → {(D1,a1 *b1), …}
3. combine the sets and sum the weights (∑)
O(|Q|*n)
27 28
Other similarities Probabilistic model

Cosine:
∑ (a * b )
Given D, estimate P(R|D) and P(NR|D)
=∑
a i
b i P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)
Sim( D, Q) = i i i
∑ ∑
j
a * ib
2
∑ ∑b
a
j
i
2
i
j
j
2
j
j
2 ∝ P(D|R)
D = {t1=x1, t2=x2, …}
1 present
xi = 
 0 absent
∏ P (t
use ∑j a j and ∑j b j to normalize the
2 2 P( D | R) = i = xi | R )
- ( t i = xi )∈D
weights after indexing = ∏ P (ti = 1 | R ) P (ti = 0 | R ) (1− xi ) = ∏ pi i (1 − pi ) (1− xi )

xi x
ti ti
- Dot product P ( D | NR ) = ∏ P (ti = 1 | NR ) xi P (ti = 0 | NR ) (1− xi ) = ∏ qi i (1 − qi ) (1− xi )
x
(Similar operations do not apply to Dice and ti ti
Jaccard) 29 30
5
Prob. model (cont’d) Prob. model (cont’d)
r n -r n
For document ranking
i i i i
Rel. doc. Irrel.doc. Doc.

∏p (1− xi ) How to estimate pi and qi?
(1 − pi )
xi
i with t with t with t
P( D | R ) i
Odd ( D) = log = log

i i
ti
N R -r N-R –n+r N-n

∏q (1 − qi ) (1− xi )
xi
P( D | NR ) i A set of relevant and
i i i i i
ti irrelevant samples:
Rel. doc. Irrel.doc. Doc.
without t without t without ti
∑ ∑ pi (1 − qi ) 1 − pi i i
= xi log + log r n −r R N-R N

qi (1 − pi ) 1 − qi pi = i qi = i i
i i
ti ti Rel. doc Irrel.doc. Samples

Ri N − Ri
∝ ∑ x log
p (1 − q ) i i
q (1 − p )
i
ti i i
31 32
Prob. model (cont’d) BM25

Odd ( D) = ∑ x log
p (1 − q ) i i
q (1 − p )
∑ w (kK ++1tf)tf (kk ++1qtf)qtf + k | Q | avdl
avdl − dl
i
ti i i
Score( D, Q) = 1 3
+ dl
2
=∑x
r (N − R − n + r ) i i i i t ∈Q 3
( R − r )(n − r )
i
dl
K = k1 ((1 − b) + b
ti i i i i
)
Smoothing (Robertson-Sparck-Jones formula) avdl − dl
Odd ( D) = ∑ x (r (+R0−.5r)(+N0−.5R)(n− n− r+ +r 0+.50).5) = ∑ w

ti
i
i
i i
i
i
i
i
i
ti ∈D
i k1, k2, k3, d: parameters
qtf: query term frequency
When no sample is available: dl: document length
p =0.5,
i
q =(n +0.5)/(N+0.5)≈n /N avdl: average document length

i i i
May be implemented as VSM

33 34
(Classic) Presentation of results System evaluation

Efficiency: time, space
Query evaluation result is a list of documents,
Effectiveness:
sorted by their similarity to the query.
How is a system capable of retrieving relevant
E.g. documents?
Is a system better than another one?
doc1 0.67
Metrics often used (together):
doc2 0.65 Precision = retrieved relevant docs / retrieved docs
doc3 0.54 Recall = retrieved relevant docs / relevant docs
retrieved relevant
…
relevant retrieved
35 36
6
An illustration of P/R
General form of precision/recall calculation
Precision Precision
1.0 List Rel? 1.0 - * (0.2, 1.0)
Doc1 Y 0.8 - * (0.6, 0.75)
Doc2 0.6 -
* (0.4, 0.67)
* (0.6, 0.6)
Doc3 Y * (0.2, 0.5)
0.4 -
Doc4 Y
Recall
0.2 -
1.0
Doc5
-Precision change w.r.t. Recall (not a fixed point) … 0.0 | | | | | Recall
0.2 0.4 0.6 0.8 1.0
-Systems cannot compare at one Precision/Recall point Assume: 5 relevant docs.
-Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0) 37 38
MAP (Mean Average Precision) Some other measures

MAP =
1
n
∑ | R1 | ∑ rj
Qi i D j ∈Ri ij

Noise = retrieved irrelevant docs / retrieved docs
Silence = non-retrieved relevant docs / relevant docs
Noise = 1 – Precision; Silence = 1 – Recall
rij = rank of the j-th relevant document for Qi
Fallout = retrieved irrel. docs / irrel. docs
|Ri| = #rel. doc. for Qi Single value measures:
n = # test queries F-measure = 2 P * R / (P + R)

E.g. Rank: 1 4 1st rel. doc.
Average precision = average at 11 points of recall
Precision at n document (often used for Web IR)
5 8 2nd rel. doc.
Expected search length (no. irrelevant documents to read
10 3 rd
rel. doc. before obtaining n relevant doc.)
1 1 1 2 3 1 1 2
MAP = [ ( + + ) + ( + )]
2 3 1 5 10 2 4 8 39 40
An evaluation example
Test corpus (SMART)
Run number: 1 2 Average precision for all points
Compare different IR systems on the same Num_queries: 52 52 11-pt Avg: 0.2859 0.3092
test corpus Total number of documents over % Change: 8.2

Recall:
all queries
Exact: 0.4139 0.4166
A test corpus contains: Retrieved:
Relevant:
780
796
780
796
at 5 docs:
at 10 docs:
0.2373
0.3254
0.2726
0.3572
A set of documents Rel_ret: 246 229 at 15 docs: 0.4139 0.4166
at 30 docs: 0.4139 0.4166
Recall - Precision Averages:
A set of queries Precision:
at 0.00 0.7695 0.7894 Exact: 0.3154
Relevance judgment for every document-query pair at 0.10 0.6618 0.6449 0.2936
At 5 docs: 0.4308 0.4192
(desired answers for each query) at 0.20 0.5019 0.5090
At 10 docs: 0.3538 0.3327
at 0.30 0.3745 0.3702
The results of a system is compared with the
At 15 docs: 0.3154 0.2936
at 0.40 0.2249 0.3070 At 30 docs: 0.1577 0.1468
desired answers. at 0.50 0.1797 0.2104

at 0.60 0.1143 0.1654
at 0.70 0.0891 0.1144
at 0.80 0.0891 0.1096
at 0.90 0.0699 0.0904
41 42
at 1.00 0.0699 0.0904
7
The TREC experiments TREC evaluation methodology
Once per year Known document collection (>100K) and query set
A set of documents and queries are distributed (50)
to the participants (the standard answers are Submission of 1000 documents for each query by
each participant
unknown) (April) Merge 100 first documents of each participant ->
Participants work (very hard) to construct, fine- global pool
tune their systems, and submit the answers Human relevance judgment of the global pool
(1000/query) at the deadline (July) The other documents are assumed to be irrelevant
Evaluation of each system (with 1000 answers)
NIST people manually evaluate the answers
and provide correct answers (and classification
Partial relevance judgments
of IR systems) (July – August)
But stable for system ranking
TREC conference (November)
43 44
Tracks (tasks) CLEF and NTCIR

Ad Hoc track: given document collection, different
topics
CLEF = Cross-Language Experimental
Routing (filtering): stable interests (user profile),
Forum
incoming document flow for European languages
CLIR: Ad Hoc, but with queries in a different
organized by Europeans
language
Each per year (March – Oct.)
Web: a large set of Web pages
Question-Answering: When did Nixon visit China? NTCIR:

Interactive: put users into action with system Organized by NII (Japan)
Spoken document retrieval
For Asian languages
Image and video retrieval
45 cycle of 1.5 year 46
Information tracking: new topic / follow up
Some techniques to
Impact of TREC improve IR effectiveness
Provide large collections for further
Interaction with user (relevance feedback)
experiments - Keywords only cover part of the contents
Compare different systems/techniques on - User can help by indicating relevant/irrelevant
realistic data document
Develop new methodology for system The use of relevance feedback

evaluation To improve query expression:
Q
new = α*Qold + β*Rel_d - γ*Nrel_d
Similar experiments are organized in other where Rel_d = centroid of relevant documents
NRel_d = centroid of non-relevant documents
areas (NLP, Machine translation,
Summarization, …)
47 48
8
Effect of RF Modified relevance feedback
Users usually do not cooperate (e.g.
2nd retrieval
AltaVista in early years)
1st retrieval
Pseudo-relevance feedback (Blind RF)
* * *
* x * x x
* *
* * * x x
Using the top-ranked documents as if they
* * * * R* Q * NR x are relevant:
Q
new
* x * x x Select m terms from n top-ranked documents
* *
* * x One can usually obtain about 10% improvement
49 50
Query expansion Global vs. local context analysis

A query contains part of the important words
Global analysis: use the whole document
Add new (related) terms into the query collection to calculate term relationships
Manually constructed knowledge base/thesaurus
Local analysis: use the query to retrieve a
subset of documents, then calculate term
(e.g. Wordnet)
Q = information retrieval
Q’ = (information + data + knowledge + …) relationships
(retrieval + search + seeking + …) Combine pseudo-relevance feedback and term co-
Corpus analysis:
occurrences
two terms that often co-occur are related (Mutual More effective than global analysis
information)
Two terms that co-occur with the same words are
related (e.g. T-shirt and coat with wear, …)
51 52
Some current research topics:

Go beyond keywords Theory …
Keywords are not perfect representatives of concepts Bayesian networks
Ambiguity: P(Q|D)
table = data structure, furniture? D1 D2 D3 … Dm
Lack of precision:
“operating”, “system” less precise than “operating_system” t1 t2 t3 t4 …. tn
Suggested solution
Sense disambiguation (difficult due to the lack of contextual c1 c2 c3 c4 … cl

information)
Using compound terms (no complete dictionary of Inference Q revision
compound terms, variation in form)
Using noun phrases (syntactic patterns + statistics)
Language models
Still a long way to go
53 54
9
Related applications:
Logical models Information filtering

IR: changing queries on stable document collection
How to describe the relevance
IF: incoming document flow with stable interests
relation as a logical relation? (queries)
yes/no decision (in stead of ordering documents)
D => Q
Advantage: the description of user’s interest may be
What are the properties of this improved using relevance feedback (the user is more willing
to cooperate)
relation? Difficulty: adjust threshold to keep/ignore document
The basic techniques used for IF are the same as those for
How to combine uncertainty with a
IR – “Two sides of the same coin”
logical framework? keep
… doc3, doc2, doc1 IF
The problem: What is relevance? ignore
User profile
55 56
IR for (semi-)structured
documents PageRank in Google
Using structural information to assign weights I1
to keywords (Introduction, Conclusion, …)
Hierarchical indexing I2
A B PR ( A) = (1 − d ) + d ∑ PRC ((II))
i i
i
Querying within some structure (search in

Assign a numeric value to each page
title, etc.) The more a page is referred to by important pages, the more this
INEX experiments page is important
Using hyperlinks in indexing and retrieval d: damping factor (0.85)

(e.g. Google)
… Many other criteria: e.g. proximity of query words
“…information retrieval …” better than “… information … retrieval …”
57 58
IR on the Web Final remarks on IR

No stable document collection (spider, IR is related to many areas:
crawler) NLP, AI, database, machine learning, user
Invalid document, duplication, etc. modeling…
library, Web, multimedia search, …

Huge number of documents (partial
Relatively week theories
collection)
Very strong tradition of experiments
Multimedia documents
Many remaining (and exciting) problems
Great variation of document quality Difficult area: Intuitive methods do not
Multilingual problem necessarily improve effectiveness in practice
…
59 60
10
Why is IR difficult
Vocabularies mismatching
Synonymy: e.g. car v.s. automobile
Polysemy: table
Queries are ambiguous, they are partial specification
of user’s need
Content representation may be inadequate and
incomplete
The user is the ultimate judge, but we don’t know
how the judge judges…
The notion of relevance is imprecise, context- and user-
dependent
But how much it is rewarding to gain 10%

improvement!
61
11

IR Problem: Introduction To Information Retrieval Outline

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

IR Problem: Introduction To Information Retrieval Outline

Enviado por

Direitos autorais:

Formatos disponíveis

Outline

Introduction to Information What is the IR problem?

The problem of IR Example

IR problem Possible approaches

Query evaluation (or retrieval process)

(Query analysis) to a query?

Representation Representation System evaluation

evaluation Are the retrieved documents relevant?

Are all the relevant documents retrieved?

Document indexing Keyword selection and weighting

Coverage Accuracy Min.

tf*idf weighting schema Some common tf*idf schemes

df = document frequency tf(t, D)=log[freq(t,D)]+1 N = #docs in corpus

no. of documents containing the term tf(t, D)=freq(t,d)/Max[f(t,d)]

length: of, in, about, with, I, although, …

Lemmatization Result of indexing

The problems underlying retrieval 1-word query:

compared to calculate a score? Multi-word query?

- How to interpret the weight?

IR models Boolean model

R is either 1 or 0 (unordered set of documents)

Extensions to Boolean model

D is a member of class ti to degree wi.

D3 a31 a32 a33 … a3n

Dm am1 am2 am3 … amn i

Implementation (space) Implementation (time)

Stored as: Given a query = {(t1,b1), (t2,b2)}:

Other similarities Probabilistic model

weights after indexing = ∏ P (ti = 1 | R ) P (ti = 0 | R ) (1− xi ) = ∏ pi i (1 − pi ) (1− xi )

(Similar operations do not apply to Dice and ti ti

Rel. doc. Irrel.doc. Doc.

Odd ( D) = log = log

N R -r N-R –n+r N-n

= xi log + log r n −r R N-R N

ti ti Rel. doc Irrel.doc. Samples

Prob. model (cont’d) BM25

Odd ( D) = ∑ x (r (+R0−.5r)(+N0−.5R)(n− n− r+ +r 0+.50).5) = ∑ w

q =(n +0.5)/(N+0.5)≈n /N avdl: average document length

May be implemented as VSM

(Classic) Presentation of results System evaluation

Doc1 Y 0.8 - * (0.6, 0.75)

MAP (Mean Average Precision) Some other measures

|Ri| = #rel. doc. for Qi Single value measures:

n = # test queries F-measure = 2 P * R / (P + R)

test corpus Total number of documents over % Change: 8.2

desired answers. at 0.50 0.1797 0.2104

Tracks (tasks) CLEF and NTCIR

Question-Answering: When did Nixon visit China? NTCIR:

Develop new methodology for system The use of relevance feedback

Query expansion Global vs. local context analysis

Some current research topics:

Sense disambiguation (difficult due to the lack of contextual c1 c2 c3 c4 … cl

Logical models Information filtering

Querying within some structure (search in

Using hyperlinks in indexing and retrieval d: damping factor (0.85)

IR on the Web Final remarks on IR

Invalid document, duplication, etc. modeling…

library, Web, multimedia search, …

But how much it is rewarding to gain 10%

Você também pode gostar

tfidf weighting schema Some common tfidf schemes