Escolar Documentos
Profissional Documentos
Cultura Documentos
- widespread synonym occurances -decrease recall. Polysemy - retrieval of irrelevant documents - poor precision Noise - Boolean search on specific words - Retrieval o contently unrelated documents
relationships between terms and documents. To find out what terms "really" are implied by a query . LSI allow the user to search for concepts rather than specific words. LSI can retrieve documents related to a user's query even when the query and the documents do not share any common terms.
Example
Q : Light waves. D1: Particle and wave models of
models for
fotons.
all documents and terms. Each dimension in that space corresponds to a concept existing in the collection. Thus underlying topics of the document is encoded in a vector. Common related terms in a document and query will pull document and query vector close to each other.
Drawback!
The complexity of the LSI model obtained
from truncated SVD is costly. Its execution efficiency lag far behind the execution efficiency of the simpler, Boolean models, especially on large data sets.
SVD
The key to working with SVD of any rectangular
matrix A is to consider AAT and ATA. The columns of U, that is t by t, are eigenvectors of AAT, The columns of V, that is d by d, are eigenvectors of ATA. The singular values on the diagonal of S, that is t by d, are the positive square roots of the nonzero eigenvalues of both AAT and ATA.
SVD
Eigenvalue-eigenvector factorization
A = USVT
SVD-property
Diagonals are ordered in magnitude:
Computing SVD
T = AAT and D = ATA :
for T and D
Computing SVD(2)
Truncated-SVD
Create a rank-k
Truncated-SVD
Using truncated SVD, underlying latent
LSI-Procedure
Obtain term-document matrix. Compute the SVD. Truncate-SVD into reduced-k LSI space.
Query processing
Map the query to reduced k-space
Updating
Folding-in
Example:Collection
Label
C1
Course Title
Parallel Programming Languages Systems Parallel Processing for Noncommercial Applications Algorithm Design for Parallel Computers Networks and Algorithms for Parallel Computation Application of Computer Graphics Database Theory Distributed Database Systems Topics in Database Management Systems Data Organization and Management Network Theory Computer Organization
C2 C3 C4 C5 C6 C7 C8 C9 C10 C11
A versus A2
Observations
Lower entry values.
Higher values.
Negative Entries.
Mapping
0,5
application
0,0
0,0 0,2 0,4
C11 network C2
0,6 0,8 1,0 1,2 1,4
C5
algorithm
C3
1,6
comput
C4
2,0
parallel
1,8
C10
organization
-0,5
C9
theory
C1
management
C6
courses
Series1 words
-1,0
C7
systems
-1,5
database
C8
-2,0
qT = [ 0 1 0 0 0 0 1 0 0 1 ]. Update:
Label C12 C13 Course Title Parallel Programming for Scientific Computations Data Structures for Parallel Programming
Query
0,5
application
0,0
0,0 0,2
C11 network C2
0,6 0,8 1,0 1,2 1,4
C5
algorithm
C3
1,6
comput
C4
2,0
parallel
1,8
organizatio n
Q
-0,5
C10
0,4
C9
theory
C1
management
C6
-1,0
courses
relevance space
Series1 words
C7
systems
-1,5
database
C8
-2,0
Fold-in
0,5
application
0,0
0,0 0,2 0,4
C11 network C2
0,6 0,8 1,0 1,2 1,4
C5
algorithm
C3
C13
1,6
comput
C4
C12
2,0
C10
parallel
1,8
organization
-0,5
C9
theory
data
- programming
C1
management
C6
courses
Series1 words
-1,0
C7
systems
-1,5
database
C8
-2,0
Recomputed Space
Some Applications
Information Retrieval
Information Filtering
Relevance Feedback
Cross-language retrieval