Você está na página 1de 26

Latent Semantic Indexing by Singular Value Decomposition

Problems in Lexical Matching


Synonymy

- widespread synonym occurances -decrease recall. Polysemy - retrieval of irrelevant documents - poor precision Noise - Boolean search on specific words - Retrieval o contently unrelated documents

Motivation for LSI


To find and fit a useful model of the

relationships between terms and documents. To find out what terms "really" are implied by a query . LSI allow the user to search for concepts rather than specific words. LSI can retrieve documents related to a user's query even when the query and the documents do not share any common terms.

Example
Q : Light waves. D1: Particle and wave models of

light. D2: Surfing on the waves under star lights.


D3: Electro-magnetic

models for

fotons.

How LSI Works?


uses multidimensional vector space to place

all documents and terms. Each dimension in that space corresponds to a concept existing in the collection. Thus underlying topics of the document is encoded in a vector. Common related terms in a document and query will pull document and query vector close to each other.

Drawback!
The complexity of the LSI model obtained

from truncated SVD is costly. Its execution efficiency lag far behind the execution efficiency of the simpler, Boolean models, especially on large data sets.

SVD
The key to working with SVD of any rectangular

matrix A is to consider AAT and ATA. The columns of U, that is t by t, are eigenvectors of AAT, The columns of V, that is d by d, are eigenvectors of ATA. The singular values on the diagonal of S, that is t by d, are the positive square roots of the nonzero eigenvalues of both AAT and ATA.

SVD
Eigenvalue-eigenvector factorization

A = USVT

- UUT=I -VVT=I -S singular values

SVD-property
Diagonals are ordered in magnitude:

s1 >= s2 ....>= sr > sr+1 =...=sr=0. Truncated Ak is best approximation.

Computing SVD
T = AAT and D = ATA :

Eigenvector and Eigenvalue computation

for T and D

Computing SVD(2)

Truncated-SVD
Create a rank-k

approximation to A, k < rA or k = rA , Ak = Uk Sk VTk

Truncated-SVD
Using truncated SVD, underlying latent

structure is represented in reduced-k dimensional space. Noise in word usage is eliminated,

LSI-Procedure
Obtain term-document matrix. Compute the SVD. Truncate-SVD into reduced-k LSI space.

-k-dimensional semantic structure -similarity on reduced-space: -term-term -term-document -document-document

Query processing
Map the query to reduced k-space

q=qTUkS-1k, Retrieve documents or terms within a proximity. -cosine -best m

Updating
Folding-in

d=dTUkS-1k - similar to query projection SVD re-computation

Example:Collection
Label
C1

Course Title
Parallel Programming Languages Systems Parallel Processing for Noncommercial Applications Algorithm Design for Parallel Computers Networks and Algorithms for Parallel Computation Application of Computer Graphics Database Theory Distributed Database Systems Topics in Database Management Systems Data Organization and Management Network Theory Computer Organization

C2 C3 C4 C5 C6 C7 C8 C9 C10 C11

A versus A2

Observations
Lower entry values.

Higher values.
Negative Entries.

Mapping
0,5

application
0,0
0,0 0,2 0,4

C11 network C2
0,6 0,8 1,0 1,2 1,4

C5

algorithm

C3
1,6

comput

C4
2,0

parallel
1,8

C10

organization

-0,5

C9

theory

C1
management

C6

courses

Series1 words

-1,0

C7

systems
-1,5

database

C8

-2,0

Example:Query and New terms


Query:computer database organizations

qT = [ 0 1 0 0 0 0 1 0 0 1 ]. Update:
Label C12 C13 Course Title Parallel Programming for Scientific Computations Data Structures for Parallel Programming

Query
0,5

application
0,0
0,0 0,2

C11 network C2
0,6 0,8 1,0 1,2 1,4

C5

algorithm

C3
1,6

comput

C4
2,0

parallel
1,8

organizatio n

Q
-0,5

C10

0,4

C9

theory

C1
management

C6
-1,0

courses
relevance space

Series1 words

C7

systems
-1,5

database

C8

-2,0

Comparison with Lexical Matching

Fold-in
0,5

application
0,0
0,0 0,2 0,4

C11 network C2
0,6 0,8 1,0 1,2 1,4

C5

algorithm

C3
C13
1,6

comput

C4

C12
2,0

C10

parallel
1,8

organization

-0,5

C9

theory

data

- programming

C1
management

C6

courses

Series1 words

-1,0

C7

systems
-1,5

database

C8

-2,0

Recomputed Space

Some Applications
Information Retrieval

Information Filtering
Relevance Feedback

Cross-language retrieval

Você também pode gostar