Você está na página 1de 78

COLING 2012

24th International Conference on


Computational Linguistics
Revisiting Dimensionality Reduction
Techniques for NLP
(Tutorial Notes)

Jagadeesh Jagarlamudi, Raghavendra Udupa


08 December 2012
Mumbai, India

Diamond sponsors
Tata Consultancy Services
Linguistic Data Consortium for Indian Languages (LDC-IL)

Gold Sponsors
Microsoft Research
Beijing Baidu Netcon Science Technology Co. Ltd.

Silver sponsors
IBM, India Private Limited
Crimson Interactive Pvt. Ltd.
Yahoo
Easy Transcription & Software Pvt. Ltd.

Revisiting Dimensionality Reduction Techniques for NLP (Tutorial Notes)


Jagadeesh Jagarlamudi and Raghavendra Udupa
Preprint edition
Published by The COLING 2012 Organizing Committee
Mumbai, 2012
c 2012 The COLING 2012 Organizing Committee.
This volume
Licensed under the Creative Commons Attribution-Noncommercial-Share Alike
3.0 Nonported license.

http://creativecommons.org/licenses/by-nc-sa/3.0/
Some rights reserved.

Contributed content copyright the contributing authors.


Used with permission.

ii

Revisiting Dimensionality Reduction Techniques for NLP


J a g ad eesh J a g ar l amudi 1 Ra ghavend r a U dupa2
(1) University of Maryland, College Park, Maryland, USA
(2) Microsoft Research Lab India Private Limited, Bangalore, India

jags@umiacs.umd.edu, raghavu@microsoft.com

ABSTRACT
Many natural language processing (NLP) applications represent words and documents as
vectors in a very high dimensional space. The inherently high-dimensional nature of these
applications leads to sparse vectors resulting in poor performance of downstream applications.
Dimensionality reduction aims to find a lower dimensional subspace (or simply subspace)
which captures the essential information required by the downstream applications. Although
it received a lot of attention in the beginning, its popularity in NLP, unlike other fields such
as computer vision, has declined over the time. This is partly because, traditionally, it was
studied in an unsupervised fashion and hence the learnt subspace may not be optimal for the
task at hand. But recent advances in learning low-dimensional representations in the presence
of input and output variables enables us to learn task-specific subspaces that are as effective
as the state-of-the-art approaches. In this tutorial, we aim to demonstrate the simplicity and
effectiveness of these techniques on a diverse set of NLP tasks. By the end of the tutorial, we
hope the attendees would be able to decide "whether or not dimensionality reduction can help
their task and if so, how?".
The tutorial begins with an introduction to dimensionality reduction and its importance to NLP. As many of the dimensionality reduction techniques discussed in the tutorial make
use of Linear Algebra, we discuss some important concepts including Linear Transformation,
Positive Definite Matrices, eigenvalues and eigenvectors. Next, we look at some important
dimensionality reduction techniques for data with single view (PCA, SVD, OPCA). We then
take up applications of these techniques to some important NLP problems (word-sense
discrimination, POS tagging and Information Retrieval). As NLP often involves more than one
language, we look at dimensionality reduction of multiview data using Canonical Correlation
Analysis. We discuss some interesting applications of multiview dimensionality reduction
(bilingual document projections and mining word-level translations). We also discuss some
advanced topics in dimensionality reduction such as Non-Linear and Neural techniques and
some application inspired techniques such as Discriminative Reranking, Supervised Semantic
Analysis, and Multilingual Hashing.
We do not assume attendees know anything about dimensionality reduction (though
the tutorial should be interesting even to those who know some), but we *do* assume some
basic knowledge of linear algebra.

Revisiting Dimensionality Reduction Techniques for NLP (Tutorial Notes), pages 175,
COLING 2012, Mumbai, December 2012.

Road Map
Introduction
NLP and Dimensionality Reduction
Mathematical Background
Data with Single View
Techniques
Applications
Advanced Topics
Data with Multiple Views
Techniques
Applications
Advanced Topics

Summary

Dimensionality Reduction:
Motivation
Many applications involve high dimensional
(and often sparse) data
High dimensional data poses several
challenges
Computational
Difficulty of Interpretation
Over Fitting

However, data often lies (approximately) in a


low dimensional manifold embedded in high
dimensional manifold

Dimensionality Reduction:
Goal
Given high dimensional data, discover the
underlying low dimensional structure

560 dimensional data

2D Embedding
He et al, Face Recognition Using LaplacianFaces

Dimensionality Reduction:
Benefits
Computational Efficiency
K-Nearest Neighbor Search

Data Compression
Less storage; millions of data points in RAM

Data Visualization
2D and 3D Scatter Plots

Latent Structure and Semantics


Feature Extraction
Removing distracting variance from data sets

Dimensionality Reduction:
Techniques
Projective Methods
find low dimensional projections that extract
useful information from the data, by maximizing
a suitable objective function
PCA, ICA, LDA

Manifold Modeling Methods


find low dimensional subspace that best
preserves the manifold structure in the data, by
modelling the manifold structure
LLE, Isomap, Laplacian Eigenmaps

Dimensionality Reduction:
Relevance to NLP
High dimensional data in NLP
Text Documents
Context Vectors

How can Dimensionality Reduction help?


Z^ u v] [ ]u]o ] }( } u v
Correlate semantically related terms
Crosslingual similarity

Data Centering
Dataset: : L T5 T 4H
Mean: L @5 T
5

Centering: T L T F
Centered dataset: : L T5 T
Mean after centering:
L @5 T L @5 T F L F L r
5

Mean after linear transformation:


5
5
@5 #T L @5 # T F L # F # L r

Data Variance
Dataset: : L T5 T 4H
Centered: L @5 T L r
5

Variance:

@5

v ]vP } v[
change data variance

L 6N :: L 6N %
5

where % L :: (sample covariance)


5

Transformed dataset: #:
Variance after transformation:
5
5
@5 #T 6 L 6N #:: # L 6N #% #

Positive Definite Matrices


Real: / 4H
Square: L L M
Symmetric: / L /

Positive: T /T P r for all T M r


Examples:
Identity Matrix
s s

s w
% , #% #

Cholesky decomposition: / L ..

Eigenvalues and Eigenvectors


/ 4H
/Q L Q where Q is a vector and is a scalar
eigenvector Q, eigenvalue
< = eigenvalues of M

Trace: 6N / L @5 / L
Rank: Number of non-zero eigenvalues

Eigensystem of Positivedefinite
Matrices
/ 4H
Positive eigenvalues: P r
Real valued eigenvectors: Q 4

Orthonormal eigenvectors: M Q Q L
r and Q Q L s : 7 7 L +;
Full rank: 4=JG / L L
Eigen decomposition: / L 7&7

Data Variance and Eigenvalues


Centered dataset: : L T5 T 4
Data variance:

@5

L 6N %

Eigen decomposition: % L 7&7

Data variance:

5
T6
@5

10

L 6N % L

11

Principal Components Analysis (PCA)


Centered Dataset: : L T5 T T 4
Goal: Find orthonormal linear transformation
6 4 \ 4 that maximizes data variance
6 T L #T
## L +
6N #% #

Linear transformation
Orthonormal basis
Data variance

Mathematical formulation:
# L 6N #% #
H
@

12

PCA: Solution
Eigen decomposition of % :
% L 7&7
7 L Q5 Q6 Q
& L @E=C 5 6
5 R 6 R R

# L Q5 Q6 Q
6 T L #T L Q5 Q6 Q T
MATLAB function: LNEJ?KIL:;

PCA: Solution (contd.)


Data variance after transformation:

#: L Q5 Q6 Q :
#:: # L Q5 Q6 Q :: Q5 Q6 Q
L Q5 Q6 Q 7&7 Q5 Q6 Q
L @E=C 5 6
6N #:: # L @5

Contribution of th component to data

variance:

8-

13

PCA: Properties
PCA decorrelates the dataset
#% # L @E=C 5 6

PCA gives rank k reconstruction with


minimum squared error
PCA is sensitive to the scaling of the original
features

14

Singular Value Decomposition (SVD)


Dataset: : L T5 T

T 4

: L 7-8 L @5 Q R
N L N=JG :
7 4H such that 7 7 L + (left singular vectors)
8 4H such that 8 8 L + (right singular vectors)
- L @E=C 5 4H (singular values)
: L @5 Q R

Low rank approximation:


: L 7- 8 L @5 Q R

- L @E=C 5 r r G Q @

SVD and Data Sphering


Centered dataset: : L T5 T
: L 7-8 L @5 Q R
:: L 7- 6 7 L @5 6 Q Q

Note that

@5 6 Q Q

T 4

Q L s

Let 7 L Q5 Q 8 L R5 R
- L @E=C 5 G Q N

-?5 7 :: 7 - L +

#:: # L + where # L -?5 7

The linear transformation # L -?5 7


decorrelates the data set

15

SVD and Eigen Decomposition


Dataset: : L T5 T
: L 7-8

T 4

:: L 7-8 8-7 L 7-6 7 (eigen decomposition)


: : L 8-7 7-8 L 8-6 8 (eigen decomposition)
SVD and PCA:
SVD on centered : is the same as PCA on :

16

Oriented Principal Components


Analysis (OPCA)
Generalization of PCA
Along with signal covariance % , a noise
covariance % is available
When % L + (white noise) OPCA = PCA

Seeks projections that maximize the ratio of the


variance of the signal projected to the variance
of the noise
Mathematical formulation:

# L 6N #% #
H

OPCA: Solution
Generalized eigenvalue problem:
% 7 L % 7&
Equivalent eigenvalue problem:

% ?5 6 % % ?5 6 8 L 8& where 8 L % 5 6 7

7 L Q5 Q6 Q
& L @E=C 5 6
5 R 6 R R

# L Q5 Q6 Q
6 T L #T L Q5 Q6 Q T
MATLAB function: AEC:;

17

OPCA: Properties
Projections remain the same when the noise
and signal vectors are globally scaled with
two different scale factors
Projected data is not necessarily uncorrelated
Can be extended to multiview data [Platt et
al, EMNLP 2010]

18

Popular Feature Space Models


Vector Space Model
Document is represented as bag-of-words
Features: words
Feature weight: TF(S @) or some variant

Word Space Model


Word is represented in terms of its context words
Features: words (with or with out the position)
Feature weight: Freq(S S )
Turney and Pantel 2010

19

Curse of dimensionality
We have observations T 4
@ is usually very huge
Vector Space Models
@ = vocab size (number of words in a language)

Word Space Models


@ = vocab size (if position is ignored)
@ = H where L is window length

Curse of dimensionality

20

21

Word Sense Discrimination


Aim: Cluster contexts based on the meaning
1. Word Vectors
Steps:
1. Represent a word as a point in vector space

2. Context
Dimensionality Reduction
Vectors

2. Represent each context as a point


3. Cluster the points using a clustering algorithm

Vector Space

3. Sense Vectors

Use words as the features


Feature weight is co-occurrence strength

22

23

Word Sense Discrimination :

Dimensionality Reduction
Reduce the dimensionality of word vectors

9 L

legal

Clothes

judge

210

75

robe

50

250

law

240

50

suit

147

157

dismisses

96

152

9 L 7-8
9 Z Q5 Q

Word Sense Discrimination :


Results & Discussion
Averaged results on 20 words
Accuracy
6

, terms
6 , SVD
Frequency, terms
Frequency, SVD

76
90
81
88

Schtze 1998

24

Part-of-Speech (POS) Tagging


Given a sentence label words with their POS tags
I
NN

ate
VB

an
DT

apple .
NN .

Unsupervised Approaches
Attempt to cluster words
Align each cluster with a POS tag
Do not assume a dictionary of tags
Schtze 1995, Lamar et al 2010

25

Part-of-Speech Tagging
Steps
1. Represent words in appropriate vector space

Dimensionality Reduction

2. Cluster using your favorite algorithm

Vector space should capture syntactic properties


Use most frequent :@; words as features
Frequency of a word in the context as feature weight

Part-of-Speech Tagging :
Pass 1
Construct left and right context matrices
. and 4 matrices of size 8 H @

Dimensionality Reduction
Get rank N5 approximation
. L 7 - 8
. L 7 -
. ZNormalized .

4 L 7 - 8
4 L 7 -
4 ZNormalized 4

& L . 4 is a 8 H tN5 matrix

Run weighted G-means on & with G5 clusters

26

Part-of-Speech Tagging :
Pass 2
The clusters are not optimal because of sparsity
Construct . and 4 of size 8 H G5

Dimensionality Reduction
Get rank N6 approximation
. L 7 - 8
. L 7 -
. ZNormalized .

4 L 7 - 8
4 L 7 -
4 ZNormalized 4

& L . 4 is a 8 H tN6 matrix

Run weighted G-means on &

Part-of-Speech Tagging :
Results
Penn Treebank (1.1 M tokens, 43K types)
17 and 45 tags
M-to-1 accuracies
PTB17

PTB45

SVD2

0.730

0.660

HMM-EM

0.647

0.621

HMM-VB

0.637

0.605

HMM-GS

0.674

0.660

HMM-Sparse(32)

0.702 (2.2)

0.654 (1.0)

VEM:sr?5 sr?5 ;

0.682 (0.8)

0.546 (1.7)
Lamar et al 2010

27

Part-of-Speech Tagging :
Discussion
Sensitivity to parameters
Scaling with singular values
G-means algorithm
Weighted G-means
Clusters are initialized to most frequent word types

Non-disambiguating tagger
Very simple algorithm

28

Information Retrieval
Rank documents @ in response to a query M
Vector Space Model
Query and doc. are represented as bag-of-words
Features: Words
Feature Weight: TFIDF

Lexical Gap
Polysemy and Synonymy

Information Retrieval :
Lexical Gap
Term H Document matrix %

ship

boat

ocean

voyage

trip

TFIDF weighting
is better ?

29

Information Retrieval :
Latent Semantic Analysis
Term H Document matrix %H
Steps:

1. Dimensionality Reduction of term H doc. matrix


2. Folding-in queries

M Z B:M;

3. Compute semantic similarity, score M @

30

Information Retrieval :
Latent Semantic Analysis
Term H Document matrix %H
Steps:
1. Dimensionality Reduction
2. Folding-in queries
@ L 7 - @
@ L -?5 7 @

% L 7 - 8

M L -?5 7 M

31

Information Retrieval :
Latent Semantic Analysis
Term H Document matrix %H
Steps:
% L 7 - 8

1. Dimensionality Reduction
2. Folding-in queries
3. Semantic similarity

M L -?5 7 M

5?KNA M @ Z M @ L

M @
M @

denotes dot product


Deerwester 1988; Dumais 2005

Information Retrieval :
Lexical Gap Revisited
Term H Document matrix %

ship

boat

1
1

ocean

voyage

trip

New document representations

ship

-1.62

-0.60

-0.44

-0.97

-0.70

-0.26

boat

-0.46

-0.84

-0.30

1.00

0.35

0.65

32

Information Retrieval :
Results & Discussion
Term H Document matrix %
MED

CRAN

CACM

CISI

Cos+tfidf

49

35.2

21.9

20.2

LSA

64.6

38.7

23.8

21.9

PLSI-U

69.5

38.9

25.3

23.3

PLSI-Q

63.2

38.6

26.6

23.1

Fold-in new documents as well


Deviates from the optimal as we add more docs.

Probabilistic Latent Semantic Analysis


Hofmann 1999

33

34

Non-linear Dimensionality Reduction


Laplacian Eigenmaps
Weight matrix 9 with similarities
Local neighbourhood

& L 9 and . L & F 9


Q .Q s.t. Q &Q L +

.Q L &Q

35

Q .Q L 9 Q F Q

36

37

Neural Embeddings
Word is represented as a vector of size I
Learning
Optimize such that log-likelihood is maximized
Gradient ascent
Learns parameters and word vectors simultaneously
Learned word-vectors capture semantics

Learn to perform multiple tasks simultaneously


Bengio et al 2003; Collobert and Weston 2008

38

Canonical Correlation Analysis (CCA)


Centered dataset:

: L T5 T 4 -H ; L U5 U 4 . H

Project : and ; along = 4- and > 4.


O L =T5 =T , P L >U5 >U

Data correlation after transformation:


?KO O P L
L

8-
.
.

8-
8-

= :; >
= :: = > ;; >

39

CCA (contd.)
Covariance matrices:

% L :; % L :: , % L ;;

Correlation in terms of covariance matrices:


= % >

?KO O P L

= % = > % >

Directions that maximize data correlation:


= > L

= % >
= % = > % >

CCA: Formulation
Goal: Find linear transformations # $ that
maximize data correlation
Optimization problem:
# $ L 6N # :; $

O P
6N # :: # L s
6N $ ;; $ L s

40

CCA: Solution
Generalized eigenvalue problem:
% $ L % #&

%
# L % $&
Can be shown that & L & L &
?5
$ L %
% #&?5
?5
% %
% # L % #&6
MATLAB function: ?=JKJ?KNN:;

41

42

Bilingual Document Projections


Applications:
Comparable and Parallel Document Retrieval
Cross-language text categorization

Steps:
1. Represent each document as a vector

Two different vector spaces, one per language

2. Use CCA to find linear transformations :# $;


3. Find new aligned documents using # and $

Bilingual Document Projections


Steps:
1. Represent each document as a vector

Vector Space:
Features: Most frequent 20K content words
Feature weight: TFIDF weighting

Training Data:
T 4- bag of English words
U 4 . bag of Hindi words

T U E L s J : L T5 T6 T ; L U5 U6 U

43

Bilingual Document Projections


Steps:
1. Represent each document as a vector
2. Use CCA to find linear transformations # and $
3. Find new aligned documents using # and $

Scoring:
Score(T U; Z #T $U L

44

Bilingual Document Projections :


Results & Discussion
Accuracy

MRR

OPCA

72.55

77.34

Word-by-word

70.33

74.67

CCA

68.94

73.78

Word-by-word (5000)

67.86

72.36

CL-LSI

53.02

61.30

Untranslated

46.92

53.83

CPLSA

45.79

51.30

JPLSA

33.22

36.19

Platt et al 2010

45

Mining Word-Level Translations


Training Data: Word level seed translations
English

Spanish

P(s|e)

state

estado

0.5

state

declarar

0.3

society

sociedad

0.4

society

compaa

0.35

company

sociedad

0.8

Task: Mine translations for new words


d vo ]}v }( ^ ]o]_ M

Resources: monolingual comparable corpora

Mining Word-Level Translations


Applications:
Lexicon induction for resource poor languages
Mining translations for unknown words in MT

Steps:
1. W ]v]vP }( ^} ]_
2. Represent each word as vector

Two different feature spaces, one per language

3. Use CCA to find transformations # and $


4. Use # and $ to mine new word translations

46

Mining Word-Level Translations


Steps:
1. W ]v]vP }( ^} ]_
2. Represent each word as a vector

Vector Space
Features: context words (WSM); Orthography
Feature Weights: TFIDF weights
Can be computed using ONLY comparable corpora

T U E L s J; : L T5 T6 T ; ; L U5 U6 U

47

Mining Word-Level Translations


Steps:
1.
2.
3.
4.

W ]v]vP }( ^} ]_
Represent each word as a vector
Use CCA to find transformations # and $
Use # and $ to mine new word translations

Scoring

Score(A O; L #T $U L

48

Mining Word-Level Translations :


Results & Discussion
Seed lexicon size 100
Bootstrapping

Best-

EditDist

58.6

62.6

61.1

Ortho

76.0

81.3

80.1

52.3

55.0

Context

91.1

81.3

80.2

65.3

58.0

Both

87.2

89.7

89.0

89.7

72.0

47.4

Results are lower for other language pairs


Haghighi et al 2008

Mining Word-Level Translations :


Results & Discussion
Mining translations for unknown words
OOV words for MT domain adaptation

French German

MT Accuracies (BLEU)

Baseline
+ve change
Baseline

+ve change

News

Emea

Subs

PHP

23.00

26.62

10.26

38.67

0.80

1.44

0.13

0.28

27.30

40.46

16.91

28.12

0.36

1.51

0.61

0.68

Daum and Jagarlamudi 2011

49

50

Supervised Semantic Indexing


Task: Learn to rank ads = for a given doc. @
Training Data:
Pairs of :@ => ; webpages and clicked ads
Randomly chosen pairs @ =?

Steps :
1. Represent an ad = and a doc. @ as vectors
2. Learn scoring function B:= @;
3. Rank ads for a given document
Bai et al 2009

Supervised Semantic Indexing


Steps :
1. Represent ads and docs. as vectors

Vector Space
Bag-of-word representation
Features: words
Feature weights: TFIDF weight

= and @ are vectors of size 8

51

Supervised Semantic Indexing


Steps :
1. Represent ads and docs. as vectors
2. Learn scoring function B:= @;

Scoring function

Parameters: 8 H 8

B = @ L @ 9=

9L+

9L&

9 L 7 8 E +

Cosine
Similarity

Reweighting
of words

Dimensionality Reduction
Different treatment for
ads and documents

9 L 7 7 E +
Dimensionality Reduction
SAME treatment for ads
and documents

Supervised Semantic Indexing :


Learn Scoring Function
Max-Margin B @ => F B @ =? P s
Objective
rs F B @ => E B:@ =? ;

6 7

Sub Gradient Descent

52

Supervised Semantic Indexing


Steps:
1. Represent ads and docs. as vectors
2. Learn scoring function B:= @;
3. Rank ads for a give document

Ranking Ads
Compute score using B = @ and rank

Supervised Semantic Indexing :


Results & Discussion
1.9 M pairs for training
100K pairs for testing
Parameters
TFIDF
SSI: 9 L

Rank Loss
45.60

7 8

SSI: 9 L 7 8
SSI: 9 L 7 8

54
64

E+

E+

74 E +

50H10k

25.83

50H20k

26.68

50H30k

26.98

Bai et al 2009

53

Supervised Semantic Indexing :


Results & Discussion
Ranking wikipedia pages for queries
Rank Loss
K=5

K=10

K=20

TFIDF

21.6

14.0

9.14

LSI + :s F ;TFIDF

14.2

9.73

6.36

4.80

3.10

1.87

4.37

2.91

1.80

SSI: 9 L 7 7
SSI: 9 L 7 8

74
74

E+

E+

Performs better when training data is big


Bai et al 2009

54

DiscriminativeU& Reranking
T&

@5

T5

U5

T6

U6

T7

U7

T8

U8

@6

Vector of
length @5 H @6

T5 U5 T5 U6 T U5 T U6
55 56 h5
h6
S L = >

Reranker operates in the outer product space

Weight vector is constrained [Bai et al. 10]

[Szedmak et al., 2006; Wang et al., 2007]

55

Low-Dimensional Reranking
Find # and $ s.t.
:# T $ U ;

T U
U6

U6

U
U5

Low-Dimensional Reranking
Find # and $ s.t.

:# T $ U ;

1.

Score: = T U >

Idea
2.

Add constraints to penalize incorrect ones


O?KNA T U R O?KNA T U E s F

I R s F

56

[Tsochantaridis et al. 04]

U7

Discriminative Model
L .

// Initialization

Repeat

# $

Z Softened-Disc

// Get the current soln.

I L = T U > F = T U >

// Compute margins

L r P r=

// Potential Slack
// Compute Slack

L s F I .
If P r

@ L I F s F

Z F @

// Update the
Lagrangian variables

End
Until convergence

6ODFN GRHVQW FKDQJH

57

POS Tagging
Combine with Viterbi score
Interpolation parameter is tuned

Training
Input sentence and Reference tag sequences
Candidates, Score and Loss values
Buyers stepped in

to

the

futures pit

-0.1947

NNS

VBD

RP

TO

DT

NNS

NN

-6.8068

NNS

VBD

RB

TO

DT

NNS

NN

-7.0514

NNS

VBD

IN

TO

DT

NNS

NN

-7.1408

NNS

VBD

RP

TO

DT

NNS

VB

-13.752

NNS

VBD

RB

TO

DT

NNS

VB

Score

Testing

POS Tagging
Combine with Viterbi score
Interpolation parameter is tuned

Data Statistics
Results

English

Chinese

French

Swedish

Baseline

96.15

92.31

97.41

93.23

Collins

96.06

92.81

97.35

93.44

Regularized

96.00

92.88

97.38

93.35

Oracle

98.39

98.19

99.00

96.48

58

POS Tagging
Combine with Viterbi score
Interpolation parameter is tuned

Data Statistics
Results

English

Chinese

French

Swedish

Baseline

96.15

92.31

97.41

93.23

Collins

96.06

92.81

97.35

93.44

Regularized

96.00

92.88

97.38

93.35

Softened-Disc

96.32

92.87

97.53

93.24

Discriminative

96.3

92.91

97.53

93.36

Oracle

98.39

98.19

99.00

96.48

POS Tagging
Z o }v]v

English

Chinese

French

Swedish

Softened-Disc

+0.17

+0.56

+0.12

+0.01

Discriminative

+0.15

+0.6

+0.12

+0.13

Softened-Disc*

+0.92

+4.31

+1.12

+0.08

Discriminative*

+0.88

+4.77

+0.9

+0.73

Interpolation with Viterbi score is crucial


Softened-Disc
Independent of no. training examples
Easy to code and can be solved exactly
Jagarlamudi and Daum 2012

59

60

Similarity Search: Challenges


Computing nearest neighbors in high
dimensions using geometric search
techniques is very difficult
All methods are as bad as brute force linear
search which is expensive
Approximate techniques such as ANN perform
efficiently in dimensions as high as 20; in higher
dimensions, the results are rather spotty

Need to do search on commodity hardware


Cross-language search

61

62

What is the advantage?


Scales easily to very large databases
Compact language-independent representation
32 bits per object

Search is effective and efficient


Hamming nearest-neighbor search
Few milliseconds per query for searching a
million objects (single thread on a single
processor)

63

What is the challenge?


Language/Script Independent Hash
Codes
Learning Hash Functions from
Training Data

64

65

66

67

Learning Hash Functions: Summary


Given a set of parallel names as training data, find
the top K projection vectors for each language
using Canonical Correlation Analysis.
Each projection vector gives a 1-bit hash function.

Hash code for a name can be computed by


projecting its feature vector on to the projection
vectors followed by binarization.

Udupa & Kumar, 2010

Fuzzy Name Search: Experimental Setup


Test Sets:
DUMBTIONARY
1231 misspelled names
INTRANET
200 misspelled names
Name Directories:
DUMBTIONARY
550K names from Wikipedia
INTRANET
150K employee names
Training Data:
15K pairs of single token names in English and Hindi
Baselines:
Two popular search engines, Double Metaphone, BM25

68

69

Multilingual: Experimental Setup


Test Sets
1000 multi-word names each in Russian, Hebrew,
Kannada, Tamil, Hindi
Name Directory:
English Wikipedia Titles
6 Million Titles, 2 Million Unique Words
Baseline:
State-of-the-art Machine Transliteration (NEWS
2009)

70

71

Dimensionality Reduction
#of dim. reduction papers

800

Vision

700

NLP

600
500
400
300
200
100
0
1990

1995

2000

2005

2010

Dimensionality Reduction
1.2

Vision

NLP

1
0.8
Popularity compared to 0.6
Bayesian approaches
0.4
0.2
0
1990

1995

2000

72

2005

2010

Summary
Dimensionality reduction has merits for NLP
Computational and Feature correlations

Has been explored in unsupervised fashion


But recent novel developments
For multi-view data

If you can formulate your problem as mapping


Try dimensionality reduction
Can solve for the global optimum

Summary
Spectral Learning
Provides way to learn global optimum for
generative models

Enriching the existing models


Using word embeddings instead of words

Scalability of the techniques


} v[ v }v Z vu
Large scale SVD

73

}( uo

References

HOTELLING, Harold, 1933. Analysis of a Complex of Statistical Variables into Principal Components. Journal
of Educational Psychology, 24(6 & 7), 417t441 & 498t520.
Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), Numerical Recipes: The Art of Scientific
Computing (3rd ed.), New York: Cambridge University Press
John C. Platt, Kristina Toutanova, and Wen-tau Yih. 2010. Translingual document representations from
] ]u]v ] }i ]}vX /v DE>W [U P t261
Hyvrinen A.; Oja E.; Independent component analysis: algorithms and applications, Journal of Neural
Networks, Volume 13 Issue 4-5, May-June 2000, Pages 411 - 430
Blei, David M.; Ng, Andrew Y.; Jordan, Michael I. Lafferty, John. ed. "Latent Dirichlet allocation". Journal of
Machine Learning Research 3 (4t5), 2003: pp. 993t1022
S. T. Roweis and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science Vol
290, 22 December 2000, 2323t2326.
Schutze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24, 1, 97t124.
Hinrich Schtze. 1995. Distributional part-of-speech tagging. In EACL 7, pages 141--148.
Michael Lamar, Yariv Maron, Mark Johnson, Elie Bienenstock, SVD and clustering for unsupervised POS
tagging, In ACL 2010, p.215-219, July 11-16
Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the
51st Annual Meeting of the American Society for Information Science 25, 1988, pp. 36t40.
Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology
38: 188.
Thomas Hofmann, Probabilistic Latent Semantic Indexing, In SIGIR 1999

References

J. B. TenenbaumU s]v ^]o U v :}Zv X > vP(} U ^ 'o} o ' }u ] & u }l (} E}vo]v
]u v]}v o] Z ]}vU_ ^ ] v U vol 290, 22 December 2000.
Belkin,M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation.
Neural Computation, 15, 1373t1396.
Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of
Artificial Intelligence Research.
Yoshua Bengio , Rjean Ducharme , Pascal Vincent , Christian Janvin, A neural probabilistic language
model, The Journal of Machine Learning Research, 3, 3/1/2003
Joseph Turian , Lev Ratinov , Yoshua Bengio, Word representations: a simple and general method for semisupervised learning, In ACL 2010, p.384-394, July 11-16, 2010
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing
(almost) from scratch. JMLR, 2011.
Harold Hotelling. 1936. Relation between two sets of variables. Biometrica, 28:322t377.
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from
monolingual corpora. In ACL, pages 771--779.
Hal Daum, III , Jagadeesh Jagarlamudi, Domain adaptation for machine translation by mining unseen
words, In ACL 2011, June 19-24
Bing Bai , Jason Weston , David Grangier , Ronan Collobert , Kunihiko Sadamasa , Yanjun Qi , Olivier
Chapelle , Kilian Weinberger, Supervised semantic indexing, In CIKM 2009, November 02-06, Hong Kong,
China

74

References

Jagadeesh Jagarlamudi, Hal Daum, III , Low-Dimensional Discriminative Reranking, in HLT-NAACL 2012
Shaishav Kumar and Raghavendra Udupa, Learning Hash Functions for Cross-View Similarity Search, in
IJCAI-11, IJCAI, 20 July 2011
Raghavendra Udupa and Shaishav Kumar, Hashing-based Approaches to Spelling Correction of Personal
Names, in Proceedings of EMNLP 2010, October 2010
Raghavendra Udupa and Mitesh Khapra, Transliteration Equivalence using Canonical Correlation Analysis,
in ECIR 2010, 2010
Jagadeesh Jagarlamudi, Hal Daum, III , Regularized Interlingual Projections: Evaluation on Multilingual
Transliteration, in Proceedings of EMNLP-CoNLL 2012.

75

Você também pode gostar