T06

COLING 2012
24th International Conference on

Computational Linguistics
Revisiting Dimensionality Reduction
Techniques for NLP
(Tutorial Notes)
Jagadeesh Jagarlamudi, Raghavendra Udupa

08 December 2012
Mumbai, India
Diamond sponsors
Tata Consultancy Services
Linguistic Data Consortium for Indian Languages (LDC-IL)
Gold Sponsors
Microsoft Research
Beijing Baidu Netcon Science Technology Co. Ltd.
Silver sponsors
IBM, India Private Limited
Crimson Interactive Pvt. Ltd.
Yahoo
Easy Transcription & Software Pvt. Ltd.
Revisiting Dimensionality Reduction Techniques for NLP (Tutorial Notes)

Jagadeesh Jagarlamudi and Raghavendra Udupa
Preprint edition
Published by The COLING 2012 Organizing Committee
Mumbai, 2012
c 2012 The COLING 2012 Organizing Committee.
This volume
Licensed under the Creative Commons Attribution-Noncommercial-Share Alike
3.0 Nonported license.
http://creativecommons.org/licenses/by-nc-sa/3.0/
Some rights reserved.
Contributed content copyright the contributing authors.

Used with permission.
ii
Revisiting Dimensionality Reduction Techniques for NLP

J a g ad eesh J a g ar l amudi 1 Ra ghavend r a U dupa2
(1) University of Maryland, College Park, Maryland, USA
(2) Microsoft Research Lab India Private Limited, Bangalore, India
jags@umiacs.umd.edu, raghavu@microsoft.com
ABSTRACT
Many natural language processing (NLP) applications represent words and documents as
vectors in a very high dimensional space. The inherently high-dimensional nature of these
applications leads to sparse vectors resulting in poor performance of downstream applications.
Dimensionality reduction aims to find a lower dimensional subspace (or simply subspace)
which captures the essential information required by the downstream applications. Although
it received a lot of attention in the beginning, its popularity in NLP, unlike other fields such
as computer vision, has declined over the time. This is partly because, traditionally, it was
studied in an unsupervised fashion and hence the learnt subspace may not be optimal for the
task at hand. But recent advances in learning low-dimensional representations in the presence
of input and output variables enables us to learn task-specific subspaces that are as effective
as the state-of-the-art approaches. In this tutorial, we aim to demonstrate the simplicity and
effectiveness of these techniques on a diverse set of NLP tasks. By the end of the tutorial, we
hope the attendees would be able to decide "whether or not dimensionality reduction can help
their task and if so, how?".
The tutorial begins with an introduction to dimensionality reduction and its importance to NLP. As many of the dimensionality reduction techniques discussed in the tutorial make
use of Linear Algebra, we discuss some important concepts including Linear Transformation,
Positive Definite Matrices, eigenvalues and eigenvectors. Next, we look at some important
dimensionality reduction techniques for data with single view (PCA, SVD, OPCA). We then
take up applications of these techniques to some important NLP problems (word-sense
discrimination, POS tagging and Information Retrieval). As NLP often involves more than one
language, we look at dimensionality reduction of multiview data using Canonical Correlation
Analysis. We discuss some interesting applications of multiview dimensionality reduction
(bilingual document projections and mining word-level translations). We also discuss some
advanced topics in dimensionality reduction such as Non-Linear and Neural techniques and
some application inspired techniques such as Discriminative Reranking, Supervised Semantic
Analysis, and Multilingual Hashing.
We do not assume attendees know anything about dimensionality reduction (though
the tutorial should be interesting even to those who know some), but we *do* assume some
basic knowledge of linear algebra.
Revisiting Dimensionality Reduction Techniques for NLP (Tutorial Notes), pages 175,
COLING 2012, Mumbai, December 2012.
Road Map
Introduction
NLP and Dimensionality Reduction
Mathematical Background
Data with Single View
Techniques
Applications
Advanced Topics
Data with Multiple Views
Techniques
Applications
Advanced Topics
Summary
Dimensionality Reduction:
Motivation
Many applications involve high dimensional
(and often sparse) data
High dimensional data poses several
challenges
Computational
Difficulty of Interpretation
Over Fitting
However, data often lies (approximately) in a

low dimensional manifold embedded in high
dimensional manifold
Goal
Given high dimensional data, discover the
underlying low dimensional structure
560 dimensional data
2D Embedding
He et al, Face Recognition Using LaplacianFaces
Benefits
Computational Efficiency
K-Nearest Neighbor Search
Data Compression
Less storage; millions of data points in RAM
Data Visualization
2D and 3D Scatter Plots
Latent Structure and Semantics

Feature Extraction
Removing distracting variance from data sets
Techniques
Projective Methods
find low dimensional projections that extract
useful information from the data, by maximizing
a suitable objective function
PCA, ICA, LDA
Manifold Modeling Methods

find low dimensional subspace that best
preserves the manifold structure in the data, by
modelling the manifold structure
LLE, Isomap, Laplacian Eigenmaps
Relevance to NLP
High dimensional data in NLP
Text Documents
Context Vectors
How can Dimensionality Reduction help?

Z^ u v] [ ]u]o ] }( } u v
Correlate semantically related terms
Crosslingual similarity
Data Centering
Dataset: : L T5 T 4H
Mean: L @5 T
5
Centering: T L T F
Centered dataset: : L T5 T
Mean after centering:
L @5 T L @5 T F L F L r
5
Mean after linear transformation:

5
5
@5 #T L @5 # T F L # F # L r
Data Variance
Dataset: : L T5 T 4H
Centered: L @5 T L r
5
Variance:
@5
v ]vP } v[
change data variance
L 6N :: L 6N %
5
where % L :: (sample covariance)

5
Transformed dataset: #:
Variance after transformation:
5
5
@5 #T 6 L 6N #:: # L 6N #% #
Positive Definite Matrices

Real: / 4H
Square: L L M
Symmetric: / L /
Positive: T /T P r for all T M r

Examples:
Identity Matrix
s s
s w
% , #% #
Cholesky decomposition: / L ..
Eigenvalues and Eigenvectors

/ 4H
/Q L Q where Q is a vector and is a scalar
eigenvector Q, eigenvalue
< = eigenvalues of M
Trace: 6N / L @5 / L
Rank: Number of non-zero eigenvalues
Eigensystem of Positivedefinite
Matrices
/ 4H
Positive eigenvalues: P r
Real valued eigenvectors: Q 4
Orthonormal eigenvectors: M Q Q L
r and Q Q L s : 7 7 L +;
Full rank: 4=JG / L L
Eigen decomposition: / L 7&7
Data Variance and Eigenvalues

Centered dataset: : L T5 T 4
Data variance:
@5
L 6N %
Eigen decomposition: % L 7&7
Data variance:
5
T6
@5
10
L 6N % L
11
Principal Components Analysis (PCA)

Centered Dataset: : L T5 T T 4
Goal: Find orthonormal linear transformation
6 4 \ 4 that maximizes data variance
6 T L #T
## L +
6N #% #
Linear transformation
Orthonormal basis
Data variance
Mathematical formulation:
# L 6N #% #
H
@
12
PCA: Solution
Eigen decomposition of % :
% L 7&7
7 L Q5 Q6 Q
& L @E=C 5 6
5 R 6 R R
# L Q5 Q6 Q
6 T L #T L Q5 Q6 Q T
MATLAB function: LNEJ?KIL:;
PCA: Solution (contd.)

Data variance after transformation:
#: L Q5 Q6 Q :
#:: # L Q5 Q6 Q :: Q5 Q6 Q
L Q5 Q6 Q 7&7 Q5 Q6 Q
L @E=C 5 6
6N #:: # L @5
Contribution of th component to data
variance:
8-
13
PCA: Properties
PCA decorrelates the dataset
#% # L @E=C 5 6
PCA gives rank k reconstruction with

minimum squared error
PCA is sensitive to the scaling of the original
features
14
Singular Value Decomposition (SVD)

Dataset: : L T5 T
T 4
: L 7-8 L @5 Q R
N L N=JG :
7 4H such that 7 7 L + (left singular vectors)
8 4H such that 8 8 L + (right singular vectors)
- L @E=C 5 4H (singular values)
: L @5 Q R
Low rank approximation:

: L 7- 8 L @5 Q R
- L @E=C 5 r r G Q @
SVD and Data Sphering

Centered dataset: : L T5 T
: L 7-8 L @5 Q R
:: L 7- 6 7 L @5 6 Q Q
Note that
@5 6 Q Q
T 4
Q L s
Let 7 L Q5 Q 8 L R5 R
- L @E=C 5 G Q N
-?5 7 :: 7 - L +
#:: # L + where # L -?5 7
The linear transformation # L -?5 7

decorrelates the data set
15
SVD and Eigen Decomposition

Dataset: : L T5 T
: L 7-8
T 4
:: L 7-8 8-7 L 7-6 7 (eigen decomposition)

: : L 8-7 7-8 L 8-6 8 (eigen decomposition)
SVD and PCA:
SVD on centered : is the same as PCA on :
16
Oriented Principal Components

Analysis (OPCA)
Generalization of PCA
Along with signal covariance % , a noise
covariance % is available
When % L + (white noise) OPCA = PCA
Seeks projections that maximize the ratio of the

variance of the signal projected to the variance
of the noise
Mathematical formulation:
# L 6N #% #
H
OPCA: Solution
Generalized eigenvalue problem:
% 7 L % 7&
Equivalent eigenvalue problem:
% ?5 6 % % ?5 6 8 L 8& where 8 L % 5 6 7
7 L Q5 Q6 Q
& L @E=C 5 6
5 R 6 R R
# L Q5 Q6 Q
6 T L #T L Q5 Q6 Q T
MATLAB function: AEC:;
17
OPCA: Properties
Projections remain the same when the noise
and signal vectors are globally scaled with
two different scale factors
Projected data is not necessarily uncorrelated
Can be extended to multiview data [Platt et
al, EMNLP 2010]
18
Popular Feature Space Models

Vector Space Model
Document is represented as bag-of-words
Features: words
Feature weight: TF(S @) or some variant
Word Space Model

Word is represented in terms of its context words
Features: words (with or with out the position)
Feature weight: Freq(S S )
Turney and Pantel 2010
19
Curse of dimensionality
We have observations T 4
@ is usually very huge
Vector Space Models
@ = vocab size (number of words in a language)
Word Space Models

@ = vocab size (if position is ignored)
@ = H where L is window length
Curse of dimensionality
20
21
Word Sense Discrimination

Aim: Cluster contexts based on the meaning
1. Word Vectors
Steps:
1. Represent a word as a point in vector space
2. Context
Dimensionality Reduction
Vectors
2. Represent each context as a point

3. Cluster the points using a clustering algorithm
Vector Space
3. Sense Vectors
Use words as the features

Feature weight is co-occurrence strength
22
23
Word Sense Discrimination :
Reduce the dimensionality of word vectors
9 L
legal
Clothes
judge
210
75
robe
50
250
law
240
50
suit
147
157
dismisses
96
152
9 L 7-8
9 Z Q5 Q
Word Sense Discrimination :

Results & Discussion
Averaged results on 20 words
Accuracy
6
, terms
6 , SVD
Frequency, terms
Frequency, SVD
76
90
81
88
Schtze 1998
24
Part-of-Speech (POS) Tagging

Given a sentence label words with their POS tags
I
NN
ate
VB
an
DT
apple .
NN .
Unsupervised Approaches
Attempt to cluster words
Align each cluster with a POS tag
Do not assume a dictionary of tags
Schtze 1995, Lamar et al 2010
25
Part-of-Speech Tagging
Steps
1. Represent words in appropriate vector space
2. Cluster using your favorite algorithm
Vector space should capture syntactic properties

Use most frequent :@; words as features
Frequency of a word in the context as feature weight
Part-of-Speech Tagging :
Pass 1
Construct left and right context matrices
. and 4 matrices of size 8 H @
Get rank N5 approximation
. L 7 - 8
. L 7 -
. ZNormalized .
4 L 7 - 8
4 L 7 -
4 ZNormalized 4
& L . 4 is a 8 H tN5 matrix
Run weighted G-means on & with G5 clusters
26
Pass 2
The clusters are not optimal because of sparsity
Construct . and 4 of size 8 H G5
Get rank N6 approximation
. L 7 - 8
. L 7 -
. ZNormalized .
4 L 7 - 8
4 L 7 -
4 ZNormalized 4
& L . 4 is a 8 H tN6 matrix
Run weighted G-means on &
Results
Penn Treebank (1.1 M tokens, 43K types)
17 and 45 tags
M-to-1 accuracies
PTB17
PTB45
SVD2
0.730
0.660
HMM-EM
0.647
0.621
HMM-VB
0.637
0.605
HMM-GS
0.674
0.660
HMM-Sparse(32)
0.702 (2.2)
0.654 (1.0)
VEM:sr?5 sr?5 ;
0.682 (0.8)
0.546 (1.7)
Lamar et al 2010
27
Discussion
Sensitivity to parameters
Scaling with singular values
G-means algorithm
Weighted G-means
Clusters are initialized to most frequent word types
Non-disambiguating tagger
Very simple algorithm
28
Information Retrieval
Rank documents @ in response to a query M
Vector Space Model
Query and doc. are represented as bag-of-words
Features: Words
Feature Weight: TFIDF
Lexical Gap
Polysemy and Synonymy
Information Retrieval :
Lexical Gap
Term H Document matrix %
ship
boat
ocean
voyage
trip
TFIDF weighting
is better ?
29
Latent Semantic Analysis
Term H Document matrix %H
Steps:
1. Dimensionality Reduction of term H doc. matrix

2. Folding-in queries
M Z B:M;
3. Compute semantic similarity, score M @
30
Steps:
1. Dimensionality Reduction
@ L 7 - @
@ L -?5 7 @
% L 7 - 8
M L -?5 7 M
31
Steps:
% L 7 - 8
1. Dimensionality Reduction
3. Semantic similarity
M L -?5 7 M
5?KNA M @ Z M @ L
M @
M @
denotes dot product

Deerwester 1988; Dumais 2005
Lexical Gap Revisited
ship
boat
1
1
ocean
voyage
trip
New document representations
ship
-1.62
-0.60
-0.44
-0.97
-0.70
-0.26
boat
-0.46
-0.84
-0.30
1.00
0.35
0.65
32
MED
CRAN
CACM
CISI
Cos+tfidf
49
35.2
21.9
20.2
LSA
64.6
38.7
23.8
21.9
PLSI-U
69.5
38.9
25.3
23.3
PLSI-Q
63.2
38.6
26.6
23.1
Fold-in new documents as well

Deviates from the optimal as we add more docs.
Probabilistic Latent Semantic Analysis

Hofmann 1999
33
34
Non-linear Dimensionality Reduction

Laplacian Eigenmaps
Weight matrix 9 with similarities
Local neighbourhood
& L 9 and . L & F 9

Q .Q s.t. Q &Q L +
.Q L &Q
35
Q .Q L 9 Q F Q
36
37
Neural Embeddings
Word is represented as a vector of size I
Learning
Optimize such that log-likelihood is maximized
Gradient ascent
Learns parameters and word vectors simultaneously
Learned word-vectors capture semantics
Learn to perform multiple tasks simultaneously

Bengio et al 2003; Collobert and Weston 2008
38
Canonical Correlation Analysis (CCA)

Centered dataset:
: L T5 T 4 -H ; L U5 U 4 . H
Project : and ; along = 4- and > 4.

O L =T5 =T , P L >U5 >U
Data correlation after transformation:

?KO O P L
L
8-
.
.
8-
8-
= :; >
= :: = > ;; >
39
CCA (contd.)
Covariance matrices:
% L :; % L :: , % L ;;
Correlation in terms of covariance matrices:

= % >
?KO O P L
= % = > % >
Directions that maximize data correlation:

= > L
= % >
= % = > % >
CCA: Formulation
Goal: Find linear transformations # $ that
maximize data correlation
Optimization problem:
# $ L 6N # :; $
O P
6N # :: # L s
6N $ ;; $ L s
40
CCA: Solution
Generalized eigenvalue problem:
% $ L % #&
%
# L % $&
Can be shown that & L & L &
?5
$ L %
% #&?5
?5
% %
% # L % #&6
MATLAB function: ?=JKJ?KNN:;
41
42
Bilingual Document Projections

Applications:
Comparable and Parallel Document Retrieval
Cross-language text categorization
Steps:
1. Represent each document as a vector
Two different vector spaces, one per language
2. Use CCA to find linear transformations :# $;

3. Find new aligned documents using # and $

Steps:
Vector Space:
Features: Most frequent 20K content words
Feature weight: TFIDF weighting
Training Data:
T 4- bag of English words
U 4 . bag of Hindi words
T U E L s J : L T5 T6 T ; L U5 U6 U
43

Steps:
2. Use CCA to find linear transformations # and $
3. Find new aligned documents using # and $
Scoring:
Score(T U; Z #T $U L
44
Bilingual Document Projections :

Accuracy
MRR
OPCA
72.55
77.34
Word-by-word
70.33
74.67
CCA
68.94
73.78
Word-by-word (5000)
67.86
72.36
CL-LSI
53.02
61.30
Untranslated
46.92
53.83
CPLSA
45.79
51.30
JPLSA
33.22
36.19
Platt et al 2010
45
Mining Word-Level Translations

Training Data: Word level seed translations
English
Spanish
P(s|e)
state
estado
0.5
state
declarar
0.3
society
sociedad
0.4
society
compaa
0.35
company
sociedad
0.8
Task: Mine translations for new words

d vo ]}v }( ^ ]o]_ M
Resources: monolingual comparable corpora

Applications:
Lexicon induction for resource poor languages
Mining translations for unknown words in MT
Steps:
1. W ]v]vP }( ^} ]_
2. Represent each word as vector
Two different feature spaces, one per language
3. Use CCA to find transformations # and $

4. Use # and $ to mine new word translations
46

Steps:
1. W ]v]vP }( ^} ]_
2. Represent each word as a vector
Vector Space
Features: context words (WSM); Orthography
Feature Weights: TFIDF weights
Can be computed using ONLY comparable corpora
T U E L s J; : L T5 T6 T ; ; L U5 U6 U
47

Steps:
1.
2.
3.
4.
W ]v]vP }( ^} ]_
Represent each word as a vector
Use CCA to find transformations # and $
Use # and $ to mine new word translations
Scoring
Score(A O; L #T $U L
48
Mining Word-Level Translations :

Seed lexicon size 100
Bootstrapping
Best-
EditDist
58.6
62.6
61.1
Ortho
76.0
81.3
80.1
52.3
55.0
Context
91.1
81.3
80.2
65.3
58.0
Both
87.2
89.7
89.0
89.7
72.0
47.4
Results are lower for other language pairs

Haghighi et al 2008
Mining Word-Level Translations :

Mining translations for unknown words
OOV words for MT domain adaptation
French German
MT Accuracies (BLEU)
Baseline
+ve change
Baseline
+ve change
News
Emea
Subs
PHP
23.00
26.62
10.26
38.67
0.80
1.44
0.13
0.28
27.30
40.46
16.91
28.12
0.36
1.51
0.61
0.68
Daum and Jagarlamudi 2011
49
50
Supervised Semantic Indexing

Task: Learn to rank ads = for a given doc. @
Training Data:
Pairs of :@ => ; webpages and clicked ads
Randomly chosen pairs @ =?
Steps :
1. Represent an ad = and a doc. @ as vectors
2. Learn scoring function B:= @;
3. Rank ads for a given document
Bai et al 2009

Steps :
1. Represent ads and docs. as vectors
Vector Space
Bag-of-word representation
Features: words
Feature weights: TFIDF weight
= and @ are vectors of size 8
51

Steps :
Scoring function
Parameters: 8 H 8
B = @ L @ 9=
9L+
9L&
9 L 7 8 E +
Cosine
Similarity
Reweighting
of words
Different treatment for
ads and documents
9 L 7 7 E +
SAME treatment for ads
and documents
Supervised Semantic Indexing :

Learn Scoring Function
Max-Margin B @ => F B @ =? P s
Objective
rs F B @ => E B:@ =? ;
6 7
Sub Gradient Descent
52

Steps:
3. Rank ads for a give document
Ranking Ads
Compute score using B = @ and rank

1.9 M pairs for training
100K pairs for testing
Parameters
TFIDF
SSI: 9 L
Rank Loss
45.60
7 8
SSI: 9 L 7 8
SSI: 9 L 7 8
54
64
E+
E+
74 E +
50H10k
25.83
50H20k
26.68
50H30k
26.98
Bai et al 2009
53

Ranking wikipedia pages for queries
Rank Loss
K=5
K=10
K=20
TFIDF
21.6
14.0
9.14
LSI + :s F ;TFIDF
14.2
9.73
6.36
4.80
3.10
1.87
4.37
2.91
1.80
SSI: 9 L 7 7
SSI: 9 L 7 8
74
74
E+
E+
Performs better when training data is big

Bai et al 2009
54
DiscriminativeU& Reranking
T&
@5
T5
U5
T6
U6
T7
U7
T8
U8
@6
Vector of
length @5 H @6
T5 U5 T5 U6 T U5 T U6
55 56 h5
h6
S L = >
Reranker operates in the outer product space
Weight vector is constrained [Bai et al. 10]
[Szedmak et al., 2006; Wang et al., 2007]
55
Low-Dimensional Reranking
Find # and $ s.t.
:# T $ U ;
T U
U6
U6
U
U5
Low-Dimensional Reranking
Find # and $ s.t.
:# T $ U ;
1.
Score: = T U >
Idea
2.
Add constraints to penalize incorrect ones

O?KNA T U R O?KNA T U E s F
I R s F
56
[Tsochantaridis et al. 04]
U7
Discriminative Model
L .
// Initialization
Repeat
# $
Z Softened-Disc
// Get the current soln.
I L = T U > F = T U >
// Compute margins
L r P r=
// Potential Slack
// Compute Slack
L s F I .
If P r
@ L I F s F
Z F @
// Update the
Lagrangian variables
End
Until convergence
6ODFN GRHVQW FKDQJH
57
POS Tagging
Combine with Viterbi score
Interpolation parameter is tuned
Training
Input sentence and Reference tag sequences
Candidates, Score and Loss values
Buyers stepped in
to
the
futures pit
-0.1947
NNS
VBD
RP
TO
DT
NNS
NN
-6.8068
NNS
VBD
RB
TO
DT
NNS
NN
-7.0514
NNS
VBD
IN
TO
DT
NNS
NN
-7.1408
NNS
VBD
RP
TO
DT
NNS
VB
-13.752
NNS
VBD
RB
TO
DT
NNS
VB
Score
Testing
POS Tagging
Data Statistics
Results
English
Chinese
French
Swedish
Baseline
96.15
92.31
97.41
93.23
Collins
96.06
92.81
97.35
93.44
Regularized
96.00
92.88
97.38
93.35
Oracle
98.39
98.19
99.00
96.48
58
POS Tagging
Data Statistics
Results
English
Chinese
French
Swedish
Baseline
96.15
92.31
97.41
93.23
Collins
96.06
92.81
97.35
93.44
Regularized
96.00
92.88
97.38
93.35
Softened-Disc
96.32
92.87
97.53
93.24
Discriminative
96.3
92.91
97.53
93.36
Oracle
98.39
98.19
99.00
96.48
POS Tagging
Z o }v]v
English
Chinese
French
Swedish
Softened-Disc
+0.17
+0.56
+0.12
+0.01
Discriminative
+0.15
+0.6
+0.12
+0.13
Softened-Disc*
+0.92
+4.31
+1.12
+0.08
Discriminative*
+0.88
+4.77
+0.9
+0.73
Interpolation with Viterbi score is crucial

Softened-Disc
Independent of no. training examples
Easy to code and can be solved exactly
Jagarlamudi and Daum 2012
59
60
Similarity Search: Challenges

Computing nearest neighbors in high
dimensions using geometric search
techniques is very difficult
All methods are as bad as brute force linear
search which is expensive
Approximate techniques such as ANN perform
efficiently in dimensions as high as 20; in higher
dimensions, the results are rather spotty
Need to do search on commodity hardware

Cross-language search
61
62
What is the advantage?

Scales easily to very large databases
Compact language-independent representation
32 bits per object
Search is effective and efficient

Hamming nearest-neighbor search
Few milliseconds per query for searching a
million objects (single thread on a single
processor)
63
What is the challenge?

Language/Script Independent Hash
Codes
Learning Hash Functions from
Training Data
64
65
66
67
Learning Hash Functions: Summary

Given a set of parallel names as training data, find
the top K projection vectors for each language
using Canonical Correlation Analysis.
Each projection vector gives a 1-bit hash function.
Hash code for a name can be computed by

projecting its feature vector on to the projection
vectors followed by binarization.
Udupa & Kumar, 2010
Fuzzy Name Search: Experimental Setup

Test Sets:
DUMBTIONARY
1231 misspelled names
INTRANET
200 misspelled names
Name Directories:
DUMBTIONARY
550K names from Wikipedia
INTRANET
150K employee names
Training Data:
15K pairs of single token names in English and Hindi
Baselines:
Two popular search engines, Double Metaphone, BM25
68
69
Multilingual: Experimental Setup

Test Sets
1000 multi-word names each in Russian, Hebrew,
Kannada, Tamil, Hindi
Name Directory:
English Wikipedia Titles
6 Million Titles, 2 Million Unique Words
Baseline:
State-of-the-art Machine Transliteration (NEWS
2009)
70
71
#of dim. reduction papers
800
Vision
700
NLP
600
500
400
300
200
100
0
1990
1995
2000
2005
2010
1.2
Vision
NLP
1
0.8
Popularity compared to 0.6
Bayesian approaches
0.4
0.2
0
1990
1995
2000
72
2005
2010
Summary
Dimensionality reduction has merits for NLP
Computational and Feature correlations
Has been explored in unsupervised fashion

But recent novel developments
For multi-view data
If you can formulate your problem as mapping

Try dimensionality reduction
Can solve for the global optimum
Summary
Spectral Learning
Provides way to learn global optimum for
generative models
Enriching the existing models

Using word embeddings instead of words
Scalability of the techniques

} v[ v }v Z vu
Large scale SVD
73
}( uo
References
HOTELLING, Harold, 1933. Analysis of a Complex of Statistical Variables into Principal Components. Journal
of Educational Psychology, 24(6 & 7), 417t441 & 498t520.
Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), Numerical Recipes: The Art of Scientific
Computing (3rd ed.), New York: Cambridge University Press
John C. Platt, Kristina Toutanova, and Wen-tau Yih. 2010. Translingual document representations from
] ]u]v ] }i ]}vX /v DE>W [U P t261
Hyvrinen A.; Oja E.; Independent component analysis: algorithms and applications, Journal of Neural
Networks, Volume 13 Issue 4-5, May-June 2000, Pages 411 - 430
Blei, David M.; Ng, Andrew Y.; Jordan, Michael I. Lafferty, John. ed. "Latent Dirichlet allocation". Journal of
Machine Learning Research 3 (4t5), 2003: pp. 993t1022
S. T. Roweis and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science Vol
290, 22 December 2000, 2323t2326.
Schutze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24, 1, 97t124.
Hinrich Schtze. 1995. Distributional part-of-speech tagging. In EACL 7, pages 141--148.
Michael Lamar, Yariv Maron, Mark Johnson, Elie Bienenstock, SVD and clustering for unsupervised POS
tagging, In ACL 2010, p.215-219, July 11-16
Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the
51st Annual Meeting of the American Society for Information Science 25, 1988, pp. 36t40.
Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology
38: 188.
Thomas Hofmann, Probabilistic Latent Semantic Indexing, In SIGIR 1999
References
J. B. TenenbaumU s]v ^]o U v :}Zv X > vP(} U ^ 'o} o ' }u ] & u }l (} E}vo]v
]u v]}v o] Z ]}vU_ ^ ] v U vol 290, 22 December 2000.
Belkin,M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation.
Neural Computation, 15, 1373t1396.
Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of
Artificial Intelligence Research.
Yoshua Bengio , Rjean Ducharme , Pascal Vincent , Christian Janvin, A neural probabilistic language
model, The Journal of Machine Learning Research, 3, 3/1/2003
Joseph Turian , Lev Ratinov , Yoshua Bengio, Word representations: a simple and general method for semisupervised learning, In ACL 2010, p.384-394, July 11-16, 2010
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing
(almost) from scratch. JMLR, 2011.
Harold Hotelling. 1936. Relation between two sets of variables. Biometrica, 28:322t377.
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from
monolingual corpora. In ACL, pages 771--779.
Hal Daum, III , Jagadeesh Jagarlamudi, Domain adaptation for machine translation by mining unseen
words, In ACL 2011, June 19-24
Bing Bai , Jason Weston , David Grangier , Ronan Collobert , Kunihiko Sadamasa , Yanjun Qi , Olivier
Chapelle , Kilian Weinberger, Supervised semantic indexing, In CIKM 2009, November 02-06, Hong Kong,
China
74
References
Jagadeesh Jagarlamudi, Hal Daum, III , Low-Dimensional Discriminative Reranking, in HLT-NAACL 2012
Shaishav Kumar and Raghavendra Udupa, Learning Hash Functions for Cross-View Similarity Search, in
IJCAI-11, IJCAI, 20 July 2011
Raghavendra Udupa and Shaishav Kumar, Hashing-based Approaches to Spelling Correction of Personal
Names, in Proceedings of EMNLP 2010, October 2010
Raghavendra Udupa and Mitesh Khapra, Transliteration Equivalence using Canonical Correlation Analysis,
in ECIR 2010, 2010
Jagadeesh Jagarlamudi, Hal Daum, III , Regularized Interlingual Projections: Evaluation on Multilingual
Transliteration, in Proceedings of EMNLP-CoNLL 2012.
75

T06

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

T06

Enviado por

Direitos autorais:

Formatos disponíveis

COLING 2012

24th International Conference on

Jagadeesh Jagarlamudi, Raghavendra Udupa

Revisiting Dimensionality Reduction Techniques for NLP (Tutorial Notes)

Contributed content copyright the contributing authors.

Revisiting Dimensionality Reduction Techniques for NLP

However, data often lies (approximately) in a

560 dimensional data

Latent Structure and Semantics

Manifold Modeling Methods

How can Dimensionality Reduction help?

Mean after linear transformation:

where % L :: (sample covariance)

Positive Definite Matrices

Positive: T /T P r for all T M r

Eigenvalues and Eigenvectors

Data Variance and Eigenvalues

Eigen decomposition: % L 7&7

Principal Components Analysis (PCA)

PCA: Solution (contd.)

Contribution of th component to data

PCA gives rank k reconstruction with

Singular Value Decomposition (SVD)

Low rank approximation:

SVD and Data Sphering

#:: # L + where # L -?5 7

The linear transformation # L -?5 7

SVD and Eigen Decomposition

:: L 7-8 8-7 L 7-6 7 (eigen decomposition)

Oriented Principal Components

Seeks projections that maximize the ratio of the

Popular Feature Space Models

Word Space Model

Word Space Models

Word Sense Discrimination

2. Represent each context as a point

Use words as the features

Word Sense Discrimination :

Word Sense Discrimination :

Part-of-Speech (POS) Tagging

2. Cluster using your favorite algorithm

Vector space should capture syntactic properties

& L . 4 is a 8 H tN5 matrix

Run weighted G-means on & with G5 clusters

& L . 4 is a 8 H tN6 matrix

Run weighted G-means on &

1. Dimensionality Reduction of term H doc. matrix

3. Compute semantic similarity, score M @

denotes dot product

New document representations

Fold-in new documents as well

Probabilistic Latent Semantic Analysis

Non-linear Dimensionality Reduction

& L 9 and . L & F 9

Learn to perform multiple tasks simultaneously

Canonical Correlation Analysis (CCA)

Project : and ; along = 4- and > 4.

Data correlation after transformation:

Correlation in terms of covariance matrices:

Directions that maximize data correlation:

Bilingual Document Projections

Two different vector spaces, one per language

2. Use CCA to find linear transformations :# $;

Bilingual Document Projections