Escolar Documentos
Profissional Documentos
Cultura Documentos
Diamond sponsors
Tata Consultancy Services
Linguistic Data Consortium for Indian Languages (LDC-IL)
Gold Sponsors
Microsoft Research
Beijing Baidu Netcon Science Technology Co. Ltd.
Silver sponsors
IBM, India Private Limited
Crimson Interactive Pvt. Ltd.
Yahoo
Easy Transcription & Software Pvt. Ltd.
http://creativecommons.org/licenses/by-nc-sa/3.0/
Some rights reserved.
ii
jags@umiacs.umd.edu, raghavu@microsoft.com
ABSTRACT
Many natural language processing (NLP) applications represent words and documents as
vectors in a very high dimensional space. The inherently high-dimensional nature of these
applications leads to sparse vectors resulting in poor performance of downstream applications.
Dimensionality reduction aims to find a lower dimensional subspace (or simply subspace)
which captures the essential information required by the downstream applications. Although
it received a lot of attention in the beginning, its popularity in NLP, unlike other fields such
as computer vision, has declined over the time. This is partly because, traditionally, it was
studied in an unsupervised fashion and hence the learnt subspace may not be optimal for the
task at hand. But recent advances in learning low-dimensional representations in the presence
of input and output variables enables us to learn task-specific subspaces that are as effective
as the state-of-the-art approaches. In this tutorial, we aim to demonstrate the simplicity and
effectiveness of these techniques on a diverse set of NLP tasks. By the end of the tutorial, we
hope the attendees would be able to decide "whether or not dimensionality reduction can help
their task and if so, how?".
The tutorial begins with an introduction to dimensionality reduction and its importance to NLP. As many of the dimensionality reduction techniques discussed in the tutorial make
use of Linear Algebra, we discuss some important concepts including Linear Transformation,
Positive Definite Matrices, eigenvalues and eigenvectors. Next, we look at some important
dimensionality reduction techniques for data with single view (PCA, SVD, OPCA). We then
take up applications of these techniques to some important NLP problems (word-sense
discrimination, POS tagging and Information Retrieval). As NLP often involves more than one
language, we look at dimensionality reduction of multiview data using Canonical Correlation
Analysis. We discuss some interesting applications of multiview dimensionality reduction
(bilingual document projections and mining word-level translations). We also discuss some
advanced topics in dimensionality reduction such as Non-Linear and Neural techniques and
some application inspired techniques such as Discriminative Reranking, Supervised Semantic
Analysis, and Multilingual Hashing.
We do not assume attendees know anything about dimensionality reduction (though
the tutorial should be interesting even to those who know some), but we *do* assume some
basic knowledge of linear algebra.
Revisiting Dimensionality Reduction Techniques for NLP (Tutorial Notes), pages 175,
COLING 2012, Mumbai, December 2012.
Road Map
Introduction
NLP and Dimensionality Reduction
Mathematical Background
Data with Single View
Techniques
Applications
Advanced Topics
Data with Multiple Views
Techniques
Applications
Advanced Topics
Summary
Dimensionality Reduction:
Motivation
Many applications involve high dimensional
(and often sparse) data
High dimensional data poses several
challenges
Computational
Difficulty of Interpretation
Over Fitting
Dimensionality Reduction:
Goal
Given high dimensional data, discover the
underlying low dimensional structure
2D Embedding
He et al, Face Recognition Using LaplacianFaces
Dimensionality Reduction:
Benefits
Computational Efficiency
K-Nearest Neighbor Search
Data Compression
Less storage; millions of data points in RAM
Data Visualization
2D and 3D Scatter Plots
Dimensionality Reduction:
Techniques
Projective Methods
find low dimensional projections that extract
useful information from the data, by maximizing
a suitable objective function
PCA, ICA, LDA
Dimensionality Reduction:
Relevance to NLP
High dimensional data in NLP
Text Documents
Context Vectors
Data Centering
Dataset: : L T5 T 4H
Mean: L @5 T
5
Centering: T L T F
Centered dataset: : L T5 T
Mean after centering:
L @5 T L @5 T F L F L r
5
Data Variance
Dataset: : L T5 T 4H
Centered: L @5 T L r
5
Variance:
@5
v ]vP } v[
change data variance
L 6N :: L 6N %
5
Transformed dataset: #:
Variance after transformation:
5
5
@5 #T 6 L 6N #:: # L 6N #% #
s w
% , #% #
Cholesky decomposition: / L ..
Trace: 6N / L @5 / L
Rank: Number of non-zero eigenvalues
Eigensystem of Positivedefinite
Matrices
/ 4H
Positive eigenvalues: P r
Real valued eigenvectors: Q 4
Orthonormal eigenvectors: M Q Q L
r and Q Q L s : 7 7 L +;
Full rank: 4=JG / L L
Eigen decomposition: / L 7&7
@5
L 6N %
Data variance:
5
T6
@5
10
L 6N % L
11
Linear transformation
Orthonormal basis
Data variance
Mathematical formulation:
# L 6N #% #
H
@
12
PCA: Solution
Eigen decomposition of % :
% L 7&7
7 L Q5 Q6 Q
& L @E=C 5 6
5 R 6 R R
# L Q5 Q6 Q
6 T L #T L Q5 Q6 Q T
MATLAB function: LNEJ?KIL:;
#: L Q5 Q6 Q :
#:: # L Q5 Q6 Q :: Q5 Q6 Q
L Q5 Q6 Q 7&7 Q5 Q6 Q
L @E=C 5 6
6N #:: # L @5
variance:
8-
13
PCA: Properties
PCA decorrelates the dataset
#% # L @E=C 5 6
14
T 4
: L 7-8 L @5 Q R
N L N=JG :
7 4H such that 7 7 L + (left singular vectors)
8 4H such that 8 8 L + (right singular vectors)
- L @E=C 5 4H (singular values)
: L @5 Q R
- L @E=C 5 r r G Q @
Note that
@5 6 Q Q
T 4
Q L s
Let 7 L Q5 Q 8 L R5 R
- L @E=C 5 G Q N
-?5 7 :: 7 - L +
15
T 4
16
# L 6N #% #
H
OPCA: Solution
Generalized eigenvalue problem:
% 7 L % 7&
Equivalent eigenvalue problem:
% ?5 6 % % ?5 6 8 L 8& where 8 L % 5 6 7
7 L Q5 Q6 Q
& L @E=C 5 6
5 R 6 R R
# L Q5 Q6 Q
6 T L #T L Q5 Q6 Q T
MATLAB function: AEC:;
17
OPCA: Properties
Projections remain the same when the noise
and signal vectors are globally scaled with
two different scale factors
Projected data is not necessarily uncorrelated
Can be extended to multiview data [Platt et
al, EMNLP 2010]
18
19
Curse of dimensionality
We have observations T 4
@ is usually very huge
Vector Space Models
@ = vocab size (number of words in a language)
Curse of dimensionality
20
21
2. Context
Dimensionality Reduction
Vectors
Vector Space
3. Sense Vectors
22
23
Dimensionality Reduction
Reduce the dimensionality of word vectors
9 L
legal
Clothes
judge
210
75
robe
50
250
law
240
50
suit
147
157
dismisses
96
152
9 L 7-8
9 Z Q5 Q
, terms
6 , SVD
Frequency, terms
Frequency, SVD
76
90
81
88
Schtze 1998
24
ate
VB
an
DT
apple .
NN .
Unsupervised Approaches
Attempt to cluster words
Align each cluster with a POS tag
Do not assume a dictionary of tags
Schtze 1995, Lamar et al 2010
25
Part-of-Speech Tagging
Steps
1. Represent words in appropriate vector space
Dimensionality Reduction
Part-of-Speech Tagging :
Pass 1
Construct left and right context matrices
. and 4 matrices of size 8 H @
Dimensionality Reduction
Get rank N5 approximation
. L 7 - 8
. L 7 -
. ZNormalized .
4 L 7 - 8
4 L 7 -
4 ZNormalized 4
26
Part-of-Speech Tagging :
Pass 2
The clusters are not optimal because of sparsity
Construct . and 4 of size 8 H G5
Dimensionality Reduction
Get rank N6 approximation
. L 7 - 8
. L 7 -
. ZNormalized .
4 L 7 - 8
4 L 7 -
4 ZNormalized 4
Part-of-Speech Tagging :
Results
Penn Treebank (1.1 M tokens, 43K types)
17 and 45 tags
M-to-1 accuracies
PTB17
PTB45
SVD2
0.730
0.660
HMM-EM
0.647
0.621
HMM-VB
0.637
0.605
HMM-GS
0.674
0.660
HMM-Sparse(32)
0.702 (2.2)
0.654 (1.0)
VEM:sr?5 sr?5 ;
0.682 (0.8)
0.546 (1.7)
Lamar et al 2010
27
Part-of-Speech Tagging :
Discussion
Sensitivity to parameters
Scaling with singular values
G-means algorithm
Weighted G-means
Clusters are initialized to most frequent word types
Non-disambiguating tagger
Very simple algorithm
28
Information Retrieval
Rank documents @ in response to a query M
Vector Space Model
Query and doc. are represented as bag-of-words
Features: Words
Feature Weight: TFIDF
Lexical Gap
Polysemy and Synonymy
Information Retrieval :
Lexical Gap
Term H Document matrix %
ship
boat
ocean
voyage
trip
TFIDF weighting
is better ?
29
Information Retrieval :
Latent Semantic Analysis
Term H Document matrix %H
Steps:
M Z B:M;
30
Information Retrieval :
Latent Semantic Analysis
Term H Document matrix %H
Steps:
1. Dimensionality Reduction
2. Folding-in queries
@ L 7 - @
@ L -?5 7 @
% L 7 - 8
M L -?5 7 M
31
Information Retrieval :
Latent Semantic Analysis
Term H Document matrix %H
Steps:
% L 7 - 8
1. Dimensionality Reduction
2. Folding-in queries
3. Semantic similarity
M L -?5 7 M
5?KNA M @ Z M @ L
M @
M @
Information Retrieval :
Lexical Gap Revisited
Term H Document matrix %
ship
boat
1
1
ocean
voyage
trip
ship
-1.62
-0.60
-0.44
-0.97
-0.70
-0.26
boat
-0.46
-0.84
-0.30
1.00
0.35
0.65
32
Information Retrieval :
Results & Discussion
Term H Document matrix %
MED
CRAN
CACM
CISI
Cos+tfidf
49
35.2
21.9
20.2
LSA
64.6
38.7
23.8
21.9
PLSI-U
69.5
38.9
25.3
23.3
PLSI-Q
63.2
38.6
26.6
23.1
33
34
.Q L &Q
35
Q .Q L 9 Q F Q
36
37
Neural Embeddings
Word is represented as a vector of size I
Learning
Optimize such that log-likelihood is maximized
Gradient ascent
Learns parameters and word vectors simultaneously
Learned word-vectors capture semantics
38
: L T5 T 4 -H ; L U5 U 4 . H
8-
.
.
8-
8-
= :; >
= :: = > ;; >
39
CCA (contd.)
Covariance matrices:
% L :; % L :: , % L ;;
?KO O P L
= % = > % >
= % >
= % = > % >
CCA: Formulation
Goal: Find linear transformations # $ that
maximize data correlation
Optimization problem:
# $ L 6N # :; $
O P
6N # :: # L s
6N $ ;; $ L s
40
CCA: Solution
Generalized eigenvalue problem:
% $ L % #&
%
# L % $&
Can be shown that & L & L &
?5
$ L %
% #&?5
?5
% %
% # L % #&6
MATLAB function: ?=JKJ?KNN:;
41
42
Steps:
1. Represent each document as a vector
Vector Space:
Features: Most frequent 20K content words
Feature weight: TFIDF weighting
Training Data:
T 4- bag of English words
U 4 . bag of Hindi words
T U E L s J : L T5 T6 T ; L U5 U6 U
43
Scoring:
Score(T U; Z #T $U L
44
MRR
OPCA
72.55
77.34
Word-by-word
70.33
74.67
CCA
68.94
73.78
Word-by-word (5000)
67.86
72.36
CL-LSI
53.02
61.30
Untranslated
46.92
53.83
CPLSA
45.79
51.30
JPLSA
33.22
36.19
Platt et al 2010
45
Spanish
P(s|e)
state
estado
0.5
state
declarar
0.3
society
sociedad
0.4
society
compaa
0.35
company
sociedad
0.8
Steps:
1. W ]v]vP }( ^} ]_
2. Represent each word as vector
46
Vector Space
Features: context words (WSM); Orthography
Feature Weights: TFIDF weights
Can be computed using ONLY comparable corpora
T U E L s J; : L T5 T6 T ; ; L U5 U6 U
47
W ]v]vP }( ^} ]_
Represent each word as a vector
Use CCA to find transformations # and $
Use # and $ to mine new word translations
Scoring
Score(A O; L #T $U L
48
Best-
EditDist
58.6
62.6
61.1
Ortho
76.0
81.3
80.1
52.3
55.0
Context
91.1
81.3
80.2
65.3
58.0
Both
87.2
89.7
89.0
89.7
72.0
47.4
French German
MT Accuracies (BLEU)
Baseline
+ve change
Baseline
+ve change
News
Emea
Subs
PHP
23.00
26.62
10.26
38.67
0.80
1.44
0.13
0.28
27.30
40.46
16.91
28.12
0.36
1.51
0.61
0.68
49
50
Steps :
1. Represent an ad = and a doc. @ as vectors
2. Learn scoring function B:= @;
3. Rank ads for a given document
Bai et al 2009
Vector Space
Bag-of-word representation
Features: words
Feature weights: TFIDF weight
51
Scoring function
Parameters: 8 H 8
B = @ L @ 9=
9L+
9L&
9 L 7 8 E +
Cosine
Similarity
Reweighting
of words
Dimensionality Reduction
Different treatment for
ads and documents
9 L 7 7 E +
Dimensionality Reduction
SAME treatment for ads
and documents
6 7
52
Ranking Ads
Compute score using B = @ and rank
Rank Loss
45.60
7 8
SSI: 9 L 7 8
SSI: 9 L 7 8
54
64
E+
E+
74 E +
50H10k
25.83
50H20k
26.68
50H30k
26.98
Bai et al 2009
53
K=10
K=20
TFIDF
21.6
14.0
9.14
LSI + :s F ;TFIDF
14.2
9.73
6.36
4.80
3.10
1.87
4.37
2.91
1.80
SSI: 9 L 7 7
SSI: 9 L 7 8
74
74
E+
E+
54
DiscriminativeU& Reranking
T&
@5
T5
U5
T6
U6
T7
U7
T8
U8
@6
Vector of
length @5 H @6
T5 U5 T5 U6 T U5 T U6
55 56 h5
h6
S L = >
55
Low-Dimensional Reranking
Find # and $ s.t.
:# T $ U ;
T U
U6
U6
U
U5
Low-Dimensional Reranking
Find # and $ s.t.
:# T $ U ;
1.
Score: = T U >
Idea
2.
I R s F
56
U7
Discriminative Model
L .
// Initialization
Repeat
# $
Z Softened-Disc
I L = T U > F = T U >
// Compute margins
L r P r=
// Potential Slack
// Compute Slack
L s F I .
If P r
@ L I F s F
Z F @
// Update the
Lagrangian variables
End
Until convergence
57
POS Tagging
Combine with Viterbi score
Interpolation parameter is tuned
Training
Input sentence and Reference tag sequences
Candidates, Score and Loss values
Buyers stepped in
to
the
futures pit
-0.1947
NNS
VBD
RP
TO
DT
NNS
NN
-6.8068
NNS
VBD
RB
TO
DT
NNS
NN
-7.0514
NNS
VBD
IN
TO
DT
NNS
NN
-7.1408
NNS
VBD
RP
TO
DT
NNS
VB
-13.752
NNS
VBD
RB
TO
DT
NNS
VB
Score
Testing
POS Tagging
Combine with Viterbi score
Interpolation parameter is tuned
Data Statistics
Results
English
Chinese
French
Swedish
Baseline
96.15
92.31
97.41
93.23
Collins
96.06
92.81
97.35
93.44
Regularized
96.00
92.88
97.38
93.35
Oracle
98.39
98.19
99.00
96.48
58
POS Tagging
Combine with Viterbi score
Interpolation parameter is tuned
Data Statistics
Results
English
Chinese
French
Swedish
Baseline
96.15
92.31
97.41
93.23
Collins
96.06
92.81
97.35
93.44
Regularized
96.00
92.88
97.38
93.35
Softened-Disc
96.32
92.87
97.53
93.24
Discriminative
96.3
92.91
97.53
93.36
Oracle
98.39
98.19
99.00
96.48
POS Tagging
Z o }v]v
English
Chinese
French
Swedish
Softened-Disc
+0.17
+0.56
+0.12
+0.01
Discriminative
+0.15
+0.6
+0.12
+0.13
Softened-Disc*
+0.92
+4.31
+1.12
+0.08
Discriminative*
+0.88
+4.77
+0.9
+0.73
59
60
61
62
63
64
65
66
67
68
69
70
71
Dimensionality Reduction
#of dim. reduction papers
800
Vision
700
NLP
600
500
400
300
200
100
0
1990
1995
2000
2005
2010
Dimensionality Reduction
1.2
Vision
NLP
1
0.8
Popularity compared to 0.6
Bayesian approaches
0.4
0.2
0
1990
1995
2000
72
2005
2010
Summary
Dimensionality reduction has merits for NLP
Computational and Feature correlations
Summary
Spectral Learning
Provides way to learn global optimum for
generative models
73
}( uo
References
HOTELLING, Harold, 1933. Analysis of a Complex of Statistical Variables into Principal Components. Journal
of Educational Psychology, 24(6 & 7), 417t441 & 498t520.
Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), Numerical Recipes: The Art of Scientific
Computing (3rd ed.), New York: Cambridge University Press
John C. Platt, Kristina Toutanova, and Wen-tau Yih. 2010. Translingual document representations from
] ]u]v ] }i ]}vX /v DE>W [U P t261
Hyvrinen A.; Oja E.; Independent component analysis: algorithms and applications, Journal of Neural
Networks, Volume 13 Issue 4-5, May-June 2000, Pages 411 - 430
Blei, David M.; Ng, Andrew Y.; Jordan, Michael I. Lafferty, John. ed. "Latent Dirichlet allocation". Journal of
Machine Learning Research 3 (4t5), 2003: pp. 993t1022
S. T. Roweis and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science Vol
290, 22 December 2000, 2323t2326.
Schutze, H. 1998. Automatic word sense discrimination. Computational Linguistics 24, 1, 97t124.
Hinrich Schtze. 1995. Distributional part-of-speech tagging. In EACL 7, pages 141--148.
Michael Lamar, Yariv Maron, Mark Johnson, Elie Bienenstock, SVD and clustering for unsupervised POS
tagging, In ACL 2010, p.215-219, July 11-16
Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the
51st Annual Meeting of the American Society for Information Science 25, 1988, pp. 36t40.
Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology
38: 188.
Thomas Hofmann, Probabilistic Latent Semantic Indexing, In SIGIR 1999
References
J. B. TenenbaumU s]v ^]o U v :}Zv X > vP(} U ^ 'o} o ' }u ] & u }l (} E}vo]v
]u v]}v o] Z ]}vU_ ^ ] v U vol 290, 22 December 2000.
Belkin,M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation.
Neural Computation, 15, 1373t1396.
Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of
Artificial Intelligence Research.
Yoshua Bengio , Rjean Ducharme , Pascal Vincent , Christian Janvin, A neural probabilistic language
model, The Journal of Machine Learning Research, 3, 3/1/2003
Joseph Turian , Lev Ratinov , Yoshua Bengio, Word representations: a simple and general method for semisupervised learning, In ACL 2010, p.384-394, July 11-16, 2010
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing
(almost) from scratch. JMLR, 2011.
Harold Hotelling. 1936. Relation between two sets of variables. Biometrica, 28:322t377.
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from
monolingual corpora. In ACL, pages 771--779.
Hal Daum, III , Jagadeesh Jagarlamudi, Domain adaptation for machine translation by mining unseen
words, In ACL 2011, June 19-24
Bing Bai , Jason Weston , David Grangier , Ronan Collobert , Kunihiko Sadamasa , Yanjun Qi , Olivier
Chapelle , Kilian Weinberger, Supervised semantic indexing, In CIKM 2009, November 02-06, Hong Kong,
China
74
References
Jagadeesh Jagarlamudi, Hal Daum, III , Low-Dimensional Discriminative Reranking, in HLT-NAACL 2012
Shaishav Kumar and Raghavendra Udupa, Learning Hash Functions for Cross-View Similarity Search, in
IJCAI-11, IJCAI, 20 July 2011
Raghavendra Udupa and Shaishav Kumar, Hashing-based Approaches to Spelling Correction of Personal
Names, in Proceedings of EMNLP 2010, October 2010
Raghavendra Udupa and Mitesh Khapra, Transliteration Equivalence using Canonical Correlation Analysis,
in ECIR 2010, 2010
Jagadeesh Jagarlamudi, Hal Daum, III , Regularized Interlingual Projections: Evaluation on Multilingual
Transliteration, in Proceedings of EMNLP-CoNLL 2012.
75