Linguistic Structured Sparsity in Text Categorization

Linguistic Structured Sparsity in Text Categorization
Dani Yogatama Noah A. Smith

Language Technologies Institute Language Technologies Institute
School of Computer Science School of Computer Science
Carnegie Mellon University Carnegie Mellon University
Pittsburgh, PA 15213, USA Pittsburgh, PA 15213, USA
dyogatama@cs.cmu.edu nasmith@cs.cmu.edu
Abstract about different weights jointly. The most widely

explored variant, group lasso (Yuan and Lin, 2006)
We introduce three linguistically moti- seeks to avoid large `2 norms for groups of
vated structured regularizers based on weights. Group lasso has been shown useful in
parse trees, topics, and hierarchical word a range of applications, including computational
clusters for text categorization. These biology (Kim and Xing, 2008), signal processing
regularizers impose linguistic bias in fea- (Lv et al., 2011), and NLP (Eisenstein et al., 2011;
ture weights, enabling us to incorporate Martins et al., 2011; Nelakanti et al., 2013). For
prior knowledge into conventional bag- text categorization problems, Yogatama and Smith
of-words models. We show that our (2014) proposed groups based on sentences, an
structured regularizers consistently im- idea generalized here to take advantage of richer
prove classification accuracies compared linguistic information.
to standard regularizers that penalize fea- In this paper, we show how linguistic informa-
tures in isolation (such as lasso, ridge, tion of various kindsparse trees, thematic topics,
and elastic net regularizers) on a range of and hierarchical word clusteringscan be used to
datasets for various text prediction prob- construct group lasso variants that impose linguis-
lems: topic classification, sentiment anal- tic bias without introducing any new features. Our
ysis, and forecasting. experiments demonstrate that structured regulariz-
ers can squeeze higher performance out of conven-
1 Introduction tional bag-of-words models on seven out of eight
What is the best way to exploit linguistic infor- of text categorization tasks tested, in six cases with
mation in statistical text processing models? For more compact models than the best-performing
tasks like text classification, sentiment analysis, unstructured-regularized model.
and text-driven forecasting, this is an open ques- 2 Notation
tion, as cheap bag-of-words models often per-
form well. Much recent work in NLP has fo- We represent each document as a feature vector
cused on linguistic feature engineering (Joshi et x RV , where V is the vocabulary size. xv is the
al., 2010) or representation learning (Glorot et al., frequency of the vth word (i.e., this is a bag of
2011; Socher et al., 2013). words model).
In this paper, we propose a radical alternative. Consider a linear model that predicts a binary
We embrace the conventional bag-of-words repre- response y {1, +1} given x and weight vector
sentation of text, instead bringing linguistic bias w RV . We denote our training data of D doc-
to bear on regularization. Since the seminal work uments in the corpus by {xd , yd }D
d=1 . The goal of
of Chen and Rosenfeld (2000), the importance of the learning procedure is to estimate w by mini-
regularization in discriminative models of text mizing the regularized training data loss:
including language modeling, structured predic- w = arg min (w) + D d=1 L(xd , w, yd ),
P
tion, and classificationhas been widely recog- w
nized. The emphasis, however, has largely been where L(x, w, y) is the loss function for docu-
on one specific kind of inductive bias: avoiding ment d and (w) is the regularizer.
large weights (i.e., coefficients in a linear model). In this work, we use the log loss:
Recently, structured (or composite) regulariza-
tion has been introduced; simply put, it reasons L(xd , w, yd ) = log(1 + exp(yd w> xd )),
Other loss functions (e.g., hinge loss, squared loss) where glas is a hyperparameter tuned on a devel-
can also be used with any of the regularizers dis- opment data, and g is a group specific weight.
cussed in this paper. Typically the groups are non-overlapping, which
Our focus is on the regularizer, (w). For high offers computational advantages, but this need not
dimensional data such as text, regularization is be the case (Jacob et al., 2009; Jenatton et al.,
crucial to avoid overfitting.1 2011).
The usual starting points for regularization are
the lasso (Tibshirani, 1996) and the ridge (Ho- 4 Structured Regularizers for Text
erl and Kennard, 1970), based respectively on the Past work applying the group lasso to NLP prob-
`1 and squared `2 norms: lems has considered four ways of defining the
P groups. Eisenstein et al. (2011) defined groups
las (w) = las kwk1 = j |wj |
of coefficients corresponding to the same inde-
rid (w) = rid kwk22 = j wj2
P
pendent variable applied to different (continuous)
output variables in multi-output regression. Mar-
Both methods disprefer weights of large magni-
tins et al. (2011) defined groups based on fea-
tude; smaller (relative) magnitude means a feature
ture templates used in chunking and parsing tasks.
(here, a word) has a smaller effect on the predic-
Nelakanti et al. (2013) defined groups based on n-
tion, and zero means a feature has no effect.2 The
gram histories for language modeling. In each of
hyperparameter in each case is typically tuned
these cases, the groups were defined based on in-
on a development dataset. A linear combination
formation from feature types alone; given the fea-
of ridge and lasso is known as the elastic net (Zou
tures to be used, the groups were known.
and Hastie, 2005). The lasso, ridge, and elastic net
Here we build on a fourth approach that exploits
are three strong baselines in our experiments.
structure in the data.4 Yogatama and Smith (2014)
3 Group Lasso introduced the sentence regularizer, which uses
patterns of word cooccurrence in the training data
Structured regularizers penalize estimates of w in to define groups. We review this method, then ap-
which collections of weights are penalized jointly. ply the idea to three more linguistically informed
For example, in the group lasso (Yuan and Lin, structure in text data.
2006), predefined groups of weights (subvectors
of w) are encouraged to either go to zero (as 4.1 Sentence Regularizer
a group) or not (as a group)this is known as The sentence regularizer exploits sentence bound-
group sparsity.3 aries in each training document. The idea is to
The variant of group lasso we explore here uses define a group gd,s for every sentence s in every
an `1,2 norm. Let g index the G predefined groups training document d. The group contains coeffi-
of weights and wg denote the subvector of w con- cients for words that occur in its sentence. This
taining weights for group g: means that a word is a member of one group for
every distinct (training) sentence it occurs in, and
glas (w) =glas G
P
g=1 g kwg k2 , that the regularizer is based on word tokens, not
1
A Bayesian interpretation of regularization is as a prior types as in the approach of Martins et al. (2011)
on the weight vector w; in many cases can be under- and Nelakanti et al. (2013). The regularizer is:
stood as a log-prior representing beliefs about the model held
before exposure to data. For lasso regression, the prior is PSd
sen (w) = D
P
a zero-mean Laplace distribution, whereas for ridge regres- d=1 s=1 d,s kwd,s k2 ,
sion the prior is a zero-mean Gaussian distribution. For non-
overlapping group lasso, the prior is a two-level hierarchical where Sd is the number of sentences in document
Bayes model (Figueiredo, 2002). The Bayesian interpretation d. This regularizer results in tens of thousands
of overlapping group lasso is not yet well understood.
2 to millions of heavily overlapping groups, since
The lasso leads to strongly sparse solutions, in which
many elements of the estimated w are actually zero. This a standard corpus typically contains thousands to
is an attractive property for efficiency and (perhaps) inter- millions of sentences and many words that appear
pretability. The ridge encourages weights to go toward zero,
but usually not all the way to zero; for this reason its solutions
in more than one sentence.
are known as weakly sparse. 4
This provides a compelling reason not to view such
3
Other structured regularizers include the fused lasso methods in a Bayesian framework: if the regularizer is in-
(Tibshirani et al., 2005) and the elitist lasso (Kowalski and formed by the data, then it does not truly correspond to a
Torresani, 2009). prior.
c0,++
coefficients and ) for one sentence with the parse

c1 c4,+ tree shown in Figure 1 is:
tree (w) =
p
c2 c3 c5,++ c8 |wthe |2 + |wactors |2 + |ware |2 + |wfantastic |2 + |w. |2
p
+ |ware |2 + |wfantastic |2 + |w.2 |
The actors c6 c7,+ .
p p
+ |wthe |2 + |wactors |2 + |ware |2 + |wfantastic |2
+ |wthe | + |wactors | + |ware | + |wfantastic | + |w. |
are fantastic
The groups have a tree structure, in that assign-
Figure 1: An example of a parse tree from the Stanford sen- ing zero values to the weights in a group corre-
timent treebank, which annotates sentiment at the level of
every constituent (indicated here by + and ++; no mark-
sponding to a higher-level constituent implies the
ing indicates neutral sentiment). The sentence is The ac- same for those constituents that are dominated by
tors are fantastic. Our regularizer constructs nine groups for it. This resembles the tree-guided group lasso in
this sentence, corresponding to c0 , c1 , . . . , c8 . gc0 consists of
5 weightshwthe , wactors , ware , wfantastic , w. i, exactly the Kim and Xing (2008), although the leaf nodes in
same as the group in the sentence regularizergc1 consists their tree represent tasks in multi-task regression.
of 2 words, gc4 of 3 words, etc. Notice that c2 , c3 , c6 , c7 , Of course, in a corpus there are many parse trees
and c8 each consist of only 1 word. The Stanford sentiment
treebank has an annotation of sentiments at the constituent (one per sentence, so the number of parse trees is
level. As in this example, most constituents are annotated as the number of sentences). The parse-tree regular-
neutral. izer is:
If the norm of wgd,s is driven to zero, then the PSd PCd,s
tree (w) = D
P
learner has deemed the corresponding sentence ir- d=1 s=1 c=1 d,s,c kwd,s,c k2 ,
relevant to the prediction. It is important to point p
out that, while the regularizer prefers to zero out where d,s,c = glas size(gd,s,c ), d ranges
the weights for all words in irrelevant sentences, it over (training) documents and c ranges over con-
also prefers not to zero out weights for words in stituents in the parse of sentence s in docu-
relevant sentences. Since the groups overlap and ment d. Similar to the sentence regularizer,
may work against each other, the regularizer may the parse-tree regularizer operates on word to-
not be able to drive many weights to zero on its kens. Note that, since each word token is it-
own. Yogatama and Smith (2014) used a linear self a constituent, the parse tree regularizer in-
combination of the sentence regularizer and the cludes terms just like the lasso naturally, penal-
lasso (a kind of sparse group lasso; Friedman et izing the absolute value of each words weight
al., 2010) to also encourage weights of irrelevant in isolation. For the lasso-like penalty on each
word types to go to zero.5 word, instead of defining the group weights to be
1 the number of tokens for each word type, we
4.2 Parse Tree Regularizer tune one group weight for all word types on a de-
velopment data. As a result, besides glas , we have
Sentence boundaries are a rather superficial kind
an additional hyperparameter, denoted by las .
of linguistic structure; syntactic parse trees pro-
To gain an intuition for this regularizer, consider
vide more fine-grained information. We introduce
the case where we apply the penalty only for a sin-
a new regularizer, the parse tree regularizer, in
gle tree (sentence), which for ease of exposition is
which groups are defined for every constituent in
assumed not to use the same word more than once
every parse of a training data sentence.
(i.e., kxk = 1). Because it instantiates the tree-
Figure 1 illustrates the group structures derived structured group lasso, the regularizer will require
from an example sentence from the Stanford sen- bigger constituents to be included (i.e., their
timent treebank (Socher et al., 2013). This regu- words given nonzero weight) before smaller con-
larizer captures the idea that phrases might be se- stituents can be included. The result is that some
lected as relevant or (in most cases) irrelevant to words may not be included. Of course, in some
a task, and is expected to be especially useful in sentences, some words will occur more than once,
sentence-level prediction tasks. and the parse tree regularizer instantiates groups
The parse-tree regularizer (omitting the group for constituents in every sentence in the training
5
Formally, this is equivalent to including one additional corpus, and these groups may work against each
group for each word type. other. The parse tree regularizer should therefore
be understood as encouraging group behavior of izer can be written as:
syntactically grouped words, or sharing of infor-
PK
mation by syntactic neighbors. lda (w) = k=1 k kwk k2 ,
In sentence level prediction tasks, such as
sentence-level sentiment analysis, it is known that where k ranges over the K topics. Similar to our
most constituents (especially those that corre- earlier notations, wk corresponds to the subvec-
spond to shorter phrases) in a parse tree are un- tor of w such that the corresponding features are
informative (neutral sentiment). This was verified present in topic k. Note that in this case we can
by Socher et al. (2013) when annotating phrases also have overlapping groups, since words can ap-
in a sentence for building the Stanford sentiment pear in the top R of many topics.
treebank. Our regularizer incorporates our prior
expectation that most constituents should have no k=1 k=2 k=3 k=4
soccer injury physics monday
effect on prediction. striker knee gravity tuesday
midfielder ligament moon april
4.3 LDA Regularizer goal shoulder sun june
defender cruciate relativity sunday
Another type of structure to consider is topics.
For example, if we want to predict whether a pa-
Table 1: A toy example of K = 4 topics. The top R = 5
per will be cited or not (Yogatama et al., 2011), words in each topics are displayed. The LDA regularizer
the model can perform better if it knows before- will construct four groups from these topics. The first group
hand the collections of words that represent certain is hwsoccer , wstriker , wmidfielder , wgoal , wdefender i, the sec-
ond group is hwinjury , wknee , wligament , wshoulder , wcruciate i,
themes (e.g., in ACL papers, these might include etc. In this example, there are no words occurring in the top
machine translation, parsing, etc.). As a result, R of more than one topic, but that need not be the case in
the model can focus on which topics will increase general.
the probability of getting citations, and penalize To gain an intuition for this regularizer, consider
weights for words in the same topic together, in- the toy example in Table 1. the case where we
stead of treating each word separately. have K = 4 topics and we select R = 5 top words
We do this by inferring topics in the training from each topic. Supposed that we want to clas-
corpus by estimating the latent Dirichlet alloca- sify whether an article is a sports article or a sci-
tion (LDA) model (Blei et al., 2003)). Note that ence article. The regularizer might encourage the
LDA is an unsupervised method, so we can in- weights for the fourth topics words toward zero,
fer topical structures from any collection of docu- since they are less useful for the task. Addition-
ments that are considered related to the target cor- ally, the regularizer will penalize words in each of
pus (e.g., training documents, text from the web, the other three groups collectively. Therefore, if
etc.). This contrasts with typical semi-supervised (for example) ligament is deemed a useful feature
learning methods for text categorization that com- for classifying an article to be about sports, then
bine unlabeled and labeled data within a genera- the other words in that topic will have a smaller ef-
tive model, such as multinomial nave Bayes, via fective penalty for getting nonzero weightseven
expectation-maximization (Nigam et al., 2000) or weights of the opposite sign as wligament . It is im-
semi-supervised frequency estimation (Su et al., portant to distinguish this from unstructured reg-
2011). Our method does not use unlabeled data ularizers such as the lasso, which penalize each
to obtain more training documents or estimate the words weight on its own without regard for re-
joint distributions of words better, but it allows the lated word types.
use of unlabeled data to induce topics. We leave Unlike the parse tree regularizer, the LDA regu-
comparison with other semi-supervised methods larizer is not tree structured. Since the lasso-like
for future work. penalty does not occur naturally in a non tree-
There are many ways to associate inferred top- structured regularizer, we add an additional lasso
ics with group structure. In our experiments, we penalty for each word type (with hyperparameter
choose the R most probable words given a topic las ) to also encourage weights of irrelevant words
and create a group for them.6 The LDA regular- to go to zero. Our LDA regularizer is an instance
6
Another possibility is to group the smallest set of words of sparse group lasso (Friedman et al., 2010).
whose total probability given a topic amounts to P (e.g.,
0.99). mass of a topic. Preliminary experiments found this not to work well.
v0
4.4 Brown Cluster Regularizer

v1 v5
Brown clustering is a commonly used unsuper-

vised method for grouping words into a hierarchy v2 v4 v6 v7
of clusters (Brown et al., 1992). Because it uses

local information, it tends to discover words with v3 v10 v11 v12 v13 v14 v15 v16
similar syntactic behavior, though semantic group-

ings are often evident, especially at the more fine- v8 v9 midelder knee injury moon sun monday sunday
grained end of the hierarchy.

goal striker
We incorporate Brown clusters into a regular-
izer in a similar way to the topical word groups Figure 2: An illustrative example of Brown clusters for N =
inferred using LDA in 4.3, but here we make use 9. The Brown cluster regularizer constructs 17 groups, one
of the hierarchy. Specifically, we construct tree- per node in for this tree, v0 , v1 , . . . , v16 . v0 contains 8 words,
v1 contains 5, etc. Note that the leaves, v8 , v9 , . . . , v16 , each
structured groups, one per cluster (i.e., one per contain one word.
node in the hierarchy). The Brown cluster regu-
LDA and Brown cluster regularizers offer ways to
larizer is:
incorporate unlabeled data, if we believe that the
brown (w) = N
P unlabeled data can help us infer better topics or
v=1 v kwv k2 ,
clusters. Note that the processes of learning topics
where v ranges over the N nodes in the Brown or clusters, or parsing training data sentences, are
cluster tree. As a tree structured regularizer, this a separate stage that precedes learning our predic-
regularizer enforces constraints that a node vs tive model.
group is given nonzero weights only if those nodes
that dominate v (i.e., are on a path from v to the 5 Learning
root) have their groups selected. There are many optimization methods for learn-
Consider a similar toy example to the LDA reg- ing models with structured regularizers, particu-
ularizer (sports vs. science) and the hierarchical lary group lasso (Jacob et al., 2009; Jenatton et al.,
clustering of words in Figure 2. In this case, the 2011; Chen et al., 2011; Qin and Goldfarb, 2012;
Brown cluster regularizer will create 17 groups, Yuan et al., 2013). We choose the optimization
one for every node in the clustering tree. The regu- method of Yogatama and Smith (2014) since it
larizer for this tree (omitting the group coefficients handles millions of overlapping groups effectively.
and ) is: The method is based on the alternating directions
method of multipliers (ADMM; Hestenes, 1969;
brown (w) = 7i=0 kwvi k2 + |wgoal | + |wstriker |
P
Powell, 1969). We review it here in brief, for com-
+ |wmidfielder | + |wknee | + |winjury | pleteness, and show how it can be applied to tree-
+ |wgravity | + |wmoon | + |wsun | structured regularizers (such as the parse tree and
Brown cluster regularizers in 4) in particular.
The regularizer penalizes words in a cluster to- Our learning problem is, generically:
gether, exploiting discovered syntactic related-
min (w) + D d=1 L(xd , w, yd ).
P
ness. Additionally, the regularizer can zero out w
weights of words corresponding to any of the in-
ternal nodes, such as v7 if the words monday and Separating the lasso-like penalty for each word
sunday are deemed irrelevant to prediction. type from our group regularizers, we can rewrite
Note that the regularizer already includes terms this problem as:
min las (w) + glas (v) + D d=1 L(xd , w, yd )
P
like the lasso naturally. Similar to the parse w,v
tree regularizer, for the lasso-like penalty on each s.t. v = Mw
word, we tune one group weight for all word types
on a development data with a hyperparameter las . where v consists of copies of the elements of
A key difference between the Brown cluster w. Notice that we work directly on w instead
regularizer and the parse tree regularizer is that of the copies for the lasso-like penalty, since it
there is only one tree for Brown cluster regularizer, does not have overlaps and has its own hyper-
whereas the parse tree regularizer can have mil- parameters las . For the remaining groups with
lions (one per sentence in the training data). The size greater than one, we create copies v of size
PG LV is a ma-
L = g=1 size(g). M {0, 1} Algorithm 1 ADMM for overlapping group lasso
trix whose 1s link elements of w to their copies.7 Input: augmented Lagrangian variable , regularization
strengths glas and las
We now have a constrained optimization prob- while stopping criterion not met do
lem, from which we can create an augmented La- PV
w = arg min las (w)+L(w)+ Ni (wi i )2
w 2 i=1
grangian problem; let u be the Lagrange variables: for g = 1 to G do
vg = prox , g (zg )
las (w) + glas (v) + L(w) glas
end for

+ u> (v Mw) + kv Mwk22 u = u + (v Mw)
2 end while
ADMM proceeds by iteratively updating each
of w, v, and u, amounting to the following sub-
problems: 6 Experiments

min las (w) + L(w) u> Mw + kv Mwk22 (1)
w 2
6.1 Datasets
min glas (v) + u> v + kv Mwk22 (2)
v 2
We use publicly available datasets to evaluate our
u = u + (v Mw) (3)
model described in more detail below.
Yogatama and Smith (2014) show that Eq. 1
can be rewritten in a form quite similar to `2 - Topic classification. We consider four binary
regularized loss minimization.8 categorization tasks from the 20 Newsgroups
Eq. 2 is the proximal operator of 1 glas ap- dataset.10 Each task involves categorizing a
plied to Mw u . As such, it depends on the document according to two related categories:
form of M. Note that when applied to the col- comp.sys: ibm.pc.hardware vs. mac.hardware;
lection of copies of the parameters, v, glas no rec.sport: baseball vs. hockey; sci: med vs. space;
longer has overlapping groups. Defined Mg as and alt.atheism vs. soc.religion.christian.
the rows of M corresponding to weight copies as-
u Sentiment analysis. One task in sentiment anal-
signed to group g. Let zg , Mg w g . De-
p ysis is predicting the polarity of a piece of text, i.e.,
note g = glas size(g). The problem can be whether the author is favorably inclined toward a
solved by applying the proximal operator used in (usually known) subject of discussion or proposi-
non-overlapping group lasso to each subvector: tion (Pang and Lee, 2008). Sentiment analysis,
vg = prox , g (zg ) even at the coarse level of polarity we consider
glas
here, can be confused by negation, stylistic use of
g
0 if kzg k2 irony, and other linguistic phenomena. Our sen-
= g
timent analysis datasets consist of movie reviews
kzg k2 zg otherwise.
kzg k2 from the Stanford sentiment treebank (Socher et
For a tree structured regularizer, we can get al., 2013),11 and floor speeches by U.S. Congress-
speedups by working from the root node towards men alongside yea/nay votes on the bill under
the leaf nodes when applying the proximal oper- discussion (Thomas et al., 2006).12 For the Stan-
ator in the second step. If g is a node in a tree ford sentiment treebank, we only predict binary
which is driven to zero, all of its children h that classifications (positive or negative) and exclude
has h g will also be driven to zero. neutral reviews.
Eq. 3 is a simple update of the dual variable u.
Text-driven forecasting. Forecasting from text
Algorithm 1 summarizes our learning procedure.9
requires identifying textual correlates of a re-
7
For the parse tree regularizer, L is the sum, over all sponse variable revealed in the future, most of
training-data word tokens t, of the number of constituents t
belongs to. For the LDA regularizer, L = R K. For the
which will be weak and many of which will be
Brown cluster regularizer, L = V 1. spurious (Kogan et al., 2009). We consider two
8
The difference lies in that the squared `2 norm in the such problems. The first one is predicting whether
penalty penalizes the difference between w and a vector that
depends on the current values of u and v. This does not affect
a scientific paper will be cited or not within three
the algorithm or its convergence in any substantive way. years of its publication (Yogatama et al., 2011);
9
We use relative changes in the `2 norm of the parameter 10
vector w as our convergence criterion (threshold of 103 ), http://qwone.com/jason/20Newsgroups
11
and set the maximum number of iterations to 100. Other cri- http://nlp.stanford.edu/sentiment/
12
teria can also be used. http://www.cs.cornell.edu/ainur/data.html
Dataset D # Dev. # Test V 6.3 Results
science 952 235 790 30,154
20N sports 958 239 796 20,832 Table 3 shows the results of our experiments on
relig. 870 209 717 24,528
comp. 929 239 777 20,868
the eight datasets. The results demonstrate the su-
periority of structured regularizers. One of them
Fore. Sent.
movie 6,920 872 1,821 17,576

vote 1,175 257 860 24,508 achieved the best result on all but one dataset.18 It
science 3,207 280 539 42,702 is also worth noting that in most cases all variants
bill 37,850 7,341 6,571 10,001
of the structured regularizers outperformed lasso,
Table 2: Descriptive statistics about the datasets. ridge, and elastic net. In four cases, the new regu-
larizers in this paper outperform the sentence reg-
the dataset comes from the ACL Anthology and ularizer.
consists of research papers from the Association We can see that the parse tree regularizer per-
for Computational Linguistics and citation data formed the best for the movie review dataset. The
(Radev et al., 2009). The second task is predicting task is to predict sentence-level sentiment, so each
whether a legislative bill will be recommended by training example is a sentence. Since constituent-
a Congressional committee (Yano et al., 2012).13 level annotations are available for this dataset, we
Table 2 summarizes statistics about the datasets only constructed groups for neutral constituents
used in our experiments. In total, we evaluate our (i.e., we drive neutral constituents to zero during
method on eight binary classification tasks. training). It has been shown that syntactic in-
formation is helpful for sentence-level predictions
6.2 Setup (Socher et al., 2013), so the parse tree regularizer
is naturally suitable for this task.
In all our experiments, we use unigram features The Brown cluster and LDA regularizers per-
plus an additional bias term which is not regu- formed best for the forecasting scientific articles
larized. We compare our new regularizers with dataset. The task is to predict whether an article
state-of-the-art methods for document classifica- will be cited or not within three years after publi-
tion: lasso, ridge, and elastic net regularization, as cation. Regularizers that exploit the knowledge of
well as the sentence regularizer discussed in 4.1 semantic relations (e.g., topical categories), such
(Yogatama and Smith, 2014).14 as the Brown cluster and LDA regularizers, are
We parsed all corpora using the Berkeley parser therefore suitable for this type of prediction.
(Petrov and Klein, 2007).15 For the LDA regular- Table 4 shows model sizes obtained by each
izers, we ran LDA16 on training documents with of the regularizers for each dataset. While lasso
K = 1, 000 and R = 10. For the Brown cluster prunes more aggressively, it almost always per-
regularizers, we ran Brown clustering17 on train- forms worse. Our structured regularizers were
ing documents with 5, 000 clusters for the topic able to obtain a significantly smaller model (27%,
classification and sentiment analysis datasets, and 34%, 19% as large on average for parse tree,
1, 000 for the larger text forecasting datasets (since Brown, and LDA regularizers respectively) com-
they are bigger datasets that took more time). pared to the ridge model.
13
http://www.ark.cs.cmu.edu/bills Topic and cluster features. Another way to in-
14
Hyperparameters are tuned on a separate develop- corporate LDA topics and Brown clusters into a
ment dataset, using accuracy as the evaluation crite-
rion. For lasso and ridge models, we choose from
linear model is by adding them as additional fea-
{102 , 101 , 1, 10, 102 , 103 }. For elastic net, we perform tures. For the 20N datasets, we also ran lasso,
grid search on the same set of values as ridge and lasso ridge, and elastic net with additional LDA topic
experiments for rid and las . For the sentence, Brown
cluster, and LDA regularizers, we perform grid search on and Brown cluster features.19 Note that these new
the same set of values as ridge and lasso experiments for baselines use more features than our model. We
, glas , las . For the parse tree regularizer, because there can also add these additional features to our model
are many more groups than other regularizers, we choose
glas from {104 , 103 , 102 , 101 , 10}, and las from 18
This bill dataset, where they offered no improvement,
the same set of values as ridge and lasso experiments. If there is the largest by far (37,850 documents), and therefore the
is a tie on development data we choose the model with the one where regularizers should matter the least. Note that the
smallest number of nonzero weights. differences are small across regularizers for this dataset.
15 19
https://code.google.com/p/berkeleyparser/ For LDA, we took the top 10 words in a topic as a feature.
16
http://www.cs.princeton.edu/blei/lda-c/ For Brown clusters, we add a cluster as an additional feature
17
https://github.com/percyliang/brown-cluster if its size is less than 50.
Accuracy (%) Table 3:
Task Dataset Classification
m.f.c. lasso ridge elastic sentence parse Brown LDA
science 50.13 90.63 91.90 91.65 96.20 92.66 93.04 93.67 accuracies on
sports 50.13 91.08 93.34 93.71 95.10 93.09 93.71 94.97 various datasets.
20N m.f.c. is the
religion 55.51 90.52 92.47 92.47 92.75 94.98 92.89 93.03
computer 50.45 85.84 86.74 87.13 90.86 89.45 86.36 88.42 most frequent
class baseline.
movie 50.08 78.03 80.45 80.40 80.72 81.55 80.34 78.36 Boldface shows
Sentiment
vote 58.37 73.14 72.79 72.79 73.95 73.72 66.86 73.14 best results.
science 50.28 64.00 66.79 66.23 67.71 66.42 69.02 69.39
Forecasting
bill 87.40 88.36 87.70 88.48 88.11 87.98 88.20 88.27
Model size (%) Table 4: Model

Task Dataset
m.f.c. lasso ridge elastic sentence parse Brown LDA sizes (percentages
science - 1 100 34 12 2 42 9 of nonzero
sports - 2 100 15 3 3 16 9 features in the
20N resulting models)
religion - 0.3 100 48 94 72 41 15
computer - 2 100 24 10 5 24 8 on various
datasets.
movie - 10 100 54 83 87 59 12
Sentiment
vote - 2 100 44 6 2 30 4
science - 31 100 43 99 9 50 90
Forecasting
bill - 7 100 7 8 37 7 7
+ LDA features LDA lected vs. removed). For each of the proposed
Dataset
lasso ridge elastic reg.
science 90.63 91.90 91.90 93.67 regularizers, we inspect the model a task in which
sports 91.33 93.47 93.84 94.97 it performed well.
religion 91.35 92.47 91.35 93.03
computer 85.20 86.87 86.35 88.42 For the parse tree regularizer, we inspect the
+ Brown features Brown model for the 20N:religion task. We observed that
Dataset
lasso ridge elastic reg. the model included most of the sentences (root
science 86.96 90.51 91.14 93.04
sports 82.66 88.94 85.43 93.71
node groups), but in some cases removed phrases
religion 94.98 96.93 96.93 92.89 from the parse trees, such as ozzy osbourne in the
computer 55.72 96.65 67.57 86.36 sentence ozzy osbourne , ex-singer and main char-
Table 5: Classification accuracies on the 20N datasets for acter of the black sabbath of good ole days past ,
lasso, ridge, and elastic net models with additional LDA fea- is and always was a devout catholic .
tures (top) and Brown cluster features (bottom). The last col-
umn shows structured regularized models from Table 3. For the LDA regularizer, we inspect zero and
nonzero groups (topics) in the forecasting scien-
and treat them as regular features (i.e., they do tific articles task. In this task, we observed that
not belong to any groups and are regularized with 642 out of 1,000 topics are driven to zero by
standard regularizer such as the lasso penalty). our model. Table 6 shows examples of zero and
The results in Table 5 show that for these datasets, nonzero topics for the dev.-tuned hyperparameter
models that incorporate this information through values. We can see that in this particular case, the
structured regularizers outperformed models that model kept meaningful topics such as parsing and
encode this information as additional features in speech processing, and discarded general topics
4 out 4 of cases (LDA) and 2 out of 4 cases that are not correlated with the content of the pa-
(Brown). Sparse models with Brown clusters ap- pers (e.g., acknowledgment, document metadata,
pear to overfit badly; recall that the clusters were equation, etc.). Note that most weights for non-
learned on only the training dataclusters from selected groups, even in w, are near zero.
a larger dataset would likely give stronger re- For the Brown cluster regularizer, we inspect
sults. Of course, better performance might also the model from the 20N:science task. 771 out
be achieved by incorporating new features as well of 5,775 groups were driven to zero for the best
as using structured regularizers. model tuned on the development set. Examples
of zero and nonzero groups are shown in Ta-
6.4 Examples ble 7. Similar to the LDA example, the groups
To gain an insight into the models, we inspect that were driven to zero tend to contain generic
group sparsity patterns in the learned models by words that are not relevant to the predictions. We
looking at the parameter copies v. This lets us see can also see the tree structure effect in the regu-
which groups are considered important (i.e., se- larizer. The group {underwater, industrial} was
acknowledgment: workshop arpa program session darpa research papers spoken technology systems
document metadata: university references proceedings abstract work introduction new been research both
=0
equation: pr w h probability wi gram context z probabilities complete
translation: translation target source german english length alignment hypothesis translations position
translation: korean translation english rules sentences parsing input evaluation machine verb
speech processing: speaker identification topic recognition recognizer models acoustic test vocabulary independent
6= 0
parsing: parser parsing probabilistic prediction parse pearl edges chart phase theory
classification: documents learning accuracy bayes classification wt document naive method selection
Table 6: Examples of LDA regularizer-removed and -selected groups (in v) in the forecasting scientific articles dataset. Words
with weights (in w) of magnitude greater than 103 are highlighted in red (not cited) and blue (cited).
underwater industrial Table 7: Examples of Brown

spotted hit reaped rejuvenated destroyed stretched undertake shake run regularizer-removed and
=0
seeing developing tingles diminishing launching finding investigating receiving -selected groups (in v) in the
maintaining 20N:science task. # denotes
adds engage explains builds any numeral. Words with
failure reproductive ignition reproduction weights (in w) of magnitude
cyanamid planetary nikola fertility astronomical geophysical # lunar cometary greater than 103 are
6= 0 supplying astronautical highlighted in red (space) and
magnetic atmospheric blue (medical).
std underwater hpr wordscan exclusively aneutronic industrial peoples obsessive
congenital rare simple bowel hereditary breast
driven to zero, but not once it combined with other 8 Conclusion

words such as hpr, std, obsessive. Note that we
ran Brown clustering only on the training docu- We introduced three data-driven, linguistically
ments; running it on a larger collection of (unla- informed structured regularizers based on parse
beled) documents relevant to the prediction task trees, topics, and hierarchical word clusters. We
(i.e., semi-supervised learning) is worth exploring empirically showed that models regularized us-
in future work. ing our methods consistently outperformed stan-
dard regularizers that penalize features in isolation
7 Related and Future Work such as lasso, ridge, and elastic net on a range
of datasets for various text prediction problems:
Overall, our results demonstrate that linguistic topic classification, sentiment analysis, and fore-
structure in the data can be used to improve bag- casting.
of-words models, through structured regulariza-
tion. State-of-the-art approaches to some of these Acknowledgments
problems have used additional features and repre-
sentations (Yessenalina et al., 2010; Socher et al., The authors thank Brendan OConnor for help
2013). For example, for the vote sentiment analy- with visualization and three anonymous review-
sis datasets, latent variable models of Yessenalina ers for helpful feedback on an earlier draft of this
et al. (2010) achieved a superior result of 77.67%. paper. This research was supported in part by
To do so, they sacrificed convexity and had to rely computing resources provided by a grant from the
on side information for initialization. Our exper- Pittsburgh Supercomputing Center, a Google re-
imental focus has been on a controlled compari- search award, and the Intelligence Advanced Re-
son between regularizers for a fixed model family search Projects Activity via Department of In-
(the simplest available, linear with bag-of-words terior National Business Center contract number
features). However, the improvements offered by D12PC00347. The U.S. Government is authorized
our regularization methods can be applied in fu- to reproduce and distribute reprints for Govern-
ture work to other model families with more care- mental purposes notwithstanding any copyright
fully engineered features, metadata features (espe- annotation thereon. The views and conclusions
cially important in forecasting), latent variables, contained herein are those of the authors and
etc. In particular, note that other kinds of weights should not be interpreted as necessarily represent-
(e.g., metadata) can be penalized conventionally, ing the official policies or endorsements, either ex-
or incorporated into the structured regularization pressed or implied, of IARPA, DoI/NBC, or the
where it makes sense to do so (e.g., n-grams, as in U.S. Government.
Nelakanti et al., 2013).
References Matthieu Kowalski and Bruno Torresani. 2009. Spar-
sity and persistence: mixed norms provide simple
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. signal models with dependent coefficients. Signal,
2003. Latent Dirichlet allocation. Journal of Ma- Image and Video Processing, 3(3):2510264.
chine Learning Research, 3:9931022.
Peter F. Brown, Peter V. deSouza, Robert L. Mer- Xiaolei Lv, Guoan Bi, and Chunru Wan. 2011. The
cer, Vincent J. Della Pietra, and Jenifer C. Lai. group lasso for stable recovery of block-sparse sig-
1992. Class-based n-gram models of natural lan- nal representations. IEEE Transactions on Signal
guage. Computational Linguistics, 18:467479. Processing, 59(4):13711382.
Stanley F. Chen and Ronald Rosenfeld. 2000. A Andre F. T. Martins, Noah A. Smith, Pedro M. Q.
survey of smoothing techniques for me models. Aguiar, and Mario A. T. Figueiredo. 2011. Struc-
IEEE Transactions on Speech and Audio Process- tured sparsity in structured prediction. In Proc. of
ing, 8(1):3750. EMNLP.
Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Car- Anil Nelakanti, Cedric Archambeau, Julien Mairal,
bonell, and Eric P. Xing. 2011. Smoothing prox- Francis Bach, and Guillaume Bouchard. 2013.
imal gradient method for general structured sparse Structured penalties for log-linear language models.
learning. In Proc. of UAI. In Proc. of EMNLP.
Jacob Eisenstein, Noah A. Smith, and Eric P. Xing.
Kamal Nigam, Andrew McCallum, Sebastian Thrun,
2011. Discovering sociolinguistic associations with
and Tom Mitchell. 2000. Text classification from la-
structured sparsity. In Proc. of ACL.
beled and unlabeled documents using em. Machine
Mario A. T. Figueiredo. 2002. Adaptive sparseness Learning, 39(2-3):103134.
using Jeffreys prior. In Proc. of NIPS.
Bo Pang and Lilian Lee. 2008. Opinion mining and
Jerome Friedman, Trevor Hastie, and Robert Tibshiran. sentiment analysis. Foundations and Trends in In-
2010. A note on the group lasso and a sparse group formation Retrieval, 2(12):1135.
lasso. Technical report, Stanford University.
Slav Petrov and Dan Klein. 2007. Improved inference
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. for unlexicalized parsing. In Proc. of HLT-NAACL.
2011. Domain adaptation for large-scale sentiment
classification: A deep learning approach. In Proc. of M. J. D. Powell. 1969. A method for nonlinear con-
ICML. straints in minimization problems. In R. Fletcher,
editor, Optimization, pages 283298. Academic
Magnus R. Hestenes. 1969. Multiplier and gradient
Press.
methods. Journal of Optimization Theory and Ap-
plications, 4:303320.
Zhiwei (Tony) Qin and Donald Goldfarb. 2012. Struc-
Arthur E. Hoerl and Robert W. Kennard. 1970. Ridge tured sparsity via alternating direction methods.
regression: Biased estimation for nonorthogonal Journal of Machine Learning Research, 13:1435
problems. Technometrics, 12(1):5567. 1468.
Laurent Jacob, Guillaume Obozinski, and Jean- Dragomir R. Radev, Pradeep Muthukrishnan, and Va-
Philippe Vert. 2009. Group lasso with overlap and hed Qazvinian. 2009. The ACL anthology net-
graph lasso. In Proc. of ICML. work corpus. In Proc. of ACL Workshop on Natural
Language Processing and Information Retrieval for
Rodolphe Jenatton, Jean-Yves Audibert, and Fran- Digital Libraries.
cis Bach. 2011. Structured variable selection
with sparsity-inducing norms. Journal of Machine Richard Socher, Alex Perelygin, Jean Wu, Jason
Learning Research, 12:27772824. Chuang, Chris Manning, Andrew Ng, and Chris
Potts. 2013. Recursive deep models for semantic
Mahesh Joshi, Dipanjan Das, Kevin Gimpel, and
compositionality over a sentiment treebank. In Proc.
Noah A. Smith. 2010. Movie reviews and rev-
of EMNLP.
enues: An experiment in text regression. In Proc.
of NAACL.
Jiang Su, Jelber Sayyad-Shirabad, and Stan Matwin.
Seyoung Kim and Eric P. Xing. 2008. Feature selec- 2011. Large scale text classication using semi-
tion via block-regularized regression. In Proc. of supervised multinomial naive Bayes. In Proc. of
UAI. ICML.
Shimon Kogan, Dimitry Levin, Bryan R. Routledge, Matt Thomas, Bo Pang, and Lilian Lee. 2006. Get out
Jacob S. Sagi, and Noah A. Smith. 2009. Predicting the vote: Determining support or opposition from
risk from financial reports with regression. In Proc. congressional floor-debate transcripts. In Proc. of
of HLT-NAACL. EMNLP.
Robert Tibshirani, Michael Saunders, Saharon Ros-
set, Ji Zhu, and Keith Knight. 2005. Sparsity and
smoothness via the fused lasso. Journal of Royal
Statistical Society B, 67(1):91108.
Robert Tibshirani. 1996. Regression shrinkage and
selection via the lasso. Journal of Royal Statistical
Society B, 58(1):267288.
Tae Yano, Noah A. Smith, and John D. Wilkerson.
2012. Textual predictors of bill survival in congres-
sional committees. In Proc. of NAACL.
Ainur Yessenalina, Yisong Yue, and Claire Cardie.
2010. Multi-level structured models for document
sentiment classification. In Proc. of EMNLP.
Dani Yogatama and Noah A. Smith. 2014. Making the
most of bag of words: Sentence regularization with
alternating direction method of multipliers. In Proc.
of ICML.
Dani Yogatama, Michael Heilman, Brendan OConnor,
Chris Dyer, Bryan R. Routledge, and Noah A.
Smith. 2011. Predicting a scientific communitys
response to an article. In Proc. of EMNLP.
Ming Yuan and Yi Lin. 2006. Model selection
and estimation in regression with grouped variables.
Journal of the Royal Statistical Society, Series B,
68(1):4967.
Lei Yuan, Jun Liu, and Jieping Ye. 2013. Efficient
methods for overlapping group lasso. IEEE Trans-
actions on Pattern Analysis and Machine Intelli-
gence, 35(9):21042116.
Hui Zou and Trevor Hastie. 2005. Regularization and
variable selection via the elastic net. Journal of the
Royal Statistical Society, Series B, 67:301320.

Linguistic Structured Sparsity in Text Categorization

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Linguistic Structured Sparsity in Text Categorization

Enviado por

Direitos autorais:

Formatos disponíveis

Linguistic Structured Sparsity in Text Categorization

Dani Yogatama Noah A. Smith

Abstract about different weights jointly. The most widely

coefficients and ) for one sentence with the parse

4.4 Brown Cluster Regularizer

Brown clustering is a commonly used unsuper-

of clusters (Brown et al., 1992). Because it uses

similar syntactic behavior, though semantic group-

grained end of the hierarchy.

movie 6,920 872 1,821 17,576

Model size (%) Table 4: Model

underwater industrial Table 7: Examples of Brown

driven to zero, but not once it combined with other 8 Conclusion

Você também pode gostar