Escolar Documentos
Profissional Documentos
Cultura Documentos
in College Students
1348
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1348–1353,
Seattle, Washington, USA, 18-21 October 2013.
2013
c Association for Computational Linguistics
on precision (with no decrease in recall), as well However, results adding these features to LIWC did
as comparison with human performance by clinical not differ significantly from LIWC alone, and for
psychologists. Third, we show that LDA has iden- brevity we do not report them.3
tified meaningful, population-specific themes that
LDA features. We use vanilla LDA as imple-
go beyond the pre-defined LIWC categories and are
mented in the Mallet toolkit (McCallum, 2002), de-
psychologically relevant.
veloping a k-topic model on just training documents,
2 Predicting neuroticism and using the posterior topic distribution for each
training and test document as a set of k features.
2.1 Experimental framework Mallet’s stoplist and default parameters were used
Data. We utilize a collection of 6,459 stream-of- for burn-in, lag, number of iterations, priors, etc.
consciousness essays collected from college stu- Details on train/test splits and number k of topics
dents by Pennebaker and King (1999) between 1997 appear below.
and 2008, averaging approximately 780 words each.
LIWC+LDA features. The union of the LIWC
Students were asked to think about their thoughts,
features (one feature per category) and LDA features
sensations, and feelings in the moment and “write
(one feature per topic).
your thoughts as they come to you”.
Each essay is accompanied by metadata for its Prediction. We utilize linear regression in the
author, which includes scores for the Big-5 person- WEKA toolkit (Hall et al., 2009), estimated on train-
ality traits (John and Srivastava, 1999): agreeable- ing documents, to predict the neuroticism score as-
ness, conscientiousness, extraversion, neuroticism, sociated with the author of each test document.4
and openness. Because Big-5 assessment can be
done using a variety of different survey instruments 2.2 Results
(John et al., 2008), and different instruments were Table 1 shows the quality of prediction via linear
used from year to year, we treat the data from each regression, averaged over the eleven datasets, 1997
year as an independent dataset. through 2008, using Pearson correlation (r) as the
Any author missing any Big-5 attribute was ex- evaluation metric. For each year, we used 10-fold
cluded. Essays were tokenized using NLTK (Bird cross-validation to ensure proper separation of train-
et al., 2009) and lowercased, eliminating those that ing and test data. We experimented with LDA us-
failed tokenization because of encoding issues. This ing 20, 30, 40, and 50 topics.5
resulted in a dataset containing 4,777 essays with as- A first thing to observe is that the multiple re-
sociated Big-5 metadata. gressions using all LIWC categories produce much
stronger correlations with neuroticism than the indi-
LIWC features. For each document, we calcu-
vidual category correlations reported by Pennebaker
late the number of observed words in each of Pen-
and King.6 There the strongest individual corre-
nebaker and King’s 64 LIWC categories. These in-
lations with neuroticism for any LIWC categories
clude, among others, syntactic categories (e.g. pro-
are .16 (negative emotion words) and -.13 (positive
nouns, verbs), affect categories (e.g. negative emo-
emotion words), though it should be noted that their
tion words, anger words), semantic categories (e.g.
goal was to validate their categories as a meaningful
causation words, cognitive words), and topical cat-
3
egories (e.g. health, religion). For instance, the Using richer measures of complexity, e.g. Pakhomov et al.
anger category contains 190 word patterns specify- (2011), is a topic for future work.
4
In previous work we have found that multiple linear regression
ing, for example, words descriptive of contexts in- is competitive with more complicated techniques such as SVM
volving anger (e.g. brutal, hostile, shoot) and words regression, though we plan to explore the latter in future work.
5
that would be used by someone when angry (e.g. Full year-by-year data appears in supplemental materi-
bullshit, hate, moron). We also explored includ- als at http://umiacs.umd.edu/˜resnik/papers/
emnlp2013-supplemental/.
ing essays’ average sentence length and total word- 6
The comparison is not perfect, since they used Big-5 data col-
count, as an initial proxy for language complexity, lected between 1993 and 1998, and we also eliminated some
which often figures into psychological assessments. files during preprocessing.
1349
Feature set LIWC LDA20 LIWC+LDA20 LDA30 LIWC+LDA30 LDA40 LIWC+LDA40 LDA50 LIWC+LDA50
Average r 0.413 0.384 0.430 0.407 0.442* 0.420 0.459** 0.440 0.443*
Table 1: Prediction quality for neuroticism, for alternative feature sets (Pearson’s r). *p< .03, **p < .02
way to explore personality differences, not predic- ing depression.8 They were asked to “decide how
tion. at-risk this person might be for depression”, assign-
As noted in Table 1, paired t-tests (df=10, α = ing 0 (no significant concerns), 1 (mild concerns, but
.05) establish that, in comparing the cross-year av- does not require futher evaluation), or 2 (requires at-
erages, augmenting LIWC with topic features im- tention, refer for further evaluation). Following rec-
proves average correlation significantly over using ommended practice for cases where different labels
LIWC features alone for 30, 40, and 50 topics. are not equally distinct from each other (Artstein
LDA features alone do not improve significantly and Poesio, 2008), we evaluate inter-coder agree-
over LIWC. ment using Krippendorff’s α; our α, computed for
ordinal data, is 0.722.
3 Predicting Depression
3.1 Experimental framework Features. We ran 50-topic LDA on the 4,777 es-
says from §2.1 plus the BDI training items, using the
Data. We use essays collected by Rude et al.
posterior topic distributions as features as in §2.1.
(2004) similarly to §2.1; in this case, students were
As in §2.1, the LIWC features comprised one count
asked to “describe your deepest thoughts and feel-
per LIWC category, and LIWC+LDA features were
ings about being in college”. Each essay is accom-
the union of the two.
panied by the author’s Beck Depression Inventory
score (BDI). BDI (Beck et al., 1961) is a widely
used 21-question instrument that correlates strongly 3.2 Results
with ratings of depression by psychiatrists. Follow-
ing Rude et al., we treat BDI ≥ 14 as the threshold Regression on LIWC features alone achieved r =
for a positive instance of depression.7 Text prepro- .288, and adding topic features improved this sub-
cessing was done as in §2.1, with 124 documents in stantially to r = .416. Treating BDI ≥ 14 as the
total averaging around 390 words each. threshold for positive instances (i.e. that an author is
depressed), Table 2 shows that adding topic features
Training/test split. Because only 12 of 124 au- improves precision without harming recall. Auto-
thors met the BDI ≥ 14 threshold, we did not split matic prediction is more conservative than human
randomly, lest the test sample include too few pos- ratings, trading recall for precision to achieve com-
itive instances to be useful. Instead we included a parable F-measure on this test set.9
random 6 of the 12 above-threshold cases, plus 24
more items sampled at random, to create a 30-item
8
test set. To form the training set from the comple- These psychologists all have doctoral degrees, are licensed,
and spend significant time primarily in assessment and diag-
mentary items, we added two more copies of each nosis of psychological disorders. None were familiar with the
positive instance to help address class imbalance specifics of this study.
9
(Batista et al., 2004) and, following Rude et al., we A reviewer observes, correctly, that in a scenario where a sys-
excluded items with with BDI of 0 or 1 as potentially tem is providing preliminary screenings to aid psychologists,
the precision/recall tradeoff demonstrated here would poten-
invalid. tially be undesirable, since a presumed goal would be to not
miss any cases, even at the risk of some false positives. We
Human comparison. We created a set of expert note, however, that the real world is unfortunately replete with
human results for comparison by asking three prac- situations where there is significant cost or social/professional
ticing clinical psychologists to review the test doc- stigma associated with interventions or follow-up testing; in
uments and rate whether or not the author is suffer- such situations it might be high precision that is desirable.
These are challenging questions, and the ability to trade off
7
Each question contributes a score value from 0 to 3, so BDI precision versus recall more flexibly is a topic we are inter-
scores range from 0 to 63. ested in investigating in future work.
1350
P R F1 autistic spectrum disorders (Van Santen et al., 2010;
LIWC .43 .50 .46
Prudhommeaux et al., 2011; Lehr et al., 2013) and
LIWC+LDA .50 .50 .50
Rater 1 .38 .83 .52 dementia (Pakhomov et al., 2011; Lehr et al., 2012;
Rater 2 .33 .83 .47 Roark et al., 2011).
Rater 3 .33 .66 .44 With regard to depression, Neuman et al. (2012)
Table 2: Prediction quality for depression. develop a corpus-based “depression lexicon” and
produce promising screening results, and De Choud-
hury et al. (2013) predict social network behav-
4 Qualitative themes ior changes related to post-partum depression. Nei-
In order to explore the relevance of the themes ther, however, evaluates using formal instruments
uncovered by LDA, the third author, a practicing for clinical assessment.
clinical psychologist, reviewed the 50 LDA cate- Related investigations involving LDA include
gories created in §3.1. Each category, represented Zhai et al. (2012), who use LIWC to provide pri-
by its 20 highest-probability words, was given a ors for corpus-specific emotion categories; Stark et
readable description. Then, for each category, she al. (2012), who combine LIWC and LDA-based
was asked: “If you were conducting a clinical in- features in classification of social relationships; and
terview, would observing these themes in a patient’s Schwartz et al. (2013), who use lexical and topic-
responses make you more (less) likely to feel that the based features in Twitter to predict life satisfaction.
patient merited further evaluation for depression?”
6 Conclusions
Table 3 shows the seven topics selected as par-
ticularly indicative of further evaluation.10 These In this paper, we have aimed for a small, fo-
capture population-specific properties in ways that cused contribution, investigating the value-add of
LIWC cannot — for example, although LIWC does topic modeling in text analysis for depression, and
have a body category, it does not have a category for neuroticism as a strongly associated personal-
that corresponds to somatic complaints, which often ity measure. Our contribution here is not techni-
co-occur with depression. Similarly, some words re- cal: corpus-specific topics/themes are anticipated by
lated to energy level, e.g. tired, would be captured Zhai et al. (2012), and Stark et al. (2012) employ
in LIWC’s body, bio, and/or health category, but topic-based features for prediction in a supervised
the LDA theme corresponding to low energy or lack setting. Rather, our contribution here has been to
of sleep, another potential depression cue, contains show that topic models can get us beyond the LIWC
words that make sense there only in context (e.g. to- categories to relevant, population-specific themes
morrow, late). Other themes, such as the one labeled related to neuroticism and depression, and to sup-
HOMESICKNESS , are clearly relevant (potentially in- port that claim using evaluation against formal clin-
dicative of an adjustment disorder), but even more ical assessments. More data (e.g. Kosinski et al.
specific to the population and context. (2013)) and more sophisticated models (e.g. super-
vised LDA, Blei and McAuliffe (2008), and exten-
5 Related Work sions such as Nguyen et al. (2013)) will be the key
The application of NLP to psychological variables to further progress.
has seen a recent uptick in community activity. One
Acknowledgments
recent shared task brings together research on the
Big-5 personality traits (Celli et al., 2013; Kosin- We are grateful to Jamie Pennebaker for the LIWC
ski et al., 2013), and another involved research on lexicon and for allowing us to use data from Pen-
identification of emotion in suicide notes (Pestian et nebaker and King (1999) and Rude et al. (2004),
al., 2012). Other examples include NLP research on to the three psychologists who kindly took the time
10
All 50 can be found in the supplemental materials at
to provide human ratings, and to our reviewers for
http://umiacs.umd.edu/˜resnik/papers/ helpful comments. This work has been supported in
emnlp2013-supplemental/. part by NSF grant IIS-1211153.
1351
VEGETATIVE / ENERGY LEVEL sleep tired night bed morning class early tomorrow wake late asleep long hours day sleeping nap today fall stay
time
SOMATIC hurts sick eyes hurt cold head tired back nose itches hate stop starting water neck hand stomach feels kind sore
NEGATIVE / TROUBLE COPING don(’t) hate doesn care didn(’t) understand anymore feel isn(’t) stupid make won(’t) wouldn talk scared wanted
wrong mad stop shouldn(’t)
ANGER / FRUSTRATION hate damn stupid sucks hell shit crap man ass god don blah thing bad suck doesn fucking fuck freaking real
HOMESICKNESS home miss friends back school family weekend austin parents college mom lot boyfriend left houston visit weeks
wait high homesick
EMOTIONAL STRESS feel feeling thinking makes make felt feels things nervous scared lonely feelings afraid moment happy worry
comfortable stress excited guilty
ANXIETY feel happy things lot sad good makes bad make hard mind happen crazy cry day worry times talk great wanted
1352
text analysis. Artif. Intell. Med., 56(1):19–25, Septem- [Tellegen et al.2003] A. Tellegen, Y.S. Ben-Porath, J.L.
ber. McNulty, P.A. Arbisi, and B. Graham, J.R.and Kaem-
[Nguyen et al.2013] Viet-An Nguyen, Jordan Boyd- mer. 2003. The MMPI-2 Restructured Clinical
Graber, and Philip Resnik. 2013. Lexical and Scales: Development, validation, and interpretation.
hierarchical topic regression. In Neural Information Minneapolis, MN: University of Minnesota Press.
Processing Systems. [Van Santen et al.2010] Jan PH Van Santen, Emily T
[Pakhomov et al.2011] Serguei Pakhomov, Dustin Cha- Prud’hommeaux, Lois M Black, and Margaret
con, Mark Wicklund, and Jeanette Gundel. 2011. Mitchell. 2010. Computational prosodic markers for
Computerized assessment of syntactic complexity in autism. Autism, 14(3):215–236.
alzheimers disease: a case study of iris murdochs writ- [Zhai et al.2012] Ke Zhai, Jordan Boyd-Graber, Nima
ing. Behavior Research Methods, 43(1):136–144. Asadi, and Mohamad L. Alkhouja. 2012. Mr. lda:
[Pennebaker and King1999] James W Pennebaker and a flexible large scale topic modeling package using
Laura A King. 1999. Linguistic styles: language use variational inference in mapreduce. In Proceedings of
as an individual difference. Journal of personality and the 21st international conference on World Wide Web,
social psychology, 77(6):1296. WWW ’12, pages 879–888, New York, NY, USA.
[Pestian et al.2012] John P Pestian, Pawel Matykiewicz, ACM.
Michelle Linn-Gust, Brett South, Ozlem Uzuner, Jan
Wiebe, K Bretonnel Cohen, John Hurdle, Christopher
Brew, et al. 2012. Sentiment analysis of suicide
notes: A shared task. Biomedical Informatics Insights,
5(Suppl. 1):3.
[Prudhommeaux et al.2011] Emily T Prudhommeaux,
Brian Roark, Lois M Black, and Jan van Santen.
2011. Classification of atypical language in autism.
ACL HLT 2011, page 88.
[Roark et al.2011] Brian Roark, Margaret Mitchell, J Ho-
som, Kristy Hollingshead, and Jeffrey Kaye. 2011.
Spoken language derived measures for detecting mild
cognitive impairment. Audio, Speech, and Language
Processing, IEEE Transactions on, 19(7):2081–2090.
[Rude et al.2004] Stephanie Rude, Eva-Maria Gortner,
and James Pennebaker. 2004. Language use of de-
pressed and depression-vulnerable college students.
Cognition & Emotion, 18(8):1121–1133.
[Schwartz et al.2013] H. Andrew Schwartz, Johannes C.
Eichstaedt, Margaret L. Kern, Lukasz Dziurzynski,
Megha Agrawal, Gregory J. Park, Shrinidhi K. Lak-
shmikanth, Sneha Jha, Martin E. P. Seligman, and
Lyle Ungar. 2013. Characterizing geographic varia-
tion in well-being using tweets. In Seventh Interna-
tional AAAI Conference on Weblogs and Social Media
(ICWSM 2013).
[Sibelius2013] Kathleen Sibelius. 2013. Increas-
ing access to mental health services, April.
http://www.whitehouse.gov/blog/2013/04/10/increasing-
access-mental-health-services.
[Stark et al.2012] Anthony Stark, Izhak Shafran, and Jef-
frey Kaye. 2012. Hello, who is calling?: can words
reveal the social nature of conversations? In Proceed-
ings of the 2012 Conference of the North American
Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, pages 112–119.
Association for Computational Linguistics.
1353