Você está na página 1de 75

Content Analysis in the Social Sciences

Stuart Soroka, University of Michigan


Content Analysis in the Social Sciences

From Manual to Automated Approaches


Schedule:
Session 1. Introduction, and Building a Corpus
This brief session introduces some central ideas and objectives in content analysis, and then
focuses on issues of sampling and pre-processing, as they relate to both human and
automated approaches.

Session 2. Exploring Frequent, Discriminating, and Co-occurring Words


This session offers a first foray in computer-automated content analysis in R, focused on
Session 1. Introduction, & Building a Corpus

approaches that look at individual words. We look at word counts, word clouds, and some
basic clustering methods that explore connections between words.

Session 3. Reliability and Validity in Human Coding and Dictionary-Based


Approaches
This session focuses on approaches that rely on human and automated both dictionary-
based and machine-learning approaches to capturing frames, or topics, or the tone of text.

Session 4. Supervised Learning, Unsupervised Learning, and Topic Modeling


This session offers and introduction to supervised and unsupervised learning approaches,
and a preliminary application of LDA topic modeling.
Content Analysis in the Social Sciences
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

From Manual to Automated Approaches


Main Resources:
Klaus Krippendorff. 1989. Content analysis. Pp. 403-7 in the International Encyclopedia of
Communications, Erik Barnouw et al., eds., Oxford: Oxford University Press.
H. Andrew Schwartz and Lyle H. Unger. 2015. Data-Driven Content Analysis of Social Media:
A Systematic Overview of Automated Methods. American Academic of Political and Social
Science 659: 78-94.
Justin Grimmer and Brandon M. Stewart. 2013. Text as Data: The Promise and Pitfalls of
Automatic Content Analysis Methods for Political Texts. Political Analysis.
Session 1. Introduction, & Building a Corpus

Kenneth Benoit and Alexander Herzog. 2015. Text Analysis: Estimating Policy Preferences
From Written and Spoken Words. In Analytics, Policy and Governance, eds. Jennifer
Bachner, Kathyrn Wagner Hill, and Benjamin Ginsberg.
Daniel Riffe, Stephen Lacy and Frederick Fico. 2014. Analyzing Media Messages: Chapters 6
(Reliability) and 7 (Validity).
Lori Young and Stuart Soroka. 2012. Affective News: The Automated Coding of Sentiment
in Political Texts, Political Communication 29: 205-231.
Cornelius Puschmann and Tatjana Scheffler. 2016. Topic Modelling for Media and
Communication Research: A Short Primer. HIGG Discussion Paper Series, no. 2016-05.
Content Analysis in the Social Sciences
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

From Manual to Automated Approaches


Selected Supplementary Resources:
Kimberly A. Neuedorf. 2017 The Content Analysis Guidebook, 2nd ed. Sage.
Klaus Krippendorff. 2013. Content Analysis: An Introduction to Its Methodology, 3rd ed.
Sage.
Klaus Krippendorff and Mary Angela Bock, eds. 2009. The Content Analysis Reader. Sage.
Selections from Dhavan Shah, Joseph Cappella and W. Russell Neuman, eds. 2015.Toward
Computational Social Science: Big Data in Digital Environments, special issue of The Annals
of the American Academy of Political and Social Science 659.
Session 1. Introduction, & Building a Corpus


Content Analysis in the Social Sciences
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

From Manual to Automated Approaches


Prep
- have R and RStudio installed, including the following packages: foreign, stargazer, tm,
wordcloud, stringr, irr, topicmodels, stm; use install.packages(x)
- request and download Lexicoder, and the Lexicoder Sentiment Dictionary
- download course files at www.snsoroka.com/files/UT.2017.zip
- dump of all this
into a content
analysis folder,
ca, on your
Session 1. Introduction, & Building a Corpus

desktop, which
in the end
should look like
this:
Session 1
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

This session introduces some central ideas and


objectives in content analysis, and then focuses
on issues of sampling and pre-processing, as
they relate to both human and automated
approaches.
1.1 Foundations
1.2 The content analysis process
Session 1. Introduction, & Building a Corpus

1.3 Typologies of content analysis


1.4 Principles of Content Analysis
1.5 Finding a Corpus
1.6 Spreadsheets & Software
1.6 Preparing your Corpus
Stuart Soroka, University of Michigan
Session 1. Introduction, & Building a Corpus Content Analysis in the Social Sciences

1.1 Foundations

Source: Neuedorf, The Content Analysis Guidebook, 2nd ed.


1.1 Foundations
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Harold D. Lasswell. 1942. The Politically


Significant Content of the Press: Coding
Procedures. Journalism Quarterly 19: 12-23.

In general terms we know that response (R) is a


function of environment (E) and predisposition (P), and
no response is adequately explained until it has been
Session 1. Introduction, & Building a Corpus

related to both sets of determinants. The contents of


the press represent part of the environment (E) of its
readers, and we need to solve the technical problems
connected with content analysis before we can
assemble the facts needed to confirm basic
hypotheses in the science of communication.
1.1 Foundations
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Bernard Berelson. 1952. Content Analysis in Communication Research. Glencoe: Free Press.

Content analysis is a research technique for the


objective, systematic and quantitative description of
the manifest content of communication.
Ole Holsti. 1969. Content Analysis for the Social Sciences and Humanities. Reading, MA: Addison
Wesley.

Content analysis is any technique for making


inferences by objectively and systematically identifying
Session 1. Introduction, & Building a Corpus

specified characteristics of messages.


Klaus Krippendorff. 2013. Content Analysis: An Introduction to Its Methodology, 3rd ed.
Thousand Oaks, CA: Sage.

Content analysis is a research technique for making


replicable and valid inferences from texts (or other
meaningful matter) to the contexts of their use.
1.1 Foundations
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

There are some differences in these accounts of


content analysis, including (drawn from
Krippendorff)

Berelson takes content to be contained in a text.


Holsti takes content to be a property of the source of
Session 1. Introduction, & Building a Corpus

a text.
Krippendorff takes content to emerge in the process
of a researcher analyzing a text relative to a particular
context.

Source: Krippendorff, Content Analysis: An Introduction to its Methodology, 3rd ed.


1.1 Foundations
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

These relate to a standard theme in the field,


focused on manifest versus latent content
Manifest content: elements that are physically
present and countable (Gray & Densten 1998)
Latent content: unobserved concept(s) that cannot
be measured directly but can be represented or
measured by one or more indicators (Hair et al.
Session 1. Introduction, & Building a Corpus

2010)
I am not convinced that the manifest-latent
distinction is useful; but the different approaches
to content matter fundamentally to the kind
of analyses we choose, manual or automated.

Source: Neuedorf, The Content Analysis Guidebook, 2nd ed.


Stuart Soroka, University of Michigan
Session 1. Introduction, & Building a Corpus Content Analysis in the Social Sciences

1.2 The content analysis process

Source: Krippendorff. 1989. Content analysis.


1.3 Typologies of content analysis
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

An overview of text as data methods

No Categories
Session 1. Introduction, & Building a Corpus

(Just looking
at words)

Source: Grimmer and Stewart. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.
1.3 Typologies of content analysis
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

A categorization of content analysis techniques


Session 1. Introduction, & Building a Corpus

Source: Schwartz and Unger. 2015. Data-Driven Content Analysis of Social Media: A Systematic Overview of Automated Methods.
1.4 Principles of Content Analysis
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Grimmer & Stewarts four principles of


quantitative text analysis:

(1) All quantitative models of language are wrong


but some are useful.
(2) Quantitative methods for text amplify resources
and augment humans.
Session 1. Introduction, & Building a Corpus

(3) There is no globally best method for automated


text analysis.
(4) Validate, Validate, Validate.

Source: Grimmer and Stewart. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.
1.5 Finding a Corpus
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

There are many possible sources of text

Standard full-text resources, Lexis-Nexis, Factiva, etc.,


which offer limited (but useful) access.
See Neuendorfs Message Archives
http://academic.csuohio.edu/neuendorf_ka/content/archive.html
Session 1. Introduction, & Building a Corpus

+ lots of online content, scraping of news content,


social media, parliamentary debates, movie scripts,
etc.

Source: Neuedorf, The Content Analysis Guidebook, 2nd ed.


1.5 Finding a Corpus
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

There are many possible formats


Session 1. Introduction, & Building a Corpus

Source: Benoit and Herzog. 2015. Text Analysis: Estimating Policy Preferences From Written and Spoken Words.
1.6 Spreadsheets & Software
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

The standard approach for human-coded data is


to use some form of simple questionnaire, on
paper or digitally.
Session 1. Introduction, & Building a Corpus

Source: Riffe, Lacy and Fico. 2014. Analyzing Media Messages, 3rd ed.
1.6 Spreadsheets & Software
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

For automated content analysis, there are many


options for software.
Our approach is going to be to use R, which has
the following advantages, including:
(1) there are a number of very good packages for
content analysis written for R, and its free;
Session 1. Introduction, & Building a Corpus

(2) R can manage relatively large bodies of data,


textual and otherwise; and
(3) R can be both the software you use to
produce the content analysis, and the software
you use to estimate models.
1.6 Spreadsheets & Software
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Examples in the lab portion of the class will use R;


and my own demonstrations will rely on RStudio
as well (which is just a nice interface for R).
Well draw on Lexicoder as well, which is written
in Java but can be used directly from R.
Session 1. Introduction, & Building a Corpus
1.7 Preparing your Corpus
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Once you have a corpus and an approach (/


software), you need to prepare the corpus to
work with that approach (/software).
For human coding, we typically worry about
removing information you dont want coders to see.
For automated coding, researchers consider
removing extra white space, upper case letters,
Session 1. Introduction, & Building a Corpus

numbers, stopwords, and stemming.


There also may be custom stopwords/pre-whitening
for specific analysis.
1.7 Preparing your Corpus
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Standard stopwords in the tm package in R:


[1] "i" "me" "my" "myself" "we" "our" "ours" "ourselves"
[9] "you" "your" "yours" "yourself" "yourselves" "he" "him" "his"
[17] "himself" "she" "her" "hers" "herself" "it" "its" "itself"
[25] "they" "them" "their" "theirs" "themselves" "what" "which" "who"
[33] "whom" "this" "that" "these" "those" "am" "is" "are"
[41] "was" "were" "be" "been" "being" "have" "has" "had"
[49] "having" "do" "does" "did" "doing" "would" "should" "could"
[57] "ought" "i'm" "you're" "he's" "she's" "it's" "we're" "they're"
[65] "i've" "you've" "we've" "they've" "i'd" "you'd" "he'd" "she'd"
[73] "we'd" "they'd" "i'll" "you'll" "he'll" "she'll" "we'll" "they'll"
[81] "isn't" "aren't" "wasn't" "weren't" "hasn't" "haven't" "hadn't" "doesn't"
[89] "don't" "didn't" "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot"
[97] "couldn't" "mustn't" "let's" "that's" "who's" "what's" "here's" "there's"
Session 1. Introduction, & Building a Corpus

[105] "when's" "where's" "why's" "how's" "a" "an" "the" "and"


[113] "but" "if" "or" "because" "as" "until" "while" "of"
[121] "at" "by" "for" "with" "about" "against" "between" "into"
[129] "through" "during" "before" "after" "above" "below" "to" "from"
[137] "up" "down" "in" "out" "on" "off" "over" "under"
[145] "again" "further" "then" "once" "here" "there" "when" "where"
[153] "why" "how" "all" "any" "both" "each" "few" "more"
[161] "most" "other" "some" "such" "no" "nor" "not" "only"
[169] "own" "same" "so" "than" "too" "very"
1.7 Preparing your Corpus
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Selections from the preprocessing script from the


Lexicoder Sentiment Dictionary:
To make sure we catch negations
not been a -> not
not been any -> not
not could have -> not
not had a -> not

To make sure we dont mis-code positives


child care -> child xcare
day care -> day xcare
Session 1. Introduction, & Building a Corpus

foster care -> foster xcare


geriatric care -> geriatric xcare
great difficulty -> xgreat difficulty
great disappointment -> xgreat disappointment
great disservice -> xgreat disservice
mad cow -> xmad cow
To make sure we dont mis-code negatives
tears of joy -> xtears of joy
waste disposal -> xwaste disposal
1.7 Preparing your Corpus
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Selections from the preprocessing script used to


look at open-ended responses to a survey question
about what you have read, seen or heard about
Clinton or Trump:
clinton
To make sure we catch important bigrams and trigrams

questions
paul ryan -> paulryan corrupt criminal
department
mike pence -> mikepence crooked
crook preparing

her
congress leaked classified

debate
she
miss universe -> missuniverse situation campaign secretary campaigning
Session 1. Introduction, & Building a Corpus

director private server information


make america great -> makeamericagreat berniesanders dnc billclinton deleted

timkaine
mostly
scandal dishonest

muslim problems
investigation ads
hacked released truth
foundation
liar benghazi

read
still play
To make sure we lump together related terms got

leaks
wiki

trail
money corruption
miss piggy -> missuniverse reopening state health fbi issues regarding
response

children
email

wikileaks
done

found
russians -> russia jail

groping
military
public

rally
visit que
way like
paulryan
To make sure we lump together singular and plan media wall
mexico

women
business rigged immigration
years talk thinks

twitter
plural forms baby
speech trip war born stance
country obama
tax sexual going making changing
immigrant -> immigration meeting man election
hotel made russia president people paid
family

ago
isis said policy doesnt

talking
immigrants -> immigration amendment
callingvideo wants wife veterans
mikepence rnc black mouth statements
emails -> email stupid negativeracist son

remarks
star towards
statement trying missuniverse sex idiot manager
america
louisiana wentchurch
make african changed
makeamericagreat second build
economy room tape putin building accusing flood
soldier assault change sexually

detroit
talks allegations
locker
trump
Source: electiondynamics.org.
1.7 Preparing your Corpus
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

An illustration of converting Presidential


inaugural speeches into quantitative data
Session 1. Introduction, & Building a Corpus

Source: Benoit and Herzog. 2015. Text Analysis: Estimating Policy Preferences From Written and Spoken Words.
Session 2
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

This session offers a first foray in computer-


automated content analysis in R, focused on
approaches that look at individual words. We
look at word counts, word clouds, and methods
that explore connections between words.

2.1 Taking words seriously


2.2 Human vs computers
Session 2. Word-Based Approaches

2.3 Pre-whitening for word-based analyses


2.4 Word mentions
2.5 Word clouds
2.6 Co-occurrences
2.1 Taking words seriously
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Murray Edelman. 1985. Political Language and


Political Reality. PS (Winter 1985): 10-19.

It is language about political events and


developments that people experience; even events
that are close by take their meaning from the language
used to depict them. So political language is political
reality; there is no other so far as the meaning of events
Session 2. Word-Based Approaches

to actor and spectators is concerned.


2.1 Taking words seriously
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Murray Edelman. 1985. Political Language and


Political Reality. PS (Winter 1985): 10-19.

In short, it is not reality in any testable or observable


sense that matters in shaping political consciousness
and behavior, but rather the beliefs that language
helps evoke about the causes of discontents and
satisfactions, about policies that will bring about a
Session 2. Word-Based Approaches

future closer to the hearts desire, and about other


observables.
2.1 Taking words seriously
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Deborah Stone. 1989. Causal Stories and the


Formation of Policy Agendas. Political Science
Quarterly 104(2): 281-300.

Problem definition is a process of image making,


where the images have to do fundamentally with
attributing cause, blame and responsibility.
Conditions, difficulties, or issue thus do not have
Session 2. Word-Based Approaches

inherent properties that make them more or less likely


to be seen as problems or to be expanded. Rather
political actors deliberately portray them in ways
calculated to gain support for their side.
2.1 Taking words seriously
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Deborah Stone. 1989. Causal Stories and the


Formation of Policy Agendas. Political Science
Quarterly 104(2): 281-300.
Session 2. Word-Based Approaches
2.1 Taking words seriously
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Irving Goffman. 1974. Frame analysis: An essay


on the organization of experience. London:
Harper and Row.
Amos Tversky and Daniel Kahneman. 1981. The
Framing of Decisions and the Psychology of
Choice. Science 211(4481): 453458.
Shanto Iyengar. 1991.
Is Anyone Responsible?
Session 2. Word-Based Approaches

Chicago: University of
Chicago Press.
2.1 Taking words seriously
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Robert M Entman. 1993. Framing: Towards


Clarification of a Fractured Paradigm. Journal of
Communication 43(4); 51-58.

To frame is to select some aspects of a perceived


reality and make them more salient in a
communicating text, in such a way as to promote a
particular problem definition, causal interpretation,
Session 2. Word-Based Approaches

moral evaluation, and/or treatment recommendation


for the item described.
2.2 Human vs computers
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

A focus on words highlights questions about


whether the aim of automated content analysis
should be to mimic, or improve on, human-coded
analysis.
For instance:
Can computers reliably identify frames?
Are frames reliably indicated by words?
Session 2. Word-Based Approaches

Can computers reliably count words?


It is possible that computer-coded frames are different
from, but more reliable than, human coded frames?
2.2 Human vs computers
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Which would
code these
variables
more reliably
humans or
computers?
Session 2. Word-Based Approaches

Source: Maxwell McCombs and Donald Shaw, The Agenda-Setting Function of Mass Media, Chapter 2.7 in Krippendorff and Bock
2.2 Human vs computers
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Which would
code the
readability/
complexity of
text more
reliably
humans or
computers?
Session 2. Word-Based Approaches

Source: Klaus Krippendorff, Inferring the Readability of Text, Chapter 3.9 in Krippendorff and Bock
2.3 Pre-whitening for word-based analyses
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

The steps in pre-whitening data will vary based


on how you intend to analyze data.
Simple word-based analysis may not require dealing
with negations
scared is the same as not scared what matters are
mentions of fear
Or it may require that you pre-whiten so that negated
words are counted separately
Session 2. Word-Based Approaches

not scared should be changed to not_scared or


notscared, so that it gets counted differently
2.3 Pre-whitening for word-based analyses
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

The steps in pre-whitening data will vary based


on how you intend to analyze data.
You may also use some very simple re-codes, to
capture very similar words as a single word
horses and horse should be counted together
Session 2. Word-Based Approaches
2.3 Pre-whitening for word-based analyses
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

One option is to stem your text, reducing


words to their word stem.
"fishing", "fished", "fisher" ->fish"
"argue", "argued", "argues", arguing" -> "argu"
Another option is to lemmatize your text,
grouping together all inflected versions of a
word into linguistically valid lemmas
good, better <- good
Session 2. Word-Based Approaches
2.3 Pre-whitening for word-based analyses
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

I dont usually recommend any kind of heavy-


handed pre-whitening.
Rather, I suggest looking at results, considering
problems in common words, and then going
back to re-process and re-analyze data.
If your corpus never deals with fish, then pre-
processing that deals with fishing is of no
consequence.
Session 2. Word-Based Approaches

And there often are unique entries in a corpus


that a standard approach to stemming wont
deal with.
2.4 Word mentions
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

The simplest approach to automated content


analysis is to count all the words in a corpus.
The first (and nearly the only) step is to create a
Term Document Matrix (tdm):
documents

terms
Session 2. Word-Based Approaches
Stuart Soroka, University of Michigan
Session 2. Word-Based Approaches Content Analysis in the Social Sciences

documents
terms
2.4 Word mentions

or, alternatively, a Document Term Matrix:


2.4 Word mentions
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

The tdm can then be used to look at, for instance,


the words with the highest occurrence in the
corpus.
Session 2. Word-Based Approaches
2.4 Word mentions
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences
Session 2. Word-Based Approaches

Source: Mary Angela Bock, Impressionistic Content Analysis: Word Counting in Popular Media, Chapter 1.7 in Krippendorff and Bock
2.5 Word Clouds
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Words and Phrases (Left) and Topics (Right) Most


Distinguishing Women (Top) from Men (Bottom),
Facebook Posts
Session 2. Word-Based Approaches

Source: Schwartz and Unger. 2015. Data-Driven Content Analysis of Social Media: A Systematic Overview of Automated Methods.
2.5 Word Clouds
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Word cloud plots for Presidential inaugural


speeches since 1981, by (a) Democrat and (b)
Republican presidents
Session 2. Word-Based Approaches

Source: Benoit and Herzog. 2015. Text Analysis: Estimating Policy Preferences From Written and Spoken Words.
2.5 Word Clouds
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

clinton

Word cloud of

questions
criminal
corrupt
differentiating crooked department
crook preparing

her
congress leaked classified

debate
she
words from situation campaign
director private server
secretary campaigning
information
dnc billclinton deleted
open-ended

timkaine
berniesanders dishonest mostly
scandal

muslim problems
investigation ads
hacked released truth
foundation
liar

read
benghazi still play
responses to a got wiki

leaks
trail
money corruptionresponse
state health issues regarding fbi
survey
reopening

children
email

wikileaks
done

found
jail

groping
military
public

question

rally
visit que
way like

about what
paulryan
plan media wall
mexico

women
business rigged
twitter speech trip
immigration
years talk thinks
war making born stance
you have baby country obama
Session 2. Word-Based Approaches

sexual going changing


meeting man election tax
hotel made russia president people paid
read, seen or
family

ago
isis said policy doesnt

talking
amendment
callingvideo wants wife veterans
rnc black mouth statements
stupid mikepence
heard about negative racist son

remarks
star towards
statement trying sex idiot
missuniverse went america manager
make african louisiana
Clinton or makeamericagreat church changed
second build
economy room tape putin building accusing flood

Trump soldier assault change sexually

detroit
talks allegations
locker
trump
Source: electiondynamics.org.
2.5 Word Clouds
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

A comparison cloud is based on the following:


Let pi,j be the rate at which word i occurs in
document j, and pj be the average across
documents,
pj = ipi,j/ndocs.
The size of each word is mapped to its
maximum deviation,
Session 2. Word-Based Approaches

maxi(pi,j-pj) ,
and its angular position is determined by the
document where that maximum occurs.

Source: description file for wordcloud package.


2.5 Word Clouds
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

So the math behind both standard word clouds


and comparison clouds is relatively simple. So
too is connecting frequency to font size; so
constructing various size-weighted visualizations
can be relatively straightforward.
Jul 1016 email berniesanders fbi liar scandal

Jul 1723 email conventionliar benghazi speech

Jul 2430 convention email speech dnc president

Jul 316 email liar convention speech campaign

Aug 713 email liar taxfoundationscandal

Aug 1420 email foundation liarhealth going


Session 2. Word-Based Approaches

Aug 2127
email foundation liar scandal health

Aug 283
email foundation liar scandal fbi

email
health
Sep 410 liar fbiscandal talking

Sep 1117 email liar issues campaign

health
debate
Sep 824 email liar debatecampaign

Sep 251 email liar health night

Oct 28 debate email taxliarcampaign

Oct 915 email debate liarleakswiki

Oct 1622 email debate wikileaks liar leaks

Oct 2330

email
email debate wikileaks liar campaign

Oct 315 fbi investigation reopening scandal

Nov 67
email fbi foundation investigation scandal
Source: electiondynamics.org.
2.6 Co-occurrences
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

So too is the calculation of co-occurrences, based


on the tdm.
Which words are most strongly correlated with
mentions of word i - this is a simple matter of
bivariate correlations between i and all other
words in the tdm.
Doing this manually would be tedious, but the tm
package includes a function that will do this for
Session 2. Word-Based Approaches

you.
2.6 Co-occurrences
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

In the context of survey data, this kind of analysis


can be useful as a way of understanding the
language that is correlated with a survey
response as well.
Words correlated (at >.049) with
Positive favorability for Clinton: debate (.07), campaign (.06),
campaigning (.05), president (.05), speech (.05)
Negative favorability for Clinton: liar (.16), email (.13), scandal
Session 2. Word-Based Approaches

(.10), benghazi (.08), foundation (.06), dishonest (.05)


Positive favorability for Trump: economy (.06), jobs (.06),
makeamericagreat (.06), speech (.06), america (.05), country (.
05)
Negative favorability for Trump: women (.07), sexual (.05)

Source: Gallup, Michigan, Georgetown Working Group


Session 3
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

This session focuses on approaches that rely on


human and automated both dictionary-based
and machine-learning approaches to capturing
frames, or topics, or the tone of text.
3.1 Dictionary-Based Coding & Sentiment Analysis
3.2 Human coding
3.3 The classic approach to reliability
Session 3. Dictionary-Based Approaches

3.4 New concerns about reliability


3.5 The classic approach to measurement validity
3.6 New concerns about measurement validity
3.7 Picking a dictionary
3.1 Dictionary-Based Coding & Sentiment
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Analysis
Session 3. Dictionary-Based Approaches

Source: James Pennebaker and Cindy Chung, Computerized Text Analysis of Al Qaeda Transcripts, Chapter 7.7 in Krippendorff et al.
3.1 Dictionary-Based Coding & Sentiment
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Analysis
Session 3. Dictionary-Based Approaches

Source: Megan Bayagich, Laura Cohen, Lauren Farfel, Andrew Krowitz, Emily Kuchman, Sarah Lindenberg,
Natalie Sochacki, and Hannah Suh, Exploring the Tone of the 2016 Campaign, CPS Blog
3.1 Dictionary-Based Coding & Sentiment
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Analysis
Some standard sources for automated
dictionaries include:
General Inquirer
Linguistic Inquiry and Word Count (LIWC)
Diction
Some Lexicoder-built dictionaries (adaptable to
Session 3. Dictionary-Based Approaches

other software, including simple R coding):


Lexicoder Sentiment Dictionary
Lexicoder Topic Dictionary
3.1 Dictionary-Based Coding & Sentiment
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Analysis
How do you make a decision about which
dictionaries to use?
Always read the dictionary, and if you cant,
dont use the dictionary.
Always check the use of words you are unsure
of, ideally using the corpus you want to analyze.
Session 3. Dictionary-Based Approaches

Dont hesitate to change a dictionary to suit the


context in which youre working with it
(including dealing with suffixes), but keep
careful track of how its changed, and make the
results available to others.
Stuart Soroka, University of Michigan
Session 3. Dictionary-Based Approaches Content Analysis in the Social Sciences

3.2 Human coding

Source: Neuedorf, The Content Analysis Guidebook, 2nd ed.


Stuart Soroka, University of Michigan
Session 3. Dictionary-Based Approaches Content Analysis in the Social Sciences

3.2 Human coding


Neuendorfs process of coder training

Source: Neuedorf, The Content Analysis Guidebook, 2nd ed.


Stuart Soroka, University of Michigan
Session 3. Dictionary-Based Approaches Content Analysis in the Social Sciences

3.2 Human coding

Source: Riffe, Lacy and Fico.


Stuart Soroka, University of Michigan
Session 3. Dictionary-Based Approaches Content Analysis in the Social Sciences

3.2 Human coding

Source: Riffe, Lacy and Fico.


Stuart Soroka, University of Michigan
Session 3. Dictionary-Based Approaches Content Analysis in the Social Sciences

precision
reliability,
Neuedorfs
Comparing

accuracy and
3.3 The classic approach to reliability
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Intercoder reliability

Option 1: % agreement

Option 2: Scotts pi or Cohens kappa


% Observed Agreement - % Expected Agreement
1 - % Expected Agreement
Session 3. Dictionary-Based Approaches

Option 3: Krippendorffs Alpha


1- Observed Disagreement
Expected Disagreement

Source: Riffe, Lacy and Fico.


3.4 New concerns about reliability
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

What is the automated version of intercoder


reliability?
Session 3. Dictionary-Based Approaches

Stuart Soroka. 2014. Reliability and Validity in Automated Content Analysis, in Communication and
Language Analysis in the Corporate World, Roderick P. Hart, ed., Hershey PA: CGI Global.
3.4 New concerns about reliability
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Do we always want intercoder reliability?


Usually we do for topics; we probably are less
concerned with reliability for sentiment.
And whether we care about reliability depends in
part on what were using our data for.
For instance: Are we looking at what we think a
government was doing, or at what an audience
Session 3. Dictionary-Based Approaches

thought a government was doing?


3.5 The classic approach to measurement
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

validity
Drawn from Riffe, Lacy and Ficos Types of content analysis validity

Face validity: based on a persuasive argument


that the measure looks the way we would expect.
Concurrent validity: correlate the current
measure with another (pre-existing) similar one.
Session 3. Dictionary-Based Approaches

Predictive validity: correlate the current measure


with a predicted outcome.
Construct validity: observe whether the measure
behaves as we expect alongside (potentially
abstract) concepts.
3.6 New concerns about measurement
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

validity
Comparing automation and human coding
Session 3. Dictionary-Based Approaches

Lori Young and Stuart Soroka. 2012. "Affective News: The Automated Coding of
Sentiment in Political Texts", Political Communication 29: 205-231.
3.6 New concerns about measurement
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

validity
Pairwise correlations, automated dictionaries
Session 3. Dictionary-Based Approaches

Lori Young and Stuart Soroka. 2012. "Affective News: The Automated Coding of
Sentiment in Political Texts", Political Communication 29: 205-231.
3.6 New concerns about measurement
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

validity
Comparing
results across
dictionaries
Session 3. Dictionary-Based Approaches

Lori Young and Stuart Soroka. 2012. "Affective News: The Automated Coding of
Sentiment in Political Texts", Political Communication 29: 205-231.
3.6 New concerns about measurement
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

validity
Automated dictionaries and manual coding
Session 3. Dictionary-Based Approaches

Lori Young and Stuart Soroka. 2012. "Affective News: The Automated Coding of
Sentiment in Political Texts", Political Communication 29: 205-231.
3.6 New concerns about measurement
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

validity
Media tone and vote shares in the 2006
Canadian election
Session 3. Dictionary-Based Approaches

Lori Young and Stuart Soroka. 2012. "Affective News: The Automated Coding of
Sentiment in Political Texts", Political Communication 29: 205-231.
Session 4
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

This session offers and introduction to a range of


supervised and unsupervised learning
approaches, including a preliminary application
of LDA topic modeling.
4.1 Various Approaches, Unsupervised & Supervised
4.1 Examples, Background
4.3 Probabilistic Topic Models (LDA, CTM, STM)
Session 4. (Un)Supervised Learning
4.1 Various Approaches, Unsupervised &
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Supervised
Session 4. (Un)Supervised Learning

Source: Grimmer and Stewart. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.
4.2 Examples, Background
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Supervised Ideological Scaling: Wordscores


Laver, Benoit, and Garry (2003)
- use humans to estimate the policy position of a set of
reference texts (i.e., party platforms); then use a
computer to count words common in source texts,
and assign an ideological value to other texts based
on word occurrences
Unsupervised Ideological Scaling: Wordfish
Session 4. (Un)Supervised Learning

Slapin and Proksch (2008)


- use a scaling algorithm to estimate a dimension
based on word use in texts (i.e., party platforms)
4.2 Examples, Background
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

Note that Wordscores insofar as it trains a


computer to match human codes is similar to
the machine-learning approach in RTextTools,
aimed at text classification by training a model to
match human codes (based on word frequencies).
It also is like less computationally complex, but
often very powerful, ways of building dictionaries
using seed words. This is more common in big
Session 4. (Un)Supervised Learning

data work, where we we might use the words co-


occurring with a given hashtag to capture not-
hashtagged Tweets with a similar theme or
sentiment.
4.2 Examples, Background
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

That kind of seed-word focusing dictionary-


building, as part of a reiterative process in which
humans double-check results, and make
adjustments to both dictionaries and pre-
whitening scripts, is not far off from a supervised
automated approach.
Nor is the use of Wordscores or Wordfish, each of
which extract dimensions that are useful only
Session 4. (Un)Supervised Learning

insofar as humans can identify what they mean. (If


the dimensions dont work, then we might use
different references texts for Wordscores; we
might change the corpus for Wordfish.)
4.2 Examples, Background
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

In short, all good automated work requires some


degree of human intervention, both at the pre-
processing stage (pre-whitening, or initial
dictionary-building), and at the post-processing
stage (to change prewhitening, or the dictionary,
or the algorithm, or the reference texts, etc.).
The same is true even for fully-automated
clustering, where the clusters are derived
Session 4. (Un)Supervised Learning

automatically, but are only of value if we can


interpret them (and we might still need to go back
and adjust the text some more).
4.3 Probabalistic Topic Models
Stuart Soroka, University of Michigan
Content Analysis in the Social Sciences

There are many approaches to topic modeling,


although all focus on revealing a hidden structure
to the occurrence of words, within topics, across
texts.
Latent Dirichlet Allocation (LDA) is the simplest
(ha!) probabilistic topic model
Correlated Topic Models (CTM) relax the
Session 4. (Un)Supervised Learning

assumption that the occurrence of different


topics is uncorrelated.
Structural Topic Models (STM) allow for the
incorporation of metadata in fitting a topic
model.

Você também pode gostar