Content Analysis in Social Sciences: Manual to Automated Approaches

Content Analysis in the Social Sciences
Stuart Soroka, University of Michigan

From Manual to Automated Approaches

Schedule:
Session 1. Introduction, and Building a Corpus
This brief session introduces some central ideas and objectives in content analysis, and then
focuses on issues of sampling and pre-processing, as they relate to both human and
automated approaches.
Session 2. Exploring Frequent, Discriminating, and Co-occurring Words

This session offers a first foray in computer-automated content analysis in R, focused on
Session 1. Introduction, & Building a Corpus
approaches that look at individual words. We look at word counts, word clouds, and some
basic clustering methods that explore connections between words.
Session 3. Reliability and Validity in Human Coding and Dictionary-Based

Approaches
This session focuses on approaches that rely on human and automated both dictionary-
based and machine-learning approaches to capturing frames, or topics, or the tone of text.
Session 4. Supervised Learning, Unsupervised Learning, and Topic Modeling

This session offers and introduction to supervised and unsupervised learning approaches,
and a preliminary application of LDA topic modeling.

Main Resources:
Klaus Krippendorff. 1989. Content analysis. Pp. 403-7 in the International Encyclopedia of
Communications, Erik Barnouw et al., eds., Oxford: Oxford University Press.
H. Andrew Schwartz and Lyle H. Unger. 2015. Data-Driven Content Analysis of Social Media:
A Systematic Overview of Automated Methods. American Academic of Political and Social
Science 659: 78-94.
Justin Grimmer and Brandon M. Stewart. 2013. Text as Data: The Promise and Pitfalls of
Automatic Content Analysis Methods for Political Texts. Political Analysis.
Kenneth Benoit and Alexander Herzog. 2015. Text Analysis: Estimating Policy Preferences
From Written and Spoken Words. In Analytics, Policy and Governance, eds. Jennifer
Bachner, Kathyrn Wagner Hill, and Benjamin Ginsberg.
Daniel Riffe, Stephen Lacy and Frederick Fico. 2014. Analyzing Media Messages: Chapters 6
(Reliability) and 7 (Validity).
Lori Young and Stuart Soroka. 2012. Affective News: The Automated Coding of Sentiment
in Political Texts, Political Communication 29: 205-231.
Cornelius Puschmann and Tatjana Scheffler. 2016. Topic Modelling for Media and
Communication Research: A Short Primer. HIGG Discussion Paper Series, no. 2016-05.

Selected Supplementary Resources:
Kimberly A. Neuedorf. 2017 The Content Analysis Guidebook, 2nd ed. Sage.
Klaus Krippendorff. 2013. Content Analysis: An Introduction to Its Methodology, 3rd ed.
Sage.
Klaus Krippendorff and Mary Angela Bock, eds. 2009. The Content Analysis Reader. Sage.
Selections from Dhavan Shah, Joseph Cappella and W. Russell Neuman, eds. 2015.Toward
Computational Social Science: Big Data in Digital Environments, special issue of The Annals
of the American Academy of Political and Social Science 659.


Prep
- have R and RStudio installed, including the following packages: foreign, stargazer, tm,
wordcloud, stringr, irr, topicmodels, stm; use install.packages(x)
- request and download Lexicoder, and the Lexicoder Sentiment Dictionary
- download course files at www.snsoroka.com/files/UT.2017.zip
- dump of all this
into a content
analysis folder,
ca, on your
desktop, which
in the end
should look like
this:
Session 1
This session introduces some central ideas and

objectives in content analysis, and then focuses
on issues of sampling and pre-processing, as
they relate to both human and automated
approaches.
1.1 Foundations
1.2 The content analysis process
1.3 Typologies of content analysis

1.4 Principles of Content Analysis
1.5 Finding a Corpus
1.6 Spreadsheets & Software
1.6 Preparing your Corpus
Session 1. Introduction, & Building a Corpus Content Analysis in the Social Sciences
1.1 Foundations
Source: Neuedorf, The Content Analysis Guidebook, 2nd ed.

1.1 Foundations
Harold D. Lasswell. 1942. The Politically

Significant Content of the Press: Coding
Procedures. Journalism Quarterly 19: 12-23.
In general terms we know that response (R) is a

function of environment (E) and predisposition (P), and
no response is adequately explained until it has been
related to both sets of determinants. The contents of

the press represent part of the environment (E) of its
readers, and we need to solve the technical problems
connected with content analysis before we can
assemble the facts needed to confirm basic
hypotheses in the science of communication.
1.1 Foundations
Bernard Berelson. 1952. Content Analysis in Communication Research. Glencoe: Free Press.
Content analysis is a research technique for the

objective, systematic and quantitative description of
the manifest content of communication.
Ole Holsti. 1969. Content Analysis for the Social Sciences and Humanities. Reading, MA: Addison
Wesley.
Content analysis is any technique for making

inferences by objectively and systematically identifying
specified characteristics of messages.

Klaus Krippendorff. 2013. Content Analysis: An Introduction to Its Methodology, 3rd ed.
Thousand Oaks, CA: Sage.
Content analysis is a research technique for making

replicable and valid inferences from texts (or other
meaningful matter) to the contexts of their use.
1.1 Foundations
There are some differences in these accounts of

content analysis, including (drawn from
Krippendorff)
Berelson takes content to be contained in a text.

Holsti takes content to be a property of the source of
a text.
Krippendorff takes content to emerge in the process
of a researcher analyzing a text relative to a particular
context.
Source: Krippendorff, Content Analysis: An Introduction to its Methodology, 3rd ed.

1.1 Foundations
These relate to a standard theme in the field,

focused on manifest versus latent content
Manifest content: elements that are physically
present and countable (Gray & Densten 1998)
Latent content: unobserved concept(s) that cannot
be measured directly but can be represented or
measured by one or more indicators (Hair et al.
2010)
I am not convinced that the manifest-latent
distinction is useful; but the different approaches
to content matter fundamentally to the kind
of analyses we choose, manual or automated.

Session 1. Introduction, & Building a Corpus Content Analysis in the Social Sciences
1.2 The content analysis process
Source: Krippendorff. 1989. Content analysis.

An overview of text as data methods
No Categories
(Just looking
at words)
Source: Grimmer and Stewart. 2013. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.
A categorization of content analysis techniques

Source: Schwartz and Unger. 2015. Data-Driven Content Analysis of Social Media: A Systematic Overview of Automated Methods.
1.4 Principles of Content Analysis
Grimmer & Stewarts four principles of

quantitative text analysis:
(1) All quantitative models of language are wrong

but some are useful.
(2) Quantitative methods for text amplify resources
and augment humans.
(3) There is no globally best method for automated

text analysis.
(4) Validate, Validate, Validate.
There are many possible sources of text
Standard full-text resources, Lexis-Nexis, Factiva, etc.,

which offer limited (but useful) access.
See Neuendorfs Message Archives
http://academic.csuohio.edu/neuendorf_ka/content/archive.html
+ lots of online content, scraping of news content,

social media, parliamentary debates, movie scripts,
etc.

There are many possible formats

Source: Benoit and Herzog. 2015. Text Analysis: Estimating Policy Preferences From Written and Spoken Words.
The standard approach for human-coded data is

to use some form of simple questionnaire, on
paper or digitally.
Source: Riffe, Lacy and Fico. 2014. Analyzing Media Messages, 3rd ed.
For automated content analysis, there are many

options for software.
Our approach is going to be to use R, which has
the following advantages, including:
(1) there are a number of very good packages for
content analysis written for R, and its free;
(2) R can manage relatively large bodies of data,

textual and otherwise; and
(3) R can be both the software you use to
produce the content analysis, and the software
you use to estimate models.
Examples in the lab portion of the class will use R;

and my own demonstrations will rely on RStudio
as well (which is just a nice interface for R).
Well draw on Lexicoder as well, which is written
in Java but can be used directly from R.
Once you have a corpus and an approach (/

software), you need to prepare the corpus to
work with that approach (/software).
For human coding, we typically worry about
removing information you dont want coders to see.
For automated coding, researchers consider
removing extra white space, upper case letters,
numbers, stopwords, and stemming.

There also may be custom stopwords/pre-whitening
for specific analysis.
Standard stopwords in the tm package in R:

[1] "i" "me" "my" "myself" "we" "our" "ours" "ourselves"
[9] "you" "your" "yours" "yourself" "yourselves" "he" "him" "his"
[17] "himself" "she" "her" "hers" "herself" "it" "its" "itself"
[25] "they" "them" "their" "theirs" "themselves" "what" "which" "who"
[33] "whom" "this" "that" "these" "those" "am" "is" "are"
[41] "was" "were" "be" "been" "being" "have" "has" "had"
[49] "having" "do" "does" "did" "doing" "would" "should" "could"
[57] "ought" "i'm" "you're" "he's" "she's" "it's" "we're" "they're"
[65] "i've" "you've" "we've" "they've" "i'd" "you'd" "he'd" "she'd"
[73] "we'd" "they'd" "i'll" "you'll" "he'll" "she'll" "we'll" "they'll"
[81] "isn't" "aren't" "wasn't" "weren't" "hasn't" "haven't" "hadn't" "doesn't"
[89] "don't" "didn't" "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot"
[97] "couldn't" "mustn't" "let's" "that's" "who's" "what's" "here's" "there's"
[105] "when's" "where's" "why's" "how's" "a" "an" "the" "and"

[113] "but" "if" "or" "because" "as" "until" "while" "of"
[121] "at" "by" "for" "with" "about" "against" "between" "into"
[129] "through" "during" "before" "after" "above" "below" "to" "from"
[137] "up" "down" "in" "out" "on" "off" "over" "under"
[145] "again" "further" "then" "once" "here" "there" "when" "where"
[153] "why" "how" "all" "any" "both" "each" "few" "more"
[161] "most" "other" "some" "such" "no" "nor" "not" "only"
[169] "own" "same" "so" "than" "too" "very"
Selections from the preprocessing script from the

Lexicoder Sentiment Dictionary:
To make sure we catch negations
not been a -> not
not been any -> not
not could have -> not
not had a -> not
To make sure we dont mis-code positives

child care -> child xcare
day care -> day xcare
foster care -> foster xcare

geriatric care -> geriatric xcare
great difficulty -> xgreat difficulty
great disappointment -> xgreat disappointment
great disservice -> xgreat disservice
mad cow -> xmad cow
To make sure we dont mis-code negatives
tears of joy -> xtears of joy
waste disposal -> xwaste disposal
Selections from the preprocessing script used to

look at open-ended responses to a survey question
about what you have read, seen or heard about
Clinton or Trump:
clinton
To make sure we catch important bigrams and trigrams
questions
paul ryan -> paulryan corrupt criminal
department
mike pence -> mikepence crooked
crook preparing
her
congress leaked classified
debate
she
miss universe -> missuniverse situation campaign secretary campaigning
director private server information

make america great -> makeamericagreat berniesanders dnc billclinton deleted
timkaine
mostly
scandal dishonest
muslim problems
investigation ads
hacked released truth
foundation
liar benghazi
read
still play
To make sure we lump together related terms got
leaks
wiki
trail
money corruption
miss piggy -> missuniverse reopening state health fbi issues regarding
response
children
email
wikileaks
done
found
russians -> russia jail
groping
military
public
rally
visit que
way like
paulryan
To make sure we lump together singular and plan media wall
mexico
women
business rigged immigration
years talk thinks
twitter
plural forms baby
speech trip war born stance
country obama
tax sexual going making changing
immigrant -> immigration meeting man election
hotel made russia president people paid
family
ago
isis said policy doesnt
talking
immigrants -> immigration amendment
callingvideo wants wife veterans
mikepence rnc black mouth statements
emails -> email stupid negativeracist son
remarks
star towards
statement trying missuniverse sex idiot manager
america
louisiana wentchurch
make african changed
makeamericagreat second build
economy room tape putin building accusing flood
soldier assault change sexually
detroit
talks allegations
locker
trump
Source: electiondynamics.org.
An illustration of converting Presidential

inaugural speeches into quantitative data
Session 2
This session offers a first foray in computer-

automated content analysis in R, focused on
approaches that look at individual words. We
look at word counts, word clouds, and methods
that explore connections between words.
2.1 Taking words seriously

2.2 Human vs computers
Session 2. Word-Based Approaches
2.3 Pre-whitening for word-based analyses

2.4 Word mentions
2.5 Word clouds
2.6 Co-occurrences
Murray Edelman. 1985. Political Language and

Political Reality. PS (Winter 1985): 10-19.
It is language about political events and

developments that people experience; even events
that are close by take their meaning from the language
used to depict them. So political language is political
reality; there is no other so far as the meaning of events
to actor and spectators is concerned.

Murray Edelman. 1985. Political Language and

Political Reality. PS (Winter 1985): 10-19.
In short, it is not reality in any testable or observable

sense that matters in shaping political consciousness
and behavior, but rather the beliefs that language
helps evoke about the causes of discontents and
satisfactions, about policies that will bring about a
future closer to the hearts desire, and about other

observables.
Deborah Stone. 1989. Causal Stories and the

Formation of Policy Agendas. Political Science
Quarterly 104(2): 281-300.
Problem definition is a process of image making,

where the images have to do fundamentally with
attributing cause, blame and responsibility.
Conditions, difficulties, or issue thus do not have
inherent properties that make them more or less likely

to be seen as problems or to be expanded. Rather
political actors deliberately portray them in ways
calculated to gain support for their side.
Deborah Stone. 1989. Causal Stories and the

Formation of Policy Agendas. Political Science
Quarterly 104(2): 281-300.
Irving Goffman. 1974. Frame analysis: An essay

on the organization of experience. London:
Harper and Row.
Amos Tversky and Daniel Kahneman. 1981. The
Framing of Decisions and the Psychology of
Choice. Science 211(4481): 453458.
Shanto Iyengar. 1991.
Is Anyone Responsible?
Chicago: University of
Chicago Press.
Robert M Entman. 1993. Framing: Towards

Clarification of a Fractured Paradigm. Journal of
Communication 43(4); 51-58.
To frame is to select some aspects of a perceived

reality and make them more salient in a
communicating text, in such a way as to promote a
particular problem definition, causal interpretation,
moral evaluation, and/or treatment recommendation

for the item described.
A focus on words highlights questions about

whether the aim of automated content analysis
should be to mimic, or improve on, human-coded
analysis.
For instance:
Can computers reliably identify frames?
Are frames reliably indicated by words?
Can computers reliably count words?

It is possible that computer-coded frames are different
from, but more reliable than, human coded frames?
Which would
code these
variables
more reliably
humans or
computers?
Source: Maxwell McCombs and Donald Shaw, The Agenda-Setting Function of Mass Media, Chapter 2.7 in Krippendorff and Bock
Which would
code the
readability/
complexity of
text more
reliably
humans or
computers?
Source: Klaus Krippendorff, Inferring the Readability of Text, Chapter 3.9 in Krippendorff and Bock
The steps in pre-whitening data will vary based

on how you intend to analyze data.
Simple word-based analysis may not require dealing
with negations
scared is the same as not scared what matters are
mentions of fear
Or it may require that you pre-whiten so that negated
words are counted separately
not scared should be changed to not_scared or

notscared, so that it gets counted differently
The steps in pre-whitening data will vary based

on how you intend to analyze data.
You may also use some very simple re-codes, to
capture very similar words as a single word
horses and horse should be counted together
One option is to stem your text, reducing

words to their word stem.
"fishing", "fished", "fisher" ->fish"
"argue", "argued", "argues", arguing" -> "argu"
Another option is to lemmatize your text,
grouping together all inflected versions of a
word into linguistically valid lemmas
good, better <- good
I dont usually recommend any kind of heavy-

handed pre-whitening.
Rather, I suggest looking at results, considering
problems in common words, and then going
back to re-process and re-analyze data.
If your corpus never deals with fish, then pre-
processing that deals with fishing is of no
consequence.
And there often are unique entries in a corpus

that a standard approach to stemming wont
deal with.
2.4 Word mentions
The simplest approach to automated content

analysis is to count all the words in a corpus.
The first (and nearly the only) step is to create a
Term Document Matrix (tdm):
documents
terms
Session 2. Word-Based Approaches Content Analysis in the Social Sciences
documents
terms
2.4 Word mentions
or, alternatively, a Document Term Matrix:

2.4 Word mentions
The tdm can then be used to look at, for instance,

the words with the highest occurrence in the
corpus.
2.4 Word mentions
Source: Mary Angela Bock, Impressionistic Content Analysis: Word Counting in Popular Media, Chapter 1.7 in Krippendorff and Bock
2.5 Word Clouds
Words and Phrases (Left) and Topics (Right) Most

Distinguishing Women (Top) from Men (Bottom),
Facebook Posts
Source: Schwartz and Unger. 2015. Data-Driven Content Analysis of Social Media: A Systematic Overview of Automated Methods.
2.5 Word Clouds
Word cloud plots for Presidential inaugural

speeches since 1981, by (a) Democrat and (b)
Republican presidents
2.5 Word Clouds
clinton
Word cloud of
questions
criminal
corrupt
differentiating crooked department
crook preparing
her
congress leaked classified
debate
she
words from situation campaign
director private server
secretary campaigning
information
dnc billclinton deleted
open-ended
timkaine
berniesanders dishonest mostly
scandal
muslim problems
investigation ads
hacked released truth
foundation
liar
read
benghazi still play
responses to a got wiki
leaks
trail
money corruptionresponse
state health issues regarding fbi
survey
reopening
children
email
wikileaks
done
found
jail
groping
military
public
question
rally
visit que
way like
about what
paulryan
plan media wall
mexico
women
business rigged
twitter speech trip
immigration
years talk thinks
war making born stance
you have baby country obama
sexual going changing

meeting man election tax
hotel made russia president people paid
read, seen or
family
ago
isis said policy doesnt
talking
amendment
callingvideo wants wife veterans
rnc black mouth statements
stupid mikepence
heard about negative racist son
remarks
star towards
statement trying sex idiot
missuniverse went america manager
make african louisiana
Clinton or makeamericagreat church changed
second build
economy room tape putin building accusing flood
Trump soldier assault change sexually
detroit
talks allegations
locker
trump
2.5 Word Clouds
A comparison cloud is based on the following:

Let pi,j be the rate at which word i occurs in
document j, and pj be the average across
documents,
pj = ipi,j/ndocs.
The size of each word is mapped to its
maximum deviation,
maxi(pi,j-pj) ,
and its angular position is determined by the
document where that maximum occurs.
Source: description file for wordcloud package.

2.5 Word Clouds
So the math behind both standard word clouds

and comparison clouds is relatively simple. So
too is connecting frequency to font size; so
constructing various size-weighted visualizations
can be relatively straightforward.
Jul 1016 email berniesanders fbi liar scandal
Jul 1723 email conventionliar benghazi speech
Jul 2430 convention email speech dnc president
Jul 316 email liar convention speech campaign
Aug 713 email liar taxfoundationscandal
Aug 1420 email foundation liarhealth going

Aug 2127
email foundation liar scandal health
Aug 283
email foundation liar scandal fbi
email
health
Sep 410 liar fbiscandal talking
Sep 1117 email liar issues campaign
health
debate
Sep 824 email liar debatecampaign
Sep 251 email liar health night
Oct 28 debate email taxliarcampaign
Oct 915 email debate liarleakswiki
Oct 1622 email debate wikileaks liar leaks
Oct 2330
email
email debate wikileaks liar campaign
Oct 315 fbi investigation reopening scandal
Nov 67
email fbi foundation investigation scandal
2.6 Co-occurrences
So too is the calculation of co-occurrences, based

on the tdm.
Which words are most strongly correlated with
mentions of word i - this is a simple matter of
bivariate correlations between i and all other
words in the tdm.
Doing this manually would be tedious, but the tm
package includes a function that will do this for
you.
2.6 Co-occurrences
In the context of survey data, this kind of analysis

can be useful as a way of understanding the
language that is correlated with a survey
response as well.
Words correlated (at >.049) with
Positive favorability for Clinton: debate (.07), campaign (.06),
campaigning (.05), president (.05), speech (.05)
Negative favorability for Clinton: liar (.16), email (.13), scandal
(.10), benghazi (.08), foundation (.06), dishonest (.05)

Positive favorability for Trump: economy (.06), jobs (.06),
makeamericagreat (.06), speech (.06), america (.05), country (.
05)
Negative favorability for Trump: women (.07), sexual (.05)
Source: Gallup, Michigan, Georgetown Working Group

Session 3
This session focuses on approaches that rely on

human and automated both dictionary-based
and machine-learning approaches to capturing
frames, or topics, or the tone of text.
3.1 Dictionary-Based Coding & Sentiment Analysis
3.2 Human coding
3.3 The classic approach to reliability
Session 3. Dictionary-Based Approaches
3.4 New concerns about reliability

3.5 The classic approach to measurement validity
3.6 New concerns about measurement validity
3.7 Picking a dictionary
3.1 Dictionary-Based Coding & Sentiment
Analysis
Source: James Pennebaker and Cindy Chung, Computerized Text Analysis of Al Qaeda Transcripts, Chapter 7.7 in Krippendorff et al.
Analysis
Source: Megan Bayagich, Laura Cohen, Lauren Farfel, Andrew Krowitz, Emily Kuchman, Sarah Lindenberg,
Natalie Sochacki, and Hannah Suh, Exploring the Tone of the 2016 Campaign, CPS Blog
Analysis
Some standard sources for automated
dictionaries include:
General Inquirer
Linguistic Inquiry and Word Count (LIWC)
Diction
Some Lexicoder-built dictionaries (adaptable to
other software, including simple R coding):

Lexicoder Sentiment Dictionary
Lexicoder Topic Dictionary
Analysis
How do you make a decision about which
dictionaries to use?
Always read the dictionary, and if you cant,
dont use the dictionary.
Always check the use of words you are unsure
of, ideally using the corpus you want to analyze.
Dont hesitate to change a dictionary to suit the

context in which youre working with it
(including dealing with suffixes), but keep
careful track of how its changed, and make the
results available to others.
Session 3. Dictionary-Based Approaches Content Analysis in the Social Sciences
3.2 Human coding

3.2 Human coding

Neuendorfs process of coder training

3.2 Human coding
Source: Riffe, Lacy and Fico.

3.2 Human coding

precision
reliability,
Neuedorfs
Comparing
accuracy and
3.3 The classic approach to reliability
Intercoder reliability
Option 1: % agreement
Option 2: Scotts pi or Cohens kappa

% Observed Agreement - % Expected Agreement
1 - % Expected Agreement
Option 3: Krippendorffs Alpha

1- Observed Disagreement
Expected Disagreement

What is the automated version of intercoder

reliability?
Stuart Soroka. 2014. Reliability and Validity in Automated Content Analysis, in Communication and
Language Analysis in the Corporate World, Roderick P. Hart, ed., Hershey PA: CGI Global.
Do we always want intercoder reliability?

Usually we do for topics; we probably are less
concerned with reliability for sentiment.
And whether we care about reliability depends in
part on what were using our data for.
For instance: Are we looking at what we think a
government was doing, or at what an audience
thought a government was doing?

3.5 The classic approach to measurement
validity
Drawn from Riffe, Lacy and Ficos Types of content analysis validity
Face validity: based on a persuasive argument

that the measure looks the way we would expect.
Concurrent validity: correlate the current
measure with another (pre-existing) similar one.
Predictive validity: correlate the current measure

with a predicted outcome.
Construct validity: observe whether the measure
behaves as we expect alongside (potentially
abstract) concepts.
3.6 New concerns about measurement
validity
Comparing automation and human coding
Lori Young and Stuart Soroka. 2012. "Affective News: The Automated Coding of
Sentiment in Political Texts", Political Communication 29: 205-231.
validity
Pairwise correlations, automated dictionaries
validity
Comparing
results across
dictionaries
validity
Automated dictionaries and manual coding
validity
Media tone and vote shares in the 2006
Canadian election
Session 4
This session offers and introduction to a range of

supervised and unsupervised learning
approaches, including a preliminary application
of LDA topic modeling.
4.1 Various Approaches, Unsupervised & Supervised
4.1 Examples, Background
4.3 Probabilistic Topic Models (LDA, CTM, STM)
Session 4. (Un)Supervised Learning
4.1 Various Approaches, Unsupervised &
Supervised
Supervised Ideological Scaling: Wordscores

Laver, Benoit, and Garry (2003)
- use humans to estimate the policy position of a set of
reference texts (i.e., party platforms); then use a
computer to count words common in source texts,
and assign an ideological value to other texts based
on word occurrences
Unsupervised Ideological Scaling: Wordfish
Slapin and Proksch (2008)

- use a scaling algorithm to estimate a dimension
based on word use in texts (i.e., party platforms)
Note that Wordscores insofar as it trains a

computer to match human codes is similar to
the machine-learning approach in RTextTools,
aimed at text classification by training a model to
match human codes (based on word frequencies).
It also is like less computationally complex, but
often very powerful, ways of building dictionaries
using seed words. This is more common in big
data work, where we we might use the words co-

occurring with a given hashtag to capture not-
hashtagged Tweets with a similar theme or
sentiment.
That kind of seed-word focusing dictionary-

building, as part of a reiterative process in which
humans double-check results, and make
adjustments to both dictionaries and pre-
whitening scripts, is not far off from a supervised
automated approach.
Nor is the use of Wordscores or Wordfish, each of
which extract dimensions that are useful only
insofar as humans can identify what they mean. (If

the dimensions dont work, then we might use
different references texts for Wordscores; we
might change the corpus for Wordfish.)
In short, all good automated work requires some

degree of human intervention, both at the pre-
processing stage (pre-whitening, or initial
dictionary-building), and at the post-processing
stage (to change prewhitening, or the dictionary,
or the algorithm, or the reference texts, etc.).
The same is true even for fully-automated
clustering, where the clusters are derived
automatically, but are only of value if we can

interpret them (and we might still need to go back
and adjust the text some more).
4.3 Probabalistic Topic Models
There are many approaches to topic modeling,

although all focus on revealing a hidden structure
to the occurrence of words, within topics, across
texts.
Latent Dirichlet Allocation (LDA) is the simplest
(ha!) probabilistic topic model
Correlated Topic Models (CTM) relax the
assumption that the occurrence of different

topics is uncorrelated.
Structural Topic Models (STM) allow for the
incorporation of metadata in fitting a topic
model.

Content Analysis in Social Sciences: Manual to Automated Approaches

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Content Analysis in Social Sciences: Manual to Automated Approaches

Enviado por

Direitos autorais:

Formatos disponíveis

Content Analysis in the Social Sciences

Stuart Soroka, University of Michigan

From Manual to Automated Approaches

Session 2. Exploring Frequent, Discriminating, and Co-occurring Words

Session 3. Reliability and Validity in Human Coding and Dictionary-Based

Session 4. Supervised Learning, Unsupervised Learning, and Topic Modeling

From Manual to Automated Approaches

From Manual to Automated Approaches

From Manual to Automated Approaches

This session introduces some central ideas and

1.3 Typologies of content analysis

Source: Neuedorf, The Content Analysis Guidebook, 2nd ed.

Harold D. Lasswell. 1942. The Politically

In general terms we know that response (R) is a

related to both sets of determinants. The contents of

Content analysis is a research technique for the

Content analysis is any technique for making

specified characteristics of messages.

Content analysis is a research technique for making

There are some differences in these accounts of

Berelson takes content to be contained in a text.

Source: Krippendorff, Content Analysis: An Introduction to its Methodology, 3rd ed.

These relate to a standard theme in the field,

Source: Neuedorf, The Content Analysis Guidebook, 2nd ed.

1.2 The content analysis process

Source: Krippendorff. 1989. Content analysis.

An overview of text as data methods

A categorization of content analysis techniques

Grimmer & Stewarts four principles of

(1) All quantitative models of language are wrong

(3) There is no globally best method for automated

There are many possible sources of text

Standard full-text resources, Lexis-Nexis, Factiva, etc.,

+ lots of online content, scraping of news content,

Source: Neuedorf, The Content Analysis Guidebook, 2nd ed.

There are many possible formats

The standard approach for human-coded data is

For automated content analysis, there are many

(2) R can manage relatively large bodies of data,

Examples in the lab portion of the class will use R;

Once you have a corpus and an approach (/

numbers, stopwords, and stemming.

Standard stopwords in the tm package in R:

[105] "when's" "where's" "why's" "how's" "a" "an" "the" "and"

Selections from the preprocessing script from the

To make sure we dont mis-code positives

foster care -> foster xcare

Selections from the preprocessing script used to

director private server information

An illustration of converting Presidential

This session offers a first foray in computer-

2.1 Taking words seriously

2.3 Pre-whitening for word-based analyses

Murray Edelman. 1985. Political Language and

It is language about political events and

to actor and spectators is concerned.

Murray Edelman. 1985. Political Language and

In short, it is not reality in any testable or observable

future closer to the hearts desire, and about other

Deborah Stone. 1989. Causal Stories and the

Problem definition is a process of image making,

inherent properties that make them more or less likely

Deborah Stone. 1989. Causal Stories and the