Escolar Documentos
Profissional Documentos
Cultura Documentos
3, 1990
Julia C. Jorgensen I
Accepted November 22, 1989 -- Revised March 25, 1990
This paper tests psychologists" frequent assumption that dictionaries are psychologically realistic
models of polysemy in the mental lexicon. Psychologists have not often explored the nature of
polysemy, and lexicographers' methods have not involved scientific sampling of usages or inform-
ants. It is argued, however, that the lexicographic technique of citation sorting is an effective way
of diseovering sense differences. Here this technique was used in three tasks involving usage samples
for 24 high- and low-frequency nouns varying widely in degree of polysemy in the dictionary.
Analyses of agreement within and between subjects showed that subjects consistently judged and
substantially agreed upon the major senses of most nouns, but that few nouns in either frequency
group were perceived to have more than three significant senses. Additionally, the possibility that
larger usage samples will bias people to make more sense groupings was found not to be true,
suggesting that the larger number of senses lexicographers create for high-frequency words are not
artifacts of larger usage samples.
This research was partially supported by grants from the Spencer and Sloan Foundations to George
A. Miller.
1 Correspondence should be addressed to the author at Psychology Department, Lehman College,
The City University of New York, Bedford Park Boulevard West, Bronx, New York 10468.
167
seven or eight sense definitions, while even rarer words such as nepotism,
ecumenical, or indemnity may be found to have only one or two.
Accurately identifying and describing the senses of particular words is
important in several types of endeavor. First, lexicographers hope that diction-
aries will be helpful in teaching word use, and teachers often believe this to be
true (Deese, 1967; Miller & Gildea, 1987). The ultimate test of a pedagogically
good dictionary may be whether the sense distinctions and descriptions it con-
tains for any given word can convey the native's understanding of that word's
meaning to a naive user, such as a child or a second-language learner (Edwards,
i983).
Second, psycholinguists (e.g., Zipf, 1945; Johnson-Laird and Quinn, 1976)
often count or identify word senses as part of experimental procedure, and they
frequently obtain this information from dictionaries. Miller, Fellbaum, Kegl, &
Miller (1988) have remarked, "By and large, psycholinguistic experiments pre-
suppose the validity of the general structures that linguists and lexicographers
have identified and try instead to test hypotheses concerning the way such struc-
tures arise or how they contribute to other cognitive processes" (p. 4).
Third. artificially intelligent programs to process natural language must
identify and describe word senses, and researchers are beginning to use machine-
readable versions of ordinary dictionaries as the basic sources of information
for the large, computerized lexicons of such programs (Byrd et al., 1987).
Therefore, the assumption that dictionaries are accurate descriptions of our lex-
ical knowledge and of the extent of polysemy in the language is apparently
widespread and has important practical consequences.
However, this assumption begs the important psychological questions of
how one can accurately identify the senses of a particular word, what the mental
representation of a sense is actually like, and how a particular sense is chosen
as the intended meaning of a word in disambiguating a sentence, possibly be-
cause these questions have proven to be among the most intractible in under-
standing natural language. The major difficulty in saying what a sense is involves
clarifying the distinction between the ambiguity of a word and the diversity or
generality of its use: when are perceived differences in the meaning of a word
in different contexts indicative of "true" ambiguity or sense difference? At least
three levels of perceived meaning difference have been discussed in the literature
on polysemy. These may be called homonymy, ambiguity, and microdistinction
or generality of use.
What has been called homonymy occurs when native speakers can see no
obvious semantic relation between two different uses of a word (even though
those uses may have derived from a common ancestor) (Panman, 1982; Zgusta,
1971). Common examples of homonymy are the relation between ball, a dance,
and ball, a round projectile; port, a wine, and port, a docking area for boats;
Psychological Reality of Word Senses 169
(This could be resolved in context by either "And black mink was also her
favorite for warmth" or "She had never liked dark-colored coats.")
(This could be resolved in context by either "And here he is now" or " I f only
it could have been bigger.")
In these cases of ambiguity, the word clearly has more than one sense. But
there are words that rarely, if ever, display this kind of ambiguity, yet are, as
Quine says, multiply applicable, and may therefore seem to have more than one
sense. His example of this sort of word is hard, which is multiply applicable
in that it can be applied to chairs as well as to questions. It does, in fact, have
twelve adjectival senses (or applications, in Quine's terms) in Webster's Col-
legiate Dictionary, and many of them fit the pattern of having one or a few
particular referents characteristically tied to them. For example, there is "'hard
money," meaning metallic money (or, by extension, good money); "hard wor-
sted," meaning firmly twisted worsted; and "hard evidence," meaning definite
evidence.
170 Jorgensen
candy") may coexist in a lexical entry with meanings which may be judged as
closely related (such as " h a r d " in "hard money" and "hard candy").
Making a judgment as to whether two related usages of a word constitute
two different senses, or, instead, two applications of a single sense, must then
rest on a judgment as to whether the two usages are sufficiently remote se-
mantically rather than on any secure principle or test. Kelly and Stone believe
that this intuitive criterion is the only one possible, that the line between poly-
semy (and thus " t r u e " ambiguity in Quine's sense) and generality is impossible
to draw in any principled way.
nitions, a finding of such task bias for our subjects could imply that some senses
which appear in dictionaries are a product of a biasing effect of sample size.
Past studies of polysemy have been limited in one of two ways: Either they
have relied on judgments of only one informant or they have limited judgments
to uses of only one word. Lexicographers, in building dictionaries (and as in
the computerized disambiguation system of Kelly and Stone, 1975), have es-
sentially done sample studies of uses of the entire vocabulary of the language,
but they have relied on one individual's intuitions in most judgments. In addi-
tion, lexicographers have failed to adopt systematic & unbiased procedures for
sampling the language.
Psychologists, on the other hand, have pooled the intuitions of many in-
formants. However, they have generally failed to use systematic samples of
vocabulary. (Weinreich, 1980, and Clark, 1973, comment on this failure; studies
by Caramazza et aI., 1974, and Osgood, Suci, & Tannenbaum, 1957, exemplify
this failure.)
The sorting tasks included here sample from a large and representative
corpus of citations from written English, collected by Kucera and Francis (1967),
and multiple informants were used.
An advantage of having multiple judges lay in being able to evaluate in-
terpersonal consistency. In addition, having subjects make judgments twice, for
large samples of citations, allowed some evaluation of intrapersonal consistency.
The stability of judgments in both these realms acts as a measure of confidence
in subjects' intuitions about sense groupings.
METHODOLOGICAL BACKGROUND
The method of sorting has been used by Miller (1969) and others to study
the organization of lexical information in memory. Miller has argued that a
sorting task using individual nouns as items results in clusters of words that
reflect their common conceptual features while discounting their idiosyncratic
features. Sorting sentences according to the meaning of some key word that
appears in them would seem to be based on a similar process. In fact, Miller's
subjects had said that they sorted the words by trying to put them into sentence
contexts and then finding commonalities in the contexts.
Lexicographers use the method of sorting sentence citations as the means
of distinguishing various word senses for which to write definitions. The ex-
periments described here were designed to mimic the method of lexicography;
that is, subjects were given sentences using a particular word and asked to sort
them into groups, according to the similarities that they perceived in the uses
or meanings of the word in those sentences. So, comparing any two sentence
174 Jorgensen
citations that used a given word, the subject was required to ask himself or
herself whether that word had the same meaning in those two sentences. In order
to eliminate possible effects of short-term memory limitations on the number of
groupings subjects would make, subjects were allowed to keep all grouped stacks
of citations before them at all times, as well as notations they desired to make
concerning the senses represented by the groupings. In addition, no time limi-
tation on task completion was imposed.
The three sorting tasks used here are actually variants of the same task. In
Task 1, subjects sorted citations and wrote definitions for 12 high-frequency
nouns of high or low polysemy, using citation samples of varying size. In Task
2, the same subjects sorted these citations again, but dictionary definitions were
available to them for use as guides in sorting. In Task 3, a new group of subjects
sorted citations and wrote definitions for 12 low-frequency nouns. Using com-
bined data from these experiments, assessments of individual consistency, agree-
ment between subjects, agreement between subject and dictionary estimates of
polysemy, and the biasing effects of sample size were carried out.
METHODS
Task 1
Subjects. Seven graduate students and two undergraduates served as sub-
jects. They were chosen because they were native English speakers who had
done well on verbal ability tests (SAT or GRE verbal sections). They were
volunteers who received a small amou m of money for participating.
Design and Procedure. Sentences using 12 high-frequency nouns were
drawn from the Kucera and Francis (1967) corpus of one million running words
of written English. For every occurrence of a given noun in the corpus, a single
stimulus sentence was available. All such sentences for each of the 12 nouns
were individually typed on three-by-five cards, with the noun itself underlined
on each card. The total number of citations (equal to the frequency) for each
word is given in Table I. Uses of the words as proper nouns were eliminated
from our sample and from the count given in the table.
Six of the words were chosen because they are highly potysemous (11-21
nonarchaic senses), according to Webster's New Collegiate Dictionary, 1975,
and six were chosen for their comparatively low degree of polysemy (2-4 non-
archaic senses). Table I also gives the number of senses found in Webster's for
each word.
Card packets for subjects to sort were made in three sizes: they contained
20, 100, or 200 citation cards. Cards for each packet were drawn at random
from the entire set of citations for a given word, except that care was taken to
Psychological Reality of Word Senses 175
sample equally across all of the 15 genres represented in the corpus (see Kucera
and Francis, 1967, for a description of the genres). Each of the nine subjects
received a packet for each of the 12 words; four packets for each subject con-
tained 20 cards each, four contained 100 cards each, and four contained 200
cards each, so that over the nine subjects each word was represented by each
citation sample size three times. No two subjects got exactly the same set of
cards, although some overlap necessarily existed between sets for the same
word.
Each subject was given his or her 12 card packets in a random order. The
following written instructions were given: (1) Sort the sentences into groups
according to similarity in meaning or use of the underlined word. Make as many
(or few) groups as you wish. (2) As you work, attach a written label to each
card pile to help you remember your criteria for grouping those sentences to-
gether. Start labeling at the beginning, and make notations as you need them.
(3) After you finish sorting the cards, go back to your notations (and look at
the cards if you wish) and write a definition for each group. This definition
should be detailed enough to allow another person to write a meaningful sentence
using the word in that particular way/sense, even if the word were unfamiliar
to him. (4) If there are any sentences you cannot understand, set them aside and
ask the experimenter for substitutes. (5) All the key words in this task are nouns.
Some sentences may contain adjectival uses of these words, however, such as
"'night s k y " or "'group therapy." You should not need to characterize these
176 Jorgensen
uses by form (that is, adjective vs. noun) per se, since your task is to sort
according to similarities in meaning. (6) Please don't look up any of these words
in a dictionary or thesaurus until you have finished all experimental tasks.
There was no time limit for the sorting task, and subjects completed parts
of it on different days, as it was very time-consuming (although any one packet
had to be finished at one sitting). Just as subjects were allowed to add to or
change their labels for the groups as needed, they were also allowed to alter
their groupings of the citations at any point in the task. Following the sorting
task, subjects were asked to give written responses to a set of questions about
the sorting strategies they used.
Task 2
Task 2 was used to determine how subjects will change their strategies and
groupings in the same sorting task when given dictionary definitions to use as
guides.
Subjects. The nine subjects from Task 1 also participated in Task 2.
Design and Procedure. The stimuli were the same cards allotted to each
subject in Task 1, organized into the same packets. The order of presentation
was changed, both for the individual cards (which were shuffled) and for the
packets themselves (which were in a new random order).
Each of the 12 packets a subject received was attached to a set of cards
that had individual dictionary definitions (from Webster's New Collegiate Dic-
tionary) for the key word typed on them. These definitions corresponded to
those counted in "Number of senses" in Table I. Thus, the card packets for
head were each accompanied by 21 definition cards, packets for group by 4
definition cards, etc. The definition cards themselves were randomized before
being attached to the citation card packets.
A period of at least 1 week, but less than 2 weeks, elapsed between any
subject's completion of Task 1 and beginning of Task 2.
Instructions were the same as those for Task 1, except that the definition
cards were to be read prior to, and used in, the sorting task. Subjects were told
to try to categorize all the citation cards in a packet according to the definition
cards; for any citations they could not categorize this way, they were told to
make their own categories and definitions. Subjects were advised that they could
use as many or as few of the given definition cards as they chose to, and, again,
that they could make as many or few categories for the citations as they wished.
Subjects were also asked to report any ambiguities they noticed in the dictionary
definitions.
After completing the sorting task, subjects were again asked to reply in
writing to a series of questions about their sorting strategies.
Psychological Reality of Word Senses 177
Task 3
categories according to the meaning or use of the underlined word, to keep notes
during the process, and to write definitions after sorting.
RESULTS
Number of Categories. Table III shows overall results for the three sorting
tasks in terms of the mean number of definition categories subjects made for
each word type and for each task type (that is, with or without dictionary
definitions as guides), as compared to the mean number of dictionary definitions
for each word type.
Except for words characterized by the greatest polysemy (averaging about
14.6 dictionary senses), subjects distinguished around three senses for any given
word in the task without dictionary definitions.
When provided with dictionary definitions to use in sorting, subjects ap-
peared to change their estimates of the number of senses. For the words of
greatest polysemy (dictionary average of 14.6 senses), subjects significantly
increased the number of sense categories (p < .01, matched t-test), although
they did not adopt the even larger number of categories suggested by the dic-
tionary. For words which have few dictionary senses (an average of three),
subjects modified their estimates to be more in harmony with the number of
categories suggested by the dictionary in a few cases, but not consistently. The
overall difference between the sorting results for the low-polysemy words with
and without dictionary definitions was not significant; the independent estimates
of our subjects and the dictionary both converged on an average of three senses
to begin with.
Consistency of Categories. Since information is available about which in-
dividual citations were grouped together in particular categories by particular
subjects in the sortings, it is possible to examine the consistency in grouping
between the first sorting task (without dictionary definitions) and the second
sorting task (with dictionary definitions).
As a measure of such consistency, the Agreement-Disagreement (A-D)
ratio, devised by George A. Miller, and used and described by Shipstone (1960),
is useful, as it was designed for sorting data. This ratio is a statistic devised to
calculate the amount of agreement between different sortings of the same ma-
terial by taking into account differences in grouping of particular items (in this
study, the sentence citations). The ratio is a kind of correlational technique:
when two groupings are identical, it yields a value of + 1.0, and when two
groupings are completely diverse, it yields a value of - 1.0. Roughly speaking,
it expresses the observed number of agreements between the groupings as di-
vided by the total number of possible agreements. The appendix at the end of
this article gives details of calculation of the ratio, along with notes regarding
possible bias in the technique.
Table IV presents the Agreement-Disagreement ratios between the group-
ings for individual words by particular subjects in the first sorting task and in
the second sorting task. An analysis of variance showed that consistency values
for individual words were significantly different, with F(11, 88) = 6.98, p <
.0001, as were overall consistency values for the six high-polysemy words
compared to the six low-polysemy words, with F(1, 8) = 36.82, p < .0003.
Duncan's Multiple Range Test (at p < .05) yielded groupings of words
which were significantly different in consistency values, as shown in Table V.
The overall difference in consistency between words of high and low po-
lysemy (as measured by number of senses in dictionary entries or by number of
senses assigned by subjects in the second sorting task) goes against the natural
direction of bias in the Agreement-Disagreement ratio; the ratio is more likely
to be inflated when categories are larger (thus, having fewer categories is infla-
tionary). However, it is possible that some difference in the way citations are
distributed in categories for various words could give some words one unusually
large category along with some extremely small ones, while other words could
have several medium-size categories. Perhaps in that case the words with one
180 Jorgensen
Subject
Worda 1 2 3 4 5 6 7 8 9 Mean
head (21) .99 .63 .88 .68 .93 .54 .81 .97 .99 .83
life (18) .30 .00 .69 .52 .31 .58 .26 .49 .30 .38
world (14) .23 .08 .18 .34 .08 .45 .25 .47 .37 .27
way (12) .45 .66 .47 .36 .86 .69 .55 .54 .45 .56
side (12) .69 .12 .64 .42 .26 .36 .33 .66 .62 .45
hand (11) .65 .82 .95 .87 .70 .76 .49 .88 .87 .78
fact (4) .27 - . 3 1 .59 .49 1.00 -.05 -.12 .29 .63 .31
group (4) -.32 -.15 -.04 -.18 -.31 .79 .04 .44 - . 2 9 -.005
night (3) -.18 .02 .25 .01 -.04 .33 .11 .53 .41 .16
development (3) -.11 .51 -.18 -.09 .60 .32 .70 -.t9 .13 .18
something (2) .02 - . 0 0 7 .13 .13 .79 -.19 1.00 -.03 .24 .23
war (2) .63 .70 -.04 .15 .52 .32 .53 .01 .02 .31
Mean .30 .25 .37 .31 .47 .41 .41 .42 .39
Number of dictionary senses is given in parentheses.
consistent words. For instance, development, something, and group each have
a major category with a larger proportion of citations in it than do head or hand,
which are much more consistent. So the difference in consistency by degree of
polysemy should not be attributable to bias in the ratio.
The size of the citation sample provided to the sorting subject (which could
be 20, 100, or 200 citation cards) had no effect on the size of the Agreement-
Disagreement ratio.
It seems likely, then, that some dictionary definitions are much more like
subjects' initial intuitions about features of similarity and difference between
the sentences that comprise sense groupings than are others. In addition, al-
though seeing the dictionary definitions for the high-polysemy words seems to
encourage subjects to add new senses, there seems to be a greater basic con-
gruence between initial intuitions and the definitions for these words than there
is between initial intuitions and the definitions for low-polysemy words.
InterpersonalAgreement. Table VI summarizes the extent of subject agree-
ment about proportional allotment of citations to various senses in Task 2. That
is, for each dictionary sense label, we calculated the proportion of citations each
subject allotted to it, and then we calculated the mean proportion of citations
for each sense (using N = 9, total number of subjects, regardless of possible
empty cells) and the standard deviation. Considering the standard deviation a
measure of interpersonal agreement, we computed the coefficient of variation
(V = s/m) to allow a comparison of interpersonal agreement between senses
with different means. Because different senses may have different numbers of
empty cells (subjects who chose not to use the sense), V is not always a precise
measure of comparison (as empty cells increase, the mean decreases, while the
standard deviation grows). Since Nwas set equal to the total number of subjects,
V is a measure of agreement about whether or not to choose the sense at all, as
well as a measure of agreement about the proportions of citations which belong
with the sense.
Table VI is based on all groupings which comprised 5% or more of any
subject's sample of citations (since 5% is equivalent to one citation for samples
of 20). A smaller value of V reflects greater agreement about the proportion of
citations which should be put in the sense, keeping in mind the biasing effect
of any empty cells.
Considering the number of dictionary definitions from which subjects had
to choose (especially for the high-polysemy words), as well as the fact that no
two subjects had the same set of citations for a given word (in spite of some
overlap), the extent of interpersonal agreement about which senses are important
seems impressive. However, the extent of agreement about proportional allot-
ment of citations varies considerably for different words. The primary sense of
head is an example of the cases with strong agreement: To obtain a coefficient
182 Jorgensen
Sense
Word 1 2 3 4 5 6 7
head
percentb 69,0 6.8 1,7 1.6 1.3 0.8 0.6
Vc 0.06 0.63 2.06 3.01 1.98 3.02 3.03
empty cellsd 0 2 7 8 7 8 8
hand
percent 63.0 21.0 1.8
V 0.20 0.30 1.38
empty cells 0 0 5
way
percent 44.0 11.0 9.0 5.3 4.8 3.4 1.7
V 0.21 0.46 1.14 1.72 1.53 1.31 2.07
empty cells 0 1 3 6 6 5 7
side
percent 27.7 11.0 10.8 10.7 9.4 5.8 3.4
V 0.56 0.68 0.40 1.11 1.13 0.90 1.98
empty cells 1 2 0 3 4 3 7
life
percent 30.4 14.7 10.8 10.6 9.7
V 0.50 0.50 0.99 1.34 0.93
empty cells 1 0 3 5 3
war
percent 85.4 12.9
V 0.09 0.44
0 0
fact
percent 42.7 37.6 18.3 3.1
V 0.37 0.52 0.96 1.99
empty cells 0 0 1 5
world
percent 27.8 13.3 9.7 7.0 6.2 4.0
V 0.66 0.65 0.86 0.85 0.90 1.02
empty cells 1 1 2 2 3 4
something
percent 67.7 20.5
V 0.51 1.00
empty cells 1 1
development
percent 86.8 7.2 5.8
V 0.09 1.14 0,83
empty cells 0 2 1
night
percent 55.5 39.0 4.0
V 0.31 0.31 0.84
empty cells 0 0 2
Psychological Reality of Word Senses 183
Word 1 2 3 4 5 6 7
group
percent 8t.2 17.1 0.9 0.2
v 0.38 1.84 1.61 3.00
empty cells 1 1 4 8
" Words are listed in descending order of Agreement-Disagreement ratios. Senses included are those
for which at least one subject apportioned at least 5% of a citation sample (equivalent to one
citation for a sample of 20). Head, hand, and side each had a few very small senses with low
agreement, which were omitted for lack of space.
b Percent = mean proportion of citations allotted, when n = 9, total number of subjects in Task
2.
c V = coefficient of variation (S/m).
u Empty cells = the number of subjects who did not use the sense at all in labeling.
of variation equal to .062, one needed a standard deviation of only 4.3 against
a mean of 69. The primary sense of world is an example of weak agreement,
with a standard deviation of 18.59 against a mean of 27.88. If interpersonal
agreement is taken as a criterion for saying that a grouping of citations exem-
plifies a psychologically real sense (which is a matter of degree, just as is
intrapersonal agreement), then it could be argued that none of these words has
much more than three real senses, if having V less than (around) 1 (so that the
standard deviation is less than the mean) is used as the criterial limit.
It is notable that interpersonal agreement does not seem to correlate with
the A - D ratio number (or with number of dictionary senses), i.e., there is as
much agreement about major senses and proportions for war, development, and
night, as for head, hand, and way.
Patterns of sense distribution also vary considerably among these words.
For instance, " h a n d " and " h e a d " show less agreement on minor senses than
do life and world, but more agreement on the primary sense. It is possible that
the more important senses for life and world (or the ways they relate to dictionary
definitions) are more alike and confusable than those for hand and head.
One interesting practical outcome of having empirically derived information
about patterns of sense distribution is that one can use them to estimate the
probabilities with which a computerized disambiguation program will make er-
rors in processing a given amount of text if it lacks a representation for a
particular sense. Some very infrequent senses might not be worth including in
such a system if their absence generates very low error rates. Kelly and Stone
(1975) describe a similar error prediction procedure.
Effect of Sample Size on Number of Categories. In using sorting tasks to
184 Jorgensen
assess the polysemy of particular words, lexicographers may naturally often find
that the number of citations available for low-frequency words is much smaller
than for high-frequency words. If the sorting task just involves judgments of
meaning difference, the number of senses of a word actually represented in a
given citation sample will vary as a function of the size of the sample and the
probability of occurrence of the various senses of the word. However, the sorting
process may be contaminated by biases of various kinds, so that it is not an
accurate measure of the meaning differences people actually perceive in ordinary
comprehension. Since lexicographers do not have equal numbers of citations
for words of differing frequencies, it seems important to ask whether or not
citation sample size alone has an effect on the number of categories people will
make when they are asked to sort. Is the greater polysemy of high-frequency
words in the dictionary at least partly an effect of this kind of bias, so that the
larger the sample, the more categories one feels inclined to make?
Since the high-frequency words in the sorting tasks used here were pre-
sented to subjects in citation samples of three different sizes, the results of these
tasks may offer some insight concerning the possibility of this kind of sample
size bias in lexicographic sorting tasks. Table VII presents the mean number of
senses created for each word in each of the three sample size conditions, both
for Task 2 and for Task 1, sorting with and without dictionary definitions as
guides.
Sample size
To evaluate the possibility of a sample size bias, one needs to know what
to expect if such a bias does not exist, i.e., what are the expected outcomes of
our sorting conditions when those outcomes are solely a function of sample size
and the probabilities of occurrence of the senses of each given word? Such a
model of expectations is difficult to find because the true numbers and relative
probabilities of senses are not known. There are not any independent data to
prove information about sense distributions; West (1953) has estimates for some
words, but one cannot tell how they were obtained. Also, it is clear from the
earlier results of these studies that senses are not at all equiprobable, but analytic
ways of deriving a model of expectations, such as the Poisson distribution
(Feller, 1950), are not applicable to nonequiprobable categories of outcome.
In order to get around these problems, one may use differences in the
proportional occurrence of senses in samples of different sizes as a clue to the
presence or absence of a sample size bias in sorting. It is possible to use the
proportions of citations allotted to the various senses for a given word by our
subjects in Task 2 as an estimate of the true probabilities of occurrence of those
senses. The dictionary definition labels let one combine proportional information
about the senses from different subjects, since the subjects are in good agreement
about the use and relative importance of the various labels (even though no two
subjects had the same set of citations), as Table VI shows.
The estimate of sense proportions for any word, then, was based on data
from three subjects from the largest sample size condition (200 citations), by
averaging proportions of citations allotted to each sense category chosen by
subjects for that word. This averaging did not violate any subject's weightings
of senses, except in the case of a few very small senses.
These average sense proportions were taken as estimates of probabilities.
For each word, a Monte Carlo simulation generated events according to these
estimated probability distributions. Results of this procedure are estimates, for
a given word, of the probability of occurrence of a given number of senses in
the samples of 20 and 100 citations.
The numbers of categories that were created by each individual subject in
each condition (20 or 100 citations) for each of the words in Task 2 may be
taken as data to be compared to the distribution of expected values resulting
from the Monte Carlo simulation. Unfortunately, to firmly reject the hypothesis
that sample size creates a bias in sorting, one would need a larger number of
subjects for each word in each condition (n = 3 for this in Task 2). However,
one may examine the probabilities of the individual outcomes across all 12 words
to see if there is any clear pattern of deviation above or below the median of
expected values, which would indicate a good possibility of bias. A pattern
below the median (since we are extrapolating from the sample of 200 to the
samples of 20 and 100) would be indicative of the bias to make disproportion-
186 Jorgensen
APPENDIX
15
Step 3. Calculate the A-D ratio from these three numbers, 29, 27, 15, by
substituting them into the following equation:
r -- 1 - (x + y - 2intersection)/60
r - ~ 1 - (29 + 2 7 - 30)/60 = 1 - 26/60 = 1 - .443 = .56
The denominator in the above formula (60 in this example) represents the max-
i m u m n u m b e r of agreements possible, x(x - 1)/4.
In this paper, we were concerned with comparing the sorting for a given
word in sorting experiment I with its sorting in experiment II, b y individual
subject, rather than with c o m p a r i n g sortings made b y different subjects (since
no two subjects had the same set of citations).
Shipstone points out that the ratio is biased in that the measure of agreement
Psychological Reality of Word Senses 189
for a set rises with the increase in the number of instances; that is, the A - D
ratio grows somewhat disproportionately as categories get larger.
If each citation added to a set is not compared with every item already in
the set (that is, if the subject simply assumes his or her criteria are transitive
across all items), then the assumption that every item in a set adds a unit of
agreement for its conjunction with every other item in the set m a y also spuriously
inflate the ratio for large sets.
REFERENCES
Anderson, R. C., & Ortony, A. (1975). On putting apples into bottles--A problem of polysemy.
Cognitive Psychology, 7, 167-180.
Barclay, J. R., Bransford, J. D., Franks, J. J., McCarrell, N. S., & Nitsch, K. (1974). Compre-
hension and semantic flexibility. Journal of Verbal Learning and Verbal Behavior, 13, 471-
481.
Bierwiseh, M. (1981). Basic issues in the development of word meaning. In W. Deutsch (Ed.),
The child's construction of language (pp. 341-380). New York: Academic Press.
Byrd, R. J., Calzolari, N., Chodorow, M. S., Klavans, J. L., Neff, M. S., & Rizk, O. A. (1987).
Tools and methods for computational lexicology (RC 12642, No. 56847). Yorktown Heights,
NY: IBM Thomas J. Watson Research Center.
Caramazza, A., Grober, E. H., & Zurif, E. B. (1974). A psycholinguistic investigation ofpolysemy:
The meanings of LINE. Unpublished manuscript, The Johns Hopkins University, Baltimore.
Churchland, P. M. (1979). Scientific realism and the plasticity of mind. Cambridge: Cambridge
University Press.
Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in
psychological research. Journal of Verbal Learning and Verbal Behavior, 12, 335-359.
Deese, J. (1967). Meaning and change of meaning. American Psychologist, 22, 641-651.
Durkin, K., & Manning, J. (1989). Polysemy and the subjective lexicon: Semantic relatedness and
the salience of intraword senses. Journal of Psycholinguistic Research, 18, 577-611.
Edwards, D. (1983). Foundationalism in the philosophy of science and in semantics. Unpublished
manuscript, Princeton University, Princeton, NJ.
Feller, W. (1950). An introduction to probability theory and its applications. New York: John
Wiley and Sons.
Jackendoff, R. (1983). Semantics and cognition. Cambridge, MA: MIT Press.
Johnson-Laird, P. N., & Quinn, J. G. (1976). To define true meaning. Nature, 264, 635--636.
Katz, J. J., & Fodor, J. A. (1963). The structure of a semantic theory. Language, 39, 170-210.
Kelly, E., & Stone, P. (1975). Computer recognition of English word senses. Amsterdam: North
Holland.
Kucera, H., & Francis, W. (1967). Computational analysis of present-day American English.
Providence, RI: Brown University Press.
Kurylowitz, J. (1955). [Notes on word meanings]. Voprosy Jazykoznanija, 3, 73-81.
MacNamara, J. (1971). Parsimony and the lexicon. Language, 47, 359-374.
Miller, G. A. (1969). A psychological method to investigate verbal concepts. Journal of Mathe-
matical Psychology, 6, 169-191.
Miller G. A., Fellbaum, C., Kegl, J., & Miller, K. (1988). Wordnet: An electronic lexical reference
system based on theories of lexical memory (Report No. 11). Princeton, NJ: Princeton Uni-
versity, Cognitive Science Laboratory.
190 Jorgensen
Miller, G. A., & Gildea, P. M. (1987). How children learn words. Scientific American, 258, 94-
99.
Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The measurementof meaning. Urbana:
University of Illinois Press.
Panman, O. (1982). Homonymy and polysemy. Lingua, 58, 105-136.
Quine, W. v. O. (1960). Word and object. Cambridge, MA: MIT Press.
Reder, L. M., Anderson, J. R., & Bjork, R. A. (1974). A semantic intepretation of encoding
specificity. Journal of Experimental Psychology, 102, 648-656.
Rosch, E., & Mervis, C. (1975). Family resemblances: Studies in the internal structure of categories.
Cognitive Psychology, 7, 573-605.
Schoen, L. M. (1988). Semantic flexibility and core meaning. Journal of Psycholinguistic Research,
17, 113-123.
Shipstone, E. I. (1960). Some variables affecting pattern conception. Psychological Monographs:
General andApplied, 74, 1-41.
Tulving, E., & Thomson, D. M. (1973). Encoding specificity and retrieval processes in episodic
memory. Psychological Review, 80, 352-373.
Webster's New Collegiate Dictionary (1975). Springfield, M: G. & C. Merriam.
Weinreich, U. (1980). On semantics. Philadelphia: University of Pennsylvania Press.
West, M. (1953). A general service list of English words with semantic frequencies and a supple-
mentary word-listfor the writing ofpopular science and technology. London: Longmans Green.
Wittgenstein, L. (1958). Philosophical investigations. New York: Macmillan.
Zgnsta, L. (1971). Manual of lexicography. The Hague: Mouton.
Zipf, G. K. (1945). The meaning-frequency relationship of words. Journal of General Psychology,
33, 251-256.
Zipf, G. K. (1949). Human behavior and the principle of least effort. Cambridge, MA: Addison-
Wesley.