Escolar Documentos
Profissional Documentos
Cultura Documentos
1. INTRODUCTION
The need for a statistical methodology for analyzing linguistic data, is,
I believe, vital when these data are a function of either geographical or
sociolinguistic factors. In analyzing these data one is called upon to
draw inferences from a large nurnber of linguistic phenomena from a
large set of informants. Wide experience in other behavioral sciences
has shown, however, that if objective inferenees are to be obtained,
quantification of these linguistic events is necessary. And since linguistic
phenomena of this kind usually do not constitute measurement which is
fully reproducible s it is in the physical sciences, and are generally
subject to error, it is best analyzed statistically. Statistics provides,
first of all, empirically tested formulae for drawing accurate inferences
about the differences or similarities of a given population from a sample,
even though any particular set of data may be very inaccurate. And
secondly, these statistical formulae provide an estimate of the degree of
error involved in making these inferences. Lack of this estimate of error
usually renders Statements about a population from a sample unreliable.1
Although David W. Reed and John L. Spxcer proposed s early s
1952 correlation methods which provided a rigorous means by which
* An earlier Version of this paper was presented at the meeting of the Midwest
Modern Language Association in Chicago on 8 May 1965. I am especially indebted
to John W. Bowers, Associate Professor of Speech, University of Iowa, for his numerous editorial suggestions and help with the statistical design and analysis. I also wish
to express my thanks to Robert Howren, Jr., Associate Professor of English, University
of Iowa, Pavle Ivic, Faculty of Philosophy, Novi Sad University, Yugoslavia, and
Roger Shuy, Assistant Professor of English, Michigan State University, for the advice,
comments, and encouragement they have given me in the pursuit of this study.
1
The statistical concepts expressed in this introduction were obtained from the
introductory material in T. G. Connolly and W. Sluckin, Statistics for the social
sciences (London, 1962), 1-3; George A. Ferguson, Statistical analysis in psychology
andeducation (New York, 1959), 1-12; and Sidney Siegel, Nonparametric Statistics for
the behavioral sciences (New York, 1965), 1-5.
81
82
CHARLES L. HOUCK
basis of published linguistic atlas materials, the level of statistical sophistication in the field of linguistic geography has remained low and static
since the appearance of their article in 1952.7
The purpose of this pilot study is fivefold: (1) It will apply the concept
of density s an attempt to obtain statistical data which largely eliminates
the chance factor. It will do this by providing the phi coefficient and the
tetrachoric correlation with input which is derived, first, from a larger
Informant sample concentrated in a smaller geographical area, and second
from a much larger sample of lexical test questions and response items.
(2) It will explore the use of fourfold correlation analyses in which the
correlation coefficients are used not only terminally s in Reed and
Spicer,8 but also instrumentally s input for factor analyses which
attempt to determine from these intercorrelations whether the "Variation
represented can be accounted for adequately by a number of basic
categories smaller than that with which the investigation was started".9
(3) 1t will provide a simple frequency count analysis which, first, provides
a basis for the rejection or retention of test questions and response items
in the questionnaire, and, second, provides data in an accessible form so
that a statistical test for differences can be executed on the response to
dialect lexical items and a given geographical area can be classified
dialectally. (4) It will provide Computer programs which can process
a large amount of linguistic data for correlation, count, and factor
analyses. (5) It will report Substantive findings of the pilot study.
2. METHOD
The following sections will describe the methodology used in this study.
2.1 GEOGRAPHICAL AREA. Since the orientation of this study is primarily methodological, no attempt was made to pick a county which
would have important dialectal findings. Johnson County, Iowa, however, falls within the Davenport-Cedar Rapids-Dubuque triangle of
Iowa which is described by Harold B. Allen, s showing, in his view,
strong Northern elements although the contrasts to Midland features
are not s strong s they are at the major boundary.10 The resultant
7
See Glenna Ruth Pickford, "American linguistic geography: a sociological appraisal", Word, 12 (1956), 211-233, for a comment on linguistic geography methodology which is still apropos. For a more recent comment, see Charles A. Ferguson,
Social science research counc, vol. 19, no. l (1965).
8
Reed and Spicer, Language, 28, 348-359.
0
Benjamin Fruchter, Introduction factor analysis (Princeton, New Jersey, 1954), l.
10
Harold B. Allen, "The primary dialect areas of the upper midwest", in Harold
B. Allen (ed), Readings in applied English lingmstics (New York, 1964), 233 and 241.
83
The township was discarded s the basic geographical unit because it was too small
a unit, especially in the relatively sparsely settled areas, to provide enough informants
meeting the requirements set forth in 2.3; moreover, farmers, at least in Johnson
County, many times move from township to township in quest of better farms and
living conditions, or simply to town to retire, but remain in the county, and, most
of the time, in the same sections of the county s devised for this study.
12
A wordgeography ofthe eastern United States (Ann Arbor, U. of Michigan, 1949).
13
"The speech of Ocracoke, North Carolina", American Speech, 37 (1962), 163-175.
A copy of the questionnaire was also made available to me by Robert Howren.
84
CHARLES L. HOUCK
sounds; (8) calls to farm animals; (9) landscape; (10) fishing; (11) roads;
(12) food; (13) nature; (14) kinship terms (primarily parental); (15)
idioms; (16) childhood terms for playthings and games; and (17) miscellaneous. In Table 3 is a sample of fifty-four key questions, their
respective response items, and the frequency with which each was chosen.
The questionnaires were distributed in person to help insure the high
return necessary for a methodological pilot study. The informants were
provided with a stamped envelope for the return of the questionnaire.
2.5. THE PHI COEFFICIENT WITH THE CHI-SQUARE TEST. The phi coefficient, or fourfold point correlation, measures, like other tests I will
describe later, what statistical relationships exist among informants on
the cnterion oflexical similarity. It assumes that a given lexical response
item is either present or absent in a given idiolect. Given a phi coefficient, one can determine by referring to appropriate theoretical distributions the likelihood of the apparent relationships having occurred by
chance. Such a test is the chi-square test for significance. By referring
to a chi-square table for the critical value required for significance at an
accepted significance level for the appropriate degrees of freedom, one
can determine whether the values for the differences between the observed
and the expected frequencies are significant and cannot reasonably be
explained by sampling fluctuation or chance.14 The phi coefficient i s
used here to provide input for Guttman5 s Radex Analysis15 and the
cluster analysis.
2.6 THE TETRACHORIC CORRELATION. The tetrachoric correlation is
also a fourfold correlation which treats the dichotomy, presence and
absence, s though it is on a continuum; i.e. sometimes present, sometimes
absent, depending, e.g. on the Speech Situation.16 The tetrachoric is
used here primarily to provide input for the multiple factor analysis.
2.7. FACTOR ANALYSIS. Three kinds of factor analysis are employed
in this study: (1) Guttman's Radex approach to factor analysis;17
(2) a multiple factor analysis Computer program assembled by Professor
Harold Bechtoldt, Department of Psychology, University of Iowa;18 and
14
85
(3) Robert C. Tryon's cluster analysis.19 All three of these factor analyses
provide "a mathematical model which can be used to describe certain
areas [of linguistic behavior such s the use of lexical items]. A series ...
of measures [e.g. responses to lexical items in a questionnaire] are
intercorrelated to determine the number of dimensions the test space
occupies, and to identify these dimensions in terms [of Jmguistic or
socio-geographical categories]. The interpretations are done by observing which tests fall on a given dimension and inferring what these
tests have in common [e.g. geography, occupation, age, sex, or education]
that is absent from tests not falling on the dimension. Tests correlate
to the extent that they measure common traits ... [Responses to a checklist questionnaire or to a fieldworker can be studied] to detect possible
common sources of Variation or variance, [or factors; and factors
represent] the fundamental underlying sources of Variation operating in a
given set of scores or other data observed under a specified set of conditions."20
2.8. FREQUENCY COUNT OF ITEMS ACROSS INFORMANTS. The purpose
of the frequency count is to provide a tabulation of response items
across informants, so that the total number of responses to particular
response items in the questionnaire can be readily determined and analyzed. This is important for editorial purposes, for the count can determine meaningfulness of response items in the questionnaire for a particular geographical area. The frequency count also provides input for
the Mest.21
2.9. THE Z-TEST FOR THE DIFFERENCE BETWEEN TWO MEANS. The /-test
determines whether an apparent difference between two means can
easily be accounted for by chance.22 In this study, it will be used to
determine whether Johnson County natives employ Northern lexical
items significantly more often than they use Midland lexical items.
available upon request from the State Univeristy of Iowa Computer Center, Iowa
City, Iowa 52240.
19
R. C. Tryon, Cluster analysis: correlation profile and orthometric {factor) analysis for
the isolation ofunities in mind andpersonality (Ann Arbor, 1939), especially 41-48.
20
Fruchter, op. dt., 2-4 (see fn. 9).
21
The frequency count analysis has been expanded to three types. The primary
addition is the tabulation and percentage of response items across Informant profile
which includes sex, age, education, and occupation. The program identifies the
profile, totals the number of informants who belong to each profile, and indicates how
each profile responds in toto to each lexical item in the questionnaire. This Output can
then be fed into a Type l analysis of variance which tests whether each profile differs
significantly in relation to each lexical response item.
22
George A. Ferguson, op. c/ , 126-128 (see fn. 1).
86
CHARLES L. HOUCK
2.10. THE COMPUTER. The study was designed to make fll use of
the Computer for two reasons: (1) accuracy, for correlation and factor
analysis studies entail a great amount of intricate mathematical computation and counting which, by their very nature, are greatly error prone
when done humanly; and (2) efficiency, for, since there is a great amount
of mathematical computation and counting, the computor saves time.
It is, of course, in this area that a Computer provides the linguistic
geographer with his greatest boon, for it allows him to increase his informant sample for more reliable results. In this study, for example, the
estimated time for manual computation of a 32 X 32 phi coefficient and
tetrachoric matrix was more than one thousand hours. The estimated
time for programming, keypunching, and eliminating program errors
(fide-bugging') is around one hundred hours. Although the saving of
time here i s large, the real saving comes when data from a new study
are to be analyzed, for all that remains is the preparation of the data
a minor part of the process.
In this study, then, a Computer program was used for each type of
analysis except for the -test and the cluster analysis. A Computer
program for the cluster analysis is now operational.23
The following sequence was used for analysis on the Computer: (1) The
data was readied on Computer data cards. (2) The frequency count
program was then run. This program not only provided the necessary input for the Mest, but also provided automatically another input
deck for the phi and tetrachoric program in which all the response items
that none of the informants responded to were deleted. This was done so
that the *D' cell of the fourfold contingency table for the two correlation
computations would not be inflated, thus providing greater correlation
discrimination. (3) The phi and tetrachoric program was then run. This
program also automatically provided an input deck in the form of a
symmetrical tetrachoric intercorrelation matrix for the multiple factor
analysis program. (4) The multiple factor analysis was run in two stages:
(a) exploratory; and (b) confirmatory.
23
The use of the Computer was first made possible through the interest of Garry A.
Flint, a Computer programmer at the Indiana University Computing Research Center
in the summer of 1964. He was responsible for the phi and tetrachoric program used
in this pilot study, I am also indebted to him for his help in learning the basics of
Computer programming. Since the completion of this pilot study I have expanded the
analyses and have increased the data processing capacity of the various Computer
programs through the generous help of the University of Iowa Computer Center.
This expanded methodology has been applied to the Iowa Atlas checklist materials,
and the results will appear in a monograph by Robert Howren, Jr. and myself, to
be published by the Iowa State University Press, Ames, Iowa. The complete methodology will also be described in my doctoral dissertation.
87
The overall results were encouraging, for the degree of density provided
highly reliable data input for the phi, tetrachoric, and count analyses.
The phi and tetrachoric intercorrelation matrices consistently showed
middle to relatively high but homogeneous intercorrelations, indicating
perhaps dialectal homogeneity, while at the same time revealing idiolectal
discrimination. The rnge for the phi coefficient intercorrelations was .05
to .60; the rnge for the tetrachoric intercorrelations was .09 to .83.
All the intercorrelations except four were significant (If 2 > 6.64, df =
l,p ^.01); i.e. if chi-square is greater thaii 6.64 at one degree of freedom,
the probability is that fewer than one intercorrelation out of 100 would
be due to chance, A randomly selected sample of phi (with their 2
values) and tetrachoric intercorrelations is shown in Table 1.
TABLE l
A phi coefficient and tetrachoric intercorrelation matrix of randomly
selected informants from the five county-sections of Johnson County
Informants
2
7
18
23
28
2
1.00
.50
.73*
176.59**
.49
.72
168.80
.47
.70
118.42
.47
.70
152.48
18
23
28
LOO
.72
.71
161.72
.45
.68
143.04
.46
.69
146.52
1.00
.39
.60
104.63
.42
.64
123.04
1.00
.37
.58
94.43
1.00
* Tetrachoric intercorrelations.
** Chi-square values.
The four non-significant intercorrelations were caused by one informant who also showed marked deviation from the rest of the informants,
even though he correlated significantly with them in some respects. No
explanation can be offered for this deviation, for there is nothing in his
biographical data which would indicate even a post hoc explanation for
the deviation. On the criteria set up for the selection of informants in the
Linguistic Atlas of the United States and Canada, he would have been an
ideal Informant: he was a native and life-long resident of Johnson
CHARLES L. HOUCK
r
County, Iowa; he was 71 years old; he was a farmer who owned his own
farm; and he had only four years of education. I believe this case of
deviance points up rather concretely the need to exercise care in assuming
that an Informant who meets the Informant selection criteria of the
Linguistic Atlas of the United States and Canada necessarily represents
the norm of his geographical area, and to note that he may in fact
contribute spurious data to a survey.
TABLE 2
The Guttman 'quasi-simplex covariance structure*
Informants
2
7
18
28
23
2
1.00
.50
.73*
176.59**
.49
,72
168.80
.47
.70
152.48
.47
.70
118.42
18
28
23
1.00
.47
.71
161.72
.46
.69
146.52
.45
.68
143.04
1.00
.42
.64
123.04
.39
.60
104.63
1.00
.37
.58
94.43
1.00
* Tetrachoric intercorrelations.
** Chi-square values.
89
90
CHARLES L. HOUCK
t
There were three hundred and forty-two of these. All informants responded to only ten response-items. In terms of the above correlational
analyses, the informants were correlated over 738 response-items rather
than the 1,080 items of the original list. This type of Information is
important for editorial purposes. A questionnaire of 1,080 lexical
response items presents a formidable task for many informants; thus,
a frequency count analysis which indicates that three hundred and
forty-two of these one thousand and eighty response items were responded to by none of the informants means that this questionnaire
contained considerable excess baggage. Since studies in other social
sciences show that short questionnaires obtained a better return percentage than long ones, it seems almost mandatory that those three hundred
and forty-two response items be deleted in this case.
4. SUBSTANTIVE RESULTS
The sample of fifty-four key lexical test questions and their respective response
items in Table 3 were chosen on the basis of the findings in H. Kurath's A word geography of the eastern United States, Roger W. Shuy's monograph, The northernmidland dialect boundary in Illinois ( Publication of the American Dialect Society,
no. 38) (U. of Alabama, 1962), and the previously cited Davis and McDavid article.
These lexical response items consistently showed dialectal Variation s a function of
geography. Words marked N, M, SM, and S are Northern, Midland, South Midland,
and Southern respectively. This dialectal classification is not, however, absolute,
for the drawing of isoglosses tends to be more of an art than a science, and dialectal
overlap is the rule rather than the exception; but there is, for the most part, general
agreement that the classified response items in Table 3 represent that particular
dialect area. The unclassified response items are either nondiscriminate or more
restricted in relation to dialect areas. The frequency with which each response item
was chosen was compiled by the frequency count program.
91
lern shows dialect mixture within a set of response-items since the modal
distribution is not so extreme between Northern and Midland reponseitems, and, in some instances, almost bi-modal, s in 14 and 2L this
sample, only test questions 47 and 51 show true bi-modal distribution.
These results show, then, that, while dialectal mixture occurs, there is
no central tendency for Northern or Midland response items to occur
overall more frequently, thus indicating once again dialectal homogeneity.
TABLE 3
* J * ~-**-^
mf
20*
5
11
16*
15
eavetroughs (N)
eavestrough (N)
gutters (M,S)
spouts (M)
13*
8
11
coil
pile
neap (N)
tongue
pole (N)
spear
8. TWIN POLES OF BUGGY
shafts (M)
shavs
th ls (N)
31*
0
0
4
hayrick (M)
2
haymow
0
0
Dutch cap
barrack (N)
0
haystack
30*
6. SMALL STACK FOR DRYING HAY
IN FIELD:
haycock (N)
19*
tumble (N)
0
doodle (M)
0
heap (N)
0
cock
4
Modal response
l
5
drafts
0
31*
l
0
28*
5
l
0
0
whiffletree (N)
whtpplefree (N)
swingletree
singletree (M)
10. PIVOTED CROSSBAR FOR TWO
HORSES:
evener (N)
doubletree
spreader
double singletree
\ l.
0
0
l
31*
0
31*
0
l
WOOD IN WAGON:
hauling (M)
drawing (N)
carting (N)
teaming
32*
0
0
0
92
CHARLES L. HOUCK
by name (N)
(N)
drag (N)
harrow
13. SETTING HEN:
duck (M)
duck hen
setting hen
hatching hen
brooder
0
32*
21.
horse (N)
9*
8
/z<?r,y (M)
lead horse
3
leader
l
wheel horse
0
saddle horse
l
//e Aor^e
15. SOUND MADE BY CALF AT FEEDING
TIME:
blat (N)
l
blare
0
2
29*
1 6. CALL TO CALVES :
24*
3
3
l
l
14*
0
0
10
whinny
whinner
nicker (M)
whicker
/*/-/?
12*
9
9
0
4
co-jack
kwope
kope (M)
come up
0
4
8
0
coo-sheep
coo-nannie
coo-nan
kudack
kuday
fe (M)
7
0
27*
0
l
sook, calf(M)
sook, sook (M)
come calfy
come, boss
wo special call
17. CALL cows:
co9 boss (N)
saw
madam
()
20.
10
13*
pal l (N)
b cket (M)
22.
6
25*
16
20*
swill (N)
slop (M)
23.
5
l
0
0
0
19*
25*
5
0
l
0
0
28*
0
0
0
3
0
tarw
0
armful (N)
14
armload (M)
19*
26. "A"-FRAME SUPPORT USED BY
CARPENTERS:
trestle (M)
7
sawhorse
22*
horse
2
sawbuck
l
27.
-?///*/ (M)
(N)
27*
0
8
22*
7
poke (M)
sack
bag
l
5
\
burlap sack
burlap bag
gunny sack (M)
polato sack
gramsack
8
3
25*
0
4
20*
7
5
harmonica
mouth organ (N)
french harp (SM)
breath harp
mouth harp
harp
juice harp
jew's harp
11
19*
l
0
6
0
0
l
LAMPS:
4
5
2
23*
IN
0
31*
tied quilt
0
comforter (N)
19*
comfort (M)
11
comfortable (N)
l
/H f
0
35. A SMALL FRESH BODY OF RUNNING
WATER:
creek
28*
stream
4
prong
0
/ ()
fork
0
brauch (M)
2
(N)
3
rindet
riverlet
glitter
93
0
0
0
corn bread
johnny cake (N)
cornpone (M)
/X?77
29*
2
l
0
8
19*
3
0
0
0
17*
l
0
l
5
l
6
4
l
2
l
0
12
l
0
0
25*
Wte (N)
snack(M,S)
/7/ece (M)
lunch
4
15*
5
8
94
42.
CHARLES L. HOUCK
seed(M)
pit (N)
stone
kernel
heart
43.
16
19*
l
0
0
CENTER OF A PEACH :
stone (N)
seed(M)
pit (N)
44.
ear-sewer
CENTER OF A CHERRY:
14
16*
5
hll (M)
shuck (N)
25*
2
AWJ:
Shell
45.
- BEANS:
shell (N)
/m// (M)
shuck
3
19*
9
0
6
46.
KIND OF WORM I
earthworm (N)
angleworm (N)
zY worm
mudworm
redworm (M)
fishworm (M)
fishing worm (M)
eelworm
rainworm
eaceworm
47. LARGE WENGED INSECT SEEN
AROUND WATER:
darning needle (N)
deviVs darning needle (N)
sewing needle
mosquito hawk
snake feeder (M)
dragonfly
7
5
l
l
l
26*
l
0
l
0
2
4
7
0
0
11*
11*
>?r^/fy (N)
lightning bug
firebug (M)
candlefly
49.
5
25*
4
0
maple
/re^ (M)
wwzpfe (N)
maple (N)
5
0
22*
2
maple grove
(N)
13*
2
(N)
orchard
maple grove
camp (M)
(M)
51. HE is SICK
0
10
4
2
16*
16*
l
0
quoits (N)
quates
horseshoes
serenade
chivaree (N)
belling(M)
dish-panning
skimmelton (N)
callathump
creeps (N)
crawls (M)
l
0
30*
l
31 *
l
0
0
0
23*
10
95
considered s a critical part of future dialect studies. Although the sampling techniques used in the study were relatively unsophisticated and
definitely need to be improved, the reliability of the data was apparent
in all of the significance tests. The degree of density in relation to questionnaire size will have to be revised in the light of the above findings.
(2) Although the phi coefficient s well s the tetrachoric correlation
describe the relatedness of linguistic phenomena under analysis reliably,
they function more crucially s input for either the multiple factor analysis
or cluster analysis, for the factors they describe must obtain some significance criterion if they are to be considered valid. (3) Although the Guttman quasi-simplex covariance structure can show apparent 'complexity'
among idiolects, it is unreliable s an indicator of the statistically significant factors which underlie the intercorrelations. (4) The count analysis
proved to be editorially informative s well s an Instrument to provide
an accurate frequency-count and input for the Mest. (5) The results
of the proposed statistical methodology overwhelmingly did not support
previous assumptions about lexical usage in Johnson County, Iowa, and
demonstrated the need for an analytic methodology which can test for
significant differences. It should be pointed out at this point that the
proposed methodology can also be used to analyze phonological,
morphological, and syntactical dialect materials. (6) The Computer
can make the necessary degree of density feasible and be an extremely
time-saving and powerful tool in counting and computation for the
linguistic geographer.
30 vi 1966
University of Iowa
Iowa City, Iowa 52240
U.S.A.