Você está na página 1de 48

This manuscript was accepted for publication in Journal of Memory and Language

on July 5, 2015.
The final text and layout can be obtained through the publisher.
Citation information:
Baart, M., & Samuel, A. G. (in press). Turning a blind eye to the lexicon: ERPs show no
cross-talk between lip-read and lexical context during speech sound processing. Journal
of Memory and Language.

Turning a blind eye to the lexicon: ERPs show no cross-talk between lip-read and
lexical context during speech sound processing.

Martijn Baart a and Arthur G. Samuel a,b,c

BCBL. Basque Center on Cognition, Brain and Language, Donostia, Spain.

IKERBASQUE, Basque Foundation for Science.

Stony Brook University, Dept. of Psychology, Stony Brook, NY, the United States of

America.

Corresponding author:
Martijn Baart
Basque Center on Cognition, Brain and Language
Paseo Mikeletegi 69, 2nd floor
20009 Donostia (San Sebastin)
Spain
Tel: +34 943 309 300 (ext. 228)
Email: m.baart@bcbl.eu

Abstract
Electrophysiological research has shown that pseudowords elicit more negative EventRelated Potentials (i.e., ERPs) than words within 250 ms after the lexical status of a speech
token is defined (e.g., after hearing the onset of ga in the Spanish word lechuga, versus
da in the pseudoword lechuda). Since lip-read context also affects speech sound
processing within this time frame, we investigated whether these two context effects on
speech perception operate together. We measured ERPs while listeners were presented with
auditory-only, audiovisual, or lip-read-only stimuli, in which the critical syllable that
determined lexical status was naturally-timed (Experiment 1) or delayed by ~800 ms
(Experiment 2). We replicated the electrophysiological effect of stimulus lexicality, and also
observed substantial effects of audiovisual speech integration for words and pseudowords.
Critically, we found several early time-windows (< 400 ms) in which both contexts
influenced auditory processes, but we never observed any cross-talk between the two types of
speech context. The absence of any interaction between the two types of speech context
supports the view that lip-read and lexical context mainly function separately, and may have
different neural bases and purposes.

Key-words
ERPs, N200-effect, P2, lexical processing, audiovisual speech integration

Introduction
Over the course of a half century of research on speech perception and spoken word
recognition, the central observation has been that despite the enormous variability in the
speech signal (both between and within speakers), listeners are generally able to correctly
perceive the spoken message. Researchers have found that in addition to extra-segmental
cues like prosody, context is used whenever possible to support the interpretation of the
auditory stream. Two substantial literatures exist in this domain: One body of research has
shown that perceivers rely on the visible articulatory gestures of the speaker, here referred to
as lip-reading (e.g., Sumby & Pollack, 1954). The second widely studied context effect is
based on the mental lexicon the knowledge that informs listeners about existing words
within the language (e.g., Ganong, 1980). In the current study, we use electrophysiological
techniques to examine the relationship between these two types of contextual support during
spoken word recognition.
The studies of lip-read context and of lexical context have generally developed
independently of each other (for exceptions, see e.g., Barutchu, Crewther, Kiely, Murphy, &
Crewther, 2008; Brancazio, 2004), but there are several ways in which the two context types
seem to be quite similar. One commonality is that both context types appear to be most
effective when the auditory speech signal is degraded or ambiguous. For example, in Sumby
and Pollacks (1954) classic study of audiovisual (henceforth AV) speech, participants tried
to recognize words in sentences under various levels of noise masking, either with auditoryonly or with AV presentation. When the listening conditions were good, the lip-read
information had little effect, but when the signal to noise ratio was substantially reduced,
there was a very large advantage for the AV over the audio-only condition. Lexical context
also seems to be particularly potent when the auditory information is unclear. For instance, an
ambiguous sound in between g and k will be heard as g when followed by ift and as

k when followed by iss because gift and kiss are lexically valid items and kift
and giss are not (Ganong, 1980); under very clear listening conditions, listeners are
certainly able to identify nonlexical tokens like kift or giss. Ambiguous speech sounds
(e.g., a sound in between /b/ and /d/) can be disambiguated through lip-reading, with the
sound heard as /b/ when listeners see a speaker pronouncing /b/, and heard as /d/ when
combined with a visual /d/ (e.g., Bertelson, Vroomen, & De Gelder, 2003). There are many
studies showing such disambiguation of the auditory input by both lexical context, and by lipread context.
In addition, in the last decade there have been parallel developments of studies
showing that both lexical context and lip-read context can be used by listeners to recalibrate
their phonetic category boundaries. The seminal report for lexical context was done by
Norris, McQueen, and Cutler (2003), and the original work showing such effects for lip-read
speech was by Bertelson, Vroomen, and de Gelder (2003). The recalibration effect occurs as
a result of exposure to ambiguous speech that is disambiguated by lip-read or lexical context.
It manifests as a subsequent perceptual shift in the auditory phoneme boundary such that the
initially ambiguous sound is perceived in accordance with the phonetic identity provided by
the lip-read context (e.g., Baart, de Boer-Schellekens, & Vroomen, 2012; Baart & Vroomen,
2010; Bertelson et al., 2003; van Linden & Vroomen, 2007; Vroomen & Baart, 2009a,
2009b; Vroomen, van Linden, de Gelder, & Bertelson, 2007; Vroomen, van Linden, Keetels,
de Gelder, & Bertelson, 2004) or the lexical context (Eisner & McQueen, 2006; Kraljic,
Brennan, & Samuel, 2008; Kraljic & Samuel, 2005, 2006, 2007; Norris et al., 2003; van
Linden & Vroomen, 2007). In both cases, the assumption is that the context acts as a teaching
signal that drives a change in the perceived phoneme boundary that can be observed when
later ambiguous auditory speech tokens are presented in isolation.

In addition to these parallel behavioral patterns for lexical and lip-read context,
electrophysiological studies have revealed that both types of context modulate auditory
processing within 250 ms after onset of the critical speech segment. For lip-reading,
electrophysiological evidence for such early context induced modulations of auditory
processing is found in studies that relied on the mismatch negativity (i.e., MMN, e.g.,
Ntnen, Gaillard, & Mntysalo, 1978), which is a negative component in the event-related
potentials (ERPs) usually occurring 150 200 ms post stimulus in response to a deviant
sound in a sequence of standard sounds (the standards are all the same). As demonstrated by
McGurk and MacDonald (1976), lip-read context can change perceived sound identity, and
when it does, it triggers an auditory MMN response when the illusory AV stimulus is
embedded in a string of congruent AV stimuli (e.g., Colin, Radeau, Soquet, & Deltenre,
2004; Colin et al., 2002; Saint-Amour, De Sanctis, Molholm, Ritter, & Foxe, 2007). When
sound onset is sudden and does not follow repeated presentations of standard sounds, it
triggers an N1/P2 complex (a negative peak at 100 ms followed by a positive peak at ~200
ms) and it is well-documented that amplitude and latency of both peaks are modulated by lipread speech (e.g., Alsius, Mttnen, Sams, Soto-Faraco, & Tiippana, 2014; Baart,
Stekelenburg, & Vroomen, 2014; Besle, Fort, Delpuech, & Giard, 2004; Frtusova, Winneke,
& Phillips, 2013; Klucharev, Mttnen, & Sams, 2003; Stekelenburg, Maes, van Gool,
Sitskoorn, & Vroomen, 2013; Stekelenburg & Vroomen, 2007, 2012; van Wassenhove,
Grant, & Poeppel, 2005; Winneke & Phillips, 2011). Thus, studies measuring both the MMN
and the N1/P2 peaks indicate that lip-reading affects sound processing within 200 to 250 ms
after sound onset.
The electrophysiological literature on lexical context paints a similar picture, as both
the MMN and the P2 are modulated by lexical properties of the speech signal. For instance,
when a spoken syllable completes a word it elicits a larger MMN response than when the

same syllable completes a non-word (Pulvermller et al., 2001). Similarly, single-syllable


deviant words (e.g., day) in a string of non-word standards (de) elicit larger MMNs than
non-word deviants in a sequence of word standards (Pettigrew et al., 2004). Recently, we
demonstrated that the auditory P2 is also sensitive to lexical processes (Baart & Samuel,
2015). We presented listeners with spoken, naturally timed, three-syllable tokens in which the
lexical status of each token was determined at third syllable onset (e.g., the Spanish word
lechuga [lettuce] versus the pseudoword lechuda). The lexical context effect occurred by
about ~200 ms after onset of the third syllable, with pseudowords eliciting a larger negativity
than words. In previous studies, a comparable ERP pattern was observed for sentential
context rather than within-item lexical context, and was referred to as an N200 effect
(Connolly, Phillips, Stewart, & Brake, 1992; van den Brink, Brown, & Hagoort, 2001; van
den Brink & Hagoort, 2004). In our study (Baart & Samuel, 2015) , we sought to determine
whether the N200 effect is robust against violations of temporal coherence within the
stimulus, and we therefore added ~440 or ~800 ms of silence before onset of the third
syllable. The lexicality effect survived the delay and pseudowords again elicited more
negative ERPs than words at around 200 ms. As expected, the ERPs following the onset of
the third syllable after a period of silence had a different morphology than the ERPs obtained
for the naturally timed items. That is, the delayed syllables elicited an auditory N1/P2
complex, in which the lexicality effect was now observed at the P2. This result demonstrated
that the N200-effect of lexicality was robust against violations of temporal coherence.
As with lip-read context, effects of lexical context can be found even before the N200
effect. Recently, it was demonstrated that brain activity in response to words may start to
differentiate from the response to pseudowords as early as 50 ms after the information needed
to identify a word becomes available (MacGregor, Pulvermller, van Casteren, & Shtyrov,
2012). There are also linguistic context effects later than the N200, including a negative

going deflection in the waveform at around 400 ms (i.e., the N400 effect) in response to
meaningful stimuli such as spoken words, written words and sign language (see Kutas &
Federmeier, 2011 for a review). In the auditory domain, the N400 is larger (more negative)
when words are preceded by unrelated words than when words are preceded by related words
(i.e., a semantic priming effect, see Holcomb & Neville, 1990). The N400 has been argued to
reflect cognitive/linguistic processing in response to auditory input (Connolly et al., 1992),
and is proposed to be functionally distinct from the earlier N200 (Connolly et al., 1992;
Connolly, Stewart, & Phillips, 1990 ; van den Brink et al., 2001; van den Brink & Hagoort,
2004).
Looking at the literature on lexical context effects, and the literature on the effect of
lip-read context, we see multiple clear parallels: In both cases, the context effect strongly
influences how listeners interpret ambiguous or unclear speech. In both cases, the context
guides recalibration of phonemic category boundaries, bringing them into alignment with the
contextually-determined interpretation. In both cases, the context effects can be detected in
differing patterns of electrophysiological activity within approximately 250 ms after the
context is available. Based on these strong commonalities, it seems plausible that in the
processing stream that starts with activity on the basilar membrane, and results in a
recognized word, contextual influences operate together very rapidly.
Despite the appealing parsimony of this summary, there are some findings in the
literature that suggest that lexical and lip-read context might not operate together, and may
actually have rather different properties. We will briefly review two types of evidence that
challenge the idea that lexical and lip-read context operate in the same way. The domains of
possible divergence include differing outcomes in selective adaptation studies, and possibly
different semantic priming effects.

Both lip-read and lexical context have been employed in the selective adaptation
paradigm (Eimas & Corbit, 1973; Samuel, 1986), with divergent results. In an adaptation
study, subjects identify tokens along some speech continuum (e.g., a continuum from ba to
da), before and after an adaptation phase. In the adaptation phase, an adaptor is played
repeatedly. For example, in one condition an adaptor might be ba, and in a second
condition it might be da. The adaptation effect is a reduction in report of sounds that are
similar to the adaptor (e.g., in this case, listeners would report fewer items as ba after
adaptation with ba, and fewer items as da after adaptation with da). Multiple studies
have tested whether a percept generated via lip-read context will produce an adaptation effect
that matches the effect for an adaptor that is specified in a purely auditory way, and all such
studies have shown no such effect. For example, Roberts and Summerfield (1981) had
subjects identify items along a b-d continuum before and after adaptation, with adaptors
that were either purely auditory or that were audiovisual. Critically, in the AV case, a pairing
of an auditory b with a visual g consistently led the subjects to identify the AV
combination as d (see McGurk & MacDonald, 1976). Despite this lip-read generated
percept, the pattern of activation effects for the AV adaptor perfectly matched the pattern for
the simple auditory b. This failure by lip-read context to produce an adaptation shift has
been replicated repeatedly (Saldaa & Rosenblum, 1994; Samuel & Lieblich, 2014; Shigeno,
2002).
This inefficacy contrasts with adaptors in which the critical sound is generated by
lexical context. Samuel (1997) constructed adaptors in which a b (e.g., in alphabet) or a
d (e.g., in armadillo) was replaced by white noise. The lexical context in such words
caused listeners to perceive the missing b or d (via phonemic restoration: Samuel, 1981;
Warren, 1970), and these lexically-driven sounds successfully generated adaptation shifts.
Similarly, when a segment was phonetically ambiguous (e.g., midway between s and sh),

lexical context (e.g., arthriti_ versus demoli_) caused listeners to hear the appropriate
sound (see Ganong, 1980; Pitt & Samuel, 1993), again successfully producing adaptation
shifts. Thus, across multiple tests, the percepts generated as a function of lip-read information
cannot support adaptation, despite being phenomenologically compelling; lexically-driven
percepts are both compelling and able to generate adaptation.
A recent study examining semantic priming effects also suggests that there may be an
important dissociation between the phenomenological experience produced by lip-read
context and actual activation of the apparently perceived word. Ostrand, Blumstein, and
Morgan (2011) presented AV primes followed by auditory test items. For some primes an
auditory nonword (e.g. bamp) was combined with a video of a speaker producing a real
word (e.g., damp), leading subjects to report that they heard the real word (damp) due to
the lip-read context. Other primes were made with an auditory word (e.g., beef) combined
with a visual nonword (e.g., deef), leading to a nonword percept (e.g.,deef) due to the lipread context. Ostrand et al. (2011) found that if the auditory component of the AV prime was
a word then semantic priming was found even if the perception of the AV prime was a
nonword (e.g., even if people said they heard the prime as deef, beef would prime
pork). No priming occurred when a prime percept (e.g., deef) was based on an AV
combination that was purely a nonword (e.g., both audio and video deef). These semantic
priming results again indicate that lip-read context may not be activating representations of
words in the same way that lexical context does. Samuel and Lieblich (2014) have suggested
that lip-read context may have strong and direct effects on the immediate percept, but may
not directly affect linguistic encoding.
The current study builds on electrophysiological evidence for effects of lexical and
lip-read context on auditory processing within approximately 250 ms after the relevant
stimulus. Two experiments explore the neural correlates of the two types of context, with the

specific goal of assessing whether they show interactive patterns of neural activation, which
would indicate a mutual influence on auditory processing. This approach has been used for
decades in the behavioral literature, and more recently, in the ERP literature. It is grounded in
the additive factors logic developed by Sternberg (Sternberg, 1969; see Sternberg, 2013, for a
recent and thoughtful discussion of the method). The fundamental idea is that if two factors
are working together during at least some part of processing, they have the opportunity to
produce over-additive or under-additive effects, whereas if they operate independently then
each will produce its own effect, and the two effects will simply sum.
In the behavioral literature this approach has typically been applied to reaction time
measurements. With ERPs, the approach gets instantiated by looking for additive versus nonadditive effects on the evoked responses. ERPs have proven to be sensitive enough to reveal
interactions between various properties of a stimulus. For instance, lip-read induced
suppression of the N1 is modulated by AV spatial congruence (Stekelenburg & Vroomen,
2012). Interactions between phonological and semantic stimulus properties (Perrin & GarcaLarrea, 2003) and between sentence context and concreteness of a target word (Holcomb,
Kounios, Anderson, & West, 1999) have been found for the N400 (see also Kutas &
Federmeier, 2011). Here, we measure ERPs while listeners are presented with the same
auditory words and pseudowords used in our recent study (Baart & Samuel, 2015), but now
the third syllable (that determines whether the item is a word or pseudoword) is presented in
auditory form (as in the previous study), in the visual modality (requiring lip-reading), or
audiovisually. We expect, based on prior work, to see an early ERP effect of lexicality, and
an early ERP effect of lip-read context. Our central question is whether these two effects are
independent and simply sum, or if there is evidence for some coordination of the two context
effects that yields non-additive effects.

10

We provide two quite different stimulus conditions to look for any interaction of the
two context types. In Experiment 1, we use naturally timed items in which we should observe
a lexicality effect in the form of an auditory N200 effect, as we found previously with these
words and pseudowords. The goal is to determine whether lip-read context modulates the
effects of lexicality. A complication with using naturally timed speech is that it means that we
will be looking for an effect of lip-read context that occurs during ongoing speech (i.e., at the
beginning of the third syllable in our stimuli). There is nothing fundamentally wrong with
this, but almost all of the existing studies that show early ERP consequences of lip-reading
have done so for utterances that follow silence. Such conditions provide a clean N1/P2
complex to look at, a complex that is not found during ongoing speech.
To provide a test with these properties, in Experiment 2, we use items in which the
third syllable is delayed, providing conditions that should elicit an N1/P2 response. Effects of
lexical context and lip-read speech were both expected to occur at the P2 peak. The existing
literature predicts that these effects will push the P2 in opposite directions (i.e., lip-read
information suppresses the P2, lexical information yields more positive ERPs), but the critical
issue is whether the two effects are additive or interactive. Assuming that lip-read induced
modulations at the N1 reflect integration of relatively low-level features, the P2 is the most
promising component to reveal interactions between the two contexts. By testing both
normally-timed speech, and speech delayed to offer a clean P2, we maximize the opportunity
to observe any interaction of the two types of speech context.

Experiment 1: Naturally timed items


In Experiment 1, participants were presented with three-syllable words or
pseudowords that were identical through the first two syllables, but diverged at the onset of
the third syllable. In all cases the first two syllables were presented in auditory form only.

11

The third syllable was then presented either in auditory-only form (A), in visual-only form
(V), or audiovisually (AV). We measured the resulting ERP patterns for the six resulting
conditions (word/pseudoword A/V/AV), focusing on whether there was any interaction of
the lexical and lip-read context effects following third syllable onset.

Methods
Participants. 20 adults with normal hearing and normal or corrected-to-normal vision
participated in the experiment in return for a payment of 10 per hour. Participants were only
eligible for participation if Spanish was their dominant language. Language proficiency in
other languages (e.g., English, German, French, Basque) was variable. All participants gave
their written informed consent prior to testing, and the experiment was conducted in
accordance with the Declaration of Helsinki. Two participants were excluded from the
analyses, one because of poor data quality (see EEG recording and analyses below) and one
because one of the experimental blocks was accidentally repeated. The mean age in the final
sample of 18 participants (7 females) was 22 years (SD = 2.1).
Stimuli. The auditory stimuli were the same as those used in our recent study (Baart &
Samuel, 2015). We selected 6 three-syllable nouns from the EsPal subtitle database (Duchon,
Perea, Sebastin-Galls, Mart, & Carreiras, 2013) that were matched on four criteria
(frequency, stress, absence of embedded lexical items, and absence of higher frequency
phonological neighbors). The resulting nouns (i.e., brigada [brigade], lechuga [lettuce],
granuja [rascal], laguna [lagoon], pellejo [hide/skin] and boleto [(lottery) ticket])
were produced by a male native speaker of Spanish. The speaker was recorded with a digital
video camera (at a rate of 25 frames/second) and its internal microphone (Canon Legria HF
G10), framing the video as a headshot. The speaker was asked to produce several tokens of
each item, and also to produce versions in which the onset consonant of the third syllable was

12

replaced by ch or sh. These tokens were used for splicing purposes as ch and sh have
relatively weak co-articulatory effects on the preceding vowel, and they also would not
provide accurate predictive information after the splicing process. Word stimuli were created
by splicing the final syllable of the original recording onto the first two syllables of the ch
or sh item (e.g., na from laguna was spliced onto lagu from lagucha).
Pseudowords were created by rotating the final syllables which led to brigaja, lechuda,
granuna, laguga, pelleto and bolejo and splicing the final syllables of the original
word recordings onto the first two syllables of the ch item (e.g., na from laguna was
spliced onto granu from granucha). All of these naturally-timed stimuli sounded natural,
with no audible clicks or irregularities. The original video recordings that corresponded to the
six final syllables were converted to bitmap sequences and matched on total duration (11
bitmaps; 440 ms, including 3 bitmaps of anticipatory lip-read motion before sound onset).
Procedures. Participants were seated in a sound-attenuated, dimly lit, and electrically
shielded booth at ~80 cm from a 19-inch CRT monitor (100 Hz refresh). Auditory stimuli
were presented at 65 dB(A) (measured at ear-level) and played through a regular computer
speaker placed directly above the monitor. In total, 648 experimental trials were delivered:
216 A trials, 216 V trials and, 216 AV trials. For each modality, half of the trials were words
and half were pseudowords. An additional 108 catch-trials (14% of the total number of trials)
were included to keep participants fixated on the monitor and minimize head-movement.
During a catch-trial, a small white dot briefly appeared at auditory onset of the third syllable
(120 ms, = 4 mm) between the nose and upper-lip of the speaker on AV and V trials, and in
the same location on the screen on A trials. Participants were instructed to press the space-bar
of a regular keyboard whenever they detected the dot.
We used Presentation software for stimulus delivery. As can be seen in Figure 1, each
trial started with a 400 ms fixation cross followed by 600, 800 or 1000 ms of silence before

13

the auditory stimulus was delivered. Auditory onset of the first syllable was slightly variable,
as we added silence before the stimulus (in the .wav files) to ensure that third syllable onset
was always at 640 ms, which coincided with bitmap number 17 from a bitmap sequence that
was initiated at sound onset (see Figure 1). Bitmaps were 164 mm (W) 179 mm (H) in size,
and on AV and V trials, auditory onset of the critical third-syllable was preceded by 3 fade-in
bitmaps (still frames) and 3 bitmaps (120 ms) of anticipatory lip-read information (which
were bitmaps 11 to 16 in the bitmap sequence, see Figure 1). In all conditions, the ITI
between sound offset and fixation onset was 1800 ms. The trials were pseudo-randomly
distributed across six experimental blocks. Before the experiment started, participants
completed a 12-trial practice session to familiarize them with the procedures.

[Insert Figure 1 about here]

EEG recording and analyses. The EEG was recorded at a 500 Hz sampling rate
through a 32-channel BrainAmp system (Brain Products GmbH). 28 Ag/AgCl electrodes
were placed in an EasyCap recording cap and EEG was recorded from sites Fp1, Fp2, F7, F3,
Fz, F4, F8, FC5, FC1, FC2, FC6, T7, C3, Cz, C4, T8, CP5, CP1, CP2, CP6, P7, P3, Pz, P4,
P8, O1 and O2 (FCz served as ground). Four electrodes (2 on the orbital ridge above and
below the right eye and 2 on the lateral junctions of both eyes) recorded the vertical- and
horizontal Electro-oculogram (EOG). Two additional electrodes were placed on the mastoids,
of which the left was used to reference the signal on-line. Impedance was kept below 5 k
for mastoid and scalp electrodes, and below 10 k for EOG electrodes. The EEG signal was
analyzed using Brain Vision Analyzer 2.0. The signal was referenced off-line to an average
of the two mastoid electrodes and band-pass filtered (Butterworth Zero Phase Filter, 0.1 - 30
Hz, 24 dB/octave). Additional interference was removed by a 50Hz notch filter. ERPs were

14

time-locked to auditory onset of the third syllable and the raw data were segmented into 1100
ms epochs (from 200 ms before to 900 ms after third syllable onset). After EOG correction
(i.e., ERPs were subtracted from the raw signal, the proportion of ocular artifacts were
calculated in each channel and also subtracted from the EEG, and ERPs were then added
back to the signal, see Gratton, Coles, & Donchin, 1983, for details), segments that
corresponded to catch-trials and segments with an amplitude change > 120 V at any channel
were rejected. One participant was excluded from analyses because 40% of the trials
contained artifacts, versus 4% or less for the other participants. ERPs were averaged per
modality (A, V and AV) for words and pseudowords separately, and base-line corrected (200
ms before onset of the third syllable).
Next, we subtracted the visual ERPs from the audiovisual ERPs. This was done in
order to compare the AV V difference with the A condition, which captures AV integration
effects (see e.g., Alsius et al., 2014; Baart et al., 2014; Besle, Fort, Delpuech, et al., 2004; A.
Fort, Delpuech, Pernier, & Giard, 2002; Giard & Peronnet, 1999; Klucharev et al., 2003;
Stekelenburg & Vroomen, 2007; van Wassenhove et al., 2005; Vroomen & Stekelenburg,
2010). These ERPs are provided in Figure 2, and ERPs for all experimental conditions (A, V
and AV) are provided in Appendix 1 (Figure 4).
The hypothesis that lip-read and lexical context effects are independent predicts that
their impact on ERPs will be additive, i.e., that there will be no interaction of these two
factors. The absence of an interaction is of course a null effect, and it is not statistically
possible to prove a null effect. What can be done is to show that each factor itself produces a
robust effect, and to then look at every possible way that an interaction might manifest itself
if it exists. Towards this end, in addition to conducting the test both with normal timing
(Experiment 1) and with a delay that offers a clear P2 (Experiment 2), in each experiment we
first analyze a window large enough to contain any interaction that might occur (the first 400

15

ms), and then provide a focused analysis within the window that prior work indicates is the
most likely one for any interaction (150-250 ms). All ANOVAs are Greenhouse-Geisser
corrected.
[Insert Figure 2 about here]

Results
On average, participants detected 99% of the catch-trials (S.D. = 2%, ranging from
94% [N = 1] to 100% [N = 9]), indicating that they had attended the screen as instructed.

The 0-400 ms window


We averaged the data in eight 50 ms bins spanning a time-window from 0 to 400 ms.
This window was chosen to ensure that we did not miss any of the context effects of interest,
since these clearly occur before 400 ms (e.g., Besle, Fort, Delpuech, et al., 2004; Klucharev
et al., 2003; MacGregor et al., 2012; Pettigrew et al., 2004; Stekelenburg & Vroomen, 2007;
van Linden, Stekelenburg, Tuomainen, & Vroomen, 2007; van Wassenhove et al., 2005). In a
first analysis, we submitted the data to an 8 (Time-window; eight 50 ms bins from 0 to 400
ms) 27 (Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV V)
ANOVA. The results of this omnibus ANOVA are summarized in Table 1.

[Insert Table 1 about here]

As indicated in Table 1, both contexts produced significant effects. Lexicality yielded


a main effect in the entire 0-400 ms epoch, and also showed an interaction with Timewindow. Modality also interacted with Time-window, as well as with Electrode. The Time-

16

window Electrode Modality interaction was also significant. Thus, the results meet the
first critical criterion of demonstrating effects of both Lexicality and Modality.
The second critical criterion is the finding of purely additive effects for these two
factors, and that is exactly what Table 1 shows: None of the interactions involving Lexicality
and Modality even approached significance. As noted, the first, broad analysis was intended
to cast a wide net in looking for any evidence of an interaction of the two types of context.
Since both interacted with Time-window, we analyzed the data separately for each 50 ms bin,
using eight 27 (Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV
V) ANOVAs. The results of those ANOVAs are summarized in Figure 3a, where the pvalues for the main and interaction effects involving Lexicality and Modality are plotted. The
top panel shows the main effects of Lexicality and Modality, and each ones interaction with
Electrode; the bottom panel shows the interaction of the two context types with each other,
and their three-way interaction with Electrode. As the bottom panel illustrates, at no point
within the 400 ms window were there any interactions between Modality and Lexicality, Fs <
1. There were never any interactions between Electrode, Modality and Lexicality, Fs(26,442)
< 1.35, ps > .26, p2s < .08, presumably because Modality had no influence on the effect that
Lexicality had on auditory processing, and vice versa.
In contrast, in the top panel, there are robust effects of the two factors, in overlapping
time windows. The time-windows where both contexts yielded significant results (as a main
effect and/or interaction effect with Electrode) are shaded in grey. In all time-windows,
ANOVAs yielded significant main effects of Electrode, Fs(26,442) > 4.00, ps < .01, p2s >
.18.

[Insert Figure 3 about here]

17

The main effects of Lexicality in the time-windows 100-150 ms, 250-300 ms, 300350 ms and 350-400 ms, Fs(1,17) > 4.59, ps < .05, p2s > .20, indicated that in all four timewindows, words yielded more positive ERPs than pseudowords (differences ranged between
.39 V and .91 V). The interactions between Lexicality and Electrode (in the 300-350 ms
window and the 350-400 ms window) were followed-up by paired-samples t-tests that tested
activity for words against pseudowords at each electrode. Family-wise error was controlled
by applying a step-wise Holm-Bonferroni correction (Holm, 1979) to the t-tests. The
Appendix provides these tests and topography maps across the 400 ms window (Figure 6).
The ANOVAs showed a main effect of Modality in the 150-200 ms time-window,
F(1,17) = 5.80, p = .03, p2 = .25, because amplitude across the scalp was 1.01 V more
negative for A than for AV V. Interactions between Electrode and Modality were observed
in all time-windows, Fs(26,442) > 6.64, ps < .01, p2s > .27, because auditory activity was
more negative than AV V at bilateral (centro)frontal locations, with largest effects in a 100250 ms time-frame. Interactions were again followed-up as before (see the topography maps
in the Appendix for details).

The 150 250 ms window


Looking across the entire 400 ms window, we saw multiple significant effects of both
lexical context and lip-read context, but no hint anywhere of their interacting. We now focus
on the time range that a priori has the greatest chance of including an interaction, the period
around 200 ms that has been implicated in previous studies as showing both lexical and lipreading effects.
A 27 (Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV
V) ANOVA on the averaged data in a 150-250 ms window showed a main effect of
Electrode, F(26,442) = 14.98, p < .01, p2 = .47, as activity was most positive at central

18

electrodes (i.e., the maximal activity was 2.00 V at Cz). There was no main effect of
Lexicality in the 150-250 ms window (as Figure 3a shows, the significant Lexicality effect in
Experiment 1 surrounded this window), and there was also no interaction between Electrode
and Lexicality, F < 1. There was no main effect of Modality, F(1,17) = 3.66, p = .07, p2 =
.18, but there was an Electrode Modality interaction, F(26,442) = 14.01 p < .01, p2 = .45,
that was already explored in the 0-400 ms window analyses. Critically, Lexicality did not
interact with Modality, F < 1, and the interaction between Electrode, Lexicality and Modality
was also not significant, F < 1. To quantify this finding, we constructed the AV V A
difference waves for words, and compared them with the difference-waves for pseudowords.
The rationale was that residual activity after the AV V A subtraction fully captures the
effect of AV integration. Critically, if AV integration were modulated by Lexicality, this
effect would be different for words and pseudowords. However, as can be seen in Figure 3b,
the effect of AV integration was similar for words and pseudowords, and none of the pairwise comparisons (testing AV V A wave for words against pseudowords) reached
significance, ts(17) < 1.78, ps > .09, let alone survived a correction for multiple comparisons.

Experiment 1: Discussion
Experiment 1 yielded four main findings: 1) lexical context modulated auditory
processing in the 100-150 ms and 250-400 windows, 2) lip-read context modulated auditory
processing in the entire 0-400 ms epoch, 3) both contexts had a similarly directed effect;
words yielded more positive ERPs relative to pseudowords, and effects of AV integration
(AV V) were manifest through more positive ERPs than auditory only presentations, and 4)
there were no interactions between the two context types; AV integration was statistically
alike for words and pseudowords, and likewise, lexical processing was statistically alike for
A and AV V.

19

For lexical context, the pattern of results was as expected, although the N200-effect
was somewhat smaller, earlier and shorter-lived (.39 V in a 100-150 ms window) compared
to prior findings for sentential auditory context (i.e., the effect ranged in between .71 and .94
V in a 150-250 ms window, see van den Brink et al., 2001; van den Brink & Hagoort,
2004). The N200 effect we observed was statistically alike across the scalp, which is
consistent with previous findings (van den Brink et al., 2001). It has been argued that the
N200 effect is distinct from the N400 effect that follows it (e.g., Connolly et al., 1992;
Hagoort, 2008; van den Brink et al., 2001; van den Brink & Hagoort, 2004), and given that
our ERPs also show two distinct effects of Lexicality, it appears that the N200 at 100-150 ms
was indeed followed by an (early) N400 effect starting at ~250 ms. As argued before (Baart
& Samuel, 2015), the N200 effect obtained with final syllables and sentence final words may
be quite similar, given that both ultimately reflect phonological violations of lexical
predictions (see e.g., Connolly & Phillips, 1994, who argued that such early negativity may
be attributed to a phonemic deviation from the lexical form).
Effects of AV integration were largest in a 100-250 ms time-window at bilateral
frontal electrodes. The time-course of these effects is thus quite consistent with the literature
that shows AV integration at the N1 and P2 peaks (or in between both peaks, e.g., Alsius et
al., 2014). The effect became manifest as more negative A-only ERPs than for the AV V
difference waves, which is consistent with the ERPs observed by Brunellire et al. (2013) for
highly salient lip-read conditions, although these authors did not find this effect to be
significant. Since Brunellire et al. (2013) used ongoing AV speech sentences whereas we
only presented lip-read information in the final syllable, it seems likely that the differences
are related to the differences in experimental procedures. Possibly, insertion of a lip-read
signal during ongoing auditory stimulation renders it highly unexpected for participants. This
could have triggered a positive P3a component that has an anterior distribution and peaks at

20

around 300 ms (Courchesne, Hillyard, & Galambos, 1975), which is essentially what we
observed in the ERPs (see Figure 2). However, since we know of no prior research in which a
lip-read signal was inserted during ongoing auditory stimulation, this suggestion is clearly
speculative.
The central question of Experiment 1 was whether there is electrophysiological
evidence for an interaction between lip-read and lexical speech context. Both contexts
produced significant effects in four time-windows, yet we never found interactions between
Lexicality and Modality. Our more focused test was less conclusive because the 150-250 ms
time-window did not show clear effects of both contexts. Experiment 2 provides another
opportunity to look for evidence of the two context types working together. By delaying the
onset of the critical third syllable, the effect of each context type should potentially be clearer
because the silent period should allow a clear auditory P2 to emerge, and to possibly be
affected by each type of context.

Experiment 2: Delayed third syllable


In Experiment 2, we use the same basic materials and the same approach, but with
stimuli in which we delayed the onset of the third syllable. By inserting ~800 ms of silence
before the onset of the third syllable, the syllables onset should produce the usual ERP
pattern that has been studied in most previous research on AV speech integration. More
specifically, there should be an N1/P2 complex for such cases, and prior studies of the effect
of visual context have found significant effects in this time window (e.g., Stekelenburg &
Vroomen, 2007; van Wassenhove et al., 2005). In our previous work (Baart & Samuel, 2015)
we demonstrated a robust lexical effect at about 200 ms post onset (i.e. at the P2) with this
kind of delay in the onset of the third syllable. Thus, Experiment 2 offers an opportunity to

21

test for any interaction of the two context types under conditions that have been more widely
examined in the literature on audiovisual context effects.

Methods
Participants. 20 new adults with Spanish as their dominant language and with normal
hearing and normal- or corrected to normal vision participated. Two participants were
excluded because of poor data quality (see below). The final sample included 8 females, and
the mean age across participants was 21 years (S.D. = 1.9).
Stimuli. Stimulus material was the same as in Experiment 1.
Procedures. Experimental procedures were the same as in Experiment 1, except that
in all conditions, third syllable onset was delayed relative to offset of the second syllable.
This delay was 760, 800 or 840 ms (216 trials per delay, 72 per modality) and realized by
adding silence in the soundfiles after offset of the second syllable, and adding 19, 20 or 21
black bitmaps (40 ms each) to the sequence triggered at stimulus onset (see Figure 1).
EEG recording and analyses. EEG recording and analyses procedures were the same
as before. Two participants were excluded from the analyses because more than a third (i.e.
43% and 50%) of the trials did not survive artifact rejection, whereas the proportion of trials
with artifacts for the remaining participants was 17% (N = 1), 11% (N = 1) or < 7% (N = 16).
To facilitate a comparison across experiments, the analysis protocol was similar to that in
Experiment 1. As before, we first look across the full 400 ms window for any evidence of
interaction, and then focus on the most likely time window for these effects around 200 ms.

Results
On average, participants detected 96% of the catch-trials (S.D. = 5%, range from 85%
[N = 1] to 100% [N = 5]), indicating that they attended the screen as instructed.

22

The 0-400 ms window


The results of the 8 (Time-window; eight 50 ms bins from 0 to 400 ms) 27
(Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV V) ANOVA
are summarized in Table 1. As in Experiment 1, Lexicality yielded a main effect and also
showed an interaction with Time-window. Modality interacted with Time-window, as well as
with Electrode. The Time-window Electrode Modality interaction was also significant.
Critically, none of the interactions involving Lexicality and Modality reached significance.
The 27 (Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV V)
ANOVAs conducted on each time-window are summarized in Figure 3c; topography maps
are shown in the Appendix. As in Experiment 1, the critical overarching finding is that in
none of the time windows does the interaction of Modality and Lexicality (or their interaction
together with Electrode) approach significance (see the bottom panel of Figure 3c), Fs(1,17)
< 2.35, ps > .13, p2s < .13. Modality did not influence the effect that Lexicality had on
auditory processing, and vice versa.
In contrast, as summarized in the top panel of Figure 3c, there were strong effects of
both Lexicality and Modality in several critical time periods. Across the scalp, words elicited
more positive ERPs than pseudowords in five consecutive time-windows from 150-400 ms
(words pseudowords differences ranged between .55 V and .82 V). There was also a
main effect of Modality in the 350-400 ms window as averaged activity for A was 1.17 V
more positive than for AV V. Lexicality interacted with Electrode in the 150-200 ms
window, the 250-300 ms window, and the 350-400 ms window, Fs(26,442) > 3.09, ps < .05,
p2s > .14. However, none of the significant paired comparisons between words and
pseudowords in the 150-200 ms window survived the Holm-Bonferroni correction, whereas a
cluster of six occipito-parietal electrodes in the 250-300 ms window, and four parietal

23

electrodes in the 350-400 ms window did show reliable Lexicality effects in the anticipated
direction (i.e., activity for pseudowords was more negative than for words).
Modality interacted with Electrode in the 50-100 ms window, the 200-250 ms
window, the 300-350 ms window, and the 350-400 ms window, Fs(26,442) > 3.20, ps < .04,
p2s > .15. In those four time-windows, auditory stimuli elicited more positive ERPs than AV
V, although the interaction was only reliable at electrode T8 in the 50-100 ms, 200-250 ms
and 300-350 ms windows, whereas AV integration in the 350-400 ms window was observed
in a large cluster of left-, mid- and right-central electrodes, with most prominent interactions
at right scalp locations.

The 150 250 ms window


As in Experiment 1, the broad-window analyses showed strong Lexical and Modality
effects, in overlapping time periods, and no hint of any interactions between the two types of
context. Looking at the time period around the expected critical delay of 200 ms, we
conducted a targeted analysis as in Experiment 1.
A 27 (Electrode) 2 (Lexicality; Words vs. Pseudowords) 2 (Modality; A vs. AV
V) ANOVA on the averaged data in a 150-250 ms window showed a main effect of
Electrode, F(26,442) = 13.10, p < .01, p2 = .44, as activity was most positive at central
electrodes (i.e., the maximal activity was 2.33 V at Cz, consistent with the topography of the
P2). There was a main effect of Lexicality, F(1,17) = 9.61, p < .01, p2 = .36, as Lexical
context increased positive activity in the ERPs (i.e., average amplitudes were 1.43 V for
words and .87 V for pseudowords). There was also an interaction between Electrode and
Lexicality, F(26,442) = 3.04, p = .02, p2 = .15, with stronger lexical effects at some sites
than others. Lip-read context decreased positivity (i.e., average amplitudes were 1.22 V for
A and 1.09 V for AV V). This effect varied with location, as the main effect of Modality

24

was not significant, F < 1, but the interaction of Electrode and Modality was, F(26,442) =
3.37 p = .02, p2 = .17. Critically, Lexicality did not interact with Modality, F < 1. The
interaction between Electrode, Lexicality and Modality was also not significant, F(26,442) =
1.91, p = .13, p2 = .10, as underscored by the comparisons between the averaged amplitudes
of the AV integration effect (AV V A) for words with pseudowords at each electrode, that
all yielded ps > .09 (see Figure 3d).

Experiment 2: Discussion
Experiment 2 yielded four main findings: 1) lexical context modulated auditory
processing in the 150-400 window, with relatively constant effects in terms of ERP amplitude
across that time-frame, 2) lip-read context modulated auditory processing modestly in the 50100 ms, 200-250 ms, and 300-350 ms epoch, whereas effects in the 350-400 ms window were
clearly larger, 3) the effects of context were in opposite directions; words yielded more
positive ERPs relative to pseudowords, and effects of AV integration (AV V) were
manifest through more negative ERPs than auditory only presentation, and 4) there were no
interactions between the two types of context; AV integration was statistically alike for words
and pseudowords, and likewise, lexical processing was statistically alike for A and AV V.
As can be seen in Figure 2, the effect of lexicality starting at around 150 ms was
characterized by a more negative P2 peak in the ERPs for pseudowords than words. In our
previous study that included a comparable delay condition for auditory stimuli, we observed a
similar effect (Baart & Samuel, 2015). Given that the lexicality effect is thus similar to the
effect observed with naturally timed stimuli, it appears that the N200 effect can survive the
~800 ms delay, producing an N200 effect superimposed on the obligatory P2. Despite the
clear central topography (as often found for the P2, see e.g., Stekelenburg & Vroomen, 2007),
effects of lexicality were statistically alike across the scalp, providing additional support for

25

the hypothesis that the effect is an N200-effect with a central distribution (Connolly et al.,
1990 ) that is nonetheless statistically alike across the scalp (van den Brink et al., 2001).
Although effects of lip-read context were largest in the 300-350 ms epoch, the same
pattern of AV integration (i.e., more negative ERP amplitudes for AV V than for A) started
to appear at the P2 peak for a number of mid-central electrodes (e.g., C3, Cz, C4). The trend
is clearly visible in Figure 2, though as noted not significant until later. The patterns in the
ERPs are consistent with previous studies that have shown that lip-read information
suppresses the auditory P2 (e.g., van Wassenhove et al., 2005). The weaker/later effect here
may be due to the overall modest size of the N1/P2 complex, which presumably stems from
our materials being more complex than the speech typically used in previous studies. Such
studies usually employ single, stressed syllables beginning with a stop consonant (i.e., /pa/,
/py/, /po/, /pi/, Besle, Fort, Delpuech, et al., 2004), and/or that are extremely well articulated
(Stekelenburg & Vroomen, 2007). Such stimuli produce sharper sound onsets than our
critical syllables because ours were taken from natural 3-syllable utterances with secondsyllable stress, with initial consonants belonging to different consonant classes (i.e., the
voiced dental fricative /d/ [], the velar approximant /g/ [ ], the voiceless velar fricative /j/
[x], and the nasal /n/ [n]). Of note however, is that others have observed N1/P2 peaks that are
comparable in size (i.e., a ~4 V peak-to-peak amplitude, Ganesh, Berthommier, Vilain,
Sato, & Schwartz, 2014), or even smaller than what we observed here (Winneke & Phillips,
2011). Moreover, even when N1/P2 amplitudes are of the usual size, lip-read information
does not always suppress the N1 (Baart et al., 2014) and/or the P2 (Alsius et al., 2014). AV
P2 peaks may even be larger than for A-only stimulation (see e.g. Figure 2 in Treille, Vilain,
& Sato, 2014), indicating that the effects partially depend on procedural details.

26

General discussion
The current study was designed to examine the neural consequences of lip-read and
lexical context when listening to speech input, and in particular, to determine whether these
two types of context assert a mutual influence. Our approach is based on the additive factors
method that has been used very productively in both behavioral work (e.g., Sternberg, 1969,
2013) and in prior ERP studies (e.g., Alsius et al., 2014; Baart et al., 2014; Besle, Fort,
Delpuech, et al., 2004; Besle, Fort, & Giard, 2004; Stekelenburg & Vroomen, 2007). This
approach depends on finding clear effects of each of the two factors of interest, and then
determining whether the two effects simply sum, or if instead they operate in a non-additive
fashion. Independent factors produce additive effects, while interactive ones can yield overadditive or under-additive patterns.
In two experiments that were built on quite different listening situations (naturally
timed versus delayed onset of a words third syllable) we found pervasive effects of both
lexicality and lip-read context. In comparing items in which the third syllable determined
whether an item was a word or a pseudoword, we replicated the recent finding of an auditory
N200-effect of stimulus lexicality (Baart & Samuel, 2015). In naturally timed items, ERPs for
pseudowords were more negative than for words just before and just after 200 ms, and for the
delayed stimuli this effect reached significance by 200 ms and remained robust after that.
Effects of lip-read context were widespread for both the naturally timed and the delayed
stimuli, and critically, overlapped the lexicality effect at various time-windows in both
experiments. Thus, our two experiments produced the necessary conditions for detecting an
interaction of these two factors, if such an interaction exists. The fundamental finding of the
current study is that the interaction of the two types of context never approached significance,
in any time window. We believe we provided every opportunity for an interaction to appear
by running the test with two such different timing situations, and by analyzing the ERPs both

27

broadly and with a specific focus in the 200 ms region. Thus, the most appropriate conclusion
is that lexical context and lip-read context operate independently, rather than being combined
to form a broader context effect.
As we noted previously, the claim of independent effects that comes from finding an
additive pattern is inherently a kind of null effect the absence of an interaction. The fact that
the main effects were robust, and that there were many other interactions (e.g., with
Electrode) suggests that the lack of the critical interaction is not due to insufficient power.
More substantively, as Sternberg (2013) points out in his comments on the additive factors
approach, any finding of this sort needs to be placed in a broader empirical and theoretical
context the test provides one type of potential converging evidence. We believe that the
broader empirical and theoretical context does in fact converge with our conclusion. There
are three relevant bodies of empirical work, and some recent theoretical developments that
provide a very useful context for them.
One set of empirical findings comes from behavioral studies that examined the
potential interaction between lip-read context and the lexicon. These studies tested whether
the impact of lip-read context varies as a function of whether the auditory signal is a real
word or not. The evidence from these studies is rather mixed. (Dekle, Fowler, & Funnell,
1992) found McGurk-like lip-read biases on auditory words, and a study in Finnish (Sams,
Manninen, Surakka, Helin, & Ktt, 1998) also showed strong and similar McGurk effects
for syllables, words and words embedded in sentences. These initial findings are consistent
with independent effects of lip-read and lexical context. However, Brancazio (2004)
suggested that the stimuli used by Sams et al. (1998) may have been too variable to determine
any effect of the lexicon on AV integration. In particular, the position of the audiovisual
phonetic incongruency across words differed from non-words, as did the vowel that followed
the incongruence. Brancazio (2004) therefore fixed the position of AV congruency at

28

stimulus onset and controlled subsequent vowel identity across stimuli, and observed that
McGurk effects were larger when the critical phoneme produced a word rather than a nonword. In a subsequent study by (Barutchu et al., 2008) the observed likelihood of McGurkinfluenced responses was the same for words and pseudowords when the incongruency came
at stimulus onset (which seems reasonable given that lexical information at sound onset is
minimal at best), but was lower for words than for pseudowords at stimulus offset. Overall,
studies using this approach may provide some suggestion of an interaction between lip-read
and lexical context, but the evidence is quite variable and subject to a complex set of stimulus
properties.
As discussed in the Introduction, two other lines of research have shown a clear
divergence between the effect of lip-read context and lexical context studies using selective
adaptation, and semantic priming. Recall that Ostrand et al. (2011) recently reported that
when an audiovisual prime is constructed with conflicting auditory (e.g., beef) and visual
(e.g., deef) components, listeners often perceive the prime on the basis of visual capture
(e.g., they report perceiving deef), but semantic priming is dominated by the unperceived
auditory component (beef primes pork, even though the subject claims not to have heard
beef). This pattern demonstrates a dissociation between a percept that is dominated by lipread context, and internal processing (lexical/semantic priming) that is dominated by the
auditory stimulus.
Several studies using the selective adaptation paradigm have shown a comparable
dissociation (Roberts & Summerfield, 1981; Saldaa & Rosenblum, 1994; Samuel &
Lieblich, 2014; Shigeno, 2002). In these studies, the critical adaptors were audiovisual stimuli
in which the reported percept depended on the lip-read visual component, but in all cases the
observed adaptation shifts were completely determined by the unperceived auditory
component of the audiovisual adaptors. Samuel and Lieblich (2014) tried to strengthen the

29

audiovisual percept by making it lexical (e.g., pairing a visual armagillo with an auditory
armibillo, to generate the perceived adaptor armadillo), but the results were entirely like
those from the studies using simple syllables as adaptors the shifts were always determined
by the auditory component, not by the perceived stimulus. The results from all of these
studies contrast with those from multiple experiments in which the perceived identity of the
adaptor was determined by lexical context, either through phonemic restoration (Samuel,
1997), or via the Ganong effect (Samuel, 2001). In the lexical cases, the perceived identity of
the adaptors matched the observed adaptation shifts.
Looking at all of these findings, it is clear that both lip-read context and lexical
context produce reliable and robust effects on what listeners say that they are perceiving. At
the same time, there is now sufficient behavioral and electrophysiological evidence to sustain
the claim that these two types of context operate differently, and at least mostly
independently. Samuel and Lieblich (2014) suggested that the adaptation data are consistent
with somewhat different roles for lip-read information and for lexical information. They
argued that peoples reported percepts indicate that a primary role for lip-read context is to
aid in the overt perception of the stimulus such context directly affects what people think
they are hearing. Lexical context can also do this, but in addition, it seems to have a direct
impact on the linguistic encoding of the speech. Samuel and Lieblich argued that the
successful adaptation by lexically-determined speech sounds was evidence for the linguistic
encoding of those stimuli, while the unsuccessful adaptation found repeatedly for lip-read
based percepts shows that this type of context does not directly enter into the linguistic
processing chain. This distinction is grounded in the idea that speech is simultaneously both a
linguistic object, and a perceptual one; lip-read context seems to be primarily affecting the
latter.

30

This distinction in fact closely matches one that Poeppel and his colleagues have
drawn on the basis of a wide set of other types of evidence. As Poeppel (2003) put it, Soundbased representations interface in task-dependent ways with other systems. An acoustic
phoneticarticulatory coordinate transformation occurs in a dorsal pathway ... that links
auditory representations to motor representations in superior temporal/parietal areas. A
second, ventral pathway interfaces speech derived representations with lexical semantic
representations (p. 247). The first pathway in this approach aligns with Samuel and
Lieblichs (2014) more perceptual analysis, while the second one clearly matches the more
linguistic encoding. In fact, Poeppels first pathway explicitly connects to motor
representations, exactly the kinds of representations that naturally would be associated with
lip-read information. And, his second pathway explicitly involves lexical semantic
representations, obviously a close match to the lexical context effects examined here. The
theoretical distinctions made by both Poeppel and by Samuel and Lieblich provide a natural
basis for the hypothesized independence of lip-read context and lexical context: Each type of
context is primarily involved with somewhat different properties of the speech signal,
reflecting its dual nature as both a perceptual and a linguistic stimulus. Moreover, Poeppels
analysis suggests that the two functions are subserved by different neural circuits, providing
an explanation for the observed additive effects: There is little or no interaction because the
different types of context are being routed separately.
We should stress that the independence of the two pathways is a property of the
online processing of speech. The initial independence does not preclude one type of context
from eventually affecting the other kind of information, sometime later. For example, lipread only words can prime the semantic categories an auditory target belongs to (Dodd,
Oerlemans, & Robinson, 1989) and lip-reading the first syllable of a low frequency word may
prime auditory recognition of that word (M. Fort et al., 2012). However, the fact that lip-read

31

information may affect later auditory perception is not evidence for on-line interaction
between the two contexts while processing a speech sound, or even for the visual information
directly affecting linguistic encoding; an initial perceptual effect may eventually have
downstream linguistic consequences. The consistently independent effects of the two context
types on the observed ERPs in the current study, together with the accumulating findings
from multiple behavioral paradigms, provide strong support for the view that lip-reading
primarily guides perceptual analysis while lexical context primarily supports linguistic
encoding of the speech signal.

32

Appendix
Figures 4 and 5 display the ERPs for each stimulus type (A, V and AV), from ~600
ms before onset of the critical third syllable, to ~500 ms after. As can be seen, all conditions
yield similar ERPs before video-onset, which is to be expected given that two auditory
syllables were played during this time in all conditions. Onset of the third syllables lip-read
information (indicated by video) clearly led to a response in V and AV conditions (there
was no lip-read information in the A condition) and V and AV start to diverge after auditory
onset of the third syllable (indicated by audio). ERPs for words and pseudowords look quite
similar in all conditions, but to capture effects of AV integration, analyses were conducted on
A and the AV V difference waves (see Figure 2).

[Insert Figure 4 about here]

[Insert Figure 5 about here]

Figure 6 displays topography maps of the averaged activity in eight consecutive 50


ms time-windows. The maps are symmetrically scaled with different maximal amplitudes for
each time-window. Negativity is indicated by black contour lines and positivity by white
ones. Main effects of Lexicality are indicated by rectangles around the topography maps
corresponding to the time-windows where main effects were observed. Interactions between
Lexicality or Modality and Electrode in any given time-window were assessed through pairwise t-tests at each electrode (testing activity for words vs. pseudowords, or for A vs. AV
V, depending on the factor that interacted with Electrode), and electrodes for which the
obtained p-value survived a Holm-Bonferroni correction are indicated by asterisks. Test
parameters of interest are summarized below the maps. Delta values () indicate the minimal

33

amplitude difference observed in the significant tests. The minimal t and maximal p -values
of the comparisons are provided, as well as the maximal 95% Confidence Interval around
delta, and minimal effect-size (Cohens d). Figure 6 displays these maps for both
experiments, with the upper panel corresponding to Experiment 1, and the lower panel to
Experiment 2.

[Insert Figure 6 about here]

34

Acknowledgements
This work was supported by Rubicon grant 446-11-014 by the Netherlands
Organization for Scientific Research (NWO) to MB, and MINECO grant PSI2010-17781
from the Spanish Ministry of Economics and Competitiveness to AGS.

35

References
Alsius, A., Mttnen, R., Sams, M. E., Soto-Faraco, S., & Tiippana, K. (2014). Effect of
attentional load on audiovisual speech perception: evidence from ERPs. Frontiers in
psychology, 5:727.
Baart, M., de Boer-Schellekens, L., & Vroomen, J. (2012). Lipread-induced phonetic
recalibration in dyslexia. Acta Psychologica, 140(1), 91-95.
Baart, M., & Samuel, A. G. (2015). Early processing of auditory lexical predictions revealed
by ERPs. Neuroscience Letters, 585, 98-102.
Baart, M., Stekelenburg, J. J., & Vroomen, J. (2014). Electrophysiological evidence for
speech-specific audiovisual integration. Neuropsychologia, 53, 115-121.
Baart, M., & Vroomen, J. (2010). Phonetic recalibration does not depend on working
memory. Experimental Brain Research, 203(3), 575-582.
Barutchu, A., Crewther, S. G., Kiely, P., Murphy, M. J., & Crewther, D. P. (2008). When
/b/ill with /g/ill becomes /d/ill: Evidence for a lexical effect in audiovisual speech
perception. European Journal of Cognitive Psychology, 20(1), 1-11.
Bertelson, P., Vroomen, J., & De Gelder, B. (2003). Visual recalibration of auditory speech
identification: a McGurk aftereffect. Psychological Science, 14(6), 592-597.
Besle, J., Fort, A., Delpuech, C., & Giard, M. H. (2004). Bimodal speech: early suppressive
visual effects in human auditory cortex. European Journal of Neuroscience, 20(8),
2225-2234.
Besle, J., Fort, A., & Giard, M. H. (2004). Interest and validity of the additive model in
electrophysiological studies of multisensory interactions. Cognitive Processing, 5(3),
189-192.
Brancazio, L. (2004). Lexical influences in audiovisual speech perception. Journal of
Experimental Psychology: Human Perception & Performance, 30(3), 445-463.
Brunellire, A., Snchez-Garca, C., Ikumi, N., & Soto-Faraco, S. (2013). Visual information
constrains early and late stages of spoken-word recognition in sentence context.
International Journal of Psychophysiology, 89(1), 136-147.
Colin, C., Radeau, M., Soquet, A., & Deltenre, P. (2004). Generalization of the generation of
an MMN by illusory McGurk percepts: voiceless consonants. Clinical
Neurophysiology, 115(9), 1989-2000.
Colin, C., Radeau, M., Soquet, A., Demolin, D., Colin, F., & Deltenre, P. (2002). Mismatch
negativity evoked by the McGurk-MacDonald effect: a phonetic representation within
short-term memory. Clinical Neurophysiology, 113(4), 495-506.
Connolly, J. F., & Phillips, N. A. (1994). Event-related potential components reflect
phonological and semantic processing of the terminal word of spoken sentences. .
Journal of Cognitive Neuroscience, 6(3), 256-266.
Connolly, J. F., Phillips, N. A., Stewart, S. H., & Brake, W. G. (1992). Event-related
potential sensitivity to acoustic and semantic properties of terminal words in
sentences. Brain and Language, 43(1), 1-18.
Connolly, J. F., Stewart, S. H., & Phillips, N. A. (1990 ). The effects of processing
requirements on neurophysiological responses to spoken sentences. Brain and
Language, 39, 302318.
Courchesne, E., Hillyard, S. A., & Galambos, R. (1975). Stimulus novelty, task relevance and
the visual evoked potential in man. Electroencephalography and Clinical
Neurophysiology, 39(2), 131-143.
Dekle, D. J., Fowler, C. A., & Funnell, M. G. (1992). Audiovisual integration in perception
of real words. Perception & Psychophysics, 51(4), 355-362.

36

Dodd, B., Oerlemans, M., & Robinson, R. (1989). Cross-modal effects in repetition priming:
A comparison of lip-read graphic and heard stimuli. Visible Language, 22, 59-77.
Duchon, A., Perea, M., Sebastin-Galls, N., Mart, A., & Carreiras, M. (2013). EsPal: Onestop shopping for Spanish word properties. Behavior Research Methods, 1-13.
Easton, R. D., & Basala, M. (1982). Perceptual dominance during lipreading. Perception &
Psychophysics, 32(6), 562-570.
Eimas, P. D., & Corbit, J. D. (1973). Selective adaptation of linguistic feature detectors.
Cognitive Psychology, 4, 99-109.
Eisner, F., & McQueen, J. M. (2006). Perceptual learning in speech: stability over time.
Journal of the Acoustical Society of America, 119(4), 1950-1953.
Fort, A., Delpuech, C., Pernier, J., & Giard, M. H. (2002). Early auditoryvisual interactions
in human cortex during nonredundant target identification. Cognitive Brain Research,
14, 2030.
Fort, M., Kandel, S., Chipot, J., Savariaux, C., Granjon, L., & Spinelli, E. (2012). Seeing the
initial articulatory gestures of a word triggers lexical access. Language and Cognitive
Processes, 28(8), 1207-1223.
Frtusova, J. B., Winneke, A. H., & Phillips, N. A., 28(2), 481. (2013). ERP evidence that
auditoryvisual speech facilitates working memory in younger and older adults.
Psychology and aging, 28(2), 481-494.
Ganesh, A. C., Berthommier, F., Vilain, C., Sato, M., & Schwartz, J. L. (2014). A possible
neurophysiological correlate of audiovisual binding and unbinding in speech
perception. Frontiers in psychology, 5: 1340.
Ganong, W. F. (1980). Phonetic categorization in auditory word perception. Journal of
Experimental Psychology: Human Perception & Performance, 6(1), 110-125.
Giard, M. H., & Peronnet, F. (1999). Auditory-visual integration during multimodal object
recognition in humans: a behavioral and electrophysiological study. Journal of
Cognitive Neuroscience, 11(5), 473-490.
Gratton, G., Coles, M. G., & Donchin, E. (1983). A new method for off-line removal of
ocular artifact. Electroencephalography and Clinical Neurophysiology, 55(4), 468484.
Hagoort, P. (2008). The fractionation of spoken language understanding by measuring
electrical and magnetic brain signals. Philosophical Transactions of the Royal Society
B: Biological Sciences, 363(1493), 1055-1069.
Holcomb, P. J., Kounios, J., Anderson, J. E., & West, W. C. (1999). Dual-coding, contextavailability, and concreteness effects in sentence comprehension: an
electrophysiological investigation. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 25(3), 721-742.
Holcomb, P. J., & Neville, H. J. (1990). Auditory and visual semantic priming in lexical
decision: A comparison using event-related brain potentials. Language and Cognitive
Processes, 5(4), 281-312.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian
Journal of Statistics, 6(2), 65-70.
Klucharev, V., Mttnen, R., & Sams, M. (2003). Electrophysiological indicators of phonetic
and non-phonetic multisensory interactions during audiovisual speech perception.
Cognitive Brain Research, 18(1), 65-75.
Kraljic, T., Brennan, S. E., & Samuel, A. G. (2008). Accommodating variation: dialects,
idiolects, and speech processing. Cognition, 107(1), 54-81.
Kraljic, T., & Samuel, A. G. (2005). Perceptual learning for speech: Is there a return to
normal? Cognitive Psychology, 51(2), 141-178.

37

Kraljic, T., & Samuel, A. G. (2006). Generalization in perceptual learning for speech.
Psychonomic Bulletin & Review, 13(2), 262-268.
Kraljic, T., & Samuel, A. G. (2007). Perceptual adjustments to multiple speakers. Journal of
Memory and Language, 56, 1-15.
Kutas, M., & Federmeier, K. D. (2011). Thirty years and counting: finding meaning in the
N400 component of the event-related brain potential (ERP). Annual review of
psychology, 62, 621-647.
MacGregor, L. J., Pulvermller, F., van Casteren, M., & Shtyrov, Y. (2012). Ultra-rapid
access to words in the brain. Nature Communications, 3, 711.
McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746-748.
Ntnen, R., Gaillard, A. W. K., & Mntysalo, S. (1978). Early selective-attention effect in
evoked potential reinterpreted. Acta Psychologica, 42, 313-329.
Norris, D., McQueen, J. M., & Cutler, A. (2003). Perceptual learning in speech. Cognitive
Psychology, 47(2), 204-238.
Ostrand, R., Blumstein, S. E., & Morgan, J. L. (2011). When hearing lips and seeing voices
becomes perceiving speech: Auditory-visual integration in lexical access. Paper
presented at the 33rd Annual Conference of the Cognitive Science Society, Austin,
Texas.
Perrin, F., & Garca-Larrea, L. (2003). Modulation of the N400 potential during auditory
phonological/semantic interaction. Cognitive Brain Research, 17(1), 36-47.
Pettigrew, C. M., Murdoch, B. E., Ponton, C. W., Finnigan, S., Alku, P., Kei, J., . . . Chenery,
H. J. (2004). Automatic auditory processing of english words as indexed by the
mismatch negativity, using a multiple deviant paradigm. Ear and Hearing, 25(3),
284-301.
Pitt, M. A., & Samuel, A. G. (1993). An empirical and meta-analytic evaluation of the
phoneme identification task. Journal of Experimental Psychology: Humand
Perception & Performance, 19(4), 699-725.
Poeppel, D. (2003). The analysis of speech in different temporal integration windows:
cerebral lateralization as asymmetric sampling in time. Speech Communication, 41,
245-255.
Pulvermller, F., Kujala, T., Shtyrov, Y., Simola, J., Tiitinen, H., Alku, P., . . . Ntnen, R.
(2001). Memory traces for words as revealed by the mismatch negativity.
Neuroimage, 14(3), 607-616.
Roberts, M., & Summerfield, Q. (1981). Audiovisual presentation demonstrates that selective
adaptation in speech perception is purely auditory. Perception & Psychophysics,
30(4), 309-314.
Saint-Amour, D., De Sanctis, P., Molholm, S., Ritter, W., & Foxe, J. J. (2007). Seeing voices:
High-density electrical mapping and source-analysis of the multisensory mismatch
negativity evoked during the McGurk illusion. Neuropsychologia, 45(3), 587-597.
Saldaa, H. M., & Rosenblum, L. D. (1994). Selective adaptation in speech perception using
a compelling audiovisual adaptor. Journal of the Acoustical Society of America, 95(6),
3658-3661.
Sams, M., Manninen, P., Surakka, V., Helin, P., & Ktt, R. (1998). McGurk effect in
Finnish syllables, isolated words, and words in sentences: Effects of word meaning
and sentence context. Speech Communication, 26, 75-87.
Samuel, A. G. (1981). Phonemic restoration: insights from a new methodology. Journal of
Experimental Psychology: General, 110(4), 474-494.
Samuel, A. G. (1986). Red herring detectors and speech perception: in defense of selective
adaptation. Cognitive Psychology, 18(4), 452-499.

38

Samuel, A. G. (1997). Lexical activation produces potent phonemic percepts. Cognitive


Psychology, 32(2), 97-127.
Samuel, A. G. (2001). Knowing a word affects the fundamental perception of the sounds
within it. Psychological Science, 12(4), 348-351.
Samuel, A. G., & Lieblich, J. (2014). Visual speech acts differently than lexical context in
supporting speech perception. Journal of Experimental Psychology: Human
Perception & Performance, 40(4), 1479-1490.
Shigeno, S. (2002). Anchoring effects in audiovisual speech perception. Journal of the
Acoustical Society of America, 111(6), 2853-2861.
Stekelenburg, J. J., & Vroomen, J. (2007). Neural correlates of multisensory integration of
ecologically valid audiovisual events. Journal of Cognitive Neuroscience, 19(12),
1964-1973.
Stekelenburg, J. J., & Vroomen, J. (2012). Electrophysiological correlates of predictive
coding of auditory location in the perception of natural audiovisual events. Frontiers
in Integrative Neuroscience, 6:26.
Sternberg, S. (1969). The discovery of processing stages: Extensions of Donders method. In
W. G. Koster (Ed.), Attention and performance II. Acta Psychologica (Vol. 30, pp.
276-315). Amsterdam: North-Holland.
Sternberg, S. (2013). The meaning of additive reaction-time effects: Some misconceptions.
Frontiers in Psychology, 4, 744.
Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise.
Journal of the Acoustical Society of America, 26, 212-215.
Treille, A., Vilain, C., & Sato, M. (2014). The sound of your lips: electrophysiological crossmodal interactions during hand-to-face and face-to-face speech perception. Frontiers
in psychology, 5: 420.
van den Brink, D., Brown, C. M., & Hagoort, P. (2001). Electrophysiological evidence for
early contextual influences during spoken-word recognition: N200 versus N400
effects. Journal of Cognitive Neuroscience, 13(7), 967-985.
van den Brink, D., & Hagoort, P. (2004). The influence of semantic and syntactic context
constraints on lexical selection and integration in spoken-word comprehension as
revealed by ERPs. Journal of Cognitive Neuroscience, 16(6), 1068-1084.
van Linden, S., Stekelenburg, J. J., Tuomainen, J., & Vroomen, J. (2007). Lexical effects on
auditory speech perception: An electrophysiological study. Neuroscience Letters,
420(1), 49-52.
van Linden, S., & Vroomen, J. (2007). Recalibration of phonetic categories by lipread speech
versus lexical information. Journal of Experimental Psychology: Human Perception
& Performance, 33(6), 1483-1494.
van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual speech speeds up the neural
processing of auditory speech. Proceedings of the National Academy of Sciences of
the United States of America, 102(4), 1181-1186.
Vroomen, J., & Baart, M. (2009a). Phonetic recalibration only occurs in speech mode.
Cognition, 110(2), 254-259.
Vroomen, J., & Baart, M. (2009b). Recalibration of phonetic categories by lipread speech:
Measuring aftereffects after a twenty-four hours delay. Language and Speech, 52,
341-350.
Vroomen, J., & Stekelenburg, J. J. (2010). Visual anticipatory information modulates
multisensory interactions of artificial audiovisual stimuli. Journal of Cognitive
Neuroscience, 22(7), 1583-1596.

39

Vroomen, J., van Linden, S., de Gelder, B., & Bertelson, P. (2007). Visual recalibration and
selective adaptation in auditory-visual speech perception: Contrasting build-up
courses. Neuropsychologia, 45(3), 572-577.
Vroomen, J., van Linden, S., Keetels, M., de Gelder, B., & Bertelson, P. (2004). Selective
adaptation and recalibration of auditory speech by lipread information: Dissipation.
Speech Communication, 44, 55-61.
Warren, R. M. (1970). Perceptual restoration of missing speech sounds. Science, 167(3917),
392-393.
Winneke, A. H., & Phillips, N. A. (2011). Does audiovisual speech offer a fountain of youth
for old ears? An event-related brain potential study of age differences in audiovisual
speech perception. Psychology and aging, 26(2), 427-438.

40

Figure captions

Figure 1. Overview and timing for auditory-only (left) and audiovisual and visual-only
(right) trials in Experiments 1 and 2.

Figure 2. ERPs time-locked to third-syllable onset for auditory words and pseudowords and
the AV V difference-waves for Experiment 1 (upper panel) and Experiment 2 (lower panel).

Figure 3. Panel 3a depicts p-values of the main effects of Modality and Lexicality and their
interactions with Electrode, obtained with ANOVAs on averaged data in eight 50 ms timewindow in Experiment 1. The lower panel in 3a displays significance of the interaction
between both contexts and their 3-way interaction with Electrode. The grey shaded areas
below the alpha threshold indicate time-windows in which both contexts yielded significant
results. Panel 3b displays the averaged effect of AV integration (AV V A) for words and
pseudowords in a 150-250 ms window, for each electrode. Panels 3c and 3d are like 3a and
3b respectively, but for Experiment 2 instead of Experiment 1.

Figure 4. ERPs for unimodal (A and V) and AV stimuli in Experiment 1 when the final
syllable made up a word (upper panel) or a pseudoword (lower panel). Onset of the lip-read
information is indicated by video, and audio refers to auditory onset of the third syllable
(e.g., ga), that was presented immediately after the first two auditory syllables (e.g.,
lechu).

Figure 5. ERPs for unimodal (A and V) and AV stimuli in Experiment 2 when the final
syllable made up a word (upper panel) or a pseudoword (lower panel). Onset of the lip-read

41

information is indicated by video, and audio refers to auditory onset of the third syllable
(e.g., ga), that was preceded by ~800 ms of silence that was inserted after the first two
auditory syllables (e.g., lechu).

Figure 6. Scalp topographies of average activity in 50 ms windows for words, pseudowords,


auditory and AV V in Experiments 1 (upper panel) and 2 (lower panel). Rectangles indicate
time-windows in which main effects of lexicality (Words vs. pseudowords) or modality (AV
V vs. A) were observed. Interactions between Electrode and Lexicality, or Electrode and
Modality were followed-up by pair-wise comparisons that tested the context effect at each
electrode. Those tests for which the p-value survived a Holm-Bonferroni correction are
indicated by asterisks. The corresponding test-parameters are displayed below the maps and
include the minimal difference across electrodes (), the minimal t-value, the maximal pvalue, the maximal 95% Confidence Interval around the difference, and the minimal effectsize (Cohens d).

42

Figure 1.

43

Figure 2.

44

Figure 3.

45

Figure 4.

46

Figure 5.

47

Figure 6.

48

Você também pode gostar