Cortical Tracking of Hierarchical Linguistic Structures in Connected Speech

a r t ic l e s
Cortical tracking of hierarchical linguistic structures

in connected speech
Nai Ding1,2, Lucia Melloni35, Hang Zhang1,68, Xing Tian1,9,10 & David Poeppel1,11
The most critical attribute of human language is its unbounded combinatorial nature: smaller elements can be combined into
larger structures on the basis of a grammatical system, resulting in a hierarchy of linguistic units, such as words, phrases and
sentences. Mentally parsing and representing such structures, however, poses challenges for speech comprehension. In speech,
hierarchical linguistic structures do not have boundaries that are clearly defined by acoustic cues and must therefore be internally
and incrementally constructed during comprehension. We found that, during listening to connected speech, cortical activity of
different timescales concurrently tracked the time course of abstract linguistic structures at different hierarchical levels, such as
2016 Nature America, Inc. All rights reserved.
words, phrases and sentences. Notably, the neural tracking of hierarchical linguistic structures was dissociated from the encoding
of acoustic cues and from the predictability of incoming words. Our results indicate that a hierarchy of neural processing
timescales underlies grammar-based internal construction of hierarchical linguistic structure.
To understand connected speech, listeners must construct a hierarchy RESULTS

of linguistic structures of different sizes, including syllables, Cortical tracking of phrasal and sentential structures
words, phrases and sentences13. It remains puzzling how the brain In the first set of experiments, we sought to determine the neural
simultaneously handles the distinct timescales of the different lin- representation of hierarchical linguistic structure in the absence
guistic structures, for example, from a few hundred milliseconds for of prosodic cues. We constructed hierarchical linguistic structures
syllables to a few seconds for sentences414. Previous studies have using an isochronous, 4-Hz sequence of syllables that were inde-
suggested that cortical activity is synchronized to acoustic features pendently synthesized (Fig. 1a,b, Supplementary Fig. 1 and
of speech, approximately at the syllabic rate, providing an initial Supplementary Table 1). As a result of the acoustic independence
timescale for speech processing1519. But how the brain utilizes such between syllables (that is, no co-articulation), the linguistic constituent
syllabic-level phonological representations closely aligned with the structure could only be extracted using lexical, syntactic and
physical input to build multiple levels of abstract linguistic structure, semantic knowledge, and not prosodic cues. The materials were first
and represent these concurrently, is not known. We hypothesized developed in Mandarin Chinese, in which syllables are relatively
that cortical dynamics emerge at all timescales required for the uniform in duration and are also the basic morphological unit
processing of different linguistic levels, including the timescales (always morphemes and, in most cases, monosyllabic words).
npg
corresponding to larger linguistic structures such as phrases and Cortical activity was recorded from native listeners of Mandarin
sentences, and that the neural representation of each linguistic level Chinese using magnetoencephalography (MEG). Given that differ-
corresponds to timescales matching the timescales of the respective ent linguistic levels, that is, the monosyllabic morphemes, phrases
linguistic level. and sentences, were presented at unique and constant rates, the
Although linguistic structure building can clearly benefit from hypothesized neural tracking of hierarchical linguistic structure
prosodic20,21 or statistical cues22, it can also be achieved purely on was tagged at distinct frequencies.
the basis of the listeners grammatical knowledge. To experimentally The MEG response was analyzed in the frequency domain and
isolate the neural representation of the internally constructed hier- we extracted response power in every frequency bin using an opti-
archical linguistic structure, we developed new speech materials mal spatial filter (Online Methods). Consistent with our hypothesis,
in which the linguistic constituent structure was dissociated from the response spectrum showed three peaks at the syllabic rate (P = 1.4
prosodic or statistical cues. By manipulating the levels of linguistic 105, paired one-sided t test, false discovery rate (FDR) corrected),
abstraction, we found separable neural encoding of each different phrasal rate (P = 1.6 104, paired one-sided t test, FDR corrected)
linguistic level. and sentential rate (P = 9.6 107, paired one-sided t test, FDR
1Department of Psychology, New York University, New York, New York, USA. 2College of Biomedical Engineering and Instrument Sciences, Zhejiang University,
Hangzhou, China. 3Department of Neurology, New York University Langone Medical Center, New York, New York, USA. 4Department of Neurophysiology, Max-Planck
Institute for Brain Research, Frankfurt, Germany. 5Department of Psychiatry, Columbia University, New York, New York, USA. 6Department of Psychology and Beijing
Key Laboratory of Behavior and Mental Health, Peking University, Beijing, China. 7PKU-IDG/McGovern Institute for Brain Research, Peking University, Beijing, China.
8Peking-Tsinghua Center for Life Sciences, Beijing, China. 9New York University Shanghai, Shanghai, China. 10NYU-ECNU Institute of Brain and Cognitive Science
at NYU Shanghai, Shanghai, China. 11Neuroscience Department, Max-Planck Institute for Empirical Aesthetics, Frankfurt, Germany. Correspondence should be
addressed to N.D. (ding_nai@zju.edu.cn) or D.P. (david.poeppel@nyu.edu).
Received 12 August; accepted 3 November; published online 7 December 2015; doi:10.1038/nn.4186
158 VOLUME 19 | NUMBER 1 | JANUARY 2016 nature NEUROSCIENCE

a r t ic l e s
Figure 1 Neural tracking of hierarchical linguistic structures. a Sentence Sentence 1 Hz

(a) Sequences of Chinese or English monosyllabic words were presented
isochronously, forming phrases and sentences. (b) Spectrum of stimulus N phrase V phrase N phrase V phrase 2 Hz
intensity fluctuation revealed syllabic rhythm, but no phrasal or sentential

4 Hz
modulation. The shaded area covers 2 s.e.m. across stimuli. (c) MEG-
derived cortical response spectrum for Chinese listeners and materials
Dry Fur Rubs Skin New Plans Gave Hope
(dark red curve, grand average; light red curves, individual listeners;
N = 16, 0.11-Hz frequency resolution). Neural tracking of syllabic, 250 ms
phrasal and sentential rhythms was reflected by spectral peaks at
corresponding frequencies. Frequency bins with significantly stronger b Spectrum for stimulus intensity
*
power than neighbors (0.5 Hz range) are marked (*P < 0.001, paired 20 dB
Intensity
one-sided t test, FDR corrected). The topographical maps of response
power across sensors are shown for the peak frequencies.
1 Hz 2 Hz 3 Hz 4 Hz
corrected) and the response was highly consistent across listeners Neural response spectrum
(Fig. 1c). Given that the phrasal- and sentential-rate rhythms were
not conveyed by acoustic fluctuations at the corresponding frequen- c fsentence fphrase fsyllable
cies (Fig. 1b), cortical responses at the phrasal and sentential rates * * *
must be a consequence of internal online structure building processes.
Power (dB)
Cortical activity at all the three peak frequencies was seen bilater- Max
ally (Fig. 1c). The response power averaged over sensors in each
hemisphere was significantly stronger in the left hemisphere at the
6 dB
sentential rate (P = 0.014, paired two-sided t test), but not at the 1 Hz 2 Hz 3 Hz 4 Hz
Min
phrasal (P = 0.20, paired two-sided t test) or syllabic rates (P = 0.40, Frequency
paired two-sided t test).
To test whether the phrase-level responses segregate from the sen-
Dependence on syntactic structures tence level, we constructed longer verb phrases that were unevenly
Are the responses at the phrasal and sentential rates indeed separate divided into a monosyllabic verb followed by a three-syllable noun
neural indices of processing at distinct linguistic levels or are they phrase (Fig. 2c). We expect that the neural responses to the long
merely sub-harmonics of the syllabic rate response, generated by verb phrase to be tagged at 1 Hz, whereas the neural responses to the
intrinsic cortical dynamical properties? We address this question by monosyllabic verb and the three-syllable noun phrase will present as
manipulating different levels of linguistic structure in the input. When harmonics of 1 Hz. Consistent with our hypothesis, cortical dynam-
the stimulus is a sequence of random syllables that preserves the ics emerged at one-fourth of the syllabic rate, whereas the response
acoustic properties of Chinese sentences (Fig. 1 and Supplementary at half of the syllabic rate is no longer detectable (P = 1.9 104, 1.7
Fig. 2), but eliminates the phrasal/sentential structure, only syllabic 104 and 9.3 104 at 1, 3 and 4 Hz, respectively, paired one-sided
(acoustic) level tracking occurs (P = 1.1 104 at 4 Hz, paired one- t test, FDR corrected).
sided t test, FDR corrected; Fig. 2a). Furthermore, this manipulation
preserves the position of each syllable in a sentence (Online Methods) Dependence on language comprehension
and therefore further demonstrates that the phrasal- and sentential- When listening to Chinese sentences (Fig. 1a), listeners who did not
rate responses are not a result of possible acoustic differences between understand Chinese only showed responses to the syllabic (acoustic)
npg
the syllables in a sentence. When two adjacent syllables and mor- rhythm (P = 3.0 105 at 4 Hz, paired one-sided t test, FDR corrected;
phemes combine into verb phrases, but there is no four-element sen- Fig. 2d), further supporting the argument that cortical responses
tential structure, phrasal-level tracking emerges at half of the syllabic to larger, abstract linguistic structures is a direct consequence of
rate (P = 8.6 104 at 2 Hz and P = 2.7 104 at 4 Hz, paired one-sided language comprehension.
t test, FDR corrected; Fig. 2b). Similar responses are observed for If aligning cortical dynamics to the time course of linguistic constit-
noun phrases (Supplementary Fig. 3). uent structure is a general mechanism required for comprehension,
it must apply across languages. Indeed, when native English speakers
Chinese materials, Chinese listeners Chinese materials, English listeners were tested with English materials (Fig. 1a), their cortical activity also
a *
d * followed the time course of larger linguistic structures, that is, phrases
12 dB 12 dB and sentences (P = 4.1 105, syllabic rate; Fig. 2e; P = 3.9 103,
Power
Power
1 2 3 4
1 2 3 4 Figure 2 Tracking of different linguistic structures. Each panel shows
b * * e English materials, English listeners syntactic structure repeating in the stimulus (left) and the cortical
response spectrum (right; shaded area indicates 2 s.e.m. over listeners,
Power
*
N = 8). (a) Chinese listeners, Chinese materials: syllables were
Power
syntactically independent and cortical activity encoded only acoustic and

1 2 3 4
syllabic rhythm. (b,c) Additional tracking emerged with larger linguistic
c * f * * structures. Spectral peaks marked by a star (black, P < 0.001; gray,
* * P < 0.005; paired one-sided t test, FDR corrected). (d) English listeners,
Power
Power
* Chinese materials from Figure 1: acoustic tracking only, as there was

no parsable structure. (e,f) English listeners, English materials: syllabic
1 2 3 4 1/1.28 2/1.28 4/1.28 rate (4/1.28 Hz) and sentential and phrasal rate responses to parsable
Frequency (Hz) Frequency (Hz) structure in stimulus.
nature NEUROSCIENCE VOLUME 19 | NUMBER 1 | JANUARY 2016 159

a r t ic l e s
Figure 3 Dissociating sentential structures and transitional probability. a Constant transitional probability c Transitional Fourier
(a,b) Grammar of an artificial Markovian stimulus set with constant (a) 1/5 1/5 1/5 probability spectrum
C1 C2 C3 fc
or variable (b) transitional probability. Each sentence consists of three 1/5
acoustic chunks, each containing 12 English words. The listeners
the boy ordered beer, soup, salad C1 C2 C3 C1 C2 C3
memorized the grammar before experiments. (c) Schematic time course a girl lives in pizza, coffee 1 1 1 1
.. fs fc
and spectrum of the transitional probability. (d) Neural response spectrum her dad speaks . 1/25 1/25
(shaded area covers 2 s.e.m. over listeners, N = 8). Significant neural John didnt book, letter, story
Jess wrote a poem, memo Time Frequency
responses to sentences were seen for both languages. Spectral peaks are
shown by an asterisk (P < 0.001, paired one-sided t test, FDR corrected,
b d Constant transitional probability
Varying transitional probability
same color code as the spectrum). Responses were not significantly Varying transitional probability
different between the two languages in any frequency bin (paired ** fs **
1/25 1 1 fc
two-sided t test, P > 0.09, uncorrected). C1 C2 C3 6 dB
**
Power (dB)
my cat is so lovely
4.3 103 and 6.8 106 at the sentential, phrasal and syllabic rates, N = 25 they grow apples
Sarah looks happy
respectively; Fig. 2f; paired one-sided t test, FDR corrected). ... ... ... 1/1.05 2/1.05 3/1.05
Frequency (Hz)
Neural tracking of linguistic structures rather than probability cues

We found that concurrent neural tracking of multiple levels of lin- 2.7 106, respectively). For the varying transitional probability MSS,
guistic structure was not confounded with the encoding of acoustic the response was statistically significant at the sentential rate, twice
cues (Figs. 1 and 2). However, is this simply explained by the neural the sentential rate and the syllable rate (P = 7.1 104, 7.1 104 and
tracking of the predictability of smaller units? As a larger linguistic 4.8 106, respectively).
structure, such as a sentence, unfolds in time, its component units Given that the MSS involved real English sentences, listeners had
become more predictable. Thus, cortical networks solely tracking prior knowledge of the transitional probabilities between acoustic
transitional probabilities across smaller units could show temporal chunks. To control for the effect of such prior knowledge, we created
dynamics matching the timescale of larger structures. To test this a set of Artificial Markovian Sentences (AMS). In the AMS, the tran-
alternative hypothesis, we crafted a constant transitional probability sitional probability between syllables was the same in and across sen-
Markovian Sentence Set (MSS) in which the transitional probability tences (Supplementary Fig. 4a). The AMS was composed of Chinese
of lower level units was dissociated from the higher level structures syllables, but no meaningful Chinese expressions were embedded in
(Fig. 3a and Supplementary Fig. 1e,f). The constant transitional the AMS sequences. As the AMS was not based on the grammar of
probability MSS is contrasted with a varying transitional probability Chinese, the listeners had to learn the AMS grammar to segment
MSS, in which the transitional probability is low across sentential sentences. By comparing the neural responses to the AMS sequences
boundaries and high in a sentence (Fig. 3b,c). If cortical activity only before and after the grammar was learned, we were able to separate
encodes the transitional probability between lower level units (for the effect of prior knowledge of transitional probability and the effect
example, acoustic chunks in the MSS) independent of the underlying of grammar learning. Here, the grammar of the AMS indicates the
syntactic structure, it can show tracking of the sentential structure for set of rules that governs the sequencing of the AMS, that is, the rule
the varying probability MSS, but not for the constant probability MSS. of which syllables can follow which syllables.
In contrast with this prediction, indistinguishable neural responses The neural responses to the AMS before and after grammar learning
to sentences were observed for both MSS (Fig. 3d), demonstrating were analyzed separately (Supplementary Fig. 4). Before learning,
that neural tracking of sentences is not confounded by transitional when the listeners were instructed that the stimulus was just a sequence
probability. Specifically, for the constant transitional probability MSS, of random syllables, the response showed a statistically significant
npg
the response was statistically significant at the sentential rate, twice peak at the syllabic rate (P = 0.0003, bootstrap), but not at the senten-
the sentential rate and the syllable rate (P = 1.8 104, 2.3 104 and tial rate. After the AMS grammar was learned, however, a significant
response peak emerged at the sentential rate (P = 0.0001, bootstrap).
A response peak was also observed at twice the sentential rate, possibly
a Neural tracking of sentences
of variable durations (48 syllables)
c Single-trial decoding
of sentence duration reflecting the second harmonic of the sentential response. This result
4 40 further confirms that neural tracking of sentences is not confounded
(number of syllables)
Percent trials (%)

Decoded duration
5 by neural tracking of transitional probability.

RMS over sensors
30
6
6 dB 7
20 Neural tracking of sentences varying in duration and structure
Sentence 10
These results are based on sequences of sentences that have uniform
4 5 6 8
duration:
7 8
duration and syntactic structure. We next addressed whether cortical
4 5 6 7 8
2.25 s Actual duration (number of syllables)
b d Figure 4 Neural tracking of sentences of varying structures. (a) Neural

6 dB activity tracked the sentence duration, even when the sentence boundaries
(dotted lines) were not separated by acoustic gaps. (b) Averaged response
RMS over sensors
near a sentential boundary (dotted line). The power continuously changed

RMS over sensors
throughout the duration of a sentence. Shaded area covers 2 s.e.m.

6 dB over listeners (N = 8). Significance power differences between time bins
(shaded squares) are marked by asterisks (P = 0.01, one-sided t test,
FDR corrected). (c) Confusion matrix for neural decoding of the sentence
* 3-syllable NP VP
* * * * * duration. (d) Neural activity tracks noun phrase duration (shown in
5 4 3 2 1 1 2 4-syllable NP VP the bottom). Yellow areas show significant differences between curves
1.25 s 1s (P = 0.005, bootstrap, FDR corrected).

a r t ic l e s
Figure 5 Localizing cortical sources of the sentential and phrasal rate High-gamma power Low-frequency waveform
responses using ECoG (N = 5). Left, power envelope of high-gamma

16 dB
activity. Right, waveform of low-frequency activity. Electrodes in the right
Sentential rate
hemisphere were projected to the left hemisphere, and right hemisphere
(left hemisphere) electrodes are shown by hollow (filled) circles.
The figure only displays electrodes that showed statistically significant
neural responses to sentences in Figure 2e and no significant response to
10 dB
the acoustic control shown in Figure 2f. Significance was determined by Left hemisphere
bootstrap (FDR corrected) and the significance level is 0.05. The response Right hemisphere
strength, that is, the response at the target frequency relative to the mean
response averaged over a 1-Hz wide neighboring region, is color coded. 16 dB
Electrodes with response strength less than 10 dB are shown by smaller
Phrasal rate
symbols. The sentential and phrasal rate responses were seen in bilateral
pSTG, TPJ and left IFG.
tracking of larger linguistic structures generalizes to sentences that 10 dB

are variable in duration (48 syllables) and syntactic structures.
These sentences were again built on isochronous Chinese syllables,
intermixed and sequentially presented without any acoustic gap at dynamics, we averaged the r.m.s. response over all sentences contain-
the sentence boundaries. Examples translated into English include ing six or more syllables after aligning them to the sentence offset
Dont be nervous, The book is hard to read, and Over the street (Fig. 4b). During the last four syllables of a sentence, the r.m.s.
is a museum. response continuously and significantly decreased for every syllable,
As these sentences have irregular durations that are not tagged by indicating that the neural response continuously changes during the
frequency, the MEG responses were analyzed in the time domain by course of a sentence rather than being a transient response only occur-
averaging sentences of the same duration. To focus on sentential level ring at the sentence boundary.
processing, we low-pass filtered the response at 3.5 Hz. The MEG A single-trial decoding analysis was performed to independ-
response (root mean square, r.m.s., over all sensors) rapidly increased ently confirm that cortical activity tracks the duration of sentences
after a sentence boundary and continuously changed throughout the (Fig. 4c). The decoder applied template matching for the response
duration of a sentence (Fig. 4a). To illustrate the detailed temporal time course (leave-one-out cross-validation, Online Methods) and
correctly determined the duration of 34.9 0.6% sentences (mean
s.e.m. over subjects, significantly above chance, P = 1.3 107,
a High-gamma power b Low-frequency waveform one-sided t test).
fs fp f fs fp f After demonstrating cortical tracking of sentences, we further
6 dB N = 22
tested whether cortical activity also tracks the phrasal structure inside
N = 62 of a sentence. We constructed sentences that consist of a noun phrase
followed by a verb phrase and manipulated the duration of the noun
phrase (three syllable or four syllable). The cortical responses closely
follow the duration of the noun phrase: the r.m.s. response gradually
N = 10
N = 60 decreased in the noun phrase, then showed a transient increase after
the onset of the verb phrase (Fig. 4d).
npg
N = 13 Neural source localization using electrocorticography (ECoG)

N = 29
We found that large-scale neural activity measured by MEG concur-
rently follows the hierarchical linguistic structure of speech, but which
neural networks generate such activity? To address this question,
1/1.28 2/1.28 4/1.28 1/1.28 2/1.28 4/1.28 we recorded the ECoG responses to English sentences (Fig. 2e) and
Frequency (Hz) Frequency (Hz) an acoustic control (Fig. 2f). ECoG signals are mesoscopic neuro-
fs+ fp fs fp+ (fs+ fp+) f
physiological signals recorded by intracranial electrodes implanted in
c d Figure 6 Spatial dissociation between sentential-rate, phrasal-rate and

syllabic-rate responses (N = 5). (a) The power spectrum of the power
envelope of high-gamma activity. (b) The power spectrum of low-frequency
ECoG waveform. The top panels (green curves) show the response
averaged over all electrodes that show a significant sentential-rate
response but not a significant phrasal-rate response. Significance was
determined by bootstrap (FDR corrected) and the significance level is
0.05. The shaded area is 1 s.d. over electrodes on each side.
The blue curves show the response averaged over all electrodes that
showed a significant phrasal-rate response, but not a significant
sentential-rate response. The red curves show a significant sentential-
rate or a significant phrasal-rate response, but not a significant syllabic
response. (c,d) The topographic distribution of the three groups of
electrodes analyzed in a and b. As in Figure 5, electrodes showing a
response greater than 10 dB are shown by larger symbols than electrodes
showing a response weaker than 10 dB.

a r t ic l e s
Figure 7 Syllabic-rate ECoG responses to English sentences and the High-gamma power Low-frequency waveform
acoustic control (N = 5). Top, electrodes showing statistically significant
28 dB
syllabic-rate ECoG responses to the acoustic control, that is, shuffled
Acoustic control
sequences, which had the same acoustic and syllabic rhythm as the
English sentences, but contained no hierarchical linguistic structures
(Fig. 2f). Significance was determined by bootstrap (FDR corrected) and
the significance level is 0.05. The responses were most strongly seen
10 dB
in bilateral STG for both high-gamma and low-frequency activity and in Left hemisphere
bilateral pre-motor areas for low-frequency activity. Bottom, syllabic-rate Right hemisphere
ECoG responses to English sentences. The electrodes displayed are those 16 dB
Sentence specific
that showed statistically significant neural responses to sentences and no
significant response to the acoustic control. The syllabic rate responses
specific to sentences were strong along bilateral STG for high-gamma
activity and were widely distributed in the frontal and temporal lobes for
low-frequency activity. 10 dB
epilepsy patients for clinical evaluation (see Supplementary Fig. 5 For electrodes showing a significant response at either the sentential
for the electrode coverage), and they possess better spatial resolution rate or the phrasal rate, the strength of the sentential rate response was
than MEG. We first analyzed the power of the ECoG signal in the also negatively correlated with the strength of the phrasal rate response
high gamma band (70200 Hz), as it highly correlates with multiunit (R = 0.21, significantly greater than 0, P = 0.023, bootstrap).
firing23. The electrodes exhibiting significant sentential, phrasal and
syllabic rate fluctuations in high gamma power are shown separately DISCUSSION
(Fig. 5). The sentential rate response clustered over the posterior and Our data show that the multiple timescales that are required for the
middle superior temporal gyrus (pSTG), bilaterally, with a second processing of linguistic structures of different sizes emerge in corti-
cluster over the left inferior frontal gyrus (IFG). Phrasal rate responses cal networks during speech comprehension. The neural sources for
were also observed over the pSTG bilaterally. Notably, although the sentential, phrasal and syllabic rate responses are highly distributed
sentential and phrasal rate responses were observed in similar cortical and include cortical areas that have been found to be critical for
areas, electrodes showing phrasal rate responses only partially over- prosodic (for example, right STG), syntactic and semantic (for exam-
lapped with electrodes showing sentential rate responses in the pSTG ple, left pSTG and left IFG) processing9,2528. Neural integration on
(Fig. 6). For electrodes showing a significant response at either the different timescales is likely to underlie the transformation from
sentential rate or the phrasal rate, the strength of the sentential rate shorter lived neural representations of smaller linguistic units to longer
response was negatively correlated with the strength of the phrasal lasting neural representations of larger linguistic structures1114.
rate response (R = 0.32, P = 0.004, bootstrap). This phenomenon These results underscore the undeniable existence of hierarchical
demonstrates spatially dissociable neural tracking of the sentential structure building operations in language comprehension1,2 and can
and phrasal structures. be applied to objectively assess language processing in children and
Furthermore, some electrodes with a significant sentential or difficult-to-test populations, as well as animal preparations to allow
phrasal rate response showed no significant syllabic rate response for cross-species comparisons.
(P < 0.05, FDR corrected, Fig. 6). In other words, there are cortical
circuits specifically encoding larger, abstract linguistic structures Relation to language comprehension
without responding to syllabic-level acoustic features of speech. Concurrent neural tracking of hierarchical linguistic structures
npg
In addition, although the syllabic responses were not significantly provides a plausible functional mechanism for temporally integrat-
different (P > 0.05, FDR corrected) for English sentences and the ing smaller linguistic units into larger structures. In this form of
acoustic control in the MEG results, they were dissociable spatially concurrent neural tracking, the neural representation of smaller
in the ECoG results (Fig. 7). Electrodes showing significant syllabic linguistic units is embedded at different phases of the neural activity
responses (P < 0.05, FDR corrected) to sentences, but not the acoustic tracking a higher level structure. Thus, it provides a possible
control, were seen in bilateral pSTG, bilateral anterior STG (aSTG), mechanism to transform the hierarchical embedding of linguistic
and left IFG. structures into hierarchical embedding of neural dynamics, which
We then analyzed neural tracking of the sentential, phrasal and may facilitate information integration in time 10,11. Low-frequency
syllabic rhythms in the low-frequency ECoG waveform (Fig. 5), which neural tracking of linguistic structures may further modulate higher
is a close neural correlate of MEG activity. Fourier analysis was directly frequency neural oscillations2931, which have been proposed to pro-
applied to the ECoG waveform and the Fourier coefficients at 1, 2 and vide additional roles for structure building7. In addition, multiple
4 Hz are extracted. Low-frequency ECoG activity is usually viewed as resources and computations are needed for syntactic analysis, for
the dendritic input to a cortical area24. The low-frequency responses example, access to combinatorial syntactic subroutines, and such
are more distributed than high-gamma activity, possibly reflecting operations may correspond to neural computations on distinct
the fact that the neural representations of different levels of linguistic frequency scales, which are coordinated by the low-frequency
structures serve as inputs to broad cortical areas. Sentential and neural tracking of linguistic constituent structures. Furthermore,
phrasal rate responses are strong in STG, IFG and temporoparietal low-frequency neural activity and oscillations have been hypoth-
junction (TPJ). Compared with the acoustic control, the syllabic-rate esized as critical mechanisms to generate predictions about future
response to sentences was stronger in broad cortical areas, including events32. For language processing, it is likely that concurrent neural
the temporal and frontal lobes (Fig. 7). Similar to the high-gamma tracking of hierarchical linguistic structures provides mechanisms
activity, the low-frequency responses to the sentential and phrasal to generate predictions on multiple linguistic levels and allow
structures were not reflected in the same set of electrodes (Fig. 6). interactions across linguistic levels33.

a r t ic l e s
Neural entrainment to speech predictability of words43,45 and its amplitude continuously reduces in
Recent work has shown that cortex tracks the slow acoustic fluctuations a sentence46,47. For syntactic processing, when two words combine
of speech below 10 Hz (refs. 1518,34,35), and this phenomenon is into a short phrase, increased activity is seen in the temporal and
commonly described as cortical entrainment to the syllabic rhythm of frontal lobes4. Our results build on and extends these findings by
speech. It has been controversial whether such syllabic-level cortical demonstrating structure building at different levels of the linguis-
entrainment is related to low-level auditory encoding or language tic hierarchy, during online comprehension of connected speech
processing6. Our findings demonstrate that processing goes well materials in which the structural boundaries are neither physically
beyond stimulus-bound analysis: cortical activity is entrained to larger cued nor confounded by the semantic predictability of the individual
linguistic structures that are, by necessity, internally constructed, words (Fig. 3). Note that, although the two Markovian languages
based on syntax. The emergence of slow cortical dynamics provides (compared in Fig. 3) differed in their transitional probability between
timescales suitable for the analysis of larger chunk sizes13,14. acoustic chunks, they both had fully predictable syntactic structures.
A long-lasting controversy concerns how the neural responses to The equivalence in syntactic predictability is likely to result in the
sensory stimuli are related to intrinsic, ongoing neural oscillations. very similar responses between the two conditions.
This question is heavily debated for the neural response entrained Lastly, the emergence of slow neural dynamics tracking superor-
to the syllabic rhythm of speech36 and can also be asked for neural dinate stimulus structures is reminiscent of what has been observed
activity entrained to the time courses of larger linguistic structures. during decision making48, action planning49 and music perception50,
Our experiment was not designed to answer this question; however, suggesting a plausible common neural computational framework to
we clearly found that cortical speech processing networks have the integrate information over distinct timescales12. These findings invite
capacity to generate activity on very long timescales corresponding MEG and EEG studies to extend from the classic event-related designs
to larger linguistic structures, such as phrases and sentences. In other to investigating continuous neural encoding of internally constructed
words, the timescales of larger linguistic structures fall in the perceptual organization of an information stream.
timescales, or temporal receptive windows12,13, that the relevant

cortical networks are sensitive to. Whether the capacity of generating Methods
low-frequency activity during speech processing is the same as the Methods and any associated references are available in the online
mechanisms generating low-frequency spontaneous neural oscilla- version of the paper.
tions will need to be addressed in the future.
Note: Any Supplementary Information and Source Data files are available in the
online version of the paper.
Nature of the linguistic representations
Language processing is not monolithic and involves partially Acknowledgments
segregated cortical networks for the processing of, for example, pho- We thank J. Walker for MEG technical support, T. Thesen, W. Doyle and
O. Devinsky for their instrumental help in collecting ECoG data, and G. Buzsaki,
nological, syntactic and semantic information9. The phonological,
G. Cogan, S. Dehaene, A.-L. Giraud, G. Hickok, N. Hornstein, E. Lau, A. Marantz,
syntactic and semantic representations are all hierarchically N. Mesgarani, M. Pea, B. Pesaran, L. Pylkknen, C. Schroeder, J. Simon and
organized37 and may rely on the same core structure building opera- W. Singer for their comments on previous versions of the manuscript. This work
tions38. In natural speech, linguistic structure building can be facili- supported by US National Institutes of Health grant 2R01DC05660 (D.P.) and
tated by prosodic39 and statistical cues22, and some underlying neural Major Projects Program of the Shanghai Municipal Science and Technology
Commission (STCSM) 15JC1400104 (X.T.) and National Natural Science
signatures have been demonstrated6,8,20. Such cues, however, are not Foundation of China 31500914 (X.T.).
always available, and even when they are available, they are generally
not sufficient. Thus, robust structure building relies on a listeners AUTHOR CONTRIBUTIONS
tacit syntactic knowledge, and our findings provide unique insights
npg
N.D., L.M. and D.P. conceived and designed the experiments. N.D., H.Z. and X.T.
into the neural representation of abstract linguistic structures that performed the MEG experiments. L.M. performed the ECoG experiment.
are internally constructed on the basis of syntax alone. Although the N.D., L.M. and D.P. wrote the paper. All of the authors discussed the results
and edited the manuscript.
construction of abstract structures is driven by syntactic analysis, when
such structures are built, different aspects of the structure, including COMPETING FINANCIAL INTERESTS
semantic information, can be integrated in the neural representa- The authors declare no competing financial interests.
tion. Indeed, the wide distribution of cortical tracking of hierarchical
linguistic structures suggests that it is a general neurophysiological Reprints and permissions information is available online at http://www.nature.com/
reprints/index.html.
mechanism for combinatorial operations involved in hierarchical
linguistic structure building in multiple linguistic processing networks
1. Berwick, R.C., Friederici, A.D., Chomsky, N. & Bolhuis, J.J. Evolution, brain, and
(for example, phonological, syntactic and semantic). Furthermore, the nature of language. Trends Cogn. Sci. 17, 8998 (2013).
coherent synchronization to the correlated linguistic structures in dif- 2. Chomsky, N. Syntactic Structures (Mouton de Gruyter, 1957).
ferent representational networks, for example, syntactic, semantic and 3. Phillips, C. Linear order and constituency. Linguist. Inq. 34, 3790 (2003).
4. Bemis, D.K. & Pylkknen, L. Basic linguistic composition recruits the left anterior
phonological, provides a way to integrate multi-dimensional linguistic temporal lobe and left angular gyrus during both listening and reading. Cereb.
representations into a coherent language percept38,40, just as tempo- Cortex 23, 18591873 (2013).
ral synchronization between cortical networks provides a possible 5. Giraud, A.-L. & Poeppel, D. Cortical oscillations and speech processing: emerging
computational principles and operations. Nat. Neurosci. 15, 511517 (2012).
solution to the binding problem in sensory processing41. 6. Sanders, L.D., Newport, E.L. & Neville, H.J. Segmenting nonsense: an event-related
potential index of perceived onsets in continuous speech. Nat. Neurosci. 5,
700703 (2002).
Relation to event-related responses 7. Bastiaansen, M., Magyari, L. & Hagoort, P. Syntactic unification operations are
Although many previous neurophysiological studies on structure reflected in oscillatory dynamics during on-line sentence comprehension. J. Cogn.
building have focused on syntactic and semantic violations4244, fewer Neurosci. 22, 13331347 (2010).
8. Buiatti, M., Pea, M. & Dehaene-Lambertz, G. Investigating the neural correlates
have addressed normal structure building; on the lexical-semantic of continuous speech computation with frequency-tagged neuroelectric responses.
level, the N400/N400m has been identified as a marker of the semantic Neuroimage 44, 509519 (2009).

a r t ic l e s
9. Pallier, C., Devauchelle, A.-D. & Dehaene, S. Cortical representation of the constituent 30. Lakatos, P. et al. An oscillatory hierarchy controlling neuronal excitability and
structure of sentences. Proc. Natl. Acad. Sci. USA 108, 25222527 (2011). stimulus processing in the auditory cortex. J. Neurophysiol. 94, 19041911
10. Schroeder, C.E., Lakatos, P., Kajikawa, Y., Partan, S. & Puce, A. Neuronal oscillations (2005).
and visual amplification of speech. Trends Cogn. Sci. 12, 106113 (2008). 31. Sirota, A., Csicsvari, J., Buhl, D. & Buzski, G. Communication between neocortex
11. Buzski, G. Neural syntax: cell assemblies, synapsembles and readers. Neuron 68, and hippocampus during sleep in rodents. Proc. Natl. Acad. Sci. USA 100,
362385 (2010). 20652069 (2003).
12. Bernacchia, A., Seo, H., Lee, D. & Wang, X.-J. A reservoir of time constants for 32. Arnal, L.H. & Giraud, A.-L. Cortical oscillations and sensory predictions. Trends
memory traces in cortical neurons. Nat. Neurosci. 14, 366372 (2011). Cogn. Sci. 16, 390398 (2012).
13. Lerner, Y., Honey, C.J., Silbert, L.J. & Hasson, U. Topographic mapping of a hierarchy 33. Poeppel, D., Idsardi, W.J. & van Wassenhove, V. Speech perception at the interface
of temporal receptive windows using a narrated story. J. Neurosci. 31, 29062915 of neurobiology and linguistics. Phil. Trans. R. Soc. Lond. B 363, 10711086
(2011). (2008).
14. Kiebel, S.J., Daunizeau, J. & Friston, K.J. A hierarchy of time-scales and the brain. 34. Pea, M. & Melloni, L. Brain oscillations during spoken sentence processing.
PLoS Comput. Biol. 4, e1000209 (2008). J. Cogn. Neurosci. 24, 11491164 (2012).
15. Luo, H. & Poeppel, D. Phase patterns of neuronal responses reliably discriminate 35. Gross, J. et al. Speech rhythms and multiplexed oscillatory sensory coding in the
speech in human auditory cortex. Neuron 54, 10011010 (2007). human brain. PLoS Biol. 11, e1001752 (2013).
16. Ding, N. & Simon, J.Z. Emergence of neural encoding of auditory objects while 36. Ding, N. & Simon, J.Z. Cortical entrainment to continuous speech: functional roles
listening to competing speakers. Proc. Natl. Acad. Sci. USA 109, 1185411859 and interpretations. Front. Hum. Neurosci. 8, 311 (2014).
(2012). 37. Jackendoff, R. Foundations of Language: Brain, Meaning, Grammar, Evolution
17. Zion Golumbic, E.M. et al. Mechanisms underlying selective neuronal tracking of (Oxford University Press, 2002).
attended speech at a cocktail party. Neuron 77, 980991 (2013). 38. Hagoort, P. On Broca, brain, and binding: a new framework. Trends Cogn. Sci. 9,
18. Peelle, J.E., Gross, J. & Davis, M.H. Phase-locked responses to speech in human 416423 (2005).
auditory cortex are enhanced during comprehension. Cereb. Cortex 23, 13781387 39. Cutler, A., Dahan, D. & van Donselaar, W. Prosody in the comprehension of spoken
(2013). language: a literature review. Lang. Speech 40, 141201 (1997).
19. Pasley, B.N. et al. Reconstructing speech from human auditory cortex. PLoS Biol. 40. Frazier, L., Carlson, K. & Clifton, C. Jr. Prosodic phrasing is central to language
10, e1001251 (2012). comprehension. Trends Cogn. Sci. 10, 244249 (2006).
20. Steinhauer, K., Alter, K. & Friederici, A.D. Brain potentials indicate immediate use 41. Singer, W. & Gray, C.M. Visual feature integration and the temporal correlation
of prosodic cues in natural speech processing. Nat. Neurosci. 2, 191196 hypothesis. Annu. Rev. Neurosci. 18, 555586 (1995).
(1999). 42. Friederici, A.D. Towards a neural basis of auditory sentence processing. Trends
21. Pea, M., Bonatti, L.L., Nespor, M. & Mehler, J. Signal-driven computations in Cogn. Sci. 6, 7884 (2002).
speech processing. Science 298, 604607 (2002). 43. Kutas, M. & Federmeier, K.D. Electrophysiology reveals semantic memory use in
22. Saffran, J.R., Aslin, R.N. & Newport, E.L. Statistical learning by 8-month-old language comprehension. Trends Cogn. Sci. 4, 463470 (2000).
infants. Science 274, 19261928 (1996). 44. Neville, H., Nicol, J.L., Barss, A., Forster, K.I. & Garrett, M.F. Syntactically based
23. Ray, S. & Maunsell, J.H. Different origins of gamma rhythm and high-gamma activity sentence processing classes: evidence from event-related brain potentials. J. Cogn.
in macaque visual cortex. PLoS Biol. 9, e1000610 (2011). Neurosci. 3, 151165 (1991).
24. Einevoll, G.T., Kayser, C., Logothetis, N.K. & Panzeri, S. Modeling and analysis of 45. Lau, E.F., Phillips, C. & Poeppel, D. A cortical network for semantics: (de)constructing
local field potentials for studying the function of cortical circuits. Nat. Rev. Neurosci. the N400. Nat. Rev. Neurosci. 9, 920933 (2008).
14, 770785 (2013). 46. Halgren, E. et al. N400-like magnetoencephalography responses modulated by
25. Hagoort, P. & Indefrey, P. The neurobiology of language beyond single words. semantic context, word frequency and lexical class in sentences. Neuroimage 17,
Annu. Rev. Neurosci. 37, 347362 (2014). 11011116 (2002).
26. Grodzinsky, Y. & Friederici, A.D. Neuroimaging of syntax and syntactic processing. 47. Van Petten, C. & Kutas, M. Interactions between sentence context and word
Curr. Opin. Neurobiol. 16, 240246 (2006). frequency in event-related brain potentials. Mem. Cognit. 18, 380393 (1990).
27. Hickok, G. & Poeppel, D. The cortical organization of speech processing. Nat. Rev. 48. OConnell, R.G., Dockree, P.M. & Kelly, S.P. A supramodal accumulation-to-bound
Neurosci. 8, 393402 (2007). signal that determines perceptual decisions in humans. Nat. Neurosci. 15,
28. Friederici, A.D., Meyer, M. & von Cramon, D.Y. Auditory language comprehension: 17291735 (2012).
an event-related fMRI study on the processing of syntactic and lexical information. 49. Koechlin, E., Ody, C. & Kouneiher, F. The architecture of cognitive control in the
Brain Lang. 74, 289300 (2000). human prefrontal cortex. Science 302, 11811185 (2003).
29. Canolty, R.T. et al. High gamma power is phase-locked to theta oscillations in 50. Nozaradan, S., Peretz, I., Missal, M. & Mouraux, A. Tagging the neuronal entrainment
human neocortex. Science 313, 16261628 (2006). to beat and meter. J. Neurosci. 31, 1023410240 (2011).
npg

ONLINE METHODS Backward syllabic sequences. In normal trials, ten four-syllable sentences were
Participants. 34 native listeners of Mandarin Chinese (1936 years old, mean played but with all syllables being played backward in time. An outlier trial was
25 years old; 13 male) and 13 native listeners of American English (2246 years the same as a normal trial except that four consecutive syllables at a random
old, mean 26 years old; 6 male) participated in the study. All Chinese listeners position were replaced by four random syllables that were not reversed in time.
received high school education in China and 26 of them also received college Four-syllable idioms. 50 common four-syllable idioms were selected
education in China. None of the English listeners understood Chinese. All partici- (Supplementary Table 1), in which the first two syllables formed a noun phrase
pants were right-handed51. Five experiments were run for Chinese listeners and and the last two syllables formed a verb phrase. In a normal trial ten sentences
two experiments for English listeners. Each experiment included eight listeners were played. An outlier trial was the same as a normal trial except that the noun
(except that the AMS experiment involved five listeners) and each listener par- phrases in two idioms were exchanged, creating two nonexistent and semanti-
ticipated in at most two experiments. The number of listeners per experiment was cally nonsensical idioms.
chosen based on previous MEG experiments on neural tracking of continuous Sentences with variable duration and syntactic structures. The sentence dura-
speech. The sample size for previous experiments was typically between three tion was varied between four and eight syllables. 40 sentences were constructed
and 12 (refs. 15,16), and the basic phenomenon reported here was replicated for each duration, resulting in a total of 200 sentences (listed in Supplementary
in all the seven experiments of the study (N = 47 in total). The experimental Table 1). All 200 sentences were intermixed. In a normal trial, ten different
procedures were approved by the New York University Institutional Review sentences were sequentially played without inserting any acoustic gap in
Board, and written informed consent was obtained from each participant before between sentences. In an outlier trial, one of the ten sentences was replaced by a
the experiment. syntactically correct but semantically anomalous sentence. Examples of nonsense
sentences, translated into English, included ancient history is drinking tea and
Stimuli I: Chinese materials. All Chinese materials were constructed based on take part in his portable hard drive.
an isochronous sequence of syllables. Even when the syllables were hierarchi- Sentences with variable NP durations. All sentences consisted of a noun phrase
cally grouped into linguistic constituents, no acoustic gaps were inserted between followed by a verb phrase (Supplementary Table 1). The noun phrase had three
constituents. All syllables were synthesized independently using the Neospeech syllables for half of the sentences (N = 45) and four syllables for the other half.
synthesizer (http://www.neospeech.com/, the male voice, Liang). The synthesized A three-syllable noun phrase was followed by either a four-syllable verb phrase
syllables were 75354 ms in duration (mean duration 224 ms), and were adjusted (N = 20) or a five-syllable verb phrase (N = 25). A four-syllable noun phrase was
to 250 ms by truncation or padding silence at the end. The last 25 ms of each followed by a three-syllable verb phrase (N = 20) or a four-syllable verb phrase
syllable were smoothed by a cosine window. (N = 25). Sentences with different noun phrase durations and verb phrase dura-
Four-syllable sentences. 50 four-syllable sentences were constructed, in which tions were intermixed. In a normal trial 10 different sentences were played
the first two syllables formed a noun phrase and the last two syllables formed a sequentially, without inserting any acoustic gap between phrases or sentences.
verb phrase (Supplementary Table 1). The noun phrase could be composed of In an outlier trial one sentence was replaced by a sentence with the same syntactic
either a single two-syllable noun or a one-syllable adjective followed by a one- structure but that was semantically anomalous.
syllable noun. The verb phrase could be composed of either a two-syllable verb or AMS. Five sets of AMS were created. Each sentence consisted of three compo-
a one-syllable verb followed by a one-syllable noun object. In a normal trial, ten nents, C1, C2 and C3. Each component (C1, C2 or C3) was independently chosen
sentences were sequentially played and no acoustic gaps were inserted between from three candidate syllables with equal probability. The grammar of the AMS
sentences (Supplementary Fig. 1a). Due to the lack of phrasal and sentential is illustrated in Supplementary Figure 4a. In the experiments, sentences were
level prosodic cues, the sound intensity of the stimulus, characterized by the played sequentially without any gap between sentences. Since all components
sound envelope, only fluctuates at the syllabic rate but not at the phrasal or sen- were chosen independently and each component was chosen from three syllables
tential rate (Supplementary Fig. 2). An outlier trial was the same as a normal with equal probability, all components were equally predictable regardless of its
trial except that the verb phrases in two sentences were exchanged, creating two position in a sequence. In other words, P(C1) = P(C2) = P(C3) = P(C2|C1) =
nonsense sentences with incompatible subjects and predicates (an example in P(C3|C2) = P(C1|C3) = 1/3.
English would be new plans rub skin). All Chinese syllables were synthesized independently and adjusted to 300 ms
Four-syllable verb phrases. Two types of four-syllable verb phrases were by truncation or padding silence at the end. In each trial, 60 sentences were
created. Type I verb phrase contained a one-syllable verb followed by a three- played and no additional gap was inserted between sentences. Therefore, the
npg
syllable noun phrase, which could be a compound noun or an adjective + noun syllables were played at a constant rate of 3.33 Hz and the sentences were played
phrase (Supplementary Fig. 1b and Supplementary Table 1). Type II verb phrase at a constant rate of 1.11 Hz. To make sure that neural encoding of the AMS was
contained a two-syllable verb followed by a two-syllable noun (Supplementary not confounded by acoustic properties of a particular set of syllables, five sets of
Fig. 1c, all phrases listed in Supplementary Table 1). 50 instances were created AMS were created (Supplementary Table 1). No meaningful Chinese expressions
for each type of verb phrases. In a normal trial, ten phrases of the same type were are embedded in the AMS sequences.
sequentially presented. An outlier trial was the same as a normal trial except that
the verbs in two phrases were exchanged, creating two nonsense verb phrases Stimuli II: English materials. All English materials were synthesized using the
with incompatible verbs and objects (an example in English would be drink a MacinTalk Synthesizer (male voice Alex, in Mac OS X 10.7.5).
long walk). Four-syllable English sentences. 60 four-syllable English sentences were con-
Two-syllable phrases. The verb phrases (or the noun phrases) in the four- structed (Supplementary Table 1), and each syllable was a monosyllabic word. All
syllable sentences were presented in a sequence (Supplementary Fig. 1d). sentences had the same syntactic structure: adjective/pronoun + noun + verb +
In a normal trial, 20 different phrases were played. In an outlier trial, one of noun. Each syllable was synthesized independently, and all the synthesized
the 20 phrases was replaced by two random syllables that did not constitute a syllables (250347 ms in duration) were adjusted to 320 ms by padding silence
sensible phrase. at the end or truncation. The offset of each syllable was smoothed by a 25-ms
Random syllabic sequences. The random syllabic sequences were generated cosine window. In each trial, 12 sentences were presented without any acous-
based on the four-syllable sentences. Each four-syllable sentence was trans- tic gap between them. In an outlier trial, 3 consecutive words from a random
formed into four random syllables using the following rule: the first syllable in position were replaced by three random words so that the corresponding
the sentence was replaced by the first syllable of a randomly chosen sentence. sentence(s) became ungrammatical.
The second syllable was replaced by the second syllable of another randomly Shuffled sequences. Shuffled sequences were constructed as an unintelligible
chosen sentence and the same for the third and the fourth syllables. This way, if sound sequence that preserved the acoustic properties of the sentence sequences.
there were any consistent acoustic differences between the syllables at different All syllables in the four-syllable English sentences were segmented into five over-
positions in a sentence, those acoustic differences were preserved in the random lapping slices. Each slice was 72 ms in duration and overlapped with neighboring
syllabic sequences. Each normal trial contained 40 syllables. In outlier trials, four slices for 10 ms. The first 10 ms and the last 10 ms of each slice was smoothed by a
consecutive syllables were replaced by a Chinese idiom. linear ramp, except for the onset of the first slice and the offset of the last slice.
doi:10.1038/nn.4186 nature NEUROSCIENCE

A shuffled sentence was constructed by shuffling all slices at the same position Experiment 4. Sentences with variable NP durations, as described above,
across the four-syllable sentences. For example, the first slice of the first syllable were presented in a single block, counterbalanced with three other blocks that
in a given sentence was replaced by the first slice of the first syllable in a different presented language materials not analyzed here. In that block, 27 normal trials
randomly chosen sentence. For another example, the third slice of the fourth and seven outlier trials were presented. The other three blocks presented other
syllable in one sentence was replaced by the third slice of the fourth syllable in language materials not analyzed here. The order of the blocks was counterbal-
another randomly chosen sentence. In a normal trial, 12 different shuffled sen- anced across listeners.
tences were played sequentially, resulting in a trial that had the same duration Experiment 5. Trials consisting of four-syllable sentences, four-syllable idioms,
as a trial of four-syllable English sentences. In an outlier trial, four consecutive random syllabic sequences, and backward syllabic sequences were intermixed and
shuffled syllables were replaced by four randomly chosen English words that did presented in a random order. 20 normal trials for each type of materials were pre-
not construct a sentence. sented. In each trial, the last 1 or 2 syllable was removed, each with 50% probability.
Markovian sentences. The pronunciation of an English syllable strongly Listeners were instructed to count the number of syllables in each trial in a cyclic
depends on its neighbors. To simulate more natural English, we also synthe- way: 1, 2, 3, 4, 1, 2, 3, 4, 1, 2 The final count could only be 2 or 3 and the listeners
sized English sentences based on an isochronous multi-syllabic acoustic chunk. had to report whether it was 2 or 3 at the end of each trial via button press.
Every sentence was divided into three acoustic chunks that were roughly equal in Experiment 6. Four-syllable English sentences, shuffled sequences, constant
duration. Each acoustic chunk consisted of 12 monosyllabic or bisyllabic words predictability Markovian sentences, and predictable Markovian sentences were
and was synthesized as a whole, independently of neighboring acoustic chunks. presented in separate blocks. The order of the blocks was counterbalanced across
All synthesized acoustic chunks (250364 ms in duration) were adjusted to 350 ms listeners. Listeners took breaks between blocks. In each block, 22 normal trials
by truncation or padding silence at the end. The offset of each chunk was and 8 outlier trials were intermixed and presented in a random order.
smoothed by a 25-ms cosine window. Experiment 7. The experiment involved the AMS and was divided into two
Two types of Markov chain sentences were generated based on isochronous sessions. In the first session, ten trials were presented (two trials from each AMS
sequences of acoustic chunks. In one type of Markovian sentences, called the set; see the upper row in Supplementary Fig. 4b). In each trial, the last syllable was
constant predictability sentences, each acoustic chunk was equally predictable removed with 50% probability. The listeners were told that the stimulus was only a
based on the preceding chunk, regardless of its position within a sentence. sequence of random syllables. They were asked to count the number of syllables in a
The constant predictability sentences were generated based on the grammar cyclic way: 1, 2, 1, 2, 1, 2 and report whether the final count was 1 or 2 at the end
specified in Figure 3a and Supplementary Figure 1e. Listeners were familiar- of each trial via button press. Since each trial contained 179 or 180 rapidly presented
ized with the grammar and were able to write down the full grammar table syllables, the listeners were not able to count accurately (mean performance 52
before participating in the experiment. In each trial, ten sentences were sepa- 9.7%, not significantly above chance, P > 0.8, t test). However, the listeners were
rately generated based on the grammar and sequentially presented without any asked to follow the rhythm and keep counting even when they lost count. After
acoustic gap between them. the first session of the experiment was finished, the listeners were told about that
The other type of Markovian sentences, called the predictable sentences, con- the general structure of the AMS and examples were given based on real Chinese
sisted of a finite number of sentences (N = 25, Supplementary Table 1) that sentences. In the second session of the experiment, the listeners had to learn the
were extensively repeated (1112 times) in a ~7-min block. In these sentences, 5 sets of AMS separately (lower row, Supplementary Fig. 4b). For each set of the
the second and the third acoustic chunks were uniquely determined by the first AMS, during training, the listeners listened to 20 sentences from the AMS set in a
chunk. In each trial, ten different sentences were played sequentially without any sequence, with a 300-ms gap being inserted between sentences to facilitate learning.
acoustic gap between them. Then, the listeners listened to two trials of sentences from the same AMS set, which
they also listened to in the first session (shown by symbol S in Supplementary
Acoustic analysis. The intensity fluctuation of the sound stimulus is charac- Fig. 4b). They had to do the same cyclical counting task. However, they were told
terized by its temporal envelope. To extract the temporal envelope, the sound that the last count was 1 if the last sentence was incomplete and the last count was 2
signal is first half-wave rectified and then downsampled to 200 Hz. The Discrete if the last sentence was complete (mean performance 82 8.0%, significantly above
Fourier Transform of the temporal envelope (without any windowing) is shown in chance P < 0.2, t test). At the end of the two trials, the listeners had to report the
Figure 1 and Supplementary Figure 2. grammar of the AMS, i.e. which 3 syllables could be the first syllable of a sentence,
which three syllables could be the middle one, and which three syllables could
npg
Experimental procedures. Seven experiments were run. Experiment 14 be the last one. The grammatical roles of 77 7.6% (mean s.e. across subjects)
involved Chinese listeners listening to Chinese materials, experiment 5 involved syllables were reported correctly.
English listeners listening to Chinese materials, and experiment 6 involved
English listeners listening to English materials. Experiment 7 involved Chinese Neural recordings. Cortical neuromagnetic activity was recorded using a
listeners listening to AMS. 157-channel whole-head MEG system (KIT) in a magnetically shielded room.
In all experiments except for experiment 5, listeners were instructed to detect The MEG signals were sampled at 1 kHz, with a 200-Hz low-pass filter and a
outlier trials. At the end of each trial, listeners had to report whether it was a 60-Hz notch filter applied online and a 0.5-Hz high-pass filter applied offline
normal trial or an outlier trial via button press. Following the button press, (time delay compensated). The environmental magnetic field was recorded using
the next trial was played after a delay randomized between 800 and 1,400 ms. three reference sensors and regressed out from the MEG signals using time-
In experiment 5, listeners performed a syllable counting task described below. shifted PCA52. Then, the MEG responses were further denoised using the blind
Behavioral results are reported in Supplementary Table 2. source separation technique, Denoising Source Separation (DSS)53. The MEG
Experiment 1. Four-syllable Chinese sentences, four-syllable idioms, random responses were decomposed into DSS components using a set of linear spatial
syllabic sequences and backward syllabic sequences were presented in separate filters, and the first 6 DSS components were retained for analysis and transformed
blocks. The order of the blocks was counter balanced across listeners. Listeners back to the sensor space. The DSS decomposes multi-channel MEG recordings
took breaks between blocks. In each block, 20 normal trials and ten outlier trials to extract neural response components that are consistent over trials and has
were intermixed and presented in a random order. been demonstrated to be effective in denoising cortical responses to connected
Experiment 2. Four-syllable sentences, type I four-syllable verb phrases, type II speech18,54,55. The DSS was applied to more accurately estimate the strength of
four-syllable verb phrases, two-syllable noun phrases, and two-syllable verb neural activity phase-locked to the stimulus. Even when the DSS spatial filtering
phrases were presented in separate blocks. The order of the blocks was counter process was omitted, for the r.m.s. response over all MEG sensors, the senten-
balanced across listeners. Listeners took breaks between blocks. In each block, tial, phrasal, and syllabic responses in Figure 1 were still statistically significant
20 normal trials and five outlier trials were intermixed and presented in a (P < 0.001, bootstrap).
random order.
Experiment 3. Sentences with variable durations and syntactic structures, as Data analysis. Only the MEG responses to normal trials were analyzed.
described above, were played in an intermixed order. Listeners took a break every Frequency domain analysis. In experiments 1, 2, 5 and 6, the linguistic
25 trials. In total, 80 normal trials and 20 outlier trials were presented. structures of different hierarchies were presented at unique and constant rates
nature NEUROSCIENCE doi:10.1038/nn.4186

and neural tracking of those linguistic structures was analyzed in the frequency with replacement 2,000 times. The s.d. of the resampled results was taken as the
domain. For each trial, to avoid the transient response to the acoustic onset of each s.e.m. In Figure 4d, the statistical difference between the two curves, that is, the
trial, the neural recordings were analyzed in a time window between the onset of three-syllable NP condition and the four-syllable NP condition, was also tested
the second sentence (or the fifth syllable if the stimulus contained no sentential using bootstrap. For each subject, the difference between the responses in these
structure) and the end of the trial. The single-trial responses were transformed into two conditions was taken. At each time point, the response difference was resam-
the frequency domain using the discrete Fourier transform (DFT). For all Chinese pled with replacement 2,000 times across the eight subjects, and percentage of the
materials and the artificial Markovian language materials, nine sentences were ana- resampled differences being larger or smaller than 0 (the smaller value) was taken
lyzed in each trial, resulting in a frequency resolution of 1/9 of the sentential rate as the significance level. A FDR correction was applied to the bootstrap results.
(~0.11 Hz). For the English sentences and the shuffled sequences, the trials were
longer and the duration equivalent to 44 English syllables was analyzed, resulting Code availability. The computer code used for the MEG analyses is available
in a frequency resolution of 1/44 of the syllabic rate, that is, 0.071 Hz. upon request.
The response topography (Fig. 1c) showed the power of the DFT coefficients
at a given frequency and hemispheric lateralization was calculated by averaging Neural Source Localization Using ECoG. ECoG participants. ECoG recordings
the response power over the sensors in each hemisphere (N = 54). were obtained from five patients (three female; average 33.6 years old, range 1942
Given that the properties of the neural responses to linguistic structures and years old) diagnosed with pharmaco-resistant epilepsy and undergoing clinically
background neural activity might vary in different frequency bands, to treat each motivated subdural electrode recordings at the New York University Langone
frequency band equally, a separate spatial filter was designed for every frequency Medical Center. Patients provided informed consent before participating in the
bin in the DFT output to optimally estimate the response strength. The linear study in accordance with the Institutional Review Board at the New York University
spatial filter was the DSS filter56. The output of the DSS filter was a weighted Langone Medical Center. Three patients were right-handed, two were left-handed.
summation over all MEG sensors, and the weights were optimized to extract All patients were native English speakers (one of them was a bilingual Spanish/
neural activity consistent over trials. In brief, if the DFT of the MEG response English speaker), and all patients were left-hemisphere dominant for language.
averaged over trials is X(f) and the autocorrelation matrix of single-trial MEG ECoG recordings. Participants were implanted with 96179 platinum-iridium
recordings is R(f), the spatial filter is w = R1(f)X(f) (see the appendix of ref. 56). clinical subdural grid or strip electrodes (three patients with a left-hemisphere
The spatial filter w is an 157 1 vector (for the 157 sensors), the same size as X(f), implant and two patients with a right hemisphere implant, additional depth elec-
and R(f) is a 157 157 matrix. The spatial filter could be viewed as a virtual sen- trodes implanted for some patients but not analyzed). The electrode locations per
sor that was optimized to record phase-locked neural activity at each frequency. patient are shown in Supplementary Figure 5. Electrode localization followed previ-
Power of the scalar output of the spatial filter, |XT(f)R1(f)X(f)|2, was the power ously described procedures58. In brief, for each patient we obtained pre-op and post-
spectral density shown in the figures. op T1-weighted MRIs which were co-registered with each other and normalized to a
Time domain analysis. The response to each sentence was baseline corrected MNI-152 template, allowing the extraction of the electrode location in MNI space.
based on the 100-ms period preceding the sentence onset, for each sensor. The ECoG signals were recorded with a Nicolet clinical amplifier at a sampling
To remove the neural response to the 4-Hz isochronous syllabic rhythm and focus rate of 512 Hz. The ECoG recordings were re-referenced to the grand average over
on the neural tracking of sentential/phrasal structures, we low-pass filtered the all electrodes (after removing artifact-laden or noisy channels). Electrodes from
neural response waveforms using a 0.5-s duration linear phase FIR filter (cut-off different subjects were pooled per hemisphere, resulting in 385/261 electrodes
3.5 Hz). The filter delay was compensated by filtering the neural signals twice, in the left/right hemispheres. High gamma activity was extracted by high-pass
once forward and once backward. When the response power at 4 Hz was extracted filtering the ECoG signal above 70 Hz (with additional notch filters at 120 and
separately by a Fourier analysis, it does not significantly change as a function of 180 Hz). The energy envelope of high gamma activity was extracted by taking
sentence duration (P > 0.19, one-way ANOVA). The r.m.s. of the MEG responses the square of high-gamma response waveform.
was calculated as the sum of response power (that is, square of the MEG response) ECoG procedures. Participants performed the same task as healthy subjects in
of all sensors, and the r.m.s. response was further low-pass filtered by a 0.5-s the MEG (Fig. 2e,f). In brief, they listened to a set of English sentences and control
duration linear phase FIR filter (cut-off 3.5 Hz, delay compensated). stimuli in the first and second block. The control stimulus, that is, the shuffled
A linear decoder was built to decode the duration of sentences. In the decoding sequences, preserves the syllabic-level acoustic rhythm of English sentences but
analysis, the multi-channel MEG responses were compressed to a single channel, contain no hierarchical linguistic structure. The procedure was the same as the
npg
i.e. the first DSS component, and the decoder solely relied on the time course of MEG experiment, except for a familiarization session in which the subjects listened
the neural response. A 2.25-s response epoch was extracted for each sentence, to individual sentences with visual feedback. 60 trials of sentences and control stim-
starting from the sentence onset. A leave-one-out cross-validation procedure uli were played. The ECoG data from each electrode was analyzed separately and
was employed to evaluate the decoders performance. Each time, the response converted to the frequency domain via DFT (frequency resolution 0.071 Hz).
to one sentence was used as the testing response, and the responses to all other A significant response at the syllabic, phrasal or sentential rate was reported if
sentences were treated as the training set. The training signals were averaged the response power at the target frequency was stronger than the response power
for sentences of the same duration, creating a template for the response time averaged over neighboring frequency bins (0.5-Hz range above and below the
course for each sentence duration. The testing response was correlated with all target frequency). The significance level for each electrode was first determined
the templates and the category of the most correlated templates was the decoders based on a bootstrap procedure that randomly sampled the 60 trials 1,000 times
output. For example, if the testing response was most correlated with the template and then underwent FDR correction for multiple comparisons across all elec-
for 5-syllable sentences, the decoders output would be that the testing response trodes in the same hemisphere.
was generated by a five-syllable sentence. A Supplementary Methods Checklist is available.
Statistical analysis and significance tests. For spectral peaks (Figs. 1 and 2), a 51. Oldfield, R.C. The assessment and analysis of handedness: the Edinburgh inventory.
Neuropsychologia 9, 97113 (1971).
one-tailed paired t test was used to test if the neural response in a frequency bin 52. de Cheveign, A. & Simon, J.Z. Denoising based on time-shift PCA. J. Neurosci.
was significantly stronger than the average of the neighboring four frequency Methods 165, 297305 (2007).
bins (two bins on each side). Such a test was applied to all frequency bins below 53. de Cheveign, A. & Simon, J.Z. Denoising based on spatial filtering. J. Neurosci.
5 Hz, and a FDR correction for multiple comparisons was applied. Except for the Methods 171, 331339 (2008).
54. Ding, N. & Simon, J.Z. Adaptive temporal encoding leads to a background-insensitive
analysis of the spectral peaks, two-tailed t tests were applied. For all the t tests cortical representation of speech. J. Neurosci. 33, 57285735 (2013).
applied in this study, data from the two conditions had comparable variance 55. Ding, N. & Simon, J.Z. Neural coding of continuous speech in auditory cortex during
and showed no clear deviation from the normal distribution when checking the monaural and dichotic listening. J. Neurophysiol. 107, 7889 (2012).
histograms. If the t test was replaced by a bias-corrected and accelerated 56. Wang, Y. et al. Sensitivity to temporal modulation rate and spectral bandwidth in the
human auditory system: MEG evidence. J. Neurophysiol. 107, 20332041 (2012).
bootstrap, all results remained significant. 57. Efron, B. & Tibshirani, R. An Introduction to the Bootstrap (CRC press, 1993).
In Figure 4, the s.e.m. over subjects was calculated using bias-corrected and 58. Yang, A.I. et al. Localization of dense intracranial electrode arrays using magnetic
accelerated bootstrap57. In the bootstrap procedure, all the subjects were resampled resonance imaging. Neuroimage 63, 157165 (2012).
doi:10.1038/nn.4186 nature NEUROSCIENCE

Cortical Tracking of Hierarchical Linguistic Structures in Connected Speech

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Cortical Tracking of Hierarchical Linguistic Structures in Connected Speech

Cargado por

Copyright:

Formatos disponibles

a r t ic l e s

Cortical tracking of hierarchical linguistic structures

To understand connected speech, listeners must construct a hierarchy RESULTS

Received 12 August; accepted 3 November; published online 7 December 2015; doi:10.1038/nn.4186

158 VOLUME 19 | NUMBER 1 | JANUARY 2016 nature NEUROSCIENCE

Figure 1 Neural tracking of hierarchical linguistic structures. a Sentence Sentence 1 Hz

intensity fluctuation revealed syllabic rhythm, but no phrasal or sentential

syntactically independent and cortical activity encoded only acoustic and

* Chinese materials from Figure 1: acoustic tracking only, as there was

nature NEUROSCIENCE VOLUME 19 | NUMBER 1 | JANUARY 2016 159

Neural tracking of linguistic structures rather than probability cues

Percent trials (%)

5 by neural tracking of transitional probability.

b d Figure 4 Neural tracking of sentences of varying structures. (a) Neural

near a sentential boundary (dotted line). The power continuously changed

throughout the duration of a sentence. Shaded area covers 2 s.e.m.

160 VOLUME 19 | NUMBER 1 | JANUARY 2016 nature NEUROSCIENCE

responses using ECoG (N = 5). Left, power envelope of high-gamma

tracking of larger linguistic structures generalizes to sentences that 10 dB

N = 13 Neural source localization using electrocorticography (ECoG)

c d Figure 6 Spatial dissociation between sentential-rate, phrasal-rate and

nature NEUROSCIENCE VOLUME 19 | NUMBER 1 | JANUARY 2016 161

162 VOLUME 19 | NUMBER 1 | JANUARY 2016 nature NEUROSCIENCE

timescales, or temporal receptive windows12,13, that the relevant

nature NEUROSCIENCE VOLUME 19 | NUMBER 1 | JANUARY 2016 163

164 VOLUME 19 | NUMBER 1 | JANUARY 2016 nature NEUROSCIENCE

doi:10.1038/nn.4186 nature NEUROSCIENCE

nature NEUROSCIENCE doi:10.1038/nn.4186

doi:10.1038/nn.4186 nature NEUROSCIENCE

También podría gustarte