Está en la página 1de 16

Relationship between tongue positions and formant frequencies in female speakers

,
Jimin Lee , Susan Shaiman, and Gary WeismerLK

Citation: The Journal of the Acoustical Society of America 139, 426 (2016); doi: 10.1121/1.4939894
View online: http://dx.doi.org/10.1121/1.4939894
View Table of Contents: http://asa.scitation.org/toc/jas/139/1
Published by the Acoustical Society of America

Articles you may be interested in


Hyperarticulation in Lombard speech: Global coordination of the jaw, lips and the tongue
The Journal of the Acoustical Society of America 139, (2016); 10.1121/1.4939495
Relationship between tongue positions and formant frequencies
in female speakers
Jimin Lee,1,a) Susan Shaiman,2 and Gary Weismer3
1
Department of Communication Sciences and Disorders, The Pennsylvania State University,
404A Ford Building, University Park, Pennsylvania 16802, USA
2
Department of Communication Science and Disorders, University of Pittsburgh, 4033 Forbes Tower,
Pittsburgh, Pennsylvania 15260, USA
3
Department of Communication Sciences and Disorders, University of Wisconsin-Madison,
1975 Willow Drive, Madison, Wisconsin 53706, USA
(Received 10 November 2014; revised 15 December 2015; accepted 31 December 2015; published
online 21 January 2016)
This study examined the relationship (1) between acoustic vowel space and the corresponding
tongue kinematic vowel space and (2) between formant frequencies (F1 and F2) and tongue x-y
coordinates for the same time sampling point. Thirteen healthy female adults participated in this
study. Electromagnetic articulography and synchronized acoustic recordings were utilized to obtain
vowel acoustic and tongue kinematic data across ten speech tasks. Intra-speaker analyses showed
that for 10 of the 13 speakers the acoustic vowel space was moderately to highly correlated with
tongue kinematic vowel space; much weaker correlations were obtained for inter-speaker analyses.
Correlations of individual formants with tongue positions showed that F1 varied strongly with
tongue position variations in the y dimension, whereas F2 was correlated in equal magnitude with
variations in the x and y positions. For within-speaker analyses, the size of the acoustic vowel space
is likely to provide a reasonable inference of size of the tongue working space for most speakers;
unfortunately there is no a priori, obvious way to identify the speakers for whom the covariation is
not significant. A second conclusion is that F1 variations reflect tongue height, but F2 is a much
more complex reflection of tongue variation in both dimensions.
C 2016 Acoustical Society of America. [http://dx.doi.org/10.1121/1.4939894]
V

[LK] Pages: 426–440

I. INTRODUCTION the interpretation of acoustic vowel space areas generated


from F1 and F2 measurements of the four English corner
The place and degree of vocal tract constriction play a
vowels. Specifically, acoustic vowel space areas have been
significant role in vowel production and differ by target vow-
interpreted as reflecting articulatory working spaces (Neel,
els (Fant, 1960). Vocal tract configuration for vowel produc-
2008). The interpretation is, all other things being as nearly
tion involves the whole vocal cavity (e.g., lips, tongue, hard
equal as possible, a larger acoustic vowel space implies
palate, pharyngeal wall, etc.). However, the place and degree
greater average differences between constriction degree and
of vocal tract constriction are primarily controlled by posi-
location across the corner vowels, as compared to a smaller
tion of the tongue (Stevens and House, 1955; Fant, 1960).
acoustic vowel space (see relevant analyses in Wang et al.,
As a result of the relationship between the place and degree
2013). This interpretation has had a significant impact on
of vocal tract constriction for vowels and formant frequen-
various fields, especially in speech-language pathology and
cies, assumptions of simplified relationships between tongue
more general studies of acoustic phonetics.
position and formant frequencies have been generalized in
In speech-language pathology, acoustic vowel space has
the field. Traditional and more contemporary tube models
been utilized to characterize speech motor skills of speakers
(Fant, 1960; Arai, 2012), electrical circuit models (Stevens
with various disorders, such as motor speech disorders and
and House, 1955), human reproductions of Fant’s nomo-
hearing impairment. Across many studies and disorders, a
grams (Ladefoged and Bladon, 1982), and a recent, joint lin-
strong relationship between size of the acoustic vowel space
gual kinematic and formant frequency analysis (Wang et al.,
and speech intelligibility has been documented (Weismer
2013) largely support these assumptions. The assumptions
et al., 2001; Tjaden and Wilding, 2004). Some researchers
are that in general the first formant frequency (F1) increases
have argued that measures of the acoustic vowel space are
as the constricted area becomes larger, and the second form-
sensitive to subtle, disease-related reductions in articulatory
ant frequency (F2) increases as the constricted area is located
function even in the absence of perceptual evidence of a dis-
more anteriorly.
turbance in vowel production (Skodda et al., 2012). Even
This understanding of the relationship between the first
among normal speakers, a relationship between size of the
two formant frequencies and tongue position has influenced
acoustic vowel space and speech intelligibility has been
claimed (Bradlow et al., 1996). The results of these studies
provide support for a meaningful link between acoustic
a)
Electronic mail: JXL91@psu.edu and articulatory measures of vowel production, with the

426 J. Acoust. Soc. Am. 139 (1), January 2016 0001-4966/2016/139(1)/426/15/$30.00 C 2016 Acoustical Society of America
V
implication for therapeutic enhancements of speech intelligi- distinctiveness of vowel formant frequencies serves as a rea-
bility via well-differentiated tongue placements and move- sonable proxy for the distinctiveness of underlying lingual
ments for vowels. positions.
In work on acoustic phonetics, researchers have reported Despite the frequent inferences made from formant fre-
an association between acoustic vowel space and manipula- quencies to tongue positions for vowels, empirical demon-
tions of speech rate, intensity, and clarity. For example, stration of the goodness of these inferences has not received
larger acoustic vowel spaces were observed in clear speech, adequate attention. If the generalized concept of a relation-
which was interpreted as an outcome of exaggerated articu- ship between the first two formant frequencies and tongue
latory behavior and therefore greater distinctiveness between position is true, the variance of one variable (e.g., formant
the corner vowels (Smiljanic and Bradlow, 2005). Larger frequency) that is explained by another (e.g., the x or y coor-
acoustic vowel spaces were also observed in slower speech dinates for tongue position) should be substantial. The cur-
when compared to faster speech (Fourakis, 1991), and in rent study approaches this question from two perspectives.
louder as compared to habitual speech (Tjaden and Wilding, First, the relationship of acoustic and articulatory vowel
2004). In addition to rate, intensity, and clarity, consonant spaces is really a derived measure from multiple correspond-
environment was also found to influence the acoustic vowel ences between tongue positions and formant frequencies: if
space. Iivonen (1995) reported different dispersion patterns acoustic vowel space as measured in the traditional method
of Finnish vowels in an F1-F2 plot, depending on consonant reflects tongue kinematic vowel space, the correlation
context. For example, vowels in the /h/-vowel-/h/ context between the two should be high when both are measured at
generated a larger vowel space than vowels in the alveolar (or very close to) the same time sampling point. Second, fol-
consonant context (/t/-vowel-/t/). Iivonen’s interpretation lowing work of Ladefoged et al. (1978), McGowan and
was that smaller lingual distinctions between vowels were Berger (2009), Mefferd and Green (2010), Dromey et al.
made in the /t/-vowel-/t/ context. The same differences (2013), and Wang et al. (2013), specific relationships are
between acoustic vowel spaces for /h/-vowel-/d/ (/hVd/) ver- explored between tongue position variation in the x and y
sus /d/-vowel-/d/ (/dVd/) contexts can be extrapolated from dimensions and variation of the first and second formant fre-
data published by Stevens and House (1963) and quencies. The first question concerns the ability to infer lin-
Hillenbrand et al. (2001). gual working space for vowels from acoustic vowel spaces,
The traditional method to obtain the size of the acoustic the second question the ability to infer specific variations in
vowel space is to measure F1 and F2 at the temporal mid- the dimensions of tongue height and advancement from vari-
point (or a nearby point in time) of corner vowels. The mid- ation in F1 and F2.
point is measured to minimize potential coarticulatory One of the challenges in testing the relationship between
effects from surrounding consonants. As noted above, vowel acoustic and tongue kinematic vowel spaces is the broad
spaces measured in this way have been reported as signifi- issue of differences across speakers. Speakers have different
cant predictors of speech intelligibility in speakers with vari- vocal tract sizes, the effect of which on formant frequencies
ous disorders (e.g., dysarthria, hearing-impairment). is well documented (Lee et al., 1999). Evidence also exists
In previous studies, the relationship between formant for different articulatory kinematic patterns in male and
frequencies and tongue positions has been observed descrip- female speakers, even when producing the same segments
tively and by means of dimension-reducing statistical analy- (Simpson, 2001). In addition, variable articulatory patterns
sis. Tongue kinematic vowel space was examined in a single within and across gender may depend on relative lengths of
subject by Alfonso and Baer (1982) using cinefluorography, the oral and pharyngeal tracts (Fuchs et al., 2008). Like
and in four speakers with and without Down syndrome by Wang et al. (2013), the present study used female speakers
Bunton and Leddy (2011) using x-ray microbeam data only to control at least one major variable (sex-related vocal
(Westbury, 1994). Bunton and Leddy (2011) reported that tract size) in the determination of articulatory kinematic-
the relationships among acoustic vowel space, articulatory vowel acoustic relations. In addition, to obtain variation in
working space (tongue kinematic vowel space), and speech size of the vowel space but maintain the ability to perform
intelligibility were descriptively positive, that is, larger intra-speaker analyses and therefore control the potential
acoustic vowel spaces were associated with larger articula- effect of variation in vocal tract size on either articulatory or
tory spaces and increased speech intelligibility. More acoustic data, or both, ten different speech tasks (two conso-
recently, Wang et al. (2013) quantified articulatory vowel nant environments combined with five levels of rate and in-
space in neurologically healthy women using a signal- tensity) were utilized to evaluate within-speaker covariation
processing scheme in which different vowel categories were of lingual position and vowel formant frequency data. A sim-
characterized as shapes derived from tongue position data ilar manipulation was employed by Mefferd and Green
across the duration of the vowels. Wang et al. (2013) did not (2010) to study the relationship between distance traveled by
test the relationship between acoustic and tongue kinematic a single point on the tongue and the change in F1-F2 space
vowel spaces statistically; however, the dispersion pattern of for a low-back to high-front vowel sequence.
articulatory vowel shapes derived in Wang et al. (2013) In summary, acoustic vowel space has been claimed and
closely resembled an acoustic vowel space constructed from assumed to reflect tongue kinematic space; however, this
formant frequency measures collected synchronously with relationship has been tested or quantified minimally in
the kinematic data. The joint kinematic-acoustic analysis in speech production. The following research questions were
Wang et al. (2013) added validity to the assumption that addressed in the current study:

J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al. 427
(1) What is the relationship between the acoustic vowel were asked to halve (slow speech) or double (fast speech)
space and tongue kinematic vowel space at both intra- their habitual rate to produce the varied-rate conditions.
and inter-speaker levels of analysis? Consistent with the Similarly, the varied loudness conditions were elicited by
typical concept of vowel “targets” and their use in quan- requesting speakers to halve (soft condition) or double (loud
tification of size of the vowel space, it was hypothesized condition) their habitual speaking loudness, with the added
that at a single measurement point in time there is a stat- proviso that the soft condition be produced without whisper-
istically positive relationship between acoustically and ing. Each target word was repeated three times in each con-
kinematically determined vowel spaces. The strength of dition. The combination of two consonant environments
this relationship, both within and across speakers, with five different speaking styles was employed to obtain a
remains an open question. varied range of acoustic and tongue kinematic vowel spaces
(2) At a more fine-grained level, what is the relationship for intra-speaker analyses. The habitual condition was
between formant frequencies and tongue positions for always produced first; thereafter each speaker produced con-
vowel production? For the /hVd/ frame and ten American ditions and words in a randomized order.
English vowels, it was hypothesized that a large amount
of variance in F1 can be explained by variations in C. Procedures
inferred tongue height (from the y dimension of tongue
The Carstens AG-200, two-dimensional electromagnetic
position), and in F2 by variations in inferred tongue
articulograph (EMA), was used to record tongue positions
advancement (from the x dimension of tongue position).
with a synchronized audio recording. Three sensors were
attached to each speaker’s tongue along the midsagittal
II. METHOD plane: one close to the tongue tip, one on the body, and one
on the dorsum. The tongue tip sensor was positioned approx-
A. Participants
imately 10 mm back from the tongue tip, the tongue body
Thirteen healthy female adults between 20 and 23 years sensor approximately 15 mm back from the tongue tip sen-
of age participated in this study. All participants were from sor, and the tongue dorsum sensor approximately 15 mm
the Middle-Atlantic Northeast region of the United States back from the tongue body sensor. Euclidean distances
(12 from Pennsylvania and one from Syracuse, New York). between tongue sensors were measured with EMA sensor
All participants met the following criteria: (a) no known location data while speakers were in a still and relaxed posi-
speech, language, or cognitive disorders; (b) native speaker tion with their mouth closed. The average Euclidean distance
of American English; and (c) normal hearing as evidenced between tongue tip and body sensors across the 13 speakers
by passing a pure tone hearing screening bilaterally at 25 dB was 17 mm with a standard deviation of 5 mm; the average
for the frequencies 500, 1000, 2000, and 4000 Hz. The cur- distance between tongue body and dorsum sensors was
rent study was approved by the Institutional Review Board 15 mm with a standard deviation of 3 mm. Two reference
of the University of Pittsburgh. sensors were used, one on the bridge of the nose and the
other on the gingiva above the two upper central incisors. A
B. Materials bite plane was recorded before the speech tasks for post-
data-collection processing.
Each target word contained one of the four corner vow-
Each speaker had a 3-min warm-up time at the begin-
els /i, u, A, æ/ in /hVd/ and /dVd/ consonant environments to
ning of the session to get used to the sensors on the tongue
generate vowel spaces for two different phonetic contexts. In
while speaking. During the warm-up time, speakers pro-
addition, the monophthongs, /I, E, U, O, ˆ, o/ were collected
duced conversational speech and the Rainbow Passage
in the /hVd/ frame to provide a finer test of the relationship
(Fairbanks, 1960). A synchronized speech signal was
between individual formant frequencies and tongue posi-
recorded via a lavalier microphone placed 4 in. from the sub-
tions. The addition of non-corner vowels for this finer-
ject’s mouth. The speech acoustic signal was digitized with
grained analysis (hypothesis 2, above) was deemed neces-
12-bit resolution at 16 000 Hz and stored synchronously with
sary because the corner vowels alone would yield confirma-
kinematic data on a computer hard-drive. Kinematic data
tion of the hypothesis (i.e., only the extremes of the vowel
were sampled at 200 Hz and stored directly on a computer
space would be included in the analysis, forcing high cova-
hard-drive. On the basis of this sampling rate for kinematic
riation between the acoustic and kinematic data). All /hVd/
data, two consecutive kinematic data points were separated
and /dVd/ stimuli were embedded in the carrier phrase “I say
in time by 5 ms (see Sec. II D).
a ___ again.” Participants read the materials from English
orthography. Before the recording, all participants read out
D. Analysis of acoustic data
all the stimuli and practiced accurate pronunciation of the
target words. Participants practiced less common and/or con- Acoustic measures were obtained from the digital
fusing words such as “dud” /dˆd/ and “hod” /hAd/ until they speech samples using a wideband spectrographic display in
produced the desired vowels. TF32 (Milenkovic, 2002), following established measure-
Five different speaking styles (habitual, fast, slow, soft, ment criteria (Kent and Read, 2001, pp. 71–137). The fol-
and loud) were employed in this study, and evoked in the lowing acoustic measurements were obtained for each of the
same fashion as in previous studies (Tjaden and Wilding, target stimulus words produced by each participant: vowel
2004; Mefferd and Green, 2010). For rate variation, speakers duration; average dB root mean square (RMS) across the

428 J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al.
vowel duration; and the first two formant frequencies (F1 et al., 2013), or the occlusal (bite plane) and central maxil-
and F2) at the temporal midpoint of the vowel. lary planes have been used with data expressed within this
When locating the temporal midpoint of the vowel, the coordinate system (Westbury, 1994).
time sampling point for formant frequency measures was For the purpose of the current study, the raw unit (milli-
adjusted to the nearest time sample for the kinematic data. meters) of tongue position measurement was maintained for
This adjustment required changing the sampling point for evaluation of covariance between articulatory positions and
formant frequency measures by no more than 2.5 ms from formant frequencies (including the derived vowel spaces
the “true” temporal midpoint of the vowel. F1 and F2 were from these measures). Therefore, previously utilized meth-
measured in hertz using both the spectrographic and spec- ods that require unit changes for articulatory positional nor-
trum display in TF32, with a 30 ms window centered at the malization were not employed. Initially, a reference sensor
adjusted sampling point for the acoustic data. The formant attached to the gingiva between the upper incisors was
frequency values were obtained manually, using the LPC applied as the origin of the coordinate system (Westbury,
solutions on the spectrum display (superimposed on the fast 1994). However, in the treated data, a significant discrepancy
Fourier transform) together with the spectrographic display. in tongue position data along the x axis was observed in one
Vowel durations were measured primarily to verify the speaker when compared with other speakers. Specifically,
speaking rate manipulations, and relative vowel sound pressure speaker P3 presented more posterior tongue body x positions
levels were measured to verify the speech loudness manipula- (approximately 20 mm on average) than other speakers. The
tions. A total of 10 920 acoustic measurements were made distribution of data points within the outlying cluster along
in the current study: [2 (consonant environments)  4 (corner the x axis was found to be consistent with the distributions
vowels)  5 (speaking conditions)  3 (repetitions)  13 (par- observed in other speakers. It was speculated that the outly-
ticipants)  4 (acoustic measures)] þ [6 (other monophthongs) ing data points for P3 were secondary to anatomical differen-
 5 (speaking conditions)  3 (repetitions)  13 (participants) ces in the speaker such as the thickness or shape of the
 4 (acoustic measures)]. From these formant frequency meas- maxillary gingiva or alveolar process. In addition, even
urements, acoustic vowel space area was calculated in Hz2 among the data points of the remaining 12 speakers, slight
using the following polygon area formula described by discrepancies in tongue position data were observed [e.g.,
Johnson et al. (2004): area ¼ 1/2 (F1/i/F2/u/  F1/u/F2/i/) þ 1/ superior and anterior tongue positions (high y and low x
2 (F1/u/F2/A/  F1/A/F2/u/) þ 1/2 (F1/A/F2/æ/  F1/æ/F2/A/) coordinate values) in P1; superior tongue positions in P2 and
þ 1/2 (F1/æ/F2/i/  F1/i/F2/æ/). P10; and inferior tongue positions (low y) in P6].
To resolve this issue, the centroid position was calcu-
1. Reliability lated across the corner vowels, without unit transformation,
within speakers for each tongue sensor. The centroid value
Inter-judge reliability was obtained for all acoustic of each speaker represents a neutral position of each tongue
measures. A second judge, trained in acoustic analyses, region during vowel production. Use of the centroid for the
measured 12% of the full set of acoustic data. Pearson tongue kinematic data did not influence the units of the kine-
product-moment coefficients and absolute differences matic vowel space, hence, the relationship between acoustic
between the first and second sets of measures were calcu- and tongue kinematic vowel space remained the same after
lated to obtain reliability data. Correlation values for the processing the data according to the centroid. In sum, in the
four acoustic variables ranged between 0.98 and 0.99. The current study (1) kinematic data were rotated according to
mean absolute differences were 9.2 ms (vowel duration), each speaker’s bite plane; (2) the centroid position for each
0.25 dB (dB RMS), 14 Hz (F1), and 27 Hz (F2). The reliabil- speaker was calculated for each tongue sensor based on the
ity data for formant frequency measures are consistent with corner vowel position data across the ten speech tasks; and
those reported in prior investigations and are within the ac- (3) the centroid value of each speaker was utilized as the
ceptable range of error (Monsen and Engebretson, 1983). coordinate of the origin (0, 0). For example, if the original
tongue body x and y values of /A/ for a given speaker were
E. Analysis of kinematic data (130, 115) and the calculated centroid values of the tongue
body sensor were (120, 125), the converted tongue body x
Kinematic data were rotated according to each speaker’s
and y values of /A/ in P1 were (10, 10) (130–120 and
bite plane and smoothed using the Tailor software program
115–125, respectively). Centroid values for speaker normal-
(Carstens Medizinelektronik GmbH, Germany) prior to the
ization have been used previously in treating vowel acoustic
analyses. The kinematic data were rotated so that the bite
data (Nearey, 1978). For the current study, we employed this
plane of each speaker was parallel to the x axis.
approach to resolve the tongue positional discrepancy dis-
cussed above. In this way, the relative positional changes
1. Positional normalization of tongue kinematic data
across ten speech tasks within a speaker were maintained,
Secondary to the issue of individual differences across and the comparison of tongue position across speakers was
speakers, the normalization of articulator position needs to feasible without losing the original unit.
be handled carefully for data comparison across speakers. For each vowel, the x-y coordinate data of each sensor
Depending on the purpose of the study, the unit of move- were measured using the EMALYSE program (Carstens
ment data has been transformed and mapped onto a model of Medizinelektronik GmbH, Germany) at the time sampling
an average speaker (Geng and Mooshammer, 2009; Wang point where the formant frequencies were measured. With the

J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al. 429
TABLE I. Pearson correlation coefficient values (r-values) between acoustic vowel space and kinematic vowel space across speakers (inter-speaker analyses)
in each speaking style. The sample size of 26 in each speaking style is from 13 speakers and two consonant environments.

Condition Sample size (n) Tongue tip vowel space Tongue body vowel space Tongue dorsum vowel space
a a
Habitual 26 0.580 0.554 0.483b
Fast 26 0.397b 0.406b 0.429b
Loud 26 0.294 0.443b 0.409b
Slow 26 0.193 0.204 0.342
Soft 26 0.283 0.232 0.448b
All conditions 130 0.384b 0.414b 0.451b

a
p-value < 0.01.
b
p-value < 0.05.

x-y coordinate data of the four corner vowels, tongue kine- tongue y in the first block and tongue x in the second block)
matic vowel space was calculated for each of the three sensors was administered to obtain variance partitions. More specifi-
using the formula presented above (Sec. II D; Johnson et al., cally, the method was chosen to obtain variance values in
2004). The x and y coordinate values were used in place of the formant frequencies explained by only tongue x position,
F1 and F2 values to obtain tongue kinematic vowel spaces. only tongue y position, and both tongue x and y positions
(shared variance) by examining incremental R2 changes. An
F. Experimental design and analysis alpha level of 0.05 was employed in all analyses as the crite-
The current study employed a 2 (two consonants)  5 rion of statistical significance.
(rate and loudness) repeated measures analysis for vowel du-
III. RESULTS
ration and dB RMS measures to examine the protocol for
eliciting the targeted rate and intensity changes. This design A. Verification of speaking style manipulations
was also employed to evaluate variation of acoustic vowel
The data showed that each condition yielded the expected
space and all three tongue kinematic vowel spaces (tongue
differences compared to the habitual condition (Table III).
tip, tongue body, and tongue dorsum) across speaking styles.
Omnibus tests for vowel duration [F(4,96) ¼ 73.02, p < 0.001]
When a significant main effect was observed across speaking
and sound pressure level [F(4,96) ¼ 176.86, p < 0.001] were
styles, each speaking style was compared to the habitual
significant. In the /hVd/ condition, the average vowel duration
speaking style using post hoc tests.
was 46 ms shorter in the fast speaking style [F(1,24) ¼ 62.07,
To address the research question regarding the acoustic
p < 0.001] and 79 ms longer in the slow speaking style
and tongue kinematic vowel space relationship at the intra-
speaker level, Spearman’s rank correlation coefficients were [F(1,24) ¼ 72.97, p < 0.001] as compared to the habitual
utilized because of the relatively low n (n ¼ 10, from ten speech speaking style. The loud and soft speaking styles were 8 dB
tasks). To examine the relationship between acoustic and greater [F(1,24) ¼ 141. 80, p < 0.001] and 5 dB less
tongue kinematic vowel space at the inter-speaker level, [F(1,24) ¼ 152.15, p < 0.001], respectively, when compared
Pearson product-moment correlation coefficients were used to the habitual speaking style. Task effects for the /dVd/ con-
because of the larger n (n ¼ 13 speakers  2 consonant environ- dition were similar in magnitude.
ments); n’s for the inter- and intra-speaker analyses of vowel
B. Relationship of acoustic and articulatory spaces:
space covariation are reported in Tables I and IV, respectively.
Corner vowels
To examine the relationship between individual formant
frequencies (F1, F2) and individual tongue positions (x and y A significant consonant environment effect was observed
positions), Pearson product-moment correlation coefficients for the acoustic vowel space and all tongue kinematic vowel
and a simultaneous method of multiple linear regression spaces. The direction of the effect was that, in the /dVd/ conso-
were utilized. For the regression models, tongue x and y val- nant environment, both acoustic and tongue kinematic spaces
ues were entered into two different blocks. The same model (tongue tip, body, and dorsum) were significantly smaller than
with a different order of variables within the block (e.g., in the /hVd/ environment. Although this effect (reduction of
tongue x in the first block, tongue y in the second block; vowel space from /hVd/ to /dVd/ contexts) can be inferred
TABLE II. Pearson r-values between formant frequency and tongue x-y coordinates across 13 speakers.

Tongue regions

Formant frequency Tongue coordinate Sample size (n) Tongue tip Tongue body Tongue dorsum
a a
F1 x 2715 0.436 0.397 0.199a
F1 y 2715 0.788a 0.821a 0.818a
F2 x 2710 0.601a 0.620a 0.413a
F2 y 2710 0.546a 0.666a 0.593a

a
p-value < 0.01.

430 J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al.
TABLE III. Average and standard deviation values of vowel duration (ms), vowel intensity (dB), acoustic and tongue kinematic vowel space (Hz2 and mm2,
respectively) across the 13 speakers. The dB values are expressed as differences from the habitual /hVd/ condition (reference value ¼ 0).

Condition Duration (ms) RMS (dB) Acoustic vowel space (Hz2) Tongue tip (mm2) Tongue body (mm2) Tongue dorsum (mm2)

/hVd/ Average 153.00 0 392 625 39.28 54.48 42.69


Habitual SD 40.27 4.94 97 758 25.57 34.13 32.79
/hVd/ Average 107.24 0.15 285 774 23.94 33.65 28.92
Fast SD 28.54 4.73 117 916 15.86 21.62 15.36
/hVd/ Average 149.62 7.56 403 856 37.51 59.15 47.71
Loud SD 25.45 4.40 122 316 23.93 39.98 30.15
/hVd/ Average 232.18 3.03 412 433 25.75 44.46 37.39
Slow SD 72.38 5.28 103 499 17.34 21.40 16.95
/hVd/ Average 149.19 5.51 342 619 15.46 25.98 25.64
Soft SD 36.72 4.84 146 760 13.53 20.67 18.91
/dVd/ Average 179.31 0.56 264 724 19.17 27.16 22.41
Habitual SD 40.79 5.18 98 764 12.76 16.00 11.45
/dVd/ Average 124.03 1.6 166 197 13.30 18.39 13.00
Fast SD 12.58 5.09 80 815 11.58 9.22 7.92
/dVd/ Average 209.16 7.49 272 600 19.24 31.49 29.48
Loud SD 55.90 4.40 92 043 16.28 23.93 19.42
/dVd/ Average 265.12 1.78 269 790 20.98 25.84 19.34
Slow SD 47.43 4.86 66 888 9.21 19.09 12.26
/dVd/ Average 184.32 4.9 248 098 12.83 17.74 15.94
Soft SD 47.28 4.42 82 626 7.27 15.06 14.37

from previous studies (e.g., Stevens and House, 1963; vowels toward the top of the plots. On the kinematic plots, in
Hillenbrand et al., 2001), to our knowledge this is the first which the origin of the coordinate space is based on the
report of the corresponding lingual space reduction when the speaker-specific centroids described above, increasingly nega-
two contexts are compared. Table III presents the average and tive x values indicate tongue points closer to the lips whereas
standard deviation values of each acoustic and tongue kine- increasingly positive y values indicate tongue points closer to
matic vowel space across the 13 speakers. the superior, palatal boundaries of the vocal tract. The general
Figure 1 shows the consonant environment effect on correspondence of reduced vowel space for both acoustic and
acoustic and tongue body kinematic vowel spaces. Among the kinematic measures should be understood with respect to cer-
three tongue regions (tongue tip, body, and dorsum), the kine- tain differences between the reduction effects for the two meas-
matic vowel space of the tongue body was chosen for Fig. 1 ures. For example, the reduction of acoustic vowel space from
because it tended to have the strongest relationship with acous- /hVd/ to /dVd/ is primarily the result of F2 changes for the
tic vowel space (see Table IV). In this and all subsequent fig- high vowels, with a particularly large contribution from the F2
ures showing acoustic and kinematic vowel spaces, the corner change for /u/. This disproportionate change for F2 of /u/ when
vowels are arranged with front vowels to the left and high preceded by an alveolar consonant is consistent with data
reported by Stevens and House (1963) and Hillenbrand et al.
TABLE IV. Spearman’s rank correlation coefficient values between acous- (2001). On the other hand, average changes from the /hVd/ to /
tic vowel space and tongue kinematic vowel space across the speech tasks dVd/ context for x-y tongue positions, measured for nearly
(n ¼ 10), within each individual.
identical time samples as the formant frequency measurements,
Tongue tip Tongue body Tongue dorsum indicate an effect for all vowels. Most, but not all these posi-
Speaker vowel space vowel space sowel space tion effects are in the expected direction of reduction of posi-
tion along either the x or y axes of the movement coordinate
P1 0.455 0.939a 0.952a
space. For example, both /æ/ and /A/ are less low in /dVd/ as
P2 0. 855a 0.879a 0.527
P3 0.394 0.745b 0.564
compared to /hVd/ utterances, and /æ/ is less forward in the
P4 0.685b 0.636b 0.224 former as compared to the latter. Similarly, in /dVd/ utterances
P5 0.188 0.152 0.358 /i/ is less forward and /u/ more forward as compared to their
P6 0.721b 0.661b 0.697b positions along the x axis in /hVd/ utterances. The higher
P7 0.236 0.418 0.455 (greater mean y values) /i/ and /u/ in /dVd/ as compared to /
P8 0.794a 0.709b 0.624 hVd/ utterances seem to provide evidence against the expecta-
P9 0.164 0.842a 0.794a tion for articulatory reduction, at least when expressed strictly
P10 0.650b 0.661b 0.588 in terms of a tongue position in a single dimension.
P11 0.612 0.527 0.830a
A significant main effect of speaking style was observed
P12 0.588 0.733b 0.830a
for both the acoustic vowel space [F (4, 96) ¼ 15.17,
P13 0.127 0.067 0.152
p < 0.001] and all three tongue kinematic vowel spaces [F(4,
a
p-value < 0.01. 96) ¼ 8.48; F (4, 96) ¼ 11.68; F (4, 96) ¼ 8.15, for tip, body,
b
p-value < 0.05. and dorsum, respectively, all three: p < 0.001]. Figure 2

J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al. 431
FIG. 1. Average acoustic vowel space (a) and tongue body kinematic vowel
space (b) for different consonant environments (/hVd/ and /dVd/) in habitual
condition. The vertical and anterior-posterior tongue body positions were FIG. 2. Average (a) acoustic vowel space and (b) tongue body kinematic
measured relative to the respective mean positions of vowels for each speaker. vowel space for different speaking styles. The vertical and anterior-posterior
tongue body positions were measured relative to the respective mean posi-
tions of vowels for each speaker.
shows the speaking style effect on acoustic and tongue kine-
matic vowel spaces. A larger acoustic vowel space was
tongue body sensor, for which 9 of 13 speakers had signifi-
observed in the habitual speaking style as compared to the
cant correlations in the expected direction (greater acoustic
fast speaking style [F(1,24) ¼ 39.657, p < 0.001]. For tongue
spaces associated with greater articulatory vowel spaces),
kinematic vowel spaces, larger spaces were observed at all
ranging in magnitude from 0.661 to 0.939. Ten of the 13
three tongue locations for the habitual speaking style as com-
speakers had at least one tongue sensor area significantly
pared to the fast speaking style [F(1,24) ¼ 9.36, p ¼ 0.005;
correlated with area of the acoustic vowel space, and 8 of the
F(1,24) ¼ 14.81, p ¼ 0.001; and F(1,24) ¼ 8.10, p ¼ 0.009,
13 speakers had at least two sensors whose areas were signif-
for tip, body, and dorsum, respectively] and to the soft
icantly correlated with area of the acoustic vowel space.
speaking style [F(1,24) ¼ 24.23, p < 0.001; F(1,24) ¼ 19.29,
Three speakers (P5, P7, P13) had no significant associations
p-value < 0.001; F(1,24) ¼ 8.43, p-value ¼ 0.008].
between the areas of the two spaces.
1. Intra-speaker analyses
2. Inter-speaker analyses
Ten acoustic vowel spaces and ten tongue kinematic
vowel spaces (five speaking styles  2 consonant environ- Table I shows the Pearson correlation coefficient values
ments) were correlated across tasks within each speaker. (r-values) among acoustic and tongue kinematic vowel
Figure 3 shows an example of the ten different acoustic and spaces across speakers in each speaking style. The bottom
tongue body kinematic vowel spaces for a selected speaker row in Table I shows r-values for acoustic vowel space
(P1). Variation across tasks in the size of this speaker’s against tongue kinematic vowel space in the full data set,
acoustic and articulatory vowel spaces is obvious; data from across all conditions and speakers (n ¼ 130). All correlations
other speakers showed the same cross-task variability, in dif- for the habitual condition were significant but, on average,
fering degrees. Table IV shows the Spearman rank correla- weaker than those of the intra-speaker analysis (Table IV).
tion coefficients by speaker for each tongue kinematic vowel Correlations in the other conditions were sometimes signifi-
space against acoustic vowel space. The greatest number of cant, but always weaker than those in the habitual condition.
statistically significant correlations was obtained for the (There is the possibility that the inter-speaker correlations

432 J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al.
FIG. 3. (a) Acoustic vowel spaces and (b) tongue body kinematic vowel
spaces across the ten speech tasks (two consonant environments, five speak-
ing styles) for a selected speaker (P1). The vertical and anterior-posterior
tongue body positions were measured relative to the respective mean posi- FIG. 4. (Color online) Scatterplot of F1 and tongue body (a) x position and
tion (tongue body centroid) of vowels for P1. Solid lines show vowel spaces (b) y position across 13 speakers. Individual speakers are shown by unique
in /hVd/ environment, dotted lines vowel spaces in /dVd/ environment. colors of data points.

speaker Pearson r values between formant frequency and


between acoustic and kinematic vowel spaces might be tongue x-y coordinates for the tongue tip, body, and dorsum.
improved by additional scaling of the kinematic data, specif- For the 13 speakers, the general pattern of the relation-
ically by dividing by individual speaker, kinematic vowel- ships was positive for F1-x [higher F1 values for increasingly
space area. This analysis was performed and yielded results positive x values, that is, to the right on the kinematic plots
essentially identical to those reported in Table I.) such as Fig. 1(b)], negative for F1-y, negative for F2-x, and
positive for F2-y. These general trends are broadly consistent
C. Individual formant frequencies and tongue with expectations from the acoustic theory of speech produc-
positions tion (Fant, 1960). That is, all other things being equal, F1 is
The relationships between F1, F2 and tongue x and y coor- predicted by the theory to decrease slightly with tongue
dinates were examined for each of the three tongue sensors advancement, especially for high vowels and from a starting
(tongue tip, body and dorsum). These analyses included the point roughly midway between /i/ and /u/ [Fig. 4(a); note posi-
acoustic and position data for all ten monophthongs. Consistent tive correlation, Table II], and to increase substantially with
and significant covariation between tongue kinematic and increased distance between the highest point on the tongue
acoustic data was observed for the tongue body sensor. Figures and superior boundary of the vocal tract [Fig. 4(b)]; F2 is pre-
4 and 5 show scatterplots of tongue body x and y coordinates dicted to increase substantially with tongue advancement
against F1 (Fig. 4), and F2 (Fig. 5) across ten monophthongs [Fig. 5(a)] and to decrease with decreased distance between
produced by the 13 speakers. Analyses were performed on the highest point on the tongue and the superior border of the
these data at both the intra- and inter-speaker levels. vocal tract, especially for back vowels [Fig. 5(b)].
At the intra-speaker level, the highest r-value across the
three tongue regions was examined in each speaker. The F1-x
1. Intra-speaker analyses
correlations ranged between 0.20 and 0.74 with a mean of
When data were evaluated within speaker but across 0.45, and the F1-y correlations between 0.81 to 0.90, with
vowels and speaking conditions, F1 and F2 were significantly a mean across speakers of 0.87. Of all possible formant
correlated with both x and y coordinates for at least one of the frequency-tongue coordinate pairs (F1-x, F1-y, F2-x, and
three tongue sensors. Appendixes A and B present within- F2-y), F1-y correlations showed the greatest magnitudes. The

J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al. 433
FIG. 5. (Color online) Scatterplot of F2 and tongue body (a) x position and
(b) y position across 13 speakers. Individual speakers are shown by unique FIG. 6. Bar graphs of variance in (a) F1 and (b) F2 explained by tongue x-y
colors of data points. positions. The height of the bar shows the total R2 value and the separate shad-
ings show individual contributions to the total R2 of tongue x and y positions.

F2-x correlations ranged between 0.56 and 0.88 with a significant model for F1 emerged; tongue tip [F (2,
mean of 0.74, and the F2-y correlations between 0.44 and 2709) ¼ 2328, p < 0.0001, adjusted R2 ¼ 0.632], body [F (2,
0.83 with a mean of 0.71. The strongest correlations were 2712) ¼ 2703, p < 0.0001, adjusted R2 ¼ 0.674] and dorsum
therefore for the traditional tongue height and tongue [F (2, 2709) ¼ 2778, p < 0.0001, adjusted R2 ¼ 0.672]. With
advancement dimensions. Three subjects (P2, P4, P11) the same method, a significant model for F2 was obtained:
showed F1-x or F2-x correlations in the direction opposite that tongue tip [F (2, 2707) ¼ 1165, p < 0.0001, adjusted
predicted from the acoustic theory of speech production. R2 ¼ 0.462], body [F (2, 2707) ¼ 1651, p < 0.0001, adjusted
R2 ¼ 0.549], and dorsum [F (2, 2707) ¼ 933, p < 0.0001,
2. Inter-speaker analyses: Variance partitions adjusted R2 ¼ 0.408]. The 63% to 67% of F1 variance was
explained by tongue position, especially tongue y position.
When data were evaluated across individuals, vowels, The 41% to 55% of F2 variance was explained by tongue
repetitions, and speaking conditions, F1 and F2 were moder- position, both in the x and y dimensions. A notable amount
ately to strongly correlated with both the x and y coordinates of variance in F2 was explained by both tongue body x and y
of the tongue tip, body, and dorsum (Table II). At the inter- position (shared variance) [Fig. 6(b)]. The more posterior the
speaker level, the relationship between F1 and the tongue y tongue region, the greater the explained variance in formant
coordinate was the strongest among the four formant-by- frequencies by only tongue y position [Fig. 6(a)]; the more
tongue coordinate pairs (F1-x, F1-y, F2-x, and F2-y coordi- anterior the tongue region, the greater the explained variance
nate pairs). in F2 solely by tongue x position [Fig. 6(b)].
The amount of variance in formant frequencies
explained by tongue x and y positions is presented in Fig. 6, IV. DISCUSSION
using the multiple linear regression block models described
in Sec. II F. For each of the three sensors, the height of the A. Acoustic and lingual vowel spaces: Intra-speaker
findings
bar shows the total R2 value and the separate shadings show
individual contributions to the total R2 of tongue x and y The overall purpose of the current study was to obtain
positions, respectively. Using the simultaneous method, a simultaneous measurements of tongue positions and formant

434 J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al.
frequencies for productions of American English vowels, relationships between these two sets of measures, was
and to evaluate the covariation between the two sets of addressed in the current study by including within-speaker
measures. The corner vowels of English were produced in manipulations of speaking style known to have an effect on
two “frames,” one the traditional /hVd/ frame meant to elicit estimates of acoustic vowel space. Compared to the habitual
canonical articulation of a vowel, relatively free of coarticu- condition, the fast rate manipulation was expected to result
latory influence (Stevens and House, 1963), and the other a in smaller acoustic vowel spaces (Weismer et al., 2000), and
/dVd/ frame to promote deviation from this canonical articu- loud voice and slow rate were expected to result in larger
lation, with the vowel presumably subject to greater coarti- acoustic vowel spaces for most speakers (Tjaden et al.,
culatory influences. Formant frequencies were measured at a 2013). Expectations of larger acoustic vowel spaces for slow
single point in time during the vowel nucleus, and tongue versus habitual speech, loud versus habitual speech, and ha-
positions in the form of x-y coordinates were measured at the bitual versus fast speech, were largely confirmed at the group
same time (within the limits of the sampling-rate differences level for both /hVd/ and /dVd/ frames (Table III). The soft-
for the acoustic and kinematic signals). Using a similar, voice condition produced smaller acoustic vowel spaces than
slice-in-time acoustic measurement for formant frequency the habitual condition, but not to the same extent as the fast
“targets,” Stevens and House (1963), and Hillenbrand et al. condition. Clearly though, the speaking condition manipula-
(2001) showed that vowels in /dVd/ frames tended to under- tions caused systematic, within-speaker variation in the size
shoot target formant frequencies measured in /hVd/ frames. of the acoustic vowel space. The important question was
Because this effect in Stevens and House (1963) and whether the articulatory vowel spaces would vary in the
Hillenbrand et al. (2001) appeared to be present for each of same way. At the group level, the answer seems to be a
the corner vowels, albeit in varying degrees, the overall qualified “yes.” Across phonetic contexts and conditions, the
effect on the acoustic vowel space was a reduction in /dVd/ articulatory vowel space varied from 12.8 to 39.3 mm2 for
versus /hVd/ frames. This acoustic vowel space effect was the tongue tip, 17.7 to 59.2 mm2 for the tongue body, and
replicated in the current study (Table III, see data for habit- 13.0 to 47.7 mm2 for the tongue dorsum (Table III). For each
ual condition), albeit with clear reduction effects for the high of the three tongue points, with just a very few exceptions,
vowels, but low vowels that seemed resistant to the conso-
articulatory working space was smaller for fast rate and soft
nant context effect [see Fig. 1(a)]. A new finding of the cur-
voice as compared to habitual rate, and larger for loud voice
rent study was the demonstration of smaller articulatory
as compared to habitual and fast rates, and soft voice. The
spaces in /dVd/ frames as compared to /hVd/ frames, com-
one finding that went against expectations from the literature
puted from the x-y coordinates measured at the same point in
on acoustic vowel space was the roughly equivalent or
time as the formant frequencies from which the acoustic
slightly smaller articulatory vowel spaces at the slow rate, as
vowel space was derived. Thus, a general articulatory reduc-
compared to the habitual condition; nevertheless, the articu-
tion due to consonant context, expressed as a smaller area
latory spaces at slow rate were greater than those at the fast
enclosed by x-y lingual coordinates, has been demonstrated
rate and in the soft voice condition.
in parallel with the expected reduction in the acoustic vowel
Variation induced by the speaking conditions and con-
space. This finding verifies, to a limited extent as discussed
below, the frequently encountered assumption that articula- texts allowed statistical analysis of within-speaker covaria-
tory inferences of vowel “working space” can be made, at tion of lingual and acoustic vowel spaces. Nine of the 13
least within speakers, from the relative size of the acoustic speakers had a positive, significant Spearman correlation
vowel space. For example, Berisha et al. (2014, p. 421) offer between the two spaces for the tongue body sensor. For these
the view that, “VSAs [vowel spaces areas] are acoustic 9 speakers, it seems reasonable to view variations in their
proxy for the kinematic displacements of the articulators,” acoustic vowel spaces as reflecting variations in their lingual
and Wenke et al. (2010, p. 204) state that, “spectral measures working spaces for vowels. This finding is relevant to many
of vowels…including vowel formants and subsequent vowel reports in the dysarthria literature in which variation in the
space area, can provide useful information about articulatory acoustic vowel space is interpreted as reflecting varying
movement in dysarthria.” As noted by McGowan and Berger speech motor capabilities. For example, individual subject
(2009), assumptions such as these have been difficult to eval- analyses have shown acoustic vowel spaces to increase with
uate due to the scarcity of simultaneously collected position recovery from head injury (Ziegler and von Cramon, 1983)
and acoustic data in both healthy speakers and speakers with and as a result of speech therapy in persons with dysarthria
speech disorders. following stroke (Mahler and Ramig, 2012) or with a variety
In the current study, the parallel reduction of the space of diseases associated with dysarthria (Wenke et al., 2010).
enclosed by lingual coordinates for corner vowels and the An argument has also been made that in Parkinson’s disease
first two formant frequencies may seem somewhat surpris- the size of the acoustic vowel space can be used to track
ing. After all, formant frequencies are a product of the area deterioration of speech production skills over time, although
function of the entire vocal tract, and point-parameterized this conclusion is based solely on group data (Skodda et al.,
articulatory measurements of tongue positions provide no in- 2012). In the first three cases, wherein expansions and com-
formation on the front (i.e., lips) or back (i.e., pharynx) ends pressions of the vowel space are reported speaker by speaker
of the vocal tract. The question of changes in vowel working and issues such as vocal tract size, dialect, and so forth are
spaces expressed by lingual positions on the one hand and strictly controlled by the analysis unit (that is, the speaker),
formant frequency measures on the other, and potential the current results suggest that variation in size of the

J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al. 435
acoustic vowel space may reflect systematic variation in lin- forward relationship between tongue point positions and
gual behaviors. vowel formant frequencies for each of four speakers.
On the other hand, the potential error in within-speaker McGowan and Berger’s analysis revealed some similarities
inferences cannot be ignored. Several factors may contrib- across speakers for articulatory-to-formant frequency map-
ute to these inferential errors. First, the fact that different pings, but also non-trivial, speaker-to-speaker variation
vocal tract configurations can yield the same set of resonant reminiscent of the findings of Harshman et al. (1977) and
frequencies may result in variance in the position data that Johnson et al. (1993) in which different speakers (N ¼ 5 in
is not mapped onto the formant frequencies. The point- these two studies) employed different magnitudes or even
parameterized data for tongue positions used in the current styles of articulatory gestures for the same vowels. One
data and the majority of relevant literature reviewed above source of across-speaker variation in the forward mappings
may not capture this partition of the variance. Second, the is almost certainly due to speaker-to-speaker variation in
extent to which a speaker uses lip rounding/spreading for vocal tract morphology, which has been linked to variation
the differentiation of the corner vowels, and the critical or in tongue behavior for the production of certain vowels
non-critical role of lip rounding in a vowel system may (Fuchs et al., 2008).
weaken the inference. McGowan and Berger (2009), how- McGowan and Berger (2009) and Wang et al. (2013)
ever, showed for speakers of American English a minimal studied the mapping between kinematic and acoustic vocalic
influence of lip position on kinematic-to-formant frequency data by analyzing their covariations across time, albeit in
mapping. Third, variation in the way the cross-section of somewhat different ways. McGowan and Berger (2009) per-
the pharynx is modified over time may introduce uncer- formed a very dense sampling of x-y coordinates of tongue
tainty to the acoustic-to-articulatory inference, though points and the lips together with formant frequencies meas-
Whalen et al. (1999) argued that much of the variance in ured at each position sampling point, whereas Wang et al.
pharyngeal width for vowels can be explained by tongue (2013) mapped a “shape” derived from tongue and lip data
positions substantially forward to the pharynx (and presum- to a single estimate of vowel formant frequencies. In con-
ably by the point parameterization of those more forward trast, the current study used a mapping based on measure-
positions). ments derived from a single point-in-time, to be consistent
The rank-order, intra-speaker correlation data in Table with acoustic measures of vowel space that have attained a
IV indicate that an interpretation of lingual vowel space prominent role in the literature on acoustic measures of
from acoustic vowel space is further complicated by varia- speech motor control integrity. The current study, including
tion in the strength of the covariation across speakers and aspects of the methods and results, is similar to Mefferd and
tongue location. In the current data set, the tongue body sen- Green (2010) in showing for a majority of healthy speakers a
sor showed the most consistent, significant covariation systematic relationship between lingual position and formant
between the lingual and acoustic vowel spaces. Nine of the frequency measures; this conclusion is also broadly consist-
13 speakers had significant correlations for the tongue body, ent with the results of McGowan and Berger (2009) and
ranging from 0.636 to 0.939. Of the four speakers who did Wang et al. (2013). Taken together, the results of Mefferd
not have a significant correlation for the tongue body, three and Green (2010) and of the current study, as well as the
did not have a significant correlation for any of the tongue work of McGowan and Berger (2009), Wang et al. (2013),
sensors. For these latter speakers, it seems safe to conclude and recent work by Mefferd (2015) in speakers with dysarth-
that variation in the acoustic vowel space does not provide ria, suggest a reasonable starting point of using acoustic
good information about underlying variation in tongue posi- measures to draw some inferences concerning lingual behav-
tion. At the current time, unfortunately, it is not possible to ior for vocalics, and possibly as an index of speech motor
identify a priori those speakers for whom the acoustic-to- control integrity. The difference between the current study
articulatory inference is likely to be reasonable. In post hoc and the previous ones is the explicit comparison of the
analyses we explored the possibility that speakers with the acoustically and kinematically derived working spaces for
least lingual position variation across speaking conditions or corner vowels. Depending on how the current results are
contexts, or speakers with exceptionally small lingual work- viewed, the inference of lingual working space from acoustic
ing spaces in the habitual condition, were those for whom vowel spaces, and by extension the use of vowel space areas
the correlations between the acoustic and position data were as estimates of speech motor control integrity, is either partly
not significant; the analysis did not support these possibil- supported or deeply flawed.
ities. Future work should focus on factors such as vocal tract Partial support for the inference is found in the weight
morphology, dialect, gender, and so forth that may be deter- of statistical evidence showing positive relationships
miners of significant covariation between acoustic and between size of the acoustic vowel space and size of the lin-
lingual vowel spaces. gual kinematic vowel space. All but 3 of the 13 speakers
Articulatory variation across speakers for a given showed at least one significant correlation in the expected
vowel of English, even with control over dialect, is well direction. Moreover, this finding applies not only to the
known (Harshman et al., 1977; Johnson et al., 1993). /hVd/ versus /dVd/ comparison, but also across the tasks
When solutions for a forward model from articulatory which modified the size of the acoustic vowel space and typ-
behavior to acoustic output are explored, across-speaker ically were associated with similar changes in the kinematic
variation is prominent. For example, McGowan and Berger vowel space: tasks inducing smaller acoustic vowel spaces
(2009) performed a very dense-sampling analysis of the tended to be associated with smaller kinematic vowel spaces.

436 J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al.
An additional observation worthy of note is the statistical (Tjaden et al., 2013) are even more difficult to interpret in
support for the goodness of the inference from acoustic to articulatory terms as compared to variation of acoustic vowel
kinematic vowel space, for the majority of speakers, even in space within an individual.
the absence of other potentially relevant position data, such
as from the lips, pharynx, and even the parasagittal configu- C. The x and y coordinates and variations in F1 and F2
ration of the vocal tract. Even for a substantially underpara- If the acoustic vowel space is a proxy for articulatory
meterized analysis, both in the type of measures obtained capabilities, interpretation of acoustic data in terms of the
and the single-point-in-time sampling, the hypothesized classical articulatory dimensions for vowels seems a sensi-
results were obtained in most cases. The more important ble extension of the more general notion of “vowel work-
finding of the present study, however, may be the failure to ing space.” In virtually every study of articulatory-to-
confirm statistically significant covariation between acousti- acoustic mappings for vowels, a two dimensional articula-
cally and kinematically derived vowel spaces for three of tory solution accounts for a very large portion of the var-
the speakers. Some preliminary, post hoc analyses per- iance in formant frequency data. These two dimensions
formed in these speakers failed to identify possible reasons include either the classical ones from speech acoustic
for these findings. Possibly these three speakers are precisely theory (Fant, 1960), wherein tongue height variation
the ones for whom data on lip positions, palatal morphology, accounts largely for variations in F1 and tongue advance-
and pharyngeal dimensions are critical in understanding the ment for variations in F2 (three of four speakers in
mapping from acoustic to kinematic vowel spaces. The McGowan and Berger, 2009; Wang et al., 2013) or a pair
problem is that at the current time there is no way to know of factors in which one (or both) incorporates the tongue
which speakers are those for whom the acoustic-to-kine- height dimension of the classical theory and the other is a
matic vowel space inference is reasonable. For a given somewhat more complex version of the tongue advance-
speaker, therefore, the current data suggest that an inference ment/height dimensions (Harshman et al., 1977; Ladefoged
of articulatory vowel space from acoustic data alone may be et al., 1978). As summarized by McGowan and Berger
risky. (2009, pp. 2030–2031), the more complex mapping
between tongue advancement/height and variation in F2
B. Acoustic and lingual vowel spaces: Inter-speaker
and F1 can be explained within the tube-resonator laws of
findings versus intra-speaker findings
Fant’s (1960) acoustic theory of vowel production. What is
Computed across speakers, correlations between size clear from these studies is that two-dimensional articulatory
of the acoustic and articulatory vowel spaces (Table I) solutions account for a substantial amount of the variance
were much weaker than within-speaker correlations (Table in F1 and F2; the immediate question is the degree to
IV). Although many across-speaker correlations were stat- which individual articulatory dimensions can be inferred
istically significant, they reflected a substantial amount of from single-formant measures.
unexplained variance (slightly more than 65%, in the best The current data are consistent with a fairly straightfor-
case) between the two types of vowel space. The relative ward relationship between tongue height (including jaw con-
weakness of the inter-speaker, compared to the intra- tributions) and variation in F1. Data plotted in Fig. 4 and
speaker correlations is perhaps not surprising when consid- reported in Table II show a tight inverse relationship
ering all the “noise variables” in an inter-speaker analysis. between the lingual y position and value of F1, as predicted
These include variations in vocal tract size, vocal tract by the acoustic theory of vowel production. These fairly
morphology, speaker to speaker variations in task produc- strong correlations, which suggest approximately 65% of the
tion (note in Table I the much lower correlations for the variance in F1 is explained by tongue point variation in the y
non-habitual speaking tasks), and dialect variation when it dimension, are consistent across the three sampled tongue
is not controlled (see summary of such factors in Johnson sensors. The across-vowel range of y values, roughly 25 mm
et al., 1993). in Fig. 4, is consistent with the range of vertical values plot-
An excellent example of the potential effect of varia- ted by McGowan and Berger (2009, their Fig. 3, p. 2019) for
tions in vocal tract morphology on inferences from acoustic x-ray microbeam data across a wide range of vowels, as well
to lingual position vowel spaces is palatal shape. For vowel as with the range of Euclidean distances between /i/ and /A/
production, the tongue is shaped to form specific constric- reported by Mefferd and Green (2010). Collectively, the
tions. Because it is known that palatal morphology affects results of the present study, and those of a few previous
lingual positions for vowel constrictions (Brunner et al., investigations, point to an ability to infer vowel tongue
2009; Rudy and Yunusova, 2013), the relative weakness of height, or perhaps more precisely openness of the vocal tract,
the inter-speaker correlations between acoustic and lingual from the value of the first formant frequency. Correlations
vowel spaces may be due, at least in part, to the absence of between the x dimension of tongue position and F1 were
data on palatal morphology differences among the current 13 also statistically significant, but the largest coefficient, for
speakers. the tongue tip, explained only 19% of the variance in F1
Nevertheless, the implication of the relatively weak cor- (Table II).
relations in the inter-speaker analyses is that comparisons of The pattern of correlations for tongue positions and F2
group acoustic vowel spaces, such as those across speaker was less clear than the case for F1. Correlations for the x and
groups or treatments (Sapir et al., 2007) or conditions y positions with F2 were roughly of the same magnitude,

J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al. 437
both accounting for about 40% of the variance between the related in a more complex way to openness and tongue
position and acoustic data (with the exception of the very advancement.
low correlation of F2-x at the tongue dorsum sensor). The
finding that F2-position correlations explain a smaller ACKNOWLEDGMENTS
amount of variance than F1-y correlations may not be sur- We thank Neil Szuminsky for assistance with equipment
prising, partially because F2 is particularly sensitive to varia- operational support and data collection. We thank Zora
tions in lip and pharyngeal sections of the vocal tract (Fant, McFarlane-Blake, Jamie Meta, and Taylor Baker Rogan for
1960), which were not parameterized in the current study. assistance with data collection and data analyses. Portions of
Of greater interest, perhaps, are the apparently equivalent these data were presented at the 164th Meeting of the Acoustical
effects of x and y position variation on F2. In theory, tongue Society of America, Kansas City, MO and 2013 American
position variation in the x dimension, conceptually aligned Speech-Language-Hearing Convention, Chicago, IL. The
with tongue advancement/retraction, might be expected to authors thank one of the reviewers for raising the possibility that
be more strongly associated with F2 than variations in the y the inter-speaker correlations between acoustic and kinematic
dimension. From the perspective of Fant’s tube theory of vowel spaces might be improved by additional scaling of the
articulatory-to-acoustic mappings, however, the effects of kinematic data, specifically by dividing by individual speaker,
tongue height and advancement on F2 interact due to the kinematic vowel-space area.
relationship of constriction location and size to the pressure
(or velocity) distribution for the second resonant mode of a
tube closed at one end. Examination of Fant’s nomogram APPENDIX A
with the lip section removed (Fant, 1960, p. 84) shows the Please see Table V for additional results.
expected increase in F2 with position of the major vowel
constriction advancing toward the lip end of the model—the TABLE V. Pearson r-values between the first formant frequency (F1) and
so-called tongue advancement rule—but also shows the tongue x-y coordinates of tongue tip, body and dorsum within each speaker
magnitude to depend dramatically on the area of the major (Intra-speaker level analyses). The sample size (n) ranged between 203 and
210 within each speaker [(4 corner vowels  5 speaking styles  2 conso-
constriction (that is, the height of the tongue). The corre- nant environments  3 repetitions) þ (6 other monophthongs  5 speaking
sponding curves for F1 show only small effects of constric- styles  3 repetitions)]. The bolded values are the highest r-value in the
tion location, but a fairly strong effect for constriction area, expected direction across the three tongue regions examined in each
even when the areas are restricted to three fairly tight con- speaker. These values are used in text in Sec. III C 1.
strictions. Fant’s theoretical formulations are supported by Formant Speaker Tongue tip x Tongue body x Tongue dorsum x
the current correlational analyses: whereas variations in F1
a a
provide reasonably good information on tongue height, var- F1 P1 0.341 0.458 0.271a
iations in F2 reflect a more composite effect of tongue height P2 0.534a 0.449a 0.369a
P3 0.387a 0.403a 0.312a
and advancement.
P4 0.298a 0.199a 0.454a
P5 0.569a 0.525a 0.230a
D. Limitations and conclusions of the current study P6 0.101 0.325a 0.141b
P7 0.737a 0.672a 0.683a
A major conclusion of this study is that inferences to the
P8 0.585a 0.634a 0.720a
size of the lingual articulatory space from acoustic vowel P9 0.553a 0.679a 0.640a
space measures seem at first glance to be reasonable in a ma- P10 0.662a 0.683a 0.656a
jority of the speakers studied, but there is no way, at least at P11 0.388a 0.354a 0.328a
the current time, to identify those speakers for whom the in- P12 0.053 0.061 0.196a
ference is not reasonable. More specifically, the inferences P13 0.593a 0.542a 0.319a
are clearly more consistent for within-speaker variation in Formant Speaker Tongue tip y Tongue body y Tongue dorsum y
the acoustic vowel space as compared to across-speaker vari-
ation, but the uncertainty mentioned above is still present for F1 P1 0.830a 0.847a 0.894a
P2 0.755a 0.825a 0.810a
intra-speaker inferences such as might be made across time
P3 0.775a 0.861a 0.833a
to index progress due to speech therapy or recovery of
P4 0.816a 0.903a 0.841a
impaired speech skills from effects of a stroke or head P5 0.703a 0.827a 0.832a
injury. Additional work is required to determine the ultimate P6 0.901a 0.849a 0.858a
use of acoustic vowel space measures in understanding P7 0.872a 0.855a 0.904a
speech motor control limitations among persons with speech P8 0.850a 0.871a 0.904a
disorders. Variables contributing to the uncertainty have P9 0.765a 0.898a 0.904a
been discussed above, and include the underparameterization P10 0.845a 0.862a 0.870a
of the articulatory data in the current study, and even possi- P11 0.791a 0.790a 0.809a
P12 0.774a 0.826a 0.854a
bly of the acoustic data. Data from the current study on the
P13 0.776a 0.808a 0.843a
inference from formant frequencies to specific characteristics
of vowel production suggest that F1 provides relatively good a
p-value < 0.01
b
information on relative openness of the tract, whereas F2 is p-value < 0.05.

438 J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al.
Fant, G. (1960). Acoustic Theory of Speech Production (Mouton, the Hague,
APPENDIX B the Netherlands), pp. 63–214.
Fourakis, M. (1991). “Tempo, stress, and vowel reduction in American
Please see Table VI for additional results.
English,” J. Acoust. Soc. Am. 90, 1816–1827.
Fuchs, S., Winkler, R., and Perrier, P. (2008). “Do speakers’ vocal tract geo-
TABLE VI. Pearson r-values between the second formant frequency (F2)
metries shape their articulatory vowel space?,” in Proceedings of the ISSP,
and tongue x-y coordinates of tongue tip, body and dorsum within each
pp. 333–336, http://issp2008.loria.fr/Proceedings/PDF/issp2008-77.pdf.
speaker (Intra-speaker level analyses). The sample size (n) ranged between
Geng, C., and Mooshammer, C. (2009). “How to stretch and shrink vowel
203– 210 within each speaker [(4 corner vowels  5 speaking styles  2 con- systems: Results from a vowel normalization procedure,” J. Acoust. Soc.
sonant environments  3 repetitions) þ (6 other monophthongs  5 speaking Am. 125, 3278–3288.
styles  3 repetitions)]. The bolded values are the highest r-value across the Harshman, R., Ladefoged, P., and Goldstein, L. (1977). “Factor analysis of
three tongue regions examined in each speaker. These values are used in tongue shapes,” J. Acoust. Soc. Am. 62, 693–713.
text in Sec. III C 1. Hillenbrand, J. M., Clark, M. J., and Nearey, T. M. (2001). “Effects of con-
sonant environment on vowel formant patterns,” J. Acoust. Soc. Am. 109,
Formant Speaker Tongue tip x Tongue body x Tongue dorsum x 748–763.
Iivonen, A. (1995). “Explaining the dispersion of the single-vowel occur-
F2 P1 0.470a 0.880a 0.820a rences in an F1/F2 space,” Phonetica 52, 221–227.
P2 0.728a 0.736a 0.232a Johnson, K., Flemming, E., and Wright, R. (2004). “Response to Whalen
P3 0.705a 0.699a 0.606a et al.,” Language 80, 646–648.
P4 0.562a 0.542a 0.419a Johnson, K., Ladefoged, P., and Lindau, M. (1993). “Individual differences
P5 0.585a 0.502a 0.451a in vowel production,” J. Acoust. Soc. Am. 94, 701–714.
Kent, R., and Read, C. (2001). Acoustic Analysis of Speech, 2nd ed.
P6 0.604a 0.797a 0.672a
(Singular/Thomson Learning, Albany, NY), pp. 71–137.
P7 0.609a 0.734a 0.715a Ladefoged, P., and Bladon, A. (1982). “Attempts by human speakers to
P8 0.852a 0.783a 0.702a reproduce Fant’s nomograms,” Speech Commun. 1, 185–198.
P9 0.709a 0.834a 0.790a Ladefoged, P., Harshman, R., Goldstein, L., and Rice, L. (1978).
P10 0.842a 0.854a 0.788a “Generating vocal tract shapes from formant frequencies,” J. Acoust. Soc.
P11 0.581a 0.058 0.097 Am. 64, 1027–1035.
P12 0.760a 0.734a 0.837a Lee, S., Potamianos, A., and Narayanan, S. (1999). “Acoustics of children’s
speech: Developmental changes of temporal and spectral parameters,”
P13 0.528a 0.632a 0.472a
J. Acoust. Soc. Am. 105, 1455–1468.
Formant Speaker Tongue tip y Tongue body y Tongue dorsum y Mahler, L. A., and Ramig, L. O. (2012). “Intensive treatment of dysarthria
secondary to stroke,” Clin. Linguist. Phonet. 26, 681–693.
a a
F2 P1 0.493 0.634 0.774a McGowan, R. S., and Berger, M. A. (2009). “Acoustic-articulatory mapping
P2 0.541a 0.790a 0.617a in vowels by locally weighted regression,” J. Acoust. Soc. Am. 126,
2011–2032.
P3 0.557a 0.773a 0.749a
Mefferd, A. (2015). “Articulatory-to-acoustic relations in talkers with dys-
P4 0.668a 0.765a 0.646a arthria: A first analysis,” J. Speech Hear. Res. 58, 576–589.
P5 0.589a 0.725a 0.669a Mefferd, A. S., and Green, J. R. (2010). “Articulatory-to acoustic relations
P6 0.561a 0.735a 0.689a in response to speaking rate and loudness manipulations,” J. Speech Lang.
P7 0.542a 0.671a 0.686a Hear. Res. 53, 1206–1219.
P8 0.749a 0.826a 0.773a Milenkovic, P. (2002). “TF32 [computer program]” (Department of
P9 0.448a 0.696a 0.599a Electrical and Computer Engineering, University of Wisconsin-Madison,
Madison, WI).
P10 0.745a 0.806a 0.637a
Monsen, R. B., and Engebretson, A. M. (1983). “The accuracy of formant
P11 0.441a 0.306a 0.110 frequency measurements: A comparison of spectrographic analysis and
P12 0.553a 0.662a 0.475a linear prediction,” J. Speech Hear. Res. 26, 89–97.
P13 0.538a 0.604a 0.575a Nearey, T. M. (1978). Phonetic Feature Systems for Vowels (Indiana
University Linguistic Club, Bloomington, IN).
a Neel, A. T. (2008). “Vowel space characteristics and vowel identification
p-value < 0.01.
b accuracy,” J. Speech Lang. Hear. Res. 51, 574–585.
p-value < 0.05.
Rudy, K., and Yunusova, Y. (2013). “The effect of anatomic factors on
tongue position variability during consonants,” J. Speech Lang. Hear. Res.
Alfonso, P. J., and Baer, T. J. (1982). “Dynamics of vowel articulation,” 56, 137–149.
Lang. Speech 25, 151–173. Sapir, S., Spielman, J. L., Ramig, L. O., Story, B. H., and Fox, C. (2007).
Arai, T. (2012). “Education in acoustics and speech science using vocal tract “Effects on intensive voice treatment (the Lee Silverman Voice Treatment
models,” J. Acoust. Soc. Am. 131, 2444–2454. [LSVT] on vowel articulation in dysarthric individuals with idiopathic
Berisha, V., Sandoval, S., Utianski, R., Liss, J., and Spanias, A. (2014). Parkinson disease: Acoustic and perceptual findings,” J. Speech Lang.
“Characterizing the distribution of the quadrilateral vowel space area,” Hear. Res. 53, 114–125.
J. Acoust. Soc. Am. 135, 421–427. Simpson, A. P. (2001). “Dynamic consequences of differences in male and
Bradlow, A. R., Torretta, G. M., and Pisoni, D. B. (1996). “Intelligibility of female vocal tract dimensions,” J. Acoust. Soc. Am. 109, 2153–2164.
normal speech I: Global and fine-grained acoustic-phonetic talker charac- Skodda, S., Gr€ onheit, W., and Schlegel, U. (2012). “Impairment of vowel
teristics,” Speech Commun. 20, 255–272. articulation as a possible marker of disease progression in Parkinson’s dis-
Brunner, J., Fuchs, S., and Perrier, P. (2009). “On the relationship between ease,” PLoS One 7(2), e32132.
palate shape and articulatory behavior,” J. Acoust. Soc. Am. 125, Smiljanic, R., and Bradlow, A. R. (2005). “Production and perception of
3936–3949. clear speech in Croatian and English,” J. Acoust. Soc. Am. 118,
Bunton, K., and Leddy, M. (2011). “An evaluation of articulatory working 1677–1688.
space area in vowel production of adults with Down syndrome,” Clin. Stevens, K. N., and House, A. S. (1955). “Development of a quantitative
Linguist. Phonet. 25, 321–334. description of vowel articulation,” J. Acoust. Soc. Am. 27, 484–493.
Dromey, C., Jang, G. O., and Hollis, K. (2013). “Assessing correlations Stevens, K. N., and House, A. S. (1963). “Perturbation of vowel articulations by
between lingual movements and formants,” Speech Commun. 55, consonantal context: An acoustical study,” J. Speech Hear. Res. 6, 111–128.
315–328. Tjaden, K., Lam, J., and Wilding, G. (2013). “Vowel acoustics in
Fairbanks, G. (1960). Voice and Articulation Drillbook, 2nd ed. (Harper and Parkinson’s disease and multiple sclerosis: Comparison of clear, loud, and
Row, New York), p. 127. slow speaking conditions,” J. Speech Lang. Hear. Res. 56, 1485–1502.

J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al. 439
Tjaden, K., and Wilding, G. E. (2004). “Rate and loudness manipulations in of the dysarthria in amyotrophic lateral sclerosis,” Folia Phoniatr. Logop.
dysarthria: Acoustic and perceptual findings,” J. Speech Lang. Hear. Res. 52, 201–219.
47, 766–783. Wenke, R. J., Cornwell, P., and Theodoros, D. G. (2010). “Changes to artic-
R
Wang, J., Green, J. R., Samal, A., and Yunusova, Y. (2013). “Articulatory ulation following LSVTV and traditional dysarthria therapy in non-
distinctiveness of vowels and consonants: A data-driven approach,” progressive dysarthria,” Int. J. Speech Lang. Pathol. 12, 203–220.
J. Speech Lang. Hear. Res. 56, 1539–1551. Westbury, J. (1994). X-ray Microbeam: Speech Production Database User’s
Weismer, G., Jeng, J. Y., Laures, J. S., Kent, R. D., and Kent, J. F. Handbook (Waisman Center, Madison, WI).
(2001). “Acoustic and intelligibility characteristics of sentence pro- Whalen, D. H., Kang, A. M., Magen, H. S., Fulbright, R. K., and Gore, J. C.
duction in neurogenic speech disorders,” Folia Phoniatr. Logop. 53, (1999). “Predicting midsagittal pharynx shape from tongue position during
1–18. vowel production,” J. Speech Lang. Hear. Res. 42, 592–603.
Weismer, G., Laures, J. S., Jeng, J. Y., Kent, R. D., and Kent, J. F. (2000). Ziegler, W., and von Cramon, D. (1983). “Vowel distortion in traumatic
“Effect of speaking rate manipulations on acoustic and perceptual aspects dysarthria: A formant study,” Phonetica 40, 63–78.

440 J. Acoust. Soc. Am. 139 (1), January 2016 Lee et al.

También podría gustarte