Documentos de Académico
Documentos de Profesional
Documentos de Cultura
M ICROPHONE A RRAYS
By
John E. Adcock
Sc.B., Computer Science, Brown University, 1989
Sc.M., Engineering, Brown University, 1993
Date
Harvey F. Silverman, Director
Date
Michael S. Brandstein, Reader
Date
Allan E. Pearson, Reader
Date
Peder J. Estrup
Dean of the Graduate School and Research
ii
c Copyright 2001 by John E. Adcock
T HE V ITA OF J OHN E. A DCOCK
John was born October 19, 1967 in Boston, Massachusetts. As a child he spent five years with his family
in Paris, France returning in 1980 to Pound Ridge, NY where he lived while attending high school in
Bedford Hills, NY. John received the Bachelor of Science degree in Computer Science, magna cum laude,
from Brown University in 1989. He subsequently spent two years as an engineer in the Signal Processing
Center of Technology at Lockheed Sanders in Nashua, New Hampshire before returning to Brown
University to pursue advanced degrees in Engineering. John earned his Master of Science Degree in
Engineering from Brown in 1993. John received a University Fellowship in 1991, and the Doris I.
Eccleston ‘25 Fellowship in 1996 for support of his graduate studies. In 1993 John worked with Brown
Engineering alumnus Krishna Nathan in the handwriting recognition group at IBM Watson Research
laboratories in Hawthorne, NY and in 1995 as Brown Engineering alumnus Professor Michael
Brandstein’s teaching assistant at the John’s Hopkins University Center for Talented Youth summer
program in Lancaster, PA. In 1998 and 1999 John’s fingers provided the thundering bass lines for the
popular local rock band Wave. John is co-inventor of a 1998 Brown University patent describing a method
for microphone-array source location. John has worked as a self-employed programmer/technical
consultant and was briefly a partner in an Internet bingo venture.
iii
ACKNOWLEDGMENTS
I thank my advisor, Professor Harvey Silverman, for his support and trust over the course of my graduate
career at Brown. Thanks to my readers, Professor Allan E. Pearson and especially Professor Michael
Brandstein whose critical input and encouragement were vital to the completion of this work.
In my time at Brown I have had the privilege of working and playing with many wonderful and talented
people. Michael Hochberg, Jon Foote and Alexis Tzannes who were here with me at the beginning and Joe
DiBiase, Michael Brandstein, Aaron Smith and Michael Blane who shared my trials in the remainder. I
also thank my wonderful friends Lance Riek, Laura Maxwell and Carina Quezada for their relentless
support.
Thanks to Ginny Novak for the uncomplaining effort she has consistently made to extend my deadlines
and otherwise keep me in the good graces of the Registrar and Graduate School.
Finally, at the culmination of my formal education, I thank my parents for all they’ve taught me over the
course of many years.
iv
C ONTENTS
1 Introduction 1
1.1 Methods for Acquiring Speech With Microphone Arrays . . . . . . . . . . . . . . . . . . 1
1.2 The Scope of This Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
v
5 Optimal Filtering 41
5.1 The Single Channel Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Additive Uncorrelated Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Multi-channel Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.1 Additive Uncorrelated Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.2 Direct Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.3 Filtered Signal Plus Additive Independent Noise . . . . . . . . . . . . . . . . . . 46
5.2.4 Filtered Signal Plus Semi-Independent Noise Model . . . . . . . . . . . . . . . . 48
5.3 A Non-Optimal Filter and Sum Framework . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliography 86
vi
L IST OF TABLES
vii
L IST OF F IGURES
1
5.3 The attenuation of Φm ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
viii
6.1 Average BSD, SSNR and peak SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Average BSD, peak SNR and SSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Average BSD, peak SNR and SSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.1 The structure of the DSBF with optimal-SNR based channel filtering (OSNR). . . . . . 59
7.2 Narrowband spectrogram of a noisy utterance processed with OSNR. . . . . . . . . . . 61
7.3 The structure of the DSBF with Wiener post-filtering . . . . . . . . . . . . . . . . . . . 63
7.4 Narrowband spectrograms for the WSF beamformer. . . . . . . . . . . . . . . . . . . . 64
7.5 The structure of the Wiener filter-and-sum (WFS) beamformer. . . . . . . . . . . . . . 67
7.6 Narrowband spectrograms from the WFS beamformer. . . . . . . . . . . . . . . . . . . 68
7.7 Diagram of the optimal multi-channel Wiener (MCW) beamformer. . . . . . . . . . . . 71
7.8 Narrowband spectrograms from the MCW beamformer. . . . . . . . . . . . . . . . . . 72
7.9 Summary of word error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.10 Word error rates with OSNR input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.11 The best performing filtering schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.12 Summary of FD and BSD values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.13 Summary of SSNR and SNR values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.14 Scatter plots of error rate and distortion measures. . . . . . . . . . . . . . . . . . . . . 80
7.15 Scatter plots comparing correlation of distortion measures . . . . . . . . . . . . . . . . 81
7.16 Scatter plot of distortion measurements by algorithm type. . . . . . . . . . . . . . . . . 82
7.17 Scatter plots of OSNR distortion measurements . . . . . . . . . . . . . . . . . . . . . . 83
ix
C HAPTER 1:
I NTRODUCTION
Microphone arrays are becoming an increasingly popular tool for speech capture[1, 2, 3, 4, 5, 6] and may
soon render the traditional desk-top or headset microphone obsolete. Unlike conventional directional
microphones, microphone arrays are electronically steerable which gives them the ability to acquire a
high-quality signal (or signals) from a desired direction (or directions) while attenuating off-axis noise or
interference. Because the steering is implemented by software and not by a physical realignment of
sensors, moving targets can be tracked anywhere within the receptive area of the microphone array and the
number of simultaneously active targets is limited only by the available processing power. The
applications for microphone-array speech interfaces include telephony and teleconferencing in home,
office[7, 4, 8] and car environments[3, 9], speech recognition and automatic dictation[6, 10, 11, 12], and
acoustic surveillance[13, 14] to name a few. To realize the promise of unobtrusive hands-free speech
interfaces that microphone arrays offer, they must perform effectively and robustly in a wide variety of
challenging environments.
Microphone-array systems face several sources of signal degradation:
achieve a significant improvement in SNR (say 30dB) with simple delay-and-sum beamforming
requires an impractical number of microphones, even under idealized noise conditions. This built-in
1
2
limitation of DSBF motivates research into supplemental processing techniques to improve
microphone-array performance.
Inversion Techniques
Multiple input/output inverse theorem (MINT) techniques are aimed at inverting the impulse response of
the room which is assumed to be known a priori and thereby eliminating the effects of
reverberation[18, 19]. Although a room impulse response is not certain to be invertible[20, 21], under
certain constraints on the nature of the individual transfer functions between the source and each
microphone, a perfect inverse system can be realized for the multiple-input/single-output system[19].
Although this is an effective method for reducing the effects of reverberation it requires an accurate
estimate of the transfer function between the source and each microphone, a measurement that can be
quite difficult to make in practice. A simultaneous recording and playback system is required to accurately
measure the transfer function between the sound source and a microphone, and a separate transfer function
must be measured for every point in the room from which a talker may speak. The room transfer function
will change as the room configuration changes, whether due to rearrangement of furniture or the number
of occupants. To further complicate the task of measuring the room transfer functions, significant
variations in impulse response occur over time as temperature and humidity, and therefore the speed of
propagation of sound, vary[22] and impulse response inversion techniques may be very sensitive to the
accuracy of the measured impulse response[21].
Matched-filtering compensates for the effects of reverberation by whitening the received signal with the
time-reverse of the estimated impulse response [23]. Matched-filtering relies upon the generally
uncorrelated nature of the channel impulse response to strengthen the main impulse while spreading
energy away from the central peak. As such, it performs best when significant reverberant energy exists.
Although a sub-optimal inversion technique, matched-filtering does not require the inverse of the channel
impulse response. Like MINT techniques, matched-filtering requires prior knowledge of the impulse
response.
Adaptive Beamformers
Traditional adaptive beamformers[24, 25, 26, 15] optimize a set of channel filters under some set of
constraints. A typical optimization is for minimal output power subject to flat overall frequency response.
These techniques do well in narrowband, far-field applications and where the signal of interest has
generally stationary statistics but are not as well suited for use in speech applications where:
Adaptive noise cancellation (ANC)[27] techniques exploit knowledge of the interference signal to cancel it
out of the desired input system. In some ANC applications the interference signal can be obtained
independently (perhaps with a sensor located close to the interferer but far from the desired source). The
remaining problem is then to eliminate the interfering signal from a mix of desired and interfering signals
typically by matching the delay and gain of the measured noise and subtracting it from the beamformer
output. A particular sort of adaptive array employing ANC has been used with some success in
microphone-array applications [28, 29]. The generalized sidelobe canceler (GSC) uses an adaptive array
structure to measure a noise-only signal which is then canceled from the beamformer output. Obtaining a
noise measurement that is free from signal leakage, especially in reverberant environments, is generally
where the difficulty lies in implementing a robust and effective GSC.
3
Another variant of adaptive beamformers that has found some application in microphone arrays is the
superdirective array[30, 31, 32]. Superdirective array structures generally exhibit greater off-axis noise
suppression than delay-and-sum arrays given similar aperture size but at the expense of being geometry
dependent (endfire arrays are a common superdirective configuration) which restricts the effective steering
range.
When evaluating a speech or signal enhancement technique, the intended application influences the choice
of benchmark. If the application is speech recognition then clearly recognition performance is the
objective measure of interest. If the microphone-array system is acquiring speech for a teleconferencing
task the intelligibility and subjective quality of the speech as perceived by human listeners is the important
measure. In this chapter some signal quality measures will be described. These measures will be used in
later chapters to evaluate the performance of proposed speech-enhancement algorithms.
(2.1)
2
x n y n
∑Nn 1
where x n is an undistorted reference signal and y n is the distorted (for instance with additive noise) test
signal. In some cases the signal and noise may not be available separately to form the ratio in Equation
4
5
(2.1). In such situations an alternative is to estimate the peak SNR directly from the measured signal. If the
recording under analysis has regions with only noise and no signal (recordings of speech almost always
do) and the noise is assumed to be stationary, these regions can be used to estimate the noise power.
Correspondingly a region where the signal is active can be used to estimate the signal+noise power. The
peak SNR can be formed from the ratio of these measured powers:
∑Kk y nmax k ∑k 1 y nmin k
2 K 2
10 log10 1
∑k 1 y nmin k
K 2
where the signal is broken down into frames of K samples, y n k indicates the k th sample of the nth
analysis frame of the observed signal y. The power in each frame is measured and n max indicates the
analysis frame with the highest power and nmin denotes the analysis frame with the lowest power in the test
signal. The difference in the numerator of Equation (2.2) is based on the assumption that the noise and
signal are statistically independent which implies that E s n 2 E s 2 E n 2 . In this way an
estimate of the peak SNR is made without access to the reference clean signal as in Equation (2.1).
SNR is attractive for several reasons:
It is simple to compute, using Equation (2.1) if the reference signal and noise may be separated or
using Equation (2.2) if an estimate must be made only from an observed signal.
It is intuitively appealing, especially to electrical or communications engineers who are accustomed
to the idea that improved SNR indicates improved information transfer.
Unfortunately SNR correlates poorly with subjective speech quality[53, 48]. SNR as written in Equation
(2.1) is sensitive to phase shift (delay), bias and scaling which are often perceptually insignificant.
Meanwhile, the peak SNR as measured in Equation (2.2) has no clean signal reference to work from and is
really measuring the dynamic range rather than the distortion in the recording under test. As such it is
possible for the peak SNR to improve even as the signal is becoming more distorted. SNR and peak SNR
are still useful measurements, but care must be taken to interpret them in the proper context.
1 N K
x n k
2
N n∑ ∑
1 k 1 x n k y n k
2
where x n k denotes the kth sample of the nth frame of the reference signal x, and y n k the
corresponding frame and sample of the distorted test signal. Because the ratio is evaluated on individual
frames loud and soft portions contribute equally to the measure. Speech detection is desirable to prevent
silence frames from unduly biasing the average with extremely low segment SNR’s[53, 48]. Also
long-term frequency response adjustment of the test signal may be desirable to avoid biasing the error with
frequency response effects that could be easily compensated for.
20
18
16
14
12
Bark
10
2 estimated
measured
0
0 2000 4000 6000 8000
Frequency (Hz)
Figure 2.1: Relationship between Hz and Bark. The dashed line is from Equation (2.4) and the marks
indicate the band edges derived from the psycho-acoustical experiments undertaken by Zwicker[54].
140
120
100 100
90
Intensity (dB)
80 80
70
60
60
50
40
40
30
20
20
10
0
1 2 3 4
10 10 10 10
Frequency (Hz)
Figure 2.2: Equal-loudness curves from Robinson[55]. Each line is at the constant phon value indicated on
the plot. The phon and intensity (dB) values are equal at 1kHz by definition.
f 2
z 13atan 00076 f 3 5atan
(2.4)
7500
In addition to frequency warping, the Bark spectrum incorporates spreading, preemphasis, and loudness
scaling to model the perceived excitation of the auditory nerve along the basilar membrane in the ear. In
this work the Bark spectrum, Lx z , of a discrete time signal, x k , is formed by the following steps:
−10
−20
dB
−30
−40
−50
−60
−2 0 2 4 6
Bark
The linear frequency spectrum of the DFT is warped onto the constant rate bandwidth Bark scale.
See Figure 2.1.
5. Apply spreading function: Xs z SF z X z
The loudness in dB is warped onto the (approximate) sone scale[50] where a doubling in perceived
loudness has a constant distance. Each 10dB corresponds approximately to a doubling in perceived
loudness for phones above 40dB. In the absence of an absolute loudness calibration the 40dB offset
term in this expression is not especially meaningful.
The BSD measure itself[51] is computed by taking the mean difference between reference and test Bark
spectra and then normalizing by the mean Bark spectral power of the reference signal:
∑Nn 1 ∑Zz Lx n z Ly n z
1 2
N 1
BSD x y (2.5)
N ∑ n 1 ∑ z 1 Lx n z
1 N Z 2
where Lx n z and Ly n z are the discrete Bark power-spectra in dB of the reference and test signals for
time frame index n and Bark frequency index z. Speech detection is performed by an energy thresholding
operation on the reference signal so that the distortion measure averages the distortion only over speech
segments. Also the test signal is filtered before computing the Bark spectrum to equalize the average
spectral power in 6 octave-spaced frequency bands to prevent the measure from being overly sensitive to
the long term spectral density of the test signal.
The Modified Bark Spectral Distortion (MBSD)[51] measure incorporates an explicit model of
simultaneous masking to determine if distortion at a particular frequency is audible. If the error between
test and reference signals at a particular frequency falls below the masking threshold, then that error is not
included in the error sum (since it is deemed to be inaudible). Accurate computation of the masking
threshold is a very involved process. Yang[51] cites a simplified method for determining the overall
masking threshold given by Johnston[56]. Even the simplified method is quite involved so this method
will be omitted in the use of BSD contained herein.
8
2.3 Speech Recognition Performance
The use of speech-recognition accuracy as a means of evaluating a speech enhancement method had some
advantages. Certainly if the goal is to achieve robust hands-free use of a speech recognition system,
recognition accuracy is the measurement of interest. Unfortunately the evaluation of the performance of a
speech recognition system is not without drawbacks. If the recognition system being used is sensitive to
the acoustic environment (and all are) in which it is used it may require some form of retraining or
adaptation. This retraining may require significant training data.
3 m∑
FD R T l 2
(2.6)
1
K m
∑Nl 1 ∑k m1
Rk l
where FD R T denotes the feature distortion between the reference signal, R, and the signal being tested,
m
T . Rk l denotes the kth feature of the mth sub-vector for analysis frame l for the features derived from
m
the reference signal. Tk l denotes the corresponding feature value for the test signal.
Although cepstral distance is used in a variety of speech processing applications including speech quality
assessment[61, 62], the measure presented above is referred to instead as the feature distortion because it
operates directly on the features used by the LEMS speech recognizer. Mel-cepstral distortion is similar in
nature to Bark spectral distortion as both measures are derived from a warped and smoothed (or liftered)
log spectral representation.
2.4 Summary
In this chapter several objective measures of speech quality were introduced, varying from the traditional
signal-to-noise ratio measure to a perceptually motivated Bark spectral distortion measure, and the
speech-recognition targeted feature distortion measure. In the following chapters the measures presented
here will be used to evaluate the performance of proposed microphone-array processing techniques. By
using a set of measures with different underlying principles a multifaceted view of the performance of the
algorithms to follow will be possible.
C HAPTER 3:
S PEECH R ECOGNITION W ITH M ICROPHONE
A RRAYS
In this chapter tests on the performance of a 16 element microphone array will be presented. A database of
multichannel recordings was collected and speech recognition tests performed on delay-and-sum
beamformer configurations using from 1 to 16 microphones. The methods used to record and process the
multichannel recordings will be described and the performance of each microphone-array configuration
will be presented. Each tested configuration will be evaluated with the measures described in Chapter 2:
feature distortion (FD), Bark spectral distortion (BSD), segmental signal-to-noise ratio (SSNR), peak SNR
(SNR). Also each array configuration will be used as a front end to the LEMS alphadigit speech
recognition system and its performance assessed in that role. Finally, the significance of and relationships
between the various performance measures will be discussed.
horizontally placed at a height of 1.6 m. Within each 8-microphone sub-array the microphones are
uniformly spaced at 16.5 cm intervals. The microphone-array is in a partially walled-off area of a 7 x 8 m
acoustically-untreated workstation lab. Approximately 70% of the surface area of the enclosure walls is
covered with 7.5 cm acoustic foam, the 3 m ceiling is painted concrete, and the floor is carpeted. The
reverberation time within the enclosure is approximately 200 ms.
The utterances were recorded with the talker standing approximately 2m away from each of the
microphone sub-arrays. The microphone-array recording was performed with a custom-built 12-bit
20 kHz multichannel acquisition system[63]. The 20 kHz datastream was resampled to match the 16 kHz
sampling rate used by the recognition system. During recording, the talker wore a close-talking headset
Table 3.1: Breakdown of the experimental database by the number of talkers of each gender, the number of
utterance and the number of words in each of the training and test sets.
9
10
480
233
118
9 16
16.5 acoustic
145
foam
260
8
350
1 16.5 talker area
( units: cm )
Figure 3.1: Layout of the LEMS microphone-array system using 16 pressure gradient microphones.
16 LEMS
Microphone SBUS
SUN Sparc Resample
Array II
A/D
microphone. This is the same microphone used to collect the high-quality speech data for training the
baseline HMM system (see Section 3.1.3). Using the analog-to-digital conversion unit of a Sparc10
workstation, the signal from the close-talking microphone was digitized to 16 bits at 16 kHz
simultaneously with the 16 remote microphones in the array system. Both the close-talking and the array
recordings were segmented by hand to remove leading and trailing silence and then the close-talking
recording was time-aligned to the first channel of the multi-channel recordings. See Figure 3.2.
Figures 3.3 and 3.4 show data from an example recording from the recognition database. Figure 3.3 shows
the time-sequences for a single utterance recorded from the close-talking microphone (a), a single
microphone in the array (b) and the output of the 16 channel DSBF (c). Figure 3.4 shows the
corresponding spectrograms for the sequences shown in Figure 3.3. The noise-suppressing effect of the
beamformer is evident in both Figures 3.3 and 3.4. Also evident is that the output of the beamformer,
though greatly improved from the single microphone, is quite a bit more noisy than the recording from the
close-talking microphone.
3.1.2 Beamforming
After recording and preliminary segmentation and alignment, the channels of every multichannel data file
were time-aligned with the reference close-talking microphone recording. Figure 3.5 shows an outline of
the process. The close-talking microphone recording was used as a reference to ensure the best possible
time alignment. This is critical when computing distortion measurements (BSD, SSNR etc...) that assume
the test and reference signals are precisely aligned. The time-alignment was achieved by using an
11
4000
2000
(a) −2000
−4000
−6000
60
40
20
(b) 0
−20
−40
−60
0.5 1 1.5 2 2.5 3 3.5 4
sec.
80
60
40
20
(c) 0
−20
−40
−60
0.5 1 1.5 2 2.5 3 3.5 4
sec.
Figure 3.3: Example recorded time sequences. The recording from the close talking microphone (a), chan-
nel 1 from the microphone array (b) and the output of the beamformer (c). The talker is male. The spoken
text is "GRAPH 600".
implementation of the all-phase transform (PHAT)[13, 64, 65, 66]. The all-phase transform of two
time-domain sequences x k and y k is given by the inverse DFT of their magnitude normalized
cross-correlation:
X ω Y ω
F 1
PHAT x y (3.1)
X ω Y ω
where F 1 denotes the inverse Fourier transform and X ω and Y ω the Fourier transforms of x k and
y k , respectively. A 512 point (32ms) Hamming window with a 256 point (16ms) shift was used in
computing the cross-spectra. The cross-spectra were smoothed by averaging 7 adjacent frames 1 then
magnitude normalized and an inverse Fourier transform applied. The resulting cross-correlation was then
upsampled by a factor of 202 and the lag corresponding to the peak value chosen. Some post-processing
was performed to eliminate spurious estimates and to constrain the delay estimates during non-speech
periods. Each channel was steered using the estimated time delays and a delay-steered version of the
1 Although not entirely still during the recording, the talker movements were generally limited to leaning or shifting, rarely resulting
in a change of more than 0.1 m in location. The resulting time-delay changes were generally small and slowly varying and not adversely
affected by the time-averaging of the cross-spectra.
2 This fairly high upsampling factor wasn’t chosen because resolution to 1 of a sample is necessary, but because the higher
20
sampling rate makes it more likely that the peak sampled value will correspond with the actual peak value of the underlying waveform.
12
8000
7000
6000
5000
Hertz
4000
(a)
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4
sec.
8000
7000
6000
5000
Hertz
4000
(b)
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4
sec.
8000
7000
6000
5000
Hertz
4000
(c)
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4
sec.
Figure 3.4: Log-magnitude spectrograms of the example sequences shown in Figure 3.3.The recording
from the close talking microphone (a), channel 1 from the microphone array (b) and the output of the
beamformer (c). The talker is male. The spoken text is "GRAPH 600". The analysis window is a 512 point
(32ms) Hamming window with a half-window (16ms) shift.
d(t) D(k)
Reference DFT Conj
Recording
D’(f)
Apply y(n)
Steered
Delay Result
Figure 3.5: Outline of the method for delay steering the array recordings.
multichannel data file saved to disk. Figure 3.6(a) shows locations derived from the estimated delays using
a maximum-likelihood (ML) location estimator[67]. Figure 3.6(b) shows the distribution of the x and y
coordinates for ML location estimates of every analysis frame from the entire database. The talker location
estimates were generated for analysis only; the channel delays were not constrained to correspond to any
particular source radiation model during processing.
For the purposes of establishing a baseline for the beamformer performance, no channel weighting or
normalization was performed. The channels were simply delay-steered according to the estimated delays
13
3 0.6
2.5 0.5
2 0.4 x
y coord (meters)
y
fraction
1.5 0.3
1 0.2
0.5 0.1
0 0
0 0.5 1 1.5 2 2.5 3 1.4 1.6 1.8 2 2.2 2.4 2.6
x coord (meters) x/y coord (meters)
(a) (b)
Figure 3.6: Measured talker locations. (a) Scatter plot of locations from a single talker (b) Distribution of
the measured x and y talker locations taken over the entire database.
and summed3 in sequential order (See Figure 3.1 for the microphone numbering).
3 The sum was used rather than the mean since the microphone-array recordings are 12 bit and the recognition system uses 16 bit
PCM input data so the sum of the 16 12-bit channels will never overflow a 16 bit word. Using the mean of the channels instead of the
sum would involve a requantization step. Although the effect of this requantization is almost certainly unmeasurable by any method
used herein, there was no reason not to preserve the data with maximum precision.
14
the prior regardless of sample size t. This implies that the representation of the posterior remains fixed as
additional data is observed.
In the case of missing-data problems (e.g., HMMs), the expectation-maximization (EM) algorithm can be
used to provide an iterative solution for estimation of the MAP parameters[69]. The iterative EM MAP
estimation process can be combined with the recursive Bayes approach. The approach that incorporates
(3.2) and (3.3) with the incremental EM method[70] (That is, randomly selecting a subset of data from the
training set and immediately applying the updated model) is fully described in[68]. Also, Gauvain and Lee
have presented the expressions for computing the posterior distributions and MAP estimates of continuous
observation density HMM (CD-HMM) parameters[71]. Because the posterior is from the same family as
the prior, (3.2) and (3.3) are equivalent to the update expressions in[71] and are not repeated here.
Baseline Model
A baseline talker-independent continuous density hidden Markov model (CD-HMM) is obtained by a
conventional maximum likelihood (ML) training scheme using a training database of high-quality data
acquired with a close-talking headset microphone. The training set contains contained 3484 utterances
from 80 talkers. The initial parameters of the CD-HMM are derived from a discrete observation hidden
semi-Markov model (HSMM) using a Poisson distribution to model state duration. This model is then
converted to a tied-mixture HSMM by simply replacing each discrete symbol with a multivariate normal
distribution. Normal means and full covariances are estimated from the training data.
Prior Generation
The initial prior distributions are also derived from the training data set used to train the baseline HMM.
The employed prior distributions are the normal-Wishart distribution for the parameters of the normal
distribution and the Dirichlet distribution for the rest of model parameters. The parameters describing the
priors are set such that the mode of the distribution corresponded to the initial CD-HMM. The strength of
the prior (That is, the amount of observed data required for the posterior to significantly differ from the
prior) is set to “moderately” strong belief. A subjective measure of prior strength is used[6, 72, 73] where
a very weak prior is (almost) equivalent to a non-informative prior and a very strong prior (almost)
corresponds to impulses at the initial parameter values4 .
4 In [72, 6] Gotoh characterizes the initial prior weights as “weak” or “moderate” or “strong” often without attributing a specific
value. Presumably the intention was to drive me insane. The prior strength value used herein is 0 1 and corresponds to a “moderate”
prior strength.
15
0.1
0.85 0.095
0.09
0.8
BSD
0.7
0.075
0.65 0.07
0.6 0.065
0.06
0.55
0.055
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
# mics/mic # # of mics / mic #
(a) (b)
6 27
26
5.5 25
5 23
22
4.5 21
20
4 19
18
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
# of mics / mic # # of mics / mic #
(c) (d)
Figure 3.7: Distortion measurements. (a) Feature distortion (FD), (b) Bark spectral distortion (BSD), (c)
segmental SNR (SSNR) and (d) peak SNR shown as a function of the number of microphones used in
a delay-and-sum beamformer and for each channel individually. The measurements shown were averaged
over all recorded utterances for the 11 test talkers. The ’s indicate the beamformer with the x-axis showing
the number of microphones included in the sum. The microphones were added in order, starting with
microphone 1 (see Figure 3.1). The ’s denote the average distortion or SNR value for that channel taken
alone.
The overall improvement in peak SNR from the 1 microphone beamformer to the 16 microphone
beamformer is 9.7dB. As expected, this is somewhat less than the ideal value derived in Chapter 4 due to
the non-ideal noise cancellation and signal reinforcement. Note also that this improvement figure is greatly
dependent on which microphone is chosen for the 1 microphone beamformer. Microphone 1 has one of
the lowest peak SNR’s of any channel and the 9.7dB figure is therefore somewhat generous. By choosing
the single microphone with the highest SNR (microphone 12) the total improvement in peak SNR is only
7.65dB.
Table 3.2: Distortion measurements plotted in Figure 3.7. Distortion is shown as a function of the number
of microphones included in a DSBF and as a function of the single microphone used.
(MAP-HMM) as described in Section 3.1.3. Note that after MAP training the best single-channel
recognition result is marginally better than the performance of the 2-channel beamformer; a testament to
the usefulness of the MAP training. A different choice of microphones to include in the 2-channel
beamformer would change this result considering that microphone 1 is one of the poorest performing
individual channels. To put these values in perspective, the word error rate for the data collected with the
close-talking microphone is 8.16%. The improvement in MAP-HMM recognition accuracy from the 1
microphone beamformer to the 16 microphone beamformer is 9.38% (or a 44% reduction in error).
Comparing the 16 microphone beamformer against the best performing single microphone (12) the error
rate is reduced by 5.45% (a 31% reduction in error). The performance of the array is close enough to the
close-talking microphone performance that a small number of errors is a large fraction of the gap between
the array performance and the close-talking microphone performance.
5 Pink noise contains equal power in each octave. This corresponds to a -3dB per octave slope of the power spectral density.
17
45
40
35
25
20
15
10
Headset Error Rate: 8.16%
5
2 4 6 8 10 12 14 16
# mics / mic #
Figure 3.8: Word recognition error rates as a function of the number of microphones used in a delay-and-
sum beamformer and as a function of the single microphone used alone, before and after MAP training.
denotes the recognition performance before MAP training of the beamformer with varying numbers of
microphones and the beamformer performance after MAP training.
denotes the recognition perfor-
mance before MAP training of each channel taken individually and the performance after MAP training.
The microphones were added in numerical order for the beamformer (see Figure 3.1). The strong line at
the base of the graph corresponds to the error rate for the data acquired with the close-talking microphone,
8.16%. These values are tabulated in Table 3.3.
Table 3.3: Word error rates (%) for the HMM before and after MAP training as a function of the number of
microphones included in the DSBF or the single microphone used. The beamformer values are plotted in
Figure 3.8 with ’s and ’s and the single microphone values are plotted in Figure 3.8 with ’s and ’s.
18
480
233
118
9 16
16.5 acoustic
145
foam
260
8
350
1 16.5 talker area
300
pink noise source
( units: cm )
115
Figure 3.9: Layout of the recording room as in Figure 3.1 but showing the position of the interfering noise
source.
80
60
40
20
−20
−40
−60
60
40
20
−20
−40
−60
Figure 3.10: PCM sequence of a talker recording with pink noise recording added. This is the same talker
and utterance used in Figures 3.3 and 3.4. The top plot is channel 16 alone and the bottom plot is the output
of the 16 channel beamformer.
Each channel of the noise recording was added to the corresponding channel of the talker recordings. For
beamforming the inter-microphone delays estimated from the clean speech were used to steer both the
original speech channels and the added noise. Figures 3.10 and 3.11 show time and spectrogram plots for
channel 16 and the beamformer output for a talker recording with pink noise recording added.
The discerning reader may notice the pronounced bands of noise visible in Figure 3.11 around 2800 and
5600Hz and assume that these are a result of a resonance in the playback or recording system. These
bands result from the beamforming operation and the spatial aliasing inherent in the geometry of this
particular microphone array. The bands appear in all recordings but their exact location in frequency varies
with each recording as a function of the applied steering delays. To illustrate this, Figure 3.12 shows the
spectrum of the DSBF output with no speech present as the steering location is moved in a spiral of
increasing radius starting at x=2, y=2. The noise bands appear at harmonically spaced intervals which vary
19
8000
7000
6000
5000
Hertz
4000
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4
sec.
8000
7000
6000
5000
Hertz
4000
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4
sec.
Figure 3.11: Spectrograms of the noisy recordings shown in Figure 3.10. A single channel of a noise-added
recording on top and the output of the 16 channel DSBF on the bottom.
8000
7000
6000
5000
Hertz
4000
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5
sec
x
3 y
2.5
meters
1.5
0.5
0.5 1 1.5 2 2.5 3 3.5
sec
Figure 3.12: Aliasing spectral bands in the response of the beamformer to a stationary noise input as the
beamformer is steered to different locations indicated in the bottom plot. The x-coordinate of the steering
location is indicated with a solid line and the y-coordinate with the dashed line.
smoothly with the steering location. As the beamformer is steered to different locations the noise source is
found in different portions of the beamformer sidelobes and aliasing pattern. Spatial aliasing is a side
effect of having an array aperture larger than the wavelength of the target signal. Microphone arrays may
be designed with a constant width main lobe[75, 76, 77] which will eliminate the sort of aliasing seen here
though the tradeoff is that the main lobe is wider throughout most of the bandwidth. The particular array
geometry used here is prone to aliasing and an explicit solution to this issue is outside the scope of this
work, though processing techniques presented in later chapters will significantly reduce this effect.
20
Table 3.4: Distortion measurements for the noisy database plotted in Figure 3.13. Distortion as a func-
tion of the number of microphones included in the delay-and-sum beamformer and as a function of each
microphone taken alone using the noisy speech data.
2
0.24
1.9
0.22
1.8
1.6
0.18
BSD
1.5
1.4 0.16
1.3 0.14
1.2
0.12
1.1
1 0.1
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
# mics/mic # # of mics / mic #
(a) (b)
4
14
3.5
12
3
10
2.5
8
2 6
1.5 4
1 2
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
# of mics / mic # # of mics / mic #
(c) (d)
Figure 3.13: Distortion measurements for the added-noise database. (a) Feature distortion (FD) (b) Bark
spectral distortion (BSD), (c) segmental SNR (SSNR) and (d) peak SNR shown as a function of the number
of microphones used in a delay-and-sum beamformer and for each channel individually. The measurements
shown were averaged over all recorded utterances with added noise for the 11 test talkers. The ’s indicate
the beamformer with the x-axis showing the number of microphones included in the sum. The microphones
were added in order, starting with microphone 1 (see Figure 3.9). The ’s denote the average distortion or
SNR value for that channel taken alone. These values are tabulated in Table 3.4.
100
90
80
60
50
40
30
20
2 4 6 8 10 12 14 16
# mics / mic #
Figure 3.14: Word recognition error rates as a function of the number of microphone used in a delay-and-
sum beamformer and as a function of channel, before and after MAP training. denotes the performance
before MAP training of the beamformer with varying numbers of microphones and the beamformer
performance after MAP training. denotes the recognition performance before MAP training of each
channel taken individually and the performance after MAP training. The microphones were added in
numerical order for the beamformer (see Figure 3.9). These values are tabulated in Table 3.5.
Table 3.5: Word error rates (%) for the model before MAP training and model after MAP training as a func-
tion of microphones included in the delay-and-sum beamformer and as a function of the single microphone
used alone. These values are plotted in Figure 3.14 with ’s and ’s for the beamformer values, and ’s
recognition performance before and after MAP retraining. The coefficient of correlation is given by[78]:
Cov Y1 Y2
ρ
(3.4)
σ1 σ2
23
Table 3.6: Matrix of correlation coefficients for the different types of measurements without per-talker
normalization.
where Cov Y1Y2 denotes the covariance of the joint distribution of the two variables under examination
1
and σ1 2 are the corresponding standard deviations, σ1 E Y1 Y¯1 2 2 . The covariance is given by:
Cov Y1 Y2
E Y1Y2 E Y1 E Y2
(3.5)
Table 3.6 shows the inter-measure correlation coefficients for the recognition scores and distortion
measures presented above. For each talker the average value of each measure was computed over the set of
utterances for that talker. For each of the 11 test talkers 31 values of each measure were taken (15 values
measured from beamformers with a varying number of microphones > 1 and 16 values measured from
each microphone taken individually) both for the original database and the added-noise database for a total
of 682 values over which the correlation coefficient was measured. The correlation coefficients are shown
as signed values since the distortion measures and signal-to-noise ratio measures are inversely correlated.
The signs of the measured correlation coefficients are all appropriate for the measures being compared;
measures of goodness correlate positively with other measures of goodness and negatively with measures
of badness. Table 3.6 shows generally very strong correlations. The correlation between FD and
baseline-HMM error rate is strongest with a generally lower correlation between any other distortion or
SNR measure and any recognition error rate. The strong correlation between FD and baseline-HMM error
rate can be clearly seen in the scatter plots in Figure 3.15(c). Note that FD is (slightly) more strongly
correlated with the baseline-HMM error rate than the MAP-HMM error rate is. Although it is not
surprising that the recognizer performance, in particular before retraining, would be closely related to the
feature distortion, it is somewhat unexpected that the feature distortion would be more closely coupled
than the MAP-HMM error rate. Note also that all distortion and SNR measures are less strongly correlated
with the MAP-HMM error rate than with the baseline-HMM error rate.
The preceding analysis fails to take into account the talker-dependence of the error rates. That is, the
speech recognition error rates can vary quite a bit between talkers or even individual utterances from a
single talker. Using the results from the recordings acquired with the close-talking microphone, the
inter-talker standard deviation of the error rates of the 11 talkers in the test set is 5.9%. This is largely due
to a single talker with a very high error rate - this same talker stands out clearly in the scatter plots in
Figure 3.15. Excluding this one outlying talker the standard deviation of the per-talker error rates is 2.0%.
This phenomenon is not necessarily an issue of sound quality but often one of the manner of pronunciation
or elocution6 . To eliminate this talker variability the per-talker word error rates were computed and the
mean error rate for each talker was subtracted from that talker’s values. The result is a differential error
rate for each talker relative to their mean error rate. The same normalization was performed with the
distortion and SNR measures7 rendering them per-talker difference measures as well. Table 3.7 shows the
inter-measure correlation coefficients after this normalization. Figure 3.15 shows scatter plots of the more
6 In particular, the very worst performing talker is a female talker with very high pitch.
7 Although this may not have been strictly necessary, little is gained by maintaining the absolute distortion measurements in the
absence of absolute recognition error rates. That is, all the results were already rendered relative by the per-talker normalization of the
error rates.
24
90
80
70
50
40
30
20
10
20 40 60 80 100
Baseline error rate
(a)
80 80 80
60 60 60
40 40 40
20 20 20
90 90 90
80 80 80
70 70 70
MAP error rate
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
Figure 3.15: Scatter plots comparing the various distortion measures with recognition error rates. The black
points correspond to the values with per-talker bias removed and the lighter patches to the values without
any normalization. One point is shown for each of 62 measurements for 11 talkers (682 points in total).
strongly correlated pairs. Most obviously and not surprisingly, the per-talker bias normalization has
generally increased the correlation values. This effect can be seen in the plots of Figure 3.15 as the spread
of the data away from the primary linear trend is greatly reduced in the talker-normalized data. As in the
unnormalized case peak FD is still most strongly correlated with the baseline-HMM error rate though now
both BSD and peak SNR are much closer than in the unnormalized case.
25
Table 3.7: Matrix of correlation coefficients for the different measurements with per-talker bias normaliza-
tion of each of the 11 62 measurements.
Table 3.8: RMS linear fit error for linear predictors of recognition error rate. The column corresponds to
the type of data used to predict with and each row corresponds to the target of the predictor. Errors are in
the units of the recognition error rate, % words in error.
80
70
50
40
30
20
20 40 60 80
Baseline error rate
(a)
90 90 90
80 80 80
Baseline error rate
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
1 1.5 2 0.1 0.15 0.2 5 10 15 20 25
FD BSD SNR
80 80 80
70 70 70
MAP error rate
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
Figure 3.16: Scatter plots comparing the various distortion measures with recognition error rates. All talkers
are averaged together for each of 62 data points.
Table 3.9: RMS fit error for linear predictors of recognition error rate using the overall average value of
each measure. The column corresponds to the type of data used to predict with and each row corresponds
to the measure being predicted. Errors are in the units of the recognition error rate, % words in error.
27
80 80 80
70 70 70
60 60 60
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Baseline error rate Baseline error rate Baseline error rate
(a) 1st order. E=6.15 (b) 2nd order. E=3.41 (c) 3rd order. E=0.79
80 80 80
70 70 70
60 60 60
MAP error rate
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Baseline error rate Baseline error rate Baseline error rate
(d) 4th order. E=0.74 (e) 5th order. E=0.61 (f) 6th order. E=0.59
Figure 3.17: Scatter plot of baseline-HMM error rate versus MAP-HMM error rate with polynomial fits of
various orders. One data point is shown for each error rate averaged over all the talkers.
80 80 80
40 40 40
20 20 20
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Baseline error rate Baseline error rate Baseline error rate
(a) 1st order. E=6.81 (b) 2nd order. E=4.57 (c) 3rd order. E=3.66
80 80 80
MAP error rate
40 40 40
20 20 20
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Baseline error rate Baseline error rate Baseline error rate
(d) 4th order. E=3.66 (e) 5th order. E=3.66 (f) 6th order. E=3.66
Figure 3.18: Scatter plot of baseline-HMM error rate versus MAP-HMM error rate with polynomial fits of
various orders. One data point is shown for each talker and each processing type. Only talker normalized
data is shown.
M
ŷ x ∑ ak Fk x (3.6)
k 1
where each Fk x is an arbitrary fixed basis function of the input x. The optimal least squares coefficients,
ak , are those that minimize the total squared error, χ2 , over the set of observations, xi yi :
N
yi ŷ xi 2
χ2 ∑
(3.7)
i 1 σi
where σi is the standard deviation of measurement i, or 1 if it is unknown or they are all equal.
Tables 3.10 and 3.11 show the RMS error for general least squares fit of the baseline-HMM and
MAP-HMM error rates using different sets of measures as input. In each column another function of a
measure is added to the set of basis functions: F1 x 1, F2 x FD x , F3 x SNR x and so on. For
each set of basis functions the optimal linear coefficients were computed and the RMS error measured.
The standard deviation factor in Equation (3.7) was set to a constant value of 1. When fitting to the
baseline-HMM error rate, powers of the MAP-HMM error rate are added as basis functions and when
fitting to the MAP-HMM error rate, powers of the baseline-HMM error rate are added as basis functions.
Except for error rate the functions are added in decreasing order of their linear correlation. As would be
expected the fit error decreases significantly as basis functions are added. In particular the fit of the
MAP-HMM error rates is greatly enhanced by the addition of the squared functions which allows the
optimization to fit the curvature of the relationship and achieve a fit as good or nearly as good as the fit of
the baseline-HMM error rate. Figure 3.19 shows the scatter plots of the linear predictors including 1 st and
29
FD SNR BSD FD2 SNR2 SSNR Err%1 2 3
BSD2 SSNR2
Without Talker Normalization
Baseline Err% 7.86 7.83 7.53 7.42 7.00 3.42
MAP Err% 8.68 8.32 8.30 8.20 6.96 4.11
With Talker Normalization
Baseline Err% 4.27 4.09 4.05 4.03 3.79 3.08
MAP Err% 5.41 4.74 4.70 4.58 3.71 2.76
Table 3.10: RMS errors for linear least squares estimators of recognition error rates using per-utterance
values. Units are % words in error. Basis functions are added to the least squares estimator starting with FD.
Each column adds another function to the set of basis functions and the 5th column adds squared functions
of the 4 distortion measures. For the column labeled "Err%1 2 3 ” powers of the baseline-HMM error rate
is added to the basis functions for predicting MAP-HMM error rate and vice versa. The optimization and
error computation is performed over the 662 per-talker values as in Table 3.8.
Table 3.11: RMS errors for linear least squares estimators of recognition error rates using the ensemble
average values of each measure. Units are % words in error. Basis functions are added to the least squares
estimator starting with FD. Each column adds another function to the set of basis functions and the 5th
column adds squared functions of the 4 distortion measures. For the column labeled "Err% 1 2 3 ” powers 1
through 3 of the baseline-HMM error rate are added to the basis functions for predicting MAP-HMM error
rate and vice versa. The optimization and error computation is performed over the 62 average values as in
Table 3.9.
2nd powers of FD, SNR, BSD and SSNR. The fits with 3rd powers included aren’t shown since adding the
3rd power reduced the error only very marginally.
Figure 3.19 shows how the general least squares fit has a strong linear relationship with the predicted error
rate (compared to Figure 3.15) but the variance around this linear trend remains significant. As before
when the data is averaged across all the talkers (see Figure 3.20) the variance from the linear trend is (not
surprisingly) greatly reduced.
90
100
80
70
Baseline error rate
80
MAP error rate
60
60 50
40
40 30
20
20
10
20 40 60 80 100 20 40 60 80
LLS Prediction LLS Prediction
Figure 3.19: Scatter plots of the linear least squares fit including all 1st and 2nd powers of FD, SNR,
BSD and SSNR against the predicted error rate. The corresponding fit error is shown in Table 3.10. The
black points correspond to the values with per-talker bias removed and the lighter patches to the values
without any normalization. As before there are 682 data points plotted, one for each of 11 talkers times 62
measurements.
30
90 80
80 70
60 50
50 40
40 30
30 20
20
20 40 60 80 20 40 60 80
LLS Prediction LLS Prediction
Figure 3.20: Scatter plots of the linear least squares fit including all 1st and 2nd powers of FD, SNR, BSD
and SSNR against the predicted error rate for the overall averages. The corresponding fit error is shown in
Table 3.11.
3.4 Summary
A set of multichannel recordings was made and processed through a uniformly weighted delay-and-sum
beamformer using from 1 to 16 microphones. The resulting enhanced recordings were evaluated with
distortion measures (FD, BSD, SSNR, peak SNR) and with the performance of a speech recognition
system. As expected, the performance of the DSBF improves as the number of microphones used
increases.
Recognition performance improves steadily as microphones are added to the DSBF resulting in
approximately a 40% decrease in error rate for the quiet data and 70% decrease in error rate for the
noisy data (compared to the performance of microphone 1).
Every distortion measure shows similar monotonic improvement as microphones are added to the
DSBF.
For the added-noise case, the large range of performance measured for each single microphone
suggests that a non-uniform weighting of the microphones should provide an improvement over the
uniform weighting used here.
The feature distortion measure (FD) tracked the recognition performance very closely and may
provide a way to predict the recognition performance without actually running a large data set
through the speech recognizer.
The improvement in peak SNR was also quite well correlated with the improvement in speech
recognition score. In general this is not expected to be true; there are trivial operations that could
increase SNR while destroying the speech signal (adding noise only during times of active speech
for instance). For the DSBF however the overall performance is well reflected by its ability to
suppress noise.
The MAP training greatly improved the speech recognition accuracy in every instance. The
improvement due to MAP training is of the same order of magnitude as the improvement from the
DSBF.
In the following chapters methods intended to improve upon the performance of the simple unweighted
DSBF used here will be developed. Chapter4 investigates some alternative weighting methods based upon
the array and source geometry. Chapter 5 offers development of an MMSE multi-input noise-suppression
filtering system and Chapter 2 shows experimental results on the recorded database using those methods.
C HAPTER 4:
T OWARDS E NHANCING D ELAY AND S UM
B EAMFORMING
This chapter will examine the performance that can be expected from delay-and-sum beamformers in a
simple microphone-array scenario and investigate the improvements possible through an optimal
microphone weighting scheme. Reverberant room simulations with multiple noise sources will be used to
evaluate the impact of adding microphones to a linear array. The beamformer performance will be
assessed with objective measures including signal-to-noise ratio (SNR), signal-to-reverberation ratio
(SRR) and Bark spectral distortion (BSD).
zero mean. The signal received at each sensor is a delayed version of the original signal plus an additive
noise component:
ym t
hm s t τm
nm t
1 m M (4.1)
where nm t is a zero-mean normally distributed noise signal with variance
is the time delay to the σ2m , τm
mth sensor and hm is the signal attenuation at the mth sensor. nm t is uncorrelated with s t and all nl t
for m l. Assuming that the τm are known each received signal can be appropriately delayed. The
beamformed output is then the sum of M copies of the signal s t with M uncorrelated additive noise
sources:
M 1 M M
y t ∑ hm s t n m t τm s t ∑ hm ∑ nm t τm
m 1
m 0 m 1
30
25
SNR improvement (dB)
20
15
10
0
0 200 400 600 800 1000
Number of microphones
Figure 4.1: Idealized SNR gain as a function of the number of sensors in a DSBF beamformer given that
the noise at each sensor is uncorrelated with the noise at all other sensors and the signal.
31
32
To simplify the analysis assume that the noise power is identical in each channel, σ 2m σ2 , and the signal
gain is likewise identical, hm 1 for all 1 m M. This assumption implies that no sensor contributes
more than any other sensor; SNR at each sensor is identical. The signal-to-noise ratio of a single (delayed)
channel is
E s2 t E s2 t
SNR1
E n21
t τi
σ2
and the signal-to-noise ratio of the beamformer output is
E ∑M
m 1 s t
2 M 2 E s2 t M E s2 t
SNRuni f orm
E
∑M
m 1 nm
t τm
2
∑M m 1 σm
2 σ2
Taking the log of the ratio of these two SNR’s to get the improvement in dB yields:
SNRuni f orm
10 log10 10 log10 M (4.2)
SNR1
Figure 4.1 plots this gain function. For every doubling of the number of sensors a 3dB improvement in
output SNR is realized. Even in this idealized scenario 100 sensors are needed to achieve a 20dB
improvement in SNR.
The situation becomes more complicated if each sensor measures a different level of signal and noise.
That is, hm hl and σ2m σ2l . In this case the SNR of the simple delay-and-sum beamformer is
∑M
m 1 hm s t ∑M
m 1 hm E s
2 2 2 t
SNRnonuni f orm E
(4.3)
∑M
m 1 nm t
2
∑m 1 σ2m
M
4.2 Delay-Weight-and-Sum
The output SNR of the beamformer can be maximized by weighting each y m t before performing
summing the channels. If gm is the weight applied to channel m, then the SNR of the output of this
delay-weight-and-sum beamformer is
∑M
m 1 gm hm E s
2 2 t
SNRweighted (4.4)
∑ m 1 g m σm
M 2 2
A simple optimization can be performed on the express in Equation (4.4) by taking the derivative with
respect to gl and setting the result equal to zero:
∂ 2 ∑M
m 1 gm hm hl E s2 t ∑M
m 1 gm hm 2gl σl E
2 2 s2 t
0
SNRweighted
∂gl ∑Mm 1 g m σm
2 2
∑m 1 g2m σ2m 2
M
h l ∑M
m 1 gm hm
gl
σ2l ∑M
m 1 g m σm
2 2
M
h2
1 σ2m
∑ m2 E s2 t
SNRoptimalweighted
(4.6)
m 1 σm
h2m
∑M
m 1 σ2m
33
M 3 1 2 M−1
Figure 4.2: A linear microphone array with inter-microphone spacing d and distance from the talker to the
array midpoint r.
20
18
16
14
SNR improvment (dB)
12
10
4
ideal
2 optimal
unweighted
0
0 1 2
10 10 10
Number of microphones
Figure 4.3: Simulation showing the improvement in SNR for the optimally-weighted beamformer and the
unweighted beamformer for a near-field linear array. The ideal SNR curve from Figure 4.1 is included for
comparison.
To get a sense of what this could mean in practice consider the example in Figure 4.2. A talker stands
r 1m away from the center of a symmetrical linear microphone array with constant microphone spacing
of d 10cm. Each microphone receives an equal level of independent noise (σ 2m 1). The hm (and gm )
terms are inversely proportional to the distance from the talker to the microphone and given by
1
2
hm (This choice of distances results in unity gain at the center microphone). As
r
2 2
m 1 2 d
microphones are added at the ends of the array we can compute the improvement in SNR as a function of
the number of microphones with Equation (4.6). This is plotted in Figure 4.3.
It is apparent from Figure 4.3 that when even a simple model of signal attenuation is taken into account the
realizable gain from a delay-and-sum or delay-weight-and-sum beamformer can be quite limited. As
microphones are added in this simple example the SNR in each added microphone drops as the added
microphones are further and further away from the source, eventually negating the gain of adding the
distant microphone at all1 .
Taking this example one step further, supplant the independent noise at each microphone with a Gaussian
1 Granted, this is a contrived example and with the 10cm spacing described the array is 4 1m wide when there are 40 microphones
in the array, 10m wide when there are 100 microphones in the array. Clearly this is not the ideal array geometry for a talker 1 meter
away from the array but it makes the point.
34
M 3 1 2 M−1
Figure 4.4: Linear microphone array with interfering noise point-source located 2m to the right of the talker.
20
18
16
SNR improvement (dB)
14
12
10
4
ideal
2 optimal
unweighted
0
0 1 2
10 10 10
# of microphones
Figure 4.5: SNR improvement for DSBF with point-source noise and ambient noise.
point-noise source 2m to the right of the talker. Assume that the level of noise measured by the
microphone closest to the noise source is equal to the ambient noise level. This is depicted in Figure 4.4.
In this case the optimal weighting according to Equation (4.5) is no longer simply inversely proportional to
the distance from the talker, but also proportional to the square of the distance from the interfering noise
source (because in the simple spherical propagation model the noise power in the denominator of
Equation (4.5) is inversely proportional to the square-root of the distance from the noise source).
Figure 4.5 shows SNR improvement as a function of the number of microphones in the array 2 . This
particular result obviously will vary with the ratio of the ambient noise power to the source noise power. If
the ambient noise power is much greater than the point-noise power the result will approach the one
plotted in Figure 4.3.
The curves shown in Figures 4.3 and 4.5 show only a subtle difference. The achievable SNR is marginally
higher ( 1dB) in Figure 4.5 for both the unweighted and optimally weighted schemes, but in either
scenario the weighted solution only enjoys about 1dB of improvement in SNR over the unweighted
solution.
2 Inthis scenario the noise is no longer independent and the expression in equation 4.6 is not truly applicable. Although correlated
noise may add destructively or constructively depending upon the geometry of the array and source, a reasonable (or pessimistic)
expectation is that the array will do worse in the presence of correlated noise. The SNR improvements shown here are over-estimates
under that expectation.
35
4.3 Delay-Filter-and-Sum
The obvious extension of the simple source model in Equation (4.1) is to generalize the signal scaling
factor hm to a convolutional element or channel impulse response. Each sensor in this model receives a
filtered version of the signal plus an independent noise signal:
ym t hm t
st
nm t
(where denotes convolution) and to introduce a corresponding convolutional element into the channel
weighting in the beamformer:
M
y t ∑ gm t hm t st nm t (4.7)
m 1
The channel dependent filtering function can be distributed to write the beamformer output in terms of the
signal-derived portion, ys t , and the noise-derived portion, yn t :
M
ys t ∑ gm t hm t st
m 1
M
yn t ∑ gm t nm t
m 1
Rewriting these expressions in the frequency domain facilitates the formulation of a frequency-dependent
optimal weighting:
M
Ys ω ∑ Gm ω Hm ω S ω
m 1
M
Yn ω ∑ Gm ω Nm ω
m 1
the frequency dependent weighting. Ys ω is the Fourier transform of the signal-derived portion of the
beamformer output, and Yn ω is the Fourier transform of the noise-derived portion of the beamformer
E Ys ω 2 E ∑Mm 1 Gm ω Hm ω S ω
2
SNRweighted ω
(4.8)
E Yn ω 2 E ∑m 1 Gm ω Nm ω 2
M
∑m 1 Gm ω Hm ω E S ω
M 2
2
(4.9)
∑M
m 1 Gm ω 2 σ2m ω
where σ2m ω E Nm ω 2 . Once again the assumption in place is that the noise in each channel is
the result equal to 0. Dropping the ω notation and omitting the limits on the summations (they are all
(4.10)
∂Gl ∑ Gm 2 σ2m 2
36
which simplifies to
Hl ∑ Gm 2 σ2m ∑ Gm Hm
Gl
σ2l ∑ Gm Hm 2
and, not surprisingly, this is satisfied by
Hl ω
Gl ω
(4.11)
σ2l ω
The solution in Equation (4.11) features the time-reverse of the impulse response, Hl ω , and is a
noise-weighted variant of the matched-filter method[23]. The Hl ω filter acts as a sort of pseudo-inverse
for Hl ω .
Magnitude-only Solution
If the channel impulse response is modeled as a magnitude-only filter, Hl ω
Hl ω , then
Gl ω Hl ω
(4.12)
σ2l ω
and the filter-and-sum strategy in this case can be considered as a filterbank implementation where
Equation (4.5) is used to derive a real-valued weight for every filterbank frequency ω.
response to the beamformer the filters once derived from Equations (4.11) and (4.12) should have their
magnitudes normalized by a factor of M 1G ω for each analysis frequency, ω, to ensure that the gain of
∑m 1 m
the beamformer is uniform across frequency.
It may seem an obvious step to apply a frequency weighting that implements the same sort of SNR
optimization as the microphone weighting, but if Equation (4.9) is rewritten as a function of frequency
weights instead of channel weights it will not lead to a similar solution for a frequency weighting.
Frequency weighting strategies are discussed in Chapter 5.
4.4.1 Methods
Figure 4.6 depicts the layout of the simulated room. 40 microphones are arranged in a linear array with
10cm microphone separation across the short wall of a 4x6m room. 3 interfering noise sources are
simulated with recordings of computer equipment fan noise and placed as shown in Figure 4.6. A digital
recording of a male talker made with a close-talking microphone is used as the desired source signal. The
sampling rate is 16000Hz. The impulse response from each source (the talker and 3 separate noise
sources) to each microphone is simulated using the image method[80], see Figure 4.7 for an example
impulse response. The resultant reverberation time is approximately 250ms and the unprocessed
37
Microphones
Talker
Noise source
z
1 1
4
40
0
6 2
4
2 x
y 0 0
Figure 4.6: Location of the source, noise, and microphones for the reverberant simulation.
0.6
0.5
0.4
0.3
0.2
0.1
Figure 4.7: The simulated impulse response for the talker received at microphone 1.
−3
x 10
18
16
14
12
10
−2
−0.04 −0.035 −0.03 −0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01
time (s)
Figure 4.8: Optimal filter for microphone 1 for the 40 microphone case using the complex model of the
channel transfer function (based on Equation (4.11)).
38
signal-to-noise ratio at microphone 1 is approximately 3dB. Signal or noise power in the following is
computed by taking the mean squared value of the time signal over the length of the test signal 3 .
The microphone outputs are delay-steered to the source location then the statistics of the noise and signal
is measured over the entire 3 second utterance and used to derive the weight or filter for each channel. The
channels are weighted (or filtered) and summed in the following 4 ways:
1. Uniformly weighted (unweighted). This is simply a delay and sum beamformer; each channel is
weighted equally.
2. Weighted according to Equation (4.5) (weighted). The noise power and the magnitude of the direct
path in the impulse response are measured for each microphone to form each weight according to
Equation (4.5).
3. Weighted at each analysis frequency according to Equation (4.11) (freq weighted). The PSD of the
noise and the time inverse of the impulse response for each microphone is used to generate the
optimal filter according to Equation (4.11).
4. Weighted at each analysis frequency according to Equation (4.12) (mag freq weighted). A
magnitude-only version of the filter computed in item 3 above is used to weight each frequency at
each microphone.
In this simulation a 512 point analysis window (32ms) was used for measuring the power spectral density
of the signal and noise. The filters in the filter-and-sum beamformer are 1024 points long for the complex
model of impulse response (Gm Hm σ2m ; method (3), freq weighted) and 512 points for the
magnitude-only case (Gm Hm σ2m ; method (4), mag freq weighted). A 1024 point tapered truncation of
the channel impulse responses, hm t , is used to compute the numerator for the filters4 . The shorter
window length was used for the magnitude-only version only because the shorter window is a more
practical analysis length for speech signals. For the complex case the window had to be lengthened to
include a reasonable portion of the impulse response.
For cases (3) and (4) the overall array frequency response is normalized as described in Section 4.3.1. This
frequency normalization smoothes out the overall spectral shape of the beamformer response, but
especially in case (3) where the conjugate of the channel transfer function is being used as the beamformer
filter (see Figure 4.8), there are zeros in the derived channel filters; the resulting total beamformer
frequency response still contains these zeros. Because of this beamformer frequency response distortion
and the action of the matched filter the signal-to-noise and reverb ratios in Figures 4.9, 4.11 and 4.12 show
an improvement even for the case of a single microphone.
4.4.2 Results
Figure 4.9 depicts the SNR yielded by the different weighting strategies as a function of the number of
microphones added in order of increasing distance from the source. For the SN ratio shown in Figure 4.9,
the numerator is the energy of the direct path signal of the talker and the denominator includes the energy
of the reverberation due to the talker as well as all direct and reverberant energy from the noise sources. In
other words, anything other than the talker direct path is considered to be noise in this ratio. The signal
impulse response is divided into direct and reverberant components by applying a 4ms 5 wide window
3 If
the unprocessed SNR seems low consider that this is an average SNR. Even at 3dB the target speech is intelligible. The peak
SNR is approximately 5dB.
4 The truncated impulse response was used to try to inject a little bit of practicality into the implementation of the filter-and-sum
beamformers. This can be increased along with the lengths of the derived filters to correspond to the total length of the simulated
channel impulse responses at the cost of increased computational complexity but the results do not significantly change.
5 This value was chosen fairly ad hoc. It should be noted that the measured signal power is very sensitive to the value of this
parameter. The wider the time window that is considered to be direct path energy, the higher the measured direct signal energy will be
and subsequently the higher all the signal power ratios (SNR, SRR) will be.
39
12
10
2
unweighted
weighted
freq weighted
mag freq weighted
0
5 10 15 20 25 30 35 40
# microphones
Figure 4.9: Signal-to-noise+reverb improvement for a simulated room with 3 equipment fan noise sources
as a function of the number of microphones in the array.
unweighted
0.16 weighted
freq weighted
mag freq weighted
0.14
0.12
Bark Spectral Distortion
0.1
0.08
0.06
0.04
0.02
5 10 15 20 25 30 35 40
# microphones
Figure 4.10: BSD measure for the 4 different beamforming schemes as a function of the number of micro-
phones in the array.
around the impulse corresponding to the direct path in the simulated channel impulse response. The talker
signal is then convolved with the direct path component and the reverberant component separately and
summed separately for cases (1), (2), and (4). For case (3) one of the actions of the derived filter is to
increase the power in the main lobe while spreading reverberant energy out away from the main
impulse[23], consequently for this case the derived filter is convolved with the simulated channel impulse
response and then decomposed into direct and reverberant components so that the “matched-filtering”
effect can be measured accurately.
For all methods the SNR improvement falls well short of the 10 log 10 40 16dB theoretical array gain
from Equation (4.2). This is hardly surprising given the correlated nature of the noise included in this
simulation; both from the simulated noise sources and from the talker reverberance. Figure 4.9 shows that
the simple weighted beamformer (2) and the magnitude-only filter-and-sum beamformer (4) hardly do any
better than the uniformly weighted beamformer (1). Although method (3) does noticeably better according
to the signal power ratio measures (Figures 4.9, 4.11 and 4.12) the Bark spectral distortion measure
(Figure 4.10) is only marginally lower than even the simplest unweighted beamformer. Note also how the
incremental improvement in SNR is quite similar for all methods. Method (3)’s higher SNR starts right at
the single microphone case suggesting that the main source of the higher ratios can be attributed to the
matched-filtering and spectral distortion effects rather than the optimization in the microphone weighting.
40
12
10
2
unweighted
weighted
freq weighted
mag freq weighted
0
5 10 15 20 25 30 35 40
# microphones
Figure 4.11: Signal-to-noise-only ratios as a function of the number of microphones in the array.
12
10
8
SRR Improvement (dB)
6
unweighted
weighted
freq weighted
mag freq weighted
4
0
5 10 15 20 25 30 35 40
# microphones
Figure 4.12: Signal-to-reverberation ratios as a function for the number of microphones in the array.
Informal listening tests confirm that method (3) sounds marginally clearer than the other methods, but this
improvement comes at the cost of knowing the channel impulse response exactly. The computation of the
optimal filters in simulation is trivial since the channel impulse responses are all known, but in a practical
situation accurate channel impulse responses will be quite difficult to measure and will vary widely with
changes in the room[21] not to mention the position of the talker.
The marginal results of the weighting methods investigated in this simulation suggest that a different
strategy may be more fruitful in providing improvement over the basic unweighted DSBF. Chapter 5 will
introduce another form of optimization and the resulting algorithm will be implemented in Chapter 7.
C HAPTER 5:
O PTIMAL F ILTERING
In the previous chapter an optimal-SNR weighting strategy was derived based upon a combination of noise
statistics and geometric signal propagation models. The derived weights or filters were constant for a
particular arrangement of talker and noise sources in the room with no dependence on the signal received
at the array sensors. Additionally, no frequency shaping or distortion was permitted in the array frequency
response. In this chapter, data-dependent optimal filtering strategies that use spectral shaping in an attempt
to improve signal quality will be investigated, in particular the Wiener filter and a novel multi-channel
variant of the Wiener filter. Also a non-optimal application of Wiener pre-filtering to microphone arrays
will be introduced.
other distortion. A filter φ t that when convolved with the received signal, y t , approximates s t is
ŝ t φ t
yt
(5.1)
and the error by:
et
st
ŝ t
If minimum mean-squared error is the criterion for choosing φ t the expression for the mean-squared
Rewriting the expression for the error in the frequency domain and employing Parseval’s relation yields:
Ŝ ω
ΦωY ω
E ω
S ω Ŝ ω
1
∞ 1
∞
ξ E S ω Ŝ ω dω E E ω dω
2 2
(5.3)
2π ∞ 2π ∞
where e t s t ŝ t and S ω , Ŝ ω , Φ ω , Y ω and E ω are the Fourier transforms of s t , ŝ t ,
φ t , y t and e t , respectively. The filter Φ ω is chosen to minimize the total squared error, ξ. Moving
the expected value operation inside the integral, taking the derivative 1 of the integrand with respect to
Φ ω and setting it equal to 0 yields
1 A frequently omitted detail is that this is not, strictly speaking, the derivative of the total squared error, ξ. Nevertheless, the values
41
42
∂E ω ∂
E E ω E ω E Eω E E ω Y ω 0
∂Φ ω ∂Φ ω
E S ω Φ ω Y ω Y ω
0
E S ω Y ω Φ ω E Y ω Y ω
0
Φ ω E Y ω 2 E S ω Y ω
E S ω Y ω
Φω
(5.4)
E Y ω 2
variety of cross terms unless some assumptions are made about the nature of the signal, s t , and noise,
nt .
A commonly made assumption that simplifies Equation (5.4) is that the signal and noise are uncorrelated.
More explicitly, that the expected value of the cross-correlation of the signal and noise is equal to 0.
∞
E S ω N ω dω 0 (5.5)
∞
where E denotes expected value. Using the signal model y t s t n t or its frequency domain
E SY S
2
Φ (5.6)
E Y 2
S 2 E N 2
or rewriting this in terms of the measured signal Y and the noise statistic E N :
2
Y E N
2 2
Φ
(5.7)
Y 2
Equations (5.6) and (5.7) are the most commonly seen forms of the Wiener filter. Note that the filter
coefficients are strictly real and non-negative. In Equation (5.6) it is clear that Φ 1. The result of
Equation (5.7) may not satisfy this condition if the noise power in the observation of Y 2 is less than
E N 2 . Care must be taken in the implementation to insure that noise in the observations doesn’t create
yt
ht st n t . In this case, using the same assumption about the lack of correlation between
E S HS N H S 2
Φ
(5.8)
E HS N 2
H S 2 E N 2 2
Details of Implementation
Note that if the estimate of the signal power already incorporates the transfer function, H, then the
formulation in Equation (5.8) shouldn’t be applied. For instance, if deriving a filter, Φ̂, according to
43
Equation (5.8) but the estimate of S 2 , Ŝ 2 , is formed by subtracting the expected value of the noise power
from the instantaneous measurement of the input signal:
Y E N HS N E N
2 2 2 2 2 2 2
Ŝ H S
Then it is inappropriate to directly substitute this signal estimate into Equation (5.8),
H Ŝ 2
Φ̂
Y 2
because according to the signal model Ŝ 2 already includes a factor of H 2 . To achieve the form in
Equation (5.8), Ŝ 2 needs to be divided by H rather than multiplied by H . That is,
2 2
H Y E N
1 1
H H S
2 2
H S 2
Φ̂
Y 2 Y 2 Y 2
M
yB t ∑ ym t
m 1
manner analogous to the development of the Wiener filter above. Since the channels have been summed
into a single output channel the MMSE solution is simply a Wiener post-filter on the beamformed signal:
S ω YB ω
Φ ω E
(5.9)
YB ω 2
m 1
In a manner similar to the derivation of the single channel Wiener filter a solution for Φ m ω can be
derived starting from Equation (5.3). Substituting Ŝ ω as it is defined as in Equation (5.11) yields the
∞
∞
ξ E E ω
E S ω Ŝ ω
∞ ∞
∞ M
E S ω ∑ Φm ω Ym ω
(5.12)
∞
m 1
Taking the derivative of the integrand of Equation (5.12) with respect to Φ m ω and setting it equal to zero
yields:
44
∂ ∂E ω
E E ω E ω E Eω E Eω Ym ω 0
∂Φm ω ∂Φm ω
Y1YM Y2YM YM
2 ΦM SYM
and the general solution written as:
1
Φ1
Y1
2 Y2Y1 YM Y1
SY1
Φ2 Y1Y2 Y2
2 YM Y2 SY2
.. E
.. .. .. ..
E
..
(5.13)
.
. . . .
.
ΦM Y1YM Y2YM YM
2 SYM
The discerning reader will note that the matrix in Equation (5.13) is the spatial correlation matrix[81] and
can be written as the outer product of the input signal vector:
Y1
Y2
E
.. Y1 Y2 YM
(5.14)
.
YM
Measurement of the spatial correlation matrix in Equation (5.14) can be problematic in practice. To
estimate this matrix by averaging different instances of the Y vectors requires at least M instances to
achieve a spatial correlation matrix with full rank, since a particular instance of YY has rank 1. Consider
what this means for a typical speech signal where it can only be considered stationary for 40ms or so; if
there are 16 microphones in the array the spatial correlation estimate requires at least 16 independent
frames within that 40ms window to form a spatial-correlation matrix of full rank. With a half-overlap
hamming analysis window this would imply an individual analysis frame of 4 7ms. At 16kHz sampling
rate this results in a frequency resolution of approximately 212hz. For larger numbers of microphones the
frequency resolution only gets worse.
ym t
st
nm t
Ym ω S ω
Nm ω
(5.15)
Assuming initially that not only are signal and noise uncorrelated, but that each n m t is uncorrelated with
E Ym 2 S 2 Nm 2 E SYm S 2 E YmYl S 2
(5.16)
incorporating the simplifications from Equation (5.16) into Equation (5.13) yields
1
Φ1
S
2 σ21 S
2
S 2
2
S 2
2
Φ2 S
2
S
2 σ22 S S
.. .. .. .. .. .. (5.17)
.
. . . .
.
ΦM S
2
S
2
S
2 σ2M S
2
45
The matrix in Equation (5.17) can be written as the sum of a constant matrix and a diagonal matrix of
noise autocorrelation values.
1 1 1
σ21 0 0
1 1 1 0 σ22 0
2
S
.. .. ..
.
..
.. .. ..
.
..
. . . . . .
1 1 1 0 0 σ2M
The diagonal matrix is full rank and non-negative so its addition with a non-negative constant matrix is
also full rank. The matrix inverse in Equation (5.17) exists in general and can be formed without long term
averaging of observations of YY .
1
ΦM
∏M
m M σm
2
where ∏Mm k σm denotes the product of all σm terms except for the m k term. It may be clearer to view
2 2
Equation (5.18) with the product of the σm ’s factored out. Note that
2
M σ2k
∏ σ2m ∏m σm
M 2
m k
so Equation (5.18) can be rewritten as:
1
Φ1
σ21
1
Φ2
1 σm
S
2
∏M
m
2
σ22
.. ..
m 1 σm
∏M
2
.
S 2 ∑k ∏M 1 σm
M .
m
2
ΦM
1 σ2k
1
σ2M
and the product terms in numerator and denominator cancel out to yield:
1
Φ1
σ21
1
Φ2
S
2
σ22
.. .. (5.19)
.
S 2 ∑k
M 1 .
σ2k 1
ΦM 1
1
σ2M
or written more succinctly in terms of the weight for a single microphone (and including the dependence
on ω previously omitted for brevity):
46
S ω
2 1
Φm ω
(5.20)
σ2m ω
S ω 2 ∑k 1 σ2k ω
M 1
1
In this form it can be clearly seen that each each Φm is the reciprocal of the noise power for that channel
with a common overall weighting that is a function of S 2 and σ2m . In this form it is also more apparent
that the computational complexity of this solution is now O M rather than the O M 3 typically required
by the matrix inverse2 . Note that for M 1 this solution is identical to the Wiener filter in Equation (5.6).
Also, if the noise power is the same in each channel, σ2m ω σ2 ω , then the resulting Φm ω is also the
S ω
1 2
Φm ω
M S ω 2 σ2 ω
which is precisely a Wiener filter on each individual channel. Since this filter is the same for each channel,
by the principles of linear systems it can be applied after the beamforming summation, resulting in a
beamformer with Wiener post-filter, as would be expected.
ym t
hm t
st
nm t
Ym ω Hm ω S ω
Nm ω
(5.21)
Y
In this case the following substitutions apply, where Pm l YmYl ,
Ym 2 Hm S σm Hm S 2 Hm Hl S 2
2 2 2 Y
SYm Pm l (5.22)
H H
P1 2 S 2 H2 S σ2 S H2 S 2
2 2 2 PM 2
2
Φ
.. .. .. .. .. (5.23)
. . . .
.
H
P1 M S 2
H
P2 M S 2 σ2M HM S 2
HM S
2 2
The matrix to be inverted in Equation (5.23) is the sum of a vector cross product and a diagonal matrix of
noise autocorrelation values.
H1
σ21 0 0
H2 0 σ22 0
2
S
..
H1 H2 HM
.. .. ..
.
..
. . . .
HM 0 0 0 σ2M
The diagonal matrix is positive and full rank (barring a vanishing noise signal) so the sum is also full rank
(barring a vanishing Hm ) so the matrix inversion in Equation (5.23) above exists in general. As in the
previous case this expression for the optimal filter can be rewritten in a simplified form that obviates the
use of the generalized matrix inversion in Equation (5.23). The simplified solution is given by:
2 A matrix inverse can be computed in a manner that has complexity O M log2 7 but at the expense of a very large constant factor[79].
Φ1
m 1 σm
H1 ΠM 2
Φ2 H2 Πm 2 σ2m
M
S
2
.. .. (5.24)
S 2 ∑k Hk 2 Πm k σk m 1 σm
ΠM
.
M M 2 2
.
1
ΦM
HM ΠM
m M σm
2
This result can be rewritten in a more revealing form by factoring out the product of the noise variances.
As above, note that
M σ2k
∏ σ2m ∏m 1 σ2m
M
m k
So Equation (5.25) can be rewritten as:
H1
Φ1
σ21
H2
Φ2
S m 1 σm
∏M
2 2
σ22
.. M σ2 ..
.
∏
S 2 ∑k 1 Hk 2 mσ2k1 ∏M 1 σm
M
m
m 2
.
ΦM
HM2
σ2M
. Hk
S 2 ∑k
M .
σ2k 1
ΦM 1
HM2
σ2M
or written more succinctly in terms of the weight for a single microphone (and including the dependence
on ω previously omitted for brevity):
S ω Hm ω
2
Φm ω
(5.26)
Hk ω 2
1 σm ω
2
S ω 2 ∑k 1 σ2k ω
M
As in the previous case, the computational complexity of the solution written in this form is only O M as
opposed to the O M 3 for the form including the matrix inversion. Equation (5.25) is very similar to the
optimal-SNR weighting derived in Equation (4.11). Each Φm is the ratio of the conjugated channel
transfer function and the channel noise power, but now also includes an overall weighting at each
frequency. This is consistent with the optimal-SNR result of Equation (4.11).
Note that when Hm 1 m 1 M Equation (5.25) is identical to the solution for the model without
signal filtering n Equation (5.19). Also, for the case where M 1 Equation (5.25) becomes
H S 2
Φ
H 2 S 2 σ2
σ2m ω
ω
osnr 0
Φm (5.27)
Hk ω
∑M
k 1 σ2 ω
k
48
where the denominator is designed to normalize the gain of the array so that
M
∑ Φosnr
m ω 1
k 1
Sω 2
. Adding a Wiener weighting on top of this weighting adds in a factor of to the weighting,
E Y ω 2
where Y ω is the output of the optimal-SNR weighted beamformer. Incorporating this factor into
osnr 0
Φm ω from Equation (5.27) yields a new weighting:
Hm ω
S ω
2
σ2m ω
ω
osnr 1
φm
(5.28)
Hk ω 2
∑M
E osnr 0
k 1 Φk
∑M ω Xk ω
k 1 σ2 ω
k
Using the independent noise model from above to simplify the expected value in Equation (5.28) (and
dropping the ω notation for brevity’s sake) yields:
2
2
M M M
∑ Φk ∑ Φk ∑ Φk
osnr 0 osnr 0 osnr 0
E
Xk E
Hk S
Nk
k 1 k 1 k 1
2
M M
osnr 0 2
∑ Φk ∑
osnr 0
E S 2
Hk Φk
σ2k
k 1 k 1
Comparing this to Equation (5.26), the Hmσ2ω and S ω 2 terms (both in numerator and denominator) are
m
in common. What remains are the normalization terms in the denominators:
2
M M M
Hk ω 2
∑ σ2 ω
2
∑ Φk ∑ ?
osnr 0 osnr 0
ω Hk ω Φk ω σ2k ω
1
k 1 k 1 k 1 k
2
Hk ω 2 Hk ω
Hk ω
2
M M M
σ2k ω σ2k ω
∑ ∑ σ2k ω ? ∑
1
Hl ω Hl ω σ2k ω
k 1 ∑l 1 k 1 ∑l 1
M M
k 1
σ2l ω σ2l ω
2
2
1 M
Hk ω
2 M Hk ω 2 M
Hk ω
2
∑ ∑ ∑
1
∑M
l 1 σ2 ω k 1
k 1
k 1
l
In this form it is clear (or at least more clear) that the two weightings are not equivalent; there are
cross-terms introduced on the left side that will not be cancelled on the right side. The relative weighting
between the channels is the same as for the optimal-SNR weighting, but the overall weighting at each
frequency is not.
H1 S 2
H N H N H2 S 2
P1 2 S
2 P1 2 H2 S
2 2 σ22 PM 2 S
2 PM 2
Φ
.. .. .. .. .. (5.29)
. . . .
.
H N H N HM S 2
P1 M S 2 P2 M S 2 HM S σM
P1 M
P2 M
2 2 2
On the face of it this matrix might appear singular, but because it is the expected value of the noise
covariance that is added it is generally not singular. That is, the matrix in Equation (5.29) can be written as
the sum of two cross products:
H1 N1
H2 N2
S
2
.. H1 H2 HM E
.. N1 N2 NM
. .
HM NM
Noting that the second cross product is the expected value of the noise cross-correlation. This is a
Hermitian matrix and except under degenerate values of noise correlation it will be full rank, and therefore
the matrix inverse in Equation (5.29) will exist. In Equations (5.17) and (5.23) the noise in each channel
was assumed to be independent of the noise in any other channel simplifying this cross-correlation matrix
to a diagonal matrix of noise autocorrelation values. Effective estimation of the noise correlation matrix
through the averaging of multiple observations may be made if the noise is slowly varying. This is in
contrast to the spatial correlation matrix in Equation (5.14) which contains an estimate of the rapidly
varying speech signal.
Unlike the previous cases, the matrix in Equation (5.29) does not lend itself to a simplified inverse
operation. Also Equation (5.29) requires the estimation of the complete noise cross-correlation matrix
rather than just the noise autocorrelation terms used in Equations (5.17) and (5.23).
jωτ Y1
e 1
SY’
1
YY’
1 1
(0) (1)
Σ Σ
1 S 1 S
M M
e
jωτM YM
SY’
M
YY’
M M
Figure 5.1: Flow diagram for a Wiener filter-and-sum beamformer using the delay-and-sum beamformer
output as the signal estimate for the Wiener filters.
Φm ω
0
1
1 M
M m∑
Sk ω k
Φm ω Ym ω
1
Φm
k 1
ω
S k ω Ym ω
(5.30)
Ym ω 2
This is illustrated in an idealized example. Consider an M channel array. The noise in each channel is
Gaussian, uncorrelated between the channels and of equal power in each channel. Let the signal of interest
be a sine wave of frequency ω0 at a nominal power, E S ω0 2 ψ2 . The noise spectrum has a
ψ2 σ2 ω ω0
E Y ω 2
σ2 ω ω0
E S 0 ω 2
σ
ω ω0
2
M
Now forming the ratio in Equation (5.30) to generate a Wiener filter for each channel results in a filter with
the following transfer function:
2
ψ2 σM
Φm ω
1
ψ σ2
2 ω ω0
(5.31)
ω ω0
1
M
In the Wiener filter the factor of M1 reappears but now in magnitude rather than power, effectively doubling
(in dB) the noise suppression achieved by the beamformer. The gain at the signal frequency is not unity,
but will approach 1 for ψ2 σ2 and as M increases it approaches the minimum mean squared error
optimal gain of ψ2ψ σ2 . Figure 5.2 shows the value of this term for varying number of microphones and
2
1
signal-to-noise ratio. In this simple example the value of Φm ω is directly related to the SNR of channel
m. In a more realistic situation the attenuation of the noise by the beamformer will not be so reliable -
coherent noise may sum constructively at some frequencies and destructively at others - and this direct
1 1
mapping of channel SNR to Φm ω will not hold. Applying Φm ω to each channel3 and beamforming
3 Since each channel has identical statistics and therefore identical Φm in this example, it is mathematically equivalent to apply the
filter on the beamformer output.
51
0
−1
Filter response in dB
−2
−3
−4
2
3
−5 4
8
256
−6
0 5 10 15 20 25 30
SNR in dB
1
Figure 5.2: The attenuation of Φm ω from Equation (5.31) at the signal frequency for different values of
−5
−10
Filter response in dB
−15
−20
−25
−30
1
−35 3
7
−40
0 5 10 15 20 25 30
SNR in dB
1
Figure 5.3: The attenuation of Φm ω as a function of input SNR raised to different powers corresponding
2 3
to Φm ω and Φm ω . The number of channels is fixed at 16.
2
ψ2 1
σ2
ω ω0
ψ2 σ2 M
E S 1 ω 2 M ψ2
1
σ2
σ2
M3
ω ω0
where the gain at ω ω0 has been rewritten to more clearly separate the influences of the signal-to-noise
ratio and the number of microphones. Note the M13 reduction in noise power. This is the cube of the
reduction in noise power achieved by the DSBF.
2
Repeating the process to generate Φm ω yields
2 ψ 2 σ2
ω ω0
1
Φ m ω0 M
ψ 2 σ2
Φm
2
ω
2
Φm ω ω ω0
1
ω0 1
M
k 1
which shows that subsequent iterations of Φm are simply powers of Φm . In particular,
2
Φm
k 1
k
Φm
1
Φm . Figure 5.3 shows the value of Φm ω raised to the 3rd and 7th powers,
1
23
corresponding to Φm
ω . Note how the attenuation falls off steeply; signals at different frequencies will
be attenuated to a degree that is greatly sensitive to the SNR at that frequency, potentially resulting in
1
undesirable signal coloration if a higher power of Φm ω is employed. This is illustrated in Figure 5.3.
One way to avoid this sort of signal distortion while increasing the noise-suppression of the filter is by
mapping the filter response non-uniformly. Compressing Φm ω in the neighborhood of 0dB while
52
0
−5
−10
Filter response in dB
−15
−20
−25
−30
−35
0 5 10 15 20 25 30
SNR in dB
Figure 5.4: Ad hoc methods of warping the filter gains to create a flatter response at moderately high SNR
1
while preserving a strong attenuation at low SNR. denotes the curve for Φ m and denotes the result
1
after a warping by Equation (5.32). For the attenuation was held at 1 for values of Φ m greater than -2dB
1 5
and set to Φm for values below that.
maintaining a strong attenuation in low SNR regions. For instance, any gain above some threshold could
1
be set to unity while leaving gains below the threshold alone. Another strategy would be to raise Φ m ω
to a variable power based on its value. A possible (absolutely ad-hoc) warping which maintains a longer
flat region and faster dropoff below some threshold is
1
20 log10 Φm ω
Φm ω
1
Φm ω
(5.32)
1
The effect of this ad hoc warping is shown in Figure 5.4. Both warped curves are flatter than Φ m above
5.4 Summary
The derivation of the single channel Wiener filter was presented and extended to a multi-input MMSE
solution, multi channel Wiener (MCW). The form of this multi-channel Wiener filter was simplified for the
cases of additive noise, and convolution plus additive noise signal scenarios, resulting in solutions of low
computational complexity. The MCW method was shown to incorporate the optimal-SNR
inter-microphone weighting derived in Chapter 4 and the overall frequency weighting of the MCW
algorithm was shown to be different from that of the Wiener post-filter (WSF) algorithm. Another
non-optimal but intuitively appealing application of Wiener filters to microphone arrays as pre-filters was
described and its behavior as an iterative process explored. These methods, along with a reference Wiener
post-filter (WSF), will be implemented in Chapter 7.
All the Wiener algorithms presented in this chapter require a noise-free or at least noise-reduced estimate
of the signal spectrum. Chapter 6 addresses the spectrum estimation problem in the context of microphone
arrays.
C HAPTER 6:
S IGNAL S PECTRUM E STIMATION
The Wiener filter requires knowledge of the power spectrum of the desired signal (see Equation (5.6)). In
some communications applications the statistics of the desired signal may be reasonably well
approximated by an a priori distribution, but when the signal of interest is speech, ongoing signal
measurements are required to estimate the rapidly changing signal power spectrum. In this chapter some
methods of spectrum estimation will be investigated. The cross-spectrum signal estimation method which
is often used in microphone-array systems[39, 38, 41] will be shown to be a special case of the ubiquitous
noise-spectrum subtraction methods[33]. A novel spectral estimate that combines the cross-spectrum and
minimum-noise subtraction methods will be developed with some investigation of parameter optimization
for the method.
6.1 Spectral-Subtraction
In the classical single-channel spectral-subtraction case[33] the signal spectrum is commonly estimated by
measuring the noise power during silence regions and estimating the signal power with:
Ŝ ω S ω N ω E N ω
2 2 2
(6.1)
To the extent that the noise is stationary and uncorrelated with the signal this is a good estimate of the
signal power, though care must be taken to avoid over-estimating the instantaneous noise power and
inserting negative values into the signal power-spectrum estimate[33].
One way to estimate the noise power[82, 34, 83] is to use the minimum observed value of the power
spectrum over some interval:
N̂k ω min Y k ω 2
N k
(6.2)
where the noise power spectrum estimate for analysis frame k and frequency ω, N̂k ω 2 , is the minimum
value taken from the last N analysis frames of the noise corrupted signal power spectrum, Y k N k ω 2 .
Implementations of this technique typically use a smoothed version of the power spectrum. The
implementation used herein weights past analysis frames with an exponentially decaying weight factor:
Y¯k ω 1 α Yk ω α Yk ω
2 2 2
1
(6.3)
where Y¯k ω 2 , the smoothed spectrum estimate for frame k, is formed by weighting the raw estimate for
the current frame, Yk ω 2 by 1 α and the estimate for the previous frame, Yk 1 ω 2 , by α. The noise
53
54
0.13 32
4.5
0.12 30
28
SSNR
SNR
BSD 0.11 4
26
0.1 24
0.09 3.5 22
0.8 0.8 0.8
0.6 80 0.6 80 0.6 80
0.4 60 0.4 60 0.4 60
0.2 40 0.2 40 0.2 40
alpha 0 20 alpha 0 20 alpha 0 20
Frames Frames Frames
Figure 6.1: Average BSD, SSNR and peak SNR for minimum spectral subtraction scheme as described
in Equation (6.3) as a function of the averaging constant, α, and the number of past analysis frames from
which the minimum is taken, N.
estimate for frame k is then formed by taking the minimum value of Y¯k ω 2 for k k N k.
6.2 Cross-Power
When multiple input channels are available the cross-power spectra of the channels can be used to form
the signal power estimate, Ŝ ω 2 , in a way that takes advantage of the correlated nature of the signal and
the (hopefully) uncorrelated nature of the interference. Using the signal plus independent noise model
from Equation (5.15) the expected value of the cross-spectrum of 2 channels is (from Equation (5.16))
1 For this optimization data from the training set rather than the test set was used.
2 Theanalysis length and FFT size were chosen to correspond with the parameters of the BSD measure (See Section 2.2.3) because
the BSD is being measured directly from the power spectra estimated by the spectral subtraction process.
3 It is expected that the best values for these parameters will vary with the test conditions; noise levels, channel variations etc. The
E Ym ω Yl ω
S ω 2
(6.4)
since the noise is assumed to be uncorrelated the expected value of its cross-correlation is 0. A pair of
microphones, m and l, can be used to form an estimate of the signal power by taking the real portion of
their cross-spectrum:
In general there are M microphones available so there are 2 independent estimates of S ω 2 that can be
formed from Equation (6.5). Taking the average of these individual estimates and the applying a half-wave
rectification yields an estimate of the signal power incorporating information from all the microphones:
M 1 M
1
Ŝ ω max 0 M ∑ ∑ P̂ml ω
2
(6.6)
2 m 1 l m 1
This is essentially the development done by Zelinski in[38]. The signal power spectrum was estimated by
averaging together the cross-power spectra of all possible microphone combinations in a 4-microphone
array. This estimate of the signal spectrum was then used in a Wiener filter applied to the output of the
beamformer as in Equation (5.9). This formulation of the spectral estimate has been used elsewhere
including[40][30][84]. In [84] taking the real portion of the mean cross-power spectrum is eschewed in
favor the magnitude. The rationale behind doing this is to design the derived Wiener filter to attenuate the
spatially uncorrelated noise while ignoring coherent noise[84] (or rather including it equally in the
numerator and denominator of the Wiener filter transfer function) and thereby leaving the attenuation of
coherent noise to the spatial selectivity of the beamformer.
2
1 M 1 M 1 M
M m∑ M m∑ M m∑
YB
2
Ym Ym Ym
1 1 1
1 M 1 M 1 M 1 M 1 M
M 2 m∑ M 2 l∑ ∑ YlYm M 2 l∑ ∑ Yl Ym
2
Ym
1 1 m l 1 1 m l 1
M M 1 M
1 2
M 2 m∑ ∑ ∑
Ym
2
real Yl Ym (6.7)
1 M2
l 1 m l 1
where YB is the delay-and-sum beamformer output and Ym is the mth channel (expressed in the frequency
domain). The estimate in Equation (6.6) can be formed by computing and subtracting out the
auto-spectrum terms from Equation (6.7) and scaling appropriately. This entails the computation of only
M 1 power-spectrum estimates rather than M2 cross-spectrum estimates4 . Specifically, given YB as
expressed in Equation (6.7) above, the spectral estimate in Equation (6.6) can be realized:
4 This relationship between the power spectrum of the beamformer output and the desired cross-spectrum estimate is also pointed
out in[30].
56
YB
M2
1 M 1 M 2 1
∑M m 1 Ym
2
∑ ∑
M2 2
M re YmYl M
2 m 1l m 1 2
M
M 1
YB M 2 ∑ Ym
2 2
(6.8)
M 1 m 1
This economy of computation is only available if the function chosen to combine the cross-spectrum
estimates is the mean and the function chosen to project the complex-valued cross-spectra is the real
function. If the absolute value of the cross-spectral estimates is used[84][85][39] this breakdown of the
beamformer power-spectrum does not apply. Likewise, if some function other than the mean is used to
combine the individual cross-spectral estimates (e.g. the median) this simplification also does not apply.
1 M
M 2 m∑
N̂cp
2
Ym
2
1
N̂ss
2 2
min Y¯k
N k
where N̂cp 2 is the noise estimate from the cross-power method of Equation (6.8) and N̂ss is the noise
estimate from the spectral-subtraction estimate of Equation (6.2). The signal estimate, Ŝ 2 is formed by
subtracting the larger of the two noise estimates from the beamformer power spectrum, YB 2 .
The signal power spectrum estimate types enumerated above were then measured against power-spectra
generated from the close-talking microphone reference recordings. Peak SNR (SNR), segmental SNR
57
0.075 6.2 38
6 36
0.07
5.8 34
Peak SNR
0.065
5.6 32
SSNR
BSD
5.4 30
0.06
5.2 28
0.055
5 26
0.05 24
BF Spec. Sub. Cross Power Combo BF Spec. Sub. Cross Power Combo BF Spec. Sub. Cross Power Combo
Figure 6.2: Average BSD, peak SNR and SSNR for the different spectral estimation methods for the quiet
database using 8 microphones ( ) and 16 microphones ( ). The values were averaged across all 438
utterances in the test set. See Figure 3.7 for scale comparisons.
4.5 22
0.12
20
0.11
4 18
Peak SNR
SSNR
BSD
0.1 16
3.5 14
0.09
12
0.08
3 10
BF Spec. Sub. Cross Power Combo BF Spec. Sub. Cross Power Combo BF Spec. Sub. Cross Power Combo
Figure 6.3: Average BSD, peak SNR and SSNR for the different spectral estimation methods for the noisy
database using 8 microphones ( ) and 16 microphones ( ). The values were averaged across all 438
utterances in the test set. See Figure 3.13 for scale comparisons.
(SSNR) and Bark spectral distortion (BSD) values are computed and averaged over the 438 test set talker
utterances. For all estimation techniques the data segmentation was done with a 512 point (32ms)
Hamming window with a half-window overlap and a 1024 point FFT used.
The most glaring feature of the quiet-database results in Figure 6.2 is that the Bark distortion (BSD) and
segmental SNR (SSNR) are worse after any sort of processing of the beamformer spectrum. Using
Figures 3.7 and 3.13 for comparison the degradation in the SSNR and BSD measurements of Figure 6.2 is
marginal. The total increase in BSD shown in Figure 6.2 is approximately 5% of the difference between
the values measured for the 1 and 16 microphone beamformers in Figure 3.7(a). For the noisy data in
Figure 6.3 the BSD improves (decreases) slightly with all post-processing methods. The magnitude of the
improvement in Figure 6.3(a) is approximately 8 times greater than the decline in Figure 6.2(a). The
SSNR deteriorates with all post-processing methods for both data sets but the decrease for the noisy data
set is about half as much as with the quiet data. In both cases the change in SSNR is approximately an
order of magnitude smaller than the total range shown in Figure 3.7(b). In contrast, the peak SNR is
improved significantly by all 3 post-processing methods for both quiet and noisy data, and though the
cross-power method has a worse peak SNR than the spectral-subtraction method on the quiet data, the
combined cross-power/spectral-subtraction method displays the best peak SNR in both noisy and quiet
cases. Also, unlike the marginal decline or improvement in BSD and SSNR, the magnitude of the
improvement in peak SNR in Figure 6.2(c) is comparable to the improvement achieved by the 16
58
microphone beamformer shown in Figure 3.7(c).
Since BSD and SSNR are measured only during active speech segments this suggests that the minimum
spectral subtraction method does a good job of attenuating noise during silence passages, but is less
effective at reducing distortion during segments of speech. For the quiet database this tradeoff between
reducing the noise and distorting the speech results in a slight overall performance degradation precisely
because the noise is minimal; the distortion introduced by the processing is on the same order as the
distortion already present in the signal. The combination algorithm incorporates the improvement in signal
distortion of the cross-power method and the improvement in peak SNR of the spectral subtraction method.
6.5 Summary
In this chapter methods for signal spectrum estimation were described. The cross-spectrum method was
shown to be a special case of spectral subtraction. An algorithm combining a minimum noise estimate and
the cross-spectrum noise estimate was motivated and described. A preliminary comparison of the different
estimate methods using signal distortion measures was presented supporting the use of the combination
estimate. In Chapter 7 optimal filtering strategies will be implemented using the spectrum estimation
methods described here.
C HAPTER 7:
I MPLEMENTATIONS AND A NALYSIS
In this chapter variations on the filtering strategies described in Chapters 5 and 4 will be implemented and
evaluated. In particular, implementations of the optimal-SNR filter-and sum strategy (OSNR) from
Equation (4.12), the Wiener sum-and-filter (WSF) from Equation (5.9), Wiener filter-and-sum (WFS) from
Equation (5.30) and Multi-Channel Wiener (MCW) from Equation (5.26) beamformers will described, and
the distortion measures and speech-recognition performance results for 8 and 16 microphone versions
presented. For the Wiener techniques, the different methods of estimating the signal spectrum described in
Chapter 6 will be used and compared.
A 512 point (32ms at 16kHz) rectangular window with a half window shift was used for spectral
analysis.
To preserve a linear convolution in the frequency domain a 1024 point FFT was used.
The noise power in each channel, σm ω , was estimated using the minimum statistic method
estimate) was averaged over the utterance. These long-term power estimates were then divided by
the power estimate of the first channel to provide an estimate of the channel transfer function. The
jωτ Y1 H1
e 1 N1
2
(0)
Σ
1 S
M
HM
jωτ YM
e M NM
2
Figure 7.1: The structure of the DSBF with optimal-SNR based channel filtering (OSNR). Signal and noise
statistics are generated for each channel. The resulting filter weights are normalized across the array to
yield a flat total frequency response for the array.
59
60
Table 7.1: Recognition performance for the OSNR beamformer expressed in % words in error. The lowest
word error rate in each column is highlighted in boldface.
result is a constant magnitude transfer function estimate for each channel. For instance:
1
2
Ym ω σ2m ω
2
Hm ω
Y1 ω
2
σ21
ω
To guard against divide by zero or multiply by zero situations the gain at each frequency was
constrained to lie within -40dB to 3dB by hard limiting at both ends of the range.
To reduce artifacts in the reconstructed signal the derived filters were truncated to 512 points in the
time domain to preserve the linear convolution property. This corresponds to a smoothing of the
filter in the frequency domain.
The filter weights were normalized across the array at each frequency so that the total array
frequency response was flat.
The filters were applied to the beamformer output in the frequency domain and reconstructed in the
time domain with an overlap-add technique. A Hanning window was used to taper the overlapping
segments together during reconstruction.
After filtering the individual channels they were summed to form the final output.
The microphone-array databases (quiet and noise-added) described in Chapter 3 were processed with the
OSNR algorithm as described above, using 8 and 16 microphones. The results are presented the sections
that follow.
1 The uncolored nature of the beamformer output is a given since the overall frequency response is constrained to be uniform.
2 Thisdifferent quality to the noise is extremely subtle and probably would not be noticed in casual listening or listening in less
than optimal environments, but the beamformers can be reliably distinguished in a blind test.
61
8000
7000
6000
5000
Hertz
4000
DSBF 3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5
sec.
8000
7000
6000
5000
Hertz
4000
OSNR 3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5
sec.
Figure 7.2: Narrowband spectrogram of a noisy utterance processed with OSNR. The top figure is from the
unweighted 16 channel beamformer and the bottom figure is from the optimal-SNR weighted 16 channel
beamformer. The talker is female and the text being spoken is the alpha-digit string ’BUCKLE83566’. The
overall reduction in background noise is apparent, especially the bands around 4500Hz and 6500Hz.
Table 7.2: Summary of the measured average FD, BSD, SSNR and peak SNR values for the OSNR beam-
former.
measured performance actually decreases. The .05% increase in error rate corresponds to 2 added errors 3 .
All other results for the quiet database are nearly an order of magnitude greater and are improvements in
performance. Admittedly these are very small improvements in terms of the number of word errors
involved, but note that the difference between the DSBF (11.92%) and the close-talking microphone
(8.16%) performance is only 3.76%. A .5% change in performance is 13% of that margin. The .4%
improvement for the 16 microphone MAP-HMM case is slightly more than 10% that performance gap.
With the noisy data, the performance is significantly improved relative to the unweighted DSBF. In the 16
microphone MAP-HMM case the 4.5% decrease in error rate brings the beamformer performance 30%
closer to the 8.16% error rate of the close-talking microphone. In the noisy case the SNR varies enough
across the array for the weighting to provide significant gain. In the quiet case the noise is very similar in
each channel and little can be gained by weighting the microphones nonuniformly.
Table 7.2 shows the distortion measurements for the optimal-SNR weighted beamformer. All the
measurements show some improvement over the unweighted DSBF, but the most notable improvement is
in peak SNR which shows only about 4dB improvement for both 8 and 16 microphone arrays using the
3 There 1
are 4497 total test words, so each error contributes 100 4497
e
jωτ1 Y1
(0) (0)
Noise SNR
Σ
1 S
M Reduc.
jωτ YM
e M
Delay Sum N.R.
Figure 7.3: The structure of the DSBF with Wiener post-filtering or Wiener sum-and-filter (WSF). Note
that the channels may feed forward into the noise-reduction step as they may be necessary to generate the
post-filter.
1. The cross-spectral power signal estimate from (6.8) was used. This is consistent with the
implementation generally found in the literature[39]. In the results this is denoted by WSF cor .
2. The minimum statistic noise power estimate described in Sections 6.1 and 6.4. This is denoted
below by WSFmin .
3. The combination noise power estimate described in Sections 6.3 and 6.4. This is denoted
below by WSFcom .
The spectral densities used in the formulation of the Wiener filter from (5.6) were smoothed in time
with an exponential weighting factor of 0.4. That is, Ŝk ω 2
1 4 Ŝk ω 4 Ŝk 1 ω 2 . This value was chosen to correspond with the smoothing reported in
1 2
[39] and is intended to strike a balance between a low variance estimate of the spectral densities
while still accommodating rapid variation in the speech spectrum.
To reduce artifacts in the reconstructed signal, the post filter is truncated to 512 points in the time
domain to preserve the linear convolution property. This corresponds to a smoothing of the filter in
the frequency domain.
The filters were applied to the beamformer output in the frequency domain and reconstructed in the
time domain with an overlap-add technique. A Hanning window was used to taper the overlapping
segments together during reconstruction.
The microphone-array databases (quiet and noise-added) described in Chapter 3 were processed with the
WSF algorithm as described above, using 8 and 16 microphones. The results are presented the sections
that follow.
64
8000 8000
7000 7000
6000 6000
5000 5000
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
cor
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
min
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
com
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
WSF osnr
WSF
Figure 7.4: Narrowband spectrograms for the WSF beamformer. The bottom 3 rows correspond to the
different ways of forming the signal power spectrum estimate. The left hand column images are generated
from data with no extra pre-filtering and the right hand column images are from data that was processed
with the OSNR filtering prior to the WSF processing. The spectrograms for the unweighted DSBF and
OSNR outputs are shown at the top for reference. The talker is female and the text being spoken is the
alpha-digit string ’BUCKLE83566’. All examples are from 16 microphone implementations.
Table 7.3: Recognition performance for the WSF beamformers expressed in % words in error. The lowest
word error rate in each column is highlighted in boldface.
Table 7.4: Recognition performance for the WSFosnr beamformers expressed in % words in error. The
lowest word error rate in each column is highlighted in boldface.
lower estimate of the noise and the com estimate (by definition) the highest noise estimate; the cor
estimate performs best in the low-noise cases and the com estimate performs best in the high-noise cases.
Note that the OSNR performance is better than the WSF performance (without OSNR pre-processing).
The best improvement for the quiet data (WSFosnr
min ) makes up 27% of the difference from the DSBF to the
close-talking microphone performance. The best noisy performance (WSF osnr min ) is 45% of the performance
difference.
The largest improvements in the distortion measures can be seen in the peak SNR values. The peak SNR
measured for the quiet data is approximately 1.5 times greater, and 2 times greater for the noisy data. For
the other measures the difference is generally much smaller. For the quiet data especially the difference in
measured distortion is sometimes vanishingly small. The noisy data shows significantly more of a
Table 7.5: Summary of the measured average FD, BSD, SSNR and peak SNR values for the WSF beam-
formers. The baseline values for the delay-and-sum beamformer are shown for reference (DSBF). The best
(lowest for distortions and highest for SNR’s) is highlighted in bold-face.
66
Table 7.6: Summary of the measured average FD, BSD, SSNR and peak SNR values for the WSF osnr
beamformers. The baseline values of the delay-and-sum beamformer with optimal-SNR weighting are
shown for reference (OSNR).The best value (lowest for distortions and highest for SNR’s) in each category
is highlighted in bold-face.
difference. Note that for the quiet data the BSD is worsened by any version of WSF with the cor spectral
estimate type degrading the least of the 3 estimate types. This is consistent with the observation above that
the cor spectral estimate with its conservative noise estimate performs best on the quiet data whereas the
min and com spectral estimate types are most likely over-estimating the noise resulting in a degree of
signal distortion that outweighs the noise suppression.
67
e
jωτ1 Y1 SY’
1
YY’
1 1
jωτ YM
e M
SY’
M
YY’
M M
Figure 7.5: The structure of the Wiener filter-and-sum (WFS) beamformer. This is the same as Figure
5.1 with the addition of a configurable post-filtering step on the first beamformer output. The individual
channels may feed forward into the noise reduction step.
The microphone-array databases (quiet and noise-added) described in Chapter 3 were processed with the
WFS algorithm as described above, using 8 and 16 microphones. The results are presented the sections
that follow.
7000 7000
6000 6000
5000 5000
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
Hertz
Hertz
bf
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
cor
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
min
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
com
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
WFS osnr
WFS
Figure 7.6: Narrowband spectrograms from the WFS beamformer. The 4 bottom rows correspond to the
different ways of forming the signal power spectrum estimate. The left hand column images are generated
from data with no extra pre-filtering and the right hand column images are from data that was processed
with the OSNR filtering prior to the WFS processing. The spectrograms for unweighted DSBF and OSNR
outputs are shown at the top for reference. The talker is female and the text being spoken is the alpha-digit
string ’BUCKLE83566’.
each other and have a noticeably lower level of residual noise than either the bf or cor processing types.
The noisy examples are greatly enhanced by the use of OSNR as a pre-processing step; the spectral peaks
in the background noise are greatly attenuated by the OSNR processing. The application of OSNR
pre-processing to the quiet data is impossible to distinguish by listening. The WFS recordings exhibit a
degree of “breathing” at the transitions from speech to silence and silence to speech. This “breathing” is
more apparent in the quiet recordings (also when the min and com processing types with greater noise
suppression are used) where less residual background noise is available to mask the artifact. The processed
speech also exhibits a varying degree of echo and similar processing artifacts comparable to that observed
with WSF processing. Artifacts are lower when 16 microphones are used.
Table 7.7: Recognition performance for the WFS beamformer expressed in % words in error.
Table 7.8: Recognition performance for the WFSosnr beamformer expressed in % words in error.
cases before MAP training but once MAP training has been applied any improvement disappears. The
noisy data on the other hand does show a significant decrease in error rate with (as in the preceding
section) the com spectral estimate type leading the way.
Tables 7.9 and 7.10 show the measured distortion values for the WFS beamformer. For all measures
except for peak SNR, the quiet data shows no improvement with the WFS processing. The noisy data
shows some improvement though no particular spectral estimate method stands out from the others.
Table 7.9: Distortion values for the WFS beamformer. The best value in each column is highlighted in
boldface.
70
Table 7.10: Distortion values for the WFSosnr beamformer. The best value in each column is higlighted in
boldface.
71
e
jωτ1 Y1 Φ1 Y1
|S|
2
σ m2 Φm 1
M Σ
(1)
S
Hm
YM
e
jωτM ΦM YM
Figure 7.7: Diagram of the optimal multi-channel Wiener (MCW) beamformer. The delay compensation
stage is followed by a parameter estimation stage which feeds into the channel filters applied before the
final summation.
– The noise power spectrum (σ2m ω ) for each channel was estimated in the same manner as for
the OSNR processing (Section 7.1) see with the minimum statistic method as described in
Section 6.1.
– The transfer function for each channel (Hm ω in Equation (5.26)) was estimated the same way
as in the OSNR processing (see Section 7.1) with the normalized average power spectrum for
each channel after noise subtraction.
– The signal power ( S 2 in the numerator of Equation (5.26)) was estimated from the input
channel data with the 3 different methods described in Section 6. The same 3 methods used in
Sections 7.2 and 7.3 above.
– The signal power in the denominator of Equation (5.26) (The S ω 2 Hm ω 2 term) was
estimated with the channel power after subtracting the noise estimate, Ym ω 2 σ2m .
The same exponential smoothing described in Section 7.2 was used to smooth the spectral estimates
in the numerator of Equation (5.26).
To guard against divide by zero or multiply by zero situations the gain at each frequency was
constrained to lie within -40dB to 3dB by truncating at both ends of the range.
To reduce artifacts in the reconstructed signal, the post filter is truncated to 512 points in the time
domain to preserve the linear convolution property. This corresponds to a smoothing of the filter in
the frequency domain.
The filters were applied to the beamformer output in the frequency domain and reconstructed in the
time domain with an overlap-add technique. A Hanning window was used to taper the overlapping
segments together during reconstruction.
As in the preceding sections the 3 different spectral estimate types were used (corÿ min, com).
For the MCW implementations the OSNR version denotes the use of OSNR weighted data when
forming the signal spectrum estimate in the numerator of Equation (5.26). As opposed to the
preceding algorithms where the OSNR process was used as a pre-processing step, since the MCW
algorithm incorporates the OSNR weighting the MCWosnr implementations use the OSNR
pre-processing only on the data used in the signal spectrum estimate.
72
8000 8000
7000 7000
6000 6000
5000 5000
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
cor
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
min
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
com
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
MCW osnr
MCW
Figure 7.8: Narrowband spectrograms from the MCW beamformer. The 3 rows correspond to the different
ways of forming the signal power spectrum estimate. The spectrogram for the unweighted DSBF output
is show at the top for reference. The talker is female and the text being spoken is the alpha-digit string
’BUCKLE83566’.
The microphone-array databases (quiet and noise-added) described in Chapter 3 were processed with the
MCW algorithm as described above, using 8 and 16 microphones. The results are presented the sections
that follow.
Table 7.11: Recognition performance for the MCW beamformer expressed in % words in error.
Table 7.12: Recognition performance for the MCWosnr beamformer expressed in % words in error.
use of the OSNR data in the spectral estimation step does improve performance by a non-negligible
amount. The best case improvement with MAP training, using the quiet data is 21% of the difference
between the DSBF baseline and the close-talking microphone performance for both the 8 microphone and
16 microphone case. Using the noisy data the improvement is 44% for 16 microphones and 39% for 8
microphones.
The distortion measures shown in tables and are qualitatively similar to those in the preceding sections.
BSD is made worse by all cases, quiet and noisy. FD shows marginal improvement with the quiet data and
somewhat greater improvement with the noisy data. SSNR declines in nearly all cases though it declines
less for the noisy data than for the quiet data. Peak SNR improves by nearly a factor of 2 in all cases.
Q UIET DATA
14 1
Word Error %
Word Error %
20 27
% Improved
% Improved
8 12 35
mic 15 58
com
com
min
com
com
MCWcor
min
com
com
WSFmin
min
MCWcor
cor
WFScor
WSFmin
min
cor
WFScor
10 69
OSNR
WFSbf
OSNR
WFSbf
DSBF
MCW
MCW
DSBF
MCW
MCW
WSF
WSF
WFS
WFS
WSF
WSF
WFS
WFS
10 89
13 −29
20 −5
12 −2
Word Error %
Word Error %
% Improved
% Improved
16 11 24
15 39
mic 10 51
com
min
com
com
com
cor
min
WFSmin
min
com
WFScom
cor
cor
MCWcor
min
min
cor
WFScor
OSNR
bf
DSBF
OSNR
WFSbf
MCW
MCW
MCW
DSBF
WSF
WSF
WSF
WFS
WFS
WFS
MCW
MCW
WSF
WSF
WSF
WFS
10 84 9 78
N OISY DATA
60 1
30 10
Word Error %
Word Error %
% Improved
% Improved
8 40 39
20 51
mic
MCWcom
com
min
min
com
WFScom
com
WFScom
MCWcor
MCWcor
min
min
min
min
cor
WFScor
cor
cor
OSNR
WFSbf
20 77
OSNR
WFSbf
DSBF
DSBF
MCW
MCW
MCW
WSF
WSF
WSF
WFS
WSF
WSF
WSF
WFS
WFS
10 92
40 19 20 23
Word Error %
Word Error %
% Improved
% Improved
16 30 45
mic 15 55
MCWcom
MCWcom
min
WSFcom
WFScom
min
com
WFScom
20 70
cor
MCWcor
min
min
min
min
cor
cor
cor
cor
OSNR
WFSbf
OSNR
WFSbf
DSBF
DSBF
MCW
MCW
MCW
WSF
WSF
WFS
WFS
WSF
WSF
WSF
WFS
WFS
10 88
10 95
Figure 7.9: Summary of word error rates in % words in error. The bottom level of each graph corresponds
to the close-talking microphone error rate. The axis on the right hand side shows the percent improvement
from the DSBF baseline. That is, the DSBF is 0% improved and the close-talking microphone is 100%
improved.
7.5 Summary
Figures 7.9 and 7.10 show the recognition performance for each tested combination of microphones (8 or
16) and database (quiet or noisy). Figure 7.9 shows the performance for algorithms using the unweighted
channels as input and Figure 7.10 shows the performance for algorithms using the OSNR filtering as a
preprocessing stage. These are the same values tabulated in the previous section, but presented graphically
and side by side to facilitate comparisons over the full range of algorithms.
76
Q UIET DATA
14 1
Word Error %
Word Error %
20 27
% Improved
% Improved
8 12 35
mic 15 58
osnr
osnr
MCWosnr
com
osnr
osnr
MCWosnr
com
MCWmin
WSFosnr
osnr
WSFosnr
WFSosnr
osnr
osnr
WFSosnr
com
com
MCWcor
MCWmin
WSFosnr
osnr
WSFosnr
WFSosnr
osnr
osnr
WFSosnr
com
com
WSFmin
WFSmin
MCWcor
cor
WFScor
WSFmin
WFSmin
10 69
cor
WFScor
OSNR
bf
OSNR
bf
DSBF
10 DSBF 89
20 −5
12 −2
Word Error %
Word Error %
% Improved
% Improved
11 24
16 15 39
mic 10 51
osnr
osnr
MCWosnr
com
MCWmin
osnr
osnr
WSFosnr
WFSosnr
osnr
WFSosnr
WFSosnr
osnr
osnr
MCWosnr
com
com
com
MCWcor
WSFmin
min
MCWmin
osnr
osnr
WSFosnr
WFSosnr
osnr
WFSosnr
WFSosnr
com
com
WSFcor
WFScor
MCWcor
WSFmin
min
WSFcor
WFScor
OSNR
bf
DSBF
OSNR
bf
DSBF
10 84 9 78
N OISY DATA
60 1
30 10
Word Error %
Word Error %
% Improved
% Improved
8 40 39
20 51
mic
osnr
MCWosnr
MCWosnr
com
osnr
MCWosnr
MCWosnr
com
min
osnr
osnr
WSFosnr
osnr
osnr
WFSosnr
WFSosnr
min
com
com
osnr
osnr
WSFosnr
osnr
osnr
WFSosnr
WFSosnr
com
com
MCWcor
MCWcor
WSFmin
min
WSFmin
min
WSFcor
WFScor
WSFcor
WFScor
OSNR
WFSbf
20 77
OSNR
WFSbf
DSBF
DSBF
10 92
40 19 20 23
Word Error %
Word Error %
% Improved
% Improved
16 30 45
mic 15 55
osnr
MCWosnr
MCWosnr
com
osnr
MCWosnr
MCWosnr
com
min
osnr
WSFosnr
WSFosnr
osnr
osnr
WFSosnr
osnr
com
WFScom
min
osnr
WSFosnr
WSFosnr
osnr
osnr
WFSosnr
osnr
com
WFScom
20 70
MCWcor
MCWcor
min
min
min
min
WSFcor
WFScor
WSFcor
WFScor
OSNR
WFSbf
OSNR
WFSbf
DSBF
DSBF
10 88
10 95
Figure 7.10: Word error rates with OSNR input in % words in error. The bottom level of each graph
corresponds to the close-talking microphone error rate. The axis on the right hand side shows the percent
improvement from the DSBF baseline. That is, the DSBF is 0% improved and the close-talking microphone
is 100% improved.
Looking at the MAP-HMM column of Figure 7.9. For tests of the quiet data OSNR, WSF and MCW
processing all improve recognition rates and WFS reduces recognition performance. For tests of the noisy
data every filtering strategy improves recognition performance though the OSNR filtering out-performs all
but the MCW algorithm. This is a strong result considering that the OSNR is the only algorithm in this
comparison (apart from DSBF) that is distortionless. That is, OSNR has an overall flat system frequency
response whereas the other methods (WSF, WFS, MCW) all impose a non-uniform overall frequency
weighting that distorts the spectrum.
77
12 −2
20 23
Word Error %
Word Error %
11 24
% Improved
% Improved
10 51 15 55
MCWosnr
MCWcom
com
osnr
MCWosnr
MCWmin
min
osnr
WSFcor
WSFcor
WSFcom
WSFcom
OSNR
DSBF
OSNR
DSBF
9 78
10 88
Figure 7.11: The best performing filtering schemes using 16 microphones and MAP training. These values
are culled from Figures 7.9 and 7.10.
In Figure 7.10 results for the quiet data show a similar trend as in Figure 7.9: improved performance is
shown for every strategy except WFS for which performance declines. Unlike the tests without OSNR
pre-processing, for the noisy data the performance of WFS osnr , WSFosnr and MCWosnr are all better than
the performance of the OSNR filtering alone. The gains from the OSNR weighting and the gain from the
noise-reduction filtering which follows are additive which is not unexpected. Using the OSNR as a
pre-processing step simultaneously improves the signal estimate available to the filtering step and provides
an inter-microphone weighting that is missing from the WFS and WSF processing types. MCW already
incorporates the non-uniform microphone weighting but gains from using the OSNR weighted data for the
spectral estimate.
The relatively poor performance of the WFS algorithms on the quiet data may seem somewhat
counter-intuitive since the WFS algorithms are arguably the best sounding processing types on the quiet
data - at least in terms of having adding the least amount of artifacts into the processed speech. The ad-hoc
nature of WFS algorithm and the way it may distort the spectrum seems to be reflected in the recognition
results. WFS is the only algorithm that isn’t based directly on an optimization and it’s also the only
algorithm that reduces the recognition performance on quiet data. This probably isn’t a coincidence.
To better compare the best performing algorithms, Figure 7.11 shows a subset of the results in Figures 7.9
and 7.10 using 16 microphones and the MAP model for both quiet and noisy databases. With the quiet
data the performance of the MCW and WSF variants are virtually identical. With the noisy data the MCW
outperforms WSF but the versions with OSNR included are deadlocked again. As discussed in Section
5.2.3 WSFosnr and MCWosnr use the same inter-microphone weighting function and differ only in the
specifics of the final frequency shaping. What the results here show is that the difference in frequency
weighting between the WFS and MCW methods is not significant enough to affect the recognition
performance. The difference in performance between WFS and MCW goes away when the OSNR
weighting is used equally in both methods.
Figures 7.12 and 7.13 graphically summarize the values of the various distortion measures applied to the
filtering algorithms4 . In Figure 7.12 the FD measure varies only slightly with the different algorithms
when used on the quiet data, with the noisy data on the other hand every algorithm significantly lowers the
measured FD. This is not unlike the recognition results where with the noisy data any distortion introduced
by the processing methods is outweighed by the degree to which they suppress the noise. For the quiet
data the level of noise is low enough that the gain from reducing it and the penalty paid for introducing
filtering distortions are much more similar in magnitude.
The BSD values measured on the quiet data increase for all algorithms. The increase for WSF cor is
minimal but virtually every other algorithm shows a significant increase. The increase in BSD is generally
greater for those methods with greater noise suppression. Using the com and min spectral estimates
generally results in a higher BSD, and these two methods generally result in a larger estimate of the noise
(and greater corresponding noise suppression) than the cor method. The measurements on the noisy data
show a similar upward trend in the BSD though in this case only the MCW values are worse than the
DSBF and OSNR baselines. The SSNR values shown in Figure 7.13 show the complementary trend with
4 The values for the implementations using OSNR pre-processing are qualitatively extremely similar and are not plotted here.
78
FD
0.6
0.5 1
0.4 0.8
8
FD
FD
0.3 0.6
mic
com
com
MCWmin
MCWmin
WSFcom
com
WSFcom
com
MCWcor
MCWcor
min
min
min
min
cor
WFScor
cor
WFScor
0.2 0.4
OSNR
OSNR
WFSbf
WFSbf
DSBF
DSBF
MCW
MCW
WSF
WSF
WFS
WFS
WSF
WSF
WFS
WFS
0.1 0.2
0 0
1
0.5
0.8
0.4
0.6
16 0.3
FD
FD
mic
com
com
0.4
MCWmin
min
WSFcom
com
WSFcom
com
MCWcor
MCWcor
0.2
min
min
min
min
cor
cor
cor
cor
OSNR
OSNR
WFSbf
bf
DSBF
DSBF
MCW
MCW
MCW
WSF
WSF
WFS
WFS
WFS
WSF
WSF
WFS
WFS
WFS
WFS
0.1 0.2
0 0
BSD
0.1 0.12
0.08 0.1
0.08
8
BSD
BSD
0.06
mic 0.06
MCWcom
MCWcom
MCWmin
MCWmin
WSFcom
WFScom
WSFcom
WFScom
MCWcor
MCWcor
WSFmin
WFSmin
min
WFSmin
0.04
WSFcor
WFScor
WSFcor
WFScor
OSNR
OSNR
WFSbf
bf
0.04
DSBF
DSBF
WSF
WFS
0.02 0.02
0 0
0.1
0.1
0.08
0.08
16 0.06
BSD
BSD
0.06
mic
MCWcom
MCWcom
0.04
MCWmin
MCWmin
WSFcom
WFScom
WSFcom
WFScom
MCWcor
MCWcor
WSFmin
WFSmin
min
WFSmin
0.04
WSFcor
WFScor
WSFcor
WFScor
OSNR
OSNR
bf
WFSbf
DSBF
DSBF
WFS
WSF
0.02 0.02
0 0
Quiet Noisy
Figure 7.12: Summary of FD and BSD values measured on the variety of beamforming algorithms.
slight differences. The similarity of these trends is entirely expected since the SSNR measurement is
essentially a linear-frequency version of the BSD measurement. In this measurement the WSF algorithms
show slight improvement even on the quiet data and the MCW algorithms (as with the BSD
measurements) still show the worst performance by this measure. This ordering is reversed on the peak
SNR graphs. Every algorithm shows a significant increase in SNR and the WFS and MCW algorithms
show greater SNR than the WSF algorithm. These results point towards the tradeoff between introducing
distortion and suppressing noise. The more aggressively the noise is suppressed (indicated by SNR) the
more unwanted signal distortions (indicated by BSD and SSNR) will creep in. The surprised is how the
FD measure does not follow the other distortion measures as tightly as it did in Chapter 3. Despite having
the worst BSD performance in the group, the MCW algorithms FD scores and recognition performance
79
SSNR
4
5 3.5
4 3
2.5
8
dB
dB
3 2
mic
com
com
min
MCWmin
com
com
WSFcom
com
cor
MCWcor
min
min
min
min
1.5
WSFcor
cor
cor
WFScor
2
OSNR
OSNR
bf
WFSbf
DSBF
DSBF
MCW
MCW
MCW
MCW
WSF
WSF
WFS
WFS
WFS
WFS
WSF
WSF
WFS
WFS
1
1
0.5
0 0
5
6
5 4
4 3
16
dB
dB
mic 3
com
com
2
min
min
com
com
com
com
cor
cor
min
min
min
min
WSFcor
cor
WSFcor
cor
OSNR
OSNR
bf
WFSbf
2
DSBF
DSBF
MCW
MCW
MCW
MCW
MCW
MCW
WSF
WSF
WFS
WFS
WFS
WFS
WSF
WSF
WFS
WFS
WFS
1
1
0 0
SNR
50 25
40 20
8 30 15
dB
dB
mic
MCWcom
MCWcom
min
min
WSFcom
WFScom
WSFcom
WFScom
cor
cor
WSFmin
WFSmin
WSFmin
WFSmin
20 10
cor
cor
WSFcor
WFScor
OSNR
OSNR
bf
bf
DSBF
DSBF
MCW
MCW
MCW
MCW
WSF
WFS
WFS
WFS
10 5
0 0
50 35
30
40
25
16 30 20
dB
dB
mic
MCWcom
MCWcom
15
min
MCWmin
WSFcom
WFScom
com
com
cor
20 cor
WSFmin
WFSmin
WSFmin
WFSmin
WSFcor
WFScor
WSFcor
WFScor
OSNR
OSNR
bf
bf
DSBF
DSBF
MCW
MCW
MCW
10
WFS
WSF
WFS
WFS
10
5
0 0
Quiet Noisy
Figure 7.13: Summary of SSNR and SNR values measured on the variety of beamforming algorithms.
5 4 trials for OSNR ( 8 and 16 microphones, quiet and noisy), 24 trials each for WSF and MCW (8 and 16 microphones, quiet and
noisy, OSNR pre-processed or not, 3 spectral estimate types) and 32 trials for WFS (8 and 16 microphones, quiet and noisy, OSNR
pre-processed or not, 4 spectral estimate types).
80
60 60
Baseline
MAP
50 50
30 30
20 20
Baseline
10 MAP 10
0.4 0.6 0.8 1 1.2 1.4 0.02 0.04 0.06 0.08 0.1 0.12 0.14
FD BSD
RMSE: 2.8%(Base) 1.8%(MAP) RMSE: 8.9%(Base) 4.4%(MAP)
60 60
Baseline Baseline
MAP MAP
50 50
Error Rate (%)
30 30
20 20
10 10
3 4 5 6 7 10 20 30 40 50
SSNR (dB) SNR (dB)
RMSE: 6.0%(Base) 2.9%(MAP) RMSE: 6.1%(Base) 3.2%(MAP)
Figure 7.14: Scatter plots of error rate and distortion measures. Each figure plots word error rate as a
function of measured distortion values FD, BSD, SSNR, SNR. The ’s denote results from the baseline
HMM and ’s denote the results after MAP training. There are 84 data points. The linear fits to each set of
points is overlayed and the RMS errors from the linear fit for the baseline-HMM and the MAP-HMM are
shown below each plot.
somewhat better than the reference close-talking microphone error rate of 8.16% suggesting that data
points at lower distortions and error rates than those plotted here would fall above the linear trend shown
here. It is interesting though that the MAP-HMM and baseline-HMM linear trends for the different
measures intersect at such similar performance points.
The corresponding RMS linear fit error figures for the baseline measurements made on the DSBF data is
shown in Table 3.9. Compared to the measurements made on the DSBF processed data in Section 3.3.2 the
distortion measurements here are generally less correlated with error rate; the baseline-HMM error rate
especially. Only FD has a better linear fit here than with the DSBF measurements. This disparity is despite
the restricted range over which the error rates here fall; in Section 3.3.2 a good deal of the linear fit error
was due to a strong nonlinearity in the relation between distortion measures and error rates. In the
measurements presented here the trends appear quite linear, but with a greater variance from the trend.
The increase in overall apparent linearity is at least in part because the range of error rates observed in this
data is significantly smaller than the range observed in Section 3.3.2.
Figure 7.15 shows scatter plots of the FD, BSD, and SNR as functions of each other for measurements
taken on the noise suppressed data (OSNR, WSF, WFS, MCW) and for measurements taken on the DSBF
baseline data presented in Chapter 3. The greatly reduced correlation between measures seen with the
noise reduction algorithms is readily apparent. With the DSBF measurements the successive addition of
microphones yields data points that travel somewhat continuously through all the measurement spaces.
With the nonlinear nature of the noise-suppression algorithms this property appears to no longer hold. In
81
0.2
WSF,WFS,MCW,OSNR WSF,WFS,MCW,OSNR
DSBF 50 DSBF
40
0.15
SNR
BSD
30
0.1
20
10
0.05
0.6 0.8 1 1.2 1.4 1.6 0.5 1 1.5
FD FD
WSF,WFS,MCW,OSNR
50
DSBF
SNR 40
30
20
10
Figure 7.15: Scatter plots comparing correlation of distortion measures for the noise-suppression algorithms
(OSNR, WSF, WFS, MCW) and for DSBF. The marks represent the values measured in Chapter 3 on
the output of the DSBF. The marks represent the values measured from the noise reduction algorithms;
OSNR, WSF, WFS, MCW.
Figure 7.15 the scatter plot of BSD and SNR reflects the low correlation between these measures
compared to the relatively tight correlation that was observed with the DSBF measurements. Figure 7.16
repeats the scatter plot from Figure 7.15 but breaks each algorithm out into its own marker and linear fit
line. From this it is apparent that within each algorithm type the correlation between measures is much
stronger than it is between algorithm types, though certainly the reduced number of points in each cluster
contributes somewhat to that perception. In particular for the FD vs BSD scatter plot the difference
between the algorithms is primarily a different bias in BSD for each algorithm. The FD vs SNR plot shows
quite a bit less separation between the different algorithms. Note also that taken on their own the OSNR
points fall very neatly along a linear trend much more like the DSBF measurements in Chapter 3. In fact
the OSNR measurements generally fall very close to the trends established by the DSBF data as shown in
Figure 7.17. In contrast to Figure 7.15 the OSNR measurements taken alone are quite consistent with the
trends set by the DSBF measurements. This is reflective of how closely related the OSNR algorithm is to
the DSBF algorithm in that it doesn’t employ the Wiener filtering noise suppression that is common to all
the other algorithm types.
82
0.13 60
OSNR
0.12 WSF
50 WFS
0.11 MCW
0.1 40
SNR
BSD
0.09
0.08 30
0.07 OSNR
WSF 20
0.06 WFS
MCW
0.05 10
0.4 0.6 0.8 1 0.4 0.6 0.8 1
FD FD
55
OSNR
50 WSF
45 WFS
MCW
40
35
SNR
30
25
20
15
10
0.04 0.06 0.08 0.1 0.12 0.14
BSD
Figure 7.16: Scatter plot of distortion measurements by algorithm type. When broken out into the different
algorithm types the distortion measures show a stronger linear correlation with each other.
83
0.25 30
OSNR
DSBF
25
0.2
20
SNR
BSD
0.15 15
OSNR 10
0.1 DSBF
5
0.05 0
0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5
FD FD
30
OSNR
DSBF
25
20
SNR
15
10
0
0.05 0.1 0.15 0.2 0.25
BSD
Figure 7.17: Scatter plots of OSNR distortion measurements along with the DSBF measurements. The
OSNR measurements fall much more close to the DSBF trends than the other tested algorithms.
C HAPTER 8:
S UMMARY AND C ONCLUSIONS
The goal of this work was to measure the performance of a delay-and-sum beamformer and to investigate
techniques for improving upon that performance. Several measures by which to judge performance were
introduced in Chapter 2. The measures introduced vary from traditional signal-to-noise ratio measures
(SNR, SSNR) to perceptually motivated measures that more closely reflect subjective speech quality
(BSD). The feature distortion (FD) measure was also introduced as an attempt to predict the performance
of a speech recognition system. In Chapter 3 a database of microphone-array recordings was described.
This database of recordings was originally collected to make direct comparisons between the performance
of the microphone array and a close-talking microphone in a speech recognition task[6]. Because the
microphone-array recordings include simultaneous recordings with a close-talking microphone, signal
quality measures that require a reference signal (FD, SSNR, BSD) could be used to evaluate the results of
beamforming algorithms. Chapter 3 also describes how a high-noise database was created by adding noise
recorded by the same microphone array to the original, relatively quiet, recordings. Chapter 3 describes
the performance of a delay-and-sum beamformer using from 1 to 16 microphones. The results were
evaluated with the measures described in Chapter 2 and with the performance of an alphadigit speech
recognition system. The MAP retraining method was used to adapt the speech recognition models and
optimize the performance on the novel microphone-array data. Chapter 4 used simulations of a linear
microphone array to investigate the limits of delay-and-sum techniques in noisy and reverberant
environments. Motivated by the results in Chapter 4, in Chapter 5 MMSE optimizations for single channel
signal enhancement were extended to an optimal multi-input solution (MCW). The optimal multi-input
solution was solved for signal-plus-noise and filtered-signal-plus-noise models for the received signal. In
addition, an intuitively appealing but non-optimal filter-and-sum approach (WFS) was presented and
analyzed. Chapter 6 describes some methods for generating the spectral estimate required for the
implementation of the Wiener filtering strategies including a novel combination of cross-spectrum and
minimum-noise-subtraction spectrum estimation. Finally, Chapter 7 presents implementations and
evaluations of the various speech enhancement algorithms. Significant points of the results include:
Overall the noise-reduction techniques were quite successful in improving recognition performance,
reducing the gap between the DSBF performance and the close-talking microphone performance by
up to 27% on the quiet data and 45% on the noisy data.
The OSNR weighting is very successful in the noisy data tests, outperforming all but the MCW
algorithm. This is significant in that, unlike the other algorithms, the OSNR weighting is a
distortion-free filtering.
The MCW algorithm has the best speech recognition performance on the noisy data and is within
the smallest of margins of the WSF algorithm on the quiet data.
When OSNR is used as a pre-processing step, the MCW and WSF algorithms perform nearly
identically on the speech-recognition task. The OSNR pre-processing is the deciding factor in the
speech recognition performance; the difference between the frequency weightings of the WSF and
MCW algorithms is insignificant by comparison.
The min and com spectrum estimates generally result in better recognition scores and worse
distortion scores than the cor cross-spectrum method. This is largely due to the generally larger
noise spectrum estimates from these two methods.
84
85
The WFS algorithm has the worst speech recognition scores and distortion measures of the 3 Wiener
filtering schemes although it shows strong improvement in SNR and informal subjective evaluations
of subjective quality. WFS is the only algorithm that has worse recognition performance than the
DSBF on the quiet data set.
The FD measure does a consistently good job of predicting speech recognition performance.
The Wiener-based methods show very different relationships between measurements than the DSBF
and OSNR algorithms. FD is still strongly related to speech recognition performance, but the strong
relationships with BSD, SSNR and SNR observed with the DSBF tests are not seen here. This was
foreseen in Chapter 3; the DSBF is unique in that adding microphones simultaneously reduces the
noise and enhances the signal in a fairly uniform manner. The Wiener filtering strategies on the
other hand are based upon amplifying the signal in high-SNR regions and squelching it in low-SNR
regions, and does so in a nonlinear fashion. The result is that noise is suppressed at the cost of
increased signal distortion.
The efficacy of the MAP training technique was very effective at tuning the recognizer to the novel
data. The MAP training reduced the error rate often by nearly 50%. On the other hand, the
baseline-HMM recognition performance closely follows the MAP-HMM performance; for
comparing the performance of two speech enhancement methods it may not be necessary to do
MAP retraining; the performance given by the baseline model may reflect the MAP results
sufficiently well.
[1] M. S. Brandstein and D. B. Ward, editors. Microphone Arrays: Signal Processing Techniques and
Applications. Springer Verlag, 2001.
[2] J. L. Flanagan, A. Surendran, and E. Jan. Spatially selective sound capture for speech and audio
processing. Speech Communication, 13(1-2):207–222, 1993.
[3] Y. Grenier. A microphone array for car environments. In Proceedings of ICASSP-92 [92], pages
305–309.
[4] W. Kellerman. A self-steering digital microphone array. In Proceedings of ICASSP-91 [93], pages
3581–3584.
[5] J. Adcock, J. DiBiase, M. Brandstein, and H. F. Silverman. Practical issues in the use of a
frequency-domain delay estimator for microphone-array applications. In Proceedings of Acoustical
Society of America Meeting, Austin, Texas, November 1994.
[6] J. Adcock, Y. Gotoh, D. J. Mashao, and H. F. Silverman. Microphone-array speech recognition via
incremental MAP training. In Proceedings of ICASSP-96 [94], pages 897–900.
[7] J. L. Flanagan. Bandwidth design for speech-seeking microphone arrays. In Proceedings of
ICASSP-85, pages 732–735, Tampa, FL, March 1985. IEEE.
[8] J. L. Flanagan, D. Berkley, G. Elko, J. West, and M. Sondhi. Autodirective microphone systems.
Acustica, 73:58–71, 1991.
[9] S. Oh, V. Viswanathan, and P. Papamichalis. Hands-free voice communication in an automobile with
a microphone array. In Proceedings of ICASSP-92 [92], pages 281–284.
[10] H. F. Silverman. Some analysis of microphone arrays for speech data acquisition. IEEE Trans.
Acoust. Speech Signal Process., ASSP-35(2):1699–1712, December 1987.
[11] C. Che, M. Rahim, and J. Flanagan. Robust speech recognition in a multimedia teleconferencing
environment. J. Acoust. Soc. Am., 92(4, pt.2):2476(A), 1992.
[12] D. Giuliani, M. Omologo, and P. Svaizer. Talker localization and speech recognition using a
microphone array and a cross-power spectrum phase analysis. In Proceedings of ICSLP, volume 3,
pages 1243–1246, September 1994.
[13] Maurizio Omologo and Piergiorgio Svaizer. Acoustic event localization using a
crosspower-spectrum phase based technique. In Proceedings of ICASSP-94, volume II, pages
273–276, Adelaide, Australia, April 1994. IEEE.
[14] M. Omologo and P. Svaizer. Use of the cross-power spectrum phase in acoustic event localization.
Technical Report Technical Report No. 9303-13, IRST, Povo di Trento, Italy, March 1993.
[15] B. D. Van Veen and K. M. Buckley. Beamforming: A versatile approach to spatial filtering. IEEE
ASSP Magazine, 5(2):4–24, April 1988.
86
87
[16] J. L. Flanagan, J. D. Johnson, R. Zahn, and G. W. Elko. Computer-steered microphone arrays for
sound transduction in large rooms. J. Acoust. Soc. Am., 78(5):1508–1518, November 1985.
[17] H. F. Silverman. Some analysis of microphone arrays for speech data acquisition. LEMS Technical
Report 27, LEMS, Division of Engineering, Brown University, Providence, RI 02912, September
1986.
[18] Masato Miyoshi and Yutaka Kaneda. Inverse filtering of room acoustics. IEEE Transactions on
Acoustics, Speech, and Signal Processing, 36(2):145–152, February 1988.
[19] Hideaki Yamada, Hong Wang, and Fumitada Itakura. Recovering of broad band reverberant speech
signal by sub-band MINT method. In Proceedings of ICASSP-91 [93], pages 969–972.
[20] S. T. Neely and J. B. Allen. Invertibility of a room impulse response. J. Acoust. Soc. Amer.,
66(1):165–169, July 1979.
[21] J. Mourjopolous. On the variation and invertibility of room impulse response functions. Journal of
Sound and Vibration, 102(2):217–228, 1985.
[22] Takafumi Hikichi and Fumitada Itakura. Time variation of room acoustic transfer functions and its
effects on a multi-microphone dereverberation approach. Preprint received at 2nd International
Workshop on Microphone Arrays, Rutgers University, NJ, 1994.
[23] E. Jan, P. Svaizer, and J. Flanagan. Matched-filter processing of microphone array for spatial volume
selectivity. In Proceedings of ICASSP-95 [95], pages 1460–1463.
[24] O. L. Frost. An algorithm for linearly constrained adaptive array processing. Proceedings of the
IEEE, 60(8):926–935, August 1972.
[25] L. J. Griffiths and C. W. Jim. An alternative approach to linearly constrained adaptive beamforming.
IEEE Transactions on Antennas and Propagation, AP-30(1):27–34, January 1982.
[26] B. Widrow, P. E. Mantey, L. J. Griffiths, and B. B. Goode. Adaptive antenna systems. Proceedings of
the IEEE, 55:2143–2159, 1967.
[27] B. Widrow, J. R. Glover, J. M. McCool, J. Kaunitz, C. S. Williams, R. H. Hearn, J. R. Ziedler,
E. Dong, and R. C. Goodlin. Adaptive noise cancelling: Principles and applications. Proceedings of
the IEEE, 63(12):1692–1716, December 1975.
[28] Osamu Hoshuyama and Akihiko Sugiyama. A robust adaptive beamformer for microphone arrays
with a blocking matrix using constrained adaptive filters. In Proceedings of ICASSP-96 [94], pages
925–928.
[29] Jens Meyer and Carsten Sydow. Noise cancelling for microphone arrays. In Proceedings of
ICASSP-97 [96], pages 211–213.
[30] Joerg Bitzer, Klaus Uwe Simmer, and Karl-Dirk Kammeyer. Multi-microphone noise reduction by
post-filter and superdirective beamformer. In Proceedings of International Workshop on Acoustic
Echo and Noise Control, pages 100–103, Pocono Manor, USA, September 1999.
[31] Peter L. Chu. Superdirective microphone array for a set-top videoconferencing system. In
Proceedings of ICASSP-97 [96], pages 235–2358.
[32] J. Kates. Superdirective arrays for hearing aids. Journal of the acoustical society of america,
94(4):1930–1933, 1993.
[33] S. F. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on
Acoustics, Speech and Signal Processing, 27(2):113–120, April 1979.
88
[34] Levent Arslan, Alan McCree, and Vishu Viswanathan. New methods for adaptive noise suppression.
In Proceedings of ICASSP-95 [95], pages 812–815.
[35] T. S. Sun, S. Nandkumar, J. Carmody, J. Rothweiler, A. Goldschen, N. Russell, S. Mpasi, and
P. Green. Speech enhancement using a ternary-decision based filter. In Proceedings of ICASSP-95
[95], pages 820–823.
[36] R. J. McAulay and M. L. Malpass. Speech enhancement using a soft-decision noise suppression
filter. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28:137–145, 1980.
[37] E. Bryan George. Single-sensor speech enhancement using a soft-decision/variable attenuation
algorithm. In Proceedings of ICASSP-95 [95], pages 816–819.
[38] R. Zelinski. A microphone array with adaptive post-filtering for noise reduction in reverberant
rooms. In Proceedings of ICASSP-88, pages 2578–2580, New York, April 1988. IEEE.
[39] Claude Marro, Yannick Mahieux, and K. Uwe Simmer. Analysis of Noise Reduction and
Dereverberation Techniques Based on Microphone Arrays with Postfiltering. IEEE Transactions on
Speech and Audio Processing, 6(3):240–259, May 1998.
[40] Joerg Meyer and Klaus Uwe Simmer. Multi-channel speech enhancement in a car environment using
Wiener filtering and spectral subtraction. In Proceedings of ICASSP-97 [96], pages 1167–1171.
[41] Djamila Mahmoudi and Andrzej Drygajlo. Combined Wiener and coherence filtering in wavelet
domain for microphone array speech enhancement. In Proceedings of ICASSP-98 [97], pages
385–389.
[42] T. E. Tremain, M. A. Kohler, and T. G. Champion. Pilosophy and goals of the d.o.d. 2400 bps
vocoder selection process. In Proceedings of ICASSP-96 [94], pages 1137–1140.
[43] Matthew R. Bielefeld and Lynn M. Supplee. Developing a test program for the dod 2400 bps
vocoder selection process. In Proceedings of ICASSP-96 [94], pages 1141–1144.
[44] John D. Tardelli and Elizabeth Woodard Kreamer. Vocoder intelligibility and quality test methods. In
Proceedings of ICASSP-96 [94], pages 1145–1148.
[45] Elizabeth Woodard Kreamer and John D. Tardelli. Communicability testing for voice coders. In
Proceedings of ICASSP-96 [94], pages 1153–1156.
[46] M. A. Kohler, Philip A. LaFollette, and Matthew R. Bielefeld. Criteria for the d.o.d. 2400 bps
vocoder selection. In Proceedings of ICASSP-96 [94], pages 1161–1164.
[47] M. A. Kohler, Philip La Follette, and Matthew R. Bielefeld. Criteria for the dod 2400 bps vocoder
selection. In Proceedings of ICASSP-96 [94], pages 1161–1164.
[48] Schuyler R. Quackenbush, Thomas P. Barnwell III, and Mark A. Clements. Objective Measures of
Speech Quality. Prentice Hall, Englewood Cliffs, NJ, 1988.
[49] K. Lam, O. Au, C. Chan, K. Hui, and S. Lau. Objective speech quality measure for cellular phone. In
Proceedings of ICASSP-96 [94], pages 487–490.
[50] Shihua Wang, Andrew Sekey, and Allen Gersho. An objective measure for predicting subjective
quality of speech coders. IEEE Journal on Selected Areas in Communications, 10(5):819–829, June
1992.
[51] Wonho Yang, Majid Benbouchta, and Robert Yantorno. Performance of the modified Bark spectral
distortion as an objective speech quality measure. In Proceedings of ICASSP-98 [97], pages
541–544.
89
[52] Wonho Yang and Robert Yantorno. Improvement of MBSD by scaling noise masking threshold and
correlation analysis with MOS difference instead of MOS. In Proceedings of ICASSP-99, Phoenix,
Arizona, April 1999. IEEE.
[53] Jr. John R. Deller, John G. Proakis, and John H. L. Hansen. Discrete-Time Processing of Speech
Signals. Prentice Hall, Upper Saddle River, NJ, 1987.
[54] E. Zwicker and H. Fastl. Psychoacoustics Facts and Models. Springer-Verlag, 1990.
[55] D. W. Robinson and R. S. Dadson. A re-determination of the equal-loudness relations for pure tones.
British Journal of Applied Physics, 7:166–181, may 1956.
[56] James D. Johnston. Transform coding of audio signals using perceptual noise criteria. IEEE Journal
on Selected Areas in Communications, 6(2):314–323, Feb 1988.
[57] D. J. Mashao, Y. Gotoh, and H. F. Silverman. Analysis of LPC/DFT features for an HMM-based
alphadigit recognizer. IEEE Signal Processing Letters, 3(4):103–106, April 1996.
[58] H. F. Silverman and Yoshihiko Gotoh. On the implementation and computation of training an HMM
recognizer having explicit state durations and multiple-feature-set, tied-mixture output probabilities.
LEMS Technical Report 129, LEMS, Division of Engineering, Brown University, Providence, RI
02912, December 1993.
[59] M. Hochberg, J. Foote, and H. Silverman. The LEMS talker-independent connected speech
alphadigit recognition system. Technical Report 82, LEMS, Division of Engineering, Brown
University, Providence RI, 1991.
[60] Lawrence R. Rabiner and Biing Hwang Juang. Fundamentals of Speech Recognition. Prentice-Hall,
Englewood Cliffs, N.J., 1993.
[61] Stefan Gustafsson, Peter Jax, and Peter Vary. A novel psychoacoustically motivated audio
enhancement algorithm preserving background noise characteristics. In Proceedings of ICASSP-98
[97], pages 397–400.
[62] Yohichi Yohkura. A weighted cepstral distance measure for speech recognition. IEEE Trans. on
Acoustics Speech and Signal Processing, 35(10):1414–1422, 1987.
[63] S. E. Kirtman and H. F. Silverman. A user-friendly system for microphone-array research. In
Proceedings of ICASSP-95 [95], pages 3015–3018.
[64] Maurizio Omologo and Piergiorgio Svaizer. Acoustic source location in noisy and reverberant
environment using CSP analysis. In Proceedings of ICASSP-96 [94], pages 921–924.
[65] P. Svaizer, M. Matassoni, and M. Omologo. Acoustic source location in a three-dimensional space
using crosspower spectrum phase. In Proceedings of ICASSP-97 [96], pages 231–234.
[66] M. Omologo and P. Svaizer. Use of the cross-power spectrum phase in acoustic event localization.
IEEE Transactions on Speech and Audio Processing, 5(3):288–292, 1997.
[67] S. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, first
edition, 1993.
[68] Y. Gotoh, M. M. Hochberg, D. J. Mashao, and H. F. Silverman. Incremental MAP estimation of
HMMs for efficient training and improved performance. In Proceedings of ICASSP-95 [95], pages
457–460.
[69] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society, series B, 39(1):1–38, 1977.
90
[70] Radford M. Neal and Geoffrey E. Hinton. A new view of the EM algorithm that justifies incremental
and other variants. submitted to Biometrika, 1993.
[71] Jean-Luc Gauvain and Chin-Hui Lee. Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing,
2(2):291–298, April 1994.
[72] Y. Gotoh and H. F. Silverman. Incremental ML estimation of HMM parameters for efficient training.
In Proceedings of ICASSP-96 [94].
[73] Y. Gotoh, M. M. Hochberg, and H. F. Silverman. Efficient training algorithms for HMMs using
incremental estimation. IEEE Transactions on Speech and Audio Processing, 6(6):539–548,
November 1996.
[74] William W Seto. Schaum’s Outline of Theory and Problems of Acoustics. Schaum’s Outline Series.
McGraw-Hill Publishing Company, New York, 1971.
[75] F. Pirz. Design of a wideband, constant beamwidth, array microphone for use in the near field. Bell
System Technical Journal, 58(8):1839–1850, October 1979.
[76] M. Goodwin and G. Elko. Constant beamwidth beamforming. In Proceedings of ICASSP-93 [98],
pages 169–172.
[77] J. Lardies. Acoustic ring array with constant beamwidth over a very wide frequency range. Acoust.
Letters, 13(5):77–81, 1989.
[78] William Mendenhall, Dennis D. Wackerly, and Richard L. Scheaffer. Mathematical Statistics with
Applications. The Duxbury Series in Statistics and Decision Sciences. PWS-KENT, Boston,
Massachusetts, fourth edition, 1990.
[79] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C: The Art
of Scientific Computing. Cambridge University Press, Cambridge, UK CB2 1RP, 2nd edition, 1992.
[80] J. B. Allen and D. A. Berkley. Image method for efficiently simulating small room acoustics. J.
Acoust. Soc. Am., 65(4):943–950, April 1979.
[81] D. Johnson and D. Dudgeon. Array Signal Processing- Concepts and Techniques. Prentice Hall, first
edition, 1993.
[82] Peter L. Chu. Desktop mic array for teleconferencing. In Proceedings of ICASSP-95 [95], pages
2999–3002. Volume 5.
[83] H. G. Hirsche and C. Ehrlicher. Noise estimation techniques for robust speech recognition. In
Proceedings of ICASSP-95 [95], pages 153–156.
[84] Sven Fischer and Karl-Dirk Kammeyer. Broadband beamforming with adaptive postfiltering for
speech acquisition in noisy environments. In Proceedings of ICASSP-97 [96], pages 359–363.
[85] Regine Le Bouquin-Jeannes, Ahmad Akbari Azirani, and Gerard Faucon. Enhancement of speech
degraded by coherent and incoherent noise using a cross-spectral estimator. IEEE Transactions on
Speech and Audio Processing, 5(5):484–487, September 1997. Correspondence.
[86] Chang D. Yoo and Jae S. Lim. Speech enhancement based on the generalized dual excitation model
with adaptive analysis window. In Proceedings of ICASSP-95 [95], pages 832–835.
[87] C. d’Alessandro, B. Yegnanarayana, and V. Darsinos. Decomposition of speech signals into
deterministic and stochastic components. In Proceedings of ICASSP-95 [95], pages 760–763.
[88] John Hardwick, Chang D. Yoo, and Jae S. Lim. Speech enhancement using the dual excitation
speech model. In Proceedings of ICASSP-93 [98], pages 367–370.
91
[89] Zenton Goh, Kah Chye Tan, and B. T. G. Tan. Speech enhancement based on a voiced-unvoiced
speech model. In Proceedings of ICASSP-98 [97], pages 401–404.
[90] Lance Riek and Randy Goldberg. A Practical Handbook of Speech Coders. CRC Press, Boca Raton,
FL, 2000.
[91] Nathalie Virag. Speech enhancement based on masking properties of the auditory system. In
Proceedings of ICASSP-95 [95], pages 796–799.
[92] IEEE. International Conference on Acoustics, Speech, and Signal Processing, San Francisco, CA,
March 1992.
[93] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Toronto, Canada, May
1991.
[94] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Atlanta, GA, May
1996.
[95] IEEE. International Conference on Acoustics, Speech, and Signal Processing Signal Processing,
Detroit, MI, May 1995.
[96] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany,
April 1997.
[97] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Seattle, Washington,
May 1998.
[98] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, MN,
April 1993.