Optimal Filtering and Speech Recognition With Microphone Arrays

O PTIMAL F ILTERING AND S PEECH R ECOGNITION WITH
M ICROPHONE A RRAYS
By
John E. Adcock
Sc.B., Computer Science, Brown University, 1989
Sc.M., Engineering, Brown University, 1993
A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in the Division of Engineering at Brown University
Providence, Rhode Island

May 2001
This dissertation by John E. Adcock is accepted in its present form by the Division of Engineering as
satisfying the dissertation requirement for the degree of Doctor of Philosophy.
Date
Harvey F. Silverman, Director
Recommended to the Graduate Council
Date
Michael S. Brandstein, Reader
Date
Allan E. Pearson, Reader
Approved by the Graduate Council
Date
Peder J. Estrup
Dean of the Graduate School and Research
ii
c Copyright 2001 by John E. Adcock
T HE V ITA OF J OHN E. A DCOCK
John was born October 19, 1967 in Boston, Massachusetts. As a child he spent five years with his family
in Paris, France returning in 1980 to Pound Ridge, NY where he lived while attending high school in
Bedford Hills, NY. John received the Bachelor of Science degree in Computer Science, magna cum laude,
from Brown University in 1989. He subsequently spent two years as an engineer in the Signal Processing
Center of Technology at Lockheed Sanders in Nashua, New Hampshire before returning to Brown
University to pursue advanced degrees in Engineering. John earned his Master of Science Degree in
Engineering from Brown in 1993. John received a University Fellowship in 1991, and the Doris I.
Eccleston ‘25 Fellowship in 1996 for support of his graduate studies. In 1993 John worked with Brown
Engineering alumnus Krishna Nathan in the handwriting recognition group at IBM Watson Research
laboratories in Hawthorne, NY and in 1995 as Brown Engineering alumnus Professor Michael
Brandstein’s teaching assistant at the John’s Hopkins University Center for Talented Youth summer
program in Lancaster, PA. In 1998 and 1999 John’s fingers provided the thundering bass lines for the
popular local rock band Wave. John is co-inventor of a 1998 Brown University patent describing a method
for microphone-array source location. John has worked as a self-employed programmer/technical
consultant and was briefly a partner in an Internet bingo venture.
iii
ACKNOWLEDGMENTS
I thank my advisor, Professor Harvey Silverman, for his support and trust over the course of my graduate
career at Brown. Thanks to my readers, Professor Allan E. Pearson and especially Professor Michael
Brandstein whose critical input and encouragement were vital to the completion of this work.
In my time at Brown I have had the privilege of working and playing with many wonderful and talented
people. Michael Hochberg, Jon Foote and Alexis Tzannes who were here with me at the beginning and Joe
DiBiase, Michael Brandstein, Aaron Smith and Michael Blane who shared my trials in the remainder. I
also thank my wonderful friends Lance Riek, Laura Maxwell and Carina Quezada for their relentless
support.
Thanks to Ginny Novak for the uncomplaining effort she has consistently made to extend my deadlines
and otherwise keep me in the good graces of the Registrar and Graduate School.
Finally, at the culmination of my formal education, I thank my parents for all they’ve taught me over the
course of many years.
iv
C ONTENTS
1 Introduction 1
1.1 Methods for Acquiring Speech With Microphone Arrays . . . . . . . . . . . . . . . . . . 1
1.2 The Scope of This Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Evaluating Speech Enhancement 4

2.1 Listening Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Objective Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Signal-to-Noise Ratio (SNR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Segmental Signal-to-Noise Ratio (SSNR) . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Bark Spectral Distortion (BSD) . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Speech Recognition Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Feature Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Speech Recognition With Microphone Arrays 9

3.1 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.3 Recognizer Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4 Signal Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.5 Recognition Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Noisy Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Signal Measurements and Recognition Performance . . . . . . . . . . . . . . . . 20
3.3 Correlation Between Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Linear Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Fit Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.3 Nonlinear Fits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Towards Enhancing Delay and Sum Beamforming 31

4.1 Overview of the Delay-and-Sum beamformer . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Delay-Weight-and-Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Delay-Filter-and-Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Optimal-SNR Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 A Reverberant Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
v
5 Optimal Filtering 41
5.1 The Single Channel Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Additive Uncorrelated Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Multi-channel Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.1 Additive Uncorrelated Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.2 Direct Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.3 Filtered Signal Plus Additive Independent Noise . . . . . . . . . . . . . . . . . . 46
5.2.4 Filtered Signal Plus Semi-Independent Noise Model . . . . . . . . . . . . . . . . 48
5.3 A Non-Optimal Filter and Sum Framework . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6 Signal Spectrum Estimation 53

6.1 Spectral-Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Cross-Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2.1 Computational Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Combining Spectral-Subtraction and Cross-Power . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Comparison of Signal Estimate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7 Implementations and Analysis 59

7.1 Optimal-SNR Filter-and-Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.1.1 Subjective Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.1.2 Objective Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2 Wiener Sum-and-Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.2.1 Subjective Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3 Wiener Filter-and-Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3.1 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.4 Multi-Channel Wiener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.4.1 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8 Summary and Conclusions 84

8.1 Directions for Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Bibliography 86
vi
L IST OF TABLES
3.1 Breakdown of the experimental database . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Distortion measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Word error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Distortion measurements for the noisy database . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Word error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 Matrix of correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Matrix of correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.8 RMS linear fit error for linear predictors of recognition error rate. . . . . . . . . . . . . 25
3.9 RMS fit error for linear predictors of recognition error rate . . . . . . . . . . . . . . . . 26
3.10 RMS errors for linear least squares estimators of recognition error rates . . . . . . . . . 29
3.11 RMS errors for linear least squares estimators of recognition error rates . . . . . . . . . 29
7.1 Recognition performance for the OSNR beamformer . . . . . . . . . . . . . . . . . . . 60

7.2 Summary of the measured average FD, BSD, SSNR and peak SNR . . . . . . . . . . . 61
7.3 Recognition performance for the WSF beamformers . . . . . . . . . . . . . . . . . . . 65
7.4 Recognition performance for the WSFosnr beamformers . . . . . . . . . . . . . . . . . 65
7.7 Recognition performance for the WFS beamformer . . . . . . . . . . . . . . . . . . . . 69
7.8 Recognition performance for the WFSosnr beamformer . . . . . . . . . . . . . . . . . . 69
7.9 Distortion values for the WFS beamformer. . . . . . . . . . . . . . . . . . . . . . . . . 69
7.10 Distortion values for the WFSosnr beamformer. . . . . . . . . . . . . . . . . . . . . . . 70
7.11 Recognition performance for the MCW beamformer . . . . . . . . . . . . . . . . . . . 73
7.12 Recognition performance for the MCWosnr beamformer . . . . . . . . . . . . . . . . . 73
7.13 Measured distortion for the MCW beamformer. . . . . . . . . . . . . . . . . . . . . . . 73
7.14 Measured distortion for the MCWosnr beamformer. . . . . . . . . . . . . . . . . . . . . 74
vii
L IST OF F IGURES
2.1 Relationship between Hz and Bark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Equal-loudness curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Spreading function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Layout of the LEMS microphone-array system . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Data flow for the array recording process. . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Example recorded time sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Log-magnitude spectrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Outline of the method for delay steering the array recordings. . . . . . . . . . . . . . . 12
3.6 Measured talker locations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7 Distortion measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.8 Word recognition error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.9 Layout of the recording room . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.10 PCM sequence of a talker recording with pink noise . . . . . . . . . . . . . . . . . . . 18
3.11 Spectrograms of the noisy recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.12 Aliasing spectral bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.13 Distortion measurements for the added-noise database. . . . . . . . . . . . . . . . . . . 21
3.14 Word recognition error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.15 Scatter plots comparing the various distortion measures . . . . . . . . . . . . . . . . . . 24
3.16 Scatter plots comparing the various distortion measures . . . . . . . . . . . . . . . . . . 26
3.17 Scatter plot of baseline-HMM error rate versus MAP-HMM error rate . . . . . . . . . . 27
3.18 Scatter plot of baseline-HMM error rate versus MAP-HMM error rate . . . . . . . . . . 28
3.19 Scatter plots of the linear least squares fit . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.20 Scatter plots of the linear least squares fit . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 Idealized SNR gain as a function of the number of sensors . . . . . . . . . . . . . . . . 31

4.2 A linear microphone array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Simulation showing the improvement in SNR . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Linear microphone array with interfering noise point-source . . . . . . . . . . . . . . . 34
4.5 SNR improvement for DSBF with point-source noise . . . . . . . . . . . . . . . . . . . 34
4.6 Location of the source, noise, and microphones for the reverberant simulation. . . . . . 37
4.7 The simulated impulse response for the talker received at microphone 1. . . . . . . . . . 37
4.8 Optimal filter for microphone 1 for the 40 microphone case . . . . . . . . . . . . . . . 37
4.9 Signal-to-noise+reverb improvement for a simulated room . . . . . . . . . . . . . . . . 39
4.10 BSD measure for the 4 different beamforming schemes . . . . . . . . . . . . . . . . . . 39
4.11 Signal-to-noise-only ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.12 Signal-to-reverberation ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1 Flow diagram for a Wiener filter-and-sum beamformer . . . . . . . . . . . . . . . . . . 50

1
5.2 The attenuation of Φm ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

1
5.3 The attenuation of Φm ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4 Ad hoc methods of warping the filter gains . . . . . . . . . . . . . . . . . . . . . . . . 52
viii
6.1 Average BSD, SSNR and peak SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Average BSD, peak SNR and SSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.3 Average BSD, peak SNR and SSNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.1 The structure of the DSBF with optimal-SNR based channel filtering (OSNR). . . . . . 59
7.2 Narrowband spectrogram of a noisy utterance processed with OSNR. . . . . . . . . . . 61
7.3 The structure of the DSBF with Wiener post-filtering . . . . . . . . . . . . . . . . . . . 63
7.4 Narrowband spectrograms for the WSF beamformer. . . . . . . . . . . . . . . . . . . . 64
7.5 The structure of the Wiener filter-and-sum (WFS) beamformer. . . . . . . . . . . . . . 67
7.6 Narrowband spectrograms from the WFS beamformer. . . . . . . . . . . . . . . . . . . 68
7.7 Diagram of the optimal multi-channel Wiener (MCW) beamformer. . . . . . . . . . . . 71
7.8 Narrowband spectrograms from the MCW beamformer. . . . . . . . . . . . . . . . . . 72
7.9 Summary of word error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.10 Word error rates with OSNR input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.11 The best performing filtering schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.12 Summary of FD and BSD values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.13 Summary of SSNR and SNR values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.14 Scatter plots of error rate and distortion measures. . . . . . . . . . . . . . . . . . . . . 80
7.15 Scatter plots comparing correlation of distortion measures . . . . . . . . . . . . . . . . 81
7.16 Scatter plot of distortion measurements by algorithm type. . . . . . . . . . . . . . . . . 82
7.17 Scatter plots of OSNR distortion measurements . . . . . . . . . . . . . . . . . . . . . . 83
ix
C HAPTER 1:
I NTRODUCTION
Microphone arrays are becoming an increasingly popular tool for speech capture[1, 2, 3, 4, 5, 6] and may
soon render the traditional desk-top or headset microphone obsolete. Unlike conventional directional
microphones, microphone arrays are electronically steerable which gives them the ability to acquire a
high-quality signal (or signals) from a desired direction (or directions) while attenuating off-axis noise or
interference. Because the steering is implemented by software and not by a physical realignment of
sensors, moving targets can be tracked anywhere within the receptive area of the microphone array and the
number of simultaneously active targets is limited only by the available processing power. The
applications for microphone-array speech interfaces include telephony and teleconferencing in home,
office[7, 4, 8] and car environments[3, 9], speech recognition and automatic dictation[6, 10, 11, 12], and
acoustic surveillance[13, 14] to name a few. To realize the promise of unobtrusive hands-free speech
interfaces that microphone arrays offer, they must perform effectively and robustly in a wide variety of
challenging environments.
Microphone-array systems face several sources of signal degradation:
additive noise uncorrelated with the speech signal (background noise),

convolutional distortion of the speech signal (reverberation)1 ,
additive noise correlated with the speech signal (correlated noise).
In speech-acquisition applications the primary source of interference will vary. In a teleconferencing or

speaker-phone application there may be interfering speech present in addition to background noise and
reverberation. In a car-phone application the primary interference will be non-speech correlated and
uncorrelated noise. In an auditorium or concert hall reverberation may be the primary source of
interference.
1.1 Methods for Acquiring Speech With Microphone Arrays

Beamforming
Delay-and-sum beamforming (DSBF) is the mainstay of sensor-array signal processing[15].
Delay-and-sum beamforming is relatively simple to implement and suppresses correlated as well as
uncorrelated noise (although uncorrelated noise is suppressed most consistently)[16, 17]. It requires no
signal model or measurement of signal statistics to implement. Delay-and-sum beamforming also has the
desirable property that it introduces no nonlinear distortion products into the desired signal. The only
added signal distortion from the beamforming process is a possible linear distortion of the frequency
response due to the variation of the beamwidth over frequency. Unfortunately the idealized SNR gain of a
delay-and-sum beamformer is 10 log10 M , where M is the number of microphones in the array. To

achieve a significant improvement in SNR (say 30dB) with simple delay-and-sum beamforming

requires an impractical number of microphones, even under idealized noise conditions. This built-in
1 Which contains both correlated and uncorrelated components.
1
2
limitation of DSBF motivates research into supplemental processing techniques to improve
microphone-array performance.
Inversion Techniques
Multiple input/output inverse theorem (MINT) techniques are aimed at inverting the impulse response of
the room which is assumed to be known a priori and thereby eliminating the effects of
reverberation[18, 19]. Although a room impulse response is not certain to be invertible[20, 21], under
certain constraints on the nature of the individual transfer functions between the source and each
microphone, a perfect inverse system can be realized for the multiple-input/single-output system[19].
Although this is an effective method for reducing the effects of reverberation it requires an accurate
estimate of the transfer function between the source and each microphone, a measurement that can be
quite difficult to make in practice. A simultaneous recording and playback system is required to accurately
measure the transfer function between the sound source and a microphone, and a separate transfer function
must be measured for every point in the room from which a talker may speak. The room transfer function
will change as the room configuration changes, whether due to rearrangement of furniture or the number
of occupants. To further complicate the task of measuring the room transfer functions, significant
variations in impulse response occur over time as temperature and humidity, and therefore the speed of
propagation of sound, vary[22] and impulse response inversion techniques may be very sensitive to the
accuracy of the measured impulse response[21].
Matched-filtering compensates for the effects of reverberation by whitening the received signal with the
time-reverse of the estimated impulse response [23]. Matched-filtering relies upon the generally
uncorrelated nature of the channel impulse response to strengthen the main impulse while spreading
energy away from the central peak. As such, it performs best when significant reverberant energy exists.
Although a sub-optimal inversion technique, matched-filtering does not require the inverse of the channel
impulse response. Like MINT techniques, matched-filtering requires prior knowledge of the impulse
response.
Adaptive Beamformers
Traditional adaptive beamformers[24, 25, 26, 15] optimize a set of channel filters under some set of
constraints. A typical optimization is for minimal output power subject to flat overall frequency response.
These techniques do well in narrowband, far-field applications and where the signal of interest has
generally stationary statistics but are not as well suited for use in speech applications where:
the signal of interest has a wide bandwidth

the signal of interest is non-stationary
interfering signals also have a wide bandwidth
interfering signals may be spatially distributed
interfering signals may be non-stationary
Adaptive noise cancellation (ANC)[27] techniques exploit knowledge of the interference signal to cancel it
out of the desired input system. In some ANC applications the interference signal can be obtained
independently (perhaps with a sensor located close to the interferer but far from the desired source). The
remaining problem is then to eliminate the interfering signal from a mix of desired and interfering signals
typically by matching the delay and gain of the measured noise and subtracting it from the beamformer
output. A particular sort of adaptive array employing ANC has been used with some success in
microphone-array applications [28, 29]. The generalized sidelobe canceler (GSC) uses an adaptive array
structure to measure a noise-only signal which is then canceled from the beamformer output. Obtaining a
noise measurement that is free from signal leakage, especially in reverberant environments, is generally
where the difficulty lies in implementing a robust and effective GSC.
3
Another variant of adaptive beamformers that has found some application in microphone arrays is the
superdirective array[30, 31, 32]. Superdirective array structures generally exhibit greater off-axis noise
suppression than delay-and-sum arrays given similar aperture size but at the expense of being geometry
dependent (endfire arrays are a common superdirective configuration) which restricts the effective steering
range.
Noise Reduction Filtering

Single-channel noise reduction techniques are of also applicable to microphone-array processing, either
before or after any summing operation. A widely known technique for single-channel noise reduction is
spectral subtraction[33] in which the magnitude of the noise bias is estimated and subtracted in the
short-time frequency (Fourier) domain to obtain a noise-suppressed short-term signal magnitude spectrum
estimate. Variants on the spectral-subtraction idea include using the spectral estimate in a Wiener filter[34]
and other methods aimed at tuning the degree of noise suppression with varying SNR to best preserve
subjective speech quality[35, 36, 37]. These methods are all similar in the sense that they generally involve
processing the short-term magnitude spectrum to generate a noise-reduced signal spectrum estimate which
is then used to generate a filter through which the signal is processed. Wiener filtering techniques, which
fall in this general family of algorithms, have been applied to microphone-array applications, typically as a
post-processing or post-filtering step[38, 39, 40, 41].
1.2 The Scope of This Work

All the algorithms discussed in the preceding section have their advantages and disadvantages.
Beamforming in particular has the advantage that it is relatively simple to implement and requires no prior
knowledge of the signal or the environment to be effective. Noise reduction filtering strategies share a
similar advantage in that they generally have low computational complexity and the signal and noise
parameters required for their implementation can be estimated fairly readily and robustly from the
received signals. In the following chapters the performance of a delay-and-sum-beamformer (DSBF) will
be evaluated and extensions to the DSBF developed to incorporate noise reduction filtering in a novel way.
Chapter 2 will introduce a set of objective measures to be used in evaluating the effectiveness of a speech
enhancement technique. In Chapter 3 a set of microphone-array recordings will be introduced and baseline
measurements on a delay-and-sum beamformer presented, including the performance of a speech
recognition system. In Chapter 4 an optimal microphone weighting will be derived and the performance of
a weighted beamformer in a reverberant simulation analyzed. In Chapter 5 the derivation of MMSE filters
(Wiener filters) for signal enhancement will be described and a novel multi-input extension to the
single-channel method will be derived. Methods for noise and signal power estimation will be presented in
Chapter 6 and then applied in Chapter 7 where the results of implementations of the optimal weighting and
filtering strategies introduced in Chapters 4 and 5 will be presented and contrasted with the performance of
the baseline delay-and-sum beamformer as well as a standard Wiener post-filter algorithm.
C HAPTER 2:
E VALUATING S PEECH E NHANCEMENT
When evaluating a speech or signal enhancement technique, the intended application influences the choice
of benchmark. If the application is speech recognition then clearly recognition performance is the
objective measure of interest. If the microphone-array system is acquiring speech for a teleconferencing
task the intelligibility and subjective quality of the speech as perceived by human listeners is the important
measure. In this chapter some signal quality measures will be described. These measures will be used in
later chapters to evaluate the performance of proposed speech-enhancement algorithms.
2.1 Listening Scores

Listening scores are commonly used for evaluating speech coders[42, 43, 44, 45, 46] and are typically
broken down into intelligibility tests which measure the intelligibility of distorted or processed speech and
quality or acceptability tests which measure the subjective quality of distorted or processed speech.
Speech intelligibility tests include the Diagnostic Rhyme Test (DRT), Modified Rhyme Test (MRT) and
Phonetically Balanced Word Lists (PB) tests. Speech quality tests include the Diagnostic Acceptability
Measure (DAM), Mean Opinion Score (MOS) and Degradation Mean Opinion Score (DMOS) tests.
ANSI standards exist for DRT, MRT, and PB tests. Although these and similar tests are widely considered
useful for tasks such as comparing vocoder standards [47], it is impractical to do a new set of listening
tests every time a parameter is changed or an algorithm is updated. Quality tests in particular require
professional evaluation by expert listeners1 and intelligibility tests generally require a specialized
vocabulary for the test speech. For this reason easily computed objective measures derived directly from
the output waveform that accurately reflect speech quality are a very valuable commodity.
2.2 Objective Measures

A large study of correlations between objective measures of speech quality and the results of listening tests
is described in [48]. Recent publications cite the high correlation of Bark Spectral Distortion (BSD) with
MOS scores[49, 50]. A modified BSD that employs an explicitly perceptual masking model has exhibited
even better correlation with MOS scores[51, 52].
2.2.1 Signal-to-Noise Ratio (SNR)

Signal-to-noise ratio (SNR) is a ubiquitous measure of signal quality.

∑Nn 1 x n 2
SNR x y 10 log10

(2.1)
2

x n y n
∑Nn 1

where x n is an undistorted reference signal and y n is the distorted (for instance with additive noise) test

signal. In some cases the signal and noise may not be available separately to form the ratio in Equation
1 Dynastat, Inc. is one company that performs these services.
4
5
(2.1). In such situations an alternative is to estimate the peak SNR directly from the measured signal. If the
recording under analysis has regions with only noise and no signal (recordings of speech almost always
do) and the noise is assumed to be stationary, these regions can be used to estimate the noise power.
Correspondingly a region where the signal is active can be used to estimate the signal+noise power. The
peak SNR can be formed from the ratio of these measured powers:

∑Kk y nmax k ∑k 1 y nmin k
2 K 2
10 log10 1
peak SNR x y (2.2)

∑k 1 y nmin k
K 2

where the signal is broken down into frames of K samples, y n k indicates the k th sample of the nth

analysis frame of the observed signal y. The power in each frame is measured and n max indicates the
analysis frame with the highest power and nmin denotes the analysis frame with the lowest power in the test
signal. The difference in the numerator of Equation (2.2) is based on the assumption that the noise and
signal are statistically independent which implies that E s n 2 E s 2 E n 2 . In this way an

estimate of the peak SNR is made without access to the reference clean signal as in Equation (2.1).
SNR is attractive for several reasons:
It is simple to compute, using Equation (2.1) if the reference signal and noise may be separated or
using Equation (2.2) if an estimate must be made only from an observed signal.
It is intuitively appealing, especially to electrical or communications engineers who are accustomed
to the idea that improved SNR indicates improved information transfer.
Unfortunately SNR correlates poorly with subjective speech quality[53, 48]. SNR as written in Equation
(2.1) is sensitive to phase shift (delay), bias and scaling which are often perceptually insignificant.
Meanwhile, the peak SNR as measured in Equation (2.2) has no clean signal reference to work from and is
really measuring the dynamic range rather than the distortion in the recording under test. As such it is
possible for the peak SNR to improve even as the signal is becoming more distorted. SNR and peak SNR
are still useful measurements, but care must be taken to interpret them in the proper context.
2.2.2 Segmental Signal-to-Noise Ratio (SSNR)

The segmental signal-to-noise ratio (SSNR) has been determined to be a better estimator of subjective
speech quality[53, 48]:

1 N K
x n k
2
N n∑ ∑

SSNR x y 10 log10 (2.3)

1 k 1 x n k y n k
2

where x n k denotes the kth sample of the nth frame of the reference signal x, and y n k the

corresponding frame and sample of the distorted test signal. Because the ratio is evaluated on individual
frames loud and soft portions contribute equally to the measure. Speech detection is desirable to prevent
silence frames from unduly biasing the average with extremely low segment SNR’s[53, 48]. Also
long-term frequency response adjustment of the test signal may be desirable to avoid biasing the error with
frequency response effects that could be easily compensated for.
2.2.3 Bark Spectral Distortion (BSD)

The Bark frequency scale is based upon a variety of psycho-acoustic experiments that investigate
relationships between the bandwidth of an acoustic stimuli its perceived loudness or its ability to mask
tone stimuli[54]. These and other experiments form the basis for the concept of a critical bandwidth which
corresponds in a rough sense to the bandwidth resolution of human hearing. The Bark frequency scale is
normalized by critical bandwidth; at any frequency a unit of 1 Bark corresponds to 1 critical bandwidth.
The transformation from linear frequency in Hertz, f , to Bark frequency, z, is commonly approximated by
the following relationship[54] which is plotted in Figure 2.1.
6
22
20
18
16
14
12
Bark
10
2 estimated
measured
0
0 2000 4000 6000 8000
Frequency (Hz)
Figure 2.1: Relationship between Hz and Bark. The dashed line is from Equation (2.4) and the marks
indicate the band edges derived from the psycho-acoustical experiments undertaken by Zwicker[54].
140
120
100 100
90
Intensity (dB)
80 80
70
60
60
50
40
40
30
20
20
10
0
1 2 3 4
10 10 10 10
Frequency (Hz)
Figure 2.2: Equal-loudness curves from Robinson[55]. Each line is at the constant phon value indicated on
the plot. The phon and intensity (dB) values are equal at 1kHz by definition.
f 2
z 13atan 00076 f 3 5atan
(2.4)
7500
In addition to frequency warping, the Bark spectrum incorporates spreading, preemphasis, and loudness
scaling to model the perceived excitation of the auditory nerve along the basilar membrane in the ear. In
this work the Bark spectrum, Lx z , of a discrete time signal, x k , is formed by the following steps:
1. Take a windowed time segment: xw k w k x k

A 32ms (512 points at 16kHz) Hanning window is used to form a tapered time segment.
2. Compute the PSD: X l
F xw k
2
The magnitude squared of a 1024 point DFT is used.
3. Apply preemphasis: Xe l W l X l
The equal-loudness curve[55] for 70dB loudness is used to form a preemphasis filter[50]. See
Figure 2.2. This equalizes the perceptual contribution of energy at different frequencies.
7
0
−10
−20
dB
−30
−40
−50
−60
−2 0 2 4 6
Bark
Figure 2.3: Spreading function from Wang[50].
4. Warp to Bark frequency: X z warp Xe l

The linear frequency spectrum of the DFT is warped onto the constant rate bandwidth Bark scale.
See Figure 2.1.
5. Apply spreading function: Xs z SF z X z

Excitation spreading is approximated by convolution with a spreading function[50] pictured in

Figure 2.3. This is a first approximation to account for the effects of simultaneous masking.
6. Convert to loudness in phons: Px z 10 log10 X z

The power excitation is converted to dB.

7. Convert to loudness in sones: Lx z 2 Px z 40 10

The loudness in dB is warped onto the (approximate) sone scale[50] where a doubling in perceived
loudness has a constant distance. Each 10dB corresponds approximately to a doubling in perceived
loudness for phones above 40dB. In the absence of an absolute loudness calibration the 40dB offset
term in this expression is not especially meaningful.
The BSD measure itself[51] is computed by taking the mean difference between reference and test Bark
spectra and then normalizing by the mean Bark spectral power of the reference signal:
∑Nn 1 ∑Zz Lx n z Ly n z
1 2
N 1
BSD x y (2.5)

N ∑ n 1 ∑ z 1 Lx n z
1 N Z 2

where Lx n z and Ly n z are the discrete Bark power-spectra in dB of the reference and test signals for

time frame index n and Bark frequency index z. Speech detection is performed by an energy thresholding
operation on the reference signal so that the distortion measure averages the distortion only over speech
segments. Also the test signal is filtered before computing the Bark spectrum to equalize the average
spectral power in 6 octave-spaced frequency bands to prevent the measure from being overly sensitive to
the long term spectral density of the test signal.
The Modified Bark Spectral Distortion (MBSD)[51] measure incorporates an explicit model of
simultaneous masking to determine if distortion at a particular frequency is audible. If the error between
test and reference signals at a particular frequency falls below the masking threshold, then that error is not
included in the error sum (since it is deemed to be inaudible). Accurate computation of the masking
threshold is a very involved process. Yang[51] cites a simplified method for determining the overall
masking threshold given by Johnston[56]. Even the simplified method is quite involved so this method
will be omitted in the use of BSD contained herein.
8
2.3 Speech Recognition Performance
The use of speech-recognition accuracy as a means of evaluating a speech enhancement method had some
advantages. Certainly if the goal is to achieve robust hands-free use of a speech recognition system,
recognition accuracy is the measurement of interest. Unfortunately the evaluation of the performance of a
speech recognition system is not without drawbacks. If the recognition system being used is sensitive to
the acoustic environment (and all are) in which it is used it may require some form of retraining or
adaptation. This retraining may require significant training data.
2.3.1 Feature Distortion

In lieu of retraining a speech recognition system, an obvious metric (when a reference signal is available)
is the difference between the features of the processed speech and the features of the reference speech. The
LEMS speech recognizer uses a feature vector made up of the real Mel-warped cepstra and its time
derivative and the energy of the speech signal and its time derivative[57, 58, 59]. Specifically, the LEMS
speech recognizer models the speech signal observations with Gaussian distributions of 3 ubiquitous[60]
feature vectors for each analysis frame: Mel cepstral values 1 to 12 (12 features), time derivatives of the
Mel cepstral values 1 to 12 (12 features) and the signal energy and time derivative of the energy (2
features). The speech signals are sampled at 16kHz. The features are evaluated with a 640 point (40ms)
Hamming window with a 160 point (10ms) frame shift. The energy is gain normalized for each utterance.
The mean cepstral vector for the test utterance is subtracted from the vectors for that utterance and each
cepstra is normalized by the standard deviation of the cepstra at that quefrency measured over the
utterance.
To measure the feature distortion (FD) the test features are subtracted from the reference features, squared,
then normalized within each of the 3 sub-vectors by the squared sum of the reference features in that
sub-vector. The 3 mean distortion values are then averaged together giving an equal weight to the
distortion in each sub-vector:
2
l Tk
Km m m
1 3 ∑l 1 ∑k 1
N
Rk l

3 m∑

FD R T l 2
(2.6)
1

K m
∑Nl 1 ∑k m1
Rk l

where FD R T denotes the feature distortion between the reference signal, R, and the signal being tested,

m
T . Rk l denotes the kth feature of the mth sub-vector for analysis frame l for the features derived from

m
the reference signal. Tk l denotes the corresponding feature value for the test signal.

Although cepstral distance is used in a variety of speech processing applications including speech quality
assessment[61, 62], the measure presented above is referred to instead as the feature distortion because it
operates directly on the features used by the LEMS speech recognizer. Mel-cepstral distortion is similar in
nature to Bark spectral distortion as both measures are derived from a warped and smoothed (or liftered)
log spectral representation.
2.4 Summary
In this chapter several objective measures of speech quality were introduced, varying from the traditional
signal-to-noise ratio measure to a perceptually motivated Bark spectral distortion measure, and the
speech-recognition targeted feature distortion measure. In the following chapters the measures presented
here will be used to evaluate the performance of proposed microphone-array processing techniques. By
using a set of measures with different underlying principles a multifaceted view of the performance of the
algorithms to follow will be possible.
C HAPTER 3:
S PEECH R ECOGNITION W ITH M ICROPHONE
A RRAYS
In this chapter tests on the performance of a 16 element microphone array will be presented. A database of
multichannel recordings was collected and speech recognition tests performed on delay-and-sum
beamformer configurations using from 1 to 16 microphones. The methods used to record and process the
multichannel recordings will be described and the performance of each microphone-array configuration
will be presented. Each tested configuration will be evaluated with the measures described in Chapter 2:
feature distortion (FD), Bark spectral distortion (BSD), segmental signal-to-noise ratio (SSNR), peak SNR
(SNR). Also each array configuration will be used as a front end to the LEMS alphadigit speech
recognition system and its performance assessed in that role. Finally, the significance of and relationships
between the various performance measures will be discussed.
3.1 Experimental Database

A microphone-array speech database was collected from 22 talkers of American English. The vocabulary
comprises the American English alphabet (A Z), the digits (0 9), “space” and “period”. The typical
utterance contains approximately 12 vocabulary items and is approximately 4 seconds long. Each talker
contributed the same number of utterances. Table 3.1 shows the data sets broken down by gender for each
of the training and test sets. The training set is used to retrain the recognizer for the novel acoustic
environment (see Section 3.1.3).
3.1.1 Data Acquisition

The microphone-array environment used in this experiment is depicted in Figure 3.1. It consists of
16 pressure gradient microphones, 8 on each of two orthogonal walls of a 3.5 4.8 m enclosure,

horizontally placed at a height of 1.6 m. Within each 8-microphone sub-array the microphones are
uniformly spaced at 16.5 cm intervals. The microphone-array is in a partially walled-off area of a 7 x 8 m
acoustically-untreated workstation lab. Approximately 70% of the surface area of the enclosure walls is
covered with 7.5 cm acoustic foam, the 3 m ceiling is painted concrete, and the floor is carpeted. The
reverberation time within the enclosure is approximately 200 ms.
The utterances were recorded with the talker standing approximately 2m away from each of the
microphone sub-arrays. The microphone-array recording was performed with a custom-built 12-bit
20 kHz multichannel acquisition system[63]. The 20 kHz datastream was resampled to match the 16 kHz
sampling rate used by the recognition system. During recording, the talker wore a close-talking headset
data set female male # utterances # words

training 5 6 436 4415
testing 5 6 438 4497
Table 3.1: Breakdown of the experimental database by the number of talkers of each gender, the number of
utterance and the number of words in each of the training and test sets.
9
10
480
233
118
9 16
16.5 acoustic
145
foam
260
8
350
1 16.5 talker area
( units: cm )
Figure 3.1: Layout of the LEMS microphone-array system using 16 pressure gradient microphones.
Close Talking Mic Segment

SUN Sparc Disk File
A/D and
Align
Synchronization
Array Mic’s Disk File
16 LEMS
Microphone SBUS
SUN Sparc Resample
Array II
A/D
Figure 3.2: Data flow for the array recording process.
microphone. This is the same microphone used to collect the high-quality speech data for training the
baseline HMM system (see Section 3.1.3). Using the analog-to-digital conversion unit of a Sparc10
workstation, the signal from the close-talking microphone was digitized to 16 bits at 16 kHz
simultaneously with the 16 remote microphones in the array system. Both the close-talking and the array
recordings were segmented by hand to remove leading and trailing silence and then the close-talking
recording was time-aligned to the first channel of the multi-channel recordings. See Figure 3.2.
Figures 3.3 and 3.4 show data from an example recording from the recognition database. Figure 3.3 shows
the time-sequences for a single utterance recorded from the close-talking microphone (a), a single
microphone in the array (b) and the output of the 16 channel DSBF (c). Figure 3.4 shows the
corresponding spectrograms for the sequences shown in Figure 3.3. The noise-suppressing effect of the
beamformer is evident in both Figures 3.3 and 3.4. Also evident is that the output of the beamformer,
though greatly improved from the single microphone, is quite a bit more noisy than the recording from the
close-talking microphone.
3.1.2 Beamforming
After recording and preliminary segmentation and alignment, the channels of every multichannel data file
were time-aligned with the reference close-talking microphone recording. Figure 3.5 shows an outline of
the process. The close-talking microphone recording was used as a reference to ensure the best possible
time alignment. This is critical when computing distortion measurements (BSD, SSNR etc...) that assume
the test and reference signals are precisely aligned. The time-alignment was achieved by using an
11
4000
2000
(a) −2000
−4000
−6000
0.5 1 1.5 2 2.5 3 3.5 4

sec.
60
40
20
(b) 0
−20
−40
−60
0.5 1 1.5 2 2.5 3 3.5 4
sec.
80
60
40
20
(c) 0
−20
−40
−60
0.5 1 1.5 2 2.5 3 3.5 4
sec.
Figure 3.3: Example recorded time sequences. The recording from the close talking microphone (a), chan-
nel 1 from the microphone array (b) and the output of the beamformer (c). The talker is male. The spoken
text is "GRAPH 600".
implementation of the all-phase transform (PHAT)[13, 64, 65, 66]. The all-phase transform of two
time-domain sequences x k and y k is given by the inverse DFT of their magnitude normalized

cross-correlation:

X ω Y ω
F 1
PHAT x y (3.1)

X ω Y ω

where F 1 denotes the inverse Fourier transform and X ω and Y ω the Fourier transforms of x k and

y k , respectively. A 512 point (32ms) Hamming window with a 256 point (16ms) shift was used in

computing the cross-spectra. The cross-spectra were smoothed by averaging 7 adjacent frames 1 then
magnitude normalized and an inverse Fourier transform applied. The resulting cross-correlation was then
upsampled by a factor of 202 and the lag corresponding to the peak value chosen. Some post-processing
was performed to eliminate spurious estimates and to constrain the delay estimates during non-speech
periods. Each channel was steered using the estimated time delays and a delay-steered version of the
1 Although not entirely still during the recording, the talker movements were generally limited to leaning or shifting, rarely resulting
in a change of more than 0.1 m in location. The resulting time-delay changes were generally small and slowly varying and not adversely
affected by the time-averaging of the cross-spectra.
2 This fairly high upsampling factor wasn’t chosen because resolution to 1 of a sample is necessary, but because the higher
20
sampling rate makes it more likely that the peak sampled value will correspond with the actual peak value of the underlying waveform.
12
8000
7000
6000
5000
Hertz
4000
(a)
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4
sec.
8000
7000
6000
5000
Hertz
4000
(b)
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4
sec.
8000
7000
6000
5000
Hertz
4000
(c)
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4
sec.
Figure 3.4: Log-magnitude spectrograms of the example sequences shown in Figure 3.3.The recording
from the close talking microphone (a), channel 1 from the microphone array (b) and the output of the
beamformer (c). The talker is male. The spoken text is "GRAPH 600". The analysis window is a 512 point
(32ms) Hamming window with a half-window (16ms) shift.
d(t) D(k)
Reference DFT Conj
Recording
D’(f)
Array x(n) X(k) C(k) Inter− 1 P(k) p(t) Find

DFT Frame IDFT Upsample
|C(k)| Peak
Recording Smooth
t
Apply y(n)
Steered
Delay Result
Figure 3.5: Outline of the method for delay steering the array recordings.
multichannel data file saved to disk. Figure 3.6(a) shows locations derived from the estimated delays using
a maximum-likelihood (ML) location estimator[67]. Figure 3.6(b) shows the distribution of the x and y
coordinates for ML location estimates of every analysis frame from the entire database. The talker location
estimates were generated for analysis only; the channel delays were not constrained to correspond to any
particular source radiation model during processing.
For the purposes of establishing a baseline for the beamformer performance, no channel weighting or
normalization was performed. The channels were simply delay-steered according to the estimated delays
13
3 0.6
2.5 0.5
2 0.4 x
y coord (meters)
y
fraction
1.5 0.3
1 0.2
0.5 0.1
0 0
0 0.5 1 1.5 2 2.5 3 1.4 1.6 1.8 2 2.2 2.4 2.6
x coord (meters) x/y coord (meters)
(a) (b)
Figure 3.6: Measured talker locations. (a) Scatter plot of locations from a single talker (b) Distribution of
the measured x and y talker locations taken over the entire database.
and summed3 in sequential order (See Figure 3.1 for the microphone numbering).
3.1.3 Recognizer Training

As mentioned previously, speech recognizers tend to be very sensitive to changes in acoustic environment
or other changes to the dynamic range, noise floor, frequency response, etc.. of the data under test. Half of
the data was used to retrain the recognizer to the novel acoustic space for each different beamformer or
microphone.
Incremental MAP Training

It was reported in[68] that substantial speed improvements in HMM training can be obtained using
incremental maximum a posteriori (MAP) estimation. The significance of this approach is that it does not
lose any recognition performance while speeding the convergence. The learning technique presented is a
variation on the recursive Bayes approach for performing sequential estimation of model parameters
given
incremental data. Let x1 xT be i.i.d. observations and θ be a random variable such that f xt θ is a
likelihood on θ given by xt . The posterior distribution of θ is

f θ x1 xt
f xt θ f θ x1 xt 1 (3.2)

where f θ x1 f x1 θ f θ and f θ is the prior distribution on the parameters. The recursive Bayes
approach results in a sequence of MAP estimations of θ,

θ̂t argmax f θ x1 xt (3.3)
θ
A corresponding

sequence of posterior parameters acts as the memory for previously observed data. If the
likelihood

f xt θ is from the exponential family (That is, a sufficient statistic of fixed dimension exists)

and f θ is the conjugate prior, then the posterior f θ x1 xt is a member of the same distribution as
3 The sum was used rather than the mean since the microphone-array recordings are 12 bit and the recognition system uses 16 bit
PCM input data so the sum of the 16 12-bit channels will never overflow a 16 bit word. Using the mean of the channels instead of the
sum would involve a requantization step. Although the effect of this requantization is almost certainly unmeasurable by any method
used herein, there was no reason not to preserve the data with maximum precision.
14
the prior regardless of sample size t. This implies that the representation of the posterior remains fixed as
additional data is observed.
In the case of missing-data problems (e.g., HMMs), the expectation-maximization (EM) algorithm can be
used to provide an iterative solution for estimation of the MAP parameters[69]. The iterative EM MAP
estimation process can be combined with the recursive Bayes approach. The approach that incorporates
(3.2) and (3.3) with the incremental EM method[70] (That is, randomly selecting a subset of data from the
training set and immediately applying the updated model) is fully described in[68]. Also, Gauvain and Lee
have presented the expressions for computing the posterior distributions and MAP estimates of continuous
observation density HMM (CD-HMM) parameters[71]. Because the posterior is from the same family as
the prior, (3.2) and (3.3) are equivalent to the update expressions in[71] and are not repeated here.
Baseline Model
A baseline talker-independent continuous density hidden Markov model (CD-HMM) is obtained by a
conventional maximum likelihood (ML) training scheme using a training database of high-quality data
acquired with a close-talking headset microphone. The training set contains contained 3484 utterances
from 80 talkers. The initial parameters of the CD-HMM are derived from a discrete observation hidden
semi-Markov model (HSMM) using a Poisson distribution to model state duration. This model is then
converted to a tied-mixture HSMM by simply replacing each discrete symbol with a multivariate normal
distribution. Normal means and full covariances are estimated from the training data.
Prior Generation
The initial prior distributions are also derived from the training data set used to train the baseline HMM.
The employed prior distributions are the normal-Wishart distribution for the parameters of the normal
distribution and the Dirichlet distribution for the rest of model parameters. The parameters describing the
priors are set such that the mode of the distribution corresponded to the initial CD-HMM. The strength of
the prior (That is, the amount of observed data required for the posterior to significantly differ from the
prior) is set to “moderately” strong belief. A subjective measure of prior strength is used[6, 72, 73] where
a very weak prior is (almost) equivalent to a non-informative prior and a very strong prior (almost)
corresponds to impulses at the initial parameter values4 .
Model Parameter Adjustment

Starting from the baseline HMM, incremental MAP training is performed to adjust the model parameters
for the novel database. 10 utterances are randomly chosen at each iteration for 100 iterations. Note that the
training data size for the second stage (428 utterances from 11 talkers) is an order of magnitude smaller
than that used for creating the baseline HMM. Gotoh [72, 73] presents extensive information on the effect
of varying the parameters (training set size, prior strength, number of iterations) of the MAP training.
3.1.4 Signal Measurements

Figure 3.7 and Table 16 show baseline distortion measurements for a delay-and-sum beamformer with a
variable number of microphones and for each microphone taken individually. It is worth noting that the
best measuring single channel has distortion and SNR comparable to the 2-microphone beamformer. The
distortion and SNR measurements were made on each of the 438 utterances in the recognizer test set and
then averaged to form the ensemble values shown in Figure 3.7. Of the 4 objective measures presented
below only the peak SNR can be measured from the reference close-talking microphone data. The peak
SNR of the recordings made with the close-talking microphone is 43.14dB.
4 In [72, 6] Gotoh characterizes the initial prior weights as “weak” or “moderate” or “strong” often without attributing a specific
value. Presumably the intention was to drive me insane. The prior strength value used herein is 0 1 and corresponds to a “moderate”
prior strength.
15
0.1
0.85 0.095
0.09
0.8
Feature Distortion (FD)

0.085
0.75
0.08
BSD
0.7
0.075
0.65 0.07
0.6 0.065
0.06
0.55
0.055
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
# mics/mic # # of mics / mic #
(a) (b)
6 27
26
5.5 25
Peak SNR (dB)

24
SSNR (dB)
5 23
22
4.5 21
20
4 19
18
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
# of mics / mic # # of mics / mic #
(c) (d)
Figure 3.7: Distortion measurements. (a) Feature distortion (FD), (b) Bark spectral distortion (BSD), (c)
segmental SNR (SSNR) and (d) peak SNR shown as a function of the number of microphones used in
a delay-and-sum beamformer and for each channel individually. The measurements shown were averaged
over all recorded utterances for the 11 test talkers. The ’s indicate the beamformer with the x-axis showing
the number of microphones included in the sum. The microphones were added in order, starting with
microphone 1 (see Figure 3.1). The ’s denote the average distortion or SNR value for that channel taken

alone.
The overall improvement in peak SNR from the 1 microphone beamformer to the 16 microphone
beamformer is 9.7dB. As expected, this is somewhat less than the ideal value derived in Chapter 4 due to
the non-ideal noise cancellation and signal reinforcement. Note also that this improvement figure is greatly
dependent on which microphone is chosen for the 1 microphone beamformer. Microphone 1 has one of
the lowest peak SNR’s of any channel and the 9.7dB figure is therefore somewhat generous. By choosing
the single microphone with the highest SNR (microphone 12) the total improvement in peak SNR is only
7.65dB.
3.1.5 Recognition Performance

The LEMS speech recognition system was used to evaluate the speech recognition performance of the
beamformer processed speech. Figure 3.8 and Table 3.3 show the per-microphone and delay-and-sum
beamformer recognition error rates before MAP training (baseline-HMM) after MAP training
16
By number of microphones in DSBF

#mics in DSBF 1 2 3 4 5 6 7 8
FD 0.89 0.81 0.74 0.69 0.66 0.63 0.61 0.60
BSD .096 .087 .079 .074 .070 .068 .067 .066
SSNR (dB) 3.83 4.23 4.55 4.77 5.00 5.14 5.31 5.38
Peak SNR (dB) 18.18 19.84 20.90 21.90 22.93 23.59 24.21 24.53
#mics in DSBF 9 10 11 12 13 14 15 16
FD 0.59 0.57 0.56 0.55 0.54 0.53 0.52 0.51
BSD .063 .061 .059 .058 .056 .055 .055 .054
SSNR (dB) 5.57 5.69 5.83 5.91 5.99 6.07 6.13 6.17
Peak SNR (dB) 24.86 25.23 25.73 26.25 26.74 27.20 27.59 27.86
By individual microphone used
mic # 1 2 3 4 5 6 7 8
FD 0.89 0.90 0.83 0.82 0.82 0.82 0.81 0.87
BSD .096 .094 .085 .090 .086 .090 .094 .100
SSNR (dB) 3.83 3.91 4.26 4.06 4.20 4.14 3.95 3.77
Peak SNR (dB) 18.18 17.87 18.73 18.91 19.41 19.25 18.82 17.58
mic # 9 10 11 12 13 14 15 16
FD 0.89 0.82 0.81 0.80 0.79 0.81 0.84 0.83
BSD .098 .096 .096 .090 .089 .087 .091 .090
SSNR (dB) 3.63 3.73 3.70 3.93 4.16 4.07 3.98 3.91
Peak SNR (dB) 18.39 19.17 19.43 20.21 20.20 20.17 19.25 19.06
Table 3.2: Distortion measurements plotted in Figure 3.7. Distortion is shown as a function of the number
of microphones included in a DSBF and as a function of the single microphone used.
(MAP-HMM) as described in Section 3.1.3. Note that after MAP training the best single-channel
recognition result is marginally better than the performance of the 2-channel beamformer; a testament to
the usefulness of the MAP training. A different choice of microphones to include in the 2-channel
beamformer would change this result considering that microphone 1 is one of the poorest performing
individual channels. To put these values in perspective, the word error rate for the data collected with the
close-talking microphone is 8.16%. The improvement in MAP-HMM recognition accuracy from the 1
microphone beamformer to the 16 microphone beamformer is 9.38% (or a 44% reduction in error).
Comparing the 16 microphone beamformer against the best performing single microphone (12) the error
rate is reduced by 5.45% (a 31% reduction in error). The performance of the array is close enough to the
close-talking microphone performance that a small number of errors is a large fraction of the gap between
the array performance and the close-talking microphone performance.
3.2 Noisy Database

The experimental database described above contains relatively low levels of noise. The best achievable
speech recognition with the 16 channel beamformer (using MAP training) is only 3.76 percentage points
below the MAP-HMM close-talking microphone results. To create a noisier condition with more
dramatically degraded recognition rates a recording of a noise source was made with the same microphone
array and added into the experimental speech database. Pink noise 5 [74] was played out through a 4"
diameter speaker and recorded by the same set of microphones previously used to record the talkers. The
noise source was located near and directed towards microphone 1 as indicated in Figure 3.9.
5 Pink noise contains equal power in each octave. This corresponds to a -3dB per octave slope of the power spectral density.
17
45
40
35
Word Error Rate (%)

30
25
20
15
10
Headset Error Rate: 8.16%
5
2 4 6 8 10 12 14 16
# mics / mic #
(a) Baseline model
Figure 3.8: Word recognition error rates as a function of the number of microphones used in a delay-and-
sum beamformer and as a function of the single microphone used alone, before and after MAP training.
denotes the recognition performance before MAP training of the beamformer with varying numbers of
microphones and the beamformer performance after MAP training.
denotes the recognition perfor-
mance before MAP training of each channel taken individually and the performance after MAP training.
The microphones were added in numerical order for the beamformer (see Figure 3.1). The strong line at
the base of the graph corresponds to the error rate for the data acquired with the close-talking microphone,
8.16%. These values are tabulated in Table 3.3.

# mics in DSBF 1 2 3 4 5 6 7 8
Baseline-HMM 41.49 36.36 32.16 30.09 28.13 26.35 24.66 24.28
MAP-HMM 21.3 18.52 16.43 15.86 15.23 14.63 14.14 14.03
# mics in DSBF 9 10 11 12 13 14 15 16
Baseline-HMM 23.04 23.22 22.19 21.12 20.46 20.81 19.81 19.44
MAP-HMM 13.81 13.52 12.85 12.67 12.28 12.32 11.81 11.92
mic # 1 2 3 4 5 6 7 8
Baseline-HMM 41.49 41.76 37.07 37.76 37.74 37.00 38.49 41.98
MAP-HMM 21.30 21.15 19.37 19.01 18.5 18.81 19.03 20.92
mic # 9 10 11 12 13 14 15 16
Baseline-HMM 40.60 37.62 38.65 35.05 35.49 36.16 37.40 38.16
MAP-HMM 20.41 18.72 18.77 17.37 17.37 18.01 17.81 18.83
Table 3.3: Word error rates (%) for the HMM before and after MAP training as a function of the number of
microphones included in the DSBF or the single microphone used. The beamformer values are plotted in
Figure 3.8 with ’s and ’s and the single microphone values are plotted in Figure 3.8 with ’s and ’s.

18
480
233
118
9 16
16.5 acoustic
145
foam
260
8
350
1 16.5 talker area
300
pink noise source
( units: cm )
115
Figure 3.9: Layout of the recording room as in Figure 3.1 but showing the position of the interfering noise
source.
80
60
40
20
−20
−40
−60
0.5 1 1.5 2 2.5 3 3.5 4

sec.
80
60
40
20
−20
−40
−60
0.5 1 1.5 2 2.5 3 3.5 4

sec.
Figure 3.10: PCM sequence of a talker recording with pink noise recording added. This is the same talker
and utterance used in Figures 3.3 and 3.4. The top plot is channel 16 alone and the bottom plot is the output
of the 16 channel beamformer.
Each channel of the noise recording was added to the corresponding channel of the talker recordings. For
beamforming the inter-microphone delays estimated from the clean speech were used to steer both the
original speech channels and the added noise. Figures 3.10 and 3.11 show time and spectrogram plots for
channel 16 and the beamformer output for a talker recording with pink noise recording added.
The discerning reader may notice the pronounced bands of noise visible in Figure 3.11 around 2800 and
5600Hz and assume that these are a result of a resonance in the playback or recording system. These
bands result from the beamforming operation and the spatial aliasing inherent in the geometry of this
particular microphone array. The bands appear in all recordings but their exact location in frequency varies
with each recording as a function of the applied steering delays. To illustrate this, Figure 3.12 shows the
spectrum of the DSBF output with no speech present as the steering location is moved in a spiral of
increasing radius starting at x=2, y=2. The noise bands appear at harmonically spaced intervals which vary
19
8000
7000
6000
5000
Hertz
4000
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4
sec.
8000
7000
6000
5000
Hertz
4000
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5 4
sec.
Figure 3.11: Spectrograms of the noisy recordings shown in Figure 3.10. A single channel of a noise-added
recording on top and the output of the 16 channel DSBF on the bottom.
8000
7000
6000
5000
Hertz
4000
3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5
sec
x
3 y
2.5
meters
1.5
0.5
0.5 1 1.5 2 2.5 3 3.5
sec
Figure 3.12: Aliasing spectral bands in the response of the beamformer to a stationary noise input as the
beamformer is steered to different locations indicated in the bottom plot. The x-coordinate of the steering
location is indicated with a solid line and the y-coordinate with the dashed line.
smoothly with the steering location. As the beamformer is steered to different locations the noise source is
found in different portions of the beamformer sidelobes and aliasing pattern. Spatial aliasing is a side
effect of having an array aperture larger than the wavelength of the target signal. Microphone arrays may
be designed with a constant width main lobe[75, 76, 77] which will eliminate the sort of aliasing seen here
though the tradeoff is that the main lobe is wider throughout most of the bandwidth. The particular array
geometry used here is prone to aliasing and an explicit solution to this issue is outside the scope of this
work, though processing techniques presented in later chapters will significantly reduce this effect.
20

#mics in DSBF 1 2 3 4 5 6 7 8
FD 1.94 1.77 1.55 1.41 1.33 1.27 1.21 1.17
BSD .221 .218 .185 .161 .150 .142 .131 .124
SSNR (dB) 1.21 1.00 1.67 2.17 2.44 2.72 3.03 3.25
Peak SNR (dB) 0.95 2.46 4.80 6.44 7.75 8.92 10.01 10.86
#mics in DSBF 9 10 11 12 13 14 15 16
FD 1.14 1.11 1.06 1.03 1.02 1.00 0.99 0.98
BSD .118 .114 .108 .104 .101 .099 .097 .096
SSNR (dB) 3.42 3.55 3.75 3.86 3.95 4.03 4.06 4.06
Peak SNR (dB) 11.45 12.04 12.99 13.62 14.02 14.39 14.74 14.90
mic # 1 2 3 4 5 6 7 8
FD 1.94 2.03 1.66 1.68 1.68 1.52 1.48 1.40
BSD .221 .248 .185 .191 .195 .170 .169 .166
SSNR (dB) 1.21 0.62 1.81 1.59 1.52 2.17 2.12 2.07
Peak SNR (dB) 0.95 0.46 4.13 4.08 4.80 6.38 7.79 9.45
mic # 9 10 11 12 13 14 15 16
FD 1.49 1.52 1.21 1.34 1.48 1.42 1.36 1.45
BSD .176 .184 .141 .164 .173 .170 .184 .189
SSNR (dB) 1.81 1.76 2.64 2.09 2.06 1.94 1.59 1.51
Peak SNR (dB) 7.29 7.10 11.34 9.39 7.51 8.46 8.76 7.49
Table 3.4: Distortion measurements for the noisy database plotted in Figure 3.13. Distortion as a func-
tion of the number of microphones included in the delay-and-sum beamformer and as a function of each
microphone taken alone using the noisy speech data.
3.2.1 Signal Measurements and Recognition Performance

The measurements and recognition tests made on the basic talker recordings were repeated on the
added-noise database. Figures 3.13 and 3.14 summarize these results along with Table 3.4. The proximity
of the noise source to microphone 1 is evident in the low recognition rates and high distortion values for
the microphones closest to the noise source. For perspective, note that the best performing single
microphone, 11 (see Figure 3.14), is on a par with the 8 channel beamformer using the noisier
microphones 1-8. This discrepancy suggests strongly that an appropriately weighted sum of microphones
could outperform the uniform weighted beamformer presented here.
The overall improvement in peak SNR from the 1 microphone beamformer to the 16 microphone
beamformer is 13.95dB. This is greater than the improvement predicted in Chapter 4 of
10 log10 16 12 04 dB. In the noisy recordings here though the noise in each channel is not equal.
Microphone 1 has the highest level of noise using it as the starting point for improvements inflates the
apparent improvement. If microphone 11 is used for the single microphone beamformer the total SNR
improvement comes out to only 3.6dB.
The improvement in MAP-HMM recognition accuracy from the 1 microphone beamformer to the 16
microphone beamformer is 58.4% (or a 71% reduction in error). Comparing the 16 microphone
beamformer against the best performing single microphone (11) the error rate is reduced by 9.02% (a 27%
reduction in error).
21
2
0.24
1.9
0.22
1.8
Feature Distortion (FD)

1.7 0.2
1.6
0.18
BSD
1.5
1.4 0.16
1.3 0.14
1.2
0.12
1.1
1 0.1
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
# mics/mic # # of mics / mic #
(a) (b)
4
14
3.5
12
3
10
Peak SNR (dB)

SSNR (dB)
2.5
8
2 6
1.5 4
1 2
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
# of mics / mic # # of mics / mic #
(c) (d)
Figure 3.13: Distortion measurements for the added-noise database. (a) Feature distortion (FD) (b) Bark
spectral distortion (BSD), (c) segmental SNR (SSNR) and (d) peak SNR shown as a function of the number
of microphones used in a delay-and-sum beamformer and for each channel individually. The measurements
shown were averaged over all recorded utterances with added noise for the 11 test talkers. The ’s indicate
the beamformer with the x-axis showing the number of microphones included in the sum. The microphones
were added in order, starting with microphone 1 (see Figure 3.9). The ’s denote the average distortion or
SNR value for that channel taken alone. These values are tabulated in Table 3.4.
3.3 Correlation Between Measures

All the graphs in Sections 3.1.4 and 3.1.5 show a generally similar trend as a function of microphone(s).
These trends have been averaged over many presentations and don’t show the variance of the measure or
the strength of the correlation between the different distortion or SNR measures. A distortion or SNR
measure that correlates strongly with recognition performance has value as a quick means of evaluating
recognition performance. On the other hand if all the measures are very strongly correlated then it makes
little sense to perform them all since one would suffice.
3.3.1 Linear Correlation

A linear correlation analysis of the individual measurements summarized in Section 3.1.4 and 3.1.5 is
presented below. Each test set utterance provides one observation of FD, BSD, SSNR, peak SNR and
22
100
90
80
Word Error Rate (%)

70
60
50
40
30
20
2 4 6 8 10 12 14 16
# mics / mic #
Figure 3.14: Word recognition error rates as a function of the number of microphone used in a delay-and-
sum beamformer and as a function of channel, before and after MAP training. denotes the performance
before MAP training of the beamformer with varying numbers of microphones and the beamformer
performance after MAP training. denotes the recognition performance before MAP training of each
channel taken individually and the performance after MAP training. The microphones were added in
numerical order for the beamformer (see Figure 3.9). These values are tabulated in Table 3.5.

# mics in DSBF 1 2 3 4 5 6 7 8
Baseline-HMM 96.69 92.59 84.32 77.21 71.78 67.62 63.40 60.44
MAP-HMM 81.90 72.09 56.10 47.74 41.45 38.43 35.45 32.49
# mics in DSBF 9 10 11 12 13 14 15 16
Baseline-HMM 58.06 56.39 53.46 50.72 50.72 48.92 47.94 47.59
MAP-HMM 31.22 29.66 27.35 26.11 25.39 24.44 23.50 23.48
mic # 1 2 3 4 5 6 7 8
Baseline-HMM 96.69 98.91 86.88 86.55 87.17 79.83 75.87 71.56
MAP-HMM 81.90 86.57 60.91 61.84 60.91 49.59 46.16 41.09
mic # 9 10 11 12 13 14 15 16
Baseline-HMM 76.50 77.41 63.15 68.62 76.65 73.52 72.67 77.61
MAP-HMM 47.37 47.88 32.40 38.43 45.96 42.85 40.80 46.10
Table 3.5: Word error rates (%) for the model before MAP training and model after MAP training as a func-
tion of microphones included in the delay-and-sum beamformer and as a function of the single microphone
used alone. These values are plotted in Figure 3.14 with ’s and ’s for the beamformer values, and ’s

and ’s for the single microphone values.
recognition performance before and after MAP retraining. The coefficient of correlation is given by[78]:
Cov Y1 Y2
ρ

(3.4)

σ1 σ2
23
Baseline MAP FD BSD SSNR SNR

Error % Error %
Baseline Error % 1.00 0.94 0.95 0.86 -0.75 -0.89
MAP Error % 1.00 0.89 0.82 -0.69 -0.81
FD 1.00 0.94 -0.81 -0.95
BSD 1.00 -0.90 -0.94
SSNR 1.00 0.89
SNR 1.00
Table 3.6: Matrix of correlation coefficients for the different types of measurements without per-talker
normalization.
where Cov Y1Y2 denotes the covariance of the joint distribution of the two variables under examination

1
and σ1 2 are the corresponding standard deviations, σ1 E Y1 Y¯1 2 2 . The covariance is given by:

Cov Y1 Y2

E Y1Y2 E Y1 E Y2

(3.5)
Table 3.6 shows the inter-measure correlation coefficients for the recognition scores and distortion
measures presented above. For each talker the average value of each measure was computed over the set of
utterances for that talker. For each of the 11 test talkers 31 values of each measure were taken (15 values
measured from beamformers with a varying number of microphones > 1 and 16 values measured from
each microphone taken individually) both for the original database and the added-noise database for a total
of 682 values over which the correlation coefficient was measured. The correlation coefficients are shown
as signed values since the distortion measures and signal-to-noise ratio measures are inversely correlated.
The signs of the measured correlation coefficients are all appropriate for the measures being compared;
measures of goodness correlate positively with other measures of goodness and negatively with measures
of badness. Table 3.6 shows generally very strong correlations. The correlation between FD and
baseline-HMM error rate is strongest with a generally lower correlation between any other distortion or
SNR measure and any recognition error rate. The strong correlation between FD and baseline-HMM error
rate can be clearly seen in the scatter plots in Figure 3.15(c). Note that FD is (slightly) more strongly
correlated with the baseline-HMM error rate than the MAP-HMM error rate is. Although it is not
surprising that the recognizer performance, in particular before retraining, would be closely related to the
feature distortion, it is somewhat unexpected that the feature distortion would be more closely coupled
than the MAP-HMM error rate. Note also that all distortion and SNR measures are less strongly correlated
with the MAP-HMM error rate than with the baseline-HMM error rate.
The preceding analysis fails to take into account the talker-dependence of the error rates. That is, the
speech recognition error rates can vary quite a bit between talkers or even individual utterances from a
single talker. Using the results from the recordings acquired with the close-talking microphone, the
inter-talker standard deviation of the error rates of the 11 talkers in the test set is 5.9%. This is largely due
to a single talker with a very high error rate - this same talker stands out clearly in the scatter plots in
Figure 3.15. Excluding this one outlying talker the standard deviation of the per-talker error rates is 2.0%.
This phenomenon is not necessarily an issue of sound quality but often one of the manner of pronunciation
or elocution6 . To eliminate this talker variability the per-talker word error rates were computed and the
mean error rate for each talker was subtracted from that talker’s values. The result is a differential error
rate for each talker relative to their mean error rate. The same normalization was performed with the
distortion and SNR measures7 rendering them per-talker difference measures as well. Table 3.7 shows the
inter-measure correlation coefficients after this normalization. Figure 3.15 shows scatter plots of the more
6 In particular, the very worst performing talker is a female talker with very high pitch.
7 Although this may not have been strictly necessary, little is gained by maintaining the absolute distortion measurements in the
absence of absolute recognition error rates. That is, all the results were already rendered relative by the per-talker normalization of the
error rates.
24
90
80
70
MAP error rate

60
50
40
30
20
10
20 40 60 80 100
Baseline error rate
(a)
100 100 100
Baseline error rate
Baseline error rate

Baseline error rate
80 80 80
60 60 60
40 40 40
20 20 20
0.5 1 1.5 2 0.05 0.1 0.15 0.2 0.25 0 10 20 30

BSD SNR
Feature Distortion
(b) (c) (d)
90 90 90
80 80 80
70 70 70
MAP error rate
MAP error rate

MAP error rate
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0.5 1 1.5 2 0.05 0.1 0.15 0.2 0.25 0 10 20 30

BSD SNR
Feature Distortion
(e) (f) (g)
Figure 3.15: Scatter plots comparing the various distortion measures with recognition error rates. The black
points correspond to the values with per-talker bias removed and the lighter patches to the values without
any normalization. One point is shown for each of 62 measurements for 11 talkers (682 points in total).
strongly correlated pairs. Most obviously and not surprisingly, the per-talker bias normalization has
generally increased the correlation values. This effect can be seen in the plots of Figure 3.15 as the spread
of the data away from the primary linear trend is greatly reduced in the talker-normalized data. As in the
unnormalized case peak FD is still most strongly correlated with the baseline-HMM error rate though now
both BSD and peak SNR are much closer than in the unnormalized case.
25

Error % Error %
Baseline Error % 1.00 0.95 0.98 0.95 -0.90 -0.97
MAP Error % 1.00 0.96 0.93 -0.85 -0.90
FD 1.00 0.97 -0.90 -0.97
BSD 1.00 -0.91 -0.94
SSNR 1.00 0.93
SNR 1.00
Table 3.7: Matrix of correlation coefficients for the different measurements with per-talker bias normaliza-
tion of each of the 11 62 measurements.


Error % Error %
Without Talker Normalization
Baseline Error % 0.00 8.18 7.86 12.40 16.06 10.99
MAP Error % 6.54 0.00 8.68 11.25 14.12 11.34
With Talker Normalization
Baseline Error % 0.00 7.33 4.27 7.18 10.17 5.66
MAP Error % 5.84 0.00 5.41 6.87 9.84 8.04
Table 3.8: RMS linear fit error for linear predictors of recognition error rate. The column corresponds to
the type of data used to predict with and each row corresponds to the target of the predictor. Errors are in
the units of the recognition error rate, % words in error.
3.3.2 Fit Error

While the correlation coefficient is useful for comparing the relative correlation between measurements on
different scales, the RMS linear fit error can be used to put the differences between correlation coefficients
into perspective. The fit error gives an indication of how much error would be expected when using one
measure (or group of measures) to predict another.
Each measure was fit with a linear function of every other measure. The least-squares coefficients of the
polynomial were obtained and the RMS error from the linear model measured. Table 3.8 shows the RMS
linear fit errors for linear predictors of recognition error rates, both with and without per-talker
normalization of the measures. For each measure, the fit and fit error are computed over 682 total
observations from the 11 test talkers and 62 different single-microphone and beamformer configurations.
The errors shown in Table 3.8 are quite large; too large for practical use as a predictor of recognition rate
on a talker-by-talker basis. The lowest talker-normalized error value at 4.27% is comparable to the
difference between the 8 and 16 microphone beamformers in Figure 3.8.
The fit error is also useful for examining the correlation of the overall average measures, grouping all
talkers together as in Figures 3.7 and 3.8. When the overall averages are used there are only 62 values
available for each measurement type and this isn’t sufficient for a good estimate of the correlation
coefficients8 . Averaging all the talkers together greatly reduces the variance of the measures and the linear
fit error provides a way to quantify this effect. The reduced variance of the measures can be seen in
Figure 3.16 which shows scatter plots of the measures most strongly correlated with recognition error rate.
Table 3.9 shows the linear fit errors for the average values of each measurement. It is apparent from
Figure 3.16(c) that the strongest linear correlation is still between peak FD and baseline-HMM error rate.
The correlation coefficients shown in Table 3.9 bear this out. The SNR scatter plot shows a slight
nonlinear trend. The BSD scatter plot shows a similar linear relationship with baseline-HMM error rate,
but with a greater spread from the linear trend.
8 The self -correlation coefficient of 62 random points is only around 0.98.

26
80
70
MAP error rate

60
50
40
30
20
20 40 60 80
Baseline error rate
(a)
90 90 90
80 80 80
Baseline error rate
Baseline error rate
Baseline error rate

70 70 70
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
1 1.5 2 0.1 0.15 0.2 5 10 15 20 25
FD BSD SNR
(b) (c) (d)
80 80 80
70 70 70
MAP error rate
MAP error rate
MAP error rate
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
1 1.5 2 0.1 0.15 0.2 5 10 15 20 25

FD BSD SNR
(e) (f) (g)
Figure 3.16: Scatter plots comparing the various distortion measures with recognition error rates. All talkers
are averaged together for each of 62 data points.

Error % Error %
Baseline Error % 0.00 6.38 2.14 4.07 5.68 3.14
MAP Error % 5.05 0.00 3.96 4.39 7.65 6.83
Table 3.9: RMS fit error for linear predictors of recognition error rate using the overall average value of
each measure. The column corresponds to the type of data used to predict with and each row corresponds
to the measure being predicted. Errors are in the units of the recognition error rate, % words in error.
27
80 80 80
70 70 70
60 60 60
MAP error rate
MAP error rate
MAP error rate

50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Baseline error rate Baseline error rate Baseline error rate
(a) 1st order. E=6.15 (b) 2nd order. E=3.41 (c) 3rd order. E=0.79
80 80 80
70 70 70
60 60 60
MAP error rate
MAP error rate
MAP error rate

50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
(d) 4th order. E=0.74 (e) 5th order. E=0.61 (f) 6th order. E=0.59
Figure 3.17: Scatter plot of baseline-HMM error rate versus MAP-HMM error rate with polynomial fits of
various orders. One data point is shown for each error rate averaged over all the talkers.
3.3.3 Nonlinear Fits

The relationship between baseline-HMM error rate and MAP-HMM error rate has a decidedly nonlinear
trend to it (see Figures 3.15(a) and 3.16(a)) but a relatively low variance away from that trend.
Incorporating the appropriate nonlinearity could provide a stronger correlation between the two error rates.
Figure 3.17 shows the scatter plot of the MAP-HMM error rate against a 1 st order through 6th order
polynomial function of the baseline-HMM error rate along with the trajectory of the polynomial fit. For
these polynomials the constant term was set to 0 to constrain the fit to pass through the origin since
presumably when there are no errors with the baseline model there will be none with the MAP trained
model.
Predictably, the fit error declines significantly with higher orders of polynomial fit. For orders higher than
2 the polynomials can fit the convex shape of the data points and still change slope to intersect the origin.
The drop in error from 4th to 5th order is similarly due to the polynomial fitting another zig in the data.
With only 62 data points in a fairly sparse distribution this is most likely the result of overfitting.
Figure 3.18 shows polynomial fit of the talker-by-talker data points. Here the fit error improves not at all
for polynomials of greater than 3rd order, reinforcing the conclusion that the improvement at higher orders
shown in in Figure 3.17 is the result of overfitting and not indicative of any underlying trend in the data.
It is also reasonable to expect that a combination of FD, BSD, SNR, SSNR and functions thereof might
form a better fit to recognition rate than any one taken alone. A general linear least squares[79] fit
provides an approach for forming estimates of one variable by linear combinations of arbitrary functions
of another variable or variables. The general form of the least squares model is
28
80 80 80
MAP error rate
MAP error rate
MAP error rate

60 60 60
40 40 40
20 20 20
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
(a) 1st order. E=6.81 (b) 2nd order. E=4.57 (c) 3rd order. E=3.66
80 80 80
MAP error rate
MAP error rate
MAP error rate

60 60 60
40 40 40
20 20 20
0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
(d) 4th order. E=3.66 (e) 5th order. E=3.66 (f) 6th order. E=3.66
Figure 3.18: Scatter plot of baseline-HMM error rate versus MAP-HMM error rate with polynomial fits of
various orders. One data point is shown for each talker and each processing type. Only talker normalized
data is shown.
M
ŷ x ∑ ak Fk x (3.6)

k 1
where each Fk x is an arbitrary fixed basis function of the input x. The optimal least squares coefficients,

ak , are those that minimize the total squared error, χ2 , over the set of observations, xi yi :

N
yi ŷ xi 2
χ2 ∑

(3.7)

i 1 σi
where σi is the standard deviation of measurement i, or 1 if it is unknown or they are all equal.
Tables 3.10 and 3.11 show the RMS error for general least squares fit of the baseline-HMM and
MAP-HMM error rates using different sets of measures as input. In each column another function of a
measure is added to the set of basis functions: F1 x 1, F2 x FD x , F3 x SNR x and so on. For

each set of basis functions the optimal linear coefficients were computed and the RMS error measured.
The standard deviation factor in Equation (3.7) was set to a constant value of 1. When fitting to the
baseline-HMM error rate, powers of the MAP-HMM error rate are added as basis functions and when
fitting to the MAP-HMM error rate, powers of the baseline-HMM error rate are added as basis functions.
Except for error rate the functions are added in decreasing order of their linear correlation. As would be
expected the fit error decreases significantly as basis functions are added. In particular the fit of the
MAP-HMM error rates is greatly enhanced by the addition of the squared functions which allows the
optimization to fit the curvature of the relationship and achieve a fit as good or nearly as good as the fit of
the baseline-HMM error rate. Figure 3.19 shows the scatter plots of the linear predictors including 1 st and
29
FD SNR BSD FD2 SNR2 SSNR Err%1 2 3
BSD2 SSNR2
Without Talker Normalization
Baseline Err% 7.86 7.83 7.53 7.42 7.00 3.42
MAP Err% 8.68 8.32 8.30 8.20 6.96 4.11
With Talker Normalization
Baseline Err% 4.27 4.09 4.05 4.03 3.79 3.08
MAP Err% 5.41 4.74 4.70 4.58 3.71 2.76
Table 3.10: RMS errors for linear least squares estimators of recognition error rates using per-utterance
values. Units are % words in error. Basis functions are added to the least squares estimator starting with FD.
Each column adds another function to the set of basis functions and the 5th column adds squared functions
of the 4 distortion measures. For the column labeled "Err%1 2 3 ” powers of the baseline-HMM error rate
is added to the basis functions for predicting MAP-HMM error rate and vice versa. The optimization and
error computation is performed over the 662 per-talker values as in Table 3.8.
FD SNR BSD SSNR FD2 SNR2 Err%1 2 3

BSD2 SSNR2
Baseline Err% 2.14 1.59 1.57 1.35 0.81 0.51
MAP Err% 3.96 2.11 2.10 2.07 1.06 0.58
Table 3.11: RMS errors for linear least squares estimators of recognition error rates using the ensemble
average values of each measure. Units are % words in error. Basis functions are added to the least squares
estimator starting with FD. Each column adds another function to the set of basis functions and the 5th
column adds squared functions of the 4 distortion measures. For the column labeled "Err% 1 2 3 ” powers 1
through 3 of the baseline-HMM error rate are added to the basis functions for predicting MAP-HMM error
rate and vice versa. The optimization and error computation is performed over the 62 average values as in
Table 3.9.
2nd powers of FD, SNR, BSD and SSNR. The fits with 3rd powers included aren’t shown since adding the
3rd power reduced the error only very marginally.
Figure 3.19 shows how the general least squares fit has a strong linear relationship with the predicted error
rate (compared to Figure 3.15) but the variance around this linear trend remains significant. As before
when the data is averaged across all the talkers (see Figure 3.20) the variance from the linear trend is (not
surprisingly) greatly reduced.
90
100
80
70
Baseline error rate
80
MAP error rate
60
60 50
40
40 30
20
20
10
20 40 60 80 100 20 40 60 80
LLS Prediction LLS Prediction
Figure 3.19: Scatter plots of the linear least squares fit including all 1st and 2nd powers of FD, SNR,
BSD and SSNR against the predicted error rate. The corresponding fit error is shown in Table 3.10. The
black points correspond to the values with per-talker bias removed and the lighter patches to the values
without any normalization. As before there are 682 data points plotted, one for each of 11 talkers times 62
measurements.
30
90 80
80 70
Baseline error rate
MAP error rate

70 60
60 50
50 40
40 30
30 20
20
20 40 60 80 20 40 60 80
LLS Prediction LLS Prediction
Figure 3.20: Scatter plots of the linear least squares fit including all 1st and 2nd powers of FD, SNR, BSD
and SSNR against the predicted error rate for the overall averages. The corresponding fit error is shown in
Table 3.11.
3.4 Summary
A set of multichannel recordings was made and processed through a uniformly weighted delay-and-sum
beamformer using from 1 to 16 microphones. The resulting enhanced recordings were evaluated with
distortion measures (FD, BSD, SSNR, peak SNR) and with the performance of a speech recognition
system. As expected, the performance of the DSBF improves as the number of microphones used
increases.
Recognition performance improves steadily as microphones are added to the DSBF resulting in
approximately a 40% decrease in error rate for the quiet data and 70% decrease in error rate for the
noisy data (compared to the performance of microphone 1).
Every distortion measure shows similar monotonic improvement as microphones are added to the
DSBF.
For the added-noise case, the large range of performance measured for each single microphone
suggests that a non-uniform weighting of the microphones should provide an improvement over the
uniform weighting used here.
The feature distortion measure (FD) tracked the recognition performance very closely and may
provide a way to predict the recognition performance without actually running a large data set
through the speech recognizer.
The improvement in peak SNR was also quite well correlated with the improvement in speech
recognition score. In general this is not expected to be true; there are trivial operations that could
increase SNR while destroying the speech signal (adding noise only during times of active speech
for instance). For the DSBF however the overall performance is well reflected by its ability to
suppress noise.
The MAP training greatly improved the speech recognition accuracy in every instance. The
improvement due to MAP training is of the same order of magnitude as the improvement from the
DSBF.
In the following chapters methods intended to improve upon the performance of the simple unweighted
DSBF used here will be developed. Chapter4 investigates some alternative weighting methods based upon
the array and source geometry. Chapter 5 offers development of an MMSE multi-input noise-suppression
filtering system and Chapter 2 shows experimental results on the recorded database using those methods.
C HAPTER 4:
T OWARDS E NHANCING D ELAY AND S UM
B EAMFORMING
This chapter will examine the performance that can be expected from delay-and-sum beamformers in a
simple microphone-array scenario and investigate the improvements possible through an optimal
microphone weighting scheme. Reverberant room simulations with multiple noise sources will be used to
evaluate the impact of adding microphones to a linear array. The beamformer performance will be
assessed with objective measures including signal-to-noise ratio (SNR), signal-to-reverberation ratio
(SRR) and Bark spectral distortion (BSD).
4.1 Overview of the Delay-and-Sum beamformer

Consider the idealized model with a signal s t impinging upon M sensors. For convenience assume s t is

zero mean. The signal received at each sensor is a delayed version of the original signal plus an additive
noise component:
ym t
hm s t τm
nm t

1 m M (4.1)
where nm t is a zero-mean normally distributed noise signal with variance

is the time delay to the σ2m , τm
mth sensor and hm is the signal attenuation at the mth sensor. nm t is uncorrelated with s t and all nl t
for m l. Assuming that the τm are known each received signal can be appropriately delayed. The

beamformed output is then the sum of M copies of the signal s t with M uncorrelated additive noise

sources:
M 1 M M
y t ∑ hm s t n m t τm s t ∑ hm ∑ nm t τm
m 1

m 0 m 1
30
25
SNR improvement (dB)
20
15
10
0
0 200 400 600 800 1000
Number of microphones
Figure 4.1: Idealized SNR gain as a function of the number of sensors in a DSBF beamformer given that
the noise at each sensor is uncorrelated with the noise at all other sensors and the signal.
31
32
To simplify the analysis assume that the noise power is identical in each channel, σ 2m σ2 , and the signal
gain is likewise identical, hm 1 for all 1 m M. This assumption implies that no sensor contributes
more than any other sensor; SNR at each sensor is identical. The signal-to-noise ratio of a single (delayed)
channel is
E s2 t E s2 t

SNR1

E n21
t τi

σ2
and the signal-to-noise ratio of the beamformer output is
E ∑M
m 1 s t
2 M 2 E s2 t M E s2 t

SNRuni f orm

E
∑M
m 1 nm
t τm
2

∑M m 1 σm
2 σ2
Taking the log of the ratio of these two SNR’s to get the improvement in dB yields:
SNRuni f orm
10 log10 10 log10 M (4.2)
SNR1

Figure 4.1 plots this gain function. For every doubling of the number of sensors a 3dB improvement in
output SNR is realized. Even in this idealized scenario 100 sensors are needed to achieve a 20dB
improvement in SNR.
The situation becomes more complicated if each sensor measures a different level of signal and noise.
That is, hm hl and σ2m σ2l . In this case the SNR of the simple delay-and-sum beamformer is

∑M
m 1 hm s t ∑M
m 1 hm E s
2 2 2 t
SNRnonuni f orm E

(4.3)

∑M
m 1 nm t

2
∑m 1 σ2m
M

4.2 Delay-Weight-and-Sum
The output SNR of the beamformer can be maximized by weighting each y m t before performing

summing the channels. If gm is the weight applied to channel m, then the SNR of the output of this
delay-weight-and-sum beamformer is
∑M
m 1 gm hm E s
2 2 t

SNRweighted (4.4)

∑ m 1 g m σm
M 2 2

A simple optimization can be performed on the express in Equation (4.4) by taking the derivative with
respect to gl and setting the result equal to zero:
∂ 2 ∑M
m 1 gm hm hl E s2 t ∑M
m 1 gm hm 2gl σl E
2 2 s2 t

0

SNRweighted

∂gl ∑Mm 1 g m σm
2 2
∑m 1 g2m σ2m 2
M

Which simplifies to the expression
h l ∑M
m 1 gm hm
gl
σ2l ∑M
m 1 g m σm
2 2
which is trivially satisfied by

hl
gl (4.5)
σ2l
Substituting the optimal weight from Equation (4.5) into Equation (4.4) yields the following expression
for the SNR of this optimally weighted beamformer:
2
h2m
∑M
m E s2 t

M
h2

1 σ2m
∑ m2 E s2 t

SNRoptimalweighted
(4.6)
m 1 σm
h2m

∑M

m 1 σ2m
33
M 3 1 2 M−1
Figure 4.2: A linear microphone array with inter-microphone spacing d and distance from the talker to the
array midpoint r.
20
18
16
14
SNR improvment (dB)
12
10
4
ideal
2 optimal
unweighted
0
0 1 2
10 10 10
Number of microphones
Figure 4.3: Simulation showing the improvement in SNR for the optimally-weighted beamformer and the
unweighted beamformer for a near-field linear array. The ideal SNR curve from Figure 4.1 is included for
comparison.
To get a sense of what this could mean in practice consider the example in Figure 4.2. A talker stands
r 1m away from the center of a symmetrical linear microphone array with constant microphone spacing
of d 10cm. Each microphone receives an equal level of independent noise (σ 2m 1). The hm (and gm )
terms are inversely proportional to the distance from the talker to the microphone and given by
1
2
hm (This choice of distances results in unity gain at the center microphone). As
r
2 2
m 1 2 d
microphones are added at the ends of the array we can compute the improvement in SNR as a function of
the number of microphones with Equation (4.6). This is plotted in Figure 4.3.
It is apparent from Figure 4.3 that when even a simple model of signal attenuation is taken into account the
realizable gain from a delay-and-sum or delay-weight-and-sum beamformer can be quite limited. As
microphones are added in this simple example the SNR in each added microphone drops as the added
microphones are further and further away from the source, eventually negating the gain of adding the
distant microphone at all1 .
Taking this example one step further, supplant the independent noise at each microphone with a Gaussian
1 Granted, this is a contrived example and with the 10cm spacing described the array is 4 1m wide when there are 40 microphones
in the array, 10m wide when there are 100 microphones in the array. Clearly this is not the ideal array geometry for a talker 1 meter
away from the array but it makes the point.
34
M 3 1 2 M−1
Figure 4.4: Linear microphone array with interfering noise point-source located 2m to the right of the talker.
20
18
16
SNR improvement (dB)
14
12
10
4
ideal
2 optimal
unweighted
0
0 1 2
10 10 10
# of microphones
Figure 4.5: SNR improvement for DSBF with point-source noise and ambient noise.
point-noise source 2m to the right of the talker. Assume that the level of noise measured by the
microphone closest to the noise source is equal to the ambient noise level. This is depicted in Figure 4.4.
In this case the optimal weighting according to Equation (4.5) is no longer simply inversely proportional to
the distance from the talker, but also proportional to the square of the distance from the interfering noise
source (because in the simple spherical propagation model the noise power in the denominator of
Equation (4.5) is inversely proportional to the square-root of the distance from the noise source).
Figure 4.5 shows SNR improvement as a function of the number of microphones in the array 2 . This
particular result obviously will vary with the ratio of the ambient noise power to the source noise power. If
the ambient noise power is much greater than the point-noise power the result will approach the one
plotted in Figure 4.3.
The curves shown in Figures 4.3 and 4.5 show only a subtle difference. The achievable SNR is marginally
higher ( 1dB) in Figure 4.5 for both the unweighted and optimally weighted schemes, but in either
scenario the weighted solution only enjoys about 1dB of improvement in SNR over the unweighted
solution.
2 Inthis scenario the noise is no longer independent and the expression in equation 4.6 is not truly applicable. Although correlated
noise may add destructively or constructively depending upon the geometry of the array and source, a reasonable (or pessimistic)
expectation is that the array will do worse in the presence of correlated noise. The SNR improvements shown here are over-estimates
under that expectation.
35
4.3 Delay-Filter-and-Sum
The obvious extension of the simple source model in Equation (4.1) is to generalize the signal scaling
factor hm to a convolutional element or channel impulse response. Each sensor in this model receives a
filtered version of the signal plus an independent noise signal:
ym t hm t

st
nm t

(where denotes convolution) and to introduce a corresponding convolutional element into the channel
weighting in the beamformer:
M
y t ∑ gm t hm t st nm t (4.7)

m 1
The channel dependent filtering function can be distributed to write the beamformer output in terms of the
signal-derived portion, ys t , and the noise-derived portion, yn t :

M
ys t ∑ gm t hm t st

m 1
M
yn t ∑ gm t nm t

m 1
Rewriting these expressions in the frequency domain facilitates the formulation of a frequency-dependent
optimal weighting:
M
Ys ω ∑ Gm ω Hm ω S ω

m 1
M
Yn ω ∑ Gm ω Nm ω

m 1
where Hm ω , S ω and Nm ω are the Fourier transforms of hm t , s t and nm t , respectively. Gm ω is

the frequency dependent weighting. Ys ω is the Fourier transform of the signal-derived portion of the

beamformer output, and Yn ω is the Fourier transform of the noise-derived portion of the beamformer

output. The output SNR can be written as a function of frequency:
E Ys ω 2 E ∑Mm 1 Gm ω Hm ω S ω
2

SNRweighted ω

(4.8)

E Yn ω 2 E ∑m 1 Gm ω Nm ω 2
M

∑m 1 Gm ω Hm ω E S ω
M 2

2

(4.9)

∑M
m 1 Gm ω 2 σ2m ω

where σ2m ω E Nm ω 2 . Once again the assumption in place is that the noise in each channel is

independent of the noise in every other channel.
4.3.1 Optimal-SNR Solution

As before this SNR can be optimized by taking the derivative, with respect to G l ω this time, and setting

the result equal to 0. Dropping the ω notation and omitting the limits on the summations (they are all

m 1 M) for the sake of brevity, the following solution results:

∂ ∑ Gm 2 σ2m ∑ Gm Hm Hl ∑ Gm Hm 2 Gl σ2l

SNRweighted ∝ 0

(4.10)

∂Gl ∑ Gm 2 σ2m 2

36
which simplifies to
Hl ∑ Gm 2 σ2m ∑ Gm Hm

Gl

σ2l ∑ Gm Hm 2
and, not surprisingly, this is satisfied by
Hl ω
Gl ω

(4.11)

σ2l ω

The solution in Equation (4.11) features the time-reverse of the impulse response, Hl ω , and is a

noise-weighted variant of the matched-filter method[23]. The Hl ω filter acts as a sort of pseudo-inverse

for Hl ω .

Magnitude-only Solution
If the channel impulse response is modeled as a magnitude-only filter, Hl ω

Hl ω , then

Equation (4.11) becomes
Gl ω Hl ω

(4.12)
σ2l ω

and the filter-and-sum strategy in this case can be considered as a filterbank implementation where
Equation (4.5) is used to derive a real-valued weight for every filterbank frequency ω.
Beamformer Frequency Response

The set of filters described in Equations (4.11) and (4.12) will distort the frequency response of the
beamformer (That is, ∑M m 1 Gm ω 1) if left in the form presented. To preserve a flat frequency

response to the beamformer the filters once derived from Equations (4.11) and (4.12) should have their
magnitudes normalized by a factor of M 1G ω for each analysis frequency, ω, to ensure that the gain of
∑m 1 m

the beamformer is uniform across frequency.
It may seem an obvious step to apply a frequency weighting that implements the same sort of SNR
optimization as the microphone weighting, but if Equation (4.9) is rewritten as a function of frequency
weights instead of channel weights it will not lead to a similar solution for a frequency weighting.
Frequency weighting strategies are discussed in Chapter 5.
4.4 A Reverberant Simulation

The following section describes a simulation of a linear microphone array using the channel weighting
strategies described above.
4.4.1 Methods
Figure 4.6 depicts the layout of the simulated room. 40 microphones are arranged in a linear array with
10cm microphone separation across the short wall of a 4x6m room. 3 interfering noise sources are
simulated with recordings of computer equipment fan noise and placed as shown in Figure 4.6. A digital
recording of a male talker made with a close-talking microphone is used as the desired source signal. The
sampling rate is 16000Hz. The impulse response from each source (the talker and 3 separate noise
sources) to each microphone is simulated using the image method[80], see Figure 4.7 for an example
impulse response. The resultant reverberation time is approximately 250ms and the unprocessed
37
Microphones
Talker
Noise source
z
1 1
4
40
0
6 2
4
2 x
y 0 0
Figure 4.6: Location of the source, noise, and microphones for the reverberant simulation.
0.6
0.5
0.4
0.3
0.2
0.1
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

time (s)
Figure 4.7: The simulated impulse response for the talker received at microphone 1.
−3
x 10
18
16
14
12
10
−2
−0.04 −0.035 −0.03 −0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01
time (s)
Figure 4.8: Optimal filter for microphone 1 for the 40 microphone case using the complex model of the
channel transfer function (based on Equation (4.11)).
38
signal-to-noise ratio at microphone 1 is approximately 3dB. Signal or noise power in the following is
computed by taking the mean squared value of the time signal over the length of the test signal 3 .
The microphone outputs are delay-steered to the source location then the statistics of the noise and signal
is measured over the entire 3 second utterance and used to derive the weight or filter for each channel. The
channels are weighted (or filtered) and summed in the following 4 ways:
1. Uniformly weighted (unweighted). This is simply a delay and sum beamformer; each channel is
weighted equally.
2. Weighted according to Equation (4.5) (weighted). The noise power and the magnitude of the direct
path in the impulse response are measured for each microphone to form each weight according to
Equation (4.5).
3. Weighted at each analysis frequency according to Equation (4.11) (freq weighted). The PSD of the
noise and the time inverse of the impulse response for each microphone is used to generate the
optimal filter according to Equation (4.11).
4. Weighted at each analysis frequency according to Equation (4.12) (mag freq weighted). A
magnitude-only version of the filter computed in item 3 above is used to weight each frequency at
each microphone.
In this simulation a 512 point analysis window (32ms) was used for measuring the power spectral density
of the signal and noise. The filters in the filter-and-sum beamformer are 1024 points long for the complex
model of impulse response (Gm Hm σ2m ; method (3), freq weighted) and 512 points for the
magnitude-only case (Gm Hm σ2m ; method (4), mag freq weighted). A 1024 point tapered truncation of
the channel impulse responses, hm t , is used to compute the numerator for the filters4 . The shorter

window length was used for the magnitude-only version only because the shorter window is a more
practical analysis length for speech signals. For the complex case the window had to be lengthened to
include a reasonable portion of the impulse response.
For cases (3) and (4) the overall array frequency response is normalized as described in Section 4.3.1. This
frequency normalization smoothes out the overall spectral shape of the beamformer response, but
especially in case (3) where the conjugate of the channel transfer function is being used as the beamformer
filter (see Figure 4.8), there are zeros in the derived channel filters; the resulting total beamformer
frequency response still contains these zeros. Because of this beamformer frequency response distortion
and the action of the matched filter the signal-to-noise and reverb ratios in Figures 4.9, 4.11 and 4.12 show
an improvement even for the case of a single microphone.
4.4.2 Results
Figure 4.9 depicts the SNR yielded by the different weighting strategies as a function of the number of
microphones added in order of increasing distance from the source. For the SN ratio shown in Figure 4.9,
the numerator is the energy of the direct path signal of the talker and the denominator includes the energy
of the reverberation due to the talker as well as all direct and reverberant energy from the noise sources. In
other words, anything other than the talker direct path is considered to be noise in this ratio. The signal
impulse response is divided into direct and reverberant components by applying a 4ms 5 wide window
3 If
the unprocessed SNR seems low consider that this is an average SNR. Even at 3dB the target speech is intelligible. The peak
SNR is approximately 5dB.
4 The truncated impulse response was used to try to inject a little bit of practicality into the implementation of the filter-and-sum
beamformers. This can be increased along with the lengths of the derived filters to correspond to the total length of the simulated
channel impulse responses at the cost of increased computational complexity but the results do not significantly change.
5 This value was chosen fairly ad hoc. It should be noted that the measured signal power is very sensitive to the value of this
parameter. The wider the time window that is considered to be direct path energy, the higher the measured direct signal energy will be
and subsequently the higher all the signal power ratios (SNR, SRR) will be.
39
12
10
SNR Improvement (dB) (noise+reverb)

8
2
unweighted
weighted
freq weighted
mag freq weighted
0
5 10 15 20 25 30 35 40
# microphones
Figure 4.9: Signal-to-noise+reverb improvement for a simulated room with 3 equipment fan noise sources
as a function of the number of microphones in the array.
unweighted
0.16 weighted
freq weighted
mag freq weighted
0.14
0.12
Bark Spectral Distortion
0.1
0.08
0.06
0.04
0.02
5 10 15 20 25 30 35 40
# microphones
Figure 4.10: BSD measure for the 4 different beamforming schemes as a function of the number of micro-
phones in the array.
around the impulse corresponding to the direct path in the simulated channel impulse response. The talker
signal is then convolved with the direct path component and the reverberant component separately and
summed separately for cases (1), (2), and (4). For case (3) one of the actions of the derived filter is to
increase the power in the main lobe while spreading reverberant energy out away from the main
impulse[23], consequently for this case the derived filter is convolved with the simulated channel impulse
response and then decomposed into direct and reverberant components so that the “matched-filtering”
effect can be measured accurately.
For all methods the SNR improvement falls well short of the 10 log 10 40 16dB theoretical array gain
from Equation (4.2). This is hardly surprising given the correlated nature of the noise included in this
simulation; both from the simulated noise sources and from the talker reverberance. Figure 4.9 shows that
the simple weighted beamformer (2) and the magnitude-only filter-and-sum beamformer (4) hardly do any
better than the uniformly weighted beamformer (1). Although method (3) does noticeably better according
to the signal power ratio measures (Figures 4.9, 4.11 and 4.12) the Bark spectral distortion measure
(Figure 4.10) is only marginally lower than even the simplest unweighted beamformer. Note also how the
incremental improvement in SNR is quite similar for all methods. Method (3)’s higher SNR starts right at
the single microphone case suggesting that the main source of the higher ratios can be attributed to the
matched-filtering and spectral distortion effects rather than the optimization in the microphone weighting.
40
12
10
SNR Improvement (dB) (noise only)

8
2
unweighted
weighted
freq weighted
mag freq weighted
0
5 10 15 20 25 30 35 40
# microphones
Figure 4.11: Signal-to-noise-only ratios as a function of the number of microphones in the array.
12
10
8
SRR Improvement (dB)
6
unweighted
weighted
freq weighted
mag freq weighted
4
0
5 10 15 20 25 30 35 40
# microphones
Figure 4.12: Signal-to-reverberation ratios as a function for the number of microphones in the array.
Informal listening tests confirm that method (3) sounds marginally clearer than the other methods, but this
improvement comes at the cost of knowing the channel impulse response exactly. The computation of the
optimal filters in simulation is trivial since the channel impulse responses are all known, but in a practical
situation accurate channel impulse responses will be quite difficult to measure and will vary widely with
changes in the room[21] not to mention the position of the talker.
The marginal results of the weighting methods investigated in this simulation suggest that a different
strategy may be more fruitful in providing improvement over the basic unweighted DSBF. Chapter 5 will
introduce another form of optimization and the resulting algorithm will be implemented in Chapter 7.
C HAPTER 5:
O PTIMAL F ILTERING
In the previous chapter an optimal-SNR weighting strategy was derived based upon a combination of noise
statistics and geometric signal propagation models. The derived weights or filters were constant for a
particular arrangement of talker and noise sources in the room with no dependence on the signal received
at the array sensors. Additionally, no frequency shaping or distortion was permitted in the array frequency
response. In this chapter, data-dependent optimal filtering strategies that use spectral shaping in an attempt
to improve signal quality will be investigated, in particular the Wiener filter and a novel multi-channel
variant of the Wiener filter. Also a non-optimal application of Wiener pre-filtering to microphone arrays
will be introduced.
5.1 The Single Channel Wiener Filter

Assume a signal, y t , that contains the desired signal, s t , corrupted by some as yet unspecified noise or

other distortion. A filter φ t that when convolved with the received signal, y t , approximates s t is

desired. The signal estimate is given by:
ŝ t φ t

yt
(5.1)
and the error by:
et
st
ŝ t

If minimum mean-squared error is the criterion for choosing φ t the expression for the mean-squared

error can be written:

∞
∞
ξ E s t ŝ t 2 dt E
2
e t dt
(5.2)
∞ ∞

Rewriting the expression for the error in the frequency domain and employing Parseval’s relation yields:
Ŝ ω
ΦωY ω

E ω
S ω Ŝ ω

1
∞ 1
∞
ξ E S ω Ŝ ω dω E E ω dω
2 2
(5.3)
2π ∞ 2π ∞

where e t s t ŝ t and S ω , Ŝ ω , Φ ω , Y ω and E ω are the Fourier transforms of s t , ŝ t ,

φ t , y t and e t , respectively. The filter Φ ω is chosen to minimize the total squared error, ξ. Moving

the expected value operation inside the integral, taking the derivative 1 of the integrand with respect to
Φ ω and setting it equal to 0 yields

1 A frequently omitted detail is that this is not, strictly speaking, the derivative of the total squared error, ξ. Nevertheless, the values
of Φ that minimize this pseudo-derivative will also minimize ξ.
41
42

∂E ω ∂
E E ω E ω E Eω E E ω Y ω 0

∂Φ ω ∂Φ ω

Substituting back in for E ω and solving for Φ ω yields

E S ω Φ ω Y ω Y ω

0
E S ω Y ω Φ ω E Y ω Y ω

0
Φ ω E Y ω 2 E S ω Y ω

E S ω Y ω
Φω

(5.4)

E Y ω 2

Equation (5.4) is the most general form of the Wiener filter.
5.1.1 Additive Uncorrelated Noise

A simple model for the received signal, y t , in which the desired signal, s t is corrupted by an additive
noise signal, n t , is simply y t s t

n t . Using this signal model in Equation (5.4) would result in a

variety of cross terms unless some assumptions are made about the nature of the signal, s t , and noise,

nt .

A commonly made assumption that simplifies Equation (5.4) is that the signal and noise are uncorrelated.
More explicitly, that the expected value of the cross-correlation of the signal and noise is equal to 0.
∞
E S ω N ω dω 0 (5.5)
∞

where E denotes expected value. Using the signal model y t s t n t or its frequency domain
counterpart, Y ω S ω N ω , we get a new expression for the Wiener filter:

E SY S
2
Φ (5.6)

E Y 2
S 2 E N 2
or rewriting this in terms of the measured signal Y and the noise statistic E N :
2

Y E N
2 2

Φ

(5.7)
Y 2
Equations (5.6) and (5.7) are the most commonly seen forms of the Wiener filter. Note that the filter
coefficients are strictly real and non-negative. In Equation (5.6) it is clear that Φ 1. The result of
Equation (5.7) may not satisfy this condition if the noise power in the observation of Y 2 is less than
E N 2 . Care must be taken in the implementation to insure that noise in the observations doesn’t create

degenerate filter coefficients.

The model for y t can be altered to include a convolutional distortion of the signal component:

yt

ht st n t . In this case, using the same assumption about the lack of correlation between

noise and signal, Equation (5.4) becomes
E S HS N H S 2
Φ

(5.8)

E HS N 2
H S 2 E N 2 2
Details of Implementation
Note that if the estimate of the signal power already incorporates the transfer function, H, then the
formulation in Equation (5.8) shouldn’t be applied. For instance, if deriving a filter, Φ̂, according to
43
Equation (5.8) but the estimate of S 2 , Ŝ 2 , is formed by subtracting the expected value of the noise power
from the instantaneous measurement of the input signal:
Y E N HS N E N
2 2 2 2 2 2 2
Ŝ H S

Then it is inappropriate to directly substitute this signal estimate into Equation (5.8),
H Ŝ 2
Φ̂
Y 2
because according to the signal model Ŝ 2 already includes a factor of H 2 . To achieve the form in
Equation (5.8), Ŝ 2 needs to be divided by H rather than multiplied by H . That is,
2 2
H Y E N
1 1
H H S
2 2

H S 2
Φ̂

Y 2 Y 2 Y 2
5.2 Multi-channel Wiener Filter

In a multi-channel configuration the most obvious application of a Wiener filter is to apply it after the
beamforming operation:
M
yB t ∑ ym t

m 1
A post-filter φ t can be designed to minimize the squared error between ŝ t φ t y B t and s t in a

manner analogous to the development of the Wiener filter above. Since the channels have been summed
into a single output channel the MMSE solution is simply a Wiener post-filter on the beamformed signal:

S ω YB ω
Φ ω E

(5.9)

YB ω 2

This sum-and-filter formulation has been used variously in[38, 39].

An alternative is to derive a filter-and-sum process. In the following formulation each of the M channels is
filtered independently:
M
ŝ t ∑ φm t ym t (5.10)
m 1

Or expressed in the frequency domain:

M
Ŝ ω ∑ Φm ω Ym ω (5.11)

m 1
In a manner similar to the derivation of the single channel Wiener filter a solution for Φ m ω can be

derived starting from Equation (5.3). Substituting Ŝ ω as it is defined as in Equation (5.11) yields the

following expression for the total squared error:
∞
∞
ξ E E ω
E S ω Ŝ ω
∞ ∞

∞ M
E S ω ∑ Φm ω Ym ω

(5.12)

∞

m 1
Taking the derivative of the integrand of Equation (5.12) with respect to Φ m ω and setting it equal to zero

yields:
44

∂ ∂E ω
E E ω E ω E Eω E Eω Ym ω 0

∂Φm ω ∂Φm ω

The resulting system of M equations can be written in matrix form,

Φ1

Y1
2 Y2Y1 YM Y1 SY1

Y1Y2 Y2
2 YM Y2 Φ2 SY2
E

.. .. .. ..

.. E

..

. . . .

.

.

Y1YM Y2YM YM
2 ΦM SYM
and the general solution written as:
1
Φ1

Y1
2 Y2Y1 YM Y1

SY1

Φ2 Y1Y2 Y2
2 YM Y2 SY2
.. E

.. .. .. ..

E

..

(5.13)

.

. . . .

.

ΦM Y1YM Y2YM YM
2 SYM
The discerning reader will note that the matrix in Equation (5.13) is the spatial correlation matrix[81] and
can be written as the outer product of the input signal vector:

Y1

Y2
E

.. Y1 Y2 YM

(5.14)

.

YM
Measurement of the spatial correlation matrix in Equation (5.14) can be problematic in practice. To
estimate this matrix by averaging different instances of the Y vectors requires at least M instances to
achieve a spatial correlation matrix with full rank, since a particular instance of YY has rank 1. Consider
what this means for a typical speech signal where it can only be considered stationary for 40ms or so; if
there are 16 microphones in the array the spatial correlation estimate requires at least 16 independent
frames within that 40ms window to form a spatial-correlation matrix of full rank. With a half-overlap
hamming analysis window this would imply an individual analysis frame of 4 7ms. At 16kHz sampling
rate this results in a frequency resolution of approximately 212hz. For larger numbers of microphones the
frequency resolution only gets worse.
5.2.1 Additive Uncorrelated Noise

Once again the simplest model is the additive noise model. In the multi-channel case each received signal
has an independent noise signal:
ym t
st
nm t
Ym ω S ω

Nm ω
(5.15)
Assuming initially that not only are signal and noise uncorrelated, but that each n m t is uncorrelated with
each nl t for m l implies the following substitutions

E Ym 2 S 2 Nm 2 E SYm S 2 E YmYl S 2

(5.16)
Using the compact overbar notation for expected value, E X X, σ 2l E Nl 2 Nl 2 , and

incorporating the simplifications from Equation (5.16) into Equation (5.13) yields
1

Φ1

S
2 σ21 S
2
S 2
2

S 2
2

Φ2 S
2
S
2 σ22 S S

.. .. .. .. .. .. (5.17)

.

. . . .

.
ΦM S
2
S
2
S
2 σ2M S
2
45
The matrix in Equation (5.17) can be written as the sum of a constant matrix and a diagonal matrix of
noise autocorrelation values.

1 1 1

σ21 0 0

1 1 1 0 σ22 0
2
S
.. .. ..
.
..

.. .. ..
.
..

. . . . . .
1 1 1 0 0 σ2M
The diagonal matrix is full rank and non-negative so its addition with a non-negative constant matrix is
also full rank. The matrix inverse in Equation (5.17) exists in general and can be formed without long term
averaging of observations of YY .
5.2.2 Direct Solution

The highly structured form of the matrix in Equation (5.17) suggests that there may be a simplified form of
the solution that does away with the matrix inversion. The form of this simplified solution can be
discerned by writing the inversion in terms of the adjoint matrix (or matrix of cofactors) and the
determinant. Specifically:
1
Aco f A 1

det A
where for a matrix, A, det A is its determinant and Aco f is the matrix of cofactors. Applying this basic
formula for the inverse to Equation (5.17) yields:

Φ1 ∏M 1 σm
2
m

Φ2 ∏M 2 σm
2
S
2 m
.. .. (5.18)
S 2 ∑k m k σm
∏M m σm
∏M

.
M 2 2
.
1
ΦM
∏M
m M σm
2
where ∏Mm k σm denotes the product of all σm terms except for the m k term. It may be clearer to view
2 2
Equation (5.18) with the product of the σm ’s factored out. Note that
2
M σ2k
∏ σ2m ∏m σm
M 2
m k
so Equation (5.18) can be rewritten as:

1

Φ1

σ21

1
Φ2
1 σm

S
2
∏M
m
2
σ22
.. ..
m 1 σm
∏M

2
.
S 2 ∑k ∏M 1 σm
M .
m
2
ΦM
1 σ2k

1

σ2M
and the product terms in numerator and denominator cancel out to yield:

1

Φ1

σ21

1
Φ2
S
2
σ22
.. .. (5.19)

.
S 2 ∑k
M 1 .
σ2k 1
ΦM 1
1
σ2M
or written more succinctly in terms of the weight for a single microphone (and including the dependence
on ω previously omitted for brevity):
46
S ω
2 1
Φm ω

(5.20)

σ2m ω

S ω 2 ∑k 1 σ2k ω
M 1
1

In this form it can be clearly seen that each each Φm is the reciprocal of the noise power for that channel
with a common overall weighting that is a function of S 2 and σ2m . In this form it is also more apparent
that the computational complexity of this solution is now O M rather than the O M 3 typically required

by the matrix inverse2 . Note that for M 1 this solution is identical to the Wiener filter in Equation (5.6).
Also, if the noise power is the same in each channel, σ2m ω σ2 ω , then the resulting Φm ω is also the

same for each channel and given by:
S ω
1 2
Φm ω

M S ω 2 σ2 ω

which is precisely a Wiener filter on each individual channel. Since this filter is the same for each channel,
by the principles of linear systems it can be applied after the beamforming summation, resulting in a
beamformer with Wiener post-filter, as would be expected.
5.2.3 Filtered Signal Plus Additive Independent Noise

Proceeding as above, a slightly more comprehensive signal model is one where each channel is subject to
convolutional distortion as well as independent additive noise:
ym t

hm t
st
nm t
Ym ω Hm ω S ω

Nm ω
(5.21)
Y
In this case the following substitutions apply, where Pm l YmYl ,

Ym 2 Hm S σm Hm S 2 Hm Hl S 2
2 2 2 Y
SYm Pm l (5.22)

and lead to the solution (from Equation (5.13))

1
H H
H1 S σ1 P2 1 S 2 PM 1 S 2
2 2 2
H1 S 2

H H
P1 2 S 2 H2 S σ2 S H2 S 2

2 2 2 PM 2

2
Φ

.. .. .. .. .. (5.23)

. . . .
.
H
P1 M S 2
H
P2 M S 2 σ2M HM S 2
HM S

2 2
The matrix to be inverted in Equation (5.23) is the sum of a vector cross product and a diagonal matrix of
noise autocorrelation values.

H1

σ21 0 0

H2 0 σ22 0
2
S
..

H1 H2 HM

.. .. ..
.
..

. . . .
HM 0 0 0 σ2M
The diagonal matrix is positive and full rank (barring a vanishing noise signal) so the sum is also full rank
(barring a vanishing Hm ) so the matrix inversion in Equation (5.23) above exists in general. As in the
previous case this expression for the optimal filter can be rewritten in a simplified form that obviates the
use of the generalized matrix inversion in Equation (5.23). The simplified solution is given by:
2 A matrix inverse can be computed in a manner that has complexity O M log2 7 but at the expense of a very large constant factor[79].
The typical LU decomposition algorithm for matrix inversion is a O M 3 process.

47

Φ1
m 1 σm
H1 ΠM 2

Φ2 H2 Πm 2 σ2m
M
S
2
.. .. (5.24)
S 2 ∑k Hk 2 Πm k σk m 1 σm
ΠM

.
M M 2 2
.
1
ΦM
HM ΠM
m M σm
2
This result can be rewritten in a more revealing form by factoring out the product of the noise variances.
As above, note that
M σ2k
∏ σ2m ∏m 1 σ2m
M
m k
So Equation (5.25) can be rewritten as:

H1

Φ1

σ21

H2
Φ2

S m 1 σm
∏M
2 2
σ22
.. M σ2 ..

.
∏
S 2 ∑k 1 Hk 2 mσ2k1 ∏M 1 σm
M
m
m 2

.
ΦM

HM2
σ2M
and the product terms cancel out resulting in:

H1

Φ1

σ21

H2
Φ2

S
2
σ22
.. 2 .. (5.25)

. Hk
S 2 ∑k
M .
σ2k 1
ΦM 1
HM2
σ2M
or written more succinctly in terms of the weight for a single microphone (and including the dependence
on ω previously omitted for brevity):
S ω Hm ω
2
Φm ω

(5.26)

Hk ω 2
1 σm ω
2

S ω 2 ∑k 1 σ2k ω
M

As in the previous case, the computational complexity of the solution written in this form is only O M as

opposed to the O M 3 for the form including the matrix inversion. Equation (5.25) is very similar to the

optimal-SNR weighting derived in Equation (4.11). Each Φm is the ratio of the conjugated channel
transfer function and the channel noise power, but now also includes an overall weighting at each
frequency. This is consistent with the optimal-SNR result of Equation (4.11).
Note that when Hm 1 m 1 M Equation (5.25) is identical to the solution for the model without
signal filtering n Equation (5.19). Also, for the case where M 1 Equation (5.25) becomes
H S 2
Φ
H 2 S 2 σ2
which is the single channel Wiener filter.

A reasonable question to ask is if the solution in Equation (5.26) is equivalent to an optimal-SNR
weighting as in Equation (4.11) followed by a Wiener post filter. The optimal-SNR weighting with
normalization is given by
Hm ω
σ2m ω
ω
osnr 0
Φm (5.27)

Hk ω

∑M

k 1 σ2 ω
k
48
where the denominator is designed to normalize the gain of the array so that
M
∑ Φosnr
m ω 1

k 1

Sω 2
. Adding a Wiener weighting on top of this weighting adds in a factor of to the weighting,

E Y ω 2
where Y ω is the output of the optimal-SNR weighted beamformer. Incorporating this factor into

osnr 0
Φm ω from Equation (5.27) yields a new weighting:

Hm ω
S ω
2

σ2m ω
ω
osnr 1
φm

(5.28)

Hk ω 2

∑M

E osnr 0
k 1 Φk
∑M ω Xk ω

k 1 σ2 ω

k

Using the independent noise model from above to simplify the expected value in Equation (5.28) (and
dropping the ω notation for brevity’s sake) yields:

2
2
M M M
∑ Φk ∑ Φk ∑ Φk
osnr 0 osnr 0 osnr 0
E
Xk E
Hk S

Nk

k 1 k 1 k 1
2
M M
osnr 0 2
∑ Φk ∑
osnr 0
E S 2

Hk Φk

σ2k

k 1 k 1
Comparing this to Equation (5.26), the Hmσ2ω and S ω 2 terms (both in numerator and denominator) are

m
in common. What remains are the normalization terms in the denominators:
2
M M M
Hk ω 2
∑ σ2 ω
2
∑ Φk ∑ ?
osnr 0 osnr 0
ω Hk ω Φk ω σ2k ω

1

k 1 k 1 k 1 k

This can be re-expanded in terms of Hm ω and σ2m ω in search of an equivalence:

2
Hk ω 2 Hk ω
Hk ω
2

M M M
σ2k ω σ2k ω
∑ ∑ σ2k ω ? ∑

1

Hl ω Hl ω σ2k ω

k 1 ∑l 1 k 1 ∑l 1

M M

k 1
σ2l ω σ2l ω

2
2
1 M
Hk ω
2 M Hk ω 2 M
Hk ω
2
∑ ∑ ∑

1

Hl ω σ2k ω σ2k ω σ2k ω

∑M

l 1 σ2 ω k 1

k 1

k 1

l
In this form it is clear (or at least more clear) that the two weightings are not equivalent; there are
cross-terms introduced on the left side that will not be cancelled on the right side. The relative weighting
between the channels is the same as for the optimal-SNR weighting, but the overall weighting at each
frequency is not.
5.2.4 Filtered Signal Plus Semi-Independent Noise Model

Changing the signal model one more time to one where the signal undergoes convolutional distortion, as
above, but in this case the corrupting noise is not independent from channel to channel. That is to say,
Pl m 0. The assumption that the signal and noise are uncorrelated is still in effect. This signal model
N
leads to the following substitutions:

49
Ym 2 Hm S Hm S 2 PmHl S 2 PmNl

Y
2 2
σ2m SYm Pm l

Applying these substitutions to Equation (5.13) yields

1
H N H N

H1 S
2 2 σ21 P2 1 S 2 P2 1 PM 1 S 2 PM 1

H1 S 2

H N H N H2 S 2

P1 2 S
2 P1 2 H2 S
2 2 σ22 PM 2 S
2 PM 2

Φ

.. .. .. .. .. (5.29)

. . . .
.
H N H N HM S 2
P1 M S 2 P2 M S 2 HM S σM

P1 M

P2 M

2 2 2
On the face of it this matrix might appear singular, but because it is the expected value of the noise
covariance that is added it is generally not singular. That is, the matrix in Equation (5.29) can be written as
the sum of two cross products:

H1 N1

H2 N2
S
2
.. H1 H2 HM E

.. N1 N2 NM

. .

HM NM
Noting that the second cross product is the expected value of the noise cross-correlation. This is a
Hermitian matrix and except under degenerate values of noise correlation it will be full rank, and therefore
the matrix inverse in Equation (5.29) will exist. In Equations (5.17) and (5.23) the noise in each channel
was assumed to be independent of the noise in any other channel simplifying this cross-correlation matrix
to a diagonal matrix of noise autocorrelation values. Effective estimation of the noise correlation matrix
through the averaging of multiple observations may be made if the noise is slowly varying. This is in
contrast to the spatial correlation matrix in Equation (5.14) which contains an estimate of the rapidly
varying speech signal.
Unlike the previous cases, the matrix in Equation (5.29) does not lend itself to a simplified inverse
operation. Also Equation (5.29) requires the estimation of the complete noise cross-correlation matrix
rather than just the noise autocorrelation terms used in Equations (5.17) and (5.23).
5.3 A Non-Optimal Filter and Sum Framework

An alternative way to incorporate Wiener filtering into a beamformer is to simply apply the Wiener filters
before sum of the beamformer. The DSBF output can be used to provide the clean signal estimate since it
has reduced noise compared to the individual channels. The advantage of this method is that the DSBF
itself can provide the signal estimate and the matrix inversion of the previous section can be avoided
altogether. Also, since the Wiener filtering occurs before the sum it is possible that the artifacts from the
Wiener filtering in each channel will tend to cancel in the beamformer output, resulting in a lower level of
filtering artifacts in the final output. This process is diagrammed in Figure 5.1. After delay steering, the
channels are averaged together forming the basic DSBF output. This signal is then used as a signal
estimate to design a Wiener filter for each individual channel. Because the DSBF output is used as the
signal reference, the individual channel filters will implement the same noise-canceling and
signal-reinforcing behavior that the DSBF accomplishes. That is to say, if the beamformer provides good
noise attenuation at some frequency, that frequency will be weighted more strongly by the channel Wiener
filters and conversely where the beamformer does not provide noise attenuation the channel Wiener filters
will attenuate at that frequency. Since this filtering is done on each channel before the beamformer sum,
the two processes (Wiener filtering and beamforming) are additive.
In general the process can be iterative, reusing the filter-and-sum beamformer output at iteration k, S k , as
the new signal reference to generate the channel filters for iteration k 1, Φ k 1 . Explicitly:

50
jωτ Y1
e 1

SY’
1
YY’

1 1
(0) (1)
Σ Σ
1 S 1 S
M M
e
jωτM YM
SY’
M
YY’
M M
Delay Sum Filter Sum
Figure 5.1: Flow diagram for a Wiener filter-and-sum beamformer using the delay-and-sum beamformer
output as the signal estimate for the Wiener filters.
Φm ω
0

1
1 M
M m∑
Sk ω k
Φm ω Ym ω

1

Φm
k 1
ω
S k ω Ym ω

(5.30)
Ym ω 2

This is illustrated in an idealized example. Consider an M channel array. The noise in each channel is
Gaussian, uncorrelated between the channels and of equal power in each channel. Let the signal of interest
be a sine wave of frequency ω0 at a nominal power, E S ω0 2 ψ2 . The noise spectrum has a

constant power, E N ω σ . The power spectrum of a single channel is then:

2 2

ψ2 σ2 ω ω0
E Y ω 2

σ2 ω ω0

The noise power in the initial DSBF output is reduced by a factor of M1 :

ψ2 σM ω ω0
2
E S 0 ω 2

σ
ω ω0
2

M
Now forming the ratio in Equation (5.30) to generate a Wiener filter for each channel results in a filter with
the following transfer function:

2
ψ2 σM
Φm ω
1
ψ σ2
2 ω ω0
(5.31)
ω ω0

1
M
In the Wiener filter the factor of M1 reappears but now in magnitude rather than power, effectively doubling

(in dB) the noise suppression achieved by the beamformer. The gain at the signal frequency is not unity,
but will approach 1 for ψ2 σ2 and as M increases it approaches the minimum mean squared error

optimal gain of ψ2ψ σ2 . Figure 5.2 shows the value of this term for varying number of microphones and
2
1
signal-to-noise ratio. In this simple example the value of Φm ω is directly related to the SNR of channel

m. In a more realistic situation the attenuation of the noise by the beamformer will not be so reliable -
coherent noise may sum constructively at some frequencies and destructively at others - and this direct
1 1
mapping of channel SNR to Φm ω will not hold. Applying Φm ω to each channel3 and beamforming

(averaging) to generate S ω results in

1

3 Since each channel has identical statistics and therefore identical Φm in this example, it is mathematically equivalent to apply the
filter on the beamformer output.
51
0
−1
Filter response in dB
−2
−3
−4
2
3
−5 4
8
256
−6
0 5 10 15 20 25 30
SNR in dB
1
Figure 5.2: The attenuation of Φm ω from Equation (5.31) at the signal frequency for different values of

SNR, 10 log10 ψσ2 , and number of microphones in the beamformer, M.

2
−5
−10
−15
−20
−25
−30
1
−35 3
7
−40
0 5 10 15 20 25 30
SNR in dB
1
Figure 5.3: The attenuation of Φm ω as a function of input SNR raised to different powers corresponding

2 3
to Φm ω and Φm ω . The number of channels is fixed at 16.

2
ψ2 1
σ2
ω ω0

ψ2 σ2 M
E S 1 ω 2 M ψ2

1
σ2

σ2
M3
ω ω0

where the gain at ω ω0 has been rewritten to more clearly separate the influences of the signal-to-noise
ratio and the number of microphones. Note the M13 reduction in noise power. This is the cube of the
reduction in noise power achieved by the DSBF.
2
Repeating the process to generate Φm ω yields

2 ψ 2 σ2
ω ω0
1

Φ m ω0 M

ψ 2 σ2
Φm
2
ω

2
Φm ω ω ω0

1
ω0 1

M

k 1
which shows that subsequent iterations of Φm are simply powers of Φm . In particular,

2
Φm
k 1
k
Φm
1
Φm . Figure 5.3 shows the value of Φm ω raised to the 3rd and 7th powers,
1

23
corresponding to Φm
ω . Note how the attenuation falls off steeply; signals at different frequencies will

be attenuated to a degree that is greatly sensitive to the SNR at that frequency, potentially resulting in
1
undesirable signal coloration if a higher power of Φm ω is employed. This is illustrated in Figure 5.3.

One way to avoid this sort of signal distortion while increasing the noise-suppression of the filter is by
mapping the filter response non-uniformly. Compressing Φm ω in the neighborhood of 0dB while

52
0
−5
−10
−15
−20
−25
−30
−35
0 5 10 15 20 25 30
SNR in dB
Figure 5.4: Ad hoc methods of warping the filter gains to create a flatter response at moderately high SNR
1
while preserving a strong attenuation at low SNR. denotes the curve for Φ m and denotes the result

1
after a warping by Equation (5.32). For the attenuation was held at 1 for values of Φ m greater than -2dB

1 5
and set to Φm for values below that.

maintaining a strong attenuation in low SNR regions. For instance, any gain above some threshold could
1
be set to unity while leaving gains below the threshold alone. Another strategy would be to raise Φ m ω

to a variable power based on its value. A possible (absolutely ad-hoc) warping which maintains a longer
flat region and faster dropoff below some threshold is

1
20 log10 Φm ω

Φm ω
1
Φm ω

(5.32)

1
The effect of this ad hoc warping is shown in Figure 5.4. Both warped curves are flatter than Φ m above

9dB SNR and then drop off sharply at lower SNRs.
5.4 Summary
The derivation of the single channel Wiener filter was presented and extended to a multi-input MMSE
solution, multi channel Wiener (MCW). The form of this multi-channel Wiener filter was simplified for the
cases of additive noise, and convolution plus additive noise signal scenarios, resulting in solutions of low
computational complexity. The MCW method was shown to incorporate the optimal-SNR
inter-microphone weighting derived in Chapter 4 and the overall frequency weighting of the MCW
algorithm was shown to be different from that of the Wiener post-filter (WSF) algorithm. Another
non-optimal but intuitively appealing application of Wiener filters to microphone arrays as pre-filters was
described and its behavior as an iterative process explored. These methods, along with a reference Wiener
post-filter (WSF), will be implemented in Chapter 7.
All the Wiener algorithms presented in this chapter require a noise-free or at least noise-reduced estimate
of the signal spectrum. Chapter 6 addresses the spectrum estimation problem in the context of microphone
arrays.
C HAPTER 6:
S IGNAL S PECTRUM E STIMATION
The Wiener filter requires knowledge of the power spectrum of the desired signal (see Equation (5.6)). In
some communications applications the statistics of the desired signal may be reasonably well
approximated by an a priori distribution, but when the signal of interest is speech, ongoing signal
measurements are required to estimate the rapidly changing signal power spectrum. In this chapter some
methods of spectrum estimation will be investigated. The cross-spectrum signal estimation method which
is often used in microphone-array systems[39, 38, 41] will be shown to be a special case of the ubiquitous
noise-spectrum subtraction methods[33]. A novel spectral estimate that combines the cross-spectrum and
minimum-noise subtraction methods will be developed with some investigation of parameter optimization
for the method.
6.1 Spectral-Subtraction
In the classical single-channel spectral-subtraction case[33] the signal spectrum is commonly estimated by
measuring the noise power during silence regions and estimating the signal power with:
Ŝ ω S ω N ω E N ω
2 2 2

(6.1)
To the extent that the noise is stationary and uncorrelated with the signal this is a good estimate of the
signal power, though care must be taken to avoid over-estimating the instantaneous noise power and
inserting negative values into the signal power-spectrum estimate[33].
One way to estimate the noise power[82, 34, 83] is to use the minimum observed value of the power
spectrum over some interval:
N̂k ω min Y k ω 2

N k

(6.2)
where the noise power spectrum estimate for analysis frame k and frequency ω, N̂k ω 2 , is the minimum

value taken from the last N analysis frames of the noise corrupted signal power spectrum, Y k N k ω 2 .

The advantages of this approach are
Explicit speech/no-speech decision is not necessary.
Over estimation of the noise power is less likely.

The noise estimate will adapt to changing noise.
Very simple to implement.
Implementations of this technique typically use a smoothed version of the power spectrum. The
implementation used herein weights past analysis frames with an exponentially decaying weight factor:
Y¯k ω 1 α Yk ω α Yk ω
2 2 2

1

(6.3)
where Y¯k ω 2 , the smoothed spectrum estimate for frame k, is formed by weighting the raw estimate for

the current frame, Yk ω 2 by 1 α and the estimate for the previous frame, Yk 1 ω 2 , by α. The noise

53
54
0.13 32
4.5
0.12 30
28
SSNR
SNR
BSD 0.11 4
26
0.1 24
0.09 3.5 22
0.8 0.8 0.8
0.6 80 0.6 80 0.6 80
0.4 60 0.4 60 0.4 60
0.2 40 0.2 40 0.2 40
alpha 0 20 alpha 0 20 alpha 0 20
Frames Frames Frames
(a) BSD (b) SSNR (c) peak SNR
Figure 6.1: Average BSD, SSNR and peak SNR for minimum spectral subtraction scheme as described
in Equation (6.3) as a function of the averaging constant, α, and the number of past analysis frames from
which the minimum is taken, N.
estimate for frame k is then formed by taking the minimum value of Y¯k ω 2 for k k N k.

Unfortunately, because the processing is done on an utterance-by-utterance basis only 3 or 4 seconds of

input are processed at a given time. This leads to a distinct “start-up” phenomenon at the beginning of
each utterance whereby the noise estimate at the beginning of the utterance is significantly lower than the
estimate towards the end of the utterance. To ameliorate this problem the minimum choosing process is
done forward and backwards and the two results averaged together. To determine appropriate values of the
weighting factor α and the history length N various values were used to do spectral subtraction on a set of
4 noisy utterances from the noisy database and 4 utterances from the quiet database. The data for each
utterance was processed with a delay-and-sum beamformer using 1 to 16 microphones, so a total of
16*8=128 signals with various noise characteristics were processed. For each test instance the BSD,
SSNR and peak SNR were measured from the resulting power spectra. Figure 6.1 shows the average BSD,
SSNR, and peak SNR as a function of α and N for the quiet and noisy databases 1 using a 512 point
Hamming window was used and a 1024 point zero padded FFT 2 . A reasonable choice of parameters
falling near the optimal areas of both BSD and SSNR while staying as high on the SNR curve as possible
is α 0 6 and N 80 (1.28 seconds)3 . Note that the peak SNR is a monotonically increasing function of
α. As the time constant of the averaging function increases the magnitude of the noise estimate will tend
to increase as more speech energy is incorporated into the noise estimate. The increase in peak SNR is a
by-product of the increase in the magnitude of the noise estimate and doesn’t account for the “quality” of
the noise estimate.
6.2 Cross-Power
When multiple input channels are available the cross-power spectra of the channels can be used to form
the signal power estimate, Ŝ ω 2 , in a way that takes advantage of the correlated nature of the signal and

the (hopefully) uncorrelated nature of the interference. Using the signal plus independent noise model
from Equation (5.15) the expected value of the cross-spectrum of 2 channels is (from Equation (5.16))
1 For this optimization data from the training set rather than the test set was used.
2 Theanalysis length and FFT size were chosen to correspond with the parameters of the BSD measure (See Section 2.2.3) because
the BSD is being measured directly from the power spectra estimated by the spectral subtraction process.
3 It is expected that the best values for these parameters will vary with the test conditions; noise levels, channel variations etc. The
values derived here are entirely specific to the database used.

55
E Ym ω Yl ω

S ω 2
(6.4)
since the noise is assumed to be uncorrelated the expected value of its cross-correlation is 0. A pair of
microphones, m and l, can be used to form an estimate of the signal power by taking the real portion of
their cross-spectrum:
P̂ml ω re Ym ω Yl ω (6.5)

M

In general there are M microphones available so there are 2 independent estimates of S ω 2 that can be

formed from Equation (6.5). Taking the average of these individual estimates and the applying a half-wave
rectification yields an estimate of the signal power incorporating information from all the microphones:

M 1 M
1
Ŝ ω max 0 M ∑ ∑ P̂ml ω
2
(6.6)
2 m 1 l m 1

This is essentially the development done by Zelinski in[38]. The signal power spectrum was estimated by
averaging together the cross-power spectra of all possible microphone combinations in a 4-microphone
array. This estimate of the signal spectrum was then used in a Wiener filter applied to the output of the
beamformer as in Equation (5.9). This formulation of the spectral estimate has been used elsewhere
including[40][30][84]. In [84] taking the real portion of the mean cross-power spectrum is eschewed in
favor the magnitude. The rationale behind doing this is to design the derived Wiener filter to attenuate the
spatially uncorrelated noise while ignoring coherent noise[84] (or rather including it equally in the
numerator and denominator of the Wiener filter transfer function) and thereby leaving the attenuation of
coherent noise to the spatial selectivity of the beamformer.
6.2.1 Computational Considerations

Although Equation (6.6) is written as the mean of M2 cross-spectrum computations it should be noted that
this value can be measured without forming the cross-spectra individually. The power-spectrum of the
delay-and-sum beamformer output contains all the cross-spectra from Equation (6.6) as well as the
auto-spectrum of each microphone:
2
1 M 1 M 1 M
M m∑ M m∑ M m∑
YB
2
Ym Ym Ym
1 1 1

1 M 1 M 1 M 1 M 1 M
M 2 m∑ M 2 l∑ ∑ YlYm M 2 l∑ ∑ Yl Ym

2
Ym
1 1 m l 1 1 m l 1

M M 1 M
1 2
M 2 m∑ ∑ ∑
Ym
2
real Yl Ym (6.7)
1 M2

l 1 m l 1
where YB is the delay-and-sum beamformer output and Ym is the mth channel (expressed in the frequency
domain). The estimate in Equation (6.6) can be formed by computing and subtracting out the
auto-spectrum terms from Equation (6.7) and scaling appropriately. This entails the computation of only
M 1 power-spectrum estimates rather than M2 cross-spectrum estimates4 . Specifically, given YB as

expressed in Equation (6.7) above, the spectral estimate in Equation (6.6) can be realized:
4 This relationship between the power spectrum of the beamformer output and the desired cross-spectrum estimate is also pointed
out in[30].
56
YB
M2
1 M 1 M 2 1
∑M m 1 Ym
2
∑ ∑
M2 2
M re YmYl M

2 m 1l m 1 2

M
M 1
YB M 2 ∑ Ym
2 2
(6.8)
M 1 m 1

This economy of computation is only available if the function chosen to combine the cross-spectrum
estimates is the mean and the function chosen to project the complex-valued cross-spectra is the real
function. If the absolute value of the cross-spectral estimates is used[84][85][39] this breakdown of the
beamformer power-spectrum does not apply. Likewise, if some function other than the mean is used to
combine the individual cross-spectral estimates (e.g. the median) this simplification also does not apply.
6.3 Combining Spectral-Subtraction and Cross-Power

In light of Equation (6.8) it is apparent that the cross-power spectral estimate is a special case of spectral
subtraction (followed by a scaling factor). In this case the noise power estimate to be subtracted is a scaled
average of the individual channel power spectra as opposed to an average or minimum statistic of the
power spectrum of the beamformer output. The two different strategies for forming noise estimates can be
used simultaneously. The rationale behind doing this is that the cross-power estimate lumps anything that
is correlated between channels into the signal estimate. Because of this, when the noise combines
coherently the noise will tend to be underestimated. The spectral subtraction method on the other hand
uses long term statistics of the beamformer output power spectrum to estimate the noise bias; coherent
noise will show up in the beamformer output and in the corresponding noise estimate. Because the
tendency is for the cross-power estimate of the noise to be too small the appropriate combination is to use
the larger of the two noise estimates at any given time:
1 M
M 2 m∑
N̂cp
2
Ym
2
1
N̂ss
2 2
min Y¯k

N k

Ŝ YB max N̂cp N̂ss

2 2 2 2

(6.9)
where N̂cp 2 is the noise estimate from the cross-power method of Equation (6.8) and N̂ss is the noise
estimate from the spectral-subtraction estimate of Equation (6.2). The signal estimate, Ŝ 2 is formed by
subtracting the larger of the two noise estimates from the beamformer power spectrum, YB 2 .
6.4 Comparison of Signal Estimate Methods

The recordings described in Section 3.1.1 were used to evaluate the performance of some of the variations
of the signal spectrum estimation methods described above. The 4 signal power estimates evaluated are:
1. power spectrum of the beamformer output

2. spectral-subtraction method
3. cross-power method
4. combination cross-power and spectral-subtraction method
The signal power spectrum estimate types enumerated above were then measured against power-spectra
generated from the close-talking microphone reference recordings. Peak SNR (SNR), segmental SNR
57
0.075 6.2 38
6 36
0.07
5.8 34
Peak SNR
0.065
5.6 32
SSNR
BSD
5.4 30
0.06
5.2 28
0.055
5 26
0.05 24
BF Spec. Sub. Cross Power Combo BF Spec. Sub. Cross Power Combo BF Spec. Sub. Cross Power Combo
Figure 6.2: Average BSD, peak SNR and SSNR for the different spectral estimation methods for the quiet
database using 8 microphones ( ) and 16 microphones ( ). The values were averaged across all 438

utterances in the test set. See Figure 3.7 for scale comparisons.
4.5 22
0.12
20
0.11
4 18
Peak SNR
SSNR
BSD
0.1 16
3.5 14
0.09
12
0.08
3 10
BF Spec. Sub. Cross Power Combo BF Spec. Sub. Cross Power Combo BF Spec. Sub. Cross Power Combo
Figure 6.3: Average BSD, peak SNR and SSNR for the different spectral estimation methods for the noisy
database using 8 microphones ( ) and 16 microphones ( ). The values were averaged across all 438

utterances in the test set. See Figure 3.13 for scale comparisons.
(SSNR) and Bark spectral distortion (BSD) values are computed and averaged over the 438 test set talker
utterances. For all estimation techniques the data segmentation was done with a 512 point (32ms)
Hamming window with a half-window overlap and a 1024 point FFT used.
The most glaring feature of the quiet-database results in Figure 6.2 is that the Bark distortion (BSD) and
segmental SNR (SSNR) are worse after any sort of processing of the beamformer spectrum. Using
Figures 3.7 and 3.13 for comparison the degradation in the SSNR and BSD measurements of Figure 6.2 is
marginal. The total increase in BSD shown in Figure 6.2 is approximately 5% of the difference between
the values measured for the 1 and 16 microphone beamformers in Figure 3.7(a). For the noisy data in
Figure 6.3 the BSD improves (decreases) slightly with all post-processing methods. The magnitude of the
improvement in Figure 6.3(a) is approximately 8 times greater than the decline in Figure 6.2(a). The
SSNR deteriorates with all post-processing methods for both data sets but the decrease for the noisy data
set is about half as much as with the quiet data. In both cases the change in SSNR is approximately an
order of magnitude smaller than the total range shown in Figure 3.7(b). In contrast, the peak SNR is
improved significantly by all 3 post-processing methods for both quiet and noisy data, and though the
cross-power method has a worse peak SNR than the spectral-subtraction method on the quiet data, the
combined cross-power/spectral-subtraction method displays the best peak SNR in both noisy and quiet
cases. Also, unlike the marginal decline or improvement in BSD and SSNR, the magnitude of the
improvement in peak SNR in Figure 6.2(c) is comparable to the improvement achieved by the 16
58
microphone beamformer shown in Figure 3.7(c).
Since BSD and SSNR are measured only during active speech segments this suggests that the minimum
spectral subtraction method does a good job of attenuating noise during silence passages, but is less
effective at reducing distortion during segments of speech. For the quiet database this tradeoff between
reducing the noise and distorting the speech results in a slight overall performance degradation precisely
because the noise is minimal; the distortion introduced by the processing is on the same order as the
distortion already present in the signal. The combination algorithm incorporates the improvement in signal
distortion of the cross-power method and the improvement in peak SNR of the spectral subtraction method.
6.5 Summary
In this chapter methods for signal spectrum estimation were described. The cross-spectrum method was
shown to be a special case of spectral subtraction. An algorithm combining a minimum noise estimate and
the cross-spectrum noise estimate was motivated and described. A preliminary comparison of the different
estimate methods using signal distortion measures was presented supporting the use of the combination
estimate. In Chapter 7 optimal filtering strategies will be implemented using the spectrum estimation
methods described here.
C HAPTER 7:
I MPLEMENTATIONS AND A NALYSIS
In this chapter variations on the filtering strategies described in Chapters 5 and 4 will be implemented and
evaluated. In particular, implementations of the optimal-SNR filter-and sum strategy (OSNR) from
Equation (4.12), the Wiener sum-and-filter (WSF) from Equation (5.9), Wiener filter-and-sum (WFS) from
Equation (5.30) and Multi-Channel Wiener (MCW) from Equation (5.26) beamformers will described, and
the distortion measures and speech-recognition performance results for 8 and 16 microphone versions
presented. For the Wiener techniques, the different methods of estimating the signal spectrum described in
Chapter 6 will be used and compared.
7.1 Optimal-SNR Filter-and-Sum

Figure 7.1 shows the basic structure of the optimal-SNR weighted beamformer (OSNR). Each channel
was filtered with a magnitude weighting derived as in Equation (4.12).
A 512 point (32ms at 16kHz) rectangular window with a half window shift was used for spectral
analysis.
To preserve a linear convolution in the frequency domain a 1024 point FFT was used.
The noise power in each channel, σm ω , was estimated using the minimum statistic method

described in Section 6.1.

To estimate the channel transfer functions, Hm ω , the signal power (after subtracting the noise

estimate) was averaged over the utterance. These long-term power estimates were then divided by
the power estimate of the first channel to provide an estimate of the channel transfer function. The
jωτ Y1 H1
e 1 N1
2
(0)
Σ
1 S
M
HM
jωτ YM
e M NM
2
Delay Filter Sum
Figure 7.1: The structure of the DSBF with optimal-SNR based channel filtering (OSNR). Signal and noise
statistics are generated for each channel. The resulting filter weights are normalized across the array to
yield a flat total frequency response for the array.
59
60
Database Quiet Noisy

Model Baseline MAP Baseline MAP
# mics 8 16 8 16 8 16 8 16
DSBF 24.28 19.44 14.03 11.92 60.44 47.59 32.49 23.48
OSNR 23.93 19.01 14.08 11.52 53.92 40.25 27.13 18.81
Table 7.1: Recognition performance for the OSNR beamformer expressed in % words in error. The lowest
word error rate in each column is highlighted in boldface.
result is a constant magnitude transfer function estimate for each channel. For instance:

1
2
Ym ω σ2m ω

2
Hm ω

Y1 ω

2
σ21
ω
To guard against divide by zero or multiply by zero situations the gain at each frequency was
constrained to lie within -40dB to 3dB by hard limiting at both ends of the range.
To reduce artifacts in the reconstructed signal the derived filters were truncated to 512 points in the
time domain to preserve the linear convolution property. This corresponds to a smoothing of the
filter in the frequency domain.
The filter weights were normalized across the array at each frequency so that the total array
frequency response was flat.
The filters were applied to the beamformer output in the frequency domain and reconstructed in the
time domain with an overlap-add technique. A Hanning window was used to taper the overlapping
segments together during reconstruction.
After filtering the individual channels they were summed to form the final output.
The microphone-array databases (quiet and noise-added) described in Chapter 3 were processed with the
OSNR algorithm as described above, using 8 and 16 microphones. The results are presented the sections
that follow.
7.1.1 Subjective Observations

The output of the OSNR beamformer sounds uncolored1 and free of the sort of warbling artifacts that are
typical of noise-suppression filters (including those in ensuing sections). For the quiet database the noise,
although not noticeably attenuated, sounds slightly whiter; periodicities in the background noise are subtly
but audibly reduced2 . For the noisy database the character of the background noise is greatly changed. The
bands of noise visible in Figure 3.11 are now attenuated and the whistling quality of the noise is gone.
Overall the background noise sounds much whiter than with the unweighted beamformer. Figure 7.2
shows a comparison of noisy optimal-SNR weighted and unweighted DSBF spectrograms.
7.1.2 Objective Performance

Table 7.1 shows the recognition performance when using the optimal-SNR weighting filter-and-sum
technique. The least difference is seen for the 8 microphone beamformer and the quiet database where the
1 The uncolored nature of the beamformer output is a given since the overall frequency response is constrained to be uniform.
2 Thisdifferent quality to the noise is extremely subtle and probably would not be noticed in casual listening or listening in less
than optimal environments, but the beamformers can be reliably distinguished in a blind test.
61
8000
7000
6000
5000
Hertz
4000
DSBF 3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5
sec.
8000
7000
6000
5000
Hertz
4000
OSNR 3000
2000
1000
0
0 0.5 1 1.5 2 2.5 3 3.5
sec.
Figure 7.2: Narrowband spectrogram of a noisy utterance processed with OSNR. The top figure is from the
unweighted 16 channel beamformer and the bottom figure is from the optimal-SNR weighted 16 channel
beamformer. The talker is female and the text being spoken is the alpha-digit string ’BUCKLE83566’. The
overall reduction in background noise is apparent, especially the bands around 4500Hz and 6500Hz.
Database Quiet Noisy Quiet Noisy

# mics 8 16 8 16 8 16 8 16
measure FD BSD
DSBF 0.60 0.51 1.17 0.98 .066 .054 .124 .096
OSNR 0.59 0.51 1.06 0.85 .065 .053 .106 .083
measure SSNR peak SNR
DSBF 5.38 6.17 3.25 4.06 24.53 27.86 10.86 14.90
OSNR 5.48 6.26 3.69 4.41 25.21 28.90 14.09 18.33
Table 7.2: Summary of the measured average FD, BSD, SSNR and peak SNR values for the OSNR beam-
former.
measured performance actually decreases. The .05% increase in error rate corresponds to 2 added errors 3 .
All other results for the quiet database are nearly an order of magnitude greater and are improvements in
performance. Admittedly these are very small improvements in terms of the number of word errors
involved, but note that the difference between the DSBF (11.92%) and the close-talking microphone
(8.16%) performance is only 3.76%. A .5% change in performance is 13% of that margin. The .4%
improvement for the 16 microphone MAP-HMM case is slightly more than 10% that performance gap.
With the noisy data, the performance is significantly improved relative to the unweighted DSBF. In the 16
microphone MAP-HMM case the 4.5% decrease in error rate brings the beamformer performance 30%
closer to the 8.16% error rate of the close-talking microphone. In the noisy case the SNR varies enough
across the array for the weighting to provide significant gain. In the quiet case the noise is very similar in
each channel and little can be gained by weighting the microphones nonuniformly.
Table 7.2 shows the distortion measurements for the optimal-SNR weighted beamformer. All the
measurements show some improvement over the unweighted DSBF, but the most notable improvement is
in peak SNR which shows only about 4dB improvement for both 8 and 16 microphone arrays using the
3 There 1
are 4497 total test words, so each error contributes 100 4497

022% to the error rate.

62
noisy data. This improvement in peak SNR may seem small compared to the improvement shown by other
techniques presented herein; unlike the Wiener filtering strategies in the following sections, the OSNR
beamformer achieves this improvement in SNR while maintaining a flat overall frequency response. That
is, the peak SNR values from the OSNR beamformer are not inflated by arbitrarily high noise suppression
during silence passages. The overall array response is uniform during silence as well as during speech.
63
e

jωτ1 Y1
(0) (0)
Noise SNR
Σ
1 S
M Reduc.
jωτ YM
e M

Delay Sum N.R.
Figure 7.3: The structure of the DSBF with Wiener post-filtering or Wiener sum-and-filter (WSF). Note
that the channels may feed forward into the noise-reduction step as they may be necessary to generate the
post-filter.
7.2 Wiener Sum-and-Filter

A delay-and-sum beamformer with Wiener post-filter (Herein this will be termed “Wiener sum-and-filter”
or WSF) in the manner of [39] was implemented as follows:
A 512-point (32ms at 16kHz) rectangular window with a half window shift was used for spectral
analysis.
To preserve a linear convolution in the frequency domain a 1024 point FFT was used.
3 different methods were used to form the signal spectral estimate, or rather 3 different methods
were used to estimate the noise power spectrum to be subtracted from the beamformer power
spectrum:
1. The cross-spectral power signal estimate from (6.8) was used. This is consistent with the
implementation generally found in the literature[39]. In the results this is denoted by WSF cor .
2. The minimum statistic noise power estimate described in Sections 6.1 and 6.4. This is denoted
below by WSFmin .
3. The combination noise power estimate described in Sections 6.3 and 6.4. This is denoted
below by WSFcom .

The spectral densities used in the formulation of the Wiener filter from (5.6) were smoothed in time

with an exponential weighting factor of 0.4. That is, Ŝk ω 2
1 4 Ŝk ω 4 Ŝk 1 ω 2 . This value was chosen to correspond with the smoothing reported in
1 2
[39] and is intended to strike a balance between a low variance estimate of the spectral densities
while still accommodating rapid variation in the speech spectrum.
To reduce artifacts in the reconstructed signal, the post filter is truncated to 512 points in the time
domain to preserve the linear convolution property. This corresponds to a smoothing of the filter in
the frequency domain.
WSF algorithm as described above, using 8 and 16 microphones. The results are presented the sections
that follow.
64
8000 8000
7000 7000
6000 6000
5000 5000
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
cor
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
min
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
com
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
WSF osnr
WSF
Figure 7.4: Narrowband spectrograms for the WSF beamformer. The bottom 3 rows correspond to the
different ways of forming the signal power spectrum estimate. The left hand column images are generated
from data with no extra pre-filtering and the right hand column images are from data that was processed
with the OSNR filtering prior to the WSF processing. The spectrograms for the unweighted DSBF and
OSNR outputs are shown at the top for reference. The talker is female and the text being spoken is the
alpha-digit string ’BUCKLE83566’. All examples are from 16 microphone implementations.
7.2.1 Subjective Observations

Figure 7.4 shows example spectrograms for an utterance processed with the WSF algorithm. In general the
com spectral estimate shows a markedly reduced background noise level compared to the other spectral
estimate methods. This observation is confirmed by listening; the background noise is more strongly
suppressed by the min and com methods. For the noisy data the com method sounds noticeably better than
either of the other two methods (though when OSNR pre-processing is used com and min sound very
similar). The cor method also introduces a greater degree of warbling and tonal noise in the residual
background noise. With the quiet data, the level of tone and warble artifacts is low for the com and min
spectral estimate types. With the noisy data a greater level of warble and tone artifacts is introduced. Some
double talk and reverberation artifacts can be heard in the 8 microphone case; less so when 16
microphones are used. As when OSNR is used without any post-filtering, the versions using OSNR as a
preprocessing step show dramatically reduced spectral peaks in the background noise.

Tables 7.3 and 7.4 show the recognition performance of the WSF and WSF osnr beamformers. The values
for the quiet database show a slight performance improvement in all cases. The improvement of WSF over
DSBF is comparable to the improvement of WSFosnr over OSNR; the improvement is somewhat additive.
In the noisy cases the com spectral estimate performs marginally better in every case; with the quiet data
the cor estimate edges out the other methods. This isn’t unexpected since the cor estimate has a generally
65

# mics 8 16 8 16 8 16 8 16
DSBF 24.28 19.44 14.03 11.92 60.44 47.59 32.49 23.48
WSFcor 21.77 16.70 13.03 11.10 50.39 36.60 27.15 20.61
WSFmin 21.19 18.50 13.41 11.50 51.66 36.83 27.20 20.35
WSFcom 20.72 18.14 13.14 11.54 47.43 34.29 26.75 19.70
Table 7.3: Recognition performance for the WSF beamformers expressed in % words in error. The lowest
word error rate in each column is highlighted in boldface.

# mics 8 16 8 16 8 16 8 16
OSNR 23.93 19.01 14.08 11.52 53.92 40.25 27.13 18.81
WSFosnr
cor 21.57 16.41 12.88 10.96 44.03 30.11 25.35 17.92
WSFosnr
min 21.06 18.52 13.43 10.87 43.59 28.71 23.35 16.90
WSFosnr
com 20.84 18.12 13.16 10.90 40.41 28.06 23.30 16.59
Table 7.4: Recognition performance for the WSFosnr beamformers expressed in % words in error. The
lowest word error rate in each column is highlighted in boldface.
lower estimate of the noise and the com estimate (by definition) the highest noise estimate; the cor
estimate performs best in the low-noise cases and the com estimate performs best in the high-noise cases.
Note that the OSNR performance is better than the WSF performance (without OSNR pre-processing).
The best improvement for the quiet data (WSFosnr
min ) makes up 27% of the difference from the DSBF to the
close-talking microphone performance. The best noisy performance (WSF osnr min ) is 45% of the performance
difference.
The largest improvements in the distortion measures can be seen in the peak SNR values. The peak SNR
measured for the quiet data is approximately 1.5 times greater, and 2 times greater for the noisy data. For
the other measures the difference is generally much smaller. For the quiet data especially the difference in
measured distortion is sometimes vanishingly small. The noisy data shows significantly more of a

# mics 8 16 8 16 8 16 8 16
measure FD BSD
DSBF 0.60 0.51 1.17 0.98 .066 .054 .124 .096
WSFcor 0.57 0.51 0.82 0.73 .068 .055 .100 .075
WSFmin 0.54 0.49 0.86 0.74 .075 .063 .098 .077
WSFcom 0.54 0.49 0.77 0.69 .075 .062 .100 .076
DSBF 5.38 6.17 3.25 4.06 24.53 27.86 10.86 14.90
WSFcor 5.31 6.24 3.62 4.49 28.41 32.71 15.78 20.41
WSFmin 5.66 6.38 4.03 4.88 40.94 45.02 21.15 26.56
WSFcom 5.63 6.37 3.98 4.89 39.91 44.54 22.63 28.45
Table 7.5: Summary of the measured average FD, BSD, SSNR and peak SNR values for the WSF beam-
formers. The baseline values for the delay-and-sum beamformer are shown for reference (DSBF). The best
(lowest for distortions and highest for SNR’s) is highlighted in bold-face.
66

# mics 8 16 8 16 8 16 8 16
measure FD BSD
OSNR 0.59 0.51 1.06 0.85 .065 .053 .106 .083
WSFosnr
cor 0.56 0.49 0.78 0.68 .067 .055 .090 .068
WSFosnr
min 0.53 0.49 0.80 0.67 .075 .063 .092 .074
WSFosnr
com 0.54 0.49 0.74 0.65 .075 .062 .092 .073
OSNR 5.48 6.26 3.69 4.41 25.21 28.90 14.09 18.33
WSFosnr
cor 5.57 6.41 4.00 4.81 30.08 35.22 19.59 24.96
WSFosnr
min 5.72 6.41 4.36 5.12 41.49 45.97 25.20 31.39
WSFosnr
com 5.69 6.41 4.33 5.14 40.29 45.36 26.22 32.49
Table 7.6: Summary of the measured average FD, BSD, SSNR and peak SNR values for the WSF osnr
beamformers. The baseline values of the delay-and-sum beamformer with optimal-SNR weighting are
shown for reference (OSNR).The best value (lowest for distortions and highest for SNR’s) in each category
is highlighted in bold-face.
difference. Note that for the quiet data the BSD is worsened by any version of WSF with the cor spectral
estimate type degrading the least of the 3 estimate types. This is consistent with the observation above that
the cor spectral estimate with its conservative noise estimate performs best on the quiet data whereas the
min and com spectral estimate types are most likely over-estimating the noise resulting in a degree of
signal distortion that outweighs the noise suppression.
67
e

jωτ1 Y1 SY’
1
YY’

1 1
(0) (0) (1)

Noise SNR
Σ Σ
1 S 1 S
M Reduc. M
jωτ YM
e M
SY’
M
YY’
M M
Delay Sum N.R. Filter Sum
Figure 7.5: The structure of the Wiener filter-and-sum (WFS) beamformer. This is the same as Figure
5.1 with the addition of a configurable post-filtering step on the first beamformer output. The individual
channels may feed forward into the noise reduction step.
7.3 Wiener Filter-and-Sum

A version of the ad-hoc Wiener filter-and-sum (WFS) strategy described in Section 5.3 is implemented and
evaluated. Figure 7.5 shows the structure of the algorithm. The delay-and-sum beamformer output is
formed, an optional noise-reduction filter applied and the result is fed back as a signal reference to
generate Wiener prefilters for each channel.
As described in section 5.3 the output of the beamformer can be used as a signal reference to pre-filter the
channels individually. This algorithm was implemented and applied to the databases described in
Sections 3.1 and 3.2.
For the pre-filtering step a 512 point (32ms) Hanning window was used, half overlapped, with a
1024 point FFT.
4 different signal spectrum estimates were used for the numerator of the Wiener filters:
1. The power spectrum of the unfiltered DSBF output (bf ).

2. The cross-correlation spectral estimate (cor) (see Section 6.2).
3. The minimum-statistic noise spectral estimate (min) (see Section 6.1).
4. The combination spectral estimate (com) (see Section 6.3).
The same exponential smoothing described in Section 7.2 was used to smooth the spectral estimates
used in the generation of the channel filters.
To reduce artifacts the filters were smoothed by taking them back into the time domain and
truncating them to 512 points, preserving the linear convolution property of the frequency domain
implementation.
WFS algorithm as described above, using 8 and 16 microphones. The results are presented the sections
that follow.
7.3.1 Subjective Evaluation

The amount of warbling and tonal artifacts in the residual background noise is noticeably lower for the
WFS algorithm than for the WSF algorithm. In particular the speech processed with WFS cor has a more
suppressed and more natural sounding residual noise than WSF cor . Of the spectral estimate types, the bf
version has the least effective suppression of background noise; the “sound” of the original background
can still be discerned in those recordings. WFScor has a lower level of residual noise than WFSb f as can be
seen in the spectrograms in Figure 7.6. The min and com versions sound virtually indistinguishable from
68
8000 8000
7000 7000
6000 6000
5000 5000
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
Hertz
Hertz
bf
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
cor
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
min
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
com
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
WFS osnr
WFS
Figure 7.6: Narrowband spectrograms from the WFS beamformer. The 4 bottom rows correspond to the
different ways of forming the signal power spectrum estimate. The left hand column images are generated
from data with no extra pre-filtering and the right hand column images are from data that was processed
with the OSNR filtering prior to the WFS processing. The spectrograms for unweighted DSBF and OSNR
outputs are shown at the top for reference. The talker is female and the text being spoken is the alpha-digit
string ’BUCKLE83566’.
each other and have a noticeably lower level of residual noise than either the bf or cor processing types.
The noisy examples are greatly enhanced by the use of OSNR as a pre-processing step; the spectral peaks
in the background noise are greatly attenuated by the OSNR processing. The application of OSNR
pre-processing to the quiet data is impossible to distinguish by listening. The WFS recordings exhibit a
degree of “breathing” at the transitions from speech to silence and silence to speech. This “breathing” is
more apparent in the quiet recordings (also when the min and com processing types with greater noise
suppression are used) where less residual background noise is available to mask the artifact. The processed
speech also exhibits a varying degree of echo and similar processing artifacts comparable to that observed
with WSF processing. Artifacts are lower when 16 microphones are used.

Tables 7.7 and 7.8 summarize the recognition performance of the WFS algorithm. Most notably the
recognition performance is generally worse with the quiet data. A slight improvement can be seen in some
69

# mics 8 16 8 16 8 16 8 16
DSBF 24.28 19.44 14.03 11.92 60.44 47.59 32.49 23.48
WFSb f 23.46 18.55 14.10 11.88 51.63 35.40 27.89 19.77
WFScor 23.55 19.46 14.28 12.08 48.63 32.93 28.37 19.77
WFSmin 23.24 21.10 14.77 12.79 47.45 32.04 26.91 18.88
WFScom 23.86 20.72 14.59 12.47 46.25 31.42 26.42 19.23
Table 7.7: Recognition performance for the WFS beamformer expressed in % words in error.

# mics 8 16 8 16 8 16 8 16
OSNR 23.93 19.01 14.08 11.52 53.92 40.25 27.13 18.81
WFSosnr
bf 23.35 18.12 13.68 11.92 46.25 30.06 25.04 17.92
WFSosnr
cor 22.55 18.97 14.45 12.21 44.16 28.17 25.66 17.75
WFSosnr
min 23.37 20.72 14.30 12.56 41.18 27.13 23.95 16.66
WFSosnr
com 23.64 20.57 14.30 12.52 40.45 26.91 24.06 16.50
Table 7.8: Recognition performance for the WFSosnr beamformer expressed in % words in error.
cases before MAP training but once MAP training has been applied any improvement disappears. The
noisy data on the other hand does show a significant decrease in error rate with (as in the preceding
section) the com spectral estimate type leading the way.
Tables 7.9 and 7.10 show the measured distortion values for the WFS beamformer. For all measures
except for peak SNR, the quiet data shows no improvement with the WFS processing. The noisy data
shows some improvement though no particular spectral estimate method stands out from the others.

# mics 8 16 8 16 8 16 8 16
measure FD BSD
DSBF 0.60 0.51 1.17 0.98 .066 .054 .124 .096
WFSb f 0.58 0.50 0.90 0.72 .078 .066 .106 .077
WFScor 0.58 0.51 0.81 0.67 .082 .069 .110 .079
WfSmin 0.56 0.50 0.81 0.67 .090 .076 .109 .088
WFScom 0.57 0.51 0.78 0.66 .090 .076 .112 .089
DSBF 5.38 6.17 3.25 4.06 24.53 27.86 10.86 14.90
WFSb f 5.09 5.90 3.57 4.50 30.28 37.62 16.51 24.15
WFScor 5.03 5.84 3.55 4.51 32.89 41.07 18.95 27.45
WFSmin 5.06 5.74 3.79 4.61 41.84 48.92 24.38 32.64
WFScom 5.03 5.72 3.72 4.59 41.12 48.60 24.71 33.33
Table 7.9: Distortion values for the WFS beamformer. The best value in each column is highlighted in
boldface.
70

# mics 8 16 8 16 8 16 8 16
measure FD BSD
OSNR 0.59 0.51 1.06 0.85 .065 .053 .106 .083
WFSosnr
bf 0.57 0.49 0.86 0.68 .077 .065 .097 .073
WFSosnr
cor 0.57 0.51 0.80 0.65 .081 .069 .100 .076
WFSosnr
min 0.56 0.51 0.78 0.64 .088 .075 .102 .085
WFSosnr
com 0.57 0.51 0.77 0.64 .089 .075 .104 .085
OSNR 5.48 6.26 3.69 4.41 25.21 28.90 14.09 18.33
WFSosnr
bf 5.13 5.92 3.72 4.63 30.58 38.31 18.67 26.54
WFSosnr
cor 5.07 5.87 3.70 4.65 33.22 41.83 21.44 29.94
WFSosnr
min 5.12 5.77 3.96 4.75 42.11 49.61 27.30 36.20
WFSosnr
com 5.07 5.76 3.89 4.73 41.38 49.26 27.34 36.37
Table 7.10: Distortion values for the WFSosnr beamformer. The best value in each column is higlighted in
boldface.
71
e
jωτ1 Y1 Φ1 Y1
|S|
2
σ m2 Φm 1
M Σ
(1)
S

Hm
YM
e
jωτM ΦM YM
Delay Estimate Parameters Apply Filters Sum
Figure 7.7: Diagram of the optimal multi-channel Wiener (MCW) beamformer. The delay compensation
stage is followed by a parameter estimation stage which feeds into the channel filters applied before the
final summation.
7.4 Multi-Channel Wiener

Figure 7.7 shows the basic structure of the multi-channel Wiener (MCW) algorithm described in Section
5.2. Equation (5.26) forms the basis of the channel filters.
A 512 point analysis window was used with a 1024 point FFT length.
The channel filters were derived according to Equation (5.26):

– The noise power spectrum (σ2m ω ) for each channel was estimated in the same manner as for
the OSNR processing (Section 7.1) see with the minimum statistic method as described in
Section 6.1.

– The transfer function for each channel (Hm ω in Equation (5.26)) was estimated the same way
as in the OSNR processing (see Section 7.1) with the normalized average power spectrum for

each channel after noise subtraction.
– The signal power ( S 2 in the numerator of Equation (5.26)) was estimated from the input
channel data with the 3 different methods described in Section 6. The same 3 methods used in
Sections 7.2 and 7.3 above.

– The signal power in the denominator of Equation (5.26) (The S ω 2 Hm ω 2 term) was
estimated with the channel power after subtracting the noise estimate, Ym ω 2 σ2m .

The same exponential smoothing described in Section 7.2 was used to smooth the spectral estimates
in the numerator of Equation (5.26).
To guard against divide by zero or multiply by zero situations the gain at each frequency was
constrained to lie within -40dB to 3dB by truncating at both ends of the range.
To reduce artifacts in the reconstructed signal, the post filter is truncated to 512 points in the time
domain to preserve the linear convolution property. This corresponds to a smoothing of the filter in
the frequency domain.
As in the preceding sections the 3 different spectral estimate types were used (corÿ min, com).
For the MCW implementations the OSNR version denotes the use of OSNR weighted data when
forming the signal spectrum estimate in the numerator of Equation (5.26). As opposed to the
preceding algorithms where the OSNR process was used as a pre-processing step, since the MCW
algorithm incorporates the OSNR weighting the MCWosnr implementations use the OSNR
pre-processing only on the data used in the signal spectrum estimate.
72
8000 8000
7000 7000
6000 6000
5000 5000
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
cor
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
min
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
8000 8000
7000 7000
6000 6000
5000 5000
com
Hertz
Hertz
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
sec. sec.
MCW osnr
MCW
Figure 7.8: Narrowband spectrograms from the MCW beamformer. The 3 rows correspond to the different
ways of forming the signal power spectrum estimate. The spectrogram for the unweighted DSBF output
is show at the top for reference. The talker is female and the text being spoken is the alpha-digit string
’BUCKLE83566’.
MCW algorithm as described above, using 8 and 16 microphones. The results are presented the sections
that follow.
7.4.1 Subjective Evaluation

Figure 7.8 shows spectrograms for the example utterance for the MCW processed speech. The
spectrograms suggest a very strong suppression of the background noise and this is confirmed by listening.
The increased noise suppression for min and com can also be seen in Figure 7.8 as well as the greater
suppression of the noise bands in the OSNR processed column. The MCW processed speech has a
distinctly processed quality. The residual background noise is at a lower level than for the corresponding
outputs from WSF or WFS but overwhelmingly consists of warbling tones; the background noise is
squelched to such a degree that any sense of the ambiance of the original recordings is lost, even in the 8
microphone case. The different methods of signal estimation are virtually indistinguishable from each
other by listening.

Tables 7.11 and 7.12 show the recognition performance for the MCW beamformer. An improvement is
shown for every tabulated case with the min and com spectral estimation types performing better than the
cor type. The min spectral estimate performs marginally better than the com estimate in most cases. For
the quiet data the difference between MCW and MCWosnr is virtually nonexistent; for the noisy data the
73

# mics 8 16 8 16 8 16 8 16
DSBF 24.28 19.44 14.03 11.92 60.44 47.59 32.49 23.48
MCWcor 24.02 18.26 13.36 11.23 51.41 34.27 27.37 18.70
MCWmin 20.26 17.14 12.83 11.36 46.56 30.80 25.77 17.88
MCWcom 20.81 17.37 12.88 11.14 46.79 30.71 25.84 18.35
Table 7.11: Recognition performance for the MCW beamformer expressed in % words in error.

# mics 8 16 8 16 8 16 8 16
OSNR 23.93 19.01 14.08 11.52 53.92 40.25 27.13 18.81
MCWosnrcor 23.15 17.46 13.43 11.34 44.14 30.15 24.31 17.39
MCWosnrmin 20.15 16.86 12.81 11.25 40.63 27.06 22.79 16.61
MCWosnrcom 20.53 17.21 12.74 11.12 41.16 27.44 23.19 16.81
Table 7.12: Recognition performance for the MCWosnr beamformer expressed in % words in error.
use of the OSNR data in the spectral estimation step does improve performance by a non-negligible
amount. The best case improvement with MAP training, using the quiet data is 21% of the difference
between the DSBF baseline and the close-talking microphone performance for both the 8 microphone and
16 microphone case. Using the noisy data the improvement is 44% for 16 microphones and 39% for 8
microphones.
The distortion measures shown in tables and are qualitatively similar to those in the preceding sections.
BSD is made worse by all cases, quiet and noisy. FD shows marginal improvement with the quiet data and
somewhat greater improvement with the noisy data. SSNR declines in nearly all cases though it declines
less for the noisy data than for the quiet data. Peak SNR improves by nearly a factor of 2 in all cases.

# mics 8 16 8 16 8 16 8 16
measure FD BSD
DSBF 0.60 0.51 1.17 0.98 .066 .054 .124 .096
MCWcor 0.59 0.52 0.83 0.70 .099 .093 .125 .099
MCWmin 0.56 0.50 0.82 0.69 .106 .097 .125 .109
MCWcom 0.57 0.50 0.81 0.69 .106 .098 .129 .109
DSBF 5.38 6.17 3.25 4.06 24.53 27.86 10.86 14.90
MCWcor 4.54 4.98 3.22 4.07 34.86 44.17 21.20 30.84
MCWmin 4.63 4.94 3.52 4.14 53.55 52.44 26.67 36.35
MCWcom 4.57 4.91 3.42 4.14 42.52 51.94 26.54 36.48
Table 7.13: Measured distortion for the MCW beamformer.

74

# mics 8 16 8 16 8 16 8 16
measure FD BSD
OSNR 0.59 0.51 1.06 0.85 .065 .053 .106 .083
MCWosnrcor 0.59 0.51 0.80 0.68 .100 .093 .115 .102
MCWosnrmin 0.56 0.50 0.78 0.66 .105 .097 .123 .113
MCWosnrcom 0.57 0.50 0.78 0.67 .106 .098 .123 .112
OSNR 5.48 6.26 3.69 4.41 25.21 28.90 14.09 18.33
MCWosnrcor 4.59 5.01 3.50 4.16 35.74 45.69 26.48 34.64
MCWosnrmin 4.67 4.96 3.74 4.19 44.28 53.58 32.18 40.44
MCWosnrcom 4.60 4.93 3.67 4.19 43.18 53.03 31.49 39.95
Table 7.14: Measured distortion for the MCWosnr beamformer.

75
Q UIET DATA
14 1
Word Error %
Word Error %
20 27
% Improved
% Improved
8 12 35
mic 15 58
com
com
min
com
com
MCWcor
min
com
com
WSFmin
min
MCWcor
cor
WFScor
WSFmin
min
cor
WFScor
10 69
OSNR
WFSbf
OSNR
WFSbf
DSBF
MCW
MCW
DSBF
MCW
MCW
WSF
WSF
WFS
WFS
WSF
WSF
WFS
WFS
10 89
13 −29
20 −5
12 −2
Word Error %
Word Error %
% Improved
% Improved
16 11 24
15 39
mic 10 51
com
min
com
com
com
cor
min
WFSmin
min
com
WFScom
cor
cor
MCWcor
min
min
cor
WFScor
OSNR
bf
DSBF
OSNR
WFSbf
MCW
MCW
MCW
DSBF
WSF
WSF
WSF
WFS
WFS
WFS
MCW
MCW
WSF
WSF
WSF
WFS
10 84 9 78
N OISY DATA
60 1
30 10
Word Error %
Word Error %
% Improved
% Improved
8 40 39
20 51
mic
MCWcom
com
min
min
com
WFScom
com
WFScom
MCWcor
MCWcor
min
min
min
min
cor
WFScor
cor
cor
OSNR
WFSbf
20 77
OSNR
WFSbf
DSBF
DSBF
MCW
MCW
MCW
WSF
WSF
WSF
WFS
WSF
WSF
WSF
WFS
WFS
10 92
40 19 20 23
Word Error %
Word Error %
% Improved
% Improved
16 30 45
mic 15 55
MCWcom
MCWcom
min
WSFcom
WFScom
min
com
WFScom
20 70
cor
MCWcor
min
min
min
min
cor
cor
cor
cor
OSNR
WFSbf
OSNR
WFSbf
DSBF
DSBF
MCW
MCW
MCW
WSF
WSF
WFS
WFS
WSF
WSF
WSF
WFS
WFS
10 88
10 95
Baseline HMM MAP HMM
Figure 7.9: Summary of word error rates in % words in error. The bottom level of each graph corresponds
to the close-talking microphone error rate. The axis on the right hand side shows the percent improvement
from the DSBF baseline. That is, the DSBF is 0% improved and the close-talking microphone is 100%
improved.
7.5 Summary
Figures 7.9 and 7.10 show the recognition performance for each tested combination of microphones (8 or
16) and database (quiet or noisy). Figure 7.9 shows the performance for algorithms using the unweighted
channels as input and Figure 7.10 shows the performance for algorithms using the OSNR filtering as a
preprocessing stage. These are the same values tabulated in the previous section, but presented graphically
and side by side to facilitate comparisons over the full range of algorithms.
76
Q UIET DATA
14 1
Word Error %
Word Error %
20 27
% Improved
% Improved
8 12 35
mic 15 58
osnr
osnr
MCWosnr
com
osnr
osnr
MCWosnr
com
MCWmin
WSFosnr
osnr
WSFosnr
WFSosnr
osnr
osnr
WFSosnr
com
com
MCWcor
MCWmin
WSFosnr
osnr
WSFosnr
WFSosnr
osnr
osnr
WFSosnr
com
com
WSFmin
WFSmin
MCWcor
cor
WFScor
WSFmin
WFSmin
10 69
cor
WFScor
OSNR
bf
OSNR
bf
DSBF
10 DSBF 89
20 −5
12 −2
Word Error %
Word Error %
% Improved
% Improved
11 24
16 15 39
mic 10 51
osnr
osnr
MCWosnr
com
MCWmin
osnr
osnr
WSFosnr
WFSosnr
osnr
WFSosnr
WFSosnr
osnr
osnr
MCWosnr
com
com
com
MCWcor
WSFmin
min
MCWmin
osnr
osnr
WSFosnr
WFSosnr
osnr
WFSosnr
WFSosnr
com
com
WSFcor
WFScor
MCWcor
WSFmin
min
WSFcor
WFScor
OSNR
bf
DSBF
OSNR
bf
DSBF
10 84 9 78
N OISY DATA
60 1
30 10
Word Error %
Word Error %
% Improved
% Improved
8 40 39
20 51
mic
osnr
MCWosnr
MCWosnr
com
osnr
MCWosnr
MCWosnr
com
min
osnr
osnr
WSFosnr
osnr
osnr
WFSosnr
WFSosnr
min
com
com
osnr
osnr
WSFosnr
osnr
osnr
WFSosnr
WFSosnr
com
com
MCWcor
MCWcor
WSFmin
min
WSFmin
min
WSFcor
WFScor
WSFcor
WFScor
OSNR
WFSbf
20 77
OSNR
WFSbf
DSBF
DSBF
10 92
40 19 20 23
Word Error %
Word Error %
% Improved
% Improved
16 30 45
mic 15 55
osnr
MCWosnr
MCWosnr
com
osnr
MCWosnr
MCWosnr
com
min
osnr
WSFosnr
WSFosnr
osnr
osnr
WFSosnr
osnr
com
WFScom
min
osnr
WSFosnr
WSFosnr
osnr
osnr
WFSosnr
osnr
com
WFScom
20 70
MCWcor
MCWcor
min
min
min
min
WSFcor
WFScor
WSFcor
WFScor
OSNR
WFSbf
OSNR
WFSbf
DSBF
DSBF
10 88
10 95
Baseline HMM MAP HMM
Figure 7.10: Word error rates with OSNR input in % words in error. The bottom level of each graph
corresponds to the close-talking microphone error rate. The axis on the right hand side shows the percent
improvement from the DSBF baseline. That is, the DSBF is 0% improved and the close-talking microphone
is 100% improved.
Looking at the MAP-HMM column of Figure 7.9. For tests of the quiet data OSNR, WSF and MCW
processing all improve recognition rates and WFS reduces recognition performance. For tests of the noisy
data every filtering strategy improves recognition performance though the OSNR filtering out-performs all
but the MCW algorithm. This is a strong result considering that the OSNR is the only algorithm in this
comparison (apart from DSBF) that is distortionless. That is, OSNR has an overall flat system frequency
response whereas the other methods (WSF, WFS, MCW) all impose a non-uniform overall frequency
weighting that distorts the spectrum.
77
12 −2
20 23
Word Error %
Word Error %
11 24
% Improved
% Improved
10 51 15 55
MCWosnr
MCWcom
com
osnr
MCWosnr
MCWmin
min
osnr
WSFcor
WSFcor
WSFcom
WSFcom
OSNR
DSBF
OSNR
DSBF
9 78
10 88
Quiet data Noisy Data
Figure 7.11: The best performing filtering schemes using 16 microphones and MAP training. These values
are culled from Figures 7.9 and 7.10.
In Figure 7.10 results for the quiet data show a similar trend as in Figure 7.9: improved performance is
shown for every strategy except WFS for which performance declines. Unlike the tests without OSNR
pre-processing, for the noisy data the performance of WFS osnr , WSFosnr and MCWosnr are all better than
the performance of the OSNR filtering alone. The gains from the OSNR weighting and the gain from the
noise-reduction filtering which follows are additive which is not unexpected. Using the OSNR as a
pre-processing step simultaneously improves the signal estimate available to the filtering step and provides
an inter-microphone weighting that is missing from the WFS and WSF processing types. MCW already
incorporates the non-uniform microphone weighting but gains from using the OSNR weighted data for the
spectral estimate.
The relatively poor performance of the WFS algorithms on the quiet data may seem somewhat
counter-intuitive since the WFS algorithms are arguably the best sounding processing types on the quiet
data - at least in terms of having adding the least amount of artifacts into the processed speech. The ad-hoc
nature of WFS algorithm and the way it may distort the spectrum seems to be reflected in the recognition
results. WFS is the only algorithm that isn’t based directly on an optimization and it’s also the only
algorithm that reduces the recognition performance on quiet data. This probably isn’t a coincidence.
To better compare the best performing algorithms, Figure 7.11 shows a subset of the results in Figures 7.9
and 7.10 using 16 microphones and the MAP model for both quiet and noisy databases. With the quiet
data the performance of the MCW and WSF variants are virtually identical. With the noisy data the MCW
outperforms WSF but the versions with OSNR included are deadlocked again. As discussed in Section
5.2.3 WSFosnr and MCWosnr use the same inter-microphone weighting function and differ only in the
specifics of the final frequency shaping. What the results here show is that the difference in frequency
weighting between the WFS and MCW methods is not significant enough to affect the recognition
performance. The difference in performance between WFS and MCW goes away when the OSNR
weighting is used equally in both methods.
Figures 7.12 and 7.13 graphically summarize the values of the various distortion measures applied to the
filtering algorithms4 . In Figure 7.12 the FD measure varies only slightly with the different algorithms
when used on the quiet data, with the noisy data on the other hand every algorithm significantly lowers the
measured FD. This is not unlike the recognition results where with the noisy data any distortion introduced
by the processing methods is outweighed by the degree to which they suppress the noise. For the quiet
data the level of noise is low enough that the gain from reducing it and the penalty paid for introducing
filtering distortions are much more similar in magnitude.
The BSD values measured on the quiet data increase for all algorithms. The increase for WSF cor is
minimal but virtually every other algorithm shows a significant increase. The increase in BSD is generally
greater for those methods with greater noise suppression. Using the com and min spectral estimates
generally results in a higher BSD, and these two methods generally result in a larger estimate of the noise
(and greater corresponding noise suppression) than the cor method. The measurements on the noisy data
show a similar upward trend in the BSD though in this case only the MCW values are worse than the
DSBF and OSNR baselines. The SSNR values shown in Figure 7.13 show the complementary trend with
4 The values for the implementations using OSNR pre-processing are qualitatively extremely similar and are not plotted here.
78
FD
0.6
0.5 1
0.4 0.8
8
FD
FD
0.3 0.6
mic
com
com
MCWmin
MCWmin
WSFcom
com
WSFcom
com
MCWcor
MCWcor
min
min
min
min
cor
WFScor
cor
WFScor
0.2 0.4
OSNR
OSNR
WFSbf
WFSbf
DSBF
DSBF
MCW
MCW
WSF
WSF
WFS
WFS
WSF
WSF
WFS
WFS
0.1 0.2
0 0
1
0.5
0.8
0.4
0.6
16 0.3
FD
FD
mic
com
com
0.4
MCWmin
min
WSFcom
com
WSFcom
com
MCWcor
MCWcor
0.2
min
min
min
min
cor
cor
cor
cor
OSNR
OSNR
WFSbf
bf
DSBF
DSBF
MCW
MCW
MCW
WSF
WSF
WFS
WFS
WFS
WSF
WSF
WFS
WFS
WFS
WFS
0.1 0.2
0 0
BSD
0.1 0.12
0.08 0.1
0.08
8
BSD
BSD
0.06
mic 0.06
MCWcom
MCWcom
MCWmin
MCWmin
WSFcom
WFScom
WSFcom
WFScom
MCWcor
MCWcor
WSFmin
WFSmin
min
WFSmin
0.04
WSFcor
WFScor
WSFcor
WFScor
OSNR
OSNR
WFSbf
bf
0.04
DSBF
DSBF
WSF
WFS
0.02 0.02
0 0
0.1
0.1
0.08
0.08
16 0.06
BSD
BSD
0.06
mic
MCWcom
MCWcom
0.04
MCWmin
MCWmin
WSFcom
WFScom
WSFcom
WFScom
MCWcor
MCWcor
WSFmin
WFSmin
min
WFSmin
0.04
WSFcor
WFScor
WSFcor
WFScor
OSNR
OSNR
bf
WFSbf
DSBF
DSBF
WFS
WSF
0.02 0.02
0 0
Quiet Noisy
Figure 7.12: Summary of FD and BSD values measured on the variety of beamforming algorithms.
slight differences. The similarity of these trends is entirely expected since the SSNR measurement is
essentially a linear-frequency version of the BSD measurement. In this measurement the WSF algorithms
show slight improvement even on the quiet data and the MCW algorithms (as with the BSD
measurements) still show the worst performance by this measure. This ordering is reversed on the peak
SNR graphs. Every algorithm shows a significant increase in SNR and the WFS and MCW algorithms
show greater SNR than the WSF algorithm. These results point towards the tradeoff between introducing
distortion and suppressing noise. The more aggressively the noise is suppressed (indicated by SNR) the
more unwanted signal distortions (indicated by BSD and SSNR) will creep in. The surprised is how the
FD measure does not follow the other distortion measures as tightly as it did in Chapter 3. Despite having
the worst BSD performance in the group, the MCW algorithms FD scores and recognition performance
79
SSNR
4
5 3.5
4 3
2.5
8
dB
dB
3 2
mic
com
com
min
MCWmin
com
com
WSFcom
com
cor
MCWcor
min
min
min
min
1.5
WSFcor
cor
cor
WFScor
2
OSNR
OSNR
bf
WFSbf
DSBF
DSBF
MCW
MCW
MCW
MCW
WSF
WSF
WFS
WFS
WFS
WFS
WSF
WSF
WFS
WFS
1
1
0.5
0 0
5
6
5 4
4 3
16
dB
dB
mic 3
com
com
2
min
min
com
com
com
com
cor
cor
min
min
min
min
WSFcor
cor
WSFcor
cor
OSNR
OSNR
bf
WFSbf
2
DSBF
DSBF
MCW
MCW
MCW
MCW
MCW
MCW
WSF
WSF
WFS
WFS
WFS
WFS
WSF
WSF
WFS
WFS
WFS
1
1
0 0
SNR
50 25
40 20
8 30 15
dB
dB
mic
MCWcom
MCWcom
min
min
WSFcom
WFScom
WSFcom
WFScom
cor
cor
WSFmin
WFSmin
WSFmin
WFSmin
20 10
cor
cor
WSFcor
WFScor
OSNR
OSNR
bf
bf
DSBF
DSBF
MCW
MCW
MCW
MCW
WSF
WFS
WFS
WFS
10 5
0 0
50 35
30
40
25
16 30 20
dB
dB
mic
MCWcom
MCWcom
15
min
MCWmin
WSFcom
WFScom
com
com
cor
20 cor
WSFmin
WFSmin
WSFmin
WFSmin
WSFcor
WFScor
WSFcor
WFScor
OSNR
OSNR
bf
bf
DSBF
DSBF
MCW
MCW
MCW
10
WFS
WSF
WFS
WFS
10
5
0 0
Quiet Noisy
Figure 7.13: Summary of SSNR and SNR values measured on the variety of beamforming algorithms.
are among the best observed.

Figure 7.14 shows scatter plots of the recognition error rate for each of the 84 individual trials 5 as a
function of each distortion measure along with a superimposed linear fit. The RMS linear fit errors for the
baseline-HMM and the MAP-HMM are shown below each plot. By far FD shows the strongest linear
correlation with recognition error rate with SSNR, SNR and BSD following in that order. Note that the
MAP-HMM and baseline-HMM linear trends all intersect at an error rate of approximately 5%. This
5 4 trials for OSNR ( 8 and 16 microphones, quiet and noisy), 24 trials each for WSF and MCW (8 and 16 microphones, quiet and
noisy, OSNR pre-processed or not, 3 spectral estimate types) and 32 trials for WFS (8 and 16 microphones, quiet and noisy, OSNR
pre-processed or not, 4 spectral estimate types).
80
60 60
Baseline
MAP
50 50
Error Rate (%)
Error Rate (%)

40 40
30 30
20 20
Baseline
10 MAP 10
0.4 0.6 0.8 1 1.2 1.4 0.02 0.04 0.06 0.08 0.1 0.12 0.14
FD BSD
RMSE: 2.8%(Base) 1.8%(MAP) RMSE: 8.9%(Base) 4.4%(MAP)
60 60
Baseline Baseline
MAP MAP
50 50
Error Rate (%)
Error Rate (%)

40 40
30 30
20 20
10 10
3 4 5 6 7 10 20 30 40 50
SSNR (dB) SNR (dB)
RMSE: 6.0%(Base) 2.9%(MAP) RMSE: 6.1%(Base) 3.2%(MAP)
Figure 7.14: Scatter plots of error rate and distortion measures. Each figure plots word error rate as a
function of measured distortion values FD, BSD, SSNR, SNR. The ’s denote results from the baseline
HMM and ’s denote the results after MAP training. There are 84 data points. The linear fits to each set of

points is overlayed and the RMS errors from the linear fit for the baseline-HMM and the MAP-HMM are
shown below each plot.
somewhat better than the reference close-talking microphone error rate of 8.16% suggesting that data
points at lower distortions and error rates than those plotted here would fall above the linear trend shown
here. It is interesting though that the MAP-HMM and baseline-HMM linear trends for the different
measures intersect at such similar performance points.
The corresponding RMS linear fit error figures for the baseline measurements made on the DSBF data is
shown in Table 3.9. Compared to the measurements made on the DSBF processed data in Section 3.3.2 the
distortion measurements here are generally less correlated with error rate; the baseline-HMM error rate
especially. Only FD has a better linear fit here than with the DSBF measurements. This disparity is despite
the restricted range over which the error rates here fall; in Section 3.3.2 a good deal of the linear fit error
was due to a strong nonlinearity in the relation between distortion measures and error rates. In the
measurements presented here the trends appear quite linear, but with a greater variance from the trend.
The increase in overall apparent linearity is at least in part because the range of error rates observed in this
data is significantly smaller than the range observed in Section 3.3.2.
Figure 7.15 shows scatter plots of the FD, BSD, and SNR as functions of each other for measurements
taken on the noise suppressed data (OSNR, WSF, WFS, MCW) and for measurements taken on the DSBF
baseline data presented in Chapter 3. The greatly reduced correlation between measures seen with the
noise reduction algorithms is readily apparent. With the DSBF measurements the successive addition of
microphones yields data points that travel somewhat continuously through all the measurement spaces.
With the nonlinear nature of the noise-suppression algorithms this property appears to no longer hold. In
81
0.2
WSF,WFS,MCW,OSNR WSF,WFS,MCW,OSNR
DSBF 50 DSBF
40
0.15
SNR
BSD
30
0.1
20
10
0.05
0.6 0.8 1 1.2 1.4 1.6 0.5 1 1.5
FD FD
WSF,WFS,MCW,OSNR
50
DSBF
SNR 40
30
20
10
0.06 0.08 0.1 0.12 0.14 0.16 0.18

BSD
Figure 7.15: Scatter plots comparing correlation of distortion measures for the noise-suppression algorithms
(OSNR, WSF, WFS, MCW) and for DSBF. The marks represent the values measured in Chapter 3 on
the output of the DSBF. The marks represent the values measured from the noise reduction algorithms;
OSNR, WSF, WFS, MCW.
Figure 7.15 the scatter plot of BSD and SNR reflects the low correlation between these measures
compared to the relatively tight correlation that was observed with the DSBF measurements. Figure 7.16
repeats the scatter plot from Figure 7.15 but breaks each algorithm out into its own marker and linear fit
line. From this it is apparent that within each algorithm type the correlation between measures is much
stronger than it is between algorithm types, though certainly the reduced number of points in each cluster
contributes somewhat to that perception. In particular for the FD vs BSD scatter plot the difference
between the algorithms is primarily a different bias in BSD for each algorithm. The FD vs SNR plot shows
quite a bit less separation between the different algorithms. Note also that taken on their own the OSNR
points fall very neatly along a linear trend much more like the DSBF measurements in Chapter 3. In fact
the OSNR measurements generally fall very close to the trends established by the DSBF data as shown in
Figure 7.17. In contrast to Figure 7.15 the OSNR measurements taken alone are quite consistent with the
trends set by the DSBF measurements. This is reflective of how closely related the OSNR algorithm is to
the DSBF algorithm in that it doesn’t employ the Wiener filtering noise suppression that is common to all
the other algorithm types.
82
0.13 60
OSNR
0.12 WSF
50 WFS
0.11 MCW
0.1 40
SNR
BSD
0.09
0.08 30
0.07 OSNR
WSF 20
0.06 WFS
MCW
0.05 10
0.4 0.6 0.8 1 0.4 0.6 0.8 1
FD FD
55
OSNR
50 WSF
45 WFS
MCW
40
35
SNR
30
25
20
15
10
0.04 0.06 0.08 0.1 0.12 0.14
BSD
Figure 7.16: Scatter plot of distortion measurements by algorithm type. When broken out into the different
algorithm types the distortion measures show a stronger linear correlation with each other.
83
0.25 30
OSNR
DSBF
25
0.2
20
SNR
BSD
0.15 15
OSNR 10
0.1 DSBF
5
0.05 0
0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5
FD FD
30
OSNR
DSBF
25
20
SNR
15
10
0
0.05 0.1 0.15 0.2 0.25
BSD
Figure 7.17: Scatter plots of OSNR distortion measurements along with the DSBF measurements. The
OSNR measurements fall much more close to the DSBF trends than the other tested algorithms.
C HAPTER 8:
S UMMARY AND C ONCLUSIONS
The goal of this work was to measure the performance of a delay-and-sum beamformer and to investigate
techniques for improving upon that performance. Several measures by which to judge performance were
introduced in Chapter 2. The measures introduced vary from traditional signal-to-noise ratio measures
(SNR, SSNR) to perceptually motivated measures that more closely reflect subjective speech quality
(BSD). The feature distortion (FD) measure was also introduced as an attempt to predict the performance
of a speech recognition system. In Chapter 3 a database of microphone-array recordings was described.
This database of recordings was originally collected to make direct comparisons between the performance
of the microphone array and a close-talking microphone in a speech recognition task[6]. Because the
microphone-array recordings include simultaneous recordings with a close-talking microphone, signal
quality measures that require a reference signal (FD, SSNR, BSD) could be used to evaluate the results of
beamforming algorithms. Chapter 3 also describes how a high-noise database was created by adding noise
recorded by the same microphone array to the original, relatively quiet, recordings. Chapter 3 describes
the performance of a delay-and-sum beamformer using from 1 to 16 microphones. The results were
evaluated with the measures described in Chapter 2 and with the performance of an alphadigit speech
recognition system. The MAP retraining method was used to adapt the speech recognition models and
optimize the performance on the novel microphone-array data. Chapter 4 used simulations of a linear
microphone array to investigate the limits of delay-and-sum techniques in noisy and reverberant
environments. Motivated by the results in Chapter 4, in Chapter 5 MMSE optimizations for single channel
signal enhancement were extended to an optimal multi-input solution (MCW). The optimal multi-input
solution was solved for signal-plus-noise and filtered-signal-plus-noise models for the received signal. In
addition, an intuitively appealing but non-optimal filter-and-sum approach (WFS) was presented and
analyzed. Chapter 6 describes some methods for generating the spectral estimate required for the
implementation of the Wiener filtering strategies including a novel combination of cross-spectrum and
minimum-noise-subtraction spectrum estimation. Finally, Chapter 7 presents implementations and
evaluations of the various speech enhancement algorithms. Significant points of the results include:
Overall the noise-reduction techniques were quite successful in improving recognition performance,
reducing the gap between the DSBF performance and the close-talking microphone performance by
up to 27% on the quiet data and 45% on the noisy data.
The OSNR weighting is very successful in the noisy data tests, outperforming all but the MCW
algorithm. This is significant in that, unlike the other algorithms, the OSNR weighting is a
distortion-free filtering.
The MCW algorithm has the best speech recognition performance on the noisy data and is within
the smallest of margins of the WSF algorithm on the quiet data.
When OSNR is used as a pre-processing step, the MCW and WSF algorithms perform nearly
identically on the speech-recognition task. The OSNR pre-processing is the deciding factor in the
speech recognition performance; the difference between the frequency weightings of the WSF and
MCW algorithms is insignificant by comparison.
The min and com spectrum estimates generally result in better recognition scores and worse
distortion scores than the cor cross-spectrum method. This is largely due to the generally larger
noise spectrum estimates from these two methods.
84
85
The WFS algorithm has the worst speech recognition scores and distortion measures of the 3 Wiener
filtering schemes although it shows strong improvement in SNR and informal subjective evaluations
of subjective quality. WFS is the only algorithm that has worse recognition performance than the
DSBF on the quiet data set.
The FD measure does a consistently good job of predicting speech recognition performance.
The Wiener-based methods show very different relationships between measurements than the DSBF
and OSNR algorithms. FD is still strongly related to speech recognition performance, but the strong
relationships with BSD, SSNR and SNR observed with the DSBF tests are not seen here. This was
foreseen in Chapter 3; the DSBF is unique in that adding microphones simultaneously reduces the
noise and enhances the signal in a fairly uniform manner. The Wiener filtering strategies on the
other hand are based upon amplifying the signal in high-SNR regions and squelching it in low-SNR
regions, and does so in a nonlinear fashion. The result is that noise is suppressed at the cost of
increased signal distortion.
The efficacy of the MAP training technique was very effective at tuning the recognizer to the novel
data. The MAP training reduced the error rate often by nearly 50%. On the other hand, the
baseline-HMM recognition performance closely follows the MAP-HMM performance; for
comparing the performance of two speech enhancement methods it may not be necessary to do
MAP retraining; the performance given by the baseline model may reflect the MAP results
sufficiently well.
8.1 Directions for Further Study

Throughout this work no attempt to incorporate a speech model was made. Neither was any specific noise
model imposed. Incorporating a speech model certainly has the potential for improving the recovered
speech by imposing constraints on the trajectory of the estimated speech signal rather than relying upon
unconstrained non-parametric spectral estimates[86, 87, 88, 89]. The difficulty lies in having a model that
can simultaneously represent all sorts of speech accurately while being sufficiently constrained to avoid
modeling noise elements. In a similar manner some gain may be realized by using a noise model. These
models could model a particular method of source production or could attempt to track noise sources with
specific statistical constraints. The large variety of noise types that may be encountered (narrowband,
broadband, coherent, ambient, impulsive) indicates that a flexible model or multiple simultaneous models
would be required for accurate modeling. Ultimately the speech production model could be integrated
with the speech recognition system providing a single global model guiding optimal filtering for subjective
quality and for speech recognition performance within one speech modeling framework.
From a strictly signal-processing point of view, all the processing herein would most likely be enhanced by
the use of wavelet transforms or some other nonlinearly spaced filterbank processing[41]. The
linearly-spaced FFT is a very convenient tool, but ideally the signal processing would be tailored to the
sensitivity of the human auditory system. The features used in Bark spectral distortion and the
Mel-warped features used by the speech recognition system are based on nonlinear frequency scales (Bark
and Mel, respectively) though the underlying processing is made with linear filterbanks. Why not
incorporate the varying frequency resolution (and at the same time a varying time resolution) into the
underlying signal processing front end and corresponding noise reduction. An auditory model can be
incorporated to help determine where and when in the received signal the greatest noise reduction gains
can be achieved or where the greatest penalty for added distortion will be incurred[90, 91].
B IBLIOGRAPHY
[1] M. S. Brandstein and D. B. Ward, editors. Microphone Arrays: Signal Processing Techniques and
Applications. Springer Verlag, 2001.
[2] J. L. Flanagan, A. Surendran, and E. Jan. Spatially selective sound capture for speech and audio
processing. Speech Communication, 13(1-2):207–222, 1993.
[3] Y. Grenier. A microphone array for car environments. In Proceedings of ICASSP-92 [92], pages
305–309.
[4] W. Kellerman. A self-steering digital microphone array. In Proceedings of ICASSP-91 [93], pages
3581–3584.
[5] J. Adcock, J. DiBiase, M. Brandstein, and H. F. Silverman. Practical issues in the use of a
frequency-domain delay estimator for microphone-array applications. In Proceedings of Acoustical
Society of America Meeting, Austin, Texas, November 1994.
[6] J. Adcock, Y. Gotoh, D. J. Mashao, and H. F. Silverman. Microphone-array speech recognition via
incremental MAP training. In Proceedings of ICASSP-96 [94], pages 897–900.
[7] J. L. Flanagan. Bandwidth design for speech-seeking microphone arrays. In Proceedings of
ICASSP-85, pages 732–735, Tampa, FL, March 1985. IEEE.
[8] J. L. Flanagan, D. Berkley, G. Elko, J. West, and M. Sondhi. Autodirective microphone systems.
Acustica, 73:58–71, 1991.
[9] S. Oh, V. Viswanathan, and P. Papamichalis. Hands-free voice communication in an automobile with
a microphone array. In Proceedings of ICASSP-92 [92], pages 281–284.
[10] H. F. Silverman. Some analysis of microphone arrays for speech data acquisition. IEEE Trans.
Acoust. Speech Signal Process., ASSP-35(2):1699–1712, December 1987.
[11] C. Che, M. Rahim, and J. Flanagan. Robust speech recognition in a multimedia teleconferencing
environment. J. Acoust. Soc. Am., 92(4, pt.2):2476(A), 1992.
[12] D. Giuliani, M. Omologo, and P. Svaizer. Talker localization and speech recognition using a
microphone array and a cross-power spectrum phase analysis. In Proceedings of ICSLP, volume 3,
pages 1243–1246, September 1994.
[13] Maurizio Omologo and Piergiorgio Svaizer. Acoustic event localization using a
crosspower-spectrum phase based technique. In Proceedings of ICASSP-94, volume II, pages
273–276, Adelaide, Australia, April 1994. IEEE.
[14] M. Omologo and P. Svaizer. Use of the cross-power spectrum phase in acoustic event localization.
Technical Report Technical Report No. 9303-13, IRST, Povo di Trento, Italy, March 1993.
[15] B. D. Van Veen and K. M. Buckley. Beamforming: A versatile approach to spatial filtering. IEEE
ASSP Magazine, 5(2):4–24, April 1988.
86
87
[16] J. L. Flanagan, J. D. Johnson, R. Zahn, and G. W. Elko. Computer-steered microphone arrays for
sound transduction in large rooms. J. Acoust. Soc. Am., 78(5):1508–1518, November 1985.
[17] H. F. Silverman. Some analysis of microphone arrays for speech data acquisition. LEMS Technical
Report 27, LEMS, Division of Engineering, Brown University, Providence, RI 02912, September
1986.
[18] Masato Miyoshi and Yutaka Kaneda. Inverse filtering of room acoustics. IEEE Transactions on
Acoustics, Speech, and Signal Processing, 36(2):145–152, February 1988.
[19] Hideaki Yamada, Hong Wang, and Fumitada Itakura. Recovering of broad band reverberant speech
signal by sub-band MINT method. In Proceedings of ICASSP-91 [93], pages 969–972.
[20] S. T. Neely and J. B. Allen. Invertibility of a room impulse response. J. Acoust. Soc. Amer.,
66(1):165–169, July 1979.
[21] J. Mourjopolous. On the variation and invertibility of room impulse response functions. Journal of
Sound and Vibration, 102(2):217–228, 1985.
[22] Takafumi Hikichi and Fumitada Itakura. Time variation of room acoustic transfer functions and its
effects on a multi-microphone dereverberation approach. Preprint received at 2nd International
Workshop on Microphone Arrays, Rutgers University, NJ, 1994.
[23] E. Jan, P. Svaizer, and J. Flanagan. Matched-filter processing of microphone array for spatial volume
selectivity. In Proceedings of ICASSP-95 [95], pages 1460–1463.
[24] O. L. Frost. An algorithm for linearly constrained adaptive array processing. Proceedings of the
IEEE, 60(8):926–935, August 1972.
[25] L. J. Griffiths and C. W. Jim. An alternative approach to linearly constrained adaptive beamforming.
IEEE Transactions on Antennas and Propagation, AP-30(1):27–34, January 1982.
[26] B. Widrow, P. E. Mantey, L. J. Griffiths, and B. B. Goode. Adaptive antenna systems. Proceedings of
the IEEE, 55:2143–2159, 1967.
[27] B. Widrow, J. R. Glover, J. M. McCool, J. Kaunitz, C. S. Williams, R. H. Hearn, J. R. Ziedler,
E. Dong, and R. C. Goodlin. Adaptive noise cancelling: Principles and applications. Proceedings of
the IEEE, 63(12):1692–1716, December 1975.
[28] Osamu Hoshuyama and Akihiko Sugiyama. A robust adaptive beamformer for microphone arrays
with a blocking matrix using constrained adaptive filters. In Proceedings of ICASSP-96 [94], pages
925–928.
[29] Jens Meyer and Carsten Sydow. Noise cancelling for microphone arrays. In Proceedings of
ICASSP-97 [96], pages 211–213.
[30] Joerg Bitzer, Klaus Uwe Simmer, and Karl-Dirk Kammeyer. Multi-microphone noise reduction by
post-filter and superdirective beamformer. In Proceedings of International Workshop on Acoustic
Echo and Noise Control, pages 100–103, Pocono Manor, USA, September 1999.
[31] Peter L. Chu. Superdirective microphone array for a set-top videoconferencing system. In
Proceedings of ICASSP-97 [96], pages 235–2358.
[32] J. Kates. Superdirective arrays for hearing aids. Journal of the acoustical society of america,
94(4):1930–1933, 1993.
[33] S. F. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on
Acoustics, Speech and Signal Processing, 27(2):113–120, April 1979.
88
[34] Levent Arslan, Alan McCree, and Vishu Viswanathan. New methods for adaptive noise suppression.
In Proceedings of ICASSP-95 [95], pages 812–815.
[35] T. S. Sun, S. Nandkumar, J. Carmody, J. Rothweiler, A. Goldschen, N. Russell, S. Mpasi, and
P. Green. Speech enhancement using a ternary-decision based filter. In Proceedings of ICASSP-95
[95], pages 820–823.
[36] R. J. McAulay and M. L. Malpass. Speech enhancement using a soft-decision noise suppression
filter. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28:137–145, 1980.
[37] E. Bryan George. Single-sensor speech enhancement using a soft-decision/variable attenuation
algorithm. In Proceedings of ICASSP-95 [95], pages 816–819.
[38] R. Zelinski. A microphone array with adaptive post-filtering for noise reduction in reverberant
rooms. In Proceedings of ICASSP-88, pages 2578–2580, New York, April 1988. IEEE.
[39] Claude Marro, Yannick Mahieux, and K. Uwe Simmer. Analysis of Noise Reduction and
Dereverberation Techniques Based on Microphone Arrays with Postfiltering. IEEE Transactions on
Speech and Audio Processing, 6(3):240–259, May 1998.
[40] Joerg Meyer and Klaus Uwe Simmer. Multi-channel speech enhancement in a car environment using
Wiener filtering and spectral subtraction. In Proceedings of ICASSP-97 [96], pages 1167–1171.
[41] Djamila Mahmoudi and Andrzej Drygajlo. Combined Wiener and coherence filtering in wavelet
domain for microphone array speech enhancement. In Proceedings of ICASSP-98 [97], pages
385–389.
[42] T. E. Tremain, M. A. Kohler, and T. G. Champion. Pilosophy and goals of the d.o.d. 2400 bps
vocoder selection process. In Proceedings of ICASSP-96 [94], pages 1137–1140.
[43] Matthew R. Bielefeld and Lynn M. Supplee. Developing a test program for the dod 2400 bps
vocoder selection process. In Proceedings of ICASSP-96 [94], pages 1141–1144.
[44] John D. Tardelli and Elizabeth Woodard Kreamer. Vocoder intelligibility and quality test methods. In
[45] Elizabeth Woodard Kreamer and John D. Tardelli. Communicability testing for voice coders. In
[46] M. A. Kohler, Philip A. LaFollette, and Matthew R. Bielefeld. Criteria for the d.o.d. 2400 bps
vocoder selection. In Proceedings of ICASSP-96 [94], pages 1161–1164.
[47] M. A. Kohler, Philip La Follette, and Matthew R. Bielefeld. Criteria for the dod 2400 bps vocoder
selection. In Proceedings of ICASSP-96 [94], pages 1161–1164.
[48] Schuyler R. Quackenbush, Thomas P. Barnwell III, and Mark A. Clements. Objective Measures of
Speech Quality. Prentice Hall, Englewood Cliffs, NJ, 1988.
[49] K. Lam, O. Au, C. Chan, K. Hui, and S. Lau. Objective speech quality measure for cellular phone. In
[50] Shihua Wang, Andrew Sekey, and Allen Gersho. An objective measure for predicting subjective
quality of speech coders. IEEE Journal on Selected Areas in Communications, 10(5):819–829, June
1992.
[51] Wonho Yang, Majid Benbouchta, and Robert Yantorno. Performance of the modified Bark spectral
distortion as an objective speech quality measure. In Proceedings of ICASSP-98 [97], pages
541–544.
89
[52] Wonho Yang and Robert Yantorno. Improvement of MBSD by scaling noise masking threshold and
correlation analysis with MOS difference instead of MOS. In Proceedings of ICASSP-99, Phoenix,
Arizona, April 1999. IEEE.
[53] Jr. John R. Deller, John G. Proakis, and John H. L. Hansen. Discrete-Time Processing of Speech
Signals. Prentice Hall, Upper Saddle River, NJ, 1987.
[54] E. Zwicker and H. Fastl. Psychoacoustics Facts and Models. Springer-Verlag, 1990.
[55] D. W. Robinson and R. S. Dadson. A re-determination of the equal-loudness relations for pure tones.
British Journal of Applied Physics, 7:166–181, may 1956.
[56] James D. Johnston. Transform coding of audio signals using perceptual noise criteria. IEEE Journal
on Selected Areas in Communications, 6(2):314–323, Feb 1988.
[57] D. J. Mashao, Y. Gotoh, and H. F. Silverman. Analysis of LPC/DFT features for an HMM-based
alphadigit recognizer. IEEE Signal Processing Letters, 3(4):103–106, April 1996.
[58] H. F. Silverman and Yoshihiko Gotoh. On the implementation and computation of training an HMM
recognizer having explicit state durations and multiple-feature-set, tied-mixture output probabilities.
LEMS Technical Report 129, LEMS, Division of Engineering, Brown University, Providence, RI
02912, December 1993.
[59] M. Hochberg, J. Foote, and H. Silverman. The LEMS talker-independent connected speech
alphadigit recognition system. Technical Report 82, LEMS, Division of Engineering, Brown
University, Providence RI, 1991.
[60] Lawrence R. Rabiner and Biing Hwang Juang. Fundamentals of Speech Recognition. Prentice-Hall,
Englewood Cliffs, N.J., 1993.
[61] Stefan Gustafsson, Peter Jax, and Peter Vary. A novel psychoacoustically motivated audio
enhancement algorithm preserving background noise characteristics. In Proceedings of ICASSP-98
[97], pages 397–400.
[62] Yohichi Yohkura. A weighted cepstral distance measure for speech recognition. IEEE Trans. on
Acoustics Speech and Signal Processing, 35(10):1414–1422, 1987.
[63] S. E. Kirtman and H. F. Silverman. A user-friendly system for microphone-array research. In
[64] Maurizio Omologo and Piergiorgio Svaizer. Acoustic source location in noisy and reverberant
environment using CSP analysis. In Proceedings of ICASSP-96 [94], pages 921–924.
[65] P. Svaizer, M. Matassoni, and M. Omologo. Acoustic source location in a three-dimensional space
using crosspower spectrum phase. In Proceedings of ICASSP-97 [96], pages 231–234.
[66] M. Omologo and P. Svaizer. Use of the cross-power spectrum phase in acoustic event localization.
IEEE Transactions on Speech and Audio Processing, 5(3):288–292, 1997.
[67] S. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, first
edition, 1993.
[68] Y. Gotoh, M. M. Hochberg, D. J. Mashao, and H. F. Silverman. Incremental MAP estimation of
HMMs for efficient training and improved performance. In Proceedings of ICASSP-95 [95], pages
457–460.
[69] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society, series B, 39(1):1–38, 1977.
90
[70] Radford M. Neal and Geoffrey E. Hinton. A new view of the EM algorithm that justifies incremental
and other variants. submitted to Biometrika, 1993.
[71] Jean-Luc Gauvain and Chin-Hui Lee. Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing,
2(2):291–298, April 1994.
[72] Y. Gotoh and H. F. Silverman. Incremental ML estimation of HMM parameters for efficient training.
In Proceedings of ICASSP-96 [94].
[73] Y. Gotoh, M. M. Hochberg, and H. F. Silverman. Efficient training algorithms for HMMs using
incremental estimation. IEEE Transactions on Speech and Audio Processing, 6(6):539–548,
November 1996.
[74] William W Seto. Schaum’s Outline of Theory and Problems of Acoustics. Schaum’s Outline Series.
McGraw-Hill Publishing Company, New York, 1971.
[75] F. Pirz. Design of a wideband, constant beamwidth, array microphone for use in the near field. Bell
System Technical Journal, 58(8):1839–1850, October 1979.
[76] M. Goodwin and G. Elko. Constant beamwidth beamforming. In Proceedings of ICASSP-93 [98],
pages 169–172.
[77] J. Lardies. Acoustic ring array with constant beamwidth over a very wide frequency range. Acoust.
Letters, 13(5):77–81, 1989.
[78] William Mendenhall, Dennis D. Wackerly, and Richard L. Scheaffer. Mathematical Statistics with
Applications. The Duxbury Series in Statistics and Decision Sciences. PWS-KENT, Boston,
Massachusetts, fourth edition, 1990.
[79] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C: The Art
of Scientific Computing. Cambridge University Press, Cambridge, UK CB2 1RP, 2nd edition, 1992.
[80] J. B. Allen and D. A. Berkley. Image method for efficiently simulating small room acoustics. J.
Acoust. Soc. Am., 65(4):943–950, April 1979.
[81] D. Johnson and D. Dudgeon. Array Signal Processing- Concepts and Techniques. Prentice Hall, first
edition, 1993.
[82] Peter L. Chu. Desktop mic array for teleconferencing. In Proceedings of ICASSP-95 [95], pages
2999–3002. Volume 5.
[83] H. G. Hirsche and C. Ehrlicher. Noise estimation techniques for robust speech recognition. In
[84] Sven Fischer and Karl-Dirk Kammeyer. Broadband beamforming with adaptive postfiltering for
speech acquisition in noisy environments. In Proceedings of ICASSP-97 [96], pages 359–363.
[85] Regine Le Bouquin-Jeannes, Ahmad Akbari Azirani, and Gerard Faucon. Enhancement of speech
degraded by coherent and incoherent noise using a cross-spectral estimator. IEEE Transactions on
Speech and Audio Processing, 5(5):484–487, September 1997. Correspondence.
[86] Chang D. Yoo and Jae S. Lim. Speech enhancement based on the generalized dual excitation model
with adaptive analysis window. In Proceedings of ICASSP-95 [95], pages 832–835.
[87] C. d’Alessandro, B. Yegnanarayana, and V. Darsinos. Decomposition of speech signals into
deterministic and stochastic components. In Proceedings of ICASSP-95 [95], pages 760–763.
[88] John Hardwick, Chang D. Yoo, and Jae S. Lim. Speech enhancement using the dual excitation
speech model. In Proceedings of ICASSP-93 [98], pages 367–370.
91
[89] Zenton Goh, Kah Chye Tan, and B. T. G. Tan. Speech enhancement based on a voiced-unvoiced
speech model. In Proceedings of ICASSP-98 [97], pages 401–404.
[90] Lance Riek and Randy Goldberg. A Practical Handbook of Speech Coders. CRC Press, Boca Raton,
FL, 2000.
[91] Nathalie Virag. Speech enhancement based on masking properties of the auditory system. In
[92] IEEE. International Conference on Acoustics, Speech, and Signal Processing, San Francisco, CA,
March 1992.
[93] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Toronto, Canada, May
1991.
[94] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Atlanta, GA, May
1996.
[95] IEEE. International Conference on Acoustics, Speech, and Signal Processing Signal Processing,
Detroit, MI, May 1995.
[96] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany,
April 1997.
[97] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Seattle, Washington,
May 1998.
[98] IEEE. International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, MN,
April 1993.

Optimal Filtering and Speech Recognition With Microphone Arrays

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Optimal Filtering and Speech Recognition With Microphone Arrays

Cargado por

Copyright:

Formatos disponibles

O PTIMAL F ILTERING AND S PEECH R ECOGNITION WITH

A dissertation submitted in partial fulfillment of the requirements for the degree of

Providence, Rhode Island

Recommended to the Graduate Council

Approved by the Graduate Council

2 Evaluating Speech Enhancement 4

3 Speech Recognition With Microphone Arrays 9

4 Towards Enhancing Delay and Sum Beamforming 31

6 Signal Spectrum Estimation 53

7 Implementations and Analysis 59

8 Summary and Conclusions 84

3.1 Breakdown of the experimental database . . . . . . . . . . . . . . . . . . . . . . . . . 9

7.1 Recognition performance for the OSNR beamformer . . . . . . . . . . . . . . . . . . . 60

2.1 Relationship between Hz and Bark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Layout of the LEMS microphone-array system . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Idealized SNR gain as a function of the number of sensors . . . . . . . . . . . . . . . . 31

5.1 Flow diagram for a Wiener filter-and-sum beamformer . . . . . . . . . . . . . . . . . . 50

5.4 Ad hoc methods of warping the filter gains . . . . . . . . . . . . . . . . . . . . . . . . 52

additive noise uncorrelated with the speech signal (background noise),

In speech-acquisition applications the primary source of interference will vary. In a teleconferencing or

1.1 Methods for Acquiring Speech With Microphone Arrays

1 Which contains both correlated and uncorrelated components.

the signal of interest has a wide bandwidth

Noise Reduction Filtering

1.2 The Scope of This Work

2.1 Listening Scores

2.2 Objective Measures

2.2.1 Signal-to-Noise Ratio (SNR)

1 Dynastat, Inc. is one company that performs these services.

peak SNR x y  (2.2)

2.2.2 Segmental Signal-to-Noise Ratio (SSNR)

SSNR x y 10 log10 (2.3)

2.2.3 Bark Spectral Distortion (BSD)

1. Take a windowed time segment: xw k  w k  x k 

Figure 2.3: Spreading function from Wang[50].

4. Warp to Bark frequency: X z warp Xe l

Excitation spreading is approximated by convolution with a spreading function[50] pictured in

The power excitation is converted to dB.

2.3.1 Feature Distortion

3.1 Experimental Database

3.1.1 Data Acquisition

data set female male # utterances # words

Close Talking Mic Segment

Figure 3.2: Data flow for the array recording process.

0.5 1 1.5 2 2.5 3 3.5 4

Array x(n) X(k) C(k) Inter− 1 P(k) p(t) Find

3.1.3 Recognizer Training

Incremental MAP Training

Model Parameter Adjustment

3.1.4 Signal Measurements

Feature Distortion (FD)

Peak SNR (dB)

3.1.5 Recognition Performance

By number of microphones in DSBF

3.2 Noisy Database

Word Error Rate (%)

(a) Baseline model

By number of microphones in DSBF

0.5 1 1.5 2 2.5 3 3.5 4

0.5 1 1.5 2 2.5 3 3.5 4

By number of microphones in DSBF

3.2.1 Signal Measurements and Recognition Performance

Feature Distortion (FD)

Peak SNR (dB)

peak SNR x y (2.2)

1. Take a windowed time segment: xw k w k x k

∂ ∑ Gm 2 σ2m ∑ Gm Hm Hl ∑ Gm Hm 2 Gl σ2l