Está en la página 1de 5

2010 2nd International Conference on Signal Processing Systems (ICSPS)

An Endpoint Detection Algorithm Based on MFCC And Spectral Entropy Using BP NN


Haiying Zhang
Software school, Xiamen University Xiamen, China zhang2002@xmu.edu.cn
AbstractEndpoint detection is the preliminary job of speech signal processing, it is vital to speech recognition. Most of recent endpoint detection algorithms will give a satisfied result at high SNRs (signal-to-noise ratio), while they might fail in occasion where the noise level is too excessive. In this paper, a novel endpoint detection algorithm based on 12-order MFCC and spectral entropy in the framework of BP NN is presented. It can be shown by the experiments that the proposed method is more reliable and efficient than the traditional ones based on short-term energy at low SNRs. Keywords-endpoint detection; BP neural network; MFCC; spectral entropy; short-term energy

Hailong Hu
Software school, Xiamen University Xiamen, China hhl417@163.com

II.

TRADITIONAL METHODS OF SPEECH ENDPOINT DETECTION

I.

INTRODUCTION

Endpoint detection is to detect the presence and boundaries of speech within an audio signal. It is indispensable part of speech recognition. The study of Martin [1] has shown that the performance of recognition has close relation to the accuracy of endpoint detection. An efficient endpoint detection algorithm can reduce the time complexity during speech recognition as well as improve the performance of speech recognition system. An efficient endpoint detection algorithm should be accurate, robust and self-adaptive. Robustness means that the algorithm is reliable in different noise conditions. Most of the recent endpoint detection methods have good performance in conditions with high SNR or without noise, such as methods based on short-term energy. However, it is not enough to detect the speech boundary in noisy conditions only depend on the energy parameter [2]. In order to detect the endpoint of speech at low SNRs effectively, an endpoint detection algorithm based on MFCC (Mel-frequency cepstral coefficients) and spectral entropy is proposed in this paper. It takes 12-order MFCC parameters and spectral entropy proposed by Shen [3] as the feature vector, and uses a trained BP neural network as classifier to distinguish the speech and non-speech segments from audio signals. Experimental results indicate that the algorithm outperforms energy-based algorithm.

A. Methods based on Threshold Most of conventional endpoint detection algorithms are based on threshold, which first calculate the value of one of the features (such as short-term energy or spectral entropy) of speech frame, then compare the value with the threshold predefined. If the value exceeds the threshold, the frame is classified as speech, otherwise the frame is classified as noise. In the widely used endpoint detection algorithms based on short-term energy, the threshold can be dynamically adjusted according to the noise level. In [4][5], adaptive threshold techniques were proposed. The theory of the method is given below: Assume that the first few frames are noise signals, the noise level Enoise (t ) is estimated during noisy segments using the following recursive formula: (1) E noise (t ) = E noise (t 1) + (1 ) E (t ) Where E(t) is the energy of frame t and is an experimental parameter. Then the threshold Te (t ) of frame t is calculated as follows: Te (t ) = Enoise (t ) + (2) Where

is a fixed value to determine the threshold. Thus

in the process of detecting, if E (t ) Te (t ) , frame t is classified as speech, and the update in (1) stop, else when E (t ) < Te (t ) , frame t is classified as noise, and the update in (1) restart. After spectral entropy was proposed firstly by Shen [3] to detect the endpoints of speech, many endpoint detection algorithms based on spectral entropy were proposed [6][7] constantly, which also depend on experimental parameters. B. Demerits of methods based on threshold

978-1-4244-6893-5/$26.00 C 2010 IEEE

V2-509

2010 2nd International Conference on Signal Processing Systems (ICSPS)

Figure 1. The endpoint of utterance A detected by method based on short-term energy (solid line represents the endpoint hand labeled, dashed line represents the endpoint detected)

The traditional methods have the property of simple and fast, but the drawbacks are clear: It is difficult to choose a proper threshold in different conditions. Many researchers give an autoadaptive method to decide thresholds, however it also need some experiential or experimental parameters to adjust the threshold, such as , in [4][5]. The methods based on threshold need some assumptions in order to determine the threshold. For example, most methods work on the assumption that the first few frames (e. g. five fames) are noise. So, if the audio signal doesnt meet the assumption, the threshold preset will be not available any more. The features selected in these methods based on threshold are sensitive to various types of noise. The performance is poor to distinguish speech and nonspeech under low SNR conditions or with nonstationary noise. For example, it can be shown in Figure 1 that the endpoint of utterance A detected by method based on short-term energy. Solid line represents the endpoint hand labeled and dashed line represents the endpoint detected. As we can see in the figure, when the speech is clean or the SNR is high, the algorithm is able to detect the endpoint

accurately; but as the noise increase, the performance decline greatly, (see (d) in figure 1). III. ENDPOINT DETECTION ALGORITHM BASED ON BP NN From the above discussion, it can be concluded that noise can affect the accuracy of detection greatly. An effective endpoint detection algorithm should have good performance in any conditions. Backpropagation neural networks have broad application in classification, approximation, prediction, and control aspects. Based on biological analogy, neural networks try to emulate the human brains ability to learn from examples or incomplete data and especially to generalize concepts [8]. For the excellent ability to classification of BP NN, it is considered as the classifier in this paper. The main process is to extract some speech features of signal frame first, and then use the trained BP neural network to classify these frames based on the features mentioned above, finally, the endpoints of the speech are estimated according to classification result. A. Selection of speech features In order to detect speech, many parameters have been introduced. Besides short-term energy and spectral entropy mentioned above, several other features have also been proposed, including ZCR (Zero crossing rate), LPCs (linear prediction coefficients), cepstral coefficients, and pitch. Although these features can help detect speech, they all have

V2-510

2010 2nd International Conference on Signal Processing Systems (ICSPS)

their own disadvantages. ZCR and LPCs are sensitive to noise, they will be invalid in the conditions of strong noise; pitch is available in expressing the characteristics of speech signal, however to extract the correct pitch in noisy environments is difficult [7]. Therefore, it is significant and urgent to seek features that can sufficiently specify the characteristics of speech which are robust in noisy environments. In view of the analysis, it can be deduced that the features selected must meet the following two conditions: The feature will present the intrinsic characteristics of speech signals and it is easy to distinguish speech and non-speech signal under different SNR conditions. It is easy to extract the feature from speech signals even in noisy conditions. According to the conditions mentioned above, this paper chooses MFCC and spectral entropy as the features used in the detection algorithm. MFCC is the method that is applied for speech parameterization. Human ear has been proven to resolve frequencies non-linearly across the audio spectrum, thus filter bank analysis is more desirable because it is spectrally based method [8]. Since MFCC fully simulate the human auditory characteristics without any assumptions, it has been widely used in the field of speech recognition. For its capability of recognition and immunity to noise, it can distinguish speech segment and noise segment effectively. Shen [3] first used entropy-based parameter to detect speech signals and have achieved good results. The entropy of speech signals is different from that of most noise signals because of the intrinsic characteristics of speech spectrums and the different probability density distributions [6]. The basic theory of spectral entropy is described as follows: For each frame, the spectrum is obtained by FFT (fast fourier transform). Then for each frequency component f i , estimate its probability density function (pdf) :

Thus, after the 12-order MFCC parameters and spectral entropy have been calculated, BP neural network can be used to classify each frame based on these 13 features. B. The structure of the BP neural network

Figure 2. The structure of the BP neural network

The topological structure of BP neural network is shown in Fig. 2. It is a forward neural network model composed by input layer, hidden layer and output layer. There are 13 neurons in input layer to receive the 13 features of speech we defined above; hidden layer has 20 neurons, and output layer has only one neuron, it output a value between 0 and 1, if the value is less than 0. 5, it indicates the frame is a non-speech frame, otherwise, it is a speech frame. After the structure of BP NN has been defined, the main task is training, which is based on training set marked with noise or signal in advance. Then it can be used to classify the testing set. C. Description of the algorithm proposed in this paper In this section, we discuss how to detect the speech signal with background noise using the trained BP neural network. We can classify the audio signals by the network now, but we dont treat the first frame which is predicted as speech as the starting point of the speech segment. Similarly dont we treat the last frame which is predicted as speech as the ending point of the speech segment. Since it will lead to failure detection if one of these frames is classified as wrong type. There exists some frames which have been wrong predicted named as singularities. For the purpose of decreasing the error caused by the occurrence of the singularity, we give a term as PSD (predicted speech density). PSD is the proportion of the frames predicted as speech to the total frames within a certain range. Since speech is continuous, it at least has a certain time span, containing a certain number of consecutive frames. So, if in a number of consecutive frames of audio signals, only one frame is predicted as speech, it can be asserted the prediction is wrong. PSD is defined as follows:

pi = s( f i ) /

N k =1

s( f k )

i=1N

(3)

Where pi is the probability density for frequency component f i , s ( f i ) is the spectral energy for f i , and N is the total number of frequency components in FFT. After the probability density has been calculated, the corresponding spectral entropy for each frame can be estimated:

H =

N k =1

p k log( p k )

(4)

Information entropy is only related to randomness of the energy, but not the amplitude of the energy, so spectral entropy can prevent the speech signal being covered by noise energy. Therefore the spectral entropy is robust to noise to some extent.

V2-511

2010 2nd International Conference on Signal Processing Systems (ICSPS)

(5) PSD = N / W Where W is the number of frames we decide to calculate the PSD, we set W=6 in our algorithm. N is the number of frames which have been predicted as speech within these W frames. Then we specify the PSD value of one frame the predict speech density of its following W frames. In order to reduce the error deduced by singularity, we treat the first frame which its PSD >= as the start of speech segment,

lasts 3~6 seconds. The sampling rate of the speech signal is 16 kHz. Four kind of noise signals (white, factory, babble and pink noise) from NOISEX-92 Database are added to these clean speech signals, the SNRs is 20dB, 15dB, 10dB, 5dB and 0dB respectively. All the audio signals are separated into two groups, two-thirds of them were used as training set, the rest were used as test set. B. Analysis of experimental results
TABLE I. ACCURACY COMPARISONS OF SPEECH ENDPOINT DETECTION BETWEEN OUR ALGORITHM AND THE ALGORITHM BASED ON SHORT-TERM ENERGY.

and the last frame which its PSD >= as the end of speech segment. In our experiment, we set =0. 5. The choice of is under the consideration of fault tolerance. It must be pointed out that the function of PSD parameter proposed in the paper is different with that of the threshold set by the methods based on threshold. The threshold in conventional methods is set to determine whether the signal frame is speech or noise. But in this paper, the classes of the signal frames have already been determined by the trained network, PSD is only an improved parameter in response to error classification of individual signal frame. The algorithm is described as the following steps: Step 1: Transform the audio signals from time domain signals to frequency domain signals. 1) Frame the audio signals, we set the frame length 256, overlap 128. 2) Multiply each frame with a hamming window in order to keep the continuity of the first and the last points in the frame. Transform the audio signals from time domain signals to frequency domain signals using FFT.

Noise type pure speech white

SNR

Accuracy of our algorithm 98. 2%

Accuracy of the algorithm based on energy 97. 2% 96. 8% 95. 9% 94. 5% 89. 6% 58. 7% 97. 2% 96. 5% 91. 7% 77. 8% 45. 8% 97. 9% 94. 4% 82. 3% 69. 1% 47. 9% 97. 8% 97. 2% 90. 6% 81. 7% 52. 2%

factory

3)

Step 2: Extract MFCC parameters and spectral entropy. 1) Calculate the 12-order MFCC parameters and spectral entropy for each frame. 2) Normalize the 13 features calculated above. Step 3: Predict the endpoints of the speech segment. 1) Input the value of 13 features of each frame to the BP neural network to predict if its speech or nonspeech. 2) For each frame, calculate its PSD, the first frame which its PSD >= is estimated as the start of speech segment, and the last frame which its PSD >= is estimated as the end of speech segment. IV. EXPERIMENTAL RESULTS

babble

pink

SNR=20dB SNR=15dB SNR=10dB SNR=5dB SNR=0db SNR=20dB SNR=15dB SNR=10dB SNR=5dB SNR=0db SNR=20dB SNR=15dB SNR=10dB SNR=5dB SNR=0db SNR=20dB SNR=15dB SNR=10dB SNR=5dB SNR=0db

98. 2% 97. 9% 97. 2% 93. 7% 88. 2% 96. 4% 95. 1% 93. 7% 88. 2% 76. 4% 96. 3% 94. 0% 90. 3% 86. 1% 73. 6% 96. 1% 95. 8% 94. 8% 93. 7% 88. 2%

A. Experimental enviroments The experiment environment is: Intel(R) Core(TM) 2 Duo CPU 2. 10 GHz, Vista operating system, we use MATLAB 7.0 to simulate algorithm. Experimental data are composed by the utterances of 26 letters from A to Z and 10 numbers from 0 to 9 recorded by 10 students respectively in quiet environment, each utterance

In the experiment, the range of accurate starting frame is from 10 frames before the hand labeled starting frame to 10 frames after the hand labeled starting frame, similarly with the ending frame. Table 1 shows accuracy comparisons of speech endpoint detection between our algorithm and the algorithm based on short-term energy. Through the experiment, we can notice that the accuracy rate of our proposed algorithm and the algorithm based on energy is similar, and can both perform well at high SNRs. But as the noise level increase, it can be found the performance of the algorithm based on energy decline greatly. For example, under factory noise, when the SNR is 0dB, its accuracy is only 45. 8%. In contrast, our algorithm still achieves 76.4% of detection. The experiment shows that our algorithm

V2-512

2010 2nd International Conference on Signal Processing Systems (ICSPS)

outperforms the algorithm based on short-term energy in noisy environments. V. CONCLUSIONS Endpoint detection become relatively difficult in noisy environments, but is definitely important for robust speech recognition. Short-term energy and spectral entropy has been It use BP neural network to classify the audio signals directly, so it doesnt need to set threshold and the relative parameters which are used to classify the audio signals. It doesnt need any assumptions for the audio signals. For example, it doesnt need the assumption that the first few frames of signals are noisy signals. It is robust and more accurate in noisy environments. REFERENCES
[1] K. L. Martin, A. Towards, Improving Speech Detection Robustness for Speech Recognition in Adverse Conditions, Speech Communication, 2003, vol. 40: pp.261-276 Gin-Der Wu, Chin-Teng Lin. Word Boundary Detection with Mel-Scale Frequency Bank in Noisy Environment, IEEE Transactions on Speech and Audio Processing, 2000, 8(5):541-554

widely used in endpoint detection. However, these features were not stable and robust enough in noisy environments. The endpoint algorithm based on MFCC and spectral entropy using BP NN is proposed in this paper, it has the following merits compared with the traditional one:

[3]

[4] [5]

[6]

[7]

[8]

[2]

J. L. Shen, J. W. Huang, and L. S. Lee, Robust entropy-based endpoint detection for speech recognition in noisy environments, International Conference on Spoken Language Processing, Sydney, 1998. S. V. Gerven, F. Xie, A Comparative Study of Speech Detection Methods, in Proc. EUROSPEECH 1997, vol.III, pp.1095-1098. P. Renevey, A. Drygajlo. Entropy Based Voice Activity Detection in Very Noisy Conditions, in Proc EUROSPEECH 2001, pp.1887-1890. Chuan JIA, Bo XU. An Improved Entropy-Based Endpoint Detection Algorithm, International Symposium on Chinese Spoken Language Processing. Taipei, 2002, pp: 479583. Bing-Fei Wu, Kun-Ching Wang. Robust Endpoint Detection Algorithm Based on the Adaptive Band-Partitioning Spectral Entropy in Adverse Environments, IEEE Transactions on Speech and Audio Processing, 2005, 13(5): 762-775 C. K. On, P. M. Pandiyan, S. Yaacob and A. Saudi, MelFrequency Cepstral Coefficient Analysis in Speech Recognition, International Conference on Computing & Informatics, 2006, pp.15.

V2-513

También podría gustarte