Está en la página 1de 47

PRESENTED BY

ABSTRACT
Voice/speech is one of the most popular and reliable

biometric technologies used in automatic personal identification systems This paper presents a speaker identification system using cepstral based speech features The commonly used cepstral based features; MelFrequency Cepstral Coefficient (MFCC), linear predictive cepstral coefficient (LPCC) and real cepstral coefficient (RCC) are employed in the speaker identification system

Cont.
The experimental results show that the identification

accuracy with MFCC is superior to both of LPCC and RCC This paper introduces a new approach to control and drive the DC motor, using voice recognition using MFCC This dc motor is used to control wheel chair

INTRODUCTION
Speech is one of the natural forms of communication. Recent development has made it possible to use this in

the security system. the task is to use a speech sample to select the identity of the person that produced the speech from among a population of speakers. This paper makes it possible to use the speakers voice to verify their identity and control access to DC motor. The MFCC algorithm for speech recognition is more accurate than Linear PredictionCoding (LPC)

CONT.
The external DC motor is connected through

interfacing between computer and hardware circuit. The hardware circuit consist of microcontroller (ARM), IC MAX 232, driver IC (l293D) mainly

PRINCIPLES OF SPEECH/VOICE RECOGNITION


Speech recognition is considered as one of the most

popular and reliable biometric technologies used in automatic personal identification systems. Speech recognition systems are used for variety of applications such as multimedia browsing tool, access centre, security and finance.

Cont
The physiological structure of a vocal tract is different

for different persons. Due to this, we can differentiate one persons voice from others. This difference in vocal tract structure is reflected in the frequency spectrum of speech signal. This is a filter bank based approach but implemented using time-frequency analysis technique. Here, first time analysis is done through framing operation and then frequency analysis is done by passing that frame through filter bank.

Cont
Filters are designed in such a way that they resemble

the human auditory frequency perception Presently MFCC is the most widely used feature set for speaker recognition. (first proposed for speech recognition. ) In the MFCC filter bank, the low frequencies are given more importance compared to the high frequencies. Speech recognition performance degrades significantly under varying environmental conditions for many application areas. Speech recognition accuracy can be improved by the removal of noise

Cont
Speech recognition system include
1. feature extraction method 2. feature recognition method. MFCC is used as the feature extraction method and

Euclidean Squared Distance is used as the recognition method.

CLASSIFICATION OF SPEECH RECOGNITION SYSTEM


Speaker recognition methods can be divided into text

independent and text-dependent methods. TEXT INDEPENDENT SYSTEM A text-independent ASIS does not rely on a specific text being spoken both in the training and testing phase. It relies on long-term statistical characteristics of speech for effecting a successful identification

TEXT-DEPENDENT SYSTEM
In text-dependent ASI system , a fixed utterance, like

passwords, card numbers, PIN codes etc. in both training and testing phase and rely on specific features of the test utterance in order to affect a match. Text dependent system requires less training than text independant. provides a perfect solution in practical applications.

CEPSTRAL BASED FEATURES


The source-filter model of speech production assumes

that the speech segment centered at time to is produced when the excitation signal, e(n,to), is passed through a linear filter, h(n,to), the model of vocal tract. That is, for a small segment of time which the properties of the speech signal are assumed to be stationary, the speech signal is composed of a excitation sequence (quickly varying part) convolved with a vocal system impulse response(slowly varying part

Cont..
S(n,to)=e(n,to)*h(n,to)
To extract the vocal-tract specific characteristics it is

desirable to filter out the excitation component from the filter component. The convolution makes it difficult to separate the two parts; therefore the cepstral analysis is introduced. The cepstral coefficients are generally derived either through linear predictive (LP) analysis or mel filterbank analysis.

REAL CEPSTRAL COEFFICIENT (RCC)


To compute RCCs, the signal is transformed from time

domain to the frequency domain by applying Fourier transform. According to the convolution theorem as Eq. (2), the convolution expression of Eq.1 becomes multiplication as shown in Eq. (3). When the spectrum is represented logarithmically, its component becomes additive due to the property of logarithm as eqn 4.

The inverse Fourier transform is linear and therefore

works individually on the two components. Here Cs(m,to)is called cepstrum called the cepstrum or real cepstrum coefficient of s(n,to). The domain of cs(m,to) is called the quefrency domain. The vocal tract characteristics are encoded into the lower frequencies. The excitation can therefore be removed by only keeping the lower cepstral coefficient

MEL -FREQUENCY CEPSTRAL COEFICIENT(MFCC)


They are calculated in the same way as the real

Cepstral coefficients except that the frequency scale is warped to correspond to the mel scale, This mapping is usually done using the equation Mel(f) = 2595*log(1 +f/700 ) The calculation of the mel-frequency Cepstral coefficients is illustrated in Fig.

MFCC CALCULATION
The speech is first pre-emphasized with a pre-

emphasis filter 1-az1 to spectrally flatten the signal, where "a" is between 0.9 and 1. Here approximate a as 31/32 In the time domain, the relationship between the output Sn and the input s, of the preemphasis block is shown in fig. Sn = Sn - asn-I

BLOCK DIAGRAM OF THE MFCC EXTRACTION ALGORITHM

Cont
Sn = Sn - asn-I =Sn-(31/32)Sn-1= Sn-( Sn-1- Sn-1/32)
Then the pre-emphasized speech is separated into

short segments called subframe. The frame length is set to 10 ms(80 samples)guarantee stationarity inside the frame nd no overlap. the Hamming window is used mainly to reduce the edge effect, so 80 point window is used

Cont
As if the window size becomes smaller, Recognition

Accuracy the short-time spectrum will give a poorer frequency resolution but a better estimate of the overall spectral envelope 128 point fft is used, only 64 coefficients is needed , since symmetric property rectangle filter bank used to overlap. in a rectangular FFT , the output characteristic of a rectangular filter is either a "1" or a "0

Cont
thus the operations are changed to "add" or "not add".

For a 128-point FFT, the rectangular filter bank is reduced to 23 equally spaced rectangular filters indicates 23 filters produce the highest recognition accuracy the rectangular filters only require 120 addition fn and fn+1represent the original 160-point frames with 50% overlap, and sfn and sfn+1 represent the new 80point non- overlapped sub-frames.

Cont
We add the filter bank outputs sFn,k and sFn+1,k to

generate the power coeficient. Sn,k t We have reduced almost half of the computation by moving the overlap operation to the end of the spectrum calculation. The following DCT and delta calculations are the same There are also 26 features in each frame This extraction algorithm reduces the total number of multiplications

Cont
We can calculate the Mel-Frequency cepstrum from

output power of the filter bank using equation where Sk is the output power of the kth filter of the filter bank, and n is from 1 to 12. We can also calculate the logged energy of each frame as one of the coefficient.
which is calculated without any windowing and

premphasis. Up to now we have got 13 cepstrum coefficientsTo enhance the performance Of the speech recognition

Cont
system, time derivatives are added to the basic static
parameters. The delta coefficients are obtained from the following

formula:
After all the calculations, the total number of MFCC

for the spectrum of each frame is one frame is 26.

Linear predictive coding(LPCC)


The derivation summarized by Eq.7 which transforms

the linear predictive coding (LPC) into a set of Cepstral coefficients. It is noted that while there is a finite number of LPCs, the number of cepstral coefficients is infinite the cepstrum is a decaying sequence, so a finite number of coefficients are sufficient to approximate it.

Cont.

FEATURE MATCHING
Feature matching techniques used in speaker

recognition include, Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector Quantization (VQ). The VQ approach has been used here for its ease of implementation and high accuracy.

Dynamic Time Warping


data is converted to templates.
The recognition process then consists of matching the

incoming speech with stored templates. The template with the lowest distance measure from the input pattern is the recognized word. The best match (lowest distance measure) is based upon dynamic programming. This is called a Dynamic Time Warping (DTW) word recogniser.

Vector quantization (VQ)


Vector quantization (VQ) is a lossy data compression method based on principle of blockcoding We use VQ (Vector Quantization) method to classify the

voice features could avoid the problem of time warping Here we use split method to initial the codebook of every speakers features. When we get the final codebooks, well use them to make sure whom the speaker is . A feature from an unknown speaker should be matching with the database first.

Cont
Then it computes the Euclidean distance between the

feature and every speakers codebooks. D = min[d(X ,Y)] (7) Where X is the unknown speakers feature and Y is the codebooks. Compute the distance between X and every codebook of a speaker, and then take the minimal value as the distance D. Here we set a threshold value, if every speakers distance morethan the threshold value; we need judge again by using the next feature.

Cont
Dk= 1Kmin[d(Xk,Y)]
If k reaches the maximum value we set and non-speakers distance less than the threshold

value,the system judges that its not in the date base

Cont
only one speakers less than the threshold, the
system judges it is just the speaker. And if some speakers less than the threshold value, we should make the minimal speakers go through the GMM judgment model. And the number of the speakers should be set according to the local condition.

Cont
The speaker-based VQ codebook generation can have Given a set of I training feature vectors, {a1,a2,..an}

characterizing the variability of a speaker, to find a partitioning of the feature vector space, {S1,S2 SM}, for that the whole feature space is represented as S =S1 US2 U . . . U SM. Each partition, Si, forms a convex, Non overlapping region and every vector inside Si is represented by the corresponding centroid vector bi of Si The partitioning is done in such a way that the average distortion

LBG Design Algorithm



The LBG algorithm requires an initial codebook. The initial codebook is obtained by the splitting method. In this method, an initial codevector is set as the average of the entire training sequence. This codevector is then split into two. The iterative algorithm is run with these two vectors as the initial codebook. The final two codevectors are split into four and the process is repeated until the desired number of codevectors is obtained.

Cont
the VQ is shown for two speakers.
The circle refers the speaker 1 and triangle refers for speakers In the training phase, a speaker-specific VQ codebook is generated for each known speaker. shows the use of different number of centroids for the

same data field. After calculation of MFCC and VQ, Euclidian distance is calculated for nearest speech matching

Conceptual diagram that explains VQ process

Pictorial view of code book with 15 centroids

EXPERIMENTAL SETUP FOR DC MOTOR CONTROL

Cont.
the speech signal is taken by microphone that is connected

to computer. Software coding is to calculate the MFCC and VQ (LBG algorithm) MATLAB 7.5 version can be used, to recognize the input speech taken from micro phone. For hardware part to make DC motor understands, microcontroller (ARM) is used. For microcontroller coding Embedded C programming is used. The interfacing between computer and microcontroller is done by RS-232. For drive the DC motor the driver IC L293D is used.

RESULTS

The coding has been developed using MFCC and VQ algorithm, in MATLAB 7.5 version on window Vista platform and supporting hardware also has been implemented. The interfacing is done between hardware and software using RS-232 cable (MAX-232 IC). External DC motor can be driven in forward or reverse direction as well as it can be stopped also by giving speech commands. While calculating of MFCC for database at the time of speech recognition,

conclusion
In this paper MFCC and VQ techniques are used in

speech recognition to control the DC motor drive. using ARM microcontroller in order to control the movement of wheelchair. The code can be developed in MATLAB using MFCC and VQ can be even used for control and drive the stepper motor, servo motor etc.

También podría gustarte