Está en la página 1de 18

Artificial Neural Network for Speech Recognition

Overview :
Ø
ØPresenting an Artificial Neural Network to
recognize and classify speech.
Ø
ØSpoken digits.
Ø
Ø“one”,”two”,”three”, etc…
Ø
ØChoosing a speech representation scheme.
Ø
ØTraining Perceptron.
Ø
ØResults.
Representing Speech :

vProblem

ØRecording samples never produce identical waveforms.


ØLength.
ØAmplitude.
ØBackground noise.
ØSample rate.
ØHowever, perceptual information relative to speech
remains consistent.

vSolution
Ø
ØExtract speech-related information.
ØSee: Spectrogram.
Representing Speech :
“one” “one”

Waveform

Spectrogram
Spectrogram :

ØShows change in
amplitude spectra
over time.

ØThree dimensions

ØX Axis: Time
ØY Axis: Frequency
ØZ axis: Color
intensity represents
magnitude.
Mel Frequency Cepstrum Coefficients :

ØSpectrogram provides a good visual representation of


speech but still varies significantly between samples.

ØA cepstral analysis is a popular method for feature


extraction in speech recognition applications, and can be
accomplished using Mel Frequency Cepstrum Coefficient
analysis (MFCC).
Mel Frequency Cepstrum Coefficients :

ØInverse Fourier transform of the log of the Fourier


transform of a signal using the Mel Scale filterbank.

Ømfcc function returns vectors of 13 dimensions.


Network Architecture :

vInput layer
Ø26 Cepstral Coefficients
vHidden Layer
Ø100 fully-connected hidden-
layer units
ØWeight range between -1 +1
ØInitially random
ØRemain constant
vOutput
Ø1 output unit for each target
ØLimited to values between 0
and +1
Sample Training Stimuli
(Spectrograms) :

“one”

“two”

“three”
Training the network :

vSpoken digits were recorded

ØSeven samples of each digit.


Ø“One” through “eight” recorded.
ØTotal of 56 different recordings with varying lengths and
environmental conditions.

ØBackground noise was removed from each Sample.


Training the network :

ØCalculate MFCC using Malcolm Slaney’s Auditory


Toolbox.
Øc=mfcc(s,fs,fix((3*fs)/(length(s)-256))).
ØLimits frame rate such that mfcc always produces a matrix
of two vectors corresponding to the coefficients of the two
halves of the sample.

ØConvert 13x2 matrix to 26 dimensional column vector.


Øc=c(:).
Training the network :

vSupervised learning
ØChoose intended target and create a target vector.
Ø56 dimensional target vector.
ØIf training the network to recognize spoken “one”, target
has a value of +1 for each of the known “one” stimuli and 0
for everything else.

vTrain a multilayer perceptron with feature vectors


(simplified)
ØSelect stimuli at random.
ØCalculate response to stimuli.
ØCalculate error.
ØUpdate weights.
ØRepeat.
Ø
Training the network :
ØIn a finite amount of time, the perceptron will successfully
learn to distinguish between stimuli of an intended target
and not.
Ø sigmoid(x)=1/(1+e-x)
ØCalculate response to
Stimuli

ØCalculate hidden layer.


Øh=sigmoid(W*s+bias).
ØCalculate response.
Øo=sigmoid(v*h+bias).
ØSigmoid transfer function.
ØMaps values between 0
and +1.
Training the network :
vCalculate error
ØFor a given stimuli, error is the difference between
target and response
Øt-o
Øt will be either 0 or 1
Øo will be between 0 and +1

v
vUpdate weights
Øv=vprevious+γ(t-o)hT
Øv is weight vector between hidden-layer units and
output
Øγ (gamma) is learning rate
Results :
Target = “one”

ØLearning rate: +1
ØBias: -1
Ø100 hidden-layer
units
Ø3000 iterations
Ø316 seconds to
learn target
Results :

vResponse to unseen stimuli

ØStimuli produced by same voice used to train network with


noise removed.
ØNetwork was tested against eight unseen stimuli
corresponding to eight spoken digits.
ØReturned 1 (full activation) for “one” and zero for all other
stimuli.
ØResults were consistent across targets.
Øi.e. when trained to recognize “two”, “three”, etc…
Øsigmoid(v*sigmoid(w*t1+bias)+bias) == 1.
Results :

vResponse to noisy sample


v
ØNetwork returned a low, but response > 0 to a sample
without noise removed.

vResponse to foreign speaker.

ØNetwork responded with mixed results when presented


samples from speakers different from training stimuli.

vIn all cases, error rate decreased and accuracy improved


with more learning iterations.
v
References :
ØJurafsky, Daniel and Martin, James H. (2000) Speech and Language
Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition (1st ed.). Prentice Hall.
Ø
ØGolden, Richard M. (1996) Mathematical Methods for Neural Network
analysis and Design (1st ed.). MIT Press.
Ø
ØAnderson, James A. (1995) An Introduction to Neural Networks
(1st ed.). MIT Press.

ØHosom, John-Paul, Cole, Ron, Fanty, Mark, Schalkwyk, Joham, Yan,


Yonghong, Wei, Wei (1999, February 2). Training Neural Networks for
Speech Recognition Center for Spoken Language Understanding, Oregon
Graduate Institute of Science and Technology, http://
speech.bme.ogi.edu/tutordemos/nnet_training/tutorial.html.

ØSlaney, Malcolm Auditory Toolbox Interval Research Corporation