Voice Recognition System by Using An Automatic Classification Processor

International Journal of Science and Advanced Technology (ISSN 2221-8386)
Volume 4 No 4 April 2014
http://www.ijsat.com
Voice Recognition System by Using an Automatic

Classification Processor
J. Fernando Pial-Moctezuma
Facultad de Ciencias de la Electrnica
Benemrita Universidad Autnoma de Puebla
Puebla, Mexico
Selene Edith Maya

Puebla, Mexico
Salvador Ayala-Raggi
Puebla, Mexico
Ana Mara Rodrguez

Puebla, Mexico
Josefina Castaeda
Puebla, Mexico
Francisco Portillo
Puebla, Mexico
Gerardo Mino-Aguilar
Puebla, Mexico
gmino44@ieee.org
Rodrigo Maya
Puebla, Mexico
Ricardo lvarez
Puebla, Mexico
Tecilli Tapia
Puebla, Mexico
Abstract In this paper an Automatic Speech Recognition

System is implemented by using an Automatic Pattern
Recognition Processor that instead of a Speech Recognition
Techniques based on Artificial Vision paradigms, this method
uses images produced in a function of intrinsic characteristics
extracted from the speech for the Training Stages and Pattern
Generation, thus using this information to classify
appropriately this signals. Nevertheless, the State of the
Technique in the area of Digital Signal Processing is used,
focused on the processing of the speech, which is very
important for the signal conditioning algorithms generation,
as well as the speech Feature Extraction. Its performance was
compared with four different methods of Feature Extraction:
Average Magnitude, Spectrograms (Time-Frequency), Linear
Prediction Coefficients and finally we propose a technique
called: Short time Analysis of the Fundamental Signal. For
the Training stages and Pattern generation, the State of the
Art on Artificial Intelligence for Computer Vision as well as
for Speech Recognition, stressing particularly in the Principal
Components Analysis, was developed to be able to implement
in a later stage the Eigenfaces method, being this the
prevailing reference on Facial Recognition field and the
fundamental technique of recognition in this work. Next, the
state of the art on automatic classifiers and the k-Near
Neighbors (k-NN) algorithm was made. Lastly, it was
generated a test bench, where the final implementation of the
algorithms were created and then evolutionary tests were
executed based on the spaces generated, thus there were
obtained the system recognition rates. For this, the TMW
Voice Corpus enable this way the evaluation of the final

performance of the implemented system was used.
Keywords voice recognition, pattern, signal, eigenfaces, and
components analysis.
I.
INTRODUCTION
The voice is a signal that travels through disturbed air,

which is produced by physiologic human functions that
uses three essential physical processes: the generation of
pressurized air, the regulation of the vibration of said air
and the control of the resonance of the generated acoustic
signal. This physiologic process generates a signal that
contains a great amount of information, which can be
analyzed as a high dimensionality flow of information (seen
from the Multivariate statistics analysis).
A common way of representing the acoustic signals
(including voice signals), is measuring the energy of the
signal by using different width bands and calculating this
energy over small and different time intervals; in this
fashion each frequency band can be seen as a single
dimension in a multidimensional space, with a dimension
equal to the number of bands of frequency, under this
perspective, a speech signal segment can be represented in a
new dimensional space.
32
Given the fact that the physiological restrictions in the

human body movements (jaws, lungs, tongue, etc.), the
vocal apparatus has certain limited liberty degrees that can
lead to a reduced mathematical representation. Besides this
physiologic restrictions and from the phonetic point of
view, only a small subset of sounds amongst all which can
be produced by a human being are the ones that can
effectively be used in oral communication. This is a
motivation in the structure (manifolds) research of low
dimensionality inherent to the speech, so by means of these
methods we can be able of obtaining a parameterization in
the fundamental variability of the data flow of these signals,
using only a few characteristics. For this approximation we
can see the obtained information contained by the voice
signals as if it was of low dimensionality embedded in a
superior dimensional space (as seen in Figure 1).
Generation, thus using this information to classify

appropriately this signals as shown in figure 2.
II.
METHOD
In this work an Automatic Speech Recognition System,

using for this purpose Dimensionality Reduction
Techniques based on Facial Recognition Paradigms as
fundamental method in the data analysis was implemented,
in addition to this and with the purpose of translate this
speech signals to an appropriate domain for its observation
and classification we use Digital Signal Processing
techniques in order to construct the Features Vectors
necessary for this classification. The proposed system is
shown in Figure 3, which is based on the classic paradigm
for an ASR system.
Speech Signal
Recognition results
Signal Preprocessing
Features
Extraction
Identification
Pattern
association
On-line process
Training
Knowledge
base (library)
Off-line process
Figure 3: Proposed ASR system showing the classic paradigm of Pattern

Recognition, which involves comparing the parameters or features
representing the spoken word with reference patterns for each word in the
vocabulary library.
Figure 1. A two-dimensional manifold embedded in a non-linear way on

a three-dimensional space. The existence of a low dimensional structure
of the speech signals is based on previous studies dating back to the
classic analysis of the Formant-Plane vowels in the speech signals [1] [2].
III.
Pre-processing
Feature Extraction
Average
magnitude
Spectrogram
Linear Prediction
Coefficients
SIGNAL PRE-PROCESSING
For this work, for the Pre-processing stage and with the
goal of adapting the speech signals into a representation that
helps to the coming stages to perform in a more efficient
manner a Pre-emphasis filter, a Voice Activity Detection
stage, a stage of Short time analysis of the voice signal (if
the characteristics require to do so) and finally a
Normalization and Temporal alignment process (see Figure
4), were used.
Speech signal
Short time Analysis

of the
Fundamental
Signal
Images
generation
Dimensionality
Reduction
(Eigenfaces)
Evaluation
(classification)
Figure2. Proposed method in order to apply PCA through Eigenfaces.
Figure 4. Voice signals before (top) and after (bottom) of the detection of
vocalized signals. In this figure it is used the first 20ms of the signal in
order to calculate the levels of background noise and a threshold of 15%
of the signals range above of the background noise to discriminate
vocalized signals.
In this work an Automatic Speech Recognition System

is implemented by using an Automatic Pattern Recognition
Processor that instead of a Speech Recognition Techniques,
based on Artificial Vision paradigms, using images
produced in a function of intrinsic characteristics extracted
from the speech for the Training Stages and Pattern
IV.
FEATURE EXTRACTION
In the Feature Extraction stage a transformation over the

Pre-processed speech signals is applied, from the domain
33
and co-domain in which they originally reside, to a new

representation, using digital images by means of the
generation of Sonic-Images of the speech signals. Having in
mind the goal of stressing the particular characteristics
residing in said signals, it was drawn upon some schemes
used by the state of the art on speech processing;
specifically the Linear Prediction Coding, Average
Magnitude of the Signal was used, the Fundamental
Frequency of signal and the Discrete Short Time Fourier
Transform (Figure 5).
And the difference

faces was found.
for each of the group
Figure 7. Statistically centered data (mean removed).
The co-variance matrix (Figure 8) of

the training
calculated
size from
examples:
was
Figure 5. Representation of the speech signal in a Decibel Spectrogram

and filtered in the frequency bands relevant to the human voice. In this
fashion the stage transforms the speech signals into a representation in
which the most relevant, particular and peculiar features of the signals are
available in a simple, evident and affordable way, in order to improve the
pattern recognition.
V.
DIMENSIONALITY REDUCTION
For the Recognition algorithm training, (main part of

this research), it was reduced the Dimensionality of the
generated space of the set of Features Extracted from the
speech signals now in form of images. This work makes use
of the Principal Component Analysis (PCA) through the
Eigenfaces technique. By means of this technique a space
containing all the different Voice Patterns the system can
classify was generated, training it previously to the stage of
recognition for which it was conceived.
Figure 8. Covariance matrix generated from the training examples.
By the use of the covariance matrix, the eigenvalues of

the space was found, and it was picked the best among them
which can deliver the greatest power of representation
(Figure 9).
Figure 6. Training set (generated for this example from spectrograms).

Figure 9. Eigenvalues found from the covariance matrix.
An image the size of NXM pixels can be decomposed in

a column vector of NXM size. The objective is to obtain a
group with M images NXM and represent each one of these
images I as a vector i,
(Figure 6),
An eigenvalue matrix (Figure 10) was constructed:
finding the average face of the signals:

34
TMW voice corpus (Tohoku University and Panasonic

Isolated Spoken Word Database) that the group for voice
resources from the national informatics institute from Japan
so kindly conceded for the development of this work was
used.
VII. SIMULATION RESULTS
After the completion of the proposed tests for this work,
it was found that the ASR system functions properly for the
recognition of words, as shown in the graphics where we
plot the recognition rate against the number of training
example for the algorithm. It is shown that as the number of
training elements grows so does the recognition rate
behaves.
Figure 10. Eigenvoices matrix generated to training the ASR system.
For the construction of the knowledge library, each

image from the training group into the generated eigenspace
and keeping their weights
projected.
(Figure 11) was
Figure 12. Results for the Magnitude Average used as Feature Vectors for
a space of seven words and fifty testing elements.
Figure 11. Knowledge library containing the weights from each of the
training examples.
VI.
AUTOMATIC RECOGNITION
The next step (now that the system is trained), is to

begin the process of automatic recognition, making the
words that will be identified go in. To this end, we use the
Features Extraction stage to adequate the signals. After a
Sonic Image of the word to be recognized using the stage of
Training and Signal identification was generated, and then
projecting this image into the space generated by this same
stage. Thus this procedure facilitates the use of some
method of classification that sorts the word to be
recognized. The classifier operates over the k-Nearest
Neighbors technique (k-NN) which works over the weights
obtained via the projection of the voice signal (to be
recognized) and using the Euclidian distance.
Figure 13. Results for the Spectrograms used as Feature Vectors for a
space of seven words and fifty testing elements.
Finally a testing platform to evaluate the system in an

integral as well as a reduced way was developed. Being
possible to modify the parameters for the processing of the
signal, the type of Feature Extraction as well as the
particular parameterization of each technique, the
percentage of the representation power for the generated
space, the election of the percentage of nearest neighbors
for the classifier under function of the dimensionality of
said space and for last the number of elements with which
the system will be trained as well as the number of testing
elements that the system will evaluate. To this end, the
Figure 14. Results for the Linear Prediction Coefficients used as Feature
Vectors for a space of seven words and fifty testing elements.
35
REFERENCES
Figure 15. Results for the Short-Time Analysis of the Fundamental Signal
used as Feature Vectors for a space of seven words and fifty testing
elements.
VIII. CONCLUSIONS
The main contribution of this work is developing an
original method for the Automatic Speech Recognition that
explodes the hypothesis in which it is assumed that the
voice possesses an inherent los dimensional structure of low
drawing upon Facial Recognition paradigms. The following
is a summary of the advantages of this proposed technique:
A method capable of recognizing the speech via the

image construction taking advantage of the Feature
Vector extracted from the voice.
The application of this method over classic procedures

of Feature Extraction in speech processing to verify
the functionality of the method and on the same time
to compare the performance and robustness of this
traditional procedures using this new scheme.
The creation of a technique for characteristics

extraction imprinting over an image fundamental
signals obtained by the frame analysis of a voice
signal.
36
[1]
Tohoku University Matsuhita Electronic Isolated Word Speech

Database (TMW)
[2]
R. J. Weiss y D. P. Ellis, Speech separation using speaker-adapted

eigenvoice speech models, Computer Speech and Language.
Elsevier, vol. 24, pp. 16-29, 2010.
[3]
A. Errity, Exploring the Dimensionality of Speech using Manifold

Learning and Dimensionality Reduction Methods., School of
Computing. Faculty of Engineering and Computing. Dublin City
University, Dublin, 2010.
[4]
[4] G. E. Peterson y H. L. Barney, Control Methods Used in a

Study of the Vowels, Journal of the Acoustical Society of America,
vol. 24, n 2, pp. 175-184, 1952.
[5]
M. A. Turk y A. P. Pentland, Face recognition using eigenfaces,

Computer Vision and Pattern Recognition. Proceedings CVPR '91,
pp. 586-591, 1991.
[6]
M. Kirby y L. Sirovich, Application of the Karhunen-Loeve

procedure for human faces., Pattern Analysis and Machine
Intelligence, IEEE Transactions., vol. 12, n 1, pp. 103-108, 1990.
[7]
R. J. Weiss y D. P. Ellis, Speech separation using speaker-adapted

eigenvoice speech models, Computer Speech and Language.
Elsevier, vol. 24, pp. 16-29, 2010.
Selene Edith Maya Rueda was born in the

city of Puebla, Puebla, Mexico. She received
the degree of Bachelor of Electronics, at the
Faculty of Physics and Mathematics of the
in 1995 and the degree of Master of Science in
Electrical Engineering specializing in National
Institute of Astrophysics, Optics and Electronics. Currently she is
part of the academic staff of the Faculty of Sciences of the
Electronics of the Benemrita Universidad Autnoma de Puebla,
assigned to the area of Digital Systems. She has made research in
the areas of Digital Image Analysis, Computer Vision and Power
Electronics.
M. of En. Juan Fernando Pial

Moctezuma, was born in Puebla, Mexico
1982. He got a Master of Electronics
Engineering, graduated with Honors by
in 2013. In 2010 a Master of Science in
Electronics and Telecommunications at
Centro de Investigacin Cientfica y de Estudios Superiores de
Ensenada, Baja California. In 2007 got a Diploma in Digital
Systems with Reconfigurable Logic. And in 2007 a Bachelor of
Science in Electronics at Benemrita Universidad Autnoma de
Puebla. The academic specialties are Instrumentation and
Control, Digital Systems and Pattern Recognition, and Power
electronics and Digital systems respectively. He is an IEEE
member. From 2013 up to date, he is working as Services
Manager and Engineering Sub-Manager by the Company:
Intema S.A. de C.V.
Rodriguez Ana Maria was born in Mexico in

1970; received the M. Sc. degree in Electronics
Engineering at Instituto Nacional de
Astrofisica, Optica y Electronica (INAOE) in
1998. She is currently working as a titular
professor in the Faculty of Electronics Sciences
at the Universidad Autonoma de Puebla,
Mexico in the area of digital systems and its applications.
Dr. Salvador Ayala-Raggi was born in Puebla, Mxico. He is

professor at the Benemrita Universidad Autnoma de Puebla. He
received a Ph.D. and Master of Science Degrees in Electronics
Engineering by the Instituto Nacional de Astrofsica, ptica y
Electrnica (INAOE). He received a Bachelor in Electronics
degree by Benemrita Universidad Autnoma de Puebla. He is
specialist on digital signal processing.
M. of Sc. Francisco Portillo Robledo was born in Puebla,

Mxico. He is professor at the Benemrita Universidad
Autnoma de Puebla. He received a Master of Science Degree in
Electronics by the Instituto Nacional de Astrofsica, ptica y
Electrnica (INAOE). He received a Bachelor in Electronics
degree by Benemrita Universidad Autnoma de Puebla. He is
specialist on digital systems design. Currently he is the Academic
Secretary at the Faculty of Electronics Sciences.
Dra. Josefina Castaeda-Camacho was born

in Puebla, Mexico, in 1973. She received the
B.Sc. degree in electrical engineering from the
Autonomous University of Puebla, in 1996,
and the M.Sc. and Ph.D. degrees in electrical
engineering from the Centro de Investigacin y
de Estudios Avanzados del Instituto Politcnico
Nacional
(CINVESTAV),
in
2000
and
2007,
respectively.Currently she works in the Electronic Department in
the Benemrita Universidad Autonma de Puebla (BUAP). Her
main research interests include teletraffic analysis, cellular digital
systems dimensioning, performance modeling and evaluation of
overlaid systems and packet networks.
M. of Sc. Rodrigo Maya was born in Puebla,

Mxico. He is professor at the Benemrita
Universidad Autnoma de Puebla. He received
a Master of Science Degree in Electronics by
the Instituto Nacional de Astrofsica, ptica y
Electrnica (INAOE). He received a Bachelor
in Electronics degree by Instituto Politcnico
Nacional, Mexico. He is specialist on digital systems design.
Dr. Gerardo Mino-Aguilar was born in

Puebla, Mexico in 1971. In 1995 he received a
Bachelor degree in Electronics Engineering by
Benemrita Universidad Autnoma de Puebla.
In 1999 a Master degree by the Universidad de
las Amricas Puebla. In 2006, a Ph.D. in
Electronics Engineering by Universitat
Politcnica de Catalunya, Spain. Currently, he is the head of the
postgraduate department on electronics engineering master
degree, and electronics instrumentation at BUAP. His work is
focused on power electronics, power quality, power generation,
electric drives, and drivetrain electric vehicles.
M. Sc. lvarez Gonzalez Ricardo was born in

Puebla Mexico in 1967, he received the B.S.
degree in electronics from the Benemrita
Universidad Autonma de Puebla (BUAP) and
the M.S. degree in electronics from the
National Institute of Astrophysics, Optics and
Electronics (INAOE). Nowadays he works as
Titular Professor in the Faculty of Electronics Science of the
Benemrita Universidad Autnoma de Puebla, Mxico.
M. of En. Tecilli Tapia Tlatelpa, was born in

Mexico, 1987. She received a Master of
Electronics Engineering, and a Bachelor of
Science in Electronics degree at Benemrita
Universidad Autnoma de Puebla in 2013 and
2009 respectively. The academic specialties
Digital Communications Systems, and
Optoelectronics respectively.
37

Voice Recognition System by Using An Automatic Classification Processor

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Voice Recognition System by Using An Automatic Classification Processor

Cargado por

Copyright:

Formatos disponibles

International Journal of Science and Advanced Technology (ISSN 2221-8386)

Volume 4 No 4 April 2014