Documentos de Académico
Documentos de Profesional
Documentos de Cultura
► Recall
► What are sound features?
► Feature detection and extraction
► Features in Sphinx III
Recall:
► Speech signal is ‘slowly’ time varying singnal
► There are a number of linguistically distinct speech
sounds (phonemes) in a language.
► It is possible to represent the sound spectrogram in a
3D spectrogram of the speech intensity and the
different frequency bands over time
► Most SR systems rely heavily on vowel recognition to
achieve high performance (they are long in duration
and spectrally well defined and therefore easily
recognized)
Speech sounds and features
Examples:
► Vowels (a, u, …)
► Diphthongs (f.i. ay as in guy, …)
► Semivowels (w, l, r, y)
► Nasal Consonants (m, n)
► Unvoiced Fricatives (f, s)
► Voiced Fricatives (v, th, z)
► Voiced and Unvoiced Stops (b, d, g)
Fourier • MFCC’s
Fourier
Transform
are beautiful,
Transform
Input Speech because they incorporate
knowledge of the nature of
speech sounds in measurement
of the features.
• Fourier Transform time Cepstral
Cepstral
domain frequency domain Analysis
Analysis
• Utilize rudimentary models of
• Frequency domain is sliced in human perception.
12 smaller parts with each it’s
own MFCC
Perceptual
Perceptual Time
Time Time
Time
• Include absolute energy and Weighting Derivative Derivative
Weighting Derivative Derivative
12 spectral measurements.
• Time derivatives to model
spectral change
Energy Delta Energy Delta-Delta Energy
+ + +
Mel-Spaced Cepstrum Delta Cepstrum Delta-Delta Cepstrum
What ‘to do’ with the MFCC’s:
► A speech recognizer can be built using the energy values (time domain) and
12 MFCC's (frequency domain), plus the first and second order derivatives of
those coefficients.
► The derivatives are useful because they provide information about the spectral
change
In Sphinx in the feature vectors, the streams of features are stored as follows:
► CEP: C1-C12
► DCEP: D1-D12
► Energy values: C0, D0, DD0
► D2CEP: DD1-DD12
► So, at this point in the speech recognition process, you have stored
feature vectors for the entire speech segment you are looking at,
providing the necessary information about what kind features are in
that segment.
► Now, The feature stream can be analyzed using a Hidden-Markov
Model (HMM)
voicing
round
“one
a1 … … ”
nasal
a2
Concat. : Train … …“two”
glide :
a5 :
frication a6 … …
“oh
”
burst
The feature
Input Feature Feature stream is
speech Extraction Vector analyzed using a
Modules Hidden-Markov
Model (HMM)