Está en la página 1de 11

Overview

► Recall
► What are sound features?
► Feature detection and extraction
► Features in Sphinx III
Recall:
► Speech signal is ‘slowly’ time varying singnal
► There are a number of linguistically distinct speech
sounds (phonemes) in a language.
► It is possible to represent the sound spectrogram in a
3D spectrogram of the speech intensity and the
different frequency bands over time
► Most SR systems rely heavily on vowel recognition to
achieve high performance (they are long in duration
and spectrally well defined and therefore easily
recognized)
Speech sounds and features
Examples:
► Vowels (a, u, …)
► Diphthongs (f.i. ay as in guy, …)
► Semivowels (w, l, r, y)
► Nasal Consonants (m, n)
► Unvoiced Fricatives (f, s)
► Voiced Fricatives (v, th, z)
► Voiced and Unvoiced Stops (b, d, g)

► They all have their own characteristics (features)


ASR Stages
1) speech analysis system: to provide an appropriate spectral
representation of the characteristics of the time-varying speech
signal

 2) feature detection stage: to convert the spectral


measurements to a set of features that describe the broad acoustic
properties of the different phonetic units (f.i. nasality, frication,
formant locations, voiced-unvoiced classification, ratios of high-
and low-frequency energy, etc.)

3) segmentation and labeling phase: to find stable regions


and then label the segmented region according to how well the
features within that region match those of individual phonetic
units

4) final output of the recognizer is the word or word sequence


that best matches
Feature detection (and extraction)

► Speech segment contains certain characteristics,


features.
► Different segments of speech contain different features,
specific for the kind of segment!
► Goal is to try to classify a speech segment into one of
several broad speech classes (f.i. via binary tree:
compact/diffuse, acute/grave, long/short, high/low
frequency, etc)
► Ideally, feature vectors for a given word should
hopefully be the same regardless of the way in which
the word has been uttered
Last week:
Mel-Frequency Ceptrum Coefficient
► Fourier Transform extracts the frequency components of
a signal in the time domain

► Frequency domain is filtered/sliced in 12 smaller parts,


where for each it’s own coefficient (MFCC) can be
calculated

► MFCC's use the log-spectrum of the speech signal.


The logarithmic nature of the technique is significant since
the human auditory system perceives sound on a
logarithmic scale above certain frequencies
Acoustic Modeling: Feature Extraction

Fourier • MFCC’s
Fourier
Transform
are beautiful,
Transform
Input Speech because they incorporate
knowledge of the nature of
speech sounds in measurement
of the features.
• Fourier Transform time Cepstral
Cepstral
domain  frequency domain Analysis
Analysis
• Utilize rudimentary models of
• Frequency domain is sliced in human perception.
12 smaller parts with each it’s
own MFCC
Perceptual
Perceptual Time
Time Time
Time
• Include absolute energy and Weighting Derivative Derivative
Weighting Derivative Derivative
12 spectral measurements.
• Time derivatives to model
spectral change
Energy Delta Energy Delta-Delta Energy
+ + +
Mel-Spaced Cepstrum Delta Cepstrum Delta-Delta Cepstrum
What ‘to do’ with the MFCC’s:
► A speech recognizer can be built using the energy values (time domain) and
12 MFCC's (frequency domain), plus the first and second order derivatives of
those coefficients.

13 (Absolute Energy (1) and MFCCs (12))


13 (Delta First-order derivatives of the 13 absolute coefficients)
13 (Delta-Delta Second-order derivatives of the 13 absolute coefficients)
------------------------------------------------
39 Total Basic MFCC Front End

► The derivatives are useful because they provide information about the spectral
change

► These total of 39 coefficients will provide information about the different


features in that segment!

► The feature measurements of the segments are stored in so called ‘feature


vectors’, that can be used in the next stage of the speech recognition (f.i.
Hidden Markov Model)
In Sphinx III:
computation of feature vectors
► feat_s2mfc2feat
► feat_s2mfc2feat_block

1. MFC file is read


2. Initialization: defining the kind of input->feature conversion desired (there are
some differences between Sphinx II and Sphinx III)
3. Feature vectors are computed for the entire segment specified
(feat_s2mfc2feat and feat_s2mfc2feat_block)

In Sphinx in the feature vectors, the streams of features are stored as follows:
► CEP: C1-C12
► DCEP: D1-D12
► Energy values: C0, D0, DD0
► D2CEP: DD1-DD12
► So, at this point in the speech recognition process, you have stored
feature vectors for the entire speech segment you are looking at,
providing the necessary information about what kind features are in
that segment.
► Now, The feature stream can be analyzed using a Hidden-Markov
Model (HMM)
voicing

round
“one
a1 … … ”
nasal
a2
Concat. : Train … …“two”
glide :
a5 :
frication a6 … …
“oh

burst

The feature
Input Feature Feature stream is
speech Extraction Vector analyzed using a
Modules Hidden-Markov
Model (HMM)

También podría gustarte