Documentos de Académico
Documentos de Profesional
Documentos de Cultura
D.MEENAKSHI
G.SILPA
V.RAJITHA
1
BESSEL FEATURES FOR SPEECH SIGNAL PROCESSING
PROJECT REPORT
SUBMITTED IN PARTIAL FULFILLMENT
BACHELOR OF TECHNOLOGY
IN
ELECTRONICS AND COMMUNICATION ENGINEERING
BY
D.MEENAKSHI (06261A0412)
G.SILPA (06261A0420)
V.RAJITHA (06261A0456
2
MAHATMA GANDHI INSTITUTE OF TECHNOLOGY
(Affiliated to Jawaharlal Nehru Technological University, Hyderabad, A.P.)
CERTIFICATE
This is to certify that the project work entitled “Bessel Features For Speech
Signal Processing” is a bonafide work carried out by
D.Meenakshi (06261A0412)
G.Silpa (06261A0420)
V.Rajitha (06261A0456)
The results embodied in this report have not been submitted to any other University or
Institution for the award of any degree or diploma.
(Signature) (Signature)
-------------------------- ------------------
Mr. T. D. Bhatt, Associate Professor
Dr.Nagbhooshanam
Faculty Advisor/Liaison Professor & Head
3
ACKNOWLEDGEMENT
Finally, we thank all the people who have directly or indirectly helped us through
the course of our Project.
D.Meenakshi
G. Silpa
V. Rajitha
4
ABSTRACT
Human speech signal is a multi component signal where the components are
called formants. Multicomponent signals produce delineated concentrations in the time-
frequency plane. In the time frequency plane there is a clear delineation into different
regions. Different time frequency distributions may give somewhat different
representations-however they all give roughly the same picture in regard to the existence
of the components. Most efforts in devising recognition schemes have been directed
toward the recognition of human speech. While it has been appreciated over fifty years
that speech is multicomponent no particular exploitation has been made of that fact.
Recently, however, an ingenious idea has been proposed and developed by
Fineberg and Mammone which takes advantage of the multicomponent nature of a signal.
Suppose, for the sake of illustration, we consider signals consisting of two components.
The phase of the first component of the unknown and of the template candidate is
determined. Subtraction of the two phases for each instant of time defines the
transformation function for going from the template to the unknown. One can think of
this as the possible distortion function for the first component. It would equal if there was
no distortion. Similarly one determines the distortion function for the second component.
If the two distortion functions are equal then we have a match. Fineberg and Mammone
have successfully used this method for the classification of speech.
The excellence of the results can be interpreted as indicating that indeed formants
are correlated and distorted in the same way. This is an important finding about the nature
of speech. Note that in the above discussion, the distortion function is taken to be the
difference of the phases. However different circumstances may make other distortion
functions more appropriate For example one can define the distortion function to be the
ratio of the two phases. In general one can think of the distortion as a function of the
signal and the environment. It would be of considerable interest to investigate distortion
functions for common situations.
5
The discrete energy separation algorithm (DESA) together with the Gabor's
filtering provides a standard approach to estimate the amplitude envelope (AE) and the
instantaneous frequency (IF) functions of a multicomponent amplitude and frequency
modulated (AM-FM) signal. The filtering operation introduces amplitude and phase
modulation in the separated mono component signals, which may lead to an error in the
final estimation of the modulation functions. We have used a method called the Fourier-
Bessel expansion-based discrete energy separation algorithm (FB-DESA) for component
separation and estimation of the AE and IF functions of a multicomponent AM-FM
signal. The FB-DESA method does not introduce any amplitude or phase modulation in
the separated mono component signal leading to accurate estimations of the AE and IF
functions. Simulation results with synthetic and natural signals are included to illustrate
the effectiveness of the proposed method.
6
Table of contents
CHAPTER 1. OVERVIEW
1. Introduction 2
2. Aim of the project 3
3. Methodology 3
4. Significance and applications 4
5. Organization of work 4
CHAPTER 2. INTRODUCTION TO BESSEL FEATURES FOR SPEECH
SIGNAL PROCESSING
2.1 Introduction 7
2.2 Multicomponent signal
2.3 Series representation
2.4 Fourier Bessel
CHAPTER 3. REVIEW OF APPROACHES FOR BESSEL FEATURES
3.1 Introduction 29
3.2 Parse Val’s Formula 30
3.3 Hankel transform 30
CHAPTER 4. THEORY OF BESSEL FEATURES
4.1 Introduction 74
4.2 Solution of Differential Equation 75
4.3 Mean Frequency computation 76
4. Reconstruction of the signal 82
CHAPTER 5. BESSEL FEATURES FOR DETECTION OF VOICE ONSET TIME
(VOT)
5.1 Introduction
7
5.2 Significance of VOT
5.3 Detection of VOT
5.3.1 Fourier-Bessel expansion
5.3.2 AM-FM model and DESA method
5.3.3 Approach to detect VOTs from speech using
Amplitude Envelope (AE)
5.4 Results
5.5 Conclusions
CHAPTER 6. BESSEL FEATURES FOR DETECTION OF
GLOTTAL CLOSURE INSTANTS (GCI)
6.1 Introduction
6.2 Significance of Epoch in speech analysis
6.3 Review of existing approaches
6.3.1 Epoch extraction from short time
energy of speech signal
6.3.2 Epoch extraction from linear prediction analysis
6.3.4 Limitation of existing approaches
6.4 Detection of GCI using FB expansion and AM-FM model
6.4.1 Fourier-Bessel expansion
6.4.2 AM-FM model and DESA method
6.5 Studies on detection of GCIs for various speech utterances
6.6 Glottal activity detection
6.7 Conclusion
CHAPTER 7. SUMMARY AND CONCLUSIONS
7.1 Summary of the work
CHAPTER 8. REFERENCES
8
LIST OF FIGURES
5.5 Plot of the Bar graphs for utterances of /ke/, /te/, /pe/..................
9
LIST OF TABLES
5.1 FB coefficient orders for emphasizing the vowel and consonant parts
5.2 VOT values for female (F01) and male (M05) speakers
1
CHAPTER 1
OVERVIEW
1.1 INTRODUCTION
Multicomponent signals produce delineated concentrations in the time-frequency
plane. Human speech signal is a multi component signal where the components are called
formants. Most efforts in devising recognition schemes have been directed toward the
recognition of human speech. While it has been appreciated over fifty years that speech is
multicomponent no particular exploitation has been made of that fact. Recently, however,
an ingenious idea has been proposed and developed by Fineberg and Mammone which
takes advantage of the multicomponent nature of a signal.
1.3 METHODOLOGY
Fourier-Bessel (FB) expansion and AM-FM model has been employed for
efficient results in speech signal processing.
1
CHAPTER 2
INTRODUCTION TO BESSEL FEATURES FOR SPEECH
SIGNAL PROCESSING
2.1 INTRODUCTION
Human speech signal is a multi component signal where the components are called
formants. Multicomponent signals produce delineated concentrations in the time-
frequency plane. The general belief is that in a multicomponent signal each component in
the time-frequency plane corresponds to a part of a signal. That is if
S (t) = S1 (t) + S2 (t)
That each part, S1 and S2 corresponds to a component. A mono component signal looks
like a single mountain ridge. The center of the ridge forms a trajectory in the time
'generally varies from time to time. The trajectory is the instantaneous frequency of the
signal. If the signal is written in the form
S (t) =A (t) e jw(t)
1
The instantaneous frequency is an average, namely the average of the frequencies at a
given time. The broadness at that time is the spread (root mean square) of frequencies at
time t, we call it the instantaneous bandwidth.
1
Basis function for zero-order Bessel function
1
processing, particularly speaker identification, bears further research. The shift variant
property of the Hankel transform may prove valuable for non-stationary analysis and
some indications that fewer coefficients may be required. Since the coefficients are real,
the speech can be directly reconstructed from its coefficients time index plot without
need to retain phase components; this may prove to be of some use if conversion back to
the time domain is desired.
1
2.4 CONCLUSIONS
1
CHAPTER 3
REVIEW OF APPROACHES FOR BESSEL FEATURES
3.1 INTRODUCTION
Any orthonormal set of basis functions can be used to represent some arbitrary
function. Fourier series theory includes that the series coefficients are given by a Discrete
Fourier transform, thus coefficient generation is an easy process with the numerous FFT
algorithms that abound.
Calculation of the Fourier-Bessel series coefficients requires computation of a
Hankel Transform, which until recently greatly diminished consideration of this series for
potential applications. Fast Hankel Transform have now been developed which allow
computation of F-B coefficients at a speed only slightly slower than Fourier coefficients;
this should result in increased use of the F-B expansion..
1
This is simply a restatement of Parse Val’s well known formula concerning
Fourier series coefficients.
1
3.3 HANKEL TRANSFORM
3.4 CONCLUSIONS
Any orthonormal set of basis functions can be used to represent some arbitrary
function. Calculation of the Fourier-Bessel series coefficients requires computation of a
Hankel Transform, which until recently greatly diminished consideration of this series for
potential applications.A fast Hankel transform algorithm was presented that allows the
Fourier-Bessel series coefficients to be computed efficiently.
1
CHAPTER 4
THEORY OF BESSEL FEARURES
4.1 INTRODUCTION
Bessel functions arise as a solution of the differential equation. The general
solution is given by J n(x) is called a Bessel function of the first kind of order n and Yn(x)
is called a Bessel function of the second kind of order n. Bessel functions are expressible
in series form. It should be noted that the FB series coefficients Cm are unique for a given
signal, similarly as the Fourier series coefficients are unique for a given signal. However,
unlike the sinusoidal basis functions in the Fourier series, the Bessel functions decay over
time. This feature of the Bessel functions makes the FB series expansion suitable for
nonstationary signals.
Also, it is possible that the Fourier-Bessel coefficients in some sense better
capture the fundamental nature of the speech waveform; the shift variant property may be
desirable and possibly result in improved speaker identification/authentication
probabilities. Since the Fourier –Bessel coefficients are real; the noisy phase problem
upon reconstruction is avoided, which may be advantageous. The entire range of image
processing algorithms developed over the past several decades would be available for
exploitation to improve upon the speech characteristics.
2
And in particular
J0 (x) =1- x 2/2 2 + x4 / 224 2 -……
It can be readily shown that Bessel functions are orthogonal with respect to the weighting
function x. This can be seen by computing
∫ x Jn (αx) Jn (βx) dx= β Jn ( α ) J'n (β)- α Jn (β) J'n ( α ) / α 2 - β 2
and
∫ x Jn (αx) dx= ½ [ J2n ( α ) + (1- n /⌐ ) J2 ( α ) ]
Now if a and b are different roots of Jn(x) =0 we can write
∫ x Jn (αx) Jn (β x) dx =0, α ≠ β
And thus J n (ax) and J n (b x) are orthogonal with respect to the weighting function x.
Having established orthogonally, a series expansion of an arbitrary function can be
written in terms of Bessel functions with the form
f(x) = ∑ Cm Jn (λmx),
Where λ1, λ2, λ3........are the positive roots of Jn(x) = 0.
The coefficients, Cm, are given by
Cm =2 ∫ x Jn (λmx) f(x) dx / J2n+1(λm ).
If we wish to expand f(t) over some arbitrary interval (0, a) the zero order Bessel series
expansion becomes
f (t) = ∑ Cm J 0(λt), 0<t<a,
With the coefficients, Cm, calculated from
Cm=2 ∫ t f(t) J 0(λt) dt / a2[J 1(λ m a)]2
And are the ascending order positive roots of J0(a) =0. The integral in the numerator is
the Hankel transform
The coefficients of the FB expansion have been used to compute the Mean
Frequency. The FB coefficients are unique for a given signal in the same way that Fourier
series coefficients are unique for a given signal. However, unlike the Sinusoidal basis
function in the Fourier transform, the Bessel functions are aperiodic, and decay over time.
These features of the Bessel functions make the FB series expansion suitable for analysis
of non stationary signals when compared to simple Fourier transform.
2
4.3 MEAN FREQUENCY COMPUTATION
With 1<= m<= q, where Q is the order of the FB expansion and, J1( ) are the first-
order Bessel functions. The FB expansion order Q must be known a priori. The interval
between successive zero-crossings of the Bessel function J0( ) increases slowly with time
and approaches…….in the limit. If order Q is unknown, then in order to cover full signal
band width, the half of the sampling frequency, Q, must be equal to the length of the
signal.
It should be noted that the FB series coefficients Cm are unique for a given signal,
similarly as the Fourier series coefficients are unique for a given signal. However, unlike
the sinusoidal basis functions in the Fourier series, the Bessel functions decay over time.
This feature of the Bessel functions makes the FB series expansion suitable for
nonstationary signals.
The mean frequency is calculated as in 11
F mean= ∑ Q f mEm / ∑ Em
Where
E m = Cm2 = (energy at order m),
Fm = m /2a = (frequency at order m).
characteristics (such as speech) may be more compactly represented by Bessel function
basis vectors rather then by pure sinusoids. Also, it is possible that the Fourier-Bessel
coefficients in some sense better capture the fundamental nature of the speech waveform;
2
the shift variant property may be desirable and possibly result in improved speaker
identification/authentication probabilities.
For the test function f (t) =J0 (t), the Fourier series coefficients produced an
extremely accurate reconstruction of the function under transformation. A Fourier-Bessel
expansion resulted in a higher error, but the numbers of coefficients were dramatically
different. Regenerating f(t)= J0 (t) from Fourier coefficients required all 256 values to
achieve the result; by contrast just one Fourier-Bessel coefficient is required to
reconstruct the function.
Any function decomposed into basis vectors of the same analytic form will
produce a single coefficient. Indeed, expanding the test signal f (t) =sin (t) via Fourier
series requires a single coefficient. Nevertheless, the point being made is that an
unknown signal will be more efficiently (more information in fewer coefficients)
represented if expanded in the set of basis functions that “resemble” itself. Since the
Fourier –Bessel coefficients are real; the noisy phase problem upon reconstruction is
avoided, which may be advantageous. The entire range of image processing algorithms
developed over the past several decades would be available for exploitation to improve
upon the speech characteristics.
Fig shows different order Bessel functions. Red colored waveform represents
zeroth order Bessel function, green colored waveform represents first order Bessel
function, blue colored waveform represents second order waveform.
2
4.4 RECONSTRUCTION OF THE SIGNAL
Fig a represents speech signal. Fig b represents frequencies present in the speech
signal as a cluster. Fig c represents band limited signal reconstructed from the original
one using Bessel coefficients.
2
4.5 CONCLUSIONS
2
CHAPTER 5
BESSEL FEATURES FOR DETECTION OF VOICE ONSET
TIME (VOT)
5.1 INTRODUCTION
The instant of onset of vocal fold vibration relative to the release of closure
(burst) is the commonly used feature to analyze the manner of articulation in production
of stop consonants. The interval between the time of burst release to the time of onset of
vocal fold vibration is defined as voice onset time (VOT) [4].
Accurate determination of VOT from acoustic signals is important both theoretically
and clinically. From a clinical perspective, the VOT constitutes an important clue for
assessment of speech production of hearing impaired speakers [5]. From a theoretical
perspective, the VOT of stop consonants often serves as a significant acoustic correlate to
discriminate voiced from unvoiced, and aspirated from unaspirated stop consonants. The
unvoiced unaspirated stop consonants typically have low and positive VOTs, meaning
that the voicing of the following vowel begins near the instant of closure release. The
unvoiced aspirated stop consonants followed by a vowel have slightly higher VOTs than
their unaspirated counterparts, as the burst is followed by the aspiration noise. The
duration of the VOT in such cases is a practical measure of aspiration. The longer the
VOT, the stronger is the aspiration. On the other hand, voiced stop consonants have a
negative VOT, meaning that the vocal folds start vibrating before the stop is released.
2
Voice Onset Time (VOT) can be used to classify mandarin, Turkish, German and
American accented English. It is an important temporal feature which is often overlooked
in speech perception, speech recognition, as well as accent detection. Therefore, the
amplitude envelope (AE) function is useful for detection of VOT. The sub-band
frequency analysis is performed to detect VOT of unvoiced stops in [9]. The amplitude
modulation component (AMC) is used to detect vowel plus voiced onset regions (VORs)
in different frequency bands assuming the stop to vowel transition has different amplitude
envelopes for partitioned frequency ranges. In the following section we shall discuss the
effective VOT detection approach using FB expansion followed by AM-FM model for
stop consonant vowel units (/ke/, /ki/, /ku/, /te/, /ti/, /tu/, /pe/, /pi/, /pu/). The dominant
frequency bands of the voiced onset region for various stops and vowels are as
follows: /k/ between 1500 and 2500 Hz; /t/ between 2000 and 3000 Hz; /p/ between 2500
and 3500 Hz; vowel between 300 and 1200 Hz [10,6]. The VOT detection discussed here
is conceptually simpler and can be implemented as a one step process, which makes real
time implementation feasible.
The detection of VOT has been done using FB expansion and AM-FM model.
Section 5.2.1 discusses the FB expansion. AM-FM signal model and its analysis using
DESA method is discussed in Section 5.2.2. Section 5.2.3 describes the proposed
algorithm for VOT detection using FB expansion and AE function of AM-FM model.
The VOT detection results for speech data collected form various male and female
speakers are presented in Section 5.2.4.
where,
2
and
Where,
It has been shown that there is one-to-one correspondence between the frequency
content of the signal and the order (m) where the coefficient attains peak magnitude [10].
If the AM-FM components of formant of the speech signal are well separated in the
frequency domain, the speech signal components will be associated with various distinct
clusters of non-overlapping FB coefficients [11]. Each component of the speech signal
can be reconstructed by identifying and separating the corresponding FB coefficients.
Where,
The energy operator can estimate the modulating signal, or more precisely its scaled
version, when either AM or FM is present [12]. When AM and FM are present
simultaneously, three algorithms are described in [12] to estimate the instantaneous
frequency and A(n) separately. The best among the three algorithms according to
performance is the discrete energy separation algorithm 1 (DESA-1).
2
5.3.3 APPROACH TO DETECT VOTS FROM SPEECH USING AMPLITUDE
ENVELOPE (AE)
In order to detect voice onset time, the emphasized consonant and vowel regions of
the speech utterance are separated using the FB expansion of appropriate range of orders
using the Bessel function. Since, the separated regions are narrow-band signals, they can
be modeled by using AM-FM signal model. The DESA technique is applied on the
emphasized regions of the speech utterance in order to estimate the AE function for
detection of VOT. From the vowel emphasized part the beginning of the vowel will be
detected. From the beginning of the vowel, by tracing back towards the beginning of
consonant region in the consonant emphasized part, the beginning of the consonant
region has been detected. The VOT is obtained by taking the difference between
beginning of the vowel and beginning of consonant regions.
5.4 RESULTS
In this section we discuss the suitability of the proposed method for VOT detection.
Speech data used for these consists of 24 (12 male and 12 female speaker) isolated
utterances of the units /ke/, /te/, /pe/, /ki/, /ti/, /pi/, /ku/, /tu/, /pu/ respectively. The speech
signals were sampled at 16 kHz with 16 bits resolution, and stored as separate wave files.
Here, we shall consider the important subset of basic units namely SCV (stop
consonant vowel). Stop consonants are the sounds produced by complete closure at some
point along the vocal tract, build up pressure behind the closure, and release the pressure
by sudden opening. These units have two distinct regions in the production
characteristics: the region just before the onset of the vowel (corresponds to consonant
region) and steady vowel region. Figure 5.1 shows the regions of significant events in the
production of the SCV unit /kha/ with Vowel Onset Point (VOP) at sample number 3549.
Table 5.1 shows the requirement of the Fourier-Bessel coefficient orders for emphasizing
the vowel and consonant regions of the speech utterances.
2
Region of speech signal Required FB coefficient orders
/a/ P1=12 to P2=48
/k/ P1=60 to P2=100
/t/ P1=80 to P2=120
/p/ P1=100 to P2=140
Table 5.1 FB coefficient orders for emphasizing the vowel and consonant parts of the
different speech utterances.
For illustration first we shall consider the speech utterances /ke/ whose waveform is
shown in Figure 5.2. The spectrogram, amplitude envelope estimations for vowel and
consonant emphasized regions of the speech utterance /ke/ are shown in 5.2 (a), (c) and
(d) respectively. Similarly, the plots of the waveform, spectrogram, amplitude envelope
estimation for vowel and consonant region of the speech utterances /te/ and /pe/ are
shown in Figures 5.3 and 5.4 respectively. It is seen that the amplitude envelopes
corresponding to vowel and consonant regions using proposed method are emphasized.
This enables us to identify the beginning of the consonant (tc) and beginning of the vowel
region (tv). We have subtracted tc from tv in order to detect the voice onset time (tvot),
tvot = tv-tc. For testing we have considered 24 utterances from various speakers. Their
respective tv and tc and VOT values are shown in Table 5.2.
5.5 CONCLUSION
In this chapter, Fourier-Bessel (FB) expansion and the amplitude and frequency
modulated (AM-FM) signal model has been proposed to detect the voice onset time
(VOT). The FB expansion is used to emphasize the vowel and consonant regions which
results narrow-band signals from the speech utterance. The DESA method has been
applied for estimating amplitude envelope of the AM-FM signals due to its low
complexity and good time resolution.
3
VOICE ONSET TIME (VOT)
Fig 5.1 Regions of significant events in the production of the SCV unit /kha/ with Vowel
Onset Point (VOP) at sample number 3549.
3
Fig 5.2 Plot of the (a) Spectrogram, (b) Waveform, (c) AE estimation of the vowel (/e/)
emphasized part, (d) AE estimation of the consonant part (/k/) emphasized part for the
speech utterance /ke/.
3
Fig 5.3 Plot of the (a) Spectrogram, (b) Waveform, (c) AE estimation of the vowel (/e/)
emphasized part, (d) AE estimation of the consonant part (/t/) emphasized part for the
speech utterance /te/.
3
Fig 5.4 Plot of the (a) Spectrogram, (b) Waveform, (c) AE estimation of the vowel (/e/)
emphasized part, (d) AE estimation of the consonant part (/p/) emphasized part for the
speech utterance /pe/.
3
6.
Fig 5.5 plot of the Bar graphs for utterances of /ke/, /te/, /pe/ for various male and female
speakers.
3
Waveform VOP (sec) BURST (sec) VOT(sec)
Ke_F01_s1.wav 0.8443 0.8227 0.0216
Ke_F01_s2.wav 0.8168 0.8039 0.0132
Ke_F01_s3.wav 0.9633 0.9473 0.0160
Ke_F01_s4.wav 0.4611 0.4394 0.0217
Te_F01_s1.wav 0.6178 0.6029 0.0149
Te_F01_s2.wav 0.6979 0.6839 0.0140
Te_F01_s3.wav 0.7401 0.7236 0.0165
Te_F01_s4.wav 0.7212 0.7088 0.0124
Pe_F01_s1.wav 0.5377 0.5239 0.0138
Pe_F01_s2.wav 0.5308 0.5153 0.0155
Pe_F01_s3.wav 0.8250 0.8087 0.0163
Pe_F01_s4.wav 0.8239 0.8154 0.0085
Ke_M05_s1.wav 1.4540 1.4170 0.0370
Ke_M05_s2.wav 0.5560 0.5230 0.0330
Ke_M05_s3.wav 0.7480 0.7136 0.0344
Ke_M05_s4.wav 0.6574 0.6137 0.0437
Te_M05_s1.wav 0.6687 0.6502 0.0185
Te_M05_s2.wav 0.4704 0.4604 0.0100
Te_M05_s3.wav 0.6814 0.6548 0.0266
Te_M05_s4.wav 0.6013 0.5843 0.0170
Pe_M05_s1.wav 0.9851 0.9718 0.0133
Pe_M05_s2.wav 0.7262 0.7164 0.0098
Pe_M05_s3.wav 0.4899 0.4809 0.0090
Pe_M05_s4.wav 0.4341 0.4291 0.0050
Table 5.2 VOT values for female (F01) and male (M05) speakers for utterances /ke/, /te/
and /pe/ respectively.
3
CHAPTER 6
6.1 INTRODUCTION
The primary mode of excitation of the vocal tract system during speech production is due
to the vibration of the vocal folds. For voiced speech, the most significant excitation takes
place around the glottal closure instant (GCI), called the epoch. The performance of
many speech analysis and synthesis approaches depends on accurate estimation of GCIs.
In this chapter we propose to use a new method based on Fourier-Bessel (FB) expansion
and amplitude and frequency modulated (AM-FM) signal model for the detection of GCI
locations in speech utterances.
The organization of this chapter is as follows: In section 6.1 the significance of epochs
is discussed. The review of the existing approaches for detection of epochs is provided in
Section 6.2. The detection of GCI using FB expansion and the AM-FM signal model is
discussed in 6.3. A study on detection of GCIs for various categories of sound units has
been provided in Section 6.4. The detection of glottal activity has been analyzed in the
Section 6.5. The final section summarizes the study of GCI.
Glottal closure instants are defined as the instants of significant excitation of the
vocal-tract system. Speech analysis consists of determining the frequency response of the
vocal-tract system and the glottal pulses representing the excitation source. Although the
source of excitation for voiced speech is the sequence of glottal pulses, the significant
excitation of the vocal-tract system within the glottal pulse, can be considered to occur at
the GCI, called an epoch. Many speech analysis situations depend on accurate estimation
of the location of epoch within the glottal pulse. For example, the knowledge of epoch
location is useful for accurate estimation of the fundamental frequency (fo).
3
Analyses of speech signals in the closed glottis regions provide an accurate estimate of
the frequency response of the supraalaryngeal vocal-tract system [12] [13]. With the
knowledge of epochs, it may be possible to determine the characteristics of the voice
source by a careful analysis of the signal within a glottal pulse. The epochs can be used as
pitch markers for prosody manipulation, which is useful in applications like text-speech
synthesis, voice conversion and speech rate conversion [14] [15]. Knowledge of the
epoch locations may be used for estimating the time delay between speech signals
collected over a pair of spatially separated microphones [16]. The segmental signal-to-
noise ration (SNR) of the speech signal is high in the regions around the epochs. Hence it
is possible to enhance the speech by exploiting the characteristics of speech signals
around the epochs [17]. It has been shown that the excitation features derived from the
regions around the epoch locations provide complimentary speaker-specific information
to the existing spectral features.
The instants of significant excitation play an important role in human perception also.
It is because of the epochs in speech that a human being seems to be able to perceive
speech even at a distance from the source, though the spectral components of the direct
signal suffer an attenuation of over 40 dB. The neural mechanism of human beings has
the ability of processing selectively the robust regions around the epoch for extracting
acoustics cues even under degraded conditions. It is the ability of human beings to focus
on these micro level events that may be responsible for perceiving the speech information
even under severe degradation such as noise, reverberation, presence of other speakers
and channel variations.
Several methods have been proposed for estimating the GCI from a speech signal. We
categorize these methods as follows: (a) methods based on short-time energy of speech
signal, (b) methods based on predictability of all-pole linear predictor and (c) methods
based on the properties of Group-Delay (GD) i.e. the negative going zero crossing of GD
measure derived from the speech signal. The above classification is not a rigid one and
one category can overlap with another based on the interpretation of the method.
3
6.3.1 EPOCH EXTRACTION FROM SHORT-TIME ENERGY OF SPEECH SIGNAL
GCIs can be detected from the energy peaks in the waveform derived directly
from the speech signal [17, 18] or from the features in its time-frequency representation
[19, 20]. The epoch filter proposed in this work, computes the Hilbert envelope (HE) of
the high-pass filtered composite signal to locate the epoch instants. It was shown that the
instant of excitation of the vocal-tract could be identified precisely even for continuous
speech. However, this method is suitable for analyzing only clean speech signals.
A method for detecting the epochs in a speech signal using the properties of
minimum phase signals and GD function was proposed in [25]. The method is based on
3
the fact that the average value of the GD function of a signal within an analysis frame
corresponds to the location of the significant excitation. An improved method based on
computation of the GD function directly from the speech signal was proposed in [26].
The method is based on the FB expansion and the AM-FM signal model. The inherent
filtering property of the FB expansion is used to weaken the effect of formants in the
speech utterances. The FB coefficients are unique for a given signal in the same way that
Fourier series coefficients are unique for a given signal. However, unlike the sinusoidal
basis functions in the Fourier transform, the Bessel functions are aperiodic, and decay
4
over time. These features of the Bessel functions make the FB series expansion suitable
for analysis of non-stationary signals such as speech when compared to simple Fourier
transform [9-11]. The discrete energy separation (DESA) method has been used to
estimate amplitude envelope (AE) function of the AM-FM model due to its good time
resolution. This feature is advantageous as they are well localized in time-domain.
The zero order Fourier-Bessel series expansion of a signal x(t) considered over
some arbitrary interval (0,T) is expressed as
where,
and
Where,
It has been shown that there is one-to-one correspondence between the frequency
content of the signal and the order (m) where the coefficient attains peak magnitude [10].
If the AM-FM components of formant of the speech signal are well separated in the
frequency domain, the speech signal components will be associated with various distinct
clusters of non-overlapping FB coefficients [11]. Each component of the speech signal
can be reconstructed by identifying and separating the corresponding FB coefficients.
4
6.4.2 AM-FM Model and DESA Method
For both continuous and discrete time signals, Kaiser has defined a nonlinear
energy tracking operator [11]. For the discrete time case, the energy operator for x[n] is
defined as,
Where,
The energy operator can estimate the modulating signal, or more precisely its scaled
version, when either AM or FM is present [11]. When AM and FM are present
simultaneously, three algorithms are described in [11] to estimate the instantaneous
frequency and A(n) separately. The best among the three algorithms according to
performance is the discrete energy separation algorithm 1 (DESA-1).
In order to detect GCIs we emphasize the low frequency contents of the speech
signal in the range of 0 to 300 Hz. This is achieved by using the FB expansion of
appropriate order m of the expansion. Since the resultant band-limited signal is a narrow-
band signal, it can be modeled by using AM-FM signal model.
The advantage of choosing 0 to 300 Hz band is that the characteristics of the time-
varying vocal-tract system will not affect the location of the GCIs. This is because the
vocal-tract system has resonances at higher frequencies than 300 Hz. Therefore, it has
been studied that the characteristics of peaks due to GCIs can be extracted by
reconstructing the speech signal using the FB expansion of order m=75. The DESA
technique is applied on this band-limited AM-FM signal of the speech utterance in order
to determine amplitude envelope (AE) functions for detection of GCIs. The peaks in the
envelope of AM-FM signals provide the locations of GCIs.
4
6.5 STUDIES ON DETECTION OF GCIs FOR VARIOUS SPEECH
UTTERANCES
In this section we provide an analysis for the studies on epoch or GCIs for both male
and female speakers. Figure 6.1 gives the speech signal of a male speaker and its
corresponding spectrogram. The amplitude envelope (AE) estimation of the band-limited
AM-FM signal, reconstruction by using FB expansion is shown in second waveform of
the figure. The third waveform shows the estimated amplitude envelope for the second
waveform. The differenced EGG signal is shown in the fourth waveform. It is seen that
the peaks in the amplitude envelope and peaks in the differenced EGG signals are
agreeing in most of the cases. Similar observations are also noticed for a female speaker
shown in Figure 6.2.
This enables us to identify the locations of GCIs from the peaks of the amplitude
envelope of the band-limited AM-FM signal of the given speech utterance. From Figures
6.1 and 6.2, it can be noticed that the number of GCIs for female speakers are more than
the male speaker for the same duration of speech segment. This is due to the fact that
generally the fundamental frequency (reciprocal of the difference between successive
GCIs) of female speakers is higher than the male speaker.
The strength of excitation helps in the detection of the regions of the speech signal
based on the glottal activity. The region where the strength of excitation is considered to
be significant is referred to as the regions of the vocal fold vibration (glottal activity).
That is, they are considered as the voiced regions where glottal activity is detected. In the
absence of vocal fold vibration, the vocal-tract system can be considered to be excited by
random noise, as in the case of frication.
The energy of the random noise excitation is distributed both in time and frequency
domains. While the energy of an impulse is distributed uniformly in the frequency
domain, it is highly concentrated in the time-domain. As a result, the filtered signal
4
exhibits, significantly lower amplitude for random noise excitation compared to the
impulse-like excitation. Hence the filtered signal can be used to detect the regions of
glottal activity (vocal fold vibration).
6.7 CONCLUSIONS
The primary mode of excitation of the vocal tract system during speech
production is due to the vibration of the vocal folds. For voiced speech, the most
significant excitation takes place around the glottal closure instant (GCI). The instants of
significant excitation play an important role in human perception. The studies on GCIs
help in identifying the various regions in continuous speech. The extracted GCIs help in
identifying the fundamental frequency (pitch) of the speaker. The pitch is a feature
unique to a speaker.
4
GLOTTAL CLOSURE INSTANTS OF A MALE SPEAKER
Fig. 6.1 Epoch (or GCI) extraction of a male speaker using Fourier-Bessel (FB)
expansion and AM-FM model
4
GLOTTAL CLOSURE INSTANTS OF A FEMALE SPEAKER
Fig 6.2 Epoch (or GCI) extraction of a female speaker using Fourier-Bessel (FB)
expansion and AM-FM model
4
CHAPTER 7
SUMMARY AND CONCLUSIONS
4
Using the epoch locations as anchor points within each glottal cycle, a method to
estimate the instantaneous fundamental frequency of voiced speech segments is
presented. The fundamental frequency is estimated as the reciprocal of the interval
between two successive epoch locations derived from the speech signal. Since the
method is based on the point property of epoch and does not involve any block
processing, it provides cycle-to-cycle variations in pitch during voicing. This results in
instantaneous fundamental frequency. Errors due to spurious zero crossings in the weak
voiced region are corrected using the filtered signal of Hilbert Enevlope (HE) of the
speech signal. In this work, analysis of the pitch frequency for various subjects in
different environmental conditions is carried out.
4
CHAPTER 8
REFERENCES
4
On time domain prosodic modifications of speech”, in Proc. IEEE, May, 1989.
16. K. S. Rao and B. Yegnanarayana, “Prosody modification using instants of
Significant excitation”, IEEE, May, 2006.
17. B. Yegnanarayana, S.R.M Prasanna, R. Duraiswamy, and D. Zotkin, “Processing
Of reverberant speech for time-delay estimation”, IEEE Trans. Speech and Audio
Signal Processing, 2005.
18. B. Yegnanarayana, P.S. Murty, “Enhancement of reverberant speech using LP
Residual signal”, IEEE Trans.Speech and Audio Processing, vol.8, 2000
19. Y KC Ma and LF Willems, “A Frobenius norm approach to glottal closure
Detection from the speech signal”, IEEE Tans. Speech Audio Processing, vol.2,
1994.
20. C R Jankowski Jr, T F Quatieri, and D A Reynolds, “Measuring fine structure in
Speech: Application to speaker identification”, Proc. IEEE Int. Conf., 1995.
25. K. Sri Rama Murthy and B. Yegnanarayana, “Epoch extraction from speech
signal,” IEEE Trans. Audio, Speech Lang. Process. , vol. 16 (8), pp.1602-1613,
Nov.2008.
5
SOURCE CODE
5
Gci_female.m
lc;
clear all;
close all;
inputfile='30401.wav'
eggfile='30401.egg'
samplesrange=[89601:92800];
fborderleft=1;
fborderright=75;
MM=length(samplesrange);
%computation of roots of bessel function Jo(x)
if exist('alfa') == 0
x=2;
alfa=zeros(1,MM);
for i=1:MM
ex=1;
while abs(ex)>.00001
ex=-besselj(0,x)/besselj(1,x);
x=x-ex;
end
alfa(i)=x;
fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex)
x=x+pi;
end
end
[s,fs]=wavread(inputfile);
s=s;
S=s';
% S=diff(S);
S=-S(samplesrange);
ax(1)=subplot(4,1,1);
5
% plot(samplesrange, S);
plot(samplesrange/fs, S);
axis tight
grid on
s=S;
N=length(s);
nb=1:N;
a=N;
for m1=1:MM
a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s.*besselj(0,alfa(m1)/a*nb));
end
for mm=1:N
g1_r=[(alfa(fborderleft:fborderright))/a ];
F1_r(mm)=sum(a3(fborderleft:fborderright).*besselj(0,g1_r*mm));
end
y1=F1_r;
ax(2)=subplot(4,1,2);
% plot(samplesrange, y1)
plot((samplesrange)/fs, y1)
axis tight
grid on
% y1=y1-mean(y1);
ax(3)=subplot(4,1,3);
[egg, fs]=wavread(eggfile);
vg=-diff(egg);
% plot(samplesrange,vg(samplesrange))
plot((samplesrange)/fs,vg(samplesrange))
5
axis tight
grid on
for l=2:N-1
xx=y1;
si(l)=xx(l)^2-xx(l-1)*xx(l+1);
end
for m=2:N-1
yy(m)=y1(m)-y1(m-1);
end
for m=2:N-2
siy(m)=yy(m)^2-yy(m-1).*yy(m+1);
end
for mm=2:N-2
if siy(mm)<0
yy1(mm)=siy(mm-1);
yy1(mm)=yy1(mm-1);
else
yy1(mm)=siy(mm);
end
end
siy=yy1;
for m1=2:N-3
omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));
end
for mm=2:N-3
if imag(omega(mm))==0
yy1(mm)=omega(mm);
else
yy1(mm)=omega(mm-1);
yy1(mm)=yy1(mm-1);
end
end
omega=yy1;
for m1=2:N-3
amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));
end
for mm=2:N-3
if imag(amp(mm))==0
yy1(mm)=amp(mm);
else
yy1(mm)=amp(mm-1);
yy1(mm)=yy1(mm-1);
5
end
end
[ca,cd]=dwt(yy1,'db2');
yy1=idwt(ca,[],'db2');
amp1=yy1;
amp1(end+1:end+2)=amp1(end);
% X2=overlapadd(omega1,W,INC);
ax(4)=subplot(4,1,4);
% plot(samplesrange,amp1)
plot((samplesrange)/fs,amp1)
axis tight
grid on
% ax(4)=subplot(4,1,4);
%
% plot((1:length(X2))/32000,X2)
linkaxes(ax,'x');
5
Gci_male.m
clc;
clear all;
close all;
inputfile='10501.wav'
eggfile='10501.egg'
samplesrange=[76001:79200];
fborderleft=1;
fborderright=75;
MM=length(samplesrange);
%computation of roots of bessel function Jo(x)
if exist('alfa') == 0
x=2;
alfa=zeros(1,MM);
for i=1:MM
ex=1;
while abs(ex)>.00001
ex=-besselj(0,x)/besselj(1,x);
x=x-ex;
end
alfa(i)=x;
%fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex)
x=x+pi;
end
end
[s,fs]=wavread(inputfile);
s=s;
S=s';
% S=diff(S);
S=-S(samplesrange);
5
ax(1)=subplot(4,1,1);
% plot(samplesrange, S);
plot(samplesrange/fs, S);
axis tight
grid on
s=S;
N=length(s);
nb=1:N;
a=N;
for m1=1:MM
a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s.*besselj(0,alfa(m1)/a*nb));
end
for mm=1:N
g1_r=[(alfa(fborderleft:fborderright))/a ];
F1_r(mm)=sum(a3(fborderleft:fborderright).*besselj(0,g1_r*mm));
end
y1=F1_r;
ax(2)=subplot(4,1,2);
% plot(samplesrange, y1)
plot((samplesrange)/fs, y1)
axis tight
grid on
% y1=y1-mean(y1);
ax(3)=subplot(4,1,3);
[egg, fs]=wavread(eggfile);
vg=-diff(egg);
5
% plot(samplesrange,vg(samplesrange))
plot((samplesrange)/fs,vg(samplesrange))
axis tight
grid on
for l=2:N-1
xx=y1;
si(l)=xx(l)^2-xx(l-1)*xx(l+1);
end
for m=2:N-1
yy(m)=y1(m)-y1(m-1);
end
for m=2:N-2
siy(m)=yy(m)^2-yy(m-1).*yy(m+1);
end
for mm=2:N-2
if siy(mm)<0
yy1(mm)=siy(mm-1);
yy1(mm)=yy1(mm-1);
else
yy1(mm)=siy(mm);
end
end
siy=yy1;
for m1=2:N-3
omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));
end
for mm=2:N-3
if imag(omega(mm))==0
yy1(mm)=omega(mm);
else
yy1(mm)=omega(mm-1);
yy1(mm)=yy1(mm-1);
end
end
omega=yy1;
for m1=2:N-3
amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));
end
for mm=2:N-3
if imag(amp(mm))==0
5
yy1(mm)=amp(mm);
else
yy1(mm)=amp(mm-1);
yy1(mm)=yy1(mm-1);
end
end
[ca,cd]=dwt(yy1,'db2');
yy1=idwt(ca,[],'db2');
amp1=yy1;
amp1(end+1:end+2)=amp1(end);
% X2=overlapadd(omega1,W,INC);
ax(4)=subplot(4,1,4);
% plot(samplesrange,amp1)
plot((samplesrange)/fs,amp1)
axis tight
grid on
% ax(4)=subplot(4,1,4);
%
% plot((1:length(X2))/32000,X2)
linkaxes(ax,'x');
---------------------------------------------------------------------------------------------------------
Vot.m
clc;
clear all;
close all;
inputfile='ku_F01_S4.wav';
MM=320;
%MM=1100;
%computation of roots of bessel function Jo(x)
if exist('alfa') == 0
x=2;
alfa=zeros(1,MM);
for i=1:MM
ex=1;
while abs(ex)>.00001
ex=-besselj(0,x)/besselj(1,x);
x=x-ex;
5
end
alfa(i)=x;
% fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex)
x=x+pi;
end
end
[s,fs]=wavread(inputfile);
% yy=resample(s,1,2);
%
% s=yy(9500:10600);
S=s';
INC=160;
%INC=550;
NW=INC*2;
W=sqrt(hamming(NW+1));
W(end)=[];
F=enframe(S,W,INC);
[r,c]=size(F);
for i=1:r
s1=F(i,:);
N=length(s1);
nb=1:N;
a=N;
for m1=1:MM
a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s1.*besselj(0,alfa(m1)/a*nb));
end
for mm=1:N
g1_r=[(alfa(12:48))/a ];
F1_r(mm)=sum(a3(12:48).*besselj(0,g1_r*mm));
% g1_r=[(alfa(20:85))/a ];
% F1_r(mm)=sum(a3(20:85).*besselj(0,g1_r*mm));
% g1_r=[(alfa)/a ];
% F1_r(mm)=sum(a3.*besselj(0,g1_r*mm));
end
y1=F1_r;
6
for l=2:N-1
xx=y1;
si(l)=xx(l)^2-xx(l-1)*xx(l+1);
end
for m=2:N-1
yy(m)=y1(m)-y1(m-1);
end
for m=2:N-2
siy(m)=yy(m)^2-yy(m-1).*yy(m+1);
end
for mm=2:N-2
if siy(mm)<0
yy1(mm)=siy(mm-1);
yy1(mm)=yy1(mm-1);
else
yy1(mm)=siy(mm);
end
end
siy=yy1;
for m1=2:N-3
omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));
end
for mm=2:N-3
if imag(omega(mm))==0
yy1(mm)=omega(mm);
else
yy1(mm)=omega(mm-1);
yy1(mm)=yy1(mm-1);
end
end
omega=yy1;
for m1=2:N-3
amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));
end
for mm=2:N-3
if imag(amp(mm))==0
yy1(mm)=amp(mm);
else
yy1(mm)=amp(mm-1);
yy1(mm)=yy1(mm-1);
end
end
[ca,cd]=dwt(yy1,'db2');
yy1=idwt(ca,[],'db2');
6
amp1(i,:)=yy1;
end
amp1(:,c-1)=amp1(:,c-2);
amp1(:,c)=amp1(:,c-1);
Xvowel=overlapadd(amp1,W,INC);
%%%%%
% [s,fs]=wavread(inputfile);
% % yy=resample(s,1,2);
% % fs=fs/2;
% % s=yy(9500:10600);
% S=s';
INC=160;
NW=INC*2
W=sqrt(hamming(NW+1));
W(end)=[];
F=enframe(S,W,INC);
[r,c]=size(F);
for i=1:r
s2=F(i,:);
N=length(s2);
nb=1:N;
a=N;
for m1=1:MM
a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s2.*besselj(0,alfa(m1)/a*nb));
end
6
for mm=1:N
% g1_r=[(alfa(60:100))/a ];
% F1_r(mm)=sum(a3(60:100).*besselj(0,g1_r*mm));
g1_r=[(alfa(60:100))/a ];
F1_r(mm)=sum(a3(60:100).*besselj(0,g1_r*mm));
end
y1=F1_r;
for l=2:N-1
xx=y1;
si(l)=xx(l)^2-xx(l-1)*xx(l+1);
end
for m=2:N-1
yy(m)=y1(m)-y1(m-1);
end
for m=2:N-2
siy(m)=yy(m)^2-yy(m-1).*yy(m+1);
end
for mm=2:N-2
if siy(mm)<0
yy1(mm)=siy(mm-1);
yy1(mm)=yy1(mm-1);
else
yy1(mm)=siy(mm);
end
end
siy=yy1;
for m1=2:N-3
omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));
end
for mm=2:N-3
if imag(omega(mm))==0
yy1(mm)=omega(mm);
else
yy1(mm)=omega(mm-1);
yy1(mm)=yy1(mm-1);
end
end
omega=yy1;
for m1=2:N-3
amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));
end
for mm=2:N-3
if imag(amp(mm))==0
yy1(mm)=amp(mm);
6
else
yy1(mm)=amp(mm-1);
yy1(mm)=yy1(mm-1);
end
end
[ca,cd]=dwt(yy1,'db2');
yy1=idwt(ca,[],'db2');
amp2(i,:)=yy1;
end
amp2(:,c-1)=amp2(:,c-2);
amp2(:,c)=amp2(:,c-1);
Xc=overlapadd(amp2,W,INC);
fig = figure;
ax(1)=subplot(4,1,1);
%specgram(s,320,16000,320,160) ; colormap(1-gray);
svlSpgram(s,2^8,fs,10*fs/1000,9*fs/1000,30,4000);
%specgram(s,1100,8000,1100,550)
ax(2)=subplot(4,1,2);
plot((1:length(s))/16000, s);grid;axis tight;
%text(1.81,0.04,'(b)','fontweight','bold');
%plot((1:length(s))/8000, s);
%plot(s);grid;axis tight
% spgramsvg('ka_F01_S2.wav', 320, 160, 8000)
ax(3)=subplot(4,1,3);
plot((1:length(Xvowel))/16000,Xvowel);grid;axis tight;
%text(1.81,0.15,'(c)','fontweight','bold');
ax(4)=subplot(4,1,4);
plot((1:length(Xc))/16000,Xc);grid;axis tight;xlabel('Time (sec)');
%text(1.81,0.004,'(d)','fontweight','bold');
% ax(5)=subplot(5,1,5);
% plot((1:length(Xc)-1)/16000,diff(Xc));grid;axis tight;xlabel('Time (sec)');
%text(1.81,0.004,'(d)','fontweight','bold');
linkaxes(ax,'x');
alltext=findall(fig,'type','text');
6
allaxes=findall(fig,'type','axes');
allfont=[alltext(:);allaxes(:)];
set(allfont,'fontsize',16);
Overlapadd.m
function [x,zo]=overlapadd(f,win,inc)
%OVERLAPADD join overlapping frames together X=(F,WIN,INC)
%
% Inputs: F(NR,NW) contains the frames to be added together, one
% frame per row.
% WIN(NW) contains a window function to multiply each frame.
% WIN may be omitted to use a default rectangular window
% If processing the input in chunks, WIN should be replaced by
% ZI on the second and subsequent calls where ZI is the saved
% output state from the previous call.
% INC gives the time increment (in samples) between
% succesive frames [default = NW].
%
% Outputs: X(N,1) is the output signal. The number of output samples is N=NW+(NR-
1)*INC.
% ZO Contains the saved state to allow a long signal
% to be processed in chunks. In this case X will contain only N=NR*INC
% output samples.
%
% Example of frame-based processing:
% INC=20
% set frame increment
% NW=INC*2
% oversample by a factor of 2 (4 is also often used)
% S=cos((0:NW*7)*6*pi/NW);
% example input signal
% W=sqrt(hamming(NW+1)); W(end)=[]; % sqrt hamming window of period
NW
% F=enframe(S,W,INC); % split into frames
% ... process frames ...
% X=overlapadd(F,W,INC); % reconstitute the time
waveform (omit "X=" to plot waveform)
6
% (at your option) any later version.
%
% This program is distributed in the hope that it will be useful,
% but WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
% GNU General Public License for more details.
%
% You can obtain a copy of the GNU General Public License from
% http://www.gnu.org/copyleft/gpl.html or by writing to
% Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[nr,nf]=size(f); % number of frames and frame length
if nargin<2
win=nf; % default increment
end
if isstruct(win)
w=win.w;
if ~numel(w) && length(w)~=nf
error('window length does not match frames size');
end
inc=win.inc;
xx=win.xx;
else
if nargin<3
inc=nf;
end
if numel(win)==1 && win==fix(win) && nargin<3 % win has been omitted
inc=win;
w=[];
else
w=win(:).';
if length(w)~=nf
error('window length does not match frames size');
end
if all(w==1)
w=[];
end
end
xx=[]; % partial output from previous call is null
end
nb=ceil(nf/inc); % number of overlap buffers
no=nf+(nr-1)*inc; % buffer length
z=zeros(no,nb); % space for overlapped output speech
if numel(w)
z(repmat(1:nf,nr,1)+repmat((0:nr-1)'*inc+rem((0:nr-
1)',nb)*no,1,nf))=f.*repmat(w,nr,1);
else
6
z(repmat(1:nf,nr,1)+repmat((0:nr-1)'*inc+rem((0:nr-1)',nb)*no,1,nf))=f;
end
x=sum(z,2);
if ~isempty(xx)
x(1:length(xx))=x(1:length(xx))+xx; % add on leftovers from previous call
end
if nargout>1 % check if we want to preserve the state
mo=inc*nr; % completed output samples
if no<mo
x(mo,1)=0;
zo.xx=[];
else
zo.xx=x(mo+1:end);
zo.w=w;
zo.inc=inc;
x=x(1:mo);
end
elseif ~nargout
if isempty(xx)
k1=nf-inc; % dubious samples at start
else
k1=0;
end
k2=nf-inc; % dubious samples at end
plot(1+(0:nr-1)*inc,x(1+(0:nr-1)*inc),'>r',nf+(0:nr-1)*inc,x(nf+(0:nr-1)*inc),'<r', ...
1:k1+1,x(1:k1+1),':b',k1+1:no-k2,x(k1+1:end-k2),'-b',no-k2:no,x(no-k2:no),':b');
xlabel('Sample Number');
title(sprintf('%d frames of %d samples with %.0f%% overlap = %d
samples',nr,nf,100*(1-inc/nf),no));
end
6
Enframe.m
function f=enframe(x,win,inc)
%ENFRAME split signal up into (overlapping) frames: one per row. F=(X,WIN,INC)
%
% F = ENFRAME(X,LEN) splits the vector X(:) up into
% frames. Each frame is of length LEN and occupies
% one row of the output matrix. The last few frames of X
% will be ignored if its length is not divisible by LEN.
% It is an error if X is shorter than LEN.
%
% F = ENFRAME(X,LEN,INC) has frames beginning at increments of INC
% The centre of frame I is X((I-1)*INC+(LEN+1)/2) for I=1,2,...
% The number of frames is fix((length(X)-LEN+INC)/INC)
%
% F = ENFRAME(X,WINDOW) or ENFRAME(X,WINDOW,INC) multiplies
% each frame by WINDOW(:)
%
% Example of frame-based processing:
% INC=20
% set frame increment
% NW=INC*2
% oversample by a factor of 2 (4 is also often used)
% S=cos((0:NW*7)*6*pi/NW);
% example input signal
% W=sqrt(hamming(NW+1)); W(end)=[]; % sqrt hamming window of period
NW
% F=enframe(S,W,INC); % split into frames
% ... process frames ...
% X=overlapadd(F,W,INC); % reconstitute the time
waveform (omit "X=" to plot waveform)
6
% (at your option) any later version.
%
% This program is distributed in the hope that it will be useful,
% but WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
% GNU General Public License for more details.
%
% You can obtain a copy of the GNU General Public License from
% http://www.gnu.org/copyleft/gpl.html or by writing to
% Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
nx=length(x(:));
nwin=length(win);
if (nwin == 1)
len = win;
else
len = nwin;
end
if (nargin < 3)
inc = len;
end
nf = fix((nx-len+inc)/inc);
f=zeros(nf,len);
indf= inc*(0:(nf-1)).';
inds = (1:len);
f(:) = x(indf(:,ones(1,len))+inds(ones(nf,1),:));
if (nwin > 1)
w = win(:)';
f = f .* w(ones(nf,1),:);
end
6
SvlSpgram.m
function [X, f_r, t_r] = svlSpgram(x, n, Fs, window, overlap, clipdB, maxfreq)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Usage:
% [X [, f [, t]]] = svlSpgram(x [, n [, Fs [, window [, overlap[, clipdB [, maxfreq ]]]]]])
%
% Generate a spectrogram for the signal. This chops the signal into
% overlapping slices, windows each slice and applies a Fourier
% transform to determine the frequency components at that slice.
%
% INPUT:
% x: signal or vector of samples
% n: size of fourier transform window, or [] for default=256
% Fs: sample rate, or [] for default=2 Hz
% window: shape of the fourier transform window,
% or [] for default=hanning(n)
% Note: window length can be specified instead, in which case
% window=hanning(length)
% overlap: overlap with previous window,
% or [] for default=length(window)/2
% clipdB:Clip or cut-off any spectral component more than 'clipdB'
% below the peak spectral strength.(default = 35 dB)
% maxfreq: Maximum freq to be plotted in the spectrogram (default = Fs/2)
%
% OUTPUT:
% X: STFT of the signal x
% f: The frequency values corresponding to the STFT values
% t: Time instants at which the STFT values are computed
%
% Example
%--------
% x = chirp([0:0.001:2],0,2,500); # freq. sweep from 0-500 over 2 sec.
% Fs=1000; # sampled every 0.001 sec so rate is 1 kHz
% step=ceil(20*Fs/1000); # one spectral slice every 20 ms
% window=ceil(100*Fs/1000); # 100 ms data window
% svlSpgram(x, 2^nextpow2(window), Fs, window, window-step);
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
7
% Original version by Paul Kienzle; modified by Sean Fulop March 2002
%
% Customized by Anand and then by Dhananjaya
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% assign defaults
if nargin < 2 | isempty(n), n = min(256, length(x)); end
if nargin < 3 | isempty(Fs), Fs = 2; end
if nargin < 4 | isempty(window), window = hanning(n); end
if nargin < 5 | isempty(overlap), overlap = length(window)/2; end
if nargin < 6 | isempty(clipdB), clipdB = 35; end % clip anything below 35 dB
if nargin < 7 | isempty(maxfreq), maxfreq = Fs/2; end
7
ret_n = (n+1)/2;
else
ret_n = n/2;
end
f = [0:ret_n-1]*Fs/n;
t = offset/Fs;
%maxfreq = Fs/2;
STFTmag = abs(STFT(2:n*maxfreq/Fs,:)); % magnitude of STFT
STFTmag = STFTmag/max(max(STFTmag)); % normalize so max magnitude will be 0
db
STFTmag = max(STFTmag, 10^(-clipdB/10)); % clip everything below -35 dB
% display as an indexed grayscale image showing the log magnitude of the STFT,
% i.e. a spectrogram; the colormap is flipped to invert the default setting,
% in which white is most intense and black least---in speech science we want
% the opposite of that.
% imagesc(t, f(2:n*maxfreq/Fs), 20*log10(STFTmag)); axis xy; colormap(flipud(gray));
if nargout==0
if Fs<2000
imagesc(t, f(2:n*maxfreq/Fs), 20*log10(STFTmag));
%pcolor(t, f(2:n*maxfreq/Fs), 20*log10(STFTmag));
ylabel('Hz');
else
%imagesc(t, f(2:n*maxfreq/Fs)/1000, 20*log10(STFTmag));
pcolor(t, f(2:n*maxfreq/Fs)/1000, 20*log10(STFTmag));
ylabel('kHz');
end
axis xy;
colormap(flipud(gray));
shading interp;
%xlabel('Time (seconds)');
end
7
Fbc.m
function [c,F1_r]= fbc(s)
N = length(s);
a = N;
nb = 1:N;
for m1 = 1:N
c(m1) = (2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s'.*besselj(0,alfa(m1)/a*nb));
end
% cindex=[6:10];
% cindex=18:24;
% cindex=[6:10 18:24];
% cindex=[2:5 5:8 26:30];
% cindex=[2:6 6:10 10:15 52:58];
7
Residual.m
function [residual,LPCoeffs,eta,Ne] =
Residual(speech,Fs,segmentsize,segmentshift,lporder,preempflag,plotflag)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% USAGE : [residual,LPCoeffs,Ne] =
Residual(speech,framesize,frameshift,lporder,preempflag,plotflag)
% INPUTS :
% speech - speech in ASCII
% Fs - Sampling frequency (Hz)
% segmentsize - framesize for lpanalysis (ms)
% segmentshift - frameshift for lpanalysis (ms)
% lporder - order of lpc
% preempflag - If 1 do preemphasis
% plotflag - If 1 plot results
% OUTPUTS :
% residual - residual signal
% LPCoeffs - 2D array containing LP coeffs of all frames
% Ne - Normalized error
[r,c] = size(dspeech);
if r==1
dspeech = dspeech';
end
7
frameshift = floor(segmentshift * Fs/1000);
nframes=floor((length(dspeech)-framesize)/frameshift)+1;
LPCoeffs=zeros(nframes,(lporder+1));
Lspeech = length(dspeech);
numSamples = Lspeech - framesize;
Lspeech
nframes
(nframes-1)*frameshift + framesize;
j=1;
% Processing the frames.
for i=1:frameshift:Lspeech-framesize
curFrame = dspeech(i:i+framesize-1);
frame = speech(i:i+framesize-1);
% Inverse filtering.
if i <= lporder
frameToFilter(1:lporder) = 0;
else
frameToFilter(1:lporder) = dspeech(i-lporder:i-1);
%frameToFilter(1:lporder) = speech(i-lporder:i-1);
end
frameToFilter(lporder+1:lporder+framesize)=curFrame;
%frameToFilter(lporder+1:lporder+framesize)=frame;
resFrame = InverseFilter(frameToFilter,lporder,LPCoeffs(j,:));
numer=resFrame(lporder+1:framesize);
denom=curFrame(lporder+1:framesize);
eta(i) = sum(numer.*numer)/sum(denom.*denom);
%eta(i)
7
else
LPCoeffs(j,:) = 0;
Ne(j) = 0;
resFrame(1:framesize) = 0;
end
end
clear frameToFilter;
i=i+frameshift;
% Updating the remaining residual samples of the penultimate frame.
%residual(i+frameshift:i+framesize-1) = resFrame(frameshift+1:framesize);
% Processing the last frame. However, this last frame will have
% a length of {Lspeech-i+1} samples.
curFrame = dspeech(i:Lspeech);
frame = speech(i:Lspeech);
l=Lspeech-i+1;
if(sum(nanFlag) == 0)
LPCoeffs(j,:) = real(a);
Ne(j) = E;
% Inverse filtering.
frameToFilter(1:lporder) = dspeech(i-lporder:i-1);
7
%frameToFilter(1:lporder) = speech(i-lporder:i-1);
frameToFilter(lporder+1:lporder+l)=curFrame;
%frameToFilter(lporder+1:lporder+l)=frame;
%resFrame(1:l) =
InverseFilter(frameToFilter,lporder,LPCoeffs(j,:));
resFrame(1:l) =
InverseFilter(frameToFilter,lporder,LPCoeffs(j,:));
else
LPCoeffs(j,:) = 0;
Ne(j) = 0;
resFrame(1:l) = 0;
end
end
residual(i:i+l-1) = resFrame(1:l);
% Residual computation is now complete.
% The lengths of speech and residual are equal now.
end
figure;
ax(1) = subplot(2,1,1);plot(x,speech);
xlim([x(1) x(end)]);
xlabel('Time (s)');ylabel('Signal');grid;
ax(2) = subplot(2,1,2);plot(x,residual);
xlim([x(1) x(end)]);
xlabel('Time (s)');ylabel('LP Residual');grid;
linkaxes(ax,'x');
end
7
Inversefilter.m
function residual = InverseFilter(frameToFilter,lporder,a)
l = length(frameToFilter);
if l <= lporder
return;
else
% Note that l-lporder = frameSize.
for i=1:l-lporder
predictedSample=0;
% Note that a(1) = 1; Hence start from j=2.
for j=2:lporder+1
predictedSample = predictedSample -
a(j)*frameToFilter(lporder+1+i-j);
end
residual(i) = frameToFilter(i+lporder) - predictedSample;
end
end
7
Hilberenvelope.m
function [HilbertEnv]=HilbertEnvelope(signal,Fs,plotflag)
tempSeq=hilbert(signal);
HilbertSeq=imag(tempSeq);
sigSqr=signal.*signal;
HilbertSqr=HilbertSeq.*HilbertSeq;
HilbertEnv=sqrt(sigSqr+HilbertSqr);
%wavwrite(HilbertSeq/1.01/max(abs(HilbertSeq)),Fs,16,'ht.wav');
wavwrite(HilbertSeq,Fs,16,'ht.wav');
if(plotflag==1)
% Setting scale for x-axis.
len = length(signal);
x = [1:len]*1/Fs;
figure;
subplot(3,1,1);plot(x,signal);
%xlabel('Time (ms)');
ylabel('Signal');grid;
hold on;
%plot(x,HilbertSeq.*HilbertSeq,'r');
plot(x,HilbertEnv,'r');
hold off;
subplot(3,1,2);plot(x,HilbertSeq);
ylabel('HT of Signal');grid;
subplot(3,1,3);plot(x,HilbertEnv);
ylabel('HE of Signal');grid;
7
%xlabel('Time (ms)');ylabel('Hilbert Envelope of LP Residual');grid;
xlabel('Time (s)');
%figure;
%plot(signal,HilbertSeq,'k.');grid;
%plot(x, signal.*HilbertSeq);grid;
for i=1:16:len-16
xi = signal(i:i+16);
yi = HilbertSeq(i:i+16);
%plot(xi-mean(xi),yi-mean(yi),'k.');grid;
end
end
8
Bandlimitfbc.m
clc;
clear all;
close all;
[os,fs] = wavread('sa1_8000.wav');
plot(os),title('Original Speech Signal'),axis tight,grid on;
samplesrange = [14000:20000];
S = os';
s = S(samplesrange);
N = length(s);
figure(),plot(s),title('Extracted voiced speech signal of 20msec'),axis tight, grid on;
if exist('alfa') == 0
x=2;
alfa=zeros(1,N);
for i=1:N
ex=1;
while abs(ex)>.00001
ex=-besselj(0,x)/besselj(1,x);
x=x-ex;
end
alfa(i)=x;
%fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex);
x=x+pi;
end
end
fts = fft(s);
ftsby2 = fts(1:length(fts)/2);
n = length(ftsby2);
tf = [1:n].*((fs/2)/n);
figure(),plot(tf,20*log10(abs(ftsby2))),title('Spectrum of the Extracted speech
signal'),axis tight, grid on;
a = N;
nb = 1:N;
for m1 = 1:N
fbc(m1) = (2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s.*besselj(0,alfa(m1)/a*nb));
end
8
N=length(s1);
nb=1:N;
MM = N;
a=N;
for m1=1:MM
c(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s1.*besselj(0,alfa(m1)/a*nb));
p(m1)= (c(m1)^2)*(a^2*(besselj(1,alfa(m1))).^2)/2;
end
f(1:MM)=alfa(1:MM)/(2*pi*a);
fmean1 = sum(f(1:MM).*p(1:MM))/sum(p(1:MM));
mf = fmean1*fs
fd = 300;
range = N*fd/(fs/2);
fbc = fbc(1:range);
for mm=1:N
g1_r=[alfa/a];
rs(mm)=sum(fbc.*besselj(0,g1_r*mm));
end
figure(),plot(rs),title('Band Limited signal with frequency < 300Hz'),axis tight, grid on;
ftrs = fft(rs);
ftrsby2 = ftrs(1:length(ftrs)/2);
nr = length(ftrsby2);
tfr = [1:nr].*((fs/2)/nr);
figure(),plot(tfr,20*log10(abs(ftrsby2))),title('Spectrum of the Band Limited Signal'),axis
tight, grid on;
8
Computeresidual.m
function [residual,LPCoeffs] =
computeResidual(speech,Fs,segmentsize,segmentshift,lporder,preempflag,plotflag)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% USAGE : [residual,LPCoeffs,Ne] =
computeResidual(speech,framesize,frameshift,lporder,preempflag,plotflag)
% INPUTS :
% speech - speech in ASCII
% Fs - Sampling frequency (Hz)
% segmentsize - framesize for lpanalysis (ms)
% segmentshift - frameshift for lpanalysis (ms)
% lporder - order of lpc
% preempflag - If 1 do preemphasis
% plotflag - If 1 plot results
% OUTPUTS :
% residual - residual signal
% LPCoeffs - 2D array containing LP coeffs of all frames
% Ne - Normalized error
dspeech = dspeech(:);
8
Lspeech
nframes
(nframes-1)*frameshift + framesize
% Computation of LPCs.
for i=1:cs
nanflag = 0;
erg(i) = sum(sbuf(:,i).*sbuf(:,i));
if erg(i) ~= 0
a = lpc(sbuf(:,i),lporder);
nanflag = sum(isnan(real(a)));
else
a1 = [1 zeros(1,p)];
end
if nanflag == 0
A(:,i) = real(a(:));
else
A(:,i) = a1(:);
end
end
% Computation of LP residual.
x = [zeros(1,lporder) (dspeech(:))'];
xbuf = buffer(x, lporder+framesize, lporder+framesize-frameshift,'nodelay');
[rx,cx] = size(xbuf);
tmp = x(Lspeech+lporder-rx+1:Lspeech+lporder);
xbuf(:,cx) = tmp(:); % Last column of the buffer.
% Inverse filtering.
j=1;
for i=1:cx-1
res = filter(A(:,i), 1, xbuf(:,i));
8
LPCoeffs = A';
size(LPCoeffs)
figure; subplot(2,1,1);plot(x,speech);
xlim([x(1) x(end)]);
ylabel('Signal');grid;
subplot(2,1,2);plot(x,residual);
xlim([x(1) x(end)]);
xlabel('Time (s)');ylabel('LP Residual');grid;
end
8
Synthesizespeech.m
function [speech] = synthesizeSpeech(Residual, LPCs, Fs, lporder,fsize,fshift,plotflag)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% function [speech]=SynthesizeSpeechUsingResidual(idNoEx,
lporder,framesize,frameshift,plotflag)
% fsize,fshift: In ms
% Use .res and .lpc files.
% In .lpc file, each row is a set of LPCs for one frame.
% Get sampling frequency from .res.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
framesize = floor(fsize*Fs/1000);
frameshift = floor(fshift*Fs/1000);
speech=zeros(1,length(Residual));
j=1;
for(i=1:frameshift:length(Residual)-framesize)
ResFrm=Residual(i:i+framesize-1);
a=LPCs(j,:);
j=j+1;
SpFrm=SynthFilter(real(PrevFrm),real(ResFrm),real(a),lporder,framesize,0);
speech(i:i+frameshift-1)=SpFrm(1:frameshift);
%pause
end
speech(i+frameshift:i+framesize-1)=SpFrm(frameshift+1:framesize);
8
%PROCESSING LASTFRAME SAMPLES
if(i<length(Residual))
ResFrm = Residual(i:length(Residual));
a=LPCs(j,:);
j=j+1;
PrevFrm=speech((i-framesize):(i-1));
SpFrm=SynthFilter(real(PrevFrm),real(ResFrm),real(a),lporder,framesize,0);
speech(i:i+length(SpFrm)-1)=SpFrm(1:length(SpFrm));
end
if(plotflag==1)
figure;
l = length(speech);
x = [1:l]/Fs;
subplot(2,1,1);plot(x,real(Residual),'k');grid;
xlim([x(1) x(end)]);
subplot(2,1,2);plot(x,real(speech),'k');grid;
xlim([x(1) x(end)]);
xlabel('Time (s)');
end
Synthfilter.m
function
[SpchFrm]=SynthFilter(PrevSpFrm,ResFrm,FrmLPC,LPorder,FrmSize,plotflag);
%USAGE:
[SpchFrm]=SynthFilter(PrevSpFrm,ResFrm,FrmLPC,LPorder,FrmSize,plotflag);
tempfrm=zeros(1,2*FrmSize);
tempfrm((FrmSize-LPorder):FrmSize)=PrevSpFrm((FrmSize-LPorder):FrmSize);
for(i=1:FrmSize)
t=0;
for(j=1:LPorder)
t=t+FrmLPC(j+1)*tempfrm(-j+i+FrmSize);
8
%pause
end
% ResFrm(i);
%s=-t+ResFrm(i)
%pause
%tempfrm(FrmSize+i)=s;
tempfrm(FrmSize+i)=-t+ResFrm(i);
%pause
end
SpchFrm=tempfrm(FrmSize+1:2*FrmSize);
if(plotflag==1)
figure;
subplot(2,1,1);plot(ResFrm);grid;
subplot(2,1,2);plot(SpchFrm);grid;
end