Está en la página 1de 88

BESSEL FEATURES FOR SPEECH SIGNALPROCESSING

D.MEENAKSHI
G.SILPA
V.RAJITHA

Department of Electronics and Communication Engineering


MAHATMA GANDHI INSTITUTE OF TECHNOLOGY
(Affiliated to Jawaharlal Nehru Technological University, Hyderabad, A.P.)

Chaitanya Bharathi P.O., Gandipet, Hyderabad – 500 075


2010

1
BESSEL FEATURES FOR SPEECH SIGNAL PROCESSING

PROJECT REPORT
SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

BACHELOR OF TECHNOLOGY
IN
ELECTRONICS AND COMMUNICATION ENGINEERING
BY

D.MEENAKSHI (06261A0412)
G.SILPA (06261A0420)
V.RAJITHA (06261A0456

Department of Electronics and Communication Engineering


MAHATMA GANDHI INSTITUTE OF TECHNOLOGY
(Affiliated to Jawaharlal Nehru Technological University, Hyderabad, A.P.)

Chaitanya Bharathi P.O., Gandipet, Hyderabad – 500 075


2010

2
MAHATMA GANDHI INSTITUTE OF TECHNOLOGY
(Affiliated to Jawaharlal Nehru Technological University, Hyderabad, A.P.)

Chaitanya Bharathi P.O., Gandipet, Hyderabad-500 075 (Font: 14, TNR)

Department of Electronics and Communication Engineering

CERTIFICATE

Date: 26 April 2010

This is to certify that the project work entitled “Bessel Features For Speech
Signal Processing” is a bonafide work carried out by

D.Meenakshi (06261A0412)
G.Silpa (06261A0420)
V.Rajitha (06261A0456)

in partial fulfillment of the requirements for the degree of BACHELOR OF


TECHNOLOGY in ELECTRONICS & COMMUNICATION ENGINEERING by
the Jawaharlal Nehru Technological University, Hyderabad during the academic year
2009-10.

The results embodied in this report have not been submitted to any other University or
Institution for the award of any degree or diploma.

(Signature) (Signature)
-------------------------- ------------------
Mr. T. D. Bhatt, Associate Professor
Dr.Nagbhooshanam
Faculty Advisor/Liaison Professor & Head

3
ACKNOWLEDGEMENT

We express our deep sense of gratitude to our Guide Mr. Suryakanth V


Gangashetty, IIIT, Hyderabad, for his invaluable guidance and encouragement in
carrying out our Project.

We are highly indebted to our Faculty Liaison Mr. T. D. Bhatt, Associate


Professor, Electronics and Communication Engineering Department, who has given us
all the necessary technical guidance in carrying out this Project.

We wish to express our sincere thanks to Dr. E. Nagabhooshanam, Head of the


Department of Electronics and Communication Engineering, M.G.I.T., for permitting us
to pursue our Project in Cranes Varsity and encouraging us throughout the Project.

Finally, we thank all the people who have directly or indirectly helped us through
the course of our Project.

D.Meenakshi
G. Silpa
V. Rajitha

4
ABSTRACT

Human speech signal is a multi component signal where the components are
called formants. Multicomponent signals produce delineated concentrations in the time-
frequency plane. In the time frequency plane there is a clear delineation into different
regions. Different time frequency distributions may give somewhat different
representations-however they all give roughly the same picture in regard to the existence
of the components. Most efforts in devising recognition schemes have been directed
toward the recognition of human speech. While it has been appreciated over fifty years
that speech is multicomponent no particular exploitation has been made of that fact.
Recently, however, an ingenious idea has been proposed and developed by
Fineberg and Mammone which takes advantage of the multicomponent nature of a signal.
Suppose, for the sake of illustration, we consider signals consisting of two components.
The phase of the first component of the unknown and of the template candidate is
determined. Subtraction of the two phases for each instant of time defines the
transformation function for going from the template to the unknown. One can think of
this as the possible distortion function for the first component. It would equal if there was
no distortion. Similarly one determines the distortion function for the second component.
If the two distortion functions are equal then we have a match. Fineberg and Mammone
have successfully used this method for the classification of speech.
The excellence of the results can be interpreted as indicating that indeed formants
are correlated and distorted in the same way. This is an important finding about the nature
of speech. Note that in the above discussion, the distortion function is taken to be the
difference of the phases. However different circumstances may make other distortion
functions more appropriate For example one can define the distortion function to be the
ratio of the two phases. In general one can think of the distortion as a function of the
signal and the environment. It would be of considerable interest to investigate distortion
functions for common situations.

5
The discrete energy separation algorithm (DESA) together with the Gabor's
filtering provides a standard approach to estimate the amplitude envelope (AE) and the
instantaneous frequency (IF) functions of a multicomponent amplitude and frequency
modulated (AM-FM) signal. The filtering operation introduces amplitude and phase
modulation in the separated mono component signals, which may lead to an error in the
final estimation of the modulation functions. We have used a method called the Fourier-
Bessel expansion-based discrete energy separation algorithm (FB-DESA) for component
separation and estimation of the AE and IF functions of a multicomponent AM-FM
signal. The FB-DESA method does not introduce any amplitude or phase modulation in
the separated mono component signal leading to accurate estimations of the AE and IF
functions. Simulation results with synthetic and natural signals are included to illustrate
the effectiveness of the proposed method.

The Voice Onset Time (VOT) is a important characteristic of stop consonants


which plays a significant role in perceptual discrimination of phonemes of the same place
of articulation [6]. It also plays an important role in word segmentation, stress related
phenomena, and dialectal and accented variations in speech patterns [7-8]. The VOT can
also be used for classification of accents.Voice Onset Time (VOT) can be used to classify
mandarin, Turkish, German and American accented English. It is an important temporal
feature which is often overlooked in speech perception, speech recognition, as well as
accent detection.

Many speech analysis situations depend on accurate estimation of the location of


epoch within the glottal pulse. For example, the knowledge of epoch location is useful for
accurate estimation of the fundamental frequency (fo)Analyses of speech signals in the
closed glottis regions provide an accurate estimate of the frequency response of the
supraalaryngeal vocal-tract system [12] [13]. With the knowledge of epochs, it may be
possible to determine the characteristics of the voice source by a careful analysis of the
signal within a glottal pulse. The epochs can be used as pitch markers for prosody
manipulation, which is useful in applications like text-speech synthesis, voice conversion
and speech rate conversion

6
Table of contents

CERTIFICATE FROM ECE DEPARTMENT (i)


ACKNOWLEDGEMENTS (ii)
ABSTRACT (iii)
LIST OF FIGURES (v)
LIST OF TABLES (vi)

CHAPTER 1. OVERVIEW
1. Introduction 2
2. Aim of the project 3
3. Methodology 3
4. Significance and applications 4
5. Organization of work 4
CHAPTER 2. INTRODUCTION TO BESSEL FEATURES FOR SPEECH
SIGNAL PROCESSING
2.1 Introduction 7
2.2 Multicomponent signal
2.3 Series representation
2.4 Fourier Bessel
CHAPTER 3. REVIEW OF APPROACHES FOR BESSEL FEATURES
3.1 Introduction 29
3.2 Parse Val’s Formula 30
3.3 Hankel transform 30
CHAPTER 4. THEORY OF BESSEL FEATURES
4.1 Introduction 74
4.2 Solution of Differential Equation 75
4.3 Mean Frequency computation 76
4. Reconstruction of the signal 82
CHAPTER 5. BESSEL FEATURES FOR DETECTION OF VOICE ONSET TIME
(VOT)
5.1 Introduction

7
5.2 Significance of VOT
5.3 Detection of VOT
5.3.1 Fourier-Bessel expansion
5.3.2 AM-FM model and DESA method
5.3.3 Approach to detect VOTs from speech using
Amplitude Envelope (AE)
5.4 Results
5.5 Conclusions
CHAPTER 6. BESSEL FEATURES FOR DETECTION OF
GLOTTAL CLOSURE INSTANTS (GCI)
6.1 Introduction
6.2 Significance of Epoch in speech analysis
6.3 Review of existing approaches
6.3.1 Epoch extraction from short time
energy of speech signal
6.3.2 Epoch extraction from linear prediction analysis
6.3.4 Limitation of existing approaches
6.4 Detection of GCI using FB expansion and AM-FM model
6.4.1 Fourier-Bessel expansion
6.4.2 AM-FM model and DESA method
6.5 Studies on detection of GCIs for various speech utterances
6.6 Glottal activity detection
6.7 Conclusion
CHAPTER 7. SUMMARY AND CONCLUSIONS
7.1 Summary of the work
CHAPTER 8. REFERENCES

8
LIST OF FIGURES

2.1 Zero order Bessel function………………………………………

3.1 Sinusoidal function………………………………………………

4.1 Bessel function for different order………………………………

4.2 Reconstructed Band limited signal………………………………

5.1 Regions of significant events in the production............................

5.2 Plot of waveforms for speech utterance /ke/.................................

5.3 Plot of waveforms for speech utterance /te/..................................

5.4 Plot of waveforms for speech utterance /pe/.................................

5.5 Plot of the Bar graphs for utterances of /ke/, /te/, /pe/..................

6.1 Epoch (or GCI) extraction of a male speaker..............................

6.2 Epoch (or GCI) extraction of a female speaker..........................

9
LIST OF TABLES

5.1 FB coefficient orders for emphasizing the vowel and consonant parts

5.2 VOT values for female (F01) and male (M05) speakers

1
CHAPTER 1
OVERVIEW

1.1 INTRODUCTION
Multicomponent signals produce delineated concentrations in the time-frequency
plane. Human speech signal is a multi component signal where the components are called
formants. Most efforts in devising recognition schemes have been directed toward the
recognition of human speech. While it has been appreciated over fifty years that speech is
multicomponent no particular exploitation has been made of that fact. Recently, however,
an ingenious idea has been proposed and developed by Fineberg and Mammone which
takes advantage of the multicomponent nature of a signal.

1.2 AIM OF THE PROJECT


The aim is to explore Bessel features and apply them for Speech signal processing
such as detecting accent, language and also for speaker identification.

1.3 METHODOLOGY
Fourier-Bessel (FB) expansion and AM-FM model has been employed for
efficient results in speech signal processing.

1.4 SIGNIFICANCE AND APPLICATIONS


The applications include speech segmentation, speaker verification, speaker
identification, speech recognition and language identification. Pattern recognition may
also be accomplished.

1.5 ORGANISATION OF WORK


Firstly, the significance of Bessel features have been studied and their
applications in speech signal processing for the detection of Voice Onset Time (VOT)
and Glottal Closure Instants (GCI) have been observed.

1
CHAPTER 2
INTRODUCTION TO BESSEL FEATURES FOR SPEECH
SIGNAL PROCESSING

2.1 INTRODUCTION

Representation of a signal directly by it’s sample values or by an analytic function


may not be desired and practical. Many practical signals are highly redundant, both
image and Speech signals fall under this category, and it may be desirable and possibly
necessary to represent the signal with a fewer number of samples for economy of storage
and /or transmission bandwidth limitation. The processing of signals can be efficiently
carried out in another domain that of the original signal. Many natural and man-made
signals have a unique structure in time and frequency. In the time frequency plane there is
a clear delineation into different regions. Different time frequency distributions may give
somewhat different representations-however they all give roughly the same picture in
regard to the existence of the components Pattern recognition techniques rely on the
ability to generate a set of coefficients from the raw data (time domain samples) that are
more compact and are more closely related to the signal characteristics of interest.

2.2 MULTICOMPONENT SIGNAL

Human speech signal is a multi component signal where the components are called
formants. Multicomponent signals produce delineated concentrations in the time-
frequency plane. The general belief is that in a multicomponent signal each component in
the time-frequency plane corresponds to a part of a signal. That is if
S (t) = S1 (t) + S2 (t)
That each part, S1 and S2 corresponds to a component. A mono component signal looks
like a single mountain ridge. The center of the ridge forms a trajectory in the time
'generally varies from time to time. The trajectory is the instantaneous frequency of the
signal. If the signal is written in the form
S (t) =A (t) e jw(t)

1
The instantaneous frequency is an average, namely the average of the frequencies at a
given time. The broadness at that time is the spread (root mean square) of frequencies at
time t, we call it the instantaneous bandwidth.

2.2.1 SERIES REPRESENTATION

If Frequency content of signal is desired, a series representation packs the


frequency information into fewer samples than a time domain representation. Hence,
signal decomposition by means of series representation is important to such applications
as Seismic, Speech and Image processing. Fourier series and Fourier transform
representations of the speech signal have been applied extensively in speech storage and
speech compression systems. The advantage of these representations lies in the ability to
store or transmit the parameters associated with the transformation instead of the binary
codes representing the values of samples of the waveform taken in the time domain.
Usually the transform domain parameters can be handled with greater efficiency than the
time domain parameters.

2.3 FOURIER BESSEL

We are interested in expanding a speech signal into a Fourier Bessel series.


Generally speaking the spectrum cannot be used to ascertain or define whether a signal is
mono component or not, although in some cases it may give an indication that
components are present. The reason is that the spectrum tells us what frequencies existed
for the total duration of the signal, while the existence of components happens locally in
time. If each component is limited to its own mutually exclusive band for all time then
and only then the spectrum may give an indication of the existence of components. If
there is an overlap at any point of time, the spectrum will not indicate that there were
components since the spectral components will coalesce.
The spectrum gives no indications of components; it just tells us what components
existed irrespective of when they existed. In general the spectrum is not a good indicator

1
Basis function for zero-order Bessel function

Fig 2.1 zero order Bessel function

of components. What is needed is a measure of the spread of frequencies at each time-


the instantaneous bandwidth. It appears that the application of F-B series speech

1
processing, particularly speaker identification, bears further research. The shift variant
property of the Hankel transform may prove valuable for non-stationary analysis and
some indications that fewer coefficients may be required. Since the coefficients are real,
the speech can be directly reconstructed from its coefficients time index plot without
need to retain phase components; this may prove to be of some use if conversion back to
the time domain is desired.

The Fourier series representation employs an orthogonal set of sinusoidal


functions as a basis set, while the Fourier transform uses a complex exponential function,
related to the sinusoidal through Euler’s relation as its kernel. The sinusoidal functions
are periodic and are ideal for representing general periodic functions, but may not fully
match the properties of other waveforms. In particular the random, non-stationary natures
of speech waveforms do not lead to the most efficient representation by sinusoidal based
transformations. In general, for series expansion or integral transforms of signals, the
representation converges more rapidly if there is a better match between the basis or
kernel function, and the signal being represented. Thus the Fourier transform of a sine
wave is impulsive, indicating a perfect match between signal and the kernel function and
corresponding in infinite convergence rate for the transform. This principle may be
exploited to improve the signal to noise ratio of speech signals in digital signal
processing.

A basis set or kernel function with regular zero-crossings and decaying


amplitudes would be expected to provide efficiencies in representing speech signals for
storage and compression. Bessel functions provide the desired properties and have been
accordingly exploited for speech processing. Fourier-Bessel transform is more efficient in
representing speech-like waveforms by comparing the Fourier and Fourier-Bessel
transforms of a rectangular pulse, a triangular pulse, and a linearly-damped sinusoidal
pulse.

1
2.4 CONCLUSIONS

Generally speaking the spectrum cannot be used to ascertain or define whether a


signal is mono component or not, although in some cases it may give an indication that
components are present. he reason is that the spectrum tells us what frequencies existed
for the total duration of the signal, while the existence of components happens locally in
time. It appears that the application of F-B series speech processing, particularly speaker
identification, bears further research. The shift variant property of the Hankel transform
may prove valuable for non-stationary analysis and some indications that fewer
coefficients may be required. Since the coefficients are real, the speech can be directly
reconstructed from its coefficients time index plot without need to retain phase
components

1
CHAPTER 3
REVIEW OF APPROACHES FOR BESSEL FEATURES

3.1 INTRODUCTION

Any orthonormal set of basis functions can be used to represent some arbitrary
function. Fourier series theory includes that the series coefficients are given by a Discrete
Fourier transform, thus coefficient generation is an easy process with the numerous FFT
algorithms that abound.
Calculation of the Fourier-Bessel series coefficients requires computation of a
Hankel Transform, which until recently greatly diminished consideration of this series for
potential applications. Fast Hankel Transform have now been developed which allow
computation of F-B coefficients at a speed only slightly slower than Fourier coefficients;
this should result in increased use of the F-B expansion..

3.2 PARSE VAL’S FORMULA


The theory of series representation of an arbitrary signal is more general than
expressing a signal as a sum of sinusoids. In fact, any orthonormal set of basis functions
can be used to represent some arbitrary function. For example, if we define an orthogonal
set of function as follows,
∑ ɸm(t) ɸn(t) dt=1, m=n; 0, m ≠ n,
The function can be written as:
f (t ) =∑ C n ɸn(t),
Where
Cn= ∫ f (t) ɸn(t)dt.
If we restrict f (t) to signals possessing finite energy and band limited frequency spectra a
useful property can be stated:
The energy, E, of f(t) is given by
E= ∫ f (t) 2dt = ∑ Cn 2 <∞

1
This is simply a restatement of Parse Val’s well known formula concerning
Fourier series coefficients.

Fig 3.1 Sinusoidal function-Basis functions for Fourier transfor

Although the generalized form of series representation is useful for the


construction of mathematical proofs, we are more interested in specific choices of the
basis function, ɸn(t). Obviously, choosing
ɸn(t) = ejnwt
results in a Fourier series. If f (t) is available over a finite segment of time (-t, t), a
realistic assumption, it may be desirable to concentrate the energy within this interval.
Denoting the concentrated energy by the fractional energy ratio
E= ∫-t t│ɸn(t)│dt / ∫ │ɸn(t)│dt
As the waveform becomes more speech-like the Fourier-Bessel transform
converges faster then the Fourier transform. This means that the frequency spectrum is
narrower, so that when dealing with signal plus wide band noise, a filter with lower
cutoff frequency can be employed to attenuate more of the noise power without also
attenuating the signal.

1
3.3 HANKEL TRANSFORM

The Fourier series representation employs an orthogonal set of sinusoidal


functions as a basis set, while the FIt can be shown that E is maximized for ɸn(t)
corresponding to prolate spheroid functions. This choice of basis function has been
investigated thoroughly and has found use within many areas of digital signal processing.
Another possible choice for ɸn(t) is a family of Bessel functions, which results in an
expansion termed the Fourier-Bessel series. In this case, choosing a zero order Bessel
function for illustration:
ɸn(t) =J 0(λt),
And f (t) can then be expressed as
f (t) = ∑C n J 0(λt).
Fourier series possesses some analytical properties such as shift invariance that
make the various mathematical manipulations much simpler. Fourier series theory
includes that the series coefficients are given by a Discrete Fourier transform, thus
coefficient generation is an easy process with the numerous FFT algorithms that abound.
Calculation of the Fourier-Bessel series coefficients requires computation of a Hankel
Transform, which until recently greatly diminished consideration of this series for
potential applications. Fast Hankel Transform have now been developed which allow
computation of F-B coefficients at a speed only slightly slower than Fourier coefficients;
this should result in increased use of the F-B expansion..

3.4 CONCLUSIONS

Any orthonormal set of basis functions can be used to represent some arbitrary
function. Calculation of the Fourier-Bessel series coefficients requires computation of a
Hankel Transform, which until recently greatly diminished consideration of this series for
potential applications.A fast Hankel transform algorithm was presented that allows the
Fourier-Bessel series coefficients to be computed efficiently.

1
CHAPTER 4
THEORY OF BESSEL FEARURES

4.1 INTRODUCTION
Bessel functions arise as a solution of the differential equation. The general
solution is given by J n(x) is called a Bessel function of the first kind of order n and Yn(x)
is called a Bessel function of the second kind of order n. Bessel functions are expressible
in series form. It should be noted that the FB series coefficients Cm are unique for a given
signal, similarly as the Fourier series coefficients are unique for a given signal. However,
unlike the sinusoidal basis functions in the Fourier series, the Bessel functions decay over
time. This feature of the Bessel functions makes the FB series expansion suitable for
nonstationary signals.
Also, it is possible that the Fourier-Bessel coefficients in some sense better
capture the fundamental nature of the speech waveform; the shift variant property may be
desirable and possibly result in improved speaker identification/authentication
probabilities. Since the Fourier –Bessel coefficients are real; the noisy phase problem
upon reconstruction is avoided, which may be advantageous. The entire range of image
processing algorithms developed over the past several decades would be available for
exploitation to improve upon the speech characteristics.

4.2 SOLUTION FOR DIFFERENTIAL EQUATION

Bessel functions arise as a solution of the differential equation:


x2 y’’ + x y’ +(x 2–n 2) y = 0, n>0,…….(1)
this is called Bessel’s differential equation. The general solution of (1) is given by
y = C1Jn(x) + C2Yn(x),
where J n(x) is called a Bessel function of the first kind of order n and Y n(x) is called a
Bessel function of the second kind of order n. Bessel functions are expressible in series
form; for example, J n(x) can be written
J n(x) = ∑ (-1) r (x/2)n+2r / r!┌ (n+r+1)

2
And in particular
J0 (x) =1- x 2/2 2 + x4 / 224 2 -……
It can be readily shown that Bessel functions are orthogonal with respect to the weighting
function x. This can be seen by computing
∫ x Jn (αx) Jn (βx) dx= β Jn ( α ) J'n (β)- α Jn (β) J'n ( α ) / α 2 - β 2
and
∫ x Jn (αx) dx= ½ [ J2n ( α ) + (1- n /⌐ ) J2 ( α ) ]
Now if a and b are different roots of Jn(x) =0 we can write
∫ x Jn (αx) Jn (β x) dx =0, α ≠ β
And thus J n (ax) and J n (b x) are orthogonal with respect to the weighting function x.
Having established orthogonally, a series expansion of an arbitrary function can be
written in terms of Bessel functions with the form
f(x) = ∑ Cm Jn (λmx),
Where λ1, λ2, λ3........are the positive roots of Jn(x) = 0.
The coefficients, Cm, are given by
Cm =2 ∫ x Jn (λmx) f(x) dx / J2n+1(λm ).
If we wish to expand f(t) over some arbitrary interval (0, a) the zero order Bessel series
expansion becomes
f (t) = ∑ Cm J 0(λt), 0<t<a,
With the coefficients, Cm, calculated from
Cm=2 ∫ t f(t) J 0(λt) dt / a2[J 1(λ m a)]2
And are the ascending order positive roots of J0(a) =0. The integral in the numerator is
the Hankel transform
The coefficients of the FB expansion have been used to compute the Mean
Frequency. The FB coefficients are unique for a given signal in the same way that Fourier
series coefficients are unique for a given signal. However, unlike the Sinusoidal basis
function in the Fourier transform, the Bessel functions are aperiodic, and decay over time.
These features of the Bessel functions make the FB series expansion suitable for analysis
of non stationary signals when compared to simple Fourier transform.

2
4.3 MEAN FREQUENCY COMPUTATION

The zero-order Fourier-Bessel series expansion of a signal x(t) considered over


some arbitrary interval (0, a) is expressed as in
X (t) = ∑ J0 ( λm t / a )Cm
Where ……, m=1, 2, 3….are the ascending-order positive roots of J0( )= 0, and
J0(m/a) are the zero-order Bessel functions. The sequence of Bessel functions {J0(m/
a)t)} forms an orthogonal set on the interval 0<= t<=a with respect to the weight t.
Using the orthogonality of the set {J0 (m/a)t)}, the FB coefficients Cm are
computed by using the following equation
Cm=2 ∫ t f(t) J (λt) dt / a2[J (λ a)]2
0 1 m

With 1<= m<= q, where Q is the order of the FB expansion and, J1( ) are the first-
order Bessel functions. The FB expansion order Q must be known a priori. The interval
between successive zero-crossings of the Bessel function J0( ) increases slowly with time
and approaches…….in the limit. If order Q is unknown, then in order to cover full signal
band width, the half of the sampling frequency, Q, must be equal to the length of the
signal.
It should be noted that the FB series coefficients Cm are unique for a given signal,
similarly as the Fourier series coefficients are unique for a given signal. However, unlike
the sinusoidal basis functions in the Fourier series, the Bessel functions decay over time.
This feature of the Bessel functions makes the FB series expansion suitable for
nonstationary signals.
The mean frequency is calculated as in 11

F mean= ∑ Q f mEm / ∑ Em

Where
E m = Cm2 = (energy at order m),
Fm = m /2a = (frequency at order m).
characteristics (such as speech) may be more compactly represented by Bessel function
basis vectors rather then by pure sinusoids. Also, it is possible that the Fourier-Bessel
coefficients in some sense better capture the fundamental nature of the speech waveform;

2
the shift variant property may be desirable and possibly result in improved speaker
identification/authentication probabilities.
For the test function f (t) =J0 (t), the Fourier series coefficients produced an
extremely accurate reconstruction of the function under transformation. A Fourier-Bessel
expansion resulted in a higher error, but the numbers of coefficients were dramatically
different. Regenerating f(t)= J0 (t) from Fourier coefficients required all 256 values to
achieve the result; by contrast just one Fourier-Bessel coefficient is required to
reconstruct the function.
Any function decomposed into basis vectors of the same analytic form will
produce a single coefficient. Indeed, expanding the test signal f (t) =sin (t) via Fourier
series requires a single coefficient. Nevertheless, the point being made is that an
unknown signal will be more efficiently (more information in fewer coefficients)
represented if expanded in the set of basis functions that “resemble” itself. Since the
Fourier –Bessel coefficients are real; the noisy phase problem upon reconstruction is
avoided, which may be advantageous. The entire range of image processing algorithms
developed over the past several decades would be available for exploitation to improve
upon the speech characteristics.

Fig shows different order Bessel functions. Red colored waveform represents
zeroth order Bessel function, green colored waveform represents first order Bessel
function, blue colored waveform represents second order waveform.

Fig 4.1 Bessel functions of different order

The Zero-order Fourier-Bessel series expansion of a signal x(t) over the


interval (0,a) is

2
4.4 RECONSTRUCTION OF THE SIGNAL

Fig a represents speech signal. Fig b represents frequencies present in the speech
signal as a cluster. Fig c represents band limited signal reconstructed from the original
one using Bessel coefficients.

Fig 4.2 Reconstruction of the speech signal using Bessel coefficients.

2
4.5 CONCLUSIONS

In this chapter, we represented the general signal decomposition problem in terms


of an orthogonal series expansion. Focus was primarily held on the Fourier-Bessel series
expansion utilized here and there for comparison purposes. A fast Hankel transform
algorithm was presented that allows the Fourier-Bessel series coefficients to be computed
efficiently. Mean frequency computation is made possible using Fourier-Bessel
coefficiens.

2
CHAPTER 5
BESSEL FEATURES FOR DETECTION OF VOICE ONSET
TIME (VOT)

5.1 INTRODUCTION

The instant of onset of vocal fold vibration relative to the release of closure
(burst) is the commonly used feature to analyze the manner of articulation in production
of stop consonants. The interval between the time of burst release to the time of onset of
vocal fold vibration is defined as voice onset time (VOT) [4].
Accurate determination of VOT from acoustic signals is important both theoretically
and clinically. From a clinical perspective, the VOT constitutes an important clue for
assessment of speech production of hearing impaired speakers [5]. From a theoretical
perspective, the VOT of stop consonants often serves as a significant acoustic correlate to
discriminate voiced from unvoiced, and aspirated from unaspirated stop consonants. The
unvoiced unaspirated stop consonants typically have low and positive VOTs, meaning
that the voicing of the following vowel begins near the instant of closure release. The
unvoiced aspirated stop consonants followed by a vowel have slightly higher VOTs than
their unaspirated counterparts, as the burst is followed by the aspiration noise. The
duration of the VOT in such cases is a practical measure of aspiration. The longer the
VOT, the stronger is the aspiration. On the other hand, voiced stop consonants have a
negative VOT, meaning that the vocal folds start vibrating before the stop is released.

5.2 SIGNIFICANCE OF VOICE ONSET TIME (VOT)

The Voice Onset Time (VOT) is a important characteristic of stop consonants


which plays a significant role in perceptual discrimination of phonemes of the same place
of articulation [6]. It also plays an important role in word segmentation, stress related
phenomena, and dialectal and accented variations in speech patterns [7-8]. The VOT can
also be used for classification of accents.

2
Voice Onset Time (VOT) can be used to classify mandarin, Turkish, German and
American accented English. It is an important temporal feature which is often overlooked
in speech perception, speech recognition, as well as accent detection. Therefore, the
amplitude envelope (AE) function is useful for detection of VOT. The sub-band
frequency analysis is performed to detect VOT of unvoiced stops in [9]. The amplitude
modulation component (AMC) is used to detect vowel plus voiced onset regions (VORs)
in different frequency bands assuming the stop to vowel transition has different amplitude
envelopes for partitioned frequency ranges. In the following section we shall discuss the
effective VOT detection approach using FB expansion followed by AM-FM model for
stop consonant vowel units (/ke/, /ki/, /ku/, /te/, /ti/, /tu/, /pe/, /pi/, /pu/). The dominant
frequency bands of the voiced onset region for various stops and vowels are as
follows: /k/ between 1500 and 2500 Hz; /t/ between 2000 and 3000 Hz; /p/ between 2500
and 3500 Hz; vowel between 300 and 1200 Hz [10,6]. The VOT detection discussed here
is conceptually simpler and can be implemented as a one step process, which makes real
time implementation feasible.

5.3 DETECTION OF VOICE ONSET TIME (VOT)

The detection of VOT has been done using FB expansion and AM-FM model.
Section 5.2.1 discusses the FB expansion. AM-FM signal model and its analysis using
DESA method is discussed in Section 5.2.2. Section 5.2.3 describes the proposed
algorithm for VOT detection using FB expansion and AE function of AM-FM model.
The VOT detection results for speech data collected form various male and female
speakers are presented in Section 5.2.4.

5.3.1 FOURIER-BESSEL (FB) EXPANSION


The zero order Fourier-Bessel series expansion of a signal x(t) considered over
some arbitrary interval (0,T) is expressed as

where,

2
and

Where,

are the ascending order positive roots of

are the zero order Bessel function.

It has been shown that there is one-to-one correspondence between the frequency
content of the signal and the order (m) where the coefficient attains peak magnitude [10].
If the AM-FM components of formant of the speech signal are well separated in the
frequency domain, the speech signal components will be associated with various distinct
clusters of non-overlapping FB coefficients [11]. Each component of the speech signal
can be reconstructed by identifying and separating the corresponding FB coefficients.

5.3.2 AM-FM MODEL AND DESA METHOD


For both continuous and discrete time signals, Kaiser has defined a nonlinear energy
tracking operator [12]. For the discrete time case, the energy operator for x[n] is defined
as,

Where,

The energy operator can estimate the modulating signal, or more precisely its scaled
version, when either AM or FM is present [12]. When AM and FM are present
simultaneously, three algorithms are described in [12] to estimate the instantaneous
frequency and A(n) separately. The best among the three algorithms according to
performance is the discrete energy separation algorithm 1 (DESA-1).

2
5.3.3 APPROACH TO DETECT VOTS FROM SPEECH USING AMPLITUDE
ENVELOPE (AE)

In order to detect voice onset time, the emphasized consonant and vowel regions of
the speech utterance are separated using the FB expansion of appropriate range of orders
using the Bessel function. Since, the separated regions are narrow-band signals, they can
be modeled by using AM-FM signal model. The DESA technique is applied on the
emphasized regions of the speech utterance in order to estimate the AE function for
detection of VOT. From the vowel emphasized part the beginning of the vowel will be
detected. From the beginning of the vowel, by tracing back towards the beginning of
consonant region in the consonant emphasized part, the beginning of the consonant
region has been detected. The VOT is obtained by taking the difference between
beginning of the vowel and beginning of consonant regions.

5.4 RESULTS

5. 4.1RESULTS OF VARIOUS UTERENCES FOR MALE AND FEMALE


SPEAKERS

In this section we discuss the suitability of the proposed method for VOT detection.
Speech data used for these consists of 24 (12 male and 12 female speaker) isolated
utterances of the units /ke/, /te/, /pe/, /ki/, /ti/, /pi/, /ku/, /tu/, /pu/ respectively. The speech
signals were sampled at 16 kHz with 16 bits resolution, and stored as separate wave files.
Here, we shall consider the important subset of basic units namely SCV (stop
consonant vowel). Stop consonants are the sounds produced by complete closure at some
point along the vocal tract, build up pressure behind the closure, and release the pressure
by sudden opening. These units have two distinct regions in the production
characteristics: the region just before the onset of the vowel (corresponds to consonant
region) and steady vowel region. Figure 5.1 shows the regions of significant events in the
production of the SCV unit /kha/ with Vowel Onset Point (VOP) at sample number 3549.
Table 5.1 shows the requirement of the Fourier-Bessel coefficient orders for emphasizing
the vowel and consonant regions of the speech utterances.

2
Region of speech signal Required FB coefficient orders
/a/ P1=12 to P2=48
/k/ P1=60 to P2=100
/t/ P1=80 to P2=120
/p/ P1=100 to P2=140

Table 5.1 FB coefficient orders for emphasizing the vowel and consonant parts of the
different speech utterances.

For illustration first we shall consider the speech utterances /ke/ whose waveform is
shown in Figure 5.2. The spectrogram, amplitude envelope estimations for vowel and
consonant emphasized regions of the speech utterance /ke/ are shown in 5.2 (a), (c) and
(d) respectively. Similarly, the plots of the waveform, spectrogram, amplitude envelope
estimation for vowel and consonant region of the speech utterances /te/ and /pe/ are
shown in Figures 5.3 and 5.4 respectively. It is seen that the amplitude envelopes
corresponding to vowel and consonant regions using proposed method are emphasized.
This enables us to identify the beginning of the consonant (tc) and beginning of the vowel
region (tv). We have subtracted tc from tv in order to detect the voice onset time (tvot),
tvot = tv-tc. For testing we have considered 24 utterances from various speakers. Their
respective tv and tc and VOT values are shown in Table 5.2.

5.5 CONCLUSION
In this chapter, Fourier-Bessel (FB) expansion and the amplitude and frequency
modulated (AM-FM) signal model has been proposed to detect the voice onset time
(VOT). The FB expansion is used to emphasize the vowel and consonant regions which
results narrow-band signals from the speech utterance. The DESA method has been
applied for estimating amplitude envelope of the AM-FM signals due to its low
complexity and good time resolution.

3
VOICE ONSET TIME (VOT)

Fig 5.1 Regions of significant events in the production of the SCV unit /kha/ with Vowel
Onset Point (VOP) at sample number 3549.

3
Fig 5.2 Plot of the (a) Spectrogram, (b) Waveform, (c) AE estimation of the vowel (/e/)
emphasized part, (d) AE estimation of the consonant part (/k/) emphasized part for the
speech utterance /ke/.

3
Fig 5.3 Plot of the (a) Spectrogram, (b) Waveform, (c) AE estimation of the vowel (/e/)
emphasized part, (d) AE estimation of the consonant part (/t/) emphasized part for the
speech utterance /te/.

3
Fig 5.4 Plot of the (a) Spectrogram, (b) Waveform, (c) AE estimation of the vowel (/e/)
emphasized part, (d) AE estimation of the consonant part (/p/) emphasized part for the
speech utterance /pe/.

3
6.

Fig 5.5 plot of the Bar graphs for utterances of /ke/, /te/, /pe/ for various male and female
speakers.

3
Waveform VOP (sec) BURST (sec) VOT(sec)
Ke_F01_s1.wav 0.8443 0.8227 0.0216
Ke_F01_s2.wav 0.8168 0.8039 0.0132
Ke_F01_s3.wav 0.9633 0.9473 0.0160
Ke_F01_s4.wav 0.4611 0.4394 0.0217
Te_F01_s1.wav 0.6178 0.6029 0.0149
Te_F01_s2.wav 0.6979 0.6839 0.0140
Te_F01_s3.wav 0.7401 0.7236 0.0165
Te_F01_s4.wav 0.7212 0.7088 0.0124
Pe_F01_s1.wav 0.5377 0.5239 0.0138
Pe_F01_s2.wav 0.5308 0.5153 0.0155
Pe_F01_s3.wav 0.8250 0.8087 0.0163
Pe_F01_s4.wav 0.8239 0.8154 0.0085
Ke_M05_s1.wav 1.4540 1.4170 0.0370
Ke_M05_s2.wav 0.5560 0.5230 0.0330
Ke_M05_s3.wav 0.7480 0.7136 0.0344
Ke_M05_s4.wav 0.6574 0.6137 0.0437
Te_M05_s1.wav 0.6687 0.6502 0.0185
Te_M05_s2.wav 0.4704 0.4604 0.0100
Te_M05_s3.wav 0.6814 0.6548 0.0266
Te_M05_s4.wav 0.6013 0.5843 0.0170
Pe_M05_s1.wav 0.9851 0.9718 0.0133
Pe_M05_s2.wav 0.7262 0.7164 0.0098
Pe_M05_s3.wav 0.4899 0.4809 0.0090
Pe_M05_s4.wav 0.4341 0.4291 0.0050

Table 5.2 VOT values for female (F01) and male (M05) speakers for utterances /ke/, /te/
and /pe/ respectively.

3
CHAPTER 6

BESSEL FEATURES FOR DETECTION OF


GLOTTAL CLOSURE INSTANTS (GCI)

6.1 INTRODUCTION

The primary mode of excitation of the vocal tract system during speech production is due
to the vibration of the vocal folds. For voiced speech, the most significant excitation takes
place around the glottal closure instant (GCI), called the epoch. The performance of
many speech analysis and synthesis approaches depends on accurate estimation of GCIs.
In this chapter we propose to use a new method based on Fourier-Bessel (FB) expansion
and amplitude and frequency modulated (AM-FM) signal model for the detection of GCI
locations in speech utterances.

The organization of this chapter is as follows: In section 6.1 the significance of epochs
is discussed. The review of the existing approaches for detection of epochs is provided in
Section 6.2. The detection of GCI using FB expansion and the AM-FM signal model is
discussed in 6.3. A study on detection of GCIs for various categories of sound units has
been provided in Section 6.4. The detection of glottal activity has been analyzed in the
Section 6.5. The final section summarizes the study of GCI.

6.2 SIGNIFICANCE OF EPOCHS IN SPEECH ANALYSIS

Glottal closure instants are defined as the instants of significant excitation of the
vocal-tract system. Speech analysis consists of determining the frequency response of the
vocal-tract system and the glottal pulses representing the excitation source. Although the
source of excitation for voiced speech is the sequence of glottal pulses, the significant
excitation of the vocal-tract system within the glottal pulse, can be considered to occur at
the GCI, called an epoch. Many speech analysis situations depend on accurate estimation
of the location of epoch within the glottal pulse. For example, the knowledge of epoch
location is useful for accurate estimation of the fundamental frequency (fo).

3
Analyses of speech signals in the closed glottis regions provide an accurate estimate of
the frequency response of the supraalaryngeal vocal-tract system [12] [13]. With the
knowledge of epochs, it may be possible to determine the characteristics of the voice
source by a careful analysis of the signal within a glottal pulse. The epochs can be used as
pitch markers for prosody manipulation, which is useful in applications like text-speech
synthesis, voice conversion and speech rate conversion [14] [15]. Knowledge of the
epoch locations may be used for estimating the time delay between speech signals
collected over a pair of spatially separated microphones [16]. The segmental signal-to-
noise ration (SNR) of the speech signal is high in the regions around the epochs. Hence it
is possible to enhance the speech by exploiting the characteristics of speech signals
around the epochs [17]. It has been shown that the excitation features derived from the
regions around the epoch locations provide complimentary speaker-specific information
to the existing spectral features.

The instants of significant excitation play an important role in human perception also.
It is because of the epochs in speech that a human being seems to be able to perceive
speech even at a distance from the source, though the spectral components of the direct
signal suffer an attenuation of over 40 dB. The neural mechanism of human beings has
the ability of processing selectively the robust regions around the epoch for extracting
acoustics cues even under degraded conditions. It is the ability of human beings to focus
on these micro level events that may be responsible for perceiving the speech information
even under severe degradation such as noise, reverberation, presence of other speakers
and channel variations.

6.3 REVIEW OF EXISTING APPROACHES

Several methods have been proposed for estimating the GCI from a speech signal. We
categorize these methods as follows: (a) methods based on short-time energy of speech
signal, (b) methods based on predictability of all-pole linear predictor and (c) methods
based on the properties of Group-Delay (GD) i.e. the negative going zero crossing of GD
measure derived from the speech signal. The above classification is not a rigid one and
one category can overlap with another based on the interpretation of the method.

3
6.3.1 EPOCH EXTRACTION FROM SHORT-TIME ENERGY OF SPEECH SIGNAL

GCIs can be detected from the energy peaks in the waveform derived directly
from the speech signal [17, 18] or from the features in its time-frequency representation
[19, 20]. The epoch filter proposed in this work, computes the Hilbert envelope (HE) of
the high-pass filtered composite signal to locate the epoch instants. It was shown that the
instant of excitation of the vocal-tract could be identified precisely even for continuous
speech. However, this method is suitable for analyzing only clean speech signals.

6.3.2 EPOCH EXTRACTION FROM LINEAR PREDICTION ANALYSIS

Many methods of epoch extraction rely on the discontinuities in a linear model of


speech production. An early approach used the predictability measure to detect epochs by
finding the maximum of the determinant of the auto covariance matrix [21, 22] of the
speech signal. The determinant of the matrix as a function of time increases sharply when
the speech segment covered by the data matrix contains an excitation, and it decreases
when the speech segment is excitation free. This method does not work well for some
vowels, particularly when many pulses occur in the determinant computed around the
GCI.

A method for unambiguous identification of epochs from the LP residual was


proposed in [23] [24]. In this work the amplitude envelope of the analytic signal of the
LP residual, referred to as the Hilbert Envelope (HE) of the Linear prediction (LP)
residual, is used for epoch extraction. Computation of the HE overcomes the effect due to
inaccurate phase compensation during inverse filtering. This method works well for clean
signals, but the performance degrades under noisy conditions.

6.3.3 EPOCH EXTRACTION BASED ON THE PROPERTIES OF GROUP-DELAY

A method for detecting the epochs in a speech signal using the properties of
minimum phase signals and GD function was proposed in [25]. The method is based on

3
the fact that the average value of the GD function of a signal within an analysis frame
corresponds to the location of the significant excitation. An improved method based on
computation of the GD function directly from the speech signal was proposed in [26].

The Dynamic Programming Projected Phase-Slope Algorithm (DYPSA) for


automatic estimation of GCI in speech is presented in [27, 28]. The candidates for GCI
were obtained from the zero crossing of the phase-slope function derived from the energy
weighted GD, and were refined by employing a dynamic programming algorithm. The
performance of this method was better than the previous methods.

6.3.4 LIMITATIONS OF EXISTING APPROACHES

Epoch is an instant property. However, in most of the methods discussed (except


the GD function based method) the epochs are detected by employing block processing
approaches, which result in ambiguity about the precise location of the epochs. One of
the difficulties in using the prediction error approach is that it often contains effects due
to the resonances of the vocal-tract system. As a result, the excitation peaks become less
prominent in the residual signal, and hence unambiguous detection of the GCIs becomes
difficult. Most of the existing methods rely on LP residual signal derived by inverse
filtering the speech signal. Even using the GD based methods, it is still difficult to detect
the epochs in case of low voiced consonants, nasals and semi-vowels, breathy voices and
female speakers.

6.4 DETECTION OF GCI USING FOURIER BESSEL (FB) EXPANSION


AND THE AM-FM SIGNAL MODEL

The method is based on the FB expansion and the AM-FM signal model. The inherent
filtering property of the FB expansion is used to weaken the effect of formants in the
speech utterances. The FB coefficients are unique for a given signal in the same way that
Fourier series coefficients are unique for a given signal. However, unlike the sinusoidal
basis functions in the Fourier transform, the Bessel functions are aperiodic, and decay

4
over time. These features of the Bessel functions make the FB series expansion suitable
for analysis of non-stationary signals such as speech when compared to simple Fourier
transform [9-11]. The discrete energy separation (DESA) method has been used to
estimate amplitude envelope (AE) function of the AM-FM model due to its good time
resolution. This feature is advantageous as they are well localized in time-domain.

6.4.1 FOURIER-BESSEL EXPANSION

The zero order Fourier-Bessel series expansion of a signal x(t) considered over
some arbitrary interval (0,T) is expressed as

where,

and

Where,

are the ascending order positive roots of

are the zero order Bessel function.

It has been shown that there is one-to-one correspondence between the frequency
content of the signal and the order (m) where the coefficient attains peak magnitude [10].
If the AM-FM components of formant of the speech signal are well separated in the
frequency domain, the speech signal components will be associated with various distinct
clusters of non-overlapping FB coefficients [11]. Each component of the speech signal
can be reconstructed by identifying and separating the corresponding FB coefficients.

4
6.4.2 AM-FM Model and DESA Method

For both continuous and discrete time signals, Kaiser has defined a nonlinear
energy tracking operator [11]. For the discrete time case, the energy operator for x[n] is
defined as,

Where,

The energy operator can estimate the modulating signal, or more precisely its scaled
version, when either AM or FM is present [11]. When AM and FM are present
simultaneously, three algorithms are described in [11] to estimate the instantaneous
frequency and A(n) separately. The best among the three algorithms according to
performance is the discrete energy separation algorithm 1 (DESA-1).

6.4.3 APPROACH TO DETECT GCIS FROM SPEECH USING AMPLITUDE


ENVELOPE (AE)

In order to detect GCIs we emphasize the low frequency contents of the speech
signal in the range of 0 to 300 Hz. This is achieved by using the FB expansion of
appropriate order m of the expansion. Since the resultant band-limited signal is a narrow-
band signal, it can be modeled by using AM-FM signal model.
The advantage of choosing 0 to 300 Hz band is that the characteristics of the time-
varying vocal-tract system will not affect the location of the GCIs. This is because the
vocal-tract system has resonances at higher frequencies than 300 Hz. Therefore, it has
been studied that the characteristics of peaks due to GCIs can be extracted by
reconstructing the speech signal using the FB expansion of order m=75. The DESA
technique is applied on this band-limited AM-FM signal of the speech utterance in order
to determine amplitude envelope (AE) functions for detection of GCIs. The peaks in the
envelope of AM-FM signals provide the locations of GCIs.

4
6.5 STUDIES ON DETECTION OF GCIs FOR VARIOUS SPEECH
UTTERANCES

In this section we provide an analysis for the studies on epoch or GCIs for both male
and female speakers. Figure 6.1 gives the speech signal of a male speaker and its
corresponding spectrogram. The amplitude envelope (AE) estimation of the band-limited
AM-FM signal, reconstruction by using FB expansion is shown in second waveform of
the figure. The third waveform shows the estimated amplitude envelope for the second
waveform. The differenced EGG signal is shown in the fourth waveform. It is seen that
the peaks in the amplitude envelope and peaks in the differenced EGG signals are
agreeing in most of the cases. Similar observations are also noticed for a female speaker
shown in Figure 6.2.

This enables us to identify the locations of GCIs from the peaks of the amplitude
envelope of the band-limited AM-FM signal of the given speech utterance. From Figures
6.1 and 6.2, it can be noticed that the number of GCIs for female speakers are more than
the male speaker for the same duration of speech segment. This is due to the fact that
generally the fundamental frequency (reciprocal of the difference between successive
GCIs) of female speakers is higher than the male speaker.

6.6 GLOTTAL ACTIVITY DETECTION

The strength of excitation helps in the detection of the regions of the speech signal
based on the glottal activity. The region where the strength of excitation is considered to
be significant is referred to as the regions of the vocal fold vibration (glottal activity).
That is, they are considered as the voiced regions where glottal activity is detected. In the
absence of vocal fold vibration, the vocal-tract system can be considered to be excited by
random noise, as in the case of frication.

The energy of the random noise excitation is distributed both in time and frequency
domains. While the energy of an impulse is distributed uniformly in the frequency
domain, it is highly concentrated in the time-domain. As a result, the filtered signal

4
exhibits, significantly lower amplitude for random noise excitation compared to the
impulse-like excitation. Hence the filtered signal can be used to detect the regions of
glottal activity (vocal fold vibration).

6.7 CONCLUSIONS

The primary mode of excitation of the vocal tract system during speech
production is due to the vibration of the vocal folds. For voiced speech, the most
significant excitation takes place around the glottal closure instant (GCI). The instants of
significant excitation play an important role in human perception. The studies on GCIs
help in identifying the various regions in continuous speech. The extracted GCIs help in
identifying the fundamental frequency (pitch) of the speaker. The pitch is a feature
unique to a speaker.

4
GLOTTAL CLOSURE INSTANTS OF A MALE SPEAKER

Fig. 6.1 Epoch (or GCI) extraction of a male speaker using Fourier-Bessel (FB)
expansion and AM-FM model

4
GLOTTAL CLOSURE INSTANTS OF A FEMALE SPEAKER

Fig 6.2 Epoch (or GCI) extraction of a female speaker using Fourier-Bessel (FB)
expansion and AM-FM model

4
CHAPTER 7
SUMMARY AND CONCLUSIONS

7.1 SUMMARY OF THE WORK


Glottal Closure Instants (GCI) (also known as epoch) is one of the important event
that can be attributed to the speech production mechanism. The primary mode of
excitation of vocal-tract system during speech production is due to the vibration of the
vocal folds. For voiced speech, the most significant excitation takes place around the
GCIs. The rate of glottal closure is referred to as strength of the epoch. The GCIs and the
strength of the epochs form important features of the excitation source. In this work, we
propose to use a method based on Fourier-Bessel expansion and AM-FM model to detect
the regions of glottal activity and to estimate the strength of excitation in each glottal
cycle.
The influence of the vocal-tract system is relatively less at zero-frequency, as the
vocal-tract system has resonances at much higher frequencies. Hence, we use Fourier-
Bessel expansion followed by DESA method to extract the epoch locations and their
strengths. The method involves subjecting the speech signal to Bessel transformation and
then using the required coefficients to band-limit the signal by means of discrete energy
separation algorithm (DESA) so as to produce an EGG signal, from which the Amplitude
Envelope (AE) is obtained, which gives the required peaks that enables us to distinguish
between male and female speakers.
The excitation source information is used to analyze the manner of articulation
(MOA) of stop consonants. Two of the excitation source features utilized in this study are
the filtered speech signal and the normalized error. The filtered speech signal is used to
characterize the excitation information during vocal-fold vibration. The normalized error
derived from linear prediction (LP) analysis is used to highlight the regions of noisy
excitation caused by a rush of air through the open glottis during closure release and
aspiration. It is observed from the studies that these two features jointly highlight the
important events. Like onset of voicing and instant of closure release, in the stop
consonants. Using these two excitation source features, the voice onset time and the burst
duration of stop consonants were measured.

4
Using the epoch locations as anchor points within each glottal cycle, a method to
estimate the instantaneous fundamental frequency of voiced speech segments is
presented. The fundamental frequency is estimated as the reciprocal of the interval
between two successive epoch locations derived from the speech signal. Since the
method is based on the point property of epoch and does not involve any block
processing, it provides cycle-to-cycle variations in pitch during voicing. This results in
instantaneous fundamental frequency. Errors due to spurious zero crossings in the weak
voiced region are corrected using the filtered signal of Hilbert Enevlope (HE) of the
speech signal. In this work, analysis of the pitch frequency for various subjects in
different environmental conditions is carried out.

The ability to detect VOT in speech is a challenging problem because it combines


temporal and frequency structure over a very short duration. So far, no successful
automatic VOT detection scheme has yet been developed or published in the literature. In
this study, the amplitude modulation component (AMC) of the Teager-Energy operator
(TEO), sub-band frequency based non-linear energy tracking operator was employed to
detect the VOR and to estimate VOT. The proposed algorithm was applied to the
problem accent classification using American English, Chinese and Indian accented
speakers. Using 546 tokens, consisting of three words from 24 speakers, the average
MSEC mismatch between automatic and hand labeled VOT was 0.735 (msec) (among
the 409 tokens, which were detected with less than 10 percent error). This represents a
1.15% mismatch. It was also shown that average VOTs are different among three
different language groups, hence making VOT a good feature for accent classification.

4
CHAPTER 8
REFERENCES

2. Papoulis, A. Signal Analysis. McGraw-Hill, New York, 1977.


3. Bracewell, R. The Fourier Transform and its Applications. McGraw-Hill, New
York, 1964.
4. Luke, Y. L. Integrals of Bessel Functions. McGraw-Hill, New York, 1962.
5. A S Ambramson and L Lisker, “Voice onset time in stop consonants: acoustic
Analysis and synthesis”, in Proc. 5th International congress on phonetic sciences,
1965.
6. R B Monsen, “Normal and reduced phonological space: The study of vowels by a
Deaf adolescent”, Journal of Phonetics, vol.4, 1976.
7. J. Jiang, M. Chen, and A. Alwan “On the perception of voicing in syllable-initial
Plosives in noise”, Journal of the Acoustical Society of America, vol. 119, 2006.
8. L. Lisker and A.S Abramson, “A cross-language study of voicing in initial stops:
Acoustical measurements”, Word, vol.10, 1967.
9. L. Lisker and A.S. Abramson, “Some effects of context on voice onset time in
English stops”, Language and Speech, vol. 10, 1967.
10. S. Das and J.H.L. Hansen, “Detection of voice onset time (VOT) for unvoiced
Stops (/p/,/t/,/k/) using the Teager energy operator (TEO) for automatic detection
of accented English”, Proc. 6th Nordic Signal Processing Symposium, 2004.
11. P. Ladefoged, “A Course in Phonetics”, Third edition, Harcourt Brace College
Publishers, Fort Worth, 1993.
12. R.B Pachori and P. Sircar, “Analysis of multicomponent AM-FM signals using
FB-DESA method”, Digital Signal Processing, vol.20, 2010
13. D. Veeneman and S. BeMent, “Automatic glottal inverse filtering from speech
And electroglottographic signals”, IEEE Trans. Signal Processing, vol.33, 1985.
14. B. Yegnanarayana and R N. J. Veldhuis, “Extraction of vocal-tract system
Characteristics from speech signals”, IEEE Trans. Speech and Audio Processing,
vol, 6, 1998.
15. C. Hamon, E. Moulines, and F. Charpentier, “A diphone synthesis system based

4
On time domain prosodic modifications of speech”, in Proc. IEEE, May, 1989.
16. K. S. Rao and B. Yegnanarayana, “Prosody modification using instants of
Significant excitation”, IEEE, May, 2006.
17. B. Yegnanarayana, S.R.M Prasanna, R. Duraiswamy, and D. Zotkin, “Processing
Of reverberant speech for time-delay estimation”, IEEE Trans. Speech and Audio
Signal Processing, 2005.
18. B. Yegnanarayana, P.S. Murty, “Enhancement of reverberant speech using LP
Residual signal”, IEEE Trans.Speech and Audio Processing, vol.8, 2000
19. Y KC Ma and LF Willems, “A Frobenius norm approach to glottal closure
Detection from the speech signal”, IEEE Tans. Speech Audio Processing, vol.2,
1994.
20. C R Jankowski Jr, T F Quatieri, and D A Reynolds, “Measuring fine structure in
Speech: Application to speaker identification”, Proc. IEEE Int. Conf., 1995.
25. K. Sri Rama Murthy and B. Yegnanarayana, “Epoch extraction from speech
signal,” IEEE Trans. Audio, Speech Lang. Process. , vol. 16 (8), pp.1602-1613,
Nov.2008.

5
SOURCE CODE

5
Gci_female.m
lc;

clear all;

close all;

inputfile='30401.wav'
eggfile='30401.egg'
samplesrange=[89601:92800];
fborderleft=1;
fborderright=75;

MM=length(samplesrange);
%computation of roots of bessel function Jo(x)
if exist('alfa') == 0
x=2;
alfa=zeros(1,MM);
for i=1:MM
ex=1;
while abs(ex)>.00001
ex=-besselj(0,x)/besselj(1,x);
x=x-ex;
end
alfa(i)=x;
fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex)
x=x+pi;
end
end

[s,fs]=wavread(inputfile);
s=s;

S=s';

% S=diff(S);

S=-S(samplesrange);

ax(1)=subplot(4,1,1);

5
% plot(samplesrange, S);
plot(samplesrange/fs, S);

axis tight
grid on

% spgramsvg('ka_F01_S2.wav', 320, 160, 8000)

s=S;
N=length(s);
nb=1:N;

a=N;

for m1=1:MM
a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s.*besselj(0,alfa(m1)/a*nb));
end

%reconstruction of the signal

for mm=1:N
g1_r=[(alfa(fborderleft:fborderright))/a ];
F1_r(mm)=sum(a3(fborderleft:fborderright).*besselj(0,g1_r*mm));
end

y1=F1_r;

ax(2)=subplot(4,1,2);
% plot(samplesrange, y1)
plot((samplesrange)/fs, y1)
axis tight
grid on
% y1=y1-mean(y1);

ax(3)=subplot(4,1,3);

[egg, fs]=wavread(eggfile);

vg=-diff(egg);

% plot(samplesrange,vg(samplesrange))
plot((samplesrange)/fs,vg(samplesrange))

5
axis tight
grid on

for l=2:N-1
xx=y1;
si(l)=xx(l)^2-xx(l-1)*xx(l+1);
end

for m=2:N-1
yy(m)=y1(m)-y1(m-1);
end
for m=2:N-2
siy(m)=yy(m)^2-yy(m-1).*yy(m+1);
end

for mm=2:N-2
if siy(mm)<0
yy1(mm)=siy(mm-1);
yy1(mm)=yy1(mm-1);
else
yy1(mm)=siy(mm);
end
end
siy=yy1;
for m1=2:N-3
omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));
end
for mm=2:N-3
if imag(omega(mm))==0
yy1(mm)=omega(mm);

else
yy1(mm)=omega(mm-1);
yy1(mm)=yy1(mm-1);
end
end
omega=yy1;

for m1=2:N-3
amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));
end
for mm=2:N-3
if imag(amp(mm))==0
yy1(mm)=amp(mm);
else
yy1(mm)=amp(mm-1);
yy1(mm)=yy1(mm-1);

5
end
end
[ca,cd]=dwt(yy1,'db2');

yy1=idwt(ca,[],'db2');

amp1=yy1;

amp1(end+1:end+2)=amp1(end);

% X2=overlapadd(omega1,W,INC);
ax(4)=subplot(4,1,4);

% plot(samplesrange,amp1)
plot((samplesrange)/fs,amp1)
axis tight
grid on
% ax(4)=subplot(4,1,4);
%
% plot((1:length(X2))/32000,X2)

linkaxes(ax,'x');

5
Gci_male.m

clc;

clear all;

close all;

inputfile='10501.wav'
eggfile='10501.egg'
samplesrange=[76001:79200];
fborderleft=1;
fborderright=75;

MM=length(samplesrange);
%computation of roots of bessel function Jo(x)
if exist('alfa') == 0
x=2;
alfa=zeros(1,MM);
for i=1:MM
ex=1;
while abs(ex)>.00001
ex=-besselj(0,x)/besselj(1,x);
x=x-ex;
end
alfa(i)=x;
%fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex)
x=x+pi;
end
end

[s,fs]=wavread(inputfile);
s=s;

S=s';

% S=diff(S);

S=-S(samplesrange);

5
ax(1)=subplot(4,1,1);

% plot(samplesrange, S);
plot(samplesrange/fs, S);

axis tight
grid on

% spgramsvg('ka_F01_S2.wav', 320, 160, 8000)

s=S;
N=length(s);
nb=1:N;

a=N;

for m1=1:MM
a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s.*besselj(0,alfa(m1)/a*nb));
end

%reconstruction of the signal

for mm=1:N
g1_r=[(alfa(fborderleft:fborderright))/a ];
F1_r(mm)=sum(a3(fborderleft:fborderright).*besselj(0,g1_r*mm));

end

y1=F1_r;

ax(2)=subplot(4,1,2);
% plot(samplesrange, y1)
plot((samplesrange)/fs, y1)
axis tight
grid on
% y1=y1-mean(y1);

ax(3)=subplot(4,1,3);

[egg, fs]=wavread(eggfile);

vg=-diff(egg);

5
% plot(samplesrange,vg(samplesrange))
plot((samplesrange)/fs,vg(samplesrange))
axis tight
grid on

for l=2:N-1
xx=y1;
si(l)=xx(l)^2-xx(l-1)*xx(l+1);
end

for m=2:N-1
yy(m)=y1(m)-y1(m-1);
end
for m=2:N-2
siy(m)=yy(m)^2-yy(m-1).*yy(m+1);
end

for mm=2:N-2
if siy(mm)<0
yy1(mm)=siy(mm-1);
yy1(mm)=yy1(mm-1);
else
yy1(mm)=siy(mm);
end
end
siy=yy1;
for m1=2:N-3
omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));
end
for mm=2:N-3
if imag(omega(mm))==0
yy1(mm)=omega(mm);

else
yy1(mm)=omega(mm-1);
yy1(mm)=yy1(mm-1);
end
end
omega=yy1;

for m1=2:N-3
amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));
end
for mm=2:N-3
if imag(amp(mm))==0

5
yy1(mm)=amp(mm);
else
yy1(mm)=amp(mm-1);
yy1(mm)=yy1(mm-1);
end
end

[ca,cd]=dwt(yy1,'db2');

yy1=idwt(ca,[],'db2');

amp1=yy1;

amp1(end+1:end+2)=amp1(end);

% X2=overlapadd(omega1,W,INC);
ax(4)=subplot(4,1,4);

% plot(samplesrange,amp1)
plot((samplesrange)/fs,amp1)
axis tight
grid on
% ax(4)=subplot(4,1,4);
%
% plot((1:length(X2))/32000,X2)

linkaxes(ax,'x');

---------------------------------------------------------------------------------------------------------
Vot.m
clc;

clear all;

close all;
inputfile='ku_F01_S4.wav';
MM=320;
%MM=1100;
%computation of roots of bessel function Jo(x)
if exist('alfa') == 0
x=2;
alfa=zeros(1,MM);
for i=1:MM
ex=1;
while abs(ex)>.00001
ex=-besselj(0,x)/besselj(1,x);
x=x-ex;

5
end
alfa(i)=x;
% fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex)
x=x+pi;
end
end
[s,fs]=wavread(inputfile);
% yy=resample(s,1,2);
%
% s=yy(9500:10600);
S=s';
INC=160;
%INC=550;
NW=INC*2;
W=sqrt(hamming(NW+1));
W(end)=[];
F=enframe(S,W,INC);

[r,c]=size(F);

for i=1:r

s1=F(i,:);
N=length(s1);
nb=1:N;

a=N;

for m1=1:MM
a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s1.*besselj(0,alfa(m1)/a*nb));
end

%reconstruction of the signal

for mm=1:N
g1_r=[(alfa(12:48))/a ];
F1_r(mm)=sum(a3(12:48).*besselj(0,g1_r*mm));
% g1_r=[(alfa(20:85))/a ];
% F1_r(mm)=sum(a3(20:85).*besselj(0,g1_r*mm));
% g1_r=[(alfa)/a ];
% F1_r(mm)=sum(a3.*besselj(0,g1_r*mm));

end

y1=F1_r;

6
for l=2:N-1
xx=y1;
si(l)=xx(l)^2-xx(l-1)*xx(l+1);
end

for m=2:N-1
yy(m)=y1(m)-y1(m-1);
end
for m=2:N-2
siy(m)=yy(m)^2-yy(m-1).*yy(m+1);
end

for mm=2:N-2
if siy(mm)<0
yy1(mm)=siy(mm-1);
yy1(mm)=yy1(mm-1);
else
yy1(mm)=siy(mm);
end
end
siy=yy1;
for m1=2:N-3
omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));
end
for mm=2:N-3
if imag(omega(mm))==0
yy1(mm)=omega(mm);

else
yy1(mm)=omega(mm-1);
yy1(mm)=yy1(mm-1);
end
end
omega=yy1;
for m1=2:N-3
amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));
end
for mm=2:N-3
if imag(amp(mm))==0
yy1(mm)=amp(mm);
else
yy1(mm)=amp(mm-1);
yy1(mm)=yy1(mm-1);
end
end
[ca,cd]=dwt(yy1,'db2');

yy1=idwt(ca,[],'db2');

6
amp1(i,:)=yy1;
end

amp1(:,c-1)=amp1(:,c-2);

amp1(:,c)=amp1(:,c-1);

Xvowel=overlapadd(amp1,W,INC);

%%%%%

% [s,fs]=wavread(inputfile);
% % yy=resample(s,1,2);
% % fs=fs/2;
% % s=yy(9500:10600);
% S=s';

INC=160;
NW=INC*2
W=sqrt(hamming(NW+1));
W(end)=[];
F=enframe(S,W,INC);

[r,c]=size(F);

for i=1:r

s2=F(i,:);
N=length(s2);
nb=1:N;

a=N;

for m1=1:MM
a3(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s2.*besselj(0,alfa(m1)/a*nb));
end

%reconstruction of the signal

6
for mm=1:N
% g1_r=[(alfa(60:100))/a ];
% F1_r(mm)=sum(a3(60:100).*besselj(0,g1_r*mm));
g1_r=[(alfa(60:100))/a ];
F1_r(mm)=sum(a3(60:100).*besselj(0,g1_r*mm));
end

y1=F1_r;
for l=2:N-1
xx=y1;
si(l)=xx(l)^2-xx(l-1)*xx(l+1);
end

for m=2:N-1
yy(m)=y1(m)-y1(m-1);
end
for m=2:N-2
siy(m)=yy(m)^2-yy(m-1).*yy(m+1);
end

for mm=2:N-2
if siy(mm)<0
yy1(mm)=siy(mm-1);
yy1(mm)=yy1(mm-1);
else
yy1(mm)=siy(mm);
end
end
siy=yy1;
for m1=2:N-3
omega(m1)=acos(1-((siy(m1)+siy(m1+1))/(4*si(m1))));
end
for mm=2:N-3
if imag(omega(mm))==0
yy1(mm)=omega(mm);

else
yy1(mm)=omega(mm-1);
yy1(mm)=yy1(mm-1);
end
end
omega=yy1;
for m1=2:N-3
amp(m1)=sqrt(((si(m1))/(1-(1-((siy(m1)+siy(m1+1))/(4*si(m1))))^2)));
end
for mm=2:N-3
if imag(amp(mm))==0
yy1(mm)=amp(mm);

6
else
yy1(mm)=amp(mm-1);
yy1(mm)=yy1(mm-1);
end
end
[ca,cd]=dwt(yy1,'db2');

yy1=idwt(ca,[],'db2');

amp2(i,:)=yy1;
end

amp2(:,c-1)=amp2(:,c-2);

amp2(:,c)=amp2(:,c-1);

Xc=overlapadd(amp2,W,INC);

fig = figure;

ax(1)=subplot(4,1,1);
%specgram(s,320,16000,320,160) ; colormap(1-gray);
svlSpgram(s,2^8,fs,10*fs/1000,9*fs/1000,30,4000);
%specgram(s,1100,8000,1100,550)

ax(2)=subplot(4,1,2);
plot((1:length(s))/16000, s);grid;axis tight;
%text(1.81,0.04,'(b)','fontweight','bold');
%plot((1:length(s))/8000, s);
%plot(s);grid;axis tight
% spgramsvg('ka_F01_S2.wav', 320, 160, 8000)
ax(3)=subplot(4,1,3);
plot((1:length(Xvowel))/16000,Xvowel);grid;axis tight;
%text(1.81,0.15,'(c)','fontweight','bold');

ax(4)=subplot(4,1,4);
plot((1:length(Xc))/16000,Xc);grid;axis tight;xlabel('Time (sec)');
%text(1.81,0.004,'(d)','fontweight','bold');

% ax(5)=subplot(5,1,5);
% plot((1:length(Xc)-1)/16000,diff(Xc));grid;axis tight;xlabel('Time (sec)');
%text(1.81,0.004,'(d)','fontweight','bold');

linkaxes(ax,'x');

alltext=findall(fig,'type','text');

6
allaxes=findall(fig,'type','axes');
allfont=[alltext(:);allaxes(:)];
set(allfont,'fontsize',16);
Overlapadd.m
function [x,zo]=overlapadd(f,win,inc)
%OVERLAPADD join overlapping frames together X=(F,WIN,INC)
%
% Inputs: F(NR,NW) contains the frames to be added together, one
% frame per row.
% WIN(NW) contains a window function to multiply each frame.
% WIN may be omitted to use a default rectangular window
% If processing the input in chunks, WIN should be replaced by
% ZI on the second and subsequent calls where ZI is the saved
% output state from the previous call.
% INC gives the time increment (in samples) between
% succesive frames [default = NW].
%
% Outputs: X(N,1) is the output signal. The number of output samples is N=NW+(NR-
1)*INC.
% ZO Contains the saved state to allow a long signal
% to be processed in chunks. In this case X will contain only N=NR*INC
% output samples.
%
% Example of frame-based processing:
% INC=20
% set frame increment
% NW=INC*2
% oversample by a factor of 2 (4 is also often used)
% S=cos((0:NW*7)*6*pi/NW);
% example input signal
% W=sqrt(hamming(NW+1)); W(end)=[]; % sqrt hamming window of period
NW
% F=enframe(S,W,INC); % split into frames
% ... process frames ...
% X=overlapadd(F,W,INC); % reconstitute the time
waveform (omit "X=" to plot waveform)

% Copyright (C) Mike Brookes 2009


% Version: $Id: overlapadd.m,v 1.2 2009/06/08 16:21:49 dmb Exp $
%
% VOICEBOX is a MATLAB toolbox for speech processing.
% Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This program is free software; you can redistribute it and/or modify
% it under the terms of the GNU General Public License as published by
% the Free Software Foundation; either version 2 of the License, or

6
% (at your option) any later version.
%
% This program is distributed in the hope that it will be useful,
% but WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
% GNU General Public License for more details.
%
% You can obtain a copy of the GNU General Public License from
% http://www.gnu.org/copyleft/gpl.html or by writing to
% Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[nr,nf]=size(f); % number of frames and frame length
if nargin<2
win=nf; % default increment
end
if isstruct(win)
w=win.w;
if ~numel(w) && length(w)~=nf
error('window length does not match frames size');
end
inc=win.inc;
xx=win.xx;
else
if nargin<3
inc=nf;
end
if numel(win)==1 && win==fix(win) && nargin<3 % win has been omitted
inc=win;
w=[];
else
w=win(:).';
if length(w)~=nf
error('window length does not match frames size');
end
if all(w==1)
w=[];
end
end
xx=[]; % partial output from previous call is null
end
nb=ceil(nf/inc); % number of overlap buffers
no=nf+(nr-1)*inc; % buffer length
z=zeros(no,nb); % space for overlapped output speech
if numel(w)
z(repmat(1:nf,nr,1)+repmat((0:nr-1)'*inc+rem((0:nr-
1)',nb)*no,1,nf))=f.*repmat(w,nr,1);
else

6
z(repmat(1:nf,nr,1)+repmat((0:nr-1)'*inc+rem((0:nr-1)',nb)*no,1,nf))=f;
end
x=sum(z,2);
if ~isempty(xx)
x(1:length(xx))=x(1:length(xx))+xx; % add on leftovers from previous call
end
if nargout>1 % check if we want to preserve the state
mo=inc*nr; % completed output samples
if no<mo
x(mo,1)=0;
zo.xx=[];
else
zo.xx=x(mo+1:end);
zo.w=w;
zo.inc=inc;
x=x(1:mo);
end
elseif ~nargout
if isempty(xx)
k1=nf-inc; % dubious samples at start
else
k1=0;
end
k2=nf-inc; % dubious samples at end
plot(1+(0:nr-1)*inc,x(1+(0:nr-1)*inc),'>r',nf+(0:nr-1)*inc,x(nf+(0:nr-1)*inc),'<r', ...
1:k1+1,x(1:k1+1),':b',k1+1:no-k2,x(k1+1:end-k2),'-b',no-k2:no,x(no-k2:no),':b');
xlabel('Sample Number');
title(sprintf('%d frames of %d samples with %.0f%% overlap = %d
samples',nr,nf,100*(1-inc/nf),no));
end

6
Enframe.m
function f=enframe(x,win,inc)
%ENFRAME split signal up into (overlapping) frames: one per row. F=(X,WIN,INC)
%
% F = ENFRAME(X,LEN) splits the vector X(:) up into
% frames. Each frame is of length LEN and occupies
% one row of the output matrix. The last few frames of X
% will be ignored if its length is not divisible by LEN.
% It is an error if X is shorter than LEN.
%
% F = ENFRAME(X,LEN,INC) has frames beginning at increments of INC
% The centre of frame I is X((I-1)*INC+(LEN+1)/2) for I=1,2,...
% The number of frames is fix((length(X)-LEN+INC)/INC)
%
% F = ENFRAME(X,WINDOW) or ENFRAME(X,WINDOW,INC) multiplies
% each frame by WINDOW(:)
%
% Example of frame-based processing:
% INC=20
% set frame increment
% NW=INC*2
% oversample by a factor of 2 (4 is also often used)
% S=cos((0:NW*7)*6*pi/NW);
% example input signal
% W=sqrt(hamming(NW+1)); W(end)=[]; % sqrt hamming window of period
NW
% F=enframe(S,W,INC); % split into frames
% ... process frames ...
% X=overlapadd(F,W,INC); % reconstitute the time
waveform (omit "X=" to plot waveform)

% Copyright (C) Mike Brookes 1997


% Version: $Id: enframe.m,v 1.6 2009/06/08 16:21:42 dmb Exp $
%
% VOICEBOX is a MATLAB toolbox for speech processing.
% Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This program is free software; you can redistribute it and/or modify
% it under the terms of the GNU General Public License as published by
% the Free Software Foundation; either version 2 of the License, or

6
% (at your option) any later version.
%
% This program is distributed in the hope that it will be useful,
% but WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
% GNU General Public License for more details.
%
% You can obtain a copy of the GNU General Public License from
% http://www.gnu.org/copyleft/gpl.html or by writing to
% Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

nx=length(x(:));
nwin=length(win);
if (nwin == 1)
len = win;
else
len = nwin;
end
if (nargin < 3)
inc = len;
end
nf = fix((nx-len+inc)/inc);
f=zeros(nf,len);
indf= inc*(0:(nf-1)).';
inds = (1:len);
f(:) = x(indf(:,ones(1,len))+inds(ones(nf,1),:));
if (nwin > 1)
w = win(:)';
f = f .* w(ones(nf,1),:);
end

6
SvlSpgram.m
function [X, f_r, t_r] = svlSpgram(x, n, Fs, window, overlap, clipdB, maxfreq)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Usage:
% [X [, f [, t]]] = svlSpgram(x [, n [, Fs [, window [, overlap[, clipdB [, maxfreq ]]]]]])
%
% Generate a spectrogram for the signal. This chops the signal into
% overlapping slices, windows each slice and applies a Fourier
% transform to determine the frequency components at that slice.
%
% INPUT:
% x: signal or vector of samples
% n: size of fourier transform window, or [] for default=256
% Fs: sample rate, or [] for default=2 Hz
% window: shape of the fourier transform window,
% or [] for default=hanning(n)
% Note: window length can be specified instead, in which case
% window=hanning(length)
% overlap: overlap with previous window,
% or [] for default=length(window)/2
% clipdB:Clip or cut-off any spectral component more than 'clipdB'
% below the peak spectral strength.(default = 35 dB)
% maxfreq: Maximum freq to be plotted in the spectrogram (default = Fs/2)
%
% OUTPUT:
% X: STFT of the signal x
% f: The frequency values corresponding to the STFT values
% t: Time instants at which the STFT values are computed
%
% Example
%--------
% x = chirp([0:0.001:2],0,2,500); # freq. sweep from 0-500 over 2 sec.
% Fs=1000; # sampled every 0.001 sec so rate is 1 kHz
% step=ceil(20*Fs/1000); # one spectral slice every 20 ms
% window=ceil(100*Fs/1000); # 100 ms data window
% svlSpgram(x, 2^nextpow2(window), Fs, window, window-step);
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%

7
% Original version by Paul Kienzle; modified by Sean Fulop March 2002
%
% Customized by Anand and then by Dhananjaya
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

if nargin < 1 | nargin > 7


error('usage: ([Y [, f [, t]]] = svlSpgram(x [, n [, Fs [, window [, overlap[, clipdB]]]]]) ')
end

% assign defaults
if nargin < 2 | isempty(n), n = min(256, length(x)); end
if nargin < 3 | isempty(Fs), Fs = 2; end
if nargin < 4 | isempty(window), window = hanning(n); end
if nargin < 5 | isempty(overlap), overlap = length(window)/2; end
if nargin < 6 | isempty(clipdB), clipdB = 35; end % clip anything below 35 dB
if nargin < 7 | isempty(maxfreq), maxfreq = Fs/2; end

% if only the window length is given, generate hanning window


if length(window) == 1, window = hanning(window); end

% should be extended to accept a vector of frequencies at which to


% evaluate the fourier transform (via filterbank or chirp
% z-transform)
if length(n)>1,
error('spgram doesn''t handle frequency vectors yet')
end

% compute window offsets


win_size = length(window);
if (win_size > n)
n = win_size;
warning('spgram fft size adjusted---must be at least as long as frame')
end
step = win_size - overlap;

% build matrix of windowed data slices


S = buffer(x,win_size,overlap,'nodelay');
W = window(:,ones(1,size(S,2)));
S = S .* W;
offset = [0:size(S,2)-1]*step;

% compute fourier transform


STFT = fft(S,n);

% extract the positive frequency components


if rem(n,2)==1

7
ret_n = (n+1)/2;
else
ret_n = n/2;
end

STFT = STFT(1:ret_n, :);

f = [0:ret_n-1]*Fs/n;
t = offset/Fs;

if nargout>1, f_r = f; end


if nargout>2, t_r = t; end

%maxfreq = Fs/2;
STFTmag = abs(STFT(2:n*maxfreq/Fs,:)); % magnitude of STFT
STFTmag = STFTmag/max(max(STFTmag)); % normalize so max magnitude will be 0
db
STFTmag = max(STFTmag, 10^(-clipdB/10)); % clip everything below -35 dB

if nargout>0, X = STFTmag; end

% display as an indexed grayscale image showing the log magnitude of the STFT,
% i.e. a spectrogram; the colormap is flipped to invert the default setting,
% in which white is most intense and black least---in speech science we want
% the opposite of that.
% imagesc(t, f(2:n*maxfreq/Fs), 20*log10(STFTmag)); axis xy; colormap(flipud(gray));

if nargout==0
if Fs<2000
imagesc(t, f(2:n*maxfreq/Fs), 20*log10(STFTmag));
%pcolor(t, f(2:n*maxfreq/Fs), 20*log10(STFTmag));
ylabel('Hz');
else
%imagesc(t, f(2:n*maxfreq/Fs)/1000, 20*log10(STFTmag));
pcolor(t, f(2:n*maxfreq/Fs)/1000, 20*log10(STFTmag));
ylabel('kHz');
end
axis xy;
colormap(flipud(gray));
shading interp;
%xlabel('Time (seconds)');
end

7
Fbc.m
function [c,F1_r]= fbc(s)

N = length(s);

% calculation of the roots


if exist('alfa') == 0
x=2;
alfa=zeros(1,N);
for i=1:N
ex=1;
while abs(ex)>.00001
ex=-besselj(0,x)/besselj(1,x);
x=x-ex;
end
alfa(i)=x;
%fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex);
x=x+pi;
end
end

a = N;
nb = 1:N;
for m1 = 1:N
c(m1) = (2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s'.*besselj(0,alfa(m1)/a*nb));
end
% cindex=[6:10];
% cindex=18:24;
% cindex=[6:10 18:24];
% cindex=[2:5 5:8 26:30];
% cindex=[2:6 6:10 10:15 52:58];

%cindex=[6:30 130:145 160:175 225:235];


for mm=1:N
g1_r=[alfa(mm)/a];
F1_r(mm)=sum(c(mm).*besselj(0,g1_r*mm));
% g1_r=[alfa(1:N/8)/a];
% F1_r(mm)=sum(a3(1:N/8).*besselj(0,g1_r*mm));
end

7
Residual.m
function [residual,LPCoeffs,eta,Ne] =
Residual(speech,Fs,segmentsize,segmentshift,lporder,preempflag,plotflag)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% USAGE : [residual,LPCoeffs,Ne] =
Residual(speech,framesize,frameshift,lporder,preempflag,plotflag)
% INPUTS :
% speech - speech in ASCII
% Fs - Sampling frequency (Hz)
% segmentsize - framesize for lpanalysis (ms)
% segmentshift - frameshift for lpanalysis (ms)
% lporder - order of lpc
% preempflag - If 1 do preemphasis
% plotflag - If 1 plot results
% OUTPUTS :
% residual - residual signal
% LPCoeffs - 2D array containing LP coeffs of all frames
% Ne - Normalized error

% LOG: Nan errors have been fixed.


% Some elements of 'residual' were Nan. Now, the error
% has been corrected.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Preemphasizing speech signal


if(preempflag==1)
dspeech=diff(speech);
dspeech(length(dspeech)+1)=dspeech(length(dspeech));
else
dspeech=speech;
end

[r,c] = size(dspeech);
if r==1
dspeech = dspeech';
end

framesize = floor(segmentsize * Fs/1000);

7
frameshift = floor(segmentshift * Fs/1000);
nframes=floor((length(dspeech)-framesize)/frameshift)+1;
LPCoeffs=zeros(nframes,(lporder+1));
Lspeech = length(dspeech);
numSamples = Lspeech - framesize;

Lspeech
nframes
(nframes-1)*frameshift + framesize;

j=1;
% Processing the frames.
for i=1:frameshift:Lspeech-framesize
curFrame = dspeech(i:i+framesize-1);
frame = speech(i:i+framesize-1);

% Check if energy of the frame is zero.


if(sum(abs(curFrame)) == 0)
LPCoeffs(j,:) = 0;
Ne(j) = 0;
resFrame(1:framesize) = 0;
else
%[a,E] = lpc(hamming(framesize).*curFrame,lporder);
[a,E] = lpc(curFrame,lporder);
nanFlag = isnan(real(a));

% Check for ill-conditioning that can lead to NaNs.


if(sum(nanFlag) == 0)
LPCoeffs(j,:) = real(a);
Ne(j) = E;

% Inverse filtering.
if i <= lporder
frameToFilter(1:lporder) = 0;
else
frameToFilter(1:lporder) = dspeech(i-lporder:i-1);
%frameToFilter(1:lporder) = speech(i-lporder:i-1);
end

frameToFilter(lporder+1:lporder+framesize)=curFrame;
%frameToFilter(lporder+1:lporder+framesize)=frame;
resFrame = InverseFilter(frameToFilter,lporder,LPCoeffs(j,:));

numer=resFrame(lporder+1:framesize);
denom=curFrame(lporder+1:framesize);

eta(i) = sum(numer.*numer)/sum(denom.*denom);
%eta(i)

7
else
LPCoeffs(j,:) = 0;
Ne(j) = 0;
resFrame(1:framesize) = 0;
end
end

% Write current residual into the main residual array.


residual(i:i+frameshift-1) = resFrame(1:frameshift);
j=j+1;
end

clear frameToFilter;

i=i+frameshift;
% Updating the remaining residual samples of the penultimate frame.
%residual(i+frameshift:i+framesize-1) = resFrame(frameshift+1:framesize);

% The above processing covers only L samples, where


% L = {(nframes-1)*frameshift + framesize}.
% Still, the last {Lspeech-L} samples remain to be processed.
% Note that 0 <= {Lspeech-L} < framesize.

% Processing the last frame. However, this last frame will have
% a length of {Lspeech-i+1} samples.

if(i < Lspeech)

curFrame = dspeech(i:Lspeech);
frame = speech(i:Lspeech);
l=Lspeech-i+1;

% Check if energy of the frame is zero.


if(sum(abs(curFrame)) == 0)
LPCoeffs(j,:) = 0;
Ne(j) = 0;
resFrame(1:l) = 0;
else
%[a,E] = lpc(hamming(l).*curFrame,lporder);
[a,E] = lpc(curFrame,lporder);
nanFlag = isnan(real(a));

if(sum(nanFlag) == 0)
LPCoeffs(j,:) = real(a);
Ne(j) = E;

% Inverse filtering.
frameToFilter(1:lporder) = dspeech(i-lporder:i-1);

7
%frameToFilter(1:lporder) = speech(i-lporder:i-1);

frameToFilter(lporder+1:lporder+l)=curFrame;
%frameToFilter(lporder+1:lporder+l)=frame;

%resFrame(1:l) =
InverseFilter(frameToFilter,lporder,LPCoeffs(j,:));
resFrame(1:l) =
InverseFilter(frameToFilter,lporder,LPCoeffs(j,:));
else
LPCoeffs(j,:) = 0;
Ne(j) = 0;
resFrame(1:l) = 0;
end
end

residual(i:i+l-1) = resFrame(1:l);
% Residual computation is now complete.
% The lengths of speech and residual are equal now.
end

% Plotting the results


if plotflag==1

% Setting scale for x-axis.


i=1:Lspeech;
x = (i/Fs);

figure;
ax(1) = subplot(2,1,1);plot(x,speech);
xlim([x(1) x(end)]);
xlabel('Time (s)');ylabel('Signal');grid;

ax(2) = subplot(2,1,2);plot(x,residual);
xlim([x(1) x(end)]);
xlabel('Time (s)');ylabel('LP Residual');grid;
linkaxes(ax,'x');
end

7
Inversefilter.m
function residual = InverseFilter(frameToFilter,lporder,a)

% frameToFilter: It has 'lporder + framesize' samples.


% lporder: Order of LP analysis.
% a: Array of predictor coefficients, of size 'lporder+1'.

l = length(frameToFilter);

if l <= lporder
return;
else
% Note that l-lporder = frameSize.
for i=1:l-lporder
predictedSample=0;
% Note that a(1) = 1; Hence start from j=2.
for j=2:lporder+1
predictedSample = predictedSample -
a(j)*frameToFilter(lporder+1+i-j);
end
residual(i) = frameToFilter(i+lporder) - predictedSample;
end
end

7
Hilberenvelope.m
function [HilbertEnv]=HilbertEnvelope(signal,Fs,plotflag)

% returns a complex helical sequence, sometimes called the analytic


% signal, from a real data sequence. The analytic signal has a real
% part, which is the original data, and an imaginary part, which
% contains the Hilbert Transform. The imaginary part is a version
% of the original real sequence with a 90 degree pahse shift. Sines
% are therefore transformed to cosines and vice versa.

tempSeq=hilbert(signal);
HilbertSeq=imag(tempSeq);

% HilbertSeq contains the Hilbert transformed version of input signal


% Hilbert Envelope is given by t=sqrt((sig*sig)+(h(sig)*h(sig)));

sigSqr=signal.*signal;
HilbertSqr=HilbertSeq.*HilbertSeq;
HilbertEnv=sqrt(sigSqr+HilbertSqr);

%wavwrite(HilbertSeq/1.01/max(abs(HilbertSeq)),Fs,16,'ht.wav');
wavwrite(HilbertSeq,Fs,16,'ht.wav');

if(plotflag==1)
% Setting scale for x-axis.
len = length(signal);
x = [1:len]*1/Fs;

figure;
subplot(3,1,1);plot(x,signal);
%xlabel('Time (ms)');
ylabel('Signal');grid;
hold on;
%plot(x,HilbertSeq.*HilbertSeq,'r');
plot(x,HilbertEnv,'r');
hold off;

subplot(3,1,2);plot(x,HilbertSeq);
ylabel('HT of Signal');grid;
subplot(3,1,3);plot(x,HilbertEnv);
ylabel('HE of Signal');grid;

7
%xlabel('Time (ms)');ylabel('Hilbert Envelope of LP Residual');grid;
xlabel('Time (s)');

%figure;
%plot(signal,HilbertSeq,'k.');grid;
%plot(x, signal.*HilbertSeq);grid;
for i=1:16:len-16
xi = signal(i:i+16);
yi = HilbertSeq(i:i+16);
%plot(xi-mean(xi),yi-mean(yi),'k.');grid;
end
end

8
Bandlimitfbc.m
clc;
clear all;
close all;

[os,fs] = wavread('sa1_8000.wav');
plot(os),title('Original Speech Signal'),axis tight,grid on;
samplesrange = [14000:20000];
S = os';
s = S(samplesrange);
N = length(s);
figure(),plot(s),title('Extracted voiced speech signal of 20msec'),axis tight, grid on;
if exist('alfa') == 0
x=2;
alfa=zeros(1,N);
for i=1:N
ex=1;
while abs(ex)>.00001
ex=-besselj(0,x)/besselj(1,x);
x=x-ex;
end
alfa(i)=x;
%fprintf('Root # %g = %8.5f ex = %9.6f \n',i,x,ex);
x=x+pi;
end
end

fts = fft(s);
ftsby2 = fts(1:length(fts)/2);
n = length(ftsby2);
tf = [1:n].*((fs/2)/n);
figure(),plot(tf,20*log10(abs(ftsby2))),title('Spectrum of the Extracted speech
signal'),axis tight, grid on;

a = N;
nb = 1:N;
for m1 = 1:N
fbc(m1) = (2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s.*besselj(0,alfa(m1)/a*nb));
end

% Calculating Mean Frequency


s1=s;

8
N=length(s1);
nb=1:N;
MM = N;
a=N;
for m1=1:MM
c(m1)=(2/(a^2*(besselj(1,alfa(m1))).^2))*sum(nb.*s1.*besselj(0,alfa(m1)/a*nb));
p(m1)= (c(m1)^2)*(a^2*(besselj(1,alfa(m1))).^2)/2;
end
f(1:MM)=alfa(1:MM)/(2*pi*a);
fmean1 = sum(f(1:MM).*p(1:MM))/sum(p(1:MM));
mf = fmean1*fs

% Reconstructing the signal from the coefficients

fd = 300;
range = N*fd/(fs/2);

fbc = fbc(1:range);

fbc = [fbc zeros(1,N-length(fbc))];

for mm=1:N
g1_r=[alfa/a];
rs(mm)=sum(fbc.*besselj(0,g1_r*mm));
end

figure(),plot(rs),title('Band Limited signal with frequency < 300Hz'),axis tight, grid on;

ftrs = fft(rs);
ftrsby2 = ftrs(1:length(ftrs)/2);
nr = length(ftrsby2);
tfr = [1:nr].*((fs/2)/nr);
figure(),plot(tfr,20*log10(abs(ftrsby2))),title('Spectrum of the Band Limited Signal'),axis
tight, grid on;

8
Computeresidual.m
function [residual,LPCoeffs] =
computeResidual(speech,Fs,segmentsize,segmentshift,lporder,preempflag,plotflag)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% USAGE : [residual,LPCoeffs,Ne] =
computeResidual(speech,framesize,frameshift,lporder,preempflag,plotflag)
% INPUTS :
% speech - speech in ASCII
% Fs - Sampling frequency (Hz)
% segmentsize - framesize for lpanalysis (ms)
% segmentshift - frameshift for lpanalysis (ms)
% lporder - order of lpc
% preempflag - If 1 do preemphasis
% plotflag - If 1 plot results
% OUTPUTS :
% residual - residual signal
% LPCoeffs - 2D array containing LP coeffs of all frames
% Ne - Normalized error

% LOG: Nan errors have been fixed.


% Some elements of 'residual' were Nan. Now, the error
% has been corrected.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Preemphasizing speech signal


if(preempflag==1)
dspeech=diff(speech);
dspeech(length(dspeech)+1)=dspeech(length(dspeech));
else
dspeech=speech;
end

dspeech = dspeech(:);

framesize = floor(segmentsize * Fs/1000);


frameshift = floor(segmentshift * Fs/1000);
nframes=floor((length(dspeech)-framesize)/frameshift)+1;
Lspeech = length(dspeech);
numSamples = Lspeech - framesize;

8
Lspeech
nframes
(nframes-1)*frameshift + framesize

sbuf = buffer(dspeech, framesize, framesize-frameshift,'nodelay');


[rs,cs] = size(sbuf);
tmp = dspeech(Lspeech-rs+1:Lspeech);
sbuf(:,cs) = tmp(:); % Last column of the buffer.

% Computation of LPCs.
for i=1:cs
nanflag = 0;
erg(i) = sum(sbuf(:,i).*sbuf(:,i));
if erg(i) ~= 0
a = lpc(sbuf(:,i),lporder);
nanflag = sum(isnan(real(a)));
else
a1 = [1 zeros(1,p)];
end

if nanflag == 0
A(:,i) = real(a(:));
else
A(:,i) = a1(:);
end
end

% Computation of LP residual.
x = [zeros(1,lporder) (dspeech(:))'];
xbuf = buffer(x, lporder+framesize, lporder+framesize-frameshift,'nodelay');
[rx,cx] = size(xbuf);
tmp = x(Lspeech+lporder-rx+1:Lspeech+lporder);
xbuf(:,cx) = tmp(:); % Last column of the buffer.

% Inverse filtering.
j=1;
for i=1:cx-1
res = filter(A(:,i), 1, xbuf(:,i));

% Write current residual into the main residual array.


residual(j:j+frameshift-1) = res(lporder+1:frameshift+lporder);
j=j+frameshift;
end
res = filter(A(:,cx), 1, xbuf(:,cx));
residual((cx-1)*frameshift + 1: Lspeech) = res((cx-1)*frameshift - Lspeech + rx +
1:rx);

8
LPCoeffs = A';
size(LPCoeffs)

% Plotting the results


if plotflag==1

% Setting scale for x-axis.


i=1:Lspeech;
x = i/Fs;

figure; subplot(2,1,1);plot(x,speech);
xlim([x(1) x(end)]);
ylabel('Signal');grid;

subplot(2,1,2);plot(x,residual);
xlim([x(1) x(end)]);
xlabel('Time (s)');ylabel('LP Residual');grid;
end

8
Synthesizespeech.m
function [speech] = synthesizeSpeech(Residual, LPCs, Fs, lporder,fsize,fshift,plotflag)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% function [speech]=SynthesizeSpeechUsingResidual(idNoEx,
lporder,framesize,frameshift,plotflag)
% fsize,fshift: In ms
% Use .res and .lpc files.
% In .lpc file, each row is a set of LPCs for one frame.
% Get sampling frequency from .res.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%

framesize = floor(fsize*Fs/1000);
frameshift = floor(fshift*Fs/1000);

speech=zeros(1,length(Residual));

j=1;
for(i=1:frameshift:length(Residual)-framesize)

ResFrm=Residual(i:i+framesize-1);

a=LPCs(j,:);

j=j+1;

if(i <= framesize)


PrevFrm=zeros(1,framesize);
PrevFrm(framesize-(i-2):framesize)=speech(1:i-1);
else
PrevFrm=speech((i-framesize):(i-1));
end

SpFrm=SynthFilter(real(PrevFrm),real(ResFrm),real(a),lporder,framesize,0);

speech(i:i+frameshift-1)=SpFrm(1:frameshift);
%pause

end
speech(i+frameshift:i+framesize-1)=SpFrm(frameshift+1:framesize);

8
%PROCESSING LASTFRAME SAMPLES
if(i<length(Residual))

ResFrm = Residual(i:length(Residual));

a=LPCs(j,:);

j=j+1;

PrevFrm=speech((i-framesize):(i-1));

SpFrm=SynthFilter(real(PrevFrm),real(ResFrm),real(a),lporder,framesize,0);
speech(i:i+length(SpFrm)-1)=SpFrm(1:length(SpFrm));

end

if(plotflag==1)

figure;
l = length(speech);
x = [1:l]/Fs;

subplot(2,1,1);plot(x,real(Residual),'k');grid;
xlim([x(1) x(end)]);

subplot(2,1,2);plot(x,real(speech),'k');grid;
xlim([x(1) x(end)]);
xlabel('Time (s)');

end
Synthfilter.m
function
[SpchFrm]=SynthFilter(PrevSpFrm,ResFrm,FrmLPC,LPorder,FrmSize,plotflag);

%USAGE:
[SpchFrm]=SynthFilter(PrevSpFrm,ResFrm,FrmLPC,LPorder,FrmSize,plotflag);

tempfrm=zeros(1,2*FrmSize);

tempfrm((FrmSize-LPorder):FrmSize)=PrevSpFrm((FrmSize-LPorder):FrmSize);

for(i=1:FrmSize)

t=0;
for(j=1:LPorder)

t=t+FrmLPC(j+1)*tempfrm(-j+i+FrmSize);

8
%pause
end

% ResFrm(i);

%s=-t+ResFrm(i)
%pause
%tempfrm(FrmSize+i)=s;

tempfrm(FrmSize+i)=-t+ResFrm(i);

%pause
end

SpchFrm=tempfrm(FrmSize+1:2*FrmSize);

if(plotflag==1)

figure;
subplot(2,1,1);plot(ResFrm);grid;

subplot(2,1,2);plot(SpchFrm);grid;

end

También podría gustarte