Coding of Text, Voice, Image, and Video

UCCN2043 Lecture Notes
4.0
Coding of Text, Voice, Image, and Video Signals
The information that has to be exchanged between two entities (persons or machines) in a
communication system can be in one of the following formats:
Text
Voice
Image
Video
In an electrical communication system, the information is first converted into an electrical
signal. For instance,
A microphone is the transducer that converts the human voice into an analog signal.
Similarly, the video camera converts the real-life scenery into an analog signal.
In a digital communication system, the first step is to convert the analog signal into digital
format using analog-to-digital conversion techniques. This digital signal representation for
various types of information is the topic of this lesson.
4.1
Text Messages
Text messages are generally represented in ASCII (American Standard Code for Information
Interchange), in which a 7-bit code is used to represent each character. Another code form
called EBCDIC (Extended Binary Coded Decimal Interchange Code) is also used. ASCII is
the most widely used coding scheme for representation of text in computers.
Using ASCII, the number of characters that can be represented is limited to 128 because only
7-bit code is used. Out of these 128 characters, 33 are non-printing control characters (many
now obsolete) that affect how text and space are processed. The other 95 are printable
characters, including the space (which is considered an invisible graphic). The ASCII code is
used for representing many European languages as well.
To transmit text messages, first the text is converted into any one of the character-encoding
schemes (such as ASCII), and then the bit stream is converted into an electrical signal.
Note: In extended ASCII, each character is represented by 8 bits. Using 8 bits, a number of
graphic characters and control characters can be represented.
To represent all the world languages, Unicode has been developed. Unicode uses 16 bits to
represent each character and can be used to encode the characters of any recognized language
in the world. Modern programming languages such as Java and markup languages such as
XML support Unicode.
It is important to note that the ASCII/Unicode coding mechanism is not the best way,
according to Shannon. If we consider the frequency of occurrence of the letters of a language
and use small codewords for frequently occurring letters, the coding will be more efficient.
However, more processing will be required, and more delay will result.
Page 1 of 15

The best coding mechanism for text messages was developed by Morse. The Morse code was
used extensively for communication in the old days. Many ships used the Morse code until
May 2000. In Morse code, characters are represented by dots and dashes. Morse code is no
longer used in standard communication systems.
Note: Morse code uses dots and dashes to represent various English characters. It is an
efficient code because short codes are used to represent high-frequency letters and
long codes are used to represent low-frequency letters. The letter E is represented by
just one dot and the letter Q is represented by dash dash dot dash.
4.2
Voice
To transmit voice from one place to another, the speech (acoustic signal) is first converted
into an electrical signal using a transducer, the microphone. This electrical signal is an analog
signal. The voice signal corresponding to the speech "how are you" is shown in Figure 4.1.
Figure 4.1: Speech waveform.
The important characteristics of the voice signal are given here:

The voice signal occupies a bandwidth of 4 kHz i.e., the highest frequency component in
the voice signal is 4 kHz. Though higher frequency components are present, they are not
significant, so a filter is used to remove all the high-frequency components above 4 kHz.
In telephone networks, the bandwidth is limited to only 3.4 kHz.
The pitch varies from person to person. Pitch is the fundamental frequency in the voice
signal. In a male voice, the pitch is in the range of 50250 Hz. In a female voice, the pitch
is in the range of 200400 Hz.
The speech sounds can be classified broadly as voiced sounds and unvoiced sounds.
Signals corresponding to voiced sounds (such as the vowels a, e, i, o, u) will be periodic
signals and will have high amplitude. Signals corresponding to unvoiced sounds (such as
th, s, z, etc.) will look like noise signals and will have low amplitude.
Page 2 of 15

Voice signal is considered a non-stationary signal, i.e., the characteristics of the signal
(such as pitch and energy) vary. However, if we take small portions of the voice signals of
about 20msec duration, the signal can be considered stationary. In other words, during this
small duration, the characteristics of the signal do not change much. Therefore, the pitch
value can be calculated using the voice signal of 20msec. However, if we take the next
20msec, the pitch may be different.
The voice signal occupies a bandwidth of 4 KHz. The voice signal can be broken down into a
fundamental frequency and its harmonics. The fundamental frequency or pitch is low for a
male voice and high for a female voice.
These characteristics are used while converting the analog voice signal into digital form.
Analog-to-digital conversion of voice signals can be done using one of two techniques:
waveform coding and vocoding.
Note: The characteristics of speech signals described here are used extensively for speech
processing applications such as text-to-speech conversion and speech recognition.
Music signals have a bandwidth of 20 kHz. The techniques used for converting music
signals into digital form are the same as for voice signals.
4.2.1
Waveform Coding
Waveform coding is done in such a way that the analog electrical signal can be reproduced at
the receiving end with minimum distortion. Hundreds of waveform coding techniques have
been proposed by many researchers. We will study two important waveform coding
techniques: pulse code modulation (PCM) and adaptive differential pulse code modulation
(ADPCM).
Pulse Code Modulation

Pulse Code Modulation (PCM) is the first and the most widely used waveform coding
technique. The ITU-T Recommendation G.711 specifies the algorithm for coding speech in
PCM format.
PCM coding technique is based on Nyquist's theorem, which states that if a signal is sampled
uniformly at least at the rate of twice the highest frequency component, it can be
reconstructed without any distortion.
The highest frequency component in voice signal is 4 kHz, so we need to sample the
waveform at 8000 samples per secondevery 1/8000th of a second (125 microseconds). We
have to find out the amplitude of the waveform for every 125 microseconds and transmit that
value instead of transmitting the analog signal as it is.
Page 3 of 15

The sample values are still analog values, and we can "quantize" these values into a fixed
number of levels. As shown in Figure 4.2, if the number of quantization levels is 256, we can
represent each sample by 8 bits. So, 1 second of voice signal can be represented by 8000 8
bits, 64kbits. Hence, for transmitting voice using PCM, we require 64 kbps data rate.
Figure 4.2: Pulse Code Modulation.

However, note that since we are approximating the sample values through quantization, there
will be a distortion in the reconstructed signal; this distortion is known as quantization noise.
In the PCM coding technique standardized by ITU in the G.711 recommendation, the
nonlinear characteristic of human hearing is exploited the ear is more sensitive to the
quantization noise in the lower amplitude signal than to noise in the large amplitude signal.
In G.711, a logarithmic (non-linear) quantization function is applied to the speech signal, and
so the small signals are quantized with higher precision. Two quantization functions, called
A-law and -law, have been defined in G.711.
-law is used in the U.S. and Japan.
A-law is used in Europe and the countries that follow European standards.
The speech quality produced by the PCM coding technique is called toll quality speech and is
taken as the reference to compare the quality of other speech coding techniques.
For CD-quality audio, the sampling rate is 44.1 kHz (one sample every 23 microseconds),
and each sample is coded with 16 bits. For two-channel stereo audio stream, the bit rate
required is
2 44.1 1000 16 = 1.41Mbps.
Adaptive Differential Pulse Code Modulation

One simple modification that can be made to PCM is that we can code the difference between
two successive samples rather than coding the samples directly. This technique is known as
differential pulse code modulation (DPCM).
Page 4 of 15

Another characteristic of the voice signal that can be used is that a sample value can be
predicted from past sample values. At the transmitting side, we predict the sample value and
find the difference between the predicted value and the actual value and then send the
difference value. This technique is known as adaptive differential pulse code modulation
(ADPCM).
In ADPCM, each sample is represented by 4 bits, and hence the data rate required is 32kbps.
Therefore, using ADPCM, voice signals can be coded at 32kbps without any degradation of
quality as compared to PCM.
ITU-T Recommendation G.721 specifies the coding algorithm. In ADPCM, the value of
speech sample is not transmitted, but the difference between the predicted value and the
actual sample value is. Generally, the ADPCM coder takes the PCM coded speech data and
converts it to ADPCM data.
The block diagram of an ADPCM encoder is shown in Figure 4.3(a).
Figure 4.3: (a) ADPCM Encoder

Eight-bit [.mu]-law PCM samples are input to the encoder and are converted into linear
format. Each sample value is predicted using a prediction algorithm, and then the predicted
value of the linear sample is subtracted from the actual value to generate the difference signal.
Adaptive quantization is performed on this difference value to produce a 4-bit ADPCM
sample value, which is transmitted. Instead of representing each sample by 8 bits, in ADPCM
only 4 bits are used.
Figure 4.3: (b) ADPCM Decoder

Page 5 of 15

At the receiving end, the decoder, shown in Figure 4.3(b), obtains the de-quantized version of
the digital signal. This value is added to the value generated by the adaptive predictor to
produce the linear PCM coded speech, which is adjusted to reconstruct m-law-based PCM
coded speech.
There are many waveform coding techniques such as delta modulation (DM) and
continuously variable slope delta modulation (CVSD). Using these, the coding rate can be
reduced to 16kbps, 9.8kbps, and so on. As the coding rate reduces, the quality of the speech is
also going down. There are coding techniques using good quality speech which can be
produced at low coding rates.
The PCM coding technique is used extensively in telephone networks. ADPCM is used in
telephone networks as well as in many radio systems such as digital enhanced cordless
telecommunications (DECT).
Note: Over the past 50 years, hundreds of waveform coding techniques have been
developed with which data rates can be reduced to as low as 9.8kbps to get good
quality speech.
4.2.2
Vocoding
A radically different method of coding speech signals was proposed by H. Dudley in 1939.
He named his coder vocoder, a term derived from VOice CODER. In a vocoder, the electrical
model for speech production seen in Figure 4.4 is used.
Figure 4.4: Electrical model of speech production

This model is called the sourcefilter model because the speech production mechanism is
considered as two distinct entitiesa filter to model the vocal tract and an excitation source.
The excitation source consists of a pulse generator and a noise generator. The filter is excited
by the pulse generator to produce voiced sounds (vowels) and by the noise generator to
produce unvoiced sounds (consonants).
Page 6 of 15

The vocal tract filter is a time-varying filterthe filter coefficients vary with time. As the
characteristics of the voice signal vary slowly with time, for time periods on the order of
20msec, the filter coefficients can be assumed to be constant.
In vocoding techniques, at the transmitter, the speech signal is divided into frames of 20msec
in duration. Each frame contains 160 samples. Each frame is analyzed to check whether it is a
voiced frame or unvoiced frame by using parameters such as energy, amplitude levels, etc.
For voiced frames, the pitch is determined. For each frame, the filter coefficients are also
determined. These parametersvoiced/unvoiced classification, filter coefficients, and pitch
for voiced framesare transmitted to the receiver.
At the receiving end, the speech signal is reconstructed using the electrical model of speech
production. Using this approach, the data rate can be reduced as low as 1.2kbps. However,
compared to voice coding techniques, the quality of speech will not be very good.
A number of techniques are used for calculating the filter coefficients. Linear prediction is
the most widely used of these techniques.
Linear Prediction
The basic concept of linear prediction is that the sample of a voice signal can be
approximated as a linear combination of the past samples of the signal.
If Sn is the nth speech sample, then
S n = a k S n k + GU n
where
ak (k = 1,,P) are the linear prediction coefficients
G is the gain of the vocal tract filter
Un is the excitation to the filter.
Linear prediction coefficients (generally 8 to 12) represent the vocal tract filter coefficients.
Calculating the linear prediction coefficients involves solving P linear equations. One of the
most widely used methods for solving these equations is through the Durbin and Levinson
algorithm.
Coding of the voice signal using linear prediction analysis involves the following steps:
At the transmitting end, divide the voice signal into frames, each frame of 20msec
duration. For each frame, calculate the linear prediction coefficients and pitch and find
out whether the frame is voiced or unvoiced. Convert these values into code words and
send them to the receiving end.
At the receiver, using these parameters and the speech production model, reconstruct the
voice signal.
Page 7 of 15

In linear prediction technique, a voice sample is approximated as a linear combination of the
past n samples. The linear prediction coefficients are calculated every 20 milliseconds and
sent to the receiver, which reconstructs the speech samples using these coefficients. Using
this approach, voice signals can be compressed to as low as 1.2kbps.
Using linear prediction vocoder, voice signals can be compressed to as low as 1.2kbps.
Quality of speech will be very good for data rates down to 9.6kbps, but the voice sounds
synthetic for further lower data rates. Slight variations of this technique are used extensively
in many practical systems such as mobile communication systems, speech synthesizers, etc.
Note: Variations of LPC technique are used in many commercial systems, such as mobile
communication systems and Internet telephony.
4.3
Image
To transmit an image, the image is divided into grids called pixels (or picture elements). The
higher the number of grids, the higher the resolution. Grid sizes such as 1024 768 and 800
600 are generally used in computer graphics.
For black-and-white pictures, each pixel is given a certain gray-scale value. If there are 256
gray-scale levels, each pixel is represented by 8 bits. So, to represent a picture with a grid size
of 400 600 pixels with each pixel of 8 bits, 240kbytes of storage is required.
To represent color, the levels of the three fundamental colorsred, blue, and greenare
combined together. The shades of the colors will be higher if more levels of each color are
used.
In image coding, the image is divided into small grids called pixels, and each pixel is
quantized. The higher the number of pixels, the higher will be the quality of the reconstructed
image.
For example, if an image is coded with a resolution of 352 240 pixels, and each pixel is
represented by 24 bits, the size of the image is 352 240 24/8 = 247.5 kilobytes.
To store the images as well as to send them through a communication medium, the image
needs to be compressed. A compressed image occupies less storage space if stored on a
medium such as hard disk or CD-ROM. If the image is sent through a communication
medium, the compressed image can be transmitted fast.
One of the most widely used image coding formats is JPEG format. Joint Photograph Experts
Group (JPEG) proposed this standard for coding of images. The block diagram of JPEG
image compression is shown in Figure 4.5.
Figure 4.5: JPEG compression

Page 8 of 15

For compressing the image using the JPEG compression technique, the image is divided into
blocks of 8 by 8 pixels and each block is processed using the following steps:
1. Apply discrete cosine transform (DCT), which takes the 8 8 matrix and produces an
8 8 matrix that contains the frequency coefficients. This is similar to the Fast
Fourier Transform (FFT) used in Digital Signal Processing. The output matrix
represents the image in spatial frequency domain.
2. Quantize the frequency coefficients obtained in Step 1. This is just rounding off the
values to the nearest quantization level. As a result, the quality of the image will
slightly degrade.
3. Convert the quantization levels into bits. Since there will be little change in the
consecutive frequency coefficients, the differences in the frequency coefficients are
encoded instead of directly encoding the coefficients.
Compression ratios of 30:1 can be achieved using JPEG compression. In other words, a
300kB image can be reduced to about 10kB.
Note: JPEG image compression is used extensively in Web page development. As compared
to the bit mapped files (which have a .bmp extension), the JPEG images (which have
a .jpg extension) occupy less space and hence can be downloaded fast when we access
a Web site
4.4
Video
A video signal occupies a bandwidth of 5MHz. Using the Nyquist sampling theorem, we
need to sample the video signal at 10 samples/msec. If we use 8-bit PCM, video signal
requires a bandwidth of 80Mbps. This is a very high data rate, and this coding technique is
not suitable for digital transmission of video. A number of video coding techniques have been
proposed to reduce the data rate.
For video coding, the video is considered a series of frames. At least 16 frames per second
are required to get the perception of moving video. Each frame is compressed using the
image compression techniques and transmitted. Using this technique, video can be
compressed to 64kbps, though the quality will not be very good.
Video encoding is an extension of image encoding. As shown in Figure 4.6, a series of
images or frames, typically 16 to 30 frames, is transmitted per second. Due to the persistence
of the eye, these discrete images appear as though it is a moving video.
Accordingly, the data rate for transmission of video will be the number of frames multiplied
by the data rate for one frame. The data rate is reduced to about 64kbps in desktop video
conferencing systems where the resolution of the image and the number of frames are
reduced considerably. The resulting video is generally acceptable for conducting business
meetings over the Internet or corporate intranets, but not for transmission of, say, dance
programs, because the video will have many jerks.
Page 9 of 15
Figure 4.6: Video coding through frames and pixels.

A variety of video compression standards have been developed. Notable among them is
MPEG-2, which is used for video broadcasting. MPEG-4 is used in video conferencing
applications and HDTV for high-definition television broadcasting.
Moving Picture Experts Group (MPEG) released a number of standards for video coding. The
following standards are used presently:
MPEG-2: This standard is for digital video broadcasting. The data rates are 3 and 7.5Mbps.
The picture quality will be much better than analog TV. This standard is used in broadcasting
through direct broadcast satellites.
MPEG-4: This standard is used extensively for coding, creation, and distribution of audiovisual content for many applications because it supports a wide range of data rates. The
MPEG-4 standard addresses the following aspects:
Representing audio-visual content, called media objects.
Describing the composition of these objects to create compound media objects.
Multiplexing and synchronizing the data.
The primitive objects can be still images, audio, text, graphics, video, or synthesized speech.
Video coding between 5kbps and 10Mbps, speech coding from 1.2kbps to 24kbps, audio
(music) coding at 128kbps, etc. are possible.
MP3 (MPEG Layer3) is the standard for distribution of music at 128kbps data rate, which is
a part of the MPEG-4 standards.
Page 10 of 15

For video conferencing, 384 kbps and 2.048 Mbps data rates are very commonly used to
obtain better quality as compared to 64kbps. Video conferencing equipment that supports
these data rates is commercially available.
MPEG-4 is used in mobile communication systems for supporting video conferencing while
on the move. It also is used in video conferencing over the Internet.
In spite of the many developments in digital communication, video broadcasting continues to
be analog in most countries. Many standards have been developed for digital video
applications. When optical fiber is used extensively as the transmission medium, perhaps then
digital video will gain popularity. The important European digital formats for video are given
here:
Multimedia CIF format: Width in pixels 360; height in pixels 288; frames/ second 6.25 to
25; bit rate without compression 7.8 to 31 Mbps; with compression 1 to 3 Mbps.
Video conferencing (QCIF format): Width in pixels 180; height in pixels 144; frames per
second 6.25 to 25; bit rate without compression 1.9 to 7.8 Mbps; with compression 0.064 to 1
Mbps.
Digital TV, ITU-R BT.601 format: Width 720; height 526; frames per second 25; bit rate
without compression 166 Mbps; with compression 5 to 10 Mbps.
HDTV, ITU-R BT.109 format: Width 1920; height 1250; frames per second 25; bit rate
without compression 960 Mbps; with compression 20 to 40 Mbps.
Note: Commercialization of digital video broadcasting has not happened very fast. It is
expected that utilization of HDTV will take off in the first decade of the twenty-first
century.
4.5
Pulse Modulation and Sampling (Glover 5.2 @ page 163)
Referring to Figure 4.7, the principal transmitter subsystems consist of an Analog-to-Digital

Converter (ADC) which performs sampling, quantization and PCM encoding process. A
good understanding in pulse modulation techniques need to be established as it is part of the
ADC process and, as such, constitutes an important part of a digital communications
transmitter.
However, it is important to note that pulse modulations (generally) can be used as modulation
schemes in their own rights for analogue communications.
Page 11 of 15
Figure 4.7: ADC Process in digital communications

4.5.1
Pulse Modulation
Pulse modulation describes the process whereby the amplitude, width or position of
individual pulses in a periodic pulse train are varied (i.e. modulated) in sympathy with the
amplitude of a baseband information signal, g(t).
Figures 4.8(a) to (d) shows the analog input signal, Pulse Amplitude Modulation, Pulse
Width Modulation, and Pulse Position Modulation respectively.
Figure 4.8: Illustration of pulse amplitude, width and position modulation
Page 12 of 15

Since pulse amplitude modulation (PAM) relies on changes in pulse amplitude it requires
larger signal-to-noise ratio (SNR) than pulse position modulation (PPM) or pulse width
modulation (PWM). This is essentially because a given amount of additive noise can change
the amplitude of a pulse (with rapid rise and fall times) by a greater fraction than the position
of its edges (Figure 4.9).
Figure 4.9: Effects of noise on pulses: (a) noise induced position and width errors completely
absent for ideal pulse; (b) small, noise induced, position and width errors for realistic pulse.
Pulse modulation may be an end in itself allowing, for example, many separate informationcarrying signals to share a single physical channel by interleaving the individual signal pulses
as illustrated in Figure 4.10. Such pulse interleaving is called time division multiplexing
(TDM) and is discussed in detail in Lesson 7.
Figure 4.10: Time division multiplexing of two pulse amplitude modulated signals
Page 13 of 15

Pulse modulation, however, may also represents an intermediate stage in the generation of
digitally modulated signals from the input analog signal. This process is called sampling.
Note: It is important to realise that pulse modulation is not, in itself, a digital but an
analogue technique.
4.5.2
Sampling (Glover & "Pulse Amplitude Modulation, Pulse Code Modulation and Sampling.pdf")
The process of selecting or recording the ordinate values of a continuous (usually analogue)
function at specific (usually equally spaced) values of its abscissa is called sampling. If the
function is a signal which varies with time then the samples are sometimes called a time
series. This is the most common type of sampling process encountered in electronic
communications although spatial sampling of images is also important.
There are obvious similarities between sampling and PAM. In fact, in many cases, the two
processes are indistinguishable. In an ADC process, sampling (or PAM if you prefer) always
precede quantization and PCM.
PCM modifies the pulses created by PAM to create a completely digital signal. This is done
by quantizing (assigns integer values in a specific range to sampled instances) the PAM
pulses.
Figure 4.11 shows the entire analog-to-digital conversion process in an ADC. Figure 4.12
illustrates the PAM process and Figure 4.13 illustrates the quantization and PCM process.
Figure 4.11: From analog signal to PCM digital code
Page 14 of 15
Figure 4.12: The (a) analog input signal and (b) PAM signal
Figure 4.13: PCM process and transmission of PCM signal

Page 15 of 15

Coding of Text, Voice, Image, and Video

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Coding of Text, Voice, Image, and Video

Cargado por

Copyright:

Formatos disponibles

UCCN2043 Lecture Notes