Documentos de Académico
Documentos de Profesional
Documentos de Cultura
W99
Data Compression
Chapter 5
Denition 3 (Median) The median value of a random variable is that value such that approximately
50% of the sample point values are above the median,
and approximately 50% of the sample point values are
below the median.
run
length random variable R is the number of continue
or C outcomes before the stop outcome S occurs. We
underlined before because the random variable of the
geometric distribution includes the nal outcome that
terminates the sequence of Bernoulli trials.
Observation 2 (Dierences) The dierence between the geometric and the run length distribution
is a simple matter of what gets counted.
random events, Success (denoted S) and Failure (denoted F), are the only possible outcomes of a Bernoulli
trial, and form the binary distribution upon which the
random variable of the geometric distribution is based.
had to code was less than md. How many code bits
would be best to tell the decoder the run was less than
md?
to the decoder, the decoder has now narrowed the possible values of r to 0, 1, ..., md 1.
Given a run length compression application, a popular code used for run length encodings is called the
Golomb code [Gol66]. This code is actually a parameterized family of codes. The code has been proven an
optimal prex code for the geometric distribution.
Before describing the Golomb code, which is based
on the geometric distribution, we discuss Bernoulli trials and experiments.
1 :::
1 + 12 + 41 + 18 + 16
(5.2)
The geometric distribution to a particular Bernoulli
experiment, and then see how the Golomb code for run
length coding is optimal for the geometric distribution.
The geometric distribution is a parameterized family
of distributions for the positive integers.
Denition 7 (Bernoulli trial) A Bernoulli trial,
as dened in mathematical texts, is a single binary
event that either results in Success or else in failure.
The binomial distribution and the geometric distribution are both dened in terms of a Bernoulli experiment. Following a description of the binomial distribution, we describe the geometric distribution.
are independent, we can simply calculate the probability of the sequence of N bernoulli trials as the product
of N probabilities. There are no conditional probabilities to worry about.
th
p(x) = pS (pC )
(5.4)
We need to calculate the value of the probability that
random variable of value x, a possible value member
of the binomial distribution X takes on, where x
denotes the number of successful outcomes out of N
trials. To calculate the probability value assigned to x,
denoted P (X = x), we need to know how many combinations (arrangements), denoted Comb(x; N ), yield
x successes and N x failures. The number of arrangements, Comb(x; N ), called the \combination of
x things taken N at at time", is dened by Eqn 5.5 as
follows:
x
p(S) p(F)
(5.3)
However, the S can occur in any one of the N positions, so we need to multiply Eq 5.3 by N for the
probability the outcome of 1 for the binomial distribution. For more complicated number of successes,
we need to use the number of combinations for which
s successes occur out of N trials. Combinations are
covered in Chapter One.
The reason for the name binomial distribution is because the binominal coecients are used in the description of the binomial distribution. The Bernoulli
trials f1, ..., k, ... Ng are statistically independent,
therefore, the probability pS (k) of the outcome for the
N
N
B
N
B
N!
x!(N x)!
C = Comb(x; N ) =
x
N
(5.5)
N!
p (pC )
N !(N x)! S
x
(5.6)
According to the Law of Large Numbers, the average number of successes in N Bernoulli trials tends to
parameter (pS ) as N tends to innity.
The multinomial distribution is a generalization of
the binomial distribution from two symbols (Success
and Failure) to M symbols. In N trials, how many N
of each symbol x occur?
x
7!
2!3!2!
(5.7)
The geometric distribution, based on random variable X , has a dierent experiment based on Bernoulli
trials: the number of Bernoulli trials to the rst success. Thus, random variable X can take on values
from the set f1, 2, ..., 1g, according to whether the
success outcome S (experiment Stops) on trial number
1, 2, ..., 1. The value x of random variable X takes
on values 1, 2, etc, to an arbitrarily large integer. In
contrast, the experiment for the Binomial distribution
always generates a string of N binary outcomes, and
so ends on the N trial.
G
th
We now discuss an intuitive approach to the calculation of the mean of the geometric distribution,
i.e., of the expected outcome E (X ). Value is the
average number of trials per experiment. If the mean
is 10, then on average, success occurs with the 10-th
trial.
The calculation of mean is relatively simple at
the intuitive level with a few examples. We know that
each trial ends with the outcome whose probability is
pS , which in turn is 0.5 or less. So the mean must be
greater than one.
The mean is always greater than or equal to one
(since there is always one success that ends the experiment). On the other hand, there is an average number
of C (Bernoulli trial is a failure) so: the experiment
Continues).
G
that stops the sequence. Over a large number of experiments, the experiments must converge to a mean
number of 4 trials per experiment.
=
g
1:
(5.8)
pS
5.4.2
3:1416
X1 aq
(5.9)
i=0
X1 aq =
i
i=0
1 q
(5.10)
= pS pC
p(x) = pS pC ;
x
(5.11)
p(x) = pS pC
(5.12)
Suppose we are interested in summing the probabilities for x = 1, 2, ..., m-1. We can also solve the
problem by determining the probability that random
variable x equals or exceeds value m, and subtracting
that result from 1. Taking Eq 5.10 and Eq 5.12 together, assuming pS is known, we can determine the
probability that the geometric distribution's random
variable X equals or exceeds some integer value md.
Probability p(x = md) = pS pC
is the rst term.
Factor pS is common to all terms, and the exponent
value for pC is increased by 1 for each successive term,
beginning with exponent md 1.
x
md
p(X
m) = p(X = m) + p(X = m + 1) +
G
i=m
p(x m) = pC
(5.14)
One assumption for the geometric distribution is
that pC does not change as the run length increases.
In other words, no matter how long the run becomes,
the probability pC that the run continues remains the
same.
m
In practical terms, the dierences between the geometric distribution random variable X , and run
length random variable R are listed below.
G
p(r) = pS pC
r
i=0
X1 p (5.13)
(5.15)
For R, if success is achieved on the rst trial, the outcome is 0. A run length R of 0 means we were looking
for the rst instance of the designated run symbol, but
the run stop event S occurred.
If the symbol alphabet is binary, then the value of
the event that stopped the run is known to be the
second symbol. In any case, binary or M-ary alphabet,
we can use pC for the run symbol, and pS for any other,
such that pS = 1 pC.
The scenario for a run mode within a data string
from an M-ary alphabet A is simply a sequence of
binary events; one of two atomic events: C and S.
The run symbol is C and any other symbol S. If we are
doing runs within an M-ary alphabet, then the event
of S should be followed by the non-run member of the
alphabet that stopped the run. Curiously, in using
run length coding in M-ary alphabets, and triggering
\run mode" with two successive symbols of the same
value, one can achieve a compresion gain even if the
probability of the designated run symbol is less than
0.5. The reason is that one saves enough bits when the
run symbol repeats to pay of the cost of being wrong
the majority of the time. If \run stop" is greater than
0.5 then with arithmetic coding, it costs less than a
bit to be wrong, and then code the unknown symbol
under the probability one would have used anyway.
On the other hand, if the run symbol occurs, then no
more bits are needed.
(5.17)
2
Alternatively, since the probability mass less than
the median md should approximate the probability
mass greater than the median:
Xp p 1:
md
i=0
p(r > md
X1 p p 1 :
1) =
i=md
(5.18)
p(r > md
md
md+1
md+2
(5.19)
pS pC
md
p(r > md 1) =
[1 + pC + pC + ]
2
(5.20)
pC
md
md
S C
ps
r
= pC 21 :
(5.21)
Compare Eq 5.16 and Eq 5.21, and marvel that the
median run length is indeed the Golomb number md.
Note also that the median value for the geometric
distribution's random variable is m+1, because for a
given pC , the number of Bernoulli trials per experiment is one greater than the run length value for the
same experiment.
Gallager and VanVoorhis [GVV75] show the Golomb
code to be optimal (assuming the constraint of an
integer-length codeword for each run), and characterize its coding eciency. Their work provides the
equation that denes the breakpoint probabilities that
bound the probability ranges applicable to each value
of Golomb parameter md, making precise the probability range for each value of md.
md
md
r = (md Q) + R:
(5.22)
k+1
symbol is processed. The nal step of the encoder leftshifts the least-signicant k bits of that counter to the
code string Cd(S ).
Mod-md Counter operation, when md is a power of 2,
say 2 , the Mod-md counter counts from 0 to 2 1.
When thes value in the Mod-md counter is 2 1, and
is then incremented, the Mod-md counter rolls over to
0, and sets a Carry signal in the bit position of weight
2.
In the Encoder Algorithm below, it may be the case
that the sequence ends before a Stop symbol occurs.
Thus there are three conditions: handle the C symbol, handle the S symbol, and handle an End-of-File
condition.
Encoder Algorithm
k
last character?"
One of the answers: tell the decoder how many
data characters to be decoded. Stop reconstructing the data string after decoding that many characters.
k+1
k+1
k+1
k+1
k+1
2
3
4
5
6
7
8
9
10
11
12
13
k+1
k+1
k+1
k+1
k+1
Examples
Example. md = 5.
Notice that the space between the \unary" part and
the \remainder" part is there for better understanding
and readibility of the code.
Run Codeword
-------------0
0 00
1
0 01
2
0 10
3
0 110
4
0 111
5 10 00
6 10 01
7 10 10
8 10 110
9 10 111
10 110 00
11 110 01
12 110 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Run Codeword
------------------13
110 110
14
110 111
15
1110 00
16
1110 01
17
1110 10
18
1110 110
19
1110 111
20
11110 00
21
11110 01
22
11110 10
23
11110 110
24
11110 111
25 111110 00
|
|
|
|
Run Codeword
-------------14 10 001
15 10 010
0
0
0
0
0
0
0
0
0
0
0
10
010
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
000
|
|
|
|
|
|
|
|
|
|
|
|
16 10 0110
17 10 0111
18 10 1000
19 10 1001
20 10 1010
21 10 1011
22 10 1100
23 10 1101
24 10 1110
25 10 1111
26 110 000
27 110 001
Codeword: 110111.
To decode the codeword: 110 (5x2) + 111 (remainder of 4): run length 10+4 = 14.
Determination of Golomb md
The selection of Golomb number md(pC ) has an ingenious explanation that undoubtedly motivated its discovery. We know that if the probability of an event is
, then it should be encoded with a single bit. Working backwards from that notion, we examine the unary
part of the code and ask: to add one bit to the unary
code, what should the probability be? The answer: we
should encode as many instances of the \run continues" outcome, say md of them, each at probability pC ,
such that pC d = :
Put another way, suppose that in run length coding we discover it's optimal when md successive C
events should increase the code length by one bit. In
other words, suppose that md successive C events
have a self-information of 1 bit. Converting the selfinformation to a probability, this means that that md
successive C events should have a probablity of .
Golomb's response has the insight of dening parameter md(pC ) to be such that: pC . Taking log to
base 2 of each side, we have md log pC 1. Solving
for value md:
1
(5.23)
md
log pC :
1
1
2
10
1 < pC + pC
(5.24)
Table 5.1 shows properties of the Golomb code.
Eq 5.24 of Gallager and VanVoorhis is used to identify ranges of run stop probabilities. The points were
calculated experimentally in the APL programming
environment for a given value of md by using test values of p until equality was almost reached in the left
half of Eq 5.24.
md+1
md
md
md
pS
45
Address
--------000
001
010
011
100
101
110
111
50
Code
---00,
010,
011,
100,
101,
110,
111.
Run
--0
0
1
2
3
4
5
6
Shift
----2
2
3
3
3
3
3
3
1000,10111,10010,0111,10101
11111110111111111111101111111101111110111111111110.
12
Bibliography
[BK74]
[Gol66]
[GVV75]
[Lan83]
[LR83]
[OTD90]
[Teu78]
13