AbstractThe central unit of a Viterbi decoder is a datadependent estimation of a finite state discretetime Markov process where
feedback loop which performs an addcompareselect (ACS) operation. the optimality can be achieved by criteria such as maximum
This nonlinear recursion is the only bottleneck for a highspeed parallel likelihood or maximuma posteriori. For a tutorial on the VA
implementation. see [4]. Below, the VA is explained only briefly to introduce
This paper presents a solution to implement the Viterbi algorithm by the notation used.
parallel hardware for high data rates. For a fixed processing speed of The underlying discretetime Markov process has a number
given hardware it allows a linear speedup in the throughput rate by a of N,states z, . At time ( n + 1) T a transition takes place from
linear increase in hardware complexity. A systolic array implementation is the state of time n T to the new state of time ( n + 1) T. The
presented. transitions are independent (Markov process) and occur on a
The method described here is based on the underlying finite state memoryless (noisy) channel. I The transition dynamics can be
feature. Thus it is possible to transfer this method to other types of described by a trellis diagram, see Fig. 1. Note that parallel
algorithms which contain a datadependent feedback loop and have a transition branches can also exist, as in Fig. 1 from z1 + z,.
finite state property. To simplify the notation, we assume T = 1 and the transition
probabilities to be time invariant.
I. INTRODUCTION The VA estimates (reconstructs) the path the Markov
process has taken through the trellis recursively (sequence
T 0 boost the achievable throughput rate of an
implementation of an algorithm, parallel and/or pipelined
architectures can be used. For highspeed implementations of
estimation). At each new time instant n and for every state the
VA calculates the optimum path which leads to that state, and
discards all other paths already at time as nonoptimal. This is
an algorithm, architectures are desired that at maximum lead accomplished by summing a probability measure called state
to a linear increase in hardware complexity for a linear metric rn,,,for each state z, at every time instant n. At the next
speedup in the throughput rate if the limit of the computational time instant n + 1, depending on the newly observed
speed of the hardware is reached. An architecture that transition, a transition metric Xn,zk+zr is calculated for all
achieves this linear dependency is referred to as a linear scale possible transition branches of the trellis.
solution. It can be derived for a number of algorithms such as The algorithm for obtaining the updated rn+l,z, can be
those of the plain feedforward type. Also, for algorithms described in the following way. It is called the addcompare
containing linear feedback loops a linear scale solution can be select (ACS) unit of the VA. For each state, z, and all its
found [ 11. However, a linear scale solution has not yet been predecessor states z k choose that path as optimum according to
achieved for algorithms containing a datadependent decision the following decision:
feedback. An algorithm of the latter type is the Viterbi
algorithm (VA), which is related to dynamic programming
PI. + Xn,zk+zi).
rn+I,zi := maximum (rnSzk
(all possible Zk'Z;)
In this paper, a linear scale solution (architecture) is
presented which allows the implementation of the VA despite The surviving path has to be updated for each state and has to
the fact that the VA contains a datadependent decision be stored in an additional memory called survivor memory.
feedback loop. In Section I1 the VA and its application are For a sufficiently large number of observed transitions
described. Section I11 introduces the new method which (survivor depth B) it is highly probable that all N,paths merge
achieves the linear scale solution. The addcompareselect when they are followed back. Hence, the number B of
(ACS) unit of the VA can be implemented with this new transitions which have to be stored as the path leading to each
method as a systolic array as shown in Section IV. Investiga state is finite, which allows the estimated transition of time
tions concerning the implementation of the survivor memory instant n  B to be determined.
are found in Section V. Conclusions form the contents of the Note, when parallel branches [(a) and ( b ) ]exist, one can
summarizing Section VI. find the maximum of their transition metrics before the ACS
procedure is performed, since
11. PROBLEM DEFINITION
In 1967, the VA was presented as a method of decoding
convolutional codes [3]. In the meantime it has been proven
to be a solution for a variety of digital estimation problems.
The VA is an efficient realization of optimum sequence
Therefore, the notation used here assumes that the maximum
Paper approved by the Editor for Coding Theory and Applications of the
metric of each set of parallel branches is found prior to the
IEEE Communications Society. Manuscript received August 12, 1987; ACS operation being performed. It is the one referred to as
revised May 1, 1988. This paper was presented in part at ICC'88, Xll,Zj'Zk:
Philadelphia, PA, June 1215, 1988. An implementation of the VA, called the Viterbi decoder
The authors are with Aachen University of Technology, Templergraben 55,
5100 Aachen, West Germany. ' For certain problems the VA has proven to be an efficient solution even
IEEE Log Number 89291 11. for channels with memory (intersymbol interference [ 5 ] , [ 6 ] ) .
=  * : ... : z3
... original
t r elli s 
1step
21
nT ln+llT

time
2,

I M = 31
Fig. 3. Principle of the Mstep trellis shown for a simple example ( M = 3).
Fig. 1. General example of a trellis.
feedback of step trellis (1s trellis, 1sVA etc ... .). An illustration for a
updated s t a t e metrics r simple example with M = 3 is given in Fig. 3. Now the MS
trellis can be used for Viterbi decoding the same process,
allowing the MsACS loop to be computed M times slower.
+Jyt+k
Input transition
metric
Fig. 2.
survwor
memory
Fig. 4. Rooted tree of Mstransitions leaving state 2, for M = 3 of the Fig. 6 . Timing diagram of the decoding cycles of the MsVD and the Is
example of Fig. 3. ACS units given the time scale of the Ms transitions.
transition
WS  ACS
metric
memory
2, rooted
1s trelL$s
z, O f 2,
NI= 3 columns
o f 1s  ACS units
L = M  1 rows
input of s t a t e metrics
c
E I
explanotion :
1stepACS unit
M  step  V D
updated s t a t e metrics
Fig. 8. Systolic array solution of the Ms/lsVD, clocked at time instances
Fig. 9. N, fold pipelined and interleaved systolic Ms/lsVD. Here, X, is
/8/7 X, is the complete set of Is
(rate 1/8, / is the time index). Here, 
the complete set of 1s transition metrics of time instant n. The sets of A i are
transition metrics of time instant n.
fed in P = N, times in a row. Therefore, the index I is incremented every
N, clock cycles. The array is clocked by rate P / 8 = N,/O.
ACS cells have to be clocked in a way that the MsVD
receives their results in the correct time slots which yields that which can be easier to clock in case of a large array (clock
the rate of multiplexing has to be equal the clock rate of the skew).
MsVD 1 / = ~ l / ( M T ) . Now, for 8 = 7 = MT this leads to For any implementation (systolic/wavefront array or multi
an overall synchronous system, in which at each time instant IT plexed version) the 1sACS units can be divided into a set of P
(I as time index) the computation of a set of N,parallel l s  pipelined (latched) parts, e.g., for P = 3 into three parts with
VA’s is started and the same number of 1sVA’s are part 1: add, part 2: compare and part 3: select. Therefore,
completed. Therefore, instead of implementing a set of depending on the number P of pipelined parts, P 1sVA’s can
parallel multiplexed 1sACS units one can also implement a be interleaved in one ACS unit. An especially interesting
pipeline structure of these 1sACS units. The pipeline has the pipelined architecture can be derived for the systolic array
length M  1 (8 = T = M T =. L = M  1) and the solution of Fig. 8, since the whore 1sarray is of simple
computation of each 1sVA is pipelined through this imple feedforward structure. If one column is pipelined by P = N,,
mentation. At the end of this pipeline the results of the last then the processing performed by all N, columns can be
iteration of the 1sVA’s can simply be fed to the MsACS pipeline interleaved [15] in this one column, see Fig. 9.
unit. Hence, this new array is clocked at rate P/8 = N7/0. The
This is shown in Fig. 8 for an example with Nz= 3, M = main advantage of this pipelined systolic array is the better
9, and T = 8 = M T ( * L = M  1). The systolic array is exploitation of processing hardware and the reduced amount of
clocked at the time instants IT which results in a throughput wiring required. The wiring is reduced in particular between
rate of 1 / T = M/8 = ( L + 1)/8. Each column of the array the 1sarray and the MsVD. Here the simple array supplies
computes the (M  1) fold ACS procedure based on one the MsVD in parallel with N:Mstransition metrics (equal to
rooted 1 s trellis. Therefore, a parallel number of N,columns 1sstate metrics) where the pipelined array supplies the Ms
has to be implemented. As a result this systolic array VD serially N7times in a row with Nz metrics. This allows the
implementation consists of a number of N, independent MsVD to carry out a serial processing of its ACS procedure
parallel columns, each made up of cells which communicate which is another major advantage of the pipelined array.
only in the topdown direction. Since the input (transition
metrics) to all ACS units of one row is the same, only one V. SURVIVOR MEMORY
conventional 1 s transition metric unit has to be implemented By introducing the Ms/ 1s approach a linear scale solution
for each row. T o minimize the interconnection wiring between was presented for the ACS unit of a parallel highspeed Ms/
the rows of each column of the array the methods presented in 1sVD. Also a linear scale solution can easily be found for the
[11][13] and/or of the cascade processor presented in [8] can transition metric unit. However, such a linear scale solution
be applied (here as a pure feedforward implementation). The cannot be found for the total survivor memory needed.
systolic array can be transferred to a wavefront array solution The size of the survivor memory of each 1sVD is linearly
FETTWEIS AND MEYR: PARALLEL VITERBI ALGORITHM IMPLEMENTATION 789
To show that the method described is of practical interest a ciallv the interaction between algorithm and architecture for highspeed
view on our design is given here; we examine the VLSI parailel VLSI implementations.
790 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 37, NO. 8, AUGUST 1989
Heinrich Meyr (M’75SM’83F’86) received the Electrical Engineering at the Aachen University of Technology (RWTH),
Dip1.Ing. and Ph.D. degrees from the Swiss Aachen West Germany. His research focuses on synchronization, digital
Federal Institute of Technology (ETH), Zurich, in signal processing, and in particular, on algorithms and architectures suitable
1967 and 1973, respectively. for VLSl implementation. In this area he is frequently in demand as a
From 1968 to 1970 he held’research positions at consultant to industrial concerns. He has published work in various fields and
Brown Boveri Corporation, Zurich, and the Swiss journals and holds over a dozen patents.
Federal Institute for Reactor Research. From 1970 Dr. Meyr served as a Vice Chairman for the 1978 IEEE Zurich Seminar
to the summer of 1977 he was with Hasler Research and as an International Chairman for the 1980 National Telecommunications
Laboratory, Bern, Switzerland. His last position at Conference, Houston, TX. He served as Associate Editor for the IEEE
Hasler was Manager of the Research Department. TRANSACTIONSON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING from
During 1974 he was a Visiting Assistant Professor 1982 to 1985, and as Vice President for International Affairs of the IEEE
with the Department of Electrical Engineering, University of Southern Communications Society.
California, Los Angeles. Since the summer of 1977 he has been Professor of