Documentos de Académico
Documentos de Profesional
Documentos de Cultura
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
Multiply-accumulators References
Two-operand adders
Ripple carry adder (RCA)
The most straightforward implementation of a final stage adder for two n-bit operands is a ripple carry adder, which requires n full adders (FAs). The carry-out of the ith FA is connected to the carry-in of the (i+1)th FA. Figure 1 shows a ripple carry adder for n-bit operands, producing n-bit sum outputs and a carry out.
The equation can be interpreted as stating that there is a carry either if one is generated at that stage or if one is propagated from the preceding stage. In other words, a carry is generated if both operand bits are 1, and an incoming carry is propagated if one of the operand bits is 1 and the other is 0. Therefore, let Gi and Pi denote the generation and propagation at the ith stage, we have:
for operand bit xi and yi and carry-in ci. These expressions allow us to calculate all the carries in parallel from the operands.
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
Parallel prefix adders (Ladner-Fischer adder, Kogge-Stone adder, Brent-Kung adder, Han-Carlson adder)
Parallel prefix adders are constructed out of fundamental carry operators denoted by as follows: (G'', P'') (G', P') = (G''+G'P'', P'P''), where P'' and P' indicate the propagations, G'' and G' indicate the generations. The fundamental carry operator is represented as Figure 4.
Figure 4. Carry operator A parallel prefix adder can be represented as a parallel prefix graph consisting of carry operator nodes. Figure 5 is the parallel prefix graph of a Ladner-Fischer adder. This adder structure has minimum logic depth, but has large fan-out requirement up to n/2.
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
Figure 5. 16-bit Ladner-Fischer adder Figure 6 is the parallel prefix graph of a Kogge-Stone adder. This adder structure has minimum logic depth, and full binary tree with minimum fun-out, resulting in a fast adder but with a large area.
Figure 6. 16-bit Kogge-Stone adder Figure 7 is the parallel prefix graph of a Brent-Kung adder. This adder is the extreme case of maximum logic depth and minimum area.
Figure 7. 16-bit Brent-Kung adder Figure 8 is the parallel prefix graph of a Han-Carlson adder. This adder has a hybrid design combining stages from the BrentKung and Kogge-Stone adder.
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
The basic idea in the conditional sum adder is to generate two sets of outputs for a given group of operand bits, say, k bits. Each set includes k sum bits and an outgoing carry. One set assumes that the eventual incoming carry will be zero, while the other assumes that it will be one. Once the incoming carry is known, we need only to select the correct set of outputs (out of the two sets) without waiting for the carry to further propagate through the k positions. In this generator, we divide the given n-bit operands into two groups of size n/2 bits each. Each of these can be further divided into two groups of n/4 bits each. This process can, in principle, be continued until a group of size 1 is reached. The above idea is applied to each of groups separately. Figure 9 depicts a conditional sum adder for 4-bit operands.
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
Figure 11. carry-skip block. Figure 12 shows an 8-bit carry-skip adder consisting of four fixed-size blocks, each of size 2. The fixed block size should be selected so that the time for the longest carry-propagation chain can be minimized. The optimal block size k_opt follows: .
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
Figure 12. 8-bit Fixed-block-size carry-skip adder. Figure 13 shows a 16-bit carry-skip adder consisting of seven variable-size blocks. This optimal organization of block size includes L blocks with sizes k1, k2, ..., kL = 1, 2, 3, ..., 3, 2, 1. This reduces the ripple-carry delay through these blocks.
Figure 13. 16-bit Variable-block-size carry-skip adder. Please note that the delay information of carry-skip adders in Reference data page is simply estimated by using false paths instead of true paths. Figure 14 compares the delay information of true paths and that of false paths in the case of Hitachi 0.18 um process.
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
Array
Array is a straightforward way to accumulate partial products using a number of adders. The n-operand array consists of n-2 carry-save adder. Figure 15 shows an array for 18-operand, producing 2 outputs, where CSA indicates a carry-save adder having three multi-bit inputs and two multi-bit outputs.
Wallace tree
Wallace tree is known for their optimal computation time, when adding multiple operands to two outputs using carry-save adders. The Wallace tree guarantees the lowest overall delay but requires the largest number of wiring tracks (vertical feedthroughs between adjacent bit-slices). The number of wiring tracks is a measure of wiring complexity. Figure 16 shows an 18-operand Wallace tree, where CSA indicates a carry-save adder having three multi-bit inputs and two multi-bit outputs.
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
Overturned-stairs tree
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
Overturned-stairs tree requires smaller number of wiring tracks compared with the Wallace tree and has lower overall delay compared with the balanced delay tree. Figure 18 shows an 18-operand overturned-stairs tree, where CSA indicates a carry-save adder having three multi-bit inputs and two multi-bit outputs.
Dadda tree
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
Dadda tree is based on (3,2) counters. To reduce the hardware complexity, we allow the use of (2,2) counters in addition to (3,2) counters. Given the matrix of partial product bits, the number of bits in each column is reduced to minimize the number of (3,2) and (2,2) counters.
number of partial products (and hence has a possibility of reducing the amount of hardware involved and the execution time).
Multipliers
AMG provides parallel multipliers consisting of Partial Product Generator (PPG), Partial Product Accumulator (PPA), and Final Stage Adder (FSA) as shown in Figure 21. The PPG stage first generates partial products from the multiplicand and multiplier in parallel. The PPA stage then performs multi-operand addition for all the generated partial products and produces their sum in carry-save form. Finally, the carry-save form is converted to the corresponding binary output at FSA.
Constant-coefficient multipliers
AMG provides constant-coefficient multipliers in the form: P=R*x, where R is an integer coefficient, and X and P are the integer input and output. The hardware algorithms for constant-coefficient multiplication are based on multi-input 1-output addition algorithms (i.e., combinations of PPAs and FSAs). There are many possible choices for the multiplier structure for a specific coefficient R. The complexity of multiplier structures significantly varies with the coefficient value R. We consider here the use of special number representation called Signed-Weight (SW) number system, which is useful for constructing compact PPAs. At present, the combination of CSD (Canonic Signed-Digit) coefficient encoding technique with the SW-based PPAs seems to provide the practical hardware implementation of fast constant-coefficient multipliers. As a result, AMG supports such hardware algorithms for constant-coefficient multiplication, where the range of R is from -231 to
31
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
Multiply-accumulators (MACs)
AMG provides multiply accumulators in the form: P = Xi * Y i , where Xi indicates an integer variable or constant, Y i indicates an integer variable. Figure 22 shows a n-term multiply accumulator. A multiply accumulator is generated by a combination of hardware algorithms for multipliers and constant-coefficient multipliers. All the partial products from PPGs are accumulated in carry-save form by a single PPA. The carry-save form is converted to the corresponding binary output by an FSA. Figure 23 show a simple multiply accumulators with function P = X*Y+Z, which is frequently used in DSP systems.
Figure 24 shows how the MAC is employed in actual DSP applications. The structure (a) illustrates a typical situation, where the MAC is used to perform a multiply-add operation in an iterative fashion. On the other hand, the structure (b) shows a faster design, where two product terms are computed simultaneously in a single iteration. You can further increase the number of product terms computed in a single cycle depending on your target applications.
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]
(a)
(b)
References
1. I. Koren, "Computer Arithmetic Algorithms", A K Peters, 2001. 2. B. Parhami, "Computer Arithmetic: Algorithms and Hardware Designs", Oxford University Press, 2000. 3. C. S. Wallace, "A suggestion for a fast multiplier", IEEE Trans. Computers, Vol. EC-13, pp. 14--17, February 1964. 4. D. Zuras and W. H. McAllister, "Balanced delay trees and combinatorial division in VLSI", IEEE J. Solid-State Circuits, Vol. SC-21, No. 5, pp. 814--819, October 1986. 5. Z. J. Mou and F. Jutand, " Overturned-stairs adder trees and multiplier design", IEEE Trans. Computers, Vol. C-41, No. 8, pp. 940--948, August 1992. 6. L. Dadda, "Some schemes for parallel multipliers", Alta Frequenza, Vol. 34, No. 5, pp. 349--356, March 1965. Back to Arithmetic Module Generator home
http://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html[04.10.2013 8:38:06]