Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Color space conversion is very important in many Processing an image in the RGB color space, with
types of image processing applications including a set of RGB values for each pixel is not the most
video compression. This operation consumes up efficient method. To speed up some processing steps
to 40% of the entire processing power of a highly many broadcast, video and imaging standards use
optimised decoder. Therefore, techniques which effi- luminance and color difference video signals, such as
ciently implement this conversion are desired. This YCrCb, making a mechanism for converting between
paper presents four different scalable architectures formats necessary. Several cores for RGB to YCrCb
for efficient implementation of two such color space conversion can be found in the market, which have
converters using an FPGA based system. Distributed been designed for FPGA implementation, such as the
arithmetic technique and systolic design have been cores proposed by Amphion Ltd [3], CAST.Inc [4]
exploited to implement the proposed structures and ALMA .Tech [5].
on the Celoxica RC1000-PP FPGA development
board. The implementation approaches exhibits As part of an ongoing research project to develop
better performances when compared with existing a hardware accelerator for image and signal pro-
implementations. cessing algorithms based on matrix computations at
Queen’s University of Belfast [6, 7, 8, 9], This paper
Keywords: Color space Conversion, Systolic ar- proposes the use of FPGA as a low cost accelerator
chitecture, Distributed arithmetic, FPGA. for RGB ↔ YCrCb Color Space Converters (CSCs)
using Systolic Architecture (SA) and Distributed
Arithmetic (DA) approaches. For the second ap-
1 Introduction proach, two architectures based on serial and parallel
manipulation of pixels have been proposed.
Color is a visual sensation produced by the light in
the visible region of the spectrum incident on the The target hardware for the implementation and
retina. Since the human visual system has three types verification of the proposed architectures is Celox-
of color photoreceptor cone cells, three components ica RC1000-PP PCI based FPGA development board
are necessary and sufficient to describe a color [1]. equipped with a Xilinx XCV2000E Virtex FPGA
[10, 11]. The composition of the rest of the paper is as
Color spaces (also called color models or color follows. A review for the conversion from R’G’B’ to
systems) is a method by which we can specify, create Y’CrCb is given in section 2. Sections 3 and 4 are con-
and visualise color. There are many existing color cerned with the mathematical backgrounds and the
spaces and most of them represent each color as a descriptions of the proposed architectures based SA
point in a three-dimensional coordinate system. Each and DA techniques respectively. Then the hardware
color space is optimized for a well-defined application implementations with results and analysis are then
area [2]. The three most popular color models are presented in Section 5. Finally concluding remarks
RGB (used in computer graphics); YIQ, YUV and are given in section 6.
YCrCb (used in video systems); and CMYK (used in
color printing). All of the color spaces can be derived
from the RGB information supplied by devices such
37
ICGST-GVIP Journal, Volume 5, Issue1, December 2004
value. In this color space R’G’B’ is separated into a -0.148 / 1.164 -0.291 / -0.392 0.439 / -0.813
luminance part (Y’) and two chrominance parts (Cb 128 / -276.8
SE
Cin
PEs; each PE has the same structure as the PEs used
in the first architecture. The two architectures differ
in the throughput and the area required for each one.
It is worth noting that using the first architecture, the
Figure 2: Proposed systolic architecture (1) entire computation can be carried out after M clock
cycles and requires N ×M PEs, while using the second
architecture the entire computation can be carried
out after 2×(M −1) clock cycles and requires M PEs.
A23 A13 A03
PE0
B3 Table 1 illustrates the performances obtained by
the two proposed architectures.
A22 A12 A02 In our case the throughput rate has been defined
PE1 as the reciprocal of the time between successive
B2
outputs vector. It can be seen from the table that
architecture (1) delivers data at a higher throughput
A21 A11 A01
rate when compared with architecture (2).
PE2
B1
W −1
VLSI implementation. The advantage of a DA-based
ROM approach is its efficiency of implementation. Ci = Zm × 2 m (8)
m=0
The basic operations required are a sequence of
ROMs, addition, subtraction and shift operations
of the input data sequence [17]. Examples for the The idea is that since the term Zm depends on
use of DA can be found in these references [17, 18, 19]. the bk,m values and has only 2N possible values, it is
possible to precompute and store them in ROMs. An
input set of N bits (b0,m , b1,m , . . . b(N −1),m ) is used as
4.1 Proposed Architecture Based Se- an address to retrieve the corresponding Zm values.
rial Manipulation Approach The ROM’s content is different and depends on the
constant matrix A coefficients. These intermediate
4.1.1 Mathematical Background results are accumulated in W clock cycles to produce
Consider the matrix-vector product given by the fol- Ci coefficients.
lowing equation:
4.2 Case Study: Converting From
N −1
Ci = Aik × Bk (4)
R’G’B’↔ Y’CrCb
k=0 Since all the components are in the range of 0 to 255, 8
bits are enough to represent them. In our application
(N = 4 and W = 8), Ci can be computed as:
Where {Aik }’s are L-bits constants and {Bk }’s
are written in the unsigned binary representation as 7
W −1
Bk = bk,m × 2m (5)
m=0 Where:
3
Where bk,l is the mth bit of Bk , which is zero or Zm = Aik × bk,m (10)
one, W is the word-length used which represents the k=0
resolution for each color component of a pixel.
Substituting 5 in 4, 3 ROMs (one for each matrix A row) with the size
of 2N = 24 = 16 are needed in order to store the
precompute 24 possible partial products values. Since
N −1
W −1 N
−1 0 for m = 0
= ( Aik × (bk,m × 2m )
m=0 k=0
Equation 10 can be rewritten as:
7
Define:
Ci = Zl∗ × 2m + Ai3 (12)
N −1 m=0
Zm = Aik × bk,m (7)
k=0
40
ICGST-GVIP Journal, Volume 5, Issue1, December 2004
Where:
2
∗ b0,m 3 ROMs PE
Zm = Aik × bk,m (13) b1,m Block
k=0 b2,m << m +
(RGB + C0
to
YCrCb)
It is worth mentioning that the size of the ROMs
has been reduced to 23 . Table 2 gives the content of CE << m +
+ C1
each ROM. S
CE
3 ROMs
Table 2: Content of the ROM i (0 ≤ i ≤ 2) Block << m +
C2
+
(YCrCb
The Content
b0,m b1,m b2,m to
of the ROM i RGB)
0 0 0 0
0 0 1 Ai2
0 1 0 Ai1
Figure 5: Serial CSC based DA Architecture
0 1 1 Ai1 + Ai2
1 0 0 Ai0
1 0 1 Ai0 + Ai2
1 1 0 Ai0 + Ai1 The proposed architecture consists of three iden-
1 1 1 Ai0 + Ai1 + Ai2 tical Processing Elements (P Es) and two memory
blocks. Each P E comprises a parallel ACCumulator
(ACC) and a right shifter and each memory block
4.2.1 Proposed Architecture consists of three ROMs with the size of 23 each
(see Figure 6). The ROM’s content is different and
Since our objective is to implement a core which depends on the matrix A coefficients, which depend
performs two different color conversions (R’G’B’↔ on the conversion type.
Y’CrCb), 6 ROMS are needed (3 for each conversion).
Figures 4 and 5 show the proposed core pins and its
internal architecture respectively.
b0,m b1,m b2,m
B0
C0[0:7]
P0
B1 ROM1
C1[0:7]
B2 CSC
C2[0:7]
S P1
ROM2
4.3 Proposed Architecture Based Par- bij0
A00 A01 A02 A03 bij1
allel Manipulation Approach A10 A11 A12 A13 ×
bij2 , where cijk
A20 A21 A22 A23
4.3.1 Mathematical Background 1
represent
the output image color space
components
Consider an N × M image (Figure 7)(N : image A00 A01 A02 A03
height, M : image width). and A = A10 A11 A12 A13 represents one
A20 A21 a22 A23
Let represent each image pixel by bijk (0 ≤ i ≤ of the constant matrices in equations 1 and 2.
N − 1, 0 ≤ j ≤ M − 1, 0 ≤ k ≤ 2), where:
The cijk elements (the output image color space
components) can be computed using the following
the red component of the
bij0 = Rij equation:
pixel in row i and column j
3
the green component of the
W −1
bijm = bijm,l × 2l (0 ≤ m ≤ 2) (17)
l=0
c000 c0(M −1)0
c001
... c0(M −1)1
Using the same development in the previous sec-
c002
c0(M −1)2
c102 c1(M −1)2
cijk = Zl∗ × 2l + Ak3 (18)
.. ..
. . l=0
...
c(N −1)00 c(N −1)(M −1)0
c(N −1)01 ... c(N −1)(M −1)1
Where:
c(N −1)02 c(N −1)(M −1)2
2
A00 A01 A02 A03
k k
B Cb
M M
G Cr
R Y
N j j N
Image Image
Image Conversion Image
Y’CrCb Y’CrCb
R’G’B’ Image R’G’B’ Image
Image Image
i i
Cij0
<<1 <<2 <<3 <<4 <<5 <<6 <<7
a03 + 0.5 + + + + + + + +
Cij1
<<1 <<2 <<3 <<4 <<5 <<6 <<7
a13 + 0.5 + + + + + + + +
Cij2
<<1 <<2 <<3 <<4 <<5 <<6 <<7
a23 + 0.5 + + + + + + + +
PE
Delay
PE: Processor Element
1 st CC 2 nd CC 3 rd CC 7 th CC 8 th CC 9 th CC
conversion can be carried out in (3 × 4 × N × M ) clock PP 00k0 PP 01k0 PP 02k0
…..
PP 06k0 PP 07k0 PP 08k0
PE1 …...
cycles, where (3 × 4) is the constant matrix A size. PE2 Delay PP 00k1 PP 01k1 …... PP 05k1 PP 06k1 PP 07k1
PE3 Delay Delay PP 00k2 …... PP 04k2 PP 05k2 PP 06k2
PE4 Delay Delay Delay …... PP 03k3 PP 04k3 PP 05k3
for i 1 to L do // scanning image rows PE5 Delay Delay Delay …... PP 02k4 PP 03k4 PP 04k4
for j 1 to M do // scanning image columns
PE6 Delay Delay Delay …... PP 01k5 PP 02k5 PP 03k5
for k 1 to 3 do // scanning the three RGB valus of a pixel
for k 1 to 3 do // scanning columns of the constant conversion matrix PE7 Delay Delay Delay …... PP 00k6 PP 01k6 PP 02k6
cijk += akm x bijm PE8 Delay Delay Delay …... Delay PP 00k7 PP 01k7
end for
end for
C 00 C 01 …..
end for
end for
The proposed CSC cores based on DA and SA tech- The implementations target the Celoxica RC1000
niques have been designed using Handel-C language PCI-based FPGA development board. The RC1000-
[20]. Handel-C is a high level language that is at PP board used is a standard PCI bus card equipped
the heart of a hardware compilation system known with the Virtex-E2000 FPGA chip (package :bg560,
as Celoxica Development Kit (DK) [21] which is speed grade 6). It has 8MBytes of SRAM directly
designed to compile programs written in a C-like connected to the FPGA in four 32-bit wide memory
43
ICGST-GVIP Journal, Volume 5, Issue1, December 2004
System-level model
A00 A01 A02
A03 + 0.5 PE00 PE01 PE02 C0
HW/SW C code Handel-C code B0 B1 B2
External Cores
partitioning (host processor) (FPGA Hardware)
(Schematic, VHDL ,
CoreGen ...)
C Compiler Celocixa DK2
Simulation (MS Visual C++) IDE
EDIF
A10 A11 A12
Xilinx Layout FPGA C1
Tools place&route
A13 + 0.5 PE10 PE11 PE12
FPGA bitstream
(full configuration)
FPGA
Xilinx JBits
configuration A20 A21 A22
A23 + 0.5 PE20 PE21 PE22 C2
Host processor FPGA bitstream
program (partial configuration)
Prototyping Platform
Figure 13: Modified systolic architecture (1)
Figure 11: Handel-C design flow
A23 + 0.5
A13 + 0.5
A03 + 0.5
banks. All are accessible by the FPGA and any device
on the PCI bus in parallel [10]. A schematic block A20 A12 A02
Bank0 PE1
DMA B1
Bank1
XCV2000E
Bank2
PCI Bank3
A20 A10 A00
Control
PE2 C2 C1 C0
8 Bit Status B2
Figure 12: RC1000-PP block diagram Figure 14: Modified systolic architecture (2)
Ym /Yij0,l Crm /Crij1,l Cbm /Cbij2,l ROM1 ROM2 ROM3
0 0 0 0 0 0
0 0 1 0 -0.392 0
0 1 0 1.596 -0.813 1.596
0 1 1 1.596 -1.025 1.596
1 0 0 1.164 1.164 1.164
1 0 1 1.164 0.772 1.164
1 1 0 2.76 0.351 2.76
1 1 1 2.76 -0.041 2.76
• Duplicating the ROMS using the same imple- parallel manipulation approaches show significant
mentation approach used for the first architec- improvements in comparison with the existing im-
ture(with a selector signal which allows the user plementations [3, 4, 5], which perform the R’G’B’ to
to choose the appropriate converter); or Y’CrCb conversion, in terms of the area consumed
and the maximum running clock frequency. The
• Setting the contents of the ROMs in advance,
advantage of the two other proposed architectures is
depending on the desired conversion. that they can be used for any color space conversion
The precomputed partial products are stored in based on the equation 3.
the ROMs using 13 bits fixed point representation (8
bits for integer part and 5 bits for fractional part). Table 7 illustrates the hardware/software imple-
13-bit arithmetic is used inside the architecture. mentations comparison in terms of the RMS error
The inputs and outputs of the two architectures are -due to the use of difference data representation in
presented using 8 bits and the outputs are rounded. the
two implementations- (RM SError =
Likewise the CSCs based SA implementation, the N −1 M−1 2
1/(N × M ) i=0 j=0 (Isof t (i, j) − Ihard (i, j)) )
same rounding technique is applied here. The initial and the computation time, when using the second
value for each accumulator ACCi is set in advance to proposed DA architecture.
(Ai3 + 0.5), where (0 ≤ i ≤ 2).
Table 7 shows the test results for two different
The MACs and parallel signed adders have been images (Baboon image (512 × 512) and Pepper image
implemented using Xilinx’s CoreGen utility [22]. (256 × 256) ). It can be seen that the same converted
The shifters and ROMs initialisation have been image can be obtained fastly when using the FPGA
implemented using VHDL. All design components implementation, with a minimum error (due to the
have been connected together using Handel-C. use of difference data representation in the two imple-
mentations).
In order to make a fair and consistent comparison
with the existing FPGA based color space converters,
the XCV50E-8 FPGA device has been targeted. 6 Conclusion
Table 6 illustrates the performances obtained for the
proposed architecture in terms of area consumed and Processing an image in the RGB color space, with a
speed which can be achieved. set of RGB values for each pixel is not the most ef-
ficient method. To speed up some processing steps
The proposed DA architectures based serial and many broadcast, video and imaging standards use
45
ICGST-GVIP Journal, Volume 5, Issue1, December 2004
Software Hardware
Original Computation
implemen- implemen- RMS Error
Image time (ms)
tation tation
Software Hardware
Y 0.487
Cr 0.630 126 1.2
Cb 0.461
Y 0.684
Cr 0.830 43 0.28
Cb 0.396
luminance and color difference video signals, such [3] Datasheet (www.amphion.com), “Color Space
as YCrCb, making a mechanism for converting be- Converters,” Amphion semiconductor Ltd,
tween formats necessary. In this paper novel scal- DS6400 V1.1, April 2002.
able architectures based on DA and SA approaches for
R G B ↔ Y CrCb conversions, which require enor- [4] Application Note (www.cast-inc.com), “CSC
mous computing power, have been reported. The im- Color Space Converter,” CAST Inc, April 2002.
plementation result shows the effectiveness of the DA [5] Datasheet (www.alma-tech.com), “High Perfor-
approach. The performance in terms of the area used mance Color Space Converter,” ALMA Technolo-
and the maximum running frequency of the proposed gies, May 2002.
architecture has been assessed and has shown that
the proposed system requires less area and can be run [6] F. Bensaali and A. Amira, “Design and Efficient
with a higher frequency when compared with existing FPGA Implementation of an RGB to YCrCb
systems. The proposed systolic structures can per- Color Space Converter Using Distributed Arith-
form other conversions based on matrix-vector multi- metic,” Proceedings of the International Confer-
plication, while the DA structure can be used for other ence on Field Programmable Logic (FPL), Lec-
conversions by modifying the content of the ROMs. ture Notes in Computer Science, to be published
by Springer Verlag, August, 2004.
[9] F. Bensaali, A. Amira, I.S. Uzun and A. Ahmed- [21] URL: www.celoxica.com
said, “Efficient Implementation of Large Paral-
lel Matrix Product for DOTs,” The International [22] Application Note, “Xilinx CoreGen and Handel-
Conference on Computer, Communication and C,” AN 58 v1.0, 2001.
Control Technologies (CCCT’03), Florida, USA, [23] M. Defossez, “Using the Virtex Look-Up Tables,”
July, 2003. Xilinx Application Note (www.xilinx.com).
[10] Datasheet, (www.celoxica.com)“RC1000 Recon-
figurable hardware development platform,” Ce-
locixa Ltd.,2001.
[11] URL: www.xilinx.com
[12] A. Albiol, L. Torres and E.J. Delp, “An unsuper-
vised color image segmentation algorithm for face
detection applications,” In Proceedings of the In-
ternational Conference on Image Processing, pp
681-684, Vol. 2, October 2001.
[13] P. Kuchi, P. Gabbur, P.S. Bhat and S. David,
“Human Face Detection and Tracking using Skin
Color Modelling and Connected Component Op-
erators,” The IETE Journal of Research, Special
issue on Visual Media Processing, May 2002.
[14] M. Bartkowiak, “Optimisations of Color Trans-
formation for Real Time Video Decoding,” Dig-
ital Signal Processing for Multimedia Communi-
cations and Services, EURASIP ECMCS 2001,
Budapest, September 2001.
[15] J.L. Mitchell and W.B. Pennebaker, “MPEG
Video Compression Standard,” Chapman & Hall,
1996.
[16] J. Bracamonte, P. Standelmann, M. Ansorge and
F. Pellandini, “A Multiplierless Implementation
Scheme for the JPEG Image Coding Algorithm,”
IEEE Nordic Signal Processing Symposium, Kol-
marden, Sweden, June 13 - 15, 2000.
[17] A. Amira, “An FPGA Based Parameteris-
able System For Discrete Hartley Transforms
Implementation,” Proceedings of The Interna-
tional Conference on Image Processing (ICIP),
Barcelona, Spain, September 2003.
[18] H. Ohlsson and L. Wanhammer, “Maximally fast
numerically equivalent state-space recursive digi-
tal filters using distributed arithmetic,” Proceed-
ings of the IEEE Symposium in Nordic Signal
Processing (NORSIG2000), Kolmarden, Sweden,
pp 295-298, June 2000.
[19] O. Gustafsson and L. Wanhammar, “Implemen-
tation of a Digital Beamformer in an FPGA us-
ing Distributed Arrithmetic,” Proceedings of the
IEEE Symposium in Nordic Signal Processing
(NORSIG2000), Kolmarden, Sweden, pp 295-
298, June 2000.
[20] Manual, (www.celoxica.com)“Handel-C Lan-
guage Reference Manual,” Celocixa Ltd.,2003.
47