Está en la página 1de 11

ICGST-GVIP Journal, Volume 5, Issue1, December 2004

Design and Implementation of Efficient Architectures


for Color Space Conversion
F. Bensaali and A. Amira
School of Computer Science, Queen’s University of Belfast,
University Road, BT7 1NN, Belfast,UK
[f.bensaali, a.amira]@qub.ac.uk

Abstract as cameras and scanners.

Color space conversion is very important in many Processing an image in the RGB color space, with
types of image processing applications including a set of RGB values for each pixel is not the most
video compression. This operation consumes up efficient method. To speed up some processing steps
to 40% of the entire processing power of a highly many broadcast, video and imaging standards use
optimised decoder. Therefore, techniques which effi- luminance and color difference video signals, such as
ciently implement this conversion are desired. This YCrCb, making a mechanism for converting between
paper presents four different scalable architectures formats necessary. Several cores for RGB to YCrCb
for efficient implementation of two such color space conversion can be found in the market, which have
converters using an FPGA based system. Distributed been designed for FPGA implementation, such as the
arithmetic technique and systolic design have been cores proposed by Amphion Ltd [3], CAST.Inc [4]
exploited to implement the proposed structures and ALMA .Tech [5].
on the Celoxica RC1000-PP FPGA development
board. The implementation approaches exhibits As part of an ongoing research project to develop
better performances when compared with existing a hardware accelerator for image and signal pro-
implementations. cessing algorithms based on matrix computations at
Queen’s University of Belfast [6, 7, 8, 9], This paper
Keywords: Color space Conversion, Systolic ar- proposes the use of FPGA as a low cost accelerator
chitecture, Distributed arithmetic, FPGA. for RGB ↔ YCrCb Color Space Converters (CSCs)
using Systolic Architecture (SA) and Distributed
Arithmetic (DA) approaches. For the second ap-
1 Introduction proach, two architectures based on serial and parallel
manipulation of pixels have been proposed.
Color is a visual sensation produced by the light in
the visible region of the spectrum incident on the The target hardware for the implementation and
retina. Since the human visual system has three types verification of the proposed architectures is Celox-
of color photoreceptor cone cells, three components ica RC1000-PP PCI based FPGA development board
are necessary and sufficient to describe a color [1]. equipped with a Xilinx XCV2000E Virtex FPGA
[10, 11]. The composition of the rest of the paper is as
Color spaces (also called color models or color follows. A review for the conversion from R’G’B’ to
systems) is a method by which we can specify, create Y’CrCb is given in section 2. Sections 3 and 4 are con-
and visualise color. There are many existing color cerned with the mathematical backgrounds and the
spaces and most of them represent each color as a descriptions of the proposed architectures based SA
point in a three-dimensional coordinate system. Each and DA techniques respectively. Then the hardware
color space is optimized for a well-defined application implementations with results and analysis are then
area [2]. The three most popular color models are presented in Section 5. Finally concluding remarks
RGB (used in computer graphics); YIQ, YUV and are given in section 6.
YCrCb (used in video systems); and CMYK (used in
color printing). All of the color spaces can be derived
from the RGB information supplied by devices such
37
ICGST-GVIP Journal, Volume 5, Issue1, December 2004

2 Color Space Conversion: A The calculation of Y’CrCb color components from


R’G’B’ components consumes up to 40% of the pro-
Review cessing power in a highly optimised decoder [14]. Ac-
As mentioned in the introduction, many color models celerating this operation would be useful for the ac-
have been proposed, each oriented towards supporting celeration of the whole process. A color in the R’G’B’
a specific task or solving a particular problem. De- color space is converted to the Y’CrCb color space
scribed below are the two color systems selected for using the following equation:
our study which are used in many image processing
applications.  
 Y =0.257R + 0.504G + 0.098B  + 16
Cr=0.439R + −0.368G + −0.071B  + 128 (1)

2.1 RGB Color Space Cb=−0.148R + −0.291G + 0.439B  + 128
RGB color space is a simple and robust color defini-
tion. RGB uses three numerical components to rep-
resent a color. This color space can be thought of While the inverse conversion can be carried out
as a three-dimensional coordinate system whose axes using the following equation:
correspond to the three components, R or Red, G or
Green, and B or Blue. RGB is the color space that  
computer displays use. It corresponds most closely to  R =1.164Y  + 1.596Cr + −222.912
the behavior of the human eye [1]. RGB is an addi- G =1.164Y  + −0.813Cr + −0.392Cb + 135.616
 
tive color system. The three primary colors red, green, B =1.164Y  + 2.017Cb + −276.8
and blue are added to form the desired color. For a (2)
true color image, the red, green, and blue components
of a pixel are each with eight bits width. In total, it
may have sixteen million (224 ) possible colors. Each Figure 1 shows the direct mapping of the equa-
component has a range of 0 to 255, with all three 0s tions 1 and 2 .
producing black and all three 255s producing white
[1]. In the rest of this paper, the gamma-corrected
RGB values are noted R’G’B’. R’ / Y’ G’ / Cb B’ / Cr
16 / -222.912

2.2 Y’CrCb Color Space X X X + round Y’ / R’

0.257 / 1.164 0.504 / 0.0 0.098 / 1.596


Y’CrCb is a scaled and offset version of the YUV color 128 / 135.616
space where Y represents luminance (or brightness),
U represents color, and V represents the saturation X X X + round Cb / G’

value. In this color space R’G’B’ is separated into a -0.148 / 1.164 -0.291 / -0.392 0.439 / -0.813

luminance part (Y’) and two chrominance parts (Cb 128 / -276.8

and Cr). Y’ is defined to have a range of 16 to 235,


Cb and Cr have a range of 16 to 240 [1]. X X X + round Cr / B’

0.439 / 1.164 -0.368 / 2.017 -0.071 / 0.0

2.3 Converting From R’G’B’ to


Y’CrCb Figure 1: General Block Diagram for R’G’B’ ↔
Y’CrCb CSC
Decomposing an R’G’B’ color image into one lu-
minance image and two chrominance images is the
method that has been used in most commercial appli-
cations such as face detection [12, 13] , as well as the
JPEG and MPEG imaging standards [14, 15, 16].
3 Proposed CSC based SA
The suitability of the Y’CrCb color space for these A SA represents a network of PEs that rhythmically
kind of applications is due to: compute and pass data through the system. The
• The non correlation among the spaces of main features of systolic systems are modularity and
Y’CrCb, so each space can be analysed sepa- regularity, which are important in FPGA implemen-
rately. tations [7]. In this section two architectures based on
bit parallel SA approach for CSC implementation are
• Human eyes are more sensitive to the change of described.
brightness than of color, so Cr and Cb spaces
can be compressed more heavily than Y’ space The CSC core implements the following mathe-
to get better compression ratio. matical formula to convert from one space to another:
38
ICGST-GVIP Journal, Volume 5, Issue1, December 2004

fixed-point arithmetic is one way of providing cheap


  fast non-integer support. Fixed-point arithmetic is
    B0
C0 A00 A01 A02 A03  B1  appropriate for our application because, as it can be
 C1  =  A10 A11 A12 A13  ×  
 B2  seen from equations 1 and 2, the range of the values
C2 A20 A21 A22 A23 is small.
1
(3)
The first architecture consists of twelve identical
PEs (the number of PEs is equal to N × M , where N
and is M are the number of rows and columns of the
Where Ci (0 ≤ i ≤ 2) and Bi (0 ≤ i ≤ 3) represent
matrix A respectively). Each PE comprises a parallel
the input and output color components respectively.
fixed-point Multiply ACcumulator (MAC), a set of
Equation 3 can be mapped into the two proposed
Storage Elements (SEs) where the coefficients Aik
architectures as shown in Figures 2 and 3.
and Bk are stored and another storage element for
pipelining the partial products. The MAC contains
A00 A01 A02 A03
PE00 PE01 PE02 PE03
C0 a parallel signed integer multiplier, a parallel signed
b0 B1 B2 B3
integer adder and a right shifter which has the role
of shifting the multiplier output by the number of
A10 A11 A12 A13
bits used for the fractional part representation of the
color components. The inputs data elements Aik are
C1
PE10 PE11 PE12 PE13
fed in a parallel fashion while the vector elements Bk
are fed in a parallel fashion and remain fixed in their
A20 A21 A22 A23 C2
corresponding PE cell during the entire computation
PE20 PE21 PE22 PE23 of the operation. Because of the values range of the
R’G’B’ and Y’CrCb components, the inputs elements
are presented with 13 bits (8 bits for integer part and
PE structure
Delay Signed Integer
5 bits for fractional part).
Multiplier Signed Integer
Aij Adder
SE: Storage Element
SE

Logical Shift Right


>> f
Bi Cout
The second architecture consists of four identical
SE

SE

Cin
PEs; each PE has the same structure as the PEs used
in the first architecture. The two architectures differ
in the throughput and the area required for each one.
It is worth noting that using the first architecture, the
Figure 2: Proposed systolic architecture (1) entire computation can be carried out after M clock
cycles and requires N ×M PEs, while using the second
architecture the entire computation can be carried
out after 2×(M −1) clock cycles and requires M PEs.
A23 A13 A03

PE0
B3 Table 1 illustrates the performances obtained by
the two proposed architectures.
A22 A12 A02 In our case the throughput rate has been defined
PE1 as the reciprocal of the time between successive
B2
outputs vector. It can be seen from the table that
architecture (1) delivers data at a higher throughput
A21 A11 A01
rate when compared with architecture (2).
PE2
B1

The two proposed architectures (1) and (2) can be


A20 A10 A00
used for applications requiring matrix-vector product,
PE3 C2 C1 C0 such as in 3D affine transformations [8].
B0

4 Proposed CSC Based DA


Figure 3: Proposed systolic architecture (2) Since color space conversion can be expressed as a
Matrix-Vector (MV) multiplication, two algorithms
Since the matrix A coefficients are real numbers, based DA are presented in this section.
floating-point or fixed-point representations can be
used to perform the multiplication. If the range DA distributes arithmetic operations rather than
of real numbers values that must be represented is grouping them as multipliers do. Conventional DA,
small or can be scaled in order to make it smaller, called ROM-based DA, decomposes the variable input
39
ICGST-GVIP Journal, Volume 5, Issue1, December 2004

Table 1: Architectures Performances


Throughput rate
Architecture Computation time Area complexity
(vector/clock cycle)
Proposed (1) (M )T O(N × M ) 1
Proposed (2) 2(M − 1)T O(M ) 1/N

of the inner product to bit level in order to generate


precomputed data. ROM-based DA uses a ROM
table to store the precomputed data, which makes it Therefore, Ci can be computed as:
regular and efficient in the use of the silicon area, in a

W −1
VLSI implementation. The advantage of a DA-based
ROM approach is its efficiency of implementation. Ci = Zm × 2 m (8)
m=0
The basic operations required are a sequence of
ROMs, addition, subtraction and shift operations
of the input data sequence [17]. Examples for the The idea is that since the term Zm depends on
use of DA can be found in these references [17, 18, 19]. the bk,m values and has only 2N possible values, it is
possible to precompute and store them in ROMs. An
input set of N bits (b0,m , b1,m , . . . b(N −1),m ) is used as
4.1 Proposed Architecture Based Se- an address to retrieve the corresponding Zm values.
rial Manipulation Approach The ROM’s content is different and depends on the
constant matrix A coefficients. These intermediate
4.1.1 Mathematical Background results are accumulated in W clock cycles to produce
Consider the matrix-vector product given by the fol- Ci coefficients.
lowing equation:
4.2 Case Study: Converting From

N −1
Ci = Aik × Bk (4)
R’G’B’↔ Y’CrCb
k=0 Since all the components are in the range of 0 to 255, 8
bits are enough to represent them. In our application
(N = 4 and W = 8), Ci can be computed as:
Where {Aik }’s are L-bits constants and {Bk }’s
are written in the unsigned binary representation as 7

shown in equation 5: Ci = Zm × 2 m (9)


m=0

W −1
Bk = bk,m × 2m (5)
m=0 Where:
3

Where bk,l is the mth bit of Bk , which is zero or Zm = Aik × bk,m (10)
one, W is the word-length used which represents the k=0
resolution for each color component of a pixel.

Substituting 5 in 4, 3 ROMs (one for each matrix A row) with the size
of 2N = 24 = 16 are needed in order to store the
precompute 24 possible partial products values. Since

N −1

W −1 the last element of the vector B is equal to 1:


Ci = Aik × ( bk,m × 2m ) (6)
k=0 m=0 1 for m = 0
b3,m = (11)

W −1 N

−1 0 for m = 0
= ( Aik × (bk,m × 2m )
m=0 k=0
Equation 10 can be rewritten as:
7

Define:
Ci = Zl∗ × 2m + Ai3 (12)

N −1 m=0
Zm = Aik × bk,m (7)
k=0

40
ICGST-GVIP Journal, Volume 5, Issue1, December 2004

Where:
2

∗ b0,m 3 ROMs PE
Zm = Aik × bk,m (13) b1,m Block
k=0 b2,m << m +
(RGB + C0
to
YCrCb)
It is worth mentioning that the size of the ROMs
has been reduced to 23 . Table 2 gives the content of CE << m +
+ C1
each ROM. S
CE
3 ROMs
Table 2: Content of the ROM i (0 ≤ i ≤ 2) Block << m +
C2
+
(YCrCb
The Content
b0,m b1,m b2,m to
of the ROM i RGB)
0 0 0 0
0 0 1 Ai2
0 1 0 Ai1
Figure 5: Serial CSC based DA Architecture
0 1 1 Ai1 + Ai2
1 0 0 Ai0
1 0 1 Ai0 + Ai2
1 1 0 Ai0 + Ai1 The proposed architecture consists of three iden-
1 1 1 Ai0 + Ai1 + Ai2 tical Processing Elements (P Es) and two memory
blocks. Each P E comprises a parallel ACCumulator
(ACC) and a right shifter and each memory block
4.2.1 Proposed Architecture consists of three ROMs with the size of 23 each
(see Figure 6). The ROM’s content is different and
Since our objective is to implement a core which depends on the matrix A coefficients, which depend
performs two different color conversions (R’G’B’↔ on the conversion type.
Y’CrCb), 6 ROMS are needed (3 for each conversion).
Figures 4 and 5 show the proposed core pins and its
internal architecture respectively.
b0,m b1,m b2,m

B0
C0[0:7]
P0
B1 ROM1
C1[0:7]
B2 CSC
C2[0:7]
S P1

ROM2

Figure 4: Symbol of the CSC Core P2


ROM3

The pins description is given in table 3.

Figure 6: Memory Block Structure


Table 3: Pins Description

Name Dir Description It is worth mentioning that our architecture is scal-


B0 I First input color space component able, however it can be used to perform n conversions
B1 I Second input color space component by adding every time 3 × n ROMs in order to store
B2 I Third input color space component the matrix conversion coefficients and keeping always
C0 O First output color space component the same P Es. An N × M image can be converted
C1 O Second output color space component using the proposed architecture by setting the inputs
C2 O Third output color space component every 8 clock cycles using the R’G’B’ components of
S I Color space conversion type selection a new pixel (Y’CrCb for the inverse conversion).
41
ICGST-GVIP Journal, Volume 5, Issue1, December 2004

 
4.3 Proposed Architecture Based Par-   bij0
A00 A01 A02 A03  bij1 
allel Manipulation Approach  A10 A11 A12 A13  ×  
 bij2 , where cijk
A20 A21 A22 A23
4.3.1 Mathematical Background 1
represent 
the output image color space
 components
Consider an N × M image (Figure 7)(N : image A00 A01 A02 A03
height, M : image width). and A =  A10 A11 A12 A13  represents one
A20 A21 a22 A23
Let represent each image pixel by bijk (0 ≤ i ≤ of the constant matrices in equations 1 and 2.
N − 1, 0 ≤ j ≤ M − 1, 0 ≤ k ≤ 2), where:
The cijk elements (the output image color space
 components) can be computed using the following
  the red component of the

 bij0 = Rij equation:

 pixel in row i and column j



 3
the green component of the

bij1 = Gij (14) cijk = Akm × bijm (16)



 pixel in row i and column j

 m=0



 the blue component of the
 bij2 = 
Bij
pixel in row i and column j
Where {Akm }’s are L-bits constants and {bijm }’s
are written in the unsigned binary representation as
shown in equation 17:
The image can be converted using the following
mathematical formula:

W −1
bijm = bijm,l × 2l (0 ≤ m ≤ 2) (17)
      l=0
c000 c0(M −1)0
  c001    
 ... c0(M −1)1

  Using the same development in the previous sec-
  c002    
c0(M −1)2

  tion, equation 16 can be rewritten as:


 c100 c1(M −1)0

  c101   c1(M −1)1  
 ...
= 7

 
 c102 c1(M −1)2
 cijk = Zl∗ × 2l + Ak3 (18)
 .. .. 
 . .  l=0
   ...
  
 
 c(N −1)00 c(N −1)(M −1)0

  c(N −1)01  ...  c(N −1)(M −1)1  
Where:
c(N −1)02 c(N −1)(M −1)2
  2
A00 A01 A02 A03

 A10 A11 A12 A13 ⊗ Zl∗ = Akm × bijm,l (19)


m=0
A20 A21 A22 A23
     
b000 b0(M −1)0
      Likewise the first proposed architecture, The
  b001
  b0(M −1)1
 
   ...
   ROM’s content is different and depends on the ma-
 b002 b0(M −1)2

  trix A coefficients, which depend on the conversion
  1
  1
 
  type.
 b100 b1(M −1)0

     
  b101
  b1(M −1)1
 
   ...
  
 b102 b1(M −1)2
 (15) 4.3.2 Proposed Architecture
 
 1 1

 .. ..  Equation 17 can be mapped into the proposed
 . . 
   ...
   architecture as shown in Figure 8.
 
 b(N −1)00 b(N −1)(M −1)0

     
  b(N −1)01
  b(N −1)(M −1)1
  The architecture consists of 8 identical P En s (0 ≤
  b(N −1)02  ...
 b(N −1)(M −1)2   n ≤ 7). Each P En comprises three parallel signed
1 1
integer adders, three n right shifters and one ROMs
block, which have the structure as shown in figure 6.
It is worth noting that the architecture has a latency
of W and a throughput rate equal to 1. The entire
 ⊗ can be defined as follows:
Where theoperation
cij0 image conversion can be carried out in (Latency +
Each vector  cij1  is the result of the product (N × M )T hroughput) = 8 + (N × M ) clock cycles,
cij2 while using the standard algorithm (Figure 9), the
42
ICGST-GVIP Journal, Volume 5, Issue1, December 2004

k k
B Cb
M M
G Cr
R Y
N j j N
Image Image
Image Conversion Image
Y’CrCb Y’CrCb
R’G’B’ Image R’G’B’ Image
Image Image

i i

Figure 7: R’G’B’ to Y’CrCb conversion

bij0,7 bij1,7 bij2,7

bij0,6 bij1,6 bij2,6


bij0,5 bij1,5 bij2,5

bij0,4 bij1,4 bij2,4


bij0,3 bij1,3 bij2,3

bij0,2 bij1,2 bij2,2


bij0,1 bij1,1 bij2,1

bij0,0 bij1,0 bij2,0

3 ROMs 3 ROMs 3 ROMs 3 ROMs 3 ROMs 3 ROMs 3 ROMs 3 ROMs


Block Block Block Block Block Block Block Block

Cij0
<<1 <<2 <<3 <<4 <<5 <<6 <<7
a03 + 0.5 + + + + + + + +
Cij1
<<1 <<2 <<3 <<4 <<5 <<6 <<7
a13 + 0.5 + + + + + + + +
Cij2
<<1 <<2 <<3 <<4 <<5 <<6 <<7
a23 + 0.5 + + + + + + + +

PE
Delay
PE: Processor Element

Figure 8: Proposed parallel architecture based on DA principles

1 st CC 2 nd CC 3 rd CC 7 th CC 8 th CC 9 th CC
conversion can be carried out in (3 × 4 × N × M ) clock PP 00k0 PP 01k0 PP 02k0
…..
PP 06k0 PP 07k0 PP 08k0
PE1 …...
cycles, where (3 × 4) is the constant matrix A size. PE2 Delay PP 00k1 PP 01k1 …... PP 05k1 PP 06k1 PP 07k1
PE3 Delay Delay PP 00k2 …... PP 04k2 PP 05k2 PP 06k2
PE4 Delay Delay Delay …... PP 03k3 PP 04k3 PP 05k3
for i 1 to L do // scanning image rows PE5 Delay Delay Delay …... PP 02k4 PP 03k4 PP 04k4
for j 1 to M do // scanning image columns
PE6 Delay Delay Delay …... PP 01k5 PP 02k5 PP 03k5
for k 1 to 3 do // scanning the three RGB valus of a pixel
for k 1 to 3 do // scanning columns of the constant conversion matrix PE7 Delay Delay Delay …... PP 00k6 PP 01k6 PP 02k6
cijk += akm x bijm PE8 Delay Delay Delay …... Delay PP 00k7 PP 01k7

end for
end for
C 00 C 01 …..
end for
end for

Figure 10: Functional analysis diagram


Figure 9: Pseudo code for the standard algorithm

Figure 10 shows the functional analysis diagram


high level language into synchronous hardware. DK
of the proposed architecture.
produces a Netlist file, which is used during the place
and route stage to generate the image or bitstream
5 Hardware Implementation file [21] (Figure 11).

The proposed CSC cores based on DA and SA tech- The implementations target the Celoxica RC1000
niques have been designed using Handel-C language PCI-based FPGA development board. The RC1000-
[20]. Handel-C is a high level language that is at PP board used is a standard PCI bus card equipped
the heart of a hardware compilation system known with the Virtex-E2000 FPGA chip (package :bg560,
as Celoxica Development Kit (DK) [21] which is speed grade 6). It has 8MBytes of SRAM directly
designed to compile programs written in a C-like connected to the FPGA in four 32-bit wide memory

43
ICGST-GVIP Journal, Volume 5, Issue1, December 2004

System-level model
A00 A01 A02
A03 + 0.5 PE00 PE01 PE02 C0
HW/SW C code Handel-C code B0 B1 B2
External Cores
partitioning (host processor) (FPGA Hardware)
(Schematic, VHDL ,
CoreGen ...)
C Compiler Celocixa DK2
Simulation (MS Visual C++) IDE
EDIF
A10 A11 A12
Xilinx Layout FPGA C1
Tools place&route
A13 + 0.5 PE10 PE11 PE12

FPGA bitstream
(full configuration)

FPGA
Xilinx JBits
configuration A20 A21 A22
A23 + 0.5 PE20 PE21 PE22 C2
Host processor FPGA bitstream
program (partial configuration)

Real-time Host Processor


FPGA Board
prototyping platform

Prototyping Platform
Figure 13: Modified systolic architecture (1)
Figure 11: Handel-C design flow
A23 + 0.5
A13 + 0.5
A03 + 0.5
banks. All are accessible by the FPGA and any device
on the PCI bus in parallel [10]. A schematic block A20 A12 A02

diagram of RC1000-PP board is shown in Figure 12. PE0


B0

A21 A11 A01

Bank0 PE1
DMA B1
Bank1
XCV2000E
Bank2

PCI Bank3
A20 A10 A00
Control
PE2 C2 C1 C0
8 Bit Status B2

Figure 12: RC1000-PP block diagram Figure 14: Modified systolic architecture (2)

adder in the three first PEs is set to (Ai3 + 0.5), where


5.1 CSC Based SA (0 ≤ i ≤ 2). The parallel signed adders and multi-
pliers have been implemented using Xilinx’s CoreGen
Since the vector last element B3 is equal to 1, the
utility, which contains many designs that can often
number of PEs in the two architectures shown in
save time for a programmer and it is possible to in-
figures 2 and 3 can be reduced. Figures 13 and
tegrate CoreGen blocks with a program in Handel-C
14 show the modified architectures. It is worth
using the interface declaration [22].
mentioning that using the first architecture, the
entire computation can be carried out after (M − 1)
clock cycles and requires N × (M − 1) PEs, while 5.2 CSC Based DA
using the second architecture the entire computation
can be carried out after 2 × (M − 1) − 1 clock cycles This section describes the hardware implementation
and requires (M − 1) PEs. of the CSCs based DA principles. The ROMs have
been implemented using the FPGA configurable Logic
During the conversion between (R’G’B’ ↔ Blocks (CLBs) LUTs, which have some interesting
Y’CrCb), the outputs are rounded. Rounding usu- capabilities that allow creating very fast and efficient
ally looks at the decimal value and if it is greater designs such as the RAM and ROM capability [23].
than or equal to 0.5, then the result is increased by Tables 4 and 5 give the content of the ROMs used for
one. This implies a condition to verify and another R’G’B’ to Y’CrCb and Y’CrCb to R’G’B’conversions
addition operation. A more efficient way to round a for both architectures, respectively.
number is to add 0.5 to the result and truncate the
decimal value. This technique has been applied in The second proposed architecture can be used for
our implementation. The initial value for each parallel the inverse conversion (Y’CrCb to R’G’B’) by:
44
ICGST-GVIP Journal, Volume 5, Issue1, December 2004

Table 4: The Content of the ROMs (R’G’B’ to Y’CrCb)

Rm /Rij0,l Gm /Gij1,l Bm /Bij2,l ROM1 ROM2 ROM3


0 0 0 0 0 0
0 0 1 0.098 -0.071 0.439
0 1 0 0.504 -0.368 -0.291
0 1 1 0.602 -0.439 0.148
1 0 0 0.257 0.439 -0.148
1 0 1 0.355 0.368 0.291
1 1 0 0.761 0.071 -0.439
1 1 1 0.859 0 0

Table 5: The Content of the ROMs (Y’CrCb to R’G’B’)

 
Ym /Yij0,l Crm /Crij1,l Cbm /Cbij2,l ROM1 ROM2 ROM3
0 0 0 0 0 0
0 0 1 0 -0.392 0
0 1 0 1.596 -0.813 1.596
0 1 1 1.596 -1.025 1.596
1 0 0 1.164 1.164 1.164
1 0 1 1.164 0.772 1.164
1 1 0 2.76 0.351 2.76
1 1 1 2.76 -0.041 2.76

• Duplicating the ROMS using the same imple- parallel manipulation approaches show significant
mentation approach used for the first architec- improvements in comparison with the existing im-
ture(with a selector signal which allows the user plementations [3, 4, 5], which perform the R’G’B’ to
to choose the appropriate converter); or Y’CrCb conversion, in terms of the area consumed
and the maximum running clock frequency. The
• Setting the contents of the ROMs in advance,
advantage of the two other proposed architectures is
depending on the desired conversion. that they can be used for any color space conversion
The precomputed partial products are stored in based on the equation 3.
the ROMs using 13 bits fixed point representation (8
bits for integer part and 5 bits for fractional part). Table 7 illustrates the hardware/software imple-
13-bit arithmetic is used inside the architecture. mentations comparison in terms of the RMS error
The inputs and outputs of the two architectures are -due to the use of difference data representation in
presented using 8 bits and the outputs are rounded. the
two implementations- (RM SError =
Likewise the CSCs based SA implementation, the N −1 M−1 2
1/(N × M ) i=0 j=0 (Isof t (i, j) − Ihard (i, j)) )
same rounding technique is applied here. The initial and the computation time, when using the second
value for each accumulator ACCi is set in advance to proposed DA architecture.
(Ai3 + 0.5), where (0 ≤ i ≤ 2).
Table 7 shows the test results for two different
The MACs and parallel signed adders have been images (Baboon image (512 × 512) and Pepper image
implemented using Xilinx’s CoreGen utility [22]. (256 × 256) ). It can be seen that the same converted
The shifters and ROMs initialisation have been image can be obtained fastly when using the FPGA
implemented using VHDL. All design components implementation, with a minimum error (due to the
have been connected together using Handel-C. use of difference data representation in the two imple-
mentations).
In order to make a fair and consistent comparison
with the existing FPGA based color space converters,
the XCV50E-8 FPGA device has been targeted. 6 Conclusion
Table 6 illustrates the performances obtained for the
proposed architecture in terms of area consumed and Processing an image in the RGB color space, with a
speed which can be achieved. set of RGB values for each pixel is not the most ef-
ficient method. To speed up some processing steps
The proposed DA architectures based serial and many broadcast, video and imaging standards use
45
ICGST-GVIP Journal, Volume 5, Issue1, December 2004

Table 6: Performance comparison with existing CSC cores

Design Parameters Slices Speed (MHz)


Proposed SA architecture (1) 305 68
Proposed SA architecture (2) 1022 72
Proposed DA architecture (1) 70 128
Proposed DA architecture (2) 193 234
CAST.Inc [4] 222 112
ALMA. Tech [5] 222 105
Amphion Ltd [3] 204 90

Table 7: Software/ hardware implementations for RGB to YCrCb CSC comparisons

Software Hardware
Original Computation
implemen- implemen- RMS Error
Image time (ms)
tation tation
Software Hardware

Y 0.487
Cr 0.630 126 1.2
Cb 0.461

Y 0.684
Cr 0.830 43 0.28
Cb 0.396

luminance and color difference video signals, such [3] Datasheet (www.amphion.com), “Color Space
as YCrCb, making a mechanism for converting be- Converters,” Amphion semiconductor Ltd,
tween formats necessary. In this paper novel scal- DS6400 V1.1, April 2002.
able architectures based on DA and SA approaches for
R G B  ↔ Y  CrCb conversions, which require enor- [4] Application Note (www.cast-inc.com), “CSC
mous computing power, have been reported. The im- Color Space Converter,” CAST Inc, April 2002.
plementation result shows the effectiveness of the DA [5] Datasheet (www.alma-tech.com), “High Perfor-
approach. The performance in terms of the area used mance Color Space Converter,” ALMA Technolo-
and the maximum running frequency of the proposed gies, May 2002.
architecture has been assessed and has shown that
the proposed system requires less area and can be run [6] F. Bensaali and A. Amira, “Design and Efficient
with a higher frequency when compared with existing FPGA Implementation of an RGB to YCrCb
systems. The proposed systolic structures can per- Color Space Converter Using Distributed Arith-
form other conversions based on matrix-vector multi- metic,” Proceedings of the International Confer-
plication, while the DA structure can be used for other ence on Field Programmable Logic (FPL), Lec-
conversions by modifying the content of the ROMs. ture Notes in Computer Science, to be published
by Springer Verlag, August, 2004.

References [7] A. Amira, “A custom Coprocessor for Matrix


Algorithm,” PhD thesis, Queen’s University of
[1] B. Payette, “Color Space Converter: R’G’B’ Belfast, 2001.
to Y’CrCb,” Xilinx Aplication Note, XAPP637,
[8] F. Bensaali, A. Amira, I.S. Uzun and A. Ahmed-
V1.0, September 2002.
said, “An FPGA Implementation of 3D Affine
[2] R.C. Gonzalez and R.E. Woods, “Digital Image Transformations,” The 10th IEEE International
Processing,” Second Edition, Printice Hall Inc, Conference on Electronics, Circuits and Systems
2002. (ICECS’03), Sharjah, UAE, December, 2003.
46
ICGST-GVIP Journal, Volume 5, Issue1, December 2004

[9] F. Bensaali, A. Amira, I.S. Uzun and A. Ahmed- [21] URL: www.celoxica.com
said, “Efficient Implementation of Large Paral-
lel Matrix Product for DOTs,” The International [22] Application Note, “Xilinx CoreGen and Handel-
Conference on Computer, Communication and C,” AN 58 v1.0, 2001.
Control Technologies (CCCT’03), Florida, USA, [23] M. Defossez, “Using the Virtex Look-Up Tables,”
July, 2003. Xilinx Application Note (www.xilinx.com).
[10] Datasheet, (www.celoxica.com)“RC1000 Recon-
figurable hardware development platform,” Ce-
locixa Ltd.,2001.
[11] URL: www.xilinx.com
[12] A. Albiol, L. Torres and E.J. Delp, “An unsuper-
vised color image segmentation algorithm for face
detection applications,” In Proceedings of the In-
ternational Conference on Image Processing, pp
681-684, Vol. 2, October 2001.
[13] P. Kuchi, P. Gabbur, P.S. Bhat and S. David,
“Human Face Detection and Tracking using Skin
Color Modelling and Connected Component Op-
erators,” The IETE Journal of Research, Special
issue on Visual Media Processing, May 2002.
[14] M. Bartkowiak, “Optimisations of Color Trans-
formation for Real Time Video Decoding,” Dig-
ital Signal Processing for Multimedia Communi-
cations and Services, EURASIP ECMCS 2001,
Budapest, September 2001.
[15] J.L. Mitchell and W.B. Pennebaker, “MPEG
Video Compression Standard,” Chapman & Hall,
1996.
[16] J. Bracamonte, P. Standelmann, M. Ansorge and
F. Pellandini, “A Multiplierless Implementation
Scheme for the JPEG Image Coding Algorithm,”
IEEE Nordic Signal Processing Symposium, Kol-
marden, Sweden, June 13 - 15, 2000.
[17] A. Amira, “An FPGA Based Parameteris-
able System For Discrete Hartley Transforms
Implementation,” Proceedings of The Interna-
tional Conference on Image Processing (ICIP),
Barcelona, Spain, September 2003.
[18] H. Ohlsson and L. Wanhammer, “Maximally fast
numerically equivalent state-space recursive digi-
tal filters using distributed arithmetic,” Proceed-
ings of the IEEE Symposium in Nordic Signal
Processing (NORSIG2000), Kolmarden, Sweden,
pp 295-298, June 2000.
[19] O. Gustafsson and L. Wanhammar, “Implemen-
tation of a Digital Beamformer in an FPGA us-
ing Distributed Arrithmetic,” Proceedings of the
IEEE Symposium in Nordic Signal Processing
(NORSIG2000), Kolmarden, Sweden, pp 295-
298, June 2000.
[20] Manual, (www.celoxica.com)“Handel-C Lan-
guage Reference Manual,” Celocixa Ltd.,2003.
47

También podría gustarte