An Overview of Data Bandwidth Hierarchy For An Embedded Stream Processor

2009 International Forum on Computer Science-Technology and Applications
An Overview of Data Bandwidth Hierarchy for an Embedded Stream Processor
Duan zongtao1, Zhang yanni2 , Duan zongyuan3

1
Information College, Chang’an University, xi’an Shaanxi 710064, China
2
Xi’an Metrology & Measurement Technique Institute, xi’an Shaanxi 710068, China
3
Xi’an Objective Software Company, xi’an Shaanxi 710068, China
ztduan@chd.edu.cn
ABSTRACT: The overall speed of computation is determined not modern DRAM and the bandwidth demanded by tens to
just by the speed of the processor, but also by the ability of the hundreds of arithmetic units.
memory system to feed data to it. Imagine is a novel image
processor which is constructed by a research group of Stanford II. THE CHARACTERISTICS OF IMAGE PROCESSING
University. A brand new data bandwidth hierarchy of this
We know that cache hierarchy does not suit for image
processor is introduced in this article.This new data bandwidth
hierarchy is constructed by local register file(LRF), stream
processing. So it is necessary to analysis the characteristics
register file(SRF) and main memory. LRF is used by the ALU of image processing.
clusters. SRF looks like cache in traditional processor. But SRF In image processing there is a preprocessing period. The
is a specific unit in stream processor. SRF is used as a stream preprocessing is also called low level processing.
load/store unit in stream processor. Using this kind of data Preprocessing is used to organize data for later image
bandwidth hierarchy the image processing speed is improved processing. Preprocessing mainly includes image
greatly. enhancement and image recovery. In computer opinion, the
preprocessing can be viewed as three kinds of processes:
KEYWORDS: data bandwidth hierarchy; Imagine; stream
pixel depended process (such as threshold, quantification and
processor; embedded; image processor;computer architecture
coding etc.), part pixels process (such as filtering,
convolution etc.), area pixels process and whole pixels
I. INTRODUCTION process (digital transformation such as Fourier etc.).
Image processing applications demand computation rates All of the above computing has common characteristics
of 10-100 GOPS. This high computation rates need tens to as followings: the first is space inflexibility which means that
hundreds of arithmetic units to achieve for image processing. the same instruction can be executed on the whole of the
However, the overall speed of computation is determined not image pixels we call this characteristic data independence.
just by the speed of the processor, but also by the ability of The second is locality that is to say for each pixel the
the memory system to feed data to it. While clock rates of processing result only depends on the limited neighboring
high-end processors have increased at roughly 40% per year pixels. The third is the fixed data structure which means the
over the past decade, DRAM access times have only image processing result is still an image array. From all
improved at the rate of roughly 10% per year over this above we conclude that the image preprocessing has obvious
interval. Coupled with increases in instructions executed per data parallelism.
clock cycle, this gap between processor speed and memory From the view of memory system we know that
presents a tremendous performance bottleneck. This growing preprocessing need efficient data bandwidth hierarchy to
mismatch between processor speed and DRAM latency is feed data for the large number of arithmetic units which
typically bridged by a hierarchy of successively faster support data parallelism.
memory devices called caches that rely on locality of data
reference to deliver higher memory system performance[1]. III. IMAGINE ARCHITECTURE
In fact image processing does not have the characteristic Imagine is a typical embedded stream processor for image
of locality in data reference. So the cache hierarchy does not processing. Imagine is a programmable single-chip processor
suit for image processing. VLSI constraints motivate a data which is constructed by a Stanford research group as Figure
bandwidth hierarchy to provide enough data bandwidth to 1 shows. Imagine provides a storage bandwidth hierarchy
support the demands of image processing. In modern VLSI, that corresponds to the three levels of Figure 2. Using this
hundreds of arithmetic units can fit on an inexpensive chip, hierarchy to exploit the parallelism and locality of image
but communication among those arithmetic units is applications, Imagine is able to sustain performance of
expensive in terms of area, power, and delay. Conventional 8.5GFLOPS on key kernels[3]. This is comparable to special
storage hierarchies do not provide enough data bandwidth to
purpose processors, yet Imagine is still easily programmable
support large numbers of arithmetic units. In contrast, a
bandwidth hierarchy effectively feeds numerous arithmetic for a wide range of applications. Imagine is designed to fit on
units by scaling data bandwidth across multiple levels[2]. This a 1cm2 0.25mm CMOS chip and to operate at 400MHz.
bridges the gap between the bandwidth available from The stream architecture of the Imagine media processor
effectively exploits the desirable application characteristics.
978-0-7695-3930-0/09 $26.00 © 2009 IEEE 34

DOI 10.1109/IFCSTA.2009.14
Imagine meets the computation and bandwidth demands of
image applications by directly processing the naturally
occurring data streams within these applications. Figure1
shows a block diagram of Imagine’s architecture. Imagine is
designed to be a coprocessor that operates on multimedia
data streams. The stream register file (SRF) effectively
isolates the arithmetic units from the memory system,
making Imagine a load/store architecture for streams. All
stream operations transfer data streams to or from the SRF.
For instance, the network interface transfers streams directly
out of and into the SRF, isolating network transfers from
memory accesses and computation. This simplifies the
design of the processor and allows the clients (the arithmetic
clusters, the memory system, the network interface, etc.) to
tolerate the latency of other stream clients. In essence, the Figure 2 Imagine’s Bandwidth Hierarchy
SRF enables the streaming data types inherent in image
processing applications to be routed efficiently throughout IV. DATA BANDWIDTH HIERARCHY
the processor. Imagine is a programmable media processor that matches
the demands of media processing applications to the
capabilities of modern VLSI technology. Imagine
implements a stream architecture, which includes an efficient
data bandwidth hierarchy and a streaming memory system.
The bandwidth hierarchy bridges the gap between the 2GB/s
memory system and the arithmetic units that require
544GB/s to achieve their full utilization. At the base of the
bandwidth hierarchy, the streaming memory system sustains
a significant fraction of the available DRAM bandwidth,
enabling image processing applications to take full
advantage of the bandwidth hierarchy.
The bandwidth hierarchy enables Imagine to achieve 77-
96% of the performance of a stream processor with infinite
memory and global data bandwidth. Kernels make efficient
Figure 1 Imagine architecture block diagram use of local register files for temporary storage to supply the
As shown in Figure 1, eight arithmetic clusters, a arithmetic units with data and of the stream register file to
microcontroller, a streaming memory system, a network supply input streams to the arithmetic units and store output
interface, and a stream controller are connected to the SRF. streams from the arithmetic units. The stream register file
The arithmetic clusters consume data streams from the SRF, efficiently captures the locality of stream recirculation within
process them, and return their output data streams to the media processing applications, thereby limiting the
SRF. The microcontroller is connected to the SRF to allow bandwidth demands on off-chip memory. When mapped to
compiled kernel programs to be loaded from the SRF into the three-tiered storage hierarchy, the bandwidth demands of
the microcontroller’s microcode store[4]. The streaming image processing applications are well-matched to the
memory system transfers streams between the SRF and the provided bandwidth which is scaled by a ratio of 1:16:276
off-chip DRAM. The network interface transfers streams across the levels of the hierarchy[5].
between the SRF and other processors or devices connected As Figure2 illustrated, the Imagine’s data bandwidth
to the network. The stream controller allows the host hierarchy includes three levels of storage. They are external
processor to transfer data and microcode programs into or DRAM, SRF and Local Register File（LRF）which is the
out of the SRF, although large data transfers would be made local register file in each ALU cluster.
through the Imagine network. Comparing with conventional cache hierarchy, Imagine
also has a memory level for data bandwidth hierarchy. This
memory level is DRAM, in Figure1 and Figure2 it is
constructed by SDRAM. Between SDRAM and stream
memory system a memory bandwidth is constructed. The
external DRAM provides a peak bandwidth of 2GB/s.
The next level of the storage hierarchy, the stream
register file（SRF）, provides a peak bandwidth of 32GB/s,
which is 16 times higher than the memory bandwidth. The
35
SRF is a 128KB (1Mbit) memory optimized for stream hierarchy, including distributed local register files, a global
transfers. stream register file, and external DRAM, can effectively
Finally, the distributed register file structure within the meet the needs of these applications by scaling the
arithmetic clusters provides a peak bandwidth of 544GB/s, bandwidth by over an order of magnitude at each level.
which is 17 times higher than the SRF bandwidth. The
external DRAM, SRF, and local register files form a three- ACKNOWLEDGMENT
tiered bandwidth hierarchy in which the bandwidth is scaled We would like to thank the Chinese post-doctor
by a ratio of 1:16:272 across the levels[6]. foundation, the Natural Science Basic Research Plan in
A bandwidth hierarchy allows media processing Shaanxi Province of China and the Natural Science Basic
applications to use data bandwidth efficiently. Temporary Research Plan in Education ministry of China
data is stored locally in the arithmetic clusters where it may
be accessed frequently and quickly. Intermediate streams are REFERENCES
stored on-chip in the SRF where they can be recirculated [1] Grama, A, “Introduction to Parallel Computing,” Second
between kernels without requiring costly memory Edition,China Machine Press.(2003), 3-4.
references[7]. Finally, input, output, and other global data is [2] Scott Rixner, “Stream processor Architecture,” Kluwer Academic
stored in external DRAM, since it is referenced infrequently Publishers. (2001), 53-59.
and requires large amounts of storage space. [3] Shen Xu-bang, “MPP Embedded Computer design,” Tsinghua
A cache hierarchy dynamically manages data movement university press. (1999), 250-251.
and storage[8]. However, media applications have regular, [4] S. Rixner et al., “A Bandwidth-Efficient Architecture for Media
Processing,” Proc.31st Int’l Symp. Microarchitecture, IEEE
predictable memory behavior, so the programmer and Computer Society Press, Los Alamitos, Calif. (1998), 3-7.
compiler have enough knowledge to manage the memory
[5] U.J. Kapasi et al, “The Imagine Stream Processor,” Proceedings of
hierarchy effectively at compile-time[9]. the IEEE International Conference on Computer Design, 2002, pp.
282-288.
[6] Shen Xu-bang. “RISC and Back-end Compile Technology. ” Beijing.
V. CONCLUSION Tsinghua university press. 1994. P344-346
Media processing applications benefit from a bandwidth [7] Lin Chuang. “Computer Network and Computer System Performance
scaling across multiple levels of the storage hierarchy to Evaluation. ” Beijing. Tsinghua university press. 2001. P143-145
bridge the gap from DRAM bandwidth to the data bandwidth [8] Qiang Ming. “A Kind of Digital Simulation Software for Data Bus in
Satellite. ” Computer Simulation Vol.15. 4
required by the arithmetic units. Without this bandwidth
scaling, media applications are memory bandwidth limited, [9] Kai Hwang.“Computer Architecture and Parallel Processing.”Science
Press. 1990. P133-134
which severely reduces the number of arithmetic units that
can be utilized efficiently. A three-tiered bandwidth
36

An Overview of Data Bandwidth Hierarchy For An Embedded Stream Processor

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

An Overview of Data Bandwidth Hierarchy For An Embedded Stream Processor

Cargado por

Copyright:

Formatos disponibles

2009 International Forum on Computer Science-Technology and Applications

An Overview of Data Bandwidth Hierarchy for an Embedded Stream Processor

Duan zongtao1, Zhang yanni2 , Duan zongyuan3

978-0-7695-3930-0/09 $26.00 © 2009 IEEE 34

También podría gustarte