Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Applications on Network
Processors
Michael Watts
January 26th, 2006
The Stage
• Demand to move applications from end
nodes to network edge
• Increased processing power at edge
makes this possible
End Node
Edge
Corporate
Office West
Example Corporate
Office East
The Internet
The Internet
• Symmetric
Packet
Packet
Packet
Processing Units
Intel IXP1200
• Symmetric architecture
• Processors (266MHz, 32-bit RISC)
– 1 x StrongARM controller
• L1 and L2 cache
– 6 x microengines (ME)
• 4 hardware supported threads each
• No cache, lots of registers
• Shared Memory
– 8 MBytes SRAM
– 256 MBytes SDRAM
– StrongARM and MEs share memory bus
– No built-in memory management
Intel IXP1200 Architecture
NPU Simulators
• Purpose
– Execute programs on foreign platform
– Provide performance statistics
• SimpleScalar
– Cycle-accurate hardware simulation
– Architecture similar to MIPS
– Modified GNU GCC generates binaries
PacketBench
• Developed at University of Massachusetts
• Uses SimpleScalar
• Provides API for basic NPU functions
• NPU platform independence
• Drawback: no support for multiprocessor
architectures
Benchmarks
• Applications designed to assess
performance characteristics of a single
platform or differences between platforms
– Synthetic
• Mimic a particular type of workload
– Application
• Real-world applications
• Our focus: application benchmarks for the
domain of NPUs
Benchmark Suites
• MiBench
– Target: embedded microprocessors
– Including Rijndael encryption (AES)
• NetBench
– Target: NPUs
– Including Message-Digest 5 (MD5) and URL-
based switching
• Source available in C
• Limitation: single-threaded
The Simulator
• Modified existing multiprocessor simulator
• Built on SimpleScalar
• Modeled after Intel IXP1200
– Modeled processing units, memory, and cache
structure
– Processors share memory bus
– SRAM reserved for instruction stacks
Parameter StrongARM Microengines
Scheduling Out-of-order In-order
Width 1 (single-issue) 1 (single-issue)
L1 I Cache Size 16 KByte SRAM (0 penalty)
L1 D Cache Size 8 KByte 1 KByte (replace registers)
Methods of Use
• Simulator compiles on Linux using GCC
• Takes SimpleScalar binary as input
sim3ixp1200 [-h] [sim-args] program [program-args]
barrier();
if (thread_id == 0) {
// StrongARM
}
else if (thread_id == 1) {
// 1st microengine thread
}
else {
// 2 – ncpu microengine threads
}
Benchmark Applications
• Modified 3 kernels from MiBench and
NetBench
– Message-Digest 5 (MD5)
– URL-based switching (URL)
– Advanced Encryption Standard (AES)
[Rijndael]
• Modified memory allocations
• Modified source of incoming packets
• Parallelized
MD5
• Creates a 128-bit signature of input
• Used extensively in public-key
cryptography and verification of data
integrity
• Packet processing offloaded to
microengine (ME) threads
• Packets processed in parallel
MD5 Algorithm
• Every packet
Incoming Packets
processed on
separate ME thread
• StrongARM monitors
for idle threads and
assigns work
Microengines
MD5 Parallelization
StrongARM Microengines
URL
• Directs packets based on payload content
• Useful for load-balancing, fault detection
and recovery
• Layer 7 switch, content-switch, web-switch
• Uses pattern matching algorithm
Incoming
Packets
URL Algorithm
• Work for each packet
split among ME
StrongARM threads
• StrongARM iterates
over search tree,
assigning work to idle
ME threads
• ME threads report
when match found
Microengines
URL Parallelization
Microengines
StrongARM
AES
• Block cipher encryption algorithm
• Made US government standard in 2001
• 256 bit key
• Same parallelization technique as MD5
• Key loaded into each ME’s stack during
initialization
• Packet encryption performed in parallel
Performance Tests
• Purpose
– Evaluate multi-threading kernels and end-to-
end applications
• Tests
– Isolation
– Shared
– Static
– Dynamic
Isolation Tests
• Establish baseline
• Explore effects of multi-threading kernels
• Each kernel run in isolation
• Number of ME threads varied from 1 to 24
• Speedup graphed against serial version
MD5 Isolation Results
• 0: serial on StrongARM
• 1-24: parallel on MEs
• Decreased speedup on 1 ME
• Significant speedup overall
• Note decreasing slope at 7, 13, and 19 threads
URL Isolation Results
URL
• AES queue
URL
• URL queue
AES • Network
AES
• URL queue fills as MD5 outperforms
URL
Microengines
• Additional threads created for URL
• AES threads created each time URL
finishes
Dynamic Results