Está en la página 1de 48

Multi-Threaded End-to-End

Applications on Network
Processors

Michael Watts
January 26th, 2006
The Stage
• Demand to move applications from end
nodes to network edge
• Increased processing power at edge
makes this possible

The Internet End Node


Edge

End Node
Edge
Corporate
Office West
Example Corporate
Office East

The Internet

• End nodes responsible for establishing secure


communication

The Internet

• All communication between corporate office


secured at Internet edge
Applications at Network Edge
• Provide service to end nodes
– Security
– Quality of Service
– Intrusion detection
– Load balancing
• Kernels carry out single task
– Such as MD5, URL-based switching, and AES
• End-to-end applications combine multiple
kernels
Intelligent Devices
• High level applications at network edge
– Demand processing power
– Demand flexibility of general-purpose processors
• Application-Specific Integrated Circuit (ASIC)
– Speed without flexibility
– Customized for particular use
• Network Processing Unit (NPU)
– Programmable flexibility
– Performance through parallelization
Benchmarks
• Increasing complexity of next-generation
applications
– More demand on NPUs
– Benchmark applications used to test
performance of NPUs
• Current network benchmarks
– Single-threaded kernels
– Insufficient for NPU multi-processor
architecture
Contributions
• Multi-threaded end-to-end application
benchmark suite
• Generic NPU simulator
• Analysis shows kernel performance
inaccurate indicator of end-to-end
application performance
Overview
1. Network Processors and Simulators
2. The NPU Simulator
3. Benchmark Applications
4. Tests and Results
5. Conclusion
6. Future Work
Network Processors
• NPU
– Programmable packet processing device
– Over 30 self-identified NPUs
• NPU Architecture
– Dedicated co-processors
– High-speed network interfaces
– Multiple processing units
• Pipelined
• Symmetric
Pipelined vs. Symmetric
• Pipelined
Packet
Processing Units

• Symmetric
Packet

Packet

Packet
Processing Units
Intel IXP1200
• Symmetric architecture
• Processors (266MHz, 32-bit RISC)
– 1 x StrongARM controller
• L1 and L2 cache
– 6 x microengines (ME)
• 4 hardware supported threads each
• No cache, lots of registers
• Shared Memory
– 8 MBytes SRAM
– 256 MBytes SDRAM
– StrongARM and MEs share memory bus
– No built-in memory management
Intel IXP1200 Architecture
NPU Simulators
• Purpose
– Execute programs on foreign platform
– Provide performance statistics
• SimpleScalar
– Cycle-accurate hardware simulation
– Architecture similar to MIPS
– Modified GNU GCC generates binaries
PacketBench
• Developed at University of Massachusetts
• Uses SimpleScalar
• Provides API for basic NPU functions
• NPU platform independence
• Drawback: no support for multiprocessor
architectures
Benchmarks
• Applications designed to assess
performance characteristics of a single
platform or differences between platforms
– Synthetic
• Mimic a particular type of workload
– Application
• Real-world applications
• Our focus: application benchmarks for the
domain of NPUs
Benchmark Suites
• MiBench
– Target: embedded microprocessors
– Including Rijndael encryption (AES)
• NetBench
– Target: NPUs
– Including Message-Digest 5 (MD5) and URL-
based switching
• Source available in C
• Limitation: single-threaded
The Simulator
• Modified existing multiprocessor simulator
• Built on SimpleScalar
• Modeled after Intel IXP1200
– Modeled processing units, memory, and cache
structure
– Processors share memory bus
– SRAM reserved for instruction stacks
Parameter StrongARM Microengines
Scheduling Out-of-order In-order
Width 1 (single-issue) 1 (single-issue)
L1 I Cache Size 16 KByte SRAM (0 penalty)
L1 D Cache Size 8 KByte 1 KByte (replace registers)
Methods of Use
• Simulator compiles on Linux using GCC
• Takes SimpleScalar binary as input
sim3ixp1200 [-h] [sim-args] program [program-args]

• Threads argument controlls number of


microengine threads (0-24)
• 6 microengines allotted threads using
round-robin
Application Development
• Developed in C
• Compiled using GCC 2.7.2.3 cross-compiler
– Linux/x86  SimpleScalar
• No POSIX thread support, same binary
executed by each thread
• No memory management
• Multi-threading
– getcpu()
– barrier()
– ncpus
Example Code
// common initialization

barrier();

int thread_id = getcpu();

if (thread_id == 0) {
// StrongARM
}
else if (thread_id == 1) {
// 1st microengine thread
}
else {
// 2 – ncpu microengine threads
}
Benchmark Applications
• Modified 3 kernels from MiBench and
NetBench
– Message-Digest 5 (MD5)
– URL-based switching (URL)
– Advanced Encryption Standard (AES)
[Rijndael]
• Modified memory allocations
• Modified source of incoming packets
• Parallelized
MD5
• Creates a 128-bit signature of input
• Used extensively in public-key
cryptography and verification of data
integrity
• Packet processing offloaded to
microengine (ME) threads
• Packets processed in parallel
MD5 Algorithm

• Every packet
Incoming Packets

processed on
separate ME thread
• StrongARM monitors
for idle threads and
assigns work
Microengines
MD5 Parallelization

StrongARM Microengines
URL
• Directs packets based on payload content
• Useful for load-balancing, fault detection
and recovery
• Layer 7 switch, content-switch, web-switch
• Uses pattern matching algorithm
Incoming
Packets
URL Algorithm
• Work for each packet
split among ME
StrongARM threads
• StrongARM iterates
over search tree,
assigning work to idle
ME threads
• ME threads report
when match found
Microengines
URL Parallelization

Microengines

StrongARM
AES
• Block cipher encryption algorithm
• Made US government standard in 2001
• 256 bit key
• Same parallelization technique as MD5
• Key loaded into each ME’s stack during
initialization
• Packet encryption performed in parallel
Performance Tests
• Purpose
– Evaluate multi-threading kernels and end-to-
end applications
• Tests
– Isolation
– Shared
– Static
– Dynamic
Isolation Tests
• Establish baseline
• Explore effects of multi-threading kernels
• Each kernel run in isolation
• Number of ME threads varied from 1 to 24
• Speedup graphed against serial version
MD5 Isolation Results

• 0: serial on StrongARM
• 1-24: parallel on MEs
• Decreased speedup on 1 ME
• Significant speedup overall
• Note decreasing slope at 7, 13, and 19 threads
URL Isolation Results

• When 1 thread finds a match, must wait for other threads


to finish
– Polling version required polling of global flag
– Performed slightly worse (1.64 compared to 1.75)
– Matching pattern found in 40% of packets
• When too many threads working at once, shared
resource bottlenecks affect speedup
AES Isolation Results

• Performs poorly on MEs


• Packets processed in 16 byte chunks
• State maintained in accumulator for packet lifetime
• Static lookup table of 8 Kbytes
• L1 data cache 8 Kbytes for StrongARM – 1 Kbytes for
MEs
• Consumes more cycles on ME by factor of 8.4
Shared Tests
• Reveal sensitivity of each kernel to
concurrent execution of other kernels
• StrongARM serves as controller
• Baseline of 1 MD5, 4 URL, and 1 AES
thread
• Separate packet streams for each kernel
• Number of threads increased for kernel
under test
Shared Results

• MD5: not substantially affected


• URL: maximum of 1.17 (compared to 1.75)
• AES: order of magnitude higher
– Baseline uses ME, not StrongARM
Static Tests
• Characteristics of end-to-end application
• Location of bottlenecks
• Kernels work together to process single
packet stream
• Find optimal thread configuration
End-to-End Application

• Distribution of sensitive information from


trusted network over Internet to different hosts
1. Calculate MD5 signature
2. Determine destination host using URL
3. Encrypt packet using AES
4. Send packet and signature to host
Static Results

• Baseline of 1 MD5, 4 URL, and 1 AES thread


• Additional thread tried on each kernel
• Best configuration used as starting point for next
• Final result 1 MD5, 11 URL, and 12 AES threads
Static Results (cont.)
• Although MD5 best speedup in Isolation,
unable to improve speedup in Static
– Amdahl’s Law: 1 / ((1 – P) + (P / S))
• More threads initially allocated to URL
– URL bottleneck until 10 threads
Dynamic Tests
• MEs not dedicated to single kernel,
instead assigned work by StrongARM
based on demand
• StrongARM responsible for allocating
threads and maintaining wait-queues
• Realistic configuration
• Increased development complexity
Dynamic Algorithm
• MD5  URL  AES
StrongARM MD • StrongARM monitors MEs
5
• Assigns work to idle threads
• First from queues, then from
URL incoming packet stream
Packet queues

URL
• AES queue
URL
• URL queue
AES • Network
AES
• URL queue fills as MD5 outperforms
URL
Microengines
• Additional threads created for URL
• AES threads created each time URL
finishes
Dynamic Results

• Baseline same as Static


• Substantial speedup over Static
Dynamic Results (cont.)

• 25% as many cycles as Static


• Some ME threads in Static waste idle cycles
• Less affected by URL bottleneck
• Able to adjust to varying packet sizes
Analysis
• Isolation
– Established baseline
• Shared
– Explored concurrent kernels
• Static
– End-to-end application characteristics
– Thread allocation optimization
• Dynamic
– Contrast on-demand to static thread allocation
Conclusion
• NPU multi-processor simulator
• Multi-threaded end-to-end benchmark
applications
• Analysis of benchmarks on NPU simulator
– Kernel performance is not indicative of end-to-
end application performance
– MD5 scaled well in Isolation and Shared, little
effect in end-to-end applications
Future Work
• NPU simulator
– Already used in two other M.S. thesis projects
– Larger cycle count capability
– Updated to model current NPU generation
• End-to-end applications
– Simulated on next-generation simulator
– Further investigation into bottlenecks
Future Work (cont.)
• Benchmark suite
– Include additional kernels
– Model more real-world end-to-end
applications
Thank You, Questions

También podría gustarte