The Future Is Heterogeneous Computing

Valencia, Spain September 7, 2010
The Future Is Heterogeneous Computing

Mike Houston
Principal Architect, Accelerated Parallel Processing Advanced Micro Devices October 27th, 2010
Page 1
| The Future Is Heterogeneous Computing | Oct 27, 2010
Workload Example: Changing Consumer Behavior
20 hours
of video
uploaded to YouTube
Approximately
9 billion
video files owned are
high-definition
every minute
50 million +
digital media files
added to personal content libraries
every day
1000 images
are uploaded to Facebook
every second
Page 2
Challenges for Next Generation Systems The Power Wall

Even more broadly constraining in the future!
Complexity Management HW and SW

Principles for managing exponential growth
Parallelism, Programmability and Efficiency

Optimized SW for System-level Solutions
System balance
Memory Technologies and System Design Interconnect Design
Page 3
The Power Wall

Easy prediction: Power will continue to be the #1 design constraint for Computer Systems design. Why? Vmin will not continue tracking Moores Law
Integration of system-level components consume chip power
A well utilized 100GB/sec DDR memory interface consumes ~15W for the I/O alone!
2nd Order Effects of Power

Thermal, packaging & cooling (node-level & datacenter-level) Electrical stability in the face of rising variablity
Thermal Design Points (TDPs) in all market segments continue to drop Lightly loaded and idle power characteristics are key parameters in the Operational Expense (OpEx) equation Percent of total world energy consumed by computing devices continues to grow year-on-year
Page 4
Optimized SW for System-level Solutions

Long history of SW optimizations for HW characteristics
Optimizing compilers Cache / TLB blocking Multi-processor coordination: communication & synchronization Non-uniform memory characteristics: Process and memory affinity
Scarcity/Abundance principle favors increased use of Abstractions

Abstraction leads to Increased productivity but costs performance Still allow experts burrow down into lower level on the metal details
System-level Integration Era will demand even more

Many Core: user mode and/or managed runtime scheduling? Heterogeneous Many Core: capability aware scheduling?
SW productivity versus optimization dichotomy

Exposed HW leads to better performance but requires a platform characteristics aware programming model
Page 5
The Memory Wall getting thicker

There has always been a Critical Balance between Data Availability and Processing
Situation
DRAM vs CPU Cycle Time Gap
When?
Early 1990s
Implication
Memory wait time dominates computing
Industry Solutions
Non-blocking caches O-o-O Machines Larger Caches Cache Hierarchies Elaborate prefetch Huge Caches Multiple Memory Controllers Extreme PHYs Accelerated Parallel Processing Chip Stacking

TBD
SW Productivity Crisis Object oriented languages; Managed runtime environments
Mid 1990s
Larger working sets More diverse data types
Single Thread CMP Focus
2005 and beyond
Multiple working sets! Virtual Machines! More memory accesses
New & Emerging Abstractions Browser-based Runtimes Image/Video as basic data types Throughput-based designs 2009 and beyond Even larger working sets Larger data types
Page 6
Interconnect Challenges
Coherence domain knowing when to stop
Interesting implications for on-chip interconnect networks
Industry Mantra: Never bet against Ethernet

But, current Ethernet not well suited for lossless transmission Troublesome for storage, messaging and more
The more subtle and trickier problems

Adaptive routing, congestion management, QOS, End-to-end characteristics, and more
Data centers of tomorrow are going to take great interest in this area
Page 7
Single-thread Performance
Moores Law
Integration (log scale)
The Power Wall

!
Power Budget (TDP)
The Frequency Wall

o
Frequency
o
we are here
Time
- DFM - Variability - Reliability - Wire delay
Server: power=$$ DT: eliminate fans Mobile: battery o

we are here
we are here
Time
Time
The IPC Complexity Wall

o
IPC
Locality
Single-thread Perf
Single thread Perf (!)

o
we are here
we are here
Performance
o
we are here
Issue Width
Cache Size
Time
Page 8
Parallel Programs and Amdahls Law

Speed-up =
140 120 100 Speed-up 80 60 40 20 0 1 2 4 8 16 32 64 128 Number of CPU Cores
1 SW + (1 SW ) / N
SW: % Serial Work N: Number of processors
Assume 100W TDP Socket 10W for global clocking 20W for on-chip network/caches 15W for I/O (memory, PCIe, etc) This leaves 55W for all the cores 850mW per Core !
0% Serial 0% Serial 10% Serial 100% 35% Serial Serial 100% Serial
Page 9
35 Years of Microprocessor Trend Data

Transistors (thousands)
Single-thread Performance (SpecINT) Frequency (MHz) Typical Power (Watts) Number of Cores
Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore
Page 10
The Power Wall Again!

Escalating multi-core designs will crash into the power wall just like single cores did due to escalating frequency Why?
In order to maintain a reasonable balance, core additions must be accompanied by increases in other resources that consume power (on-chip network, caches, memory and I/O BW, )
Spiral upwards effect on power
The use of multiple cores forces each core to actually slow down
At some point, the power limits will not even allow you to activate all of the cores at the same time
Small, low-power cores tend to be very weak on single-threaded general purpose workloads
Customer value proposition will continue to demand excellent performance on general purpose workloads The transition to compelling general purpose parallel workloads will not be a fast one
Page 11
What about Throughput Computing?

Works around Amdahls law by focusing on throughput of multiple independent tasks
Servers: Transaction Processing; Web Clicks; Search Queries Clients: Graphics; Multimedia; Sensory Inputs (future) HPC: Data-level parallelism
New bottlenecks start to appear

As some point, the OS itself becomes the serial component User mode scheduling and task-stealing runtimes Memory BW Goal is to saturate the pipeline to memory Large number of outstanding references Large number of active and/or standby threads Power Overall utilization goes up, so does power consumption Still the #1 constraint in modern computer design
Page 12
Three Eras of Processor Performance

Single-Core Era
Enabled by: Moores Law Voltage Scaling MicroArchitecture Constrained by: Power Complexity
Single-thread Performance
Multi-Core Era
Enabled by: Moores Law Desire for Throughput 20 years of SMP arch Constrained by: Power Parallel SW availability Scalability
Throughput Performance
Heterogeneous Systems Era

Enabled by: Moores Law Abundant data parallelism Power efficient GPUs Currently constrained by: Programming models Communication overheads
o
we are here
o
we are here
Targeted Application Performance
o
we are here
Time
Time (# of Processors)
Time (Data-parallel exploitation)
Page 13
AMD x86 64-bit CMP Evolution

2003
AMD Opteron
2005
Dual-Core AMD Opteron
2007
Quad-Core AMD Opteron
2008
45nm QuadCore AMD Opteron
2009
Six-Core AMD Opteron
2010
AMD Opteron 6100 Series
Mfg. Process CPU Core
90nm SOI K8
90nm SOI K8
65nm SOI Greyhound
45nm SOI Greyhound+
45nm SOI Greyhound+
45nm SOI Greyhound+
L2/L3 Hyper Transport Technology Memory
1MB/0 3x 1.6GT/.s 2x DDR1 300
1MB/0 3x 1.6GT/.s 2x DDR1 400
512kB/2MB 3x 2GT/s 2x DDR2 667
512kB/6MB 3x 4.0GT/s 2x DDR2 800
512kB/6MB 3x 4.8GT/s 2x DDR2 1066
512kB/12MB 4x 6.4GT/s 4x DDR3 1333
Max Power Budget Remains Consistent

Page 14
AMD Opteron 6100 Series Silicon and Package
L3 CACHE Core 1 Core 2 Core 3
Core 4 L3 CACHE
Core 5
Core 6
12 AMD64 x86 Cores 18 MB on-chip cache 4 Memory Channels @ 1333 MHz 4 HT Links @ 6.4 GT/sec
Page 15
AMD Radeon HD5870 GPU Architecture
Page 16
GPU Processing Performance Trend

3000
GigaFLOPS
*
Cypress
ATI RADEON
2500 2000 1500 1000 500 0

-05
* Peak single-precision performance; For RV670, RV770 & Cypress divide by 5 for peak double-precision performance
HD 5870
ATI RADEON
RV770
V8700 9250 9270
ATI RADEON ATI RADEON ATI RADEON
R600
V7600 V8600 V8650
ATI RADEON
RV670
HD 3800 V7700 9170
HD 4800
ATI FirePro AM D FireStream
ATI FireGL AM D FireStream
R580(+)
X19xx
HD 2900
ATI FireGL
OpenCL 1.1+ DirectX 11 2.25x Perf.
R520
X1800 V7200 V7300 V7350
ATI FireStream
ATI FireGL
2.5x ALU increase Stream SDK CAL+IL/Brook+
GPGPU via CTM
Unified Shaders
Double-precision floating point
Ap r- 0 7
Ju n08
No v -0 7
Oc t-0 6
Se p
17
De
Ju l-0 9
-06
ar
c-0 8
GPU Efficiency
16 14
14.47
GFLOPS/W
12
GFLOPS/W GFLOPS/mm2 7.50
10
4.50
6
7.90
GFLOPS/mm 2
2.01 1.07
2.21 2.24 1.06 0.92

Sep-07 ATI Radeon HD 2900 PRO Nov-07 ATI Radeon HD 3870
4.56
0.42
Nov-05 ATI Radeon X1800 XT
0
Jan-06 ATI Radeon X1900 XTX Jun-08 ATI Radeon HD 4870 Oct-09 ATI Radeon HD 5870
18
AMD Accelerated Parallel Processing (APP) Technology is

Heterogeneous: Developers leverage AMD GPUs and CPUs for optimal application performance and user experience High performance: Massively parallel, programmable GPU architecture delivers unprecedented performance and power efficiency Industry Standards: OpenCL enables cross-platform development
Gaming
Digital Content Creation
Productivity
Sciences
Government
Engineering
19
Moving Past Proprietary Solutions for Ease of Cross-Platform Programming

Open and Custom Tools
High Level Language Compilers High Level Tools Application Specific Libraries
Industry Standard Interfaces

DirectX
OpenCL
OpenGL
AMD GPUs
OpenCL -
AMD CPUs
Other CPUs/GPUs
Cross-platform development Interoperability with OpenGL and DX CPU/GPU backends enable balanced platform approach
20
Heterogeneous Computing:
Next-Generation Software Ecosystem
Increase ease of application development
End-user Applications
High Level Frameworks Advanced Optimizations & Load Balancing Tools: HLL compilers, Debuggers, Profilers
Load balance across CPUs and GPUs; leverage AMD Fusion performance advantages
Middleware/Libraries: Video, Imaging, Math/Sciences, Physics
OpenCL & Direct Compute
Drive new features into industry standards
Hardware & Drivers: AMD Fusion, Discrete CPUs/GPUs
21
AMD Balanced Platform Advantage

CPU is excellent for running some algorithms Ideal place to process if GPU is fully loaded Great use for additional CPU cores GPU is ideal for data parallel algorithms like image processing, CAE, etc Great use for AMD Accelerated Parallel Processing (APP) technology Great use for additional GPUs
Graphics Workloads Serial/Task-Paralle l Workloads Other Highly Parallel Workloads
Delivers
advanced performance
for a wide range of platform configurations
22
Challenges: Extracting Parallelism

2D array representing very large dataset
Fine-grain data parallel Code
Nested data parallel Code
Coarse-grain data parallel Code
Loop 1M times for 1M pieces of data
i=0 i++ load x(i) fmul store cmp i (1000000) bc
i=0 i++ load x(i) fmul store cmp i (16) bc
Loop 16 times for 16 pieces of data
Maps very well to integrated SIMD dataflow (ie: SSE)

Page 23
Lots of conditional data Maps very well to parallelism. Benefits Throughput-oriented from closer coupling data parallel engines between CPU & GPU
i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc
A New Era of Processor Performance
Microprocessor Advancement
CPU
Single-Core Era
Multi-Core Era
Heterogeneous Systems Era
Programmability
Heterogeneous Computing Homogeneous Computing
System-level programmable
Graphics driver-based programs
Throughput Performance
GPU
24
GPU Advancement
OpenCL/DX driver-based programs
Now the AMD Fusion Era of Computing Begins
25
DISCLA IMER The inf ormation presented in this document is f or inf ormat ional purposes only and may cont ain t echnical inaccuracie s, omissions and typographical errors. The inf ormation contained herein is subject to change and may be rendered inaccurat e f or many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product dif f erences between dif f ering manuf act urers, sof t ware changes, BIOS f lashes, f irmware upgrades, or t he like. A MD assumes no obligat ion t o updat e or ot herwise correct or revise this inf ormation. However, A MD reserves t he right to revise this inf ormation and to make changes f rom time t o time t o the cont ent hereof without obligat ion of A MD to not if y any person of such revisions or changes. A MD MA KES NO REPRESENTA TIONS OR WA RRA NTIES WITH RESPECT TO THE CONTENTS HEREOF A ND A SSUMES NO RESPONSIBILITY FOR A NY INA CCURACIES, ERRORS OR OMISSIONS THA T MA Y A PPEA R IN THIS INFORMA TION. A MD SPECIFICA LLY DISCLA IMS A NY IMPLIED WA RRA NTIES OF MERCHA NTA BILITY OR FITNESS FOR A NY PA RTICULA R PURPOSE. IN NO EVENT WILL A MD BE LIA BLE TO A NY PERSON FOR A NY DIRECT, INDIRECT, SPECIA L OR OTHER CONSEQUENTIA L DA MA GES A RISING FROM THE USE OF A NY INFORMA TION CONTA INED HEREIN, EVEN IF A MD IS EXPRESSLY A DVISED OF THE POSSIBILITY OF SUCH DA MA GES. T his presentation c ontains forward- looking s tatements c oncerning AMD and tec hnology partner produc t offerings whic h are made purs uant to the s afe harbor provis ions of the P rivate Sec urities L itigation Reform A ct of 1 9 95. Forward- looking s tatements are c ommonly identified by words s uc h as "would," "may," "expects," "believes," "plans," "intends," s trategy, roadmaps , "projects" and other terms with s imilar meaning. I nvestors are c autioned that the forward- looking s tatements in this presentation are bas ed on c urrent beliefs , as sumptions and expectations, s peak only as of the date of this pres entation and involve risks and unc ertainties that c ould c ause ac tual results to differ materially from c urrent expectations. A T TRIBUTIO N 2 0 1 0 Advanced M icro D evices, I nc. A ll rights reserved. A MD , the A MD A rrow logo, A M D O pteron, A TI, the A TI logo, Radeon and c ombinations thereof are trademarks of A dvanced M ic ro D evices, I nc. M icrosoft, Windows , and Windows V ista are registered trademarks of M icrosoft Corporation in the U nited States and/or other juris dictions. O penCL is trademark of A pple I nc. us ed under license to the Khronos G roup I nc. O ther names are for informational purposes only and may be trademarks of their res pective owners .
26

The Future Is Heterogeneous Computing

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

The Future Is Heterogeneous Computing

Cargado por

Copyright:

Formatos disponibles

Valencia, Spain September 7, 2010

The Future Is Heterogeneous Computing

| The Future Is Heterogeneous Computing | Oct 27, 2010

Workload Example: Changing Consumer Behavior

| The Future Is Heterogeneous Computing | Oct 27, 2010

Challenges for Next Generation Systems The Power Wall

Complexity Management HW and SW

Parallelism, Programmability and Efficiency

The Power Wall

2nd Order Effects of Power

| The Future Is Heterogeneous Computing | Oct 27, 2010

Optimized SW for System-level Solutions

Scarcity/Abundance principle favors increased use of Abstractions

System-level Integration Era will demand even more

SW productivity versus optimization dichotomy

| The Future Is Heterogeneous Computing | Oct 27, 2010

The Memory Wall getting thicker

SW Productivity Crisis Object oriented languages; Managed runtime environments

Larger working sets More diverse data types

Single Thread CMP Focus

2005 and beyond

Multiple working sets! Virtual Machines! More memory accesses

| The Future Is Heterogeneous Computing | Oct 27, 2010

Industry Mantra: Never bet against Ethernet

The more subtle and trickier problems

| The Future Is Heterogeneous Computing | Oct 27, 2010

The Power Wall

The Frequency Wall

- DFM - Variability - Reliability - Wire delay

Server: power=$$ DT: eliminate fans Mobile: battery o

The IPC Complexity Wall

Single thread Perf (!)

| The Future Is Heterogeneous Computing | Oct 27, 2010

Parallel Programs and Amdahls Law

SW: % Serial Work N: Number of processors

| The Future Is Heterogeneous Computing | Oct 27, 2010

35 Years of Microprocessor Trend Data

| The Future Is Heterogeneous Computing | Oct 27, 2010

The Power Wall Again!

| The Future Is Heterogeneous Computing | Oct 27, 2010

What about Throughput Computing?

New bottlenecks start to appear

| The Future Is Heterogeneous Computing | Oct 27, 2010

Three Eras of Processor Performance

Heterogeneous Systems Era

Targeted Application Performance

Time (Data-parallel exploitation)

| The Future Is Heterogeneous Computing | Oct 27, 2010

AMD x86 64-bit CMP Evolution

Mfg. Process CPU Core

65nm SOI Greyhound

45nm SOI Greyhound+

45nm SOI Greyhound+

45nm SOI Greyhound+

L2/L3 Hyper Transport Technology Memory

1MB/0 3x 1.6GT/.s 2x DDR1 300

1MB/0 3x 1.6GT/.s 2x DDR1 400

512kB/2MB 3x 2GT/s 2x DDR2 667

512kB/6MB 3x 4.0GT/s 2x DDR2 800

512kB/6MB 3x 4.8GT/s 2x DDR2 1066

512kB/12MB 4x 6.4GT/s 4x DDR3 1333

Max Power Budget Remains Consistent

| The Future Is Heterogeneous Computing | Oct 27, 2010

AMD Opteron 6100 Series Silicon and Package