Está en la página 1de 26

Valencia, Spain September 7, 2010

The Future Is Heterogeneous Computing


Mike Houston
Principal Architect, Accelerated Parallel Processing Advanced Micro Devices October 27th, 2010

Page 1

| The Future Is Heterogeneous Computing | Oct 27, 2010

Workload Example: Changing Consumer Behavior

20 hours
of video
uploaded to YouTube

Approximately

9 billion
video files owned are

high-definition

every minute

50 million +
digital media files
added to personal content libraries

every day

1000 images
are uploaded to Facebook

every second
Page 2

| The Future Is Heterogeneous Computing | Oct 27, 2010

Challenges for Next Generation Systems The Power Wall


Even more broadly constraining in the future!

Complexity Management HW and SW


Principles for managing exponential growth

Parallelism, Programmability and Efficiency


Optimized SW for System-level Solutions

System balance
Memory Technologies and System Design Interconnect Design
| The Future Is Heterogeneous Computing | Oct 27, 2010

Page 3

The Power Wall


Easy prediction: Power will continue to be the #1 design constraint for Computer Systems design. Why? Vmin will not continue tracking Moores Law
Integration of system-level components consume chip power
A well utilized 100GB/sec DDR memory interface consumes ~15W for the I/O alone!

2nd Order Effects of Power


Thermal, packaging & cooling (node-level & datacenter-level) Electrical stability in the face of rising variablity

Thermal Design Points (TDPs) in all market segments continue to drop Lightly loaded and idle power characteristics are key parameters in the Operational Expense (OpEx) equation Percent of total world energy consumed by computing devices continues to grow year-on-year
Page 4

| The Future Is Heterogeneous Computing | Oct 27, 2010

Optimized SW for System-level Solutions


Long history of SW optimizations for HW characteristics
Optimizing compilers Cache / TLB blocking Multi-processor coordination: communication & synchronization Non-uniform memory characteristics: Process and memory affinity

Scarcity/Abundance principle favors increased use of Abstractions


Abstraction leads to Increased productivity but costs performance Still allow experts burrow down into lower level on the metal details

System-level Integration Era will demand even more


Many Core: user mode and/or managed runtime scheduling? Heterogeneous Many Core: capability aware scheduling?

SW productivity versus optimization dichotomy


Exposed HW leads to better performance but requires a platform characteristics aware programming model
Page 5

| The Future Is Heterogeneous Computing | Oct 27, 2010

The Memory Wall getting thicker


There has always been a Critical Balance between Data Availability and Processing
Situation
DRAM vs CPU Cycle Time Gap

When?
Early 1990s

Implication
Memory wait time dominates computing

Industry Solutions
Non-blocking caches O-o-O Machines Larger Caches Cache Hierarchies Elaborate prefetch Huge Caches Multiple Memory Controllers Extreme PHYs Accelerated Parallel Processing Chip Stacking


TBD

SW Productivity Crisis Object oriented languages; Managed runtime environments

Mid 1990s

Larger working sets More diverse data types

Single Thread CMP Focus

2005 and beyond

Multiple working sets! Virtual Machines! More memory accesses

New & Emerging Abstractions Browser-based Runtimes Image/Video as basic data types Throughput-based designs 2009 and beyond Even larger working sets Larger data types

Page 6

| The Future Is Heterogeneous Computing | Oct 27, 2010

Interconnect Challenges
Coherence domain knowing when to stop
Interesting implications for on-chip interconnect networks

Industry Mantra: Never bet against Ethernet


But, current Ethernet not well suited for lossless transmission Troublesome for storage, messaging and more

The more subtle and trickier problems


Adaptive routing, congestion management, QOS, End-to-end characteristics, and more

Data centers of tomorrow are going to take great interest in this area

Page 7

| The Future Is Heterogeneous Computing | Oct 27, 2010

Single-thread Performance
Moores Law
Integration (log scale)

The Power Wall


!
Power Budget (TDP)

The Frequency Wall


o
Frequency

o
we are here
Time

- DFM - Variability - Reliability - Wire delay

Server: power=$$ DT: eliminate fans Mobile: battery o


we are here

we are here

Time

Time

The IPC Complexity Wall


o
IPC

Locality
Single-thread Perf

Single thread Perf (!)


o
we are here

we are here

Performance

o
we are here

Issue Width

Cache Size

Time

Page 8

| The Future Is Heterogeneous Computing | Oct 27, 2010

Parallel Programs and Amdahls Law


Speed-up =
140 120 100 Speed-up 80 60 40 20 0 1 2 4 8 16 32 64 128 Number of CPU Cores

1 SW + (1 SW ) / N

SW: % Serial Work N: Number of processors

Assume 100W TDP Socket 10W for global clocking 20W for on-chip network/caches 15W for I/O (memory, PCIe, etc) This leaves 55W for all the cores 850mW per Core !

0% Serial 0% Serial 10% Serial 100% 35% Serial Serial 100% Serial

Page 9

| The Future Is Heterogeneous Computing | Oct 27, 2010

35 Years of Microprocessor Trend Data


Transistors (thousands)

Single-thread Performance (SpecINT) Frequency (MHz) Typical Power (Watts) Number of Cores

Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore

Page 10

| The Future Is Heterogeneous Computing | Oct 27, 2010

The Power Wall Again!


Escalating multi-core designs will crash into the power wall just like single cores did due to escalating frequency Why?
In order to maintain a reasonable balance, core additions must be accompanied by increases in other resources that consume power (on-chip network, caches, memory and I/O BW, )
Spiral upwards effect on power

The use of multiple cores forces each core to actually slow down
At some point, the power limits will not even allow you to activate all of the cores at the same time

Small, low-power cores tend to be very weak on single-threaded general purpose workloads
Customer value proposition will continue to demand excellent performance on general purpose workloads The transition to compelling general purpose parallel workloads will not be a fast one
Page 11

| The Future Is Heterogeneous Computing | Oct 27, 2010

What about Throughput Computing?


Works around Amdahls law by focusing on throughput of multiple independent tasks
Servers: Transaction Processing; Web Clicks; Search Queries Clients: Graphics; Multimedia; Sensory Inputs (future) HPC: Data-level parallelism

New bottlenecks start to appear


As some point, the OS itself becomes the serial component User mode scheduling and task-stealing runtimes Memory BW Goal is to saturate the pipeline to memory Large number of outstanding references Large number of active and/or standby threads Power Overall utilization goes up, so does power consumption Still the #1 constraint in modern computer design
Page 12

| The Future Is Heterogeneous Computing | Oct 27, 2010

Three Eras of Processor Performance


Single-Core Era
Enabled by: Moores Law Voltage Scaling MicroArchitecture Constrained by: Power Complexity
Single-thread Performance

Multi-Core Era
Enabled by: Moores Law Desire for Throughput 20 years of SMP arch Constrained by: Power Parallel SW availability Scalability
Throughput Performance

Heterogeneous Systems Era


Enabled by: Moores Law Abundant data parallelism Power efficient GPUs Currently constrained by: Programming models Communication overheads

o
we are here

o
we are here

Targeted Application Performance

o
we are here

Time

Time (# of Processors)

Time (Data-parallel exploitation)

Page 13

| The Future Is Heterogeneous Computing | Oct 27, 2010

AMD x86 64-bit CMP Evolution


2003
AMD Opteron

2005
Dual-Core AMD Opteron

2007
Quad-Core AMD Opteron

2008
45nm QuadCore AMD Opteron

2009
Six-Core AMD Opteron

2010
AMD Opteron 6100 Series

Mfg. Process CPU Core

90nm SOI K8

90nm SOI K8

65nm SOI Greyhound

45nm SOI Greyhound+

45nm SOI Greyhound+

45nm SOI Greyhound+

L2/L3 Hyper Transport Technology Memory

1MB/0 3x 1.6GT/.s 2x DDR1 300

1MB/0 3x 1.6GT/.s 2x DDR1 400

512kB/2MB 3x 2GT/s 2x DDR2 667

512kB/6MB 3x 4.0GT/s 2x DDR2 800

512kB/6MB 3x 4.8GT/s 2x DDR2 1066

512kB/12MB 4x 6.4GT/s 4x DDR3 1333

Max Power Budget Remains Consistent


Page 14

| The Future Is Heterogeneous Computing | Oct 27, 2010

AMD Opteron 6100 Series Silicon and Package

L3 CACHE Core 1 Core 2 Core 3

Core 4 L3 CACHE

Core 5

Core 6

12 AMD64 x86 Cores 18 MB on-chip cache 4 Memory Channels @ 1333 MHz 4 HT Links @ 6.4 GT/sec
Page 15

| The Future Is Heterogeneous Computing | Oct 27, 2010

AMD Radeon HD5870 GPU Architecture

Page 16

| The Future Is Heterogeneous Computing | Oct 27, 2010

GPU Processing Performance Trend


3000
GigaFLOPS
*

Cypress
ATI RADEON

2500 2000 1500 1000 500 0


-05

* Peak single-precision performance; For RV670, RV770 & Cypress divide by 5 for peak double-precision performance

HD 5870
ATI RADEON

RV770
V8700 9250 9270

ATI RADEON ATI RADEON ATI RADEON

R600
V7600 V8600 V8650

ATI RADEON

RV670
HD 3800 V7700 9170

HD 4800
ATI FirePro AM D FireStream

ATI FireGL AM D FireStream

R580(+)
X19xx

HD 2900
ATI FireGL

OpenCL 1.1+ DirectX 11 2.25x Perf.

R520
X1800 V7200 V7300 V7350

ATI FireStream

ATI FireGL

2.5x ALU increase Stream SDK CAL+IL/Brook+

GPGPU via CTM

Unified Shaders

Double-precision floating point

Ap r- 0 7

Ju n08

No v -0 7

Oc t-0 6

Se p

17

De

Ju l-0 9

-06

ar

c-0 8

GPU Efficiency
16 14

14.47
GFLOPS/W

12

GFLOPS/W GFLOPS/mm2 7.50

10

4.50
6

7.90
GFLOPS/mm 2

2.01 1.07

2.21 2.24 1.06 0.92


Sep-07 ATI Radeon HD 2900 PRO Nov-07 ATI Radeon HD 3870

4.56

0.42
Nov-05 ATI Radeon X1800 XT

0
Jan-06 ATI Radeon X1900 XTX Jun-08 ATI Radeon HD 4870 Oct-09 ATI Radeon HD 5870

18

AMD Accelerated Parallel Processing (APP) Technology is


Heterogeneous: Developers leverage AMD GPUs and CPUs for optimal application performance and user experience High performance: Massively parallel, programmable GPU architecture delivers unprecedented performance and power efficiency Industry Standards: OpenCL enables cross-platform development

Gaming

Digital Content Creation

Productivity

Sciences

Government

Engineering

19

Moving Past Proprietary Solutions for Ease of Cross-Platform Programming


Open and Custom Tools
High Level Language Compilers High Level Tools Application Specific Libraries

Industry Standard Interfaces


DirectX

OpenCL

OpenGL

AMD GPUs
OpenCL -

AMD CPUs

Other CPUs/GPUs

Cross-platform development Interoperability with OpenGL and DX CPU/GPU backends enable balanced platform approach
20

Heterogeneous Computing:
Next-Generation Software Ecosystem
Increase ease of application development

End-user Applications
High Level Frameworks Advanced Optimizations & Load Balancing Tools: HLL compilers, Debuggers, Profilers

Load balance across CPUs and GPUs; leverage AMD Fusion performance advantages

Middleware/Libraries: Video, Imaging, Math/Sciences, Physics

OpenCL & Direct Compute

Drive new features into industry standards

Hardware & Drivers: AMD Fusion, Discrete CPUs/GPUs

21

AMD Balanced Platform Advantage


CPU is excellent for running some algorithms Ideal place to process if GPU is fully loaded Great use for additional CPU cores GPU is ideal for data parallel algorithms like image processing, CAE, etc Great use for AMD Accelerated Parallel Processing (APP) technology Great use for additional GPUs

Graphics Workloads Serial/Task-Paralle l Workloads Other Highly Parallel Workloads

Delivers

advanced performance

for a wide range of platform configurations

22

Challenges: Extracting Parallelism



2D array representing very large dataset

Fine-grain data parallel Code

Nested data parallel Code

Coarse-grain data parallel Code

Loop 1M times for 1M pieces of data

i=0 i++ load x(i) fmul store cmp i (1000000) bc

i=0 i++ load x(i) fmul store cmp i (16) bc

Loop 16 times for 16 pieces of data

Maps very well to integrated SIMD dataflow (ie: SSE)


Page 23

Lots of conditional data Maps very well to parallelism. Benefits Throughput-oriented from closer coupling data parallel engines between CPU & GPU
| The Future Is Heterogeneous Computing | Oct 27, 2010

i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc

A New Era of Processor Performance

Microprocessor Advancement

CPU

Single-Core Era

Multi-Core Era

Heterogeneous Systems Era

Programmability

Heterogeneous Computing Homogeneous Computing

System-level programmable

Graphics driver-based programs

Throughput Performance

GPU

24

GPU Advancement

OpenCL/DX driver-based programs

Now the AMD Fusion Era of Computing Begins

25

DISCLA IMER The inf ormation presented in this document is f or inf ormat ional purposes only and may cont ain t echnical inaccuracie s, omissions and typographical errors. The inf ormation contained herein is subject to change and may be rendered inaccurat e f or many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product dif f erences between dif f ering manuf act urers, sof t ware changes, BIOS f lashes, f irmware upgrades, or t he like. A MD assumes no obligat ion t o updat e or ot herwise correct or revise this inf ormation. However, A MD reserves t he right to revise this inf ormation and to make changes f rom time t o time t o the cont ent hereof without obligat ion of A MD to not if y any person of such revisions or changes. A MD MA KES NO REPRESENTA TIONS OR WA RRA NTIES WITH RESPECT TO THE CONTENTS HEREOF A ND A SSUMES NO RESPONSIBILITY FOR A NY INA CCURACIES, ERRORS OR OMISSIONS THA T MA Y A PPEA R IN THIS INFORMA TION. A MD SPECIFICA LLY DISCLA IMS A NY IMPLIED WA RRA NTIES OF MERCHA NTA BILITY OR FITNESS FOR A NY PA RTICULA R PURPOSE. IN NO EVENT WILL A MD BE LIA BLE TO A NY PERSON FOR A NY DIRECT, INDIRECT, SPECIA L OR OTHER CONSEQUENTIA L DA MA GES A RISING FROM THE USE OF A NY INFORMA TION CONTA INED HEREIN, EVEN IF A MD IS EXPRESSLY A DVISED OF THE POSSIBILITY OF SUCH DA MA GES. T his presentation c ontains forward- looking s tatements c oncerning AMD and tec hnology partner produc t offerings whic h are made purs uant to the s afe harbor provis ions of the P rivate Sec urities L itigation Reform A ct of 1 9 95. Forward- looking s tatements are c ommonly identified by words s uc h as "would," "may," "expects," "believes," "plans," "intends," s trategy, roadmaps , "projects" and other terms with s imilar meaning. I nvestors are c autioned that the forward- looking s tatements in this presentation are bas ed on c urrent beliefs , as sumptions and expectations, s peak only as of the date of this pres entation and involve risks and unc ertainties that c ould c ause ac tual results to differ materially from c urrent expectations. A T TRIBUTIO N 2 0 1 0 Advanced M icro D evices, I nc. A ll rights reserved. A MD , the A MD A rrow logo, A M D O pteron, A TI, the A TI logo, Radeon and c ombinations thereof are trademarks of A dvanced M ic ro D evices, I nc. M icrosoft, Windows , and Windows V ista are registered trademarks of M icrosoft Corporation in the U nited States and/or other juris dictions. O penCL is trademark of A pple I nc. us ed under license to the Khronos G roup I nc. O ther names are for informational purposes only and may be trademarks of their res pective owners .

26

También podría gustarte