P. 1
01-Parallel Computing Explained

01-Parallel Computing Explained

|Views: 1.584|Likes:
Publicado poradnanajm

More info:

Published by: adnanajm on Oct 04, 2010
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PPT, PDF, TXT or read online from Scribd
See more
See less






  • Parallel Computing Overview
  • Parallelism in our Daily Lives
  • Agenda
  • Parallelism in Computer Programs
  • Parallel Computing
  • Data Parallelism
  • An example of data parallelism
  • Quick Intro to OpenMP
  • OpenMP Loop Parallelism
  • OpenMP Style of Parallelism
  • OpenMP Task Parallelism
  • Parallelism in Computers
  • Operating System Parallelism
  • Two Unix Parallelism Features
  • Arithmetic Parallelism
  • Disk Parallelism
  • Comparison of Parallel Computers
  • Trends and Examples
  • Memory Organization
  • Shared Memory
  • Distributed Shared Memory
  • Flow of Control
  • Flynn·s Taxonomy
  • SIMD Computers
  • MIMD Computers
  • SPMD Computers
  • Summary of SIMD versus MIMD
  • Bus Network
  • Tree Network
  • Summary
  • How to Parallelize a Code
  • Task Parallelism
  • Parallelism Issues
  • Porting Issues
  • Recompile
  • Word Length
  • Standards Violations
  • IEEE Arithmetic Differences
  • Diagnostic Listings
  • Further Information
  • Scalar Tuning
  • Compiler Optimizations
  • Statement Level
  • Block Level
  • Routine Level
  • Software Pipelining
  • Loop Unrolling
  • Vendor Tuned Code
  • Parallel Code Tuning
  • Chunk Size
  • Timing and Profiling
  • Timing
  • Timing an Executable
  • Timing a Batch Job
  • Profiling
  • Profiling Analysis
  • Cache Concepts
  • Cache Thrashing
  • Cache Summary
  • Agends
  • Locating the Cache Problem
  • Cache Tuning Strategy
  • Preserve Spatial Locality
  • Locality Problem
  • Grouping Data Together
  • Cache Thrashing Example
  • Not Enough Cache
  • Parallel Performance Analysis
  • Efficiency
  • About the IBM Regatta P690
  • IBMp690 General Overview
  • IBMp690 Building Blocks
  • Power4 Core
  • Multi-Chip Modules
  • Memory Subsystem
  • Memory distribution within an MCM
  • The Operating System

Parallel Computing Explained

Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009

1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690

1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives 1.1.2 Parallelism in Computer Programs 1.1.3 Parallelism in Computers Disk Parallelism 1.1.4 Performance Measures 1.1.5 More Parallelism Issues

1.2 Comparison of Parallel Computers 1.3 Summary

Parallel Computing Overview
y Who should read this chapter?
y New Users ² to learn concepts and terminology. y Intermediate Users ² for review or reference. y Management Staff ² to understand the basic concepts ² even if

you don·t plan to do any programming. y Note: Advanced users may opt to skip this chapter.

and business. large memory. and high speed input/output y able to speed up computations y by making the sequential components run faster y by doing more operations in parallel y High performance parallel computers are in demand y need for tremendous computational capabilities in science.Introduction to Parallel Computing y High performance parallel computers y can solve large problems much faster than a desktop computer y fast CPUs. high speed interconnects. engineering. y require gigabytes/terabytes f memory and gigaflops/teraflops of performance y scientists are striving for petascale performance .

.Introduction to Parallel Computing y HPPC are used in a wide variety of disciplines. y y y y y y y Meteorologists: prediction of tornadoes and thunderstorms Computational biologists: analyze DNA sequences Pharmaceutical companies: design of new drugs Oil companies: seismic exploration Wall Street: analysis of financial markets NASA: aerospace vehicle design Entertainment industry: special effects in movies and commercials y These complex scientific and business applications all need to perform computations on large datasets or large equations.

Parallelism in our Daily Lives y There are two types of processes that occur in computers and in our daily lives: y Sequential processes y occur in a strict order y it is not possible to do the next step until the current one is completed. and write the paper. y Parallel processes y many events happen simultaneously y Examples y Plant growth in the springtime y An orchestra . research. y Writing a term paper: pick the topic. y Examples y The passage of time: the sun rises and the sun sets.

1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives 1.1.2 Parallelism in Computer Programs Data Parallelism Task Parallelism 1.1.3 Parallelism in Computers Disk Parallelism 1.1.4 Performance Measures 1.1.5 More Parallelism Issues

1.2 Comparison of Parallel Computers 1.3 Summary

Parallelism in Computer Programs
y Conventional wisdom:
y Computer programs are sequential in nature y Only a small subset of them lend themselves to parallelism. y Algorithm: the "sequence of steps" necessary to do a computation. y The first 30 years of computer use, programs were run sequentially.

y The 1980's saw great successes with parallel computers.
y Dr. Geoffrey Fox published a book entitled Parallel Computing

Works! y many scientific accomplishments resulting from parallel computing y Computer programs are parallel in nature y Only a small subset of them need to be run sequentially

Parallel Computing
y What a computer does when it carries out more than one

computation at a time using more than one processor. y By using many processors at once, we can speedup the execution
y If one processor can perform the arithmetic in time t. y Then ideally p processors can perform the arithmetic in time t/p. y What if I use 100 processors? What if I use 1000 processors?

y Almost every program has some form of parallelism.
y You need to determine whether your data or your program can be

partitioned into independent pieces that can be run simultaneously. y Decomposition is the name given to this partitioning process.
y Types of parallelism:
y data parallelism y task parallelism.

Data Parallelism
y The same code segment runs concurrently on each processor,

but each processor is assigned its own part of the data to work on.
y Do loops (in Fortran) define the parallelism. y The iterations must be independent of each other. y Data parallelism is called "fine grain parallelism" because the

computational work is spread into many small subtasks.
y Example
y Dense linear algebra, such as matrix multiplication, is a perfect

candidate for data parallelism.

J) END DO END DO END DO Parallel Code !$OMP PARALLEL DO DO K=1.N DO J=1.N DO I=1.J) = C(I.An example of data parallelism Original Sequential Code DO K=1.N DO J=1.N DO I=1.K)*B(K.N C(I.N C(I.J) END DO END DO END DO !$END PARALLEL DO .J) + A(I.K)*B(K.J) = C(I.J) + A(I.

Quick Intro to OpenMP y OpenMP is a portable standard for parallel directives covering both data and task parallelism. the loop that is performed in parallel is the loop that immediately follows the Parallel Do directive. y In our sample code. y With OpenMP.N . y More information about OpenMP is available on the OpenMP website. it's the K loop: y DO K=1. y We will have a lecture on Introduction to OpenMP later.

J) + A(I. 16:20) B(16:20 .J) The code segment running on each processor DO J=1.K)*B(K.J) = C(I.J) A(I. 11:15) B(11:15 .J) A(I. 1:5) B(1:5 .N DO I=1.J) END DO END DO .J) A(I.N C(I.OpenMP Loop Parallelism Iteration-Processor Assignments Processor proc0 proc1 proc2 proc3 Iterations of K K=1:5 K=6:10 K=11:15 K=16:20 Data Elements A(I. 6:10) B(6:10 .

If performance is not satisfactory. which is an "all or nothing" approach. Compute performance of the code. Repeat steps 2 and 3 as many times as needed. .OpenMP Style of Parallelism y can be done incrementally as follows: Parallelize the most computationally intensive loop. 1. 3. y It is contrasted with the MPI (Message Passing Interface) style of parallelism. y The ability to perform incremental parallelism is considered a positive feature of data parallelism. 4. 2. parallelize another loop.

. More code is run in parallel because the parallelism is implemented at a higher level than in data parallelism. Instead of the same operations being performed on different parts of the data. Task parallelism is often easier to implement and has less overhead than data parallelism. You can use task parallelism when your program can be split into independent pieces. that can be assigned to different processors and run concurrently. Task parallelism is called "coarse grain" parallelism because the computational work is spread into just a few subtasks.Task Parallelism y Task parallelism may be thought of as the opposite of data y y y y y parallelism. often subroutines. each process performs different operations.

B. and D. . C.Task Parallelism y The abstract code shown in the diagram is decomposed into 4 independent code segments that are labeled A. The right hand side of the diagram illustrates the 4 code segments running concurrently.

Task Parallelism Original Code program main Parallel Code program main !$OMP PARALLEL !$OMP SECTIONS code segment labeled !$OMP SECTION code segment labeled !$OMP SECTION code segment labeled !$OMP SECTION code segment labeled !$OMP END SECTIONS !$OMP END PARALLEL end code segment labeled A code segment labeled B code segment labeled C code segment labeled D A B C D end .

In our sample parallel code. the allocation of code segments to processors is as follows. the code that follows each SECTION(S) directive is allocated to a different processor. Processor proc0 proc1 proc2 proc3 Code code segment labeled A code segment labeled B code segment labeled C code segment labeled D .OpenMP Task Parallelism y With OpenMP.

Parallelism in Computers y How parallelism is exploited and enhanced within the operating system and hardware components of a parallel computer: y operating system y arithmetic y memory y disk .

Parallel Computer SGI Origin2000 HP V-Class Cray T3E IBM SP Workstation Clusters OS IRIX HP-UX Unicos AIX Linux y For more information about Unix. a collection of Unix documents is available. but the name of the Unix OS varies with each vendor.Operating System Parallelism y All of the commonly used parallel computers run a version of the Unix operating system. In the table below each OS listed is in fact Unix. .

Two Unix Parallelism Features y background processing facility y With the Unix background processing facility you can run the executable a. There are two Unix commands that accomplish this: a.out in the background and simultaneously view the man page for the etime function in the foreground. .out > results & man etime y cron feature y With the Unix cron feature you can submit a job that will run at a later time.

y Parallel computers are able to overlap multiply and add. y It makes use of the multiple. y The arithmetic operations of add. and Fused Multiply Add (FMA) on HP computers. y Fused multiply and add y is another parallel arithmetic feature. because the execution units operate independently. This allows several execution units to be used simultaneously. subtract. On superscalar computers there are multiple slots per cycle that can be filled with work. The SGI Origin2000 is called a 4-way superscalar computer.Arithmetic Parallelism y Multiple execution units y facilitate arithmetic parallelism. the two arithmetic operations are overlapped and can complete in hardware in one computer cycle. multiply.* /) are each done in a separate execution unit. In either case. independent execution units. . and divide (+ . This arithmetic is named MultiplyADD (MADD) on SGI computers. y Superscalar arithmetic y is the ability to issue several arithmetic operations per computer cycle. where n is the number of slots per cycle. This gives rise to the name n-way superscalar.

y Cache memory y Cache is a small memory that has fast access compared with the larger main . then data elements with even memory addresses would fall into one bank. memory and serves to keep the faster processor filled with data. and consecutive data elements are interleaved among them. and data elements with odd memory addresses into the other. the cache memory and the memory elements held in registers. When the data elements that are y multiple levels of the memory hierarchy y There is global memory that any processor can access. For example if your computer has 2 memory banks. the multiple memory ports allow them to be accessed and fetched in parallel. interleaved across the memory banks are needed. There is memory that is local to a partition of the processors. which increases the memory bandwidth (MB/s or GB/s). Finally there is memory that is local to a single processor.Memory Parallelism y memory interleaving y memory is divided into multiple banks. that is. y multiple memory ports y Port means a bi-directional memory pathway.

Memory Parallelism Memory Hierarchy Cache Memory .

y The advantage of a RAID disk system is that it provides a measure of fault tolerance. it is striped across the RAID disk system. the pieces are read in parallel. and the RAID disk system remains operational. That is.Disk Parallelism y RAID (Redundant Array of Inexpensive Disk) y RAID disks are on most parallel computers. When the same data set is read back in. . y Disk Striping y When a data set is written to disk. it can be swapped out. it is broken into pieces that are written simultaneously to the different disks in the RAID disk system. and the full data set is reassembled in memory. y If one of the disks goes down.

3 Parallelism in Computers Introduction to Parallel Computing 1.1 Parallelism in our Daily Lives 1.1.Agenda 1 Parallel Computing Overview 1.2 Comparison of Parallel Computers 1.1.4 Performance Measures 1.3 Summary .1.5 More Parallelism Issues 1.1.4 Disk Parallelism 1.1.2 Parallelism in Computer Programs 1.

y MIPS y is a measure of how quickly the computer can issue instructions. y Cost Performance y is used to determine if the computer is cost effective. y It is a more realistic measure of computer performance. floating point operations. y The processor speed is commonly measured in millions of cycles per second. and branch instructions. where the instructions are computer instructions such as: memory reads and writes. y It is a theoretical upper limit on the computer's performance. y MHz y is a measure of the processor speed. y Millions of instructions per second is abbreviated as MIPS. where a computer cycle is defined as the shortest time in which some work can be done. . logical operations . y Sustained Performance y is the highest consistently achieved speed. integer operations.Performance Measures y Peak Performance y is the top speed at which the computer can operate.

Performance Measures y Mflops (Millions of floating point operations per second) y measures how quickly a computer can perform floating-point operations such as add. and divide. subtract. y Benchmarks y are used to rate the performance of parallel computers and parallel programs. multiply. a list is produced of the Top 500 Supercomputer Sites. y A well known benchmark that is used to compare parallel computers is the Linpack benchmark. y Speedup y measures the benefit of parallelism. compared to the performance on one processor. y Ideal speedup happens when the performance gain is linearly proportional to the number of processors used. y It shows how your program scales as you compute with more processors. . y Based on the Linpack results. This list is maintained by the University of Tennessee and the University of Mannheim.

y Load balancing is important because the total time for the program to complete is the time spent by the longest executing thread. y Good software tools y are essential for users of high performance parallel computers. y These tools include: y parallel compilers y parallel debuggers y performance analysis tools y parallel math software y The availability of a broad set of application software is also important. y The problem size y must be large and must be able to grow as you compute with more processors. y For data parallelism it involves how iterations of loops are allocated to processors.More Parallelism Issues y Load balancing y is the technique of evenly dividing the workload among the processors. otherwise the overhead of passing information between processors will dominate the calculation time. y In order to get the performance you expect from a parallel computer you need to run a large application with large data sizes. .

y An application of this concept is the SETI project.More Parallelism Issues y The high performance computing market is risky and chaotic. making the portability of your application very important. y When they are idle. y The individual workstations serve as desktop systems for their owners. large problems can take advantage of the unused cycles in the whole system. y A workstation farm y is defined as a fast network connecting heterogeneous workstations. y Miron Livny at the University of Wisconsin at Madison is the director of the Condor project. You can participate in searching for extraterrestrial intelligence with your home PC. More information is available at the Condor Home Page. and has coined the phrase high throughput computing to describe this process of harnessing idle workstation cycles. y Condor y is software that provides resource management services for applications that run on heterogeneous collections of workstations. Many supercomputer vendors are no longer in business. More information about this project is available at the SETI Institute. .

2 Comparison of Parallel Computers 1.1 Introduction to Parallel Computing 1.Agenda 1 Parallel Computing Overview Tree Network 1.2 Memory Organization Cross-Bar Switch Network 1.1 Processors 1.3 Flow of Control 1.3 Summary .2.3 Hypercube Network 1.4 Interconnection Networks Interconnection Networks Self-test Summary of Parallel Computer Characteristics 1.2.1 Bus Network 1.

Comparison of Parallel Computers y Now you can explore the hardware components of parallel computers: y kinds of processors y types of memory organization y flow of control y interconnection networks y You will see what is common to these parallel computers. and what makes each one of them unique. .

computers with a small number of powerful processors y Typically have tens of processors. y They are general-purpose computers that perform especially well on applications that have large vector lengths. . y The examples of this type of computer are the Cray SV1 and the Fujitsu VPP5000. making these computers very expensive for computing centers.Kinds of Processors y There are three types of parallel computers: 1. y The cooling of these computers often requires very sophisticated and expensive equipment.

The processors are usually proprietary and air-cooled. Examples of this type of computer were the Thinking Machines CM-2 computer. typically have thousands of y y y y y processors. and the computers made by the MassPar company. Because of the large number of processors. the distance between the furthest processors can be quite large requiring a sophisticated internal network that allows distant processors to communicate with each other quickly. These computers are suitable for applications with a high degree of concurrency. The MPP type of computer was popular in the 1980s. .Kinds of Processors y There are three types of parallel computers: 2. computers with a large number of less powerful processors y Named a Massively Parallel Processor (MPP).

computers that are medium scale in between the two extremes y Typically have hundreds of processors. rather they are commodity processors like the Pentium III. y These are general-purpose computers that perform well on a wide range of applications. y The most common example of this class is the Linux Cluster.Kinds of Processors y There are three types of parallel computers: 3. . y The processor chips are usually not proprietary.

Intel Itanium . Commodity SGI Origin2000 y The processors on today·s commonly used parallel computers: Computer SGI Origin2000 HP V-Class Cray T3E IBM SP Workstation Clusters Processor MIPS RISC R12000 HP PA 8200 Compaq Alpha IBM Power3 Intel Pentium III. RISC. Proprietary Massively Parallel.Trends and Examples y Processor trends : Decade Processor Type 1970s 1980s 1990s 2000s Pipelined. Proprietary CISC. Commodity Computer Example Cray-1 Thinking Machines CM2 Workstation Clusters Superscalar.

Memory Organization y The following paragraphs describe the three types of memory organization found on parallel computers: y distributed memory y shared memory y distributed shared memory .

y On NUMA computers. y There is a Non-Uniform Memory Access time (NUMA). the IBM SP. while data from the most distant processor takes the longest to access. which is .Distributed Memory y In distributed memory computers. data is accessed the quickest from a private memory. y Some examples are the Cray T3E. and workstation clusters. proportional to the distance between the two communicating processors. the total memory is partitioned into memory that is private to each processor.

Distributed Memory y When programming distributed memory computers. y Today's distributed memory computers use message passing such as MPI to communicate between processors as shown in the following example: . y This is called having good data locality. the code and the data should be structured such that the bulk of a processor·s data accesses are to its own private (local) memory.

computer centers can easily add more memory and processors. As the demand for resources grows. . y The drawback is that programming of distributed memory computers can be quite complicated.Distributed Memory y One advantage of distributed memory computers is that they are easy to scale. y This is often called the LEGO block approach.

Shared Memory y In shared memory computers. y They are easier to program because they resemble the programming of single processor machines y But they don't scale like their distributed memory counterparts . y The advantages and disadvantages of shared memory machines are roughly the opposite of distributed memory computers. y Any processor can address any memory location at the same speed so there is Uniform Memory Access time (UMA). y Processors communicate with each other through the shared memory. all processors have access to a single pool of centralized memory with a uniform address space.

y It accesses the memory of a different processor cluster in a NUMA fashion. y Distributed shared memory computers combine the best features of both distributed memory computers and shared memory computers. y That is. y Memory is physically distributed but logically shared. . y Some examples of DSM computers are the SGI Origin2000 and the HP VClass computers.Distributed Shared Memory y In Distributed Shared Memory (DSM) computers. y Attention to data locality again is important. DSM computers have both the scalability of distributed memory computers and the ease of programming of shared memory computers. a cluster or partition of processors has access to a common shared memory.

Trends and Examples y Memory organization trends: Decade 1970s 1980s 1990s 2000s Memory Organization Shared Memory Distributed Memory Distributed Shared Memory Distributed Memory Example Cray-1 Thinking Machines CM-2 SGI Origin2000 Workstation Clusters y The memory Computer SGI Origin2000 HP V-Class Cray T3E IBM SP Workstation Clusters Memory Organization DSM DSM Distributed Distributed Distributed organization of today·s commonly used parallel computers: .

Flow of Control y When you look at the control of flow you will see three types of parallel computers: y Single Instruction Multiple Data (SIMD) y Multiple Instruction Multiple Data (MIMD) y Single Program Multiple Data (SPMD) .

Flynn·s Taxonomy y Flynn·s Taxonomy. y Of these 4. . and there can be single or multiple data streams. y Another computer type. devised in 1972 by Michael Flynn of Stanford University. describes computers by how streams of instructions interact with streams of data. SPMD. This gives rise to 4 types of computers as shown in the diagram below: y Flynn's taxonomy names the 4 computer types SISD. MISD. SIMD and MIMD. only SIMD and MIMD are applicable to parallel computers. is a special case of MIMD. y There can be single or multiple instruction streams.

y The processors are commanded by the global controller that sends instructions to the processors. marching in unison. and they all shift to the right. y Some examples of SIMD computers were the Thinking Machines CM-2 computer and the computers from the MassPar company. such as neural networks.SIMD Computers y SIMD stands for Single Instruction Multiple Data. . y It says add. are useful for fine grain data parallel applications. y Each processor follows the same set of instructions. and they all add. popular in the 1980s. y The processors are like obedient soldiers. and the processors run in lock step. y SIMD computers have distributed memory with typically thousands of simple processors. y With different data elements being allocated to each processor. y It says shift to the right. y SIMD computers.

While the processors on SIMD computers run in lock step. so that the processors can run the same instruction stream or different instruction streams. MIMD computers can be used for either data parallel or task parallel applications. MIMD is actually a superset of SIMD. MIMD computers can have either distributed memory or shared memory. different data elements are allocated to each processor. . there are multiple data streams. In addition.MIMD Computers y MIMD stands for Multiple Instruction Multiple Data. y There are multiple instruction streams with separate code segments distributed y y y y y y among the processors. the processors on MIMD computers run independently of each other. Some examples of MIMD computers are the SGI Origin2000 computer and the HP V-Class computer.

while the processors are running the same code segment. y SPMD execution happens when a MIMD computer is programmed to have the same set of instructions per processor. . y An example is the execution of an if statement on a SPMD computer. y Hence.SPMD Computers y SPMD stands for Single Program Multiple Data. y The analogies we used for describing SIMD computers can be modified for MIMD computers. y Because each processor computes with its own partition of the data elements. those instructions may be evaluated in a different order from one processor to the next. even though each processor has the same set of instructions. all marching in unison. in the MIMD world the processors march to the beat of their own drummer. y Unlike SIMD. the synchronous execution of instructions is relaxed. y Instead of the SIMD obedient soldiers. y With SPMD computers. y One processor may take a certain branch of the if statement. it may evaluate the right hand side of the if statement differently from another processor. and another processor may take a different branch of the same if statement. each processor can run that code segment asynchronously. y SPMD is a special case of MIMD.

Summary of SIMD versus MIMD SIMD Memory distributed memory MIMD distriuted memory or shared memory same or different asynchronously different per processor data parallel or task parallel Code Segment Processors Run In Data Elements Applications same per processor lock step different per processor data parallel .

Trends and Examples y Flow of control trends: Decade 1980's 1990's 2000's Flow of Control SIMD MIMD MIMD Computer Example Thinking Machines CM-2 SGI Origin2000 Workstation Clusters y The flow of control on today: Computer SGI Origin2000 HP V-Class Cray T3E IBM SP Workstation Clusters Flow of Control MIMD MIMD MIMD MIMD MIMD .

2.4.2 Comparison of Parallel Computers 1.Agenda 1 Parallel Computing Overview 1.2 Memory Organization 1.4 Interconnection Networks Tree Network Flow of Control 1.3 Summary . Introduction to Parallel Computing 1.3 Hypercube Network Cross-Bar Switch Network 1.1 Processors 1.4.5 Summary of Parallel Computer Characteristics 1.5 Interconnection Networks Self-test 1.1 Bus Network 1.

Interconnection Networks y What exactly is the interconnection network? y The interconnection network is made up of the wires and cables that define how the multiple processors of a parallel computer are connected to each other and to the memory units. The network topologies (geometric arrangements of the computer network connections) are: y Bus y Cross-bar Switch y Hybercube y Tree y What network characteristics are important? y Diameter: the maximum distance that data must travel for 2 processors to y Types of Interconnection Networks . communicate. y Bandwidth: the amount of data that can be sent through a network connection. y The time required to transfer data is dependent upon the specific type of the interconnection network. y Latency: the delay on a network while a data packet is being stored and forwarded. y This transfer time is called the communication time.

y A large degree is a benefit because it has multiple paths.Interconnection Networks y The aspects of network issues are: y y y y y y y Cost Scalability Reliability Suitable Applications Data Rate Diameter Degree y General Network Characteristics y Some networks can be compared in terms of their degree and diameter. y Degree: how many communicating wires are coming out of each processor. y Diameter:This is the distance between the two processors that are farthest apart. . y A small diameter corresponds to low latency.

Bus Network y Bus topology is the original coaxial cable-based Local Area Network (LAN) topology in which the medium forms a single bus to which all stations are attached. y Only scaled to 18 processors. y not scalable in terms of performance. y The cost is also very low. . y Example: SGI Power Challenge. y The positive aspects y It is also a mature technology that is well known and reliable. y The negative aspects y limited data transmission rate. y simple to construct.

y it scales better than the bus network but it costs significantly more. y The telephone system uses this type of network. y Here is a diagram of a cross-bar switch network which shows the processors talking through the switchboxes to store or retrieve data in memory. y The switches determine the optimal route to take.Cross-Bar Switch Network y A cross-bar switch is a network that works through a switching mechanism to access shared memory. . y There are multiple paths for a processor to communicate with a certain memory. An example of a computer with this type of network is the HP V-Class.

where n is the number of processors. and the Intel iPSC860. "nearest neighbor". Each node in an N dimensional cube is directly connected to N other nodes. NCUBE-2. . the processors are connected as if they were corners of a multidimensional cube. y The degree of a hypercube network is log n and the diameter is log n. y Examples of computers with this type of network are the CM-2. nodes increases with the total size of the network is also highly desirable for a parallel computer.Cross-Bar Switch Network y In a hypercube network. y The fact that the number of directly connected.

The Thinking Machines CM-5 is an example of a parallel computer with this type of network. Tree networks are very suitable for database applications because it allows multiple searches through the database at a time. For a processor y y y y to retrieve data. This is useful for decision making applications that can be mapped as trees. The diameter of the network is 2 log (n+1)-2 where n is the number of processors. .Tree Network y The processors are the bottom nodes of the tree. The degree of a tree network is 1. it must go up in the network and then go back down.

. Fully Connected Network: A network where every processor is connected to every other processor. Hypercube Network: Processors are connected as if they were corners of a multidimensional cube. Multistage Network: A network with more than one networking unit. Mesh Network: A network where each interior processor is connected to its four nearest neighbors.Interconnected Networks y Torus Network: A mesh with wrap-around connections in y y y y both the x and y directions.

y Tree Network: The processors are the bottom nodes of the tree. . y Ring Network: Each processor is connected to two others and the line of connections forms a circle.Interconnected Networks y Bus Based Network: Coaxial cable based LAN topology in which the medium forms a single bus to which all stations are attached. y Cross-bar Switch Network: A network that works through a switching mechanism to access shared memory.

Summary of Parallel Computer Characteristics y How many processors does the computer have? y 10s? y 100s? y 1000s? y How powerful are the processors? y what's the MHz rate y what's the MIPS rate y What's the instruction set architecture? y RISC y CISC .

Summary of Parallel Computer Characteristics y How much memory is available? y total memory y memory per processor y What kind of memory? y distributed memory y shared memory y distributed shared memory y What type of flow of control? y SIMD y MIMD y SPMD .

Summary of Parallel Computer Characteristics y What is the interconnection network? y y y y y y y y y y Bus Crossbar Hypercube Tree Torus Multistage Fully Connected Mesh Ring Hybrid .

Design decisions made by some of the major parallel computer vendors Computer Programming Style OpenMP MPI OpenMP MPI SHMEM MPI OS Processors Memory Flow of Control Network SGI Origin2000 IRIX MIPS RISC R10000 DSM MIMD Crossbar Hypercube Crossbar Ring Torus IBM Switch Myrinet Tree HP V-Class HP-UX HP PA 8200 DSM MIMD Cray T3E IBM SP Workstation Clusters Unicos AIX Compaq Alpha IBM Power3 Intel Pentium III Distributed Distributed MIMD MIMD MPI Linux Distributed MIMD .

1994 Parallel Computing Theory and Practice Michael J. y In addition.. and also about parallelism in the hardware components of parallel computers. y There are many good texts which provide an introductory treatment of parallel computing. Almasi and Allan Gottlieb Benjamin/Cummings Publishers. Inc. Quinn McGraw-Hill. and how these computers compare to each other. y You have learned about parallelism in computer programs. 1994 .Summary y This completes our introduction to parallel computing. Second Edition George S. Here are two useful references: Highly Parallel Computing. you have learned about the commonly used parallel computers.

4 Task Parallelism 2.2 Data Parallelism by Hand 2.3 Mixing Automatic and Hand Parallelism 2.Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 2.1 Automatic Compiler Parallelism 2.5 Parallelism Issues 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690 .

y The details for accomplishing both data parallelism and task parallelism are presented. y Both automatic compiler parallelization and parallelization by hand are covered. focusing on shared memory machines.How to Parallelize a Code y This chapter describes how to turn a single processor program into a parallel one. .

Automatic Compiler Parallelism
y Automatic compiler parallelism enables you to use a

single compiler option and let the compiler do the work. y The advantage of it is that it·s easy to use. y The disadvantages are:
y The compiler only does loop level parallelism, not task

parallelism. y The compiler wants to parallelize every do loop in your code. If you have hundreds of do loops this creates way too much parallel overhead.

Automatic Compiler Parallelism
y To use automatic compiler parallelism on a Linux system

with the Intel compilers, specify the following.
ifort -parallel -O2 ... prog.f

y The compiler creates conditional code that will run with any

number of threads. y Specify the number of threads and make sure you still get the right answers with setenv:
setenv OMP_NUM_THREADS 4 a.out > results

Data Parallelism by Hand
y First identify the loops that use most of the CPU time (the Profiling y y y y

lecture describes how to do this). By hand, insert into the code OpenMP directive(s) just before the loop(s) you want to make parallel. Some code modifications may be needed to remove data dependencies and other inhibitors of parallelism. Use your knowledge of the code and data to assist the compiler. For the SGI Origin2000 computer, insert into the code an OpenMP directive just before the loop that you want to make parallel.

!$OMP PARALLEL DO do i=1,n « lots of computation ... end do !$OMP END PARALLEL DO

Data Parallelism by Hand
y Compile with the mp compiler option. f90 -mp ... prog.f y As before, the compiler generates conditional code that will run with any

number of threads. y If you want to rerun your program with a different number of threads, you do not need to recompile, just re-specify the setenv command.
setenv OMP_NUM_THREADS 8 a.out > results2

y The setenv command can be placed anywhere before the a.out command. y The setenv command must be typed exactly as indicated. If you have a typo,

you will not receive a warning or error message. To make sure that the setenv command is specified correctly, type:

y It produces a listing of your environment variable settings.

Mixing Automatic and Hand Parallelism
y You can have one source file parallelized automatically by the

compiler, and another source file parallelized by hand. Suppose you split your code into two files named prog1.f and prog2.f.
f90 -c -apo « prog1.f f90 -c -mp « prog2.f prog2.f) f90 prog1.o prog2.o executable) a.out > results (automatic // for prog1.f) (by hand // for (creates one (runs the executable)

Task Parallelism
y You can accomplish task parallelism as follows:
!$OMP PARALLEL !$OMP SECTIONS « lots of computation in part A « !$OMP SECTION « lots of computation in part B ... !$OMP SECTION « lots of computation in part C ... !$OMP END SECTIONS !$OMP END PARALLEL

y Compile with the mp compiler option.
f90 -mp « prog.f

y Use the setenv command to specify the number of threads.
setenv OMP_NUM_THREADS 3 a.out > results

Parallelism Issues
y There are some issues to consider when parallelizing a

program. y Should data parallelism or task parallelism be used? y Should automatic compiler parallelism or parallelism by hand be used? y Which loop in a nested loop situation should be the one that becomes parallel? y How many threads should be used?

4 Standards Violations 3.Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 3.1 Recompile 3.5 IEEE Arithmetic Differences 3.10 Further Information .8 Optimization Level Too High 3.2 Word Length 3.6 Math Library Differences 3.7 Compute Order Related Differences 3.9 Diagnostic Listings 3.3 Compiler Options for Debugging 3.

. a mainframe.Porting Issues y In order to run a computer program that presently runs on a workstation. a vector computer. and save the results from the old or ´baselineµ computer.porting your code to a new computer may have uncovered a hidden flaw in the code that was already there. y Code Flaws . y Then run the ported code on the new computer and compare the results. y Detection methods for finding code flaws. or another parallel computer. on a new parallel computer you must first "port" the code. y To do this. y If the results are different. y After porting the code. it is important to have some benchmark results you can use for comparison. don't automatically assume that the new results are wrong ² they may actually be better.the new results may actually be more accurate than the baseline results. run the original program on a well-defined dataset. solutions. and workarounds are provided in this lecture. including: y Precision Differences . There are several reasons why this might be true.

Recompile y Some codes just need to be recompiled to get accurate results. y The compilers available on the NCSA computer platforms are shown in the following table: Language SGI Origin2000 MIPSpro Fortran 77 Fortran 90 Fortran 90 High Performance Fortran C C++ cc CC f77 f90 f95 pghpf icc icpc gcc g++ Portland Group IA-32 Linux Intel ifort ifort ifort pghpf pgcc pgCC icc icpc gcc g++ GNU g77 Portland Group pgf77 pgf90 IA-64 Linux Intel ifort ifort ifort GNU g77 .

The value of n can be 4 or 8 on SGI. The value of n can be 4 or 8 on SGI. respectively. On the IA32 and IA64 Linux clusters. y For Fortran. y -rn where n is a number: set the default REAL to REAL*n. or 8 on the Linux clusters. and 8 bytes if compiled without any flags or explicitly with the ²64 flag. the size of an integer variable differs depending on the machine and how the variable is generated. 4. y For C. the size of an integer variable is 4 and 8 bytes. y -in where n is a number: set the default INTEGER to INTEGER*n. the corresponding value is 4 bytes if the code is compiled with the ²n32 flag. and 2. On the SGI Origin2000. and 4. .Word Length y Code flaws can occur when you are porting your code to a different word length computer. the SGI MIPSpro and Intel compilers contain the following flags to set default variable size. or 16 on the Linux clusters. 8.

The syntax is as follows: -DEBUG:option1[=value1]:option2[=value2].. -DEBUG:subscript_check=ON y Force all un-initialized stack..Compiler Options for Debugging y On the SGI Origin2000. y Two examples are: y Array-bound checking: check for subscripts out of range at runtime. -DEBUG:trap_uninitialized=ON . the MIPSpro compilers include debugging options via the ²DEBUG:group. automatic and dynamically allocated variables to be initialized.

the Fortran compiler is equipped with the following ²C flags for runtime diagnostics: y -CA: pointers and allocatable references y -CB: array and subscript bounds y -CS: consistent shape of intrinsic procedure y -CU: use of uninitialized variables y -CV: correspondence between dummy and actual arguments .Compiler Options for Debugging y On the IA32 Linux cluster.

. the value of the do loop index upon exit from the do loop. the -ansi[-] flag enables/disables assumption of ANSI conformance. y On the Linux clusters. y ANSI standard Fortran is a set of rules for compiler writers that specify. y Standards Violations Detection y To detect standards violations on the SGI Origin2000 computer use the -ansi flag. for example. y This option generates a listing of warning messages for the use of non-ANSI standard coding.Standards Violations y Code flaws can occur when the program has non-ANSI standard Fortran coding.

y To make your program conform to the IEEE Arithmetic Standards on the SGI Origin2000 computer use: f90 -OPT:IEEEarithmetic=n . it prohibits the compiler writer from replacing x/y with x *recip (y) since the two results may differ slightly for some operands. the Intel compilers can achieve conformance to IEEE standard at a stringent level with the ²mp flag. prog.. y For example..You can make your program strictly conform to the IEEE standard. y On the Linux clusters. or a slightly relaxed level with the ²mp1 flag. or 3. .IEEE Arithmetic Differences y Code flaws occur when the baseline computer conforms to the IEEE arithmetic standard and the new computer does not. y This option specifies the level of conformance to the IEEE standard where 1 is the most stringent and 3 is the most liberal. y The IEEE Arithmetic Standard is a set of rules governing arithmetic roundoff and overflow behavior. 2.f where n is 1.

y The complib library can be linked with ²lcomplib. or ²mp ² lscs_mp for the parallel version. or ²mp ²lcomplib. the extended BLAS (sparse). there are SGI/Cray Scientific Library (SCSL) and Complib. . y SCSL can be linked with ²lscs for the serial version. and 3 Basic Linear Algebra Subprograms (BLAS). LAPACK and Fast Fourier Transform (FFT) routines.sgimath_mp for the parallel version. y SCSL contains Level 1. the complete set of LAPACK routines.sgimath for the serial version. y The Intel Math Kernel Library (MKL) contains the complete set of functions from BLAS. 2.sgimath.Math Library Differences y Most high-performance parallel computers are equipped with vendor-supplied math libraries. y On the SGI Origin2000 platform. and Fast Fourier Transform (FFT) routines.

Math Library Differences y On the IA32 Linux cluster. you also need to link with -lPEPCF90 ±lCEPCF90 ±lF90 -lintrins . you also need to link with ±lF90. the libraries to link to are: y For BLAS: -L/usr/local/intel/mkl/lib/32 -lmkl -lguide ±lpthread y For LAPACK: -L/usr/local/intel/mkl/lib/32 ±lmkl_lapack -lmkl -lguide ±lpthread y When calling MKL routines from C/C++ programs. the corresponding libraries are: y For BLAS: -L/usr/local/intel/mkl/lib/64 ±lmkl_itp ±lpthread y For LAPACK: -L/usr/local/intel/mkl/lib/64 ±lmkl_lapack ±lmkl_itp ± lpthread y When calling MKL routines from C/C++ programs. y On the IA64 Linux cluster.

The compute order in which the threads will run cannot be guaranteed. Furthermore. y Note: : If your algorithm depends on data being compared in a specific order. 1. in a data parallel program. the 50th index of a do loop may be computed before the 10th index of the loop. and in another order on the next run of the program. your code is inappropriate for a parallel computer.Compute Order Related Differences y Code flaws can occur because of the non-deterministic computation of data elements on a parallel computer. -1 The results should not change if the iterations are independent . y Use the following method to detect compute order related differences: y If your loop looks like y DO I = 1. y For example. N change it to y DO I = N. the threads may run in one order on the first run.

y Setting the Optimization Level y Both SGI Origin2000 computer and IBM Linux clusters provide Level 0 (no optimization) to Level 3 (most aggressive) optimization. One should bear in mind that Level 3 optimization may carry out loop transformations that affect the correctness of calculations.2. This can sometimes cause answers to change at higher optimization level.Optimization Level Too High y Code flaws can occur when the optimization level has been set too high thus trading speed for accuracy.f turns off all optimizations. . Checking correctness and precision of calculation is highly recommended when ²O3 is used. using the ²O{0. y For example on the Origin 2000 y f90 -O0 « prog. y The compiler reorders and optimizes your code based on assumptions it makes about your program. or 3} flag.1.

y To do this.Optimization Level Too High y Isolating Optimization Level Problems y You can sometimes isolate optimization level problems using the method of binary chop.f into halves. .o prog2.o prog1b.f with -O0 and prog1b. Name them prog1. divide your program prog. the optimization problem lies in prog1.f y Next divide prog1.f f90 -c -O3 prog2.f f90 prog1a. Name them prog1a.f.f and prog2. y Compile the first half with -O0 and the second half with -O3 f90 -c -O0 prog1.out > results y Continue in this manner until you have isolated the section of code that is producing incorrect results.o a.f and prog1b.f y Compile prog1a.o prog2.f f90 -c -O3 prog1b.out > results y If the results are correct.f with -O3 f90 -c -O0 prog1a.f into halves.o a.f f90 prog1.

.... -version . -showdefaults .. -help . Some useful listing options are: f90 f90 f90 f90 f90 -listing . .Diagnostic Listings y The SGI Origin 2000 compiler will generate all kinds of diagnostic warnings and messages. but not always by default..... -fullwarn ..

IA64.sgimath MIPSpro 64-Bit Porting and Transition Guide Online Manuals y Linux clusters pages y ifort/icc/icpc ²help (IA32. Intel64) y Intel Fortran Compiler for Linux y Intel C/C++ Compiler for Linux .Further Information y SGI y y y y y y man f77/f90/cc man debug_group man math man complib.

2 Compiler Optimizations y 4.Agenda y 1 Parallel Computing Overview y 2 How to Parallelize a Code y 3 Porting Issues y 4 Scalar Tuning y 4.4 Further Information .3 Vendor Tuned Code y 4.1 Aggressive Compiler Options y 4.

you can tune the scalar code to decrease its runtime. y This chapter describes many of these techniques: y The use of the most aggressive compiler options y The improvement of loop unrolling y The use of subroutine inlining y The use of vendor supplied tuned code y The detection of cache problems. .Scalar Tuning y If you are not satisfied with the performance of your program on the new computer. and their solution are presented in the Cache Tuning chapter.

-O3 specifies the most aggressive optimizations. may produce changes in accuracy. and turns on software pipelining. .Aggressive Compiler Options y For the SGI Origin2000 Linux clusters the main optimization switch is -On where n ranges from 0 to 3. It takes the most compile time. -O1 and -O2 do beneficial optimizations that will not effect the accuracy of results. -O0 turns off all optimizations.

²O3 can be used together with ²OPT:IEEE_arithmetic=n (n=1. the option -Ofast = ip27 is also available. This option specifies the most aggressive optimizations that are specifically tuned for the Origin2000 computer. y On the SGI Origin2000. . or 3) and ²mp (or ²mp1). y It is recommended that one compare the answer obtained from Level 3 optimization with one obtained from a lower-level optimization. y On the SGI Origin2000 and the Linux clusters.2.Aggressive Compiler Options y It should be noted that ²O3 might carry out loop transformations that produce incorrect results in some codes. to enforce operation conformance to IEEE standard at different levels. respectively.

3 Routine Level y 4.6 Subroutine Inlining y 4.1 Statement Level y Block Level y 4.5 Loop Unrolling y 4.2 Compiler Optimizations y 4.3 Vendor Tuned Code y 4.7 Optimization Report y 4.2.8 Profile-guided Optimization (PGO) y 4.1Aggressive Compiler Options y 4.2.Agenda y y y y 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning y Software Pipelining y Further Information .

.Compiler Optimizations y The various compiler optimizations can be classified as follows: y Statement Level Optimizations y Block Level Optimizations y Routine Level Optimizations y Software Pipelining y Loop Unrolling y Subroutine Inlining y Each of these are described in the following sections.

I. y y = 5+7 becomes y = 12 y Short Circuiting y Avoid executing parts of conditional tests that are not necessary.eq.K) expression when I=J immediately compute the expression y Register Assignment y Put frequently used variables in registers.or.Statement Level y Constant Folding y Replace simple arithmetic operations on constants with the pre- computed result. y if (I. .eq.J .

Block Level y Dead Code Elimination y Remove unreachable code and code that is never executed or used. . y Instruction Scheduling y Reorder the instructions to improve memory pipelining.

are computed once. and the result is substituted for each occurrence of the expression.Routine Level y Strength Reduction y Replace expressions in a loop with an expression that takes fewer cycles. y Loop Invariant Elimination y Expressions inside a loop that don't change with the do loop index are moved outside the loop. y Common Subexpressions Elimination y Expressions that appear more than once. y Constant Propagation y Compile time replacement of variables with constants. .

y Note: On the R10000s there is out-of-order execution of instructions. .Software Pipelining y Software pipelining allows the mixing of operations from different loop iterations in each iteration of the hardware loop. It is used to get the maximum work done per clock cycle. and software pipelining may actually get in the way of this feature.

. and the body of the loop is replicated. loops are unrolled to a level of 8 by default. prog.Loop Unrolling y The loops stride (or step) value is increased. It is used to improve the scheduling of the loop by giving a longer sequence of straight line code. 3 a(I) + b(I) = a(I+1) + b(I+1) = a(I+2) + b(I+2) There is a limit to the amount of unrolling that can take place because there are a limited number of registers. the corresponding flag is ±unroll and -unroll0 . respectively. f90 -O3 -OPT:unroll_times_max=12 . 99 c(I) = a(I) + b(I) enddo Unrolled Loop do I = c(I) = c(I+1) c(I+2) enddo 1. You can unroll to a level of 12 by specifying: for unrolling and no unrolling..f y On the IA32 Linux cluster. An example of loop unrolling follows: Original Loop do I = 1. y On the SGI Origin2000. 99.

Subroutine Inlining y Subroutine inlining replaces a call to a subroutine with the body of the subroutine itself. y However. . the chief reason for using it is that do loops that contain subroutine calls may not parallelize. subroutine inlining may be more efficient because it cuts down on loop overhead. y One reason for using subroutine inlining is that when a subroutine is called inside a do loop that has a huge iteration count.

the following flags can invoke function inlining: y inline function expansion for calls defined within the current source file -ip: y inline function expansion for calls defined in separate files -ipo: .f: y Specify a list of routines to inline at every call f90 -O3 -INLINE:must=subrname « prog. there are several options to invoke inlining: y Inline all routines except those specified to -INLINE:never f90 -O3 -INLINE:all « prog.f: y On the Linux clusters.Subroutine Inlining y On the SGI Origin2000 computer.f: y Specify a list of routines never to inline f90 -O3 -INLINE:never=subrname « prog.f: y Inline no routines except those specified to -INLINE:must f90 -O3 -INLINE:none « prog.

o: ifort -c -o $@ $(FFLAGS) -opt-report-file $*. each with a unique name. y To generate such optimization reports in a file filename.opt $*.f" replaced by ".f y creates optimization reports that are named identically to the original Fortran source but with the suffix ". you can also use make's "suffix" rules to have optimization reports produced automatically. .f.Optimization Report y Intel 9.x and later compilers can generate reports that provide useful information on optimization done on different parts of your code. . and you use a makefile to compile. For example. y If you have a lot of source files to process simultaneously.opt". add the flag - opt-report-file filename.

uiuc. the NCSA program OptView is designed to provide an easy-to-use and intuitive interface that allows the user to browse through their own source code.Optimization Report y To help developers and performance analysts navigate through the usually lengthy optimization reports. cross-referenced with the optimization reports.edu/OptView/ . y OptView is installed on NCSA's IA64 Linux cluster under the directory /usr/apps/tools/bin. readers see: http://perfsuite. y Optview can provide a quick overview of which loops in a source code or source codes among multiple files are highly optimized and which might need further work.You'll need to be using the X-Window system and to have set your DISPLAY environment variable correctly for OptView to work. For a detailed description of use of OptView.You can either add that directory to your UNIX PATH or you can invoke optview using an absolute path name.ncsa.

Profile-guided Optimization (PGO) y Profile-guided optimization allows Intel compilers to use valuable runtime information to make better decisions about function inlining and interprocedural optimizations to generate faster codes. Its methodology is illustrated as follows: .

o -lirc y Then. you run the program with a representative set of data to generate the dynamic information files given by the .c icc a1. the code is recompiled again with the -prof-use flag to use the runtime information. y Finally. icc -prof-use -ipo -c a1. you do an instrumented compilation by adding the -prof-gen flag in the compile process: icc -prof-gen -c a1.c y A profile-guided optimized executable is generated.dyn suffix.c a3.Profile-guided Optimization (PGO) y First. y These files contain valuable runtime information for the compiler to do better function inlining and other optimizations.c a2.o a3.c a2.o a2.c a3. .

y On the Linux clusters. Ways to link to these libraries are described in Section 3 .Vendor Tuned Code y Vendor math libraries have codes that are optimized for their specific machine. Complib. Intel MKL is available. y On the SGI Origin2000 platform. .Porting Issues.sgimath and SCSL are available.

edu/UserInfo/Resources/Hardware/Intel64Cluster/ (Intel64) y http://perfsuite.edu/UserInfo/Resources/Hardware/Intel64Cluster/ (Intel64) y http://www.edu/OptView/ .ncsa.Further Information y SGI IRIX man and www pages y y y y y y man opt man lno man inline man ipa man perfex Performance Tuning for the Origin2000 at http://www.uiuc.ncsa.ncsa.uiuc.edu/UserInfo/Resources/Hardware/Origin2000OL D/Doc/ y Linux clusters help and www pages y ifort/icc/icpc ²help (Intel) y http://www.ncsa.uiuc.uiuc.

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 5.1 Sequential Code Limitation 5.3.1 Loop Schedule Types 5.2 Chunk Size .3 Load Balance 5.2 Parallel Overhead 5.3.

. and the details for implementing them. y The majority of this chapter deals with improving load balancing.Parallel Code Tuning y This chapter describes several of the most common techniques for parallel tuning. the type of programs that benefit.

. if the sequential fraction is too large. you can calculate the sequential fraction of code using the Amdahl·s Law formula. If you think too much sequential code is a problem. Some reasons why it cannot be made data parallel are: y y y y y The code is not in a do loop. The do loop contains a read or write.Sequential Code Limitation y Sequential code is a part of the program that cannot be run with multiple processors. y Sequential Code Fraction y As shown by Amdahl·s Law. there is a limitation on speedup. The do loop contains a dependency. The do loop has an ambiguous subscript. The do loop has a call to a subroutine or a reference to a function subprogram.

y Solve for f. this is p. Run and time the program with 1 processor to give T(1). this is SP.Sequential Code Limitation y Measuring the Sequential Code Fraction y y y y y Decide how many processors to use. where f is the fraction of sequential code.You can use this report as a guide to improve performance on do loops by: y Removing dependencies y Removing I/O y Removing calls to subroutines and function subprograms . Substitute SP and p into the Amdahl·s Law formula: y f=(1/SP-1/p)/(1-1/p). Run and time the program with p processors to give T(2). this is the fraction of sequential code. y Decreasing the Sequential Code Fraction y The compilation optimization reports list which loops could not be parallelized and why. Form a ratio of the 2 timings T(1)/T(p).

. y Parallelize the code. y Measuring Parallel Overhead y To get a rough under-estimate of parallel overhead: y Run and time the code using 1 processor. y Run and time the parallel code using only 1 processor. the overhead time needed to create and control the parallel processes can be disproportionately large limiting the savings due to parallelism. y Subtract the 2 timings.Parallel Overhead y Parallel overhead is the processing time spent y y y y y creating threads spin/blocking threads starting and ending parallel regions synchronizing at the end of parallel regions When the computational work done by the parallel processes is too small.

y To benefit from parallelization.Parallel Overhead y Reducing Parallel Overhead y To reduce parallel overhead: y Don't parallelize all the loops.You can use the IF modifier in the OpenMP directive to control when loops are parallelized. a loop needs about 1000 floating point operations or 500 statements in the loop.. y Don't parallelize small loops. It doesn't generate as much parallel overhead and often more code runs in parallel. body of loop .. !$OMP PARALLEL DO IF(n > 500) do i=1.n .. end do !$OMP END PARALLEL DO y Use task parallelism instead of data parallelism. y Don't use more threads than you need.. . y Parallelize at the highest level possible.

out > results y reports per thread cycle counts. it indicates load imbalance. y Measuring Load Balance y On the SGI Origin. to measure load balance. The command perfex -e16 -mp a. use the perfex tool which is a command line interface to the R10000 hardware counters. The master thread (thread 0) always uses more cycles than the slave threads. . y Load balance is important for speedup because the end of a do loop is a synchronization point where threads need to catch up with each other. Compare the cycle counts to determine load balance problems. y If processors have different work loads. If the counts are vastly different. some of the processors will idle while others are still working.Load Balance y Load balance y is the even assignment of subtasks to processors so as to keep each processor busy doing useful work for as long as possible.

the thread cpu times can be compared with ps. ps uH y Improving Load Balance y To improve load balance. .Load Balance y For linux systems. try changing the way that loop iterations are allocated to threads by y changing the loop schedule type y changing the chunk size y These methods are discussed in the following sections. A thread with unusually high or low time compared to the others may not be working efficiently [high cputime could be the result of a thread spinning while waiting for other threads to catch up].

the default will be used. 4 different loop schedule types can be specified by an OpenMP directive.Loop Schedule Types y On the SGI Origin2000 computer. y Default Schedule Type y The default schedule type allocates 20 iterations on 4 threads as: . They are: y Static y Dynamic y Guided y Runtime y If you don't specify a schedule type.

. y An Example y Suppose you are computing on the upper triangle of a 100 x 100 matrix.Loop Schedule Types y Static Schedule Type y The static schedule type is used when some of the iterations do more work than others. workloads are uneven. With default scheduling. iterations are allocated in a round-robin fashion to the threads. and you use 2 threads. named t0 and t1. With the static schedule type.

resulting in better load balance. .Loop Schedule Types y Whereas with static scheduling. the columns of the matrix are given to the threads in a round robin fashion.

Guided gives good load balancing at a low overhead cost. Each thread is given a chunk of iterations. Dynamic gives good load balance. . When a thread finishes its work. compared to the dynamic schedule type. The guided schedule type reduces the number of entries into the critical section. it goes into a critical section where it·s given another chunk of iterations to work on. y Guided Schedule Type y The guided schedule type is dynamic scheduling that starts with large chunks of iterations and ends with small chunks of iterations.Loop Schedule Types y Dynamic Schedule Type y The iterations are dynamically allocated to threads at runtime. That is. the number of iterations given to each thread depends on the number of iterations remaining. y This type is useful when you don·t know the iteration count or work pattern ahead of time. but at a high overhead cost.

or GUIDED and chunk is any positive integer. If a chunk size is not specified.Chunk Size y The word chunk refers to a grouping of iterations. Then 20 iterations are allocated on 4 threads: y The schedule type and chunk size are specified as follows: !$OMP PARALLEL DO SCHEDULE(type. y Suppose you specify a chunk size of 2 with the static schedule type. . Chunk size means how many iterations are in the grouping. then the chunk size is 1. The static and dynamic schedule types can be used with a chunk size. or DYNAMIC. chunk) « !$OMP END PARALLEL DO y Where type is STATIC.

1.3 Timing a Batch Job 6.1 Profiling Tools 6.1 Timing a Section of Code 6.1 Timing 6.2.Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.2 Timing an Executable 6.1.2 Profile Listings Further Information .2 Wall clock Time 6.3 Profiling Analysis Profiling CPU Time 6.

y The chapter also covers how to determine which parts of the program account for the bulk of the computational load so that you can concentrate your tuning efforts on those computationally intensive parts of the program. . y This chapter describes how to measure the speed of a program using various timing routines.Timing and Profiling y Now that your program has been ported to the new computer. you will want to know how fast it runs.

we·ll discuss timers and review the profiling tools ssrun and prof on the Origin and vprof and gprof on the Linux Clusters. The specific timing functions described are: y Timing a section of code FORTRAN y etime.out y Timing a batch run y busage y qstat y qhist . dtime.Timing y In the following sections. cpu_time for CPU time y time and f_time for wallclock time C y clock for CPU time y gettimeofday for wallclock time y Timing an executable y time a.

y It returns the elapsed CPU time in seconds since the program started.CPU Time y etime y A section of code can be timed using etime.time2.timeres « beginning of program time1=etime(tarray) « start of section of code to be timed « lots of computation « end of section of code to be timed time2=etime(tarray) timeres=time2-time1 .time1. real*4 tarray(2).

CPU Time y dtime y A section of code can also be timed using dtime. real*4 tarray(2). y It returns the elapsed CPU time in seconds since the last call to dtime.timeres « beginning of program timeres=dtime(tarray) « start of section of code to be timed « lots of computation « end of section of code to be timed timeres=dtime(tarray) « rest of program .

y This is returned as the second element of tarray. y It·s the time spent executing system calls on behalf of your program. y This is returned as the first element of tarray. y Metric. y This is the function value that is returned. y Sum of user and system time.CPU Time The etime and dtime Functions y User time. y It·s the CPU time spent executing user code. y Timings are accurate to 1/100th of a second. y System time. y Timings are reported in seconds. y It·s the time that is usually reported. .

y Another warning: Do not put calls to etime and dtime inside a do loop. y This is the time of the longest thread. which is usually the master thread. The overhead is too large. y For the Linux Clusters: y The etime and dtime functions are contained in the VAX compatibility library of the Intel FORTRAN Compiler.CPU Time Timing Comparison Warnings y For the SGI computers: y The etime and dtime functions return the MAX time over all threads for a parallel program. y To use this library include the compiler flag -Vaxlib. .

timeres « beginning of program call cpu_time (time1) « start of section of code to be timed « lots of computation « end of section of code to be timed call cpu_time(time2) timeres=time2-time1 « rest of program .CPU Time cpu_time y The cpu_time routine is available only on the Linux clusters as it is a component of the Intel FORTRAN compiler library. time2. real*8 time1. y It can be used as an elapsed timer. y It provides substantially higher resolution and has substantially lower overhead than the older etime and dtime routines.

0/(double)CLOCKS_PER_SEC. . double time1.CPU Time clock y For C programmers. « time1=(clock()*iCPS).h> static const double iCPS = 1. include <time. « /* do some work */ « time2=(clock()*iCPS). time2. timers=time2-time1. timres. one can call the cpu_time routine using a FORTRAN wrapper or call the intrinsic function clock that can be used to determine elapsed CPU time.

1970. y It is a means of getting the elapsed wall clock time.time1 .timeres « beginning of program time1=time( ) « start of section of code to be timed « lots of computation « end of section of code to be timed time2=time( ) timeres=time2 .time2. Jan. external time integer*4 time1.Wall clock Time time y For the Origin. 1. the function time returns the time since 00:00:00 GMT. y The wall clock time is reported in integer seconds.

To use this library include the compiler flag -Vaxlib. the f_time function is in the VAX compatibility library of the Intel FORTRAN Compiler.time2.Wall clock Time f_time y For the Linux clusters.time1 y As above for etime and dtime. . the appropriate FORTRAN function for elapsed time is f_time. integer*8 f_time external f_time integer*8 time1.timeres « beginning of program time1=f_time() « start of section of code to be timed « lots of computation « end of section of code to be timed time2=f_time() timeres=time2 .

tv_sec+(1. elapsed=t2-t1. rtn=gettimeofday(&tp. wallclock time can be obtained by using the very portable routine gettimeofday..... int rtn.. .. NULL).h> /* definition of NULL */ #include <sys/time. .elapsed.. struct timeval tp. #include <stddef. /* do some work */ .t2. t2=(double)tp.e-6)*tp. rtn=gettimeofday(&tp.e-6)*tp...tv_sec+(1.h> /* definition of timeval struct and protyping of gettimeofday */ double t1.tv_usec.. . t1=(double)tp..Wall clock Time gettimeofday y For C programmers. . NULL)..tv_usec.

out y where options can be ¶-p· for a simple output or ¶-f format· which allows the user to display more than just time related information. explicitly call /usr/bin/time) time «options« a. .Timing an Executable y To time an executable (if using a csh or tcsh shell. y Consult the man pages on the time command for format options.

Timing a Batch Job y Time of a batch job running or completed. y Origin busage jobid y Linux clusters qstat jobid # for a running job qhist jobid # for a completed job .

2 Wall clock Time Timing 6.3 Further Information .2.3 Timing a Batch Job 6.3 Profiling Analysis 6.1 CPU Time 6.1.1 Timing a Section of Code 6.2 Profile Listings 6.Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.2 Profiling 6.1.2 Timing an Executable Profiling Tools 6.2.

y Most codes follow the 90-10 Rule. . 90% of the computation is done in 10% of the code.Profiling y Profiling determines where a program spends its time. y Use profiling when you want to focus attention and optimization efforts on those loops that are responsible for the bulk of the computational load. y It detects the computationally intensive parts of the code. y That is.

list .out prof -h a.m12345 > prof. y prof y The prof utility analyzes the data file created by ssrun and produces a report.id". y Example ssrun -fpcsamp a. y Used together they do profiling. y ssrun y The ssrun utility collects performance data for an executable that you specify. y The performance data is written to a file named "executablename.exptype.fpcsamp. or what is called hot spot analysis.out. y They are useful for generating timing profiles.Profiling Tools Profiling Tools on the Origin y On the SGI Origin2000 computer there are profiling tools named ssrun and prof.

prof and perfex tools. . y Finally analyze the resulting gmon. There are currently several efforts to produce tools comparable to the ssrun.Profiling Tools Profiling Tools on the Linux Clusters y On the Linux clusters the profiling tools are still maturing.out file using the gprof utility: gprof executable gmon.out. efc -O -qp -g -o foo foo.f . y gprof y Basic profiling information can be generated using the OS utility gprof. compile the code with the compiler flags -qp -g for the Intel compiler (-g on the Intel compiler does not change the optimization level) or -pg for the GNU compiler./foo gprof foo gmon. run the program. y First.out . y Second.

setenv VMON PAPI_TOT_CYC ifc -g -O -o md md.o L/usr/apps/tools/lib -lvmon -lpapi ./md /usr/apps/tools/vprof/bin/cprof -e md vmon. y To instrument the whole application requires recompiling and linking to vprof and PAPI libraries.f /usr/apps/tools/vprof/lib/vmonauto_gcc.out .Profiling Tools Profiling Tools on the Linux Clusters y vprof y On the IA32 platform there is a utility called vprof that provides performance information using the PAPI instrumentation library.

09 0.02 0.42 5.81 80.01 3.82 84.Profile Listings Profile Listings on the Origin y Prof Output First Listing Cycles -------42630984 6498294 6141611 3654120 2615860 1580424 1144036 886044 861136 % ----58.47 8.54 Secs ---0.03 0.14 89.41 86.02 0. .18 Cum% ----58.47 67.59 2.57 0.22 1.91 8.01 0.17 1.36 90.38 75.08 0.01 Proc ---VSUB PFSOR PBSOR PFSOR1 VADD ITSRCG ITSRSI ITJSI ITJCG y The first listing gives the number of cycles executed in each procedure (or subroutine).05 0. The procedures are listed in descending order of cycle count.57 1.57 88.

82 78. .52 2.36 1. y The lines are listed in descending order of cycle count.19 79.10 3.81 4.43 64.Profile Listings Profile Listings on the Origin y Prof Output Second Listing Cycles -------36556944 5313198 4968804 2989882 2564544 1988420 1629776 994210 969056 483018 % ----50.86 74.33 0.24 1.18 Line ---8106 6974 6671 8107 7097 8103 8045 8108 8049 6972 Proc ---VSUB PFSOR PBSOR VSUB PFSOR1 VSUB VADD VSUB VADD PFSOR y The second listing gives the number of cycles per source code line.29 6.73 2.34 71.66 Cum% ----50.24 68.59 76.14 7.52 80.14 57.

88 0.00 14.15 68.01 14.06 0.90 0.00 total us/call ------107450.05 14.90 0.----------------38.64 3.18 34.---------.83 0.00 1 0. % cumulative self self time seconds seconds calls us/call ----. sorted by 'self seconds' which is the number of seconds accounted for by this function alone.21 25.19 0.00 0.84 5.17 25199500 0.00 name ----------compute_ dist_ SIND_SINCOS sin cos dotr8_ update_ f_fioinit f_intorange mov initialize_ y The listing gives a 'flat' profile of functions and routines encountered.72 10.90 0.01 14.01 14.37 14.48 14.67 101 56157. .05 14.21 0.80 1.36 0.15 0.67 5.89 0.00 0.01 50500 0.90 0.Profile Listings Profile Listings on the Linux Clusters y gprof Output First Listing Flat profile: Each sample counts as 0.01 100 68.25 14.90 0.36 0.00 0.07 5.88 0.000976562 seconds.

67 5.80 0. The definitions of the columns are specific to the line in question.00 25199500 dist_ [3] --------------------------------------------------------------------<spontaneous> [4] 25.17 0.00 25199500/25199500 dist_ [3] 0.01 0.7 5.Profile Listings Profile Listings on the Linux Clusters y gprof Output Second Listing Call graph: index ----[1] self children called name ---.8 5.18 101 compute_ [2] 5.-------------------------------------0.00 25199500/25199500 compute_ [2] [3] 34.67 5.01 0. .00 1/1 initialize_ [12] --------------------------------------------------------------------5.00 100/100 update_ [8] 0. Detailed information is contained in the full output from gprof.5 3.00 10.00 SIND_SINCOS [4] « « % time -----72.00 0.9 y The second listing gives a 'call-graph' profile of functions and routines encountered.18 101/101 compute_ [2] 0.17 0.00 50500/50500 dotr8_ [7] --------------------------------------------------------------------5.86 main [1] 5.67 5.18 101/101 main [1] [2] 72.17 0.

f:164 0.3% /u/ncsa/gbauer/temp/md.3% /u/ncsa/gbauer/temp/md.f:166 2.f Function Summary: 84.8% /u/ncsa/gbauer/temp/md.Total cycles (1956 events) File Summary: 100.f:162 0.f:107 0. displays not only cycles consumed by functions (a flat profile) but also the lines in the code that contribute to those functions.2% /u/ncsa/gbauer/temp/md.5% /u/ncsa/gbauer/temp/md. .f:169 0.6% /u/ncsa/gbauer/temp/md.5% /u/ncsa/gbauer/temp/md.9% /u/ncsa/gbauer/temp/md.f:104 9.f:106 13.8% /u/ncsa/gbauer/temp/md.f:102 1.4% compute 15.0% /u/ncsa/gbauer/temp/md.Profile Listings Profile Listings on the Linux Clusters y vprof Listing Columns correspond to the following events: PAPI_TOT_CYC .f:165 1.8% /u/ncsa/gbauer/temp/md.6% dist Line Summary: 67.f:105 y The above listing from (using the -e option to cprof).

i) .box.f:109 /u/ncsa/gbauer/temp/md.f:100 .2% do j=1.3% 0.5% 13.np if (i .pos(1.2% 0.i) = f(k.nd f(k.1% 1.f:163 /u/ncsa/gbauer/temp/md.i).pos(1.Profile Listings Profile Listings on the Linux Clusters y vprof Listing (cont.f:149 /u/ncsa/gbauer/temp/md.j).d) ! attribute half of the potential energy to particle 'j' pot = pot + 0.rij(k)*dv(d)/d enddo endif enddo /u/ncsa/gbauer/temp/md.1% « « 100 101 102 103 104 105 106 107 108 109 0. j) then call dist(nd.5*v(d) do k=1.5% 0.9% 0.) 0.ne.7% 0.rij.6% 0.8% 67.

y Since the compiler has rearranged the source lines the line numbers given by ssrun/prof give you an area of the code to inspect. y The second profile listing shows that line 8106 in subroutine VSUB accounted for 50% of the total computation. line 8106 is a line inside a do loop. y Putting an OpenMP compiler directive in front of that do loop you can get 50% of the program to run in parallel with almost no work on your part. y The first profile listing shows that over 50% of the computation is done inside the VSUB subroutine. y To view the rearranged source use the option f90 « -FLIST:=ON cc « -CLIST:=ON y For the Intel compilers. y Going back to the source code. and consists of many subroutines. the appropriate options are ifort « ±E « icc « -E « .Profiling Analysis y The program being analyzed in the previous Origin example has approximately 10000 source code lines.

Further Information y SGI Irix y y y y y y y y y y y y y y man etime man 3 time man 1 time man busage man timers man ssrun man prof Origin2000 Performance Tuning and Optimization Guide man 3 clock man 2 gettimeofday man 1 time man 1 gprof man 1B qstat Intel Compilers Vprof on NCSA Linux Cluster y Linux Clusters .

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scaler Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690 .

1.Agenda 7 Cache Tuning 7.10 Cache Thrashing Example 7.2 Cache Mapping 7.13 Further Information .4 Measuring Cache Performance 7.1 Memory Hierarchy 7.12 Loop Blocking 7.1.5 Locating the Cache Problem 7.7 Preserve Spatial Locality 7.1 Cache Concepts 7.1.3 Code 0ptimization 7.2 Cache Specifics 7.9 Grouping Data Together 7.4 Cache Coherence 7.6 Cache Tuning Strategy 7.3 Cache Thrashing 7.11 Not Enough Cache 7.8 Locality Problem 7.1.

y The following sections will discuss the key concepts of cache including: y y y y Memory subsystem hierarchy Cache mapping Cache thrashing Cache coherence .Cache Concepts y The CPU time required to perform an operation is the sum of the clock cycles executing instructions and the clock cycles waiting for memory. y Clearly then. the memory system is a major factor in determining the performance of your program and a large part is your use of the cache. y The CPU cannot be performing useful work if it is waiting for data to arrive from memory.

. and costs.Memory Hierarchy y The different subsystems in the memory hierarchy have different speeds. y Smaller memory is faster y Slower memory is cheaper y The hierarchy is set up so that the fastest memory is closest to the CPU. sizes. and the slower memories are further away from the CPU.

Memory Hierarchy y It's a hierarchy because every level is a subset of a level further away. They hold one data element each and are 32 bits or 64 bits wide. y All data in one level is found in the level below. predicate (1 bit). y The Intel IA64 has 328 registers for general-purpose (64 bit). y There is an overhead associated with it. The Origin MIPS R10000 has 64 physical 64-bit registers of which 32 are available for floating-point operations. branch and other functions. y The purpose of cache is to improve the memory access time to the processor. floating-point (80 bit). but the benefits outweigh the cost. y y y y y . y Registers Registers are the sources and destinations of CPU data operations. They are on-chip and built from SRAM. y Register access speeds are comparable to processor speeds. Computers usually have 32 or 64 registers.

In interleaving. .Memory Hierarchy y Main Memory Improvements y A hardware improvement called interleaving reduces main memory access y y y y y y y y time. The bank cycle time is 4-8 times the CPU clock cycle time so the main memory can·t keep up with the fast CPU and keep it busy with data. Consecutive data elements are spread across the banks. Multiple data elements are read in parallel. Each bank supplies one data element per bank cycle. one from each bank. memory is divided into partitions or segments called memory banks. Large main memory with a cycle time comparable to the processor is not affordable. there is no benefit. The problem with interleaving is that the memory interleaving improvement assumes that memory is accessed sequentially. If there is 2-way memory interleaving. but the code accesses every other location.

y For example. a cache line of data is brought into the cache instead of a single data element. y When the cache line size becomes too large. y When a main memory access is made. the transfer time increases. y Spatial Locality: When an item is referenced. y The additional elements in the cache line will most likely be needed soon. y Cache Line y The overhead of the cache can be reduced by fetching a chunk or block of data elements. y The cache miss rate falls as the size of the cache line increases. a cache line is typically 32 or 128 bytes. y A cache line is defined in terms of a number of bytes. . y This takes advantage of spatial locality. y Temporal Locality: When an item is referenced.Memory Hierarchy y Principle of Locality y The way your program operates follows the Principle of Locality. but there is a point of negative returns on the cache line size. it will be referenced again soon. items whose addresses are nearby will tend to be referenced soon.

y You want to minimize cache misses. or miss time. Cache Miss Rate is defined as 1.Memory Hierarchy y Cache Hit y A cache hit occurs when the data element requested by the processor is in the cache. y You want to maximize hits.Hit Rate y Cache Miss Penalty.0 . (Recall that the lower levels of the hierarchy have a slower access time.) . y It is the fraction of the requested data that is found in the cache. is the time needed to retrieve the data from a lower level (downstream) of the memory hierarchy. y Cache Miss y A cache miss occurs when the data element requested by the processor is NOT in the cache. y The Cache Hit Rate is defined as the fraction of cache hits.

y A cache miss is very costly. This chip speeds up the on-chip cache miss time. y The on-chip cache is called First level. y An on-chip cache performs the fastest but the computer designer makes a trade-off between die size and cache size. Caches further from the CPU are called Downstream. slower off-chip cache. Hence. When the on-chip cache has a cache miss the time to access the slower main memory is very large. To solve this problem. on-chip cache has a small size. y Caches closer to the CPU are called Upstream. y L1/L2 is still true for the Origin MIPS and the Intel IA-32 processors. L1. y The off-chip cache is called Second Level. L1 cache misses are handled quickly. y The newer Intel IA-64 processor has 3 levels of cache . or primary cache. L2 cache misses have a larger performance penalty. L3.Memory Hierarchy y Levels of Cache y It used to be that there were two levels of cache: on-chip and offchip. computer designers have implemented a larger. or secondary cache. L2. y The cache external to the chip is called Third Level.

y Memory Hierarchy Sizes y Memory hierarchy sizes are specified in the following units: y Cache Line: bytes y L1 Cache: Kbytes y L2 Cache: Mbytes y Main Memory: Gbytes .Memory Hierarchy y Split or Unified Cache y In unified cache. y In split cache. the cache may be thrashed. called the data cache.g. typically L1. the cache is a combined instruction-data cache. y The 2 caches are independent of each other. and they can have independent properties. called the instruction cache y another for the data. a high cache miss rate. e. typically L2. the cache is split into 2 parts: y one for the instructions. y A disadvantage of a unified cache is that when the data access and instruction access conflict with each other.

There are 3 mapping strategies: y Direct mapped cache y Set associative cache y Fully associative cache y Direct Mapped Cache y In direct mapped cache. y Consequently. y Direct mapped cache is inexpensive but also inefficient and very susceptible to cache thrashing.Cache Mapping y Cache mapping determines which cache location should be used to store a copy of a data element from main memory. a particular cache line can be filled from (size of main memory mod size of cache) different lines from main memory. . a line of main memory is mapped to only a single line of cache.

tw/~cthuang/courses/ee3450/lectures/07_memory.Cache Mapping y Direct Mapped Cache http://larc.nthu.html .ee.edu.

xbitlabs.com/images/video/radeon-x1000/caches. http://www.Cache Mapping y Fully Associative Cache y For fully associative cache. y This technology is very fast but also very expensive.png . any line of cache can be loaded with any line from main memory.

Cache Mapping y Set Associative Cache y For N-way set associative cache.com/articles/cache_principles/cache_way. y A line from main memory can then be written to its cache line in any of the N sets.alasir. you can think of cache as being divided into N sets (usually N is 2 or 4).png . y This is a trade-off between direct mapped and fully associative cache. http://www.

With set associative cache there is a choice of 3 strategies: 1. In empirical studies. a cache line can only be mapped to one unique place in the cache. 3. An advantage of LRU is that it preserves temporal locality. there was little performance difference between LRU and Random. . In empirical studies. Random y There is a uniform random replacement within the set of cache blocks. A disadvantage of LRU is that it·s expensive to keep track of cache access patterns. Random replacement generally outperformed FIFO. regardless of the usage pattern. 2. The new cache line replaces the cache block at that address. The principle of temporal locality tells us that recently used data blocks are likely to be used again soon.Cache Mapping y Cache Block Replacement y With direct mapped cache. LRU (Least Recently Used) y The block that gets replaced is the one that hasn·t been used for the longest time. FIFO (First In First Out) y Replace the block that was brought in N accesses ago. The advantage of random replacement is that it·s simple and inexpensive to implement.

a(k(j)). . y Cache lines are discarded and later retrieved. y The arrays are dimensioned too large to fit in cache.Cache Thrashing y Cache thrashing is a problem that happens when a frequently used cache line gets displaced by another frequently used cache line. e. y The CPU can·t find the data element it wants in the cache and must make another main memory cache line access. y The same data elements are repeatedly fetched into and displaced from the cache.g. y Cache thrashing happens because the computational code statements have too many variables and arrays for the needed data elements to fit in cache. The arrays are accessed with indirect addressing. y Cache thrashing can happen for both instruction and data caches.

y When the same data is being manipulated by different processors. . they must inform each other of their modification of data. y It is the means by which all the memory subsystems maintain data coherence.Cache Coherence y Cache coherence y is maintained by an agreement between data stored in cache. and main memory. other caches. y The term Protocol is used to describe how caches and main memory communicate with each other.

and the status of the cache line ² clean (cache line does not need to be sent back to main memory) or dirty (cache line needs to update main memory with content of cache line). .Cache Coherence y Snoop Protocol y All processors monitor the bus traffic to determine cache line status. y Hardware Cache Coherence y Cache coherence on the Origin computer is maintained in the hardware. y Directory Based Protocol y Cache lines contain extra bits that indicate which other processor has a copy of that cache line. transparent to the programmer.

Cache Coherence y False sharing y happens in a multiprocessor system as a result of maintaining cache coherence. . y Both processor A and processor B have the same cache line. y B wants to modify the eighth word of the cache line. y But A has sent a signal to B that B·s cache line is invalid. y B must fetch the cache line again before writing to it. y A modifies the first word of the cache line.

Cache Coherence y A cache miss creates a processor stall. the compiler will do these at -O3 optimization. until the data that is stalling is retrieved. y The stall is minimized by continuing to load and execute instructions. y These techniques are called: y Prefetching y Out of order execution y Software pipelining y Typically. . y The processor is stalled until the data is retrieved from the memory.

N y(I)=y(I) + a*x(I) End Do y In pseudo-assembly language. this is what the Origin compiler will do: cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle t+0 t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9 t+10 t+11 ld ld st st st st ld ld ld ld ld ld y(I+3) x(I+3) y(I-4) y(I-3) y(I-2) y(I-1) y(I+4) x(I+4) y(I+5) x(I+5) y(I+6) x(I+6) madd madd madd madd I I+1 I+2 I+3 .Cache Coherence y The following is an example of software pipelining: y Suppose you compute Do I=1.

Cache Coherence y Since the Origin processor can only execute 1 load or 1 store y y y y at a time. the compiler places loads in the instruction pipeline well before the data is needed. The code above gets 8 flops in 12 clock cycles. . It is then able to continue loading while simultaneously performing a fused multiply-add (a+b*c). The Intel Pentium III (IA-32) and the Itanium (IA-64) will have differing versions of the code above but the same concepts apply. The peak is 24 flops in 12 clock cycles for the Origin.

6 Cache Tuning Strategy 7.5 Locating the Cache Problem 7.12 Loop Blocking 7.2Cache Specifics 7.9 Grouping Data Together 7.13 Further Information .11 Not Enough Cache 7.Agenda 7 Cache Tuning 7.2 Cache on the Intel Pentium III 7.2.3 Code 0ptimization 7.7 Preserve Spatial Locality 7.8 Locality Problem 7.1 Cache on the SGI Origin2000 7.2.4 Measuring Cache Performance 7.2.4 Cache Summary 7.1 Cache Concepts 7.3 Cache on the Intel Itanium 7.2.10 Cache Thrashing Example 7.

Cache on the SGI Origin2000 y L1 Cache (on-chip primary cache) y Cache size: 32KB floating point data y 32KB integer data and instruction y Cache line size: 32 bytes y Associativity: 2-way set associative y L2 Cache (off-chip secondary cache) y Cache size: 4MB per processor y Cache line size: 128 bytes y Associativity: 2-way set associative y Replacement: LRU y Coherence: Directory based 2-way interleaved (2 banks) .

6 GB/s/bank y 3.2 GB/sec overall possible y Latency: 1 cycle y Bandwidth between L1 and L2 cache y 1GB/s y Latency: 11 cycles y Bandwidth between L2 cache and local memory y .5 GB/s y Latency: 61 cycles y Average 32 processor remote memory y Latency: 150 cycles .Cache on the SGI Origin2000 y Bandwidth L1 cache-to-processor y 1.

Cache on the Intel Pentium III
y L1 Cache (on-chip primary cache)
y y y y y y y y y

Cache size: 16KB floating point data 16KB integer data and instruction Cache line size: 16 bytes Associativity: 4-way set associative Cache size: 256 KB per processor Cache line size: 32 bytes Associativity: 8-way set associative Replacement: pseudo-LRU Coherence: interleaved (8 banks)

y L2 Cache (off-chip secondary cache)

Cache on the Intel Pentium III
y Bandwidth L1 cache-to-processor
y 16 GB/s y Latency: 2 cycles

y Bandwidth between L1 and L2 cache
y 11.7 GB/s y Latency: 4-10 cycles

y Bandwidth between L2 cache and local memory
y 1.0 GB/s y Latency: 15-21 cycles

Cache on the Intel Itanium
y L1 Cache (on-chip primary cache)
y y y y y y y y y y y y

Cache size: 16KB floating point data 16KB integer data and instruction Cache line size: 32 bytes Associativity: 4-way set associative Cache size: 96KB unified data and instruction Cache line size: 64 bytes Associativity: 6-way set associative Replacement: LRU Cache size: 4MB per processor Cache line size: 64 bytes Associativity: 4-way set associative Replacement: LRU

y L2 Cache (off-chip secondary cache)

y L3 Cache (off-chip tertiary cache)

Cache on the Intel Itanium
y Bandwidth L1 cache-to-processor
y 25.6 GB/s y Latency: 1 - 2 cycle

y Bandwidth between L1 and L2 cache
y 25.6 GB/sec y Latency: 6 - 9 cycles

y Bandwidth between L2 and L3 cache
y 11.7 GB/sec y Latency: 21 - 24 cycles

y Bandwidth between L3 cache and main memory
y 2.1 GB/sec y Latency: 50 cycles

Cache Summary Chip #Caches Associativity MIPS R10000 Pentium III 2 2/2 2 4/8 Pseudo-LRU 1000 1000 Itanium 3 4/6/4 LRU 800 3200 Replacement LRU CPU MHz Peak Mflops 195/250 390/500 LD. y This indicates that loads and stores may be a bottleneck.ST/cycle 1 LD or 1 ST 1 LD and 1 ST 2 LD or 2 ST y Only one load or store may be performed each CPU cycle on the R10000. y Efficient use of cache is extremely important. .

9 Grouping Data Together 7.1 Cache Concepts 7.4 Measuring Cache Performance 7.2 Cache Specifics 7.6 Cache Tuning Strategy 7.11 Not Enough Cache 7.8 Locality Problem 7.10 Cache Thrashing Example 7.7 Preserve Spatial Locality 7.3Code 0ptimization Loop Blocking 7.2 Measuring Cache Performance on the Linux Clusters 7.13 Further Information .5 Locating the Cache Problem 7.1 Measuring Cache Performance on the SGI Origin2000 7.Agenda 7 Cache Tuning 7.

y The following questions can be useful to ask: y How much time does the program take to execute? y Use /usr/bin/time a. y For more information on timers see Timing and Profiling section. . y Which loop uses the most time? y Put etime/dtime or other recommended timer calls around loops for CPU time.Code 0ptimization y Gather statistics to find out where the bottlenecks are in your code so you can identify what you need to optimize. y What is contributing to the cpu time? y Use the Perfex utility on the Origin or perfex or hpmcount on the Linux clusters.out for CPU time y Which subroutines use the most time? y Use ssrun and prof on the Origin or gprof and vprof on the Linux clusters.

.ncsa.Code 0ptimization y Some useful optimizing and profiling tools are y etime/dtime/time y perfex y ssusage y ssrun/prof y gprof cvpav. and Productivity Tools http://www.edu/UserInfo/Resources/Software/Tools/ for information on which tools are available on NCSA platforms.uiuc. cvd y See the NCSA web pages on Compiler. Performance.

y The Perfex Utility y The hardware performance counters can be measured using the perfex utility. 0 = cycles 1 = Instructions issued ....Measuring Cache Performance on the SGI Origin2000 y The R10000 processors of NCSA·s Origin2000 computers have hardware performance counters. 26 = Secondary data cache misses . y View man perfex for more information. y There are 32 events that are measured and each event is numbered.. perfex [options] command [arguments] .

) -a sample ALL the events -mp Report all results on a per thread basis.Measuring Cache Performance on the SGI Origin2000 y where the options are: -e counter1-e counter2 This specifies which events are to be counted. -x Gives extra summary info including Mflops command Specify the name of the executable file. -y Report the results in seconds. (Remember to have a space in between the "e" and the event number.You enter the number of the event you want counted. not cycles. . arguments Specify the input and output arguments to the executable file.

the output is reported in cycles y perfex -a -y a.the output is reported in seconds ..outputs the L1 and L2 cache misses .outputs ALL the hardware performance counters .out > results .out .Measuring Cache Performance on the SGI Origin2000 y Examples y perfex -e 25 -e 26 a.

Measuring Cache Performance on the Linux Clusters y The Intel Pentium III and Itanium processors provide hardware event counters that can be accessed from several tools.out . y perfex for the Pentium III and pfmon for the Itanium y To view usage and options for perfex and pfmon: perfex -h pfmon --help y To measure L2 cache misses: perfex ±eP6_L2_LINES_IN a.out pfmon ±-events=L2_MISSES a.

y To add perfsuite's psrun to the current shell environment : soft add +perfsuite y To measure cache misses: psrun a.xml .out*.Measuring Cache Performance on the Linux Clusters y psrun [soft add +perfsuite] y Another tool that provides access to the hardware event counter and also provides derived statistics is perfsuite.out psprocess a.

1 Cache Concepts 7.10 Cache Thrashing Example 7.2 Cache Specifics 7.Agends 7 Cache Tuning 7.6 Cache Tuning Strategy 7.13 Further Information .11 Not Enough Cache 7.9 Grouping Data Together 7.3 Code 0ptimization 7.4 Measuring Cache Performance 7.7 Preserve Spatial Locality 7.8 Locality Problem 7.5 Locating the Cache Problem 7.12 Loop Blocking 7.

y Tools like vprof and libhpm provide routines for users to instrument their code.Locating the Cache Problem y For the Origin. y If you then use the CaseVision tools. y The CaseVision tools are y cvpav for performance analysis y cvd for debugging y CaseVision is not available on the Linux clusters. y Using vprof with the PAPI cache events can provide detailed information about where poor cache utilization is occurring. you can locate the cache problem in your code. the perfex output is a first-pass detection of a cache problem. .

. y Spatial Reuse y Use data that is encached as a result of fetching nearby data elements from downstream memory.Cache Tuning Strategy y The strategy for performing cache tuning on your code is based on data reuse. y Temporal Reuse y Use the same data elements on more than one iteration of the loop. y Strategies that take advantage of the Principle of Locality will improve performance.

do J=1.K) * B(K. y To ensure stride-one access modify the code using loop interchange.J)=C(I.J) + A(I.K) * B(K.n C(I.J) end do « y It is not wrong but runs much slower than it could.J) + A(I.n C(I. y The following code does not preserve spatial locality: do I=1.n do K=1.n do J=1. The code has been modified for spatial reuse. .J)=C(I.J) end do « y For Fortran the innermost loop index should be the leftmost index of the arrays.n do K=1.Preserve Spatial Locality y Check loop nesting to ensure stride-one memory access.n do I=1.

Locality Problem y Suppose your code looks like: DO J=1.N DO I=1. the code doesn·t have unitstride access on stores.N A(I. y Use the optimized. y If you interchange the loops. . intrinsic-function transpose from the FORTRAN compiler instead of hand-coding it.J)=B(J.I) ENDDO ENDDO y The loop as it is typed above does not have unit-stride access on loads.

j)*r(1. The code has been modified for cache reuse. it is likely they will be in one cache line. 1 cache line. y.0 do I=1. and r(3.n j=index(I) d = d + sqrt(x(j)*x(j) + y(j)*y(j) + z(j)*z(j)) y Since the arrays are accessed with indirect accessing.j).0 do I=1. d=0.Grouping Data Together y Consider the following code segment: d=0.j) + r(2. and z into a 2-dimensional array named r.j) are contiguous in memory.j) + r(3. it is likely that 3 new cache lines need to be brought into the cache for each iteration of the loop.j)*r(3. is brought in for each iteration of I. rather than 3.n j=index(I) d = d + sqrt(r(1. r(2.j).j)) y Since r(1. Hence. . Modify the code by grouping together x.j)*r(2.

parameter (max = 1024*1024) common /xyz/ a(max). . pad common with the size of a cache line.max something = a(I) + b(I) enddo y The cache lines for both a and b have the same cache address.Cache Thrashing Example y This example thrashes a 4MB direct mapped cache. y To avoid cache thrashing in this example. parameter (max = 1024*1024) common /xyz/ a(max). b(max) do I=1.max something=a(I) + b(I) enddo y Improving cache utilization is often the key to getting good performance.extra(32).b(max) do I=1.

Not Enough Cache y Ideally you want the inner loop·s arrays and variables to fit into cache. y If a scalar program won·t fit in cache. y This often results in super-linear speedup. its parallel version may fit in cache with a large enough number of processors. .

y A blocked loop accesses array elements in sections that optimally fit in the cache.Loop Blocking y This technique is useful when the arrays are too large to fit into the cache. y Loop blocking uses strip mining of loops and loop interchange. y It allows for spatial and temporal reuse of data. thus minimizing cache misses. y The code in the PRE column depicts the original code. y The following example (next slide) illustrates loop blocking of matrix multiplication. the POST column depicts the code when it is blocked. .

j) enddo enddo enddo enddo enddo enddo .n.k) *b(k.Loop Blocking PRE do k=1.iblk do jj=1.iblk do j=jj.j)=c(i.kk+iblk-1 do i=ii.j)+a(i.n.jj+iblk-1 do k=kk.j) enddo enddo enddo POST do kk=1.k) *b(k.n do j=1.iblk do ii=1.n c(i.n do i=1.j)=c(i.ii+iblk-1 c(i.n.j)+a(i.

Charles Severance. A Practitioner·s Guide to RISC Microprocessor Architecture. Hennessy. y y y y y y Morgan Kaufmann Publishers. Jim Handy. David A. John Wiley & Sons. O·Reilly and Associates. Hennessy and David A. Morgan Kaufmann Publishers. Inc.Further Information y Computer Organization and Design y The Hardware/Software Interface. Applied Parallel Research Intel® Architecture Optimization Reference Manual Intel® Itanium® Processor Manuals . Stakem. Patterson and John L. Patrick H. Tutorial on Optimization of Fortran. Academic Press High Performance Computing. John L. Inc. Inc. y Computer Architecture y A Quantitative Approach. John Levesque. Patterson. The Cache Memory Book. Inc.

1 Speedup 8.5 Speedup Limitations 8.Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 8.7 Summary 9 About the IBM Regatta P690 .2 Speedup Extremes 8.3 Efficiency 8.6 Benchmarks 8.4 Amdahl's Law 8.

and some sample benchmarks are given. y Often the performance gain is not perfect. . and have run it on a parallel computer using multiple processors you may want to know the performance gain that parallelization has achieved. y Finally.Parallel Performance Analysis y Now that you have parallelized your code. this chapter covers the kinds of information you should provide in a benchmark. y This chapter describes how to compute parallel code performance. and this chapter also explains some of the reasons for limitations on parallel performance.

divided by the time it takes to run on a multiple processors. you want to see the performance of the code continue to improve. you will also want to know how your code scales.Speedup y The speedup of your code tells you how much performance gain is achieved by running your program in parallel on multiple processors. y As you run your code with more and more processors. y Scalability y When you compute with multiple processors in a parallel environment. y Speedup generally ranges between 0 and p. y Computing speedup is a good way to measure how a program scales as more processors are used. . y The scalability of a parallel code is defined as its ability to achieve performance proportional to the number of processors used. where p is the number of processors. y A simple definition is that it is the length of time it takes a program to run on a single processor.

y That is. then you have perfect or linear speedup (Sp= p).Speedup y Linear Speedup y If it takes one processor an amount of time t to do a task and if p processors can do the task in time t / p. running with 8 processors improves the time by a factor of 8. running with 4 processors improves the time by a factor of 4. . and so on. y This is shown in the following illustration.

How can speedup be greater than the number of processors used? y The answer usually lies with the program's memory use.Speedup Extremes y The extremes of speedup happen when speedup is y greater than p. For example. It is possible that the smaller problem can make better use of the memory hierarchy. y Super-Linear Speedup y You might wonder how super-linear speedup can occur. the cache and the registers. it is often an indication that the sequential code. . y When super-linear speedup is achieved. had serious cache miss problems. that is. the smaller problem may fit in cache when the entire problem would not. run on one processor. When using multiple processors. y The most common programs that achieve super-linear speedup are those that solve dense linear algebra problems. each processor only gets part of the problem compared to the single processor case. called super-linear speedup. y less than 1.

and it causes the code to run slower. y The overhead of creating and controlling the parallel threads outweighs the benefits of parallel computation. y This happens when there isn't enough computation to be done by each processor. y To eliminate this problem you can try to increase the problem size or run with fewer processors. it means that the parallel code runs slower than the sequential code.Speedup Extremes y Parallel Code Slower than Sequential Code y When speedup is less than one. .

. y You can think of efficiency as describing the average speedup per processor. y Efficiency with p processors is defined as the ratio of speedup with p processors to p. y Efficiency is a fraction that usually ranges between 0 and 1.Efficiency y Efficiency is a measure of parallel performance that is closely related to speedup and is often also presented in a description of the performance of a parallel program. y Ep=1 corresponds to perfect speedup of Sp= p.

and the term (1 . y This is the fraction of code that will have to be run with just one processor.Amdahl's Law y An alternative formula for speedup is named Amdahl's Law attributed to Gene Amdahl. y Amdahl's Law defines speedup with p processors as follows: y Where the term f stands for the fraction of operations done sequentially with just one processor. . y This formula. y That is. states that no matter how many processors are used in a parallel run. introduced in the 1980s.f) stands for the fraction of operations done in perfect parallelism with p processors. one of America's great computer scientists. almost every program has a fraction of the code that doesn't lend itself to parallelism. a program's speedup will be limited by its fraction of sequential code. even in a parallel run.

y When f is 1. which results in Sp = p. or perfect parallelism.Amdahl's Law y The sequential fraction of code. then speedup is p. is a unitless measure ranging between 0 and 1. y This shows that Amdahl's speedup ranges between 1 and p. f. where p is the number of processors used in a parallel processing run. y When f is 0. meaning there is no parallel code. meaning there is no sequential code. . which results in Sp = 1. then speedup is 1. or there is no benefit from parallelism. This can be seen by substituting f = 1 in the formula above. This can be seen by substituting f = 0 in the formula above.

your code's speedup is still limited by 1 / f. y It is well known in the parallel computing community. with large data array sizes. y To get good performance. and it shrinks in its importance for speedup. . and lots of computation. when the number of processors goes to infinity.Amdahl's Law y The interpretation of Amdahl's Law is that speedup is limited by the fact that not all parts of a code can be run in parallel. y This helps to explain the need for large problem sizes when using parallel computers. y The reason for this is that as the problem size increases the opportunity for parallelism grows. that you cannot take a small application and expect it to show good performance on a parallel computer. you need to run large applications. y Amdahl's Law shows that the sequential fraction of code has a strong effect on speedup. and the sequential fraction shrinks. y Substituting in the formula.

4 Amdahl's Law 8.5Speedup Limitations 8.1 Memory Contention Limitation 8.5.2 Problem Size Limitation 8.5.3 Efficiency 8.6 Benchmarks 8.7 Summary .1 Speedup 8.Agenda 8 Parallel Performance Analysis 8.2 Speedup Extremes 8.

when there is too much input or output compared to the amount of computation. y You need to replace it with a parallel algorithm. Some of the reasons for limitations on speedup are: y Too much I/O y Speedup is limited when the code is I/O bound. y That is. y Wrong algorithm y Speedup is limited when the numerical algorithm is not suitable for a parallel computer. y Cache reutilization techniques will help here. .Speedup Limitations y This section covers some of the reasons why a program doesn't get perfect Speedup. y You need to redesign the code with attention to data locality. y Too much memory contention y Speedup is limited when there is too much memory contention.

Speedup Limitations y Wrong problem size y Speedup is limited when the problem size is too small to take best advantage of a parallel computer. spin/blocking threads. . y This is shown by Amdahl's Law. y Too much parallel overhead y Speedup is limited when there is too much parallel overhead compared to the amount of computation. and ending parallel regions. y Load imbalance y Speedup is limited when the processors have different workloads. y That is. y The processors that finish early will be idle while they are waiting for the other processors to catch up. y In addition. y Too much sequential code y Speedup is limited when there's too much sequential code. y These are the additional CPU cycles accumulated in creating parallel regions. speedup is limited when the problem size is fixed. synchronizing threads. when the problem size doesn't grow as you compute with more processors. creating threads.

writes in his book on parallel computing that the best way to define memory contention is with the word delay. y On the IA32 platform. psrun/perfsuite. use perfex. you can determine whether your code has memory contention problems by using SGI's perfex utility. psrun/perfsuite. y On the SGI Origin2000 computer. . vprof. use vprof.Memory Contention Limitation y Gene Golub. man perfex. y The perfex utility is covered in the Cache Tuning lecture in this course. you can use the hardware performance counter tools to get information on memory performance. a professor of Computer Science at Stanford University. y On the Linux clusters. there is a delay until the memory is free. y When different processors all want to read or write into the main memory. y You can also refer to SGI's manual page. y On the IA64 platform. hmpcount. pfmon. for more details.

the order is the opposite of Fortran. y A good way to reduce memory contention is to access elements from the processor's cache memory instead of the main memory. y Some programming techniques for doing this are: y Access arrays with unit `. y If the output of the utility shows that memory contention is a problem. y Be sure to refer to the man pages and webpages on the NCSA website for more information. y Pad common blocks. The details for performing these code modifications are covered in the section on Cache Optimization of this lecture. y These techniques are called cache tuning optimizations. y Avoid specific array sizes that are the same as the size of the data cache or that are exact fractions or exact multiples of the size of the data cache. . you will want to use some programming techniques for reducing memory contention. y Order nested do loops (in Fortran) so that the innermost loop index is the leftmost index of the arrays in the loop.Memory Contention Limitation y Many of these tools can be used with the PAPI performance counter interface. For the C language.

Problem Size Limitation y Small Problem Size y Speedup is almost always an increasing function of problem size. y If there's not enough work to be done by the available processors. . y The effect of small problem size on speedup is shown in the following illustration. the code will show limited speedup.

compared to the amount of computation. . causes the speedup curve to start turning downward as shown in the following figure. each processor has less and less amount of computation to perform. y As you compute with more and more processors. y The additional parallel overhead. you can reach a point of negative returns when using additional processors.Problem Size Limitation y Fixed Problem Size y When the problem size is fixed.

y You will want to show a speedup graph with the number of processors on the x axis.Benchmarks y It will finally be time to report the parallel performance of your application code. y Some other things you should report and record are: y the date you obtained the results y the problem size y the computer model y the compiler and the version number of the compiler y any special compiler options you used . and speedup on the y axis.

y You might be interested in looking at these benchmarks to see how other people report their parallel performance. y In particular.uiuc. y In this regard. it is often helpful to find out what kind of performance your colleagues are obtaining.ncsa. .edu/UserInfo/Perf/NCSAbench/. the NAMD benchmark is a report about the performance of the NAMD program that does molecular dynamics simulations. NCSA has a compilation of parallel performance benchmarks online at http://www.Benchmarks y When doing computational science.

Inc. Gene Golub and James Ortega. Here are two useful references: y Scientific Computing An Introduction with Parallel Computing. . Quinn.Summary y There are many good texts on parallel computing which treat the subject of parallel performance analysis. y Parallel Computing Theory and Practice. McGraw-Hill. Michael J. Academic Press. Inc.

Agenda y y y y y y y y y 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690 y y y y y 9.4 The Operating System 9.3 Features Performed by the Hardware 9.2 IBM p690 Building Blocks 9.5 Further Information .1 IBM p690 General Overview 9.

y This chapter describes the architecture of NCSA's IBM p690. and interconnect bandwidth. it is important to understand the architecture of the computer system on which the code runs.About the IBM Regatta P690 y To obtain your program·s top performance. . memory. cache. memory size and speed. and the interconnect network are covered along with technical specifications for the compute rate. y Technical details on the size and design of the processors.

and scalable architecture. y IBM p690 Scalability y The IBM p690 is a flexible. y This means that memory is physically distributed and logically shared. y It scales in these terms: y Number of processors y Memory size y I/O and memory bandwidth and the Interconnect bandwidth . modular.IBM p690 General Overview y The p690 is IBM's latest Symmetric Multi-Processor (SMP) machine with Distributed Shared Memory (DSM). y It is based on the Power4 architecture and is a successor to the Power3-II based RS/6000 SP system.

2.5 Memory Subsystem y 9.3 Features Performed by the Hardware y 9.1 IBM p690 General Overview y 9.4 Cache Architecture y IBM p690 Building Blocks y 9.2.3 The Processor y 9.4 The Operating System y 9.Agenda y 9 About the IBM Regatta P690 y 9.1 Power4 Core y 9.2.5 Further Information .2 Multi-Chip Modules y 9.

y This module includes the L3 cache and four Multi-Chip Modules are linked to form a 32 processor system (see figure on the next slide). which includes the processors and L1 and L2 caches. . y At NCSA. y Each of these components will be described in the following sections. four of these Power4 Cores are linked to form a Multi-Chip Module. y The first of these building blocks is the Power4 Core.IBM p690 Building Blocks y An IBM p690 system is built from a number of fundamental building blocks.

32-processor IBM p690 configuration (Image courtesy of IBM) .

Power4 Core y The Power4 Chip contains: y Two processors y Local caches (L1) y External cache for each processor (L2) y I/O and Interconnect interfaces .

The POWER4 chip (Image curtsey of IBM) .

y Multiple MCM interconnection (Image courtesy of IBM) . y Each MCM also supports the L3 cache for each Power4 chip.Multi-Chip Modules y Four Power4 Chips are assembled to form a Multi-Chip Module (MCM) that contains 8 processors.

3 GHz. y Speed of the Processor y The NCSA IBM p690 has CPUs running at 1. y The Power4 is a 4-way superscalar RISC architecture running instructions on its 8 pipelined execution units. y 2 load/store units for memory access y 2 identical floating point execution units capable of fused multiply/add y 2 fixed point execution units y 1 branch execution unit y 1 logic operation unit . y 64-Bit Processor Execution Units y There are 8 independent fully pipelined execution units.The Processor y The processors at the heart of the Power4 Core are speculative superscalar out of order execution chips.

3 Gcycles/sec * 4 flop/cycle yields 5. y It is capable of handling up to 200 in-flight instructions.3 Gcycles/sec * 5 instructions/cycle yields 65 MIPS y Instruction Set y The instruction set (ISA) on the IBM p690 is the PowerPC AS Instruction set. y Performance Numbers y Peak Performance: y 4 floating point instructions per cycle y 1. .The Processor y The units are capable of 4 floating point operations.2 GFLOPS y MIPS Rating: y 5 instructions per cycle y 1. fetching 8 instructions and completing 5 instructions per cycle.

It has split instruction and data caches.Cache Architecture y Each Power4 Core has both a primary (L1) cache associated with each processor and a secondary (L2) cache shared between the two processors. . In addition. y Level 1 Cache y The Level 1 cache is in the processor core. each MultiChip Module has a L3 cache. y L1 Instruction Cache y The properties of the Instruction Cache are: y 64KB in size y direct mapped y cache line size is 128 bytes y L1 Data Cache y The properties of the L1 Data Cache are: y y y y y 32KB in size 2-way set associative FIFO replacement policy 2-way interleaved cache line size is 128 bytes y Peak speed is achieved when the data accessed in a loop is entirely contained in the L1 data cache.

it looks in the L2 cache. The properties of the L2 Cache are: y external from the processor y unified instruction and data cache y 1.Cache Architecture y Level 2 Cache on the Power4 Chip y When the processor can't find a data element in the L1 cache.41MB per Power4 chip (2 processors) y 8-way set associative y split between 3 controllers y cache line size is 128 bytes y pseudo LRU replacement policy for cache coherence y 124.8 GB/s peak bandwidth from L2 .

it looks in the L3 cache.Cache Architecture y Level 3 Cache on the Multi-Chip Module y When the processor can't find a data element in the L2 cache.5 GB/s peak bandwidth from L2 . The properties of the L3 Cache are: y external from the Power4 Core y unified instruction and data cache y 128MB per Multi-Chip Module (8 processors) y 8-way set associative y cache line size is 512 bytes y 55.

y Memory Latencies y The latency penalties for each of the levels of the memory hierarchy are: y L1 Cache .Memory Subsystem y The total memory is physically distributed among the Multi-Chip Modules of the p690 system (see the diagram in the next slide).400 cycles .102 cycles y Main Memory .4 cycles y L2 Cache .14 cycles y L3 Cache .

Memory distribution within an MCM .

4 The Operating System 9.5 Further Information .1 IBM p690 General Overview 9.Agenda 9 About the IBM Regatta P690 9.2 IBM p690 Building Blocks 9.3 Features Performed by the Hardware 9.

Features Performed by the Hardware y The following is done completely by the hardware. transparent to the user: y Global memory addressing (makes the system memory shared) y Address resolution y Maintaining cache coherency y Automatic page migration from remote to local memory (to reduce interconnect memory transactions) .

1 is a full 64bit file system.1 is highly compatible to both BSD and System V Unix . Version 5.The Operating System y The operating system is AIX. NCSA's p690 system is currently running version 5.1 of AIX. y Compatibility y AIX 5.

uiuc. Morgan Kaufman Publishers.edu/UserInfo/Resources/Hardware/IBMp690/ .ibm. 2nd Edition. 1996 y Computer Hardware and Design:The Hardware/Software Interface y David A.com/systems/p/hardware/highend/590/index. Morgan Kaufman Publishers. Patterson.Further Information y Computer Architecture: A Quantitative Approach y John Hennessy.ncsa. et al.html y IBM p690 Documentation at NCSA at the URL: y http://www. 1997 y IBM P Series [595] at the URL: y http://www-03. et al. 2nd Edition.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->