Parallel Computing Explained

Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009

Agenda
1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690

Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives 1.1.2 Parallelism in Computer Programs 1.1.3 Parallelism in Computers 1.1.3.4 Disk Parallelism 1.1.4 Performance Measures 1.1.5 More Parallelism Issues

1.2 Comparison of Parallel Computers 1.3 Summary

Parallel Computing Overview
y Who should read this chapter?
y New Users ² to learn concepts and terminology. y Intermediate Users ² for review or reference. y Management Staff ² to understand the basic concepts ² even if

you don·t plan to do any programming. y Note: Advanced users may opt to skip this chapter.

high speed interconnects. and high speed input/output y able to speed up computations y by making the sequential components run faster y by doing more operations in parallel y High performance parallel computers are in demand y need for tremendous computational capabilities in science. and business. y require gigabytes/terabytes f memory and gigaflops/teraflops of performance y scientists are striving for petascale performance . large memory.Introduction to Parallel Computing y High performance parallel computers y can solve large problems much faster than a desktop computer y fast CPUs. engineering.

Introduction to Parallel Computing y HPPC are used in a wide variety of disciplines. y y y y y y y Meteorologists: prediction of tornadoes and thunderstorms Computational biologists: analyze DNA sequences Pharmaceutical companies: design of new drugs Oil companies: seismic exploration Wall Street: analysis of financial markets NASA: aerospace vehicle design Entertainment industry: special effects in movies and commercials y These complex scientific and business applications all need to perform computations on large datasets or large equations. .

Parallelism in our Daily Lives y There are two types of processes that occur in computers and in our daily lives: y Sequential processes y occur in a strict order y it is not possible to do the next step until the current one is completed. research. y Parallel processes y many events happen simultaneously y Examples y Plant growth in the springtime y An orchestra . y Writing a term paper: pick the topic. and write the paper. y Examples y The passage of time: the sun rises and the sun sets.

Agenda
1 Parallel Computing Overview
1.1 Introduction to Parallel Computing
1.1.1 Parallelism in our Daily Lives 1.1.2 Parallelism in Computer Programs 1.1.2.1 Data Parallelism 1.1.2.2 Task Parallelism 1.1.3 Parallelism in Computers 1.1.3.4 Disk Parallelism 1.1.4 Performance Measures 1.1.5 More Parallelism Issues

1.2 Comparison of Parallel Computers 1.3 Summary

Parallelism in Computer Programs
y Conventional wisdom:
y Computer programs are sequential in nature y Only a small subset of them lend themselves to parallelism. y Algorithm: the "sequence of steps" necessary to do a computation. y The first 30 years of computer use, programs were run sequentially.

y The 1980's saw great successes with parallel computers.
y Dr. Geoffrey Fox published a book entitled Parallel Computing

Works! y many scientific accomplishments resulting from parallel computing y Computer programs are parallel in nature y Only a small subset of them need to be run sequentially

Parallel Computing
y What a computer does when it carries out more than one

computation at a time using more than one processor. y By using many processors at once, we can speedup the execution
y If one processor can perform the arithmetic in time t. y Then ideally p processors can perform the arithmetic in time t/p. y What if I use 100 processors? What if I use 1000 processors?

y Almost every program has some form of parallelism.
y You need to determine whether your data or your program can be

partitioned into independent pieces that can be run simultaneously. y Decomposition is the name given to this partitioning process.
y Types of parallelism:
y data parallelism y task parallelism.

Data Parallelism
y The same code segment runs concurrently on each processor,

but each processor is assigned its own part of the data to work on.
y Do loops (in Fortran) define the parallelism. y The iterations must be independent of each other. y Data parallelism is called "fine grain parallelism" because the

computational work is spread into many small subtasks.
y Example
y Dense linear algebra, such as matrix multiplication, is a perfect

candidate for data parallelism.

K)*B(K.N DO J=1.J) = C(I.J) + A(I.J) = C(I.An example of data parallelism Original Sequential Code DO K=1.N DO I=1.N DO I=1.N DO J=1.J) + A(I.N C(I.J) END DO END DO END DO !$END PARALLEL DO .K)*B(K.J) END DO END DO END DO Parallel Code !$OMP PARALLEL DO DO K=1.N C(I.

y In our sample code. the loop that is performed in parallel is the loop that immediately follows the Parallel Do directive.N . y With OpenMP. y We will have a lecture on Introduction to OpenMP later. it's the K loop: y DO K=1.Quick Intro to OpenMP y OpenMP is a portable standard for parallel directives covering both data and task parallelism. y More information about OpenMP is available on the OpenMP website.

11:15) B(11:15 .N DO I=1.J) A(I. 1:5) B(1:5 .J) The code segment running on each processor DO J=1.J) END DO END DO .J) A(I.J) + A(I.K)*B(K. 6:10) B(6:10 .OpenMP Loop Parallelism Iteration-Processor Assignments Processor proc0 proc1 proc2 proc3 Iterations of K K=1:5 K=6:10 K=11:15 K=16:20 Data Elements A(I. 16:20) B(16:20 .J) A(I.J) = C(I.N C(I.

Repeat steps 2 and 3 as many times as needed. y It is contrasted with the MPI (Message Passing Interface) style of parallelism. 2. y The ability to perform incremental parallelism is considered a positive feature of data parallelism. 1. 3. 4. which is an "all or nothing" approach. . If performance is not satisfactory. Compute performance of the code. parallelize another loop.OpenMP Style of Parallelism y can be done incrementally as follows: Parallelize the most computationally intensive loop.

. Task parallelism is often easier to implement and has less overhead than data parallelism. that can be assigned to different processors and run concurrently. More code is run in parallel because the parallelism is implemented at a higher level than in data parallelism. You can use task parallelism when your program can be split into independent pieces.Task Parallelism y Task parallelism may be thought of as the opposite of data y y y y y parallelism. often subroutines. Task parallelism is called "coarse grain" parallelism because the computational work is spread into just a few subtasks. each process performs different operations. Instead of the same operations being performed on different parts of the data.

.Task Parallelism y The abstract code shown in the diagram is decomposed into 4 independent code segments that are labeled A. C. B. The right hand side of the diagram illustrates the 4 code segments running concurrently. and D.

Task Parallelism Original Code program main Parallel Code program main !$OMP PARALLEL !$OMP SECTIONS code segment labeled !$OMP SECTION code segment labeled !$OMP SECTION code segment labeled !$OMP SECTION code segment labeled !$OMP END SECTIONS !$OMP END PARALLEL end code segment labeled A code segment labeled B code segment labeled C code segment labeled D A B C D end .

OpenMP Task Parallelism y With OpenMP. the allocation of code segments to processors is as follows. In our sample parallel code. the code that follows each SECTION(S) directive is allocated to a different processor. Processor proc0 proc1 proc2 proc3 Code code segment labeled A code segment labeled B code segment labeled C code segment labeled D .

Parallelism in Computers y How parallelism is exploited and enhanced within the operating system and hardware components of a parallel computer: y operating system y arithmetic y memory y disk .

Operating System Parallelism y All of the commonly used parallel computers run a version of the Unix operating system. but the name of the Unix OS varies with each vendor. Parallel Computer SGI Origin2000 HP V-Class Cray T3E IBM SP Workstation Clusters OS IRIX HP-UX Unicos AIX Linux y For more information about Unix. In the table below each OS listed is in fact Unix. a collection of Unix documents is available. .

out in the background and simultaneously view the man page for the etime function in the foreground.out > results & man etime y cron feature y With the Unix cron feature you can submit a job that will run at a later time. . There are two Unix commands that accomplish this: a.Two Unix Parallelism Features y background processing facility y With the Unix background processing facility you can run the executable a.

subtract. and Fused Multiply Add (FMA) on HP computers. This gives rise to the name n-way superscalar. and divide (+ . y Superscalar arithmetic y is the ability to issue several arithmetic operations per computer cycle. On superscalar computers there are multiple slots per cycle that can be filled with work. y The arithmetic operations of add. The SGI Origin2000 is called a 4-way superscalar computer. . because the execution units operate independently. This allows several execution units to be used simultaneously. y Fused multiply and add y is another parallel arithmetic feature. independent execution units. the two arithmetic operations are overlapped and can complete in hardware in one computer cycle. This arithmetic is named MultiplyADD (MADD) on SGI computers. multiply. y It makes use of the multiple. where n is the number of slots per cycle. In either case. y Parallel computers are able to overlap multiply and add.Arithmetic Parallelism y Multiple execution units y facilitate arithmetic parallelism.* /) are each done in a separate execution unit.

and data elements with odd memory addresses into the other. and consecutive data elements are interleaved among them. then data elements with even memory addresses would fall into one bank. There is memory that is local to a partition of the processors. When the data elements that are y multiple levels of the memory hierarchy y There is global memory that any processor can access. y multiple memory ports y Port means a bi-directional memory pathway. For example if your computer has 2 memory banks. y Cache memory y Cache is a small memory that has fast access compared with the larger main . memory and serves to keep the faster processor filled with data. interleaved across the memory banks are needed. that is. the cache memory and the memory elements held in registers. Finally there is memory that is local to a single processor.Memory Parallelism y memory interleaving y memory is divided into multiple banks. which increases the memory bandwidth (MB/s or GB/s). the multiple memory ports allow them to be accessed and fetched in parallel.

Memory Parallelism Memory Hierarchy Cache Memory .

and the full data set is reassembled in memory. and the RAID disk system remains operational. y The advantage of a RAID disk system is that it provides a measure of fault tolerance. That is. it is broken into pieces that are written simultaneously to the different disks in the RAID disk system. the pieces are read in parallel. . y If one of the disks goes down.Disk Parallelism y RAID (Redundant Array of Inexpensive Disk) y RAID disks are on most parallel computers. it is striped across the RAID disk system. When the same data set is read back in. it can be swapped out. y Disk Striping y When a data set is written to disk.

3 Parallelism in Computers 1.5 More Parallelism Issues 1.3 Summary .1.4 Disk Parallelism 1.2 Comparison of Parallel Computers 1.1.1 Introduction to Parallel Computing 1.4 Performance Measures 1.1.1.3.Agenda 1 Parallel Computing Overview 1.2 Parallelism in Computer Programs 1.1.1.1 Parallelism in our Daily Lives 1.

y Millions of instructions per second is abbreviated as MIPS. floating point operations. y The processor speed is commonly measured in millions of cycles per second. y It is a more realistic measure of computer performance. y Sustained Performance y is the highest consistently achieved speed. y MHz y is a measure of the processor speed.Performance Measures y Peak Performance y is the top speed at which the computer can operate. where a computer cycle is defined as the shortest time in which some work can be done. integer operations. y Cost Performance y is used to determine if the computer is cost effective. logical operations . and branch instructions. . y It is a theoretical upper limit on the computer's performance. where the instructions are computer instructions such as: memory reads and writes. y MIPS y is a measure of how quickly the computer can issue instructions.

y Based on the Linpack results. y Benchmarks y are used to rate the performance of parallel computers and parallel programs. compared to the performance on one processor. y Speedup y measures the benefit of parallelism. and divide. y Ideal speedup happens when the performance gain is linearly proportional to the number of processors used. . y A well known benchmark that is used to compare parallel computers is the Linpack benchmark. subtract. multiply. y It shows how your program scales as you compute with more processors. This list is maintained by the University of Tennessee and the University of Mannheim. a list is produced of the Top 500 Supercomputer Sites.Performance Measures y Mflops (Millions of floating point operations per second) y measures how quickly a computer can perform floating-point operations such as add.

y For data parallelism it involves how iterations of loops are allocated to processors. y Load balancing is important because the total time for the program to complete is the time spent by the longest executing thread. y These tools include: y parallel compilers y parallel debuggers y performance analysis tools y parallel math software y The availability of a broad set of application software is also important. y The problem size y must be large and must be able to grow as you compute with more processors.More Parallelism Issues y Load balancing y is the technique of evenly dividing the workload among the processors. y In order to get the performance you expect from a parallel computer you need to run a large application with large data sizes. y Good software tools y are essential for users of high performance parallel computers. . otherwise the overhead of passing information between processors will dominate the calculation time.

You can participate in searching for extraterrestrial intelligence with your home PC. y Condor y is software that provides resource management services for applications that run on heterogeneous collections of workstations. y A workstation farm y is defined as a fast network connecting heterogeneous workstations. More information about this project is available at the SETI Institute. Many supercomputer vendors are no longer in business. . y Miron Livny at the University of Wisconsin at Madison is the director of the Condor project. More information is available at the Condor Home Page. y The individual workstations serve as desktop systems for their owners. making the portability of your application very important. and has coined the phrase high throughput computing to describe this process of harnessing idle workstation cycles.More Parallelism Issues y The high performance computing market is risky and chaotic. y An application of this concept is the SETI project. y When they are idle. large problems can take advantage of the unused cycles in the whole system.

2 Memory Organization 1.1 Processors 1.3 Summary .2.4.2.2.2.1 Introduction to Parallel Computing 1.4.1 Bus Network 1.2.5 Summary of Parallel Computer Characteristics 1.2 Comparison of Parallel Computers 1.2.4 Tree Network 1.3 Flow of Control 1.2.Agenda 1 Parallel Computing Overview 1.2 Cross-Bar Switch Network 1.2.4.2.4.4.3 Hypercube Network 1.4 Interconnection Networks 1.2.5 Interconnection Networks Self-test 1.

and what makes each one of them unique. .Comparison of Parallel Computers y Now you can explore the hardware components of parallel computers: y kinds of processors y types of memory organization y flow of control y interconnection networks y You will see what is common to these parallel computers.

y The examples of this type of computer are the Cray SV1 and the Fujitsu VPP5000. y The cooling of these computers often requires very sophisticated and expensive equipment. making these computers very expensive for computing centers. . computers with a small number of powerful processors y Typically have tens of processors. y They are general-purpose computers that perform especially well on applications that have large vector lengths.Kinds of Processors y There are three types of parallel computers: 1.

. These computers are suitable for applications with a high degree of concurrency. Examples of this type of computer were the Thinking Machines CM-2 computer. and the computers made by the MassPar company.Kinds of Processors y There are three types of parallel computers: 2. typically have thousands of y y y y y processors. Because of the large number of processors. The processors are usually proprietary and air-cooled. computers with a large number of less powerful processors y Named a Massively Parallel Processor (MPP). The MPP type of computer was popular in the 1980s. the distance between the furthest processors can be quite large requiring a sophisticated internal network that allows distant processors to communicate with each other quickly.

y These are general-purpose computers that perform well on a wide range of applications. . y The processor chips are usually not proprietary. y The most common example of this class is the Linux Cluster.Kinds of Processors y There are three types of parallel computers: 3. computers that are medium scale in between the two extremes y Typically have hundreds of processors. rather they are commodity processors like the Pentium III.

Trends and Examples y Processor trends : Decade Processor Type 1970s 1980s 1990s 2000s Pipelined. Proprietary CISC. RISC. Commodity SGI Origin2000 y The processors on today·s commonly used parallel computers: Computer SGI Origin2000 HP V-Class Cray T3E IBM SP Workstation Clusters Processor MIPS RISC R12000 HP PA 8200 Compaq Alpha IBM Power3 Intel Pentium III. Commodity Computer Example Cray-1 Thinking Machines CM2 Workstation Clusters Superscalar. Intel Itanium . Proprietary Massively Parallel.

Memory Organization y The following paragraphs describe the three types of memory organization found on parallel computers: y distributed memory y shared memory y distributed shared memory .

and workstation clusters. while data from the most distant processor takes the longest to access. the total memory is partitioned into memory that is private to each processor. proportional to the distance between the two communicating processors. data is accessed the quickest from a private memory. y There is a Non-Uniform Memory Access time (NUMA). the IBM SP. y On NUMA computers.Distributed Memory y In distributed memory computers. y Some examples are the Cray T3E. which is .

the code and the data should be structured such that the bulk of a processor·s data accesses are to its own private (local) memory. y This is called having good data locality. y Today's distributed memory computers use message passing such as MPI to communicate between processors as shown in the following example: .Distributed Memory y When programming distributed memory computers.

. y This is often called the LEGO block approach. As the demand for resources grows.Distributed Memory y One advantage of distributed memory computers is that they are easy to scale. computer centers can easily add more memory and processors. y The drawback is that programming of distributed memory computers can be quite complicated.

y The advantages and disadvantages of shared memory machines are roughly the opposite of distributed memory computers.Shared Memory y In shared memory computers. y Processors communicate with each other through the shared memory. all processors have access to a single pool of centralized memory with a uniform address space. y They are easier to program because they resemble the programming of single processor machines y But they don't scale like their distributed memory counterparts . y Any processor can address any memory location at the same speed so there is Uniform Memory Access time (UMA).

y Memory is physically distributed but logically shared. y It accesses the memory of a different processor cluster in a NUMA fashion. . y That is. DSM computers have both the scalability of distributed memory computers and the ease of programming of shared memory computers. a cluster or partition of processors has access to a common shared memory.Distributed Shared Memory y In Distributed Shared Memory (DSM) computers. y Some examples of DSM computers are the SGI Origin2000 and the HP VClass computers. y Attention to data locality again is important. y Distributed shared memory computers combine the best features of both distributed memory computers and shared memory computers.

Trends and Examples y Memory organization trends: Decade 1970s 1980s 1990s 2000s Memory Organization Shared Memory Distributed Memory Distributed Shared Memory Distributed Memory Example Cray-1 Thinking Machines CM-2 SGI Origin2000 Workstation Clusters y The memory Computer SGI Origin2000 HP V-Class Cray T3E IBM SP Workstation Clusters Memory Organization DSM DSM Distributed Distributed Distributed organization of today·s commonly used parallel computers: .

Flow of Control y When you look at the control of flow you will see three types of parallel computers: y Single Instruction Multiple Data (SIMD) y Multiple Instruction Multiple Data (MIMD) y Single Program Multiple Data (SPMD) .

describes computers by how streams of instructions interact with streams of data. y There can be single or multiple instruction streams. SIMD and MIMD. . MISD. y Of these 4. is a special case of MIMD. This gives rise to 4 types of computers as shown in the diagram below: y Flynn's taxonomy names the 4 computer types SISD.Flynn·s Taxonomy y Flynn·s Taxonomy. y Another computer type. devised in 1972 by Michael Flynn of Stanford University. and there can be single or multiple data streams. SPMD. only SIMD and MIMD are applicable to parallel computers.

and they all shift to the right. y Some examples of SIMD computers were the Thinking Machines CM-2 computer and the computers from the MassPar company. y It says shift to the right. and the processors run in lock step. y It says add. are useful for fine grain data parallel applications. y SIMD computers. marching in unison. and they all add. . y SIMD computers have distributed memory with typically thousands of simple processors. popular in the 1980s. y With different data elements being allocated to each processor. y The processors are commanded by the global controller that sends instructions to the processors.SIMD Computers y SIMD stands for Single Instruction Multiple Data. y The processors are like obedient soldiers. y Each processor follows the same set of instructions. such as neural networks.

so that the processors can run the same instruction stream or different instruction streams.MIMD Computers y MIMD stands for Multiple Instruction Multiple Data. different data elements are allocated to each processor. MIMD is actually a superset of SIMD. there are multiple data streams. Some examples of MIMD computers are the SGI Origin2000 computer and the HP V-Class computer. . y There are multiple instruction streams with separate code segments distributed y y y y y y among the processors. MIMD computers can be used for either data parallel or task parallel applications. In addition. the processors on MIMD computers run independently of each other. MIMD computers can have either distributed memory or shared memory. While the processors on SIMD computers run in lock step.

even though each processor has the same set of instructions. y Hence. y Instead of the SIMD obedient soldiers. y Unlike SIMD. each processor can run that code segment asynchronously. y With SPMD computers. y The analogies we used for describing SIMD computers can be modified for MIMD computers. y SPMD execution happens when a MIMD computer is programmed to have the same set of instructions per processor. y One processor may take a certain branch of the if statement. y SPMD is a special case of MIMD. and another processor may take a different branch of the same if statement. y Because each processor computes with its own partition of the data elements. y An example is the execution of an if statement on a SPMD computer. . in the MIMD world the processors march to the beat of their own drummer. while the processors are running the same code segment. it may evaluate the right hand side of the if statement differently from another processor. those instructions may be evaluated in a different order from one processor to the next. the synchronous execution of instructions is relaxed.SPMD Computers y SPMD stands for Single Program Multiple Data. all marching in unison.

Summary of SIMD versus MIMD SIMD Memory distributed memory MIMD distriuted memory or shared memory same or different asynchronously different per processor data parallel or task parallel Code Segment Processors Run In Data Elements Applications same per processor lock step different per processor data parallel .

Trends and Examples y Flow of control trends: Decade 1980's 1990's 2000's Flow of Control SIMD MIMD MIMD Computer Example Thinking Machines CM-2 SGI Origin2000 Workstation Clusters y The flow of control on today: Computer SGI Origin2000 HP V-Class Cray T3E IBM SP Workstation Clusters Flow of Control MIMD MIMD MIMD MIMD MIMD .

5 Summary of Parallel Computer Characteristics 1.2.2.3 Summary .2 Comparison of Parallel Computers 1.4.2.2 Memory Organization 1.4.Agenda 1 Parallel Computing Overview 1.3 Flow of Control 1.1 Processors 1.2.2.5 Interconnection Networks Self-test 1.4.3 Hypercube Network 1.2.4 Interconnection Networks 1.1 Bus Network 1.1 Introduction to Parallel Computing 1.2 Cross-Bar Switch Network 1.2.4.2.4 Tree Network 1.4.2.2.

y Latency: the delay on a network while a data packet is being stored and forwarded. y The time required to transfer data is dependent upon the specific type of the interconnection network. y Bandwidth: the amount of data that can be sent through a network connection. y This transfer time is called the communication time. communicate. The network topologies (geometric arrangements of the computer network connections) are: y Bus y Cross-bar Switch y Hybercube y Tree y What network characteristics are important? y Diameter: the maximum distance that data must travel for 2 processors to y Types of Interconnection Networks .Interconnection Networks y What exactly is the interconnection network? y The interconnection network is made up of the wires and cables that define how the multiple processors of a parallel computer are connected to each other and to the memory units.

Interconnection Networks y The aspects of network issues are: y y y y y y y Cost Scalability Reliability Suitable Applications Data Rate Diameter Degree y General Network Characteristics y Some networks can be compared in terms of their degree and diameter. . y A large degree is a benefit because it has multiple paths. y Diameter:This is the distance between the two processors that are farthest apart. y Degree: how many communicating wires are coming out of each processor. y A small diameter corresponds to low latency.

y The negative aspects y limited data transmission rate. y The positive aspects y It is also a mature technology that is well known and reliable. y Only scaled to 18 processors. y Example: SGI Power Challenge. .Bus Network y Bus topology is the original coaxial cable-based Local Area Network (LAN) topology in which the medium forms a single bus to which all stations are attached. y simple to construct. y The cost is also very low. y not scalable in terms of performance.

y There are multiple paths for a processor to communicate with a certain memory. y it scales better than the bus network but it costs significantly more.Cross-Bar Switch Network y A cross-bar switch is a network that works through a switching mechanism to access shared memory. An example of a computer with this type of network is the HP V-Class. y The telephone system uses this type of network. y The switches determine the optimal route to take. . y Here is a diagram of a cross-bar switch network which shows the processors talking through the switchboxes to store or retrieve data in memory.

Cross-Bar Switch Network y In a hypercube network. . Each node in an N dimensional cube is directly connected to N other nodes. where n is the number of processors. "nearest neighbor". y The fact that the number of directly connected. and the Intel iPSC860. the processors are connected as if they were corners of a multidimensional cube. y The degree of a hypercube network is log n and the diameter is log n. NCUBE-2. y Examples of computers with this type of network are the CM-2. nodes increases with the total size of the network is also highly desirable for a parallel computer.

The Thinking Machines CM-5 is an example of a parallel computer with this type of network. This is useful for decision making applications that can be mapped as trees. it must go up in the network and then go back down.Tree Network y The processors are the bottom nodes of the tree. For a processor y y y y to retrieve data. . The diameter of the network is 2 log (n+1)-2 where n is the number of processors. Tree networks are very suitable for database applications because it allows multiple searches through the database at a time. The degree of a tree network is 1.

Interconnected Networks y Torus Network: A mesh with wrap-around connections in y y y y both the x and y directions. Mesh Network: A network where each interior processor is connected to its four nearest neighbors. Fully Connected Network: A network where every processor is connected to every other processor. Hypercube Network: Processors are connected as if they were corners of a multidimensional cube. Multistage Network: A network with more than one networking unit. .

y Tree Network: The processors are the bottom nodes of the tree. . y Ring Network: Each processor is connected to two others and the line of connections forms a circle.Interconnected Networks y Bus Based Network: Coaxial cable based LAN topology in which the medium forms a single bus to which all stations are attached. y Cross-bar Switch Network: A network that works through a switching mechanism to access shared memory.

Summary of Parallel Computer Characteristics y How many processors does the computer have? y 10s? y 100s? y 1000s? y How powerful are the processors? y what's the MHz rate y what's the MIPS rate y What's the instruction set architecture? y RISC y CISC .

Summary of Parallel Computer Characteristics y How much memory is available? y total memory y memory per processor y What kind of memory? y distributed memory y shared memory y distributed shared memory y What type of flow of control? y SIMD y MIMD y SPMD .

Summary of Parallel Computer Characteristics y What is the interconnection network? y y y y y y y y y y Bus Crossbar Hypercube Tree Torus Multistage Fully Connected Mesh Ring Hybrid .

Design decisions made by some of the major parallel computer vendors Computer Programming Style OpenMP MPI OpenMP MPI SHMEM MPI OS Processors Memory Flow of Control Network SGI Origin2000 IRIX MIPS RISC R10000 DSM MIMD Crossbar Hypercube Crossbar Ring Torus IBM Switch Myrinet Tree HP V-Class HP-UX HP PA 8200 DSM MIMD Cray T3E IBM SP Workstation Clusters Unicos AIX Compaq Alpha IBM Power3 Intel Pentium III Distributed Distributed MIMD MIMD MPI Linux Distributed MIMD .

y You have learned about parallelism in computer programs. Second Edition George S. Inc. y In addition. and how these computers compare to each other. and also about parallelism in the hardware components of parallel computers.. Almasi and Allan Gottlieb Benjamin/Cummings Publishers. you have learned about the commonly used parallel computers. 1994 Parallel Computing Theory and Practice Michael J. 1994 .Summary y This completes our introduction to parallel computing. y There are many good texts which provide an introductory treatment of parallel computing. Here are two useful references: Highly Parallel Computing. Quinn McGraw-Hill.

5 Parallelism Issues 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690 .Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 2.4 Task Parallelism 2.3 Mixing Automatic and Hand Parallelism 2.2 Data Parallelism by Hand 2.1 Automatic Compiler Parallelism 2.

y The details for accomplishing both data parallelism and task parallelism are presented. . focusing on shared memory machines. y Both automatic compiler parallelization and parallelization by hand are covered.How to Parallelize a Code y This chapter describes how to turn a single processor program into a parallel one.

Automatic Compiler Parallelism
y Automatic compiler parallelism enables you to use a

single compiler option and let the compiler do the work. y The advantage of it is that it·s easy to use. y The disadvantages are:
y The compiler only does loop level parallelism, not task

parallelism. y The compiler wants to parallelize every do loop in your code. If you have hundreds of do loops this creates way too much parallel overhead.

Automatic Compiler Parallelism
y To use automatic compiler parallelism on a Linux system

with the Intel compilers, specify the following.
ifort -parallel -O2 ... prog.f

y The compiler creates conditional code that will run with any

number of threads. y Specify the number of threads and make sure you still get the right answers with setenv:
setenv OMP_NUM_THREADS 4 a.out > results

Data Parallelism by Hand
y First identify the loops that use most of the CPU time (the Profiling y y y y

lecture describes how to do this). By hand, insert into the code OpenMP directive(s) just before the loop(s) you want to make parallel. Some code modifications may be needed to remove data dependencies and other inhibitors of parallelism. Use your knowledge of the code and data to assist the compiler. For the SGI Origin2000 computer, insert into the code an OpenMP directive just before the loop that you want to make parallel.

!$OMP PARALLEL DO do i=1,n « lots of computation ... end do !$OMP END PARALLEL DO

Data Parallelism by Hand
y Compile with the mp compiler option. f90 -mp ... prog.f y As before, the compiler generates conditional code that will run with any

number of threads. y If you want to rerun your program with a different number of threads, you do not need to recompile, just re-specify the setenv command.
setenv OMP_NUM_THREADS 8 a.out > results2

y The setenv command can be placed anywhere before the a.out command. y The setenv command must be typed exactly as indicated. If you have a typo,

you will not receive a warning or error message. To make sure that the setenv command is specified correctly, type:
setenv

y It produces a listing of your environment variable settings.

Mixing Automatic and Hand Parallelism
y You can have one source file parallelized automatically by the

compiler, and another source file parallelized by hand. Suppose you split your code into two files named prog1.f and prog2.f.
f90 -c -apo « prog1.f f90 -c -mp « prog2.f prog2.f) f90 prog1.o prog2.o executable) a.out > results (automatic // for prog1.f) (by hand // for (creates one (runs the executable)

Task Parallelism
y You can accomplish task parallelism as follows:
!$OMP PARALLEL !$OMP SECTIONS « lots of computation in part A « !$OMP SECTION « lots of computation in part B ... !$OMP SECTION « lots of computation in part C ... !$OMP END SECTIONS !$OMP END PARALLEL

y Compile with the mp compiler option.
f90 -mp « prog.f

y Use the setenv command to specify the number of threads.
setenv OMP_NUM_THREADS 3 a.out > results

Parallelism Issues
y There are some issues to consider when parallelizing a

program. y Should data parallelism or task parallelism be used? y Should automatic compiler parallelism or parallelism by hand be used? y Which loop in a nested loop situation should be the one that becomes parallel? y How many threads should be used?

5 IEEE Arithmetic Differences 3.7 Compute Order Related Differences 3.8 Optimization Level Too High 3.4 Standards Violations 3.6 Math Library Differences 3.10 Further Information .Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 3.2 Word Length 3.9 Diagnostic Listings 3.1 Recompile 3.3 Compiler Options for Debugging 3.

the new results may actually be more accurate than the baseline results. a vector computer. don't automatically assume that the new results are wrong ² they may actually be better. There are several reasons why this might be true. y If the results are different. on a new parallel computer you must first "port" the code. . y Detection methods for finding code flaws. solutions.Porting Issues y In order to run a computer program that presently runs on a workstation. and save the results from the old or ´baselineµ computer. y Code Flaws . y After porting the code. y Then run the ported code on the new computer and compare the results. a mainframe. and workarounds are provided in this lecture. or another parallel computer.porting your code to a new computer may have uncovered a hidden flaw in the code that was already there. including: y Precision Differences . it is important to have some benchmark results you can use for comparison. run the original program on a well-defined dataset. y To do this.

y The compilers available on the NCSA computer platforms are shown in the following table: Language SGI Origin2000 MIPSpro Fortran 77 Fortran 90 Fortran 90 High Performance Fortran C C++ cc CC f77 f90 f95 pghpf icc icpc gcc g++ Portland Group IA-32 Linux Intel ifort ifort ifort pghpf pgcc pgCC icc icpc gcc g++ GNU g77 Portland Group pgf77 pgf90 IA-64 Linux Intel ifort ifort ifort GNU g77 .Recompile y Some codes just need to be recompiled to get accurate results.

the SGI MIPSpro and Intel compilers contain the following flags to set default variable size. the corresponding value is 4 bytes if the code is compiled with the ²n32 flag. The value of n can be 4 or 8 on SGI. y For C. 8. y -in where n is a number: set the default INTEGER to INTEGER*n. The value of n can be 4 or 8 on SGI. and 4. and 2. .Word Length y Code flaws can occur when you are porting your code to a different word length computer. or 16 on the Linux clusters. 4. or 8 on the Linux clusters. the size of an integer variable is 4 and 8 bytes. y For Fortran. and 8 bytes if compiled without any flags or explicitly with the ²64 flag. respectively. On the IA32 and IA64 Linux clusters. On the SGI Origin2000. y -rn where n is a number: set the default REAL to REAL*n. the size of an integer variable differs depending on the machine and how the variable is generated.

the MIPSpro compilers include debugging options via the ²DEBUG:group. automatic and dynamically allocated variables to be initialized... -DEBUG:trap_uninitialized=ON .Compiler Options for Debugging y On the SGI Origin2000. y Two examples are: y Array-bound checking: check for subscripts out of range at runtime. The syntax is as follows: -DEBUG:option1[=value1]:option2[=value2]. -DEBUG:subscript_check=ON y Force all un-initialized stack.

Compiler Options for Debugging y On the IA32 Linux cluster. the Fortran compiler is equipped with the following ²C flags for runtime diagnostics: y -CA: pointers and allocatable references y -CB: array and subscript bounds y -CS: consistent shape of intrinsic procedure y -CU: use of uninitialized variables y -CV: correspondence between dummy and actual arguments .

the -ansi[-] flag enables/disables assumption of ANSI conformance.Standards Violations y Code flaws can occur when the program has non-ANSI standard Fortran coding. for example. y This option generates a listing of warning messages for the use of non-ANSI standard coding. y On the Linux clusters. the value of the do loop index upon exit from the do loop. . y Standards Violations Detection y To detect standards violations on the SGI Origin2000 computer use the -ansi flag. y ANSI standard Fortran is a set of rules for compiler writers that specify.

You can make your program strictly conform to the IEEE standard. or a slightly relaxed level with the ²mp1 flag. y The IEEE Arithmetic Standard is a set of rules governing arithmetic roundoff and overflow behavior. y For example. . prog. 2. y To make your program conform to the IEEE Arithmetic Standards on the SGI Origin2000 computer use: f90 -OPT:IEEEarithmetic=n ..IEEE Arithmetic Differences y Code flaws occur when the baseline computer conforms to the IEEE arithmetic standard and the new computer does not.f where n is 1. y This option specifies the level of conformance to the IEEE standard where 1 is the most stringent and 3 is the most liberal.. it prohibits the compiler writer from replacing x/y with x *recip (y) since the two results may differ slightly for some operands. or 3. y On the Linux clusters. the Intel compilers can achieve conformance to IEEE standard at a stringent level with the ²mp flag.

2.Math Library Differences y Most high-performance parallel computers are equipped with vendor-supplied math libraries. or ²mp ² lscs_mp for the parallel version. the complete set of LAPACK routines. y SCSL contains Level 1. y On the SGI Origin2000 platform. and Fast Fourier Transform (FFT) routines. y SCSL can be linked with ²lscs for the serial version.sgimath for the serial version. there are SGI/Cray Scientific Library (SCSL) and Complib. y The Intel Math Kernel Library (MKL) contains the complete set of functions from BLAS.sgimath_mp for the parallel version. or ²mp ²lcomplib. LAPACK and Fast Fourier Transform (FFT) routines. . y The complib library can be linked with ²lcomplib.sgimath. and 3 Basic Linear Algebra Subprograms (BLAS). the extended BLAS (sparse).

you also need to link with -lPEPCF90 ±lCEPCF90 ±lF90 -lintrins . you also need to link with ±lF90. y On the IA64 Linux cluster. the libraries to link to are: y For BLAS: -L/usr/local/intel/mkl/lib/32 -lmkl -lguide ±lpthread y For LAPACK: -L/usr/local/intel/mkl/lib/32 ±lmkl_lapack -lmkl -lguide ±lpthread y When calling MKL routines from C/C++ programs. the corresponding libraries are: y For BLAS: -L/usr/local/intel/mkl/lib/64 ±lmkl_itp ±lpthread y For LAPACK: -L/usr/local/intel/mkl/lib/64 ±lmkl_lapack ±lmkl_itp ± lpthread y When calling MKL routines from C/C++ programs.Math Library Differences y On the IA32 Linux cluster.

your code is inappropriate for a parallel computer. 1. y Use the following method to detect compute order related differences: y If your loop looks like y DO I = 1. The compute order in which the threads will run cannot be guaranteed. y Note: : If your algorithm depends on data being compared in a specific order. the 50th index of a do loop may be computed before the 10th index of the loop. -1 The results should not change if the iterations are independent .Compute Order Related Differences y Code flaws can occur because of the non-deterministic computation of data elements on a parallel computer. y For example. and in another order on the next run of the program. the threads may run in one order on the first run. Furthermore. N change it to y DO I = N. in a data parallel program.

Optimization Level Too High y Code flaws can occur when the optimization level has been set too high thus trading speed for accuracy. This can sometimes cause answers to change at higher optimization level.f turns off all optimizations. Checking correctness and precision of calculation is highly recommended when ²O3 is used.1. . using the ²O{0. y The compiler reorders and optimizes your code based on assumptions it makes about your program.2. One should bear in mind that Level 3 optimization may carry out loop transformations that affect the correctness of calculations. y For example on the Origin 2000 y f90 -O0 « prog. or 3} flag. y Setting the Optimization Level y Both SGI Origin2000 computer and IBM Linux clusters provide Level 0 (no optimization) to Level 3 (most aggressive) optimization.

o prog2. Name them prog1a.f with -O0 and prog1b.o prog1b.f f90 prog1a. y Compile the first half with -O0 and the second half with -O3 f90 -c -O0 prog1.f f90 -c -O3 prog2.f. divide your program prog.Optimization Level Too High y Isolating Optimization Level Problems y You can sometimes isolate optimization level problems using the method of binary chop.o a.o a.f into halves.f f90 prog1.f and prog2.f with -O3 f90 -c -O0 prog1a. the optimization problem lies in prog1.f into halves. Name them prog1. y To do this. .out > results y If the results are correct.f f90 -c -O3 prog1b.f y Compile prog1a.f y Next divide prog1.out > results y Continue in this manner until you have isolated the section of code that is producing incorrect results.o prog2.f and prog1b.

-fullwarn . ....... Some useful listing options are: f90 f90 f90 f90 f90 -listing .. -showdefaults .Diagnostic Listings y The SGI Origin 2000 compiler will generate all kinds of diagnostic warnings and messages... -help . -version . but not always by default..

IA64. Intel64) y Intel Fortran Compiler for Linux y Intel C/C++ Compiler for Linux .Further Information y SGI y y y y y y man f77/f90/cc man debug_group man math man complib.sgimath MIPSpro 64-Bit Porting and Transition Guide Online Manuals y Linux clusters pages y ifort/icc/icpc ²help (IA32.

4 Further Information .1 Aggressive Compiler Options y 4.3 Vendor Tuned Code y 4.Agenda y 1 Parallel Computing Overview y 2 How to Parallelize a Code y 3 Porting Issues y 4 Scalar Tuning y 4.2 Compiler Optimizations y 4.

you can tune the scalar code to decrease its runtime. and their solution are presented in the Cache Tuning chapter. y This chapter describes many of these techniques: y The use of the most aggressive compiler options y The improvement of loop unrolling y The use of subroutine inlining y The use of vendor supplied tuned code y The detection of cache problems.Scalar Tuning y If you are not satisfied with the performance of your program on the new computer. .

and turns on software pipelining. -O1 and -O2 do beneficial optimizations that will not effect the accuracy of results. -O0 turns off all optimizations. .Aggressive Compiler Options y For the SGI Origin2000 Linux clusters the main optimization switch is -On where n ranges from 0 to 3. may produce changes in accuracy. -O3 specifies the most aggressive optimizations. It takes the most compile time.

Aggressive Compiler Options y It should be noted that ²O3 might carry out loop transformations that produce incorrect results in some codes. ²O3 can be used together with ²OPT:IEEE_arithmetic=n (n=1. to enforce operation conformance to IEEE standard at different levels. y It is recommended that one compare the answer obtained from Level 3 optimization with one obtained from a lower-level optimization. This option specifies the most aggressive optimizations that are specifically tuned for the Origin2000 computer. . the option -Ofast = ip27 is also available.2. or 3) and ²mp (or ²mp1). y On the SGI Origin2000. respectively. y On the SGI Origin2000 and the Linux clusters.

Agenda y y y y 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning y 4.6 Subroutine Inlining y 4.2.3 Vendor Tuned Code y 4.2.2.1Aggressive Compiler Options y 4.3 Routine Level y 4.2 Compiler Optimizations y 4.8 Profile-guided Optimization (PGO) y 4.2.4 Further Information .7 Optimization Report y 4.2.5 Loop Unrolling y 4.1 Statement Level y 4.2.2 Block Level y 4.4 Software Pipelining y 4.2.2.

Compiler Optimizations y The various compiler optimizations can be classified as follows: y Statement Level Optimizations y Block Level Optimizations y Routine Level Optimizations y Software Pipelining y Loop Unrolling y Subroutine Inlining y Each of these are described in the following sections. .

eq. I.Statement Level y Constant Folding y Replace simple arithmetic operations on constants with the pre- computed result. .J . y if (I.K) expression when I=J immediately compute the expression y Register Assignment y Put frequently used variables in registers.or.eq. y y = 5+7 becomes y = 12 y Short Circuiting y Avoid executing parts of conditional tests that are not necessary.

y Instruction Scheduling y Reorder the instructions to improve memory pipelining. .Block Level y Dead Code Elimination y Remove unreachable code and code that is never executed or used.

are computed once. y Constant Propagation y Compile time replacement of variables with constants. and the result is substituted for each occurrence of the expression. . y Common Subexpressions Elimination y Expressions that appear more than once.Routine Level y Strength Reduction y Replace expressions in a loop with an expression that takes fewer cycles. y Loop Invariant Elimination y Expressions inside a loop that don't change with the do loop index are moved outside the loop.

y Note: On the R10000s there is out-of-order execution of instructions. and software pipelining may actually get in the way of this feature. It is used to get the maximum work done per clock cycle.Software Pipelining y Software pipelining allows the mixing of operations from different loop iterations in each iteration of the hardware loop. .

the corresponding flag is ±unroll and -unroll0 . f90 -O3 -OPT:unroll_times_max=12 . You can unroll to a level of 12 by specifying: for unrolling and no unrolling. respectively. prog..Loop Unrolling y The loops stride (or step) value is increased.f y On the IA32 Linux cluster. 99 c(I) = a(I) + b(I) enddo Unrolled Loop do I = c(I) = c(I+1) c(I+2) enddo 1. It is used to improve the scheduling of the loop by giving a longer sequence of straight line code. 99. and the body of the loop is replicated. loops are unrolled to a level of 8 by default.. An example of loop unrolling follows: Original Loop do I = 1. 3 a(I) + b(I) = a(I+1) + b(I+1) = a(I+2) + b(I+2) There is a limit to the amount of unrolling that can take place because there are a limited number of registers. y On the SGI Origin2000.

Subroutine Inlining y Subroutine inlining replaces a call to a subroutine with the body of the subroutine itself. y One reason for using subroutine inlining is that when a subroutine is called inside a do loop that has a huge iteration count. subroutine inlining may be more efficient because it cuts down on loop overhead. y However. the chief reason for using it is that do loops that contain subroutine calls may not parallelize. .

there are several options to invoke inlining: y Inline all routines except those specified to -INLINE:never f90 -O3 -INLINE:all « prog.f: y Specify a list of routines to inline at every call f90 -O3 -INLINE:must=subrname « prog.f: y Specify a list of routines never to inline f90 -O3 -INLINE:never=subrname « prog.f: y On the Linux clusters.f: y Inline no routines except those specified to -INLINE:must f90 -O3 -INLINE:none « prog.Subroutine Inlining y On the SGI Origin2000 computer. the following flags can invoke function inlining: y inline function expansion for calls defined within the current source file -ip: y inline function expansion for calls defined in separate files -ipo: .

opt".f" replaced by ".f. y To generate such optimization reports in a file filename.Optimization Report y Intel 9. . add the flag - opt-report-file filename.f y creates optimization reports that are named identically to the original Fortran source but with the suffix ". y If you have a lot of source files to process simultaneously.o: ifort -c -o $@ $(FFLAGS) -opt-report-file $*.opt $*. For example. you can also use make's "suffix" rules to have optimization reports produced automatically. . each with a unique name. and you use a makefile to compile.x and later compilers can generate reports that provide useful information on optimization done on different parts of your code.

For a detailed description of use of OptView.uiuc. the NCSA program OptView is designed to provide an easy-to-use and intuitive interface that allows the user to browse through their own source code. y OptView is installed on NCSA's IA64 Linux cluster under the directory /usr/apps/tools/bin.You'll need to be using the X-Window system and to have set your DISPLAY environment variable correctly for OptView to work.You can either add that directory to your UNIX PATH or you can invoke optview using an absolute path name. cross-referenced with the optimization reports.ncsa. y Optview can provide a quick overview of which loops in a source code or source codes among multiple files are highly optimized and which might need further work. readers see: http://perfsuite.edu/OptView/ .Optimization Report y To help developers and performance analysts navigate through the usually lengthy optimization reports.

Profile-guided Optimization (PGO) y Profile-guided optimization allows Intel compilers to use valuable runtime information to make better decisions about function inlining and interprocedural optimizations to generate faster codes. Its methodology is illustrated as follows: .

Profile-guided Optimization (PGO) y First. the code is recompiled again with the -prof-use flag to use the runtime information.c a3. y These files contain valuable runtime information for the compiler to do better function inlining and other optimizations.c a2. icc -prof-use -ipo -c a1.c y A profile-guided optimized executable is generated. you do an instrumented compilation by adding the -prof-gen flag in the compile process: icc -prof-gen -c a1.c icc a1.o a3.dyn suffix.c a3.o -lirc y Then. you run the program with a representative set of data to generate the dynamic information files given by the . .c a2. y Finally.o a2.

Porting Issues.sgimath and SCSL are available. y On the Linux clusters. . Ways to link to these libraries are described in Section 3 . y On the SGI Origin2000 platform. Intel MKL is available. Complib.Vendor Tuned Code y Vendor math libraries have codes that are optimized for their specific machine.

ncsa.edu/UserInfo/Resources/Hardware/Origin2000OL D/Doc/ y Linux clusters help and www pages y ifort/icc/icpc ²help (Intel) y http://www.edu/OptView/ .uiuc.edu/UserInfo/Resources/Hardware/Intel64Cluster/ (Intel64) y http://www.uiuc.Further Information y SGI IRIX man and www pages y y y y y y man opt man lno man inline man ipa man perfex Performance Tuning for the Origin2000 at http://www.edu/UserInfo/Resources/Hardware/Intel64Cluster/ (Intel64) y http://perfsuite.ncsa.ncsa.ncsa.uiuc.uiuc.

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 5.3.2 Parallel Overhead 5.3 Load Balance 5.1 Loop Schedule Types 5.1 Sequential Code Limitation 5.3.2 Chunk Size .

. y The majority of this chapter deals with improving load balancing. and the details for implementing them. the type of programs that benefit.Parallel Code Tuning y This chapter describes several of the most common techniques for parallel tuning.

The do loop has an ambiguous subscript. The do loop has a call to a subroutine or a reference to a function subprogram. Some reasons why it cannot be made data parallel are: y y y y y The code is not in a do loop. The do loop contains a read or write. you can calculate the sequential fraction of code using the Amdahl·s Law formula.Sequential Code Limitation y Sequential code is a part of the program that cannot be run with multiple processors. The do loop contains a dependency. there is a limitation on speedup. If you think too much sequential code is a problem. y Sequential Code Fraction y As shown by Amdahl·s Law. if the sequential fraction is too large. .

y Solve for f. this is p. y Decreasing the Sequential Code Fraction y The compilation optimization reports list which loops could not be parallelized and why. Substitute SP and p into the Amdahl·s Law formula: y f=(1/SP-1/p)/(1-1/p). this is the fraction of sequential code.You can use this report as a guide to improve performance on do loops by: y Removing dependencies y Removing I/O y Removing calls to subroutines and function subprograms . Form a ratio of the 2 timings T(1)/T(p). Run and time the program with p processors to give T(2). where f is the fraction of sequential code. Run and time the program with 1 processor to give T(1).Sequential Code Limitation y Measuring the Sequential Code Fraction y y y y y Decide how many processors to use. this is SP.

the overhead time needed to create and control the parallel processes can be disproportionately large limiting the savings due to parallelism. y Measuring Parallel Overhead y To get a rough under-estimate of parallel overhead: y Run and time the code using 1 processor. . y Parallelize the code.Parallel Overhead y Parallel overhead is the processing time spent y y y y y creating threads spin/blocking threads starting and ending parallel regions synchronizing at the end of parallel regions When the computational work done by the parallel processes is too small. y Run and time the parallel code using only 1 processor. y Subtract the 2 timings.

. ... a loop needs about 1000 floating point operations or 500 statements in the loop.Parallel Overhead y Reducing Parallel Overhead y To reduce parallel overhead: y Don't parallelize all the loops. It doesn't generate as much parallel overhead and often more code runs in parallel. !$OMP PARALLEL DO IF(n > 500) do i=1. y Parallelize at the highest level possible..You can use the IF modifier in the OpenMP directive to control when loops are parallelized. y Don't use more threads than you need. end do !$OMP END PARALLEL DO y Use task parallelism instead of data parallelism. y Don't parallelize small loops.n . body of loop . y To benefit from parallelization.

The command perfex -e16 -mp a. it indicates load imbalance. y Measuring Load Balance y On the SGI Origin.Load Balance y Load balance y is the even assignment of subtasks to processors so as to keep each processor busy doing useful work for as long as possible. The master thread (thread 0) always uses more cycles than the slave threads. . Compare the cycle counts to determine load balance problems. to measure load balance. use the perfex tool which is a command line interface to the R10000 hardware counters. some of the processors will idle while others are still working. If the counts are vastly different. y Load balance is important for speedup because the end of a do loop is a synchronization point where threads need to catch up with each other.out > results y reports per thread cycle counts. y If processors have different work loads.

Load Balance y For linux systems. A thread with unusually high or low time compared to the others may not be working efficiently [high cputime could be the result of a thread spinning while waiting for other threads to catch up]. the thread cpu times can be compared with ps. ps uH y Improving Load Balance y To improve load balance. try changing the way that loop iterations are allocated to threads by y changing the loop schedule type y changing the chunk size y These methods are discussed in the following sections. .

Loop Schedule Types y On the SGI Origin2000 computer. the default will be used. They are: y Static y Dynamic y Guided y Runtime y If you don't specify a schedule type. 4 different loop schedule types can be specified by an OpenMP directive. y Default Schedule Type y The default schedule type allocates 20 iterations on 4 threads as: .

With the static schedule type. and you use 2 threads. With default scheduling. named t0 and t1. workloads are uneven. . iterations are allocated in a round-robin fashion to the threads. y An Example y Suppose you are computing on the upper triangle of a 100 x 100 matrix.Loop Schedule Types y Static Schedule Type y The static schedule type is used when some of the iterations do more work than others.

Loop Schedule Types y Whereas with static scheduling. resulting in better load balance. the columns of the matrix are given to the threads in a round robin fashion. .

Guided gives good load balancing at a low overhead cost. That is.Loop Schedule Types y Dynamic Schedule Type y The iterations are dynamically allocated to threads at runtime. the number of iterations given to each thread depends on the number of iterations remaining. Each thread is given a chunk of iterations. it goes into a critical section where it·s given another chunk of iterations to work on. . Dynamic gives good load balance. but at a high overhead cost. The guided schedule type reduces the number of entries into the critical section. compared to the dynamic schedule type. y This type is useful when you don·t know the iteration count or work pattern ahead of time. When a thread finishes its work. y Guided Schedule Type y The guided schedule type is dynamic scheduling that starts with large chunks of iterations and ends with small chunks of iterations.

The static and dynamic schedule types can be used with a chunk size.Chunk Size y The word chunk refers to a grouping of iterations. Chunk size means how many iterations are in the grouping. or GUIDED and chunk is any positive integer. then the chunk size is 1. If a chunk size is not specified. y Suppose you specify a chunk size of 2 with the static schedule type. or DYNAMIC. Then 20 iterations are allocated on 4 threads: y The schedule type and chunk size are specified as follows: !$OMP PARALLEL DO SCHEDULE(type. chunk) « !$OMP END PARALLEL DO y Where type is STATIC. .

1.1.1 Profiling Tools 6.1.Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.1.2.3 Further Information .2.2 Profile Listings 6.2 Wall clock Time 6.3 Timing a Batch Job 6.1.1 CPU Time 6.1 Timing 6.2 Timing an Executable 6.2.1.1.1 Timing a Section of Code 6.2 Profiling 6.3 Profiling Analysis 6.

you will want to know how fast it runs. y The chapter also covers how to determine which parts of the program account for the bulk of the computational load so that you can concentrate your tuning efforts on those computationally intensive parts of the program. y This chapter describes how to measure the speed of a program using various timing routines.Timing and Profiling y Now that your program has been ported to the new computer. .

dtime.out y Timing a batch run y busage y qstat y qhist . The specific timing functions described are: y Timing a section of code FORTRAN y etime. we·ll discuss timers and review the profiling tools ssrun and prof on the Origin and vprof and gprof on the Linux Clusters.Timing y In the following sections. cpu_time for CPU time y time and f_time for wallclock time C y clock for CPU time y gettimeofday for wallclock time y Timing an executable y time a.

y It returns the elapsed CPU time in seconds since the program started.time2.time1.CPU Time y etime y A section of code can be timed using etime. real*4 tarray(2).timeres « beginning of program time1=etime(tarray) « start of section of code to be timed « lots of computation « end of section of code to be timed time2=etime(tarray) timeres=time2-time1 .

CPU Time y dtime y A section of code can also be timed using dtime. y It returns the elapsed CPU time in seconds since the last call to dtime.timeres « beginning of program timeres=dtime(tarray) « start of section of code to be timed « lots of computation « end of section of code to be timed timeres=dtime(tarray) « rest of program . real*4 tarray(2).

y Metric. y It·s the CPU time spent executing user code. y Timings are reported in seconds. y It·s the time that is usually reported. y Sum of user and system time.CPU Time The etime and dtime Functions y User time. . y It·s the time spent executing system calls on behalf of your program. y This is the function value that is returned. y This is returned as the first element of tarray. y System time. y This is returned as the second element of tarray. y Timings are accurate to 1/100th of a second.

which is usually the master thread. y To use this library include the compiler flag -Vaxlib.CPU Time Timing Comparison Warnings y For the SGI computers: y The etime and dtime functions return the MAX time over all threads for a parallel program. y This is the time of the longest thread. . y Another warning: Do not put calls to etime and dtime inside a do loop. y For the Linux Clusters: y The etime and dtime functions are contained in the VAX compatibility library of the Intel FORTRAN Compiler. The overhead is too large.

y It provides substantially higher resolution and has substantially lower overhead than the older etime and dtime routines. real*8 time1. y It can be used as an elapsed timer. time2.CPU Time cpu_time y The cpu_time routine is available only on the Linux clusters as it is a component of the Intel FORTRAN compiler library. timeres « beginning of program call cpu_time (time1) « start of section of code to be timed « lots of computation « end of section of code to be timed call cpu_time(time2) timeres=time2-time1 « rest of program .

. time2.CPU Time clock y For C programmers. « time1=(clock()*iCPS). « /* do some work */ « time2=(clock()*iCPS). timers=time2-time1. timres. one can call the cpu_time routine using a FORTRAN wrapper or call the intrinsic function clock that can be used to determine elapsed CPU time.h> static const double iCPS = 1. include <time. double time1.0/(double)CLOCKS_PER_SEC.

y It is a means of getting the elapsed wall clock time.timeres « beginning of program time1=time( ) « start of section of code to be timed « lots of computation « end of section of code to be timed time2=time( ) timeres=time2 . the function time returns the time since 00:00:00 GMT. Jan. 1.Wall clock Time time y For the Origin.time2.time1 . y The wall clock time is reported in integer seconds. 1970. external time integer*4 time1.

the f_time function is in the VAX compatibility library of the Intel FORTRAN Compiler. integer*8 f_time external f_time integer*8 time1.time1 y As above for etime and dtime.time2.Wall clock Time f_time y For the Linux clusters.timeres « beginning of program time1=f_time() « start of section of code to be timed « lots of computation « end of section of code to be timed time2=f_time() timeres=time2 . the appropriate FORTRAN function for elapsed time is f_time. To use this library include the compiler flag -Vaxlib. .

.tv_usec.. wallclock time can be obtained by using the very portable routine gettimeofday.e-6)*tp.elapsed. ......Wall clock Time gettimeofday y For C programmers. t1=(double)tp.tv_sec+(1. #include <stddef.tv_usec. . /* do some work */ .t2.. NULL).e-6)*tp.. rtn=gettimeofday(&tp. . . NULL).. int rtn..tv_sec+(1. struct timeval tp.h> /* definition of NULL */ #include <sys/time.. t2=(double)tp. rtn=gettimeofday(&tp.h> /* definition of timeval struct and protyping of gettimeofday */ double t1. elapsed=t2-t1.

. explicitly call /usr/bin/time) time «options« a. y Consult the man pages on the time command for format options.Timing an Executable y To time an executable (if using a csh or tcsh shell.out y where options can be ¶-p· for a simple output or ¶-f format· which allows the user to display more than just time related information.

y Origin busage jobid y Linux clusters qstat jobid # for a running job qhist jobid # for a completed job .Timing a Batch Job y Time of a batch job running or completed.

3 Further Information .Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.2.1.2 Profile Listings 6.1.2 Wall clock Time 6.1.1 Timing 6.3 Timing a Batch Job 6.2.1.2.1 CPU Time 6.1.1.2 Profiling 6.2 Timing an Executable 6.1.1 Timing a Section of Code 6.1 Profiling Tools 6.3 Profiling Analysis 6.

90% of the computation is done in 10% of the code. y That is. y It detects the computationally intensive parts of the code. y Most codes follow the 90-10 Rule. y Use profiling when you want to focus attention and optimization efforts on those loops that are responsible for the bulk of the computational load. .Profiling y Profiling determines where a program spends its time.

y Example ssrun -fpcsamp a. y prof y The prof utility analyzes the data file created by ssrun and produces a report. y The performance data is written to a file named "executablename. y ssrun y The ssrun utility collects performance data for an executable that you specify.exptype. y They are useful for generating timing profiles. y Used together they do profiling.fpcsamp. or what is called hot spot analysis.out prof -h a.out.m12345 > prof.Profiling Tools Profiling Tools on the Origin y On the SGI Origin2000 computer there are profiling tools named ssrun and prof.id".list .

y gprof y Basic profiling information can be generated using the OS utility gprof. .out. y First. y Finally analyze the resulting gmon.Profiling Tools Profiling Tools on the Linux Clusters y On the Linux clusters the profiling tools are still maturing. compile the code with the compiler flags -qp -g for the Intel compiler (-g on the Intel compiler does not change the optimization level) or -pg for the GNU compiler.out file using the gprof utility: gprof executable gmon. y Second.f .out . run the program./foo gprof foo gmon. efc -O -qp -g -o foo foo. prof and perfex tools. There are currently several efforts to produce tools comparable to the ssrun.

y To instrument the whole application requires recompiling and linking to vprof and PAPI libraries.Profiling Tools Profiling Tools on the Linux Clusters y vprof y On the IA32 platform there is a utility called vprof that provides performance information using the PAPI instrumentation library.o L/usr/apps/tools/lib -lvmon -lpapi .out . setenv VMON PAPI_TOT_CYC ifc -g -O -o md md./md /usr/apps/tools/vprof/bin/cprof -e md vmon.f /usr/apps/tools/vprof/lib/vmonauto_gcc.

57 1.17 1.81 80.91 8.54 Secs ---0.47 8.42 5.Profile Listings Profile Listings on the Origin y Prof Output First Listing Cycles -------42630984 6498294 6141611 3654120 2615860 1580424 1144036 886044 861136 % ----58.02 0.38 75.57 88.18 Cum% ----58.14 89.41 86.22 1.01 Proc ---VSUB PFSOR PBSOR PFSOR1 VADD ITSRCG ITSRSI ITJSI ITJCG y The first listing gives the number of cycles executed in each procedure (or subroutine).03 0.09 0.05 0.59 2.01 3.82 84.08 0.01 0. The procedures are listed in descending order of cycle count. .36 90.47 67.02 0.57 0.

34 71.81 4.Profile Listings Profile Listings on the Origin y Prof Output Second Listing Cycles -------36556944 5313198 4968804 2989882 2564544 1988420 1629776 994210 969056 483018 % ----50.36 1.29 6.14 7.10 3.24 1. y The lines are listed in descending order of cycle count.52 2.24 68.66 Cum% ----50.18 Line ---8106 6974 6671 8107 7097 8103 8045 8108 8049 6972 Proc ---VSUB PFSOR PBSOR VSUB PFSOR1 VSUB VADD VSUB VADD PFSOR y The second listing gives the number of cycles per source code line.33 0.19 79.43 64.73 2.52 80. .82 78.59 76.86 74.14 57.

05 14.83 0.06 0.89 0.01 50500 0.37 14.88 0.48 14.21 0. % cumulative self self time seconds seconds calls us/call ----.21 25.---------.00 total us/call ------107450.25 14.15 68.----------------38.00 14.19 0.01 14.88 0.64 3.01 14.67 5.00 name ----------compute_ dist_ SIND_SINCOS sin cos dotr8_ update_ f_fioinit f_intorange mov initialize_ y The listing gives a 'flat' profile of functions and routines encountered.00 0.90 0.07 5.90 0.17 25199500 0.000976562 seconds.00 1 0. .00 0.18 34.01 14.Profile Listings Profile Listings on the Linux Clusters y gprof Output First Listing Flat profile: Each sample counts as 0.90 0.80 1.67 101 56157.05 14.90 0.15 0.72 10.36 0.84 5.01 100 68.36 0.90 0. sorted by 'self seconds' which is the number of seconds accounted for by this function alone.00 0.

00 25199500/25199500 dist_ [3] 0.00 10.8 5.7 5.17 0.18 101/101 compute_ [2] 0. The definitions of the columns are specific to the line in question. .80 0.67 5.9 y The second listing gives a 'call-graph' profile of functions and routines encountered.00 1/1 initialize_ [12] --------------------------------------------------------------------5.18 101 compute_ [2] 5.Profile Listings Profile Listings on the Linux Clusters y gprof Output Second Listing Call graph: index ----[1] self children called name ---.00 SIND_SINCOS [4] « « % time -----72.67 5.67 5.17 0.17 0.00 25199500 dist_ [3] --------------------------------------------------------------------<spontaneous> [4] 25.-------------------------------------0. Detailed information is contained in the full output from gprof.00 100/100 update_ [8] 0.01 0.00 25199500/25199500 compute_ [2] [3] 34.18 101/101 main [1] [2] 72.86 main [1] 5.01 0.00 0.5 3.00 50500/50500 dotr8_ [7] --------------------------------------------------------------------5.

f:165 1.4% compute 15.f:166 2.3% /u/ncsa/gbauer/temp/md.f:104 9.8% /u/ncsa/gbauer/temp/md.0% /u/ncsa/gbauer/temp/md.2% /u/ncsa/gbauer/temp/md.6% /u/ncsa/gbauer/temp/md.f:107 0.f:102 1.f Function Summary: 84.f:164 0. displays not only cycles consumed by functions (a flat profile) but also the lines in the code that contribute to those functions.9% /u/ncsa/gbauer/temp/md.Total cycles (1956 events) File Summary: 100.5% /u/ncsa/gbauer/temp/md.f:162 0.6% dist Line Summary: 67.f:105 y The above listing from (using the -e option to cprof). .Profile Listings Profile Listings on the Linux Clusters y vprof Listing Columns correspond to the following events: PAPI_TOT_CYC .3% /u/ncsa/gbauer/temp/md.5% /u/ncsa/gbauer/temp/md.f:169 0.f:106 13.8% /u/ncsa/gbauer/temp/md.8% /u/ncsa/gbauer/temp/md.

pos(1.5% 0.d) ! attribute half of the potential energy to particle 'j' pot = pot + 0.pos(1.1% « « 100 101 102 103 104 105 106 107 108 109 0.nd f(k.9% 0.5% 13.rij.f:109 /u/ncsa/gbauer/temp/md.i) .3% 0.np if (i .2% 0.f:163 /u/ncsa/gbauer/temp/md.2% do j=1.7% 0.i).ne.Profile Listings Profile Listings on the Linux Clusters y vprof Listing (cont.5*v(d) do k=1.j).6% 0.1% 1.f:100 . j) then call dist(nd.i) = f(k.box.f:149 /u/ncsa/gbauer/temp/md.rij(k)*dv(d)/d enddo endif enddo /u/ncsa/gbauer/temp/md.8% 67.) 0.

line 8106 is a line inside a do loop. y Since the compiler has rearranged the source lines the line numbers given by ssrun/prof give you an area of the code to inspect. y Going back to the source code. y The second profile listing shows that line 8106 in subroutine VSUB accounted for 50% of the total computation. y Putting an OpenMP compiler directive in front of that do loop you can get 50% of the program to run in parallel with almost no work on your part. and consists of many subroutines.Profiling Analysis y The program being analyzed in the previous Origin example has approximately 10000 source code lines. y The first profile listing shows that over 50% of the computation is done inside the VSUB subroutine. y To view the rearranged source use the option f90 « -FLIST:=ON cc « -CLIST:=ON y For the Intel compilers. the appropriate options are ifort « ±E « icc « -E « .

Further Information y SGI Irix y y y y y y y y y y y y y y man etime man 3 time man 1 time man busage man timers man ssrun man prof Origin2000 Performance Tuning and Optimization Guide man 3 clock man 2 gettimeofday man 1 time man 1 gprof man 1B qstat Intel Compilers Vprof on NCSA Linux Cluster y Linux Clusters .

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scaler Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690 .

2 Cache Mapping 7.1.6 Cache Tuning Strategy 7.1.12 Loop Blocking 7.7 Preserve Spatial Locality 7.11 Not Enough Cache 7.4 Measuring Cache Performance 7.5 Locating the Cache Problem 7.10 Cache Thrashing Example 7.8 Locality Problem 7.1 Memory Hierarchy 7.1.3 Code 0ptimization 7.1.2 Cache Specifics 7.1 Cache Concepts 7.4 Cache Coherence 7.3 Cache Thrashing 7.Agenda 7 Cache Tuning 7.13 Further Information .9 Grouping Data Together 7.

y The following sections will discuss the key concepts of cache including: y y y y Memory subsystem hierarchy Cache mapping Cache thrashing Cache coherence . y Clearly then.Cache Concepts y The CPU time required to perform an operation is the sum of the clock cycles executing instructions and the clock cycles waiting for memory. the memory system is a major factor in determining the performance of your program and a large part is your use of the cache. y The CPU cannot be performing useful work if it is waiting for data to arrive from memory.

Memory Hierarchy y The different subsystems in the memory hierarchy have different speeds. and costs. . and the slower memories are further away from the CPU. sizes. y Smaller memory is faster y Slower memory is cheaper y The hierarchy is set up so that the fastest memory is closest to the CPU.

y Register access speeds are comparable to processor speeds. The Origin MIPS R10000 has 64 physical 64-bit registers of which 32 are available for floating-point operations. y The Intel IA64 has 328 registers for general-purpose (64 bit). y The purpose of cache is to improve the memory access time to the processor. predicate (1 bit). y y y y y . branch and other functions. They hold one data element each and are 32 bits or 64 bits wide. y There is an overhead associated with it. floating-point (80 bit). They are on-chip and built from SRAM.Memory Hierarchy y It's a hierarchy because every level is a subset of a level further away. Computers usually have 32 or 64 registers. y All data in one level is found in the level below. y Registers Registers are the sources and destinations of CPU data operations. but the benefits outweigh the cost.

memory is divided into partitions or segments called memory banks. The problem with interleaving is that the memory interleaving improvement assumes that memory is accessed sequentially. one from each bank. there is no benefit. Consecutive data elements are spread across the banks. In interleaving. The bank cycle time is 4-8 times the CPU clock cycle time so the main memory can·t keep up with the fast CPU and keep it busy with data. If there is 2-way memory interleaving.Memory Hierarchy y Main Memory Improvements y A hardware improvement called interleaving reduces main memory access y y y y y y y y time. but the code accesses every other location. . Large main memory with a cycle time comparable to the processor is not affordable. Multiple data elements are read in parallel. Each bank supplies one data element per bank cycle.

y When a main memory access is made. items whose addresses are nearby will tend to be referenced soon. the transfer time increases. y A cache line is defined in terms of a number of bytes. y Spatial Locality: When an item is referenced. y When the cache line size becomes too large. y Cache Line y The overhead of the cache can be reduced by fetching a chunk or block of data elements. it will be referenced again soon. a cache line of data is brought into the cache instead of a single data element. . a cache line is typically 32 or 128 bytes. y Temporal Locality: When an item is referenced. y The additional elements in the cache line will most likely be needed soon. but there is a point of negative returns on the cache line size. y This takes advantage of spatial locality.Memory Hierarchy y Principle of Locality y The way your program operates follows the Principle of Locality. y The cache miss rate falls as the size of the cache line increases. y For example.

(Recall that the lower levels of the hierarchy have a slower access time. y The Cache Hit Rate is defined as the fraction of cache hits. y You want to maximize hits. is the time needed to retrieve the data from a lower level (downstream) of the memory hierarchy. or miss time. y It is the fraction of the requested data that is found in the cache.Memory Hierarchy y Cache Hit y A cache hit occurs when the data element requested by the processor is in the cache.Hit Rate y Cache Miss Penalty.0 . y Cache Miss y A cache miss occurs when the data element requested by the processor is NOT in the cache. Cache Miss Rate is defined as 1. y You want to minimize cache misses.) .

Hence.Memory Hierarchy y Levels of Cache y It used to be that there were two levels of cache: on-chip and offchip. slower off-chip cache. L2. y A cache miss is very costly. L2 cache misses have a larger performance penalty. y The cache external to the chip is called Third Level. To solve this problem. L1. L3. y The newer Intel IA-64 processor has 3 levels of cache . y Caches closer to the CPU are called Upstream. computer designers have implemented a larger. y An on-chip cache performs the fastest but the computer designer makes a trade-off between die size and cache size. Caches further from the CPU are called Downstream. y The off-chip cache is called Second Level. y L1/L2 is still true for the Origin MIPS and the Intel IA-32 processors. or secondary cache. or primary cache. L1 cache misses are handled quickly. on-chip cache has a small size. This chip speeds up the on-chip cache miss time. y The on-chip cache is called First level. When the on-chip cache has a cache miss the time to access the slower main memory is very large.

e. y Memory Hierarchy Sizes y Memory hierarchy sizes are specified in the following units: y Cache Line: bytes y L1 Cache: Kbytes y L2 Cache: Mbytes y Main Memory: Gbytes . y In split cache. y The 2 caches are independent of each other. typically L2.Memory Hierarchy y Split or Unified Cache y In unified cache. and they can have independent properties.g. a high cache miss rate. the cache is a combined instruction-data cache. called the instruction cache y another for the data. called the data cache. the cache is split into 2 parts: y one for the instructions. y A disadvantage of a unified cache is that when the data access and instruction access conflict with each other. typically L1. the cache may be thrashed.

y Consequently. a particular cache line can be filled from (size of main memory mod size of cache) different lines from main memory. . a line of main memory is mapped to only a single line of cache.Cache Mapping y Cache mapping determines which cache location should be used to store a copy of a data element from main memory. There are 3 mapping strategies: y Direct mapped cache y Set associative cache y Fully associative cache y Direct Mapped Cache y In direct mapped cache. y Direct mapped cache is inexpensive but also inefficient and very susceptible to cache thrashing.

Cache Mapping y Direct Mapped Cache http://larc.tw/~cthuang/courses/ee3450/lectures/07_memory.ee.nthu.edu.html .

png . any line of cache can be loaded with any line from main memory.Cache Mapping y Fully Associative Cache y For fully associative cache.xbitlabs. http://www. y This technology is very fast but also very expensive.com/images/video/radeon-x1000/caches.

Cache Mapping y Set Associative Cache y For N-way set associative cache. y A line from main memory can then be written to its cache line in any of the N sets. y This is a trade-off between direct mapped and fully associative cache.com/articles/cache_principles/cache_way.png . http://www. you can think of cache as being divided into N sets (usually N is 2 or 4).alasir.

LRU (Least Recently Used) y The block that gets replaced is the one that hasn·t been used for the longest time. a cache line can only be mapped to one unique place in the cache. The new cache line replaces the cache block at that address. Random replacement generally outperformed FIFO. With set associative cache there is a choice of 3 strategies: 1. In empirical studies. A disadvantage of LRU is that it·s expensive to keep track of cache access patterns. . Random y There is a uniform random replacement within the set of cache blocks. An advantage of LRU is that it preserves temporal locality. 3. FIFO (First In First Out) y Replace the block that was brought in N accesses ago. The principle of temporal locality tells us that recently used data blocks are likely to be used again soon. regardless of the usage pattern. 2. there was little performance difference between LRU and Random. In empirical studies.Cache Mapping y Cache Block Replacement y With direct mapped cache. The advantage of random replacement is that it·s simple and inexpensive to implement.

y The CPU can·t find the data element it wants in the cache and must make another main memory cache line access. . a(k(j)). y Cache lines are discarded and later retrieved. y The same data elements are repeatedly fetched into and displaced from the cache. y Cache thrashing can happen for both instruction and data caches. e.Cache Thrashing y Cache thrashing is a problem that happens when a frequently used cache line gets displaced by another frequently used cache line. The arrays are accessed with indirect addressing.g. y The arrays are dimensioned too large to fit in cache. y Cache thrashing happens because the computational code statements have too many variables and arrays for the needed data elements to fit in cache.

other caches.Cache Coherence y Cache coherence y is maintained by an agreement between data stored in cache. y It is the means by which all the memory subsystems maintain data coherence. y When the same data is being manipulated by different processors. they must inform each other of their modification of data. and main memory. y The term Protocol is used to describe how caches and main memory communicate with each other. .

y Hardware Cache Coherence y Cache coherence on the Origin computer is maintained in the hardware. .Cache Coherence y Snoop Protocol y All processors monitor the bus traffic to determine cache line status. and the status of the cache line ² clean (cache line does not need to be sent back to main memory) or dirty (cache line needs to update main memory with content of cache line). transparent to the programmer. y Directory Based Protocol y Cache lines contain extra bits that indicate which other processor has a copy of that cache line.

Cache Coherence y False sharing y happens in a multiprocessor system as a result of maintaining cache coherence. y A modifies the first word of the cache line. y But A has sent a signal to B that B·s cache line is invalid. . y B must fetch the cache line again before writing to it. y Both processor A and processor B have the same cache line. y B wants to modify the eighth word of the cache line.

y The stall is minimized by continuing to load and execute instructions. . until the data that is stalling is retrieved. y The processor is stalled until the data is retrieved from the memory. y These techniques are called: y Prefetching y Out of order execution y Software pipelining y Typically. the compiler will do these at -O3 optimization.Cache Coherence y A cache miss creates a processor stall.

N y(I)=y(I) + a*x(I) End Do y In pseudo-assembly language. this is what the Origin compiler will do: cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle cycle t+0 t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 t+9 t+10 t+11 ld ld st st st st ld ld ld ld ld ld y(I+3) x(I+3) y(I-4) y(I-3) y(I-2) y(I-1) y(I+4) x(I+4) y(I+5) x(I+5) y(I+6) x(I+6) madd madd madd madd I I+1 I+2 I+3 .Cache Coherence y The following is an example of software pipelining: y Suppose you compute Do I=1.

Cache Coherence y Since the Origin processor can only execute 1 load or 1 store y y y y at a time. The peak is 24 flops in 12 clock cycles for the Origin. It is then able to continue loading while simultaneously performing a fused multiply-add (a+b*c). The code above gets 8 flops in 12 clock cycles. . the compiler places loads in the instruction pipeline well before the data is needed. The Intel Pentium III (IA-32) and the Itanium (IA-64) will have differing versions of the code above but the same concepts apply.

8 Locality Problem 7.2.13 Further Information .2.6 Cache Tuning Strategy 7.Agenda 7 Cache Tuning 7.5 Locating the Cache Problem 7.7 Preserve Spatial Locality 7.11 Not Enough Cache 7.3 Cache on the Intel Itanium 7.1 Cache on the SGI Origin2000 7.4 Cache Summary 7.4 Measuring Cache Performance 7.12 Loop Blocking 7.2 Cache on the Intel Pentium III 7.2.2.9 Grouping Data Together 7.1 Cache Concepts 7.3 Code 0ptimization 7.2Cache Specifics 7.10 Cache Thrashing Example 7.

Cache on the SGI Origin2000 y L1 Cache (on-chip primary cache) y Cache size: 32KB floating point data y 32KB integer data and instruction y Cache line size: 32 bytes y Associativity: 2-way set associative y L2 Cache (off-chip secondary cache) y Cache size: 4MB per processor y Cache line size: 128 bytes y Associativity: 2-way set associative y Replacement: LRU y Coherence: Directory based 2-way interleaved (2 banks) .

5 GB/s y Latency: 61 cycles y Average 32 processor remote memory y Latency: 150 cycles .Cache on the SGI Origin2000 y Bandwidth L1 cache-to-processor y 1.2 GB/sec overall possible y Latency: 1 cycle y Bandwidth between L1 and L2 cache y 1GB/s y Latency: 11 cycles y Bandwidth between L2 cache and local memory y .6 GB/s/bank y 3.

Cache on the Intel Pentium III
y L1 Cache (on-chip primary cache)
y y y y y y y y y

Cache size: 16KB floating point data 16KB integer data and instruction Cache line size: 16 bytes Associativity: 4-way set associative Cache size: 256 KB per processor Cache line size: 32 bytes Associativity: 8-way set associative Replacement: pseudo-LRU Coherence: interleaved (8 banks)

y L2 Cache (off-chip secondary cache)

Cache on the Intel Pentium III
y Bandwidth L1 cache-to-processor
y 16 GB/s y Latency: 2 cycles

y Bandwidth between L1 and L2 cache
y 11.7 GB/s y Latency: 4-10 cycles

y Bandwidth between L2 cache and local memory
y 1.0 GB/s y Latency: 15-21 cycles

Cache on the Intel Itanium
y L1 Cache (on-chip primary cache)
y y y y y y y y y y y y

Cache size: 16KB floating point data 16KB integer data and instruction Cache line size: 32 bytes Associativity: 4-way set associative Cache size: 96KB unified data and instruction Cache line size: 64 bytes Associativity: 6-way set associative Replacement: LRU Cache size: 4MB per processor Cache line size: 64 bytes Associativity: 4-way set associative Replacement: LRU

y L2 Cache (off-chip secondary cache)

y L3 Cache (off-chip tertiary cache)

Cache on the Intel Itanium
y Bandwidth L1 cache-to-processor
y 25.6 GB/s y Latency: 1 - 2 cycle

y Bandwidth between L1 and L2 cache
y 25.6 GB/sec y Latency: 6 - 9 cycles

y Bandwidth between L2 and L3 cache
y 11.7 GB/sec y Latency: 21 - 24 cycles

y Bandwidth between L3 cache and main memory
y 2.1 GB/sec y Latency: 50 cycles

y Efficient use of cache is extremely important. y This indicates that loads and stores may be a bottleneck. .ST/cycle 1 LD or 1 ST 1 LD and 1 ST 2 LD or 2 ST y Only one load or store may be performed each CPU cycle on the R10000.Cache Summary Chip #Caches Associativity MIPS R10000 Pentium III 2 2/2 2 4/8 Pseudo-LRU 1000 1000 Itanium 3 4/6/4 LRU 800 3200 Replacement LRU CPU MHz Peak Mflops 195/250 390/500 LD.

9 Grouping Data Together 7.3Code 0ptimization 7.4.2 Measuring Cache Performance on the Linux Clusters 7.11 Not Enough Cache 7.8 Locality Problem 7.6 Cache Tuning Strategy 7.1 Measuring Cache Performance on the SGI Origin2000 7.12 Loop Blocking 7.4.2 Cache Specifics 7.5 Locating the Cache Problem 7.13 Further Information .4 Measuring Cache Performance 7.10 Cache Thrashing Example 7.7 Preserve Spatial Locality 7.Agenda 7 Cache Tuning 7.1 Cache Concepts 7.

y Which loop uses the most time? y Put etime/dtime or other recommended timer calls around loops for CPU time. y What is contributing to the cpu time? y Use the Perfex utility on the Origin or perfex or hpmcount on the Linux clusters.Code 0ptimization y Gather statistics to find out where the bottlenecks are in your code so you can identify what you need to optimize.out for CPU time y Which subroutines use the most time? y Use ssrun and prof on the Origin or gprof and vprof on the Linux clusters. y For more information on timers see Timing and Profiling section. y The following questions can be useful to ask: y How much time does the program take to execute? y Use /usr/bin/time a. .

Performance.Code 0ptimization y Some useful optimizing and profiling tools are y etime/dtime/time y perfex y ssusage y ssrun/prof y gprof cvpav.edu/UserInfo/Resources/Software/Tools/ for information on which tools are available on NCSA platforms. . cvd y See the NCSA web pages on Compiler.uiuc.ncsa. and Productivity Tools http://www.

y There are 32 events that are measured and each event is numbered. y The Perfex Utility y The hardware performance counters can be measured using the perfex utility... y View man perfex for more information..Measuring Cache Performance on the SGI Origin2000 y The R10000 processors of NCSA·s Origin2000 computers have hardware performance counters. 26 = Secondary data cache misses .. perfex [options] command [arguments] . 0 = cycles 1 = Instructions issued .

-y Report the results in seconds.) -a sample ALL the events -mp Report all results on a per thread basis. (Remember to have a space in between the "e" and the event number. not cycles. -x Gives extra summary info including Mflops command Specify the name of the executable file. .You enter the number of the event you want counted.Measuring Cache Performance on the SGI Origin2000 y where the options are: -e counter1-e counter2 This specifies which events are to be counted. arguments Specify the input and output arguments to the executable file.

out .the output is reported in cycles y perfex -a -y a.outputs the L1 and L2 cache misses .outputs ALL the hardware performance counters ..Measuring Cache Performance on the SGI Origin2000 y Examples y perfex -e 25 -e 26 a.the output is reported in seconds .out > results .

out . y perfex for the Pentium III and pfmon for the Itanium y To view usage and options for perfex and pfmon: perfex -h pfmon --help y To measure L2 cache misses: perfex ±eP6_L2_LINES_IN a.Measuring Cache Performance on the Linux Clusters y The Intel Pentium III and Itanium processors provide hardware event counters that can be accessed from several tools.out pfmon ±-events=L2_MISSES a.

xml .out*.Measuring Cache Performance on the Linux Clusters y psrun [soft add +perfsuite] y Another tool that provides access to the hardware event counter and also provides derived statistics is perfsuite. y To add perfsuite's psrun to the current shell environment : soft add +perfsuite y To measure cache misses: psrun a.out psprocess a.

11 Not Enough Cache 7.4 Measuring Cache Performance 7.12 Loop Blocking 7.13 Further Information .5 Locating the Cache Problem 7.3 Code 0ptimization 7.7 Preserve Spatial Locality 7.8 Locality Problem 7.9 Grouping Data Together 7.6 Cache Tuning Strategy 7.1 Cache Concepts 7.2 Cache Specifics 7.Agends 7 Cache Tuning 7.10 Cache Thrashing Example 7.

the perfex output is a first-pass detection of a cache problem. . y The CaseVision tools are y cvpav for performance analysis y cvd for debugging y CaseVision is not available on the Linux clusters.Locating the Cache Problem y For the Origin. y If you then use the CaseVision tools. you can locate the cache problem in your code. y Tools like vprof and libhpm provide routines for users to instrument their code. y Using vprof with the PAPI cache events can provide detailed information about where poor cache utilization is occurring.

Cache Tuning Strategy y The strategy for performing cache tuning on your code is based on data reuse. y Spatial Reuse y Use data that is encached as a result of fetching nearby data elements from downstream memory. y Temporal Reuse y Use the same data elements on more than one iteration of the loop. y Strategies that take advantage of the Principle of Locality will improve performance. .

J)=C(I.K) * B(K. y The following code does not preserve spatial locality: do I=1.J) + A(I.J) end do « y For Fortran the innermost loop index should be the leftmost index of the arrays.K) * B(K.J)=C(I.n do J=1.n C(I. The code has been modified for spatial reuse.n do K=1.Preserve Spatial Locality y Check loop nesting to ensure stride-one memory access. .J) + A(I. y To ensure stride-one access modify the code using loop interchange.J) end do « y It is not wrong but runs much slower than it could.n do I=1.n C(I.n do K=1. do J=1.

N DO I=1. .J)=B(J.Locality Problem y Suppose your code looks like: DO J=1.N A(I. the code doesn·t have unitstride access on stores. y If you interchange the loops.I) ENDDO ENDDO y The loop as it is typed above does not have unit-stride access on loads. y Use the optimized. intrinsic-function transpose from the FORTRAN compiler instead of hand-coding it.

j) are contiguous in memory.0 do I=1.n j=index(I) d = d + sqrt(x(j)*x(j) + y(j)*y(j) + z(j)*z(j)) y Since the arrays are accessed with indirect accessing.n j=index(I) d = d + sqrt(r(1.j)*r(3. and r(3.j)*r(1. .j). r(2. d=0. rather than 3.j). it is likely they will be in one cache line.Grouping Data Together y Consider the following code segment: d=0. Hence.j)*r(2. 1 cache line. The code has been modified for cache reuse.0 do I=1. it is likely that 3 new cache lines need to be brought into the cache for each iteration of the loop. is brought in for each iteration of I.j) + r(3. and z into a 2-dimensional array named r. Modify the code by grouping together x. y.j) + r(2.j)) y Since r(1.

parameter (max = 1024*1024) common /xyz/ a(max). b(max) do I=1.max something = a(I) + b(I) enddo y The cache lines for both a and b have the same cache address.extra(32). .max something=a(I) + b(I) enddo y Improving cache utilization is often the key to getting good performance. parameter (max = 1024*1024) common /xyz/ a(max).Cache Thrashing Example y This example thrashes a 4MB direct mapped cache.b(max) do I=1. y To avoid cache thrashing in this example. pad common with the size of a cache line.

its parallel version may fit in cache with a large enough number of processors. . y If a scalar program won·t fit in cache. y This often results in super-linear speedup.Not Enough Cache y Ideally you want the inner loop·s arrays and variables to fit into cache.

Loop Blocking y This technique is useful when the arrays are too large to fit into the cache. y The following example (next slide) illustrates loop blocking of matrix multiplication. . thus minimizing cache misses. y The code in the PRE column depicts the original code. y A blocked loop accesses array elements in sections that optimally fit in the cache. the POST column depicts the code when it is blocked. y Loop blocking uses strip mining of loops and loop interchange. y It allows for spatial and temporal reuse of data.

kk+iblk-1 do i=ii.j)=c(i.jj+iblk-1 do k=kk.n do i=1.k) *b(k.ii+iblk-1 c(i.n.j)+a(i.j) enddo enddo enddo POST do kk=1.k) *b(k.j)+a(i.iblk do j=jj.n.n c(i.n.Loop Blocking PRE do k=1.j)=c(i.j) enddo enddo enddo enddo enddo enddo .iblk do ii=1.n do j=1.iblk do jj=1.

Further Information y Computer Organization and Design y The Hardware/Software Interface. David A. Patterson and John L. Jim Handy. Inc. Stakem. Academic Press High Performance Computing. Hennessy. Inc. The Cache Memory Book. y Computer Architecture y A Quantitative Approach. A Practitioner·s Guide to RISC Microprocessor Architecture. John Levesque. Patrick H. John L. Charles Severance. Hennessy and David A. Inc. Morgan Kaufmann Publishers. y y y y y y Morgan Kaufmann Publishers. Patterson. O·Reilly and Associates. Inc. John Wiley & Sons. Applied Parallel Research Intel® Architecture Optimization Reference Manual Intel® Itanium® Processor Manuals . Tutorial on Optimization of Fortran.

5 Speedup Limitations 8.1 Speedup 8.Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 8.7 Summary 9 About the IBM Regatta P690 .2 Speedup Extremes 8.6 Benchmarks 8.3 Efficiency 8.4 Amdahl's Law 8.

y This chapter describes how to compute parallel code performance. y Often the performance gain is not perfect. and some sample benchmarks are given. . y Finally. and this chapter also explains some of the reasons for limitations on parallel performance. this chapter covers the kinds of information you should provide in a benchmark.Parallel Performance Analysis y Now that you have parallelized your code. and have run it on a parallel computer using multiple processors you may want to know the performance gain that parallelization has achieved.

y A simple definition is that it is the length of time it takes a program to run on a single processor. where p is the number of processors. y Speedup generally ranges between 0 and p. y Computing speedup is a good way to measure how a program scales as more processors are used. . you will also want to know how your code scales. y Scalability y When you compute with multiple processors in a parallel environment. you want to see the performance of the code continue to improve.Speedup y The speedup of your code tells you how much performance gain is achieved by running your program in parallel on multiple processors. divided by the time it takes to run on a multiple processors. y As you run your code with more and more processors. y The scalability of a parallel code is defined as its ability to achieve performance proportional to the number of processors used.

Speedup y Linear Speedup y If it takes one processor an amount of time t to do a task and if p processors can do the task in time t / p. running with 4 processors improves the time by a factor of 4. y That is. running with 8 processors improves the time by a factor of 8. y This is shown in the following illustration. and so on. then you have perfect or linear speedup (Sp= p). .

y When super-linear speedup is achieved. y Super-Linear Speedup y You might wonder how super-linear speedup can occur. run on one processor. It is possible that the smaller problem can make better use of the memory hierarchy. had serious cache miss problems. the cache and the registers. . y less than 1. For example. y The most common programs that achieve super-linear speedup are those that solve dense linear algebra problems. the smaller problem may fit in cache when the entire problem would not. each processor only gets part of the problem compared to the single processor case. How can speedup be greater than the number of processors used? y The answer usually lies with the program's memory use. that is. called super-linear speedup. When using multiple processors. it is often an indication that the sequential code.Speedup Extremes y The extremes of speedup happen when speedup is y greater than p.

it means that the parallel code runs slower than the sequential code.Speedup Extremes y Parallel Code Slower than Sequential Code y When speedup is less than one. y The overhead of creating and controlling the parallel threads outweighs the benefits of parallel computation. y To eliminate this problem you can try to increase the problem size or run with fewer processors. . and it causes the code to run slower. y This happens when there isn't enough computation to be done by each processor.

y Efficiency with p processors is defined as the ratio of speedup with p processors to p. y You can think of efficiency as describing the average speedup per processor. . y Ep=1 corresponds to perfect speedup of Sp= p. y Efficiency is a fraction that usually ranges between 0 and 1.Efficiency y Efficiency is a measure of parallel performance that is closely related to speedup and is often also presented in a description of the performance of a parallel program.

. and the term (1 . y This formula. y That is. y This is the fraction of code that will have to be run with just one processor.f) stands for the fraction of operations done in perfect parallelism with p processors. a program's speedup will be limited by its fraction of sequential code. one of America's great computer scientists. y Amdahl's Law defines speedup with p processors as follows: y Where the term f stands for the fraction of operations done sequentially with just one processor.Amdahl's Law y An alternative formula for speedup is named Amdahl's Law attributed to Gene Amdahl. states that no matter how many processors are used in a parallel run. introduced in the 1980s. almost every program has a fraction of the code that doesn't lend itself to parallelism. even in a parallel run.

is a unitless measure ranging between 0 and 1. y This shows that Amdahl's speedup ranges between 1 and p. y When f is 0. then speedup is p. This can be seen by substituting f = 1 in the formula above. meaning there is no sequential code. which results in Sp = 1. or perfect parallelism. meaning there is no parallel code. then speedup is 1. This can be seen by substituting f = 0 in the formula above.Amdahl's Law y The sequential fraction of code. y When f is 1. which results in Sp = p. or there is no benefit from parallelism. where p is the number of processors used in a parallel processing run. f. .

that you cannot take a small application and expect it to show good performance on a parallel computer. y This helps to explain the need for large problem sizes when using parallel computers. y To get good performance. y The reason for this is that as the problem size increases the opportunity for parallelism grows. and it shrinks in its importance for speedup. and the sequential fraction shrinks. when the number of processors goes to infinity. you need to run large applications. y It is well known in the parallel computing community. . y Substituting in the formula. y Amdahl's Law shows that the sequential fraction of code has a strong effect on speedup. and lots of computation. your code's speedup is still limited by 1 / f. with large data array sizes.Amdahl's Law y The interpretation of Amdahl's Law is that speedup is limited by the fact that not all parts of a code can be run in parallel.

1 Memory Contention Limitation 8.5.2 Problem Size Limitation 8.4 Amdahl's Law 8.5.6 Benchmarks 8.2 Speedup Extremes 8.Agenda 8 Parallel Performance Analysis 8.1 Speedup 8.3 Efficiency 8.5Speedup Limitations 8.7 Summary .

when there is too much input or output compared to the amount of computation. y You need to redesign the code with attention to data locality. y Too much memory contention y Speedup is limited when there is too much memory contention.Speedup Limitations y This section covers some of the reasons why a program doesn't get perfect Speedup. Some of the reasons for limitations on speedup are: y Too much I/O y Speedup is limited when the code is I/O bound. y You need to replace it with a parallel algorithm. . y That is. y Wrong algorithm y Speedup is limited when the numerical algorithm is not suitable for a parallel computer. y Cache reutilization techniques will help here.

y In addition. spin/blocking threads. y That is. . when the problem size doesn't grow as you compute with more processors. y This is shown by Amdahl's Law. y Too much parallel overhead y Speedup is limited when there is too much parallel overhead compared to the amount of computation. creating threads. y Too much sequential code y Speedup is limited when there's too much sequential code. y These are the additional CPU cycles accumulated in creating parallel regions. and ending parallel regions. y The processors that finish early will be idle while they are waiting for the other processors to catch up. synchronizing threads.Speedup Limitations y Wrong problem size y Speedup is limited when the problem size is too small to take best advantage of a parallel computer. y Load imbalance y Speedup is limited when the processors have different workloads. speedup is limited when the problem size is fixed.

for more details. a professor of Computer Science at Stanford University. y The perfex utility is covered in the Cache Tuning lecture in this course. y When different processors all want to read or write into the main memory. y On the SGI Origin2000 computer. y You can also refer to SGI's manual page.Memory Contention Limitation y Gene Golub. y On the Linux clusters. use vprof. vprof. man perfex. psrun/perfsuite. there is a delay until the memory is free. you can use the hardware performance counter tools to get information on memory performance. psrun/perfsuite. y On the IA32 platform. hmpcount. y On the IA64 platform. you can determine whether your code has memory contention problems by using SGI's perfex utility. use perfex. . writes in his book on parallel computing that the best way to define memory contention is with the word delay. pfmon.

y Some programming techniques for doing this are: y Access arrays with unit `. you will want to use some programming techniques for reducing memory contention. y A good way to reduce memory contention is to access elements from the processor's cache memory instead of the main memory. y If the output of the utility shows that memory contention is a problem.Memory Contention Limitation y Many of these tools can be used with the PAPI performance counter interface. The details for performing these code modifications are covered in the section on Cache Optimization of this lecture. y Order nested do loops (in Fortran) so that the innermost loop index is the leftmost index of the arrays in the loop. y Be sure to refer to the man pages and webpages on the NCSA website for more information. y Pad common blocks. y Avoid specific array sizes that are the same as the size of the data cache or that are exact fractions or exact multiples of the size of the data cache. . the order is the opposite of Fortran. y These techniques are called cache tuning optimizations. For the C language.

y The effect of small problem size on speedup is shown in the following illustration. the code will show limited speedup. y If there's not enough work to be done by the available processors. .Problem Size Limitation y Small Problem Size y Speedup is almost always an increasing function of problem size.

compared to the amount of computation. y As you compute with more and more processors. .Problem Size Limitation y Fixed Problem Size y When the problem size is fixed. you can reach a point of negative returns when using additional processors. causes the speedup curve to start turning downward as shown in the following figure. each processor has less and less amount of computation to perform. y The additional parallel overhead.

Benchmarks y It will finally be time to report the parallel performance of your application code. y Some other things you should report and record are: y the date you obtained the results y the problem size y the computer model y the compiler and the version number of the compiler y any special compiler options you used . y You will want to show a speedup graph with the number of processors on the x axis. and speedup on the y axis.

y In this regard. y You might be interested in looking at these benchmarks to see how other people report their parallel performance.uiuc. it is often helpful to find out what kind of performance your colleagues are obtaining. .Benchmarks y When doing computational science.edu/UserInfo/Perf/NCSAbench/. y In particular. the NAMD benchmark is a report about the performance of the NAMD program that does molecular dynamics simulations.ncsa. NCSA has a compilation of parallel performance benchmarks online at http://www.

Inc. . Gene Golub and James Ortega. Inc. Quinn. Academic Press. Here are two useful references: y Scientific Computing An Introduction with Parallel Computing. y Parallel Computing Theory and Practice. Michael J. McGraw-Hill.Summary y There are many good texts on parallel computing which treat the subject of parallel performance analysis.

5 Further Information .4 The Operating System 9.1 IBM p690 General Overview 9.2 IBM p690 Building Blocks 9.Agenda y y y y y y y y y 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 7 Cache Tuning 8 Parallel Performance Analysis 9 About the IBM Regatta P690 y y y y y 9.3 Features Performed by the Hardware 9.

and the interconnect network are covered along with technical specifications for the compute rate. . memory. y This chapter describes the architecture of NCSA's IBM p690. it is important to understand the architecture of the computer system on which the code runs. and interconnect bandwidth.About the IBM Regatta P690 y To obtain your program·s top performance. cache. memory size and speed. y Technical details on the size and design of the processors.

y It scales in these terms: y Number of processors y Memory size y I/O and memory bandwidth and the Interconnect bandwidth . and scalable architecture. modular. y IBM p690 Scalability y The IBM p690 is a flexible.IBM p690 General Overview y The p690 is IBM's latest Symmetric Multi-Processor (SMP) machine with Distributed Shared Memory (DSM). y This means that memory is physically distributed and logically shared. y It is based on the Power4 architecture and is a successor to the Power3-II based RS/6000 SP system.

3 Features Performed by the Hardware y 9.4 The Operating System y 9.2.2 IBM p690 Building Blocks y 9.Agenda y 9 About the IBM Regatta P690 y 9.2.2.3 The Processor y 9.1 Power4 Core y 9.5 Memory Subsystem y 9.1 IBM p690 General Overview y 9.2.2 Multi-Chip Modules y 9.4 Cache Architecture y 9.5 Further Information .2.

y Each of these components will be described in the following sections. . y At NCSA.IBM p690 Building Blocks y An IBM p690 system is built from a number of fundamental building blocks. four of these Power4 Cores are linked to form a Multi-Chip Module. y The first of these building blocks is the Power4 Core. which includes the processors and L1 and L2 caches. y This module includes the L3 cache and four Multi-Chip Modules are linked to form a 32 processor system (see figure on the next slide).

32-processor IBM p690 configuration (Image courtesy of IBM) .

Power4 Core y The Power4 Chip contains: y Two processors y Local caches (L1) y External cache for each processor (L2) y I/O and Interconnect interfaces .

The POWER4 chip (Image curtsey of IBM) .

y Each MCM also supports the L3 cache for each Power4 chip. y Multiple MCM interconnection (Image courtesy of IBM) .Multi-Chip Modules y Four Power4 Chips are assembled to form a Multi-Chip Module (MCM) that contains 8 processors.

The Processor y The processors at the heart of the Power4 Core are speculative superscalar out of order execution chips. y The Power4 is a 4-way superscalar RISC architecture running instructions on its 8 pipelined execution units. y Speed of the Processor y The NCSA IBM p690 has CPUs running at 1. y 64-Bit Processor Execution Units y There are 8 independent fully pipelined execution units. y 2 load/store units for memory access y 2 identical floating point execution units capable of fused multiply/add y 2 fixed point execution units y 1 branch execution unit y 1 logic operation unit .3 GHz.

2 GFLOPS y MIPS Rating: y 5 instructions per cycle y 1.3 Gcycles/sec * 5 instructions/cycle yields 65 MIPS y Instruction Set y The instruction set (ISA) on the IBM p690 is the PowerPC AS Instruction set. y Performance Numbers y Peak Performance: y 4 floating point instructions per cycle y 1. fetching 8 instructions and completing 5 instructions per cycle. .The Processor y The units are capable of 4 floating point operations. y It is capable of handling up to 200 in-flight instructions.3 Gcycles/sec * 4 flop/cycle yields 5.

It has split instruction and data caches.Cache Architecture y Each Power4 Core has both a primary (L1) cache associated with each processor and a secondary (L2) cache shared between the two processors. . y Level 1 Cache y The Level 1 cache is in the processor core. In addition. y L1 Instruction Cache y The properties of the Instruction Cache are: y 64KB in size y direct mapped y cache line size is 128 bytes y L1 Data Cache y The properties of the L1 Data Cache are: y y y y y 32KB in size 2-way set associative FIFO replacement policy 2-way interleaved cache line size is 128 bytes y Peak speed is achieved when the data accessed in a loop is entirely contained in the L1 data cache. each MultiChip Module has a L3 cache.

8 GB/s peak bandwidth from L2 .41MB per Power4 chip (2 processors) y 8-way set associative y split between 3 controllers y cache line size is 128 bytes y pseudo LRU replacement policy for cache coherence y 124.Cache Architecture y Level 2 Cache on the Power4 Chip y When the processor can't find a data element in the L1 cache. The properties of the L2 Cache are: y external from the processor y unified instruction and data cache y 1. it looks in the L2 cache.

it looks in the L3 cache. The properties of the L3 Cache are: y external from the Power4 Core y unified instruction and data cache y 128MB per Multi-Chip Module (8 processors) y 8-way set associative y cache line size is 512 bytes y 55.Cache Architecture y Level 3 Cache on the Multi-Chip Module y When the processor can't find a data element in the L2 cache.5 GB/s peak bandwidth from L2 .

Memory Subsystem y The total memory is physically distributed among the Multi-Chip Modules of the p690 system (see the diagram in the next slide). y Memory Latencies y The latency penalties for each of the levels of the memory hierarchy are: y L1 Cache .14 cycles y L3 Cache .102 cycles y Main Memory .4 cycles y L2 Cache .400 cycles .

Memory distribution within an MCM .

1 IBM p690 General Overview 9.2 IBM p690 Building Blocks 9.Agenda 9 About the IBM Regatta P690 9.4 The Operating System 9.5 Further Information .3 Features Performed by the Hardware 9.

Features Performed by the Hardware y The following is done completely by the hardware. transparent to the user: y Global memory addressing (makes the system memory shared) y Address resolution y Maintaining cache coherency y Automatic page migration from remote to local memory (to reduce interconnect memory transactions) .

NCSA's p690 system is currently running version 5.1 of AIX.The Operating System y The operating system is AIX. y Compatibility y AIX 5.1 is highly compatible to both BSD and System V Unix . Version 5.1 is a full 64bit file system.

Further Information y Computer Architecture: A Quantitative Approach y John Hennessy.html y IBM p690 Documentation at NCSA at the URL: y http://www. et al.uiuc.ibm.edu/UserInfo/Resources/Hardware/IBMp690/ .ncsa. 1997 y IBM P Series [595] at the URL: y http://www-03. Morgan Kaufman Publishers. 2nd Edition.com/systems/p/hardware/highend/590/index. 2nd Edition. et al. 1996 y Computer Hardware and Design:The Hardware/Software Interface y David A. Patterson. Morgan Kaufman Publishers.

Sign up to vote on this title
UsefulNot useful