Está en la página 1de 17

TEST-1 SOLUTIONS

Subject: Advanced Computer Architecture

PART-1
Answer any one full question.

1) Give Flynns classification of various computer architectures. Clearly explain the features
of each with conceptual diagrams.

(10 Marks)

Sol: Michael Flynns introduced a classification of various computer architectures based on


notions of instruction and data streams. They are
1.
2.
3.
4.
1.

SISD (single instruction stream over a single data stream)


SIMD (single instruction stream over a multiple data stream)
MIMD (multiple instruction stream over a multiple data stream)
MISD (multiple instruction stream over a multiple data stream)
SISD (single instruction stream over a single data stream):
Conventional sequential machines are called SISD computers as shown in Fig 1a.
CU = control unit
PU = processing unit

IS

CU

MU = memory unit
IS = instruction stream
DS = data stream

IS

PU

DS

MU

I/O

Fig 1a: SISD Uniprocessor architecture


2.

SIMD (single instruction stream over a multiple data stream):


Vector computers are equipped with scalar and vector hardware are called SIMD
computers as shown in Fig 1b.

SIMD:

PE1

Program loaded
From
Host

CU

IS

DS

LM1

DS
Data sets loaded

IS

from host

PEn
DS

DS

LMn
DS

DS

PE = Processing Elements
LM = Local Memory
Fig 1b: SIMD architecture (with distributed memory)
3. MIMD (multiple instruction stream over a multiple data stream):
Parallel computers are reserved for MIMD machines is as shown in the Fig 1c.
IS

CU1

IS

PU1

DS

Share
d
memo
ry

I/O

I/O

CUn

IS

PUn

DS

IS

Fig 1c: MIMD architecture (with shared memory)


4. MISD (multiple instruction stream over a multiple data stream):
An MISD machines are modeled in Fig 1d.

The same data stream flows through an array of processors executing different
instruction streams.

IS

IS

CU2

CU1
IS

Memory
(Program
and
DS
data)

IS

PU1

DS

DS

DS

CUn
IS

PU2
DS

DS

PUn

IS

Fig 1d: MISD architecture (the systolic array)

I/O

Of the four machine models, most parallel computers assumed MIMD model
for general-purpose computations.
The SIMD and MISD are more suitable for special-purpose computations.
Therefore MIMD is the most popular model, SIMD next and MIMD is the
least popular model.
2) a) A 40 MHz processor was supposed to execute 200000 instructions with following
instruction mix and CPI needed for each instruction
Instruction type
CPI
Instruction count
Integer arithmetic

60%

Data transfer

18%

Floating point

12%

Control transfer

10%

Determine the effective CPI, MIPS rate and execution time.


(5 Marks)
Sol:
CPI = (no of clock cycles / no of instructions)
Effective CPI
= (60/100*Ic)*2 + (18/100*Ic)*4 + (12/100*Ic)*6 + (10/100*Ic)*5
Ic
Effective CPI
= 3.14 clock cycles/instruction.
MIPS rate

= Ic / (T106)
= Ic / Ic * CPI * * 106
= 1 / 3.14*1/40*106 * 106

MIPS rate

= 12.7388 MIPS.

Execution time:
T = Ic * CPI *
= 200 *103 * 3.14 * 1/40*106
T = 15.7msec.
2) b) Differentiate between implicit and explicit parallelism with a neat sketch.
(5 Marks)
Sol:
Implicit parallelism:
An implicit approach uses a conventional language, such as C, Fortran, Lips or Pascal
to write the source program.
The sequentially coded source program is translated into parallel object code by a
parallelizing compiler.
As shown in Fig 5 (a), this compiler must be able to detect parallelism and assign
target machine resources. Programmer
This compiler approach has been applied in programming shared memory
Source code written in
multiprocessors.
sequential
C,
This approach requires
less effort on languages,
the part of the programmer.
Fortran, Lips, or Pascal
Parallelizing
compiler
Parallel object
code
Execution by routine

Fig 5 (a): Implicit Parallelism


Explicit parallelism:
This approach as shown in Fig 5 (b) requires more effort by the programmer to
develop a source program.
Parallelism is explicitly specified in the user program.
This will significantly reduce the burden on the compiler to detect parallelism.
Instead the compiler needs to preserve parallelism and where possible, assigns target
machine resources.
Programmer
Source code written in
concurrent dialects of C,
Fortran, Lips, or Pascal
Concurrency
preserving
compiler
Concurrent object
code
Execution by routine
system
Fig 5 (b) : Explicit Parallelism
PART-2

3) Explain UMA and NUMA Model of Shared-Memory Multiprocessors with a neat


diagram.
(10 Marks)
Soln
The multiprocessor parallel models are
i) Uniform memory access model [UMA].
ii) Non-uniform memory access model [NUMA].
i)

Uniform memory access model [UMA]:

In this model physical memory is uniformly shared by all processors.


All processors have equal access time to all memory words.
Each processor uses a private cache.
Multiprocessors are tightly coupled systems due to high degree of resource sharing.
The system interconnect takes the form of a common bus, a crossbar switch or a

multistage network.
UMA model is suitable for general purpose, time sharing application by multiple
users.
Coordination of parallel events, synchronization and communication among
processors are done through shared variables.
In this type of architecture when all the processors have equal access time to all the
peripherals, the system is said to be symmetric multiprocessor.
In this case all the processors equally capable of running the executive programs.
In an asymmetric multiprocessor, only one or a subset of processors are executive
capable.
The remaining processors have no I/O capability and thus are called attached
processors.
An executive or a master processor can execute the OS and handle I/O.
Attached processors execute user codes under the supervision of the master processor.

Processors
P1

P2

pn

System interconnect (Bus, crossbar,


multistage networks)

I/O

SM1

SMn


Shared memory
Fig 2: The UMA multiprocessor model.
ii)

Non-Uniform memory access model [NUMA]:


A NUMA multiprocessor is a shared memory system in which the access time varies
with the location of the memory word.
Two NUMA machine models are as shown in Fig3 (a) & (b).
The shared memory is physically distributed to all processors, called local memories.
The collection of all local memories forms a global address space accessible by
processors.
It is faster to access a local memory with a local processor. The access of remote
memory attached to other processors takes longer due to the added delay through the
interconnected network.
Besides distributed memories, globally shared memory can be added to a

multiprocessor system.
In this case there are three memory access patterns. They are
a. Local memory access (fastest).
b. Global memory access.
c. Remote memory access (slowest).
In this model processors are divided into several clusters.
Each cluster is itself an UMA or an NUMA microprocessor.
The clusters are connected to global shared memory modules. The entire system is

considered a NUMA multiprocessor.


All processors belonging to the same clusters are allowed to uniformly access the
cluster shared memory modules. All clusters have equal access to the global memory.
The access time to the cluster memory is shorter than that to the global memory.
a. Shared local memories

LM1

P
1

LM2

P
2

LMn

Inter
conne
ction
netwo

b. A hierarchical cluster model


GSM

GSM

GSM

Global interconnect network

P
P

:P

C
I
N

CSM

CSM

: CSM

Cluster1

CSM

C
I
N:

CSM

CSM
Cluster N

Fig 3: Two NUMA models for multiprocessor systems.

Answer any two full questions.


4) Explain the architecture of vector super computer with a neat diagram.
Sol:
The architecture of vector super computer is as shown in the Fig4
Vector processor is built on top of the scalar processor.
The vector processor is attached to the scalar processor as an optional feature.
Program and data are first loaded into the main memory through a host computer.
All instructions are first decoded by the scalar control unit.
If the decoded instruction is the scalar operation or a program control operation, it
will be directly executed by the scalar processor using the scalar functional

pipelines.
If the instruction is decoded as a vector operation, it will be sent to the vector
control unit. This control unit will supervise the flow of vector data between the

main memory and the vector functional pipelines. The vector data flow is
coordinated by the control unit. A number of vector functional pipelines may be
built into a vector processor.
In vector super computer, there will be a vector processor and it can be built on two
architectures, namely
1. Register-to-register architecture
2. Memory-to-memory architecture
Register-to-register architecture:

Here vector registers are used to hold the vector operands, intermediate and

final vector results.


The vector functional pipelines retrieve operands from and put results into the

vector registers.
All vector registers are programmable in user instructions.
Each vector register is equipped with a component counter which keeps track

of the component registers used in successive pipeline cycle.


In general, there are fixed number of vector registers and functional pipelines
in a vectorScalar
processor.
processor

Memory-to-memory architecture:
In this architecture, the vector operands and intermediate results are directly copied
into the memory
and they are retrieved as and when it isVector
required
from the memory.
processor
Scalar
instructions

Main memory
(program and
data)

Mass
storage

Host
comp

Instructions
Scalar

vector

Data

Data

I/O (user)

Fig 4: The architecture of vector supercomputer.

5) a) Explain different types of data dependency with an example


b) Draw the data dependency graph for the following.
S1: Load R1, M(100)
S2: Move R2, R1
S3: Inc R1
S4: Add R2, R1
S5: Store M(100), R1

(5+5Marks)

Sol:
a) There are 5 types of data dependencies. They are as follows:
(1)

Flow dependence:
A statement S2 is flow-dependent on the statement S1 if an execution path exists

from S1 to s2 and if at least one output of S1 feeds in as input to S2.


Ex: S1:
load R1, A
S2: Add R2, R1
S
S
(2)
Anti dependence:
1
Statement S2 is anti dependent on statement
S1 if S2 follows2S1 in program order
and if the output of S2 overlaps the input to S1.
Ex:
S1:
add R2, R1
S2:
move R1, R3

S
1

S
1

(3)

Output dependence:
Two statements are output dependent if they produce the same output variable.
Ex:

(4)

S1:
S2:

load R1, A
move R1, R3

S
1

S
1

I/O dependence:
Read and write are I/O statements. I/O dependence occurs not because the
same variable is involved but because the same file is referenced by both I/O
statements.

(5)

Unknown dependence:
The dependence relation between two statements cannot be determined in the
following situations.
The subscript of a variable itself subscribed.
The subscript does not contain the loop index variable.
A variable appears more than once with subscripts having different coefficients of
the loop variable.
The subscript is nonlinear in the loop index variable.
When one or more of these conditions exists, a conservative assumption is to
claim unknown dependence among the statements involved.

b) Draw the data dependency graph for the following.


S1: Load R1, M(100)
S2: Move R2, R1
S3: Inc R1
S4: Add R2, R1
S5: Store M(100), R1
Sol: The data dependence graph is as shown below.
S
1
S
2
S
3

S
5

S
4

PART3
Answer any Two full questions.
6) Trace out the following program to detect the parallelism using Bernsteins conditions
P1: C = D x E
P2: M= G + C
P3: A = B + C
P4: C = L + M
P5: F = G / E
Assume that each step requires one cycle to execute and two adders are available.
Compare between serial and parallel execution of the above program
(10 Marks)
Sol: Bernstein revealed a set of conditions based on which two processes can execute in
parallel.
P1, P2 - process
I1, I2 - inputs
O1, O2 -- outputs
P1 || P2 if and only if
I1 O 2 =
I2 O 1 =
O1 O2 =
P1 || P2 || . . . . || Pk if and only if
Pi || Pj

if

ij

G
+
1

+1

+2

+
2

A
+3

+
3

G
E

F
Fig(a): Sequential execution in 5 steps

Fig (b): Parallel execution in 3steps

P1 || P5,

P2 || P3,

P2 || P5,

P4 || P5,

P5 || P3

Collectively

P2 || P3 || P5 Because

P2 || P3,

P2 || P5,

P3 || P5

7)
Explain hardware and software parallelism with an example.
(10Marks)
Sol:
Hardware parallelism:
This refers to parallelism defined by machine architecture and hardware multiplicity.
One way to characterize the parallelism is by the number of instruction issues per
machine cycle.
If a processor issues k-instructions per machine cycle, then it is called k-issue processor.
A conventional processor takes one or more machine cycles to issue a single instruction.
These are called one issue machine with single instruction pipeline in the processor.
A multiprocessor system built with n k-issue processor should be able to handle a
maximum nk thread of instructions simultaneously.
Software parallelism:
It is defined by the control and data dependences of programs.
The degree of parallelism is revealed in the program profile or in the program flow graph.

Software parallelism can be achieved by algorithms, programming style and compiler


optimization.
Parallelism in a program varies during execution period.
Control parallelism:
It is a kind of software parallelism. This appears in the form of pipelining or multiple
functional units. But both pipelining and functional parallelisms are handled by the hardware.
So while programming, programmer has to take special actions to invoke them.
Data parallelism:

It offers the highest potential for concurrency.


It is practiced in both SIMD and MIMD modes on MPP system.
Data parallel code is easier to write and to debug than control parallel code.
Synchronization in SIMD data parallelism is handled by the hardware.
Data parallelism exploits parallelism in proportion to the quantity of data involved.
Assuming two multiplier units and two add/subtract units calculate average software

parallelism.
Assuming two multiplier units and two add/subtract units and 2-issue processor in which
one memory access (load/store) and one arithmetic operation can execute simultaneously.
Calculate average hardware parallelism.
L
1

L
2

L
4

L
3

1-cycle, 4-operations
1-cycle, 2-operations

1-cycle, 2-operations

B
Fig(a): Software parallelism

s/w parallelism = 8/3 = 2.67 instructions per cycle.


L1

Cycle 1

L2

Cycle 1
Cycle 1

L
3

L
4

Cycle 1
Cycle 1

A
Fig (b) Hardware parallelism

Cycle 1
Cycle 1

B
7-cycles and 8-operations
H/w parallelism = 8/7 = 1.14 instruction/cycle
8) Explain how grain packing can be done to compute the sum of the 4 elements in the
resulting product matrix C = A x B Where A and B are 2x2 Matrices. Assume grain size
for multiplication is 101 and the grain size for addition is 8.
(10Marks)
Sol:
A
B
A
B

is = 101

is = 8

C = AX B
A=

A11

A12

A21

A22

B=

B11
B21

C11 = A11B11 + A12B12


C12 = A11B12 + A12B22
C21 = A21B11 + A22B21
C22 = A21B12 + A22B22
SUM, C = C11 + C12 + C21 + C22

A22

B12

C=
C21

C11
C22

C12

Fine grain graph:


A

SUM

Coarse grain graph:

SUM
Grain size of U = 210
Grain size of V = 210
Grain size of W = 210

Grain size of X = 210


Grain size of Y = 24

****

También podría gustarte