Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Roman
Japanese 1
COMP 206:
Computer Architecture and
Implementation
Montek Singh
Thu, Jan 22, 2009
2
Quantitative Principles of Computer
Design
This is intro to design and analysis
Take Advantage of Parallelism
Principle of Locality
Focus on the Common Case
Amdahl’s Law
The Processor Performance Equation
3
1) Taking Advantage of Parallelism
(exs.)
Increase throughput of server computer via
multiple processors or multiple disks
Detailed HW design
Carry lookahead adders uses parallelism to speed up
computing sums from linear to logarithmic in number
of bits per operand
Multiple memory banks searched in parallel in set-
associative caches
Pipelining (next slides)
4
Pipelining
Overlap instruction execution…
… to reduce the total time to complete an instruction
sequence.
Not every instruction depends on immediate
predecessor
⇒ executing instructions completely/partially in
parallel possible
Classic 5-stage pipeline:
1) Instruction Fetch (Ifetch),
2) Register Read (Reg),
3) Execute (ALU),
4) Data Memory Access (Dmem),
5) Register Write (Reg)
5
Pipelined Instruction Execution
Time (clock cycles)
ALU
n Ifetch Reg DMem Reg
s
t
r.
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d
e
r
ALU
Ifetch Reg DMem Reg
6
Limits to pipelining
Hazards prevent next instruction from
executing during its designated clock cycle
Structural hazards: attempt to use the same hardware
to do two different things at once
Data hazards: Instruction depends on result of prior
instruction still in the pipeline
Control hazards: Caused by delay between the fetching
of instructions and decisions about changes in control
flow (branches and jumps).
7
Increasing Clock Rate
Pipelining also used for this
Clock rate determined by gate delays
Latch combinational
or logic
register
8
2) The Principle of Locality
The Principle of Locality:
Programs access a relatively small portion of the
address space. Also, reuse data.
Two Different Types of Locality:
Temporal Locality (Locality in Time): If an item is
referenced, it will tend to be referenced again soon
(e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is
referenced, items whose addresses are close by tend
to be referenced soon
(e.g., straight-line code, array access)
Last 30 years, HW relied on locality for
memory perf.
9
Levels of the Memory Hierarchy
Capacity
Access Time Staging
Cost Xfer Unit
CPU Registers
Registers Upper Level
100s Bytes
300 – 500 ps (0.3-0.5 ns) prog./compiler
Instr. Operands 1-8 bytes faster
L1 and L2 Cache L1 Cache
10s-100s K Bytes cache cntl
~1 ns - ~10 ns Blocks 32-64 bytes
$1000s/ GByte
L2 Cache
cache cntl
Main Memory Blocks 64-128 bytes
G Bytes
80ns- 200ns Memory
~ $100/ GByte
OS
Pages 4K-8K bytes
Disk
10s T Bytes, 10 ms
(10,000,000 ns)
Disk
~ $1 / GByte user/operator
Files Mbytes
Larger
Tape
infinite Tape Lower Level
sec-min
~$1 / GByte
10
3) Focus on the Common Case
In making a design trade-off, favor the frequent case
over the infrequent case
e.g., Instruction fetch and decode unit used more
frequently than multiplier, so optimize it 1st
e.g., If database server has 50 disks / processor, storage
dependability dominates system dependability, so optimize
it 1st
Frequent case is often simpler and can be done faster
than the infrequent case
e.g., overflow is rare when adding 2 numbers, so improve
performance by optimizing more common case of no
overflow
May slow down overflow, but overall performance improved
by optimizing for the normal case
What is frequent case and how much is performance
improved by making case faster => Amdahl’s Law
11
4) Amdahl’s Law (History, 1967)
Historical context
Amdahl was demonstrating “the continued validity of
the single processor approach and of the weaknesses
of the multiple processor approach”
Paper contains no mathematical formulation, just
arguments and simulation
“The nature of this overhead appears to be sequential so
that it is unlikely to be amenable to parallel processing
techniques.”
“A fairly obvious conclusion which can be drawn at this
point is that the effort expended on achieving high parallel
performance rates is wasted unless it is accompanied by
achievements in sequential processing rates of very nearly
the same magnitude.”
Nevertheless, it is of widespread applicability
“Validity of the single processor approach to achieving large scale computing capabilities”, G. M. Amdahl,
in all kinds of situations AFIPS Conference Proceedings, pp. 483-485, April 1967
http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf 12
Speedup
Book shows two forms of speedup eqn
ExTime new
Speedup overall =
ExTimeold
ExTimeold
Speedup overall =
ExTime new
We will use the second because you get
“speedup” factors like 2X
13
4) Amdahl’s Law
Fractionenhanced
ExTimenew = ExTimeold × ( 1 − Fractionenhanced ) +
Speedup enhanced
ExTimeold 1
Speedupoverall = =
ExTimenew Fractionenhanced
( 1 − Fractionenhanced ) +
Speedupenhanced
14
Amdahl’s Law example
New CPU 10X faster
I/O bound server, so 60% time waiting
1
Speedupoverall =
Fractionenhanced
(1 − Fractionenhanced ) +
Speedupenhanced
1 1
= = = 1.56
0.4 0.64
(1 − 0.4) +
10
It’s human nature to be attracted by 10X faster, vs.
keeping in perspective its just 1.6X faster
15
Amdahl’s Law for Multiple Tasks
1
Average execution rate
(performance) R avg
=
F
∑ i Fraction of results
R
i i
generated at this rate
∑Fi
i
=1 Note:
Note:Not
ofoftime
Not“fraction
timespent
“fraction
spentworking
working
atatthis
thisrate”
results [1] rate”
second = [1]
results
second
16
Example
30% 1
30%of
ofresults
resultsare
aregenerated
generatedatatthe
therate
rateof
of11MFLOPS,
MFLOPS, Ravg =
20%
20%atat10
10MFLOPS,
MFLOPS, ∑F i
50%
50%atat100
100MFLOPS.
MFLOPS. R i i
What
Whatisisthe
theaverage
averageperformance
performanceininMFLOPS?
What
MFLOPS? ∑F =1
Whatisisthe
thebottleneck?
i
bottleneck? i
1 100 100
Ravg = = =
0.3 0.2 0.5 30 + 2 + 0.5 32.5
= 3.08 MFLOPS
+ +
1 10 100
30 2 0.5
= 92.3%, = 6.2%, = 1.5%
32.5 32.5 32.5
0 0.2 0.4 0
.6 0.8 1
17
Another Example
Which
Whichchange
changeisismore
moreeffective
effectiveononaacertain
certainmachine:
machine:speeding
speedingupup10-fold
10-fold
the
thefloating
floatingpoint
pointsquare
squareroot
rootoperation
operationonly,
only,which
whichtakes
takesup up20%
20%of
of
execution
executiontime,
time,or
orspeeding
speedingup up2-fold
2-foldall
allfloating
floatingpoint
pointoperations,
operations,which
which
take
takeupup50%
50%of oftotal
totalexecution
executiontime?
time?
(Assume
(Assumethatthatthe
thecost
costofofaccomplishing
accomplishingeither
eitherchange
changeisisthe
thesame,
same,and
andthe
the
two
twochanges
changesarearemutually
mutuallyexclusive.)
exclusive.)
Fsqrt = fraction of FP sqrt results
Rsqrt = rate of producing FP sqrt results Fnon -sqrt
= 4×
F sqrt
Fnon-sqrt = fraction of non-sqrt results
Rnon-sqrt = rate of producing non-sqrt results Rnon -sqrt R sqrt
Ffp = fraction of FP results
Rfp = rate of producing FP results Fnon -fp
=
F
fp
Fnon-fp
Rnon-fp
= fraction of non-FP results
= rate of producing non-FP results
R non -fp R fp
18
Solution using Amdahl’s Law
1 1 1
R after
=
F sqrt F
= =
0.1x + 4 x 4.1x
+ non -sqrt
1 1 1
R after
=
F F
= =
0.5 y + y 1.5 y
fp
+ non - fp
2×R fp R non - fp
R after
=
1 1 .5 y
=
2
= 1.33
R before
1 2y 1.5
19
Implications of Amdahl’s Law
Improvements provided by a feature limited by how
often feature is used
As stated, Amdahl’s Law is valid only if the system
always works with exactly one of the rates
Overlap between CPU and I/O operations? Amdahl’s Law as
given here is not applicable
Bottleneck is the most promising target for
improvements
“Make the common case fast”
Infrequent events, even if they consume a lot of time, will
make little difference to performance
Typical use: Change only one parameter of system,
and compute effect of this change
The same program, with the same input data, should run
on the machine in both cases
20
5) Processor Performance
or
clock cycle
CPU Cycles for program
sec program
CPU Time
program = clock cycle
clock rate
sec
21
CPI – Clocks per Instruction
clock cycle
CPU Cycles for program
clock cyles program
CPI =
instruction instruction
instruction count
program
22
Details of CPI
CPI = ∑ CPIi ×
Ii
i Instruction count
CPI × Instruction count = ∑ ( CPIi × Ii )
i
Clock rate
CPU performance =
∑ (CPIi × Ii )
i
23
Processor Performance Eqn
24
Processor Performance Eqn
How can we improve performance?
Clockrate
ClockrateCPI
CPI Instruction
Instruction count
count
Hardware
Hardwaretechnology
technology (realization)
(realization) xx
Hardware
Hardwareorganization
organization(implementation)
(implementation) xx xx
Instruction
Instructionset
set (architecture)
(architecture) xx xx
Compiler
Compilertechnology
technology xx xx
Program
Program xx xx
25
Example 1
AALOAD/STORE
LOAD/STOREmachine machinehas hasthe
thecharacteristics
characteristicsshown
shownbelow.
below. We
Wealsoalso
observe
observethat
that25%
25%of ofthe
theALU
ALUoperations
operationsdirectly
directlyuse
useaaloaded
loadedvalue
valuethat
thatisis
not
notused
usedagain.
again. Thus
Thuswe wehope
hopeto
toimprove
improvethings
thingsbybyadding
addingnew newALU
ALU
instructions
instructionsthat
thathave
haveone onesource
sourceoperand
operandininmemory.
memory. The TheCPI
CPIofofthe
thenew
new
instructions
instructionsisis2.
2. The
Theonly
onlyunpleasant
unpleasantconsequence
consequenceof ofthis
thischange
changeisisthat
that
the
theCPI
CPIofofbranch
branchinstructions
instructionswill
willincrease
increasefrom
from22toto3.
3. Overall,
Overall,will
willCPU
CPU
performance
performanceincrease?
increase?
26
Example 1 (Solution)
Before change
Instruction type Frequency CPI
CPI = 0.43 × 1 + (0.21 + 0.12 + 0.24) × 2 = 1.57
ALU ops 0.43 1
CPU time = IC × CPI × Clock cycle time
Loads 0.21 2
= IC × 1.57 × T
Stores 0.12 2 = 1.57 × IC × T
Branches 0.24 2
After change x = 0.43 4 = 0.1075
Instruction type Frequency CPI (0.43 - x) ×1 + (0.21 - x + 0.12 + x) × 2 + 0.24 × 3
CPI =
ALU ops (0.43-x)/(1-x) 1 1- x
1.7025
Loads (0.21-x)/(1-x) 2 =
0.8925
= 1.908
Stores 0.12/(1-x ) 2 CPU time = IC × CPI × Clock cycle time
Branches 0.24/(1-x) 3 = (1 - x) × IC ×1.908 × T
Reg-mem ops x/(1-x) 2 = 1.703 × IC × T
AAload-store
load-storemachine
machinehas
hasthe
thecharacteristics
characteristicsshown
shownbelow.
below. An
Anoptimizing
optimizing
compiler
compilerfor
forthe
themachine
machinediscards
discards50%
50%ofofthe
theALU
ALUoperations,
operations,although
althoughitit
cannot
cannotreduce
reduceloads,
loads,stores,
stores,or
orbranches.
branches. Assuming
Assumingaa500
500MHz
MHz(2 (2ns)
ns)
clock,
clock,what
whatisisthe
theMIPS
MIPSrating
ratingfor
foroptimized
optimizedcode
codeversus
versusunoptimized
unoptimizedcode?
code?
Does
Doesthe
theranking
rankingofofMIPS
MIPSagree
agreewith
withthe
theranking
rankingof
ofexecution
executiontime?
time?
28
Example 2 (Solution)
29
Performance of (Blocking) Caches
no cache misses!
CPU time = CPU cycles × Clock cycle time
with cache misses!
CPU time = (CPU cycles +Memory stall cycles) × Clock cycle time
31
Fallacies and Pitfalls
Fallacies - commonly held misconceptions
When discussing a fallacy, we try to give a
counterexample.
Pitfalls - easily made mistakes
Often generalizations of principles true in limited
context
We show Fallacies and Pitfalls to help you avoid these
errors
32
Fallacies and Pitfalls (1/3)
Fallacy: Benchmarks remain valid indefinitely
Once a benchmark becomes popular, tremendous
pressure to improve performance by targeted
optimizations or by aggressive interpretation of the
rules for running the benchmark:
“benchmarksmanship.”
70 benchmarks from the 5 SPEC releases. 70% were
dropped from the next release since no longer useful
Pitfall: A single point of failure
Rule of thumb for fault tolerant systems: make sure
that every component was redundant so that no
single component failure could bring down the whole
system (e.g, power supply)
33
Fallacies and Pitfalls (2/3)
Fallacy - Rated MTTF of disks is 1,200,000
hours or
≈ 140 years, so disks practically never fail
Disk lifetime is ~5 years ⇒ replace a disk
every 5 years; on average, 28 replacement
cycles wouldn't fail (140 years long time!)
Is that meaningful?
Better unit: % that fail in 5 years
Next slide
34
Fallacies and Pitfalls (3/3)
36
References
G. M. Amdahl, “Validity of the single processor
approach to achieving large scale computing
capabilities”, AFIPS Conference Proceedings, pp. 483-
485, April 1967
http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
37