L03 Principles

Chinese (compute in hex?
Roman
Japanese 1
COMP 206:
Computer Architecture and
Implementation
Montek Singh
Thu, Jan 22, 2009
Lecture 3: Quantitative Principles
2
Quantitative Principles of Computer
Design
 This is intro to design and analysis
 Take Advantage of Parallelism
 Principle of Locality
 Focus on the Common Case
 Amdahl’s Law
 The Processor Performance Equation
3
1) Taking Advantage of Parallelism
(exs.)
 Increase throughput of server computer via
multiple processors or multiple disks
 Detailed HW design
 Carry lookahead adders uses parallelism to speed up
computing sums from linear to logarithmic in number
of bits per operand
 Multiple memory banks searched in parallel in set-
associative caches
 Pipelining (next slides)
4
Pipelining
 Overlap instruction execution…
 … to reduce the total time to complete an instruction
sequence.
 Not every instruction depends on immediate
predecessor
 ⇒ executing instructions completely/partially in
parallel possible
 Classic 5-stage pipeline:
1) Instruction Fetch (Ifetch),
2) Register Read (Reg),
3) Execute (ALU),
4) Data Memory Access (Dmem),
5) Register Write (Reg)
5
Pipelined Instruction Execution
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

I
ALU
n Ifetch Reg DMem Reg
s
t
r.
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d
e
r
ALU
Ifetch Reg DMem Reg
6
Limits to pipelining
 Hazards prevent next instruction from
executing during its designated clock cycle
 Structural hazards: attempt to use the same hardware
to do two different things at once
 Data hazards: Instruction depends on result of prior
instruction still in the pipeline
 Control hazards: Caused by delay between the fetching
of instructions and decisions about changes in control
flow (branches and jumps).
7
Increasing Clock Rate
 Pipelining also used for this
 Clock rate determined by gate delays
Latch combinational
or logic
register
8
2) The Principle of Locality
 The Principle of Locality:
 Programs access a relatively small portion of the
address space. Also, reuse data.
 Two Different Types of Locality:
 Temporal Locality (Locality in Time): If an item is
referenced, it will tend to be referenced again soon
(e.g., loops, reuse)
 Spatial Locality (Locality in Space): If an item is
referenced, items whose addresses are close by tend
to be referenced soon
(e.g., straight-line code, array access)
 Last 30 years, HW relied on locality for
memory perf.
9
Levels of the Memory Hierarchy
Capacity
Access Time Staging
Cost Xfer Unit
CPU Registers
Registers Upper Level
100s Bytes
300 – 500 ps (0.3-0.5 ns) prog./compiler
Instr. Operands 1-8 bytes faster
L1 and L2 Cache L1 Cache
10s-100s K Bytes cache cntl
~1 ns - ~10 ns Blocks 32-64 bytes
$1000s/ GByte
L2 Cache
cache cntl
Main Memory Blocks 64-128 bytes
G Bytes
80ns- 200ns Memory
~ $100/ GByte
OS
Pages 4K-8K bytes
Disk
10s T Bytes, 10 ms
(10,000,000 ns)
Disk
~ $1 / GByte user/operator
Files Mbytes
Larger
Tape
infinite Tape Lower Level
sec-min
~$1 / GByte
10
3) Focus on the Common Case
 In making a design trade-off, favor the frequent case
over the infrequent case
 e.g., Instruction fetch and decode unit used more
frequently than multiplier, so optimize it 1st
 e.g., If database server has 50 disks / processor, storage
dependability dominates system dependability, so optimize
it 1st
 Frequent case is often simpler and can be done faster
than the infrequent case
 e.g., overflow is rare when adding 2 numbers, so improve
performance by optimizing more common case of no
overflow
 May slow down overflow, but overall performance improved
by optimizing for the normal case
 What is frequent case and how much is performance
improved by making case faster => Amdahl’s Law
11
4) Amdahl’s Law (History, 1967)
 Historical context
 Amdahl was demonstrating “the continued validity of
the single processor approach and of the weaknesses
of the multiple processor approach”
 Paper contains no mathematical formulation, just
arguments and simulation
 “The nature of this overhead appears to be sequential so
that it is unlikely to be amenable to parallel processing
techniques.”
 “A fairly obvious conclusion which can be drawn at this
point is that the effort expended on achieving high parallel
performance rates is wasted unless it is accompanied by
achievements in sequential processing rates of very nearly
the same magnitude.”
 Nevertheless, it is of widespread applicability
“Validity of the single processor approach to achieving large scale computing capabilities”, G. M. Amdahl,
in all kinds of situations AFIPS Conference Proceedings, pp. 483-485, April 1967
http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf 12
Speedup
 Book shows two forms of speedup eqn
ExTime new
Speedup overall =
ExTimeold
ExTimeold
Speedup overall =
ExTime new
 We will use the second because you get
“speedup” factors like 2X
13
4) Amdahl’s Law
 Fractionenhanced 
ExTimenew = ExTimeold × ( 1 − Fractionenhanced ) + 
 Speedup enhanced 
ExTimeold 1
Speedupoverall = =
ExTimenew Fractionenhanced
( 1 − Fractionenhanced ) +
Speedupenhanced
Best you could ever hope to do:

1
Speedupmaximum =
( 1 - Fractionenhanced )
14
Amdahl’s Law example
 New CPU 10X faster
 I/O bound server, so 60% time waiting
1
Speedupoverall =
Fractionenhanced
(1 − Fractionenhanced ) +
Speedupenhanced
1 1
= = = 1.56
0.4 0.64
(1 − 0.4) +
10
It’s human nature to be attracted by 10X faster, vs.
keeping in perspective its just 1.6X faster
15
Amdahl’s Law for Multiple Tasks
1
Average execution rate
(performance) R avg
=
F
∑ i Fraction of results
R
i i
generated at this rate
∑Fi
i
=1 Note:
Note:Not
ofoftime
Not“fraction
timespent
“fraction
spentworking
working
atatthis
thisrate”
 results  [1] rate”
 second  = [1]
 
 results 
 second 
“Bottleneckology: Evaluating Supercomputers”, Jack Worlton, COMPCOM 85, pp. 405-406
16
Example
30% 1
30%of
ofresults
resultsare
aregenerated
generatedatatthe
therate
rateof
of11MFLOPS,
MFLOPS, Ravg =
20%
20%atat10
10MFLOPS,
MFLOPS, ∑F i
50%
50%atat100
100MFLOPS.
MFLOPS. R i i
What
Whatisisthe
theaverage
averageperformance
performanceininMFLOPS?
What
MFLOPS? ∑F =1
Whatisisthe
thebottleneck?
i
bottleneck? i
1 100 100
Ravg = = =
0.3 0.2 0.5 30 + 2 + 0.5 32.5
= 3.08 MFLOPS
+ +
1 10 100
30 2 0.5
= 92.3%, = 6.2%, = 1.5%
32.5 32.5 32.5
Bottleneck: the rate that consumes most of the time
0 0.2 0.4 0
.6 0.8 1
17
Another Example
Which
Whichchange
changeisismore
moreeffective
effectiveononaacertain
certainmachine:
machine:speeding
speedingupup10-fold
10-fold
the
thefloating
floatingpoint
pointsquare
squareroot
rootoperation
operationonly,
only,which
whichtakes
takesup up20%
20%of
of
execution
executiontime,
time,or
orspeeding
speedingup up2-fold
2-foldall
allfloating
floatingpoint
pointoperations,
operations,which
which
take
takeupup50%
50%of oftotal
totalexecution
executiontime?
time?
(Assume
(Assumethatthatthe
thecost
costofofaccomplishing
accomplishingeither
eitherchange
changeisisthe
thesame,
same,and
andthe
the
two
twochanges
changesarearemutually
mutuallyexclusive.)
exclusive.)
Fsqrt = fraction of FP sqrt results
Rsqrt = rate of producing FP sqrt results Fnon -sqrt
= 4×
F sqrt
Fnon-sqrt = fraction of non-sqrt results
Rnon-sqrt = rate of producing non-sqrt results Rnon -sqrt R sqrt
Ffp = fraction of FP results
Rfp = rate of producing FP results Fnon -fp
=
F
fp
Fnon-fp
Rnon-fp
= fraction of non-FP results
= rate of producing non-FP results
R non -fp R fp
Rbefore = average rate of producing results before enhancement

Rafter = average rate of producing results after enhancement
18
Solution using Amdahl’s Law
Improve FP sqrt only

1 1 1
R before
=
F sqrt F
= =
x + 4 x 5x
+ non -sqrt
R sqrt R non -sqrt
1 1 1
R after
=
F sqrt F
= =
0.1x + 4 x 4.1x
+ non -sqrt
10×R sqrt R non -sqrt

1 1 1
R after
=
1 4.1x
=
5
= 1.22
R before
=
F F
= =
y + y 2y
R 1 5x 4.1
fp
+ non -fp
before
R fp R non -fp
1 1 1
R after
=
F F
= =
0.5 y + y 1.5 y
fp
+ non - fp
2×R fp R non - fp
R after
=
1 1 .5 y
=
2
= 1.33
R before
1 2y 1.5
Improve all FP ops
19
Implications of Amdahl’s Law
 Improvements provided by a feature limited by how
often feature is used
 As stated, Amdahl’s Law is valid only if the system
always works with exactly one of the rates
 Overlap between CPU and I/O operations? Amdahl’s Law as
given here is not applicable
 Bottleneck is the most promising target for
improvements
 “Make the common case fast”
 Infrequent events, even if they consume a lot of time, will
make little difference to performance
 Typical use: Change only one parameter of system,
and compute effect of this change
 The same program, with the same input data, should run
on the machine in both cases
20
5) Processor Performance
 sec  clock cycle   sec 

CPU Time   = CPU Cycles for program  program  × clock cycle time clock cycle
 program     
or
 clock cycle 
CPU Cycles for program 
 sec   program 
CPU Time 
program =  clock cycle 
  clock rate 
 sec 
21
CPI – Clocks per Instruction
 clock cycle 
CPU Cycles for program  
 clock cyles   program 
CPI  =
 instruction   instruction 
instruction count  
 program 
 clock cycle   instructions 

CPI  × instruction count 
 sec  
 instruction   program 
CPU Time   =
 program   clock cycle 
clock rate  
 sec
22
Details of CPI
We can break performance down into

individual types of instructions (instruction
of type i ) – simplistic CPU

CPI = ∑  CPIi ×
 Ii


i  Instruction count 
CPI × Instruction count = ∑ ( CPIi × Ii )
i
Clock rate
CPU performance =
∑ (CPIi × Ii )
i
23
Processor Performance Eqn
Instructions Clock cycles Seconds Second

× × = = CPU Time
Program Instruction Clock cycle Program
Improving any of the terms decreases CPU time
This improvement is direct → 10% improvement in clock

cycle leads to 10% improvement in CPU time
Note that there’s usually a tradeoff

Fewer complex instructions → higher CPI
24
Processor Performance Eqn
 How can we improve performance?
Clockrate
ClockrateCPI
CPI Instruction
Instruction count
count
Hardware
Hardwaretechnology
technology (realization)
(realization) xx
Hardware
Hardwareorganization
organization(implementation)
(implementation) xx xx
Instruction
Instructionset
set (architecture)
(architecture) xx xx
Compiler
Compilertechnology
technology xx xx
Program
Program xx xx
25
Example 1
AALOAD/STORE
LOAD/STOREmachine machinehas hasthe
thecharacteristics
characteristicsshown
shownbelow.
below. We
Wealsoalso
observe
observethat
that25%
25%of ofthe
theALU
ALUoperations
operationsdirectly
directlyuse
useaaloaded
loadedvalue
valuethat
thatisis
not
notused
usedagain.
again. Thus
Thuswe wehope
hopeto
toimprove
improvethings
thingsbybyadding
addingnew newALU
ALU
instructions
instructionsthat
thathave
haveone onesource
sourceoperand
operandininmemory.
memory. The TheCPI
CPIofofthe
thenew
new
instructions
instructionsisis2.
2. The
Theonly
onlyunpleasant
unpleasantconsequence
consequenceof ofthis
thischange
changeisisthat
that
the
theCPI
CPIofofbranch
branchinstructions
instructionswill
willincrease
increasefrom
from22toto3.
3. Overall,
Overall,will
willCPU
CPU
performance
performanceincrease?
increase?
Instruction type Fre quency CPI

ALU ops 0.43 1
Loads 0.21 2
Stores 0.12 2
Branches 0.24 2
26
Example 1 (Solution)
Before change
Instruction type Frequency CPI
CPI = 0.43 × 1 + (0.21 + 0.12 + 0.24) × 2 = 1.57
ALU ops 0.43 1
CPU time = IC × CPI × Clock cycle time
Loads 0.21 2
= IC × 1.57 × T
Stores 0.12 2 = 1.57 × IC × T
Branches 0.24 2
After change x = 0.43 4 = 0.1075
Instruction type Frequency CPI (0.43 - x) ×1 + (0.21 - x + 0.12 + x) × 2 + 0.24 × 3
CPI =
ALU ops (0.43-x)/(1-x) 1 1- x
1.7025
Loads (0.21-x)/(1-x) 2 =
0.8925
= 1.908
Stores 0.12/(1-x ) 2 CPU time = IC × CPI × Clock cycle time
Branches 0.24/(1-x) 3 = (1 - x) × IC ×1.908 × T
Reg-mem ops x/(1-x) 2 = 1.703 × IC × T
Since CPU time increases, change will not improve performance.

27
Example 2
AAload-store
load-storemachine
machinehas
hasthe
thecharacteristics
characteristicsshown
shownbelow.
below. An
Anoptimizing
optimizing
compiler
compilerfor
forthe
themachine
machinediscards
discards50%
50%ofofthe
theALU
ALUoperations,
operations,although
althoughitit
cannot
cannotreduce
reduceloads,
loads,stores,
stores,or
orbranches.
branches. Assuming
Assumingaa500
500MHz
MHz(2 (2ns)
ns)
clock,
clock,what
whatisisthe
theMIPS
MIPSrating
ratingfor
foroptimized
optimizedcode
codeversus
versusunoptimized
unoptimizedcode?
code?
Does
Doesthe
theranking
rankingofofMIPS
MIPSagree
agreewith
withthe
theranking
rankingof
ofexecution
executiontime?
time?
Instruction type Fre quency CPI

ALU ops 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2
28
Example 2 (Solution)
Without optimization CPI = 0.43 ×1 + (0.21 + 0.12 + 0.24) × 2 = 1.57

Instruction type Frequency CPI CPU time = IC × CPI × Clock cycle time
= IC ×1.57 × 2 ×10 -9
ALU ops 43% 1
= 3.14 ×10 -9 × IC
Loads 21% 2
500 MHz
Stores 12% 2 MIPS = = 318.5
1.57 ×10 6
Branches 24% 2
With optimization x = 0.43 2
(0.43 - x) ×1 + (0.21 + 0.12 + 0.24) × 2
Instruction type Frequency CPI CPI =
1- x
ALU ops (0.43-x)/ (1-x) 1 1.355
= = 1.73
Loads 0.21/(1-x) 2 0.785
Stores 0.12/(1-x ) 2 CPU time = IC × CPI × Clock cycle time
0.24/(1-x) = (1 - x) × IC ×1.73 × 2 ×10 -9

Branches 2
= 2.72 ×10 -9 × IC
Performance increases, MIPS =
500 MHz
= 289.0
but MIPS decreases! 1.73 ×10 6
29
Performance of (Blocking) Caches
no cache misses!
CPU time = CPU cycles × Clock cycle time
with cache misses!
CPU time = (CPU cycles +Memory stall cycles) × Clock cycle time
CPU cycles = IC × CPI IC – instruction count
Memory stall cycles = Number of misses × Miss penalty

Misses
= IC × × Miss penalty
Instruction
Memory references Misses
= IC × × × Miss penalty
Instruction Memory reference
30
Example
Assume
Assumewe wehave
have aamachine
machinewhere wherethe
theCPI
CPI isis2.02.0when
whenallall
memory
memoryaccesses
accesseshit hit in
in the
thecache.
cache. The
The only
onlydatadataaccesses
accesses
are
areloads
loadsand
and stores,
stores, and
andthese
thesetotal
total40%
40% of of the
theinstructions.
instructions.
IfIf the
the miss
misspenalty
penaltyisis25
25 clock
clockcycles
cyclesand
andthethemiss
missrate
rateisis2%,
2%,
how
howmuch much faster
fasterwould
wouldthe themachine
machinebebeifif all
allmemory
memory
accesses
accesseswere werecache
cache hits?
hits?
Memory refs
CPU time CPI + × Miss rate × Miss penalty
misses
= Instruction
CPU time no misses
CPI
Why? 2 + (1 + 0.4) × 0.02 × 25

=
2
2.7
= = 1.35
2
31
Fallacies and Pitfalls
 Fallacies - commonly held misconceptions
 When discussing a fallacy, we try to give a
counterexample.
 Pitfalls - easily made mistakes
 Often generalizations of principles true in limited
context
 We show Fallacies and Pitfalls to help you avoid these
errors
32
Fallacies and Pitfalls (1/3)
 Fallacy: Benchmarks remain valid indefinitely
 Once a benchmark becomes popular, tremendous
pressure to improve performance by targeted
optimizations or by aggressive interpretation of the
rules for running the benchmark:
“benchmarksmanship.”
 70 benchmarks from the 5 SPEC releases. 70% were
dropped from the next release since no longer useful
 Pitfall: A single point of failure
 Rule of thumb for fault tolerant systems: make sure
that every component was redundant so that no
single component failure could bring down the whole
system (e.g, power supply)
33
 Fallacy - Rated MTTF of disks is 1,200,000
hours or
≈ 140 years, so disks practically never fail
 Disk lifetime is ~5 years ⇒ replace a disk
every 5 years; on average, 28 replacement
cycles wouldn't fail (140 years long time!)
 Is that meaningful?
 Better unit: % that fail in 5 years
 Next slide
34
Number of disks × Time Period

Failed Disks =
MTTF
1000 disks × (5*365* 24 hours )
Failed Disks = = 37
1, 200, 000 hours
 So 3.7% will fail over 5 years

 But this is under pristine conditions
 little vibration, narrow temperature range ⇒ no power failures
 Real world: 3% to 6% of SCSI drives fail per year
 3400 - 6800 FIT or 150,000 - 300,000 hour MTTF [Gray & van Ingen
05]
 3% to 7% of ATA drives fail per year
 3400 - 8000 FIT or 125,000 - 300,000 hour MTTF [Gray & van Ingen
05] 35
Next Time
 Instruction Set Architecture
 Appendix B
36
References
 G. M. Amdahl, “Validity of the single processor
approach to achieving large scale computing
capabilities”, AFIPS Conference Proceedings, pp. 483-
485, April 1967
 http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
37

L03 Principles

Cargado por

Información del documento

Descripción original:

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

L03 Principles

Cargado por

Copyright:

Formatos disponibles

Chinese (compute in hex?

Lecture 3: Quantitative Principles

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Best you could ever hope to do:

“Bottleneckology: Evaluating Supercomputers”, Jack Worlton, COMPCOM 85, pp. 405-406

Bottleneck: the rate that consumes most of the time

Rbefore = average rate of producing results before enhancement

Improve FP sqrt only

R sqrt R non -sqrt

10×R sqrt R non -sqrt

Improve all FP ops

 sec  clock cycle   sec 

 clock cycle   instructions 

We can break performance down into

Instructions Clock cycles Seconds Second

Improving any of the terms decreases CPU time

This improvement is direct → 10% improvement in clock

Note that there’s usually a tradeoff

Instruction type Fre quency CPI

Since CPU time increases, change will not improve performance.

Instruction type Fre quency CPI

Without optimization CPI = 0.43 ×1 + (0.21 + 0.12 + 0.24) × 2 = 1.57

0.24/(1-x) = (1 - x) × IC ×1.73 × 2 ×10 -9

CPU cycles = IC × CPI IC – instruction count

Memory stall cycles = Number of misses × Miss penalty

Why? 2 + (1 + 0.4) × 0.02 × 25

Number of disks × Time Period

 So 3.7% will fail over 5 years

También podría gustarte