Está en la página 1de 25

AMD K7 Processor

Architecture
Introduction
• AMD K7 is the first 7th generation PC CPU. First six generations
were 8086, 80286, 80386, 80486, Pentium (AMD K5/K6) and
Pentium II (AMD K6-2/K6-3). It is designed to operate above
500MHz.

• AMD K7,also known as AMD Athlon, was introduced in the first


half of 1999 and its architecture forms the basis for the
subsequent Athlon XP versions until the release of K8 (AMD
Hammer).

• Its competitor, Intel Pentium III was also released in the same
year and these two processors will be compared whenever
possible throughout the presentation.
Main Features
• Out-of-order, 3-way superscalar x86 uP
• 9 independent execution pipelines, with 10
stage integer and 15-stage FP pipeline :
– 3 Integer Execution Units
– 3 Address Calculation Units
– 3 Floating Point Execution Units
• 64kB instruction and 64kB data L1 caches
• Integrated L2 cache controller up to 8MB
• Extended 3DNow! instructions
Main Features
• K7 uses Digital™ Alpha™ EV6 system bus
interface. This is probably the most important
architectural difference from the previous
generations. EV6 provides:
- Use of both rising and falling edges, resulting in
doubled bus speed
- Scalability beyond 200MHz(beyond 400MHz bus speed)
- Highest bandwidth of that time:
Athlon using 100MHz(x2)  1.60 GB/s
PIII using 133MHz  1.01 GB/s
- 72(64 + 8ECC) bit data bus
- Independent address bus able to address 8 terabytes
- Independent snoop bus
Main Features – EV6 cont.
- low-voltage signaling for
low-cost motherboard
implementations
 Motherboards with
GeForce, Dolby and
Ethernet available below
$80.

- Point-to-Point topology with


clock forwarding for
scalable multiprocessing.
AMD K7 Processor Block Diagram
Cache Architecture
• Separate L1 instruction and data caches
• Both are 64kB, 64-bit, 2-way set associative, dual
ported and have 24-entry(32-entry for DC) L1 TLB,
256-entry L2 TLB.
• IC stores predecode information to assist multiple
instruction decoders.
• L2 cache controller can interface up to 8MB industry
standard SDR or DDR SRAMs and provides full tag
for 512kB cache or partial tag for larger caches.
Interface is 64+8ECC
Cache Competition
• AMD Athlon(1999):
– 2x64kB, 64-bit, 2-way, 3~ L1 cache with 64-byte lines
– 512kB, 64-bit, 2-way, 18~ off-chip L2 with 64-byte lines
• Intel PIII Katmai(1999):
– 2x16kB, 64-bit, 4-way, 3~ L1 cache with 32-byte lines
– 512kB, 64-bit, 2-way, 21~ off-chip L2 with 32-byte lines
• Intel PIII Coppermine(1999): L2 changed to
– 256kB, 256-bit, 8-way, 4~ on-chip
• AMD Athlon Thunderbird(2000):L2 changed to
- 256kB, 64-bit, 16-way, 7~ on-chip
- Exclusive cache structure meaning that data in L1 and L2
caches are different
Cache Competition
Pipeline Architecture - Decoders
- 3-way Decoders convert instructions into fixed-length “Macro-Ops” (or
MOPs) and send to ICU
- ICU contains 72 entries vs. 20 entries of PIII  superior out-of-order
execution performance
Pipeline – Integer Execution Units
- 3 IEU, 3 AGU
- 15 entry integer scheduler
- 24 entry 32bit 9 read 8 write
register file
Pipeline - Floating Point Unit
- Floating Point Units execute MMX,
x87 (FP) and 3D-Now! Instructions
- 36 entry FP scheduler
- 88 entry 90bit 5 read 5 write
register file.
- Some stages of the MUL pipeline
may be unused during DIV/Sqrt
iterations. ICU informs the FP
Single Double Extend
scheduler in such cases so that there ed
is sufficient time to schedule
independent MULs in the unused DIV 16/13 20/17 24/21
cycle.
Sqrt 19/16 27/24 35/32
- DIV by exact 2n or zero takes 11~
Pipeline – Load/Store Unit
• 44 entry Load/Store queue

• Data forwarding from


stores to dependent loads
Pipeline - Stages
Integer Floating Point
Stage 1 Fetch Fetch
Stage 2 Scan Scan
Stage 3 Align1 Align1
Stage 4 Align2 Align2
Stage 5 Early Decode (EDEC) Early Decode (EDEC)
Stage 6 IDEC IDEC
Stage 7 Schedule Stack Rename
Stage 8 Execute Register Rename
Stage 9 Address Generation Schedule Write
Stage 10 Data Cache Access Schedule
Stage 11 Register File Read
Stage 12 Floating-point execution
Stage 13 Floating-point execution
Stage 14 Floating-point execution
Stage 15 Floating-point execution
Branch Prediction
• Dynamic branch prediction logic composed
of:
– Branch prediction table: two-way, 2048-entry(512
for PIII). BPT stores prediction information that is
used for predicting the direction of conditional
branches.
– Branch target address table:
stores target addresses of conditional and
unconditional branches.
Branch Prediction
– Return address stack: 12-entry
optimizes CALL/RET instruction pairs

– BPT is accessed during Fetch stage and prediction is made


during scan stage using Smith Prediction Algorithm (2-bit
counters)
– Misprediction penalty is 10 cycles

• Approximate Correct Branch Predictions


– AMD Athlon: 95%
– Intel Pentium III: 90-92%
3DNow! Technology

• 3DNow! is a set of SIMD instructions designed to


accelerate the FP-intensive multimedia
applications.
• Instructions operate on two packed single-
precision 32-bit doublewords simultaneously:
Dst[63:32] = Dst[63:32] op Src[63:32]
Dst[31:00] = Dst[31:00] op Src[31:00]
3DNow! Technology
• With significant code analysis, AMD engineers found that there are two
compelling implementation alternatives:
- extending MMX with 3DNow! instructions
- using separate wide registers from MMX, 4-operand instruction format
and support for MAC.
- Anything in between requires significantly greater hardware area or
complexity without providing a corresponding performance benefit.

• AMD chose the first one that achieves most of the performance benefit
with significantly less area and power. Since no additional registers are
used, no new states are introduced  compatibility with the existing
OSs.
• The second choice is implemented in PowerPC G4 under the name
AltiVec.
3DNow! Technology
• Instead of division and sqrt, reciprocal and reciprocal sqrt are
implemented in AMD K7 since they are encountered more often in
multimedia applications.
• MMX and 3DNow! instructions have at most 4 cycle latency (only for
3DNow! Add and Mul ) and 1 cycle throughput. This is much faster
than single precision FP division(13~) and sqrt(16~).
• Using 2 FP pipelines simultaneously, maximum throughput is 4
FPops/~.
Integer Performance of AMD Athlon
Floating Point Performance of AMD
Athlon
Conclusion
• Being the first 7th generation CPU, AMD K7
has been a major leap forward in the CPU
history.
• It had both performance and cost benefits
when compared to Intel PIII and started
the competition that ended with today’s
AMD Athlon XP and P4 processors.
References
• Hesley, S., V. Andrade, B. Burd,G. Constant, J. Correll, M. Crowley, M.
Golden, N. Hopkins, S. Islam, S. Johnson, R. Khondker, D. Meyer, J.
Moench, H. Partovi, R. Posey, F. Weber and J. Yong, “A 7 th Generation
x86 Microprocessor ”, IEEE International Solid State Circuits Conference,
pp. 92-93,1999.
• Scherer, A., M. Golden, N. Juffa, S. Meier, S. Oberman, H. Partovi and F.
Weber, “ An Out-of-Order Three-Way Superscalar Multimedia Floating
Point Unit ”, IEEE International Solid State Circuits Conference, pp. 94-
95,1999.
• Oberman, S., “ Floating Point Division and Square Root Algorithms and
Implementation in the AMD-K7 Microprocessor ”, 14th IEEE Symposium on
Computer Arithmetic, pp. 106-115, 1999.
• Oberman, S., G. Favor and F. Weber, “ AMD 3DNow! Technology:
Architecture and Implementations ”, IEEE Micro, 1999.
• AMD Athlon Processor Datasheet and Technical Brief from www.amd.com
• Intel PIII Processor Datasheet from www.intel.com
Questions?

También podría gustarte