Está en la página 1de 19

AMD Opteron Overview

Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Introduction
AMD Opteron
Focuses on Barcelona

Barcelona is AMDs 65nm 4-core CPU

Fetch
Fetches 32B from L1 cache to pre-decode/Pick buffer For simplicity, the Barcelona uses pre-decode information to mark the end of an instruction.

Inst. Decode
The instruction cache contains a pre-decoder which scans 4B of the instruction stream each cycle
Inserts pre-decode information from the ECC bits of the L1I, L2 and L3 caches, along with each line of instructions

Instructions are then passed through the sideband stack optimizer


x86 includes instructions to directly manipulate the stack of each thread AMD introduced a side-band stack optimizer to remove these stack manipulations from the instruction stream Thus, many stack operations can be processed in parallel
Frees up the reservation stations, re-order buffers, and regular ALUs for other work

Branch Prediction
Branch selector chooses between a bi-modal predictor and a global predictor
The bi-modal predictor and branch selector are both stored in the ECC bits of the instruction cache, as pre-decode information The global predictor combines the relative instruction pointer (RIP) for a conditional branch with a global history register
Tracks last 12 branches with a 16K entry prediction table containing 2 bit saturating counters

The branch target address calculator (BTAC) checks the targets for relative branches
Can correct mis-predictions with a two cycle penalty.

Barcelona uses an indirect predictor


Specifically designed to handle branches with multiple targets (e.g. switch or case statements)

Return address stack has 24 entries

Pipeline
Uses a 12 stage pipeline

OO (ROB)
The Pack Buffer (post-decoding buffer) sends groups of 3 micro-ops to the re-order buffer (ROB)
The re-order buffer contains 24 entries, with 3 lanes per entry
Holds a total of 72 instructions

Instructions can be moved between lanes to avoid a congested reservation station or to observe issue restrictions

From the ROB, instructions issue to the appropriate scheduler

ROB

Integer Future File and Register File (IFFRF)


The IFFRF contains 40 registers broken up into three distinct sets
The Architectural Register File
Contains 16x64 bit non-speculative registers Instructions can only modify the Architectural Register File until they are committed

Speculative instructions read from and write to the Future File


Contains the most recent speculative state of the 16 architectural instructions

The last 8 registers are scratchpad registers used by the microcode.

Should a branch mis-prediction or an exception occur, the pipeline rolls back, and architectural register file overwrites the contents of the Future File There are three reservation stations, i.e. schedulers, within the integer cluster
Each station is tied to a specific lane in the ROB and holds 8 instructions

Integer Execution
Barcelona uses three symmetric ALUs which can execute almost any integer instruction Three full featured ALUs require more die area and power Can provide higher performance for certain edge cases Enables a simpler design for the ROB and schedulers.

Floating Point Execution


Floating Point operations are first sent to the FP Mapper and Renamer In the Renamer, up to 3 FP instructions each cycle are assigned a destination register from the 120 FP register file entries. Once the micro-ops have been renamed, they may be issued to the three FP schedulers Operands can be obtained from either the FP register file, or the forwarding network

Floating Point Execution (SIMD)


The FPUs are 128 bits wide so that Streaming SIMD Extension (SSE) instructions can execute in a single pass. Similarly, the load-store units, and the FMISC unit load 128 bit wide data, to improve SSE performance.

Memory Overview

Memory Hierarchy
4 separate 128KB 2-way set associative L1 cache
Latency = 3 cycles Write-back to L2 The data paths into and from the L1D cache also widened to 256 bits (128 bits transmit and 128 bits receive)

4 separate 512KB 16-way set associative


Latency = 12 cycles Line size is 64B

L3 Cache
Shared 2MB 32-way set associative L3
Latency = 38 cycles Uses 64B lines The L3 cache was designed with data sharing in mind

When a line is requested, if it is likely to be shared, then it will remain in the L3


This leads to duplication which would not happen in an exclusive hierarchy

In the past, a pseudo-LRU algorithm would evict the oldest line in the cache.
In Barcelonas L3, the replacement algorithm has been changed to prefer evicting unshared lines

Access to the L3 must be arbitrated since the L3 is shared between four different cores
A round-robin algorithm is used to give access to one of the four cores each cycle.

Each core has 8 data prefetchers (a total of 32 per device)


Fill the L1D cache Can have up to 2 outstanding fetches to any address

Memory Controllers
Each memory controller supports independent 64B transactions Integrated DDR2 Memory controller ensures that L3 cache miss is resolved in less than 60 nanoseconds

TLB
Barcelona offers non-speculative memory access re-ordering in the form of Load Store Units (LSU)
Thus, some memory operations can be issued out-of-order

In the 12 entry LSU1, the oldest operations translate their addresses from the virtual address space to the physical address space using the L1 DTLB During this translation, the lower 12 bits of the load operations address are tested against previously stored addresses
If they are different, then the load proceeds ahead of the store If they are the same, load-store forwarding occurs

Should a miss in the L1 DTLB occur, the L2 DTLB will be checked


Once the load or store has located address in the cache, the operation will move on to LSU2.

LSU2 holds up to 32 memory accesses, where they stay until they are removed
The LSU2 handles any cache or TLB misses via scheduling and probing In the case of a cache miss, the LSU2 will then look in the L2, L3 and then memory In the case of TLB misses, it will look in the L2 TLB and then main memory The LSU2 also holds store instructions, which are not allowed to actually modify the caches until retirement to ensure correctness Thus, the LSU2 reduces the majority of the complexity in the memory pipeline

Hypertransport
Barcelona has four HyperTransport 3.0 lanes for inter-processor communications and I/O devices HyperTransport 3.0 adds a feature called unganging or lane-splitting The HT3.0 links are composed of two 16 bit lanes ( in both directions)
Each can be split up into a pair of independent 8bit wide links

Shanghai
The latest model of the Opteron series Several improvements over Barcelona
45nm 6MB L3 cache Improved clock speeds A host of other improvements

También podría gustarte