Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Introduction
AMD Opteron
Focuses on Barcelona
Fetch
Fetches 32B from L1 cache to pre-decode/Pick buffer For simplicity, the Barcelona uses pre-decode information to mark the end of an instruction.
Inst. Decode
The instruction cache contains a pre-decoder which scans 4B of the instruction stream each cycle
Inserts pre-decode information from the ECC bits of the L1I, L2 and L3 caches, along with each line of instructions
Branch Prediction
Branch selector chooses between a bi-modal predictor and a global predictor
The bi-modal predictor and branch selector are both stored in the ECC bits of the instruction cache, as pre-decode information The global predictor combines the relative instruction pointer (RIP) for a conditional branch with a global history register
Tracks last 12 branches with a 16K entry prediction table containing 2 bit saturating counters
The branch target address calculator (BTAC) checks the targets for relative branches
Can correct mis-predictions with a two cycle penalty.
Pipeline
Uses a 12 stage pipeline
OO (ROB)
The Pack Buffer (post-decoding buffer) sends groups of 3 micro-ops to the re-order buffer (ROB)
The re-order buffer contains 24 entries, with 3 lanes per entry
Holds a total of 72 instructions
Instructions can be moved between lanes to avoid a congested reservation station or to observe issue restrictions
ROB
Should a branch mis-prediction or an exception occur, the pipeline rolls back, and architectural register file overwrites the contents of the Future File There are three reservation stations, i.e. schedulers, within the integer cluster
Each station is tied to a specific lane in the ROB and holds 8 instructions
Integer Execution
Barcelona uses three symmetric ALUs which can execute almost any integer instruction Three full featured ALUs require more die area and power Can provide higher performance for certain edge cases Enables a simpler design for the ROB and schedulers.
Memory Overview
Memory Hierarchy
4 separate 128KB 2-way set associative L1 cache
Latency = 3 cycles Write-back to L2 The data paths into and from the L1D cache also widened to 256 bits (128 bits transmit and 128 bits receive)
L3 Cache
Shared 2MB 32-way set associative L3
Latency = 38 cycles Uses 64B lines The L3 cache was designed with data sharing in mind
In the past, a pseudo-LRU algorithm would evict the oldest line in the cache.
In Barcelonas L3, the replacement algorithm has been changed to prefer evicting unshared lines
Access to the L3 must be arbitrated since the L3 is shared between four different cores
A round-robin algorithm is used to give access to one of the four cores each cycle.
Memory Controllers
Each memory controller supports independent 64B transactions Integrated DDR2 Memory controller ensures that L3 cache miss is resolved in less than 60 nanoseconds
TLB
Barcelona offers non-speculative memory access re-ordering in the form of Load Store Units (LSU)
Thus, some memory operations can be issued out-of-order
In the 12 entry LSU1, the oldest operations translate their addresses from the virtual address space to the physical address space using the L1 DTLB During this translation, the lower 12 bits of the load operations address are tested against previously stored addresses
If they are different, then the load proceeds ahead of the store If they are the same, load-store forwarding occurs
LSU2 holds up to 32 memory accesses, where they stay until they are removed
The LSU2 handles any cache or TLB misses via scheduling and probing In the case of a cache miss, the LSU2 will then look in the L2, L3 and then memory In the case of TLB misses, it will look in the L2 TLB and then main memory The LSU2 also holds store instructions, which are not allowed to actually modify the caches until retirement to ensure correctness Thus, the LSU2 reduces the majority of the complexity in the memory pipeline
Hypertransport
Barcelona has four HyperTransport 3.0 lanes for inter-processor communications and I/O devices HyperTransport 3.0 adds a feature called unganging or lane-splitting The HT3.0 links are composed of two 16 bit lanes ( in both directions)
Each can be split up into a pair of independent 8bit wide links
Shanghai
The latest model of the Opteron series Several improvements over Barcelona
45nm 6MB L3 cache Improved clock speeds A host of other improvements