Documentos de Académico
Documentos de Profesional
Documentos de Cultura
thread id
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0memory
1 2 location
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
128 byte
15.02.2017 Real-Time Graphics 2 7
Granularity Example 1/4
Coalesced
0memory
1 2 location
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
128 byte
15.02.2017 Real-Time Graphics 2 9
Granularity Example 2/4
Mixed
0memory
1 2 location
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
128 byte
15.02.2017 Real-Time Graphics 2 11
Granularity Example 3/4
Coalesced 69.69GB/s
Offset 4 20.27GB/s
Offset 2 (0.29)
10000000
1000000
ScatterAlloc
100000
Cycles
CUDA Alloc
10000
1000
1 10 100 1000 10000
Threads
15.02.2017 Real-Time Graphics 2 20
Constant memory
Read-only
Ideal for coefficients and other data that is read uniformly by warps
Data is stored in global memory, read through cache
__constant__ qualifier
limited to 64KB
Fermi added uniform access:
Load Uniform (LDU) command for const pointer
Goes through same cache
Compiler determines that all threads in a block will dereference the
same address
no size limit
15.02.2017 Real-Time Graphics 2 21
Constant memory contd
Cache throughput
4 bytes per warp per clock
All threads in warp read the same address
Otherwise serialized
__constant__ float myarray[128];
__global__ void kernel()
{
...
float x = myarray[23]; //uniform
float y = myarray[blockIdx.x + 2]; //uniform
float z = myarray[threadIdx.x]; //non-uniform
}
Thread 0
Thread 1
Thread 2
Thread 3
bank = (address/4) % 32
15.02.2017 Real-Time Graphics 2 68
Shared memory contd
Key to good performance is the access pattern
Conflict free access: all threads within a warp access
different banks (mind Kepler exceptions)
Multicast: all threads accessing the same word are
served with one transaction
Serialization: multiple access conflicts will be serialized
Introduce padding to avoid bank conflicts
Thread 31 Bank 31 Thread 31 Bank 31 Thread 31 Bank 31 Thread 31 Bank 31
conflict free access conflict free access broadcast two-way bank conflict
(different words)
15.02.2017 Real-Time Graphics 2 70
Shared memory contd
__global__ void kernel(...)
{
__shared__ float mydata[32*32];
... 0 1 2 3 4 5 6 7 8 9 3031
float sum = 0; 0 1 2 3 4 5 6 7 8 9 3031
for(uint i = 0; i < 32; ++i) 0 1 2 3 4 5 6 7 8 9 3031
sum += mydata[threadIdx.x + i*32]; 0 //conflict 8 293031
1 2 3 4 5 6 7 free
0 1 2 3 4 5 6 7 8 293031
... 0 1 2 3 4 5 6 7 8 293031
sum = 0; 0 1 2 3 4 5 6 7 8 9 30 31
0 1 2 3 4 5 6 7 8 9 30 31
for(uint i = 0; i < 32; ++i) 0 1 2 3 4 5 6 7 8 9 30 31
sum += mydata[threadIdx.x*32 + i]; 0 1//32-way
2 3 4 5 6 conflict
7 8 29 30 31
... 0 1 2 3 4 5 6 7 8 29 30 31
} 0 1 2 3 4 5 6 7 8 29 30 31
Registers
Global Shared
Memory Memory
7.6x 6x
13.5x 12x
180280 GB/s
1.3 TB/s (Fermi)
2.3 TB/s (Kepler)
8 TB/s (Fermi)
27 TB/s (Kepler)