Documentos de Académico
Documentos de Profesional
Documentos de Cultura
OpenCL
Modules 1 & 2 (full day)
March 2014
George Leaver
IT Services
george.leaver@manchester.ac.uk
In case of Fire Alarm
Introduction to OpenCL 2
Course Content
Introduction to OpenCL 3
Acknowledgements
Introduction to OpenCL 4
Research Computing
IT Services
University of Manchester
http://www.its.manchester.ac.uk/research
What is Research Computing?
Training
Friendly staff on hand to advise on research computing issues
Comprehensive training course programme (free to attend)
http://wiki.rac.manchester.ac.uk/community/Courses
GPU Computing
Heterogeneous Computing
APUs: CPU+GPU
AMD APUs Intel MIC architecture (source Intel)
Introduction to OpenCL 10
Aims
Introduction to OpenCL 11
Simple FP Example: Serial CPU
x
+
y
=
z
void arrayAddCPU( int n, const float *x, const float *y, float *z )
{
for ( int i=0; i<n; i++ )
z[i] = x[i] + y[i];
}
Introduction to OpenCL 12
Vector Processing in CPU
Introduction to OpenCL 15
Early GPU Programming
Introduction to OpenCL 16
Some OpenCL
The OpenCL way: replace loops with a kernel
A kernel is run for every data item on the GPU cores
Similarities with the previous vector and shader examples!
x
+
y
=
z
void Ordinary C code on CPU __kernel void OpenCL code on GPU
arrayAddCPU( const int n, arrayAddOCL(
const float *x, __global const float *x,
const float *y, __global const float *y,
float *z ) __global float *z )
{ {
int i; int i = get_global_id(0);
for ( i=0; i<n; i++ )
z[i] = x[i] + y[i]; z[i] = x[i] + y[i];
} }
17
Think of many kernel instances
=
kernel
kernel
kernel
32 cores per SM
16 x 32 = 512 cores
Each SM has
scheduler / dispatch
Register file, cache: L1+shm (64KB), L2
(768KB)
=
kernel
kernel
kernel
Introduction to OpenCL 20
GPUs
Introduction to OpenCL 21
Our code
What code are we going to develop today?
Kernel(s) to run on the device (GPU)
Execute on each "data point" in the problem domain in parallel
Performs your computation
Usually a small function. Several different kernels used for large problems.
Read / write device memory
Think about which memory space a variable exists in (see later)
On login node
module load libs/cuda # default is 5.5.22
Provides access to the following directories
$CUDA_HOME/include/ Nvidia OpenCL headers (CL/opencl.h)
$CUDA_SDK/OpenCL/ Nvidia OpenCL SDK (sample code) 4.2.9!!
No libOpenCL.so Needed for compilation and execution
Only available on GPU nodes (usually)
OpenCL app must run on backend GPU node (3 methods)
qsub myjobfile #!/bin/bash
#$ -S bash
qsub -b y -cwd -V -l nvidia -l course ./myoclprog #$ -cwd
qrsh -V -l nvidia -l course xterm #$ -V
#$ -l nvidia -l course
(uppercase V, lowercase letter 'el' not number 1
remove '-l course' outside of today's course)
./myoclprog
Introduction to OpenCL 25
Nvidia OpenCL SDK for reference
# Now run your newly compiled OpenCL examples, eg: device query example
OpenCL/bin/linux/release/oclDeviceQuery
Introduction to OpenCL 26
Exercise 0
See separate slides
OpenCL Basics
Execution Model
Background
Created by Khronos Compute Group + industry ~2008
Apple wanted many-core CPU, GPU standard
Specification to be implemented by vendors
OpenCL 1.0 (Dec 2008), 1.1 (Jun 2010), 1.2 (Nov 2011), 2.0 (Nov 2013)
Currently most vendors support up to v1.2
Resources: OpenCL Registry: http://www.khronos.org/opencl/
Specification (READ THIS)
www.khronos.org/registry/cl/specs/opencl-1.1.pdf
www.khronos.org/registry/cl/specs/opencl-1.2.pdf
www.khronos.org/registry/cl/specs/opencl-2.0.pdf
Reference card (REFER TO THIS)
http://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf
http://www.khronos.org/files/opencl-1-2-quick-reference-card.pdf
http://www.khronos.org/files/opencl20-quick-reference-card.pdf
Reference (man) pages:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/
URLs verified March 2014
http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/
Introduction to OpenCL 29
http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/
Vendor Support Q1 2014
http://developer.nvidia.com/opencl
CUDA toolkit 4.2, 5.5 (supports OpenCL 1.1)
Download CUDA driver, CUDA toolkit, GPU Computing SDK (example code)
OpenCL on Nvidia GPUs only
http://developer.amd.com/opencl
AMD APP SDK 2.9 (supports OpenCL 1.2. v2.0 with beta driver)
Download the AMD driver if on GPU + AMD APP SDK
OpenCL on AMD GPUs & Intel/AMD CPUs (very useful)
http://software.intel.com/en-us/vcsource/tools/opencl-sdk-xe/
Intel SDK for OpenCL Apps 2012 (supports OpenCL 1.2)
Download SDK
OpenCL on many Intel CPUs (see release notes)
https://developer.apple.com/opencl
OS X core technology via Xcode (Nvidia: OpenCL 1.0, AMD: OpenCL 1.2)
URLs verified November 2013
Introduction to OpenCL 30
Books
Introduction to OpenCL 31
Anatomy of OpenCL
Introduction to OpenCL 32
OpenCL Platform Model
The OpenCL view of your system
E.g. A Core on
Nvidia h/w E.g. A CPU running
your application
Introduction to OpenCL 34
Expressing Parallelism
Introduction to OpenCL 36
Work-items and Work-groups IDs
Introduction to OpenCL 37
Kernel ID Functions: 1-D Example
get_work_dim() = 1 1st dim idx 0
2nd dim idx 1
1-D grid (32x1) with 16x1 work-group size 3rd dim idx 2
Kernel
All functions here called by a kernel int i = get_global_id(0);
get_global_id(0) = output[i] = input[i]*input[i];
get_global_size(0) = 32
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
input 9 2 1 7 9 8 4 5 2 1 0 2 6 6 3 0 7 9 5 5 2 8 0 2 3 7 6 4 1 2 0 9
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
get_local_id(0) = get_local_id(0) =
get_num_groups(0) = 2
get_local_size(0) = 16 get_local_size(0) = 16
get_group_id(0) = 0 get_group_id(0) = 1
get_local_id(0) = 2 get_local_id(0) = 8
output 81 4 1 49 81 64 16 25 4 1 0 4 36 36 9 0 49 81 25 25 4 64 0 4 9 49 36 16 1 4 0 81
Introduction to OpenCL 38
Kernel ID Functions: 2-D Example
get_work_dim() = 2 1st dim idx 0
2nd dim idx 1
2-D grid 16x12 with 4x4 work-group size 3rd dim idx 2
Work-group
get_global_id(1) = get_local_id(1) = get_group_id(0) = 2
get_group_id(1) = 1
work-item
L:(1,0)
get_global_size(1) = 12
get_num_groups(1) = 3
0
G:(9,4)
get_local_size(1) = 4
3 2 1 0
1
7 6 5 4
11 10 9 8
0 1 2 3 4 5 6 7 8 9 10 11 12131415
3
get_global_size(0) = 16
get_num_groups(0) = 4 0 1 2 3
get_local_size(0) = 4
get_global_id(0) = get_local_id(0) =
Introduction to OpenCL 39
Work-groups
Work-items within a work-group execute concurrently, at
least conceptually
All work-groups have same number of work-items
H/w specifies maximum number allowed
Cannot change number of work-items once kernel is running
Provides scalability T
Compute
group 6
Unit 0
Compute
group 9
Unit 1
Compute
group 9
Unit 0
Compute
group 1
Unit 1
Compute
group10
Unit 2
Compute
group 5
Unit 2
Automatic speed-up Compute
group 3
Compute
group 8
Unit 0 Unit 1
Linear, in theory
Compute Compute
group 4 group10 Based on diagram in Nvidia, OpenCL Programming
Unit 0 Unit 1
Guide for CUDA Architecture, v4.1, 3/1/2012
Compute Compute
group 2 group 11
Introduction to OpenCL 41
Unit 0 Unit 1
Scalable Programming
Introduction to OpenCL 42
Work-item scheduling
Compute Unit is a SIMD / vector unit
Work-items in a work-group grouped in to
Warps (Nvidia: 32)
Wavefronts (AMD: 64)
A warp contains consecutive work-item IDs
Warps: Work-items 0-31, 32-63,
Same instruction from your kernel executed at a time in SIMD unit
Each work-item executed by one element of the vector unit
Lock-step execution
Entire warp processed at once by a vector unit
If work-group contains more work-items (often does!)
Warps of work-items swapped in and out on the vector unit
E.g.: when a warp stalls waiting for memory transfer
Very little cost associated with swapping (hardware to do it)
Introduction to OpenCL 43
Writing an OpenCL
Program
Part 1 - OpenCL C for Kernels
Our code
What code are we going to develop?
Kernel(s) to run on the device (GPU)
Execute on each "data point" in the problem domain in parallel
Performs your computation
Usually a small function. Several different kernels used for large problems.
Read / write device memory
Think about which memory space a variable exists in (see later)
Written in OpenCL C
Restricted version of ISO C99
No recursion, function pointers, or functions from the standard headers
Preprocessor (C99) supported
#ifdef etc
Scalar datatypes plus vector types
E.g., float4, int16 etc (with host equivalents: cl_float4, cl_int16)
Image types: image2d_t, image3d_t etc
Alternative to memory buffers for images (not covered)
Built-in functions: already seen work-item/group id functions
Many math.h functions, synchronisation functions
Optional functions up to vendors: e.g., double precision
Introduction to OpenCL 46
OpenCL Kernels
Remember: one instance of the kernel created for each
work-item (thread)
Must have __kernel keyword No return value Declare address space
otherwise is a function allowed (void) of memory object args
only a kernel can call
Introduction to OpenCL 47
1-D Example
Raise all elements of an array by the given exponent
1-D index-space
__kernel void exponentor(__global int* data, const uint n, const uint exponent)
{
int i = get_global_id(0);
if ( i < n ) {
int base = data[i];
for (int e=1; e < exponent; ++e) { See trick later to avoid passing
data[i] *= base; in the exponent parameter
}
}
}
(0,0)
(cx,cy)
(w,h)
Source 2D array Destination 2D array
Introduction to OpenCL 49
2-D Example
WARNING: Spot the horrible inefficiency in this kernel!
__kernel void rotArray( const float theta, const int w, const int h,
__global const int *inImage, __global int *outImage)
{
int dstx = get_global_id(0); // This work-item's point in index-space
int dsty = get_global_id(1); // is where we will write to (not read from)
int cx = w/2; int cy = h/2; // Centre pixel is the offset
Introduction to OpenCL 52
Vector Notation and Ops in Kernels
Vector literals
float4 fvec = (float4)(1.0f, 2.0f, 3.0f, 4.0f);
uint4 uvec = (uint4)(1); // uvec will be (1,1,1,1)
float4 fvec = (float4)((float2)(1.0f, 2.0f), (float2)(3.0f, 4.0f));
float4 fvec = (float4)(1.0f, 2.0f); // ERROR (need 4 components)
Introduction to OpenCL 53
Vector Ops
Swizzling
float4 fvec = (float4)(1.0f, 2.0f, 3.0f, 4.0f);
float4 swiz = fvec.wzxy // swiz = (4.0, 3.0, 1.0, 2.0)
float4 gvec = fvec.xxyy; // gvec = (1.0, 1.0, 2.0, 2.0)
Assignment of components
float4 fvec = (float4)(1.0f, 2.0f, 3.0f, 4.0f);
fvec.xw = (float2)(5.0, 6.0) // fvec = (5.0, 2.0, 3.0, 6.0)
fvec.wx = (float2)(7.0, 8.0) // fvec = (8.0, 2.0, 3.0, 7.0)
fvec.xy = (float4)(1.0f, 2.0f, 3.0f, 4.0f) // ERROR (xy is float2)
Introduction to OpenCL 54
Vector Ops
Operations and functions accepts vectors
float4 fvec = (float4)(1.0f, 2.0f, 7.0f, 8.0f);
float4 gvec = (float4)(4.0f, 3.0f, 2.0f, 1.0f);
fvec -= gvec; // fvec = (-3.0, -1.0, 5.0, 7.0)
gvec = abs(fvec); // gvec = (3.0, 1.0, 5.0, 7.0)
float c = 3.0f;
fvec = gvec + c; // fvec = (6.0, 4.0, 8.0, 10.0)
fvec *= 3.142; // Component-wise mult
gvec = sin(fvec); // Sin() on each component
float4 svec = sin(fvec.xxyy); // Sin() on x,x,y,y components
float3 crsp = cross(fvec, gvec); // Cross product
float dotp = dot(fvec, gvec); // Dot product
float sum = dot(fvec, (float4)1.0f ); // Sum elements of fvec
Introduction to OpenCL 55
Exercise 1 (parts 1 &
2)
See separate slides
Writing an OpenCL
Program
Part 2 - Host Code Overview
Our code
What code are we going to develop?
Host program
Set up your device (GPU etc)
Transfer data to/from the device
"Manage" kernels to be run on the device (see later)
Don't try to turn all your code in to a kernel!
Device (GPUs) run your Kernel(s)
Execute on each "data point" in the problem domain in parallel
Performs your computation
Usually a small function. Several different kernels used for large problems.
Read / write device memory
Think about which memory space a variable exists in (see later)
Introduction to OpenCL 58
Our OpenCL App
Introduction to OpenCL 59
Summary of the steps
Diagram by Tim Mattson (Intel), Udeepta Bordoloi (AMD), OpenCL An Introduction for HPC programmers, ISC11 presentation
Introduction to OpenCL 60
Don't Panic
Introduction to OpenCL 61
Host Code Starting Point
Introduction to OpenCL 62
Platform Model A quick overview
Select a Platform in your system (Nvidia, AMD, Intel)
cl_int clGetPlatformIDs( cl_uint num_entries, cl_platform_id *platforms,
cl_uint *num_platforms )
Introduction to OpenCL 63
The Context
Introduction to OpenCL 64
Command Queues
Each device requires its own Command Queue
Host puts commands (operations) in to the queue
This is how you ask the GPU to do something
All commands go via the queue. Like tasks.
Device executes the commands (asynchronously)
in-order : order in which you put commands in queue (default)
out-of-order : device schedules commands as it sees fit
Queue points to one device (in a context)
A device can be attached to multiple queues though
Can synchronize within and across queues
e.g., prevent one command running before another completes
cl_command_queue clCreateCommandQueue( cl_context context,
cl_device_id device,
cl_command_queue_properties properties,
cl_int *errcode_ret )
Introduction to OpenCL 65
Programs
Introduction to OpenCL 66
The Kernel
Introduction to OpenCL 67
Steps 1-7 Done
Summary of Setup
Got a platform (Nvidia, AMD, etc)
Got device(s) supported by that platform (GPUs )
Created a context to manage devices and other objects
Created a command queue, in that context, to feed device
Created a program, in that context, from OpenCL C source code
Built (compiled and linked) the program for device(s) in the context
Extracted a specific kernel so we can launch it on the device
Introduction to OpenCL 68
In the Lab
Lab code does steps 1-7 for you using our helper function
#include <CL/opencl.h> // System OpenCL header
#include <oclHelper.h> // Our helper function for the lab
// Do all the nasty OpenCL set up. Will use the first GPU device we find.
oclHelper( SETUP_ALL, &platform_id, &device_id, &context, &cmd_queue, &program,
&kernel, kernelSrcFile, kernelName, build_options, enable_timing );
Introduction to OpenCL 69
Steps 813
Introduction to OpenCL 70
OpenCL API Errors
Introduction to OpenCL 71
Memory Objects / Buffers
// Create a write-only buffer: kernel will only write to it, not read it
vecA_d = clCreateBuffer( ctx, CL_MEM_WRITE_ONLY, sizeof(float)*count,
NULL, &err );
// Flags OR-ed together: kernel will only read buffer. We populate buffer too.
cl_mem vecB_d;
float *vecB_h = readFloatDataFromFile("myinput.dat", &count);
cl_mem_flags mflags = CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR;
vecB_d = clCreateBuffer( ctx, mflags, sizeof(float)*count, vecB_h, &err );
Introduction to OpenCL 73
Buffer flags
cl_mem_flags bitfield
CL_MEM_READ_WRITE Allocate a mem obj that can be read from and written to by
a kernel.
CL_MEM_COPY_HOST_PTR Ask OpenCL to alloc device memory and copy data in from
host_ptr. Cannot be used with CL_MEM_USE_HOST_PTR.
Can be used with CL_MEM_ALLOC_HOST_PTR.
Introduction to OpenCL 74
9. Copying data via Command Queue
Can read contiguous data back from device to host or
write from host to device (a)synchronously (non-blocking).
cl_int clEnqueueReadBuffer( cl_command_queue command_queue,
cl_mem buffer, cl_bool blocking_read,
size_t offset, size_t cb,
void *ptr, cl_uint num_events_in_wait_list,
const cl_event *event_wait_list, cl_event *event )
// Set count (a large number) then alloc host array and populate it (somehow)
vecA_h = (float *)malloc(count*sizeof(float));
readMyDataFromSomewhere(vecA_h, count);
// Create a read-only buffer: kernel will only read it, not write it
vecA_d = clCreateBuffer( ctx, CL_MEM_READ_ONLY, count*sizeof(float), NULL, &err );
// Set the kernel args (refer to your kernel source code!) Note err flag trick.
err = clSetKernelArg( kernel, 0, sizeof(cl_int), &n );
err |= clSetKernelArg( kernel, 1, sizeof(cl_mem), &vecA_d ); // Input mem obj
err |= clSetKernelArg( kernel, 2, sizeof(cl_mem), &vecB_d ); // Input mem obj
err |= clSetKernelArg( kernel, 3, sizeof(cl_mem), &vecC_d ); // Output mem obj
if ( err != CL_SUCCESS ) ...
// No work offset (work-item ids start at 0). No events to wait for or returned.
err = clEnqueueNDRangeKernel( cmd_queue, kernel, 1, NULL, &globalSize
&localSize, 0, NULL, NULL );
// Command returns immediately to host, which can carry on working (see later)
Introduction to OpenCL 78
Launching the Kernel
The work_dim arg specifies array sizes for *.._work_ args
global_work_offset changes initial value of get_global_id()
Pass in NULL to keep at 0 (,0, 0) (OpenCL 1.1)
global_work_size gives total number of work-items reqd
local_work_size gives work-group size
// 2-D example: Processing a matrix of size 2048 rows x 4096 columns
// Number of columns is 4096. This is the X dimension (or width)
// Number of rows is 2048. This is the Y dimension (or height)
// No work offset (work-item ids start at 0). No events to wait for or returned.
err = clEnqueueNDRangeKernel( cmd_queue, kernel, 2, NULL, globalSize
localSize, 0, NULL, NULL );
Introduction to OpenCL 79
Number of work-items and work-groups
Introduction to OpenCL 80
Number of work-groups
Introduction to OpenCL 81
More work-items than needed
What if the global / local numbers don't divide?
Silly example: your array length is prime
More realistic: global_work_size[0] = 100,000
local_work_size[0] = 64
Asking OpenCL for 1562.5 work-groups! Error.
__kernel void myK( const int n, __global const float *x, ... ) {
int i = get_global_id(0);
01 23 45 67 8 910
if ( i<n )
10,000,000 work-items do_something_with( x[i] );
Each processes 1 element }
__kernel void myK( const int n, __global const float *x, ... ) {
int i = get_global_id(0);
01
1,000,000 work-items
Each processes 10 elements for ( j=0; j<10; j++ )
Stride = 1 do_something_with( x[i*10+j] );
}
__kernel void myK( const int n, __global const float *x, ... ) {
01 int i = get_global_id(0);
1,000,000 work-items int stride = get_global_size(0);
Each processes 10 elements for ( ; i<n; i+=stride )
Stride = #work-items (1,000,000) do_something_with( x[i] );
(most efficient - see later) }
Introduction to OpenCL 83
Choosing the work-group size
Introduction to OpenCL 84
12. Transfer results back to Host
Enqueue a read command to get data back from device
As before, transfer can be blocking or non-blocking
Don't transfer back if simply passing cl_mem obj to another kernel
An in-order queue guarantees the kernel has finished
long count; // Example number of elements in my array
float *vecC_h; // Host array to be populated with results
cl_mem vecC_d; // Device mem obj written to by kernel
// Alloc host array to hold results from device (usually allocated earlier)
vecC_h = (float *)malloc(count*sizeof(float));
Introduction to OpenCL 85
Waiting for the Kernel
Introduction to OpenCL 86
13. Tidying Up
// This ensures all previous commands have finished (may not be needed)
clFinish( cmd_queue );
// Kernel as one long single string (normally we read from a file but hard to fit on a single slide!)
const char *ksource = \
"__kernel void myKernel( const int n, __global const float *inArr, __global float *outArr ) { \n" \
" int idx = get_global_id(0);\n " \
" // Do something wonderful in OpenCL \n" \ // Zeros at end of functions should really be NULLs
" outArr[i] = inArr[i]*inArr[i]; \n" \
"}\n"; kernel = clCreateKernel(prog, "myKernel", NULL);
A_d = clCreateBuffer(ctx, CL_MEM_READ_ONLY, nbytes, 0, 0);
#define VSIZE 1024*1024 B_d = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY, nbytes, 0, 0);
int main( void )
{ // Enqueue a blocking write from host to device mem obj
cl_int err; clEnqueueWriteBuffer(cq, A_d, CL_TRUE, 0, nbytes, A_h, 0, 0,0);
cl_platform_id p_id;
cl_device_id d_id; // Set the kernel args
cl_context_properties cprops = {CL_CONTEXT_PLATFORM,0,0}; clSetKernelArg(kernel, 0, sizeof(int), &vsize);
cl_context ctx; clSetKernelArg(kernel, 1, sizeof(cl_mem), &A_d);
cl_command_queue cq; clSetKernelArg(kernel, 2, sizeof(cl_mem), &B_d);
cl_program prog;
cl_kernel kernel; // Enqueue a kernel launch using a 1-D NDRange. Auto W-G size.
cl_mem A_d, B_d; clEnqueueNDRangeKernel(cq, kernel, 1, NULL, &vsize, 0, 0,0,0);
float A_h[VSIZE], B_h[VSIZE];
long nbytes; // Enqueue a blocking readback from the device
int vsize = VSIZE; clEnqueueReadBuffer(cq, B_d, CL_TRUE, 0, nbytes, B_h, 0, 0, 0);
nbytes = VSIZE*sizeof(float); // Num bytes to alloc // Could do something with the results in B_h now ...
initArray( A_h, VSIZE); // Random values?
// Tidy up
// Assuming just one platform, one GPU clReleaseMemObject( A_d );
err = clGetPlatformIDs( 1, &p_id, &num_platforms ); clReleaseMemObject( B_d );
clGetDeviceIDs(p_id, CL_DEVICE_TYPE_GPU, 1, &d_id, NULL); clReleaseProgram( prog );
cprops[1] = p_id; clReleaseKernel( kernel );
ctx = clCreateContext(cprops, 1, &d_id, NULL, NULL, NULL); clReleaseCommandQueue( cq );
cq = clCreateCommandQueue(ctx, d_id, 0, NULL); clReleaseContext( ctx );
prog = clCreateProgramWithSource(ctx, 1, &ksource, NULL, NULL); }
clBuildProgram( prog, 0, NULL, NULL, NULL, NULL ); Introduction to OpenCL 88
Exercise 1 (part 3)
Exercise 2 (part 1)
Memory and Kernels
Memory hierarchy
Kernel Address Space Qualifiers
The OpenCL memory model
Private mem: __private Private
Memory 1
Private
Memory N
Private
Memory 1
Private
Memory N
R/W per work-item Work-item Work-item Work-item Work-item
1 N 1 N
Local mem: __local
R/W by the work-items
in a work-group Local Memory Local Memory
Work-group 1 Work-group M
Global mem: __global
R/W by work-items in
all work-groups Global and Constant Memory
Introduction to OpenCL 92
Local Memory
Introduction to OpenCL 93
Local Memory cont
__kernel void myKernel(const int n, __global float *x, __local float *localArray)
{
int local_idx = get_local_id(0); // 0 ... get_local_size(0)-1
localArray[local_idx] = ...; // work-items in work-group init array
}
Introduction to OpenCL 94
Memory Consistency
Local memory highlights a potential problem
All work-items in a work-group writing to a local array
How do we know when all work-items have written (or read, etc)?
__local float wgValues[64]; // Assume work-group size of 64
wgValues[get_local_id(0)] = myComputation(...);
// Have all work-items in work-group written to wgValues[]? See later
if ( get_local_id(0) == 0 ) // Lots of idle work-items!
for ( i=0; i<64; i++ )
sum += wgValues[i];
Memory model has relaxed consistency
Different work-items may see a different view of local and global
memory as computation progresses
Load/store consistency within a work-item
Global & local mem consistent within a work-group at barrier
Command queue events ensure consistency between kernel calls
Introduction to OpenCL 95
Global Memory
Introduction to OpenCL 96
Constant Memory
Similar to __global but read-only within a kernel
May be allocated in same global memory
Device may have separate constant mem cache (e.g., 64Kb)
Vars with 'program scope' must be declared __constant
// Create a read-only cl_mem obj. Initialize with these host values.
const float table_h[5] = {1.0, 3.0, 5.0, 3.0, 1.0};
cl_mem table_d = clCreateBuffer( ctx, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR,
sizeof(float)*5, table_h, NULL );
// Pass read-only buffer to kernel just like any other cl_mem obj
err = clSetKernelArg( kernel, 0, sizeof(int), &numElems ); // As earlier
err |= clSetKernelArg( kernel, 1, sizeof(cl_mem), &vecA_d ); // As earlier
err |= clSetKernelArg( kernel, 2, sizeof(cl_mem), &table_d ); // constant
__constant int myConst = 123; // Var has 'program scope' (use pre-processor)
__kernel void myKernel(const int n, __global float *x, __constant float *table)
{
for ( i=0; i<5; i++ )
// Do something with table[i]
}
Introduction to OpenCL 97
Accessing Memory
Coalesced __global memory access for efficiency
work-item i to access array[i]
work-item i+1 to access array[i+1]
Access to C matrices (row-major)
E.g., a 1D index-space where a single work-item processes an
entire row of a C matrix (seen in mat-mul examples)
// C (CPU cache-friendly) // Kernel (non-coalesced) // Kernel (coalesced)
for (r=0; r<height; r++) WI=get_global_id(0); WI=get_global_id(0);
for (c=0; c<width; c++) for (c=0; c<width; c++ ) for (r=0; r<height; r++ )
z = A[r*width+c]; z = A[WI*width+c]; z = A[r*width+WI];
w-item 0
w-item 1
w-item 2
wi 0 wi 1 wi 2 wi 3
Introduction to OpenCL 98
Coalescing
Rules are slightly more relaxed
Accessing sequential elements in a half-warp = 1 mem op!
__global 9 2 1 7 9 8 4 5 2 1 0 2 6 6 3 0 7 9 5 5 2 8 0 2 3 7 6 4 1 2 0 9
memory
work-items
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
in a warp 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
half-warp half-warp
Introduction to OpenCL 99
Memory sizes
// For some reason we want to fill up the first half of the local array
if ( loc_id < loc_sz/2 )
tmp_arr[loc_id] = A[gbl_id];
// All work-items must hit barrier. They'll all see a consistent tmp_arr[]
barrier(CLK_LOCAL_MEM_FENCE);
// Each work-item can now use the elements from tmp_arr[] safely.
// Often used if we'd be repeatedly accessing the same A[] elements.
for ( j=0; j<loc_sz; j++ )
my_compute( gbl_id, tmp_arr[j], A );
} Introduction to OpenCL 103
Example: Parallel Reduction (CPU)
partial
sums 45 20 38 32 = 135 OpenMP performs the final reduction for you
__local 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 get_local_id(0)
memory 9 2 1 7 9 8 4 5 2 1 0 2 6 6 3 0 stride = 8
barrier(CLK_LOCAL_MEM_FENCE)
+ + + + + + + +
11 3 1 9 15 14 7 5 2 1 0 2 6 6 3 0 stride = 4
barrier(CLK_LOCAL_MEM_FENCE)
+ + + +
26 17 8 14 15 14 7 5 2 1 0 2 6 6 3 0 stride = 2
barrier(CLK_LOCAL_MEM_FENCE)
+ +
34 31 8 14 15 14 7 5 2 1 0 2 6 6 3 0 stride = 1
barrier(CLK_LOCAL_MEM_FENCE)
// Copy from global mem in to local memory (should check for out of bounds)
localSums[local_id] = x[get_global_id(0)];
for (uint stride=group_size/2; stride>0; stride /= 2) { // stride halved at loop
OpenCL_Programming_Overview.pdf
OpenCL_Programming_Guide.pdf
OpenCL_Best_Practices_Guide.pdf