Está en la página 1de 35

Matlab Optimization, parallelism,

and GPU computing.


Kai Mollerud
CEMS IT office
What Ill Cover
Basics; what is parallel computing, why GPUs are so good at
it.
When is a GPU better than a CPU.
What youll need / how to use the GPU
Development process
Learning to write fast, non-GPU, programs
Turning non-GPU programs, into GPU programs
Parallelism
This is the key idea behind all high power computing,
especially GPU computing.
Parallelism can be difficult to fully understand, because
people dont often do things in parallel.
Here is an image of some real-world parallel problem solving:
Analogue Parallelism
How is This Parallelism?
The chalk holder is
performing whats called
a SIMD operation, single
instruction, multiple data.
Each piece of data (chalk)
must be of the same type
to fit in the array, but they
can have different values
(color, length)
Likewise, a computer can
perform the same
operation on each element
in an array simultaneously.
So, why GPU computing?
GPU vs. CPU
A modern CPU has between 2 and 16 processing cores.
CPUs are designed to handle a wide array of tasks, often
performing several heterogeneous operations at once.
A modern GPU on the other hand, can have up to 2048
stream processors.
A GPUs usual job is to decide what color each of the pixels on
your monitor are, a 1080p monitor has 2,073,600 pixels that
can change color ~60 times a second.
Parallel Problems
Not all problems are well suited to parallel computation.
There are 3 levels of parallelism, determined by how much
the operations involved depend on each other.
Fine-grained, Coarse-grained, Embarrassingly
Put simply, GPU computing is best suited to Embarrassingly parallel
problems, and sometimes usable for problems with Coarse-grained
parallelism.
The technical reasoning here revolves around memory performance,
ask me later if you would like a more detailed explanation.
When to use GPU computing
Just because a problem is parallel, doesnt mean GPU
computing is the right choice.
CPUs can do multiple operations at once, and run much faster than
GPUs.
Where GPUs really shine are problems that are parallel, and
have very large amounts of data to process.
Deciding whether or not a problem will really benefit from
GPU computing isnt always obvious until you have actually
written the program.
Luckily, matlab makes it easy to write a program for the CPU first, then
adapt it to the GPU to see if its worth it.
The Development Process
Step 1) Write a program
Step 2) Make the program fast
Step 3) Adapt the program to use the GPU
Step 1) Write a program
When you start writing a program,
performance is not important.
Try and focus on good organization of your
program, make it easy to read and modify.
Keeping things organized will make the next 2
steps much easier.
Personally, I start by writing comments to
describe each block of code.


Example Code #1
first_draft.m
1. Populates an array with some floating point
values
2. Calculates the mean value of the array
3. Perform an operation on each element
4. Repeat 1-3 1000 times
This obviously isnt a useful calculation, but it is
computationally similar to some programs I have
seen researchers using.
Step 2) Make it fast
This is not a simple subject, computers are
complex and making a program run quickly
means understanding how the computer runs the
program.
An inefficient program wont get better just
because you run it on the GPU.
Rather than tell you every trick I know for
speeding up programs, Ill show you how to
experiment and learn.
Ill also show you a few tricks.
Optimization tools
Code profiler
Programs run a bit slower in the profiler
You can save the output of the profiler as a html file to look at later,
this is useful when measuring performance changes.
Control your runtime
You will need to run your code again and again
Scale down the simulation detail, comment out plotting functions, etc.
If its part of a larger program, find a way to isolate it from the rest.
tic + toc
The code profiler does this for you, but sometimes you just want one
number to look at, and these are easy to use.
Use a fast computer.
If your group runs simulations, you should think about getting a
dedicated computer to run them on.
Optimization techniques
Avoid nesting loops if at all possible
Use for loops instead of while loops
Not necessarily faster, but cleaner and easier to parallelize
Avoid conditionals
Use the find() function
If you use an if else, put the most common part first.
Consider using a switch() statement
Avoid calling functions inside loops.
Think about MEX functions for very big calculations
lets you use C programs from matlab
C is a lot faster than matlab
Dont use the mean() function, its slow. Use sum()/numel()

Example code #2
Second_draft.m
About 92% faster than #1
Uses find() to avoid conditionals
Eliminates the nested loops by using vector
operations
Replaces the mean() function with
sum()/numel()
Step 3) Using the GPU
Matlab uses vectors for everything. GPUs are
built for vector operations
This makes the conversion really easy.
To do GPU computing in matlab you will need:
Parallel computing toolbox (university has this
licensed)
A nVidia graphics card with compute capability
version 1.3 or higher.
entry cost of about $150 for a decent card
GPU functions
Performing a calculation on the GPU involves
2-3 steps.
Put the data you need into GPU memory
Call a GPU enabled function on that data
Move the results from GPU memory to CPU
memory.

Putting data on the GPU
Matlabs parallel computation toolbox
provides the gpuArray data type
Any gpuArray variable is stored in GPU memory
gpuArray supports most data types, and behave
more or less the same as normal arrays
Any operation on a gpuArray variable will return a
gpuArray variable.
Putting data on the GPU
You can create gpuArrays in 2 ways
Copy a variable from CPU memory to GPU
memory
Create a variable directly on the GPU
Copying a variable to the GPU
Copying a variable to the GPU
a and b are independent, subsequent
operations on one do not affect the other
a must be nonsparse, and must be of type
single, double, int/uint 8/16/32/64, or logical
i.e. no custom data types
b has a 108 byte placeholder in CPU memory,
and uses 1600 bytes on GPU memory
Transferring takes time, dont do it inside a
loop
Creating data on a GPU directly
Creating data on a GPU directly
You can use; ones, zeros, inf, nan, true, false,
eye, colon, rand, randi, randn, linspace,
logspace
This avoids the time cost of transferring from
CPU memory to GPU memory.
GPU computing functions
Matlab has overloaded many functions to
execute on the GPU when you call them with
a gpuArray as an argument.
A few important ones: trig functions, log, find,
max, plot (& related)
Full list online:
http://www.mathworks.com/help/distcomp/using
-gpuarray.html (some added in 2013b not listed)
Example code #3
third_draft.m
Almost identical to #2
Turns the array into a gpuArray so the operations
are run on the GPU
Actually a bit slower than #2
That is, slower when using the same parameters.
More on this shortly.
Bringing GPU data back
The gather() function takes in a gpuArray and
copies it to CPU memory.
Again, this takes time, try and leave data on
the GPU as long as you can and transfer all of
it back at once.
I can go into detail about GPU vs CPU memory
behavior later if theres time/interest, otherwise
ask me / email me.
Using the GPU in your code
Knowing how to use the GPU is half the battle,
the rest is knowing when.
Theres a simple way to learn this, take some
code, change something to a gpuArray and
see how the runtime changes.
When to use the GPU
GPUs are good for:
Big arrays/vectors
Doing simple tasks many times
Theyre bad for:
Conditional logic
Manipulating a few specific array elements
Quantitative example
I wrote 3 programs to do the same task. The task exhibits
coarse-grained parallelism, and has a deterministic run-time.
Naive.m is a simple, non-parallel implementation. It isnt exceptionally
bad, but no effort has been made to make it run efficiently.
CPU.m is a CPU-only, parallel implementation that is essentially as fast
as it can be.
GPU.m is very similar to CPU.m, but uses GPU operations wherever
possible.
I recorded performance metrics from these 3 programs across
a range of inputs, increasing the size of the input data each
time.
Testing details
The tests were run on a dell optiplex 990
Intel i5-2400 4-cores @3.1Ghz (3.3 with turbo boost)
4Gb 1333Mhz RAM
nVidia GeForce GTX 650 Ti
1Gb GDDR5 memory @5400Mhz
768 cell processors @941Mhz
Windows 7 64-bit enterprise
The numbers I gathered are unique to this
computer. Your results will vary, but should
follow similar trends.
Runtime Vs. array size
Elements per second
Coding for the GPU
Try not to move data between CPU and GPU very
often
Replace conditional logic with set theory (loops
and if statements VS. vector ops and find())
Try to isolate variables.
Storing values in an array to look at later can replace
random accesses to those values while calculating
them
Be clever.
You may need to change your entire approach to a
problem to get the most out of GPU computing
Questions?

También podría gustarte