2013 07 22-Python-CUDA

Python for GPUs
Bryan Catanzaro, NVIDIA Research

Some slides from Mark Harris (NVIDIA)
and Andreas Klckner (NYU)
2013 NVIDIA Corporation
Rapid
Development
Powerful
Libraries
Commercial
Support
Large
Community
Is Python Fast Enough?
Python apps often implement
performance critical functions in C/C++.
Three Python projects
PyCUDA/PyOpenCL (Andreas Klckner)
Bindings for GPU runtimes
Intended to be used with Runtime Code Generation

NumbaPro (Continuum Analytics)
Write CUDA code in Python
GPU bindings
Copperhead (Bryan Catanzaro)
A data parallel Python dialect
Runtime compiled to GPUs and CPUs
PyCUDA: Programming Approaches
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions
Programming Approaches
Decisions that determine your
approach to throughput computing:
AOT vs JIT
Meta vs not
In-language vs Hybrid
If hybrid, why not use a scripting language?
Andreas Klockner GPU Programming in Python
PyCUDA: Why do scripting?
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions
Why do Scripting for GPUs?
GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum FP/memory
throughput
complement each other
CPU: largely restricted to control
tasks (1000/sec)
Scripting fast enough
Python + OpenCL = PyOpenCL
Python + CUDA = PyCUDA
Dive into PyCUDA
!"#$%& #()*+,-,*&$!.!&
!"#$%& #()*+,-+%!/0% ,1 +%/
!"#$%& .*"#(

2%$" #()*+,-)$"#!30% !"#$%& 4$*%)05$+*30
"$+ 6 4$*%)05$+*307888
99:3$;,399 /$!+ "*3&!#3(9&<0"723$,& =+01&> 23$,& =,> 23$,& =;?
@
)$.1& !.& ! 6 &<%0,+A+B-BC
+01&D!E 6 ,D!E = ;D!EC
F
888?

CUDA Code
Dive into PyCUDA, cont.
"*3&!#3(9&<0" 6 "$+-:0&92*.)&!$.78"*3&!#3(9&<0"8?

, 6 .*"#(-%,.+$"-%,.+.7GHH?-,1&(#07.*"#(-23$,&IJ?
; 6 .*"#(-%,.+$"-%,.+.7GHH?-,1&(#07.*"#(-23$,&IJ?

+01& 6 .*"#(-K0%$193!L07,?
"*3&!#3(9&<0"7
+%/-M*&7+01&?> +%/-A.7,?> +%/-A.7;?>
;3$)L67GHH>N>N?> :%!+67N>N??

#%!.& +01&O,=;
numpy
interop
kernel
launch
PyCUDA/PyOpenCL Philosophy
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions Hello World About In the box PyCUDA
PyOpenCL Philosophy
Provide complete access
Automatically manage resources
Provide abstractions
Allow interactive use
Check for and report errors
automatically
Integrate tightly with numpy
PyCUDA/PyOpenCL: Completeness
PyCUDA exposes all of the CUDA driver API
For example:
Streams/events
Surfaces/textures
Peer to peer access, pinned memory
Profiling,
PyOpenCL exposes all of OpenCL
Workflow
PyOpenCL, PyCUDA: Workow
Edit
PyOpenCL/PyCUDA
Run
Program("...")
Cache?
Compiler
no
Binary
Upload to GPU
Run on GPU
Metaprogramming
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions How?
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,
GPU code does
not need to be
a compile-time
constant.
(Key: Code is datait wants to be
reasoned about at run time)
Good for code
generation
P
yC
U
D
A PyOpenCL
How to metaprogram in PyCUDA/
PyOpenCL
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions How?
PyOpenCL: Support for Metaprogramming
Three (main) ways of generating code:
Simple %-operator substitution
Combine with C preprocessor: simple, often sucient
Use a templating engine (Mako works very well)
codepy:
Build C syntax trees from Python
Generates readable, indented C
Many ways of evaluating codemost important one:
Exact device timing via events
Other nice things
Elementwise functions very similar to numpy ufuncs
reductions, scans
gpuarray with overloaded arithmetic operators
random number generators
PyCUDA/PyOpenCL information
PyOpenCL, PyCUDA: Vital Information
http://mathema.tician.de/
software/pyopencl (or /pycuda)
Downloads:
Direct: PyOpenCL 60k, PyCUDA 30k
Binaries: Win, Debian, Arch, Fedora,
Gentoo, . . .
MIT License
Compiler Cache, RAII, Error checking
Require: numpy, Python 2.4+
(Win/OS X/Linux)
Community: mailing list, wiki, add-on
packages (PyFFT, scikits.cuda, Sailsh,
PyWENO, Copperhead. . . )
NumbaPro from Continuum
Anaconda Accelerate from Continuum Analytics
NumbaPro array-oriented compiler for Python & NumPy
Compile Python for GPUs or CPUs
Automatically compile Python functions on NumPy arrays
Or write CUDA Python kernels for maximum performance
Fast Development + Fast Execution: Ideal Combination
http://continuum.io
Free Academic
License
1024
2
Mandelbrot Time Speedup v. Pure Python
Pure Python 4.85s --
NumbaPro (CPU) 0.11s 44x
CUDA Python (K20) .004s 1221x
@cuda.jit(restype=uint32, argtypes=[f8, f8, uint32], device=True)
def mandel(x, y, max_iters):
c = complex(x, y)
z = 0.0j
for i in range(max_iters):
z = z*z + c
if (z.real*z.real + z.imag*z.imag) >= 4:
return i
return max_iters

@cuda.jit(argtypes=[uint8[:,:], f8, f8, f8, f8, uint32])
def mandel_kernel(img, xmin, xmax ymin, ymax, iters):
x, y = cuda.grid(2)
if x < img.shape[0] and y < img.shape[1]:
img[y, x] = mandel(min_x+x*((max_x-min_x)/img.shape[0]),
min_y+y*((max_y-min_y)/img.shape[1]), iters)

gimage = np.zeros((1024, 1024), dtype = np.uint8)
d_image = cuda.to_device(gimage)
mandel_kernel[(32,32), (32,32)](d_image, -2.0, 1.0, -1.0, 1.0, 20)
d_image.to_host()
CUDA Python
CUDA Programming,
Python Syntax
Copperhead
Goal: Efficiency and
Productivity

Note: Copperhead is
a research project,
not a product.
Copperhead
Python
Data
Parallelism
Need for
productivity
Copperhead code is just Python code.
No C-isms, no annotations.
http://copperhead.github.io
Hello world of data parallelism
Consider this intrinsically parallel procedure
!"# %&'()%* &* (+,
-"./-0 1%')2%13!% &4*(4, %5&4 6 (4* &* (+
or for the lambda averse
!"# %&'()%* &* (+,
-"./-0 7%5&4 6 (4 #8- &4*(4 40 94')&*(+:
This procedure is both
completely valid Python code
compilable to data parallel substrates (CUDA, OpenCL,
OpenMP+AVX intrinsics, etc.)
Support for Heterogeneity
Programmer specifies execution place

Currently support:

P!&< #3,)01-:#*HQ
:#*9%01*3& 6 ,B#(7---?
P!&< #3,)01-$#0."#Q
)#*9%01*3& 6 ,B#(7---?
CUDA
OpenMP
TBB
Sequential C++
Runtime Data Management
The Copperhead runtime manages all data
Data lazily transferred to and from memory
spaces

Memory is garbage collected via Pythons
garbage collector
Data interoperates with .*"#(, ",&#3$&3!;, etc.
,
a b c d
d
, 6 R
; 6 2$$7,?
) 6 2$$7;?
+ 6 2$$7)?
#%!.&7+?
CPU
GPU
Runtime code generation
Runtime code generation
Copperhead compiler produces C++ code
C++ code is compiled to a dynamic library using )$+0#(
Compilation artifacts persistently stored in 99#(),)<099
Runtime overhead: ~10-100 !sec (from Python, per fn call)
0
2
4
6
8
10
12
Minimal Black Scholes
S
e
c
o
n
d
s

Compile Time
0.00E+00
2.00E-05
4.00E-05
6.00E-05
8.00E-05
1.00E-04
1.20E-04
1.40E-04
Minimal Black Scholes
S
e
c
o
n
d
s

Execution Overhead
;66
1%<"
Some results (GTX480)
Solving Laplaces
equation (from Travis
Oliphants blog)

0.1
1
10
100
1000
S
e
c
o
n
d
s

0.001
0.01
0.1
1
Pure Python Numpy Copperhead
S
e
c
o
n
d
s

Sorting array of 1M
float32 elements

Conclusion
Increasing options for Python on GPUs:
PyCUDA/PyOpenCL (Andreas Klckner)
Bindings for GPU runtimes

NumbaPro (Continuum Analytics)
Write CUDA code in Python

Copperhead (Bryan Catanzaro)
A data parallel Python dialect

Questions?
Bryan Catanzaro
bcatanzaro@nvidia.com

http://research.nvidia.com

Bryan Catanzaro

2013 07 22-Python-CUDA

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

2013 07 22-Python-CUDA

Cargado por

Copyright:

Formatos disponibles

Python for GPUs

Bryan Catanzaro, NVIDIA Research

También podría gustarte