Documentos de Académico
Documentos de Profesional
Documentos de Cultura
on
High-Performance Multi-GPU
Systems
Supervisor
Prof. Alberto Sánchez Campos
Prof. Miguel Angel Otaduy Tristán
Candidate
Dr. Marcos Novalbos Mendiguchía
July 2015
El Dr. Alberto Sánchez Campos, con DNI 50120554N, y el Dr. Miguel Angel
CERTIFICAN
Octubre de 2015.
Copyright 2015
Este trabajo es la suma de los esfuerzos de mucha gente que, directa o indirectamente
han contribuido en su desarrollo. Por eso quería dedicárselo a todos los que en algún
y personas que sin tener mucha relación me han dado buenos consejos en algún
momento.
que se hace para cumplir los objetivos. Y en todo caso, si algo sale mal siempre se
de Tesis, sin ellos este trabajo habría sido imposible. A Alberto quiero agradecerle
la conanza que puso en mí desde el principio, han sido muchos años trabajando
Miguel Ángel quería agradecerle todas las horas invertidas en este trabajo, revisiones
últimos meses previos a la entrega de este trabajo, poca gente muestra tanto interés
Martínez Benito, Álvaro López-Medrano y Roldán Martínez por sus esfuerzos inver-
tidos en la empresa que fundaron y con la que intentaron marcar la diferencia. Y por
supuesto, quería agradecer a Jaime por todos estos años trabajando juntos, horas
VI
GPU.
2011 (pasados y presentes) con los que he compartido las horas de trabajo durante
los últimos 4 años. Han sido muchas las experiencias que hemos vivido juntos y
en las que nos hemos ayudado, muchas gracias a Alberto, Álvaro, Ángela, Carlos,
David, Gabriel, Javier, Jorge, Mónica, Laura y Zeleste. Y por añadidura al resto de
GMRV, en especial a Jose, Juanpe, Luis, Marcos, Óscar, Pablo, Richard SM, Sofía
y Susana.
Resumen en Castellano
tas disciplinas para poder recrear elmente las interacciones entre elementos a nivel
Estas simulaciones son necesarias para estudiar propiedades que serían im-
un pequeño sistema formado por unos 92. 224 átomos, podrían ser necesarios hasta
de los cálculos realizados para simular los movimientos físicos de los átomos que
componen el sistema es tan alta que a día de hoy es impensable realizar simulaciones
Es muy necesario reducir los tiempos de cómputo. Para poder dar una
Desde sus inicios se han desarrollado distintas técnicas para poder acelerar los cálcu-
tado tanto que es fácil encontrar máquinas de altas prestaciones que tengan instaladas
varias tarjetas grácas programables para acelera los cálculos. En concreto, resultan
ser un gran apoyo hardware para los sistemas de simulación molecular, reduciendo
Entre otros objetivos, se pretende dar un nuevo enfoque en el uso de estas arquitec-
Para poder llevar a cabo esta tarea se han desarrollado diversas herramien-
quetado de datos para comunicaciones directas entre GPUs. Éste algoritmo tiene
moviendo datos entre CPU y GPU. También se han investigado distintas formas de
partición espacial para sistemas moleculares, seleccionando el más adecuado para en-
(MSM). Por último, dado que los entornos con mayor cantidad de GPUs disponibles
IX
Antecedentes
de dinámica molecular:
Ambas líneas son contradictorias: los cálculos precisos introducen una carga
len basarse en el uso de métodos menos restrictivos que usualmente introducen er-
rores de precisión. Las optimizaciones que se suelen realizar en los últimos años se
centran en mejorar los algoritmos conocidos, adaptándolos para que hagan uso de
X
acción de los átomos que forman el sistema. El tiempo total de simulación se divide
en pequeños pasos de tiempo en los que se calculan las fuerzas que interactúan con
cada átomo para calcular su velocidad, y a partir de esa velocidad se calcula la nueva
posición para el siguiente paso de tiempo. Cuanto más pequeño es ese paso de sim-
ulación, más precisos son los cálculos, pero también tarda más en completarse. En
concreto se pueden distinguir dos tipos de fuerzas, una de ellas subdividida en otros
dos tipos:
a los átomos que formen el enlace. Suelen ser las fuerzas más rápidas de
• Fuerzas electrostáticas: Las fuerzas electrostáticas de Van der Waals son las
producidas debido a las cargas de los átomos. Dado que las fuerzas electrostáti-
cas decaen rápidamente con la distancia, estas fuerzas se suelen dividir en dos
Fuerzas de largo alcance, calculadas usando todos los átomos del sistema
culo usando algoritmos exactos es muy alto, por lo que se suelen usar
aproximaciones con cierto grado de error. Dado que las interacciones más
allá del radio de corte son poco importantes, en algunos casos es posible
Resuelve los cálculos utilizando FFTs sobre una rejilla de potenciales de carga. FFT.
La versión original de PME usa diferenciación espectral y un total de cuatro FFTs por
cada paso de simulación, mientras que Smooth PME (SPME) [6] usa interpolación
por B-spline reduciendo el número de FFTs a dos. Se usa PME en multitud de sim-
uladores moleculares, como NAMD [26], GROMACS [11] o ACEMD [10]. Además,
Simuladores moleculares
ular para entornos de altas prestaciones. NAMD [26] es uno de los más longevos,
datando las primeras versiones en 1995. Es de los más populares, siendo usado en
NAMD reparte el trabajo de cálculo entre los nodos de cómputo disponibles real-
XII
izando una partición espacial del sistema. Cada partición se asigna a un nodo de
mantiene la información de los átomos que le pertenecen junto con los parches ve-
CPUs disponibles en cada nodo de cómputo. Cada trabajo se queda denido como
parche que no tiene, se enviará una copia de los datos necesarios junto con el trabajo
asociado.
cálculos. Se crean trabajos pequeños que se asignan a las GPUs junto con los datos
necesarios. Una vez se tienen los datos en GPU se lanzan los kernels necesarios para
calcular las fuerzas y se devuelven los resultados a memoria de CPU. Este esquema de
uso de la GPU como coprocesador fuerza que haya mucho intercambio de información
rollo, sus primeras versiones son de 1991. Inicialmente se implementó como un su-
igual que NAMD, realiza una partición espacial del sistema para distribuirlo entre
coprocesadores para acelerar los cálculos de ciertas partes del código, aunque en ese
caso sólo se han optimizado las fuerzas de corto alcance. Al igual que con NAMD,
en cada paso se realizan copias de los datos de entrada a GPU, y luego la descarga
XIII
nuevos modelos matemáticos, que simplican las estructuras de las moléculas antes
de operar con ellas. Está optimizado para aprovechar sistemas multi-GPU instaladas
en una única estación de trabajo, y es uno de los simuladores más rápidos existentes.
En caso de haber varias GPUs, cada una realiza en paralelo el cálculo de un tipo de
es reducida ya que todas las comunicaciones entre GPUs se realizan pasando por
máximo de GPUs que pueden ser usadas está limitado por el número de tarjetas
Objetivos
GPU-CPU que permiten acelerar los cálculos de fuerzas. Sin embargo, es la CPU
la que mantiene el control de la aplicación, usando las GPUs como meros coproce-
sadores. Las GPUs actuales tienen una potencia de cálculo que supera la mayoría de
las CPUs, pero quedan muy limitadas por las comunicaciones entre la CPU y GPU
directas entre GPUs que se encuentren instaladas en la misma placa base [30], o in-
permiten usar las GPUs como nodos de cómputo para algoritmos de simulación de
Para alcanzar el propósito del trabajo, se han denido los siguientes obje-
tivos:
tiempo. Los simuladores tradicionales usan las GPUs como coprocesador, mien-
mejoras en MSM para que sea tan rápido como PME, adaptándolo para sis-
temas multiGPU.
XV
Como resumen, este trabajo intenta aportar nuevas formas de uso para los
entornos multiGPU. La solución debe ser escalable, y permitir acelerar los cálculos de
dinámica molecular para los sistemas moleculares que necesiten grandes cantidades
de recursos.
Metodología
En base a los objetivos planteados, se han agrupado los objetivos en una serie de hitos
tarjetas grácas, como mucho entre 4 y 8 GPUs, por lo que los tamaños de los sis-
temas simulados solían estar limitados en tamaño. Para superar esas limitaciones, la
tipo cluster, donde se puede disponer de una mayor cantidad de GPUs conectadas a
GPU utilizados.
XVI
en GPU.
algoritmos paralelos que usaban las GPUs como coprocesador, que movían una
gran cantidad de datos entre CPU y GPU. Se han adaptado de tal manera que
cada GPU opera sobre su propia partición de datos, por lo que únicamente
se envían datos entre GPUs que son necesarios para la actualización de sus
tados. Las pruebas han sido satisfactorias, probándose en muchos casos que es
que eran imposibles de simular en entornos con pocas GPUs, debido a la poca
que se puede simular. Al realizar las particiones del sistema, éstas listas con-
tienen huecos o zonas vacías. Cuantas más particiones hay, las zonas vacías
almacenaje más eciente. Las Tablas Hash se adaptan muy bien a este sis-
tema, compactando los datos útiles y ahorrando memoria, por lo que se han
incorporado al sistema.
datos. Ya que los átomos pueden migrar de una GPU a otra, se ha denido
Conclusiones
han centrado en los dos tipos de arquitecturas descritas en la sección anterior. Para
los sistemas multiGPU de bus compartido en la misma placa base se ha usado una
máquina equipada con Ubuntu GNU/Linux 10.04, dos CPUs Intel Xeon Quad Core
GTX580 conectadas a un bus PCIe 2.0 en una placa base Tyan S7025 equipada con
y una GPU NVidia GTX760 equipada con 2GB of ram. Los nodos se encuentran
tornos. La Figura 1 muestra las moléculas usadas para cada una de las pruebas.
XIX
ciones alojadas en distintas GPUs. Los tres sistemas moleculares (Figura 1) usados
de NAMD.
• 400K (399,150 átomos) Un sistema molecular sintético con una carga de datos
Todos los tests han consistido en una ejecución de 2000 pasos de simulación,
representando el cálculo de 4picosegundos (4·10−12 segundos). Las Grácas 2a mues-
tran los resultados de escalabilidad obtenidos para 2 y 4 GPUs, más una estimación
muestran que la implementación funciona mejor según aumenta el tamaño del sis-
tema, compartiendo más trabajo entre las diferentes GPUs. El speedup obtenido en
APOA1 es menor que el resto debido al que es el sistema más pequeño, y los tiempos
gundos que se pueden simular en un día. En todos los casos nuestra solución mejora
su eciencia con NAMD. Los tres sistemas moleculares (Figura 1) usados para las
de agua.
tiempos de ejecución obtenidos frente a NAMD. Se puede apreciar que, al igual que
antes, cuanto más grandes son los sistemas el speedup optenido es mejor.
XXI
scrito anteriormente. Dado que los buses de comunicación son mucho más lentos
que en los sistemas de bus en placa base, para compensar los tiempos perdidos en
es mucho mayor, facilitando la escalabilidad del sistema y probando que con una red
(Figura 1), están compuestos por una gran cantidad de átomos, por lo que no era
tarjetas grácas usadas en el cluster tienen menos memoria gráca por lo que
memoria por cada uno de los nodos del cluster para DHFR_555. Se puede observar
cómo disminuye según se añaden más nodos de simulación. Las Grácas 5 muestran
el speedup calculado para el sistema. Hay que tener en cuenta que el sistema de
referencia usado para el cálculo (seedup=1) empieza en 4 GPUs, por lo que para
cada una de conguraciones de GPUs se podría estimar que sería 4 veces mayor que
de envío, mientras que la Gráca 5b muestra el speedup para los tiempos de cálculo
Figure 5: Speedup comparison of the three molecules. Note that the reference
conguration is for 4 GPUs.
Contents
1 Introduction 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
I STATE-OF-THE-ART 9
2 Molecular dynamics 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Bonds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.1 Verlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 Respa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 NAMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 GROMACS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 ACEMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Contents XXIX
II PROBLEM-STATEMENT-AND-PROPOSAL 27
4 Problem Statement 29
7.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1.2 Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
III CONCLUSIONS-AND-FUTURE-WORK 77
Bibliography 85
List of Figures
3.1 Diagram showing the major operations of MSM. The bottom level
5.1 Comparison of binary (a) vs. linear spatial partitioning (b). The
5.2 The dierent types of cells at the interface between two portions of
C206. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.6 Running time (2000 steps) for the binary partition strategy on C206. 50
5.7 Scalability (a) and performance comparison with NAMD (b), mea-
6.1 Partition of the multilevel grid under periodic boundaries. Left: All
grid points on each level, distributed into 3 GPU devices. Right: Data
location, and in level 2 there are even interior points that map to
interface points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
small portion of the system, referencing the data by local IDs. Local
IDs are translated to global data IDs and sent to the second GPU.
7.5 Speedup comparison of the three molecules. Note that the reference
uration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Dedicado a mi familia y amigos, sin ellos no habría sido posible
Introduction
Molecular dynamics simulations [29] are computational approaches for studying the
behavior of complex biomolecular systems at the atom level, estimating their dynamic
and equilibrium properties which can not be solved analytically. Their most direct
applications are related to identifying and predicting the structure of proteins, but
their interactions for a given period of time. Molecular dynamics simulations enable
the prediction of the shape and arrangement of molecular systems that cannot be
directly observed or measured, and and they have demonstrated their impact on
applications of drug and nanodevice design [29].However, they are limited by size
high temporal and high spatial resolution. For instance, simulating just one nanosec-
ond of the motion of a well known system with 92 224 atoms (ApoA1 benchmark)
order of 1µs are dictated by vibrations taking place at scales as ne as 1fs = 10−15 s;
therefore, eective analysis requires the computation of many simulation steps. At
the same time, meaningful molecular systems are often composed of even millions
potentials, which makes molecular dynamics an n-body problem with quadratic cost.
algorithms that update atoms in a parallel way. Such algorithms were initially
puter clusters with several computing nodes connected by a local area network
(LAN) [13, 23, 4]. More recent alternatives have used hybrid GPU-CPU architec-
tures to provide parallelism [32], taking advantage of the massive parallel capabilities
of GPUs. This approach interconnects several computing nodes, each one with one
or more GPUs serving as co-processors of the CPUs [16, 12]. The compute power of
this approach is bounded by the cost to transfer data between CPUs and GPUs and
are computed exactly, from long-range ones, and approximate such long-range forces.
The Particle Mesh Ewald (PME) method [5] is probably the most popular
a grid, computes a grid-based potential using a FFT, and nally interpolates the
potential back to the atoms. Its cost is dominated by the FFT, which yields an
lel algorithms, including massive parallelization on GPUs [27, 11], or even multi-GPU
parallelization [20]. The PME method is suited for single GPU parallelization, but
1.2. State of the art 3
not for distributed computation, thus limiting the scalability of long-range molecular
dynamics.
• Speed optimizations
result in slower simulations, and speed optimizations usually assume a certain error.
Nowadays, optimizations are focused on improving well known algorithms and devel-
There are several molecular simulation algorithms optimized for shared memory sys-
tems, multi-CPU networks and distributed computing. Currently, one of the major
The present Ph.D. thesis was initially motivated by the research initiated by
Plebiotic SL company in collaboration with the Modeling and Virtual Reality Group
(GMRV) of the Universidad Rey Juan Carlos de Madrid. The initial objectives of
simulator named PleMD. This simulator achieved good simulation times but lacked
scalability, due to the large amount of data shared between CPU and GPU.
The objectives of this PhD thesis aim to exploit both on-board and dis-
1.3 Objectives
simulation times. However, the CPU keeps the control of the application, using
same board [30]. These features enable the use of GPUs as the central compute
nodes of parallel molecular dynamics algorithms, and not just as mere co-processors,
and implementation of algorithms should be carried out with scalability and light
• The denition of shared areas between partitions that maintain data coherency,
named interfaces .
• The design and implementation of a parallel algorithm for the setup of interface
data packages to be transferred between GPUs. This algorithm will run entirely
on GPU.
vironments for molecular simulation. The thesis proposes a scalable solution that
molecular dynamics.
• Problem statement and proposal: This part includes all the contributions pre-
considered.
Conclusions and future work: the last part discusses the conclusions of this
Chapter 8: This chapter extracts the main achievements of this work. Also
STATE-OF-THE-ART
Chapter 2
Molecular dynamics
2.1 Introduction
are surrounded by water molecules, and periodic boundary conditions are imposed
on the simulation volume, i.e., the simulation volume is implicitly replicated innite
be found in [29].
the action of three types of forces: bonded forces, non-bonded short-range forces
(composed of Van der Waals forces and electrostatic interactions between atoms
The simulation time is divided into steps of very small size, in the order of
1fs = 10−15 s. Given atom positions Xi and velocities Vi at time Ti , the simulation
algorithm evaluates the interaction forces and integrates them to obtain positions
A chemical bond represents the attraction between two atoms that form a chemical
connection. These types of bonds are related to the charge and number of electrons
that atoms may share or transfer. There are several types of bonds, depending on
the number of atoms that form the bond and its geometry.
forces. That is because bonds exists in groups of two or more atoms closer than a
cuto radius. Bonded force interactions should be calculated for each atom in the
Impropers.
2.2.1 Bonds
The bonds between two atoms are described by simple harmonic springs. The energy
Ebond = k(|rij | − r0 )2
2.2.2 Angles
Angles describe a bond formed by three atoms. These bonds are dened as angular
harmonic springs. The energy of an angle bond formed by three atoms (i,j and j) is
described as follows:
• kθ : Constant of the harmonic angle spring that bonds the three atoms.
• θ|= Angle formed by i-j atoms and the vector that connects k - j.
• θ0 = Angle formed by i-j atoms and the vector that connects k - j at rest.
Dihedral and Improper bonds describe the interaction between four linked atoms.
These bonds are modeled by an angle spring between the planes formed by the rst
3 atoms (i,j and k) and the second set of 3 atoms (j , k and l). The energy for a
If n > 0, or
If n = 0.
by a large distance. Electrostatic energy describes the force resulting from the inter-
action between charged particles. The resulting energy between two atoms i and j is
Cqi qj
E = ε14 ε0 |ri,j |
• ε14 : Scale factor for 1-4 interactions (pairs of atoms connected by three bonds).
It is zero for 1-2 and 1-3 interactions (pairs of atoms connected by single and
double bonds, respectively) and is equal to 1.0 for any other interaction.
• C = 2, 31 × 1019 J nm
• ε0 = Dielectric constant
into two types, short range and long range, treated separated.
The Van der Waals interactions describe the force resulting from the interaction of
atoms. The Van der Waals energy between two atoms i and j is described as follows:
A B
Evdw = r12
− r6ij
ij
A and B constants are precomputed using parameters σij and εij , which
also are precomputed using the σ and ε values of the single atoms. Those are input
constants for each type of atom. This is the entire equation sequence:
σi +σj
σi j = 2
√
εi j = εi εj
A = 4σi j 12 εi j
B = 4σi j εi j
Same as electrostatic forces, Van der Waals forces are also considered long or
medium range. These interactions happen between atoms that may be separated by
a large distance. These forces decay faster than electrostatic forces, so it is possible
to establish a cuto distance after which the force is negligible. For this reason,
2.5 Integrators
2.5.1 Verlet
Molecular dynamics often use second-order integrators, such as Leapfrog and Verlet,
which oer greater stability than Euler methods. In the following, the integration al-
gorithm is implemented using the Velocity Verlet scheme, similar to Leapfrog method,
but the positions, speeds and forces are obtained at the same value of time .
1
x(t + ∆t) = x(t) + v(t)∆t + F (t)∆t2 ) (2.1)
2m
∆t 1
v(t + ) = v(t) + F (t)∆t (2.2)
2 2m
∆t 1
v(t + ∆t) = v(t + )+ F (x(t + ∆t)) (2.3)
2 2m
Where x is the position vector, v is velocity vector and F the force vector.
As F (t + ∆t) does not depend on v, equation 2.2 is replaced by equation 2.2. This
2.5.2 Respa
multiple step times (Multi-time step). This method tries to avoid computing long-
range forces for every time step. While standard integration methods require the
2.5. Integrators 17
calculation of all forces, both of short and long range, RESPA establishes a relation-
ship between the number of times the short and long range forces are calculated.
The Van der Waals and electrostatic forces take considerably more com-
puting time and also allow a longer time step than the bonded forces. In turn,
the long-range electrostatic forces allow a longer time step with respect to the Van
der Waals. Algorithm 2 shows a pseudo code where forces have been divided into
hardf orce (fh ), mediumf orce (fm ) and sof tf orce (fs ), and their respective tran-
sition times are ∆th , ∆tm and ∆ts . Speed, coordinates and mass are shown as v, R
and m, respectively.
3: for j = 1 to M do
4: v = v + fm2m∆tm
5: for k = 1 to H do
6: v = v + fh2m∆th
7: r = r + v∆tm
8: fh = ComputeHardF orces()
9: v = v + fh2m∆th
and fs HxMxS. The hard forces correspond to the bonding forces, the medium forces
correspond to Van der Waals and electrostatic calculated closer than a cuto dis-
tance, and soft electrostatic are the forces calculated beyond the cuto distance.
Furthermore, for integration of velocity and position, we have used the Verlet scheme
In this chapter we present a review of the state of the art in computer driven simu-
lations of molecular dynamics, and more specically in the two main topics covered
dynamics simulators. The last section of this chapter presents an analysis of the
best known molecular dynamics simulators, showing how some algorithms have been
virtual 3D coordinate system that models the real environment inside a specic
volume. This section presents some of the most known techniques used to speed up
simulations.
Bonded forces represent the interaction between a group of atoms linked by some
kind of bond. For each bond, the energy that aects the atoms involved is measured
applying the forces model described in Section 2.2. Since the energy calculations
for each bond are independent of each other, and the workload is similar within the
bonds of the same type, the most popular way to parallelize this computation is by
using a task subdivision. This method is easily parallelizable using both multi-CPU
Non-bonded forces decay rapidly with distance, so only the interactions between
atoms closer than a cuto radius (Rc ) are accurately calculated. Atoms separated by
grid, which is updated at a lower rate than the simulation. This method is known as
cell list [26, 33, 11, 29]. In this algorithm, the volume of the simulation s divided into
cells or three-dimensional boxes whose dimension is given by the cuto radius (Rc ).
Van der Waals and short-range electrostatic forces are calculated between pairs of
Like in Bonded forces, Short Range forces are easily parallelizable by using
can be performed individually. Then, interactions between each atom of a box with
respect to each atom in a second box can be computed also in parallel. This takes
partitioning techniques.
from time to time along the simulation. This introduces small time lags in the simu-
computing nodes could be very large. In such cases the objetive is to reduce the
There are many approaches to improve the quadratic cost of long-range molecular
dynamics, either using approximate solutions or parallel implementations (See [25] for
a survey). Massively parallel solutions on GPUs have also been proposed, although
Particle Mesh Ewald (PME) [5] is the most popular method to compute
long-range molecular forces. Lattice Ewald methods solve the long-range potential
on a grid using an FFT. Regular PME uses spectral dierentiation and a total of
four FFTs per time step, while Smooth PME (SPME) [7] uses B-spline interpolation
reducing the number of FFTs to two. PME is widely used in parallel molecular dy-
namics frameworks such as NAMD [27], GROMACS [12] or ACEMD [11]. PME can
ple GPUs due to the all-to-all communication needed by the FFT. However Nukada
to SPME by decomposing the global FFT into a series of independent FFTs over
separate regions of a molecular system, but they did not conduct scalability analysis.
Multigrid approaches utilize multiple grid levels with dierent spatial resolutions to
methods have been demonstrated to be superior to other methods [31], such as the
Fast Multipole Method (FMM) [37], because they achieve better scalability while
keeping acceptable error levels. The Meshed Continuum Method (MCM) [1] and
Multilevel Summation Method (MSM) [9] are the two most relevant multigrid meth-
ods for long-range force computation. MCM uses density functions to sample the
particles onto a grid and calculates the potential by solving a Poisson equation in a
multigrid fashion. On the other hand, MSM calculates the potential directly on a
grid by using several length scales. The scales are spread over a hierarchy of grids,
and the potential of coarse levels is successively corrected by contributions from ner
levels up to the nest grid, which yields the nal potential. This approach exhibits
higher options for scalability than PME or other multigrid algorithms. MSM has
been massively parallelized on a single GPU [8], although the performance of this
Next, we describe MSM in more detail, as it is our method of choice for the
For a particle system with charges {q1 , . . . qN } at positions {r1 , . . . rN }, the electro-
N N
1X X qi qj
U (r1 , ...rN ) = . (3.1)
2 ||ri − rj ||
i=1 j=1,j6=i
interactions with just O(N ) computational work. MSM splits the potential into
For the long-range component, the method rst distributes atom charges
onto the nest grid. This process is called anterpolation. A nodal basis function
φ(r) with local support about each grid point is used to distribute charges. Once all
atom charges are distributed onto the nest grid, charges are distributed onto the
next coarser grid, using the same basis functions. This process is called restriction,
Figure 3.1 depicts the full MSM method. On each level, the method com-
putes direct sums of nearby grid charges up to a radius of b2 Rc /h0 c grid points,
where h0 is the resolution of the nest grid. Hardy and Skeel [9] indicate that a reso-
that the resolution is halved on each coarser grid, hence direct sums cover twice
the distance with the same number of points. The direct sum of pairwise charge
24 3. Parallel Molecular Dynamics
Figure 3.1: Diagram showing the major operations of MSM. The bottom level
represents the atoms, and higher levels represent coarser grids.
potentials is analogous to the one for short-range non-bonded forces, with the excep-
tion that grid distances are xed and can be computed as preprocessing, hence the
A GPU optimized version of the direct sum was developed by Hardy et al [8].
The weighted grid is stored in constant memory and charges in shared memory. A
computes the nest levels on GPU, while the coarsest levels are computed on CPU.
Once direct sums are computed on each level, potentials are interpolated
from coarse to ner levels, and contributions from all levels are accumulated. This
process is called prolongation. Finally, potentials from the nest grid are interpolated
on the atoms.
but molecular dynamics suers the added diculty of dealing with periodic boundary
GPU platforms, and those improvements could be extended to other types of n-body
3.2. Parallel molecular simulators 25
problems.
This section presents some of the most popular software solutions for molecular
simulation. All of them use some of the techniques described in the previous section,
adapted in some way to parallel architectures: Multi-CPU, GPUs and even clusters.
rithms on hybrid CPU-GPU architectures [18, 36]. Very recently, Rustico et al. [28]
3.2.1 NAMD
NAMD [27] performs a spatial partition of the system. Each partition is allocated in a
node in a cluster. These subdivisions are known as patches, each patch keeps
information of the atoms within it, and the neighboring patches that need shared
data. Then, NAMD denes work tasks and then distributes these tasks among the
available CPUs, on each computing node. Tasks are dened as interactions between
patches, if a computing node needs data from a patch that does not belong to it,
task will make a copy of the necessary data before being assigned. To speed-up the
creates smaller tasks and copies the necessary data to GPUs. Then, it launches the
necessary GPU kernels in order to perform the simulation, and nally it copies back
26 3. Parallel Molecular Dynamics
the results to CPU memory. By using this scheme, GPUs are seen as massively
parallel co-processors.
3.2.2 GROMACS
zones for information sharing between nodes. CPUs may use GPUs as co-processors
range forces are computed on GPUs, forcing to upload data from CPU to GPU, and
3.2.3 ACEMD
ular system, and each type of force is handled on a separate GPU. This approach
exploits on-board multi-GPU architectures, but its scalability is limited because all
PROBLEM-STATEMENT-AND-
PROPOSAL
Chapter 4
Problem Statement
The problems presented by molecular simulation systems are included within the
Grand challenge problems [34]. Not only are simulation times important, but they
also need large RAM resources to host the molecular system. An enormous volume of
ter usually has thousands of nodes interconnected by a high speed network, enough
to host the data of the simulation. However, algorithms must be adapted, and there
are several drawbacks that must be solved to use the full power of these systems.
able that are used for molecular dynamics. Also, our solutions for short range and
long range molecular dynamics simulations that make use of novel parallel architec-
The use of GPUs in Computer Science has witnessed a wide range of congurations.
own memory hierarchy, separated from the CPU. Nowadays GPUs can have several
gigabytes of RAM, so just one GPU can host a large amount of data. Also, if the
mainboard has enough slots, a single computer can host several GPUs interconnected
by a high speed bus, making itself a small hybrid multicomputer node. However
computers that integrate several GPUs usually are very expensive, so only 2 to 4
one or two GPUs on each node. Hybrid CPU-GPU systems are available in most of
rations allow scaling the performance of the system, but are severely limited by the
communications. However, this is not the case on most of the multi-GPU clusters
throughput of Inniband, only a few supercomputers actually use it, due to its price,
and the common communication speeds found in the supercomputer networks are
close to 6GB/s.
4.2. Novel architectures 31
For example, the Human Brain Project is a large 10-year scientic research project
which aims to provide a model of the whole brain. To achieve the goals of this
can be remedied with further experimentation. For this purpose various platforms
Madrid), hosts one of the most powerful supercomputers used for this project.
There have been great achievements in recent years by using high per-
formance hybrid multi-CPU systems for molecular dynamics. In 2013, the NCSA
Blue Waters supercomputer was used to perform the simulation fo the HIV-1 capsid
molecular system [38]. The HIV-1 capsid was formed by 64 million atoms, and 3500
single core nodes equipped with NVIDIA Tesla K20X were used to perform 500ns of
simulation time. In real time, it took around 35 days (14ns/day) to reach results.
dierent nodes are not as fast as communications in the same motherboard, becoming
the bottleneck of the application. The following sections will describe the problems
covered in this thesis, and will give solutions to make better use of new architectures.
32 4. Problem Statement
As stated in the previous section, section, a priori a multi-GPU system with two or
more graphics processors connected via a fast bus is capable of providing a highly
is the CPU which hosts most of the program logic, using the GPUs as coprocessors.
With this scheme, if several GPUs are present, the CPU must take care of uploading
data from GPU to CPU. One of the problems of this scheme relies in the fact that
GPUs are massive parallel architectures that run their own code in parallel with the
CPU. Each GPU should be able to identify shared data with other GPUs, package it
and perform data exchange with neighboring GPUs. Those methods should be fast
This work tries to nd a solution by using direct GPU-GPU data commu-
update the spatial partitioning and set up transfer data packages on each GPU. This
way, better simulation times are achieved, while keeping scalability of the system.
tions use PME, which is very fast on single GPUs environments. However, it is not
easily portable to distributed memory systems, due to the large amount data needed
namics, based on the Multilevel Summation Method (MSM) [15] [8], that can be
slower than PME. Chapter 6 also proposes an optimization for this method by re-
scalability along with better simulations time. The more GPUs available, the more
subdivisions are made, which take less memory on each node. This way, it is possible
single node.
speed-ups but store translation tables that grow with the size of the molecular system,
and are not scalable in memory. Chapter 7 proposes a solution by using GPU hash
tables instead of static memory arrays, saving memory while keeping the scalability
of the system. Also, a version for distributed multi-GPU cluster systems is proposed,
On-Board Multi-GPU
This chapter presents a parallel algorithm for the solution of short-range molecular
as coprocessors, CPUs are used as the primary processor, keeping data on the host's
RAM. Data is copied to the GPU when a kernel is launched, and results are copied
back. This solution presents some scalability problems due to the amount of in-
formation shared between GPU and CPU. The aim of this chapter is to present an
This chapter is focused on bonded and non-bonded short range force com-
dynamics of one portion of a molecular system on each GPU, and we take advantage
Section 7.1.2 presents a novel parallel algorithm to update the spatial par-
titioning and set up transfer data packages on each GPU. The molecular dynamics
36 5. On-Board Multi-GPU Short-Range Force Computation
simulations are parallelized at two levels. At the high level, we present a spatial
the low level, we parallelize on each GPU the simulation of its corresponding portion.
Most notably, we present algorithms for the massively parallel update of the spatial
partitions and for the setup of data packages to be transferred to other GPUs.
level algorithm that partitions the molecular system, and each GPU handles in a
parallel manner the computation and update of its corresponding portion, as well as
method, with a grid resolution of Rc /2. The cell-list data structure can be updated
width of interfaces and simplifying the update of partitions. We partition the sim-
ulation domain only once at the beginning of the simulation, and then update the
partitions by transferring atoms that cross borders. We have tested two partitioning
Figure 5.1: Comparison of binary (a) vs. linear spatial partitioning (b). The
striped regions represent the periodicity of the simulation volume.
Figure 5.2: The dierent types of cells at the interface between two portions of
the simulation volume.
• Linear partition (Figure 5.1b ): we divide the molecular system into regular
portions using planes orthogonal to the largest dimension of the full simulation
volume. With this method, each portion has only 2 neighbors, but the inter-
faces are larger; therefore, it trades fewer communication messages for more
Based on our cell-based partition strategy, each GPU contains three types
• Shared cells that contain atoms updated by a certain GPU, but whose data
• Interface cells that contain atoms owned by another GPU, and used for force
grator, highlighting in blue with a star the dierences w.r.t. a single-GPU version.
These dierences can be grouped in two tasks: update partitions and synchronize dy-
namics of neighboring portions. Once every ten time steps, we update the partitions
in two steps.
1. Identify atoms that need to be updated, i.e., atoms that enter shared cells of
a new portion.
To synchronize dynamics, we transfer forces of all shared atoms, and then each GPU
integrates the velocities and positions of its private and shared atoms, but also its
1. Identify the complete set of shared atoms after updating the cell-list data struc-
ture.
1: procedure Step(currentStep)
2: if currentStep mod 10 = 0 then
3: ∗ identif yU pdateAtomIds()
4: ∗ transf erU pdateP ositionsAndV elocities()
5: updateCellList()
6: ∗ identif ySharedAtomIds()
7: end if
8: integrateT emporaryP osition(0.5 · ∆t)
9: computeShortRangeF orces()
10: ∗ transf erSharedShortRangeF orces()
11: for nStepsBF do
12: integrateP osition(0.5 · ∆t/nStepsBF )
13: computeBondedF orces()
14: ∗ transf erSharedBondedF orces()
15: integrateKickV elocity(∆t/nStepsBF )
16: integrateP osition(0.5 · ∆t/nStepsBF )
17: end for
18: currentStep = currentStep + 1
19: end procedure
40 5. On-Board Multi-GPU Short-Range Force Computation
As outlined above, each GPU stores one portion of the complete molecular system
and simulates this subsystem using standard parallel algorithms [35]. In this section,
We propose algorithms that separate the identication of atoms whose data needs
to be transferred from the setup of the transfer packages. In this way, we can
reuse data structures and algorithms both in partition updates and force transfers.
Data transfers are issued directly between GPUs, thereby minimizing communication
overheads.
The basic molecular dynamics algorithm stores atom data in two arrays:
• staticAtomData corresponds to data that does not change during the simula-
tion, such as atom type, bonds, electrostatic and mechanical coecients, etc.
and the atom's cell. It is sorted according to the cell-list structure, and all
Both arrays store the identiers of the corresponding data in the other array to
resolve indirections. Each GPU stores a copy of the staticAtomData of the whole
molecule, and keeps dynamicAtomData for its private, shared, and interface cells.
the atom identiers are accordingly reset. Atoms that move out of a GPU's portion
for each atom a list of neighbor portions that it is shared with. We also dene two
• cellNeighbors is a static array that stores, for each cell, a list of neighbor por-
tions.
and dynamic atom identiers. This data structure is set during atom identi-
cation procedures, and it is used for the creation of the transfer packages.
Each GPU contains a transf erIDs data structure of size nN eighbors · nAtoms,
where nN eighbors is the number of neighbor portions, and nAtoms is the number
of atoms in its corresponding portion. This data structure is set at two stages of the
cases, we initialize the neighbor identier in the transf erIDs data structure to the
maximum unsigned integer value. Then, we visit all atoms in parallel in one CUDA
kernel, and ag the (atom, neighbor) pairs that actually need to be transferred. We
store one ag per neighbor and atom to avoid collisions at write operations. Finally,
we sort the transf erIDs data structure according to the neighbor identier, and the
(atom, neighbor) pairs that were agged are considered as valid and are automatically
located at the beginning of the array. We have used the highly ecient GPU-based
Merge-Sort implementation in the NVidia SDK 4.5 [24] (5.3ms to sort an unsorted
42 5. On-Board Multi-GPU Short-Range Force Computation
data. The actual implementation of the M ustT ransf erData procedure depends
transferred to a certain neighbor portion if it is not yet present in its list of neighbors.
portion if it is included in its list of neighbors. In practice, we also update the list
For data transfers, we set in each GPU a buer containing the output data and the
static atom identiers. To set the buer, we visit all valid entries of the transf erIDs
array in parallel in one CUDA kernel, and fetch the transfer data using the dynamic
atom identier. The particular transfer data may consist of forces or positions and
5.2. Parallel Partition Update and Synchronization 43
Transfer data for all neighbor GPUs is stored in one unique buer; therefore,
we set an additional array with begin and end indices for each neighbor's chunk.
This small array is copied to the CPU, and the CPU invokes one asynchronous
copy function to transfer data between each GPU and one of its neighbors. We use
NVidia's driver for unied memory access (Unied Virtual Addressing, UVA) [30] to
Upon reception of positions and velocities during the update of the parti-
tions, each GPU appends new entries of dynamicAtomData at the end of the array.
These entries will be automatically sorted as part of the update of the cell-list. Upon
reception of forces during force synchronization, each GPU writes the force values
to the force accumulator in the dynamicAtomData. The received data contains the
target atoms' static identiers, which are used to indirectly access their dynamic
identiers.
ing PCIe for direct GPU-GPU communication. We show speed-ups and improved
chine outtted with Ubuntu GNU/Linux 10.04, two Intel Xeon Quad Core 2.40GHz
CPUs with hyperthreading, 32 GB of RAM and four NVidia GTX580 GPUs con-
nected to PCIe 2.0 slots in an Intel 5520 IOH Chipset of a Tyan S7025 motherboard.
The system's PCIe 2.0 bus bandwidth for peer-to-peer throughputs via IOH chip was
9GB/c full duplex, and 3.9 GB/s for GPUs on dierent IOHs [17]. The IOH does
not support non-contiguous byte enables from PCI Express for remote peer-to-peer
depicted in Figure 5.3. Direct GPU-GPU communication can be performed only for
GPUs connected to the same IOH. For GPUs connected through QPI, the driver
Given our testbed architecture, we have tested the scalability of our pro-
transmission times for 8 and 16 partitions using the bandwidth obtained with 4
GPUs and the actual data size of 8 and 16 partitions respectively.
• ApoA1 (92,224 atoms) is a well known high density lipoprotein (HDL) in hu-
5.3. Short Range On-Board Multi-GPU Evaluation 45
simulations.
All our test simulations were executed using MTS Algorithm 3, with a time
forces. In all our tests, we measured averaged statistics for 2000 simulation steps,
To evaluate our two partition strategies described in Section 7.1.2, we have compared
their performance on the C206 molecule. We have selected C206 due to its higher
complexity and data size. Figure 5.5a indicates that, as expected, the percentage
of interface cells grows faster for the linear partition. Note that with 2 partitions
the size of the interface is identical with both strategies because the partitions are
actually the same. With 16 partitions, all cells become interface cells for the linear
partition strategy, showing the limited scalability of this approach. Figure 5.5b shows
that, on the other hand, the linear partition strategy exhibits a higher transmission
bandwidth. Again, this result was expected, as the number of neighbor partitions is
All in all, Figure 5.5c compares the actual simulation time for both partition
strategies. This time includes the transmission time plus the computation time
46 5. On-Board Multi-GPU Short-Range Force Computation
of the slowest partition. For the C206 benchmark, the binary partition strategy
exhibits better scalability, and the reason is that the linear strategy suers a high
Figure 5.6 shows how the total simulation time is split between computa-
tion and transmission times for the binary partition strategy. Note again that the
underlying architecture, but also on the specic molecule, its size, and its spatial
atom distribution.
Figure (a) shows the total speedup for the three benchmark molecules using our
proposal (with a binary partition strategy). Note again that speedups for 8 and 16
GPUs, shown in dotted lines, are estimated based on the bandwidth with 4 GPUs.
The results show that the implementation makes the most out of the molecule's size
by sharing the workload among dierent GPUs. The speedup of APOA1 is lower be-
cause it is the smallest molecule and the simulation is soon limited by communication
times.
terms of the nanoseconds that can be simulated in one day. The three benchmark
molecules were simulated on NAMD using the same settings as on our implementa-
5.3. Short Range On-Board Multi-GPU Evaluation 47
tion. Recall that NAMD distributes work tasks among CPU cores and uses GPUs as
performance for NAMD with 8 and 16 GPUs, as we could not separate computa-
tion and transmission times. All in all, the results show that our proposal clearly
each partition stores static data for the full molecule.This limitation is addressed
in Chapter 7. From our measurements, the static data occupies on average 78MB
for 100K atoms, which means that modern GPUs with 2GB of RAM could store
molecules with up to 2.5 million atoms. In the dynamic data, there are additional
memory overheads due to the storage of interface cells and sorting lists, but these
interface cells grow at a lower rate than private cells as the size of the molecule
grows.
48 5. On-Board Multi-GPU Short-Range Force Computation
(c) 400K
Figure 5.6: Running time (2000 steps) for the binary partition strategy on C206.
Figure 5.7: Scalability (a) and performance comparison with NAMD (b), mea-
sured in terms of simulated nanoseconds per day.
Chapter 6
Force Computation
This chapter presents a parallel and scalable solution to compute long-range molec-
ular forces, based on the multilevel summation method (MSM). As shown in the
previous chapter, making use of several GPUs as independent computing nodes al-
range forces computations by using several GPUs. The MSM algorithm oers good
PME method, the de facto standard for long-range molecular force computation.
But most importantly, we propose a distributed MSM that avoids the scalability
diculties of PME.
multilevel grid, together with massively parallel algorithms for interface update and
synchronization. The last section of this chapter shows the scalability of our approach
allocating a portion of the system to each GPU and using a boundary interface to
the original algorithm. See also [9] for a thorough description of the method. Note
that the direct sums are independent of each other, and the direct sum on a certain
level and the restriction to the coarser level can be executed asynchronously.
To perform the direct sum part on each level, the original MSM applies a 3D convo-
lution over all grid points using a kernel with 2 b2 Rc /hc+1 points in each dimension.
However, Hardy [9] shows that the direct sum is the most computationally expensive
The grids of charges and kernel weights should have identical dimensions; therefore,
we extend the kernel. Note that the kernel is constant, hence we only compute its
Even though the FFT has O(N log N ) complexity as opposed to O(N )
complexity of the convolution, in practice large kernels yield a steep linear complexity
for the convolution approach. For very large molecules, the log N factor of the FFT
would dominate, but with our distributed MSM presented next in Section 6.2, FFTs
the convolution and FFT approaches, and the FFT approach enjoys a speed-up of
almost 10×. Table 6.1 shows timing comparisons for two molecular systems. The
examples were executed on an Intel Core i7 CPU 860 at 2.80GHz with a NVIDIA
GTX Titan GPU and CUDA Toolkit 5.5. FFTs were computed using NVIDIA's
The cuto distance Rc has a great impact on both error and performance.
Error is lower for higher cutos, and this can be observed from the fact that a larger
cuto distance increases the kernel size as well. For our performance analysis, we used
a cuto radius of 9.0 Å, which is a standard value for molecular dynamics simulations.
Assuming a xed grid size, the resolution of the grid h, which is automatically
set for each level and each axis, determines the overall performance and accuracy.
Smaller values of h for the same number of levels imply higher accuracy, but this also
translates into a larger kernel size 2 b2 Rc /h0 c+1, hence adding to the computational
cost. The table shows the grid resolution on each axis (in Å), as well as the kernel
size.
Table 6.1 also compares the performance of MSM and PME under the
PME (SPME) algorithm [7], following the optimizations described by Harvey and
the MSM algorithm proposed by Hardy [9]. With our FFT-based optimization, the
54 6. On-Board Multi-GPU Long-Range Force Computation
the multilevel grid of MSM among multiple GPUs. As a computing element, each
GPU handles in a parallel manner the computation and update of its corresponding
portion of the molecular system, as well as the communications with other GPUs.
In this section, we rst describe the partition of the molecular system, then the
handling of periodic boundary conditions across all MSM levels, and nally our
Following the observations drawn in [20] for short-range molecular forces, we partition
a molecular system linearly along its longest axis, as this approach reduces the cost
to communicate data between partitions. Then, for DMSM, we partition each level
of the MSM grid into regular portions using planes orthogonal to the longest axis.
Each GPU device stores a portion of the grid at each level, including two types of
grid points: i) interior grid points owned by the GPU itself. ii) interface grid points
6.2. Distributed MSM 55
kernel, i.e., b2 Rc /hc points to the left and right of the interior ones, as shown in
Figure 6.1. The interface stores replicas of the grid points of neighboring partitions,
which are arranged in device memory just like interior points, to allow seamless
data access. The interface is used both to provide access to charges of neighboring
Note that, due to the use of a linear partitioning strategy, the neighboring nodes
along the shorter directions are the result of periodic boundary conditions, and they
do not need to be stored as interface points as they are readily available as interior
points.
The partitions are made only once at the beginning of the simulation. At
formed by replicating periodically images of the molecular system under study along
all three spatial directions [6]. Periodic replication is also applied to the MSM grid;
In higher levels of the multilevel grid, where the total number of grid points
along the longest axis is similar to the convolution kernel size, periodic boundaries
complicate the management of interface points. Two main complications may occur,
56 6. On-Board Multi-GPU Long-Range Force Computation
shown in Figure 6.1: the same point may map to two or more interface points, and
even interior points may map to interface points. To deal with interface handling,
• Begin and end indices of neighbor partitions, to know what part of the interface
• Periodic begin and end indices of the interfaces of neighbor partitions, to know
Since the multilevel grid is static during the simulation, the auxiliary indices
of neighbor partitions are created and shared between GPUs once as a preprocessing
Figure 6.1: Partition of the multilevel grid under periodic boundaries. Left: All
grid points on each level, distributed into 3 GPU devices. Right: Data structure
of GPU device 0 (blue) on all levels, showing: its interior grid points, interface
points for an interface of size 3, and buers to communicate partial sums to other
devices. Interface points due to periodic boundary conditions are shown striped.
Arrows indicate sums of interface values to the output buers. With interfaces
of size 3, in levels 1 and 2 several interface points contribute to the same buer
location, and in level 2 there are even interior points that map to interface points.
6.2. Distributed MSM 57
step. Once each GPU knows the indices of its neighbors, it creates the incoming and
outgoing data buers to share interface data, and sets static mappings that allow
ple stages of the original MSM algorithm. There are two synchronization operations:
sum and prolongation steps, values are accumulated onto the interface grid
points in each GPU device. These interface points are local copies of interior
58 6. On-Board Multi-GPU Long-Range Force Computation
points of other GPUs, hence the values stored on interface points need to be
First, the values from the interface points are accumulated into the output
buers. Second, the buers are transferred to their destination GPUs. And
third, the receiver GPUs accumulate the incoming values into their interior
parallel manner on each GPU. Periodic boundary conditions are also handled
eciently, and the accumulation of multiple copies of the same point is dealt
2. updateInterf aces: Once interior grid values are set, it may be necessary to
update their copies in other GPUs, i.e., the interface grid points of other GPUs.
Data is transferred between pairs of GPUs directly. This step is necessary after
charge anterpolation, after restriction, after the direct sum of potentials, and
after prolongation.
a star the steps that augment the original MSM algorithm. We distinguish
charge values q from potential values V, which are used as arguments of the
Superscripts indicate grid levels. With our DMSM algorithm, all operations to set
up, transfer, and collect data packages are highly parallelized, thus minimizing the
This section analyzes the scalability of our proposal presented in the previous section.
Precise Pangolin 12.04, two Intel Xeon Quad Core 2.40GHz CPUs with hyperthread-
ing, 32 GB of RAM and four NVidia GTX580 GPUs connected to PCIe 2.0 slots in
Given our testbed architecture, we have tested the scalability of our pro-
designed synthetically.
Figure 6.3 shows the speedup and running times for the three molecules using our
proposal with the settings shown in Table 6.4a. Note that running times have been
show the results obtained with the CPU implementation of PME in NAMD, one
60 6. On-Board Multi-GPU Long-Range Force Computation
of the most used tools for molecular dynamics, as a baseline for comparison. The
results show that our method benets from larger molecules. The reason is that
anterpolation, whose workload is easier to share among GPUs, dominates the cost
GPUs. Figure 6.4b shows the data transfers between GPUs to update their interfaces
for the 2x1VT4 molecule for a single step of DMSM. We have selected 2x1VT4 due
to its higher complexity and data size, with more than 1.2 Million atoms. The gure
indicates that, as expected, the data size of interface cells grows linearly, since each
new partition adds a constant data transfer that depends on the grid resolution h
and its corresponding interface size. Furthermore, the average data size transfered
Finally, Figure 6.4c shows how the total simulation time is split between
computation and interface updates for the 2x1VT4 molecule, to analyze the impor-
tance of the transferred data size. With up to 4 partitions, the cost is dominated by
way, the speedup grows almost linearly with each additional GPU. All in all, the
results show that our proposal presents very good scalability in on-board multi-GPU
platforms.
Molecule hx,y,z
400K {2.57,2.57,2.57}
1VT4 {1.86,1.86,0.93}
2x1VT4 {1.89,1.87,1.78}
Dynamics
This chapter presents a parallel and scalable solution to compute bonded and non-
multi-GPU solutions. However, scalability is limited by the number of GPUs that can
be connected, which is currently limited to 4-8 GPUS. Therefore, the objective of this
where several nodes with GPUs can collaborate to solve the problem.
memory scalability. Every node has to keep a complete copy of the molecule in
memory, limiting the maximum size of the molecule to simulate. Additionally, this
prevents the use of low-end GPUs with a small amount of GPU global memory.
This chapter presents new algorithms to overcome these limitations. Section 7.1
presents the elements required to perform a complete division of the system keeping
data coherency. To do this, new unique global identiers for atoms and bonds are
64 7. Distributed Multi-GPU Molecular Dynamics
generated.
Section 7.1.1 explains our method to partition the molecular system, where
each GPU maintains only a small part of the whole molecule. Section 7.1.2 explains
how data is updated by interchanging atoms and bonds between neighboring GPUs.
environment.
7.1 Algorithm
ronment, such as a cluster composed of several nodes with GPUs. The main objective
is to avoid storing a complete copy of the molecule on each node's memory, thus al-
lowing memory scalability. Our solution acts at two dierent stages: initialization
of the simulation and runtime execution of the simulation. Next, we summarize the
ing nodes, ensuring that each GPU receives only one portion of the molecule.
It is in charge of reading the molecular system and performing the data par-
a balanced way which part of the molecule goes to each one. Additionally, it
creates a neighborhood table for each GPU, establishing global atom and bond
simulation, updating and synchronizing the data partition with its neighbors.
molecular system and generates the list of Integrators (one for each GPU) that will
perform the simulation. Each Integrator receives a single partition, as it is described
in Chapter 5 (see Fig. 5.1 and Fig. 5.2), along with a list of neighbors to exchange
updates of their shared areas. After distribution, the SystemLoader is idle most of
the time, but it is also responsible for collecting partial simulation results from the
The following sections describe the methods used to make the system par-
As shown earlier in Chapter 5 and Chapter 6, each partition of the molecular system
is itself divided into three dierent sections: shared, unshared and interf ace data.
As shown in Section 5.2, there are data sets that do not change during the simulation,
the atom migration is very quick, because only the dynamic part of the data must
be sent. However, the data staticAtomDataID stored on each partition does not
decrease with the number of partitions, limiting memory scalability and therefore
In this chapter, we propose a new algorithm that maintains, for each parti-
tion, a copy of staticAtomData only for the atoms that reside within the partition.
computations within each GPU. However, our algorithm introduces a new global
identier that enables the migration of static data between GPUs when atoms leave
descriptor that references their position within the array of data contained in the
GPU RAM. On the other hand, globalDataID stores information about the GPU
node that owns the data, as well as the neighboring nodes that share a copy. There-
• SharedGP U s: A list with the neighboring GPUs that share the atom or bond.
molecular system.
data belonging to each GPU. Atoms are assigned to each partition based on their
3D position. Bonds composed of two or more atoms use the midpoint method [2],
7.1.2 Updates
tations. That method needs updated forces just before integrating positions, forcing
each type of force to be transmitted separately after computing it. To save commu-
nication times, the integrator was changed to a Velocity Verlet version, which only
needs updated positions before computing forces. Algorithm 6 shows the distributed
This includes both partitions and pre-calculated shared data information. The
atoms that migrate across partitions are sent along with their bond information,
to have the information needed to rebuild the system. When the GPU detects
that all atoms that form a bond have left the boundaries of the partition, that
1: procedure Step(currentStep)
2: integrateV elocity(0.5 · ∆t)
3: integrateP osition(∆t)
4: if currentStep mod 10 = 0 then
5: ∗ identif yU pdateAtomIds()
6: ∗ transf erU pdateP ositionsAndV elocities()
7: updateCellList()
8: ∗ identif ySharedAtomIds()
9: else∗ transf erSharedAtomP ositions()
10: end if
11: computeAllF orces()
12: integrateV elocity(0.5 · ∆t)
13: currentStep = currentStep + 1
14: end procedure
• Shared data updates. Dynamic data of all shared atoms has to be updated
on each step before continuing the simulation. In this case, dynamic data is
The methods for data communication and partition update are similar to
those explained in Section 5.2. Data is sent along with its associated identier on
every step to update it. However, each GPU computes the molecular forces using
local identiers, which must be translated before being sent. IDs are translated us-
ing two tables: localIDT oGlobalID before sending, and globalIDT oLocalID after
receiving data. A naive approach would be to use arrays as translation tables. The
identiers would be stored in the position indicated by their ID, i.e., GlobalID =
localIDT oGlobalID[localId] and localId = globalIDT olocalID[GlobalID]. Al-
though this method is very fast, the space needed for globalIDT olocalID arrays
should include the atoms that are not in the partition, preventing memory size to
Figure 7.1: Communication scheme from GPU A to GPU B. Each GPU hosts
a small portion of the system, referencing the data by local IDs. Local IDs are
translated to global data IDs and sent to the second GPU. After data reception, a
translation to local IDs is performed.
GPU hash table is used instead, storing only the global keys needed on each GPU.
ation
by a Gigabit Ethernet network outtted with Linux Mint 14, 8 GB of RAM and
one NVidia GTX760 GPU with 2GB of RAM. The inter-GPU communications were
nected by a network. However, our testbed performs the communication using CPU
The selected tests are focused on two aspects: memory accesses and running
times. In order to test the memory scalability of the proposal, we have used four
molecular systems as benchmarks (see Fig. 7.2), all of them with a large number of
presented in Chapter 6.3. This molecule was not possible to simulate in a single
node due to the small GPU memory available. A minimum of 2 GPUs are
needed.
several copies of DHFR (b), which cannot be simulated in less than 4 nodes.
All test simulations were executed using the Verlet algorithm (see Sec-
tion 2.5.1), with a single time step of 1 fs for short-range non-bonded and bonded
forces. In all tests, we measured averaged statistics for 100 simulation steps, i.e., a
72 7. Distributed Multi-GPU Molecular Dynamics
To evaluate the scalability of the proposal, several tests have been performed. Fig-
ure 7.3 shows the amount of data sent along the simulation. As it can be seen, as the
number of nodes is increased, the datasize of the shared information grows because of
the updates with a higher number of neighbors. However, the dotted line shows that
the amount of data sent per node is practically constant in all cases. Furthermore,
the amount of data per node decreases as the number of nodes is increased because
a higher number of nodes means smaller partition datasize. In summary, the use
of a higher number of nodes does not imply a penalty in the size of the data to be
communicated.
Figure 7.3: Data size communications for DHFR_844 along 100 steps.
Figure 7.4 shows the GPU-RAM allocated for DHFR_555 molecule for
possible to simulate this system using 1 or 2 nodes. The GPUs installed on each node
have only 2GB of RAM, and this memory is also shared with the GUI, so the actual
available GPU RAM for CUDA is smaller. For 4 nodes, 1.17GB of GPU-RAM are
7.2. Distributed Multi-GPU Molecular Dynamics evaluation 73
used, and the amount of memory needed is decreased when more nodes are added.
Since each node has more neighbors, the RAM usage reduction is not linear. Each
node reserves extra memory needed for communications, but the maximum number
Figure 7.5 shows the speedup for the three rst molecules used on the
testbed. Figure 7.5a shows speedup evolution using 4, 8 16 and 32 nodes. Note
that the speedup has been measured using system with 4 GPUs/nodes as reference,
because the DHFR molecules tested cannot run with fewer nodes. As a reference, in
an ideal case, assuming linear scalability, the obtained speedups could be up to four
centage of the time. As a consequence, the smallest molecule shows the worst results.
With larger molecules, force computation takes a larger percentage of the total time,
prove that this solution could take advantage of a faster network, Figure 7.5b shows
speedups for force computation only, ignoring the time spent in communications. In
74 7. Distributed Multi-GPU Molecular Dynamics
this case, all the molecules achieve a similar scalability, with a nearly linear speedup.
As stated before, these speedups are calculated using the 4 GPU conguration as
reference.
Figure 7.5: Speedup comparison of the three molecules. Note that the reference
conguration is for 4 GPUs.
Figure 7.6 shows the total simulation times for DHFR_555, split between
computation and transmission times. Note that the transmission times for 16 nodes
are higher than the others. This could be explained due to the fact that the network
is divided in two groups of nodes, connected though several routers. The commu-
nication protocol performance depends on how the partitions are spread among the
computing nodes. In this conguration some of the nodes were farther from their
neighbors than others, increasing the communication times. For 32 nodes simula-
tion, a better conguration was used, showing better communication times. In spite
of this, communication times do not grow much when adding more nodes, while
In order to prove the ability of the proposal to simulate huge molecular systems, a
last test was performed (see Fig. 7.7). 32x1VT4 is a synthetic system made of 32
copies of 1VT4, which adds a total of 20,107,488 atoms. Due to its complexity and
have estimated the speedup of the test by taking the simulation times of one copy of
the molecule in one node. Simulation times are given in nanoseconds per day. One
copy of 1VT4 can be simulated at a speed of 1.5ns/day in one node of the cluster,
so 32 copies should take around 0.047ns/day using one node (1.5 nanoseconds/day
CONCLUSIONS-AND-FUTURE-
WORK
Chapter 8
Chapter 1 introduced some problems found on molecular dynamics and the objec-
tives for this work. The current Ph.D. thesis has properly fullled all the primary
dent computing nodes. The approach extends and optimizes the Multilevel Summation
sively parallel algorithms to update and synchronize the interfaces of spatial partitions
on GPUs. The evaluations carried out show that the current implementation is faster
have simulated massive molecular systems formed by more than 20 million of atoms,
tion 8.2 presents future lines of research which are open starting from the conclusion
of this work.
80 8. Conclusions and Future Work
The following sections present the contributions for each accomplished goal.
Our initial eorts were focused on simulating short-range bonded and non-bonded
grators, achieving high speed-ups thanks to the spatial partitioning developed and
The rst milestone reached was the selection of a partition scheme that
strategies.
architectures is presented in Section 5.2. This algorithm is the key for directly
transferring information between GPUs, enabling the execution of most of the code
demonstrating the benets of using GPUs as central compute nodes instead of being
simple co-processors.
8.1. Summary of contributions 81
Chapter 6 presents the next milestone achieved in this thesis. The proposal extends
and optimizes the Multilevel Summation Method, takes advantage of direct GPU-
presents more suitable characteristics for distributing it along several nodes or GPUs,
but it is slower compared to PME. We rst improve the performance of MSM by using
GPUs.
Section 6.1 states the benets of our approach vs. the original MSM and
the well known long-range molecular dynamics algorithm PME. We then show how
to perform a spatial partitioning of the multilevel grid, dividing atom data between
Also, Section 6.3 evaluates the scalability of the proposal, showing promis-
Chapter 7 presents the nal milestone achieved. The major drawback of on-board
multi-GPU systems is the limited number of GPUs that can be used in a single
82 8. Conclusions and Future Work
partition method for the molecular system is also presented, enabling the simulation
based on the denition of local data IDs for computations, and global data IDs
for communications.
Section 7.2 shows the results of the tests performed, demonstrating the
achieves good simulation times, opening the possibility for massive molecular systems
The objectives stated at the beginning of this Ph.D thesis have been satisfactorily
reached. The evaluation carried out allows us to conclude that our multi-GPU molec-
ular dynamics approach presents very good behavior in terms of performance and
scalability. Furthermore, this work opens new research lines for current applications.
molecular dynamics is virus simulation. These molecular systems are so large that
usually only some selected parts are simulated. A scalable solution as as the one
proposed in this work may make the simulation of such large systems practical.
Also, the solutions presented in this thesis can be exported to other simula-
8.2. Future work 83
tion elds. Several solutions can be applied to solve n-body problems, such as such
as celestial mechanics.. SPH uid dynamics and mass-spring cloth applications are
some examples of dynamics simulations that may benet from the spatial partition
work:
• Our current solution relies on a static partitioning, which does not guaran-
tee load balancing across GPUs. The tests indicate that practical molecular
systems maintain rather even atom distributions, but dynamic load balancing
• Our work could be complemented with more advanced protocols and architec-
systems, there are currently architectures that outperform the Intel IOH/QPI
interface for the PCIe bridge used in the experiments. Also, distributed multi-
• One of the main drawbacks is that MSM adds a certain overhead at coarse
of GPUs, and periodic boundaries wrap around the whole molecular system,
may allow us to simulate a molecular system made of nearly 300 million atoms.
Bibliography
[1] Matthias Bolten. Multigrid methods for structured grids and their application
[2] K. J. Bowers, R. O. Dror, and D. E. Shaw. The midpoint method for par-
67
[3] David S Cerutti and David A Case. Multi-level ewald: A hybrid multigrid / fast
[5] Tom Darden, Darrin York, and Lee Pedersen. Particle mesh ewald: An
nâ log(n) method for ewald sums in large systems. The Journal of Chem-
[6] O N de Souza and R L Ornstein. Eect of periodic box size on aqueous molecular
[7] Ulrich Essmann, Lalith Perera, Max L. Berkowitz, Tom Darden, Hsing Lee,
and Lee G. Pedersen. A smooth particle mesh ewald method. The Journal of
[8] David J. Hardy, John E. Stone, and Klaus Schulten. Multilevel summation
[9] David Joseph Hardy and Robert D Skeel. Multilevel summation for the fast
mesh ewald method on gpu hardware. Journal of Chemical Theory and Com-
[12] Berk Hess, Carsten Kutzner, David van der Spoel, and Erik Lindahl. GRO-
2008. 2, 21, 26
[13] Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, and Milan Hodo Scek.
[15] Jesús A. Izaguirre, Scott S. Hampton, and Thierry Matthey. Parallel multigrid
[16] Laxmikant Kalé, Robert Skeel, Milind Bhandarkar, Robert Brunner, Attila Gur-
and Klaus Schulten. NAMD2: Greater scalability for parallel molecular dynam-
Bernd Mohr, and Dieter an Mey, editors, Euro-Par, volume 8097 of LNCS, pages
[21] Akira Nukada, Kento Sato, and Satoshi Matsuoka. Scalable multi-GPU 3-
[22] NVidia. CUFFT :: CUDA Toolkit Documentation, accessed Online Jan 2014.
53
[23] Steve Plimpton. Fast parallel algorithms for short-range molecular dynamics.
[26] D.C. Rapaport. Large-scale Molecular Dynamics Simulation Using Vector and
[27] Christopher I. Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, and
Wen-Mei W. Hwu. Gpu acceleration of cuto pair potentials for molecular mod-
2012. 25
[30] Tim C. Schroeder. Peer-to-Peer & Unied Virtual Addressing, 2011. XIII, 4, 43
[31] Robert D Skeel, Ismail Tezcan, and David J Hardy. Multiple grid methods for
684, 2002. 22
28(16):26182640, 2007. 2
[33] J.A. van Meel, A. Arnold, D. Frenkel, S.F. Portegies Zwart, and R.G. Belleman.
266, 2008. 20
Bibliography 89
2010. 36, 40
[36] Juekuan Yang, Yujuan Wang, and Yunfei Chen. GPU accelerated molecular dy-
[37] Rio Yokota, Jaydeep P. Bardhan, Matthew G. Knepley, L.A. Barba, and
[38] Gongpu Zhao, Juan R. Perilla, Ernest L. Yufenyuy, Xin Meng, Bo Chen, Jiying