Thesis Marcos Novalbos Mendiguchia

Universidad Rey Juan Carlos
Departamento de Ciencias de la Computación, Arquitectura

de Computadores,Lenguajes y Sistemas Informáticos y
Estadística e Investigación Operativa.
Doctorado en Ingeniería Informática
Scalable Molecular Dynamics
on
High-Performance Multi-GPU
Systems
By Marcos Novalbos Mendiguchía
A dissertation submitted in partial fulllment of the requirements for the
degree of Doctor of Philosophy in Computer Science
Supervisor
Prof. Alberto Sánchez Campos
Prof. Miguel Angel Otaduy Tristán
Candidate
Dr. Marcos Novalbos Mendiguchía
July 2015
El Dr. Alberto Sánchez Campos, con DNI 50120554N, y el Dr. Miguel Angel
Otaduy Tristán, con DNI 72447035W, como directores de la presente tesis
CERTIFICAN
Que los trabajos de investigación desarrollados en la memoria de tesis doctoral
Scalable Molecular Dynamics on High-Performance Multi-GPU Sys-

tems son aptos para ser presentados por el Ingeniero Informático Marcos Novalbos
Mendiguchía ante el Tribunal que en su día se consigne, para aspirar al Grado de
Doctor por la Universidad Rey Juan Carlos.
Y para que así conste rman el presente documento en Móstoles a 21 de
Octubre de 2015.
V.B. Directores de Tesis
Dr. Alberto Sánchez Campos Dr. Miguel Angel Otaduy Tristán.

Scalable Molecular Dynamics on High-Performance Multi-GPU
Systems
Copyright 2015
by Marcos Novalbos Mendiguchía

Agradecimientos
Este trabajo es la suma de los esfuerzos de mucha gente que, directa o indirectamente
han contribuido en su desarrollo. Por eso quería dedicárselo a todos los que en algún
momento me han ayudado, y hacerles llegar mi agradecimiento: familia, amigos,
y personas que sin tener mucha relación me han dado buenos consejos en algún
momento.
Primeramente quería agradecerle la paciencia y el apoyo que he tenido con
mi familia. De mi madre Carmen, mi hermana Maria del Mar y mi cuñado Ricardo
he aprendido que es imprescindible esfuerzo, constancia y responsabilidad en todo lo
que se hace para cumplir los objetivos. Y en todo caso, si algo sale mal siempre se
puede contar con ellos para buscar una solución.
También quería agradecerles el trabajo y dedicación a mis dos directores
de Tesis, sin ellos este trabajo habría sido imposible. A Alberto quiero agradecerle
la conanza que puso en mí desde el principio, han sido muchos años trabajando
juntos y sin su apoyo e interés constante seguramente no habría avanzado nunca. A
Miguel Ángel quería agradecerle todas las horas invertidas en este trabajo, revisiones
de última hora incluidas. A ambos quería agradeceros sobretodo la paciencia en los
últimos meses previos a la entrega de este trabajo, poca gente muestra tanto interés
en ayudar a otros a conseguir sus objetivos.
Quiero agradecer a todos los trabajadores de Plebiotic S.L. su nanciación
y la conanza que depositaron en mi trabajo mientras duró el proyecto. A Roberto
Martínez Benito, Álvaro López-Medrano y Roldán Martínez por sus esfuerzos inver-
tidos en la empresa que fundaron y con la que intentaron marcar la diferencia. Y por
supuesto, quería agradecer a Jaime por todos estos años trabajando juntos, horas
VI
perdidas depurando código y del que he aprendido muchísimo de programación en
GPU.
Por último, me gustaría agradecer a todos los integrantes del despacho
2011 (pasados y presentes) con los que he compartido las horas de trabajo durante
los últimos 4 años. Han sido muchas las experiencias que hemos vivido juntos y
en las que nos hemos ayudado, muchas gracias a Alberto, Álvaro, Ángela, Carlos,
David, Gabriel, Javier, Jorge, Mónica, Laura y Zeleste. Y por añadidura al resto de
GMRV, en especial a Jose, Juanpe, Luis, Marcos, Óscar, Pablo, Richard SM, Sofía
y Susana.
Resumen en Castellano
Los sistemas de simulación de dinámica molecular aúnan los esfuerzos de distin-
tas disciplinas para poder recrear elmente las interacciones entre elementos a nivel
atómico, en un entorno simulado. Estas simulaciones tienen en cuenta los movimien-
tos y equilibrios de energía necesarios, imitando el movimiento real de los átomos y
sus interacciones durante un espacio de tiempo nito.
Estas simulaciones son necesarias para estudiar propiedades que serían im-
posibles de capturar de forma analítica, ayudando en la investigación de nuevas
drogas y medicamentos. El hecho de poder predecir las interacciones y formas de
ciertas proteínas usando programas informáticos permite a las empresas farmacéuti-
cas ahorrar tiempo y dinero en sus investigaciones.
Sin embargo, los sistemas de simulación molecular están limitados por la
capacidad de cálculo de los ordenadores actuales. Las mediciones de tiempo de
simulación se dan en el orden de nanosegundos/día, dando a entender que para
obtener los movimientos durante un corto espacio de tiempo simulado es necesario
mucho tiempo real.
Por ejemplo, si se quisiera representar un nanosegundo de movimientos de
un pequeño sistema formado por unos 92. 224 átomos, podrían ser necesarios hasta
14 días de tiempo de cómputo en un ordenador monoprocesador. La complejidad
de los cálculos realizados para simular los movimientos físicos de los átomos que
componen el sistema es tan alta que a día de hoy es impensable realizar simulaciones
moleculares en tiempo real.
Es muy necesario reducir los tiempos de cómputo. Para poder dar una
respuesta relativamente rápida, las simulaciones moleculares se llevan a cabo en en-

VIII
tornos virtuales ejecutados en sistemas informáticos con gran capacidad de cálculo.
Desde sus inicios se han desarrollado distintas técnicas para poder acelerar los cálcu-
los aprovechando las características de los sistemas informáticos de alto rendimiento.
Con la evolución de los sistemas de computadores paralelos como clústeres o multi-
procesadores, se ha conseguido gran reducción de los tiempos de simulación.
En concreto, las arquitecturas de tarjetas grácas o GPUs han propor-
cionado un gran incremento en el rendimiento de multitud de aplicaciones, gracias a
sus características masivamente paralelas. La popularidad de las GPUs ha aumen-
tado tanto que es fácil encontrar máquinas de altas prestaciones que tengan instaladas
varias tarjetas grácas programables para acelera los cálculos. En concreto, resultan
ser un gran apoyo hardware para los sistemas de simulación molecular, reduciendo
drásticamente los tiempos de simulación para muchos de los algoritmos utilizados.
El trabajo que aquí se presenta está centrado en la explotación de los sis-
temas multi-GPU para acelerar el cálculo de las simulaciones de dinámica molecular.
Entre otros objetivos, se pretende dar un nuevo enfoque en el uso de estas arquitec-
turas, explotando sus características como nodos de cómputo autónomos.
Para poder llevar a cabo esta tarea se han desarrollado diversas herramien-
tas y algoritmos nuevos. En concreto, se ha desarrollado un algoritmo de empa-
quetado de datos para comunicaciones directas entre GPUs. Éste algoritmo tiene
la particularidad de que se ejecuta completamente en GPU, evitando perder tiempo
moviendo datos entre CPU y GPU. También se han investigado distintas formas de
partición espacial para sistemas moleculares, seleccionando el más adecuado para en-
tornos multiGPU. Se han introducido mejoras para algoritmos de dinámica molecular
de cálculo de fuerzas largas, optimizando el método Multigrid Summation Method
(MSM). Por último, dado que los entornos con mayor cantidad de GPUs disponibles
IX
suelen ser sistemas tipo cluster de memoria distribuida, se ha portado el código a
estos sistemas y se han realizado pruebas de escalabilidad con resultados óptimos
para simulación de moléculas de gran tamaño.
Esta tesis continúa el trabajo de investigación iniciado por la compañía
Plebiotic S.L. en colaboración con el Grupo de Modelado y Realidad Virtual (GMRV)
del departamento de Ciencias de la Computación, Arquitectura de la Computación,
Lenguajes y Sistemas Informáticos y Estadística e Investigación Operativa de la
Escuela de Ingeniería Informática de la Universidad Rey Juan Carlos de Madrid.
En las siguientes secciones se muestra el estado del arte en ámbito de
dinámica molecular, los objetivos marcados, un resumen del trabajo realizado, y
las conclusiones que se han podido extraer de las pruebas realizadas.
Antecedentes
Se pueden distinguir dos líneas de investigación dentro de los sistemas de simulación
de dinámica molecular:
• Mejoras en velocidad de simulación
• Mejoras en la precisión de los cálculos
Ambas líneas son contradictorias: los cálculos precisos introducen una carga
computacional extra, mientras que las optimizaciones de velocidad de ejecución sue-
len basarse en el uso de métodos menos restrictivos que usualmente introducen er-
rores de precisión. Las optimizaciones que se suelen realizar en los últimos años se
centran en mejorar los algoritmos conocidos, adaptándolos para que hagan uso de
X
arquitecturas de altas prestaciones. Existen diversos algoritmos optimizados para
sistemas de memoria compartida y entornos distribuidos tipo cluster. En concreto,
una de las mayores dicultades se encuentra en el aprovechamiento de arquitecturas
masivamente paralelas tipo GPU.
Tipos de fuerzas simuladas
La dinámica molecular se centra en el cálculo de las fuerzas resultantes de la inter-
acción de los átomos que forman el sistema. El tiempo total de simulación se divide
en pequeños pasos de tiempo en los que se calculan las fuerzas que interactúan con
cada átomo para calcular su velocidad, y a partir de esa velocidad se calcula la nueva
posición para el siguiente paso de tiempo. Cuanto más pequeño es ese paso de sim-
ulación, más precisos son los cálculos, pero también tarda más en completarse. En
concreto se pueden distinguir dos tipos de fuerzas, una de ellas subdividida en otros
dos tipos:
• Fuerzas de enlace: Cuando se modelan sistemas moleculares, los átomos que
forman parte de la misma molécula quedan unidos mediante distintos tipos de
enlaces. Estos enlaces se simulan como si fueran muelles, afectando únicamente
a los átomos que formen el enlace. Suelen ser las fuerzas más rápidas de
calcular, representando poca carga de cómputo.
• Fuerzas electrostáticas: Las fuerzas electrostáticas de Van der Waals son las
producidas debido a las cargas de los átomos. Dado que las fuerzas electrostáti-
cas decaen rápidamente con la distancia, estas fuerzas se suelen dividir en dos
tipos para facilitar los cálculos:
Fuerzas de corto alcance, calculadas de forma exacta dentro de una distan-

XI
cia de corte respecto de un átomo. El cálculo de fuerzas de corto alcance
es bastante más pesado que el de fuerzas de enlace, llevándose una gran
parte del tiempo de cálculo de fuerzas.
Fuerzas de largo alcance, calculadas usando todos los átomos del sistema
que se encuentran más allá de una distancia de corte. El tiempo de cál-
culo usando algoritmos exactos es muy alto, por lo que se suelen usar
aproximaciones con cierto grado de error. Dado que las interacciones más
allá del radio de corte son poco importantes, en algunos casos es posible
ignorarlas sin que el sistema se resienta.
PME [4] es el método de cálculo de fuerzas de largo alcance más popular.
Resuelve los cálculos utilizando FFTs sobre una rejilla de potenciales de carga. FFT.
La versión original de PME usa diferenciación espectral y un total de cuatro FFTs por
cada paso de simulación, mientras que Smooth PME (SPME) [6] usa interpolación
por B-spline reduciendo el número de FFTs a dos. Se usa PME en multitud de sim-
uladores moleculares, como NAMD [26], GROMACS [11] o ACEMD [10]. Además,
PME ha sido paralelizado en GPU, aunque debido a la naturaleza del algoritmo es
difícil de implementar para que aproveche varias GPUs simultáneamente.
Simuladores moleculares
En los últimos 20 años se han desarrollado diversos simuladores de dinámica molec-
ular para entornos de altas prestaciones. NAMD [26] es uno de los más longevos,
datando las primeras versiones en 1995. Es de los más populares, siendo usado en
muchos de los proyectos más importantes de simulación molecular. Desde el inicio
se centró en desarrollar algoritmos paralelos para reducir los tiempos de ejecución.
NAMD reparte el trabajo de cálculo entre los nodos de cómputo disponibles real-
XII
izando una partición espacial del sistema. Cada partición se asigna a un nodo de
cómputo, que puede ser un núcleo en un sistema multi-CPU, o un nodo de un sis-
tema distribuido. Las subdivisiones creadas se denomina parches, y cada parche
mantiene la información de los átomos que le pertenecen junto con los parches ve-
cinos. A partir de ahí, se denen trabajos que se distribuyen a lo largo de las
CPUs disponibles en cada nodo de cómputo. Cada trabajo se queda denido como
una interacción entre parches, si algún nodo de cómputo necesita información de un
parche que no tiene, se enviará una copia de los datos necesarios junto con el trabajo
asociado.
En las últimas versiones de NAMD es posible usar GPUs para acelerar
cálculos. Se crean trabajos pequeños que se asignan a las GPUs junto con los datos
necesarios. Una vez se tienen los datos en GPU se lanzan los kernels necesarios para
calcular las fuerzas y se devuelven los resultados a memoria de CPU. Este esquema de
uso de la GPU como coprocesador fuerza que haya mucho intercambio de información
entre GPU y CPU.
GROMACS [11] es otro simulador molecular con gran historia de desar-
rollo, sus primeras versiones son de 1991. Inicialmente se implementó como un su-
percomputador, con multitud de nodos de cómputo unidos entre sí en una topología
de anillo. Más tarde se portó el sistema a código C, permitiendo su ejecución en
máquinas más comunes de cómputo paralelo, como clúster o multiprocesadores. Al
igual que NAMD, realiza una partición espacial del sistema para distribuirlo entre
los nodos disponibles. El sistema se divide en un grid escalonado, con áreas de
información compartida entre particiones adyacentes. Se pueden usar GPUs como
coprocesadores para acelerar los cálculos de ciertas partes del código, aunque en ese
caso sólo se han optimizado las fuerzas de corto alcance. Al igual que con NAMD,
en cada paso se realizan copias de los datos de entrada a GPU, y luego la descarga
XIII
de datos resultado a CPU.
ACEMD [10] es un simulador molecular relativamente más moderno que los
anteriores. Éste simulador está centrado en optimizaciones de algoritmos basados en
nuevos modelos matemáticos, que simplican las estructuras de las moléculas antes
de operar con ellas. Está optimizado para aprovechar sistemas multi-GPU instaladas
en una única estación de trabajo, y es uno de los simuladores más rápidos existentes.
En caso de haber varias GPUs, cada una realiza en paralelo el cálculo de un tipo de
fuerzas. Esta aproximación explota los sistemas multiGPU, aunque su escalabilidad
es reducida ya que todas las comunicaciones entre GPUs se realizan pasando por
memoria de CPU. Además, al estar limitado a estaciones de trabajo, el número
máximo de GPUs que pueden ser usadas está limitado por el número de tarjetas
grácas que se puedan instalar en la placa base.
Objetivos
Los sistemas de simulación de dinámica molecular pueden hacer uso de arquitec-
turas de altas prestaciones. En muchos casos, se pueden tener soluciones híbridas
GPU-CPU que permiten acelerar los cálculos de fuerzas. Sin embargo, es la CPU
la que mantiene el control de la aplicación, usando las GPUs como meros coproce-
sadores. Las GPUs actuales tienen una potencia de cálculo que supera la mayoría de
las CPUs, pero quedan muy limitadas por las comunicaciones entre la CPU y GPU
por la carga/descarga de datos. Las arquitecturas actuales soportan comunicaciones
directas entre GPUs que se encuentren instaladas en la misma placa base [30], o in-
cluso en tarjetas conectadas a nodos en la misma red de trabajo. Estas características
permiten usar las GPUs como nodos de cómputo para algoritmos de simulación de
dinámica molecular y no como coprocesadores, reduciendo la sobrecarga introducida

XIV
por las comunicaciones GPU-CPU
En concreto, el presente trabajo pretende demostrar la viabilidad del

uso de GPUs como nodos de computación autonómica, desarrollando algo-
ritmos de comunicación y gestión de datos que se ejecuten completamente
en GPU, junto con algoritmos de simulación molecular adaptados a en-
tornos distribuidos MultiGPU. Para poder demostrar esta hipótesis, se deben
diseñar e implementar los algoritmos teniendo especial cuidado en la escalabilidad
del sistema y en minimizar la cantidad de datos a comunicar entre nodos.
Para alcanzar el propósito del trabajo, se han denido los siguientes obje-
tivos:
• Mejora de los tiempos de simulación en entornos MultiGPU. Como se comentó
en párrafos anteriores, la comunicación directa entre GPUs permite ahorrar
tiempo. Los simuladores tradicionales usan las GPUs como coprocesador, mien-
tras que nuestro objetivo es utilizarlas como nodos de cómputo independientes.
El objetivo incluye desarrollar protocolos de comunicación GPU-GPU y algo-
ritmos de empaquetado de datos para arquitecturas masivamente paralelas.
• Mejoras de los algoritmos de simulación molecular actuales, adaptándolos a
arquitecturas GPU y a entornos distribuidos multiGPU. En concreto, el cál-
culo de fuerzas de largo alcance es un problema difícil de distribuir entre varias
GPUs. El método tradicional es PME, pero no se adapta a nuestras necesi-
dades. Por el contrario, el método MSM de cálculo de fuerzas de largo alcance
es mucho más factible de usar para nuestros propósitos. El objetivo incluye
mejoras en MSM para que sea tan rápido como PME, adaptándolo para sis-
temas multiGPU.
XV
• Mejora de la simulación de grandes sistemas moleculares formados por mil-
lones de átomos, manteniendo la escalabilidad de sistema. Para poder simular
sistemas moleculares grandes en entornos multiGPU es necesario utilizar arqui-
tecturas de altas prestaciones que dispongan de decenas de tarjetas grácas.
El objetivo incluye la adaptación de los algoritmos desarrollados a entornos
distribuidos tipo cluster, donde es factible tener un gran número de GPUs.
Como resumen, este trabajo intenta aportar nuevas formas de uso para los
entornos multiGPU. La solución debe ser escalable, y permitir acelerar los cálculos de
dinámica molecular para los sistemas moleculares que necesiten grandes cantidades
de recursos.
Metodología
En base a los objetivos planteados, se han agrupado los objetivos en una serie de hitos
a conseguir. El objetivo nal es el aprovechamiento de cualquier arquitectura multi-
GPU, por lo que inicialmente se ha seleccionado una arquitectura concreta basada
en múltiples GPUs conectadas a través de un bus de datos de alta velocidad en la
misma placa base. Éstas arquitecturas suelen disponer de un número reducido de
tarjetas grácas, como mucho entre 4 y 8 GPUs, por lo que los tamaños de los sis-
temas simulados solían estar limitados en tamaño. Para superar esas limitaciones, la
última parte del proyecto se ha centrado en realizar pruebas en entornos distribuidos
tipo cluster, donde se puede disponer de una mayor cantidad de GPUs conectadas a
través de una red de datos.
A continuación se detallan los hitos conseguidos en los dos sistemas multi-
GPU utilizados.
XVI
Dinámica molecular para entornos multiGPU de bus compartido en

la misma placa base
• Diseño e implementación de un algoritmo de partición espacial que genera
particiones regulares a partir del volumen de simulación. Cada partición es
asignada a una única GPU.
• Denición de áreas compartidas entre particiones que sirven para mantener la
coherencia del sistema, denominadas interfaces. La gestión de estas interfaces
es llevada de forma independiente por cada una de las GPUs.
• Diseño e implementación de un algoritmo paralelo para preparar los paquetes
de datos de las interfaces, que se usarán para realizar las comunicaciones y
actualizaciones entre GPUs. Este algoritmo se ha implementado por completo
en GPU.
• Adaptación de algoritmos actuales de dinámica molecular basados en partición
de trabajos a métodos de división espacial. Inicialmente se partía de varios
algoritmos paralelos que usaban las GPUs como coprocesador, que movían una
gran cantidad de datos entre CPU y GPU. Se han adaptado de tal manera que
cada GPU opera sobre su propia partición de datos, por lo que únicamente
se envían datos entre GPUs que son necesarios para la actualización de sus
particiones. La comunicación GPU-CPU es mínima, únicamente se envían los
datos necesarios para invocar los kernels de GPU.
• Denición de un método distribuido para el cómputo de fuerzas de largo al-
cance. En concreto, se ha usado como base el método MSM, optimizándolo
para que use FFTs, e implementándolo de forma distribuida en un entorno
multiGPU. Las pruebas realizadas han sido sobre un entorno multiGPU en

XVII
una única estación de trabajo, pero el método es extensible a entornos dis-
tribuidos tipo cluster.
• Realización de pruebas de escalabilidad y eciencia de los algoritmos implemen-
tados. Las pruebas han sido satisfactorias, probándose en muchos casos que es
más rápido que NAMD, uno de los simuladores moleculares de referencia.
Dinámica molecular para entornos multGPU de memoria dis-

tribuida, conectados a través de una red local
• Adaptación de los algoritmos desarrollados a un entorno multiGPU tipo clus-
ter, donde se comprueba la viabilidad de la solución para simular sistemas
moleculares de varios millones de átomos. El objetivo era simular sistemas
que eran imposibles de simular en entornos con pocas GPUs, debido a la poca
memoria ram que tiene para cargar el volumen de datos.
• Resolución de problemas de escalabilidad en memoria. Para acelerar los cál-
culos, existen una serie de listas de traducción de identicadores de átomos.
Éstas listas no escalan en memoria, limitando el tamaño máximo del sistema
que se puede simular. Al realizar las particiones del sistema, éstas listas con-
tienen huecos o zonas vacías. Cuantas más particiones hay, las zonas vacías
crecen, por lo que se podría solucionar el problema usando otro sistema de
almacenaje más eciente. Las Tablas Hash se adaptan muy bien a este sis-
tema, compactando los datos útiles y ahorrando memoria, por lo que se han
incorporado al sistema.
• Resolución de problemas de actualización del sistema molecular en un entorno
de memoria distribuida. Los métodos anteriores se aprovechaban de las carac-
terísticas de los entornos de memoria compartida para realizar la partición de

XVIII
datos. Ya que los átomos pueden migrar de una GPU a otra, se ha denido
un protocolo de comunicación de datos que permite describir de forma unívoca
los átomos y enlaces que se envían.
Conclusiones
A continuación se detallan las pruebas realizadas y las conclusiones obtenidas. Se
han centrado en los dos tipos de arquitecturas descritas en la sección anterior. Para
los sistemas multiGPU de bus compartido en la misma placa base se ha usado una
máquina equipada con Ubuntu GNU/Linux 10.04, dos CPUs Intel Xeon Quad Core
2.40GHz con tecnología hyperthreading, 32 GB de RAM y cuatro GPUs NVidia
GTX580 conectadas a un bus PCIe 2.0 en una placa base Tyan S7025 equipada con
un chipset IOH Intel 5520.
Para los sistemas multiGPU distribuidos en una red de trabajo, se ha usado
un cluster de 32 ordenadores con Sistema Operativo Linux Mint 14, 8 GB de RAM
y una GPU NVidia GTX760 equipada con 2GB of ram. Los nodos se encuentran
interconnectados por una red Gigabit/ethernet. Para comunicaciones entre GPUs se
ha usado OpenMPI 1.8, compilado con soporte para NVidia/Cuda.
Las siguientes secciones detallan los resultados obtenidos en ambos en-
tornos. La Figura 1 muestra las moléculas usadas para cada una de las pruebas.
XIX
Simulación de fuerzas de corto alcance para sistemas multiGPU de

bus compartido en la misma placa base
Las primeras evaluaciones se centraron en la simulación de fuerzas de enlace y fuerzas
electrostáticas de corto alcance en el entorno multiGPU descrito anteriormente. Se
midieron tiempos de comunicaciones y tiempos de simulación para 1, 2, y 4 parti-
ciones alojadas en distintas GPUs. Los tres sistemas moleculares (Figura 1) usados
para las pruebas fueron los siguientes:
• ApoA1 (92,224 átomos): Es un sistema bien conocido, una lipoproteínade alta
densidad (HDL) en plasma humano. Se suele utilizar en los tests de rendimiento
de NAMD.
• C206 (256,436 átomos) Es un sistema complejo formado por una proteína,
un ligando y una membrana. Debido a su heterogeneidad, presenta algunos
desafíos de balanceo de carga.
• 400K (399,150 átomos) Un sistema molecular sintético con una carga de datos
balanceada, consistente en 133,050 moléculas de agua.
Todos los tests han consistido en una ejecución de 2000 pasos de simulación,
representando el cálculo de 4picosegundos (4·10−12 segundos). Las Grácas 2a mues-
tran los resultados de escalabilidad obtenidos para 2 y 4 GPUs, más una estimación
(líneas discontinuas) de los resultados que se podrían esperar en sistemas con 8 y
16 GPUs teniendo en cuenta las limitaciones de las comunicaciones. Los resultados
muestran que la implementación funciona mejor según aumenta el tamaño del sis-
tema, compartiendo más trabajo entre las diferentes GPUs. El speedup obtenido en
APOA1 es menor que el resto debido al que es el sistema más pequeño, y los tiempos
de comunicación limitan rápidamente la escalabilidad del sistema

XX
La Gráca 2b evalúa los resultados obtenidos frente a NAMD, el simulador
molecular de referencia. El rendimiento se mide en términos de número de nanose-
gundos que se pueden simular en un día. En todos los casos nuestra solución mejora
la ejecución respecto de NAMD en un factor de al menos 4×.
Simulación de fuerzas de largo alcance para sistemas multiGPU de

bus compartido en la misma placa base
Las siguientes pruebas se centraron en la simulación de fuerzas electrostáticas de largo
alcance en el entorno multiGPU descrito anteriormente. En este caso se implementó
una versión distribuida de MSM, se midió su escalabilidad en 4GPUs, y se comparó
su eciencia con NAMD. Los tres sistemas moleculares (Figura 1) usados para las
pruebas fueron los siguientes:
• 400K (399,150 átomos) : Presentada anteriormente, es un sistema molécular
sintética con una carga de datos balanceada, consistente en 133,050 moléculas
de agua.
• 1VT4 (645,933, átomos) Es un sistema complejo formado por una holoenzima,
ensamblada alrededor del adaptador proteínico dApaf-1/DARK/HAC-1.
• 2x1VT4 (1,256,718 átomos) Un sistema formado por dos copias de 1VT4.
La Gráca 3 evalúa tanto la escalabilidad de la solución como los resultados
tiempos de ejecución obtenidos frente a NAMD. Se puede apreciar que, al igual que
antes, cuanto más grandes son los sistemas el speedup optenido es mejor.
XXI
Simulación de dinámica molecular en entornos multiGPU distribui-

dos
Por último, se realizaron pruebas de escalabilidad y rendimiento en el cluster de-
scrito anteriormente. Dado que los buses de comunicación son mucho más lentos
que en los sistemas de bus en placa base, para compensar los tiempos perdidos en
comunicaciones se han usado sistemas moleculares especialmente grandes. De esta
manera la comparación de tiempos de ejecución respecto de tiempos de comunicación
es mucho mayor, facilitando la escalabilidad del sistema y probando que con una red
más rápida el sistema sería escalable.
Los tests de escalabilidad se han centrado en dos aspectos, escalabilidad
en memoria y en tiempos de ejecución. Los sistemas moleculares seleccionados
(Figura 1), están compuestos por una gran cantidad de átomos, por lo que no era
posible simularlos en un único nodo del clúster:
• 2x1VT4 (1,256,718 átomos): Es el sistema usado en las simulaciones anteriores.
Éste sistema estaba al límite de uso de memoria en el sistema anterior, las
tarjetas grácas usadas en el cluster tienen menos memoria gráca por lo que
se necesitan al menos un mínimo de 2 nodos para simular.
• DHFR_555 (2,944,750 átomos): Es un sistema sintético con una carga de
cálculo balanceado formado por varias copias de la molécula DHFR (f ), que
necesita un mínimo de 4 nodos para simularse.
• DHFR_844 (3,015,424 átomos): Es otro sistema sintético formado por varias
copias de DHFR, distribuido de forma distinta.
Una de las mejoras implementadas estaba centrada en la escalabilidad en

XXII
memoria del sistema. La Gráca 4 muestra la cantidad de datos reservados en
memoria por cada uno de los nodos del cluster para DHFR_555. Se puede observar
cómo disminuye según se añaden más nodos de simulación. Las Grácas 5 muestran
el speedup calculado para el sistema. Hay que tener en cuenta que el sistema de
referencia usado para el cálculo (seedup=1) empieza en 4 GPUs, por lo que para
cada una de conguraciones de GPUs se podría estimar que sería 4 veces mayor que
lo mostrado. La Gráca 5a muestra el speedup total, teniendo en cuenta los tiempos
de envío, mientras que la Gráca 5b muestra el speedup para los tiempos de cálculo
de fuerzas. Teniendo en cuenta eso, se puede determinar que el sistema es escalable
según aumenta el número de GPUs, y se beneciaría de una red de comunicaciones
mucho más rápida

XXIII
(a) ApoA1 en (b) C206 en (c) 400K

agua agua
(d) 1VT4 en (e) 2x1VT4 en (f ) Molécula

agua agua DHFR
(g) Varias (h) Varias (i)32 copias de

copias de DHFR copias de DHFR 1VT4 en una
en una con- en una con- conguración de
guración de guración de 4x4x2
5x5x5 5x5x5
Figure 1: Moléculas de prueba

XXIV
(a) Speedup (b) Comparación con NAMD
Figure 2: Escalabilidad (a) y comparación de rendimiento con NAMD (b), medido

en términos de nanosegundos simulados por día.
Figure 3: Tiempos de ejecución y speedup

XXV
Figure 4: Reservas de memoria para DHFR_555.
(a) Speedup Global (b) Only computation speedup
Figure 5: Speedup comparison of the three molecules. Note that the reference
conguration is for 4 GPUs.
Contents
List of Figures XXXIII
1 Introduction 1
1.1 Motivations and scope of the work . . . . . . . . . . . . . . . . . . . 1
1.2 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Document organization . . . . . . . . . . . . . . . . . . . . . . . . . . 6
I STATE-OF-THE-ART 9
2 Molecular dynamics 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Bonded interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Bonds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Dihedrals and Impropers . . . . . . . . . . . . . . . . . . . . . 13
2.3 Non-Bonded Electrostatic Interactions . . . . . . . . . . . . . . . . . 14

XXVIII Contents
2.4 Van Der Waals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.1 Verlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 Respa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Parallel Molecular Dynamics 19
3.1 Parallelization Techniques . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Bonded Forces . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Non-bonded forces . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2.1 Short range . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.3 Long Range Non-bonded Forces . . . . . . . . . . . . . . . . . 21
3.1.4 The Multilevel Summation Method . . . . . . . . . . . . . . . 23
3.2 Parallel molecular simulators . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 NAMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 GROMACS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 ACEMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Contents XXIX
II PROBLEM-STATEMENT-AND-PROPOSAL 27
4 Problem Statement 29
4.1 A grand challenge problem . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Novel architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Multi-GPU communications bottleneck . . . . . . . . . . . . . . . . 32
4.3.1 Contribution: Direct GPU-GPU communications . . . . . . . 33
4.3.2 Contribution: Distributed MSM . . . . . . . . . . . . . . . . . 33
4.4 Solutions for memory scalability . . . . . . . . . . . . . . . . . . . . . 34
5 On-Board Multi-GPU Short-Range Force Computation 35
5.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Parallel Partition Update and Synchronization . . . . . . . . . . . . 40
5.2.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.2 Identication of Transfer Data . . . . . . . . . . . . . . . . . 41
5.2.3 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Short Range On-Board Multi-GPU Evaluation . . . . . . . . . . . . 44
5.3.1 Comparison of Partition Strategies . . . . . . . . . . . . . . . 45
5.3.2 Scalability Analysis and Comparison with NAMD . . . . . . . 46

XXX Contents
6 On-Board Multi-GPU Long-Range Force Computation 51
6.1 Optimized MSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.1.1 FFT-Based Sums . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Distributed MSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2.1 Multigrid Partitions . . . . . . . . . . . . . . . . . . . . . . . 54
6.2.2 Periodic Boundary Conditions on Multiple GPUs . . . . . . . 55
6.2.3 Parallel Update and Synchronization of Interfaces . . . . . . . 57
6.3 Evaluation for On-Board Multi-GPU MSM . . . . . . . . . . . . . . 59
6.3.1 Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . . 59
7 Distributed Multi-GPU Molecular Dynamics 63
7.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1.1 System partition . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.2 Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2 Distributed Multi-GPU Molecular Dynamics evaluation . . . . . . . 70
7.2.1 Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2.2 Simulation of Huge Molecules . . . . . . . . . . . . . . . . . . 75

Contents XXXI
III CONCLUSIONS-AND-FUTURE-WORK 77
8 Conclusions and Future Work 79
8.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . 80
8.1.1 On-Board Multi-GPU Short-Range Molecular Dynamics . . . 80
8.1.2 On-Board Multi-GPU Long-Range Molecular Dynamics . . . 81
8.1.3 Molecular Dynamics for Distributed Multi-GPU Architectures 81
8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Bibliography 85
List of Figures
1 Moléculas de prueba . . . . . . . . . . . . . . . . . . . . . . . . . . . XXIII
2 Escalabilidad (a) y comparación de rendimiento con NAMD (b), me-
dido en términos de nanosegundos simulados por día. . . . . . . . . . XXIV
3 Tiempos de ejecución y speedup . . . . . . . . . . . . . . . . . . . . . XXIV
4 Reservas de memoria para DHFR_555. . . . . . . . . . . . . . . . . XXV
5 Speedup comparison of the three molecules. Note that the reference
conguration is for 4 GPUs. . . . . . . . . . . . . . . . . . . . . . . . XXV
3.1 Diagram showing the major operations of MSM. The bottom level
represents the atoms, and higher levels represent coarser grids. . . . . 24
4.1 HIV1 Capsid, 64 million atoms total including solvent . . . . . . . . 32
5.1 Comparison of binary (a) vs. linear spatial partitioning (b). The
striped regions represent the periodicity of the simulation volume. . . 37
5.2 The dierent types of cells at the interface between two portions of
the simulation volume. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 PCIe conguration of our testbed. . . . . . . . . . . . . . . . . . . . 43
5.4 Benchmark molecules. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

XXXIV List of Figures
5.5 Performance comparison of binary and linear partition strategies on
C206. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.6 Running time (2000 steps) for the binary partition strategy on C206. 50
5.7 Scalability (a) and performance comparison with NAMD (b), mea-
sured in terms of simulated nanoseconds per day. . . . . . . . . . . . 50
6.1 Partition of the multilevel grid under periodic boundaries. Left: All
grid points on each level, distributed into 3 GPU devices. Right: Data
structure of GPU device 0 (blue) on all levels, showing: its interior
grid points, interface points for an interface of size 3, and buers to
communicate partial sums to other devices. Interface points due to
periodic boundary conditions are shown striped. Arrows indicate sums
of interface values to the output buers. With interfaces of size 3, in
levels 1 and 2 several interface points contribute to the same buer
location, and in level 2 there are even interior points that map to
interface points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Running time and speedup . . . . . . . . . . . . . . . . . . . . . . . 61
6.4 Scalability Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.1 Communication scheme from GPU A to GPU B. Each GPU hosts a
small portion of the system, referencing the data by local IDs. Local
IDs are translated to global data IDs and sent to the second GPU.
After data reception, a translation to local IDs is performed. . . . . 69

List of Figures XXXV
7.3 Data size communications for DHFR_844 along 100 steps. . . . . . . 72
7.4 Memory allocation for DHFR_555. . . . . . . . . . . . . . . . . . . . 73
7.5 Speedup comparison of the three molecules. Note that the reference
conguration is for 4 GPUs. . . . . . . . . . . . . . . . . . . . . . . . 74
7.6 Breakdown of running time (100 steps) for DHFR_844. . . . . . . . 75
7.7 Benchmark molecule, composed of 32 copies of 1VT4 in 4x4x2 cong-
uration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Dedicado a mi familia y amigos, sin ellos no habría sido posible
Caminante no hay camino, se hace camino al andar.
Antonio Machado, Campos de Castilla, 1912.

Chapter 1
Introduction
1.1 Motivations and scope of the work
Molecular dynamics simulations [29] are computational approaches for studying the
behavior of complex biomolecular systems at the atom level, estimating their dynamic
and equilibrium properties which can not be solved analytically. Their most direct
applications are related to identifying and predicting the structure of proteins, but
they also provide a tool for drug or material design.
These simulations recreate the movements of atoms and molecules due to
their interactions for a given period of time. Molecular dynamics simulations enable
the prediction of the shape and arrangement of molecular systems that cannot be
directly observed or measured, and and they have demonstrated their impact on
applications of drug and nanodevice design [29].However, they are limited by size
and computational time due to the current available computational resources.
Molecular dynamics is a computationally expensive problem, due to both
high temporal and high spatial resolution. For instance, simulating just one nanosec-
ond of the motion of a well known system with 92 224 atoms (ApoA1 benchmark)
using only one processor takes up to 14 days [19].
The trajectories and arrangements of molecules over temporal scales in the

2 1. Introduction
order of 1µs are dictated by vibrations taking place at scales as ne as 1fs = 10−15 s;
therefore, eective analysis requires the computation of many simulation steps. At
the same time, meaningful molecular systems are often composed of even millions
of atoms. Most importantly, the motion of atoms is aected by distant electrostatic
potentials, which makes molecular dynamics an n-body problem with quadratic cost.
The simulation times of molecular dynamics can be reduced thanks to
algorithms that update atoms in a parallel way. Such algorithms were initially
implemented on multi-CPU architectures, such as multicore processors or com-
puter clusters with several computing nodes connected by a local area network
(LAN) [13, 23, 4]. More recent alternatives have used hybrid GPU-CPU architec-
tures to provide parallelism [32], taking advantage of the massive parallel capabilities
of GPUs. This approach interconnects several computing nodes, each one with one
or more GPUs serving as co-processors of the CPUs [16, 12]. The compute power of
this approach is bounded by the cost to transfer data between CPUs and GPUs and
between compute nodes.
Typical solutions to molecular dynamics separate short-range forces, which
are computed exactly, from long-range ones, and approximate such long-range forces.
The Particle Mesh Ewald (PME) method [5] is probably the most popular
approximation to long-range molecular forces, and it discretizes atom charges on
a grid, computes a grid-based potential using a FFT, and nally interpolates the
potential back to the atoms. Its cost is dominated by the FFT, which yields an
asymptotic complexity O(N log N ).
Molecular dynamics computations can be further accelerated through paral-
lel algorithms, including massive parallelization on GPUs [27, 11], or even multi-GPU
parallelization [20]. The PME method is suited for single GPU parallelization, but
1.2. State of the art 3
not for distributed computation, thus limiting the scalability of long-range molecular
dynamics.
1.2 State of the art
In Molecular Dynamics, two dierent research lines can be found:
• Speed optimizations
• Improve accuracy of calculations
These two research lines are apparently conicting:: accurate calculations
result in slower simulations, and speed optimizations usually assume a certain error.
Nowadays, optimizations are focused on improving well known algorithms and devel-
oping newer algorithms in order to take advantage of high performance computers.
There are several molecular simulation algorithms optimized for shared memory sys-
tems, multi-CPU networks and distributed computing. Currently, one of the major
challenges is the use of massively parallel processing architectures like GPUs.
The present Ph.D. thesis was initially motivated by the research initiated by
Plebiotic SL company in collaboration with the Modeling and Virtual Reality Group
(GMRV) of the Universidad Rey Juan Carlos de Madrid. The initial objectives of
Plebiotic SL focused on-board Multi-GPU systems, developing their own molecular
simulator named PleMD. This simulator achieved good simulation times but lacked
scalability, due to the large amount of data shared between CPU and GPU.
The objectives of this PhD thesis aim to exploit both on-board and dis-
tributed multi-GPU architectures to improve speed and scalability, while keeping

4 1. Introduction
accuracy in simulations. The overall objectives can be summarized as follows:
• Design a communication protocol for molecular dynamics that makes use of
several GPUs interconected by any kind of bus.
• Develop faster and scalable algorithms adapted to multi-GPU environments,
while retaining simulation accuracy.
1.3 Objectives
Typical solutions to molecular dynamics make use of high performance architectures.
In some cases, hybrid CPU-GPU computation is used in order to achieve better
simulation times. However, the CPU keeps the control of the application, using
GPUs as mere coprocessors. The parallel computation capability of modern GPUs
outperforms CPUs, but it is limited by CPU-GPU communications. Novel GPU
architectures support direct communication between GPUs, even mounted on the
same board [30]. These features enable the use of GPUs as the central compute
nodes of parallel molecular dynamics algorithms, and not just as mere co-processors,
thereby reducing the communication load.
Thus, the central aim of this work is to demonstrate the feasibility of

using GPUs as autonomous computing nodes, developing communication
and data management algorithms that are executed completely on GPU,
along with molecular simulation algorithms adapted to distributed multi-
GPU environments. In order to demonstrate this initial hypothesis, the design
and implementation of algorithms should be carried out with scalability and light
data communication in mind.

1.3. Objectives 5
The proposed approach has the following objectives:
• Improvement of simulation times in multi-GPU environments.
• Improvement of the current molecular dynamics algorithms, adapted to GPUs
and distributed architectures.
• Improvement of the simulation of molecular systems formed by millions of
atoms, while retaining scalability.
To fulll the previous objectives, the following milestones were proposed:
• The design and implementation of a spatial partition algorithm that performs
regular partitions of the simulation system. Each partition is kept on each
GPU of the system.
• The denition of shared areas between partitions that maintain data coherency,
named interfaces .
• The design and implementation of a parallel algorithm for the setup of interface
data packages to be transferred between GPUs. This algorithm will run entirely
on GPU.
• The adaptation of the current parallel molecular dynamics algorithms based
on task partition method, to a spatial partition method.
• The denition of a distributed method for slow forces computations.
As a summary, this work provides novel ways to exploit multi-GPU en-
vironments for molecular simulation. The thesis proposes a scalable solution that
increases performance of challenging data-intensive molecular dynamics simulations.

6 1. Introduction
1.4 Document organization
The rest of this document is organized into three blocks:
• State-of-the-art: The aim of this section is to provide a general overview of
up-to-date work in molecular dynamics.
Chapter 2: Presents some of the best known methods and concepts of
molecular dynamics.
Chapter 3: Explains some of the best known parallel algorithms used in
the implementation of several molecular dynamics simulators and presents
some of the most popular software solutions for molecular simulation.
• Problem statement and proposal: This part includes all the contributions pre-
sented in this thesis.
Chapter 4: Denes the main problems to be solved. The milestones pre-
sented in the previous section are explained in this chapter.
Chapter 5: Describes how the rst milestone was achieved. It covers in
depth how the space of simulation is divided, and how it is distributed in
an on-board multi-GPU environment. Only fast and medium forces are
considered.
Chapter 6: Is focused on long-range forces for on-board multi-GPU sys-
tems. Presents an optimization of the multilevel summation method
(MSM), achieving better simulation times than the original method.
Chapter 7: Describes the last milestone. It presents a scalable molecular
dynamics algorithm implementation for a distributed network of GPUs.

1.4. Document organization 7
Conclusions and future work: the last part discusses the conclusions of this
doctoral thesis and the open lines of this proposal.
Chapter 8: This chapter extracts the main achievements of this work. Also
presents some of the future researching lines to expand the capabilities of
the simulator developed.

Part I
STATE-OF-THE-ART
Chapter 2
Molecular dynamics
2.1 Introduction
In computer simulations of molecular dynamics, atoms are modeled as particles in a
virtual 3D spatial coordinate system. In biological systems, the molecules of interest
are surrounded by water molecules, and periodic boundary conditions are imposed
on the simulation volume, i.e., the simulation volume is implicitly replicated innite
times. A more comprehensive description of the basics of molecular dynamics can
be found in [29].
The motion of atoms is computed by solving Newtonian mechanics under
the action of three types of forces: bonded forces, non-bonded short-range forces
(composed of Van der Waals forces and electrostatic interactions between atoms
closer than a cuto radius Rc ), and non-bonded long-range forces (consisting of
electrostatic interactions between atoms separated by a distance greater than Rc ).
The simulation time is divided into steps of very small size, in the order of
1fs = 10−15 s. Given atom positions Xi and velocities Vi at time Ti , the simulation
algorithm evaluates the interaction forces and integrates them to obtain positions
Xi+1 and velocities Vi+1 at time Ti+1 .

12 2. Molecular dynamics
2.2 Bonded interaction
A chemical bond represents the attraction between two atoms that form a chemical
connection. These types of bonds are related to the charge and number of electrons
that atoms may share or transfer. There are several types of bonds, depending on
the number of atoms that form the bond and its geometry.
In molecular dynamics, all bonded forces are considered as short range
forces. That is because bonds exists in groups of two or more atoms closer than a
cuto radius. Bonded force interactions should be calculated for each atom in the
bond. Algorithm 1 shows the pseudocode for the serial algorithm.
Algorithm 1 Serial algorithm for bonded force computation

1: procedure computeBondedForces(transf erIDs)
2: for atom in atoms do
3: for bond in bonds[atom] do
4: for bondedAtom in bond.bondedAtoms do
5: if atom! = bondedAtom) then
6: atom.f orces+ = getBondedF orces(atom, bondedAtom)
7: end if
8: end for
9: end for
10: end for
11: end procedure
The following subsections describe four types of bonds used in molecular
dynamics to describe bonded interactions: Simple bonds, Angles, Dihedrals and
Impropers.
2.2.1 Bonds
The bonds between two atoms are described by simple harmonic springs. The energy
between two atoms i and j is given by:

2.2. Bonded interaction 13
Ebond = k(|rij | − r0 )2
• k: Constant of the spring that bonds both atoms.
• |ri,j |= Distance between i and j atoms
• r0 = Distance between i and j atoms in relax
2.2.2 Angles
Angles describe a bond formed by three atoms. These bonds are dened as angular
harmonic springs. The energy of an angle bond formed by three atoms (i,j and j) is
described as follows:
Eangle = kθ (θ − θ0 )2 + kub (|rik − rub )2
• kθ : Constant of the harmonic angle spring that bonds the three atoms.
• θ|= Angle formed by i-j atoms and the vector that connects k - j.
• θ0 = Angle formed by i-j atoms and the vector that connects k - j at rest.
2.2.3 Dihedrals and Impropers
Dihedral and Improper bonds describe the interaction between four linked atoms.
These bonds are modeled by an angle spring between the planes formed by the rst
3 atoms (i,j and k) and the second set of 3 atoms (j , k and l). The energy for a
dihedral or improper angle between the atoms i, j, k and l is given by:

Ed/i = k(1 + cos(nφ + θ))
If n > 0, or
Ed/i = k(φ − θ)2
If n = 0.
2.3 Non-Bonded Electrostatic Interactions
Electrostatic forces are considered long-range interactions between atoms separated
by a large distance. Electrostatic energy describes the force resulting from the inter-
action between charged particles. The resulting energy between two atoms i and j is
described by Coulomb's law:
Cqi qj
E = ε14 ε0 |ri,j |
• ε14 : Scale factor for 1-4 interactions (pairs of atoms connected by three bonds).
It is zero for 1-2 and 1-3 interactions (pairs of atoms connected by single and
double bonds, respectively) and is equal to 1.0 for any other interaction.
• C = 2, 31 × 1019 J nm
• qi qj = Charges of i and j atoms, respectively
• ε0 = Dielectric constant
• |ri,j |= Distance between i and j atoms

2.4. Van Der Waals 15
As already mentioned, the computation of electrostatic forces is divided
into two types, short range and long range, treated separated.
2.4 Van Der Waals
The Van der Waals interactions describe the force resulting from the interaction of
atoms. The Van der Waals energy between two atoms i and j is described as follows:
A B
Evdw = r12
− r6ij
ij
A and B constants are precomputed using parameters σij and εij , which
also are precomputed using the σ and ε values of the single atoms. Those are input
constants for each type of atom. This is the entire equation sequence:
σi +σj
σi j = 2
√
εi j = εi εj
A = 4σi j 12 εi j
B = 4σi j εi j
Same as electrostatic forces, Van der Waals forces are also considered long or
medium range. These interactions happen between atoms that may be separated by
a large distance. These forces decay faster than electrostatic forces, so it is possible
to establish a cuto distance after which the force is negligible. For this reason,
sometimes it is possible to consider only medium-range forces.

2.5 Integrators
2.5.1 Verlet
Molecular dynamics often use second-order integrators, such as Leapfrog and Verlet,
which oer greater stability than Euler methods. In the following, the integration al-
gorithm is implemented using the Velocity Verlet scheme, similar to Leapfrog method,
but the positions, speeds and forces are obtained at the same value of time .
1
x(t + ∆t) = x(t) + v(t)∆t + F (t)∆t2 ) (2.1)
2m
∆t 1
v(t + ) = v(t) + F (t)∆t (2.2)
2 2m
∆t 1
v(t + ∆t) = v(t + )+ F (x(t + ∆t)) (2.3)
2 2m
Where x is the position vector, v is velocity vector and F the force vector.
As F (t + ∆t) does not depend on v, equation 2.2 is replaced by equation 2.2. This
integration scheme is applied for every atom in the system.
2.5.2 Respa
RESPA (REference System Propagator Algorithm) is a method of integration with
multiple step times (Multi-time step). This method tries to avoid computing long-
range forces for every time step. While standard integration methods require the
2.5. Integrators 17
calculation of all forces, both of short and long range, RESPA establishes a relation-
ship between the number of times the short and long range forces are calculated.
The Van der Waals and electrostatic forces take considerably more com-
puting time and also allow a longer time step than the bonded forces. In turn,
the long-range electrostatic forces allow a longer time step with respect to the Van
der Waals. Algorithm 2 shows a pseudo code where forces have been divided into
hardf orce (fh ), mediumf orce (fm ) and sof tf orce (fs ), and their respective tran-
sition times are ∆th , ∆tm and ∆ts . Speed, coordinates and mass are shown as v, R
and m, respectively.
Algorithm 2 Generic RESPA Multi Step algorithm

1: for i = 1 to S do
2: v = v + fs2m
∆ts
3: for j = 1 to M do
4: v = v + fm2m∆tm
5: for k = 1 to H do
6: v = v + fh2m∆th
7: r = r + v∆tm
8: fh = ComputeHardF orces()
9: v = v + fh2m∆th
10: end for

11: f m = ComputeM ediumF orces()
12: v = v + fm2m
∆tm
13: end for

14: f s = ComputeSof tF orces()
15: v = v + fs2m
∆ts
16: end for
This algorithm calculates fs only S times while fm is evaluated HxM times,
and fs HxMxS. The hard forces correspond to the bonding forces, the medium forces
correspond to Van der Waals and electrostatic calculated closer than a cuto dis-
tance, and soft electrostatic are the forces calculated beyond the cuto distance.
Furthermore, for integration of velocity and position, we have used the Verlet scheme
explained in the previous section.

Chapter 3
Parallel Molecular Dynamics
In this chapter we present a review of the state of the art in computer driven simu-
lations of molecular dynamics, and more specically in the two main topics covered
in this thesis. After a general overview of parallel techniques adapted to molecular
dynamics simulations, we present related work on the topic of parallel molecular
dynamics simulators. The last section of this chapter presents an analysis of the
best known molecular dynamics simulators, showing how some algorithms have been
adapted to multi-CPU and multi-GPU parallel architectures.
3.1 Parallelization Techniques
In computer-driven molecular dynamics simulations, atoms are contained within a
virtual 3D coordinate system that models the real environment inside a specic
volume. This section presents some of the most known techniques used to speed up
simulations.
Several optimizations can be found in order to take advantage of parallel
computer architectures. In the following sections we summarize some of the most
common techniques used for each kind of atom interaction.

20 3. Parallel Molecular Dynamics
3.1.1 Bonded Forces
Bonded forces represent the interaction between a group of atoms linked by some
kind of bond. For each bond, the energy that aects the atoms involved is measured
applying the forces model described in Section 2.2. Since the energy calculations
for each bond are independent of each other, and the workload is similar within the
bonds of the same type, the most popular way to parallelize this computation is by
using a task subdivision. This method is easily parallelizable using both multi-CPU
and GPU parallel architectures.
3.1.2 Non-bonded forces
Non-bonded forces decay rapidly with distance, so only the interactions between
atoms closer than a cuto radius (Rc ) are accurately calculated. Atoms separated by
a distance greater than Rc are calculated using approximations, in order to accelerate

the simulations. These optimizations can reduce the computational cost from O(N 2 )
to O(N ).
3.1.2.1 Short range
The computation of short-range non-bonded forces can be accelerated using a regular
grid, which is updated at a lower rate than the simulation. This method is known as
cell list [26, 33, 11, 29]. In this algorithm, the volume of the simulation s divided into
cells or three-dimensional boxes whose dimension is given by the cuto radius (Rc ).
Van der Waals and short-range electrostatic forces are calculated between pairs of
atoms that are in the same cell or a neighboring cell.

3.1. Parallelization Techniques 21
Like in Bonded forces, Short Range forces are easily parallelizable by using
task-based decomposition methods. First, the interactions between neighboring cells
can be performed individually. Then, interactions between each atom of a box with
respect to each atom in a second box can be computed also in parallel. This takes
advantage of massively parallel architectures such as GPUs or clusters, using spatial
partitioning techniques.
However atoms move between cells, so the structures need to be updated
from time to time along the simulation. This introduces small time lags in the simu-
lation, hindering optimal scalability. In the case of distributed memory architectures
interconnected by a network, the amount of information shared between dierent
computing nodes could be very large. In such cases the objetive is to reduce the
amount of information transfered to prevent bottlenecks.
3.1.3 Long Range Non-bonded Forces
There are many approaches to improve the quadratic cost of long-range molecular
dynamics, either using approximate solutions or parallel implementations (See [25] for
a survey). Massively parallel solutions on GPUs have also been proposed, although
GPUs are mostly used as co-processors [27].
Particle Mesh Ewald (PME) [5] is the most popular method to compute
long-range molecular forces. Lattice Ewald methods solve the long-range potential
on a grid using an FFT. Regular PME uses spectral dierentiation and a total of
four FFTs per time step, while Smooth PME (SPME) [7] uses B-spline interpolation
reducing the number of FFTs to two. PME is widely used in parallel molecular dy-
namics frameworks such as NAMD [27], GROMACS [12] or ACEMD [11]. PME can
be massively parallelized on a single GPU, but it is dicult to distribute over multi-

ple GPUs due to the all-to-all communication needed by the FFT. However Nukada
et al. [21] propose a scalable multi-GPU 3DFFT to minimize all-to-all comunica-
tions. Cerutti et al. [3] proposed Multi-Level Ewald (MLE) as an approximation
to SPME by decomposing the global FFT into a series of independent FFTs over
separate regions of a molecular system, but they did not conduct scalability analysis.
Other long-range force approximations are based on multigrid algorithms.
Multigrid approaches utilize multiple grid levels with dierent spatial resolutions to
compute long-range potentials with O(N ) cost. In molecular dynamics, multigrid
methods have been demonstrated to be superior to other methods [31], such as the
Fast Multipole Method (FMM) [37], because they achieve better scalability while
keeping acceptable error levels. The Meshed Continuum Method (MCM) [1] and
Multilevel Summation Method (MSM) [9] are the two most relevant multigrid meth-
ods for long-range force computation. MCM uses density functions to sample the
particles onto a grid and calculates the potential by solving a Poisson equation in a
multigrid fashion. On the other hand, MSM calculates the potential directly on a
grid by using several length scales. The scales are spread over a hierarchy of grids,
and the potential of coarse levels is successively corrected by contributions from ner
levels up to the nest grid, which yields the nal potential. This approach exhibits
higher options for scalability than PME or other multigrid algorithms. MSM has
been massively parallelized on a single GPU [8], although the performance of this
implementation is notably worse than PME.
Next, we describe MSM in more detail, as it is our method of choice for the
acceleration of long-range force computation.

3.1. Parallelization Techniques 23
3.1.4 The Multilevel Summation Method
For a particle system with charges {q1 , . . . qN } at positions {r1 , . . . rN }, the electro-
static potential energy is
N N
1X X qi qj
U (r1 , ...rN ) = . (3.1)
2 ||ri − rj ||
i=1 j=1,j6=i
Its exact computation has O(N 2 ) complexity.
MSM is a fast algorithm for computing an approximation to the electrostatic
interactions with just O(N ) computational work. MSM splits the potential into
short-range and long-range components. The short-range component is computed
as a direct particle-particle interaction while the long-range one is approximated
through a hierarchy of grids.
For the long-range component, the method rst distributes atom charges
onto the nest grid. This process is called anterpolation. A nodal basis function
φ(r) with local support about each grid point is used to distribute charges. Once all
atom charges are distributed onto the nest grid, charges are distributed onto the
next coarser grid, using the same basis functions. This process is called restriction,
and it is repeated until the coarsest grid is reached.
Figure 3.1 depicts the full MSM method. On each level, the method com-
putes direct sums of nearby grid charges up to a radius of b2 Rc /h0 c grid points,
where h0 is the resolution of the nest grid. Hardy and Skeel [9] indicate that a reso-
lution h0 between 1Å and 3Å is sucient for molecular dynamics simulations. Note
that the resolution is halved on each coarser grid, hence direct sums cover twice
the distance with the same number of points. The direct sum of pairwise charge
Figure 3.1: Diagram showing the major operations of MSM. The bottom level
represents the atoms, and higher levels represent coarser grids.
potentials is analogous to the one for short-range non-bonded forces, with the excep-
tion that grid distances are xed and can be computed as preprocessing, hence the
computation is simply an accumulation of weighted grid charges.
A GPU optimized version of the direct sum was developed by Hardy et al [8].
The weighted grid is stored in constant memory and charges in shared memory. A
sliding window technique is used to achieve an ecient reading. Hardy's algorithm
computes the nest levels on GPU, while the coarsest levels are computed on CPU.
Once direct sums are computed on each level, potentials are interpolated
from coarse to ner levels, and contributions from all levels are accumulated. This
process is called prolongation. Finally, potentials from the nest grid are interpolated
on the atoms.
Multigrid methods have been used extensively in a variety of scientic elds,
but molecular dynamics suers the added diculty of dealing with periodic boundary
conditions. Izaguirre and Matthey [15] developed an MPI-based parallel multigrid
summation on clusters and shared-memory computers for n-body problems.
Chapter 6 presents a solution for long-range molecular dynamics on multi-
GPU platforms, and those improvements could be extended to other types of n-body
3.2. Parallel molecular simulators 25
problems.
3.2 Parallel molecular simulators
This section presents some of the most popular software solutions for molecular
simulation. All of them use some of the techniques described in the previous section,
adapted in some way to parallel architectures: Multi-CPU, GPUs and even clusters.
Several authors have proposed ways to parallelize molecular dynamics algo-
rithms on hybrid CPU-GPU architectures [18, 36]. Very recently, Rustico et al. [28]
have proposed a spatial partitioning approach for multi-GPU particle-based uid
simulation, which shares many features with molecular dynamics.
3.2.1 NAMD
NAMD [27] performs a spatial partition of the system. Each partition is allocated in a
computing node, which might be one core in a multi-CPU machine, or a distributed
node in a cluster. These subdivisions are known as patches, each patch keeps
information of the atoms within it, and the neighboring patches that need shared
data. Then, NAMD denes work tasks and then distributes these tasks among the
available CPUs, on each computing node. Tasks are dened as interactions between
patches, if a computing node needs data from a patch that does not belong to it,
task will make a copy of the necessary data before being assigned. To speed-up the
computation of non-bonded short-range forces, GPUs are massively used. NAMD
creates smaller tasks and copies the necessary data to GPUs. Then, it launches the
necessary GPU kernels in order to perform the simulation, and nally it copies back
the results to CPU memory. By using this scheme, GPUs are seen as massively
parallel co-processors.
3.2.2 GROMACS
GROMACS [12] performs a spatial partitioning on the molecular system to distribute
it on a multi-core architecture. The system is subdivided into a staggered grid, with
zones for information sharing between nodes. CPUs may use GPUs as co-processors
to speed-up force computations, in a similar way to NAMD. Only non-bonded short
range forces are computed on GPUs, forcing to upload data from CPU to GPU, and
copying back results from GPU to CPU.
3.2.3 ACEMD
ACEMD [11] performs GPU-parallel computation of the various forces in a molec-
ular system, and each type of force is handled on a separate GPU. This approach
exploits on-board multi-GPU architectures, but its scalability is limited because all
communications are handled through the CPU.

Part II
PROBLEM-STATEMENT-AND-
PROPOSAL
Chapter 4
Problem Statement
4.1 A grand challenge problem
The problems presented by molecular simulation systems are included within the
Grand challenge problems [34]. Not only are simulation times important, but they
also need large RAM resources to host the molecular system. An enormous volume of
data is required to simulate a system formed by several millions of atoms, requiring
high performance computing resources during a long time to obtain results.
The Grand challenge problems are solved in supercomputing centers,
which have large amounts of resources at their disposal. A supercomputing cen-
ter usually has thousands of nodes interconnected by a high speed network, enough
to host the data of the simulation. However, algorithms must be adapted, and there
are several drawbacks that must be solved to use the full power of these systems.
The following sections summarize some of he architectures currently avail-
able that are used for molecular dynamics. Also, our solutions for short range and
long range molecular dynamics simulations that make use of novel parallel architec-
tures are introduced.

30 4. Problem Statement
4.2 Novel architectures
The use of GPUs in Computer Science has witnessed a wide range of congurations.
At a high level, a GPU is formed by a massive parallel multiprocessor with its
own memory hierarchy, separated from the CPU. Nowadays GPUs can have several
gigabytes of RAM, so just one GPU can host a large amount of data. Also, if the
mainboard has enough slots, a single computer can host several GPUs interconnected
by a high speed bus, making itself a small hybrid multicomputer node. However
computers that integrate several GPUs usually are very expensive, so only 2 to 4
GPUs are used in most of the single node congurations.
It is easier to have a large number of GPUs in clusters. In recent years,
high performance parallel architectures have raised their performance by integrating
one or two GPUs on each node. Hybrid CPU-GPU systems are available in most of
the supercomputing centers, interconected by high speed networks. These congu-
rations allow scaling the performance of the system, but are severely limited by the
communications between nodes.
In a system with several GPUs, PCI-express 4.0 buses allow up to 31GB/s
communications. However, this is not the case on most of the multi-GPU clusters
available. In small research centers, it is easy to nd small clusters interconected by
Gigabit Ethernet networks, which provides communication speeds up to 125MB/s.
In supercomputing centers, Myrinet and Inniband networks are used, achieving a
maximum throughput of 1.2GB/s and 37GB/s respectively. Despite of the good
throughput of Inniband, only a few supercomputers actually use it, due to its price,
and the common communication speeds found in the supercomputer networks are
close to 6GB/s.
4.2. Novel architectures 31
Several applications are possible only by using these parallel architectures.
For example, the Human Brain Project is a large 10-year scientic research project
which aims to provide a model of the whole brain. To achieve the goals of this
project it is necessary to use supercomputing technologies that enable models and
simulations of brain information to identify patterns and possible deciencies that
can be remedied with further experimentation. For this purpose various platforms
have been developed: CeSViMa (Centro de Supercomputación y Visualización de
Madrid), hosts one of the most powerful supercomputers used for this project.
There have been great achievements in recent years by using high per-
formance hybrid multi-CPU systems for molecular dynamics. In 2013, the NCSA
Blue Waters supercomputer was used to perform the simulation fo the HIV-1 capsid
molecular system [38]. The HIV-1 capsid was formed by 64 million atoms, and 3500
single core nodes equipped with NVIDIA Tesla K20X were used to perform 500ns of
simulation time. In real time, it took around 35 days (14ns/day) to reach results.
However, the exploitation of multi-GPU systems is still under development.
In order to get the maximum benet from those architectures it is necessary to
identify the problems of the current implementations of molecular dynamics. As
stated in the previous paragraphs, the communications between GPUs hosted by
dierent nodes are not as fast as communications in the same motherboard, becoming
the bottleneck of the application. The following sections will describe the problems
covered in this thesis, and will give solutions to make better use of new architectures.
Figure 4.1: HIV1 Capsid, 64 million atoms total including solvent
4.3 Multi-GPU communications bottleneck
As stated in the previous section, section, a priori a multi-GPU system with two or
more graphics processors connected via a fast bus is capable of providing a highly
ecient and scalable solution. However, in most of the current implementations it
is the CPU which hosts most of the program logic, using the GPUs as coprocessors.
With this scheme, if several GPUs are present, the CPU must take care of uploading
an downloading data to each GPU memory, wasting time in communications.
In molecular dynamics, a large amount of data must be shared between
computing nodes in order to maintain simulation consistency. A better scheme must
be developed in order to achieve better simulation times, by reducing GPU-CPU

4.3. Multi-GPU communications bottleneck 33
communications. The next subsections explain the contributions in this area.
4.3.1 Contribution: Direct GPU-GPU communications
In a multi-GPU environment, each GPU can be seen as a complete computing node,
so direct GPU-GPU communication can be used instead of uploading/downloading
data from GPU to CPU. One of the problems of this scheme relies in the fact that
GPUs are massive parallel architectures that run their own code in parallel with the
CPU. Each GPU should be able to identify shared data with other GPUs, package it
and perform data exchange with neighboring GPUs. Those methods should be fast
and scalable, in order to allow better simulation times.
This work tries to nd a solution by using direct GPU-GPU data commu-
nications, focused on scalability and performance. Our rst goal is to parallelize a
state-of-the-art molecular dynamics algorithm by employing a spatial partitioning
approach to simulate the dynamics of one portion of a molecular system on each
GPU. Chapter 5 is focused on parallelizing bonded and short-range non-bonded
forces on an on-board multi-GPU system, providing a novel parallel algorithm to
update the spatial partitioning and set up transfer data packages on each GPU. This
way, better simulation times are achieved, while keeping scalability of the system.
4.3.2 Contribution: Distributed MSM
Long-range non-bonded forces are treated in a dierent way. Current implementa-
tions use PME, which is very fast on single GPUs environments. However, it is not
easily portable to distributed memory systems, due to the large amount data needed
to share between nodes. Chapter 6 proposes a solution to long-range molecular dy-

namics, based on the Multilevel Summation Method (MSM) [15] [8], that can be
used in distributed memory environments.
MSM is better suited for a distributed implementation, but it is notably
slower than PME. Chapter 6 also proposes an optimization for this method by re-
placing 3D convolutions with FFTs, making the performance of MSM on a single
GPU comparable to that of PME. Finally, a distributed-memory implementation is
proposed in order to make use of several GPUs.
4.4 Solutions for memory scalability
In order to simulate larger molecular systems, it is necessary to achieve memory
scalability along with better simulations time. The more GPUs available, the more
subdivisions are made, which take less memory on each node. This way, it is possible
to use a cluster of computers to simulate systems that can not be simulated on a
single node.
The contributions presented in Chapter 5 and Chapter 6 achieve good time
speed-ups but store translation tables that grow with the size of the molecular system,
and are not scalable in memory. Chapter 7 proposes a solution by using GPU hash
tables instead of static memory arrays, saving memory while keeping the scalability
of the system. Also, a version for distributed multi-GPU cluster systems is proposed,
that makes use of direct GPU-GPU communications.

Chapter 5
On-Board Multi-GPU
Short-Range Force Computation
This chapter presents a parallel algorithm for the solution of short-range molecular
dynamics on on-board multi-GPU architectures. Previous works make use of GPUs
as coprocessors, CPUs are used as the primary processor, keeping data on the host's
RAM. Data is copied to the GPU when a kernel is launched, and results are copied
back. This solution presents some scalability problems due to the amount of in-
formation shared between GPU and CPU. The aim of this chapter is to present an
alternative by using the GPUs as independent computing nodes, reducing CPU-GPU
and GPU-GPU communications.
This chapter is focused on bonded and non-bonded short range force com-
putations. The state-of-the-art molecular dynamics algorithm (Chapter 2.1) is par-
allelized at two levels: First, a spatial partitioning is performed to simulate the
dynamics of one portion of a molecular system on each GPU, and we take advantage
of direct communication between GPUs to transfer necessary data among portions.
Second, we parallelize the simulation algorithm to exploit the multi-processor com-
puting model of GPUs.
Section 7.1.2 presents a novel parallel algorithm to update the spatial par-
titioning and set up transfer data packages on each GPU. The molecular dynamics
36 5. On-Board Multi-GPU Short-Range Force Computation
simulations are parallelized at two levels. At the high level, we present a spatial
partitioning approach to assign one portion of a molecular system to each GPU. At
the low level, we parallelize on each GPU the simulation of its corresponding portion.
Most notably, we present algorithms for the massively parallel update of the spatial
partitions and for the setup of data packages to be transferred to other GPUs.
5.1 Algorithm Overview
In contrast to previous parallel molecular dynamics algorithms, we propose a two-
level algorithm that partitions the molecular system, and each GPU handles in a
parallel manner the computation and update of its corresponding portion, as well as
the communications with other GPUs.
To solve the dynamics, we use a generic Verlet/Respa MTS integrator as
described in section 2.5.1. In our examples, we have used a time step ∆t = 2 fs
for short-range non-bonded forces, and we update bonded forces nStepsBF = 2

times per time step. We accelerate short-range non-bonded forces using the cell-list
method, with a grid resolution of Rc /2. The cell-list data structure can be updated
and visited eciently on a GPU using the methods in [35].
We partition the molecule using grid-aligned planes, thus minimizing the
width of interfaces and simplifying the update of partitions. We partition the sim-
ulation domain only once at the beginning of the simulation, and then update the
partitions by transferring atoms that cross borders. We have tested two partitioning
approaches with dierent advantages:
• Binary partition (Figure 5.1a ): we recursively halve molecule portions using

5.1. Algorithm Overview 37
(a) Binary Partition (b) Linear Partition
Figure 5.1: Comparison of binary (a) vs. linear spatial partitioning (b). The
striped regions represent the periodicity of the simulation volume.
(a) Cells at the Interface
Figure 5.2: The dierent types of cells at the interface between two portions of
the simulation volume.
planes orthogonal to their largest dimension. Each portion may have up to 26

neighbors in 3D.
• Linear partition (Figure 5.1b ): we divide the molecular system into regular
portions using planes orthogonal to the largest dimension of the full simulation
volume. With this method, each portion has only 2 neighbors, but the inter-
faces are larger; therefore, it trades fewer communication messages for more
expensive partition updates.

Based on our cell-based partition strategy, each GPU contains three types
of cells as shown in Figure 5.2:
• Private cells that are exclusively assigned to one GPU.
• Shared cells that contain atoms updated by a certain GPU, but whose data
needs to be shared with neighboring portions.
• Interface cells that contain atoms owned by another GPU, and used for force
computations in the given GPU.
Algorithm 1 shows the pseudo-code of our proposed multi-GPU MTS inte-
grator, highlighting in blue with a star the dierences w.r.t. a single-GPU version.
These dierences can be grouped in two tasks: update partitions and synchronize dy-
namics of neighboring portions. Once every ten time steps, we update the partitions
in two steps.
1. Identify atoms that need to be updated, i.e., atoms that enter shared cells of
a new portion.
2. Transfer the positions and velocities of these atoms.
To synchronize dynamics, we transfer forces of all shared atoms, and then each GPU
integrates the velocities and positions of its private and shared atoms, but also its
interface atoms. Again, we carry out the synchronization in two steps.
1. Identify the complete set of shared atoms after updating the cell-list data struc-
ture.
2. Transfer the forces of shared atoms as soon as they are computed.

5.1. Algorithm Overview 39
Algorithm 3 Multi-GPU Verlet/r-Respa MTS integrator. The modications w.r.t.

the single-GPU version are highlighted in blue with a star.
1: procedure Step(currentStep)
2: if currentStep mod 10 = 0 then
3: ∗ identif yU pdateAtomIds()
4: ∗ transf erU pdateP ositionsAndV elocities()
5: updateCellList()
6: ∗ identif ySharedAtomIds()
7: end if
8: integrateT emporaryP osition(0.5 · ∆t)
9: computeShortRangeF orces()
10: ∗ transf erSharedShortRangeF orces()
11: for nStepsBF do
12: integrateP osition(0.5 · ∆t/nStepsBF )
13: computeBondedF orces()
14: ∗ transf erSharedBondedF orces()
15: integrateKickV elocity(∆t/nStepsBF )
16: integrateP osition(0.5 · ∆t/nStepsBF )
17: end for
18: currentStep = currentStep + 1
19: end procedure
5.2 Parallel Partition Update and Synchronization
As outlined above, each GPU stores one portion of the complete molecular system
and simulates this subsystem using standard parallel algorithms [35]. In this section,
we describe massively parallel algorithms to update the partitions and to transfer
interface forces to ensure proper synchronization of dynamics between subsystems.
We propose algorithms that separate the identication of atoms whose data needs
to be transferred from the setup of the transfer packages. In this way, we can
reuse data structures and algorithms both in partition updates and force transfers.
Data transfers are issued directly between GPUs, thereby minimizing communication
overheads.
5.2.1 Data Structures
The basic molecular dynamics algorithm stores atom data in two arrays:
• staticAtomData corresponds to data that does not change during the simula-
tion, such as atom type, bonds, electrostatic and mechanical coecients, etc.
It is sorted according to static atom indices.
• dynamicAtomData that contains position and velocity, a force accumulator,
and the atom's cell. It is sorted according to the cell-list structure, and all
atoms in the same cell appear in consecutive order.
Both arrays store the identiers of the corresponding data in the other array to
resolve indirections. Each GPU stores a copy of the staticAtomData of the whole
molecule, and keeps dynamicAtomData for its private, shared, and interface cells.
The dynamicAtomData is resorted in each call to the updateCellList procedure, and

5.2. Parallel Partition Update and Synchronization 41
the atom identiers are accordingly reset. Atoms that move out of a GPU's portion
are simply discarded.
In our multi-GPU algorithm, we extend the dynamicAtomData, and store
for each atom a list of neighbor portions that it is shared with. We also dene two
additional arrays on each GPU:
• cellNeighbors is a static array that stores, for each cell, a list of neighbor por-
tions.
• transferIDs is a helper data structure that stores pairs of neighbor identiers
and dynamic atom identiers. This data structure is set during atom identi-
cation procedures, and it is used for the creation of the transfer packages.
5.2.2 Identication of Transfer Data
Each GPU contains a transf erIDs data structure of size nN eighbors · nAtoms,
where nN eighbors is the number of neighbor portions, and nAtoms is the number
of atoms in its corresponding portion. This data structure is set at two stages of the
MTS Algorithm 3, identif yU pdateAtomIds and identif ySharedAtomIds. In both
cases, we initialize the neighbor identier in the transf erIDs data structure to the
maximum unsigned integer value. Then, we visit all atoms in parallel in one CUDA
kernel, and ag the (atom, neighbor) pairs that actually need to be transferred. We
store one ag per neighbor and atom to avoid collisions at write operations. Finally,
we sort the transf erIDs data structure according to the neighbor identier, and the
(atom, neighbor) pairs that were agged are considered as valid and are automatically
located at the beginning of the array. We have used the highly ecient GPU-based
Merge-Sort implementation in the NVidia SDK 4.5 [24] (5.3ms to sort an unsorted
array with one million values on a NVidia GeForce GTX580).
Algorithm 4 Identication of atoms whose data needs to be transferred, along with

their target neighbor.
1: procedure IdentifyTransferAtomIds(transf erIDs)

2: for atomID in atoms do
3: for neighborID in cellN eighbors(dynamicAtomData[atomId].cellID)
do
4: if M ustT ransf erData(atomID, neighborID) then
5: of f set = neighborID · nAtoms + atomID
6: transf erIDs[of f set].atomID = atomID
7: transf erIDs[of f set].neighborID = neighborID
8: end if
9: end for
10: end for
11: Sort(transf erIDs, neighborID)
12: end procedure
Algorithm 4 shows the general pseudo-code for the identication of transfer
data. The actual implementation of the M ustT ransf erData procedure depends
on the actual data to be transferred. For partition updates, an atom needs to be
transferred to a certain neighbor portion if it is not yet present in its list of neighbors.
For force synchronization, an atom needs to be transferred to a certain neighbor
portion if it is included in its list of neighbors. In practice, we also update the list
of neighbors of every atom as part of the identif yU pdateAtomIds procedure.
5.2.3 Data Transfer
For data transfers, we set in each GPU a buer containing the output data and the
static atom identiers. To set the buer, we visit all valid entries of the transf erIDs
array in parallel in one CUDA kernel, and fetch the transfer data using the dynamic
atom identier. The particular transfer data may consist of forces or positions and
5.2. Parallel Partition Update and Synchronization 43
velocities, depending on the specic step in the MTS Algorithm 3.
Transfer data for all neighbor GPUs is stored in one unique buer; therefore,
we set an additional array with begin and end indices for each neighbor's chunk.
This small array is copied to the CPU, and the CPU invokes one asynchronous
copy function to transfer data between each GPU and one of its neighbors. We use
NVidia's driver for unied memory access (Unied Virtual Addressing, UVA) [30] to
perform direct memory copy operations between GPUs.
Upon reception of positions and velocities during the update of the parti-
tions, each GPU appends new entries of dynamicAtomData at the end of the array.
These entries will be automatically sorted as part of the update of the cell-list. Upon
reception of forces during force synchronization, each GPU writes the force values
to the force accumulator in the dynamicAtomData. The received data contains the
target atoms' static identiers, which are used to indirectly access their dynamic
identiers.
Figure 5.3: PCIe conguration of our testbed.

5.3 Short Range On-Board Multi-GPU Evaluation
This section demonstrates our approach on a multi-GPU on-board architecture, us-
ing PCIe for direct GPU-GPU communication. We show speed-ups and improved
scalability over NAMD, a state-of-the-art multi-CPU-GPU simulation algorithm that
uses GPUs as co-processors.
In order to validate our proposal, we carried out our experiments on a ma-
chine outtted with Ubuntu GNU/Linux 10.04, two Intel Xeon Quad Core 2.40GHz
CPUs with hyperthreading, 32 GB of RAM and four NVidia GTX580 GPUs con-
nected to PCIe 2.0 slots in an Intel 5520 IOH Chipset of a Tyan S7025 motherboard.
The system's PCIe 2.0 bus bandwidth for peer-to-peer throughputs via IOH chip was
9GB/c full duplex, and 3.9 GB/s for GPUs on dierent IOHs [17]. The IOH does
not support non-contiguous byte enables from PCI Express for remote peer-to-peer
MMIO transactions [14]. The complete deployment of our testbed architecture is
depicted in Figure 5.3. Direct GPU-GPU communication can be performed only for
GPUs connected to the same IOH. For GPUs connected through QPI, the driver
performs the communication using CPU RAM [17].
Given our testbed architecture, we have tested the scalability of our pro-
posal by measuring computation and transmission times for 1, 2, and 4 partitions
running on dierent GPUs. We have estimated scalability further by estimating
transmission times for 8 and 16 partitions using the bandwidth obtained with 4
GPUs and the actual data size of 8 and 16 partitions respectively.
We have used three molecular systems as benchmarks (see Figure 6.2):
• ApoA1 (92,224 atoms) is a well known high density lipoprotein (HDL) in hu-
5.3. Short Range On-Board Multi-GPU Evaluation 45
man plasma. It is often used in performance tests with NAMD.
• C206 (256,436 atoms) is a complex system formed by a protein, a ligand and
a membrane. It presents load balancing challenges for molecular dynamics
simulations.
• 400K (399,150 atoms) is a well-balanced system of 133,050 molecules of water
designed synthetically for simulation purposes.
All our test simulations were executed using MTS Algorithm 3, with a time
step of 2 fs for short-range non-bonded forces, and 1 fs (nStepsBF = 2) for bonded
forces. In all our tests, we measured averaged statistics for 2000 simulation steps,
i.e., a total duration of 4ps (4 · 10−12 s).
5.3.1 Comparison of Partition Strategies
To evaluate our two partition strategies described in Section 7.1.2, we have compared
their performance on the C206 molecule. We have selected C206 due to its higher
complexity and data size. Figure 5.5a indicates that, as expected, the percentage
of interface cells grows faster for the linear partition. Note that with 2 partitions
the size of the interface is identical with both strategies because the partitions are
actually the same. With 16 partitions, all cells become interface cells for the linear
partition strategy, showing the limited scalability of this approach. Figure 5.5b shows
that, on the other hand, the linear partition strategy exhibits a higher transmission
bandwidth. Again, this result was expected, as the number of neighbor partitions is
smaller with this strategy.
All in all, Figure 5.5c compares the actual simulation time for both partition
strategies. This time includes the transmission time plus the computation time
of the slowest partition. For the C206 benchmark, the binary partition strategy
exhibits better scalability, and the reason is that the linear strategy suers a high
load imbalance, as depicted by the plot of standard deviation across GPUs.
Figure 5.6 shows how the total simulation time is split between computa-
tion and transmission times for the binary partition strategy. Note again that the
transmission times for 8 and 16 partitions are estimated, not measured. Up to 4

partitions, the cost is dominated by computations, and this explains the improved
performance with the binary strategy despite its worse bandwidth.
The optimal choice of partition strategy appears to be dependent on the
underlying architecture, but also on the specic molecule, its size, and its spatial
atom distribution.
5.3.2 Scalability Analysis and Comparison with NAMD
Figure (a) shows the total speedup for the three benchmark molecules using our
proposal (with a binary partition strategy). Note again that speedups for 8 and 16
GPUs, shown in dotted lines, are estimated based on the bandwidth with 4 GPUs.
The results show that the implementation makes the most out of the molecule's size
by sharing the workload among dierent GPUs. The speedup of APOA1 is lower be-
cause it is the smallest molecule and the simulation is soon limited by communication
times.
Figure 5.7b evaluates our combined results in comparison with a well-known
parallel molecular dynamics implementation, NAMD. Performance is measured in
terms of the nanoseconds that can be simulated in one day. The three benchmark
molecules were simulated on NAMD using the same settings as on our implementa-
tion. Recall that NAMD distributes work tasks among CPU cores and uses GPUs as
co-processors, in contrast to our fully GPU-based approach. We could not estimate
performance for NAMD with 8 and 16 GPUs, as we could not separate computa-
tion and transmission times. All in all, the results show that our proposal clearly
outperforms NAMD for all molecules by a factor of approximately 4×.
In terms of memory scalability, our approach suers the limitation that
each partition stores static data for the full molecule.This limitation is addressed
in Chapter 7. From our measurements, the static data occupies on average 78MB
for 100K atoms, which means that modern GPUs with 2GB of RAM could store
molecules with up to 2.5 million atoms. In the dynamic data, there are additional
memory overheads due to the storage of interface cells and sorting lists, but these
data structures become smaller as the number of partitions grows. In addition,
interface cells grow at a lower rate than private cells as the size of the molecule
grows.
(a) ApoA1 in water (b) C206 in water
(c) 400K
Figure 5.4: Benchmark molecules.

(a) Interface size (b) Bandwidth
(c) Sim. times (2000 steps)
Figure 5.5: Performance comparison of binary and linear partition strategies on

C206.
Figure 5.6: Running time (2000 steps) for the binary partition strategy on C206.
(a) Speedup (b) Comparison vs. NAMD
Figure 5.7: Scalability (a) and performance comparison with NAMD (b), mea-
sured in terms of simulated nanoseconds per day.
Chapter 6
On-Board Multi-GPU Long-Range
Force Computation
This chapter presents a parallel and scalable solution to compute long-range molec-
ular forces, based on the multilevel summation method (MSM). As shown in the
previous chapter, making use of several GPUs as independent computing nodes al-
lows us to perform faster simulations, reducing latency due to data transfers.
The objective in this chapter is to achieve a similar scalability on long-
range forces computations by using several GPUs. The MSM algorithm oers good
features to be used in a multi-GPU distributed environment, despite being slower
than PME. An optimization of MSM that replaces 3D convolutions with FFTs is
presented in this chapter, achieving a single-GPU performance comparable to the
PME method, the de facto standard for long-range molecular force computation.
But most importantly, we propose a distributed MSM that avoids the scalability
diculties of PME.
Our distributed solution is based on a spatial partitioning of the MSM
multilevel grid, together with massively parallel algorithms for interface update and
synchronization. The last section of this chapter shows the scalability of our approach
on an on-board multi-GPU platform.

52 6. On-Board Multi-GPU Long-Range Force Computation
6.1 Optimized MSM
Our method runs the whole simulation on an on-board multi-GPU architecture by
allocating a portion of the system to each GPU and using a boundary interface to
communicate updates directly between portions.
Algorithm 5 highlights the dierences between our distributed MSM and
the original algorithm. See also [9] for a thorough description of the method. Note
that the direct sums are independent of each other, and the direct sum on a certain
level and the restriction to the coarser level can be executed asynchronously.
6.1.1 FFT-Based Sums
To perform the direct sum part on each level, the original MSM applies a 3D convo-
lution over all grid points using a kernel with 2 b2 Rc /hc+1 points in each dimension.
However, Hardy [9] shows that the direct sum is the most computationally expensive
part. We substitute this convolution with a product in frequency domain. Speci-
cally, we compute grid potentials in three steps:
1. Forward FFT of the grids of charges and kernel weights.
2. Complex point-wise product of the two resulting vectors
3. Inverse FFT to obtain the potentials.
The grids of charges and kernel weights should have identical dimensions; therefore,
we extend the kernel. Note that the kernel is constant, hence we only compute its
FFT once per level as a preprocess.

6.1. Optimized MSM 53
Even though the FFT has O(N log N ) complexity as opposed to O(N )
complexity of the convolution, in practice large kernels yield a steep linear complexity
for the convolution approach. For very large molecules, the log N factor of the FFT
would dominate, but with our distributed MSM presented next in Section 6.2, FFTs
are computed on each partition separately, hence N is bounded. We have compared
the performance of ecient GPU implementations of massively parallel MSM using
the convolution and FFT approaches, and the FFT approach enjoys a speed-up of
almost 10×. Table 6.1 shows timing comparisons for two molecular systems. The
examples were executed on an Intel Core i7 CPU 860 at 2.80GHz with a NVIDIA
GTX Titan GPU and CUDA Toolkit 5.5. FFTs were computed using NVIDIA's
highly ecient cuFFT library [22].
The cuto distance Rc has a great impact on both error and performance.
Error is lower for higher cutos, and this can be observed from the fact that a larger
cuto distance increases the kernel size as well. For our performance analysis, we used
a cuto radius of 9.0 Å, which is a standard value for molecular dynamics simulations.
Assuming a xed grid size, the resolution of the grid h, which is automatically
set for each level and each axis, determines the overall performance and accuracy.
Smaller values of h for the same number of levels imply higher accuracy, but this also
translates into a larger kernel size 2 b2 Rc /h0 c+1, hence adding to the computational
cost. The table shows the grid resolution on each axis (in Å), as well as the kernel
size.
Table 6.1 also compares the performance of MSM and PME under the
same grid resolutions. We implemented an ecient GPU version of the Smooth
PME (SPME) algorithm [7], following the optimizations described by Harvey and
De Fabrities[10]. We also implemented the previously mentioned GPU version of
the MSM algorithm proposed by Hardy [9]. With our FFT-based optimization, the
#Atoms hx,y,z Kernel size tM SM tM SMF F T tP M E

256,436 {1.88,1.87,2.65} 9x9x6 31.901 4.79 5.095
90,849 {1.56,1.56,1.56} 11x11x11 43.694 5.09 2.22
Table 6.1: Performance comparison for long-range force computation on two

molecular systems, using regular MSM with 3D convolution, our optimized MSM
based on FFTs, and PME. Timings correspond to one simulation step and are given
in ms. All cases were executed using a 64 × 64 × 64 grid.
performance of MSM becomes comparable to that of PME.
6.2 Distributed MSM
We propose a distributed MSM (DMSM) that partitions a molecular system and
the multilevel grid of MSM among multiple GPUs. As a computing element, each
GPU handles in a parallel manner the computation and update of its corresponding
portion of the molecular system, as well as the communications with other GPUs.
In this section, we rst describe the partition of the molecular system, then the
handling of periodic boundary conditions across all MSM levels, and nally our
parallel algorithms for interface update and synchronization.
6.2.1 Multigrid Partitions
Following the observations drawn in [20] for short-range molecular forces, we partition
a molecular system linearly along its longest axis, as this approach reduces the cost
to communicate data between partitions. Then, for DMSM, we partition each level
of the MSM grid into regular portions using planes orthogonal to the longest axis.
Each GPU device stores a portion of the grid at each level, including two types of
grid points: i) interior grid points owned by the GPU itself. ii) interface grid points
6.2. Distributed MSM 55
owned by neighboring GPUs.
The size of the interface corresponds to the half-width of the convolution
kernel, i.e., b2 Rc /hc points to the left and right of the interior ones, as shown in
Figure 6.1. The interface stores replicas of the grid points of neighboring partitions,
which are arranged in device memory just like interior points, to allow seamless
data access. The interface is used both to provide access to charges of neighboring
partitions and to store partial potentials corresponding to those same partitions.
Note that, due to the use of a linear partitioning strategy, the neighboring nodes
along the shorter directions are the result of periodic boundary conditions, and they
do not need to be stored as interface points as they are readily available as interior
points.
The partitions are made only once at the beginning of the simulation. At
runtime, interface values need to be communicated when needed as part of restriction,
direct sum of potentials, and prolongation.
6.2.2 Periodic Boundary Conditions on Multiple GPUs
As outlined in Section 2.1, molecular dynamics are performed on innite systems
formed by replicating periodically images of the molecular system under study along
all three spatial directions [6]. Periodic replication is also applied to the MSM grid;
therefore, on the boundary of the molecular system interfaces represent images of
grid points on the opposite sides, as shown in Figure 6.1.
In higher levels of the multilevel grid, where the total number of grid points
along the longest axis is similar to the convolution kernel size, periodic boundaries
complicate the management of interface points. Two main complications may occur,
shown in Figure 6.1: the same point may map to two or more interface points, and
even interior points may map to interface points. To deal with interface handling,
each GPU device stores the following data on each level:
• Begin and end indices of neighbor partitions, to know what part of the interface
belongs to each GPU device.
• Periodic begin and end indices of the interfaces of neighbor partitions, to know
what interior points constitute interfaces for other GPU devices.
Since the multilevel grid is static during the simulation, the auxiliary indices
of neighbor partitions are created and shared between GPUs once as a preprocessing
Figure 6.1: Partition of the multilevel grid under periodic boundaries. Left: All
grid points on each level, distributed into 3 GPU devices. Right: Data structure
of GPU device 0 (blue) on all levels, showing: its interior grid points, interface
points for an interface of size 3, and buers to communicate partial sums to other
devices. Interface points due to periodic boundary conditions are shown striped.
Arrows indicate sums of interface values to the output buers. With interfaces
of size 3, in levels 1 and 2 several interface points contribute to the same buer
location, and in level 2 there are even interior points that map to interface points.
6.2. Distributed MSM 57
step. Once each GPU knows the indices of its neighbors, it creates the incoming and
outgoing data buers to share interface data, and sets static mappings that allow
ecient read/write operations with these buers as shown in Figure 6.1.
6.2.3 Parallel Update and Synchronization of Interfaces
Algorithm 5 DMSM method main loop.

1: procedure computeDMSM
2: n = nlevels
3: q 0 ← Anterpolation()
4: ∗ accumulateInteriorCopies(q 0 )
5: ∗ updateInterf aces(q 0 )
6: for i = 0 . . . n − 2 do
7: V i ← DirectSum(q i )
8: q i+1 ← Restriction(q i )
9: ∗ updateInterf aces(q i+1 )
10: end for
11: V n−1 ← DirectSum(q n−1 )
12: ∗ accumulateInteriorCopies(V n−1 )
13: ∗ updateInterf aces(V n−1 )
14: for i = n − 2 . . . 0 do
15: V i ← P rolongation(V i+1 )
16: ∗ accumulateInteriorCopies(V i )
17: ∗ updateInterf aces(V i )
18: end for
19: Interpolation(V 0 )
20: end procedure
Our DMSM algorithm needs to update and synchronize interfaces at multi-
ple stages of the original MSM algorithm. There are two synchronization operations:
1. accumulateInteriorCopies: In the charge anterpolation, the coarsest direct
sum and prolongation steps, values are accumulated onto the interface grid
points in each GPU device. These interface points are local copies of interior
points of other GPUs, hence the values stored on interface points need to be
accumulated onto their true owners. This operation is executed in 3 steps.
First, the values from the interface points are accumulated into the output
buers. Second, the buers are transferred to their destination GPUs. And
third, the receiver GPUs accumulate the incoming values into their interior
grid points. Thanks to the preprocessing of mappings described previously,
the accumulation to the output buers is executed eciently in a massively
parallel manner on each GPU. Periodic boundary conditions are also handled
eciently, and the accumulation of multiple copies of the same point is dealt
with during the accumulation to output buers, prior to data transfer.
2. updateInterf aces: Once interior grid values are set, it may be necessary to
update their copies in other GPUs, i.e., the interface grid points of other GPUs.
Data is transferred between pairs of GPUs directly. This step is necessary after
charge anterpolation, after restriction, after the direct sum of potentials, and
after prolongation.
Algorithm 5 shows our DMSM algorithm, highlighting in blue and with
a star the steps that augment the original MSM algorithm. We distinguish
charge values q from potential values V, which are used as arguments of the
accumulateInteriorCopies and updateInterf aces procedures when appropriate.
Superscripts indicate grid levels. With our DMSM algorithm, all operations to set
up, transfer, and collect data packages are highly parallelized, thus minimizing the
cost of communications and maximizing scalability.

6.3. Evaluation for On-Board Multi-GPU MSM 59
6.3 Evaluation for On-Board Multi-GPU MSM
This section analyzes the scalability of our proposal presented in the previous section.
We carried out our experiments on a machine outtted with Ubuntu GNU/Linux
Precise Pangolin 12.04, two Intel Xeon Quad Core 2.40GHz CPUs with hyperthread-
ing, 32 GB of RAM and four NVidia GTX580 GPUs connected to PCIe 2.0 slots in
an Intel 5520 IOH Chipset of a Tyan S7025 motherboard.
Given our testbed architecture, we have tested the scalability of our pro-
posal by measuring computation and transmission times for 1, 2, and 4 partitions
running on dierent GPUs. We have used three molecular systems as benchmarks
(see Figure 6.2), all three with a large number of atoms:
• 400K (399,150 atoms) is a well-balanced system of 133,050 molecules of water
designed synthetically.
• 1VT4 (645,933, atoms) is a multi-molecular holoenzyme complex assembled
around the adaptor protein dApaf-1/DARK/HAC-1.
• 2x1VT4 (1,256,718 atoms) is a complex system formed by two 1VT4 molecules.
6.3.1 Scalability Analysis
Figure 6.3 shows the speedup and running times for the three molecules using our
proposal with the settings shown in Table 6.4a. Note that running times have been
measured using a GTX580 GPU, being aected by NVidia's CUDA AtomicAdd()
operation, whose implementation depends on the hardware architecture. We also
show the results obtained with the CPU implementation of PME in NAMD, one
(a) 400K (b) 1VT4 in water
(c) 2x1VT4 in water
of the most used tools for molecular dynamics, as a baseline for comparison. The
results show that our method benets from larger molecules. The reason is that
anterpolation, whose workload is easier to share among GPUs, dominates the cost
of updates in this case.

6.3. Evaluation for On-Board Multi-GPU MSM 61
The scalability of the system is limited because of interface updates between
GPUs. Figure 6.4b shows the data transfers between GPUs to update their interfaces
for the 2x1VT4 molecule for a single step of DMSM. We have selected 2x1VT4 due
to its higher complexity and data size, with more than 1.2 Million atoms. The gure
indicates that, as expected, the data size of interface cells grows linearly, since each
new partition adds a constant data transfer that depends on the grid resolution h
and its corresponding interface size. Furthermore, the average data size transfered
per GPU is similar to the data needed in a single-GPU implementation in order to
account for periodic boundary conditions, as shown in Figure 6.4b.
Finally, Figure 6.4c shows how the total simulation time is split between
computation and interface updates for the 2x1VT4 molecule, to analyze the impor-
tance of the transferred data size. With up to 4 partitions, the cost is dominated by
computations, with interface transfers adding up to only a low percentage. In this
way, the speedup grows almost linearly with each additional GPU. All in all, the
results show that our proposal presents very good scalability in on-board multi-GPU
Figure 6.3: Running time and speedup

platforms.
Molecule hx,y,z
400K {2.57,2.57,2.57}
1VT4 {1.86,1.86,0.93}
2x1VT4 {1.89,1.87,1.78}
(a) Evaluation Settings
(b) Interface size (2x1VT4)
(c) Simulation cost (2x1VT4)
Figure 6.4: Scalability Analysis.

Chapter 7
Distributed Multi-GPU Molecular
Dynamics
This chapter presents a parallel and scalable solution to compute bonded and non-
bonded molecular forces to make use of distributed high performance multi-GPU
environments. As shown earlier in Chapter 5 and Chapter 6, several attached GPUs
hosted in on-board systems can be used as independent computing nodes to increase
performance. For example, TYAN FT72B7015 barebones may host up to 8 NVidia
GPUs in a single mainboard, becoming one of the highest-performance on-board
multi-GPU solutions. However, scalability is limited by the number of GPUs that can
be connected, which is currently limited to 4-8 GPUS. Therefore, the objective of this
chapter is instead to design a distributed high performance multi-GPU environment
where several nodes with GPUs can collaborate to solve the problem.
The main limitation of the methods presented in Chapter 5 is the lack of
memory scalability. Every node has to keep a complete copy of the molecule in
memory, limiting the maximum size of the molecule to simulate. Additionally, this
prevents the use of low-end GPUs with a small amount of GPU global memory.
This chapter presents new algorithms to overcome these limitations. Section 7.1
presents the elements required to perform a complete division of the system keeping
data coherency. To do this, new unique global identiers for atoms and bonds are
64 7. Distributed Multi-GPU Molecular Dynamics
generated.
Section 7.1.1 explains our method to partition the molecular system, where
each GPU maintains only a small part of the whole molecule. Section 7.1.2 explains
how data is updated by interchanging atoms and bonds between neighboring GPUs.
Section 7.2 demonstrates the scalability of our approach on a multi-GPU cluster
environment.
7.1 Algorithm
We propose a collaborative scheme to perform the simulation on a distributed envi-
ronment, such as a cluster composed of several nodes with GPUs. The main objective
is to avoid storing a complete copy of the molecule on each node's memory, thus al-
lowing memory scalability. Our solution acts at two dierent stages: initialization
of the simulation and runtime execution of the simulation. Next, we summarize the
processes designed for these two stages.
• SystemLoader. This process distributes the molecular system among comput-
ing nodes, ensuring that each GPU receives only one portion of the molecule.
It is in charge of reading the molecular system and performing the data par-
tition. This requires assigning identiers to every GPU available, selecting in
a balanced way which part of the molecule goes to each one. Additionally, it
creates a neighborhood table for each GPU, establishing global atom and bond
IDs required for shared data and partition updates.
• Integrators. These processes are responsible of running the molecular dy-
namics within each partition. Each integrator independently performs the

7.1. Algorithm 65
simulation, updating and synchronizing the data partition with its neighbors.
At the beginning of the simulation, the SystemLoader reads the whole
molecular system and generates the list of Integrators (one for each GPU) that will
perform the simulation. Each Integrator receives a single partition, as it is described
in Chapter 5 (see Fig. 5.1 and Fig. 5.2), along with a list of neighbors to exchange
updates of their shared areas. After distribution, the SystemLoader is idle most of
the time, but it is also responsible for collecting partial simulation results from the
Integrators, merging them, and saving them to disk.
The following sections describe the methods used to make the system par-
tition and updates between integrator nodes.
7.1.1 System partition
As shown earlier in Chapter 5 and Chapter 6, each partition of the molecular system
is itself divided into three dierent sections: shared, unshared and interf ace data.
As shown in Section 5.2, there are data sets that do not change during the simulation,
called staticAtomData. In the solution proposed in Chapter 5, this static data is
replicated on all GPUs, being considered as "global data". To improve memory
accesses, this data is kept separate from dynamicAtomData, which is private to
each GPU. Thus, each atom holds two types of identiers:
• staticAtomDataID. Identier assigned after loading system data. It references
the global position of an atom or bond data.
• dynamicAtomDataID. Identier assigned after performing the CellList par-
tition. It is updated dynamically during the simulation. It references the local

data position within the partition assigned to each GPU.
In Chapter 5, staticAtomDataID arrays are replicated on all GPUs since
the atom migration is very quick, because only the dynamic part of the data must
be sent. However, the data staticAtomDataID stored on each partition does not
decrease with the number of partitions, limiting memory scalability and therefore
the molecule datasize.
In this chapter, we propose a new algorithm that maintains, for each parti-
tion, a copy of staticAtomData only for the atoms that reside within the partition.
dynamicAtomDataID and staticAtomDatID identiers persist as they accelerate
computations within each GPU. However, our algorithm introduces a new global
identier that enables the migration of static data between GPUs when atoms leave
their assigned partitions. As a result, three types of identiers are used:
• globalDataID: Global identier used to identify the same element (atom or
bond) between dierent GPUs
• localStaticDataID: Identier assigned by each GPU to identify local static
data within its partition.
• localDynamicDataID: Identier assigned by each GPU to identify local dy-
namic data after performing local CellList updates.
localStaticDataID and localDynamicDataID are dened by an integer
descriptor that references their position within the array of data contained in the
GPU RAM. On the other hand, globalDataID stores information about the GPU
node that owns the data, as well as the neighboring nodes that share a copy. There-
fore, globalDataID has the following tuple:

7.1. Algorithm 67
• OwnerGP U : The identication of the GPU that owns it.
• SharedGP U s: A list with the neighboring GPUs that share the atom or bond.
• DataID: Unique identier, assigned by the Systemloader after loading the
molecular system.
Finally, partitions follow the patterns seen in Chapter 5 to identify the
data belonging to each GPU. Atoms are assigned to each partition based on their
3D position. Bonds composed of two or more atoms use the midpoint method [2],
based on the positions of all atoms that form the bond..
7.1.2 Updates
Algorithm 3 in Section presented the integration method used in previous implemen-
tations. That method needs updated forces just before integrating positions, forcing
each type of force to be transmitted separately after computing it. To save commu-
nication times, the integrator was changed to a Velocity Verlet version, which only
needs updated positions before computing forces. Algorithm 6 shows the distributed
implementation with communication methods highlighted in blue with a star.
There are two points to update data:
• Partition/CellList updates. CellList structures must be updated every 10 steps.
This includes both partitions and pre-calculated shared data information. The
atoms that migrate across partitions are sent along with their bond information,
to have the information needed to rebuild the system. When the GPU detects
that all atoms that form a bond have left the boundaries of the partition, that
bond is deleted from the partition to save space.

Algorithm 6 Multi-GPU Velocity Verlet/r-Respa single step integrator. Data trans-

fers are highlighted in blue with a star.
1: procedure Step(currentStep)
2: integrateV elocity(0.5 · ∆t)
3: integrateP osition(∆t)
4: if currentStep mod 10 = 0 then
5: ∗ identif yU pdateAtomIds()
6: ∗ transf erU pdateP ositionsAndV elocities()
7: updateCellList()
8: ∗ identif ySharedAtomIds()
9: else∗ transf erSharedAtomP ositions()
10: end if
11: computeAllF orces()
12: integrateV elocity(0.5 · ∆t)
13: currentStep = currentStep + 1
14: end procedure
• Shared data updates. Dynamic data of all shared atoms has to be updated
on each step before continuing the simulation. In this case, dynamic data is
dened by the 3D positions of atoms.
The methods for data communication and partition update are similar to
those explained in Section 5.2. Data is sent along with its associated identier on
every step to update it. However, each GPU computes the molecular forces using
local identiers, which must be translated before being sent. IDs are translated us-
ing two tables: localIDT oGlobalID before sending, and globalIDT oLocalID after
receiving data. A naive approach would be to use arrays as translation tables. The
identiers would be stored in the position indicated by their ID, i.e., GlobalID =
localIDT oGlobalID[localId] and localId = globalIDT olocalID[GlobalID]. Al-
though this method is very fast, the space needed for globalIDT olocalID arrays
should include the atoms that are not in the partition, preventing memory size to
decrease with the number of partitions. To avoid memory scalability problems, a

7.1. Algorithm 69
Figure 7.1: Communication scheme from GPU A to GPU B. Each GPU hosts
a small portion of the system, referencing the data by local IDs. Local IDs are
translated to global data IDs and sent to the second GPU. After data reception, a
translation to local IDs is performed.
GPU hash table is used instead, storing only the global keys needed on each GPU.
Figure 7.1 illustrates the communication scheme. To accelerate the calcula-
tions, identif ySharedAtomIds() method shown in Algorithm 6 precomputes some
translation tables that will be used in the next 10 steps.

7.2 Distributed Multi-GPU Molecular Dynamics evalu-
ation
This section analyzes the implementation of our approach on a distributed high
performance multi-GPU environment, showing its speedup and scalability. We have
performed the evaluation tests in a cluster composed of 32 nodes interconnected
by a Gigabit Ethernet network outtted with Linux Mint 14, 8 GB of RAM and
one NVidia GTX760 GPU with 2GB of RAM. The inter-GPU communications were
implemented using OpenMPI 1.8, compiled with NVidia/CUDA support.
MPI allows direct communication between GPUs in dierent nodes con-
nected by a network. However, our testbed performs the communication using CPU
RAM, decreasing the maximum system performance.
The selected tests are focused on two aspects: memory accesses and running
times. In order to test the memory scalability of the proposal, we have used four
molecular systems as benchmarks (see Fig. 7.2), all of them with a large number of
atoms that can not be simulated in a small number of nodes:
• 2x1VT4 (1,256,718 atoms) is a complex system formed by two 1VT4 molecules,
presented in Chapter 6.3. This molecule was not possible to simulate in a single
node due to the small GPU memory available. A minimum of 2 GPUs are
needed.
• DHFR_555 (2,944,750 atoms) is a well-balanced synthetic system made of
several copies of DHFR (b), which cannot be simulated in less than 4 nodes.
The basic system is formed by an enzyme surrounded by water.
• DHFR_844 (3,015,424 atoms) is another synthetic system made of several

7.2. Distributed Multi-GPU Molecular Dynamics evaluation 71
(a) 2x1VT4 in water (b) Basic DHFR molecule
(c) Several copies of DHFR, in (d) Several copies of DHFR, in

5x5x5 conguration 8x4x4 conguration
copies of DHFR (b), distributed in a dierent conguration.
All test simulations were executed using the Verlet algorithm (see Sec-
tion 2.5.1), with a single time step of 1 fs for short-range non-bonded and bonded
forces. In all tests, we measured averaged statistics for 100 simulation steps, i.e., a
total duration of 0.1ps (1 · 10−13 s).
7.2.1 Scalability Analysis
To evaluate the scalability of the proposal, several tests have been performed. Fig-
ure 7.3 shows the amount of data sent along the simulation. As it can be seen, as the
number of nodes is increased, the datasize of the shared information grows because of
the updates with a higher number of neighbors. However, the dotted line shows that
the amount of data sent per node is practically constant in all cases. Furthermore,
the amount of data per node decreases as the number of nodes is increased because
a higher number of nodes means smaller partition datasize. In summary, the use
of a higher number of nodes does not imply a penalty in the size of the data to be
communicated.
Figure 7.3: Data size communications for DHFR_844 along 100 steps.
Figure 7.4 shows the GPU-RAM allocated for DHFR_555 molecule for
each cluster conguration to test memory scalability. As commented before, it is not
possible to simulate this system using 1 or 2 nodes. The GPUs installed on each node
have only 2GB of RAM, and this memory is also shared with the GUI, so the actual
available GPU RAM for CUDA is smaller. For 4 nodes, 1.17GB of GPU-RAM are
Figure 7.4: Memory allocation for DHFR_555.
used, and the amount of memory needed is decreased when more nodes are added.
Since each node has more neighbors, the RAM usage reduction is not linear. Each
node reserves extra memory needed for communications, but the maximum number
of neighbors (27 in 3D) ensure that this extra memory is bounded.
Figure 7.5 shows the speedup for the three rst molecules used on the
testbed. Figure 7.5a shows speedup evolution using 4, 8 16 and 32 nodes. Note
that the speedup has been measured using system with 4 GPUs/nodes as reference,
because the DHFR molecules tested cannot run with fewer nodes. As a reference, in
an ideal case, assuming linear scalability, the obtained speedups could be up to four
times better in the best case.
However, as it can be seen, inter-GPU communications take a large per-
centage of the time. As a consequence, the smallest molecule shows the worst results.
With larger molecules, force computation takes a larger percentage of the total time,
hence replicated DHFR congurations exhibits better scalability than 2x1VT4. To
prove that this solution could take advantage of a faster network, Figure 7.5b shows
speedups for force computation only, ignoring the time spent in communications. In
this case, all the molecules achieve a similar scalability, with a nearly linear speedup.
As stated before, these speedups are calculated using the 4 GPU conguration as
reference.
(a) Global Speedup (b) Only computation speedup
Figure 7.5: Speedup comparison of the three molecules. Note that the reference
conguration is for 4 GPUs.
Figure 7.6 shows the total simulation times for DHFR_555, split between
computation and transmission times. Note that the transmission times for 16 nodes
are higher than the others. This could be explained due to the fact that the network
is divided in two groups of nodes, connected though several routers. The commu-
nication protocol performance depends on how the partitions are spread among the
computing nodes. In this conguration some of the nodes were farther from their
neighbors than others, increasing the communication times. For 32 nodes simula-
tion, a better conguration was used, showing better communication times. In spite
of this, communication times do not grow much when adding more nodes, while
computation times keep decreasing.

Figure 7.6: Breakdown of running time (100 steps) for DHFR_844.
Figure 7.7: Benchmark molecule, composed of 32 copies of 1VT4 in 4x4x2 con-

guration.
7.2.2 Simulation of Huge Molecules
In order to prove the ability of the proposal to simulate huge molecular systems, a
last test was performed (see Fig. 7.7). 32x1VT4 is a synthetic system made of 32
copies of 1VT4, which adds a total of 20,107,488 atoms. Due to its complexity and
size, this system needs the 32 GPUs to perform the simulation.
Due to the inability to perform a simulation using less than 32 nodes, we

have estimated the speedup of the test by taking the simulation times of one copy of
the molecule in one node. Simulation times are given in nanoseconds per day. One
copy of 1VT4 can be simulated at a speed of 1.5ns/day in one node of the cluster,
so 32 copies should take around 0.047ns/day using one node (1.5 nanoseconds/day
divided by 32). Simulations of 32x1VT4 ran at 0.6ns/day, so the speedup of our
proposal using 32 nodes is of 12.77.

Part III
CONCLUSIONS-AND-FUTURE-
WORK
Chapter 8
Conclusions and Future Work
Chapter 1 introduced some problems found on molecular dynamics and the objec-
tives for this work. The current Ph.D. thesis has properly fullled all the primary
objectives. With the obtained results we can state that:
This thesis presents a parallel scalable molecular dynamics algorithm for
both on-board and distributed multi-GPU architectures, by using GPUs as indepen-
dent computing nodes. The approach extends and optimizes the Multilevel Summation
Method, takes advantage of direct GPU-GPU communications, and introduces mas-
sively parallel algorithms to update and synchronize the interfaces of spatial partitions
on GPUs. The evaluations carried out show that the current implementation is faster
than NAMD, one of the molecular dynamics simulators of reference. Moreover, we
have simulated massive molecular systems formed by more than 20 million of atoms,
demonstrating the potential of the method in a distributed multi-GPU environment.
Section 8.1 summarizes the contributions on each subproblem, and Sec-
tion 8.2 presents future lines of research which are open starting from the conclusion
of this work.
80 8. Conclusions and Future Work
8.1 Summary of contributions
The following sections present the contributions for each accomplished goal.
8.1.1 On-Board Multi-GPU Short-Range Molecular Dynamics
Our initial eorts were focused on simulating short-range bonded and non-bonded
molecular dynamics on on-board multi-GPU architectures. Chapter 5 describes the
proposed solution. Our approach is built on parallel multiple-time-stepping inte-
grators, achieving high speed-ups thanks to the spatial partitioning developed and
direct GPU-GPU communications.
The rst milestone reached was the selection of a partition scheme that
t our requirements. Section 5.3.1 presents the comparison of dierent partition
strategies.
Then, a novel data package management algorithm for massive parallel
architectures is presented in Section 5.2. This algorithm is the key for directly
transferring information between GPUs, enabling the execution of most of the code
on the GPU, avoiding CPU communications.
Finally, Section 5.3 presents an evaluation of the scalability of the proposal,
demonstrating the benets of using GPUs as central compute nodes instead of being
simple co-processors.
8.1. Summary of contributions 81
8.1.2 On-Board Multi-GPU Long-Range Molecular Dynamics
Chapter 6 presents the next milestone achieved in this thesis. The proposal extends
and optimizes the Multilevel Summation Method, takes advantage of direct GPU-
GPU communications, and introduces massively parallel algorithms to update and
synchronize the interfaces of spatial partitions on GPUs.
The most popular algorithm for long-range molecular dynamics is PME.
However, it is not easily parallelizable in a multi-GPU environment. Instead MSM
presents more suitable characteristics for distributing it along several nodes or GPUs,
but it is slower compared to PME. We rst improve the performance of MSM by using
a FFT instead of a 3D convolution in the computation of direct sums on individual
GPUs.
Section 6.1 states the benets of our approach vs. the original MSM and
the well known long-range molecular dynamics algorithm PME. We then show how
to perform a spatial partitioning of the multilevel grid, dividing atom data between
GPUs, and designing massively parallel algorithms to minimize communications to
eciently update and synchronize interfaces.
Also, Section 6.3 evaluates the scalability of the proposal, showing promis-
ing results for a distributed multi-GPU MSM algorithm.
8.1.3 Molecular Dynamics for Distributed Multi-GPU Architec-

tures
Chapter 7 presents the nal milestone achieved. The major drawback of on-board
multi-GPU systems is the limited number of GPUs that can be used in a single
node. To address this limitation, we present a new implementation of the previously
shown short-range force method for distributed multi-GPU architectures. A scalable
partition method for the molecular system is also presented, enabling the simulation
of massive molecular systems.
New shared-data communication schemes are presented in Section 7.1,
based on the denition of local data IDs for computations, and global data IDs
for communications.
Section 7.2 shows the results of the tests performed, demonstrating the
memory and performance scalability of the proposal. In summary, our approach
achieves good simulation times, opening the possibility for massive molecular systems
simulations in distributed multi-GPU environments.
8.2 Future work
The objectives stated at the beginning of this Ph.D thesis have been satisfactorily
reached. The evaluation carried out allows us to conclude that our multi-GPU molec-
ular dynamics approach presents very good behavior in terms of performance and
scalability. Furthermore, this work opens new research lines for current applications.
Pharmaceutical research could benet from simulating massive molecular
systems composed of hundred of millions of atoms. One of the recurrent problems in
molecular dynamics is virus simulation. These molecular systems are so large that
usually only some selected parts are simulated. A scalable solution as as the one
proposed in this work may make the simulation of such large systems practical.
Also, the solutions presented in this thesis can be exported to other simula-
8.2. Future work 83
tion elds. Several solutions can be applied to solve n-body problems, such as such
as celestial mechanics.. SPH uid dynamics and mass-spring cloth applications are
some examples of dynamics simulations that may benet from the spatial partition
and multi-GPU communication schemes presented in Chapters 5 and 7.
However, our approach presents several limitations that motivate future
work:
• Our current solution relies on a static partitioning, which does not guaran-
tee load balancing across GPUs. The tests indicate that practical molecular
systems maintain rather even atom distributions, but dynamic load balancing
might be necessary for ner partitions.
• Our work could be complemented with more advanced protocols and architec-
tures to optimize communications between GPUs. For on-board multi-GPU
systems, there are currently architectures that outperform the Intel IOH/QPI
interface for the PCIe bridge used in the experiments. Also, distributed multi-
GPU architectures present diverse network congurations. An adaptive com-
munication scheme could improve communication times.
• One of the main drawbacks is that MSM adds a certain overhead at coarse
levels, where the number of points to be computed is close to the number
of GPUs, and periodic boundaries wrap around the whole molecular system,
introducing many-to-many communications. To alleviate the negative conse-
quences on scalability, we plan to redesign the algorithm on coarse levels to
run on a smaller number of GPUs once the work load is manageable.
Finally, we are planning to simulate larger molecular systems on an HPC
environment. Specically, the Barcelona Supercomputing Center(BSC) has a cluster

named MinoTauro composed of 128 GPU NVIDIA Tesla M2090 interconnected by an
Inniband QDR. By extrapolating the results of our simulations, this conguration
may allow us to simulate a molecular system made of nearly 300 million atoms.
Bibliography
[1] Matthias Bolten. Multigrid methods for structured grids and their application
in particle simulation. Dr., Univ. Wuppertal, JÃ¼lich, 2008. 22
[2] K. J. Bowers, R. O. Dror, and D. E. Shaw. The midpoint method for par-
allelization of particle simulations. J Chem Phys, 124(18):184109, May 2006.
67
[3] David S Cerutti and David A Case. Multi-level ewald: A hybrid multigrid / fast
fourier transform approach to the electrostatic particle-mesh problem. J Chem
Theory Comput, 6(2):443458, 2010. 22
[4] Terry W. Clark and James Andrew McCammon. Parallelization of a molecular
dynamics non-bonded force algorithm for MIMD architecture. Computers &
Chemistry, pages 219224, 1990. 2
[5] Tom Darden, Darrin York, and Lee Pedersen. Particle mesh ewald: An
nÃ¢ÂÂlog(n) method for ewald sums in large systems. The Journal of Chem-
ical Physics, 98(12):1008910092, 1993. 2, 21
[6] O N de Souza and R L Ornstein. Eect of periodic box size on aqueous molecular
dynamics simulation of a dna dodecamer with particle-mesh ewald method.
Biophys J, 72(6):23957, 1997. 55
[7] Ulrich Essmann, Lalith Perera, Max L. Berkowitz, Tom Darden, Hsing Lee,
and Lee G. Pedersen. A smooth particle mesh ewald method. The Journal of
Chemical Physics, 103(19):85778593, 1995. 21, 53

86 Bibliography
[8] David J. Hardy, John E. Stone, and Klaus Schulten. Multilevel summation
of electrostatic potentials using graphics processing units. Parallel Computing,
35(3):164 177, 2009. Revolutionary Technologies for Acceleration of Emerging
Petascale Applications. 22, 24, 34
[9] David Joseph Hardy and Robert D Skeel. Multilevel summation for the fast
evaluation of forces for the simulation of biomolecules. University of Illinois at
Urbana-Champaign, Champaign, IL, 2006. 22, 23, 52, 53
[10] M. J. Harvey and G. De Fabritiis. An implementation of the smooth particle
mesh ewald method on gpu hardware. Journal of Chemical Theory and Com-
putation, 5(9):23712377, 2009. 53
[11] M. J. Harvey, G. Giupponi, and G. De Fabritiis. ACEMD: Accelerating
Biomolecular Dynamics in the Microsecond Time Scale. Journal of Chemical
Theory and Computation, 5(6):16321639, 2009. 2, 20, 21, 26
[12] Berk Hess, Carsten Kutzner, David van der Spoel, and Erik Lindahl. GRO-
MACS 4: Algorithms for Highly Ecient, Load-Balanced, and Scalable Molec-
ular Simulation. Journal of Chemical Theory and Computation, 4(3):435447,
2008. 2, 21, 26
[13] Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, and Milan Hodo Scek.
Parallelizing molecular dynamics programs for distributed memory machines:
An application of the chaos runtime support library, 1994. 2
[14] Intel. Intel 5520 Chipset: Datasheet, March 2009. 44
[15] Jesús A. Izaguirre, Scott S. Hampton, and Thierry Matthey. Parallel multigrid
summation for the n-body problem. J. Parallel Distrib. Comput., 65(8):949962,
August 2005. 24, 34

Bibliography 87
[16] Laxmikant Kalé, Robert Skeel, Milind Bhandarkar, Robert Brunner, Attila Gur-
soy, Neal Krawetz, James Phillips, Aritomo Shinozaki, Krishnan Varadarajan,
and Klaus Schulten. NAMD2: Greater scalability for parallel molecular dynam-
ics. Journal of Computational Physics, 151(1):283 312, 1999. 2
[17] A. Koehler. Scalable Cluster Computing with NVIDIA GPUs, 2012. 44
[18] Samuel Kupka. Molecular dynamics on graphics accelerators, 2006. 25
[19] NAMD on Biowulf GPU nodes, accessed Feb 2013. 1
[20] Marcos Novalbos, Jaime Gonzalez, Miguel A. Otaduy, Alvaro Lopez-Medrano,
and Alberto Sanchez. On-board multi-gpu molecular dynamics. In Felix Wolf,
Bernd Mohr, and Dieter an Mey, editors, Euro-Par, volume 8097 of LNCS, pages
862873. Springer, 2013. 2, 54
[21] Akira Nukada, Kento Sato, and Satoshi Matsuoka. Scalable multi-GPU 3-
D FFT for TSUBAME 2.0 Supercomputer. In Proceedings of the Interna-
tional Conf. on High Performance Computing, Networking, Storage and Analy-
sis (SC'12), pages 44:144:10, 2012. 22
[22] NVidia. CUFFT :: CUDA Toolkit Documentation, accessed Online Jan 2014.
53
[23] Steve Plimpton. Fast parallel algorithms for short-range molecular dynamics.
Journal of Computational Physics, 117:119, 1995. 2
[24] Victor Podlozhnyuk. CUDA Samples:: CUDA Toolkit documentation - NVidia's
GPU Merge-sort implementations, accessed Feb 2013. 41
[25] Christoph Rachinger. Scalable computation of long-range potentials for molec-
ular dynamics. Master's thesis, KTH, Numerical Analysis, NA, 2013. 21

88 Bibliography
[26] D.C. Rapaport. Large-scale Molecular Dynamics Simulation Using Vector and
Parallel Computers. North-Holland, 1988. 20
[27] Christopher I. Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, and
Wen-Mei W. Hwu. Gpu acceleration of cuto pair potentials for molecular mod-
eling applications. In Proceedings of the 5th conference on Computing frontiers,
CF '08, pages 273282, 2008. 2, 21, 25
[28] E. Rustico, G. Bilotta, G. Gallo, A. Herault, and C. Del Negro. Smoothed
particle hydrodynamics simulations on multi-gpu systems. In Euromicro In-
ternational Conference on Parallel, Distributed and Network-Based Processing,
2012. 25
[29] Tamar Schlick. Molecular Modeling and Simulation: An Interdisciplinary Guide.
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2002. 1, 11, 20
[30] Tim C. Schroeder. Peer-to-Peer & Unied Virtual Addressing, 2011. XIII, 4, 43
[31] Robert D Skeel, Ismail Tezcan, and David J Hardy. Multiple grid methods for
classical molecular dynamics. Journal of Computational Chemistry, 23(6):673
684, 2002. 22
[32] John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy,
Leonardo G. Trabuco, and Klaus Schulten. Accelerating molecular modeling
applications with graphics processors. Journal of Computational Chemistry,
28(16):26182640, 2007. 2
[33] J.A. van Meel, A. Arnold, D. Frenkel, S.F. Portegies Zwart, and R.G. Belleman.
Harvesting graphics power for md simulations. Molecular Simulation, 34(3):259
266, 2008. 20
Bibliography 89
[34] B. W. Wah, T. S. Huang, A. K. Joshi, D. Moldovan, J. Aloimonos, R. K. Ba-
jcsy, D. Ballard, D. DeGroot, K. DeJong, C. R. Dyer, S. E. Fahlman, R. Grish-
man, L. Hirschman, R. E. Korf, S. E. Levinson, D. P. Miranker, N. H. Morgan,
S. Nirenburg, T. Poggio, E. M. Riseman, C. Stanll, S. J. Stolfo, S. L. Tani-
moto, and C. Weems. Report on workshop on high performance computing and
communications for grand challenge applications: Computer vision, speech and
natural language processing, and articial intelligence. IEEE Transactions on
Knowledge and Data Engineering, 5(1):138154, 1993. 29
[35] Peng Wang. Short-Range Molecular Dynamics on GPU (GTC2010), September
2010. 36, 40
[36] Juekuan Yang, Yujuan Wang, and Yunfei Chen. GPU accelerated molecular dy-
namics simulation of thermal conductivities. Journal of Computational Physics,
221(2):799 804, 2007. 25
[37] Rio Yokota, Jaydeep P. Bardhan, Matthew G. Knepley, L.A. Barba, and
Tsuyoshi Hamada. Biomolecular electrostatics using a fast multipole BEM on
up to 512 GPUs and a billion unknowns. Computer Physics Communications,
182(6):1272 1283, 2011. 22
[38] Gongpu Zhao, Juan R. Perilla, Ernest L. Yufenyuy, Xin Meng, Bo Chen, Jiying
Ning, Jinwoo Ahn, Angela Gronenborn, Klaus Schulten, Christopher Aiken, ,
and Peijun Zhang. Mature hiv-1 capsid structure by cryo-electron microscopy
and all-atom molecular dynamics. Nature, 497:643646, 2013. 31

90 Bibliography

Thesis Marcos Novalbos Mendiguchia

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Thesis Marcos Novalbos Mendiguchia

Cargado por

Copyright:

Formatos disponibles

Universidad Rey Juan Carlos

Departamento de Ciencias de la Computación, Arquitectura

Scalable Molecular Dynamics

By Marcos Novalbos Mendiguchía

A dissertation submitted in partial fulllment of the requirements for the

degree of Doctor of Philosophy in Computer Science

Otaduy Tristán, con DNI 72447035W, como directores de la presente tesis

Que los trabajos de investigación desarrollados en la memoria de tesis doctoral

 Scalable Molecular Dynamics on High-Performance Multi-GPU Sys-

Mendiguchía ante el Tribunal que en su día se consigne, para aspirar al Grado de

Doctor por la Universidad Rey Juan Carlos.

Y para que así conste rman el presente documento en Móstoles a 21 de

V.B. Directores de Tesis

Dr. Alberto Sánchez Campos Dr. Miguel Angel Otaduy Tristán.

by Marcos Novalbos Mendiguchía

momento me han ayudado, y hacerles llegar mi agradecimiento: familia, amigos,

Primeramente quería agradecerle la paciencia y el apoyo que he tenido con

mi familia. De mi madre Carmen, mi hermana Maria del Mar y mi cuñado Ricardo

he aprendido que es imprescindible esfuerzo, constancia y responsabilidad en todo lo

puede contar con ellos para buscar una solución.

También quería agradecerles el trabajo y dedicación a mis dos directores

juntos y sin su apoyo e interés constante seguramente no habría avanzado nunca. A

de última hora incluidas. A ambos quería agradeceros sobretodo la paciencia en los

en ayudar a otros a conseguir sus objetivos.

Quiero agradecer a todos los trabajadores de Plebiotic S.L. su nanciación

y la conanza que depositaron en mi trabajo mientras duró el proyecto. A Roberto

perdidas depurando código y del que he aprendido muchísimo de programación en

Por último, me gustaría agradecer a todos los integrantes del despacho

Los sistemas de simulación de dinámica molecular aúnan los esfuerzos de distin-

atómico, en un entorno simulado. Estas simulaciones tienen en cuenta los movimien-

tos y equilibrios de energía necesarios, imitando el movimiento real de los átomos y

sus interacciones durante un espacio de tiempo nito.

posibles de capturar de forma analítica, ayudando en la investigación de nuevas

drogas y medicamentos. El hecho de poder predecir las interacciones y formas de

ciertas proteínas usando programas informáticos permite a las empresas farmacéuti-

cas ahorrar tiempo y dinero en sus investigaciones.

Sin embargo, los sistemas de simulación molecular están limitados por la

capacidad de cálculo de los ordenadores actuales. Las mediciones de tiempo de

simulación se dan en el orden de nanosegundos/día, dando a entender que para

obtener los movimientos durante un corto espacio de tiempo simulado es necesario

mucho tiempo real.

Por ejemplo, si se quisiera representar un nanosegundo de movimientos de

14 días de tiempo de cómputo en un ordenador monoprocesador. La complejidad

moleculares en tiempo real.

respuesta relativamente rápida, las simulaciones moleculares se llevan a cabo en en-

tornos virtuales ejecutados en sistemas informáticos con gran capacidad de cálculo.

los aprovechando las características de los sistemas informáticos de alto rendimiento.

Con la evolución de los sistemas de computadores paralelos como clústeres o multi-

procesadores, se ha conseguido gran reducción de los tiempos de simulación.

En concreto, las arquitecturas de tarjetas grácas o GPUs han propor-

cionado un gran incremento en el rendimiento de multitud de aplicaciones, gracias a

sus características masivamente paralelas. La popularidad de las GPUs ha aumen-

drásticamente los tiempos de simulación para muchos de los algoritmos utilizados.

El trabajo que aquí se presenta está centrado en la explotación de los sis-

temas multi-GPU para acelerar el cálculo de las simulaciones de dinámica molecular.

turas, explotando sus características como nodos de cómputo autónomos.

tas y algoritmos nuevos. En concreto, se ha desarrollado un algoritmo de empa-

la particularidad de que se ejecuta completamente en GPU, evitando perder tiempo

tornos multiGPU. Se han introducido mejoras para algoritmos de dinámica molecular

de cálculo de fuerzas largas, optimizando el método Multigrid Summation Method

suelen ser sistemas tipo cluster de memoria distribuida, se ha portado el código a

estos sistemas y se han realizado pruebas de escalabilidad con resultados óptimos

para simulación de moléculas de gran tamaño.

Esta tesis continúa el trabajo de investigación iniciado por la compañía

A dissertation submitted in partial fulllment of the requirements for the

Scalable Molecular Dynamics on High-Performance Multi-GPU Sys-

Y para que así conste rman el presente documento en Móstoles a 21 de

Quiero agradecer a todos los trabajadores de Plebiotic S.L. su nanciación

y la conanza que depositaron en mi trabajo mientras duró el proyecto. A Roberto

sus interacciones durante un espacio de tiempo nito.

simulación se dan en el orden de nanosegundos/día, dando a entender que para

mucho tiempo real.

En concreto, las arquitecturas de tarjetas grácas o GPUs han propor-

una de las mayores dicultades se encuentra en el aprovechamiento de arquitecturas

Fuerzas de corto alcance, calculadas de forma exacta dentro de una distan-

tema distribuido. Las subdivisiones creadas se denomina parches, y cada parche

cinos. A partir de ahí, se denen trabajos que se distribuyen a lo largo de las

grácas que se puedan instalar en la placa base.