Está en la página 1de 130

Universidad Rey Juan Carlos

Departamento de Ciencias de la Computación, Arquitectura


de Computadores,Lenguajes y Sistemas Informáticos y
Estadística e Investigación Operativa.
Doctorado en Ingeniería Informática

Scalable Molecular Dynamics

on

High-Performance Multi-GPU

Systems

By Marcos Novalbos Mendiguchía

A dissertation submitted in partial fulllment of the requirements for the

degree of Doctor of Philosophy in Computer Science

Supervisor
Prof. Alberto Sánchez Campos
Prof. Miguel Angel Otaduy Tristán
Candidate
Dr. Marcos Novalbos Mendiguchía

July 2015
El Dr. Alberto Sánchez Campos, con DNI 50120554N, y el Dr. Miguel Angel

Otaduy Tristán, con DNI 72447035W, como directores de la presente tesis

CERTIFICAN

Que los trabajos de investigación desarrollados en la memoria de tesis doctoral

 Scalable Molecular Dynamics on High-Performance Multi-GPU Sys-


tems son aptos para ser presentados por el Ingeniero Informático Marcos Novalbos

Mendiguchía ante el Tribunal que en su día se consigne, para aspirar al Grado de

Doctor por la Universidad Rey Juan Carlos.

Y para que así conste rman el presente documento en Móstoles a 21 de

Octubre de 2015.

V.B. Directores de Tesis

Dr. Alberto Sánchez Campos Dr. Miguel Angel Otaduy Tristán.


Scalable Molecular Dynamics on High-Performance Multi-GPU
Systems

Copyright 2015

by Marcos Novalbos Mendiguchía


Agradecimientos

Este trabajo es la suma de los esfuerzos de mucha gente que, directa o indirectamente

han contribuido en su desarrollo. Por eso quería dedicárselo a todos los que en algún

momento me han ayudado, y hacerles llegar mi agradecimiento: familia, amigos,

y personas que sin tener mucha relación me han dado buenos consejos en algún

momento.

Primeramente quería agradecerle la paciencia y el apoyo que he tenido con

mi familia. De mi madre Carmen, mi hermana Maria del Mar y mi cuñado Ricardo

he aprendido que es imprescindible esfuerzo, constancia y responsabilidad en todo lo

que se hace para cumplir los objetivos. Y en todo caso, si algo sale mal siempre se

puede contar con ellos para buscar una solución.

También quería agradecerles el trabajo y dedicación a mis dos directores

de Tesis, sin ellos este trabajo habría sido imposible. A Alberto quiero agradecerle

la conanza que puso en mí desde el principio, han sido muchos años trabajando

juntos y sin su apoyo e interés constante seguramente no habría avanzado nunca. A

Miguel Ángel quería agradecerle todas las horas invertidas en este trabajo, revisiones

de última hora incluidas. A ambos quería agradeceros sobretodo la paciencia en los

últimos meses previos a la entrega de este trabajo, poca gente muestra tanto interés

en ayudar a otros a conseguir sus objetivos.

Quiero agradecer a todos los trabajadores de Plebiotic S.L. su nanciación

y la conanza que depositaron en mi trabajo mientras duró el proyecto. A Roberto

Martínez Benito, Álvaro López-Medrano y Roldán Martínez por sus esfuerzos inver-

tidos en la empresa que fundaron y con la que intentaron marcar la diferencia. Y por

supuesto, quería agradecer a Jaime por todos estos años trabajando juntos, horas
VI

perdidas depurando código y del que he aprendido muchísimo de programación en

GPU.

Por último, me gustaría agradecer a todos los integrantes del despacho

2011 (pasados y presentes) con los que he compartido las horas de trabajo durante

los últimos 4 años. Han sido muchas las experiencias que hemos vivido juntos y

en las que nos hemos ayudado, muchas gracias a Alberto, Álvaro, Ángela, Carlos,

David, Gabriel, Javier, Jorge, Mónica, Laura y Zeleste. Y por añadidura al resto de

GMRV, en especial a Jose, Juanpe, Luis, Marcos, Óscar, Pablo, Richard SM, Sofía

y Susana.
Resumen en Castellano

Los sistemas de simulación de dinámica molecular aúnan los esfuerzos de distin-

tas disciplinas para poder recrear elmente las interacciones entre elementos a nivel

atómico, en un entorno simulado. Estas simulaciones tienen en cuenta los movimien-

tos y equilibrios de energía necesarios, imitando el movimiento real de los átomos y

sus interacciones durante un espacio de tiempo nito.

Estas simulaciones son necesarias para estudiar propiedades que serían im-

posibles de capturar de forma analítica, ayudando en la investigación de nuevas

drogas y medicamentos. El hecho de poder predecir las interacciones y formas de

ciertas proteínas usando programas informáticos permite a las empresas farmacéuti-

cas ahorrar tiempo y dinero en sus investigaciones.

Sin embargo, los sistemas de simulación molecular están limitados por la

capacidad de cálculo de los ordenadores actuales. Las mediciones de tiempo de

simulación se dan en el orden de nanosegundos/día, dando a entender que para

obtener los movimientos durante un corto espacio de tiempo simulado es necesario

mucho tiempo real.

Por ejemplo, si se quisiera representar un nanosegundo de movimientos de

un pequeño sistema formado por unos 92. 224 átomos, podrían ser necesarios hasta

14 días de tiempo de cómputo en un ordenador monoprocesador. La complejidad

de los cálculos realizados para simular los movimientos físicos de los átomos que

componen el sistema es tan alta que a día de hoy es impensable realizar simulaciones

moleculares en tiempo real.

Es muy necesario reducir los tiempos de cómputo. Para poder dar una

respuesta relativamente rápida, las simulaciones moleculares se llevan a cabo en en-


VIII

tornos virtuales ejecutados en sistemas informáticos con gran capacidad de cálculo.

Desde sus inicios se han desarrollado distintas técnicas para poder acelerar los cálcu-

los aprovechando las características de los sistemas informáticos de alto rendimiento.

Con la evolución de los sistemas de computadores paralelos como clústeres o multi-

procesadores, se ha conseguido gran reducción de los tiempos de simulación.

En concreto, las arquitecturas de tarjetas grácas o GPUs han propor-

cionado un gran incremento en el rendimiento de multitud de aplicaciones, gracias a

sus características masivamente paralelas. La popularidad de las GPUs ha aumen-

tado tanto que es fácil encontrar máquinas de altas prestaciones que tengan instaladas

varias tarjetas grácas programables para acelera los cálculos. En concreto, resultan

ser un gran apoyo hardware para los sistemas de simulación molecular, reduciendo

drásticamente los tiempos de simulación para muchos de los algoritmos utilizados.

El trabajo que aquí se presenta está centrado en la explotación de los sis-

temas multi-GPU para acelerar el cálculo de las simulaciones de dinámica molecular.

Entre otros objetivos, se pretende dar un nuevo enfoque en el uso de estas arquitec-

turas, explotando sus características como nodos de cómputo autónomos.

Para poder llevar a cabo esta tarea se han desarrollado diversas herramien-

tas y algoritmos nuevos. En concreto, se ha desarrollado un algoritmo de empa-

quetado de datos para comunicaciones directas entre GPUs. Éste algoritmo tiene

la particularidad de que se ejecuta completamente en GPU, evitando perder tiempo

moviendo datos entre CPU y GPU. También se han investigado distintas formas de

partición espacial para sistemas moleculares, seleccionando el más adecuado para en-

tornos multiGPU. Se han introducido mejoras para algoritmos de dinámica molecular

de cálculo de fuerzas largas, optimizando el método Multigrid Summation Method

(MSM). Por último, dado que los entornos con mayor cantidad de GPUs disponibles
IX

suelen ser sistemas tipo cluster de memoria distribuida, se ha portado el código a

estos sistemas y se han realizado pruebas de escalabilidad con resultados óptimos

para simulación de moléculas de gran tamaño.

Esta tesis continúa el trabajo de investigación iniciado por la compañía

Plebiotic S.L. en colaboración con el Grupo de Modelado y Realidad Virtual (GMRV)

del departamento de Ciencias de la Computación, Arquitectura de la Computación,

Lenguajes y Sistemas Informáticos y Estadística e Investigación Operativa de la

Escuela de Ingeniería Informática de la Universidad Rey Juan Carlos de Madrid.

En las siguientes secciones se muestra el estado del arte en ámbito de

dinámica molecular, los objetivos marcados, un resumen del trabajo realizado, y

las conclusiones que se han podido extraer de las pruebas realizadas.

Antecedentes

Se pueden distinguir dos líneas de investigación dentro de los sistemas de simulación

de dinámica molecular:

• Mejoras en velocidad de simulación

• Mejoras en la precisión de los cálculos

Ambas líneas son contradictorias: los cálculos precisos introducen una carga

computacional extra, mientras que las optimizaciones de velocidad de ejecución sue-

len basarse en el uso de métodos menos restrictivos que usualmente introducen er-

rores de precisión. Las optimizaciones que se suelen realizar en los últimos años se

centran en mejorar los algoritmos conocidos, adaptándolos para que hagan uso de
X

arquitecturas de altas prestaciones. Existen diversos algoritmos optimizados para

sistemas de memoria compartida y entornos distribuidos tipo cluster. En concreto,

una de las mayores dicultades se encuentra en el aprovechamiento de arquitecturas

masivamente paralelas tipo GPU.

Tipos de fuerzas simuladas

La dinámica molecular se centra en el cálculo de las fuerzas resultantes de la inter-

acción de los átomos que forman el sistema. El tiempo total de simulación se divide

en pequeños pasos de tiempo en los que se calculan las fuerzas que interactúan con

cada átomo para calcular su velocidad, y a partir de esa velocidad se calcula la nueva

posición para el siguiente paso de tiempo. Cuanto más pequeño es ese paso de sim-

ulación, más precisos son los cálculos, pero también tarda más en completarse. En

concreto se pueden distinguir dos tipos de fuerzas, una de ellas subdividida en otros

dos tipos:

• Fuerzas de enlace: Cuando se modelan sistemas moleculares, los átomos que

forman parte de la misma molécula quedan unidos mediante distintos tipos de

enlaces. Estos enlaces se simulan como si fueran muelles, afectando únicamente

a los átomos que formen el enlace. Suelen ser las fuerzas más rápidas de

calcular, representando poca carga de cómputo.

• Fuerzas electrostáticas: Las fuerzas electrostáticas de Van der Waals son las

producidas debido a las cargas de los átomos. Dado que las fuerzas electrostáti-

cas decaen rápidamente con la distancia, estas fuerzas se suelen dividir en dos

tipos para facilitar los cálculos:

 Fuerzas de corto alcance, calculadas de forma exacta dentro de una distan-


XI

cia de corte respecto de un átomo. El cálculo de fuerzas de corto alcance

es bastante más pesado que el de fuerzas de enlace, llevándose una gran

parte del tiempo de cálculo de fuerzas.

 Fuerzas de largo alcance, calculadas usando todos los átomos del sistema

que se encuentran más allá de una distancia de corte. El tiempo de cál-

culo usando algoritmos exactos es muy alto, por lo que se suelen usar

aproximaciones con cierto grado de error. Dado que las interacciones más

allá del radio de corte son poco importantes, en algunos casos es posible

ignorarlas sin que el sistema se resienta.

PME [4] es el método de cálculo de fuerzas de largo alcance más popular.

Resuelve los cálculos utilizando FFTs sobre una rejilla de potenciales de carga. FFT.

La versión original de PME usa diferenciación espectral y un total de cuatro FFTs por

cada paso de simulación, mientras que Smooth PME (SPME) [6] usa interpolación

por B-spline reduciendo el número de FFTs a dos. Se usa PME en multitud de sim-

uladores moleculares, como NAMD [26], GROMACS [11] o ACEMD [10]. Además,

PME ha sido paralelizado en GPU, aunque debido a la naturaleza del algoritmo es

difícil de implementar para que aproveche varias GPUs simultáneamente.

Simuladores moleculares

En los últimos 20 años se han desarrollado diversos simuladores de dinámica molec-

ular para entornos de altas prestaciones. NAMD [26] es uno de los más longevos,

datando las primeras versiones en 1995. Es de los más populares, siendo usado en

muchos de los proyectos más importantes de simulación molecular. Desde el inicio

se centró en desarrollar algoritmos paralelos para reducir los tiempos de ejecución.

NAMD reparte el trabajo de cálculo entre los nodos de cómputo disponibles real-
XII

izando una partición espacial del sistema. Cada partición se asigna a un nodo de

cómputo, que puede ser un núcleo en un sistema multi-CPU, o un nodo de un sis-

tema distribuido. Las subdivisiones creadas se denomina parches, y cada parche

mantiene la información de los átomos que le pertenecen junto con los parches ve-

cinos. A partir de ahí, se denen trabajos que se distribuyen a lo largo de las

CPUs disponibles en cada nodo de cómputo. Cada trabajo se queda denido como

una interacción entre parches, si algún nodo de cómputo necesita información de un

parche que no tiene, se enviará una copia de los datos necesarios junto con el trabajo

asociado.

En las últimas versiones de NAMD es posible usar GPUs para acelerar

cálculos. Se crean trabajos pequeños que se asignan a las GPUs junto con los datos

necesarios. Una vez se tienen los datos en GPU se lanzan los kernels necesarios para

calcular las fuerzas y se devuelven los resultados a memoria de CPU. Este esquema de

uso de la GPU como coprocesador fuerza que haya mucho intercambio de información

entre GPU y CPU.

GROMACS [11] es otro simulador molecular con gran historia de desar-

rollo, sus primeras versiones son de 1991. Inicialmente se implementó como un su-

percomputador, con multitud de nodos de cómputo unidos entre sí en una topología

de anillo. Más tarde se portó el sistema a código C, permitiendo su ejecución en

máquinas más comunes de cómputo paralelo, como clúster o multiprocesadores. Al

igual que NAMD, realiza una partición espacial del sistema para distribuirlo entre

los nodos disponibles. El sistema se divide en un grid escalonado, con áreas de

información compartida entre particiones adyacentes. Se pueden usar GPUs como

coprocesadores para acelerar los cálculos de ciertas partes del código, aunque en ese

caso sólo se han optimizado las fuerzas de corto alcance. Al igual que con NAMD,

en cada paso se realizan copias de los datos de entrada a GPU, y luego la descarga
XIII

de datos resultado a CPU.

ACEMD [10] es un simulador molecular relativamente más moderno que los

anteriores. Éste simulador está centrado en optimizaciones de algoritmos basados en

nuevos modelos matemáticos, que simplican las estructuras de las moléculas antes

de operar con ellas. Está optimizado para aprovechar sistemas multi-GPU instaladas

en una única estación de trabajo, y es uno de los simuladores más rápidos existentes.

En caso de haber varias GPUs, cada una realiza en paralelo el cálculo de un tipo de

fuerzas. Esta aproximación explota los sistemas multiGPU, aunque su escalabilidad

es reducida ya que todas las comunicaciones entre GPUs se realizan pasando por

memoria de CPU. Además, al estar limitado a estaciones de trabajo, el número

máximo de GPUs que pueden ser usadas está limitado por el número de tarjetas

grácas que se puedan instalar en la placa base.

Objetivos

Los sistemas de simulación de dinámica molecular pueden hacer uso de arquitec-

turas de altas prestaciones. En muchos casos, se pueden tener soluciones híbridas

GPU-CPU que permiten acelerar los cálculos de fuerzas. Sin embargo, es la CPU

la que mantiene el control de la aplicación, usando las GPUs como meros coproce-

sadores. Las GPUs actuales tienen una potencia de cálculo que supera la mayoría de

las CPUs, pero quedan muy limitadas por las comunicaciones entre la CPU y GPU

por la carga/descarga de datos. Las arquitecturas actuales soportan comunicaciones

directas entre GPUs que se encuentren instaladas en la misma placa base [30], o in-

cluso en tarjetas conectadas a nodos en la misma red de trabajo. Estas características

permiten usar las GPUs como nodos de cómputo para algoritmos de simulación de

dinámica molecular y no como coprocesadores, reduciendo la sobrecarga introducida


XIV

por las comunicaciones GPU-CPU

En concreto, el presente trabajo pretende demostrar la viabilidad del


uso de GPUs como nodos de computación autonómica, desarrollando algo-
ritmos de comunicación y gestión de datos que se ejecuten completamente
en GPU, junto con algoritmos de simulación molecular adaptados a en-
tornos distribuidos MultiGPU. Para poder demostrar esta hipótesis, se deben

diseñar e implementar los algoritmos teniendo especial cuidado en la escalabilidad

del sistema y en minimizar la cantidad de datos a comunicar entre nodos.

Para alcanzar el propósito del trabajo, se han denido los siguientes obje-

tivos:

• Mejora de los tiempos de simulación en entornos MultiGPU. Como se comentó

en párrafos anteriores, la comunicación directa entre GPUs permite ahorrar

tiempo. Los simuladores tradicionales usan las GPUs como coprocesador, mien-

tras que nuestro objetivo es utilizarlas como nodos de cómputo independientes.

El objetivo incluye desarrollar protocolos de comunicación GPU-GPU y algo-

ritmos de empaquetado de datos para arquitecturas masivamente paralelas.

• Mejoras de los algoritmos de simulación molecular actuales, adaptándolos a

arquitecturas GPU y a entornos distribuidos multiGPU. En concreto, el cál-

culo de fuerzas de largo alcance es un problema difícil de distribuir entre varias

GPUs. El método tradicional es PME, pero no se adapta a nuestras necesi-

dades. Por el contrario, el método MSM de cálculo de fuerzas de largo alcance

es mucho más factible de usar para nuestros propósitos. El objetivo incluye

mejoras en MSM para que sea tan rápido como PME, adaptándolo para sis-

temas multiGPU.
XV

• Mejora de la simulación de grandes sistemas moleculares formados por mil-

lones de átomos, manteniendo la escalabilidad de sistema. Para poder simular

sistemas moleculares grandes en entornos multiGPU es necesario utilizar arqui-

tecturas de altas prestaciones que dispongan de decenas de tarjetas grácas.

El objetivo incluye la adaptación de los algoritmos desarrollados a entornos

distribuidos tipo cluster, donde es factible tener un gran número de GPUs.

Como resumen, este trabajo intenta aportar nuevas formas de uso para los

entornos multiGPU. La solución debe ser escalable, y permitir acelerar los cálculos de

dinámica molecular para los sistemas moleculares que necesiten grandes cantidades

de recursos.

Metodología

En base a los objetivos planteados, se han agrupado los objetivos en una serie de hitos

a conseguir. El objetivo nal es el aprovechamiento de cualquier arquitectura multi-

GPU, por lo que inicialmente se ha seleccionado una arquitectura concreta basada

en múltiples GPUs conectadas a través de un bus de datos de alta velocidad en la

misma placa base. Éstas arquitecturas suelen disponer de un número reducido de

tarjetas grácas, como mucho entre 4 y 8 GPUs, por lo que los tamaños de los sis-

temas simulados solían estar limitados en tamaño. Para superar esas limitaciones, la

última parte del proyecto se ha centrado en realizar pruebas en entornos distribuidos

tipo cluster, donde se puede disponer de una mayor cantidad de GPUs conectadas a

través de una red de datos.

A continuación se detallan los hitos conseguidos en los dos sistemas multi-

GPU utilizados.
XVI

Dinámica molecular para entornos multiGPU de bus compartido en


la misma placa base

• Diseño e implementación de un algoritmo de partición espacial que genera

particiones regulares a partir del volumen de simulación. Cada partición es

asignada a una única GPU.

• Denición de áreas compartidas entre particiones que sirven para mantener la

coherencia del sistema, denominadas interfaces. La gestión de estas interfaces

es llevada de forma independiente por cada una de las GPUs.

• Diseño e implementación de un algoritmo paralelo para preparar los paquetes

de datos de las interfaces, que se usarán para realizar las comunicaciones y

actualizaciones entre GPUs. Este algoritmo se ha implementado por completo

en GPU.

• Adaptación de algoritmos actuales de dinámica molecular basados en partición

de trabajos a métodos de división espacial. Inicialmente se partía de varios

algoritmos paralelos que usaban las GPUs como coprocesador, que movían una

gran cantidad de datos entre CPU y GPU. Se han adaptado de tal manera que

cada GPU opera sobre su propia partición de datos, por lo que únicamente

se envían datos entre GPUs que son necesarios para la actualización de sus

particiones. La comunicación GPU-CPU es mínima, únicamente se envían los

datos necesarios para invocar los kernels de GPU.

• Denición de un método distribuido para el cómputo de fuerzas de largo al-

cance. En concreto, se ha usado como base el método MSM, optimizándolo

para que use FFTs, e implementándolo de forma distribuida en un entorno

multiGPU. Las pruebas realizadas han sido sobre un entorno multiGPU en


XVII

una única estación de trabajo, pero el método es extensible a entornos dis-

tribuidos tipo cluster.

• Realización de pruebas de escalabilidad y eciencia de los algoritmos implemen-

tados. Las pruebas han sido satisfactorias, probándose en muchos casos que es

más rápido que NAMD, uno de los simuladores moleculares de referencia.

Dinámica molecular para entornos multGPU de memoria dis-


tribuida, conectados a través de una red local

• Adaptación de los algoritmos desarrollados a un entorno multiGPU tipo clus-

ter, donde se comprueba la viabilidad de la solución para simular sistemas

moleculares de varios millones de átomos. El objetivo era simular sistemas

que eran imposibles de simular en entornos con pocas GPUs, debido a la poca

memoria ram que tiene para cargar el volumen de datos.

• Resolución de problemas de escalabilidad en memoria. Para acelerar los cál-

culos, existen una serie de listas de traducción de identicadores de átomos.

Éstas listas no escalan en memoria, limitando el tamaño máximo del sistema

que se puede simular. Al realizar las particiones del sistema, éstas listas con-

tienen huecos o zonas vacías. Cuantas más particiones hay, las zonas vacías

crecen, por lo que se podría solucionar el problema usando otro sistema de

almacenaje más eciente. Las Tablas Hash se adaptan muy bien a este sis-

tema, compactando los datos útiles y ahorrando memoria, por lo que se han

incorporado al sistema.

• Resolución de problemas de actualización del sistema molecular en un entorno

de memoria distribuida. Los métodos anteriores se aprovechaban de las carac-

terísticas de los entornos de memoria compartida para realizar la partición de


XVIII

datos. Ya que los átomos pueden migrar de una GPU a otra, se ha denido

un protocolo de comunicación de datos que permite describir de forma unívoca

los átomos y enlaces que se envían.

Conclusiones

A continuación se detallan las pruebas realizadas y las conclusiones obtenidas. Se

han centrado en los dos tipos de arquitecturas descritas en la sección anterior. Para

los sistemas multiGPU de bus compartido en la misma placa base se ha usado una

máquina equipada con Ubuntu GNU/Linux 10.04, dos CPUs Intel Xeon Quad Core

2.40GHz con tecnología hyperthreading, 32 GB de RAM y cuatro GPUs NVidia

GTX580 conectadas a un bus PCIe 2.0 en una placa base Tyan S7025 equipada con

un chipset IOH Intel 5520.

Para los sistemas multiGPU distribuidos en una red de trabajo, se ha usado

un cluster de 32 ordenadores con Sistema Operativo Linux Mint 14, 8 GB de RAM

y una GPU NVidia GTX760 equipada con 2GB of ram. Los nodos se encuentran

interconnectados por una red Gigabit/ethernet. Para comunicaciones entre GPUs se

ha usado OpenMPI 1.8, compilado con soporte para NVidia/Cuda.

Las siguientes secciones detallan los resultados obtenidos en ambos en-

tornos. La Figura 1 muestra las moléculas usadas para cada una de las pruebas.
XIX

Simulación de fuerzas de corto alcance para sistemas multiGPU de


bus compartido en la misma placa base

Las primeras evaluaciones se centraron en la simulación de fuerzas de enlace y fuerzas

electrostáticas de corto alcance en el entorno multiGPU descrito anteriormente. Se

midieron tiempos de comunicaciones y tiempos de simulación para 1, 2, y 4 parti-

ciones alojadas en distintas GPUs. Los tres sistemas moleculares (Figura 1) usados

para las pruebas fueron los siguientes:

• ApoA1 (92,224 átomos): Es un sistema bien conocido, una lipoproteínade alta

densidad (HDL) en plasma humano. Se suele utilizar en los tests de rendimiento

de NAMD.

• C206 (256,436 átomos) Es un sistema complejo formado por una proteína,

un ligando y una membrana. Debido a su heterogeneidad, presenta algunos

desafíos de balanceo de carga.

• 400K (399,150 átomos) Un sistema molecular sintético con una carga de datos

balanceada, consistente en 133,050 moléculas de agua.

Todos los tests han consistido en una ejecución de 2000 pasos de simulación,
representando el cálculo de 4picosegundos (4·10−12 segundos). Las Grácas 2a mues-

tran los resultados de escalabilidad obtenidos para 2 y 4 GPUs, más una estimación

(líneas discontinuas) de los resultados que se podrían esperar en sistemas con 8 y

16 GPUs teniendo en cuenta las limitaciones de las comunicaciones. Los resultados

muestran que la implementación funciona mejor según aumenta el tamaño del sis-

tema, compartiendo más trabajo entre las diferentes GPUs. El speedup obtenido en

APOA1 es menor que el resto debido al que es el sistema más pequeño, y los tiempos

de comunicación limitan rápidamente la escalabilidad del sistema


XX

La Gráca 2b evalúa los resultados obtenidos frente a NAMD, el simulador

molecular de referencia. El rendimiento se mide en términos de número de nanose-

gundos que se pueden simular en un día. En todos los casos nuestra solución mejora

la ejecución respecto de NAMD en un factor de al menos 4×.

Simulación de fuerzas de largo alcance para sistemas multiGPU de


bus compartido en la misma placa base

Las siguientes pruebas se centraron en la simulación de fuerzas electrostáticas de largo

alcance en el entorno multiGPU descrito anteriormente. En este caso se implementó

una versión distribuida de MSM, se midió su escalabilidad en 4GPUs, y se comparó

su eciencia con NAMD. Los tres sistemas moleculares (Figura 1) usados para las

pruebas fueron los siguientes:

• 400K (399,150 átomos) : Presentada anteriormente, es un sistema molécular

sintética con una carga de datos balanceada, consistente en 133,050 moléculas

de agua.

• 1VT4 (645,933, átomos) Es un sistema complejo formado por una holoenzima,

ensamblada alrededor del adaptador proteínico dApaf-1/DARK/HAC-1.

• 2x1VT4 (1,256,718 átomos) Un sistema formado por dos copias de 1VT4.

La Gráca 3 evalúa tanto la escalabilidad de la solución como los resultados

tiempos de ejecución obtenidos frente a NAMD. Se puede apreciar que, al igual que

antes, cuanto más grandes son los sistemas el speedup optenido es mejor.
XXI

Simulación de dinámica molecular en entornos multiGPU distribui-


dos

Por último, se realizaron pruebas de escalabilidad y rendimiento en el cluster de-

scrito anteriormente. Dado que los buses de comunicación son mucho más lentos

que en los sistemas de bus en placa base, para compensar los tiempos perdidos en

comunicaciones se han usado sistemas moleculares especialmente grandes. De esta

manera la comparación de tiempos de ejecución respecto de tiempos de comunicación

es mucho mayor, facilitando la escalabilidad del sistema y probando que con una red

más rápida el sistema sería escalable.

Los tests de escalabilidad se han centrado en dos aspectos, escalabilidad

en memoria y en tiempos de ejecución. Los sistemas moleculares seleccionados

(Figura 1), están compuestos por una gran cantidad de átomos, por lo que no era

posible simularlos en un único nodo del clúster:

• 2x1VT4 (1,256,718 átomos): Es el sistema usado en las simulaciones anteriores.

Éste sistema estaba al límite de uso de memoria en el sistema anterior, las

tarjetas grácas usadas en el cluster tienen menos memoria gráca por lo que

se necesitan al menos un mínimo de 2 nodos para simular.

• DHFR_555 (2,944,750 átomos): Es un sistema sintético con una carga de

cálculo balanceado formado por varias copias de la molécula DHFR (f ), que

necesita un mínimo de 4 nodos para simularse.

• DHFR_844 (3,015,424 átomos): Es otro sistema sintético formado por varias

copias de DHFR, distribuido de forma distinta.

Una de las mejoras implementadas estaba centrada en la escalabilidad en


XXII

memoria del sistema. La Gráca 4 muestra la cantidad de datos reservados en

memoria por cada uno de los nodos del cluster para DHFR_555. Se puede observar

cómo disminuye según se añaden más nodos de simulación. Las Grácas 5 muestran

el speedup calculado para el sistema. Hay que tener en cuenta que el sistema de

referencia usado para el cálculo (seedup=1) empieza en 4 GPUs, por lo que para

cada una de conguraciones de GPUs se podría estimar que sería 4 veces mayor que

lo mostrado. La Gráca 5a muestra el speedup total, teniendo en cuenta los tiempos

de envío, mientras que la Gráca 5b muestra el speedup para los tiempos de cálculo

de fuerzas. Teniendo en cuenta eso, se puede determinar que el sistema es escalable

según aumenta el número de GPUs, y se beneciaría de una red de comunicaciones

mucho más rápida


XXIII

(a) ApoA1 en (b) C206 en (c) 400K


agua agua

(d) 1VT4 en (e) 2x1VT4 en (f ) Molécula


agua agua DHFR

(g) Varias (h) Varias (i)32 copias de


copias de DHFR copias de DHFR 1VT4 en una
en una con- en una con- conguración de
guración de guración de 4x4x2
5x5x5 5x5x5

Figure 1: Moléculas de prueba


XXIV

(a) Speedup (b) Comparación con NAMD

Figure 2: Escalabilidad (a) y comparación de rendimiento con NAMD (b), medido


en términos de nanosegundos simulados por día.

Figure 3: Tiempos de ejecución y speedup


XXV

Figure 4: Reservas de memoria para DHFR_555.

(a) Speedup Global (b) Only computation speedup

Figure 5: Speedup comparison of the three molecules. Note that the reference
conguration is for 4 GPUs.
Contents

List of Figures XXXIII

1 Introduction 1

1.1 Motivations and scope of the work . . . . . . . . . . . . . . . . . . . 1

1.2 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Document organization . . . . . . . . . . . . . . . . . . . . . . . . . . 6

I STATE-OF-THE-ART 9

2 Molecular dynamics 11

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Bonded interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Bonds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Angles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 Dihedrals and Impropers . . . . . . . . . . . . . . . . . . . . . 13

2.3 Non-Bonded Electrostatic Interactions . . . . . . . . . . . . . . . . . 14


XXVIII Contents

2.4 Van Der Waals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5.1 Verlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5.2 Respa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Parallel Molecular Dynamics 19

3.1 Parallelization Techniques . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 Bonded Forces . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.2 Non-bonded forces . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.2.1 Short range . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.3 Long Range Non-bonded Forces . . . . . . . . . . . . . . . . . 21

3.1.4 The Multilevel Summation Method . . . . . . . . . . . . . . . 23

3.2 Parallel molecular simulators . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 NAMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.2 GROMACS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.3 ACEMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Contents XXIX

II PROBLEM-STATEMENT-AND-PROPOSAL 27

4 Problem Statement 29

4.1 A grand challenge problem . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Novel architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Multi-GPU communications bottleneck . . . . . . . . . . . . . . . . 32

4.3.1 Contribution: Direct GPU-GPU communications . . . . . . . 33

4.3.2 Contribution: Distributed MSM . . . . . . . . . . . . . . . . . 33

4.4 Solutions for memory scalability . . . . . . . . . . . . . . . . . . . . . 34

5 On-Board Multi-GPU Short-Range Force Computation 35

5.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Parallel Partition Update and Synchronization . . . . . . . . . . . . 40

5.2.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2.2 Identication of Transfer Data . . . . . . . . . . . . . . . . . 41

5.2.3 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Short Range On-Board Multi-GPU Evaluation . . . . . . . . . . . . 44

5.3.1 Comparison of Partition Strategies . . . . . . . . . . . . . . . 45

5.3.2 Scalability Analysis and Comparison with NAMD . . . . . . . 46


XXX Contents

6 On-Board Multi-GPU Long-Range Force Computation 51

6.1 Optimized MSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.1.1 FFT-Based Sums . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2 Distributed MSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2.1 Multigrid Partitions . . . . . . . . . . . . . . . . . . . . . . . 54

6.2.2 Periodic Boundary Conditions on Multiple GPUs . . . . . . . 55

6.2.3 Parallel Update and Synchronization of Interfaces . . . . . . . 57

6.3 Evaluation for On-Board Multi-GPU MSM . . . . . . . . . . . . . . 59

6.3.1 Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . . 59

7 Distributed Multi-GPU Molecular Dynamics 63

7.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.1.1 System partition . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.1.2 Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.2 Distributed Multi-GPU Molecular Dynamics evaluation . . . . . . . 70

7.2.1 Scalability Analysis . . . . . . . . . . . . . . . . . . . . . . . . 72

7.2.2 Simulation of Huge Molecules . . . . . . . . . . . . . . . . . . 75


Contents XXXI

III CONCLUSIONS-AND-FUTURE-WORK 77

8 Conclusions and Future Work 79

8.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . 80

8.1.1 On-Board Multi-GPU Short-Range Molecular Dynamics . . . 80

8.1.2 On-Board Multi-GPU Long-Range Molecular Dynamics . . . 81

8.1.3 Molecular Dynamics for Distributed Multi-GPU Architectures 81

8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Bibliography 85
List of Figures

1 Moléculas de prueba . . . . . . . . . . . . . . . . . . . . . . . . . . . XXIII

2 Escalabilidad (a) y comparación de rendimiento con NAMD (b), me-

dido en términos de nanosegundos simulados por día. . . . . . . . . . XXIV

3 Tiempos de ejecución y speedup . . . . . . . . . . . . . . . . . . . . . XXIV

4 Reservas de memoria para DHFR_555. . . . . . . . . . . . . . . . . XXV

5 Speedup comparison of the three molecules. Note that the reference

conguration is for 4 GPUs. . . . . . . . . . . . . . . . . . . . . . . . XXV

3.1 Diagram showing the major operations of MSM. The bottom level

represents the atoms, and higher levels represent coarser grids. . . . . 24

4.1 HIV1 Capsid, 64 million atoms total including solvent . . . . . . . . 32

5.1 Comparison of binary (a) vs. linear spatial partitioning (b). The

striped regions represent the periodicity of the simulation volume. . . 37

5.2 The dierent types of cells at the interface between two portions of

the simulation volume. . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 PCIe conguration of our testbed. . . . . . . . . . . . . . . . . . . . 43

5.4 Benchmark molecules. . . . . . . . . . . . . . . . . . . . . . . . . . . 48


XXXIV List of Figures

5.5 Performance comparison of binary and linear partition strategies on

C206. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.6 Running time (2000 steps) for the binary partition strategy on C206. 50

5.7 Scalability (a) and performance comparison with NAMD (b), mea-

sured in terms of simulated nanoseconds per day. . . . . . . . . . . . 50

6.1 Partition of the multilevel grid under periodic boundaries. Left: All

grid points on each level, distributed into 3 GPU devices. Right: Data

structure of GPU device 0 (blue) on all levels, showing: its interior

grid points, interface points for an interface of size 3, and buers to

communicate partial sums to other devices. Interface points due to

periodic boundary conditions are shown striped. Arrows indicate sums

of interface values to the output buers. With interfaces of size 3, in

levels 1 and 2 several interface points contribute to the same buer

location, and in level 2 there are even interior points that map to

interface points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.2 Benchmark molecules. . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3 Running time and speedup . . . . . . . . . . . . . . . . . . . . . . . 61

6.4 Scalability Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.1 Communication scheme from GPU A to GPU B. Each GPU hosts a

small portion of the system, referencing the data by local IDs. Local

IDs are translated to global data IDs and sent to the second GPU.

After data reception, a translation to local IDs is performed. . . . . 69


List of Figures XXXV

7.2 Benchmark molecules. . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.3 Data size communications for DHFR_844 along 100 steps. . . . . . . 72

7.4 Memory allocation for DHFR_555. . . . . . . . . . . . . . . . . . . . 73

7.5 Speedup comparison of the three molecules. Note that the reference

conguration is for 4 GPUs. . . . . . . . . . . . . . . . . . . . . . . . 74

7.6 Breakdown of running time (100 steps) for DHFR_844. . . . . . . . 75

7.7 Benchmark molecule, composed of 32 copies of 1VT4 in 4x4x2 cong-

uration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Dedicado a mi familia y amigos, sin ellos no habría sido posible

Caminante no hay camino, se hace camino al andar.

Antonio Machado, Campos de Castilla, 1912.


Chapter 1

Introduction

1.1 Motivations and scope of the work

Molecular dynamics simulations [29] are computational approaches for studying the

behavior of complex biomolecular systems at the atom level, estimating their dynamic

and equilibrium properties which can not be solved analytically. Their most direct

applications are related to identifying and predicting the structure of proteins, but

they also provide a tool for drug or material design.

These simulations recreate the movements of atoms and molecules due to

their interactions for a given period of time. Molecular dynamics simulations enable

the prediction of the shape and arrangement of molecular systems that cannot be

directly observed or measured, and and they have demonstrated their impact on

applications of drug and nanodevice design [29].However, they are limited by size

and computational time due to the current available computational resources.

Molecular dynamics is a computationally expensive problem, due to both

high temporal and high spatial resolution. For instance, simulating just one nanosec-

ond of the motion of a well known system with 92 224 atoms (ApoA1 benchmark)

using only one processor takes up to 14 days [19].

The trajectories and arrangements of molecules over temporal scales in the


2 1. Introduction

order of 1µs are dictated by vibrations taking place at scales as ne as 1fs = 10−15 s;
therefore, eective analysis requires the computation of many simulation steps. At

the same time, meaningful molecular systems are often composed of even millions

of atoms. Most importantly, the motion of atoms is aected by distant electrostatic

potentials, which makes molecular dynamics an n-body problem with quadratic cost.

The simulation times of molecular dynamics can be reduced thanks to

algorithms that update atoms in a parallel way. Such algorithms were initially

implemented on multi-CPU architectures, such as multicore processors or com-

puter clusters with several computing nodes connected by a local area network

(LAN) [13, 23, 4]. More recent alternatives have used hybrid GPU-CPU architec-

tures to provide parallelism [32], taking advantage of the massive parallel capabilities

of GPUs. This approach interconnects several computing nodes, each one with one

or more GPUs serving as co-processors of the CPUs [16, 12]. The compute power of

this approach is bounded by the cost to transfer data between CPUs and GPUs and

between compute nodes.

Typical solutions to molecular dynamics separate short-range forces, which

are computed exactly, from long-range ones, and approximate such long-range forces.

The Particle Mesh Ewald (PME) method [5] is probably the most popular

approximation to long-range molecular forces, and it discretizes atom charges on

a grid, computes a grid-based potential using a FFT, and nally interpolates the

potential back to the atoms. Its cost is dominated by the FFT, which yields an

asymptotic complexity O(N log N ).

Molecular dynamics computations can be further accelerated through paral-

lel algorithms, including massive parallelization on GPUs [27, 11], or even multi-GPU

parallelization [20]. The PME method is suited for single GPU parallelization, but
1.2. State of the art 3

not for distributed computation, thus limiting the scalability of long-range molecular

dynamics.

1.2 State of the art

In Molecular Dynamics, two dierent research lines can be found:

• Speed optimizations

• Improve accuracy of calculations

These two research lines are apparently conicting:: accurate calculations

result in slower simulations, and speed optimizations usually assume a certain error.

Nowadays, optimizations are focused on improving well known algorithms and devel-

oping newer algorithms in order to take advantage of high performance computers.

There are several molecular simulation algorithms optimized for shared memory sys-

tems, multi-CPU networks and distributed computing. Currently, one of the major

challenges is the use of massively parallel processing architectures like GPUs.

The present Ph.D. thesis was initially motivated by the research initiated by

Plebiotic SL company in collaboration with the Modeling and Virtual Reality Group

(GMRV) of the Universidad Rey Juan Carlos de Madrid. The initial objectives of

Plebiotic SL focused on-board Multi-GPU systems, developing their own molecular

simulator named PleMD. This simulator achieved good simulation times but lacked

scalability, due to the large amount of data shared between CPU and GPU.

The objectives of this PhD thesis aim to exploit both on-board and dis-

tributed multi-GPU architectures to improve speed and scalability, while keeping


4 1. Introduction

accuracy in simulations. The overall objectives can be summarized as follows:

• Design a communication protocol for molecular dynamics that makes use of

several GPUs interconected by any kind of bus.

• Develop faster and scalable algorithms adapted to multi-GPU environments,

while retaining simulation accuracy.

1.3 Objectives

Typical solutions to molecular dynamics make use of high performance architectures.

In some cases, hybrid CPU-GPU computation is used in order to achieve better

simulation times. However, the CPU keeps the control of the application, using

GPUs as mere coprocessors. The parallel computation capability of modern GPUs

outperforms CPUs, but it is limited by CPU-GPU communications. Novel GPU

architectures support direct communication between GPUs, even mounted on the

same board [30]. These features enable the use of GPUs as the central compute

nodes of parallel molecular dynamics algorithms, and not just as mere co-processors,

thereby reducing the communication load.

Thus, the central aim of this work is to demonstrate the feasibility of


using GPUs as autonomous computing nodes, developing communication
and data management algorithms that are executed completely on GPU,
along with molecular simulation algorithms adapted to distributed multi-
GPU environments. In order to demonstrate this initial hypothesis, the design

and implementation of algorithms should be carried out with scalability and light

data communication in mind.


1.3. Objectives 5

The proposed approach has the following objectives:

• Improvement of simulation times in multi-GPU environments.

• Improvement of the current molecular dynamics algorithms, adapted to GPUs

and distributed architectures.

• Improvement of the simulation of molecular systems formed by millions of

atoms, while retaining scalability.

To fulll the previous objectives, the following milestones were proposed:

• The design and implementation of a spatial partition algorithm that performs

regular partitions of the simulation system. Each partition is kept on each

GPU of the system.

• The denition of shared areas between partitions that maintain data coherency,

named interfaces .

• The design and implementation of a parallel algorithm for the setup of interface

data packages to be transferred between GPUs. This algorithm will run entirely

on GPU.

• The adaptation of the current parallel molecular dynamics algorithms based

on task partition method, to a spatial partition method.

• The denition of a distributed method for slow forces computations.

As a summary, this work provides novel ways to exploit multi-GPU en-

vironments for molecular simulation. The thesis proposes a scalable solution that

increases performance of challenging data-intensive molecular dynamics simulations.


6 1. Introduction

1.4 Document organization

The rest of this document is organized into three blocks:

• State-of-the-art: The aim of this section is to provide a general overview of

up-to-date work in molecular dynamics.

 Chapter 2: Presents some of the best known methods and concepts of

molecular dynamics.

 Chapter 3: Explains some of the best known parallel algorithms used in

the implementation of several molecular dynamics simulators and presents

some of the most popular software solutions for molecular simulation.

• Problem statement and proposal: This part includes all the contributions pre-

sented in this thesis.

 Chapter 4: Denes the main problems to be solved. The milestones pre-

sented in the previous section are explained in this chapter.

 Chapter 5: Describes how the rst milestone was achieved. It covers in

depth how the space of simulation is divided, and how it is distributed in

an on-board multi-GPU environment. Only fast and medium forces are

considered.

 Chapter 6: Is focused on long-range forces for on-board multi-GPU sys-

tems. Presents an optimization of the multilevel summation method

(MSM), achieving better simulation times than the original method.

 Chapter 7: Describes the last milestone. It presents a scalable molecular

dynamics algorithm implementation for a distributed network of GPUs.


1.4. Document organization 7

Conclusions and future work: the last part discusses the conclusions of this

doctoral thesis and the open lines of this proposal.

 Chapter 8: This chapter extracts the main achievements of this work. Also

presents some of the future researching lines to expand the capabilities of

the simulator developed.


Part I

STATE-OF-THE-ART
Chapter 2

Molecular dynamics

2.1 Introduction

In computer simulations of molecular dynamics, atoms are modeled as particles in a

virtual 3D spatial coordinate system. In biological systems, the molecules of interest

are surrounded by water molecules, and periodic boundary conditions are imposed

on the simulation volume, i.e., the simulation volume is implicitly replicated innite

times. A more comprehensive description of the basics of molecular dynamics can

be found in [29].

The motion of atoms is computed by solving Newtonian mechanics under

the action of three types of forces: bonded forces, non-bonded short-range forces

(composed of Van der Waals forces and electrostatic interactions between atoms

closer than a cuto radius Rc ), and non-bonded long-range forces (consisting of

electrostatic interactions between atoms separated by a distance greater than Rc ).

The simulation time is divided into steps of very small size, in the order of

1fs = 10−15 s. Given atom positions Xi and velocities Vi at time Ti , the simulation

algorithm evaluates the interaction forces and integrates them to obtain positions

Xi+1 and velocities Vi+1 at time Ti+1 .


12 2. Molecular dynamics

2.2 Bonded interaction

A chemical bond represents the attraction between two atoms that form a chemical

connection. These types of bonds are related to the charge and number of electrons

that atoms may share or transfer. There are several types of bonds, depending on

the number of atoms that form the bond and its geometry.

In molecular dynamics, all bonded forces are considered as short range

forces. That is because bonds exists in groups of two or more atoms closer than a

cuto radius. Bonded force interactions should be calculated for each atom in the

bond. Algorithm 1 shows the pseudocode for the serial algorithm.

Algorithm 1 Serial algorithm for bonded force computation


1: procedure computeBondedForces(transf erIDs)
2: for atom in atoms do
3: for bond in bonds[atom] do
4: for bondedAtom in bond.bondedAtoms do
5: if atom! = bondedAtom) then
6: atom.f orces+ = getBondedF orces(atom, bondedAtom)
7: end if
8: end for
9: end for
10: end for
11: end procedure

The following subsections describe four types of bonds used in molecular

dynamics to describe bonded interactions: Simple bonds, Angles, Dihedrals and

Impropers.

2.2.1 Bonds

The bonds between two atoms are described by simple harmonic springs. The energy

between two atoms i and j is given by:


2.2. Bonded interaction 13

Ebond = k(|rij | − r0 )2

• k: Constant of the spring that bonds both atoms.

• |ri,j |= Distance between i and j atoms

• r0 = Distance between i and j atoms in relax

2.2.2 Angles

Angles describe a bond formed by three atoms. These bonds are dened as angular

harmonic springs. The energy of an angle bond formed by three atoms (i,j and j) is

described as follows:

Eangle = kθ (θ − θ0 )2 + kub (|rik − rub )2

• kθ : Constant of the harmonic angle spring that bonds the three atoms.

• θ|= Angle formed by i-j atoms and the vector that connects k - j.

• θ0 = Angle formed by i-j atoms and the vector that connects k - j at rest.

2.2.3 Dihedrals and Impropers

Dihedral and Improper bonds describe the interaction between four linked atoms.

These bonds are modeled by an angle spring between the planes formed by the rst

3 atoms (i,j and k) and the second set of 3 atoms (j , k and l). The energy for a

dihedral or improper angle between the atoms i, j, k and l is given by:


14 2. Molecular dynamics

Ed/i = k(1 + cos(nφ + θ))

If n > 0, or

Ed/i = k(φ − θ)2

If n = 0.

2.3 Non-Bonded Electrostatic Interactions

Electrostatic forces are considered long-range interactions between atoms separated

by a large distance. Electrostatic energy describes the force resulting from the inter-

action between charged particles. The resulting energy between two atoms i and j is

described by Coulomb's law:

Cqi qj
E = ε14 ε0 |ri,j |

• ε14 : Scale factor for 1-4 interactions (pairs of atoms connected by three bonds).

It is zero for 1-2 and 1-3 interactions (pairs of atoms connected by single and

double bonds, respectively) and is equal to 1.0 for any other interaction.

• C = 2, 31 × 1019 J nm

• qi qj = Charges of i and j atoms, respectively

• ε0 = Dielectric constant

• |ri,j |= Distance between i and j atoms


2.4. Van Der Waals 15

As already mentioned, the computation of electrostatic forces is divided

into two types, short range and long range, treated separated.

2.4 Van Der Waals

The Van der Waals interactions describe the force resulting from the interaction of

atoms. The Van der Waals energy between two atoms i and j is described as follows:

A B
Evdw = r12
− r6ij
ij

A and B constants are precomputed using parameters σij and εij , which

also are precomputed using the σ and ε values of the single atoms. Those are input

constants for each type of atom. This is the entire equation sequence:

σi +σj
σi j = 2


εi j = εi εj

A = 4σi j 12 εi j

B = 4σi j εi j

Same as electrostatic forces, Van der Waals forces are also considered long or

medium range. These interactions happen between atoms that may be separated by

a large distance. These forces decay faster than electrostatic forces, so it is possible

to establish a cuto distance after which the force is negligible. For this reason,

sometimes it is possible to consider only medium-range forces.


16 2. Molecular dynamics

2.5 Integrators

2.5.1 Verlet

Molecular dynamics often use second-order integrators, such as Leapfrog and Verlet,

which oer greater stability than Euler methods. In the following, the integration al-

gorithm is implemented using the Velocity Verlet scheme, similar to Leapfrog method,

but the positions, speeds and forces are obtained at the same value of time .

1
x(t + ∆t) = x(t) + v(t)∆t + F (t)∆t2 ) (2.1)
2m

∆t 1
v(t + ) = v(t) + F (t)∆t (2.2)
2 2m

∆t 1
v(t + ∆t) = v(t + )+ F (x(t + ∆t)) (2.3)
2 2m

Where x is the position vector, v is velocity vector and F the force vector.

As F (t + ∆t) does not depend on v, equation 2.2 is replaced by equation 2.2. This

integration scheme is applied for every atom in the system.

2.5.2 Respa

RESPA (REference System Propagator Algorithm) is a method of integration with

multiple step times (Multi-time step). This method tries to avoid computing long-

range forces for every time step. While standard integration methods require the
2.5. Integrators 17

calculation of all forces, both of short and long range, RESPA establishes a relation-

ship between the number of times the short and long range forces are calculated.

The Van der Waals and electrostatic forces take considerably more com-

puting time and also allow a longer time step than the bonded forces. In turn,

the long-range electrostatic forces allow a longer time step with respect to the Van

der Waals. Algorithm 2 shows a pseudo code where forces have been divided into

hardf orce (fh ), mediumf orce (fm ) and sof tf orce (fs ), and their respective tran-

sition times are ∆th , ∆tm and ∆ts . Speed, coordinates and mass are shown as v, R
and m, respectively.

Algorithm 2 Generic RESPA Multi Step algorithm


1: for i = 1 to S do
2: v = v + fs2m
∆ts

3: for j = 1 to M do
4: v = v + fm2m∆tm

5: for k = 1 to H do
6: v = v + fh2m∆th

7: r = r + v∆tm
8: fh = ComputeHardF orces()
9: v = v + fh2m∆th

10: end for


11: f m = ComputeM ediumF orces()
12: v = v + fm2m
∆tm

13: end for


14: f s = ComputeSof tF orces()
15: v = v + fs2m
∆ts

16: end for

This algorithm calculates fs only S times while fm is evaluated HxM times,

and fs HxMxS. The hard forces correspond to the bonding forces, the medium forces

correspond to Van der Waals and electrostatic calculated closer than a cuto dis-

tance, and soft electrostatic are the forces calculated beyond the cuto distance.

Furthermore, for integration of velocity and position, we have used the Verlet scheme

explained in the previous section.


Chapter 3

Parallel Molecular Dynamics

In this chapter we present a review of the state of the art in computer driven simu-

lations of molecular dynamics, and more specically in the two main topics covered

in this thesis. After a general overview of parallel techniques adapted to molecular

dynamics simulations, we present related work on the topic of parallel molecular

dynamics simulators. The last section of this chapter presents an analysis of the

best known molecular dynamics simulators, showing how some algorithms have been

adapted to multi-CPU and multi-GPU parallel architectures.

3.1 Parallelization Techniques

In computer-driven molecular dynamics simulations, atoms are contained within a

virtual 3D coordinate system that models the real environment inside a specic

volume. This section presents some of the most known techniques used to speed up

simulations.

Several optimizations can be found in order to take advantage of parallel

computer architectures. In the following sections we summarize some of the most

common techniques used for each kind of atom interaction.


20 3. Parallel Molecular Dynamics

3.1.1 Bonded Forces

Bonded forces represent the interaction between a group of atoms linked by some

kind of bond. For each bond, the energy that aects the atoms involved is measured

applying the forces model described in Section 2.2. Since the energy calculations

for each bond are independent of each other, and the workload is similar within the

bonds of the same type, the most popular way to parallelize this computation is by

using a task subdivision. This method is easily parallelizable using both multi-CPU

and GPU parallel architectures.

3.1.2 Non-bonded forces

Non-bonded forces decay rapidly with distance, so only the interactions between

atoms closer than a cuto radius (Rc ) are accurately calculated. Atoms separated by

a distance greater than Rc are calculated using approximations, in order to accelerate


the simulations. These optimizations can reduce the computational cost from O(N 2 )
to O(N ).

3.1.2.1 Short range

The computation of short-range non-bonded forces can be accelerated using a regular

grid, which is updated at a lower rate than the simulation. This method is known as

cell list [26, 33, 11, 29]. In this algorithm, the volume of the simulation s divided into

cells or three-dimensional boxes whose dimension is given by the cuto radius (Rc ).

Van der Waals and short-range electrostatic forces are calculated between pairs of

atoms that are in the same cell or a neighboring cell.


3.1. Parallelization Techniques 21

Like in Bonded forces, Short Range forces are easily parallelizable by using

task-based decomposition methods. First, the interactions between neighboring cells

can be performed individually. Then, interactions between each atom of a box with

respect to each atom in a second box can be computed also in parallel. This takes

advantage of massively parallel architectures such as GPUs or clusters, using spatial

partitioning techniques.

However atoms move between cells, so the structures need to be updated

from time to time along the simulation. This introduces small time lags in the simu-

lation, hindering optimal scalability. In the case of distributed memory architectures

interconnected by a network, the amount of information shared between dierent

computing nodes could be very large. In such cases the objetive is to reduce the

amount of information transfered to prevent bottlenecks.

3.1.3 Long Range Non-bonded Forces

There are many approaches to improve the quadratic cost of long-range molecular

dynamics, either using approximate solutions or parallel implementations (See [25] for

a survey). Massively parallel solutions on GPUs have also been proposed, although

GPUs are mostly used as co-processors [27].

Particle Mesh Ewald (PME) [5] is the most popular method to compute

long-range molecular forces. Lattice Ewald methods solve the long-range potential

on a grid using an FFT. Regular PME uses spectral dierentiation and a total of

four FFTs per time step, while Smooth PME (SPME) [7] uses B-spline interpolation

reducing the number of FFTs to two. PME is widely used in parallel molecular dy-

namics frameworks such as NAMD [27], GROMACS [12] or ACEMD [11]. PME can

be massively parallelized on a single GPU, but it is dicult to distribute over multi-


22 3. Parallel Molecular Dynamics

ple GPUs due to the all-to-all communication needed by the FFT. However Nukada

et al. [21] propose a scalable multi-GPU 3DFFT to minimize all-to-all comunica-

tions. Cerutti et al. [3] proposed Multi-Level Ewald (MLE) as an approximation

to SPME by decomposing the global FFT into a series of independent FFTs over

separate regions of a molecular system, but they did not conduct scalability analysis.

Other long-range force approximations are based on multigrid algorithms.

Multigrid approaches utilize multiple grid levels with dierent spatial resolutions to

compute long-range potentials with O(N ) cost. In molecular dynamics, multigrid

methods have been demonstrated to be superior to other methods [31], such as the

Fast Multipole Method (FMM) [37], because they achieve better scalability while

keeping acceptable error levels. The Meshed Continuum Method (MCM) [1] and

Multilevel Summation Method (MSM) [9] are the two most relevant multigrid meth-

ods for long-range force computation. MCM uses density functions to sample the

particles onto a grid and calculates the potential by solving a Poisson equation in a

multigrid fashion. On the other hand, MSM calculates the potential directly on a

grid by using several length scales. The scales are spread over a hierarchy of grids,

and the potential of coarse levels is successively corrected by contributions from ner

levels up to the nest grid, which yields the nal potential. This approach exhibits

higher options for scalability than PME or other multigrid algorithms. MSM has

been massively parallelized on a single GPU [8], although the performance of this

implementation is notably worse than PME.

Next, we describe MSM in more detail, as it is our method of choice for the

acceleration of long-range force computation.


3.1. Parallelization Techniques 23

3.1.4 The Multilevel Summation Method

For a particle system with charges {q1 , . . . qN } at positions {r1 , . . . rN }, the electro-

static potential energy is

N N
1X X qi qj
U (r1 , ...rN ) = . (3.1)
2 ||ri − rj ||
i=1 j=1,j6=i

Its exact computation has O(N 2 ) complexity.

MSM is a fast algorithm for computing an approximation to the electrostatic

interactions with just O(N ) computational work. MSM splits the potential into

short-range and long-range components. The short-range component is computed

as a direct particle-particle interaction while the long-range one is approximated

through a hierarchy of grids.

For the long-range component, the method rst distributes atom charges

onto the nest grid. This process is called anterpolation. A nodal basis function

φ(r) with local support about each grid point is used to distribute charges. Once all

atom charges are distributed onto the nest grid, charges are distributed onto the

next coarser grid, using the same basis functions. This process is called restriction,

and it is repeated until the coarsest grid is reached.

Figure 3.1 depicts the full MSM method. On each level, the method com-

putes direct sums of nearby grid charges up to a radius of b2 Rc /h0 c grid points,

where h0 is the resolution of the nest grid. Hardy and Skeel [9] indicate that a reso-

lution h0 between 1Å and 3Å is sucient for molecular dynamics simulations. Note

that the resolution is halved on each coarser grid, hence direct sums cover twice

the distance with the same number of points. The direct sum of pairwise charge
24 3. Parallel Molecular Dynamics

Figure 3.1: Diagram showing the major operations of MSM. The bottom level
represents the atoms, and higher levels represent coarser grids.

potentials is analogous to the one for short-range non-bonded forces, with the excep-

tion that grid distances are xed and can be computed as preprocessing, hence the

computation is simply an accumulation of weighted grid charges.

A GPU optimized version of the direct sum was developed by Hardy et al [8].

The weighted grid is stored in constant memory and charges in shared memory. A

sliding window technique is used to achieve an ecient reading. Hardy's algorithm

computes the nest levels on GPU, while the coarsest levels are computed on CPU.

Once direct sums are computed on each level, potentials are interpolated

from coarse to ner levels, and contributions from all levels are accumulated. This

process is called prolongation. Finally, potentials from the nest grid are interpolated

on the atoms.

Multigrid methods have been used extensively in a variety of scientic elds,

but molecular dynamics suers the added diculty of dealing with periodic boundary

conditions. Izaguirre and Matthey [15] developed an MPI-based parallel multigrid

summation on clusters and shared-memory computers for n-body problems.

Chapter 6 presents a solution for long-range molecular dynamics on multi-

GPU platforms, and those improvements could be extended to other types of n-body
3.2. Parallel molecular simulators 25

problems.

3.2 Parallel molecular simulators

This section presents some of the most popular software solutions for molecular

simulation. All of them use some of the techniques described in the previous section,

adapted in some way to parallel architectures: Multi-CPU, GPUs and even clusters.

Several authors have proposed ways to parallelize molecular dynamics algo-

rithms on hybrid CPU-GPU architectures [18, 36]. Very recently, Rustico et al. [28]

have proposed a spatial partitioning approach for multi-GPU particle-based uid

simulation, which shares many features with molecular dynamics.

3.2.1 NAMD

NAMD [27] performs a spatial partition of the system. Each partition is allocated in a

computing node, which might be one core in a multi-CPU machine, or a distributed

node in a cluster. These subdivisions are known as patches, each patch keeps

information of the atoms within it, and the neighboring patches that need shared

data. Then, NAMD denes work tasks and then distributes these tasks among the

available CPUs, on each computing node. Tasks are dened as interactions between

patches, if a computing node needs data from a patch that does not belong to it,

task will make a copy of the necessary data before being assigned. To speed-up the

computation of non-bonded short-range forces, GPUs are massively used. NAMD

creates smaller tasks and copies the necessary data to GPUs. Then, it launches the

necessary GPU kernels in order to perform the simulation, and nally it copies back
26 3. Parallel Molecular Dynamics

the results to CPU memory. By using this scheme, GPUs are seen as massively

parallel co-processors.

3.2.2 GROMACS

GROMACS [12] performs a spatial partitioning on the molecular system to distribute

it on a multi-core architecture. The system is subdivided into a staggered grid, with

zones for information sharing between nodes. CPUs may use GPUs as co-processors

to speed-up force computations, in a similar way to NAMD. Only non-bonded short

range forces are computed on GPUs, forcing to upload data from CPU to GPU, and

copying back results from GPU to CPU.

3.2.3 ACEMD

ACEMD [11] performs GPU-parallel computation of the various forces in a molec-

ular system, and each type of force is handled on a separate GPU. This approach

exploits on-board multi-GPU architectures, but its scalability is limited because all

communications are handled through the CPU.


Part II

PROBLEM-STATEMENT-AND-

PROPOSAL
Chapter 4

Problem Statement

4.1 A grand challenge problem

The problems presented by molecular simulation systems are included within the

Grand challenge problems [34]. Not only are simulation times important, but they

also need large RAM resources to host the molecular system. An enormous volume of

data is required to simulate a system formed by several millions of atoms, requiring

high performance computing resources during a long time to obtain results.

The Grand challenge problems are solved in supercomputing centers,

which have large amounts of resources at their disposal. A supercomputing cen-

ter usually has thousands of nodes interconnected by a high speed network, enough

to host the data of the simulation. However, algorithms must be adapted, and there

are several drawbacks that must be solved to use the full power of these systems.

The following sections summarize some of he architectures currently avail-

able that are used for molecular dynamics. Also, our solutions for short range and

long range molecular dynamics simulations that make use of novel parallel architec-

tures are introduced.


30 4. Problem Statement

4.2 Novel architectures

The use of GPUs in Computer Science has witnessed a wide range of congurations.

At a high level, a GPU is formed by a massive parallel multiprocessor with its

own memory hierarchy, separated from the CPU. Nowadays GPUs can have several

gigabytes of RAM, so just one GPU can host a large amount of data. Also, if the

mainboard has enough slots, a single computer can host several GPUs interconnected

by a high speed bus, making itself a small hybrid multicomputer node. However

computers that integrate several GPUs usually are very expensive, so only 2 to 4

GPUs are used in most of the single node congurations.

It is easier to have a large number of GPUs in clusters. In recent years,

high performance parallel architectures have raised their performance by integrating

one or two GPUs on each node. Hybrid CPU-GPU systems are available in most of

the supercomputing centers, interconected by high speed networks. These congu-

rations allow scaling the performance of the system, but are severely limited by the

communications between nodes.

In a system with several GPUs, PCI-express 4.0 buses allow up to 31GB/s

communications. However, this is not the case on most of the multi-GPU clusters

available. In small research centers, it is easy to nd small clusters interconected by

Gigabit Ethernet networks, which provides communication speeds up to 125MB/s.

In supercomputing centers, Myrinet and Inniband networks are used, achieving a

maximum throughput of 1.2GB/s and 37GB/s respectively. Despite of the good

throughput of Inniband, only a few supercomputers actually use it, due to its price,

and the common communication speeds found in the supercomputer networks are

close to 6GB/s.
4.2. Novel architectures 31

Several applications are possible only by using these parallel architectures.

For example, the Human Brain Project is a large 10-year scientic research project

which aims to provide a model of the whole brain. To achieve the goals of this

project it is necessary to use supercomputing technologies that enable models and

simulations of brain information to identify patterns and possible deciencies that

can be remedied with further experimentation. For this purpose various platforms

have been developed: CeSViMa (Centro de Supercomputación y Visualización de

Madrid), hosts one of the most powerful supercomputers used for this project.

There have been great achievements in recent years by using high per-

formance hybrid multi-CPU systems for molecular dynamics. In 2013, the NCSA

Blue Waters supercomputer was used to perform the simulation fo the HIV-1 capsid

molecular system [38]. The HIV-1 capsid was formed by 64 million atoms, and 3500

single core nodes equipped with NVIDIA Tesla K20X were used to perform 500ns of

simulation time. In real time, it took around 35 days (14ns/day) to reach results.

However, the exploitation of multi-GPU systems is still under development.

In order to get the maximum benet from those architectures it is necessary to

identify the problems of the current implementations of molecular dynamics. As

stated in the previous paragraphs, the communications between GPUs hosted by

dierent nodes are not as fast as communications in the same motherboard, becoming

the bottleneck of the application. The following sections will describe the problems

covered in this thesis, and will give solutions to make better use of new architectures.
32 4. Problem Statement

Figure 4.1: HIV1 Capsid, 64 million atoms total including solvent

4.3 Multi-GPU communications bottleneck

As stated in the previous section, section, a priori a multi-GPU system with two or

more graphics processors connected via a fast bus is capable of providing a highly

ecient and scalable solution. However, in most of the current implementations it

is the CPU which hosts most of the program logic, using the GPUs as coprocessors.

With this scheme, if several GPUs are present, the CPU must take care of uploading

an downloading data to each GPU memory, wasting time in communications.

In molecular dynamics, a large amount of data must be shared between

computing nodes in order to maintain simulation consistency. A better scheme must

be developed in order to achieve better simulation times, by reducing GPU-CPU


4.3. Multi-GPU communications bottleneck 33

communications. The next subsections explain the contributions in this area.

4.3.1 Contribution: Direct GPU-GPU communications

In a multi-GPU environment, each GPU can be seen as a complete computing node,

so direct GPU-GPU communication can be used instead of uploading/downloading

data from GPU to CPU. One of the problems of this scheme relies in the fact that

GPUs are massive parallel architectures that run their own code in parallel with the

CPU. Each GPU should be able to identify shared data with other GPUs, package it

and perform data exchange with neighboring GPUs. Those methods should be fast

and scalable, in order to allow better simulation times.

This work tries to nd a solution by using direct GPU-GPU data commu-

nications, focused on scalability and performance. Our rst goal is to parallelize a

state-of-the-art molecular dynamics algorithm by employing a spatial partitioning

approach to simulate the dynamics of one portion of a molecular system on each

GPU. Chapter 5 is focused on parallelizing bonded and short-range non-bonded

forces on an on-board multi-GPU system, providing a novel parallel algorithm to

update the spatial partitioning and set up transfer data packages on each GPU. This

way, better simulation times are achieved, while keeping scalability of the system.

4.3.2 Contribution: Distributed MSM

Long-range non-bonded forces are treated in a dierent way. Current implementa-

tions use PME, which is very fast on single GPUs environments. However, it is not

easily portable to distributed memory systems, due to the large amount data needed

to share between nodes. Chapter 6 proposes a solution to long-range molecular dy-


34 4. Problem Statement

namics, based on the Multilevel Summation Method (MSM) [15] [8], that can be

used in distributed memory environments.

MSM is better suited for a distributed implementation, but it is notably

slower than PME. Chapter 6 also proposes an optimization for this method by re-

placing 3D convolutions with FFTs, making the performance of MSM on a single

GPU comparable to that of PME. Finally, a distributed-memory implementation is

proposed in order to make use of several GPUs.

4.4 Solutions for memory scalability

In order to simulate larger molecular systems, it is necessary to achieve memory

scalability along with better simulations time. The more GPUs available, the more

subdivisions are made, which take less memory on each node. This way, it is possible

to use a cluster of computers to simulate systems that can not be simulated on a

single node.

The contributions presented in Chapter 5 and Chapter 6 achieve good time

speed-ups but store translation tables that grow with the size of the molecular system,

and are not scalable in memory. Chapter 7 proposes a solution by using GPU hash

tables instead of static memory arrays, saving memory while keeping the scalability

of the system. Also, a version for distributed multi-GPU cluster systems is proposed,

that makes use of direct GPU-GPU communications.


Chapter 5

On-Board Multi-GPU

Short-Range Force Computation

This chapter presents a parallel algorithm for the solution of short-range molecular

dynamics on on-board multi-GPU architectures. Previous works make use of GPUs

as coprocessors, CPUs are used as the primary processor, keeping data on the host's

RAM. Data is copied to the GPU when a kernel is launched, and results are copied

back. This solution presents some scalability problems due to the amount of in-

formation shared between GPU and CPU. The aim of this chapter is to present an

alternative by using the GPUs as independent computing nodes, reducing CPU-GPU

and GPU-GPU communications.

This chapter is focused on bonded and non-bonded short range force com-

putations. The state-of-the-art molecular dynamics algorithm (Chapter 2.1) is par-

allelized at two levels: First, a spatial partitioning is performed to simulate the

dynamics of one portion of a molecular system on each GPU, and we take advantage

of direct communication between GPUs to transfer necessary data among portions.

Second, we parallelize the simulation algorithm to exploit the multi-processor com-

puting model of GPUs.

Section 7.1.2 presents a novel parallel algorithm to update the spatial par-

titioning and set up transfer data packages on each GPU. The molecular dynamics
36 5. On-Board Multi-GPU Short-Range Force Computation

simulations are parallelized at two levels. At the high level, we present a spatial

partitioning approach to assign one portion of a molecular system to each GPU. At

the low level, we parallelize on each GPU the simulation of its corresponding portion.

Most notably, we present algorithms for the massively parallel update of the spatial

partitions and for the setup of data packages to be transferred to other GPUs.

5.1 Algorithm Overview

In contrast to previous parallel molecular dynamics algorithms, we propose a two-

level algorithm that partitions the molecular system, and each GPU handles in a

parallel manner the computation and update of its corresponding portion, as well as

the communications with other GPUs.

To solve the dynamics, we use a generic Verlet/Respa MTS integrator as

described in section 2.5.1. In our examples, we have used a time step ∆t = 2 fs

for short-range non-bonded forces, and we update bonded forces nStepsBF = 2


times per time step. We accelerate short-range non-bonded forces using the cell-list

method, with a grid resolution of Rc /2. The cell-list data structure can be updated

and visited eciently on a GPU using the methods in [35].

We partition the molecule using grid-aligned planes, thus minimizing the

width of interfaces and simplifying the update of partitions. We partition the sim-

ulation domain only once at the beginning of the simulation, and then update the

partitions by transferring atoms that cross borders. We have tested two partitioning

approaches with dierent advantages:

• Binary partition (Figure 5.1a ): we recursively halve molecule portions using


5.1. Algorithm Overview 37

(a) Binary Partition (b) Linear Partition

Figure 5.1: Comparison of binary (a) vs. linear spatial partitioning (b). The
striped regions represent the periodicity of the simulation volume.

(a) Cells at the Interface

Figure 5.2: The dierent types of cells at the interface between two portions of
the simulation volume.

planes orthogonal to their largest dimension. Each portion may have up to 26


neighbors in 3D.

• Linear partition (Figure 5.1b ): we divide the molecular system into regular

portions using planes orthogonal to the largest dimension of the full simulation

volume. With this method, each portion has only 2 neighbors, but the inter-

faces are larger; therefore, it trades fewer communication messages for more

expensive partition updates.


38 5. On-Board Multi-GPU Short-Range Force Computation

Based on our cell-based partition strategy, each GPU contains three types

of cells as shown in Figure 5.2:

• Private cells that are exclusively assigned to one GPU.

• Shared cells that contain atoms updated by a certain GPU, but whose data

needs to be shared with neighboring portions.

• Interface cells that contain atoms owned by another GPU, and used for force

computations in the given GPU.

Algorithm 1 shows the pseudo-code of our proposed multi-GPU MTS inte-

grator, highlighting in blue with a star the dierences w.r.t. a single-GPU version.

These dierences can be grouped in two tasks: update partitions and synchronize dy-

namics of neighboring portions. Once every ten time steps, we update the partitions

in two steps.

1. Identify atoms that need to be updated, i.e., atoms that enter shared cells of

a new portion.

2. Transfer the positions and velocities of these atoms.

To synchronize dynamics, we transfer forces of all shared atoms, and then each GPU

integrates the velocities and positions of its private and shared atoms, but also its

interface atoms. Again, we carry out the synchronization in two steps.

1. Identify the complete set of shared atoms after updating the cell-list data struc-

ture.

2. Transfer the forces of shared atoms as soon as they are computed.


5.1. Algorithm Overview 39

Algorithm 3 Multi-GPU Verlet/r-Respa MTS integrator. The modications w.r.t.


the single-GPU version are highlighted in blue with a star.

1: procedure Step(currentStep)
2: if currentStep mod 10 = 0 then
3: ∗ identif yU pdateAtomIds()
4: ∗ transf erU pdateP ositionsAndV elocities()
5: updateCellList()
6: ∗ identif ySharedAtomIds()
7: end if
8: integrateT emporaryP osition(0.5 · ∆t)
9: computeShortRangeF orces()
10: ∗ transf erSharedShortRangeF orces()
11: for nStepsBF do
12: integrateP osition(0.5 · ∆t/nStepsBF )
13: computeBondedF orces()
14: ∗ transf erSharedBondedF orces()
15: integrateKickV elocity(∆t/nStepsBF )
16: integrateP osition(0.5 · ∆t/nStepsBF )
17: end for
18: currentStep = currentStep + 1
19: end procedure
40 5. On-Board Multi-GPU Short-Range Force Computation

5.2 Parallel Partition Update and Synchronization

As outlined above, each GPU stores one portion of the complete molecular system

and simulates this subsystem using standard parallel algorithms [35]. In this section,

we describe massively parallel algorithms to update the partitions and to transfer

interface forces to ensure proper synchronization of dynamics between subsystems.

We propose algorithms that separate the identication of atoms whose data needs

to be transferred from the setup of the transfer packages. In this way, we can

reuse data structures and algorithms both in partition updates and force transfers.

Data transfers are issued directly between GPUs, thereby minimizing communication

overheads.

5.2.1 Data Structures

The basic molecular dynamics algorithm stores atom data in two arrays:

• staticAtomData corresponds to data that does not change during the simula-

tion, such as atom type, bonds, electrostatic and mechanical coecients, etc.

It is sorted according to static atom indices.

• dynamicAtomData that contains position and velocity, a force accumulator,

and the atom's cell. It is sorted according to the cell-list structure, and all

atoms in the same cell appear in consecutive order.

Both arrays store the identiers of the corresponding data in the other array to

resolve indirections. Each GPU stores a copy of the staticAtomData of the whole

molecule, and keeps dynamicAtomData for its private, shared, and interface cells.

The dynamicAtomData is resorted in each call to the updateCellList procedure, and


5.2. Parallel Partition Update and Synchronization 41

the atom identiers are accordingly reset. Atoms that move out of a GPU's portion

are simply discarded.

In our multi-GPU algorithm, we extend the dynamicAtomData, and store

for each atom a list of neighbor portions that it is shared with. We also dene two

additional arrays on each GPU:

• cellNeighbors is a static array that stores, for each cell, a list of neighbor por-

tions.

• transferIDs is a helper data structure that stores pairs of neighbor identiers

and dynamic atom identiers. This data structure is set during atom identi-

cation procedures, and it is used for the creation of the transfer packages.

5.2.2 Identication of Transfer Data

Each GPU contains a transf erIDs data structure of size nN eighbors · nAtoms,
where nN eighbors is the number of neighbor portions, and nAtoms is the number

of atoms in its corresponding portion. This data structure is set at two stages of the

MTS Algorithm 3, identif yU pdateAtomIds and identif ySharedAtomIds. In both

cases, we initialize the neighbor identier in the transf erIDs data structure to the

maximum unsigned integer value. Then, we visit all atoms in parallel in one CUDA

kernel, and ag the (atom, neighbor) pairs that actually need to be transferred. We

store one ag per neighbor and atom to avoid collisions at write operations. Finally,

we sort the transf erIDs data structure according to the neighbor identier, and the
(atom, neighbor) pairs that were agged are considered as valid and are automatically

located at the beginning of the array. We have used the highly ecient GPU-based

Merge-Sort implementation in the NVidia SDK 4.5 [24] (5.3ms to sort an unsorted
42 5. On-Board Multi-GPU Short-Range Force Computation

array with one million values on a NVidia GeForce GTX580).

Algorithm 4 Identication of atoms whose data needs to be transferred, along with


their target neighbor.

1: procedure IdentifyTransferAtomIds(transf erIDs)


2: for atomID in atoms do
3: for neighborID in cellN eighbors(dynamicAtomData[atomId].cellID)
do
4: if M ustT ransf erData(atomID, neighborID) then
5: of f set = neighborID · nAtoms + atomID
6: transf erIDs[of f set].atomID = atomID
7: transf erIDs[of f set].neighborID = neighborID
8: end if
9: end for
10: end for
11: Sort(transf erIDs, neighborID)
12: end procedure

Algorithm 4 shows the general pseudo-code for the identication of transfer

data. The actual implementation of the M ustT ransf erData procedure depends

on the actual data to be transferred. For partition updates, an atom needs to be

transferred to a certain neighbor portion if it is not yet present in its list of neighbors.

For force synchronization, an atom needs to be transferred to a certain neighbor

portion if it is included in its list of neighbors. In practice, we also update the list

of neighbors of every atom as part of the identif yU pdateAtomIds procedure.

5.2.3 Data Transfer

For data transfers, we set in each GPU a buer containing the output data and the

static atom identiers. To set the buer, we visit all valid entries of the transf erIDs
array in parallel in one CUDA kernel, and fetch the transfer data using the dynamic

atom identier. The particular transfer data may consist of forces or positions and
5.2. Parallel Partition Update and Synchronization 43

velocities, depending on the specic step in the MTS Algorithm 3.

Transfer data for all neighbor GPUs is stored in one unique buer; therefore,

we set an additional array with begin and end indices for each neighbor's chunk.

This small array is copied to the CPU, and the CPU invokes one asynchronous

copy function to transfer data between each GPU and one of its neighbors. We use

NVidia's driver for unied memory access (Unied Virtual Addressing, UVA) [30] to

perform direct memory copy operations between GPUs.

Upon reception of positions and velocities during the update of the parti-

tions, each GPU appends new entries of dynamicAtomData at the end of the array.

These entries will be automatically sorted as part of the update of the cell-list. Upon

reception of forces during force synchronization, each GPU writes the force values

to the force accumulator in the dynamicAtomData. The received data contains the

target atoms' static identiers, which are used to indirectly access their dynamic

identiers.

Figure 5.3: PCIe conguration of our testbed.


44 5. On-Board Multi-GPU Short-Range Force Computation

5.3 Short Range On-Board Multi-GPU Evaluation

This section demonstrates our approach on a multi-GPU on-board architecture, us-

ing PCIe for direct GPU-GPU communication. We show speed-ups and improved

scalability over NAMD, a state-of-the-art multi-CPU-GPU simulation algorithm that

uses GPUs as co-processors.

In order to validate our proposal, we carried out our experiments on a ma-

chine outtted with Ubuntu GNU/Linux 10.04, two Intel Xeon Quad Core 2.40GHz

CPUs with hyperthreading, 32 GB of RAM and four NVidia GTX580 GPUs con-

nected to PCIe 2.0 slots in an Intel 5520 IOH Chipset of a Tyan S7025 motherboard.

The system's PCIe 2.0 bus bandwidth for peer-to-peer throughputs via IOH chip was

9GB/c full duplex, and 3.9 GB/s for GPUs on dierent IOHs [17]. The IOH does

not support non-contiguous byte enables from PCI Express for remote peer-to-peer

MMIO transactions [14]. The complete deployment of our testbed architecture is

depicted in Figure 5.3. Direct GPU-GPU communication can be performed only for

GPUs connected to the same IOH. For GPUs connected through QPI, the driver

performs the communication using CPU RAM [17].

Given our testbed architecture, we have tested the scalability of our pro-

posal by measuring computation and transmission times for 1, 2, and 4 partitions

running on dierent GPUs. We have estimated scalability further by estimating

transmission times for 8 and 16 partitions using the bandwidth obtained with 4
GPUs and the actual data size of 8 and 16 partitions respectively.

We have used three molecular systems as benchmarks (see Figure 6.2):

• ApoA1 (92,224 atoms) is a well known high density lipoprotein (HDL) in hu-
5.3. Short Range On-Board Multi-GPU Evaluation 45

man plasma. It is often used in performance tests with NAMD.

• C206 (256,436 atoms) is a complex system formed by a protein, a ligand and

a membrane. It presents load balancing challenges for molecular dynamics

simulations.

• 400K (399,150 atoms) is a well-balanced system of 133,050 molecules of water

designed synthetically for simulation purposes.

All our test simulations were executed using MTS Algorithm 3, with a time

step of 2 fs for short-range non-bonded forces, and 1 fs (nStepsBF = 2) for bonded

forces. In all our tests, we measured averaged statistics for 2000 simulation steps,

i.e., a total duration of 4ps (4 · 10−12 s).

5.3.1 Comparison of Partition Strategies

To evaluate our two partition strategies described in Section 7.1.2, we have compared

their performance on the C206 molecule. We have selected C206 due to its higher

complexity and data size. Figure 5.5a indicates that, as expected, the percentage

of interface cells grows faster for the linear partition. Note that with 2 partitions

the size of the interface is identical with both strategies because the partitions are

actually the same. With 16 partitions, all cells become interface cells for the linear

partition strategy, showing the limited scalability of this approach. Figure 5.5b shows

that, on the other hand, the linear partition strategy exhibits a higher transmission

bandwidth. Again, this result was expected, as the number of neighbor partitions is

smaller with this strategy.

All in all, Figure 5.5c compares the actual simulation time for both partition

strategies. This time includes the transmission time plus the computation time
46 5. On-Board Multi-GPU Short-Range Force Computation

of the slowest partition. For the C206 benchmark, the binary partition strategy

exhibits better scalability, and the reason is that the linear strategy suers a high

load imbalance, as depicted by the plot of standard deviation across GPUs.

Figure 5.6 shows how the total simulation time is split between computa-

tion and transmission times for the binary partition strategy. Note again that the

transmission times for 8 and 16 partitions are estimated, not measured. Up to 4


partitions, the cost is dominated by computations, and this explains the improved

performance with the binary strategy despite its worse bandwidth.

The optimal choice of partition strategy appears to be dependent on the

underlying architecture, but also on the specic molecule, its size, and its spatial

atom distribution.

5.3.2 Scalability Analysis and Comparison with NAMD

Figure (a) shows the total speedup for the three benchmark molecules using our

proposal (with a binary partition strategy). Note again that speedups for 8 and 16
GPUs, shown in dotted lines, are estimated based on the bandwidth with 4 GPUs.

The results show that the implementation makes the most out of the molecule's size

by sharing the workload among dierent GPUs. The speedup of APOA1 is lower be-

cause it is the smallest molecule and the simulation is soon limited by communication

times.

Figure 5.7b evaluates our combined results in comparison with a well-known

parallel molecular dynamics implementation, NAMD. Performance is measured in

terms of the nanoseconds that can be simulated in one day. The three benchmark

molecules were simulated on NAMD using the same settings as on our implementa-
5.3. Short Range On-Board Multi-GPU Evaluation 47

tion. Recall that NAMD distributes work tasks among CPU cores and uses GPUs as

co-processors, in contrast to our fully GPU-based approach. We could not estimate

performance for NAMD with 8 and 16 GPUs, as we could not separate computa-

tion and transmission times. All in all, the results show that our proposal clearly

outperforms NAMD for all molecules by a factor of approximately 4×.

In terms of memory scalability, our approach suers the limitation that

each partition stores static data for the full molecule.This limitation is addressed

in Chapter 7. From our measurements, the static data occupies on average 78MB

for 100K atoms, which means that modern GPUs with 2GB of RAM could store

molecules with up to 2.5 million atoms. In the dynamic data, there are additional

memory overheads due to the storage of interface cells and sorting lists, but these

data structures become smaller as the number of partitions grows. In addition,

interface cells grow at a lower rate than private cells as the size of the molecule

grows.
48 5. On-Board Multi-GPU Short-Range Force Computation

(a) ApoA1 in water (b) C206 in water

(c) 400K

Figure 5.4: Benchmark molecules.


5.3. Short Range On-Board Multi-GPU Evaluation 49

(a) Interface size (b) Bandwidth

(c) Sim. times (2000 steps)

Figure 5.5: Performance comparison of binary and linear partition strategies on


C206.
50 5. On-Board Multi-GPU Short-Range Force Computation

Figure 5.6: Running time (2000 steps) for the binary partition strategy on C206.

(a) Speedup (b) Comparison vs. NAMD

Figure 5.7: Scalability (a) and performance comparison with NAMD (b), mea-
sured in terms of simulated nanoseconds per day.
Chapter 6

On-Board Multi-GPU Long-Range

Force Computation

This chapter presents a parallel and scalable solution to compute long-range molec-

ular forces, based on the multilevel summation method (MSM). As shown in the

previous chapter, making use of several GPUs as independent computing nodes al-

lows us to perform faster simulations, reducing latency due to data transfers.

The objective in this chapter is to achieve a similar scalability on long-

range forces computations by using several GPUs. The MSM algorithm oers good

features to be used in a multi-GPU distributed environment, despite being slower

than PME. An optimization of MSM that replaces 3D convolutions with FFTs is

presented in this chapter, achieving a single-GPU performance comparable to the

PME method, the de facto standard for long-range molecular force computation.

But most importantly, we propose a distributed MSM that avoids the scalability

diculties of PME.

Our distributed solution is based on a spatial partitioning of the MSM

multilevel grid, together with massively parallel algorithms for interface update and

synchronization. The last section of this chapter shows the scalability of our approach

on an on-board multi-GPU platform.


52 6. On-Board Multi-GPU Long-Range Force Computation

6.1 Optimized MSM

Our method runs the whole simulation on an on-board multi-GPU architecture by

allocating a portion of the system to each GPU and using a boundary interface to

communicate updates directly between portions.

Algorithm 5 highlights the dierences between our distributed MSM and

the original algorithm. See also [9] for a thorough description of the method. Note

that the direct sums are independent of each other, and the direct sum on a certain

level and the restriction to the coarser level can be executed asynchronously.

6.1.1 FFT-Based Sums

To perform the direct sum part on each level, the original MSM applies a 3D convo-

lution over all grid points using a kernel with 2 b2 Rc /hc+1 points in each dimension.
However, Hardy [9] shows that the direct sum is the most computationally expensive

part. We substitute this convolution with a product in frequency domain. Speci-

cally, we compute grid potentials in three steps:

1. Forward FFT of the grids of charges and kernel weights.

2. Complex point-wise product of the two resulting vectors

3. Inverse FFT to obtain the potentials.

The grids of charges and kernel weights should have identical dimensions; therefore,

we extend the kernel. Note that the kernel is constant, hence we only compute its

FFT once per level as a preprocess.


6.1. Optimized MSM 53

Even though the FFT has O(N log N ) complexity as opposed to O(N )
complexity of the convolution, in practice large kernels yield a steep linear complexity

for the convolution approach. For very large molecules, the log N factor of the FFT

would dominate, but with our distributed MSM presented next in Section 6.2, FFTs

are computed on each partition separately, hence N is bounded. We have compared

the performance of ecient GPU implementations of massively parallel MSM using

the convolution and FFT approaches, and the FFT approach enjoys a speed-up of

almost 10×. Table 6.1 shows timing comparisons for two molecular systems. The

examples were executed on an Intel Core i7 CPU 860 at 2.80GHz with a NVIDIA

GTX Titan GPU and CUDA Toolkit 5.5. FFTs were computed using NVIDIA's

highly ecient cuFFT library [22].

The cuto distance Rc has a great impact on both error and performance.

Error is lower for higher cutos, and this can be observed from the fact that a larger

cuto distance increases the kernel size as well. For our performance analysis, we used

a cuto radius of 9.0 Å, which is a standard value for molecular dynamics simulations.

Assuming a xed grid size, the resolution of the grid h, which is automatically

set for each level and each axis, determines the overall performance and accuracy.

Smaller values of h for the same number of levels imply higher accuracy, but this also
translates into a larger kernel size 2 b2 Rc /h0 c+1, hence adding to the computational
cost. The table shows the grid resolution on each axis (in Å), as well as the kernel

size.

Table 6.1 also compares the performance of MSM and PME under the

same grid resolutions. We implemented an ecient GPU version of the Smooth

PME (SPME) algorithm [7], following the optimizations described by Harvey and

De Fabrities[10]. We also implemented the previously mentioned GPU version of

the MSM algorithm proposed by Hardy [9]. With our FFT-based optimization, the
54 6. On-Board Multi-GPU Long-Range Force Computation

#Atoms hx,y,z Kernel size tM SM tM SMF F T tP M E


256,436 {1.88,1.87,2.65} 9x9x6 31.901 4.79 5.095
90,849 {1.56,1.56,1.56} 11x11x11 43.694 5.09 2.22

Table 6.1: Performance comparison for long-range force computation on two


molecular systems, using regular MSM with 3D convolution, our optimized MSM
based on FFTs, and PME. Timings correspond to one simulation step and are given
in ms. All cases were executed using a 64 × 64 × 64 grid.

performance of MSM becomes comparable to that of PME.

6.2 Distributed MSM

We propose a distributed MSM (DMSM) that partitions a molecular system and

the multilevel grid of MSM among multiple GPUs. As a computing element, each

GPU handles in a parallel manner the computation and update of its corresponding

portion of the molecular system, as well as the communications with other GPUs.

In this section, we rst describe the partition of the molecular system, then the

handling of periodic boundary conditions across all MSM levels, and nally our

parallel algorithms for interface update and synchronization.

6.2.1 Multigrid Partitions

Following the observations drawn in [20] for short-range molecular forces, we partition

a molecular system linearly along its longest axis, as this approach reduces the cost

to communicate data between partitions. Then, for DMSM, we partition each level

of the MSM grid into regular portions using planes orthogonal to the longest axis.

Each GPU device stores a portion of the grid at each level, including two types of

grid points: i) interior grid points owned by the GPU itself. ii) interface grid points
6.2. Distributed MSM 55

owned by neighboring GPUs.

The size of the interface corresponds to the half-width of the convolution

kernel, i.e., b2 Rc /hc points to the left and right of the interior ones, as shown in

Figure 6.1. The interface stores replicas of the grid points of neighboring partitions,

which are arranged in device memory just like interior points, to allow seamless

data access. The interface is used both to provide access to charges of neighboring

partitions and to store partial potentials corresponding to those same partitions.

Note that, due to the use of a linear partitioning strategy, the neighboring nodes

along the shorter directions are the result of periodic boundary conditions, and they

do not need to be stored as interface points as they are readily available as interior

points.

The partitions are made only once at the beginning of the simulation. At

runtime, interface values need to be communicated when needed as part of restriction,

direct sum of potentials, and prolongation.

6.2.2 Periodic Boundary Conditions on Multiple GPUs

As outlined in Section 2.1, molecular dynamics are performed on innite systems

formed by replicating periodically images of the molecular system under study along

all three spatial directions [6]. Periodic replication is also applied to the MSM grid;

therefore, on the boundary of the molecular system interfaces represent images of

grid points on the opposite sides, as shown in Figure 6.1.

In higher levels of the multilevel grid, where the total number of grid points

along the longest axis is similar to the convolution kernel size, periodic boundaries

complicate the management of interface points. Two main complications may occur,
56 6. On-Board Multi-GPU Long-Range Force Computation

shown in Figure 6.1: the same point may map to two or more interface points, and

even interior points may map to interface points. To deal with interface handling,

each GPU device stores the following data on each level:

• Begin and end indices of neighbor partitions, to know what part of the interface

belongs to each GPU device.

• Periodic begin and end indices of the interfaces of neighbor partitions, to know

what interior points constitute interfaces for other GPU devices.

Since the multilevel grid is static during the simulation, the auxiliary indices

of neighbor partitions are created and shared between GPUs once as a preprocessing

Figure 6.1: Partition of the multilevel grid under periodic boundaries. Left: All
grid points on each level, distributed into 3 GPU devices. Right: Data structure
of GPU device 0 (blue) on all levels, showing: its interior grid points, interface
points for an interface of size 3, and buers to communicate partial sums to other
devices. Interface points due to periodic boundary conditions are shown striped.
Arrows indicate sums of interface values to the output buers. With interfaces
of size 3, in levels 1 and 2 several interface points contribute to the same buer
location, and in level 2 there are even interior points that map to interface points.
6.2. Distributed MSM 57

step. Once each GPU knows the indices of its neighbors, it creates the incoming and

outgoing data buers to share interface data, and sets static mappings that allow

ecient read/write operations with these buers as shown in Figure 6.1.

6.2.3 Parallel Update and Synchronization of Interfaces

Algorithm 5 DMSM method main loop.


1: procedure computeDMSM
2: n = nlevels
3: q 0 ← Anterpolation()
4: ∗ accumulateInteriorCopies(q 0 )
5: ∗ updateInterf aces(q 0 )
6: for i = 0 . . . n − 2 do
7: V i ← DirectSum(q i )
8: q i+1 ← Restriction(q i )
9: ∗ updateInterf aces(q i+1 )
10: end for
11: V n−1 ← DirectSum(q n−1 )
12: ∗ accumulateInteriorCopies(V n−1 )
13: ∗ updateInterf aces(V n−1 )
14: for i = n − 2 . . . 0 do
15: V i ← P rolongation(V i+1 )
16: ∗ accumulateInteriorCopies(V i )
17: ∗ updateInterf aces(V i )
18: end for
19: Interpolation(V 0 )
20: end procedure

Our DMSM algorithm needs to update and synchronize interfaces at multi-

ple stages of the original MSM algorithm. There are two synchronization operations:

1. accumulateInteriorCopies: In the charge anterpolation, the coarsest direct

sum and prolongation steps, values are accumulated onto the interface grid

points in each GPU device. These interface points are local copies of interior
58 6. On-Board Multi-GPU Long-Range Force Computation

points of other GPUs, hence the values stored on interface points need to be

accumulated onto their true owners. This operation is executed in 3 steps.

First, the values from the interface points are accumulated into the output

buers. Second, the buers are transferred to their destination GPUs. And

third, the receiver GPUs accumulate the incoming values into their interior

grid points. Thanks to the preprocessing of mappings described previously,

the accumulation to the output buers is executed eciently in a massively

parallel manner on each GPU. Periodic boundary conditions are also handled

eciently, and the accumulation of multiple copies of the same point is dealt

with during the accumulation to output buers, prior to data transfer.

2. updateInterf aces: Once interior grid values are set, it may be necessary to

update their copies in other GPUs, i.e., the interface grid points of other GPUs.

Data is transferred between pairs of GPUs directly. This step is necessary after

charge anterpolation, after restriction, after the direct sum of potentials, and

after prolongation.

Algorithm 5 shows our DMSM algorithm, highlighting in blue and with

a star the steps that augment the original MSM algorithm. We distinguish

charge values q from potential values V, which are used as arguments of the

accumulateInteriorCopies and updateInterf aces procedures when appropriate.

Superscripts indicate grid levels. With our DMSM algorithm, all operations to set

up, transfer, and collect data packages are highly parallelized, thus minimizing the

cost of communications and maximizing scalability.


6.3. Evaluation for On-Board Multi-GPU MSM 59

6.3 Evaluation for On-Board Multi-GPU MSM

This section analyzes the scalability of our proposal presented in the previous section.

We carried out our experiments on a machine outtted with Ubuntu GNU/Linux

Precise Pangolin 12.04, two Intel Xeon Quad Core 2.40GHz CPUs with hyperthread-

ing, 32 GB of RAM and four NVidia GTX580 GPUs connected to PCIe 2.0 slots in

an Intel 5520 IOH Chipset of a Tyan S7025 motherboard.

Given our testbed architecture, we have tested the scalability of our pro-

posal by measuring computation and transmission times for 1, 2, and 4 partitions

running on dierent GPUs. We have used three molecular systems as benchmarks

(see Figure 6.2), all three with a large number of atoms:

• 400K (399,150 atoms) is a well-balanced system of 133,050 molecules of water

designed synthetically.

• 1VT4 (645,933, atoms) is a multi-molecular holoenzyme complex assembled

around the adaptor protein dApaf-1/DARK/HAC-1.

• 2x1VT4 (1,256,718 atoms) is a complex system formed by two 1VT4 molecules.

6.3.1 Scalability Analysis

Figure 6.3 shows the speedup and running times for the three molecules using our

proposal with the settings shown in Table 6.4a. Note that running times have been

measured using a GTX580 GPU, being aected by NVidia's CUDA AtomicAdd()

operation, whose implementation depends on the hardware architecture. We also

show the results obtained with the CPU implementation of PME in NAMD, one
60 6. On-Board Multi-GPU Long-Range Force Computation

(a) 400K (b) 1VT4 in water

(c) 2x1VT4 in water

Figure 6.2: Benchmark molecules.

of the most used tools for molecular dynamics, as a baseline for comparison. The

results show that our method benets from larger molecules. The reason is that

anterpolation, whose workload is easier to share among GPUs, dominates the cost

of updates in this case.


6.3. Evaluation for On-Board Multi-GPU MSM 61

The scalability of the system is limited because of interface updates between

GPUs. Figure 6.4b shows the data transfers between GPUs to update their interfaces

for the 2x1VT4 molecule for a single step of DMSM. We have selected 2x1VT4 due

to its higher complexity and data size, with more than 1.2 Million atoms. The gure

indicates that, as expected, the data size of interface cells grows linearly, since each

new partition adds a constant data transfer that depends on the grid resolution h
and its corresponding interface size. Furthermore, the average data size transfered

per GPU is similar to the data needed in a single-GPU implementation in order to

account for periodic boundary conditions, as shown in Figure 6.4b.

Finally, Figure 6.4c shows how the total simulation time is split between

computation and interface updates for the 2x1VT4 molecule, to analyze the impor-

tance of the transferred data size. With up to 4 partitions, the cost is dominated by

computations, with interface transfers adding up to only a low percentage. In this

way, the speedup grows almost linearly with each additional GPU. All in all, the

results show that our proposal presents very good scalability in on-board multi-GPU

Figure 6.3: Running time and speedup


62 6. On-Board Multi-GPU Long-Range Force Computation

platforms.

Molecule hx,y,z
400K {2.57,2.57,2.57}
1VT4 {1.86,1.86,0.93}
2x1VT4 {1.89,1.87,1.78}

(a) Evaluation Settings

(b) Interface size (2x1VT4)

(c) Simulation cost (2x1VT4)

Figure 6.4: Scalability Analysis.


Chapter 7

Distributed Multi-GPU Molecular

Dynamics

This chapter presents a parallel and scalable solution to compute bonded and non-

bonded molecular forces to make use of distributed high performance multi-GPU

environments. As shown earlier in Chapter 5 and Chapter 6, several attached GPUs

hosted in on-board systems can be used as independent computing nodes to increase

performance. For example, TYAN FT72B7015 barebones may host up to 8 NVidia

GPUs in a single mainboard, becoming one of the highest-performance on-board

multi-GPU solutions. However, scalability is limited by the number of GPUs that can

be connected, which is currently limited to 4-8 GPUS. Therefore, the objective of this

chapter is instead to design a distributed high performance multi-GPU environment

where several nodes with GPUs can collaborate to solve the problem.

The main limitation of the methods presented in Chapter 5 is the lack of

memory scalability. Every node has to keep a complete copy of the molecule in

memory, limiting the maximum size of the molecule to simulate. Additionally, this

prevents the use of low-end GPUs with a small amount of GPU global memory.

This chapter presents new algorithms to overcome these limitations. Section 7.1

presents the elements required to perform a complete division of the system keeping

data coherency. To do this, new unique global identiers for atoms and bonds are
64 7. Distributed Multi-GPU Molecular Dynamics

generated.

Section 7.1.1 explains our method to partition the molecular system, where

each GPU maintains only a small part of the whole molecule. Section 7.1.2 explains

how data is updated by interchanging atoms and bonds between neighboring GPUs.

Section 7.2 demonstrates the scalability of our approach on a multi-GPU cluster

environment.

7.1 Algorithm

We propose a collaborative scheme to perform the simulation on a distributed envi-

ronment, such as a cluster composed of several nodes with GPUs. The main objective

is to avoid storing a complete copy of the molecule on each node's memory, thus al-

lowing memory scalability. Our solution acts at two dierent stages: initialization

of the simulation and runtime execution of the simulation. Next, we summarize the

processes designed for these two stages.

• SystemLoader. This process distributes the molecular system among comput-

ing nodes, ensuring that each GPU receives only one portion of the molecule.

It is in charge of reading the molecular system and performing the data par-

tition. This requires assigning identiers to every GPU available, selecting in

a balanced way which part of the molecule goes to each one. Additionally, it

creates a neighborhood table for each GPU, establishing global atom and bond

IDs required for shared data and partition updates.

• Integrators. These processes are responsible of running the molecular dy-

namics within each partition. Each integrator independently performs the


7.1. Algorithm 65

simulation, updating and synchronizing the data partition with its neighbors.

At the beginning of the simulation, the SystemLoader reads the whole

molecular system and generates the list of Integrators (one for each GPU) that will
perform the simulation. Each Integrator receives a single partition, as it is described
in Chapter 5 (see Fig. 5.1 and Fig. 5.2), along with a list of neighbors to exchange

updates of their shared areas. After distribution, the SystemLoader is idle most of

the time, but it is also responsible for collecting partial simulation results from the

Integrators, merging them, and saving them to disk.

The following sections describe the methods used to make the system par-

tition and updates between integrator nodes.

7.1.1 System partition

As shown earlier in Chapter 5 and Chapter 6, each partition of the molecular system

is itself divided into three dierent sections: shared, unshared and interf ace data.

As shown in Section 5.2, there are data sets that do not change during the simulation,

called staticAtomData. In the solution proposed in Chapter 5, this static data is

replicated on all GPUs, being considered as "global data". To improve memory

accesses, this data is kept separate from dynamicAtomData, which is private to

each GPU. Thus, each atom holds two types of identiers:

• staticAtomDataID. Identier assigned after loading system data. It references

the global position of an atom or bond data.

• dynamicAtomDataID. Identier assigned after performing the CellList par-

tition. It is updated dynamically during the simulation. It references the local


66 7. Distributed Multi-GPU Molecular Dynamics

data position within the partition assigned to each GPU.

In Chapter 5, staticAtomDataID arrays are replicated on all GPUs since

the atom migration is very quick, because only the dynamic part of the data must

be sent. However, the data staticAtomDataID stored on each partition does not

decrease with the number of partitions, limiting memory scalability and therefore

the molecule datasize.

In this chapter, we propose a new algorithm that maintains, for each parti-

tion, a copy of staticAtomData only for the atoms that reside within the partition.

dynamicAtomDataID and staticAtomDatID identiers persist as they accelerate

computations within each GPU. However, our algorithm introduces a new global

identier that enables the migration of static data between GPUs when atoms leave

their assigned partitions. As a result, three types of identiers are used:

• globalDataID: Global identier used to identify the same element (atom or

bond) between dierent GPUs

• localStaticDataID: Identier assigned by each GPU to identify local static

data within its partition.

• localDynamicDataID: Identier assigned by each GPU to identify local dy-

namic data after performing local CellList updates.

localStaticDataID and localDynamicDataID are dened by an integer

descriptor that references their position within the array of data contained in the

GPU RAM. On the other hand, globalDataID stores information about the GPU

node that owns the data, as well as the neighboring nodes that share a copy. There-

fore, globalDataID has the following tuple:


7.1. Algorithm 67

• OwnerGP U : The identication of the GPU that owns it.

• SharedGP U s: A list with the neighboring GPUs that share the atom or bond.

• DataID: Unique identier, assigned by the Systemloader after loading the

molecular system.

Finally, partitions follow the patterns seen in Chapter 5 to identify the

data belonging to each GPU. Atoms are assigned to each partition based on their

3D position. Bonds composed of two or more atoms use the midpoint method [2],

based on the positions of all atoms that form the bond..

7.1.2 Updates

Algorithm 3 in Section presented the integration method used in previous implemen-

tations. That method needs updated forces just before integrating positions, forcing

each type of force to be transmitted separately after computing it. To save commu-

nication times, the integrator was changed to a Velocity Verlet version, which only

needs updated positions before computing forces. Algorithm 6 shows the distributed

implementation with communication methods highlighted in blue with a star.

There are two points to update data:

• Partition/CellList updates. CellList structures must be updated every 10 steps.

This includes both partitions and pre-calculated shared data information. The

atoms that migrate across partitions are sent along with their bond information,

to have the information needed to rebuild the system. When the GPU detects

that all atoms that form a bond have left the boundaries of the partition, that

bond is deleted from the partition to save space.


68 7. Distributed Multi-GPU Molecular Dynamics

Algorithm 6 Multi-GPU Velocity Verlet/r-Respa single step integrator. Data trans-


fers are highlighted in blue with a star.

1: procedure Step(currentStep)
2: integrateV elocity(0.5 · ∆t)
3: integrateP osition(∆t)
4: if currentStep mod 10 = 0 then
5: ∗ identif yU pdateAtomIds()
6: ∗ transf erU pdateP ositionsAndV elocities()
7: updateCellList()
8: ∗ identif ySharedAtomIds()
9: else∗ transf erSharedAtomP ositions()
10: end if
11: computeAllF orces()
12: integrateV elocity(0.5 · ∆t)
13: currentStep = currentStep + 1
14: end procedure

• Shared data updates. Dynamic data of all shared atoms has to be updated

on each step before continuing the simulation. In this case, dynamic data is

dened by the 3D positions of atoms.

The methods for data communication and partition update are similar to

those explained in Section 5.2. Data is sent along with its associated identier on

every step to update it. However, each GPU computes the molecular forces using

local identiers, which must be translated before being sent. IDs are translated us-

ing two tables: localIDT oGlobalID before sending, and globalIDT oLocalID after

receiving data. A naive approach would be to use arrays as translation tables. The

identiers would be stored in the position indicated by their ID, i.e., GlobalID =
localIDT oGlobalID[localId] and localId = globalIDT olocalID[GlobalID]. Al-

though this method is very fast, the space needed for globalIDT olocalID arrays

should include the atoms that are not in the partition, preventing memory size to

decrease with the number of partitions. To avoid memory scalability problems, a


7.1. Algorithm 69

Figure 7.1: Communication scheme from GPU A to GPU B. Each GPU hosts
a small portion of the system, referencing the data by local IDs. Local IDs are
translated to global data IDs and sent to the second GPU. After data reception, a
translation to local IDs is performed.

GPU hash table is used instead, storing only the global keys needed on each GPU.

Figure 7.1 illustrates the communication scheme. To accelerate the calcula-

tions, identif ySharedAtomIds() method shown in Algorithm 6 precomputes some

translation tables that will be used in the next 10 steps.


70 7. Distributed Multi-GPU Molecular Dynamics

7.2 Distributed Multi-GPU Molecular Dynamics evalu-

ation

This section analyzes the implementation of our approach on a distributed high

performance multi-GPU environment, showing its speedup and scalability. We have

performed the evaluation tests in a cluster composed of 32 nodes interconnected

by a Gigabit Ethernet network outtted with Linux Mint 14, 8 GB of RAM and

one NVidia GTX760 GPU with 2GB of RAM. The inter-GPU communications were

implemented using OpenMPI 1.8, compiled with NVidia/CUDA support.

MPI allows direct communication between GPUs in dierent nodes con-

nected by a network. However, our testbed performs the communication using CPU

RAM, decreasing the maximum system performance.

The selected tests are focused on two aspects: memory accesses and running

times. In order to test the memory scalability of the proposal, we have used four

molecular systems as benchmarks (see Fig. 7.2), all of them with a large number of

atoms that can not be simulated in a small number of nodes:

• 2x1VT4 (1,256,718 atoms) is a complex system formed by two 1VT4 molecules,

presented in Chapter 6.3. This molecule was not possible to simulate in a single

node due to the small GPU memory available. A minimum of 2 GPUs are

needed.

• DHFR_555 (2,944,750 atoms) is a well-balanced synthetic system made of

several copies of DHFR (b), which cannot be simulated in less than 4 nodes.

The basic system is formed by an enzyme surrounded by water.

• DHFR_844 (3,015,424 atoms) is another synthetic system made of several


7.2. Distributed Multi-GPU Molecular Dynamics evaluation 71

(a) 2x1VT4 in water (b) Basic DHFR molecule

(c) Several copies of DHFR, in (d) Several copies of DHFR, in


5x5x5 conguration 8x4x4 conguration

Figure 7.2: Benchmark molecules.

copies of DHFR (b), distributed in a dierent conguration.

All test simulations were executed using the Verlet algorithm (see Sec-

tion 2.5.1), with a single time step of 1 fs for short-range non-bonded and bonded

forces. In all tests, we measured averaged statistics for 100 simulation steps, i.e., a
72 7. Distributed Multi-GPU Molecular Dynamics

total duration of 0.1ps (1 · 10−13 s).

7.2.1 Scalability Analysis

To evaluate the scalability of the proposal, several tests have been performed. Fig-

ure 7.3 shows the amount of data sent along the simulation. As it can be seen, as the

number of nodes is increased, the datasize of the shared information grows because of

the updates with a higher number of neighbors. However, the dotted line shows that

the amount of data sent per node is practically constant in all cases. Furthermore,

the amount of data per node decreases as the number of nodes is increased because

a higher number of nodes means smaller partition datasize. In summary, the use

of a higher number of nodes does not imply a penalty in the size of the data to be

communicated.

Figure 7.3: Data size communications for DHFR_844 along 100 steps.

Figure 7.4 shows the GPU-RAM allocated for DHFR_555 molecule for

each cluster conguration to test memory scalability. As commented before, it is not

possible to simulate this system using 1 or 2 nodes. The GPUs installed on each node

have only 2GB of RAM, and this memory is also shared with the GUI, so the actual

available GPU RAM for CUDA is smaller. For 4 nodes, 1.17GB of GPU-RAM are
7.2. Distributed Multi-GPU Molecular Dynamics evaluation 73

Figure 7.4: Memory allocation for DHFR_555.

used, and the amount of memory needed is decreased when more nodes are added.

Since each node has more neighbors, the RAM usage reduction is not linear. Each

node reserves extra memory needed for communications, but the maximum number

of neighbors (27 in 3D) ensure that this extra memory is bounded.

Figure 7.5 shows the speedup for the three rst molecules used on the

testbed. Figure 7.5a shows speedup evolution using 4, 8 16 and 32 nodes. Note

that the speedup has been measured using system with 4 GPUs/nodes as reference,

because the DHFR molecules tested cannot run with fewer nodes. As a reference, in

an ideal case, assuming linear scalability, the obtained speedups could be up to four

times better in the best case.

However, as it can be seen, inter-GPU communications take a large per-

centage of the time. As a consequence, the smallest molecule shows the worst results.

With larger molecules, force computation takes a larger percentage of the total time,

hence replicated DHFR congurations exhibits better scalability than 2x1VT4. To

prove that this solution could take advantage of a faster network, Figure 7.5b shows

speedups for force computation only, ignoring the time spent in communications. In
74 7. Distributed Multi-GPU Molecular Dynamics

this case, all the molecules achieve a similar scalability, with a nearly linear speedup.

As stated before, these speedups are calculated using the 4 GPU conguration as

reference.

(a) Global Speedup (b) Only computation speedup

Figure 7.5: Speedup comparison of the three molecules. Note that the reference
conguration is for 4 GPUs.

Figure 7.6 shows the total simulation times for DHFR_555, split between

computation and transmission times. Note that the transmission times for 16 nodes

are higher than the others. This could be explained due to the fact that the network

is divided in two groups of nodes, connected though several routers. The commu-

nication protocol performance depends on how the partitions are spread among the

computing nodes. In this conguration some of the nodes were farther from their

neighbors than others, increasing the communication times. For 32 nodes simula-

tion, a better conguration was used, showing better communication times. In spite

of this, communication times do not grow much when adding more nodes, while

computation times keep decreasing.


7.2. Distributed Multi-GPU Molecular Dynamics evaluation 75

Figure 7.6: Breakdown of running time (100 steps) for DHFR_844.

Figure 7.7: Benchmark molecule, composed of 32 copies of 1VT4 in 4x4x2 con-


guration.

7.2.2 Simulation of Huge Molecules

In order to prove the ability of the proposal to simulate huge molecular systems, a

last test was performed (see Fig. 7.7). 32x1VT4 is a synthetic system made of 32

copies of 1VT4, which adds a total of 20,107,488 atoms. Due to its complexity and

size, this system needs the 32 GPUs to perform the simulation.

Due to the inability to perform a simulation using less than 32 nodes, we


76 7. Distributed Multi-GPU Molecular Dynamics

have estimated the speedup of the test by taking the simulation times of one copy of

the molecule in one node. Simulation times are given in nanoseconds per day. One

copy of 1VT4 can be simulated at a speed of 1.5ns/day in one node of the cluster,

so 32 copies should take around 0.047ns/day using one node (1.5 nanoseconds/day

divided by 32). Simulations of 32x1VT4 ran at 0.6ns/day, so the speedup of our

proposal using 32 nodes is of 12.77.


Part III

CONCLUSIONS-AND-FUTURE-

WORK
Chapter 8

Conclusions and Future Work

Chapter 1 introduced some problems found on molecular dynamics and the objec-

tives for this work. The current Ph.D. thesis has properly fullled all the primary

objectives. With the obtained results we can state that:

This thesis presents a parallel scalable molecular dynamics algorithm for

both on-board and distributed multi-GPU architectures, by using GPUs as indepen-

dent computing nodes. The approach extends and optimizes the Multilevel Summation

Method, takes advantage of direct GPU-GPU communications, and introduces mas-

sively parallel algorithms to update and synchronize the interfaces of spatial partitions

on GPUs. The evaluations carried out show that the current implementation is faster

than NAMD, one of the molecular dynamics simulators of reference. Moreover, we

have simulated massive molecular systems formed by more than 20 million of atoms,

demonstrating the potential of the method in a distributed multi-GPU environment.

Section 8.1 summarizes the contributions on each subproblem, and Sec-

tion 8.2 presents future lines of research which are open starting from the conclusion

of this work.
80 8. Conclusions and Future Work

8.1 Summary of contributions

The following sections present the contributions for each accomplished goal.

8.1.1 On-Board Multi-GPU Short-Range Molecular Dynamics

Our initial eorts were focused on simulating short-range bonded and non-bonded

molecular dynamics on on-board multi-GPU architectures. Chapter 5 describes the

proposed solution. Our approach is built on parallel multiple-time-stepping inte-

grators, achieving high speed-ups thanks to the spatial partitioning developed and

direct GPU-GPU communications.

The rst milestone reached was the selection of a partition scheme that

t our requirements. Section 5.3.1 presents the comparison of dierent partition

strategies.

Then, a novel data package management algorithm for massive parallel

architectures is presented in Section 5.2. This algorithm is the key for directly

transferring information between GPUs, enabling the execution of most of the code

on the GPU, avoiding CPU communications.

Finally, Section 5.3 presents an evaluation of the scalability of the proposal,

demonstrating the benets of using GPUs as central compute nodes instead of being

simple co-processors.
8.1. Summary of contributions 81

8.1.2 On-Board Multi-GPU Long-Range Molecular Dynamics

Chapter 6 presents the next milestone achieved in this thesis. The proposal extends

and optimizes the Multilevel Summation Method, takes advantage of direct GPU-

GPU communications, and introduces massively parallel algorithms to update and

synchronize the interfaces of spatial partitions on GPUs.

The most popular algorithm for long-range molecular dynamics is PME.

However, it is not easily parallelizable in a multi-GPU environment. Instead MSM

presents more suitable characteristics for distributing it along several nodes or GPUs,

but it is slower compared to PME. We rst improve the performance of MSM by using

a FFT instead of a 3D convolution in the computation of direct sums on individual

GPUs.

Section 6.1 states the benets of our approach vs. the original MSM and

the well known long-range molecular dynamics algorithm PME. We then show how

to perform a spatial partitioning of the multilevel grid, dividing atom data between

GPUs, and designing massively parallel algorithms to minimize communications to

eciently update and synchronize interfaces.

Also, Section 6.3 evaluates the scalability of the proposal, showing promis-

ing results for a distributed multi-GPU MSM algorithm.

8.1.3 Molecular Dynamics for Distributed Multi-GPU Architec-


tures

Chapter 7 presents the nal milestone achieved. The major drawback of on-board

multi-GPU systems is the limited number of GPUs that can be used in a single
82 8. Conclusions and Future Work

node. To address this limitation, we present a new implementation of the previously

shown short-range force method for distributed multi-GPU architectures. A scalable

partition method for the molecular system is also presented, enabling the simulation

of massive molecular systems.

New shared-data communication schemes are presented in Section 7.1,

based on the denition of local data IDs for computations, and global data IDs

for communications.

Section 7.2 shows the results of the tests performed, demonstrating the

memory and performance scalability of the proposal. In summary, our approach

achieves good simulation times, opening the possibility for massive molecular systems

simulations in distributed multi-GPU environments.

8.2 Future work

The objectives stated at the beginning of this Ph.D thesis have been satisfactorily

reached. The evaluation carried out allows us to conclude that our multi-GPU molec-

ular dynamics approach presents very good behavior in terms of performance and

scalability. Furthermore, this work opens new research lines for current applications.

Pharmaceutical research could benet from simulating massive molecular

systems composed of hundred of millions of atoms. One of the recurrent problems in

molecular dynamics is virus simulation. These molecular systems are so large that

usually only some selected parts are simulated. A scalable solution as as the one

proposed in this work may make the simulation of such large systems practical.

Also, the solutions presented in this thesis can be exported to other simula-
8.2. Future work 83

tion elds. Several solutions can be applied to solve n-body problems, such as such

as celestial mechanics.. SPH uid dynamics and mass-spring cloth applications are

some examples of dynamics simulations that may benet from the spatial partition

and multi-GPU communication schemes presented in Chapters 5 and 7.

However, our approach presents several limitations that motivate future

work:

• Our current solution relies on a static partitioning, which does not guaran-

tee load balancing across GPUs. The tests indicate that practical molecular

systems maintain rather even atom distributions, but dynamic load balancing

might be necessary for ner partitions.

• Our work could be complemented with more advanced protocols and architec-

tures to optimize communications between GPUs. For on-board multi-GPU

systems, there are currently architectures that outperform the Intel IOH/QPI

interface for the PCIe bridge used in the experiments. Also, distributed multi-

GPU architectures present diverse network congurations. An adaptive com-

munication scheme could improve communication times.

• One of the main drawbacks is that MSM adds a certain overhead at coarse

levels, where the number of points to be computed is close to the number

of GPUs, and periodic boundaries wrap around the whole molecular system,

introducing many-to-many communications. To alleviate the negative conse-

quences on scalability, we plan to redesign the algorithm on coarse levels to

run on a smaller number of GPUs once the work load is manageable.

Finally, we are planning to simulate larger molecular systems on an HPC

environment. Specically, the Barcelona Supercomputing Center(BSC) has a cluster


84 8. Conclusions and Future Work

named MinoTauro composed of 128 GPU NVIDIA Tesla M2090 interconnected by an

Inniband QDR. By extrapolating the results of our simulations, this conguration

may allow us to simulate a molecular system made of nearly 300 million atoms.
Bibliography

[1] Matthias Bolten. Multigrid methods for structured grids and their application

in particle simulation. Dr., Univ. Wuppertal, Jülich, 2008. 22

[2] K. J. Bowers, R. O. Dror, and D. E. Shaw. The midpoint method for par-

allelization of particle simulations. J Chem Phys, 124(18):184109, May 2006.

67

[3] David S Cerutti and David A Case. Multi-level ewald: A hybrid multigrid / fast

fourier transform approach to the electrostatic particle-mesh problem. J Chem

Theory Comput, 6(2):443458, 2010. 22

[4] Terry W. Clark and James Andrew McCammon. Parallelization of a molecular

dynamics non-bonded force algorithm for MIMD architecture. Computers &

Chemistry, pages 219224, 1990. 2

[5] Tom Darden, Darrin York, and Lee Pedersen. Particle mesh ewald: An

n⋅log(n) method for ewald sums in large systems. The Journal of Chem-

ical Physics, 98(12):1008910092, 1993. 2, 21

[6] O N de Souza and R L Ornstein. Eect of periodic box size on aqueous molecular

dynamics simulation of a dna dodecamer with particle-mesh ewald method.

Biophys J, 72(6):23957, 1997. 55

[7] Ulrich Essmann, Lalith Perera, Max L. Berkowitz, Tom Darden, Hsing Lee,

and Lee G. Pedersen. A smooth particle mesh ewald method. The Journal of

Chemical Physics, 103(19):85778593, 1995. 21, 53


86 Bibliography

[8] David J. Hardy, John E. Stone, and Klaus Schulten. Multilevel summation

of electrostatic potentials using graphics processing units. Parallel Computing,

35(3):164  177, 2009. Revolutionary Technologies for Acceleration of Emerging

Petascale Applications. 22, 24, 34

[9] David Joseph Hardy and Robert D Skeel. Multilevel summation for the fast

evaluation of forces for the simulation of biomolecules. University of Illinois at

Urbana-Champaign, Champaign, IL, 2006. 22, 23, 52, 53

[10] M. J. Harvey and G. De Fabritiis. An implementation of the smooth particle

mesh ewald method on gpu hardware. Journal of Chemical Theory and Com-

putation, 5(9):23712377, 2009. 53

[11] M. J. Harvey, G. Giupponi, and G. De Fabritiis. ACEMD: Accelerating

Biomolecular Dynamics in the Microsecond Time Scale. Journal of Chemical

Theory and Computation, 5(6):16321639, 2009. 2, 20, 21, 26

[12] Berk Hess, Carsten Kutzner, David van der Spoel, and Erik Lindahl. GRO-

MACS 4: Algorithms for Highly Ecient, Load-Balanced, and Scalable Molec-

ular Simulation. Journal of Chemical Theory and Computation, 4(3):435447,

2008. 2, 21, 26

[13] Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, and Milan Hodo Scek.

Parallelizing molecular dynamics programs for distributed memory machines:

An application of the chaos runtime support library, 1994. 2

[14] Intel. Intel 5520 Chipset: Datasheet, March 2009. 44

[15] Jesús A. Izaguirre, Scott S. Hampton, and Thierry Matthey. Parallel multigrid

summation for the n-body problem. J. Parallel Distrib. Comput., 65(8):949962,

August 2005. 24, 34


Bibliography 87

[16] Laxmikant Kalé, Robert Skeel, Milind Bhandarkar, Robert Brunner, Attila Gur-

soy, Neal Krawetz, James Phillips, Aritomo Shinozaki, Krishnan Varadarajan,

and Klaus Schulten. NAMD2: Greater scalability for parallel molecular dynam-

ics. Journal of Computational Physics, 151(1):283  312, 1999. 2

[17] A. Koehler. Scalable Cluster Computing with NVIDIA GPUs, 2012. 44

[18] Samuel Kupka. Molecular dynamics on graphics accelerators, 2006. 25

[19] NAMD on Biowulf GPU nodes, accessed Feb 2013. 1

[20] Marcos Novalbos, Jaime Gonzalez, Miguel A. Otaduy, Alvaro Lopez-Medrano,

and Alberto Sanchez. On-board multi-gpu molecular dynamics. In Felix Wolf,

Bernd Mohr, and Dieter an Mey, editors, Euro-Par, volume 8097 of LNCS, pages

862873. Springer, 2013. 2, 54

[21] Akira Nukada, Kento Sato, and Satoshi Matsuoka. Scalable multi-GPU 3-

D FFT for TSUBAME 2.0 Supercomputer. In Proceedings of the Interna-

tional Conf. on High Performance Computing, Networking, Storage and Analy-

sis (SC'12), pages 44:144:10, 2012. 22

[22] NVidia. CUFFT :: CUDA Toolkit Documentation, accessed Online Jan 2014.

53

[23] Steve Plimpton. Fast parallel algorithms for short-range molecular dynamics.

Journal of Computational Physics, 117:119, 1995. 2

[24] Victor Podlozhnyuk. CUDA Samples:: CUDA Toolkit documentation - NVidia's

GPU Merge-sort implementations, accessed Feb 2013. 41

[25] Christoph Rachinger. Scalable computation of long-range potentials for molec-

ular dynamics. Master's thesis, KTH, Numerical Analysis, NA, 2013. 21


88 Bibliography

[26] D.C. Rapaport. Large-scale Molecular Dynamics Simulation Using Vector and

Parallel Computers. North-Holland, 1988. 20

[27] Christopher I. Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, and

Wen-Mei W. Hwu. Gpu acceleration of cuto pair potentials for molecular mod-

eling applications. In Proceedings of the 5th conference on Computing frontiers,

CF '08, pages 273282, 2008. 2, 21, 25

[28] E. Rustico, G. Bilotta, G. Gallo, A. Herault, and C. Del Negro. Smoothed

particle hydrodynamics simulations on multi-gpu systems. In Euromicro In-

ternational Conference on Parallel, Distributed and Network-Based Processing,

2012. 25

[29] Tamar Schlick. Molecular Modeling and Simulation: An Interdisciplinary Guide.

Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2002. 1, 11, 20

[30] Tim C. Schroeder. Peer-to-Peer & Unied Virtual Addressing, 2011. XIII, 4, 43

[31] Robert D Skeel, Ismail Tezcan, and David J Hardy. Multiple grid methods for

classical molecular dynamics. Journal of Computational Chemistry, 23(6):673

684, 2002. 22

[32] John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy,

Leonardo G. Trabuco, and Klaus Schulten. Accelerating molecular modeling

applications with graphics processors. Journal of Computational Chemistry,

28(16):26182640, 2007. 2

[33] J.A. van Meel, A. Arnold, D. Frenkel, S.F. Portegies Zwart, and R.G. Belleman.

Harvesting graphics power for md simulations. Molecular Simulation, 34(3):259

266, 2008. 20
Bibliography 89

[34] B. W. Wah, T. S. Huang, A. K. Joshi, D. Moldovan, J. Aloimonos, R. K. Ba-

jcsy, D. Ballard, D. DeGroot, K. DeJong, C. R. Dyer, S. E. Fahlman, R. Grish-

man, L. Hirschman, R. E. Korf, S. E. Levinson, D. P. Miranker, N. H. Morgan,

S. Nirenburg, T. Poggio, E. M. Riseman, C. Stanll, S. J. Stolfo, S. L. Tani-

moto, and C. Weems. Report on workshop on high performance computing and

communications for grand challenge applications: Computer vision, speech and

natural language processing, and articial intelligence. IEEE Transactions on

Knowledge and Data Engineering, 5(1):138154, 1993. 29

[35] Peng Wang. Short-Range Molecular Dynamics on GPU (GTC2010), September

2010. 36, 40

[36] Juekuan Yang, Yujuan Wang, and Yunfei Chen. GPU accelerated molecular dy-

namics simulation of thermal conductivities. Journal of Computational Physics,

221(2):799  804, 2007. 25

[37] Rio Yokota, Jaydeep P. Bardhan, Matthew G. Knepley, L.A. Barba, and

Tsuyoshi Hamada. Biomolecular electrostatics using a fast multipole BEM on

up to 512 GPUs and a billion unknowns. Computer Physics Communications,

182(6):1272  1283, 2011. 22

[38] Gongpu Zhao, Juan R. Perilla, Ernest L. Yufenyuy, Xin Meng, Bo Chen, Jiying

Ning, Jinwoo Ahn, Angela Gronenborn, Klaus Schulten, Christopher Aiken, ,

and Peijun Zhang. Mature hiv-1 capsid structure by cryo-electron microscopy

and all-atom molecular dynamics. Nature, 497:643646, 2013. 31


90 Bibliography

También podría gustarte