Está en la página 1de 18

Comput. Methods Appl. Mech. Engrg.

258 (2013) 6380

Contents lists available at SciVerse ScienceDirect

Comput. Methods Appl. Mech. Engrg.


journal homepage: www.elsevier.com/locate/cma

GPU-acceleration of stiffness matrix calculation and efcient


initialization of EFG meshless methods
A. Karatarakis , P. Metsis, M. Papadrakakis
Institute of Structural Analysis and Antiseismic Research, National Technical University of Athens, Zografou Campus, Athens 15780, Greece

a r t i c l e

i n f o

Article history:
Received 5 October 2012
Received in revised form 20 January 2013
Accepted 12 February 2013
Available online 4 March 2013
Keywords:
Meshless methods
Element free Galerkin (EFG)
Preprocessing
Stiffness matrix assembly
Parallel computing
GPU acceleration

a b s t r a c t
Meshless methods have a number of virtues in problems concerning crack growth and propagation, large
displacements, strain localization and complex geometries, among other. Despite the fact that they do not
rely on a mesh, meshless methods require a preliminary step for the identication of the correlation
between nodes and Gauss points before building the stiffness matrix. This is implicitly performed with
the mesh generation in FEM but must be explicitly done in EFG methods and can be time-consuming. Furthermore, the resulting matrices are more densely populated and the computational cost for the formulation and solution of the problem is much higher than the conventional FEM. This is mainly attributed to
the vast increase in interactions between nodes and integration points due to their extended domains of
inuence. For these reasons, computing the stiffness matrix in EFG meshless methods is a very computationally demanding task which needs special attention in order to be affordable in real-world applications. In this paper, we address the pre-processing phase, dealing with the problem of dening the
necessary correlations between nodes and Gauss points and between interacting nodes, as well as the
computation of the stiffness matrix. A novel approach is proposed for the formulation of the stiffness
matrix which exhibits several computational merits, one of which is its amenability to parallelization
which allows the utilization of graphics processing units (GPUs) to accelerate computations.
2013 Elsevier B.V. All rights reserved.

1. Introduction
In meshless methods (MMs) there is no need to construct a
mesh, as in nite element method (FEM), which is often in conict
with the real physical compatibility condition that a continuum
possesses [1]. Moreover, stresses obtained using FEM are discontinuous and less accurate while a considerable loss of accuracy is
observed when dealing with large deformation problems because
of element distortion. Furthermore, due to the underlying structure
of the classical mesh-based methods, they are not well suited for
treating problems with discontinuities that do not align with element edges. MMs were developed with the objective of eliminating
part of the above mentioned difculties [2]. With MMs, manpower
time is limited to a minimum due to the absence of mesh and mesh
related phenomena. Complex geometries are handled easily with
the use of scattered nodes.
One of the rst and most prominent meshless methods is the
element free Galerkin (EFG) method introduced by Belytschko
et al. [3]. EFG requires only nodal data, no element connectivity
is needed to construct the shape functions. However a global background cell structure is necessary for the numerical integration.
Corresponding author.
E-mail addresses: alex@karatarakis.com (A. Karatarakis), pmetsis@gmail.com
(P. Metsis), mpapadra@central.ntua.gr (M. Papadrakakis).
0045-7825/$ - see front matter 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.cma.2013.02.011

Moreover, since the number of interactions between nodes and/


or integration points is heavily increased, due to large domains
of inuence, the resulting matrices are more densely populated
and the computational cost for the formulation and solution of
the problem is much higher than in the conventional FEM [3].
To improve the computational efciency of MMs, parallel
implementations like the MPI parallel paradigm has been used in
large scale applications[4,5] and several alternative methodologies
have been proposed concerning the formulation of the problem.
The smoothed FEM (SFEM) [6] couples FEM with meshless methods by incorporating a strain smoothing operation used in the
mesh-free nodal integration method. The linear point interpolation
method (PIM) [7] obtains the partial derivatives of shape functions
effortlessly due to the local character of the radial basis functions.
A coupled EFG/boundary element scheme [8], taking advantage of
both the EFG and the boundary element method. Furthermore,
solvers which perform an improved factorization of the stiffness
matrix and use special algorithms for realizing the matrixvector
multiplication are proposed in [9,10]. Divo and Kassab [11]
presented a domain decomposition scheme on a meshless
collocation method, where collocation expressions are used at each
subdomain with articial created interfaces. Wang et al. [7]
presented a parallel reproducing kernel particle method (RKPM),
using a particle overlapping scheme which signicantly increases
the number of shared particles and the time for communicating

64

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

information between them. Recently, a novel approach for


reducing the computational cost of EFG methods is proposed by
employing domain decomposition techniques on the physical as
well as on the algebraic domains [12]. In that work the solution
of the resulting algebraic problems is performed with the dual domain decomposition FETI method with and without overlapping
between the subdomains. The non-overlapping scheme has led to
a signicant decrease of the overall computational cost.
Applications of graphics processing units (GPUs) to scientic
computations are attracting a lot of attention due to their low cost
in conjunction with their inherently remarkable performance
features. Parametric tests on 2D and 3D elasticity problems
revealed the potential of the proposed approach as a result of the
exploitation of multi-core CPU hardware resources and the intrinsic software and hardware features of the GPUs.
Driven by the demands of the gaming industry, graphics hardware has substantially evolved over the years with remarkable
oating point arithmetic performance. In the early years, these
operations had to be programmed indirectly, by mapping them
to graphic manipulations and using graphic libraries such as
openGL and DirectX. This approach of solving general purpose
problems is known as general purpose computing on GPUs
(GPGPU). GPU programming was greatly facilitated with the initial
release of the CUDA-SDK [1315], which resulted in a rapid development of GPU computing and the appearance of GPU-powered
clusters on the Top500 supercomputers [16]. Unlike CPUs, GPUs
have an inherent parallel throughput architecture that focuses on
executing many concurrent threads slowly, rather than executing
a single thread very fast.
Work pertaining to GPUs has extended to a large spectrum of
applications even before CUDA made their use easier. A number
of studies in engineering applications have been recently reported
on a variety of GPU platforms using implicit computational algorithms: in uid mechanics [1721], molecular dynamics [22,23],
topology optimization [24], wave propagation [25], Helmholtz
problems [26], neurosurgical simulations [27]. Linear algebra
applications have also been a topic of scientic interest for GPU
implementations. Dense linear algebra algorithms are reported in
[28], while a thorough analysis of algorithmic performance of basic
linear algebra operations can be found in [29]. The performance of
iterative solvers is analyzed in [30], and a parametric study of the
PCG solver is performed on multi-GPU CUDA clusters in [31,32]. A
hybrid CPUGPU implementation of domain decomposition methods is presented in [33] where speedups of the order of 40x have
been achieved. It should be noted that all implementations prior
to CUDA 1.3 are performed in single-precision, since support for
double-precision oating point operation is added on CUDA 1.3.
This has caused some misinterpretations in a number of published
comparisons between the GPU and the CPU, usually in favor of the
GPU.
The present work aims at a drastic reduction of the
computational effort required for the initialization phase and for
assembling the stiffness matrix by implementing a novel node
pair-wise procedure. It is believed that with the proposed computational handling of the pre-processing phase and the accelerated
formulation of the stiffness matrix, together with recent improvements on the solution of the resulting algebraic equations [12],
MMs are becoming computationally competitive and are expected
to demonstrate their inherent advantages in solving real, largescale engineering problems.

2. Basic ingredients of the meshless EFG method


The approximation of a scalar function u in terms of Lagrangian
coordinates in the meshless EFG method can be written as

ux; t

Ui xui t

i2S

where Ui are the shape functions, ui are the nodal values at particle i
located at position xi, and S is the set of nodes i for which Ui(x) 0.
The shape functions in Eq. (1) are only approximants and not interpolants, since ui u(xi).
The shape functions Ui are obtained from the weight
coefcients wi, which are functions of a distant parameter
r = kxi  xk/di where di denes the domain of inuence (doi) of
node i. The domain of inuence is crucial to solution accuracy, stability and computational cost, as it denes the degree of continuity
between the nodes and the bandwidth of the system matrices.
The approximation uh is expressed as a polynomial of length m
with non-constant coefcients. The local approximation around a
 , evaluated at a point x is given by
point x

 pT xax

uhL x; x

 contains
where p(x) is a complete polynomial of length m and ax
non-constant coefcients that depend on x

 a0 x a1 x a2 x    am x T
ax

In two dimensional problems, the linear basis p(x) is given by

pT x 1 x y ;

m3

and the quadratic basis by


pT x 1 x y x2

y2


xy ;

m6

The unknown parameters aj(x) are determined at any point x, by


minimizing a functional J(x) dened by a weighted average over
all nodes i 2 1, . . . , n:

Jx

n
X

2
wx  xi uhL xi ; x  ui
i1

n
X

wx  xi pT xi ax  ui 2

i1

where the parameters ui are specied by the difference between the


 and the value ui while the weight
local approximation uhL x; x
function satises the condition w(x  xi) 0. An extremum of J(x)
with respect to the coefcients aj(x) can be obtained by setting
the derivative of J with respect to a(x) equal to zero. This condition
gives the following relation

Axax Wxu

where

Ax

n
X
wx  xi pxi pT xi

i1

Wx wx  x1 px1 wx  x2 px2    wx  xn pxn 


9
Solving for a(x) in Eq. (7) and substituting into Eq. (2) the approximants uh can be dened as follows:

uh x pT xAx1 Wxu

10

which together with Eq. (1) leads to the derivation of the shape
function Ui associated with node i at point x:

Ui x pT xAx1 Wxi

11

A solution of a local problem A(x)z = p(x) of size m  m is performed


whenever the shape functions are to be evaluated. This constitutes a
drawback of moving least squares-based (MLS-based) MMs since
the computational cost can be substantial and it is possible for
the moment matrix A(x) to be ill conditioned [2].

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

The Galerkin weak form of the above formulation gives the discrete algebraic equation

Ku f

12

with

Ct

13
14

In 2D problems matrix B is given by

Ui;x

6
Bi 4 0

Ui;y

Ui;y 7
5
Ui;x

15

and in 3D problems by

Uj;x

6 0
6
6
6 0
Bi 6
6U
6 j;y
6
4 0

Uj;z

Uj;y
0

Uj;x
Uj;z
0

0 7
7
7
Uj;z 7
7
0 7
7
7
Uj;y 5

16

Uj;x

Due to the lack of the Kronecker delta property of shape functions,


the essential boundary conditions cannot be imposed the same way
as in FEM. Several techniques are available such as Lagrange multipliers, penalty and EFG and FEM coupling.
For the integration of Eq. (13), virtual background cells are considered by dividing the problem domain into integration cells over
which a Gaussian quadrature is performed:

Z
X

The stiffness matrix of Eq. (13) is usually formed by adding the


contributions of the products BTG EBG of all Gauss points G to the
stiffness matrix according to the formula:

X T
X
BG EBG
QG
G

Kij
BTi EBj dX
Z X
Z
UitdC Ui bdX
fi

3. Gauss point-wise formulation of the stiffness matrix

fxdX

X
fnJ xN det J n n

17

where n are the local coordinates and det Jn(n) is the determinant of
the Jacobian.

(a)

Fig. 1. Domain of inuence of Gauss point

65

18

where the deformation matrix BG is computed at the corresponding


Gauss point. The summation is performed for each Gauss point and
affects all nodes within its domain of inuence. Compared to FEM,
the amount of calculations for performing this task is signicantly
higher since the domains of inuence of Gauss points are much larger than the corresponding domains in FEM as is schematically
shown in Fig. 1 for a domain discretized with EFG and FEM having
equal number of nodes and Gauss points. Throughout this paper we
do not address the issue of accuracy obtained by the two methods
with the same number of nodes and Gauss points.
In FEM, each Gauss point is typically involved in element-level
computations for the formation of the element stiffness matrix
which is then added to the appropriate positions of the global stiffness matrix. Moreover, the shape functions and their derivatives are
predened for each element type and need to be evaluated on all
combinations of nodes and Gauss points within each element. In
EFG methods, however, the contribution of each Gauss point is directly added to the global stiffness matrix while the shape functions
are not predened and span across larger domains with a signicantly higher amount of Gauss point-node interactions.
Although, in EFG methods there is no need to construct a mesh,
the correlation between nodes and Gauss points needs to be dened. This preliminary step before building the stiffness matrix is
implicitly performed with the mesh creation in FEM but must be
explicitly done in EFG methods and can be time-consuming if
not appropriately handled. For the aforementioned reasons, computing the stiffness matrix in EFG meshless methods is a very computationally demanding task which needs special attention in
order to be affordable in real-world applications.
3.1. Node-Gauss point correlation
In the initialization step, the basic entities are created, namely
the nodes and the Gauss points together with their domains of

(b)

in (a) EFG; (b) FEM, for the same number of nodes and Gauss points.

66

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

Table 1
Computing time required for all node-Gauss point correlations.
Example

Nodes

Gauss points

Search time (s)

2D-1
2D-2
2D-3

25,921
75,625
126,025

102,400
300,304
501,264

23
300
836

1.3
3.4
5.4

0.5
1.0
1.4

3D-1
3D-2
3D-3

9,221
19,683
35,937

64,000
140,608
262,144

7
45
157

3.7
7.8
15.7

0.9
1.7
3.3

Global serial

inuence. The domains of inuence dene the correlation between


nodes and Gauss points. With the absence of an element mesh, the
correlation of Gauss points and nodes must be established explicitly at the initialization phase.
A rst approach is to search on the global physical domain for
the Gauss points belonging to the domain of inuence of each
node. This approach performs a large amount of unnecessary calculations since the domains of inuence are localized areas. In order to reduce the time spent for identifying the interaction
between Gauss points and nodes, the search can be performed
on Gauss regions.
A rectangular grid is created and we refer to each of the regions
dened as a Gauss region. Each Gauss region contains a group of
Gauss points (Fig. 3). Given the coordinates of a particular node,
it is immediately known in which region it is located. The search
per node is conducted over the neighboring Gauss regions only instead of the global domain. Thus, regardless of the size of the problem, the search per node is restricted on a small number of Gauss
regions.
In order to quickly decide whether a neighboring Gauss region
will be searched or not, the centroid of each Gauss region is used
as a representative point for the whole region. If the centroid of a
Gauss region lies inside the domain of inuence of a node, then
all Gauss points of that region will be processed for possible interaction with the node, otherwise they will be ignored. However,
there may be cases of Gauss points which are inside the domain
of inuence of a node but are ignored because the centroid of their
Gauss region lies outside the domain of inuence, as can be seen in
Fig. 3. In order to account for such cases, the centroids are tested
with regard to an extended domain of inuence. The extended domain of inuence is only used for the centroids so the contribution
of Gauss points is evaluated based on the actual domain of inuence of the node.
The extended domain of inuence should be large enough to include the centroids of regions that would be outside the actual domain of inuence and small enough to avoid false positives, i.e.
regions that test true but contain no inuencing Gauss points. In
order to accomplish this, the maximum distance between the centroid and a point on the border of the respective Gauss region is
computed. The extended domain of inuence is then dened by
adding this distance to the initial domain of inuence. Gauss regions can be formed from a cluster of Gauss cells or it can be totally
unrelated to Gauss cells.
The time required to dene correlations in three 2D and three
3D elasticity problems with varying number of degrees of freedom

Regioned serial

Regioned parallel

(dof) are shown in Table 1. The 2D problems correspond to square


domains and the 3D to cubic domains, with rectangular domains of
inuence (doi) with dimensionless parameter 2.5. These domains
maximize the number of correlations and consequently the computational cost for the given number of nodes. In these examples,
each Gauss region is equivalent to a single Gauss cell. Thus, in
the 2D examples each Gauss cell contains 16 Gauss points (4  4
rule) and in the 3D examples 64 Gauss points (4  4  4 rule).
The examples are run on a Core i7-980X which has six physical
cores (12 logical cores) at 3.33 GHz and 12 MB cache. Each node
can dene its correlation independently of other nodes, which is
amenable to parallel computations.
When each node checks all Gauss points of the domain the time
complexity is O(nnG), where n is the number of nodes and nG is the
number of Gauss points. As a result, the time needed to dene the
correlations when globally searching quickly becomes prohibitive.
In the case of Gauss regions, each node needs to check a constant
number of Gauss points regardless of the size of the problem, so
the time complexity is O(n).
With the implementation of Gauss regions, the initialization
phase of EFG methods in complex domains takes less time than
FEM, since the generation of a nite element mesh can sometimes
be laborious and time consuming [34]. At the end of the initialization step each node has a list of inuencing Gauss points and each
Gauss point has a list of inuenced nodes.
3.2. Comparison to FEM for equal number of nodes and Gauss points
Table 2 shows the number of Gauss points inuencing a single
node and the number of nodes inuenced by a single Gauss point
in typical 2D and 3D problems. The numbers displayed for EFG correspond to the majority of nodes (Side/Corner nodes or Gauss
points have a lower number of inuences).
Table 3 shows the total number of correlations for the six examples considered. The signicantly higher number in EFG methods is
a direct consequence of the larger domain of inuence, as shown in
Figs. 1 and 2.
3.3. Computation of stiffness contribution for each Gauss point
3.3.1. Shape function derivative calculation
The shape functions in EFG formulation span across larger domains of inuence than in FEM and their evaluation is performed
over a large number of correlated Gauss points-nodes. For the
evaluation of the deformation matrix B the shape functions and

Table 2
Inuences per node and Gauss point for EFG and FEM.
2D

Gauss points inuencing a node


Nodes inuenced by a Gauss point

3D

EFG (doi = 2, 5)

FEM (QUAD4)

EFG (doi = 2, 5)

FEM (HEXA8)

100
25

16
4

1000
125

64
8

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380
Table 3
Total number of node-Gauss point correlations in EFG and FEM.
Example

Nodes

Gauss points

Total correlations

Ratio

EFG

FEM

2D-1
2D-2
2D-3

25,921
75,625
126,025

102,400
300,304
501,264

2,534,464
7,463,824
12,475,024

409,600
1,201,216
2,005,056

6.2
6.2
6.2

3D-1
3D-2
3D-3

9,221
19,683
35,937

64,000
140,608
262,144

7,077,888
16,003,008
30,371,328

512,000
1,124,864
2,097,152

13.8
14.2
14.5

their derivatives are calculated with the following procedure for


each Gauss point: (i) Calculate the weight function coefcients w,
w,x, w,y, w,z for each node in the domain of inuence of the Gauss
point. (ii) Calculate the moment matrix A of Eq. (8) and its derivatives, Ax, Ay, Az of the Gauss point with contributions from all inuenced nodes. (iii) Use the moment matrix and its derivatives along
with the weight coefcients to calculate the shape function and
derivative values for all inuenced nodes of the Gauss point.
The moment matrix and its derivatives are functions of the
polynomial p, which is a complete polynomial of order q for any
material point of the domain. In the case of a linear basis (Eq.
(4)) the moment matrix A and its derivatives are 3  3 or 4  4
matrices for 2D and 3D elasticity problems, respectively. The contribution of each node to the moment matrix and its derivatives is
related to the product ppT. According to Eq. (8), the moment matrix
and its derivatives are given by

X
wi ppT i ; 8i 2 Infl:Nodes

i
X
X
wx i ppT i ; Ay
wy i ppT i ;
Ax
i

X
wz i ppT i ;
Az

while similar expressions dene its derivatives Ax, Ay, Az.


The shape function value Ui(x) associated with node i at point x
is expressed according to Eq. (11), while the derivatives Ui,x, Ui,y,
Ui,z are given by

Ui;x wi;x pTG A1 pi wi f 0 1 0 0 gA1 pi wi pTG A1 Ax A1 pi


Ui;y wi;y pTG A1 pi wi f 0 0 1 0 gA1 pi wi pTG A1 Ay A1 pi
Ui;z wi;z pTG A1 pi wi f 0 0 0 1 gA1 pi wi pTG A1 Az A1 pi
21
where the polynomials pG and pi are evaluated at the Gauss point G
and the inuenced node i, respectively.
In Eqs. (11) and (21) the following operations are repeated for
all inuenced nodes of a Gauss point:

pTA pTG A1 ; pAx pTA Ax A1 ; pAy pTA Ay A1 ; pAz
pTA Az A1

22

These matrixvector multiplications can be reused in several calculations for every inuenced node of a particular Gauss point. For
large size of the moment matrix A, the direct computation of its inverse is burdensome, so an LU factorization is typically performed
[2]. In this implementation, an explicit algorithm is used for the
inversion of the moment matrix in order to minimize the
calculations.
For each inuenced node i, the following three groups of calculations are then performed:

Ui wi pTA pi

2
3
Ui;x
wi f 0 1 0 0 gA1 pi Ui;x wi pTAx pi

T
U1
i;x wi;x pA pi

2
3
Ui;y
wi f 0 0 1 0 gA1 pi Ui;y wi pTAy pi

2
3
T
U1
Ui;z
wi f 0 0 0 1 gA1 pi Ui;z wi pTAz pi
i;y wi;y pA pi

8i 2 Infl:Nodes

67

19

T
U1
i;z wi;z pA pi

23

Thus, the moment matrix consists of the following terms

2X
wi
6 i
6
6
A6
6
6
4

X
i

wi xi

X
wi x2i
i

X
3
wi yi
7
i
7
X
wi xi yi 7
7
7
i
7
X
2 5
wi yi

3.3.2. BTEB calculation


A fast computation of the matrix product

20

Q G BTG EBG

24

of Eq. (18) is important because it is repeated at each integration


point. This may not be so critical in FEM compared to the total

(a)
Fig. 2. Domain of inuence of node

(b)
(a) EFG; (b) FEM, for the same number of nodes and Gauss points.

68

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

Fig. 3. Identifying the inuencing Gauss points of node

associated indexing to access the entries of K dominate the total effort for the formulation of the global stiffness matrix [35].
The Qij for an isotropic material in 3D elasticity takes the form:

simulation time, but it is very important in EFG meshless methods


where the number of Gauss points and the number of inuenced
nodes per Gauss point are both signicantly greater.

2
2
Q ij
33

BTi

E Bj
36 66 63

Ui;x

6
4 0
0

Ui;y

Ui;y
Ui;x

Ui;z

Ui;z
Ui;y

36 k
Ui;z 6
6

76 k
0 56
6
Ui;x 6
6
4

32

Uj;x

76 0
76
76
76 0
76
76 U
76 j;y
76
54 0

l
l

Uj;y
0

Uj;x
Uj;z

0 7
7
7
7
0 7
7
7
Uj;y 5

Uj;z 7
26

Uj;z 0 Uj;x
2
3
Ui;x Uj;x M Ui;y Uj;y l Ui;z Uj;z l
Ui;x Uj;y k Ui;y Uj;x l
Ui;x Uj;z k Ui;z Uj;x l
6
7
Ui;y Uj;x k Ui;x Uj;y l
Ui;y Uj;y M Ui;x Uj;x l Ui;z Uj;z l
Ui;y Uj;z k Ui;z Uj;y l
Q ij 4
5
33
Ui;z Uj;x k Ui;x Uj;z l
Ui;z Uj;y k Ui;y Uj;z l
Ui;z Uj;z M Ui;y Uj;y l Ui;x Uj;x l

The computations of Eq. (24) can be broken into smaller operations for each combination of inuenced nodes i, j belonging to the
domain of inuence of the Gauss point:

Q ij BTi EBj Q Tji

25

Once a submatrix Qij is calculated, it is added to the corresponding


positions of K (Eq. (18)). The computation of Qij together with the

E and Bi/Bj are never formed. Instead three values for E, the two
Lam parameters k, l and the P-Wave modulus M = 2l + k and
three values for Bi, specically Ni,x, Ni,y, Ni,z, are stored. Since some
of the multiplications are repeated, the calculations in Eq. (26) can
be efciently performed with 30 multiplications and 12 additions.
3.3.3. Summation of Gauss point contributions
Contrary to FEM, where the stiffness matrices are built on the
element level by integrating over the element Gauss points before

69

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380
Table 4
Computing time for the formulation of the stiffness matrix in the CPU implementations of the Gauss-point wise approach.
Example

dof

Gauss points

Time (s)
Proposed GP

2D-1
2D-2
2D-3

51,842
152,250
252,050

102,400
300,304
501,264

107
313
502

12
34
53

9
9
9

3D-1
3D-2
3D-3

27,783
59,049
107,811

64,000
140,608
262,144

2,374
6,328
13,302

241
616
1165

10
10
11

Table 5
Comparison of the proposed Gauss point-wise method for the formulation of the
stiffness matrix when using sparse and skyline format.
dof

Gauss points

Example

dof

Gauss points

2D-1
2D-2
2D-3

51,842
152,250
252,050

102,400
300,304
501,264

66,221,715
331,150,875
713,161,275

4,110,003
12,129,675
20,287,275

16
27
35

3D-1
3D-2
3D-3

27,783
59,049
107,811

64,000
140,608
262,144

136,041,444
486,852,444
1,343,011,428

21,734,532
49,932,576
95,696,604

6
10
14

Ratio

Conventional
GP

Example

Table 6
Number of stored stiffness elements when using skyline and sparse format.

Time (s)
Sparse

Ratio
Skyline

2D-1
2D-2
2D-3

51,842
152,250
252,050

102,400
300,304
501,264

12
34
53

7
20
31

1.6
1.7
1.7

3D-1
3D-2
3D-3

27,783
59,049
107,811

64,000
140,608
262,144

241
616
1,165

68
174
329

3.5
3.5
3.5

assembling the global stiffness matrix, the absence of elements in


EFG meshless methods necessitates each Gauss point to directly
append its contribution to the global stiffness matrix. Since there
are considerably more Gauss points in EFG and each Gauss point
inuences much more nodes, indexing time during the creation
of the stiffness matrix is an important factor in EFG simulations.
Thus, an efcient implementation for building the stiffness matrix
in sparse format is needed in the Gauss point-wise approach. The
procedure requires updating previous values of the matrix, thus a
sparse matrix type that allows lookups is needed. Updates happen
a large number of times for every non-zero element of the matrix,
so they consume a large portion of the total effort. A sparse matrix
format suitable for this method is the dictionary of keys (DOK) [36]
and our implementation is based on hash-tables [37].
3.4. Performance of the Gauss point-wise approach
The performance of the Gauss point-wise approach in the CPU is
shown in Table 4. The proposed Gauss point-wise (GP) approach is
compared with the conventional one without the improvements
described in this Section.
The Gauss point-wise approach is heavily inuenced by indexing time, especially in the 3D examples. A matrix format with better indexing properties would benet the Gauss point-wise
approach. The quick identication of interacting node pairs as described in Section 12 allows the fast prediction of the non-zero
coefcients of the stiffness matrix, as demonstrated in Table 9. This
leads to the calculation of the indexes of the skyline format of the
matrix. The skyline format exhibits faster indexing time but increased memory requirements compared to a sparse matrix format
which contains only non-zero elements.
Table 5 compares the proposed Gauss point-wise approach for
building the stiffness matrix when using sparse and skyline format.
The difference highlights the importance of indexing time in EFG
methods where access to the stiffness matrix is performed a large
number of times during its formulation.
However, the skyline format stores higher number of elements,
as shown in Table 6, and thus requires a larger amount of memory.

Number of stored elements


Skyline

Ratio

Sparse

Note that the skyline format is dependent on the numbering of


nodes in the domain and an ideal numbering is used in the presented examples. This dependency may lead to more excessive
amounts of zeros stored and further exacerbate the required memory of the skyline format, whereas the sparse formats always have
the same amount of elements regardless of numbering.
4. Node pair-wise formulation of the stiffness matrix
An alternative way to perform the computation of the global
stiffness matrix is the proposed node pair-wise approach. The computation of the global stiffness coefcient Kij is performed for all
interacting i  j nodes and is formed from contribution by the
shared Gauss points of their domains of inuence. Fig. 4 depicts
two interacting nodes as a result of having common Gauss points
in the intersection of their domains of inuence and one node that
is not interacting with the other two.
4.1. Interacting node pairs and their shared Gauss points
The interacting node pairs approach requires an extra initialization step for identifying the interacting node pairs and their shared
Gauss points. The identication of the Gauss points associated with
each interacting node pair is benecial since it accelerates the
computation of the stiffness matrix and all node pair related calculations. More importantly, it also enables an efcient parallel
implementation and particularly utilization of massively parallel
processing, including GPUs.
In FEM the nodes interact through neighboring elements only
and thus the interacting node pairs can be easily dened from
the element-node connectivity (Fig. 5b). In EFG meshless methods
however, a node pair contributes non-zero entries in the stiffness
matrix, and therefore is active, if there is at least one Gauss point
whose domain of inuence includes both nodes (Fig. 5a). The naive
approach is to look for all possible combinations of node pairs, nd
their shared Gauss points and keep those node pairs that are interacting together with the corresponding shared Gauss points. The
shared Gauss points are located in the intersection of the domains
of inuence of two interacting nodes (Fig. 4). This approach, however, takes a prohibitive amount of time because it needs to calculate the shared Gauss points for all possible n(n + 1)/2
combinations of node pairs, where n is the number of nodes.
Table 7 shows all possible combinations of node pairs and those
that are interacting as well as the associated computing time for
a naive identication.
Thus, identication of the shared Gauss points is expensive, unless the unnecessary searches for Gauss points of non-interacting
nodes are avoided. This is accomplished by rst identifying the
interacting nodes. As in the naive approach, all n(n + 1)/2 combinations can be checked and if there is at least one Gauss point in common the node pair is marked as interacting. However, this is still an
O(n2) process so it does not scale well and quickly grows into an
unacceptable amount of time.

70

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

Fig. 4. Intersection of domains of inuence.

(a)

(b)
Fig. 5. Interacting nodes: (a) EFG; (b) FEM.

By taking advantage of the previous initialization step described


in Section 7, the identication of interacting node pairs can be
accelerated considerably. Each node has a list of inuencing Gauss

points and each Gauss point has a list of inuenced nodes. Therefore, each node looks for interacting nodes in the lists of inuenced
nodes of its Gauss points. Fig. 6 shows node A which is inuenced

71

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380
Table 7
Computing time required for a naive identication of interacting nodes and their
shared Gauss points.
Naive
Example

Nodes

All combinations

2D-1
2D-2
2D-3

25,921
75,625
126,025

335,962,081
2,859,608,125
7,941,213,325

Interacting
1,033,981
3,051,325
5,103,325

Time (s)
771
6908
23,380

3D-1
3D-2
3D-3

9221
19,683
35,937

42,518,031
193,720,086
645,751,953

2,418,035
5,554,625
10,644,935

608
3021
16,290

Fig. 7. Identifying Interacting node pairs by considering Gauss points near the
border of the domain of inuence.

Fig. 6. Identifying interacting node pairs for node A. A, B, C, D, E represent nodes


whereas i, j, k represent Gauss points.

Table 9
Computing time for the identication of interacting nodes by only inspecting Gauss
points near the border.
Example

Table 8
Computing time for the identication of interacting nodes.
Example

Time (s)
Serial

Parallel

2D-1
2D-2
2D-3

1.5
4.5
9.8

0.2
0.7
1.6

3D-1
3D-2
3D-3

20.1
42.6
85.6

2.8
5.6
11.2

Parallel

2D-1
2D-2
2D-3

0.2
0.5
0.8

<0.1
<0.1
<0.1

3D-1
3D-2
3D-3

0.5
0.9
1.6

<0.1
0.2
0.3

Table 10
Computing time to identify the shared Gauss points of an interacting node pair.
Example

by Gauss points i,j,k. Each of these Gauss points inuences various


nodes, including node A. Those nodes are guaranteed to interact
with A since there is at least one Gauss point in common between
them. In Fig. 6 the interactions are: AA, AB, AC, AD, AE.
The corresponding computing times for this process are shown
in Table 8. As previously, the examples are run on a Core i7-980X
which has six physical cores (12 logical cores) at 3.33 GHz. Each
node can search for interacting nodes independently of other
nodes, so parallelism offers very good acceleration.
With this approach the identication of interacting nodes is improved, but it can be further accelerated by noting that an interacting node may be in the lists of several Gauss points of A, as is node
B in Fig. 6. Since the number of inuencing Gauss points of a node
is large (1000 for the majority of nodes in our 3D examples), there
will be a large amount of duplicates in the process, which are discarded. To reduce the number of duplicates, we only inspect those
Gauss points that are near the border of the domain of inuence of
the node (Fig. 7). These Gauss points dene the interactions with
further away nodes while including all closer nodes. This considerably reduces the time as can be seen in Table 9.
Following the identication of the interacting node pairs, the
determination of shared Gauss points is performed the least possible number of times, i.e., only once for every interacting node pair,
in contrast to the n(n + 1)/2 times of the naive approach. This leads

Time (s)
Serial

Time (s)
Serial

Parallel

2D-1
2D-2
2D-3

2.1
6.1
8.8

0.4
1.2
1.5

3D-1
3D-2
3D-3

46.6
135.6
315.7

7.4
18.8
45.8

to a vast reduction of the required amount of computing time compared to the naive approach (Table 7) as can be seen in Table 10.
For further improvement, regioning (Fig. 8) can be utilized and
the results are shown in Table 11. The Gauss regions may be the
same as those in the initialization phase (Section 7) or can be different. Shared Gauss points are only searched within regions
shared by both node pairs. In both intersection identications, with
and without regions, each node pair can identify its shared Gauss
points independently of other node pairs, so parallelism offers very
good accelerations, as shown in Tables 10 and 11.
In the 2D examples considered, each region has 16 Gauss points
and the results are slightly worse with regioning because skipping
16 Gauss points per skipped region was not enough to compensate
for the added overhead. Higher number of Gauss points per region
eventually makes regioning worthwhile in the 2D examples. In the

72

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

Fig. 8. Region-wise search for interacting nodes. Only the shaded regions are inspected for shared Gauss points..

Table 11
Computing time to identify the shared Gauss points of an interacting node pair with
regioning.
Example

Time (s)
Serial

Table 12
Number of interacting node pairs in EFG and FEM.
Example

Nodes

2D-1
2D-2
2D-3

25,921
75,625
126,025

1,033,981
3,051,325
5,103,325

128,641
376,477
627,997

8.0
8.1
8.1

3D-1
3D-2
3D-3

9,221
19,683
35,937

2,418,035
5,554,625
10,644,935

118,121
256,361
474,305

20.5
21.7
22.4

EFG

Parallel

2D-1
2D-2
2D-3

2.4
6.8
11.0

0.6
1.6
2.8

3D-1
3D-2
3D-3

24.9
57.9
118.0

4.8
10.7
22.4

Interacting node pairs

Ratio

FEM

3D examples, the extra dimension and the fact that each region has
64 Gauss points makes regioning more important. Regioning benets become greater as the number of Gauss points per region
increases.

EFG extend in much larger regions than in FEM, as is shown in


Fig. 5. Furthermore, the numbers are indicative of the total nonzeros of the corresponding stiffness matrices. The total non-zeros
can be calculated by

4.2. Comparison to FEM for equal number of nodes and Gauss points

NZ 4  NP  n2D;

Table 12 shows the number of interacting node pairs in EFG and


FEM for equal number of nodes and Gauss points. Interactions in

where NP is the number of interacting node pairs and n is the number of nodes.

NZ 9  NP  3  n3D;

27

73

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

points and summed to form the nal values of the corresponding


coefcients of the global matrix:

Table 13
Total Gauss point contributions for EFG and FEM.
Example

Gauss points

Total GP contributions
EFG

Ratio

FEM

Kij

Q ij

2D-1
2D-2
2D-3

102,400
300,304
501,264

32,725,544
96,647,624
161,681,224

1,024,000
3,003,040
5,012,640

32.0
32.2
32.3

3D-1
3D-2
3D-3

64,000
140,608
262,144

408,317,728
942,981,088
1,813,006,048

2,304,000
5,061,888
9,437,184

177.2
186.3
192.1

Each interacting node pair corresponds to a non-zero submatrix


of the stiffness matrix, whose size is equal to the number of dof of
each node. To calculate the corresponding coefcients, contributions from several Gauss points are summed to form the nal
submatrix. The total number of Gauss point contributions for the
whole problem is shown in Table 13.
From the above tables it is clear that the computational effort
required for EFG methods is much higher than in FEM.
4.3. Computation of global stiffness coefcients for each interacting
node pair
The computation of the stiffness elements for each interacting
node pair is split in two phases. In the rst phase, the shape function derivatives for each inuenced node of every Gauss point are
calculated as described in Section 9 for the Gauss point-wise method. Then, instead of continuing with the calculation of the stiffness
matrix coefcients corresponding to a particular Gauss point, the
shape function derivatives are stored for the calculation of Qij
matrices in the next phase. The required storage of all shape function derivatives is small so storing them temporarily is not an
issue.
In the second phase, the stiffness matrix coefcients of each
interacting node pair is computed. For each interacting node pair
ij, the matrix Qij of Eq. (25) is calculated over all shared Gauss

X T
Bi EBj :

28

The calculation of Qij matrices is performed as described in Section 10. The matrices Bi, Bj contain the shape function derivative
values calculated in the rst phase and each pre-calculated shape
function derivative is used a large number of times.
Both phases are amenable to parallelization, the rst with respect to Gauss points and the second with respect to interacting
node pairs, and involve no race conditions or the need for synchronization, which makes the interacting node pairs approach an ideal
method for massively parallel systems.
4.4. Sparse matrix format for the interacting node pairs approach
The nal values of each Kij submatrix are calculated and written
once in the corresponding positions of the global stiffness matrix
instead of being gradually updated as in the Gauss point-wise approach. Apart from the reduced number of accesses to the matrix,
this method does not require lookups, which allows the use of a
simpler and more efcient sparse matrix format, like the coordinate list (COO) format [38]. A simple implementation with three
arrays, one for row indexes, one for column indexes and one for
the value of each non-zero matrix coefcient is sufcient and is
easily applied both in the CPU and the GPU, while also requiring
less memory than a format that allows lookups. Note that the node
pair-wise method has no indexing time due to its nature, in contrast to the Gauss point-wise approach as described in Section 11.
This is why the computing time shown for the interacting node
pairs approach in Table 14 are lower in the CPU implementations
presented in Section 18.
4.5. Parallelization features of the interacting node pairs approach
The interacting node pairs approach has certain advantages
compared to the Gauss point-wise approach. The most important

Table 14
Computing time for the formulation of the stiffness matrix in the serial CPU implementations of the Gauss point-wise (GP) and node pair-wise (NP) approaches.
Example

dof

Gauss points

CPU time (s)


Conventional GP

Proposed GP

Proposed NP

2D-1
2D-2
2D-3

51,842
152,250
252,050

102,400
300,304
501,264

107
313
502

12
34
53

11
28
47

3D-1
3D-2
3D-3

27,783
59,049
107,811

64,000
140,608
262,144

2374
6328
13,302

241
616
1165

134
328
645

Fig. 9. Schematic representation of the contribution of 3 Gauss points to the global stiffness matrix.

74

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

gathers all contributions from the Gauss points and writes to a specic memory location accessed by no other thread. Thus, this
method requires no synchronization or atomic operations. An
important benet of this approach is the indexing cost of the stiffness matrix elements. In the Gauss point-wise method each stiffness matrix element is updated a large number of times while in
the proposed interacting node pair approach the nal value is calculated and written only once.
5. GPU programming

Fig. 10. Scatter parallelism required for the Gauss point-wise approach.

Graphics processing units (GPUs) are parallel devices of the


SIMD (single instruction, multiple data) classication, which describes devices with multiple processing elements that perform
the same operation on multiple data simultaneously and exploit
data level parallelism. Programming in openCL or CUDA is easier
than legacy general purpose computing on GPUs (GPGPU), since
it only involves learning a few extensions to C and thus requiring
no graphic-specic knowledge. In openCL/CUDA context, the CPU
is also referred to as a host and the GPU is also referred to as a device. The general processing ow of GPU programming is depicted
in Fig. 12. GPUs have a large number of streaming processors (SPs),
which can collectively offer signicantly more gigaops than current high-end CPUs.
5.1. GPU threads

Fig. 11. Gather parallelism implemented in the interacting node pairs approach.

one is related to its amenability to parallelism, in contrast to the


Gauss point-wise approach. The Gauss point-wise approach can
be visualized in Fig. 9, where the contributions of three Gauss
points to the stiffness matrix are schematically depicted.
Since in EFG methods each Gauss point affects a large number of
nodes, each Kij submatrix is formed by a large number of stiffness
contributions. Parallelizing the Gauss point-wise approach involves scatter parallelism, which is schematically shown in
Fig. 10 for two Gauss points C and D. Each part of the sum can be
calculated in parallel but there are conicting updates to the same
element of the stiffness matrix. These race conditions can be
avoided with proper synchronization but in massively parallel systems like the GPU where thousands of threads may be working
concurrently it is very detrimental to performance because all updates are serialized with atomic operations [39].
In the interacting node pairs approach, instead of constantly
updating the matrix, the nal values for the submatrix of each
interacting node pair are calculated and then appended to the matrix. For the calculation of a submatrix, all contributions of the
Gauss points belonging to the intersection of the domains of inuence of two interacting nodes should be summed together. Thus,
the interacting node pairs approach utilizes gather parallelism as
shown schematically in Fig. 11.
In a parallel implementation, each working unit, or thread, prepares a submatrix Kij related to a specic interacting node pair ij. It

The GPU applies the same functions on a large number of data.


These data-parallel functions are called kernels. Kernels generate a
large number of threads in order to exploit data parallelism, hence
the single instruction multiple thread (SIMT) paradigm. A thread is
the smallest unit of processing that can be scheduled by an operating system. Threads in GPUs take very few clock cycles to generate
and schedule due to the GPUs underlying hardware support, unlike CPUs where thousands of clock cycles are required. All threads
generated by a kernel dene a grid and are organized in groups
which are commonly referenced as thread blocks [in CUDA] or
thread groups [in openCL]. A grid consists of a number of blocks
(all equal in size), and each block consists of a number of threads
(Fig. 13).
There is another type of thread grouping called warps which are
the units of thread scheduling in the GPU. The number of threads in
a warp is specic to the particular hardware implementationit depends on how many threads the available hardware can process at
the same time. The purpose of warps is to ensure high hardware
utilization. For example, if a warp initiates a long-latency operation
and is waiting for results in order to continue, it is put on hold and
another warp is selected for execution in order to avoid having idle
processors while waiting for the operation to complete. When the
long latency operation completes, the original warp will eventually
resume execution. With a sufcient number of warps, the processors are likely to have a continuous workload in spite of the longlatency operations. It is recommended that the number of threads
per block should be chosen as a multiple of the warp size [15].
The number of threads in each block is subject to renement. It
should be a power of 2 and, in contemporary hardware, less than
1024. The warp size of the cards used in the present study is 32,
hence, the number should ideally be 32 or higher.
5.2. GPU memory
GPGPU devices have a variety of different memories that can be
utilized by programmers in order to achieve high performance.
Fig. 14 shows a simplied representation of the different memories. The global memory is the memory responsible for interaction
with the host/CPU. The data to be processed by the device/GPU is

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

75

Fig. 12. GPU processing ow paradigm: (1) data transfer to GPU memory, (2) CPU
instructions to GPU, (3) GPU parallel processing, (4) result transfer to main memory.

Fig. 14. Visual representation of GPU memory model and scope.

memories [openCL] are allocated to thread blocks/groups instead


of single threads, which allows all threads in a block to access variables in the shared memory locations allocated specically for that
block. Shared memories are almost as fast as registers while also
allowing cooperation between threads of the same block.
5.3. Reductions in the GPU

Fig. 13. Thread organization.

rst transferred from the host memory to the device global memory. Also, output data from the device needs to be placed here before being passed over to the host. Global memory is large in size
and off-chip. Constant memory also provides interaction with the
host, but the device is only allowed to read from it and not write
to it. It is small, but provides fast access for data needed by all
threads.
There are also other types of memories which cannot be accessed by the host. Data in these memories can be accessed in a
highly efcient manner. The memories differ depending on which
threads have access to them. Registers [CUDA] or private memories
[openCL] are thread-bound meaning that each thread can only
access its own registers. Registers are typically used for holding
variables that need to be accessed frequently but that do not need
to be shared with other threads. Shared memories [CUDA] or local

In several parts of the GPU implementation, reductions need to


be performed in order to calculate a sum. On a sequential processor, the summation operation would be implemented by writing a
simple loop with a single accumulator variable to construct the
sum of all elements in sequence. On a parallel machine, using a
single accumulator variable would create a global serialization
point and lead to very poor performance. In order to overcome this
problem, a parallel reduction strategy is implemented where each
parallel thread sums a xed-length sub-sequence of the input.
Then, these partial sums are gathered by summing pairs of partial
sums in parallel. Each step of this pair-wise summation divides the
number of partial sums by half and ultimately produces the nal
sum after log2N steps as shown in Fig. 15.
In order to calculate the sum of several vectors into a single
vector, a similar process is performed but each thread sums two
vectors instead of two values in every step.
6. GPU implementation of the node pair-wise approach
Contrary to the Gauss point-wise approach, the interacting node
pair approach for the formation of the stiffness matrix in EFG
simulations is well suited for the GPU. Each one of the two phases
described Section 13 is calculated with its own kernel and exhibits

76

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

Fig. 15. Parallel summation using a tree-like structure.

different levels of parallelism. The implementations in this work


are written in openCL for greater portability.
6.1. Phase 1 calculation of shape function and derivative values
In the rst phase the shape function and its derivatives are calculated for all inuenced nodes of every Gauss point. The calculations in this phase are described in detail in Section 9. There are
two levels of parallelism: the major over the Gauss points and
the minor over the inuenced nodes. A thread block/group is assigned to each Gauss point and each thread handles one inuenced
node at a time. This is schematically shown in Fig. 16, where it is
assumed that each thread handles a single inuencing node (this
is for demonstration purposes only and not mandatory). Since
the number of threads should be a power of 2 and the number of
inuenced nodes can be anything, some threads will not produce
useful results.
For the most part of this phase all threads of a block are busy.
The exceptions are the inversion of the moment matrix A and
the reductions which are used to sum the contributions of all inuenced nodes in the moment matrix A and the vectors pA, pAx, pAy,
pAz. The process is shown schematically in Fig. 17.
Since each Gauss point has its own thread block, all values related to a particular Gauss point are stored in the shared/local
memory. This includes the moment matrix and all vectors (pA, pA
x, pA y, pA z). The interaction with the global memory is performed
only at the beginning of the process, where each thread reads the
coordinates of the corresponding Gauss point and inuenced node
and stores them in registers, and at the end of the process where
the resulting shape function values are written to the global memory. Constant memory is used for storing the ranges of the inuence domains. As a result, all calculations are performed with
data found in fast memories which is very benecial from a performance point of view.

Fig. 17. Phase 1 concurrency level for the calculation of shape function values in
the GPU.

6.2. Phase 2 calculation of the global stiffness coefcients


In the second phase, there are also two levels of parallelism, the
major one being on the level of interacting node pairs and the minor one on the Gauss points. A thread block/group is assigned to
each node pair and each thread of the block handles one Gauss
point at a time. This is schematically shown in Fig. 18, where it
is assumed that each thread handles 3 shared Gauss points (this
is for demonstration purposes only and not mandatory). Since
the number of threads should be a power of 2 and the number of
shared Gauss points can be anything, some threads will process
only 2 shared Gauss points.
In this phase, all threads of a block go through all available
shared Gauss points of the node pair and calculate the Qij submatrices (Eq. (25)) as described in Section 10. Each thread t of the
block sums contributions from different shared Gauss points and
updates its own partial Ktij so there is no need for atomic operations. After all shared Gauss points have been processed, the partial
Ktij matrices of each thread of the block are summed with a reduction into the nal values of the stiffness coefcients Kij. The process
is shown in Fig. 19.
7. Numerical results in 2D and 3D elasticity problems

Fig. 16. Thread organization in phase 1.

The two procedures elaborated in this work for the computation


of the stiffness matrix in large-scale EFG meshless simulations are

77

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

Table 15
Computing time for the formulation of the stiffness matrix in the GPU implementation of the interacting node-pair approach.
Example

dof

Gauss points

NP GPU time (s)


Kernel 1

Fig. 18. Thread organization in phase 2.

Total

51,842
152,250
252,050

102,400
300,304
501,264

0.05
0.13
0.21

0.19
0.56
0.89

0.2
0.7
1.1

3D-1
3D-2
3D-3

27,783
59,049
107,811

64,000
140,608
262,144

0.17
0.32
0.62

2.41
6.17
12.31

2.6
6.5
12.9

Table 16
Relative speedup
implementations.
Example

tested for the same 2D and 3D elasticity problems already used for
testing throughout this paper. The geometric domains of these
problems maximize the number of correlations and consequently
the computational cost for the given number of nodes. The examples are run on the following hardware. CPU: Core i7-980X which
has six physical cores (12 logical cores) at 3.33 GHz and 12 MB
cache. GPU: is a GeForce GTX680 with 1536 CUDA cores and
2 GB GDDR5 memory.
The performance of the Gauss point-wise (GP) and node pairwise (NP) approaches in the CPU are given in Table 14. The proposed Gauss point-wise approach is compared with the conventional one without the previously described improvements. The
performance of the GPU implementation of the node pair-wise
method is shown in Table 15. Speedup ratios of the GPU implementation compared to the CPU implementations is given in Table 16. The total elapsed time for the initialization phase and
formulation of the stiffness matrix with the conventional way is
shown in Table 17. By applying all techniques proposed in this paper and utilizing one GPU, we can achieve the results of Table 18,
which also shows the speedup compared to the conventional
implementation.
The identication of node pairs is performed in the CPU and the
formulation of the stiffness matrix in the GPU. Therefore, it is possible to have the CPU producing tasks (node pairs) and the GPU
processing them concurrently. This producerconsumer model
can be expanded to utilize all available hardware and is shown

Kernel 2

2D-1
2D-2
2D-3

ratios

of

GPU

implementation

compared

to

the

CPU

Speedup ratios of GPU implementation


Conventional GP

Proposed GP

Proposed NP

2D-1
2D-2
2D-3

450
457
456

50
50
48

46
41
43

3D-1
3D-2
3D-3

921
975
1,028

93
95
90

52
50
50

Table 17
Total serial CPU computing time for the conventional initialization phase and
formulation of the stiffness matrix.
Example

Conventional time (s)


Initialization

Formulation

Total

2D-1
2D-2
2D-3

23
300
836

107
313
502

130
613
1338

3D-1
3D-2
3D-3

7
45
157

2374
6328
13,302

2381
6373
13,459

schematically in Fig. 20. The production can be done on domain


or subdomain level so it is performed by the hardware assigned
to them. Processing can be done by any available CPUs, GPUs or
other processing units thanks to the huge amount of interacting
node pairs and the fact that each node pair is completely
independent of other node pairs. Example results are shown in
Table 19, where the speedup ratios refer to the conventional
implementations.
7.1. Turbine blade example

Fig. 19. Phase 2 concurrency level for the calculation of stiffness coefcients in
the GPU.

A real-world example of a turbine blade is tested. The geometry


of the example was taken from the training examples of FEMAP.
The EFG model has 31,512 degrees of freedom and 29,135 Gauss
points. The geometry of the turbine blade is shown in Fig. 21 and
node placement is shown in Fig. 22. The same hardware as in the
previous examples is used here, namely a Core i7-980X CPU and
a GeForce GTX680 GPU. The challenges associated with node and
Gauss point generation and selection of an appropriate domain of
inuence will be investigated in a future work.
The Gauss points used for integration are depicted in Fig. 23.
The density of the Gauss points clearly demonstrates the large
amount of Gauss points required in EFG methods.
The total elapsed time for the initialization phase and formulation of the stiffness matrix with the conventional way and with the
improved initialization and the proposed Gauss point-wise method
is shown in Table 20.

78

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

Table 18
Best achieved total time for the initialization phase and formulation of the stiffness
matrix.
Example

Best achieved time (s)


Initialization
CPU parallel

Speedup

Node pairs
CPU parallel

Formulation
GPU

Total

2D-1
2D-2
2D-3

0.5
1.0
1.4

0.6
1.6
2.8

0.2
0.7
1.1

1.4
3.2
5.3

93
191
252

3D-1
3D-2
3D-3

0.9
1.7
3.3

4.8
10.9
22.7

2.6
6.5
12.9

8.2
19.1
38.9

289
334
346

Fig. 21. Geometry of the turbine blade.

Fig. 22. Turbine blade example: Position of the 10,504 nodes.

Fig. 20. Schematic representation of the processing of node pairs utilizing all
available hardware.

Table 19
Best achieved total time for the initialization phase and formulation of the stiffness
matrix when using CPU and GPU concurrently.
Example

Best achieved time (s)


Initialization
CPU parallel

NP + formulation
Hybrid CPU/GPU

Fig. 23. Turbine blade example: Position of the 29,135 Gauss points.

Speedup
Total

2D-1
2D-2
2D-3

0.5
1.0
1.4

0.6
1.6
2.9

1.2
2.6
4.3

110
236
310

3D-1
3D-2
3D-3

0.9
1.7
3.3

5.0
11.5
24.0

5.9
13.2
27.3

403
481
494

Table 20
Turbine blade example: Total serial CPU computing time for the initialization phase
and formulation of the stiffness matrix with the Gauss point-wise method.
CPU time (s)

Conventional
Proposed

Initialization

Formulation

Total

3
1.4

341
36.9

344
38.3

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380
Table 21
Turbine blade example: Best achieved total time for the initialization phase and
formulation of the stiffness matrix.
Best achieved time (s)

Speedup

Initialization
CPU parallel

Node pairs
CPU parallel

Formulation
GPU

Total

0.5

0.7

0.5

1.7

79

Acknowledgments
This work has been supported by the European Research Council Advanced Grant MASTERMastering the computational challenges in numerical modeling and optimum design of CNT
reinforced composites (ERC-2011-ADG_20110209).

202

Appendix A. Supplementary data

By applying all techniques proposed in this paper and utilizing


one GPU, we can achieve a speedup of more than two orders of
magnitude compared to the conventional implementation as demonstrated in Table 21.
8. Concluding remarks
The proposed improvements on the initialization phase through
the utilization of Gauss regions signicantly reduces the time required to create the necessary correlations between the entities
of the meshless methods. With Gauss regions, the process scales
very well, in contrast to globally searching, and the initialization
takes only a small percentage of the problem formulation time.
The improvements in the Gauss point-wise approach for assembling the stiffness matrix offer an order of magnitude speedup
compared to the conventional approach. This is attributed to the
reduced number of calculations in all parts of the process and
the usage of an efcient sparse matrix format and an implementation specically tailored for the formulation phase of the stiffness
matrix. Indexing is a major factor affecting the computational cost.
Therefore, the skyline format is faster due to its lower indexing
cost, however the signicantly higher memory requirement makes
it problematic for larger problems where a sparse format is preferable or mandatory.
The proposed node pair-wise approach has several benets over
the Gauss point-wise approach. The most important being its amenability to parallelism especially in massively parallel systems like
the GPUs. Each node pair can be processed separately by any available processor in order to compute the corresponding stiffness
submatrix. The node pair approach is characterized as embarrassingly parallel since it requires no synchronization whatsoever between node pairs.
A GPU implementation is applied to the node pair-wise approach offering great speedups compared to CPU implementations.
The node pairs keep the GPU constantly busy with calculations
resulting in high hardware utilization which is evidenced by the
high speedup ratios of approximately two orders of magnitude in
the test examples presented. The node pair-wise approach can be
applied as is to any available hardware achieving even lower computing times. This includes using many GPUs, hybrid CPU(s)/
GPU(s) implementations and generally any available processing
unit. The importance of the latter becomes apparent when considering contemporary and future developments like heterogeneous
systems architecture (HSA).
In conclusion, the parametric tests performed in the framework
of this study showed that with the proposed implementation along
with the exploitation of currently available low cost hardware, the
expensive formulation of the stiffness matrix in meshless EFG
methods can be reduced by orders of magnitude. The presented
node pair-approach enables the efcient utilization of any available hardware and in conjunction with fast initialization and its
inherently parallelization features can accomplish high speedup
ratios, which convincingly addresses the main shortcoming of
meshless methods making them computationally competitive in
solving large-scale engineering problems.

Supplementary data associated with this article can be found, in


the online version, at http://dx.doi.org/10.1016/j.cma.2013.02.011.
References
[1] S. Li, W.K. Liu, Meshfree and particle methods and their applications, Appl.
Mech. Rev. 55 (2002) 134.
[2] V.P. Nguyen, T. Rabczuk, S. Bordas, M. Duot, Meshless methods: A review and
computer implementation aspects, Math. Comput. Simul. 79 (2008) 763813.
[3] T. Belytschko, Y. Krongauz, D. Organ, M. Fleming, P. Krysl, Meshless methods:
An overview and recent developments, Comput. Methods Appl. Mech. Engrg.
139 (1996) 347.
[4] K.T. Danielson, S. Hao, W.K. Liu, R.A. Uras, S. Li, Parallel computation of
meshless methods for explicit dynamic analysis, Int. J. Numer. Methods Engrg.
47 (2000) 13231341.
[5] K.T. Danielson, R.A. Uras, M.D. Adley, S. Li, Large-scale application of some
modern CSM methodologies by parallel computation, Adv. Engrg. Software 31
(2000) 501509.
[6] G.R. Liu, K.Y. Dai, T.T. Nguyen, A smoothed nite element method for
mechanics problems, Comput. Mech. 39 (2007) 859877.
[7] J.G. Wang, G.R. Liu, A point interpolation meshless method based on radial
basis functions, Int. J. Numer. Methods Engrg. 54 (2002) 16231648.
[8] Y.T. Gu, G.R. Liu, A coupled element free Galerkin/boundary element method
for stress analysis of tow-dimensional solids, Comput. Methods Appl. Mech.
Engrg. 190 (2001) 44054419.
[9] W.-R. Yuan, P. Chen, K.-X. Liu, High performance sparse solver for
unsymmetrical linear equations with out-of-core strategies and its
application on meshless methods, Appl. Math. Mech. (Engl. Ed.) 27 (2006)
13391348.
[10] S.C. Wu, H.O. Zhang, C. Zheng, J.H. Zhang, A high performance large sparse
symmetric solver for the meshfree Galerkin method, Int. J. Comput. Methods 5
(2008) 533550.
[11] E. Divo, A. Kassab, Iterative domain decomposition meshless method modeling
of incompressible viscous ows and conjugate heat transfer, Engrg. Anal.
Bound. Elem. 30 (2006) 465478.
[12] P. Metsis, M. Papadrakakis, Overlapping and non-overlapping domain
decomposition methods for large-scale meshless EFG simulations, Comput.
Methods Appl. Mech. Engrg. 229232 (2012) 128141.
[13] J. Sanders, E. Kandrot, CUDA by Example: An Introduction to General-Purpose
GPU Programming, Addison-Wesley Professional, 2010.
[14] D.B. Kirk, W.W. Hwu, Programming Massively Parallel Processors: A Hands-on
Approach, Morgan Kaufman, 2010.
[15] NVIDIA Corporation, CUDA C Best Practices Guide, NVIDIA GPU Computing
Documentation j NVIDIA Developer Zone, NVIDIA, 2012.
[16] TOP500 Supercomputing Sites. Available: <http://www.top500.org/>.
[17] I.C. Kampolis, X.S. Trompoukis, V.G. Asouti, K.C. Giannakoglou, CFD-based
analysis and two-level aerodynamic optimization on graphics processing
units, Comput. Methods Appl. Mech. Engrg. 199 (2010) 712722.
[18] E. Elsen, P. LeGresley, E. Darve, Large calculation of the ow over a hypersonic
vehicle using a GPU, J. Comput. Phys. 227 (2008) 1014810161.
[19] J.C. Thibault, I. Senocak, Accelerating incompressible ow computations with a
Pthreads-CUDA implementation on small-footprint multi-GPU platforms, J.
Supercomput. 59 (2012) 693719.
[20] M. De La Asuncin, J.M. Mantas, M.J. Castro, Simulation of one-layer shallow
water systems on multicore and CUDA architectures, J. Supercomput. 58
(2011) 206214.
[21] H. Zhou, G. Mo, F. Wu, J. Zhao, M. Rui, K. Cen, GPU implementation of lattice
Boltzmann method for ows with curved boundaries, Comput. Methods Appl.
Mech. Engrg. 225228 (2012) 6573.
[22] A. Sunarso, T. Tsuji, S. Chono, GPU-accelerated molecular dynamics simulation
for study of liquid crystalline ows, J. Comput. Phys. 229 (2010) 54865497.
[23] J.A. Anderson, C.D. Lorenz, A. Travesset, General purpose molecular dynamics
simulations fully implemented on graphics processing units, J. Comput. Phys.
227 (2008) 53425359.
[24] E. Wadbro, M. Berggren, Megapixel topology optimization on a graphics
processing unit, SIAM Rev. 51 (2009) 707721.
[25] D. Komatitsch, G. Erlebacher, D. Gddeke, D. Micha, High-order niteelement seismic wave propagation modeling with MPI on a large GPU cluster,
J. Comput. Phys. 229 (2010) 76927714.
[26] T. Takahashi, T. Hamada, GPU-accelerated boundary element method for
Helmholtz equation in three dimensions, Int. J. Numer. Methods Engrg. 80
(2009) 12951321.

80

A. Karatarakis et al. / Comput. Methods Appl. Mech. Engrg. 258 (2013) 6380

[27] G.R. Joldes, A. Wittek, K. Miller, Real-time nonlinear nite element


computations on GPU-Application to neurosurgical simulation, Comput.
Methods Appl. Mech. Engrg. 199 (2010) 33053314.
[28] S. Tomov, J. Dongarra, M. Baboulin, Towards dense linear algebra for hybrid
GPU accelerated manycore systems, Parallel Comput. 36 (2010) 232240.
[29] O. Schenk, M. Christen, H. Burkhart, Algorithmic performance studies on
graphics processing units, J. Parallel Distrib. Comput. 68 (2008) 13601369.
[30] J.M. Elble, N.V. Sahinidis, P. Vouzis, GPU computing with Kaczmarzs and other
iterative algorithms for linear systems, Parallel Comput. 36 (2010) 215231.
[31] A. Cevahir, A. Nukada, S. Matsuoka, Fast conjugate gradients with multiple
GPUs, in: 9th International Conference on Computational Science, ICCS 2009,
Baton Rouge, LA, 2009, pp. 893903.
[32] A. Cevahir, A. Nukada, S. Matsuoka, High performance conjugate gradient
solver on multi-GPU clusters using hypergraph partitioning, Comput. Sci.-Res.
Develop. 25 (2010) 8391.

[33] M. Papadrakakis, G. Stavroulakis, A. Karatarakis, A new era in scientic


computing: Domain decomposition methods in hybrid CPUGPU
architectures, Comput. Methods Appl. Mech. Engrg. 200 (2011) 14901508.
[34] R. Trobec, M. terk, B. Robic, Computational complexity and parallelization of
the meshless local PetrovGalerkin method, Comput. Struct. 87 (2009) 8190.
[35] C. Felippa, Chapter 15-Solid Elements: Overview, Advanced Finite Element
Methods (ASEN 6367) Course Material, University of Colorado, 2011.
[36] Sparse matrix: Dictionary of keys (DOK), Wikipedia, the free encyclopedia, Sep
2012.
[37] Hash table, Wikipedia, the free encyclopedia, Sep 2012.
[38] Sparse matrix: Coordinate list (COO), Wikipedia, the free encyclopedia, Sep
2012.
[39] W.W. Hwu, D.B. Kirk, Parallelism Scalability, Programming and tUning
Massively Parallel Systems (PUMPS), Barcelona, 2011.

También podría gustarte