Está en la página 1de 8

Efficient Parallel Implementation of the most efficient and cost-optimal method for

the Fox Algorithm matrix multiplication [1, 2, 3, 4, 5, 6, 7]. Also,


the existence of mathematical libraries and
high performance software packages/libraries
Akpan, Okon H. as well as the recent efforts at standardiza-
Computer Science Department tion some of them, notably the MPI [8, 9, 10],
Bowie State University, Bowie MD
have spurred studies aimed at obtaining the
oakpan@cs.bowiestate.edu
best performance from a number of important
scientific operations including matrix multipli-
Abstract
cations on supercomputers [11, 12, 13, 14, 15].
This paper presents an efficient parallel implemen- The supercomputing libraries/packages widely
tation of the Fox algorithm on a shared-memory used include PVM, TCGMSG, MPI, OpenMP,
supercomputer, the SGI Altix 350. The Fox algo- Threads, Shmem, the most popular of which
rithm is concerned with matrix multiplication, C being the MPI which is supported and imple-
← A × B. In this study, the matrices are block mented on almost every architecture.
square matrices with order n and with equal block
size n = √np , where p is the number of processors in- The conventional serial matrix multiplica-
volved in the implementation. The performance of tion of n × n matrices has asymptotical (run-
the parallel implementation is acceptable with the time) complexity of O(n3 ). Strassen’s algo-
speedup, sp , and efficiency, p , approaching their rithm [16], which recursively divides the ma-
respective upper limiting values of p and 1 only trices into 2 × 2 blocks and then multiplies
when p is not too small and n too large or when the sub-matrices using 7 scalar multiplications
p not too large and n too small. The implementa- and 18 scalar additions and subtractions, takes
tion is also found to be space-optimal but not time- O(n2.8074 ) steps, a significant improvement
optimal due to the heavy inter-process data traffic over the conventional serial matrix multipli-
and the consequent dominance of that traffic time, cation. This means that, theoretically, as
tcomm , over the computation time, tcalc , especially n → ∞, the Strassen’s algorithm is a lot faster
when p is large. than the conventional algorithm. One draw-
back, though, of the Strassen’s algorithm is its
somewhat lower numerical stability as the or-
1 Introduction der of the matrix becomes very large. However,
the search for methods to improve upon the
Strassen’s algorithm has been going on, and it
Over the past decades, a very large number of
is likely to keep going on into the distant fu-
algorithms have been proposed for matrix mul-
ture time. For example, the Winograd’s algo-
tiplication. The extensive literature resulting
rithm, which is a variant of the Strassen’s algo-
from the voluminous studies focused on design-
rithm in that the number of scalar subtractions
ing time- and space efficient matrix multipli-
and additions is reduced from 18 to 15, while
cation algorithm for various computational en-
having the same run-time complexity as the
vironments and architectures underscores the
Strassen’s, has a slightly smaller multiplicative
importance of this operation in many areas of
constant of the big-O. All these efforts aim at
scientific and engineering applications. Matrix
achieving the run-time complexity of O(n α ),
multiplication is not only the kernel operation
where 2 < α < 3. Currently, the best known
in linear algebra but it is also the fundamen-
run-time complexity of O(n2.376 ) is from the
tal operation in a large number of scientific ap-
Coppersmith and Winograd’s algorithm [17].
plications including computer graphics, combi-
natorial algorithms, and robotics. The evolu- A large number of matrix multiplication
tion of architectures of parallel and distributed algorithms which have been successfully im-
computers keeps fresh the interest of finding plemented on both distributed- and shared-
memory parallel computers exist [4, 7, 14, 15, 2 The Fox Algorithm
18, 19, 20]. On distributed-memory processors,
research has essentially been focused on paral- 2.1 Sequential Fox Algorithm
lelization of the conventional matrix multipli-
cation. Dekel, Nassimi, and Sahni [20] have Given square matrices A = (aij ) and B = (bij )
shown that matrix multiplication can be done of order n, the product A × B = C = (cij ) is
in O(n3 /p + log (p/n2 )) time on a hypercube also a square matrix of the same order, n. The
with p processors, where n2 ≤ p ≤ n3 . It conventional matrix multiplication
has been demonstrated also that two n × n n−1
matrices can be multiplied under the CREW
X
cij ← aik bkj , i = 0, 1, · · · , n − 1,
PRAM in O(log n) time using O(nα+ ) pro- k=0
cessors for a fixed positive value of  [18]. It is j = 0, 1, · · · , n − 1, (1)
also reported in [19] that matrix multiplication
can be achieved in constant time on a recon- being of order O(n3 ), is excessively expen-
figurable mesh with n4 processors. It should sive in terms of execution time (and mem-
be mentioned that, although such implemen- ory too!) especially as n → ∞. Because
tations may be fast, they are usually not cost- the multiplication involves n2 independently-
optimal. computed dot products, one obvious solution
toward time saving is to parallelize the first
The Fox [1] and the Cannon’s [21] algo- two loops thereby reducing the order to O(n 2 ),
rithms have been parallelized and implemented and, consequently, improving the overall exe-
on parallel computers. Both Fox and Cannon’s cution time. An implementation of the above
algorithms use a number of stages to carry out algorithm works equally correctly, producing
the multiplication, C ← A × B, in which the the same results, if the matrices A, B, and C
component matrices are square. The Fox algo- are block matrices in which each block is of or-
rithm involves row-wise broadcasts of a i,j and der n and an element, say, aij , in the algorithm
column-wise upward shifts of bi,j at every (but is now an n × n block matrix.
the first) stage of computation. The Cannon’s
algorithm, on the other hand, involves row- The sequential Fox algorithm for multiplica-
wise/column-wise rotations of the sub-matrices tion of A, and B proceeds in n stages where
and also an inter-leaving of communication and n, again, is the order of the matrices:
sub-matrix multiplication. Both implementa-
tions have been found to be at least memory- stage 0 : cij ← aii × bij
efficient. stage k : (1 ≤ k < n) : cij ← cij + aik × bkj , (2)

This study is focused on implementation of


where k = (i + k) mod n. That is, at stage 0,
the Fox algorithm on a shared-memory mul-
cij is computed as the product of aii and bij ,
tiprocessor using MPI. MPI communication
and at stage k, cij is computed as the product
primitives are used to decompose the compo-
of bik and bkj , where k = (i + k) mod n.
nents matrices and distributing them to the
processes, and, then carrying out the parallel
computing on the computer. The results of 2.2 Parallel Fox Algorithm
these computation are compared to those of
the sequential computing. The product of the matrices A and B can also
be computed in parallel provided there are p
processors (p < n) available. First, the ele-
ments of the operand matrices should be ap-
propriately distributed to the p processors to
ensure good load-balance. If A, B and C are
q × q block matrices in which each of their con- from MPI COMM WORLD, the MPI’s default
stituent elements is an n × n sub-block, where communicator, 2) from Gridcomm, create row

q = p and n = nq , then the distribution Cartesian communicator, Rcomm, and column
involves routing the component sub-blocks to communicator, Ccomm, this one with period-
the appropriate processors. In practice, at ev- icity so as to permit the toroidal shifts of B’s
ery stage k > 0, process pij (0 ≤ i, j ≤ q) sub-blocks from the northern-most process to
computes the sub-block cij as the product of the southern-most process. No such periodic-
aik sub-block broadcast by process pi,k (on the ity is needed among the Rcomm row processes.
same row i) and bkj obtained from pij ’s south-
ern neighboring process pkj . Actually, a single
south-north toroidal shift at the end of each 2.3.3 Parallel Implementation
stage k ensures the routing of the correct bkj
Figure 1 below shows the code fragment for
to processor pij for its (k + 1)th stage product
MPI implementation of the the Fox algorithm
operation.
in ANSI C. The implementation is based upon
the following assumptions:
2.3 MPI Implementation
1. Available are p processors, where p is a

2.3.1 Data Distribution perfect square so that q = p.

2. Matrices A, B, and C with order n are q×


Every supercomputing library or package has
q block matrices in which each sub-block
several commands with which a matrix can be
is n × n, where n = nq . In the code frag-
effectively decomposed and distributed to pro-
ment, the component sub-blocks of matri-
cessors. In MPI [8, 9, 10], one of the simplest
ces A, B, and C are variables with identi-
ways of doing this is to realize the sub-blocks
fier names aBlock, bBlock, and cBlock re-
of A, B, and C matrices as a data object of an
spectively, all of which sub-blocks having
MPI’s derived data type (MPI Datatype) and
the data type attribute of MPI Datatype.
then distributing them to the appropriate pro-
cessors. This is the data distribution method 3. On the q × q Cartesian grid, each pro-
used in this study. cess on row i and column j (0 ≤ i, j <
q) is in groups of processes having com-
municators Rcomm and Ccomm. Also
2.3.2 Cartesian Virtual Topology
each process has an x-coordinate (my-
Row ), a y-coordinate (myCol), a row
Because at any stage k > 0 of the parallel
rank, myRowRank, a column rank, my-
Fox algorithm, cij is computed (by process pij )
ColRank, and, finally, a northern neighbor
as the product of A’s sub-block broadcast by
(northNeighbor ), and a southern neighbor
process pk,i on row i to all other processes
(southNeighbor ).
on the same row and B’s sub-block shifted
from pij ’s southern neighbor, pk,j , the paral- 4. At any stage k, 0 ≤ k < q and on any row
lel implementation calls for realizing each row the process, bCastProc, which broadcasts
and each column processes as different pro- its aBlock to other processes (on the same
cess groups with their respective communica- row), is the only process with myRowRank
tors. This greatly facilitates the data commu- = myRow + k mod q.
nication required by the implementation. In
MPI, the process’s groupings are very easily 5. MatMult is a function which implements
accomplished in a two-step process: 1) cre- the sequential matrix multiplication given
ate a Cartesian communicator, say, Gridcomm, in (1).
6. Finally, the broadcast statements in the SGI NUMAflex global shared-memory archi-
scopes of the if and else constructs broad- tecture. Altix 350 supports MPI, SGI’s Mes-
cast ai,k to processes on the same row sage Passing Toolkit (MPT) for distributed
(that is, among the Rcomm processes) memory parallelism, OpenMP, SHMEM, and
while MPI Sendrev statement causes a POSIX threads for shared memory parallelism.
single south-north shift of bkj on the
Altix 350 is built with rack-mountable
same column (that is, among column the
bricks. The C-brick, which consists of 2 Ita-
Ccomm processes).
nium 2 processors and a cache, is the computa-
tional module. The front-side buses of the two
processors are connected to custom SuperHub
//step 1: Determine neighboring processes, (SHub) ASICS and NUMAlink which interface
southNeighbor = (myRow + 1) % q; the two processors to to the memory DIMMS,
northNeighbor = (myRow + q - 1) % q; to the I/O subsystems, and to other SHubs
via the NUMAflex network. The NUMAlink
//step 2: Loop q times to compute cij inter-connect channel between all modules in
for(int k = 0; k < q; k ++ ){ the system creates a single, contiguous mem-
bCastProc = (myRow + k) % q; ory in the system up to 384GB, and enables
if (myRowRank == bCastProc){ each process a direct access to every I/O slot
MPI Bcast(&aBlock[0][0], 1, BlockType, in the system.
bCastProc, Rcomm);
The Alabama Supercomputer Center’s Al-
MatMult(aBlock[0][0], bBlock, cBlock);
tix 350 has sets of 16 CPUs clustered into
}
shared-memory nodes with Infiniband (for
else{
message passing between processors), SHubs
MPI Bcast(&tempBlock[0][0], 1, BlockType,
ASICS and NUMAlink -4 inter-connect (for
bCastProc, Rcomm);
shared global memory), fiber channel switch
MatMult(tempBlock[0][0], bBlock, cBlock);
(for CXFS file system data), and with Giga-
}
bit Ethernet. This Altix cluster has 256 CPUs
MPI Sendrev replace(&bBlock[0][0], 1,
and a theoretical performance rate of over 380
BlockType, northernNeighbor, sendMsgTag,
Gflops.
southNeighbor, recvMsgTag, Ccomm);

3 Time Optimal Analysis


Figure 1: Parallel Fox Program
3.1 Speedup and Efficiency
2.3.4 Computational Environment
Given a parallel algorithm implemented with
p processors, the speedup or sp is given as the
The environment used for parallel implemen-
ratio
tation of the Fox algorithm was the SGI Al-
tix 350 of the Alabama Supercomputer Center t1
sp = , (3)
(ASC ) at Huntsville, Alabama. Altix 350 is tp
a mid-range 64 − bit supercomputing platform
built by SGI specifically for scientific comput- where t1 is the time that a sequential im-
ing applications. It has 64-bit Intel Itanium plementation of that same algorithm executes
2 processors which can easily scale to 128 in with 1 processor, and tp the time a parallel
a single system image (SSI), and thousands implementation of the algorithm takes to exe-
more via clustering using the powerful SGI’s cute with p processors. Normally, t p is a sum
of tcomm which the time for inter-processor ex- can be modeled with
changes (of data exchange, messages, etc.) t calc p−1
or the time spent on the actual computation dlog peγ + mβ. (6)
p
(by the p processors). The speedup is limited
as 1 ≤ sp ≤ p. Efficiency, p , is given as
3.2.3 Combined Communication
t1 sp
p = = .
ptp p A combination of (5) and (6) gives a model of
1
the inter-process communication of p proces-
The efficiency is limited by p ≤ p ≤ 1. sors:
4p − 1
3.2 The MPI Communication Prim- (dlog pe + 3)γ + mβ. (7)
p
itives

The two MPI inter-process communication


3.3 Time Computation Complexity
primitives heavily utilized in the implementa-
For a sequential multiplication of matrices of
tions are the point-to-point and the collective
order n × n with a single processor, the com-
primitives. The model of these primitives is
putational time is given as
given immediately below.
t1 = n3 α, (8)
3.2.1 Point-to-Point Communication where α is the time for floating point
computing. The memory requirement for
The point-to-point communication primitives
matrices A, B, and C is each n2 s, where
are MPI Send and MPI Recv pair and the
n = √np is the size of the matrix blocks, and s
MPI Sendrecv for sending and receiving a
the number of bytes per floating point number.
block of m bytes between 2 processors. The
model for MPI Send and MPI Recv pair is
For the parallel Fox algorithm implemented
γ + mβ, (4) by p processes the total inter-process commu-
nication time, tcomm , according to (7), is given
where γ is communication latency, and β the as
communication time per byte. (The reciprocal
4p − 1 m2
 
of β is the bandwidth). MPI Sendrecv combi- tcomm = (dlog pe + 3)γ + β
nation, assumed to take twice as long as the p p
MPI Send and MPI Recv pair, is double the 4p − 1 2
= (dlog pe + 3)γ + m β. (9)
value the time of (4). Therefore, the combined p2
time for the point-to-point primitives is Assuming that t1 of (8) is distributed equally
3γ + 3mβ. (5) over the p processors, then the time for parallel
implementation with p processors, t calc , is

3.2.2 Collective Communication t1 n3


tcalc = = α. (10)
p p
The collective communication primitive is The total parallel execution time, t p , which re-
MPI Bcast in which a block of data of m bytes sults from combination of (9) and (10) is
is broadcast to p processors in a group of p
processors involved. If the broadcast is imple- tp = tcomm + tcalc = (dlog pe + 3)γ +
mented on a linear array of nodes of a mini- 4p − 1 2 n3
mum span tree, then the broadcast primitive m β + α. (11)
p2 p
The speedup is then given as 2. Latency (send/receive) = λ ≈ 7.6µs.
n3 α
sp = , (12) 3. Floating Point Operation Time = α ≈
4p−1 2 n3
(dlog pe + 3)γ + p2
m β + p α 0.002µs.
and the efficiency as
4. (sizeof) C Data Type (double) = s = 8
n3 pα bytes.
p = 4p−1 2 n3
. (13)
(dlog pe + 3)γ + p2
m β + p α
5. (sizeof) MPI Data Type (Derived Data for
Matrix Blocks)
4 Space Optimal Analysis = m = 8n bytes, (n = √np ).

The parallel Fox program given in Figure 1 is


space-optimal as each processor, at any stage 5 Experimental Results
of computation, holds at most 4n2 data points
from matrices A, B, and C – 2 blocks of A, 1 The sequential Fox algorithm given is (2) and
block of B and 1 block of C each block being the parallel Fox algorithm whose MPI code
n × n. One extra block of A on each node fragment is given in Figure 1 were implemented
is that broadcast from another node on the on the ASC’s SGI Altix 350 for matrix orders
same row group with communicator Rcomm. (n) 20 to 1,600. The sequential algorithm was
B’s blocks are shared among the same column implemented on a single processor while the
group processes with communicator Rcomm, parallel code ran on 4 to 100 processors. The
and there is no extra B’s block on any node results are given in Figure 2 below.
because, when the B’s block involved in the
local matrix multiplication on a node is shifted 500
to a northern neighbor, a B’s block is shifted
450
to the same node from its southern neighbor
400
at the same time. This shift-out-shift-in Sequential
p=4
350
operation is carried out by the MPI operation p = 16
p = 64
300 p = 100
MPI Sendrev replace(&bBlock[0][0], 1, Block-
Time

250
Type, northNeighbor, sendMsgTag, southNeig-
200
hbor, recvMsgTag, Ccomm);
150
in the parallel code, Figure 1. Thus the 100
space requirement 4n2  M (M being the
50
memory size of each node) is optimal because,
0
matrices A, B, C being square, the memory 200 400 600 800 1000 1200
requirement cannot be further reduced by any Matrix Size, n

partitioning and data distribution scheme. Figure 2: Results of Implementation of Se-


quential and Parallel Fox Algorithms for Vari-
ous Values of n and p
4.1 Computation Parameters

The following are the estimates of the Altix 350


NUMAlink -based parameters for MPI in C:
6 Observations

1. Bandwidth (bi-directional) ≈ 520 MB. Summarized below are the important observa-
Hence, the time for a bi-directional trans- tions of the performance of the implementa-
1
fer of a byte = β ≈ 520 = 0.0018µs. tions:
• The performance of the parallel Fox pro- 8 Acknowledgment
gram, in terms of the speedup (12) and
efficiency (13), very strongly depends on I am very grateful to the administration and
the magnitudes of p and n. If p is small management team of the Alabama Supercom-
and n large or p large and n small, both puter Center, Huntsville, AL which gave me
sp and p tend to approach their lower an opportunity to use the wealth of their su-
limiting values of 1 and p1 respectively. percomputing resources to carry out the ex-
For example, when p = 4, n = 1, 200 perimentations reported in this paper, when I
(n = 600), s4 = 1.69, 4 = 0.42; when taught in the MCIS department of Jacksonville
p = 100, n = 80 (n = 8), s100 = 1.25 and State University, Jacksonville, AL. I am also
100 = 0.01. grateful to Dr. Sadanand Srivastava, the chair
• Outside the two extreme situations given of Computer Science Department, Bowie State
above, both sp and p improve but with University, Bowie, MD, the department I am
neither attaining its upper limiting value presently serving, without whose encourage-
for any value of p and n. For example, ment and support the timely completion of this
when p = 4, n = 100 (n = 50), s4 = study would have been impossible.
3.64, 4 = 0.92; when p = 64, n = 800
(n = 100), s64 = 23.33 and 64 = 0.36;
when p = 100, n = 800 (n = 80), s100 = References
91.30 and 100 = 0.91.
[1] Fox, G. S. Otto, and A. J. G. Hey Matrix
7 Conclusion Algorithm on a Hypercube I: Matrix Mul-
tiplication, Parallel Computing, 3, pp. 17-
31, 1987.
The performance of the parallel implementa-
tion depends on the number of processors, p,
[2] Lederman, S. H., E. M. Jacobson, and
and the order of the constituent matrix blocks,
A. Tsao, Comparison of Scalable Paral-
n. The performance improves (but never at-
lel Matrix Multiplication Libraries, Proc.
tains the optimal values of sp = p, p = 1)
of Scalable Parallel Libraries Conf., IEEE
when n is not too small or p too large, other-
Comp. Society Press, pp. 142 - 149, 1994.
wise, the performance deteriorates. When p is
large and n small, then tcomm or the time for
[3] Agarwal, R. C., F. G. Gustavson, and M.
inter-processor data traffic tcomm given in (9)
Zubair, A High Performance Matrix Mul-
dominates the computation time. On the other
tiplication on a Distributed Memory Par-
hand, when p is small and n large, tcalc (the
allel Computer Using Overlapped Commu-
computation time) dominates but the overall
nication, IBM Journ. Res. Develop., vol.
implementation is close to being serial, hence,
38, pp. 673 - 681, 1994.
sp ≈ 1. Therefore, realization of a reasonable
performance of the method given in this study
[4] Rees, S. A., and J. P. Black, An Exper-
comes from a judicious choice of both p and
imental Investigation of Distributed Ma-
n. Lastly, even though the performance may
trix Multiplication Techniques, Software-
improve at moderate choices of the values of
Practice and Experience, vol. 21(10), pp.
n and p, the parallel multiplication is recom-
1041 - 1063, Oct. 1991.
mended only for matrix orders n ≥ 500 and
serial computation for smaller orders because
[5] Dongara, J., et al, Source book of Parallel
the small matrix orders are not worth the re-
Computing, Morgan Kaufmann Pub. Co.,
sources demanded by the parallel computation.
San Francisco, 2003.
[6] Grelck, and Sven-Bolo S., SAC from High IEEE Trans. Comp., vol. c-38, pp. 140 -
Level Programming with Arrays to Effi- 155, Ja. 1989.
cient Parallel Computing, Par. Proc. Let-
ters, vol 13(3), pp. 401 - 412, 2003. [15] Siddhartha, C., et al Recursive Array
Layouts and Fast Parallel Matrix Multi-
[7] Johnsson, S. L., and C. T. Ho, Algo- plication, Annual ACM Symp. on Par.
rithms for Multiplying Matrices of arbi- Algorithms and Architectures, St. Malo,
trary Shapes Using Shared Memory Prim- France, pp. 222 - 231, June 27, 1999.
itives on Boolean Cubes, Tech. Report
[16] Horowitz, E. and S. Sahni, Fundamentals
TR-569, Yale Univ., New Haven, CT,
of Computer Algorithms, Comp. Sc. Press,
1987.
Potomac, MD, 1978.
[8] Message Passing Interface Forum:, MPI:
[17] Coppersmith, D., and S. Winograd, Ma-
A Message Interface Standard, Int’l Jour.
trix Multiplication via Arithmetic Progres-
of Supercomp. Appl. and High Perf.
sions, Jour. of Symb. Comp., vol. 9, pp.
Comp., vol. 8, no. 3/4, pp. 165 - 414,
251 - 280, 1990.
Fall/Winter 1994.
[18] Bini, D., and V. Pan, Polynomial and Ma-
[9] Snir, M. et al, MPI: The Complete Ref- trix Computations , Fundamental Algo-
erence, Volume 1, The MPI Core, Oxford rithms, Birkhäuser, Boston, 1994.
Univ. Press, New York, New York, 1998.
[19] Golub, G. H., and C. F. van Loan, Matrix
[10] Brightwell, R., et al, Design, Implementa- Computations, John Hopkins Univ. Press,
tion, and Performance of MPI on Portals Baltimore, MD, 1996.
3.0, Int’l Jour. of Supercomp. Appl., vol.
17, no. 1, pp. 7 - 20, Spring 2003. [20] Dekel, E., D. Nassimi, and S. Sahni, Par-
allel Matrix and graph Algorithms, SIAM
[11] Choi, J., J. Dongarra, and D. W. Jour. on Comput., vol. 10, pp. 657 - 673,
Walker, PUMMA: Parallel Universal Ma- 1981.
trix Multiplication Algorithms on Dis-
tributed Memory Concurrent Computers, [21] Cannon, L. e., A Cellular Computer to
Concurrency: Pract. and Exper., vol. 6, Implement the Kalman Filter Algorithm,
pp. 543 - 570, Oct. 1994. Ph.D. Thesis, Montana State Univ., 1969.

[12] Geijn, R., and J. Watts, SUMMA: Scal-


able Universal Matrix multiplication Algo-
rithm, Tech. Report TR 95-13, Dept. of
Comp. Sc., Univ. of Texas at Austin, TX,
1995.

[13] Parello, D. et al, On Increasing Awareness


in Program Optimization to Bridge the
Gap Between Peak and Sustained Proces-
sor Performance: Matrix Multiply, Pro-
ceed. of the 2002 ACM/IEEE Conf. on
SuperComp., Baltimore, MD., pp. 1 - 11,
Nov. 16, 2002.

[14] Jagadish, and T. Kailath A Family of new


Efficient Arrays of Matrix Multiplication,

También podría gustarte