Documentos de Académico
Documentos de Profesional
Documentos de Cultura
250
Type, northNeighbor, sendMsgTag, southNeig-
200
hbor, recvMsgTag, Ccomm);
150
in the parallel code, Figure 1. Thus the 100
space requirement 4n2 M (M being the
50
memory size of each node) is optimal because,
0
matrices A, B, C being square, the memory 200 400 600 800 1000 1200
requirement cannot be further reduced by any Matrix Size, n
1. Bandwidth (bi-directional) ≈ 520 MB. Summarized below are the important observa-
Hence, the time for a bi-directional trans- tions of the performance of the implementa-
1
fer of a byte = β ≈ 520 = 0.0018µs. tions:
• The performance of the parallel Fox pro- 8 Acknowledgment
gram, in terms of the speedup (12) and
efficiency (13), very strongly depends on I am very grateful to the administration and
the magnitudes of p and n. If p is small management team of the Alabama Supercom-
and n large or p large and n small, both puter Center, Huntsville, AL which gave me
sp and p tend to approach their lower an opportunity to use the wealth of their su-
limiting values of 1 and p1 respectively. percomputing resources to carry out the ex-
For example, when p = 4, n = 1, 200 perimentations reported in this paper, when I
(n = 600), s4 = 1.69, 4 = 0.42; when taught in the MCIS department of Jacksonville
p = 100, n = 80 (n = 8), s100 = 1.25 and State University, Jacksonville, AL. I am also
100 = 0.01. grateful to Dr. Sadanand Srivastava, the chair
• Outside the two extreme situations given of Computer Science Department, Bowie State
above, both sp and p improve but with University, Bowie, MD, the department I am
neither attaining its upper limiting value presently serving, without whose encourage-
for any value of p and n. For example, ment and support the timely completion of this
when p = 4, n = 100 (n = 50), s4 = study would have been impossible.
3.64, 4 = 0.92; when p = 64, n = 800
(n = 100), s64 = 23.33 and 64 = 0.36;
when p = 100, n = 800 (n = 80), s100 = References
91.30 and 100 = 0.91.
[1] Fox, G. S. Otto, and A. J. G. Hey Matrix
7 Conclusion Algorithm on a Hypercube I: Matrix Mul-
tiplication, Parallel Computing, 3, pp. 17-
31, 1987.
The performance of the parallel implementa-
tion depends on the number of processors, p,
[2] Lederman, S. H., E. M. Jacobson, and
and the order of the constituent matrix blocks,
A. Tsao, Comparison of Scalable Paral-
n. The performance improves (but never at-
lel Matrix Multiplication Libraries, Proc.
tains the optimal values of sp = p, p = 1)
of Scalable Parallel Libraries Conf., IEEE
when n is not too small or p too large, other-
Comp. Society Press, pp. 142 - 149, 1994.
wise, the performance deteriorates. When p is
large and n small, then tcomm or the time for
[3] Agarwal, R. C., F. G. Gustavson, and M.
inter-processor data traffic tcomm given in (9)
Zubair, A High Performance Matrix Mul-
dominates the computation time. On the other
tiplication on a Distributed Memory Par-
hand, when p is small and n large, tcalc (the
allel Computer Using Overlapped Commu-
computation time) dominates but the overall
nication, IBM Journ. Res. Develop., vol.
implementation is close to being serial, hence,
38, pp. 673 - 681, 1994.
sp ≈ 1. Therefore, realization of a reasonable
performance of the method given in this study
[4] Rees, S. A., and J. P. Black, An Exper-
comes from a judicious choice of both p and
imental Investigation of Distributed Ma-
n. Lastly, even though the performance may
trix Multiplication Techniques, Software-
improve at moderate choices of the values of
Practice and Experience, vol. 21(10), pp.
n and p, the parallel multiplication is recom-
1041 - 1063, Oct. 1991.
mended only for matrix orders n ≥ 500 and
serial computation for smaller orders because
[5] Dongara, J., et al, Source book of Parallel
the small matrix orders are not worth the re-
Computing, Morgan Kaufmann Pub. Co.,
sources demanded by the parallel computation.
San Francisco, 2003.
[6] Grelck, and Sven-Bolo S., SAC from High IEEE Trans. Comp., vol. c-38, pp. 140 -
Level Programming with Arrays to Effi- 155, Ja. 1989.
cient Parallel Computing, Par. Proc. Let-
ters, vol 13(3), pp. 401 - 412, 2003. [15] Siddhartha, C., et al Recursive Array
Layouts and Fast Parallel Matrix Multi-
[7] Johnsson, S. L., and C. T. Ho, Algo- plication, Annual ACM Symp. on Par.
rithms for Multiplying Matrices of arbi- Algorithms and Architectures, St. Malo,
trary Shapes Using Shared Memory Prim- France, pp. 222 - 231, June 27, 1999.
itives on Boolean Cubes, Tech. Report
[16] Horowitz, E. and S. Sahni, Fundamentals
TR-569, Yale Univ., New Haven, CT,
of Computer Algorithms, Comp. Sc. Press,
1987.
Potomac, MD, 1978.
[8] Message Passing Interface Forum:, MPI:
[17] Coppersmith, D., and S. Winograd, Ma-
A Message Interface Standard, Int’l Jour.
trix Multiplication via Arithmetic Progres-
of Supercomp. Appl. and High Perf.
sions, Jour. of Symb. Comp., vol. 9, pp.
Comp., vol. 8, no. 3/4, pp. 165 - 414,
251 - 280, 1990.
Fall/Winter 1994.
[18] Bini, D., and V. Pan, Polynomial and Ma-
[9] Snir, M. et al, MPI: The Complete Ref- trix Computations , Fundamental Algo-
erence, Volume 1, The MPI Core, Oxford rithms, Birkhäuser, Boston, 1994.
Univ. Press, New York, New York, 1998.
[19] Golub, G. H., and C. F. van Loan, Matrix
[10] Brightwell, R., et al, Design, Implementa- Computations, John Hopkins Univ. Press,
tion, and Performance of MPI on Portals Baltimore, MD, 1996.
3.0, Int’l Jour. of Supercomp. Appl., vol.
17, no. 1, pp. 7 - 20, Spring 2003. [20] Dekel, E., D. Nassimi, and S. Sahni, Par-
allel Matrix and graph Algorithms, SIAM
[11] Choi, J., J. Dongarra, and D. W. Jour. on Comput., vol. 10, pp. 657 - 673,
Walker, PUMMA: Parallel Universal Ma- 1981.
trix Multiplication Algorithms on Dis-
tributed Memory Concurrent Computers, [21] Cannon, L. e., A Cellular Computer to
Concurrency: Pract. and Exper., vol. 6, Implement the Kalman Filter Algorithm,
pp. 543 - 570, Oct. 1994. Ph.D. Thesis, Montana State Univ., 1969.