Está en la página 1de 26

SRS Library documentation

Stop your MPI application & Restart it on a different


number of processors.
SRS Library documentation

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 SRSPrecompiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Utils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.1 stop application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 rss ckpt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.3 rss restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.4 change ckpt interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.5 ibp move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.6 bandwidth matrix gen.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.1 SRS Restart example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.2 SRS Register example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.3 SRS Check Stop example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.4 SRS Read examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.5 SRS DistributeFunc Create example . . . . . . . . . . . . . . . . . . . . . . . 19
6.6 SRS DistributeMap Create example . . . . . . . . . . . . . . . . . . . . . . . . 20
6.7 The big picture, a working example . . . . . . . . . . . . . . . . . . . . . . . . 21

7 Test Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8 Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 2
SRS Library documentation

Motivation
SRS library allows the user to stop his running application and restart it
on different number of processors with different data distribution.
This is useful in a number of cases:

• Application migration
The machines in a cluster on which the application is currently running
may become unavailable after a period of time. After getting this in-
formation from the system administrators, the user determines that the
application will not be able to complete within the period of time. Hence
he wants to stop the application and move the application to another clus-
ter and continue the application from the point where the application was
stopped.
Application migration is also useful for resource management systems like
Condor where an application has to be migrated when the workstation
owner returns to using the machine. Since Condor does not have ade-
quate support for MPI programs, the SRS library can be readily used in
Condor to support MPI programs in Condor framework.
Certain scheduling systems in distributed systems use preemptive methods
to ensure the progress of different applications. In this case, the stop and
restart mechanism provided by the SRS library can be used for context
switching between different applications.
• Expanding the processors set
In many cases, the users of parallel programs do not have an idea about
the exact number of processors to use. The user may want to determine
this on trial and error basis. Hence he can start the application on an
initial set of processors, determine that the application is not running at
sufficient speed, stop the application, restart and continue it with more
number of processors, stop the application again and so on.
• Reducing the processor set
The user may want to reduce the number of processors he is using for the
application either to increase the performance of the application or due to
non-availability of some resources.
• Changing the data distribution
Like with the number of processors, the user of parallel programs are
at a loss regarding the type of data distribution he has to use for the
data in his program. The user can use a initial data distribution, say
block data distribution, and run his application. If the performance of
the application is not satisfactory, he can stop the application, compile

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 3
SRS Library documentation

the application with a new data distribution, say block cyclic, restart the
application and continue from the point when it was stopped, but this time
with the block-cyclic data distribution, note the performance change, stop
the application again and so on.
• Fault tolerance
Apart from the pro-active stopping of applications by the user, the ap-
plication may terminate abnormally due to sudden host failures. SRS
provides periodic checkpointing so that when the host is brought back up,
the application can be restarted and continued from the point when it was
terminated abnormally.

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 4
1 Introduction

Introduction

SRS is a user-level non-transparent checkpointing library containing some


APIs that the user can use in his MPI parallel program. The resulting pro-
gram can be started on a certain set of processors, interrupted due to failures
on the system or stopped using the utilities provided in SRS and continued on a
different set of processors with different architectures, different number of pro-
cessors and even with different data distribution in the application! Any native
MPI versions can be used with SRS library.
The modification in the parallel program includes registration of variables
to be checkpointed using SRS Register(), insertion of conditional statements to
execute certain segments of the code during the start mode and execute certain
other parts of the program in the restart mode and reading checkpointed data
using SRS Read(). Thus while the users need to know about the data that have
to be checkpointed and the behavior of this application in the start and restart
mode, he need not know about exactly how the checkpointing is done and where
the data is stored. By including SRS header files, making calls to SRS functions
in the application, compiling and linking with the SRS library, the application
is ready to be stopped and restarted.
SRS uses different storage frameworks, namely, simple file-based mecha-
nisms, IBP framework from University of Tennessee and Yahoo!’s Hadoop for
storing checkpointed data. The user has to run a program called Runtime Sup-
port System(RSS) that exists throughout the lifetime of the application. The
RSS maintains pointers to data between different application stops and restarts
(see Installation & Requirements).
Also included in the SRS library are few utilitites (see Utils): 1. a program
that will allow the user to stop the application any time during the application
execution, 2. a program that will allow the user to change the checkpointing
interval during the application execution, 3. a program that will allow the user
to checkpoint the RSS daemon, 4. and a program that will allow the user to
restore the RSS checkpointed on a machine to a different machine.
The checkpointed data is encoded in eXternal Data Representation(XDR),
so that the application work across different architectures. An example case
is when the user wants to stop the application in an architecture which fol-
lows Little Endian, and restart it on an architecture which follows Big Endian.

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 5
2 SRSPrecompiler

SRSPrecompiler

• Introduction:
SRSPrecompiler is intended to automate the process of inserting the SRS
calls into the application, so that the resulting programs become fault-
tolerant, migratable and malleable.
• Prerequisites:
1. gdsl-1.4(Generic Data Structure Library): It can be downloaded from
http://download.gna.org/gdsl/. Follow the installation steps specified in
the INSTALL file of gdsl.
2. gcc-3.4.4: To parse the user program, SRSprecompiler uses the gcc-3.4.4
macros.
• Installtion:
a)>> gunzip SRS-1.2.1.tgz
b)>> tar -xvf SRS-1.2.1.tar
LOC be the location of SRS-1.2.1
c)>> cd LOC/SRSPrecompiler
d)>> ./make-precompiler.sh
The step d) should result in a binary in LOC/bin/srs compile.
To clean srs compile binary run make clean in the LOC/SRSPrecompiler.
e)>> cd LOC.
f) Follow the steps given in ”Quick start to install SRS”,start from step
3.
• Compilation:
This section describes how to compile plain C/MPI program. srs compile
utility converts plain C/MPI application into C/MPI self-restartable, self
checkpointing, malleable and migratable. Compiling source code with
srs compile is similar to compiling using gcc. The difference being only in
the binary name and the output file.
Put the srs compile binary in your path.
If you are using tcsh:
>>setenv PATH $PATH:LOC/SRSPrecompiler/bin
If you are using bash:
>>export PATH=$PATH:LOC/SRSPrecompiler/bin
In the application, place SRS PollPoint() (look at redistribute test
example from the test directory) function calls at locations where there
should be potential checkpoints.
An example of such a file with SRS PollPoint() is
/SRS/test/redistribute test.c. The following statements illustrates

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 6
2 SRSPrecompiler

compiling using srs compile.


1)run srs compile on the redistribute test.c.
garl-intel1>srs compile redistribute test.c -I/home/SRS-1.2.1/include
-I/usr/local/mpich-1.2.5/ch p4/include -I/home/Research/gdsl-
1.4/include
(Note: Make sure that gcc version 3.4.4 is in your PATH variable)
This should result in redistribute test.p.c as mentioned earlier.
Compile the precompiled file (redistribute test.p.c) using gcc.
garl-intel1>gcc -o redistribute test.o redistribute test.p.c -lm -lmpich
-L/usr/local/mpich-1.2.5/ch p4/lib -lpthread -lsrs -lmallocwrapper -
L/home/SRS-1.2.1/lib /home/gdsl-1.4/lib/libgdsl.a
This creates executable ”redistribute test.o”.

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 7
3 Installation

Installation

Following are the steps to use the SRS Library:

1. Follow the quick start steps given in


http://garl.serc.iisc.ernet.in/SRS/SRS.htm.
2. Before proceeding, see the README file & the Requirements.

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 8
4 Requirements

Requirements

• MPI
The SRS library is built on top of MPI. Hence the user’s application that
uses SRS must be MPI based programs.
• Headers
The application that uses SRS must include 2 header files, srs.h and
datatype.h.
• Libs
The user must link his application with libsrs.a (lib/ directory). The user
must also link his application with his pthread library. This is for using
the IBP functionality.
• Runtime
Before starting the application, the user must run the executable rss (bin/
directory). This is a sequential program and can be run on a machine that
can be different from the machines where the actual application is run.
• Optional
– IBP
This is needed when the user wants to use IBP distributed storage
framework for storing checkpoints. IBP depots must be started on
the pool of machines where the user can potentially run his applica-
tion.(see ibp server mt). Note that the default storage mechanism
is using simple file-based. If you want to use IBP, change Line 4
in SRS/include/dsi.h from ”#define FSERVER” to ”#define IBPD”
and recompile SRS.
– HADOOP
This is needed when the user wants to use Yahoo!’s Hadoop infras-
tructure for storing checkpoints.
• Config file
Before starting the application, the user must have the file ’srs.config’ in
the same location where the process 0 of his application will be executing.
A sample srs.config is given below.

RUNTIME_HOST = torc1.cs.utk.edu
RUNTIME_PORT = 9009
FAULT_TOLERANCE = yes

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 9
4 Requirements

CKPT_INTERVAL = 40
NO_IBP_SERVERS = 3
IBP_SERVERS = garl-intel4 garl-intel2 garl-intel3

The RUNTIME HOST line points to the host where rss was started.
The RUNTIME PORT points to the port where rss will be accepting
connections. This port will be printed out when running the rss program.

The SRS library also does periodic checkpointing of data to provide fault
tolerance. Providing ’y’ or ’yes’ in the FAULT TOLERANCE line enables
this mechanism (you can also give ’n’ or ’no’). The interval of periodic
checkpointing in seconds can be specified using CKPT INTERVAL. The
checkpointing interval can also be changed during application execution
(See Utils).

The NO IBP SERVERS specifies the number of IBP depots avail-


able for that run. IBP SERVERS points to those depots. In this
sample config file the available IBP servers are “garl-intel4”, “garl-intel2”
and “garl-intel3”. The IBP servers are chosen when the application is
started or restarted, the decision depending on the availability of the file
“bandwidth matrix.dat”. The file is generic and contains the bandwidth
data representing the bandwidth between the machines on which the
application can be run and the machines on which the IBP depots are
available. The file is generated by a utility “bandwidth matrix gen.sh”
(See Utils). Every process chooses an IBP depot such that the bandwidth
between the machine where the process is running and the chosen IBP
depot is maximum. If two or more processes choose for the same IBP
depot then the IBP depots are assigned in a limited round robin fashion,
i.e if a chosen IBP depot is already being used by another process then
a search is made for the next IBP depot which has a bandwidth in the
range of 10% when compared to the IBP depot voted. If such an IBP
depot is present and is available, then it is assigned to the process, else
the chosen IBP depot is itself assigned. If the “bandwidth matrix.dat” is
not available, then the IBP depots are assigned in a round robin fashion.

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 10
5 Utils

Utils

Names
5.1 stop application ................................ 11
5.2 rss ckpt ................................ 12
5.3 rss restore ................................ 12
5.4 change ckpt interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.5 ibp move ................................ 13
5.6 bandwidth matrix gen.sh . . . . . . . . . . . . . . . . . . . . . . . 14

SRS provides some utils (in the bin/ directory after successfully building SRS).

5.1

stop application

Whenever the user wants to stop his program, he uses a program called stop
application that is included in the SRS library.
The command is:
>> stop application <runtime host> <runtime port>

• runtime host
the host where rss was started
• runtime port
the port where rss will be accepting connections. This port will be printed
out when running the rss program.

The user can restart his application in the same way he started his application
initially.
The rss program should be still running and srs.config should not have been
changed between runs. When the application runs to completion, the rss pro-
gram terminates.

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 11
5 Utils

5.2

rss ckpt

This utility is useful when using large distributed infrastructed with different
clusters. When the application is migrated from cluster-1 to cluster-2, instead
of requiring the application to contact the rss daemon of cluster-1, the rss
daemon can also be checkpointed and migrated from cluster-1 to cluster-2
using rss ckpt and rss restore(next section).

Whenever user wants to store the rss daemon, he uses a program called
rss ckpt that is included in the SRS library.
The command is:
>> rss ckpt <runtime host> <runtime port>

• runtime host
the host where rss was started
• runtime port
the port where rss will be accepting connections. This port will be printed
out when running the rss program.

The rss ckpt stores the rss daemon that is currently running on the port ot a
file rss ckpt.dat. The user can load the rss ckpt.dat into the rss daemon which
is running on another machine. Then user can restart his application in the
same way he started his application initially.
When the application runs to completion, the rss program terminates normally.

5.3

rss restore

Whenever user wants to load the previously stored rss daemon in the file
rss ckpt.dat to the current rss daemon, he uses a program called rss restore
that is included in the SRS library.
The command is:
>> rss restore <runtime host> <runtime port>

• runtime host
the host where rss was started

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 12
5 Utils

• runtime port
the port where rss will be accepting connections. This port will be printed
out when running the rss program.

The rss restore loads the rss ckpt.dat to the currently running rss daemon on
the runtime port. Then user can restart his application in the same way he
started his application initially.
When the application runs to completion, the rss program terminates.

5.4

change ckpt interval

Whenever the user wants to change the interval between checkpoints (only if
FAULT TOLERANCE is set to yes in srs.config). The command is:
>> change ckpt interval <runtime host> <runtime port> <new
interval>

• runtime host
the host where rss was started
• runtime port
the port where rss will be accepting connections. This port will be printed
out when running the rss program.

• new interval
the new time to wait between two checkpoints.

This change will only take effect after calling SRS Check Stop in the user’s
application.

5.5

ibp move

Whenever the user wants to move the IBP depots from a set of machines to a
different set of machines, this utility comes in handy. Before using the utility, the
user must have the file ’ibp servers.config’ in the same location where the utility
is called from. This utility cannot be used when the application is running. It

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 13
5 Utils

can be used when the application is put to stop and is useful when the user
wants to restart his application on a different set of machines for which the
present IBP servers are not accessible. The command is:
>> ibp move <runtime host> <runtime port>

• runtime host
the host where rss was started
• runtime port
the port where rss will be accepting connections.

A sample ibp servers.config is given below.

CHANGES = 3
garl-intel1 garl-intel2
garl-intel3 garl-intel2
garl-intel4 garl-intel2

The CHANGES specifies the number of IBP depots to be moved. The subse-
quent lines contain the source IBP depots (first field) from which the data is
being moved to the destination IBP depots(second field). In this sample config
file, the data from IBP Depots “garl-intel1”, “garl-intel3” and “garl-intel4” is
moved to “garl-intel2”. When the rss has done moving the IBP depots, it prints
out a message. Once this utility is executed the IBP Depots on the source
machines can be shut down (“garl-intel1”, “garl-intel3” and “garl-intel4”) and
the user application restarted with only the destination IBP depots running
(“garl-intel2”).

5.6

bandwidth matrix gen.sh

This utility is used to generate the file “bandwidth matrix.dat”. The file is
generic and contains the bandwidth data representing the bandwidth between
the machines on which the application can be run and the machines on which
the IBP depots are available.
The command is:
>> sh bandwidth matrix gen.sh machines info.txt &
A sample machines info.txt is given below.

garl-intel1

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 14
5 Utils

hosts
garl-intel1
garl-intel2
ibpservers
garl-intel4

The utility requires the running of Network Weather Service


http://nws.cs.ucsb.edu/ewiki/, a distributed system that periodically
monitors and dynamically forecasts the performance, various network and
computational resources can deliver over a given time interval. The skill-
Name:tcpMessageMonitor, should be started on the Network Weather Service.
The first line in the machines info.txt contains the host name on which the
NWS nameserver is started, followed by the hosts on which the user application
can be run and machines on which the ibpservers are available. This utility
runs in the background refreshing the bandwidth matrix.dat file every half an
hour. A log file “bandwidth matrix gen.log” is generated by the utlity to assist
the users.

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 15
6 Examples

Examples

Names
6.1 SRS Restart example . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.2 SRS Register example . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.3 SRS Check Stop example . . . . . . . . . . . . . . . . . . . . . . . 17
6.4 SRS Read examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.5 SRS DistributeFunc Create example . . . . . . . . . . 19
6.6 SRS DistributeMap Create example . . . . . . . . . . 20
6.7 The big picture, a working example . . . . . . . . . . 21

6.1

SRS Restart example

If the application uses a matrix and if the matrix will be checkpointed when
the application is stopped in the middle of its execution,
then the matrix needs to be initialized with initial values only when the
application is executed for the first time.

void main( ){
int* matrix;
int restart_value;
matrix = (int*)malloc(sizeof(int)*10);
restart_value = SRS_Restart_Value();
if(restart_value == 0){
for(i=0; i<10; i++){
matrix[i] = i;
}
}
}

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 16
6 Examples

6.2

SRS Register example

int main(){
double x[10];
int i;
int local_size = 2;
SRS_Register("X", x, GRADS_DOUBLE, 10, CYCLIC, &local_size);
SRS_Register("iterator", &i, GRADS_INT, 1, 0, NULL);
}

6.3

SRS Check Stop example

int main(){
int stop_value;
int *a;
stop_value = SRS_Check_Stop(NULL);
if(stop_value == 1){
free(a);
MPI_Finalize();
exit(0);
}
}

6.4

SRS Read examples

In the following examples, only partial code statements are shown to demon-
strate SRS Read call.

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 17
6 Examples

Example 1:
This is a simple example in which an array of integers are copied from the set
of processes in the old application to the corresponding set of processes in the
new application run.

int main(){
int A[10];
int i;
SRS_Init();
restart_value = SRS_Restart_Value();
if(restart_value == 1){
SRS_Read("A", A, 0, NULL);
}
SRS_Register("A", A, GRADS_INT, 10, 0, NULL);
}

Example 2:
In this example, block-cyclic data distribution is used for both old and new
application runs. Thus the same data is distributed in a block cyclic fashion
over a new set of processes when the application is restarted. Unlike Example
1, this example can be stopped and restarted on a different set of processes.

int main(){
int A[10];
int i;
SRS_Init();
restart_value = SRS_Restart_Value();
if(restart_value == 1){
SRS_Read("A", A, BLOCK, NULL);
}
SRS_Register("A", A, GRADS_INT, 10, BLOCK, NULL);
}

Example 3:
This example demonstrates the use of SAME value for new distribution in
SRS Read(). In this example, SAME is used for propagating the checkpointed
iterator to all the processes so that all the processes in the current application
run can start from the same iteration.

int main(){
int i, iter_start;
SRS_Init();
restart_value = SRS_Restart_Value();
if(restart_value == 1){
SRS_Read("iterator", &iter_start, SAME, NULL);

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 18
6 Examples

}
SRS_Register("iterator", &i, GRADS_INT, 1, 0, NULL);
for(i=iter_start; i<10; i++){
// computation
}
}

6.5

SRS DistributeFunc Create example

In this example, SRS DistributeFunc Create is used to create a handle to a


block data distribution. This handle is used in the subsequent SRS Register.

DataMapInfo* block_distribution(int data_size, int proc_count, void*


other_data, char* input_arg){
int i, total_offset;
DataMapInfo* data_map;
data_map = (DataMapInfo*)malloc(sizeof(DataMapInfo));
data_map->info_count = proc_count;
data_map->offset = (int*)malloc(sizeof(int)*proc_count);
data_map->size = (int*)malloc(sizeof(int)*proc_count);
data_map->proc = (int*)malloc(sizeof(int)*proc_count);
total_offset = 0;
for(i=0; i<proc_count; i++){
data_map->offset[i] = total_offset;
data_map->size[i] = data_size/proc_count +
((data_size % proc_count) > i);
data_map->proc[i] = i;
total_offset += data_map->size[i];
}
input_arg = NULL;
return data_map;
}

int main(){
int A[10];
int restart_value;
int distributefunc_handle;
DataMapInfo* (*distribute_func)(int, int , void*, char*);
MPI_Init();

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 19
6 Examples

SRS_Init();
restart_value = SRS_Restart_Value();
distribute_func = block_distribution;
SRS_DistributeFunc_Create(distribute_func, &distributefunc_handle);
SRS_Register(‘‘A"", A, GRADS_INT, 10, distributefunc_handle, NULL);
SRS_Finish();
MPI_Finalize();
}

6.6

SRS DistributeMap Create example

In this example, the block cyclic data distribution is constructed using the data
map structure and a handle is created using SRS DistributeMap Create().
This handle is used in the subsequent SRS Register() call.

int main(){
int A[10];
int handle;
int restart_value;
MPI_Init();
SRS_Init();
restart_value = SRS_Restart_Value();
dataMap = (DataMapInfo*)malloc(sizeof(DataMapInfo));
dataMap->info_count = 5;
dataMap->offset = (int*)malloc(sizeof(int)*5);
dataMap->size = (int*)malloc(sizeof(int)*5);
dataMap->proc = (int*)malloc(sizeof(int)*5);
dataMap->offset[0] = 0;
dataMap->size[0] = 2;
dataMap->proc[0] = 0;
dataMap->offset[1] = 2;
dataMap->size[1] = 2;
dataMap->proc[1] = 1;
dataMap->offset[2] = 4;
dataMap->size[2] = 2;
dataMap->proc[2] = 2;
dataMap->offset[3] = 6;
dataMap->size[3] = 2;
dataMap->proc[3] = 0;

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 20
6 Examples

dataMap->offset[4] = 8;
dataMap->size[4] = 2;
dataMap->proc[4] = 1;
SRS_DistributeMap_Create(dataMap, &handle);
SRS_Register(‘‘A"", A, GRADS_INT, 10, handle, NULL);
SRS_Finish();
MPI_Finalize();
}

6.7

The big picture, a working example

In this example, an array A whose size is divisible by the number of processors


is evenly distributed across all the processors.
When the application is started for the first time (line 24), the root pro-
cess initializes the array and distributes sub arrays to all the processors using
MPI Scatter.
Each process executes a loop whose number of iterations is equal to the size of
the array . In each iteration of the loop, a single element of the array whose
array index is given by the iteration number, is incremented by 10.
This increment is carried out by the processor that owns the element.
Each process registers its subarray and the iteration number for checkpointing.
At the start of each iteration of the loop, each process calls SRS Check Stop to
see if the application has to stop. If the application has received a stop signal,
each process frees the allocated arrays and calls MPI Finalize() and exit().
When the application is restarted, each process reads its portion of the array
and the array is once again distributed in a block fashion. Each process also
reads the iteration number from which it has to continue. Since all the processes
have to continue from the same iteration, SAME is used for SRS Read.
Thus this example can be started on m number of processors, stopped and can
be restarted on n number of processors where n can be dirent from m.
The only requirement for this example is that the size of the array should be
divisible by m and n.
The program can be stopped and restarted on different sets of processors any
number of times. At the end of program completion, the unique correct values
of the array, which are 10 - (size of the array-1)+10, are displayed.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 21
6 Examples

#include "mpi.h"
#include "srs.h"
#include "datatype.h"
int main(int argc, char** argv){
int* global_A;
int* local_A;
int rank, size;
int global_size, local_size;
int proc_number, local_index;
int i, j, iter_start, restart_value, stop_value;
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Init(&argc, &argv);
SRS_Init();
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
global_size = atoi(argv[1]);
local_size = global_size/size;
restart_value = SRS_Restart_Value();
global_A = (int*)malloc(sizeof(int)*global_size);
local_A = (int*)malloc(sizeof(int)*local_size);
if(restart_value == 0){
if(rank == 0){
for(i=0; i<global_size; i++){
global_A[i] = i;
}
}
MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size,
MPI_INT, 0, comm );
iter_start = 0;
}
else{
SRS_Read("A", local_A, BLOCK, NULL);
SRS_Read("iterator", &iter_start, SAME, NULL);
}
SRS_Register("A", local_A, GRADS_INT, local_size, BLOCK, NULL);
SRS_Register("iterator", &i, GRADS_INT, 1, 0, NULL);
printf("Proc. %d initial: ", rank);
for(j=0; j<local_size; j++){
printf("%d ", local_A[j]);
}
printf("\n");
for(i=iter_start; i<global_size; i++){
stop_value = SRS_Check_Stop();
if(stop_value == 1){
free(global_A);
free(local_A);

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 22
6 Examples

MPI_Finalize();
exit(0);
}
proc_number = i/local_size;
local_index = i%local_size;
if(rank == proc_number){
local_A[local_index] += 10;
}
printf("Proc. %d Iter. %d: ", rank, i);
for(j=0; j<local_size; j++){
printf("%d ", local_A[j]);
}
printf("\n");
sleep(1);
}
free(global_A);
free(local_A);
SRS_Finish();
MPI_Finalize();
exit(0);
}

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 23
7 Test Programs

Test Programs

To help you with your firsts SRS experiences, we provide you somes ready to
use applications.

• In the bin/ directory, two basic test applications are built in:
– simple test: a simple MPI test.
– redistribute test: a more complex test including redistribution
(don’t forget the size argument).
• In the src/test/ directory, some well known numerical applications are
instrumented with SRS. Currently, ScaLAPACK eigen value problem, and
PETSc CG, CGS, BICG and BCGS, are provided. In the corresponding
directories, you will be find the specific README file for compiling and
executing the applications.

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 24
8 Credits

Credits

• Sathish Vadhiyar (vss@serc.iisc.ernet.in). Coordinator.


• Antoine Henry ( antoine.henry@insa-lyon.fr). Worked as a student intern
from INSA, Lyon, France on this project from Jan-Aug 2007. Currently
in LIAMA, China.
• K. Raghavendra (raghavendra83@gmail.com). Project Assistant from
Jan. 2008 - date.
• N. Sri Harsha(sriharsha.nooli@gmail.com). Project Assistant from Jul.
2009 - date.

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 25
9 References

References

1. MPICH. http://www-unix.mcs.anl.gov/mpi/mpich2/ or
2. MPI-LAM. http://www.lam-mpi.org/
3. The Internet Backplane Protocol, http://loci.cs.utk.edu/ibp/
4. Vadhiyar, S. and Dongarra, J. SRS - A Framework for Developing Mal-
leable and Migratable Parallel Applications for Distributed Systems. Par-
allel Processing Letters, Vol. 13, number 2, pp. 291-312, June 2003.
http://garl.serc.iisc.ernet.in/SRS/SRS.htm
5. Vadhiyar, S. and Dongarra, J. Performance Oriented Mi-
gration Framework for the Grid. Proceedings of The 3rd
IEEE/ACM International Symposium on Cluster Computing and
the Grid (CCGrid 2003), pp 130-137, May 2003, Tokyo, Japan.
http://garl.serc.iisc.ernet.in/SRS/SRS.htm

This page was generated with the help of DOC++


http://docpp.sourceforge.net
December 28, 2009 26

También podría gustarte