TIL MPI_Comm_spawn
(This post assumes prior experience with MPI).
I read a blog sometime ago where the author was talking about how he wanted to write more
blog posts. One of his strategies was to turn some new thing he had learned, even if was
something like a simple language or library feature, and talk about that as a ‘Today I Learned
(TIL) xyz’. I decided to take a page from his book and write a little bit about a feature of MPI
called MPI_Comm_spawn
I learned about.
Learning MPI (Message Passing Interface) is a necessity if you’re working with supercomputers, since the vast majority of code that run on these systems use the MPI library to distribute computations across the thousands of nodes that make up any particular supercomputer. If you’re interested, there’s a beginner tutorial here.
Before I explain MPI_Comm_spawn
itself, I need to talk about spawning processes in
general. You may already be familiar with the pattern of using fork and exec to
spawn a new process from the the current program. If you’re not, this
Wikipedia page has a good
summary of it. Essentially, you can make your program clone itself at a
particular point in the program i.e. creating the “child” process, and then
replace the clone with a completely different program erasing any trace of the
original program. It’s also possible to establish communication with the child
process by some inter process communication method even if the child process is
now running a new program. This is generally how a shell would work and also
how Python’s subprocess
module works. The below example is actually forking
the process and execing the ls
program and returning the output, and all of
this is handled internally by the subprocess
module for our convenience.
import subprocess
output = subprocess.run(["ls", "-al"], capture_output=True)
print(output)
There’s a pretty good tutorial explaining fork and
exec in a lot more detail, including how to communicate between the parent and child
processes, with examples and exercises to help you get a better understanding. For our
purposes, what I’m trying to establish here is that MPI_Comm_spawn
functions on the same
principles, but with all the MPI infrastructure available to it.
MPI_Comm_spawn
can be called by one of your MPI ranks to spawn a new separate
set of MPI ranks as children. These children are all started as peers to each
other and able to talk to each other with their own MPI_COMM_WORLD
communicator unconnected to the parent’s MPI_COMM_WORLD
.
Here’s what an MPI_Comm_spawn
call looks like:
MPI_Comm_spawn(
argv[0], // program to spawn (argv[0] spawns our the currently running
// program. This can also be a file name)
MPI_ARGV_NULL, // program's args
5, // number of these child processes to spawn off
MPI_INFO_NULL, // some additional info stuff that MPI_Comm_spawn can use
parent_rank, // The rank of the MPI process that will do the spawning
// (this means that only this specified rank will spawn the
// children, not all the ranks will do it unless you do
// them all in a loop)
comm, // The communicator that the parent_rank is present in that is
// made visible to the children spawned. Usually called the
// intracommunicator.
&child_comm, // The communicator that the parent and its peers can use to
// send and receive messages from the children. This value is
// populated when the MPI_Comm_spawn returns Usually called the
// intercommunicator.
&spawn_error // array of errcodes, length is the same as the number of
// children we will spawn. This will be populated with
// either MPI_SUCCESS or MPI_ERR_SPAWN depending on whether
// all the children spawn correctly or if any one of them
// failed, respectively
);
In each child spawned, you can make a call MPI_Comm_parent
that will return the intercommunicator we
specify in the MPI_Comm_spawn
call in the parent. This will allow any of the children to send and receive
messages to the original parent rank and all the parent’s peers in that communicator by specifying their rank.
Below is an example that shows a MPI_Comm_spawn
with an example of messages being sent and received between
the children and the parent (and its peers).
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm parent_comm;
int rank;
MPI_Comm_get_parent(&parent_comm);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (parent_comm == MPI_COMM_NULL) {
// We have no parent communicator so we have been spawned directly by the
// user
if (rank == 0) {
printf("Initial world size: %d\n", size);
}
MPI_Comm child_comm;
int spawn_error;
printf("We are processes spawned directly by you. I am rank %d. \n", rank);
int parent_rank = 2;
MPI_Comm_spawn(
argv[0], // program to spawn (argv[0] spawns our the currently running
// program. This can also be a file name)
MPI_ARGV_NULL, // program's args
5, // number of these child processes to spawn off
MPI_INFO_NULL, // some additional info stuff that MPI_Comm_spawn can use
parent_rank, // The rank of the MPI process that will do the spawning
// (this means that only this specified rank will spawn the
// children, not all the ranks will do it unless you do
// them all in a loop)
comm, // The communicator that the parent_rank is present in that is
// made visible to the children spawned. Usually called the
// intracommunicator.
&child_comm, // The communicator that the parent and its peers can use to
// send and receive messages from the children. This value is
// populated when the MPI_Comm_spawn returns Usually called the
// intercommunicator.
&spawn_error // array of errcodes, length is the same as the number of
// children we will spawn. This will be populated with
// either MPI_SUCCESS or MPI_ERR_SPAWN depending on whether
// all the children spawn correctly or if any one of them
// failed, respectively
);
MPI_Comm_rank(comm, &rank);
if (rank == 3) {
// Receive message from child rank 1 via the child communicator
int hello_from_child = 0;
int child_sender = 1;
MPI_Recv(&hello_from_child, 1, MPI_INT, child_sender, 0, child_comm,
MPI_STATUS_IGNORE);
printf("number sent from child: %d\n", hello_from_child);
// Send message to child rank 2 via the child communicator
int hello_from_parent = 56;
int child_receiver = 2;
MPI_Send(&hello_from_parent, 1, MPI_INT, child_receiver, 0, child_comm);
}
} else {
// We are children spawned from an MPI process
if (rank == 0) {
printf("child world size: %d\n", size);
}
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("I am a child from MPI_Comm_spawn. I am rank %d.\n", rank);
int child_sender = 1;
int hello_from_child = 22;
if (rank == child_sender) {
// Send message to rank 3 in parent's communicator from child rank 1
MPI_Send(&hello_from_child, 1, MPI_INT, 3, 0, parent_comm);
printf("among children: Sent number from child to parent\n");
}
// children communicating with each other
int child_to_child_number = 0;
if (rank == 0) {
// Send message to child rank 1 from child rank 0 via the spawned MPI
// processes' MPI_COMM_WORLD
printf("among children: sending from rank 0 to 1\n");
child_to_child_number = 73;
MPI_Send(&child_to_child_number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
}
if (rank == 1) {
// Receive message from child rank 0 via the spawned MPI processes'
// MPI_COMM_WORLD
MPI_Recv(&child_to_child_number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("among children: received from rank 0: %d\n",
child_to_child_number);
}
if (rank == 2) {
// Receive message from rank 3 in parent's communicator
int hello_from_parent = 0;
int parent_sender = 3;
MPI_Recv(&hello_from_parent, 1, MPI_INT, parent_sender, 0, parent_comm,
MPI_STATUS_IGNORE);
printf("among children: number received from parent : %d\n",
hello_from_parent);
}
}
MPI_Finalize();
return EXIT_SUCCESS;
}
The output would look something like the below. I ’ve cleaned up the output for clarity, the actual output lines would be all mixed together since the various ranks including the spawned ranks would all be running in parallel:
$ mpirun -np4 ./example
Initial world size: 4
We are processes spawned directly by you. I am rank 0.
We are processes spawned directly by you. I am rank 2.
We are processes spawned directly by you. I am rank 3.
We are processes spawned directly by you. I am rank 1.
child world size: 5
I am a child from MPI_Comm_spawn. I am rank 2.
I am a child from MPI_Comm_spawn. I am rank 3.
I am a child from MPI_Comm_spawn. I am rank 4.
I am a child from MPI_Comm_spawn. I am rank 1.
I am a child from MPI_Comm_spawn. I am rank 0.
among children: Sent number from child to parent
number sent from child: 22
among children: sending from rank 0 to 1
among children: received from rank 0: 73
among children: number received from parent : 56
MPI_Comm_spawn
may be useful in cases where you have a different MPI program
that you want to quickly spawn off to perform some task, without needing to
integrate that into your current MPI code. I can’t really point to any real
world use cases unfortunately, because at my workplace we use job launchers
like IBM’s jsrun
and SLURM’s srun
on the supercomputers, which don’t
support the usage of MPI_Comm_spawn
(you’ll just get an incomprehensible
runtime error if you try to launch the above program with srun
or jsrun
). My examples
worked with OpenMPI with the mpirun
launcher.
I went down this rabbit hole of MPI_Comm_spawn
because I was trying to figure
out a weird bug in MPI4Py (which is a Python interface for MPI with support for
various MPI library implementations, with some additional Pythonic features).
and my initial hypothesis was that it was using MPI_Comm_spawn
somehow. I’m
working on a blog post to talk about that particular saga in more detail (in
fact this post was broken off from a section I was writing in that post because
I figured that one was long enough as it is and this would’ve been a long
tangential section). I can tell you the issue was not MPI_Comm_spawn
in that
case, MPI4Py doesn’t use this feature at all. But it was pretty neat to at
least learn about all this.
Here’s some additional documentation: OpenMPI documentation on
MPI_Comm_spawn
,
RookieHPC page that’s a little more
digestible.