MPI I/O refers to a new set of standard MPI routines for performing file read and write operations. MPI I/O has a number of useful features which are described in this document. The MPI libraries on the NERSC IBM SP contain all of the routines defined in the MPI standard.
The new MPI standard defines a set of routines for transferring data to and from external storage. It offers a number of advantages over traditional language I/O:
The MPI I/O routines are part of the standard MPI libraries on the IBM SP.
IBM SP: You must use one of the "thread-safe" compilers to use the MPI I/O routines. The "thread-safe" compilers have names that end in "_r", e.g. mpxlf90_r, mpxlf_r, mpcc_r, and mpCC_r.
A MPI file is an ordered list of MPI datatypes. These datatypes can be predefined: such as byte, real, or integer; or user-defined datatypes. MPI I/O routines support sequential or random access to these types. A file is opened collectively by any subset of MPI processes, and any collective I/O operations on the file are collective over this subset.
A view defines what data are visible to each process. A view consists of a displacement, an etype (elementary type), and a filetype.
For example, imagine the file laid out in a linear array. Given a displacement, etype, and filetype:
Displacement |
D |
etype |
E |
filetype | |||||
H | E | E | H | H | H |
where "H" indicates a hole. Then by "tiling" the file with this filetype, a process has access to only the portions of the file marked with an "E":
D | H | E | E | H | H | H | H | E | E | H | H | H | H | E | E | H | H | H | H | E | E | H | H | H | ... |
It is possible for different processes to define different filetypes, and hence different fileviews. For example, given three processes with the following filetypes:
process 0 filetype | |||||
E | H | H | H | H | H |
process 1 filetype | |||||
H | E | E | H | H | H |
process 2 filetype | |||||
H | H | H | E | E | E |
Then the file is partitioned among the processes as follows:
D | 0 | 1 | 1 | 2 | 2 | 2 | 0 | 1 | 1 | 2 | 2 | 2 | 0 | 1 | 1 | 2 | 2 | 2 | ... |
Using filetypes effectively is the key to efficient MPI I/O. Ideally, with judicious use of filetypes a single MPI read or write call is all that is needed to write out an entire data structure.
We have an Fortran MPI application where a real*8 vector is distributed in equal chunks over 4 processes. The file should have a contiguous vector:
0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 |
The data from each process forms a contiguous piece of the file, so there is little
point in setting up an filetype which covers the entire file. We can use
a displacement of zero bytes, an etype of MPI_REAL8
, and a filetype of
MPI_REAL8
. The data access routines can be used to position the data in the
file.
A more complex example shows the benefit of using fileviews. Imagine an MPI program uses a real*8 matrix distributed in regular blocked fashion over 4 processes that are arranged in a 2 by 2 grid. Each MPI process is represented by a an (x,y) coordinate:
(0,0) | (0,0) | (0,0) | (0,1) | (0,1) | (0,1) |
(0,0) | (0,0) | (0,0) | (0,1) | (0,1) | (0,1) |
(0,0) | (0,0) | (0,0) | (0,1) | (0,1) | (0,1) |
(1,0) | (1,0) | (1,0) | (1,1) | (1,1) | (1,1) |
(1,0) | (1,0) | (1,0) | (1,1) | (1,1) | (1,1) |
(1,0) | (1,0) | (1,0) | (1,1) | (1,1) | (1,1) |
We want to store the data in column-major order in a file. We choose a
displacement of zero bytes, and an etype of MPI_REAL8
. Constructing the
filetype could be cumbersome, for example, for process (0,0) we want a filetype
which looks like:
but MPI provides a routine MPI_TYPE_CREATE_SUBARRAY
to
construct a datatype which is just such a "chunk" of a multi-dimensional array.
There are many different routines to create new datatypes,
and usually the datatypes required for a fileview can be constructed without
too much difficulty.
We define fileviews similarly for the other processes. At this point
we can use any of the data access routines to write out the contiguous local
data on each process knowing that it will end up in the correct position
in the file with one write call and no file-positioning calls. If we use
certain of the data access routines, small I/O requests issued on
multiple processes are automatically merged into larger more
efficient I/O requests.
MPI_FILE_OPEN(comm, filename, amode, info, fh) IN comm Specifies the communicator (handle) IN filename Specifies the name of the file to open (string) IN amode Specifies the file access mode (integer) IN info Specifies the information object that provides hints (handle) OUT fh Returns the file handle (handle) OUT ierror Return code (integer)
comm
comm
. If
independent access to a file is needed, MPI_COMM_SELF
can be used.
amode
amode
describes how the file will be accessed. The most commonly
used options are:
MPI_MODE_RDONLY Specifies read only. MPI_MODE_RDWR Specifies read and write. MPI_MODE_WRONLY Specifies write only. MPI_MODE_CREATE Creates the file if it does not exist. MPI_MODE_EXCL Specifies error if creating a file that already exists.options can be combined with the
IOR
function in
Fortran.
info
info
object is new to MPI-2 and provides a mechanism to
pass a package of options to a routine. An info
object is defined
by the MPI_INFO_CREATE
routine.
MPI_INFO_NULL
can be used if you do not want to set any options.
fh
MPI_FILE_CLOSE(fh, ierror) IN fh Specifies the file handle (handle) OUT ierror Return code (integer)
MPI_FILE_SET_VIEW(fh, disp, etype, filetype, datarep, info, ierror) IN fh Specifies the file handle (handle). IN disp Specifies displacement (integer(kind=MPI_OFFSET_KIND)) IN etype Specifies elementary data type (handle) IN filetype Specifies the file type (handle) IN datarep Specifies the representation of data in the file (string) IN info Provides information regarding file access patterns (handle) OUT ierror Return code (integer)
fh
disp
disp
is specified as
an absolute offset in bytes from the beginning of the file. Note that on the
SP this is a non-default sized integer, and you must declare it in Fortran as
kind=MPI_OFFSET_KIND
.
etype
filetype
filetype
is either a single etype or a derived MPI data type
constructed from multiple instances of, the same etype or holes of the same
extent as the etype. You can construct a filetype by using any of the MPI data
type constructor routines.
datarep
info
info
object is new to MPI-2 and provides a mechanism to
pass a package of options to a routine. An info
object is defined
by the MPI_INFO_CREATE
routine.
MPI_INFO_NULL
can be used if you do not want to set any options.
We have an MPI application where a real*8 vector is distributed in equal chunks over 4 processes. The following diagram shows the mapping from process data to file data:
local data0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H |
As mentioned above, the data from each process in contiguous in the file, so there is little
point in setting up an filetype which covers the entire file. We can use
a filetype of MPI_REAL8
and use the data access routines described in the
next section to position the data in the file. The source code to open the
file and set up the fileview might look like:
integer filemode, finfo, fhv, ierr integer (kind=MPI_OFFSET_KIND) disp ... filemode = IOR(MPI_MODE_RDWR,MPI_MODE_CREATE) ! create an empty info object call MPI_INFO_CREATE(finfo, ierr) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE', filemode, finfo,& fhv, ierr) ! set view with a displacement of 0, an etype of MPI_REAL8, and a ! filetype of MPI_REAL8 call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL8, MPI_REAL8, 'native',& finfo, ierr) ...
The second example is more complicated. We have a 2 dimensional array evenly distributed in both dimensions. It should be stored in column major order in the file. In the case of process (0,0), we have to map a local contiguous 2 dimensional array to the file as follows:
local data1 | 4 | 7 |
2 | 5 | 8 |
3 | 6 | 9 |
1 | 2 | 3 | H | H | H | 4 | 5 | 6 | H | H | H | 7 | 8 | 9 | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H | H |
In this case, it is worth setting up a filetype to cover the whole file.
That way the local data can be written out with one call. We can use a
displacement of zero, and an etype of MPI_REAL8
, the filetype
is defined using
the routine MPI_TYPE_CREATE_SUBARRAY
. The total matrix is
n × n, and each process has an m × m block. We assume that we
already have the x and y coordinates of the process within the grid. This
could be obtained, for example, by using some of the MPI cartesian topology
routines.
... integer sizes(2), subsizes(2), starts(2), xcoord, ycoord integer fileinfo, ierror, fh, filetype integer(kind=MPI_OFFSET_KIND) :: disp ! this datatype describes the mapping of the local array ! to the global array (file) sizes(1)=n sizes(2)=n subsizes(1)=m subsizes(2)=m ! assume xcoord is the value of x coordinate in the processor grid ! assume ycoord is the value of y coordinate in the processor grid starts(1)=1+m*xcoord starts(2)=1+m*ycoord call MPI_TYPE_CREATE_SUBARRAY(2, sizes, subsizes, starts,& MPI_ORDER_FORTRAN, MPI_REAL8, filetype, ierror) call MPI_TYPE_COMMIT(filetype, ierror) call MPI_INFO_CREATE(fileinfo, ierror) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'poisson.data',& IOR(MPI_MODE_RDWR,MPI_MODE_CREATE), fileinfo, fh, ierror) disp=0 call MPI_FILE_SET_VIEW(fh, disp, MPI_REAL8, filetype, 'native',& MPI_INFO_NULL, ierror) ...
In all, there are 15 routines to read data and 15 to write data. They can be categorized as follows:
The following table summarizes the routines for reading data. Corresponding
routines for writing data exist, with names such as MPI_FILE_WRITE_AT
.
Positioning | Synchronism | Coordination | |
---|---|---|---|
Noncollective | Collective | ||
Explicit offsets | Blocking | MPI_FILE_READ_AT |
MPI_FILE_READ_AT_ALL |
Nonblocking | MPI_FILE_IREAD_AT |
MPI_FILE_READ_AT_ALL_BEGIN |
|
MPI_WAIT |
MPI_FILE_READ_AT_ALL_END |
||
Individual filepointers | Blocking | MPI_FILE_READ |
MPI_FILE_READ_ALL |
Nonblocking | MPI_FILE_IREAD |
MPI_FILE_READ_ALL_BEGIN |
|
MPI_WAIT |
MPI_FILE_READ_ALL_END |
||
Shared filepointers | Blocking | MPI_FILE_READ_SHARED |
MPI_FILE_READ_ORDERED |
Nonblocking | MPI_FILE_IREAD_SHARED |
MPI_FILE_READ_ORDERED_BEGIN |
|
MPI_WAIT |
MPI_FILE_READ_ORDERED_END |
MPI_FILE_READ_AT(fh, offset, buf, count, datatype, status, ierror) MPI_FILE_READ_AT_ALL(fh, offset, buf, count, datatype, status, ierror) MPI_FILE_WRITE_AT(fh, offset, buf, count, datatype, status, ierror) MPI_FILE_WRITE_AT_ALL(fh, offset, buf, count, datatype, status, ierror) IN fh Specifies the file handle (handle) IN offset Specifies the file offset (integer(kind=MPI_OFFSET_KIND)) IN buf Returns the initial address of the buffer (choice) IN count Specifies the number of elements in the buffer (integer) IN datatype Specifies the data type of each buffer element (handle) OUT status Returns the status object (status) OUT ierror Return code (integer)
fh
_ALL
) the read or write
is collective over all the processes that opened the file.
offset
offset
is
specified as a relative offset (there is an initial displacement from the
beginning of the physical file set by the fileview), measured in units of the
current fileview etype, and only counting those etypes made available for
reading or writing. Note that on the SP this is a non-default sized integer,
and you must declare it kind=MPI_OFFSET_KIND.
buf
count
datatype
.
datatype
status
status
object returns an MPI status object. This can
be queried to show the amount of data read or written by using the
MPI_GET_COUNT
routine.
MPI_STATUS_IGNORE
can be used if this argument is not needed.
We extend the first example to include the code to perform the write.
The offset
argument to MPI_FILE_WRITE_AT
is computed
such that the local piece of the vector, buf
is written to the
file at the correct place. This is computed by multiplying the process rank
by the length of the local data (the same for every process):
integer me, filemode, finfo, fhv, ierr integer(kind=MPI_OFFSET_KIND) disp, offset real(kind=8) buf(m) ... filemode = IOR(MPI_MODE_RDWR,MPI_MODE_CREATE) call MPI_INFO_CREATE(finfo, ierr) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE', filemode, finfo,& fhv, ierr) disp=0 call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL8, MPI_REAL8, 'native',& finfo, ierr) ... call MPI_COMM_RANK(MPI_COMM_WORLD, me, ierr) offset=m*me call MPI_FILE_WRITE_AT(fhv, offset, buf, m, MPI_REAL8, status, ierr)
The code to write out the data for the second example is shown below. Because we use a relatively complex filetype, which defines the layout correctly for each process, the call to write the data is identical for each MPI process.
dimension a(m,m) integer sizes(2), subsizes(2), starts(2), xcoord, ycoord integer fileinfo, ierror, fh, filetype, status(MPI_STATUS_SIZE) integer(kind=MPI_OFFSET_KIND) :: disp, offset ! this datatype describes the mapping of the local array ! to the global array (file) sizes(1)=n sizes(2)=n subsizes(1)=m subsizes(2)=m ! assume xcoord is the value of x coordinate in the processor grid ! assume ycoord is the value of y coordinate in the processor grid starts(1)=1+m*xcoord starts(2)=1+m*ycoord call MPI_TYPE_CREATE_SUBARRAY(2, sizes, subsizes, starts,& MPI_ORDER_FORTRAN, MPI_REAL8, filetype, ierror) call MPI_TYPE_COMMIT(filetype, ierror) call MPI_INFO_CREATE(fileinfo, ierror) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE',& IOR(MPI_MODE_RDWR,MPI_MODE_CREATE), fileinfo, fh, ierror) disp=0 call MPI_FILE_SET_VIEW(fh, disp, MPI_REAL8, filetype, 'native',& MPI_INFO_NULL, ierror) offset=0 call MPI_FILE_WRITE_AT_ALL(fh, offset, a, m*m, MPI_REAL8,& status, ierror)
MPI_FILE_IREAD_AT(fh, offset, buf, count, datatype, request, ierror) MPI_FILE_IREAD_AT_ALL(fh, offset, buf, count, datatype, request, ierror) MPI_FILE_IWRITE_AT(fh, offset, buf, count, datatype, request, ierror) MPI_FILE_IWRITE_AT_ALL(fh, offset, buf, count, datatype, request, ierror) IN fh Specifies the file handle (handle) IN offset Specifies the file offset (integer(kind=MPI_OFFSET_KIND)) IN buf Returns the initial address of the buffer (choice) IN count Specifies the number of elements in the buffer (integer) IN datatype Specifies the data type of each buffer element (handle) OUT request Returns the request object (status) OUT ierror Return code (integer)
fh
_ALL
) the read or write
is collective over all the processes that opened the file.
offset
offset
is
specified as a relative offset (there is an initial displacement from the
beginning of the physical file set by the fileview), measured in units of the
current fileview etype, and only counting those etypes made available for
reading or writing. Note that on the SP this is a non-default sized integer,
and you must declare it kind=MPI_OFFSET_KIND
in Fortran.
buf
count
datatype
.
datatype
request
request
object returns an MPI request object. This must
be used in a call to MPI_WAIT
or MPI_TEST
to ensure
that the data movement is complete.
The programming to use the nonblocking write is very similar to that of the blocking write.
integer me, filemode, finfo, fhv, request, ierr integer(kind=MPI_OFFSET_KIND) disp, offset real(kind=8) buf(m) ... filemode = IOR(MPI_MODE_RDWR,MPI_MODE_CREATE) call MPI_INFO_CREATE(finfo, ierr) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE', filemode, finfo,& fhv, ierr) disp=0 call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL8, MPI_REAL8, 'native',& finfo, ierr) ... call MPI_COMM_RANK(MPI_COMM_WORLD, me, ierr) offset=m*me call MPI_FILE_IWRITE_AT(fhv, offset, buf, m, MPI_REAL8, request, ierr) ... call MPI_WAIT(request, status, ierr)
Similarly, for the second example, the changes needed to use nonblocking writes are very small.
... dimension a(m,m) integer sizes(2), subsizes(2), starts(2), xcoord, ycoord integer fileinfo, ierror, fh, filetype, request, status(MPI_STATUS_SIZE) integer(kind=MPI_OFFSET_KIND) :: disp, offset ! this datatype describes the mapping of the local array ! to the global array (file) sizes(1)=n sizes(2)=n subsizes(1)=m subsizes(2)=m ! xcoord is the value of x coordinate in the processor grid ! ycoord is the value of y coordinate in the processor grid starts(1)=1+m*xcoord starts(2)=1+m*ycoord call MPI_TYPE_CREATE_SUBARRAY(2, sizes, subsizes, starts,& MPI_ORDER_FORTRAN, MPI_REAL8, filetype, ierror) call MPI_TYPE_COMMIT(filetype, ierror) call MPI_INFO_CREATE(fileinfo, ierror) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE',& IOR(MPI_MODE_RDWR,MPI_MODE_CREATE), fileinfo, fh, ierror) disp=0 call MPI_FILE_SET_VIEW(fh, disp, MPI_REAL8, filetype, 'native',& MPI_INFO_NULL, ierror) offset=0 call MPI_FILE_IWRITE_AT_ALL(fh, offset, a, m*m, MPI_REAL8, request, ierror) ... call MPI_WAIT(request, status, ierror)
MPI I/O provides individual filepointer routines, both blocking and non-blocking:
MPI_FILE_READ(fh, buf, count, datatype, status, ierror) MPI_FILE_READ_ALL(fh, buf, count, datatype, status, ierror) MPI_FILE_IREAD(fh, buf, count, datatype, request, ierror) MPI_FILE_IREAD_ALL(fh, buf, count, datatype, request, ierror) MPI_FILE_SEEK(fh, whence, offset) MPI_FILE_GET_POSITION(fh, offset)
These routines work in a similar way to the explicit offset routines, except
that each MPI process keeps a private filepointer for file which is updated
after each read or write. The routines MPI_FILE_SEEK
and
MPI_FILE_GET_POSITION
can be used to set and query the value
of the filepointer respectively. For documentation, see the
Further Examples section.
The most common MPI I/O routines have been discussed above, but there are a number of additional I/O related routines:
Routine Name | Function |
---|---|
MPI_FILE_DELETE |
Deletes a file |
MPI_FILE_SET_SIZE |
Resize (truncates or extends) a file |
MPI_FILE_GET_SIZE |
Queries the size of a file. |
MPI_FILE_PREALLOCATE |
Allocates storage for a file |
MPI_FILE_GET_GROUP |
Returns the group of MPI processes that opened the file. |
MPI_FILE_GET_AMODE |
Returns the access mode for a file. |
MPI_FILE_SET_INFO |
Used to set implementation hints for a file. |
MPI_FILE_GET_INFO |
Used to query implementation hints for a file. |
MPI_FILE_GET_VIEW |
Used to query a fileview for a file. |
Often, to get good performance from MPI I/O, you need to define your own datatypes to create effective fileviews. This reduces the number of data access calls that need to be made. Creating datatypes requires familiarity with the MPI User-Defined Datatype routines. The following table shows the various routines:
Routine Name | Defines a ... |
---|---|
MPI_TYPE_CONTIGUOUS |
contiguous block |
MPI_TYPE_VECTOR |
strided blocks |
MPI_TYPE_HVECTOR |
srided blocks (general) |
MPI_TYPE_INDEXED |
strided blocks, with different stride per block |
MPI_TYPE_HINDEXED |
strided blocks, with different stride per block (general) |
MPI_TYPE_STRUCT |
general datatype |
MPI_TYPE_CREATE_DARRAY |
distributed multi-dimensional array |
MPI_TYPE_CREATE_SUBARRAY |
chunk of a multi-dimensional array |
MPI_TYPE_COMMIT |
makes a type ready to use |
MPI_TYPE_FREE |
deletes a type |
MPI_TYPE_EXTENT |
type size in units of another type size |
MPI_TYPE_SIZE |
type size in bytes. |
Online descriptions of these routines can be found in the MPI Subroutine Reference. A more readable account can be found in Chapter 3 of MPI - The Complete Reference, Vol. 1
The IBM SP implementation is documented in the MPI Subroutine Reference. The implementation accepts hints passed through an info object. These hints are documented in the description of mpi_file_open. For example:
integer finfo, ierr ... call MPI_INFO_CREATE(finfo, ierr) call MPI_INFO_SET(finfo, 'IBM_largeblock_io','true',ierr) ...