Introduction to MPI I/O

MPI I/O refers to a new set of standard MPI routines for performing file read and write operations. MPI I/O has a number of useful features which are described in this document. The MPI libraries on the NERSC IBM SP contain all of the routines defined in the MPI standard.

Introduction
Compiling Code with MPI/IO Routines
Fundamental Concepts
Basic File Manipulation Routines
Data Access Routines
Other Useful Routines
IBM SP Implementation

Further Information

Chapter 9 of the MPI-2 Standard.
Chapter 7 of MPI - The Complete Reference, Vol. 2

Introduction

The new MPI standard defines a set of routines for transferring data to and from external storage. It offers a number of advantages over traditional language I/O:

Flexibility - MPI I/O provides mechanisms for collective access (many processes collectively read and write to a single file), asynchronous I/O, and strided access.
Portability - Many platforms support the MPI I/O interface, so programs should compile and run essentially unchanged.
Interoperability - Files written by MPI I/O are portable between platforms.

Compiling Code with MPI I/O routines

The MPI I/O routines are part of the standard MPI libraries on the IBM SP.

IBM SP: You must use one of the "thread-safe" compilers to use the MPI I/O routines. The "thread-safe" compilers have names that end in "_r", e.g. mpxlf90_r, mpxlf_r, mpcc_r, and mpCC_r.

Fundamental Concepts

A MPI file is an ordered list of MPI datatypes. These datatypes can be predefined: such as byte, real, or integer; or user-defined datatypes. MPI I/O routines support sequential or random access to these types. A file is opened collectively by any subset of MPI processes, and any collective I/O operations on the file are collective over this subset.

A view defines what data are visible to each process. A view consists of a displacement, an etype (elementary type), and a filetype.

A displacement is an offset, measured in bytes, from the beginning of the file.
A etype defines the unit of data access and positioning within a file. It can be a predefined or user-derived MPI datatype.
A filetype is a template for accessing a file. It consists of a number of etypes and holes (which must be a multiple of the etype size). The basic filetype is repeated again and again, tiling the file, and creating regions of allowed access (where etypes are defined), and no access (where holes are defined).

For example, imagine the file laid out in a linear array. Given a displacement, etype, and filetype:

Displacement

etype

filetype
H	E	E	H	H	H

where "H" indicates a hole. Then by "tiling" the file with this filetype, a process has access to only the portions of the file marked with an "E":

...

It is possible for different processes to define different filetypes, and hence different fileviews. For example, given three processes with the following filetypes:

process 0 filetype
E	H	H	H	H	H

process 1 filetype
H	E	E	H	H	H

process 2 filetype
H	H	H	E	E	E

Then the file is partitioned among the processes as follows:

...

Using filetypes effectively is the key to efficient MPI I/O. Ideally, with judicious use of filetypes a single MPI read or write call is all that is needed to write out an entire data structure.

Examples

We have an Fortran MPI application where a real*8 vector is distributed in equal chunks over 4 processes. The file should have a contiguous vector:

The data from each process forms a contiguous piece of the file, so there is little point in setting up an filetype which covers the entire file. We can use a displacement of zero bytes, an etype of MPI_REAL8, and a filetype of MPI_REAL8. The data access routines can be used to position the data in the file.

A more complex example shows the benefit of using fileviews. Imagine an MPI program uses a real*8 matrix distributed in regular blocked fashion over 4 processes that are arranged in a 2 by 2 grid. Each MPI process is represented by a an (x,y) coordinate:

(0,0)	(0,0)	(0,0)	(0,1)	(0,1)	(0,1)
(0,0)	(0,0)	(0,0)	(0,1)	(0,1)	(0,1)
(0,0)	(0,0)	(0,0)	(0,1)	(0,1)	(0,1)
(1,0)	(1,0)	(1,0)	(1,1)	(1,1)	(1,1)
(1,0)	(1,0)	(1,0)	(1,1)	(1,1)	(1,1)
(1,0)	(1,0)	(1,0)	(1,1)	(1,1)	(1,1)

We want to store the data in column-major order in a file. We choose a displacement of zero bytes, and an etype of MPI_REAL8. Constructing the filetype could be cumbersome, for example, for process (0,0) we want a filetype which looks like:

but MPI provides a routine MPI_TYPE_CREATE_SUBARRAY to construct a datatype which is just such a "chunk" of a multi-dimensional array. There are many different routines to create new datatypes, and usually the datatypes required for a fileview can be constructed without too much difficulty. We define fileviews similarly for the other processes. At this point we can use any of the data access routines to write out the contiguous local data on each process knowing that it will end up in the correct position in the file with one write call and no file-positioning calls. If we use certain of the data access routines, small I/O requests issued on multiple processes are automatically merged into larger more efficient I/O requests.

Basic File Manipulation Routines

Opening a File

MPI_FILE_OPEN(comm, filename, amode, info, fh)

IN   comm      Specifies the communicator (handle)
IN   filename  Specifies the name of the file to open (string)
IN   amode     Specifies the file access mode (integer)
IN   info      Specifies the information object that provides hints (handle)
OUT  fh        Returns the file handle (handle)
OUT  ierror    Return code (integer)

comm

This routine is collective over all processes in the communicator comm. If independent access to a file is needed, MPI_COMM_SELF can be used.

amode

amode describes how the file will be accessed. The most commonly used options are:

MPI_MODE_RDONLY             Specifies read only.
MPI_MODE_RDWR               Specifies read and write.
MPI_MODE_WRONLY             Specifies write only.
MPI_MODE_CREATE             Creates the file if it does not exist.
MPI_MODE_EXCL               Specifies error if creating a file that
                            already exists.

options can be combined with the IOR function in Fortran.

info

The info object is new to MPI-2 and provides a mechanism to pass a package of options to a routine. An info object is defined by the MPI_INFO_CREATE routine. MPI_INFO_NULL can be used if you do not want to set any options.

fh

This is the file handle which is returned. It is used by other routines in a manner similar to a unit number in Fortran.

Closing a File

MPI_FILE_CLOSE(fh, ierror)

IN   fh        Specifies the file handle (handle)
OUT  ierror    Return code (integer)

fh: File handle that was returned from MPI_File_open. This routine is collective over all the processes that opened the file.

Creating a Fileview

MPI_FILE_SET_VIEW(fh, disp, etype, filetype, datarep, info, ierror)

IN   fh        Specifies the file handle (handle).
IN   disp      Specifies displacement (integer(kind=MPI_OFFSET_KIND))
IN   etype     Specifies elementary data type (handle)
IN   filetype  Specifies the file type (handle)
IN   datarep   Specifies the representation of data in the file (string)
IN   info      Provides information regarding file access patterns (handle)
OUT  ierror    Return code (integer)

fh: File handle that was returned from MPI_File_open. This routine is collective over all the processes that opened the file.
disp: The position at which the view begins. disp is specified as an absolute offset in bytes from the beginning of the file. Note that on the SP this is a non-default sized integer, and you must declare it in Fortran as kind=MPI_OFFSET_KIND.
etype: The elementary data type is the unit of data access and positioning. It can be any MPI predefined or derived data type. You can construct a derived elementary data types by using any of the MPI data type constructor routines.
filetype: filetype is either a single etype or a derived MPI data type constructed from multiple instances of, the same etype or holes of the same extent as the etype. You can construct a filetype by using any of the MPI data type constructor routines.
datarep: In principle, this defines the representation of the data in the file.
info: The info object is new to MPI-2 and provides a mechanism to pass a package of options to a routine. An info object is defined by the MPI_INFO_CREATE routine. MPI_INFO_NULL can be used if you do not want to set any options.

Examples

We have an MPI application where a real*8 vector is distributed in equal chunks over 4 processes. The following diagram shows the mapping from process data to file data:

local data

file

As mentioned above, the data from each process in contiguous in the file, so there is little point in setting up an filetype which covers the entire file. We can use a filetype of MPI_REAL8 and use the data access routines described in the next section to position the data in the file. The source code to open the file and set up the fileview might look like:

      integer filemode, finfo, fhv, ierr
      integer (kind=MPI_OFFSET_KIND) disp
...
      filemode = IOR(MPI_MODE_RDWR,MPI_MODE_CREATE)

! create an empty info object

      call MPI_INFO_CREATE(finfo, ierr)

      call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE', filemode, finfo,&
        fhv, ierr)

! set view with a displacement of 0, an etype of MPI_REAL8, and a 
! filetype of MPI_REAL8

      call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL8, MPI_REAL8, 'native',&
        finfo, ierr)
...

The second example is more complicated. We have a 2 dimensional array evenly distributed in both dimensions. It should be stored in column major order in the file. In the case of process (0,0), we have to map a local contiguous 2 dimensional array to the file as follows:

local data

1	4	7
2	5	8
3	6	9

file

In this case, it is worth setting up a filetype to cover the whole file. That way the local data can be written out with one call. We can use a displacement of zero, and an etype of MPI_REAL8, the filetype is defined using the routine MPI_TYPE_CREATE_SUBARRAY. The total matrix is n × n, and each process has an m × m block. We assume that we already have the x and y coordinates of the process within the grid. This could be obtained, for example, by using some of the MPI cartesian topology routines.

...
        integer sizes(2), subsizes(2), starts(2), xcoord, ycoord
        integer fileinfo, ierror, fh, filetype
        integer(kind=MPI_OFFSET_KIND) :: disp

! this datatype describes the mapping of the local array
! to the global array (file)

        sizes(1)=n
        sizes(2)=n
        subsizes(1)=m
        subsizes(2)=m

! assume xcoord is the value of x coordinate in the processor grid
! assume ycoord is the value of y coordinate in the processor grid

        starts(1)=1+m*xcoord
        starts(2)=1+m*ycoord
      
        call MPI_TYPE_CREATE_SUBARRAY(2, sizes, subsizes, starts,&
          MPI_ORDER_FORTRAN, MPI_REAL8, filetype, ierror)
        call MPI_TYPE_COMMIT(filetype, ierror)
        call MPI_INFO_CREATE(fileinfo, ierror)

        call MPI_FILE_OPEN(MPI_COMM_WORLD, 'poisson.data',&
          IOR(MPI_MODE_RDWR,MPI_MODE_CREATE), fileinfo, fh, ierror)

        disp=0
        call MPI_FILE_SET_VIEW(fh, disp, MPI_REAL8, filetype, 'native',&
          MPI_INFO_NULL, ierror)
...

Data Access Routines

In all, there are 15 routines to read data and 15 to write data. They can be categorized as follows:

Positioning
- Explict offset routines take an argument which defines where the data is read or written.
- Individual filepointer routines use a private filepointer which is maintained by each process. It is incremented by a read or write.
- Shared filepointer routines use a shared filepointer which is incremented by any read or write.
Synchronism
- Blocking routines return when the data transfer is complete.
- Nonblocking routines initiates a request, return immediately, and provide a request identifier. A request is completed by calling an additional routine. This allows for I/O to be overlapped with computation or communication.
Coordination
- Noncollective routines involve only one process and an I/O request.
- Collective routines involve only all processes which opened the file. Collective routines sometimes perform much better than noncollective since small requests may be merged if all processes can coordinate.

The following table summarizes the routines for reading data. Corresponding routines for writing data exist, with names such as MPI_FILE_WRITE_AT.

Positioning	Synchronism	Coordination
Positioning	Synchronism	Noncollective	Collective
Explicit offsets	Blocking	`MPI_FILE_READ_AT`	`MPI_FILE_READ_AT_ALL`
	Nonblocking	`MPI_FILE_IREAD_AT`	`MPI_FILE_READ_AT_ALL_BEGIN`
	Nonblocking	`MPI_WAIT`	`MPI_FILE_READ_AT_ALL_END`
Individual filepointers	Blocking	`MPI_FILE_READ`	`MPI_FILE_READ_ALL`
	Nonblocking	`MPI_FILE_IREAD`	`MPI_FILE_READ_ALL_BEGIN`
	Nonblocking	`MPI_WAIT`	`MPI_FILE_READ_ALL_END`
Shared filepointers	Blocking	`MPI_FILE_READ_SHARED`	`MPI_FILE_READ_ORDERED`
	Nonblocking	`MPI_FILE_IREAD_SHARED`	`MPI_FILE_READ_ORDERED_BEGIN`
	Nonblocking	`MPI_WAIT`	`MPI_FILE_READ_ORDERED_END`

Blocking, Noncollective and Collective, Reading and Writing

MPI_FILE_READ_AT(fh, offset, buf, count, datatype, status, ierror)
MPI_FILE_READ_AT_ALL(fh, offset, buf, count, datatype, status, ierror)
MPI_FILE_WRITE_AT(fh, offset, buf, count, datatype, status, ierror)
MPI_FILE_WRITE_AT_ALL(fh, offset, buf, count, datatype, status, ierror)

IN   fh        Specifies the file handle (handle)
IN   offset    Specifies the file offset (integer(kind=MPI_OFFSET_KIND))
IN   buf       Returns the initial address of the buffer (choice)
IN   count     Specifies the number of elements in the buffer (integer)
IN   datatype  Specifies the data type of each buffer element (handle)
OUT  status    Returns the status object (status)
OUT  ierror    Return code (integer)

fh: File handle that was returned from MPI_File_open. For collective routines (those ending in _ALL) the read or write is collective over all the processes that opened the file.
offset: The offset at which the read or write begins. offset is specified as a relative offset (there is an initial displacement from the beginning of the physical file set by the fileview), measured in units of the current fileview etype, and only counting those etypes made available for reading or writing. Note that on the SP this is a non-default sized integer, and you must declare it kind=MPI_OFFSET_KIND.
buf: The buffer containing to data to be written, or to receive the data to be read.
count: The amount if data to be read or written, measured in units of datatype.
datatype: The type of data to be written. Can be a predefined or user-derived MPI datatype. The size of the datatype must be a multiple of the etype defined in the current fileview.
status: The status object returns an MPI status object. This can be queried to show the amount of data read or written by using the MPI_GET_COUNT routine. MPI_STATUS_IGNORE can be used if this argument is not needed.

Examples

We extend the first example to include the code to perform the write. The offset argument to MPI_FILE_WRITE_AT is computed such that the local piece of the vector, buf is written to the file at the correct place. This is computed by multiplying the process rank by the length of the local data (the same for every process):

      integer me, filemode, finfo, fhv, ierr
      integer(kind=MPI_OFFSET_KIND) disp, offset
      real(kind=8) buf(m)
...
      filemode = IOR(MPI_MODE_RDWR,MPI_MODE_CREATE)
      call MPI_INFO_CREATE(finfo, ierr)

      call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE', filemode, finfo,&
        fhv, ierr)

      disp=0
      call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL8, MPI_REAL8, 'native',&
        finfo, ierr)
...
      call MPI_COMM_RANK(MPI_COMM_WORLD, me, ierr)
      offset=m*me
      call MPI_FILE_WRITE_AT(fhv, offset, buf, m, MPI_REAL8, status, ierr)

The code to write out the data for the second example is shown below. Because we use a relatively complex filetype, which defines the layout correctly for each process, the call to write the data is identical for each MPI process.

       dimension a(m,m)
       integer sizes(2), subsizes(2), starts(2), xcoord, ycoord
       integer fileinfo, ierror, fh, filetype, status(MPI_STATUS_SIZE)
       integer(kind=MPI_OFFSET_KIND) :: disp, offset

! this datatype describes the mapping of the local array
! to the global array (file)

       sizes(1)=n
       sizes(2)=n
       subsizes(1)=m
       subsizes(2)=m

! assume xcoord is the value of x coordinate in the processor grid
! assume ycoord is the value of y coordinate in the processor grid

       starts(1)=1+m*xcoord
       starts(2)=1+m*ycoord
     
       call MPI_TYPE_CREATE_SUBARRAY(2, sizes, subsizes, starts,&
         MPI_ORDER_FORTRAN, MPI_REAL8, filetype, ierror)
       call MPI_TYPE_COMMIT(filetype, ierror)
       call MPI_INFO_CREATE(fileinfo, ierror)

       call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE',&
         IOR(MPI_MODE_RDWR,MPI_MODE_CREATE), fileinfo, fh, ierror)

       disp=0
       call MPI_FILE_SET_VIEW(fh, disp, MPI_REAL8, filetype, 'native',&
         MPI_INFO_NULL, ierror)

       offset=0
       call MPI_FILE_WRITE_AT_ALL(fh, offset, a, m*m, MPI_REAL8,&
			 status, ierror)

Nonblocking, Noncollective and Collective, Reading and Writing

MPI_FILE_IREAD_AT(fh, offset, buf, count, datatype, request, ierror)
MPI_FILE_IREAD_AT_ALL(fh, offset, buf, count, datatype, request, ierror)
MPI_FILE_IWRITE_AT(fh, offset, buf, count, datatype, request, ierror)
MPI_FILE_IWRITE_AT_ALL(fh, offset, buf, count, datatype, request, ierror)

IN   fh        Specifies the file handle (handle)
IN   offset    Specifies the file offset (integer(kind=MPI_OFFSET_KIND))
IN   buf       Returns the initial address of the buffer (choice)
IN   count     Specifies the number of elements in the buffer (integer)
IN   datatype  Specifies the data type of each buffer element (handle)
OUT  request   Returns the request object (status)
OUT  ierror    Return code (integer)

fh: File handle that was returned from MPI_File_open. For collective routines (those ending in _ALL) the read or write is collective over all the processes that opened the file.
offset: The offset at which the read or write begins. offset is specified as a relative offset (there is an initial displacement from the beginning of the physical file set by the fileview), measured in units of the current fileview etype, and only counting those etypes made available for reading or writing. Note that on the SP this is a non-default sized integer, and you must declare it kind=MPI_OFFSET_KIND in Fortran.
buf: The buffer containing to data to be written, or to receive the data to be read.
count: The amount if data to be read or written, measured in units of datatype.
datatype: The type of data to be written. Can be a predefined or user-derived MPI datatype. The size of the datatype must be a multiple of the etype defined in the current fileview.
request: The request object returns an MPI request object. This must be used in a call to MPI_WAIT or MPI_TEST to ensure that the data movement is complete.

Examples

The programming to use the nonblocking write is very similar to that of the blocking write.

      integer me, filemode, finfo, fhv, request, ierr
      integer(kind=MPI_OFFSET_KIND) disp, offset
      real(kind=8) buf(m)
...
      filemode = IOR(MPI_MODE_RDWR,MPI_MODE_CREATE)
      call MPI_INFO_CREATE(finfo, ierr)

      call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE', filemode, finfo,&
        fhv, ierr)

      disp=0
      call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL8, MPI_REAL8, 'native',&
        finfo, ierr)
...
      call MPI_COMM_RANK(MPI_COMM_WORLD, me, ierr)
      offset=m*me
      call MPI_FILE_IWRITE_AT(fhv, offset, buf, m, MPI_REAL8, request, ierr)
...
      call MPI_WAIT(request, status, ierr)

Similarly, for the second example, the changes needed to use nonblocking writes are very small.

...
       dimension a(m,m)
       integer sizes(2), subsizes(2), starts(2), xcoord, ycoord
       integer fileinfo, ierror, fh, filetype, request, status(MPI_STATUS_SIZE)
       integer(kind=MPI_OFFSET_KIND) :: disp, offset

! this datatype describes the mapping of the local array
! to the global array (file)

       sizes(1)=n
       sizes(2)=n
       subsizes(1)=m
       subsizes(2)=m

! xcoord is the value of x coordinate in the processor grid
! ycoord is the value of y coordinate in the processor grid

       starts(1)=1+m*xcoord
       starts(2)=1+m*ycoord

       call MPI_TYPE_CREATE_SUBARRAY(2, sizes, subsizes, starts,&
         MPI_ORDER_FORTRAN, MPI_REAL8, filetype, ierror)
       call MPI_TYPE_COMMIT(filetype, ierror)
       call MPI_INFO_CREATE(fileinfo, ierror)

       call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE',&
         IOR(MPI_MODE_RDWR,MPI_MODE_CREATE), fileinfo, fh, ierror)

       disp=0
       call MPI_FILE_SET_VIEW(fh, disp, MPI_REAL8, filetype, 'native',&
         MPI_INFO_NULL, ierror)

       offset=0
       call MPI_FILE_IWRITE_AT_ALL(fh, offset, a, m*m, MPI_REAL8, request, ierror)
...
       call MPI_WAIT(request, status, ierror)

Individual Filepointer Data Access Routines

MPI I/O provides individual filepointer routines, both blocking and non-blocking:

MPI_FILE_READ(fh, buf, count, datatype, status, ierror)
MPI_FILE_READ_ALL(fh, buf, count, datatype, status, ierror)
MPI_FILE_IREAD(fh, buf, count, datatype, request, ierror)
MPI_FILE_IREAD_ALL(fh, buf, count, datatype, request, ierror)
MPI_FILE_SEEK(fh, whence, offset)
MPI_FILE_GET_POSITION(fh, offset)

These routines work in a similar way to the explicit offset routines, except that each MPI process keeps a private filepointer for file which is updated after each read or write. The routines MPI_FILE_SEEK and MPI_FILE_GET_POSITION can be used to set and query the value of the filepointer respectively. For documentation, see the Further Examples section.

Other Useful Routines

The most common MPI I/O routines have been discussed above, but there are a number of additional I/O related routines:

Routine Name	Function
`MPI_FILE_DELETE`	Deletes a file
`MPI_FILE_SET_SIZE`	Resize (truncates or extends) a file
`MPI_FILE_GET_SIZE`	Queries the size of a file.
`MPI_FILE_PREALLOCATE`	Allocates storage for a file
`MPI_FILE_GET_GROUP`	Returns the group of MPI processes that opened the file.
`MPI_FILE_GET_AMODE`	Returns the access mode for a file.
`MPI_FILE_SET_INFO`	Used to set implementation hints for a file.
`MPI_FILE_GET_INFO`	Used to query implementation hints for a file.
`MPI_FILE_GET_VIEW`	Used to query a fileview for a file.

Often, to get good performance from MPI I/O, you need to define your own datatypes to create effective fileviews. This reduces the number of data access calls that need to be made. Creating datatypes requires familiarity with the MPI User-Defined Datatype routines. The following table shows the various routines:

Routine Name	Defines a ...
`MPI_TYPE_CONTIGUOUS`	contiguous block
`MPI_TYPE_VECTOR`	strided blocks
`MPI_TYPE_HVECTOR`	srided blocks (general)
`MPI_TYPE_INDEXED`	strided blocks, with different stride per block
`MPI_TYPE_HINDEXED`	strided blocks, with different stride per block (general)
`MPI_TYPE_STRUCT`	general datatype
`MPI_TYPE_CREATE_DARRAY`	distributed multi-dimensional array
`MPI_TYPE_CREATE_SUBARRAY`	chunk of a multi-dimensional array
`MPI_TYPE_COMMIT`	makes a type ready to use
`MPI_TYPE_FREE`	deletes a type
`MPI_TYPE_EXTENT`	type size in units of another type size
`MPI_TYPE_SIZE`	type size in bytes.

Online descriptions of these routines can be found in the MPI Subroutine Reference. A more readable account can be found in Chapter 3 of MPI - The Complete Reference, Vol. 1

IBM SP Implementation

The IBM SP implementation is documented in the MPI Subroutine Reference. The implementation accepts hints passed through an info object. These hints are documented in the description of mpi_file_open. For example:

     integer finfo, ierr
...
     call MPI_INFO_CREATE(finfo, ierr)
     call MPI_INFO_SET(finfo, 'IBM_largeblock_io','true',ierr)
...