Libraries |
Introduction to MPI I/OMPI I/O refers to a new set of standard MPI routines for performing file read and write operations. MPI I/O has a number of useful features which are described in this document. The MPI libraries on the NERSC IBM SP contain all of the routines defined in the MPI standard.
Further Information
IntroductionThe new MPI standard defines a set of routines for transferring data to and from external storage. It offers a number of advantages over traditional language I/O:
Compiling Code with MPI I/O routinesThe MPI I/O routines are part of the standard MPI libraries on the IBM SP. IBM SP: You must use one of the "thread-safe" compilers to use the MPI I/O routines. The "thread-safe" compilers have names that end in "_r", e.g. mpxlf90_r, mpxlf_r, mpcc_r, and mpCC_r. Fundamental ConceptsA MPI file is an ordered list of MPI datatypes. These datatypes can be predefined: such as byte, real, or integer; or user-defined datatypes. MPI I/O routines support sequential or random access to these types. A file is opened collectively by any subset of MPI processes, and any collective I/O operations on the file are collective over this subset. A view defines what data are visible to each process. A view consists of a displacement, an etype (elementary type), and a filetype.
For example, imagine the file laid out in a linear array. Given a displacement, etype, and filetype:
where "H" indicates a hole. Then by "tiling" the file with this filetype, a process has access to only the portions of the file marked with an "E":
It is possible for different processes to define different filetypes, and hence different fileviews. For example, given three processes with the following filetypes:
Then the file is partitioned among the processes as follows:
Using filetypes effectively is the key to efficient MPI I/O. Ideally, with judicious use of filetypes a single MPI read or write call is all that is needed to write out an entire data structure. ExamplesWe have an Fortran MPI application where a real*8 vector is distributed in equal chunks over 4 processes. The file should have a contiguous vector:
The data from each process forms a contiguous piece of the file, so there is little
point in setting up an filetype which covers the entire file. We can use
a displacement of zero bytes, an etype of A more complex example shows the benefit of using fileviews. Imagine an MPI program uses a real*8 matrix distributed in regular blocked fashion over 4 processes that are arranged in a 2 by 2 grid. Each MPI process is represented by a an (x,y) coordinate:
We want to store the data in column-major order in a file. We choose a
displacement of zero bytes, and an etype of
but MPI provides a routine Basic File Manipulation RoutinesOpening a FileMPI_FILE_OPEN(comm, filename, amode, info, fh) IN comm Specifies the communicator (handle) IN filename Specifies the name of the file to open (string) IN amode Specifies the file access mode (integer) IN info Specifies the information object that provides hints (handle) OUT fh Returns the file handle (handle) OUT ierror Return code (integer)
Closing a FileMPI_FILE_CLOSE(fh, ierror) IN fh Specifies the file handle (handle) OUT ierror Return code (integer)
Creating a FileviewMPI_FILE_SET_VIEW(fh, disp, etype, filetype, datarep, info, ierror) IN fh Specifies the file handle (handle). IN disp Specifies displacement (integer(kind=MPI_OFFSET_KIND)) IN etype Specifies elementary data type (handle) IN filetype Specifies the file type (handle) IN datarep Specifies the representation of data in the file (string) IN info Provides information regarding file access patterns (handle) OUT ierror Return code (integer)
ExamplesWe have an MPI application where a real*8 vector is distributed in equal chunks over 4 processes. The following diagram shows the mapping from process data to file data: local data
As mentioned above, the data from each process in contiguous in the file, so there is little
point in setting up an filetype which covers the entire file. We can use
a filetype of integer filemode, finfo, fhv, ierr integer (kind=MPI_OFFSET_KIND) disp ... filemode = IOR(MPI_MODE_RDWR,MPI_MODE_CREATE) ! create an empty info object call MPI_INFO_CREATE(finfo, ierr) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE', filemode, finfo,& fhv, ierr) ! set view with a displacement of 0, an etype of MPI_REAL8, and a ! filetype of MPI_REAL8 call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL8, MPI_REAL8, 'native',& finfo, ierr) ... The second example is more complicated. We have a 2 dimensional array evenly distributed in both dimensions. It should be stored in column major order in the file. In the case of process (0,0), we have to map a local contiguous 2 dimensional array to the file as follows: local data
In this case, it is worth setting up a filetype to cover the whole file.
That way the local data can be written out with one call. We can use a
displacement of zero, and an etype of ... integer sizes(2), subsizes(2), starts(2), xcoord, ycoord integer fileinfo, ierror, fh, filetype integer(kind=MPI_OFFSET_KIND) :: disp ! this datatype describes the mapping of the local array ! to the global array (file) sizes(1)=n sizes(2)=n subsizes(1)=m subsizes(2)=m ! assume xcoord is the value of x coordinate in the processor grid ! assume ycoord is the value of y coordinate in the processor grid starts(1)=1+m*xcoord starts(2)=1+m*ycoord call MPI_TYPE_CREATE_SUBARRAY(2, sizes, subsizes, starts,& MPI_ORDER_FORTRAN, MPI_REAL8, filetype, ierror) call MPI_TYPE_COMMIT(filetype, ierror) call MPI_INFO_CREATE(fileinfo, ierror) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'poisson.data',& IOR(MPI_MODE_RDWR,MPI_MODE_CREATE), fileinfo, fh, ierror) disp=0 call MPI_FILE_SET_VIEW(fh, disp, MPI_REAL8, filetype, 'native',& MPI_INFO_NULL, ierror) ... Data Access RoutinesIn all, there are 15 routines to read data and 15 to write data. They can be categorized as follows:
The following table summarizes the routines for reading data. Corresponding
routines for writing data exist, with names such as
Blocking, Noncollective and Collective, Reading and WritingMPI_FILE_READ_AT(fh, offset, buf, count, datatype, status, ierror) MPI_FILE_READ_AT_ALL(fh, offset, buf, count, datatype, status, ierror) MPI_FILE_WRITE_AT(fh, offset, buf, count, datatype, status, ierror) MPI_FILE_WRITE_AT_ALL(fh, offset, buf, count, datatype, status, ierror) IN fh Specifies the file handle (handle) IN offset Specifies the file offset (integer(kind=MPI_OFFSET_KIND)) IN buf Returns the initial address of the buffer (choice) IN count Specifies the number of elements in the buffer (integer) IN datatype Specifies the data type of each buffer element (handle) OUT status Returns the status object (status) OUT ierror Return code (integer)
Examples
We extend the first example to include the code to perform the write.
The integer me, filemode, finfo, fhv, ierr integer(kind=MPI_OFFSET_KIND) disp, offset real(kind=8) buf(m) ... filemode = IOR(MPI_MODE_RDWR,MPI_MODE_CREATE) call MPI_INFO_CREATE(finfo, ierr) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE', filemode, finfo,& fhv, ierr) disp=0 call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL8, MPI_REAL8, 'native',& finfo, ierr) ... call MPI_COMM_RANK(MPI_COMM_WORLD, me, ierr) offset=m*me call MPI_FILE_WRITE_AT(fhv, offset, buf, m, MPI_REAL8, status, ierr) The code to write out the data for the second example is shown below. Because we use a relatively complex filetype, which defines the layout correctly for each process, the call to write the data is identical for each MPI process. dimension a(m,m) integer sizes(2), subsizes(2), starts(2), xcoord, ycoord integer fileinfo, ierror, fh, filetype, status(MPI_STATUS_SIZE) integer(kind=MPI_OFFSET_KIND) :: disp, offset ! this datatype describes the mapping of the local array ! to the global array (file) sizes(1)=n sizes(2)=n subsizes(1)=m subsizes(2)=m ! assume xcoord is the value of x coordinate in the processor grid ! assume ycoord is the value of y coordinate in the processor grid starts(1)=1+m*xcoord starts(2)=1+m*ycoord call MPI_TYPE_CREATE_SUBARRAY(2, sizes, subsizes, starts,& MPI_ORDER_FORTRAN, MPI_REAL8, filetype, ierror) call MPI_TYPE_COMMIT(filetype, ierror) call MPI_INFO_CREATE(fileinfo, ierror) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE',& IOR(MPI_MODE_RDWR,MPI_MODE_CREATE), fileinfo, fh, ierror) disp=0 call MPI_FILE_SET_VIEW(fh, disp, MPI_REAL8, filetype, 'native',& MPI_INFO_NULL, ierror) offset=0 call MPI_FILE_WRITE_AT_ALL(fh, offset, a, m*m, MPI_REAL8,& status, ierror) Nonblocking, Noncollective and Collective, Reading and WritingMPI_FILE_IREAD_AT(fh, offset, buf, count, datatype, request, ierror) MPI_FILE_IREAD_AT_ALL(fh, offset, buf, count, datatype, request, ierror) MPI_FILE_IWRITE_AT(fh, offset, buf, count, datatype, request, ierror) MPI_FILE_IWRITE_AT_ALL(fh, offset, buf, count, datatype, request, ierror) IN fh Specifies the file handle (handle) IN offset Specifies the file offset (integer(kind=MPI_OFFSET_KIND)) IN buf Returns the initial address of the buffer (choice) IN count Specifies the number of elements in the buffer (integer) IN datatype Specifies the data type of each buffer element (handle) OUT request Returns the request object (status) OUT ierror Return code (integer)
ExamplesThe programming to use the nonblocking write is very similar to that of the blocking write. integer me, filemode, finfo, fhv, request, ierr integer(kind=MPI_OFFSET_KIND) disp, offset real(kind=8) buf(m) ... filemode = IOR(MPI_MODE_RDWR,MPI_MODE_CREATE) call MPI_INFO_CREATE(finfo, ierr) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE', filemode, finfo,& fhv, ierr) disp=0 call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL8, MPI_REAL8, 'native',& finfo, ierr) ... call MPI_COMM_RANK(MPI_COMM_WORLD, me, ierr) offset=m*me call MPI_FILE_IWRITE_AT(fhv, offset, buf, m, MPI_REAL8, request, ierr) ... call MPI_WAIT(request, status, ierr) Similarly, for the second example, the changes needed to use nonblocking writes are very small. ... dimension a(m,m) integer sizes(2), subsizes(2), starts(2), xcoord, ycoord integer fileinfo, ierror, fh, filetype, request, status(MPI_STATUS_SIZE) integer(kind=MPI_OFFSET_KIND) :: disp, offset ! this datatype describes the mapping of the local array ! to the global array (file) sizes(1)=n sizes(2)=n subsizes(1)=m subsizes(2)=m ! xcoord is the value of x coordinate in the processor grid ! ycoord is the value of y coordinate in the processor grid starts(1)=1+m*xcoord starts(2)=1+m*ycoord call MPI_TYPE_CREATE_SUBARRAY(2, sizes, subsizes, starts,& MPI_ORDER_FORTRAN, MPI_REAL8, filetype, ierror) call MPI_TYPE_COMMIT(filetype, ierror) call MPI_INFO_CREATE(fileinfo, ierror) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'FILE',& IOR(MPI_MODE_RDWR,MPI_MODE_CREATE), fileinfo, fh, ierror) disp=0 call MPI_FILE_SET_VIEW(fh, disp, MPI_REAL8, filetype, 'native',& MPI_INFO_NULL, ierror) offset=0 call MPI_FILE_IWRITE_AT_ALL(fh, offset, a, m*m, MPI_REAL8, request, ierror) ... call MPI_WAIT(request, status, ierror) Individual Filepointer Data Access RoutinesMPI I/O provides individual filepointer routines, both blocking and non-blocking: MPI_FILE_READ(fh, buf, count, datatype, status, ierror) MPI_FILE_READ_ALL(fh, buf, count, datatype, status, ierror) MPI_FILE_IREAD(fh, buf, count, datatype, request, ierror) MPI_FILE_IREAD_ALL(fh, buf, count, datatype, request, ierror) MPI_FILE_SEEK(fh, whence, offset) MPI_FILE_GET_POSITION(fh, offset)
These routines work in a similar way to the explicit offset routines, except
that each MPI process keeps a private filepointer for file which is updated
after each read or write. The routines Other Useful RoutinesThe most common MPI I/O routines have been discussed above, but there are a number of additional I/O related routines:
Often, to get good performance from MPI I/O, you need to define your own datatypes to create effective fileviews. This reduces the number of data access calls that need to be made. Creating datatypes requires familiarity with the MPI User-Defined Datatype routines. The following table shows the various routines:
Online descriptions of these routines can be found in the MPI Subroutine Reference. A more readable account can be found in Chapter 3 of MPI - The Complete Reference, Vol. 1 IBM SP ImplementationThe IBM SP implementation is documented in the MPI Subroutine Reference. The implementation accepts hints passed through an info object. These hints are documented in the description of mpi_file_open. For example: integer finfo, ierr ... call MPI_INFO_CREATE(finfo, ierr) call MPI_INFO_SET(finfo, 'IBM_largeblock_io','true',ierr) ... |
Page last modified: Tue, 12 Feb 2008 22:09:49 GMT Page URL: http://www.nersc.gov/nusers/resources/software/libs/io/mpiio.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |