Goal: . Currently it's possible to define a list of aggregators and have them do all i/o on a file. If independent i/o will not happen, only the aggregators need to actually open the file. We will implement a deferred open strategy for non-aggregators. Because the list of aggregators is given with a hint, the presence of the hint or lack thereof cannot impact the success or failure of an i/o operation. If a process, because it was not an aggregator, didn't actually open the file and tries to write, that write must succeed ( trigger the open and perform the operation), assuming the process does have access to the file Delaying the open until i/o is done means a delay in reporting some types of errors: if a non-aggregator doesn't have access to a file, the error will not be reported at open time but instead at the time of the i/o operation. On systems where only certain nodes have access to the file system, this delay in error reporting actually helps: non-aggregators will not trigger file access errors, as long as everyone sticks to collective i/o. Motivation: . future scalability: if a thousand processors are doing computations involving parallel i/o, but only 16 processors actually perform i/o, only those 16 should open the file. . it 'feels right' that a process not doing any i/o should not open the file. . ASCI would like to see this feature Assumptions: If a process makes an independent i/o call, then it is sufficient for only that process to have the file open for that function call to succeed. Plan of attack: Very broadly, there are two classes of i/o we are concerned about: independent and collective. We can catch independent i/o on an open-deferred file handle at ROMIO's mpi-io layer. We have to wait until the ADIO layer to catch collective i/o, because higher layers do not know who will actually perform the i/o. ( is it possible to do collective i/o on a non-aggregator? perhaps MPI_COMM_SELF was passed in? not really. only MPI_File_open takes a communicator) Some MPI-IO functions don't actually manipulate the low-level system file descriptor. We won't have to do anything differently with these functions ( most of the MPI_File_get_* functions fall into this category, as do the MPI_File_*_end functions ) Most MPI-IO functions end up calling ADIO functions that manipulate the low-level system file descriptor (and would thus fail if the file was not actually opened). These functions have to be modified to check if the file has been opened. If it has not, call the ADIOI_Immediate open function before proceeding with the operation. MPI_File_delete is a bit tricky: it takes a file name and an info for paramters. We do not handle MPI_File_delete in any special way in the deferred open case: if a process calls MPI_File_delete, we will make no special attempt to forward the request to an aggregator. PVFS, PIOFS, and PFS resolve some hints in ADIO_xxx_Open. ADIO_xxx_open doesn't blindly resolve any pre-set hints: For example, ADIO_PVFS_Open sets up a pvfs_filestat structure with any information provided by user hints, opens the file, then queries the opened file for the values to use for the "striping_factor", "striping_unit" and "start_iodevice" hints. We can keep MPI_File_Open mostly the same, though we do have to re-work the MPI_MODE_CREATE & MPI_MODE_EXCL case and how we deal with the shared file pointer. MPI_File_Open calls ADIO_Open, checks if the open succeeded, then immediately closes the file. That open/check/close step is ok, but we will move it into ADIO_Open where we have easier access to the list of aggregators and can ensure one aggregator does the 'open/test/close' if deferred open is in effect. The call to ADIO_Set_shared_fp must also happen on an aggregator. We will designate one of the callers to ADIO_Open as an 'io_worker'. After calling ADIO_Open, we can look at fd->io_worker (instead of blindly relying on rank == 0 ) The collective operations set the file view, currently done through ADIO_Fcntl. We open the file before performing any fcntl operations. Settign the file view doesn't need the file opened, though, so we can break out the file view stuff into a separate function. In addition to the obviosuly collective operations ( anything ending in Coll), ADIO_Open and ADIO_Close are also collecive. We dup the communicator for use in ROMIO. we will dup the communicator a second time if a process is an aggregator. This communicator will be used for collective operations. ADIO_SeekIndividual must have the file open to seek to the end of a file. when seeking to the beginning or to a specific offset, we can update the ADIO File structure without updating the system file descriptor. write operations will seek to the location in the ADIO File structure before writing data. If the romio_no_indep_rw hint is set, ADIO_Open will need to figure out if the node is an aggregator and if not, skip the ADIOI_xxx_Open call. We would like to be able to assign a sentinel value to fd_sys, except there is no value guaranteed to not be a valid value. We will add a flag to the ADIO_FileD structure to indicate if the open has actually happened or not. If the romio_no_indep_rw hint is not set, all processes will open the file ( the current behavior). Independent MPI-IO functions will check if the file has been opened before making i/o calls, and call the fs-specific open function if it has not. The collective operations will happen across aggregators, who already opened the file at ADIO_Open time. Only independent MPI-IO functions need to worry about the file not being open. If we use the ADIOI_GEN collective functions for collective IO *and* we are an aggregator, then we will open the file in ADIO_Open. Otherwise, we will defer the open until an independent IO request is made. Some independent IO requests do not need a valid system file descriptor. In these cases, we will naturally continue to defer the open. If the independent IO request needs a valid file descriptor, then we can use the ADIOI_xxx_Open function pointer to get a valid file descriptor for that process. Related to that, we will go through and change the ADIOI_Fns_struct entries to use ADIOI_GEN_* if ADIOI_xxx_* simply calls the ADIOI_GEN version. It saves a function call in the common case, but more importantly makes it possible to make a simple, quick check of the function pointer to see if the system uses the generic collective operations. MPI_File_get_group will return the group that called MPI_File_open, but not the group that actually opened the file If adding support for a file system . could use the generic collective operations or . implement fs-specific collecitve operation, in which case either . support deferred open or . filter out hint in comon/hint.c ( XXX: check this ) In summary: romio_no_indep_rw not set files opened normally two-phase not enabled files opened normally two-phase + romio_no_indep_rw set defer open on non-aggregators until time of i/o romio_no_indep_wr set but indep i/o happens anyway open file and do i/o Alternatives: We could change each file system specific ADIO implementation to handle deferred opens, making significant changes to ADIO semantics. Changing the lower-level implementation would require making substantial changes to every ADIO filesystem implementation to handle independent i/o on an open-deferred file descriptor.