Abstract
One obstacle to running very large two- and three-dimensional codes on the Cray X-MP and Y-MP systems is to efficiently perform array transpositions using SSD storage. This article discusses how such transpositions can be performed by means of algorithms that feature exclusively unit stride, long vector transfers between main memory and SSD, and which only require a single pass through the data (provided sufficient main memory buffers are available).
The author is with the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center, Moffett Field, CA 94035.
Introduction
The limited main memory on X-MP and Y-MP (compared to the Cray-2) has led some users of very large 2-D and 3-D simulation codes to consider using the SSD (solid state device) memory system. For some of these codes, not all of the necessary arrays need be present in main memory at any one time, and so such codes can run fairly efficiently using SSD. For other codes, however, it is not convenient to segregate the data in this fashion.
One situation that is particularly challenging to handle in SSD is a very large 2-D or 3-D array that must be accessed by each dimension. Accessing such arrays in SSD by the first dimension is not difficult, since in Fortran this is the natural storage order. Accessing by the second or third dimension amounts to accessing the array with a very large stride, which renders SSD access inefficient. This is because SSD access, like disk access, is only efficient when done with large blocks of contiguous data.
One way to solve this problem is to perform an array transposition after accessing the array by the first dimension. Then subsequent computation can again be done with large vector, unit stride data accesses. Depending on the circumstances, additional array transpositions may be necessary, including one to restore the array to its original ordering. What this means is that if an efficient means can be found to transpose an array in an external memory such as SSD, then large 2-D and 3-D codes could be executed on the X-MP and Y-MP without prohibitive I/O cost. This article will describe two methods for array transposition in an external memory such as SSD. Each of these methods employs exclusively large block, contiguous data transfers between main and external memory, and each requires only a single read-write pass through the data, provided sufficient main memory buffer space is available.
Fraser's Algorithm
Probably the most efficient algorithm currently known to transpose data in external storage is due to Fraser [1, 2]. A particularly attractive aspect of this algorithm is that it can easily be tuned for maximum efficiency on a given system. It is easier to exhibit an example of Fraser's algorithm than to precisely state it. Suppose one wishes to transpose a matrix on an external random access dataset into a matrix, where and . Suppose also that the size of an efficient I/O block is (on a Cray SSD system it is 512). Finally, suppose that two main memory buffers of size are available, and that an external scratch dataset of size is available. Let the notation (14 13 12 ... 2 1 0) denote the binary digit positions in the binary expansion of an index in the -long input array. Then the steps required to transpose this array can be compactly presented in Table 1.
The notation at the beginning of each line in Table 1 indicates the
source of the data in each operation: E1 denotes external dataset
number 1, M2 denotes main memory buffer number 2, etc. Note that the
transfers between external memory and main memory only alter locations
6 through 14, and leave locations 0 through 5 unchanged (i.e., 64-long
contiguous blocks are preserved), and that transfers between two main
memory buffers only alter locations 0 through 8 (i.e. only affect data
within a single 512-long main memory buffer).
The first step, from external to main, involves fetching contiguous
blocks of size 64 from disk with a block stride of four (i.e. fetch
the first 64-long block, skip three blocks, fetch the fifth 64-long
block, etc.). The first step is done in batches of 8 blocks, so that
512 words are fetched to one of the main memory buffers before
proceeding. The second step, which is performed between the two main
memory buffers, is to transpose the resulting 512-long array,
considered as a matrix, into a matrix. In
the third step, the eight 64-long blocks in the main memory buffer are
stored out to external memory, this time with a block stride of eight.
This completes one pass through the external data set.
In the next pass, eight contiguous 64-long blocks are fetched into
main memory, and the resulting 512-long array is transposed in a block
fashion that preserves 8-long contiguous sections. Finally, the
resulting 64-long blocks are stored back to external memory, again in
a manner that achieves a certain block permutation. The array has now
been transposed in just two passes. With an adjustment of the
parameters (for example, with a block size of 32 and a memory buffer
size of 1024), the transposition could be achieved in a single pass.
The author has implemented Fraser's algorithm on a Cray Y-MP running
UNICOS, using SDS space in SSD. The resulting implementation appears
to be very effective, as can be judged by the performance figures in
Table 2. In addition to the CPU times listed, some I/O time was
logged as ``system time''. As in many applications that use I/O, this
system time can vary wildly from run to run, depending on swapping and
other system activity. When the system I/O is not excessively busy,
this time is generally about the same as the user CPU time listed in
the table.