Array Transposition in SSD

David H. Bailey
October 3, 1989

Abstract

One obstacle to running very large two- and three-dimensional codes on the Cray X-MP and Y-MP systems is to efficiently perform array transpositions using SSD storage. This article discusses how such transpositions can be performed by means of algorithms that feature exclusively unit stride, long vector transfers between main memory and SSD, and which only require a single pass through the data (provided sufficient main memory buffers are available).

The author is with the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center, Moffett Field, CA 94035.

Introduction

The limited main memory on X-MP and Y-MP (compared to the Cray-2) has led some users of very large 2-D and 3-D simulation codes to consider using the SSD (solid state device) memory system. For some of these codes, not all of the necessary arrays need be present in main memory at any one time, and so such codes can run fairly efficiently using SSD. For other codes, however, it is not convenient to segregate the data in this fashion.

One situation that is particularly challenging to handle in SSD is a very large 2-D or 3-D array that must be accessed by each dimension. Accessing such arrays in SSD by the first dimension is not difficult, since in Fortran this is the natural storage order. Accessing by the second or third dimension amounts to accessing the array with a very large stride, which renders SSD access inefficient. This is because SSD access, like disk access, is only efficient when done with large blocks of contiguous data.

One way to solve this problem is to perform an array transposition after accessing the array by the first dimension. Then subsequent computation can again be done with large vector, unit stride data accesses. Depending on the circumstances, additional array transpositions may be necessary, including one to restore the array to its original ordering. What this means is that if an efficient means can be found to transpose an array in an external memory such as SSD, then large 2-D and 3-D codes could be executed on the X-MP and Y-MP without prohibitive I/O cost. This article will describe two methods for array transposition in an external memory such as SSD. Each of these methods employs exclusively large block, contiguous data transfers between main and external memory, and each requires only a single read-write pass through the data, provided sufficient main memory buffer space is available.

Fraser's Algorithm

Probably the most efficient algorithm currently known to transpose data in external storage is due to Fraser [1, 2]. A particularly attractive aspect of this algorithm is that it can easily be tuned for maximum efficiency on a given system. It is easier to exhibit an example of Fraser's algorithm than to precisely state it. Suppose one wishes to transpose a matrix on an external random access dataset into a matrix, where and . Suppose also that the size of an efficient I/O block is (on a Cray SSD system it is 512). Finally, suppose that two main memory buffers of size are available, and that an external scratch dataset of size is available. Let the notation (14 13 12 ... 2 1 0) denote the binary digit positions in the binary expansion of an index in the -long input array. Then the steps required to transpose this array can be compactly presented in Table 1.