Application: sparse matrix-vector multiplication.

Wim Vanroose, 9/4/2002

Introduction

The matrix exponential of an operator acting on a vector is an important problem in mathematics, physics and engineering. Many problems are reduced to an ODE of the form

d v(t)/dt = A v(t)

with v(t) a time dependent vector and A a linear operator. The solution v(t) is subject to certain initial conditions.
The solution of this equation can be formally written as

v(t) = Exp ( t A ) v_0.

So, calculating the matrix exponential will provide use with the solution.

For small t << 1 , approximations like Pade or Crank-Nicholson of the exponential work really good. Problem is that for the application that I am interested in, the matrix A is a huge sparse matrix and the Pade or similar approximation only work for small matrices.

Since, the matrix A is sparse, the exponential can be calculated based on the Lanczos algorithm. By applying A multiple time to the initial state v, and using a gramm-schmidt like procedure, the Lanczos algorithm creates an orthogonal basis for the vector space

Span(v_0, Av_0, AA v_0, ...., )

The operation of the matrix exponential on v is then calculated in this subspace. This is possible since the representation of A in this limited subspace is again a small matrix. And the exponential of that small matrix can be calculated with the conventional Pade approximation .

The code of this procedure is pretty easy: you essential have to write a matrix-vector product routine that implement the application of A on vector v. I have this code running on a single processor machine.

Application: sparse matrix-vector multiplication

So in order to rewrite a the matrix exponential on a parallel machine I need to calculate the Matrix-vector product in parallel. Therefore, the application that I am interested in is how do I organise sparse matrix-vector multiplication. Its obvious that this is not the first application that relies heavily on this multiplication.

Because the matrix is sparse it looks very difficult to find the optimal way of programming the matrix-vector product, since it depends critically on the distribution of non-zero elements.

We dont want to use the generalisation of the uniprocesser BLAS routine dgemv. For example PLAPACK has the call PLA_dgemv or SCALAPACK has PSgemv. These routines are optimized for dense matrices and don't take advantage of the nonzero elements in the matrix A. In addition in our application A is so big that we never want to contstruct it. We prefer an construction on-the-fly when we calculate the matrix product.

I've found some recent papers on this problem and it seems that the result depend one the distribution of of the non-zero elements in matrix A

Toledo: Discusses how to avoid cache misses, blocking and reordering loops. The best option depends on the structure of the sparse matrix.
Eun Jin Im and Yelick: discuss the matrix-vector product on a SMP machine. Conclusions for different types of matrices.

Conclusion

At first sight, no information can be given on the peak performance and the scalability without considering the structure of the sparse matrix.