A Reference Implementation for Extended and Mixed Precision BLAS

Introduction

Software (Last update: 02/15/2008)

Code Generation with M4 Macro Processor

Testing

Testing DOT
Obtaining the accurate result
Testing SPMV and GBMV

Other Languages Besides C

Future Work

Developers

Introduction

This library of routines is part of a reference implementation for the Dense and Banded BLAS routines, along with their Extended and Mixed Precision versions, as documented in Chapters 2 and 4 of the BLAS Technical Forum Standard.

Extended precision is only used internally in the BLAS; the input and output arguments remain just Single or Double as in the existing BLAS. For the internal precision, we allow Single, Double, Indigenous, or Extra. In our reference implementation we assume that Single and Double are the corresponding IEEE floating point formats. Indigenous means the widest hardware-supported format available. Its intent is to let the machine run close to top speed, while being as accurate as possible. On some machines this would be a 64-bit (Double) format, but on others such as Intel machines it means the 80-bit IEEE format (which has 64 fraction bits). Our reference implementation currently supports machines on which Indigenous is the same as Double. Extra means anything at least 1.5 times as accurate as Double, and in particular wider than 80 bits. An existing quadruple precision format could be used to implement Extra precision, but we cannot assume that the language or compiler supports any format wider than Double. So for our reference implementation, we use a technique called double-double in which extra precise numbers are represented by a pair of double precision numbers, providing about 106 bits of precision.

We have designed all our routines assuming that single precision arithmetic is actually done in IEEE single precision (32 bits) and that double precision arithmetic is actually done in IEEE double precision (64 bits). It also passes our tests on the Intel machine with 80-bit floating point registers.

Mixed precision permits some input/output arguments to be of different mathematical types, meaning real and complex, or different precisions, meaning single and double. This permits such operations as real-by-complex matrix multiplication, which can be rather faster than using alternatives that do not mix precision.

The purpose of this implementation is to provide a proof-of-concept implementation, showing that the considerable complexity of the specification is actually implementable and testable with a manageable amount of software. We have not attempted to optimize performance, but our code should be as good as straightforward but careful code written by hand.

Software

The BLAS Standard defines language bindings for Fortran 95, Fortran 77, and C. Here, we have only implemented the C version. However, our methodologies for code generation and correctness testing also apply to Fortran.

In this release, we provide the following routines:

Level 1
- DOT (inner product)
- SUM (sum)
- AXPBY (Scaled vector accumulation)
- WAXPBY (Scaled vector addition)
Level 2
- GEMV (General matrix vector product)
- GBMV (Banded matrix vector product)
- SYMV (Symmetric matrix vector product)
- SPMV (Symmetric matrix vector product, packed format)
- HEMV (Hermitian matrix vector product)
- HPMV (Hermitian matrix vector product, packed format)
- TRSV (Triangular solve)
Level 3
- GEMM (General matrix matrix product)
- SYMM (Symmetric matrix matrix product)
- HEMM (Hermitian matrix matrix product)

All these routines pass our systematic testing of all possible combinations of mixed and extended precision. We will eventually include everything in the intersection of Chapter 2 and Chapter 4 with systematic testing.

The whole package is contained in xblas.tar.gz. You can also browse the directory structure, and retrieve the individual files. The TOMS paper and the Technical Report contain more detailed information.

Code Generation with M4 Macro Processor

Our goal is to automate the construction and testing of the BLAS as proposed in Chapters 2 and 4 of the Standard. Here is the challenge and our proposed solution.

In the existing BLAS, there are usually 4 routines associated with each operation. All input, output, and internal variables are single or double precision and real or complex. But under the new extended and mixed precision rules (see Chapter 4 for details), the input, output and internal variables may have different precisions and types. Therefore, the combination of all these types results in tremendously more routines associated with each operation. For example, DOT will have 32 routines altogether, 4 "standard" versions (from Chapter 2) and 28 mixed and extended precision versions (from Chapter 4). In addition, the 16 versions with extended precision support up to three internal precisions that can be chosen at runtime. We have automated the code generation as much as possible. We use the M4 macro processor to facilitate this task.

The idea is to define a macro for each fundamental operation. The macro's argument list contains the variables, accompanied by their types and precisions. For example, for the "+" operation

	c <- a + b

we define the following macro:

   	ADD(c, c_type, a, a_type, b, b_type)

where, x_type can one of:

	real_single
	real_double
	real_extra
	complex_single
	complex_double
	complex_extra

Inside the macro body, we use an "if-test" on c_type, a_type and b_type, to generate the appropriate code segment for "+". (This is similar to operator overloading in C++; but we do it manually.) All these if-tests are evaluated at macro-evaluation time, and do not appear in the executable code. Indeed, our goal was to produce efficient C code, which means minimizing branches in inner loops. Other macros include SUB, MUL, DIV, DECLARE (variable declaration), ASSIGN, etc.

Since these macros are shared among all the BLAS routines, we put them in a common header file, named cblas.m4.h . Each BLAS routine also has its own macro file, such as dot.m4, spmv.m4 and gbmv.m4, to generate the specific functions. All the M4 macro files are located in the m4/ subdirectory.

For example, the inner loop of the M4 macro for the dot product is simply as follows (the M4 parameters $2, $3, and $4 are types):

        for (i = 0; i < n; ++i) {
            GET_VECTOR_ELEMENT(x_ii, x_i, ix, $2)
                  /* put ix-th element of vector x into x_ii */
            GET_VECTOR_ELEMENT(y_ii, y_i, iy, $3)
                  /* put iy-th element of vector y into y_ii */
            MUL(prod, $4, x_ii, $2, y_ii, $3) /* prod = x[i]*y[i] */
            ADD(sum, $4, sum, $4, prod, $4) /* sum = sum+prod */
            ix += incx;
            iy += incy;
        } /* endfor */

The motivation for this macro-based approach is simplifying software engineering. For example, the file dot.m4 of M4 macros for the dot product is 401 lines long (245 non-comment lines) but expands into 11145 lines in 32 C subroutines implementing different versions of DOT. Similarly the macros for TRSV expand from 723 lines (454 non-comment lines) to 37099 lines in 16 C subroutines. (This does not count the shared M4 macros in directory m4/.)

Testing

The goal of the testing code is to validate the underlying implementation. The challenge is twofold: First, we must thoroughly test routines claiming to use extra precision internally, where the test code is not allowed to declare any extra precision variables or use any other extra precision facilities not available to the code being tested. This requires great care in generating test data. Second, we must use M4 to automatically generate the many versions of test code needed for the many versions of the code being tested.

For each BLAS routine, we perform the following steps in the test code:

Generate input scalars, vectors and arrays, according to the routine's specification, so that the result exposes the internal precision actually used.
Call the BLAS routine
For each output, compute a test ratio of the computed error to the theoretical error bound:
```
	test ratio = | Computed_value - "True_value" | / Error_Bound
```

By design, the test ratio should be bounded by 1. A large ratio indicates that the computed result is either completely wrong, or not as accurate as claimed in the specification.

The following section will discuss how we generate "good" inputs in order to reveal the internal precisions actually used.

Testing DOT

DOT performs the following function:

	r <- alpha * SUM_{i=1,n}(x_i*y_i) + beta * r_in

Assume that the result r_comp is computed as follows

precision eps_int is used internally,

precision eps_out in used when the final result is rounded on output

underflow threshold UN_int is used internally,

underflow threshold UN_out is used on output

and that additionally we compute a very accurate approximation r_acc with

precision eps_acc = 2^(-106) (double-double format)

underflow threshold UN_acc = 2^(-1022)

Then the error bound satisfies

(*) |r_comp - r_acc| <= (n+2)*(eps_int + eps_acc)*S + U + eps_out*|r_acc|
                      = Scale

where,

    S = |alpha| * SUM_{i=1,n} |x_i*y_i| + |beta * r_in|
    U = (2*|alpha|*n + 3)*(UN_int + UN_acc) + UN_out

Thus, we can confirm that r_comp has been computed as accurately as claimed (i.e. with internal precision defined by eps_int and UN_int) by testing whether the

          test ratio = |r_comp - r_acc| / Scale

is at most 1. Suppose that no underflow occurs, and that

          eps_int_actual >> eps_int >= eps_acc.

where eps_int_actual is the internal precision actually used in some buggy version of DOT that we want to test. Then we can expect the test ratio to be as large as

                        (n+2)*(eps_int_actual + eps_acc)*S + eps_out*|r_acc|
          test ratio ~  ----------------------------------------------
                        (n+2)*(eps_int  + eps_acc)*S + eps_out*|r_acc|

If we can make r_acc very small, then this ratio will be roughly

          test ratio =  eps_int_actual / eps_int >> 1

which means the bug in DOT will be detected, and in fact the test ratio will actually tell us how much internal precision we effectively used.

Thus our goal is to pick test data alpha, x(1:n), y(1:n), beta and r_in to make |r_acc| as small as possible, MUCH smaller than S, so that eps_int term dominates on the right of the inequality (*). Otherwise eps_int will be invisible in the error bound, then we cannot tell what precision is used internally.

In our test generator, we choose input data alpha, beta, r_in, x(1:n) and y(1:n) judiciously in order to cause as much cancellation in r as possible.

Obtaining accurate r_acc

The general approach is to choose some of the input values of x(i) and y(i) so that the exact (partial) dot product of these values has a lot of nonzero fraction bits, preferably at least 106. Then the remaining values of x(i) and y(i) are chosen to cancel the previous bits as much as possible. This latter computation seems to require high precision.

One possibility to use an arbitrary precision package, such as MPFUN, but our goal is to make this test code self contained, and use no extra precision facilities not available to the code being tested. Since most of our routines can be reduced to a series of dot products, testing can be based on DOT. We only need a TRUSTED dot routine. Currently, we are using our own dot routine with double-double internal precision to compute r_truth, which is accurate to 106 bits. This means that any internal precision higher than double-double cannot be detected, and may result in a tiny test ratio. A very tiny test ratio (such as zero) may also occur if the result happens to be computed exactly.

This raises the possibility that there are "matching" bugs in our trusted DOT and the DOT under test, so that bugs are missed. To avoid this possibility we also generate some test examples where the cancellation is done mathematically (and so exactly) rather than depending on a computation. The idea is simple: For example, choose x(1:3) and y(1:3) so that

            x(1)*y(1) = -x(3)*y(3) >> x(2)*y(2)

so that SUM_{i=1,3} x(i)*y(i) = x(2)*y(2) exactly.

Testing SPMV and GBMV

SPMV, GBMV, and many other BLAS2 routines perform the following function:

	y <- alpha * A * x + beta * y

Testing them is no more difficult than testing DOT, because each component of the computed y vector is a dot product, and satisfies the error bound (*). So we simply use the same test generator as DOT, and compute a test ratio for each component of y vector.The only tricky part is that some entries in each dot product may be fixed. For example, the first row of A and the vector x can be chosen freely, but after that x is fixed and, if A is symmetric, the first entry of each subsequent row is fixed. Our dot-product test generator handles all these cases.

This approach can be generalized to most other Level 2 and 3 BLAS, except for the triangular solve function, which is future work.

Other Languages Besides C

Although we have not tried, we believe similarly structured macros could generate F77 and F95 code as well.

Note that conventional operator overloading where the implementation of each binary operation only depends on the argument types and not the result type is not enough for our purposes. It cannot implement extra = double*double correctly, for example. The alternative would be to promote one argument to the output type, but could be inefficient depending on how it is implemented (extra = extra*double is significantly more expensive than extra = double*double). This may limit the utility of this feature in F95 and C++.

Future Work

There are several avenues we would like to explore in the future. First, we will finish the code generation and testing for all the remaining functions in Chapter 4 of the BLAS Standard.

Second, we will evaluate the overall performance of the library. Our code in the present form is meant to be a reference implementation, serving much the same purpose as the existing Fortran BLAS distribution in Netlib. Our generated code consists only of straightforward loops, without any architectural optimizations, such as blocking for locality or loop unrolling. Because of the high degree of data reuse in double-double arithmetic, this yielded good performance for extended precision, but further improvements are possible, especially for mixed precision.

And third, we will investigate the benefit of the new BLAS for more linear algebra algorithms and applications.

Developers

Xiaoye Li
Jim Demmel
David Bailey
Greg Henry
William Kahan
Yozo Hida
Jimmy Iskandar
Anil Kapur
Michael Martin
Brandon Thompson
Teresa Tung
Daniel Yoo

with help from Ben Wanzo, Berkat Tung and Weihua Shen.

Please send any comments or bug reports to extended_blas@cs.berkeley.edu.

A Reference Implementation for Extended and Mixed Precision BLAS

Table of Contents

Code Generation with M4 Macro Processor