Extended precision is only used internally in the BLAS; the input and output arguments remain just Single or Double as in the existing BLAS. For the internal precision, we allow Single, Double, Indigenous, or Extra. In our reference implementation we assume that Single and Double are the corresponding IEEE floating point formats. Indigenous means the widest hardware-supported format available. Its intent is to let the machine run close to top speed, while being as accurate as possible. On some machines this would be a 64-bit (Double) format, but on others such as Intel machines it means the 80-bit IEEE format (which has 64 fraction bits). Our reference implementation currently supports machines on which Indigenous is the same as Double. Extra means anything at least 1.5 times as accurate as Double, and in particular wider than 80 bits. An existing quadruple precision format could be used to implement Extra precision, but we cannot assume that the language or compiler supports any format wider than Double. So for our reference implementation, we use a technique called double-double in which extra precise numbers are represented by a pair of double precision numbers, providing about 106 bits of precision.
We have designed all our routines assuming that single precision arithmetic is actually done in IEEE single precision (32 bits) and that double precision arithmetic is actually done in IEEE double precision (64 bits). It also passes our tests on the Intel machine with 80-bit floating point registers.
Mixed precision permits some input/output arguments to be of different mathematical types, meaning real and complex, or different precisions, meaning single and double. This permits such operations as real-by-complex matrix multiplication, which can be rather faster than using alternatives that do not mix precision.
The purpose of this implementation is to provide a proof-of-concept implementation, showing that the considerable complexity of the specification is actually implementable and testable with a manageable amount of software. We have not attempted to optimize performance, but our code should be as good as straightforward but careful code written by hand.
In this release, we provide the following routines:
The whole package is contained in xblas.tar.gz. You can also browse the directory structure, and retrieve the individual files. The TOMS paper and the Technical Report contain more detailed information.
In the existing BLAS, there are usually 4 routines associated with each operation. All input, output, and internal variables are single or double precision and real or complex. But under the new extended and mixed precision rules (see Chapter 4 for details), the input, output and internal variables may have different precisions and types. Therefore, the combination of all these types results in tremendously more routines associated with each operation. For example, DOT will have 32 routines altogether, 4 "standard" versions (from Chapter 2) and 28 mixed and extended precision versions (from Chapter 4). In addition, the 16 versions with extended precision support up to three internal precisions that can be chosen at runtime. We have automated the code generation as much as possible. We use the M4 macro processor to facilitate this task.
The idea is to define a macro for each fundamental operation. The macro's argument list contains the variables, accompanied by their types and precisions. For example, for the "+" operation
c <- a + bwe define the following macro:
ADD(c, c_type, a, a_type, b, b_type)where, x_type can one of:
real_single real_double real_extra complex_single complex_double complex_extraInside the macro body, we use an "if-test" on c_type, a_type and b_type, to generate the appropriate code segment for "+". (This is similar to operator overloading in C++; but we do it manually.) All these if-tests are evaluated at macro-evaluation time, and do not appear in the executable code. Indeed, our goal was to produce efficient C code, which means minimizing branches in inner loops. Other macros include SUB, MUL, DIV, DECLARE (variable declaration), ASSIGN, etc.
Since these macros are shared among all the BLAS routines, we put them in a common header file, named cblas.m4.h . Each BLAS routine also has its own macro file, such as dot.m4, spmv.m4 and gbmv.m4, to generate the specific functions. All the M4 macro files are located in the m4/ subdirectory.
For example, the inner loop of the M4 macro for the dot product is simply as follows (the M4 parameters $2, $3, and $4 are types):
for (i = 0; i < n; ++i) { GET_VECTOR_ELEMENT(x_ii, x_i, ix, $2) /* put ix-th element of vector x into x_ii */ GET_VECTOR_ELEMENT(y_ii, y_i, iy, $3) /* put iy-th element of vector y into y_ii */ MUL(prod, $4, x_ii, $2, y_ii, $3) /* prod = x[i]*y[i] */ ADD(sum, $4, sum, $4, prod, $4) /* sum = sum+prod */ ix += incx; iy += incy; } /* endfor */
The motivation for this macro-based approach is simplifying software engineering. For example, the file dot.m4 of M4 macros for the dot product is 401 lines long (245 non-comment lines) but expands into 11145 lines in 32 C subroutines implementing different versions of DOT. Similarly the macros for TRSV expand from 723 lines (454 non-comment lines) to 37099 lines in 16 C subroutines. (This does not count the shared M4 macros in directory m4/.)
For each BLAS routine, we perform the following steps in the test code:
test ratio = | Computed_value - "True_value" | / Error_Bound
The following section will discuss how we generate "good" inputs in order to reveal the internal precisions actually used.
r <- alpha * SUM_{i=1,n}(x_i*y_i) + beta * r_inAssume that the result r_comp is computed as follows and that additionally we compute a very accurate approximation r_acc with Then the error bound satisfies
(*) |r_comp - r_acc| <= (n+2)*(eps_int + eps_acc)*S + U + eps_out*|r_acc| = Scalewhere,
S = |alpha| * SUM_{i=1,n} |x_i*y_i| + |beta * r_in| U = (2*|alpha|*n + 3)*(UN_int + UN_acc) + UN_out
Thus, we can confirm that r_comp has been computed as accurately as claimed (i.e. with internal precision defined by eps_int and UN_int) by testing whether the
test ratio = |r_comp - r_acc| / Scaleis at most 1. Suppose that no underflow occurs, and that
eps_int_actual >> eps_int >= eps_acc.where eps_int_actual is the internal precision actually used in some buggy version of DOT that we want to test. Then we can expect the test ratio to be as large as
(n+2)*(eps_int_actual + eps_acc)*S + eps_out*|r_acc| test ratio ~ ---------------------------------------------- (n+2)*(eps_int + eps_acc)*S + eps_out*|r_acc|If we can make r_acc very small, then this ratio will be roughly
test ratio = eps_int_actual / eps_int >> 1which means the bug in DOT will be detected, and in fact the test ratio will actually tell us how much internal precision we effectively used.
Thus our goal is to pick test data alpha, x(1:n), y(1:n), beta and r_in to make |r_acc| as small as possible, MUCH smaller than S, so that eps_int term dominates on the right of the inequality (*). Otherwise eps_int will be invisible in the error bound, then we cannot tell what precision is used internally.
In our test generator, we choose input data alpha, beta, r_in, x(1:n) and y(1:n) judiciously in order to cause as much cancellation in r as possible.
The general approach is to choose some of the input values of x(i) and y(i) so that the exact (partial) dot product of these values has a lot of nonzero fraction bits, preferably at least 106. Then the remaining values of x(i) and y(i) are chosen to cancel the previous bits as much as possible. This latter computation seems to require high precision.
One possibility to use an arbitrary precision package, such as MPFUN, but our goal is to make this test code self contained, and use no extra precision facilities not available to the code being tested. Since most of our routines can be reduced to a series of dot products, testing can be based on DOT. We only need a TRUSTED dot routine. Currently, we are using our own dot routine with double-double internal precision to compute r_truth, which is accurate to 106 bits. This means that any internal precision higher than double-double cannot be detected, and may result in a tiny test ratio. A very tiny test ratio (such as zero) may also occur if the result happens to be computed exactly.
This raises the possibility that there are "matching" bugs in our trusted DOT and the DOT under test, so that bugs are missed. To avoid this possibility we also generate some test examples where the cancellation is done mathematically (and so exactly) rather than depending on a computation. The idea is simple: For example, choose x(1:3) and y(1:3) so that
x(1)*y(1) = -x(3)*y(3) >> x(2)*y(2)so that SUM_{i=1,3} x(i)*y(i) = x(2)*y(2) exactly.
y <- alpha * A * x + beta * y
Testing them is no more difficult than testing DOT, because each component of the computed y vector is a dot product, and satisfies the error bound (*). So we simply use the same test generator as DOT, and compute a test ratio for each component of y vector.The only tricky part is that some entries in each dot product may be fixed. For example, the first row of A and the vector x can be chosen freely, but after that x is fixed and, if A is symmetric, the first entry of each subsequent row is fixed. Our dot-product test generator handles all these cases.
This approach can be generalized to most other Level 2 and 3 BLAS, except for the triangular solve function, which is future work.
Note that conventional operator overloading where the implementation of each binary operation only depends on the argument types and not the result type is not enough for our purposes. It cannot implement extra = double*double correctly, for example. The alternative would be to promote one argument to the output type, but could be inefficient depending on how it is implemented (extra = extra*double is significantly more expensive than extra = double*double). This may limit the utility of this feature in F95 and C++.
Second, we will evaluate the overall performance of the library. Our code in the present form is meant to be a reference implementation, serving much the same purpose as the existing Fortran BLAS distribution in Netlib. Our generated code consists only of straightforward loops, without any architectural optimizations, such as blocking for locality or loop unrolling. Because of the high degree of data reuse in double-double arithmetic, this yielded good performance for extended precision, but further improvements are possible, especially for mixed precision.
And third, we will investigate the benefit of the new BLAS for more linear algebra algorithms and applications.
Please send any comments or bug reports to extended_blas@cs.berkeley.edu.