NAME
bml
DESCRIPTION
Kuck and Associates' CLASSPACK Basic Math Library contains
highly optimized versions of standard computational building
blocks which users of high-performance computers can use to
improve the speed and portability of their numerical appli-
cations programs.
The BLAS (Basic Linear Algebra Subprograms) routines, which
make up the bulk of this library, are widely used in dense
numerical linear algebra programs. Since they are the most
useful vector and matrix operations, the availability of
versions tuned to the user's machine provides a simple way
for speeding up existing programs and writing new numerical
programs. They also enhance portability, since programs can
contain standard calls to a widely-available set of library
routines.
The BLAS routines were developed in three groups. The BLAS
1 routines consist of vector-vector operations such as dot
products and "scalar * vector + vector". The BLAS 2 rou-
tines provide matrix-vector operations such as matrix-vector
multiply and the rank-1 update of a matrix. The BLAS 3 rou-
tines provide matrix-matrix operations such as rank-k update
and the solution of triangular systems with multiple right-
hand sides. This library contains all three groups.
The Fourier Transform is a common operation in many fields,
especially signal processing. Kuck and Associates' Basic
Math Library includes three routines (in single-precision
and double-precision versions) for performing FFTs (Fast
Fourier Transforms). These have been tuned for optimal per-
formance on this machine.
While FFT routines are not as standardized as the BLAS rou-
tines, portability is eased by limiting changes to small
sections of code which interface between the user's program
and the available library.
In addition to BLAS and Fast Fourier Transform routines,
special routines are provided for machines with Intel i860
processors. These routines include the sparse BLAS routines
for performing the scalar-vector product and dot product,
special scalar-vector product routines designed for the case
when the result vector resides in the on-chip cache, several
special vector-vector operations, programs for interchanging
the dimensions of 2-, 3, and 4-dimensional arrays, and sub-
routines for solving tridiagonal, block tridiagonal, penta-
diagonal, and block pentadiagonal systems.
The Basic Math Library object routines are packaged as a
library archive named libkmath.a. Programs must be linked
with this library.
The descriptions of the routines in this library specify the
legal values for all scalar integer and character arguments.
The routines themselves check these arguments when they are
called to determine if they have legal values. If the value
of one of the arguments is not legal, an error action fol-
lows. The action taken depends on the class of routine.
o BLAS Level 1 Routines: Return from the program without
performing any computations.
o BLAS Level 2 Routines: Issue an error message which con-
tains the name of the routine and the number of the argu-
ment with the illegal value, e.g.,
** On entry to DGEMV parameter number 1 had an illegal
value
o BLAS Level 3 Routines: Issue an error message which con-
tains the name of the routine and the number of the argu-
ment with the illegal value.
** On entry to DGEMM parameter number 1 had an illegal
value
o Fast Fourier Transforms: return from the program without
performing any computations.
Functions that return values should be declared with a
statement
extern type function();
where type is the data type of the result and function is
the name of the function.
The following functions in the Basic Math Library/C return
integer values:
isamax, idamax, icamax, izamax
The following functions in the Basic Math Library/C return
real values:
sasum, scasum, sdot, sdoti, sdsdot, snrm2, scnrm2
The following functions in the Basic Math Library/C return
doublereal values:
dasum, dzasum, ddot, ddoti, dnrm2, dznrm2
The following functions in the Basic Math Library/C return
complex values:
cdotc, cdotu
The following functions in the Basic Math Library/C return
doublecomplex values:
zdotc, zdotu
The following sections describe the BLAS (Basic Linear Alge-
bra Subprograms) routines available in the Basic Math
Library/C.
The routines are arranged in alphabetical order, ignoring
the prefixes which indicate the data type the specific rou-
tine uses, e.g., routine names beginning with d are double-
real. All of the BLAS routines come in multiple versions,
usually for real, doublereal, complex, and doublecomplex.
The data types mentioned above are defined as:
typedef float real;
typedef double doublereal;
typedef struct { float r,i } complex;
typedef struct { double r,i } doublecomplex;
typedef longint integer;
These data types (or equivalents) must be defined in your
program to produce correct results when using the BLAS rou-
tines.
For more information, consult the following articles:
C. Lawson, R. Hanson, D. Kincaid, and F. Krough, "Basic
Linear Algebra Subprograms for Fortran Usage," ACM Trans. on
Math. Soft. 5 (1979) 308-325.
J. Dongarra, J. DuCroz, S. Hammarling, and R. Hanson, "An
Extended Set of Fortran Basic Linear Algebra Subprograms,"
ACM Trans. on Math. Soft. 14,1 (1988) 1-32.
J. Dongarra, J. DuCroz, I. Duff, and S. Hammarling, "A Set
of Level 3 Basic Linear Algebra Subprograms," ACM Trans on
Math. Soft. (Dec. 1989).
ROUTINE ARGUMENTS
Vector arguments are passed in one-dimensional arrays.
Associated with the vector are a length and an increment
which are passed as integer variables. The length specifies
the number of elements in the vector. The increment (also
called stride) of the vector specifies the spacing between
vector elements and the order of the elements in the one-
dimensional array in which the vector is passed. If a vec-
tor of length n and increment incx is passed in a one-
dimensional array x, then its values stored at x[0],
x[|incx|], ..., x[ (n-1)*|incx| ]. If incx is positive,
then the elements are stored in increasing order in the
array x. If incx is negative, then the elements are stored
in decreasing order with the first element being stored in
x[ (n-1)*|incx| ]. If incx is zero, then all elements of
the vector have the same value which is stored in x[0]. The
dimension of the one-dimensional array holding the vector
must always be at least
idimx = 1 + (n-1)*|incx|
Example 1: Let x[0] to x[9] be the one-dimensional real
array
x = ( 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0 )
If
incx = 2 and n = 4
then the vector argument with elements in order from first
to last is
( 1.0, 3.0, 5.0, 7.0 )
If
incx = -2 and n = 4
then the vector argument with elements in order from first
to last is
( 7.0, 5.0, 3.0, 1.0 )
If
incx = 0 and n = 4
then the vector argument with elements in order from first
to last is
( 1.0, 1.0, 1.0, 1.0 )
One-dimensional substructures of a matrix, such as the rows,
columns, and diagonals, can be passed as vector arguments
provided that the correct starting address and increment are
specified. With C row-major ordering used for storing the
m-by-n matrix, the increment between elements in the same
row is 1, the increment between elements in the same column
is n, and the increment between elements on the same diago-
nal is n+1. Example 2: Let A be the real 5x4 matrix
declared as
real a[5][4];
The third row of A can be scaled by 2.0 by invoking the BLAS
routine sscal with the following statement:
sscal (5, 2.0, a[2][0], 1);
The second column can be scaled by 2.0 with the statement:
sscal (4, 2.0, a[0][1], 4);
The main diagonal of A can be scaled by 2.0 with the state-
ment:
sscal (5, 2.0, a[0][0], 5);
Notes: Some routines have the restriction that the vector
increment cannot be zero. This restriction is required to
maintain compatibility with the Argonne BLAS interface. If
the increment of a vector argument is not specified, then it
is assumed to be 1.
Matrix arguments are passed in two-dimensional arrays. C
row-major ordering of the storage is assumed, i.e., elements
of the same row occupy successive storage locations. Asso-
ciated with a matrix argument are three quantities: its
leading dimension which specifies the number of storage
locations between elements in the same column, its number of
rows, and its number of columns. The leading dimension of
the matrix must always be at least as large as the number of
columns. In addition, a character transposition parameter is
often passed which indicates whether the matrix argument is
to be used in normal or transposed form or, if the matrix is
complex, if the conjugate transpose of the matrix is to be
used. The values of the transposition parameter for these
cases are 'N', respectively.
EXAMPLES
This section presents examples illustrating the calling
sequence of functions in the Basic Math Library.
Example 1: The following program illustrates a call to the
BLAS 1 routine saxpy.
main()
{
typedef long int integer;
typedef float real;
integer n=5, incx=1, incy=-2;
real alpha=1.1;
real x[5] = { 0.1, 0.2, 0.3, 0.4, 0.5 };
real y[9] = { 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0 };
saxpy( n, alpha, x, incx, y, incy );
printf (" %f %f %f %f %f \ n %f %f %f %f \ n", y[0], y[1],
y[2], y[3], y[4], y[5], y[6], y[7], y[8]);
}
Output:
1.550000 2.000000 3.440000 4.000000 5.330000
6.000000 7.220000 8.000000 9.110000
Example 2: The following program illustrates a call to the
BLAS 2 routine sger.
main()
{
typedef long int integer;
typedef float real;
integer m=3, n=2, lda=2, incx= 1, incy=2, i;
real alpha=0.1;
real x[3] ={ 0.1, 0.2, 0.3 };
real y[3] ={ 1.0, 2.0, 3.0 };
real a[4][2] ={{1.1, 1.2},{2.1, 2.2},{3.1, 3.2},{4.1, 4.2}};
sger( m, n, alpha, x, incx, y, incy, a, lda );
for (i=0;i<=3;++i){
printf (" %f %f \ n", a[i][0], a[i][1]);
}
}
Output:
1.110000 1.230000
2.120000 2.260000
3.130000 3.290000
4.100000 4.200000
Example 3: The following program illustrates a call to the
BLAS 3 routine strmm.
main()
{
typedef long int integer;
typedef float real;
integer m=3, n=2, lda=3, ldb=2, i;
char side= 'L', uplo='U', transa='N', diag='U';
real alpha = 1.0 ;
real a[4][3] ={{1.1, 1.2, 1.3},{2.1, 2.2, 2.3},
{3.1, 3.2, 3.3},{4.1, 4.2, 4.3}}
real b[5][2] ={{0.1,0.6}, {0.2,0.7}, {0.3,0.8},
{0.4,0.9}, {0.5,1.0}}
strmm( side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb );
for (i=1;i<=5;++i) {
printf ("%f %f \ n", b[i][0], b[i][1]);
}
}
Output:
0.073000 0.248000
0.089000 0.254000
0.030000 0.080000
0.400000 0.900000
0.500000 1.000000
Example 4: The following program illustrates calls to the
FFT routines scfft1d and csfft1d:
main()
{
typedef long int integer;
typedef float real;
integer n=8, iflag;
real r[10] = {1.0, 2.0, 3.0, 4.0, 5.0,
6.0, 7.0, 8.0, 9.0, 10.0};
real wsave[20];
iflag = 0;
scfft1d (r, n, iflag, wsave);
iflag = 1;
scfft1d (r, n, iflag, wsave);
printf (" %f %f %f %f %f %f \ n %f %f %f %f \ n",
r[0], r[1], r[2], r[3], r[4], r[5], r[6], r[7], r[8], r[9];
iflag=1;
csfft1d (r, n, iflag, wsave);
printf (" %f %f %f %f %f %f \ n %f %f %f %f \ n",
r[0], r[1], r[2], r[3], r[4], r[5], r[6], r[7], r[8], r[9];
}
Output:
36.000000 0.000000 -4.000000 9.656855 -4.000000 4.000000
-4.000000 1.656854 -4.000000 0.000000
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000
7.000000 8.000000 0.000000 0.000000
PERFORMANCE HINTS
This section presents suggestions on how to obtain optimal
performance from the routines in the Basic Math Library on
computers which use the Intel i860 Processor. Whenever pos-
sible, follow these guidelines concerning array arguments to
routines in the library: Vector arguments should have an
increment equal to one. doublecomplex arrays should always
begin on a 16-byte boundary, i.e., the memory address of the
first element of the array should be divisible by 16. This
can be accomplished by declaring the arrays outside of the
main program, making them common (global) arrays. The fol-
lowing
#include "kai_c.h"
doublecomplex X[100], Y[100]
main()
{
...
}
will force X and Y onto 16-byte storage boundaries. Array
arguments to the FFT programs should lie on 16-byte storage
boundaries. Vector arguments to BLAS 1 routines should be
in the data cache. If there are two vector arguments to a
BLAS 1 routine, it is better if only the first argument is
in cache rather than only the second argument. A vector
will usually be in cache after it has just been referenced,
provided it is small enough (no larger than 8192 bytes). The
first loop below will generally perform better than the
second, since X will be in cache after the first call and is
accessed as the first operand while Y will not remain in
cache.
incx=1;
incy=1;
n=1000;
for (j=0;j<=99;++j) {
s = s + sdot (n, X, incx, &Y[j][0], incy);
}
for (j=0;j<=99;++j) {
s = s + sdot (n, &Y[j][0], incy, X, incx);
}
There is not much impact if the vector arguments to the BLAS
2 or FFT routines are not in cache because these routines
are designed to manage the cache. The BLAS 3 routines also
manage the cache but have no vector arguments.
Acknowledgement and Disclaimer