NAME
	  bml


     DESCRIPTION
	  Kuck and Associates' CLASSPACK Basic Math Library contains
	  highly optimized versions of standard	computational building
	  blocks which users of	high-performance computers can use to
	  improve the speed and	portability of their numerical appli-
	  cations programs.

	  The BLAS (Basic Linear Algebra Subprograms) routines,	which
	  make up the bulk of this library, are	widely used in dense
	  numerical linear algebra programs.  Since they are the most
	  useful vector	and matrix operations, the availability	of
	  versions tuned to the	user's machine provides	a simple way
	  for speeding up existing programs and	writing	new numerical
	  programs.  They also enhance portability, since programs can
	  contain standard calls to a widely-available set of library
	  routines.

	  The BLAS routines were developed in three groups.  The BLAS
	  1 routines consist of	vector-vector operations such as dot
	  products and "scalar * vector	+ vector".  The	BLAS 2 rou-
	  tines	provide	matrix-vector operations such as matrix-vector
	  multiply and the rank-1 update of a matrix. The BLAS 3 rou-
	  tines	provide	matrix-matrix operations such as rank-k	update
	  and the solution of triangular systems with multiple right-
	  hand sides.  This library contains all three groups.

	  The Fourier Transform	is a common operation in many fields,
	  especially signal processing.	 Kuck and Associates' Basic
	  Math Library includes	three routines (in single-precision
	  and double-precision versions) for performing	FFTs (Fast
	  Fourier Transforms).	These have been	tuned for optimal per-
	  formance on this machine.

	  While	FFT routines are not as	standardized as	the BLAS rou-
	  tines, portability is	eased by limiting changes to small
	  sections of code which interface between the user's program
	  and the available library.

	  In addition to BLAS and Fast Fourier Transform routines,
	  special routines are provided	for machines with Intel	i860
	  processors.  These routines include the sparse BLAS routines
	  for performing the scalar-vector product and dot product,
	  special scalar-vector	product	routines designed for the case
	  when the result vector resides in the	on-chip	cache, several
	  special vector-vector	operations, programs for interchanging
	  the dimensions of 2-,	3, and 4-dimensional arrays, and sub-
	  routines for solving tridiagonal, block tridiagonal, penta-
	  diagonal, and	block pentadiagonal systems.
	  The Basic Math Library object	routines are packaged as a
	  library archive named	libkmath.a.  Programs must be linked
	  with this library.

	  The descriptions of the routines in this library specify the
	  legal	values for all scalar integer and character arguments.
	  The routines themselves check	these arguments	when they are
	  called to determine if they have legal values.  If the value
	  of one of the	arguments is not legal,	an error action	fol-
	  lows.	 The action taken depends on the class of routine.

	    o BLAS Level 1 Routines: Return from the program without
	    performing any computations.

	    o BLAS Level 2 Routines: Issue an error message which con-
	    tains the name of the routine and the number of the	argu-
	    ment with the illegal value, e.g.,

	       ** On entry to DGEMV parameter number 1 had an illegal
	    value

	    o BLAS Level 3 Routines: Issue an error message which con-
	    tains the name of the routine and the number of the	argu-
	    ment with the illegal value.

	       ** On entry to DGEMM parameter number 1 had an illegal
	    value

	    o Fast Fourier Transforms: return from the program without
	    performing any computations.

	  Functions that return	values should be declared with a
	  statement
	       extern type function();
	  where	type is	the data type of the result and	function is
	  the name of the function.

	  The following	functions in the Basic Math Library/C return
	  integer values:
	       isamax, idamax, icamax, izamax

	  The following	functions in the Basic Math Library/C return
	  real values:
	       sasum, scasum, sdot, sdoti, sdsdot, snrm2, scnrm2

	  The following	functions in the Basic Math Library/C return
	  doublereal values:
	       dasum, dzasum, ddot, ddoti, dnrm2, dznrm2

	  The following	functions in the Basic Math Library/C return
	  complex values:
	       cdotc, cdotu

	  The following	functions in the Basic Math Library/C return
	  doublecomplex	values:
	       zdotc, zdotu

	  The following	sections describe the BLAS (Basic Linear Alge-
	  bra Subprograms) routines available in the Basic Math
	  Library/C.

	  The routines are arranged in alphabetical order, ignoring
	  the prefixes which indicate the data type the	specific rou-
	  tine uses, e.g., routine names beginning with	d are double-
	  real.	 All of	the BLAS routines come in multiple versions,
	  usually for real, doublereal,	complex, and doublecomplex.

	  The data types mentioned above are defined as:

		   typedef  float  real;
		   typedef  double  doublereal;
		   typedef  struct { float r,i } complex;
		   typedef  struct { double r,i	} doublecomplex;
		   typedef  longint  integer;


	  These	data types (or equivalents) must be defined in your
	  program to produce correct results when using	the BLAS rou-
	  tines.

	  For more information,	consult	the following articles:

	  C. Lawson, R.	Hanson,	D. Kincaid, and	F. Krough, "Basic
	  Linear Algebra Subprograms for Fortran Usage," ACM Trans. on
	  Math.	Soft. 5	(1979) 308-325.

	  J. Dongarra, J. DuCroz, S. Hammarling, and R.	Hanson,	"An
	  Extended Set of Fortran Basic	Linear Algebra Subprograms,"
	  ACM Trans. on	Math. Soft. 14,1 (1988)	1-32.

	  J. Dongarra, J. DuCroz, I. Duff, and S. Hammarling, "A Set
	  of Level 3 Basic Linear Algebra Subprograms,"	ACM Trans on
	  Math.	Soft. (Dec. 1989).

     ROUTINE ARGUMENTS
	  Vector arguments are passed in one-dimensional arrays.
	  Associated with the vector are a length and an increment
	  which	are passed as integer variables.  The length specifies
	  the number of	elements in the	vector.	 The increment (also
	  called stride) of the	vector specifies the spacing between
	  vector elements and the order	of the elements	in the one-
	  dimensional array in which the vector	is passed.  If a vec-
	  tor of length	n and increment	incx is	passed in a one-
	  dimensional array x, then its	values stored at x[0],
	  x[|incx|], ..., x[ (n-1)*|incx| ].  If incx is positive,
	  then the elements are	stored in increasing order in the
	  array	x.  If incx is negative, then the elements are stored
	  in decreasing	order with the first element being stored in
	  x[ (n-1)*|incx| ].  If incx is zero, then all	elements of
	  the vector have the same value which is stored in x[0].  The
	  dimension of the one-dimensional array holding the vector
	  must always be at least
			    idimx = 1 +	(n-1)*|incx|

	  Example 1: Let x[0] to x[9] be the one-dimensional real
	  array
	   x = ( 1.0, 2.0, 3.0,	4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0 )
	  If
			       incx = 2	and n =	4
	  then the vector argument with	elements in order from first
	  to last is
			     ( 1.0, 3.0, 5.0, 7.0 )
	  If
			      incx = -2	and n =	4
	  then the vector argument with	elements in order from first
	  to last is
			     ( 7.0, 5.0, 3.0, 1.0 )
	  If
			       incx = 0	and n =	4
	  then the vector argument with	elements in order from first
	  to last is
			     ( 1.0, 1.0, 1.0, 1.0 )

	  One-dimensional substructures	of a matrix, such as the rows,
	  columns, and diagonals, can be passed	as vector arguments
	  provided that	the correct starting address and increment are
	  specified.  With C row-major ordering	used for storing the
	  m-by-n matrix, the increment between elements	in the same
	  row is 1, the	increment between elements in the same column
	  is n,	and the	increment between elements on the same diago-
	  nal is n+1.  Example 2: Let A	be the real 5x4	matrix
	  declared as
		     real a[5][4];

	  The third row	of A can be scaled by 2.0 by invoking the BLAS
	  routine sscal	with the following statement:
		     sscal (5, 2.0, a[2][0], 1);

	  The second column can	be scaled by 2.0 with the statement:
		     sscal (4, 2.0, a[0][1], 4);

	  The main diagonal of A can be	scaled by 2.0 with the state-
	  ment:
		     sscal (5, 2.0, a[0][0], 5);



	  Notes: Some routines have the	restriction that the vector
	  increment cannot be zero. This restriction is	required to
	  maintain compatibility with the Argonne BLAS interface.  If
	  the increment	of a vector argument is	not specified, then it
	  is assumed to	be 1.

	  Matrix arguments are passed in two-dimensional arrays. C
	  row-major ordering of	the storage is assumed,	i.e., elements
	  of the same row occupy successive storage locations.	Asso-
	  ciated with a	matrix argument	are three quantities: its
	  leading dimension which specifies the	number of storage
	  locations between elements in	the same column, its number of
	  rows,	and its	number of columns.  The	leading	dimension of
	  the matrix must always be at least as	large as the number of
	  columns. In addition,	a character transposition parameter is
	  often	passed which indicates whether the matrix argument is
	  to be	used in	normal or transposed form or, if the matrix is
	  complex, if the conjugate transpose of the matrix is to be
	  used.	 The values of the transposition parameter for these
	  cases	are 'N', respectively.

     EXAMPLES
	  This section presents	examples illustrating the calling
	  sequence of functions	in the Basic Math Library.

	  Example 1: The following program illustrates a call to the
	  BLAS 1 routine saxpy.

		       main()
		       {
			 typedef long int integer;
			 typedef float	  real;

			 integer n=5, incx=1, incy=-2;
			 real alpha=1.1;
			 real x[5] = { 0.1, 0.2, 0.3, 0.4, 0.5 };
			 real y[9] = { 1.0, 2.0, 3.0, 4.0, 5.0,	6.0, 7.0, 8.0, 9.0 };

			 saxpy(	n, alpha, x, incx, y, incy );

			 printf	(" %f %f %f %f %f \ n %f %f %f %f \ n",	y[0], y[1],
			    y[2], y[3],	y[4], y[5], y[6], y[7],	y[8]);
		      }

	  Output:

	   1.550000 2.000000 3.440000 4.000000 5.330000
	   6.000000 7.220000 8.000000 9.110000

	  Example 2: The following program illustrates a call to the
	  BLAS 2 routine sger.

		   main()
		   {
		       typedef long int	integer;
		       typedef float	real;

		       integer m=3, n=2, lda=2,	incx= 1, incy=2, i;
		       real alpha=0.1;
		       real x[3] ={ 0.1, 0.2, 0.3 };
		       real y[3] ={ 1.0, 2.0, 3.0 };
		       real a[4][2] ={{1.1, 1.2},{2.1, 2.2},{3.1, 3.2},{4.1, 4.2}};

		       sger( m,	n, alpha, x, incx, y, incy, a, lda );

		       for (i=0;i<=3;++i){
			  printf (" %f %f \ n",	a[i][0], a[i][1]);
		       }
		    }

	  Output:

	      1.110000 1.230000
	      2.120000 2.260000
	      3.130000 3.290000
	      4.100000 4.200000

	  Example 3: The following program illustrates a call to the
	  BLAS 3 routine strmm.

		    main()
		    {
		      typedef long int integer;
		      typedef float    real;

		      integer  m=3, n=2, lda=3,	ldb=2, i;
		      char     side= 'L', uplo='U', transa='N',	diag='U';
		      real     alpha = 1.0 ;
		      real  a[4][3] ={{1.1, 1.2, 1.3},{2.1, 2.2, 2.3},
				      {3.1, 3.2, 3.3},{4.1, 4.2, 4.3}}
		      real  b[5][2] ={{0.1,0.6}, {0.2,0.7}, {0.3,0.8},
				      {0.4,0.9}, {0.5,1.0}}

		      strmm( side, uplo, transa, diag, m, n, alpha, a, lda, b, ldb );

		      for (i=1;i<=5;++i) {
			  printf ("%f %f \ n", b[i][0],	b[i][1]);
		      }
		    }
	  Output:

	      0.073000 0.248000
	      0.089000 0.254000
	      0.030000 0.080000
	      0.400000 0.900000
	      0.500000 1.000000

	  Example 4: The following program illustrates calls to	the
	  FFT routines scfft1d and csfft1d:

		   main()
		   {
		      typedef long int integer;
		      typedef float    real;

		      integer n=8, iflag;
		      real r[10] = {1.0, 2.0, 3.0, 4.0,	5.0,
				    6.0, 7.0, 8.0, 9.0,	10.0};
		      real wsave[20];

		      iflag = 0;
		      scfft1d (r, n, iflag, wsave);
		      iflag = 1;
		      scfft1d (r, n, iflag, wsave);

		      printf ("	%f %f %f %f %f %f \ n %f %f %f %f \  n",
			 r[0], r[1], r[2], r[3], r[4], r[5], r[6], r[7], r[8], r[9];

		      iflag=1;
		      csfft1d (r, n, iflag, wsave);

		      printf ("	%f %f %f %f %f %f \ n %f %f %f %f \  n",
			 r[0], r[1], r[2], r[3], r[4], r[5], r[6], r[7], r[8], r[9];

		  }
	  Output:

	    36.000000 0.000000 -4.000000 9.656855 -4.000000 4.000000
	    -4.000000 1.656854 -4.000000 0.000000
	    1.000000 2.000000 3.000000 4.000000	5.000000 6.000000
	    7.000000 8.000000 0.000000 0.000000

     PERFORMANCE HINTS
	  This section presents	suggestions on how to obtain optimal
	  performance from the routines	in the Basic Math Library on
	  computers which use the Intel	i860 Processor.	 Whenever pos-
	  sible, follow	these guidelines concerning array arguments to
	  routines in the library: Vector arguments should have	an
	  increment equal to one.  doublecomplex arrays	should always
	  begin	on a 16-byte boundary, i.e., the memory	address	of the
	  first	element	of the array should be divisible by 16.	 This
	  can be accomplished by declaring the arrays outside of the
	  main program,	making them common (global) arrays.  The fol-
	  lowing

			#include  "kai_c.h"
			   doublecomplex  X[100], Y[100]

			   main()
			   {
			      ...
			   }

	  will force X and Y onto 16-byte storage boundaries.  Array
	  arguments to the FFT programs	should lie on 16-byte storage
	  boundaries.  Vector arguments	to BLAS	1 routines should be
	  in the data cache. If	there are two vector arguments to a
	  BLAS 1 routine, it is	better if only the first argument is
	  in cache rather than only the	second argument.  A vector
	  will usually be in cache after it has	just been referenced,
	  provided it is small enough (no larger than 8192 bytes). The
	  first	loop below will	generally perform better than the
	  second, since	X will be in cache after the first call	and is
	  accessed as the first	operand	while Y	will not remain	in
	  cache.
		  incx=1;
		  incy=1;
		  n=1000;
		  for (j=0;j<=99;++j) {
		     s	= s + sdot (n, X, incx,	&Y[j][0], incy);
		  }

		  for (j=0;j<=99;++j) {
		     s = s + sdot (n, &Y[j][0],	incy, X, incx);
		  }

	  There	is not much impact if the vector arguments to the BLAS
	  2 or FFT routines are	not in cache because these routines
	  are designed to manage the cache. The	BLAS 3 routines	also
	  manage the cache but have no vector arguments.
Acknowledgement and Disclaimer