NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Using LibSci on Franklin

The Cray Scientific Libraries package, LibSci, is available on Franklin. Libsci includes BLAS, LAPACK, ScaLAPACK, BLACS, and SuperLU routines.

SuperLu User's Guide, LAPACK, ScaLAPACK, BLAS, and BLACS man pages

The LibSci module is automatically loaded upon login. The compiler wrappers (ftn, CC, cc) will automatically load and link to the LibSci library.

BLAS and LAPACK functionality included in LibSci contains Cray custom routines combined with the University of Texas' libGoto library. BLAS and LAPACK from LibSci are preferred and default functions than the corresponding BLAS and LAPAC functions provided in ACML. Threaded BLAS in LibSci can be used by setting the GOTO_NUM_THREADS for Franklin and by using the appropriate Torque options for using less than 4 MPI tasks per node for your job.

Scalapack in LibSci is also supported in mixed mode, i.e. MPI parallelism across compute nodes with threaded BLAS level parallelism on cores within a node. The code needs to be linked with libsci_mp library via "-lsci_mp". This is performed via compile and runtime options, and environment variables, as follows: 1) adjust the process grid dimensions in ScaLAPACK to account for the decrease in BLACS nodes; 2) ensure that the number of BLACS processes required is equal to the number of nodes required, NOT equal to the number of cores; 3) link with libsci_mp library via "-lsci_mp". 4) set GOTO_NUM_THREADS in the PBS job script used to launch the job.

Here is a sample job script for using threaded BLAS or mixed mode Scalapack:

#PBS -q debug
#PBS -l mppwidth=64
#PBS -l mppnppn=1
#PBS -l walltime=00:30:00
#PBS -j eo

cd $PBS_O_WORKDIR
setenv GOTO_NUM_THREADS 4

ftn atest.f90 -lsci_mp
aprun -n 64 -N 1 ./a.out

LibSci also includes the automatic benchmarking interfaces for IRT, the Iterative Refinement Toolkit. Iterative Refinement can be used 'under-the-covers' in your application by setting the environment variable IRT_USE_SOLVERS to 1. You can therefore obtain significant speed-ups on applications that are performing LAPACK or ScaLAPACK linear solvers (Cholesky or LU) just by setting this environment variable. It is particularly useful for benchmarkers who cannot change their codes. Please refer to "intro_irt" man page for details about the explicit interface to IRT, as well as the caveats and safety of using iterative refinement.

You can use IRT under-the-covers if your code using the following routine or combination of routines:

      dgetrf and dgetrs    (LAPACK)
      dpotrf and dpotrs    (LAPACK)
      dgesv                (LAPACK)
      dposv                (LAPACK)
      pdgetrf and pdgetrs  (ScaLAPACK)
      pdpotrf and pdpotrs  (ScaLAPACK)
      pdgesv               (ScaLAPACK)
      pdposv               (ScaLAPACK)
      zgetrf and zgetrs    (LAPACK)
      zpotrf and zpotrs    (LAPACK)
      zgesv                (LAPACK)
      zposv                (LAPACK)
      pzgetrf and pzgetrs  (ScaLAPACK)
      pzpotrf and pzpotrs  (ScaLAPACK)
      pzgesv               (ScaLAPACK)
      pzposv               (ScaLAPACK)

Example Fortran Code using ScaLAPACK from LibSci

Below is the compile and run of a Fortran that uses ScaLAPACK from LibSci.

Notice that there is no explicit reference to LibSci in the compiler line. It is included in the default user environment, and the compiler wrappers perform the linking automatically.

% cat ABCp.f90
! Matrix-Matrix Multiply, AB = C using LibSci ScaLAPACK
! filename:  ABCp.f90
! compile:   ftn -fast -o ABCp ABCp.f90
 
! input:     ABCp.dat
!              prow    number of rows in proc grid
!              pcol    number of columns in proc grid
!              n       number of rows/columns in matrix A
!              nb      matrix distribution block size   
 
! oputput:   fort.u, where u=10+processor number, and stdout
 
 
      implicit none
 
      integer :: n, nb    ! problem size and block size
      integer :: myunit   ! local output unit number
      integer :: myArows, myAcols   ! size of local subset of global array
      integer :: i,j, igrid,jgrid, iproc,jproc, myi,myj, p
      real*8, dimension(:,:), allocatable :: myA,myB,myC
 
      integer :: numroc   ! blacs routine
      integer :: me, procs, icontxt, prow, pcol, myrow, mycol  ! blacs data
      integer :: info    ! scalapack return value
      integer, dimension(9)   :: ides_a, ides_b, ides_c ! scalapack array desc
 
! Read problem description
 
      open(unit=1,file="ABCp.dat",status="old",form="formatted")
      read(1,*)prow
      read(1,*)pcol
      read(1,*)n
      read(1,*)nb
      close(1)
 
      if (((n/nb) < prow) .or. ((n/nb) < pcol)) then
         print *,"Problem size too small for processor set!"
         stop 100
      endif
 
! Initialize blacs processor grid
 
      call blacs_pinfo   (me,procs)
      call blacs_get     (0, 0, icontxt)
      call blacs_gridinit(icontxt, 'R', prow, pcol)
      call blacs_gridinfo(icontxt, prow, pcol, myrow, mycol)
      myunit = 10+me
      write(myunit,*)"--------"
      write(myunit,*)"Output for processor ",me," to unit ",myunit
      write(myunit,*)"Proc ",me,": myrow, mycol in p-array is ", &
         myrow, mycol
 
! Construct local arrays
! Global structure:  matrix A of n rows and n columns
!                    matrix B of n rows and n column
!                    matrix C of n rows and n column
 
      myArows = numroc(n, nb, myrow, 0, prow)
      myAcols = numroc(n, nb, mycol, 0, pcol)
      write(myunit,*)"Size of global array is ",n," x ",n
      write(myunit,*)"Size of block is        ",nb," x ",nb
      write(myunit,*)"Size of local array is  ",myArows," x ",myAcols
 
! Initialize local arrays    
 
      allocate(myA(myArows,myAcols)) 
      allocate(myB(myArows,myAcols)) 
      allocate(myC(myArows,myAcols)) 
 
      do i=1,n
         call g2l(i,n,prow,nb,iproc,myi)
         if (myrow==iproc) then
            do j=1,n
            call g2l(j,n,pcol,nb,jproc,myj)
               if (mycol==jproc) then
                  myA(myi,myj) = real(i+j)
                  myB(myi,myj) = real(i-j)
                  myC(myi,myj) = 0.d0
 
!                 write(myunit,*)"A(",i,",",j,")", &
!                                " --> myA(",myi,",",myj,")=",myA(myi,myj), &
!                                "on proc(",iproc,",",jproc,")"
 
               endif
            enddo
         endif
      enddo
 
! Prepare array descriptors for ScaLAPACK 
 
      ides_a(1) = 1         ! descriptor type
      ides_a(2) = icontxt   ! blacs context
      ides_a(3) = n         ! global number of rows
      ides_a(4) = n         ! global number of columns
      ides_a(5) = nb        ! row block size
      ides_a(6) = nb        ! column block size
      ides_a(7) = 0         ! initial process row
      ides_a(8) = 0         ! initial process column
      ides_a(9) = myArows   ! leading dimension of local array
 
      do i=1,9
         ides_b(i) = ides_a(i)
         ides_c(i) = ides_a(i)
      enddo
 
! Call ScaLAPACK library routine
 
      call pdgemm('T','T',n,n,n,1.0d0, myA,1,1,ides_a,  &
                    myB,1,1,ides_b,0.d0, &
                    myC,1,1,ides_c )
 
! Print results
 
      call g2l(n,n,prow,nb,iproc,myi)
      call g2l(n,n,pcol,nb,jproc,myj)
      if ((myrow==iproc) .and. (mycol==jproc))  &
         write(*,*) 'c(',n,n,')=',myC(myi,myj)
 
! Deallocate the local arrays
 
      deallocate(myA, myB, myC)
 
! End blacs for processors that are used
 
      call blacs_gridexit(icontxt)
      call blacs_exit(0)
 
      end
 
! convert global index to local index in block-cyclic distribution
 
   subroutine g2l(i,n,np,nb,p,il)
 
   implicit none
   integer :: i    ! global array index, input
   integer :: n    ! global array dimension, input
   integer :: np   ! processor array dimension, input
   integer :: nb   ! block size, input
   integer :: p    ! processor array index, output
   integer :: il   ! local array index, output
   integer :: im1   
 
   im1 = i-1
   p   = mod((im1/nb),np)
   il  = (im1/(np*nb))*nb + mod(im1,nb) + 1
   
   return
   end
 
! convert local index to global index in block-cyclic distribution
 
   subroutine l2g(il,p,n,np,nb,i)
 
   implicit none
   integer :: il   ! local array index, input
   integer :: p    ! processor array index, input
   integer :: n    ! global array dimension, input
   integer :: np   ! processor array dimension, input
   integer :: nb   ! block size, input
   integer :: i    ! global array index, output
   integer :: ilm1   
 
   ilm1 = il-1
   i    = (((ilm1/nb) * np) + p)*nb + mod(ilm1,nb) + 1
   
   return
   end

% cat runit
#PBS -N scalapackjob 
#PBS -q debug
#PBS -l mppwidth=64
#PBS -l walltime=00:05:00
#PBS -e run.out
#PBS -j eo
 
cd $PBS_O_WORKDIR
 
ftn -o ABCp -fast ABCp.f90   
aprun -n 64 ./ABCp

% cat ABCp.dat
8
8
20000
64

% cat run.out

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
/opt/xt-asyncpe/1.0c/bin/ftn: INFO: linux target is being used
 c(        20000        20000 )=    5333133330000.000
Application 243414 resources: utime 0, stime 0
 
   -------------------------- Batch Job Report ------------------------------
 
   Job Id:         5570270.nid00003
   User Name:      yunhe
   Group Name:     yunhe
   Job Name:       scalapackjob
   Session ID:     31162
   Resource List:  walltime=00:05:00
   Queue Name:     debug
   Account String: mpccc
   Total Tasks:    64
   Tasks per Node: 4
   Node Count:     16
 
Job Exit Summary:
   APINFO_SUCCESS:  application completed with no detectable system errors


LBNL Home
Page last modified: Wed, 22 Oct 2008 21:00:13 GMT
Page URL: http://www.nersc.gov/nusers/systems/franklin/software/libsci.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science