Weekly Communication

Week June 28, 2002

FOR symmetric operator
    I observed more than 2 times speedup when I switch gmres+ilu
    to icc+cg in sloving the Dagger operator. But after a few time
    steps, I got "current blowup". That means the algorithm is
    unstable. I need to check the code again to be sure that
    everything is correct.

    This is what I did:

    I refered to Hanks's OMP version and followed the
    same rule to form the star and dagger matriices.

    Then I solved them in 3 ways:
    1) using gmres+ilu
         -0.8754     0.4613e-11     -0.3435e-5
         -0.3687     0.2878e-11     -0.8994e-5  --- original

    2) using icc+ilu
         -1.2726     0.2108e-11     -0.3080e-5
         -0.3687     0.2878e-11     -0.8994e-5  --- original

    3) using icc+cg
       no problem but unstable.
       
       I think this is due to the boundary condition.

    So I need to talk to Hank how he did in the OMP version to
    keep the matrix symmetric under different boundary conditions.

    BTW, I didn't see and efficiency improvement when ilu is
    switched to icc.

Week June 14, 2002

FOR symmetric operator
    I have implemented the symmetric operators and their solvers 
    in FOUR ways. Some of them have close digits, but none of them 
    completely match. These differences are due to the diagonalized
    lumped approximation lmass (userData->work.lmass):
           _             _             -                         -
    lmass=| sum_j mass_1j |  to  Mass=|mass_11 mass_12 ... mass_1k|
          | sum_j mass_2j |           |mass_21 mass_22 ... mass_2k|
          |  .            |           |  .       .     ...   .    |
          |  .            |           |  .       .     ...   .    |
          |  .            |           |  .       .     ...   .    |
          | sum_j mass_lj |           |mass_l1 mass_l2 ... mass_lk|
           -             -             -                         -

    
EXAMPLE I: grad * grad U = W

    int grad * grad U lambda_i da = int W lambda_i da  
    -->>
    - sum_j (int grad lambda_j * grad lambda_i da) U_j = 
    __________________________________________________
    |             sum_j (int lambda_i lambda_j da) W_j
    |             ____________________________________
    |             |
    |             |
    Stiff * U  =  Mass * W
                  |
                  |
                  ->lmass * W
                  ->matrix * vector ==>> vector * vector
                       

EXAMPLE II: 1/R * grad * R * grad U - 1/(dt * g) = 1/(dt * g) W

    int 1/R *grad*R*grad U lambda_i da - 
    __________________________________
    |                  int 1/(dt*g) lambda_i da = 
    |                  ________________________
    |                  |               int 1/(dt*g) W lambda_i da
    |                  |               __________________________
    |                  |               |
    |                  |               |
    DaggerMatrix * U - Mass * D * U =  Mass * D * W
                       |               |
                       |               |
                       ->lmass * D * U |
                                       ->lmass * D * W
                       ->matrix * D ==>> vector * D

    D is the diagonal matrix given by
           _                      _
        D=| 1/(dt*g_1)             |
          |    1/(dt*g_2)          |
          |      .                 |   g_j is resistivity|viscosity
          |        .               |   So D changes in every call.
          |          .             |
          |            1/(dt*g_k)  |
           -                      -
    
JUSTIFY lmass*W approximating Mass*W
    1- Mass is symmetric, so there exits matrix X:
       X^T Mass X = Diag(eigen_1, eigen_2, ... eigen_k)
                         |
                         |
                         first eigenvalue of Mass

    2- According to Gershgorin theory, 
       |eigen_i| <= sum_j |mass_ij|
                  = sum_j  mass_ij   as mass_{ij}s are positive.
                    |
                    |
                    the ith component of lmass

    ==>> This might be the reason that lmass*W & lmass*D*W works.

FOUR ways used to implement the symmetric star and dagger operators
        grad * R   * grad U - R/(dt*g)   * U = -R/(dt*g)   * W
        grad *(1/R)* grad A - 1/(dt*g*R) * U = -1/(dt*g*R) * W

    1- lamss*DR*W to approximate Mass*DR*W:
       the same idea Hank used in EXAMPLE I, now D changes to DR
           _                       _
       DR=| 1/(dt*gR_1)             |
          |    1/(dt*gR_2)          |
          |      .                  |   gR_j=R_j/g_j    or
          |        .                |   gR_j=1/(g_j*R_j)
          |          .              |
          |            1/(dt*gR_k)  |
           -                       -
                                       
    2- lmassr*D*W to approximate Mass*DR*W:
       the same concept Hank used in lmass. forming lmassr
             _            _             -                         -
     lmassr=|sum_j masR_1j|  to  MassR=|masR_11 masR_12 ... masR_1k|
            |sum_j masR_2j|            |masR_21 masR_22 ... masR_2k|
            | .           |            |  .       .     ...   .    |
            | .           |            |  .       .     ...   .    |
            | .           |            |  .       .     ...   .    |
            |sum_j masR_lj|            |masR_l1 masR_l2 ... masR_lk|
             -           -              -                         -
       massR_ij=mass_ij/R_{vertex k}  or  mass_{ij}*R_{vertex k}
       MassR is made symmetric. 


    3- Using Mass*DR*W directly ==>> closer digits
    4- using MassR*D*W directly ==>> closer digits
    5- StarMatrix   = (stiff_ij/Raverage)   
       DaggerMatrix = (stiff_ij*Raverage)

    digits of gamma,      wkin,          w(10,1)
    original  -0.3687     0.2878e-11     -0.8994e-5
    method 1   0.2495     0.2044e-11     -0.1514e-4
    method 2   0.2439     0.2142e-11     -0.1571e-4
    method 3  -1.7153     0.9873e-12     -0.1690e-6
    method 4  -0.6488     0.3063e-11     -0.6332e-5
    Raverg 5  -0.8754     0.4613e-11     -0.3435e-5
    Rasymm 5  -1.2726     0.2108e-11     -0.3080e-5
    Results from 1- 2- were quite different from original numbers.
    Possible reason: g is almost the same over (R,Z) space, but 
                     R changes a lot.

CONCLUSION
    1- Mass*DR & MassR*D are not symmetric generally.
       if g is constant over (R,Z) space, they are symmetric;
    2- in order to make Mass*DR & MassR*D symmetric generally,
       that will be prohibitively expensive, therefore, unrealistic.
    3- check the code again and seek a better way for approximation
       otherwise, continue to use general method.

Week June 14, 2002

For GMRES + ILU
   1) on Uniprocessor, ilu fill-in is supported up to 5 levels;
   2) on Multiprocessor, 
      block Jacobi preconditioning with ILU on each processor 
      To use ILU(k) on each block use -sub_pc_ilu_levels 3

   ==>> The reason why ilu with fill-in is faster:
     With more fill-ins, ilu is more closely approximate the original
     matrix. Therefore, the number of iterations decreases.

   ==>> Preconditioning:
    cluster the eigenvalues far from zero and close to 1

   ==>> ICCG: CG with ICC 
    for Ax=b if A SPD (Symmetric Positive Definite)
    for normal equation A^TA x= A^T b if A is not SPD

For turning off 
	eta, rmu, vmu, pkks, pdissf, pkkk, icheta
     we can save about 23% of user time:
     RadialGrids     TurnON     TurnOff     Difference     Percentage
          30          54.13       42.34          11.79       21.78%
          40          91.62       70.37          21.25       23.19%
          50         139.73      107.39          32.34       23.14%
          60         203.09      155.43          47.65       23.37%
          70         282.72      216.29          66.43       23.50%
     (poloidal partition is 3)

Form symmetric star dagger operator:
    0)star[3,3]  =
       Dlam_0*Dlam_0/R_0    Dlam_0*Dlam_1/R_1    Dlam_0*Dlam_2/R_2
       Dlam_1*Dlam_0/R_1    Dlam_1*Dlam_1/R_2    Dlam_1*Dlam_2/R_0
       Dlam_2*Dlam_0/R_2    Dlam_2*Dlam_1/R_0    Dlam_2*Dlam_2/R_1
      dagger[3,3]  =
       Dlam_0*Dlam_0*R_0    Dlam_0*Dlam_1*R_1    Dlam_0*Dlam_2*R_2 
       Dlam_1*Dlam_0*R_1    Dlam_1*Dlam_1*R_2    Dlam_1*Dlam_2*R_0
       Dlam_2*Dlam_0*R_2    Dlam_2*Dlam_1*R_0    Dlam_2*Dlam_2*R_1
      massr[3,3]  =
        lam_0* lam_0/R_0     lam_0* lam_1/R_1     lam_0* lam_2/R_2
        lam_1* lam_0/R_1     lam_1* lam_1/R_2     lam_1* lam_2/R_0
        lam_2* lam_0/R_2     lam_2* lam_1/R_0     lam_2* lam_2/R_1
      rmass[3,3]  =
        lam_0* lam_0*R_0     lam_0* lam_1*R_1     lam_0* lam_2*R_2 
        lam_1* lam_0*R_1     lam_1* lam_1*R_2     lam_1* lam_2*R_0
        lam_2* lam_0*R_2     lam_2* lam_1*R_0     lam_2* lam_2*R_1
    1) mesh/felement/accumLocalFE.c 
       1.1-in AccumLocalFE
          -double  star[9], dagger[9], massr[9], rmass[9];
           double  Rave;
          -Rave = ( r[map.elements[jElement][0]] +
                    r[map.elements[jElement][1]] +
                    r[map.elements[jElement][2]] )/3.0;
          -star[i*3+j]   = s[i*3+j]/r[map.elements[jElement][(i+j)%3]];
           dagger[i*3+j] = s[i*3+j]*r[map.elements[jElement][(i+j)%3]];
           massr[i*3+j]  = m[i*3+j]/r[map.elements[jElement][(i+j)%3]];
           rmass[i*3+j]  = m[i*3+j]*r[map.elements[jElement][(i+j)%3]];
          -UpdateMassStiff(jElement, m, s, dRoverR, star, dagger, massr, rmass,
           UpdateMassStiff(jElement, m, s, dRoverR, star, dagger, massr, rmass,
           UpdateMassStiff(jElement, m, s, dRoverR, star, dagger, massr, rmass,
       1.2-in UpdateMassStiff
           l->matrix[whichPhiSection].star[iRow][iColumn] += star[i*3+j];
           l->matrix[whichPhiSection].dagger[iRow][iColumn] += dagger[i*3+j];
           l->matrix[whichPhiSection].massr[iRow][iColumn] += massr[i*3+j];
           l->matrix[whichPhiSection].rmass[iRow][iColumn] += rmass[i*3+j];
       1.3-in accumLocalFE.h
           UpdateMassStiff(int jE, double *m, double *s, double *dRoverR, 
                   double *star , double *dagger, double *massr, double *rmass,
    2) mesh/felement/felement.h
       2.1-in class FEMatrix
           double    **star;
           double    **dagger; 
           double    **massr;
           double    **rmass;
       2.2-in class FiniteElementMatrix
           Mat          *Star;  /*cj form symm operator*/
           Mat          *Dagger;
           Mat          *MassR;
           Mat          *RMass;
    3) mesh/felement/allocateLocalFEMatrix.c
       3.1-in AllocLocalFEMatrix
           p[i].star         = fMatrix(matrixInfo);
           p[i].dagger       = fMatrix(matrixInfo);
           p[i].massr        = fMatrix(matrixInfo);
           p[i].rmass        = fMatrix(matrixInfo);
    4) mesh/felement/constructFEMatrix.c
       4.1-in ConstructFEMatrix
           gfe->Star  =(Mat *)malloc((unsigned)map.NumPhiSections....
           gfe->Dagger=(Mat *)malloc((unsigned)map.NumPhiSections....
           gfe->MassR =(Mat *)malloc((unsigned)map.NumPhiSections....
           gfe->RMass =(Mat *)malloc((unsigned)map.NumPhiSections....
       4.2-in ConstructFEMatrix
           ierr = MatCreateMPIAIJ(....  &(gfe->Star[whichPhiSection]));
           ierr = MatCreateMPIAIJ(....  &(gfe->Dagger[whichPhiSection]));
           ierr = MatCreateMPIAIJ(....  &(gfe->MassR[whichPhiSection]));
           ierr = MatCreateMPIAIJ(....  &(gfe->RMass[whichPhiSection]));
       4.3-in ConstructFEMatrix
           ierr = MatSetLocalToGlobalMapping(gfe->Star[....], ....
           ierr = MatSetLocalToGlobalMapping(gfe->Dagger[....], ....
           ierr = MatSetLocalToGlobalMapping(gfe->MassR[....], 
           ierr = MatSetLocalToGlobalMapping(gfe->RMass[....],
       4.4-in ConstructFEMatrix
           ierr = MatSetValuesLocal(gfe->Star[whichPhiSection], .....
                               lfe->matrix[whichPhiSection].star[i], ....
           ierr = MatSetValuesLocal(gfe->Dagger[whichPhiSection], ....
                               lfe->matrix[whichPhiSection].dagger[i], ....
           ierr = MatSetValuesLocal(gfe->MassR[whichPhiSection], ....
                               lfe->matrix[whichPhiSection].massr[i], ....
           ierr = MatSetValuesLocal(gfe->RMass[whichPhiSection], ....
                               lfe->matrix[whichPhiSection].rmass[i],
       4.5-in ConstructFEMatrix
           ierr = MatAssemblyBegin(gfe->Star[whichPhiSection], ....
           ierr = MatAssemblyEnd(gfe->Star[whichPhiSection], .....
           ierr = MatAssemblyBegin(gfe->Dagger[whichPhiSection], ....
           ierr = MatAssemblyEnd(gfe->Dagger[whichPhiSection], ....
           ierr = MatAssemblyBegin(gfe->MassR[whichPhiSection], ....
           ierr = MatAssemblyEnd(gfe->MassR[whichPhiSection], ....
           ierr = MatAssemblyBegin(gfe->RMass[whichPhiSection], ....
           ierr = MatAssemblyEnd(gfe->RMass[whichPhiSection], ....
       4.6-in ConstructFEMatrix
           Free_fMatrix(lfe->matrixInfo, lfe->matrix[....].star);
           Free_fMatrix(lfe->matrixInfo, lfe->matrix[....].dagger);
           Free_fMatrix(lfe->matrixInfo, lfe->matrix[....].massr);
           Free_fMatrix(lfe->matrixInfo, lfe->matrix[....].rmass);
    5) m3d/interface/poissc.c
       5.1-in poissc_
           int        iR=0;
           poissolve(i, gfe.Stiffness, iR,
       5.2-in poistarc_
           int        iR=1;
           poissolve(i, gfe.Star, iR,
       5.3-in poisdagc_
           int        iR=2;
           poissolve(i, gfe.Dagger, iR,
       5.4-in poissolve(....  Mat *Msolver, const int iR, ....
           poisNeuman  ( whichPhiSection, Msolver, iR, A, B);
           poisDiriclet( whichPhiSection, Msolver, iR, A, B, ibc);
       5.5-in poisNeuman  (.... Mat *Msolver, const int iR,
           in poisDiriclet(.... Mat *Msolver, const int iR, ....
             if( iR==1 ) 
           ierr = VecPointwiseDivide(userData.canvas.rhs.g,
                  grid.R.g[whichPhiSection],userData.canvas.rhs.g )
             if( iR==2 )
           ierr = VecPointwiseMult(userData.canvas.rhs.g,
                  grid.R.g[whichPhiSection],userData.canvas.rhs.g )
       5.5==>>5.6
             if( iR==0 )
         ierr = VecPointwiseMult(A.g[whichPhiSection],
         userData.work.lmass.g[whichPhiSection], userData.canvas.rhs.g);
         CHKERRQ(ierr);
             if( iR==1 )
         ierr = MatMult(gfe.MassR[whichPhiSection], A.g[whichPhiSection],
         userData.canvas.rhs.g ); CHKERRQ(ierr);
             if( iR==2 )
         ierr = MatMult(gfe.RMass[whichPhiSection], A.g[whichPhiSection],
                     userData.canvas.rhs.g ); CHKERRQ(ierr);

       5.1-in poismc_
           int        iR=0;
           Diffpar(i, gfe.Stiffness, iR,
       5.2-in poismstc_
           int        iR=1;
           Diffpar(i, gfe.Star, iR,
       5.3-in poismdgc_
           int        iR=2;
           Diffpar(i, gfe.Dagger, iR,
       5.4-in Diffpar(.... Mat *Msolve, const int iR, ....
             if( iR==1 ) 
           ierr = VecPointwiseDivide(userData.canvas.rhs_algebraic.g,
                  grid.R.g[whichPhiSection],userData.canvas.rhs_algebraic.g )
             if( iR==2 )
           ierr = VecPointwiseMult(userData.canvas.rhs_algebraic.g,
                  grid.R.g[whichPhiSection],userData.canvas.rhs_algebraic.g )
       5.4==>>5.5
               if( iR==0 )
           ierr = VecPointwiseDivide(userData.work.lmass.g[whichPhiSection],
             userData.coeff.resistivity.g[whichPhiSection],
             userData.canvas.rhs_algebraic.g ); CHKERRQ(ierr);
               if( iR==1 )
           ierr = MatMult(gfe.MassR[whichPhiSection],
             userData.coeff.resistivity.g[whichPhiSection],
             userData.canvas.rhs_algebraic.g ); CHKERRQ(ierr);
               if( iR==2 )
           ierr = MatMult(gfe.RMass[whichPhiSection],
             userData.coeff.resistivity.g[whichPhiSection],
             userData.canvas.rhs_algebraic.g ); CHKERRQ(ierr);
    6) mhd/allocation/mhdData.h
       in MHD_work add
         ParData     lmassr;
         ParData     lrmass;
    7) mhd/allocation/constructMHDdata.c
       under m3d work space add
         duplicatePData(map, mhd.basic[0].tI, &mhd.work.lmassr);
         duplicatePData(map, mhd.basic[0].tI, &mhd.work.lrmass);
    8) mhd/allocation/constructLmass.c
         ierr = MatMult(gfe->MassR[i], scratch->tmpData.g,
                  userData->work.lmassr.g[i]); CHKERRQ(ierr);
         ierr = MatMult(gfe->RMass[i], scratch->tmpData.g,
                  userData->work.lrmass.g[i]); CHKERRQ(ierr);
         GetArrayInPData(*phone, *map, userData->work.lmassr);
         GetArrayInPData(*phone, *map, userData->work.lrmass);
    9) 
       5.5-in poisDiriclet(.... Mat *Msolver, const int iR, ....
             if( iR==0 ) 
	  ierr = VecPointwiseMult(A.g[whichPhiSection],
           userData.work.lmass.g[whichPhiSection],
           userData.canvas.rhs.g);
           CHKERRQ(ierr);
             if( iR==1 ) 
	  ierr = VecPointwiseMult(A.g[whichPhiSection],
           userData.work.lmassr.g[whichPhiSection],
           userData.canvas.rhs.g);
           CHKERRQ(ierr);
             if( iR==2 )
	  ierr = VecPointwiseMult(A.g[whichPhiSection], 
           userData.work.lrmass.g[whichPhiSection], 
           userData.canvas.rhs.g);
           CHKERRQ(ierr);

       5.6-in poisNeuman  (.... Mat *Msolver, const int iR,
  if(iR==0){
  ierr = VecPointwiseMult(A.g[whichPhiSection],
           userData.work.lmass.g[whichPhiSection],
           userData.canvas.rhs.g); CHKERRQ(ierr); }
  if(iR==1){
  ierr = VecPointwiseMult(A.g[whichPhiSection],
           userData.work.lmassr.g[whichPhiSection],
           userData.canvas.rhs.g); CHKERRQ(ierr); }
  if(iR==2){
  ierr = VecPointwiseMult(A.g[whichPhiSection],
           userData.work.lrmass.g[whichPhiSection],
           userData.canvas.rhs.g); CHKERRQ(ierr); }

  if(iR==0){
  for (i=0; i


 Week June 7, 2002 
Notes:

=>> Using Petsc, we don't have many options to make use of the matrix 
    structure to improve the performance. During these 2 weeks, I 
    strongly feel that we are limited by this package.

=>> The following 3 lines are recommended to be included in your job 
    submit script files:
    -ksp_gmres_restart 120 
    -ksp_gmres_modifiedgramschmidt 
    -pc_ilu_levels 3
    
    Here is my test: 32 = Number of grids in minor radial direction.

   - preconditioning is better
     -ksp_type gmres -pc_type ilu    is 10 times faster than
     -ksp_type gmres -pc_type none

   - gmres with modified Gram-Schmidt is even better
     -ksp_gmres_modifiedgramschmidt -pc_type ilu is 9% faster than
     -ksp_type gmres                -pc_type ilu

   - increase restart number to get even better peformance
     -ksp_gmres_restart 120 -ksp_gmres_modifiedgramschmidt -pc_type ilu
          is again 9% faster than
                            -ksp_gmres_modifiedgramschmidt -pc_type ilu

   - increase fillin level in ILU to again get even better peformance
     -ksp_gmres_restart 120 -ksp_gmres_modifiedgramschmidt -pc_ilu_levels 3
          is again 9% faster than
     -ksp_gmres_restart 120 -ksp_gmres_modifiedgramschmidt -pc_type ilu

   SO THE PERFORMANCE IS IMPROVED 4 TIMES COMPARED TO THE DEFAULT SETTING.


Jin: w3.pppl.gov/~jchen/doc.html is updated
     . Eigenspectrum and Performance

     . Direct GMRES and ILU preconditioned GMRES

     . connectivity matrix:  connecting lfe and gfe