next up previous contents
Next: 16. MP2 Up: user Previous: 14. CIS, TDHF, and   Contents

Subsections

15. Tensor Contraction Engine Module: CI, MBPT, and CC

15.1 Overview

The Tensor Contraction Engine (TCE) Module of NWChem implements a variety of approximations that converge at the exact solutions of Schrödinger equation. They include configuration interaction theory through singles, doubles, triples, and quadruples substitutions, coupled-cluster theory through connected singles, doubles, triples, and quadruples substitutions, and many-body perturbation theory through fourth order in its tensor formulation. Not only optimized parallel programs of some of these high-end correlation theories are new, but also the way in which they have been developed is unique. The working equations of all of these methods have been derived completely automatically by a symbolic manipulation program called a Tensor Contraction Engine (TCE), and the optimized parallel programs have also been computer-generated by the same program, which were interfaced to NWChem. The development of the TCE program and this portion of the NWChem program has been financially supported by the United States Department of Energy, Office of Science, Office of Basic Energy Science, through the SciDAC program.

The capabilities of the module include:

New capabilities added in the version 4.6 are: Version 4.6 and onwards the distributed binary executables do not contain CCSDTQ and its derivative methods, owing to their large volume. The source code includes them, so a user can reinstate them by setenv CCSDTQ yes and recompile TCE module. The following optimizations have been used in the module:

In addition to changes made in the 4.7 version the most essential component of the 5.0 release include:

New features in the 5.1 release included:

New features in the 5.1.1 release include:

15.2 Performance of CI, MBPT, and CC methods

For reviews or tutorials of these highly-accurate correlation methods, the user is referred to:

For background on development of the symbolic algebra tools which help create the code used in this model see:

For details of the CC algorithms implemented in 5.0 version, see:

and references therein.

For details of the CC methods implemented in 5.1 version, see:

and references therein.

15.3 Algorithms of CI, MBPT, and CC methods

15.3.1 Spin, spatial, and index permutation symmetry

The TCE thoroughly analyzes the working equation of many-electron theory models and automatically generates a program that takes full advantage of these symmetries at the same time. To do so, the TCE first recognizes the index permutation symmetries among the working equations, and perform strength reduction and factorization by carefully monitoring the index permutation symmetries of intermediate tensors. Accordingly, every input and output tensor (such as integrals, excitation amplitudes, residuals) has just two independent but strictly ordered index strings, and each intermediate tensor has just four independent but strictly ordered index strings. The operation cost and storage size of tensor contraction is minimized by using the index range restriction arising from these index permutation symmetries and also spin and spatial symmetry integration.

15.3.2 Runtime orbital range tiling

To maintain the peak local memory usage at a manageable level, in the beginning of the calculation, the orbitals are rearranged into tiles (blocks) that contains orbitals with the same spin and spatial symmetries. So the tensor contractions in these methods are carried out at the tile level; the spin, spatial, and index permutation symmetry is employed to reduce the operation and storage cost at the tile level also.

15.3.3 Dynamic load balancing parallelism

In a parallel execution, dynamic load balancing of tile-level local tensor index sorting and local tensor contraction (matrix multiplication) will be invoked.

15.3.4 Parallel I/O schemes

Each process is assigned a local tensor index sorting and tensor contraction dynamically. It must first retrieve the tiles of input tensors, and perform these local operations, and accumulate the output tensors to the storage. We have developed a uniform interface for these I/O operations to either (1) a global file on a global file system, (2) a global memory on a global or distributed memory system, and (3) semi-replicated files on a distributed file systems. Some of these operations depend on the ParSoft library.


15.4 Input syntax

The keyword to invoke the many-electron theories in the module is TCE. To perform a single-point energy calculation, include

      TASK TCE ENERGY
in the input file, which may be preceeded by the TCE input block that details the calculations:
  TCE
    [(DFT||HF||SCF) default HF=SCF]
    [FREEZE [[core] (atomic || <integer nfzc default 0>)] \
             [virtual <integer nfzv default 0>]]
    [(LCCD||CCD||CCSD||CC2||LR-CCSD||LCCSD||CCSDT||CCSDTA||CCSDTQ|| \
      CCSD(T)||CCSD[T]||CCSD(2)_T||CCSD(2)||CCSDT(2)_Q|| \
      CR-CCSD[T]||CR-CCSD(T)|| \
      LR-CCSD(T)||LR-CCSD(TQ)-1||CREOMSD(T)|| \
      QCISD||CISD||CISDT||CISDTQ|| \
      MBPT2||MBPT3||MBPT4||MP2||MP3||MP4) default CCSD]
    [THRESH <double thresh default 1e-6>]
    [MAXITER <integer maxiter default 100>]
    [PRINT (none||low||medium||high||debug)
      <string list_of_names ...>]
    [IO (fortran||eaf||ga||sf||replicated||dra||ga_eaf) default ga]
    [DIIS <integer diis default 5>]
    [LSHIFT <double lshift default is 0.0d0>]
    [NROOTS <integer nroots default 0>]
    [TARGET <integer target default 1>]
    [TARGETSYM <character targetsym default 'none'>]
    [SYMMETRY]
    [2EORB]
    [2EMET <integer fast2e default 1>]
    [T3A_LVL] 
    [ACTIVE_OA]
    [ACTIVE_OB]
    [ACTIVE_VA]
    [ACTIVE_VB]
    [DIPOLE]
    [TILESIZE <no default (automatically adjusted)>]
    [(NO)FOCK <logical recompf default .true.>]
    [FRAGMENT <default -1 (off)>]
  END
Also supported are energy gradient calculation, geometry optimization, and vibrational frequency (or hessian) calculation, on the basis of numerical differentiation. To perform these calculations, use
      TASK TCE GRADIENT
or
      TASK TCE OPTIMIZE
or
      TASK TCE FREQUENCIES

The user may also specify the parameters of reference wave function calculation in a separate block for either HF (SCF) or DFT, depending on the first keyword in the above syntax.

Since every keyword except the model has a default value, a minimal input file will be

  GEOMETRY
  Be 0.0 0.0 0.0
  END

  BASIS
  Be library cc-pVDZ
  END

  TCE
  ccsd
  END

  TASK TCE ENERGY
which performs a CCSD/cc-pVDZ calculation of the Be atom in its singlet ground state with a spin-restricted HF reference.

15.5 Keywords of TCE input block

15.5.1 HF, SCF, or DFT -- the reference wave function

This keyword tells the module which of the HF (SCF) or DFT module is going to be used for the calculation of a reference wave function. The keyword HF and SCF are one and the same keyword internally, and are default. When these are used, the details of the HF (SCF) calculation can be specified in the SCF input block, whereas if DFT is chosen, DFT input block may be provided.

For instance, RHF-RCCSDT calculation (R standing for spin-restricted) can be performed with the following input blocks:

  SCF
  SINGLET
  RHF
  END

  TCE
  SCF
  CCSDT
  END

  TASK TCE ENERGY
This calculation (and any correlation calculation in the TCE module using a RHF or RDFT reference for a closed-shell system) skips the storage and computation of all $\beta$ spin blocks of integrals and excitation amplitudes. ROHF-UCCSDT (U standing for spin-unrestricted) for an open-shell doublet system can be requested by
  SCF
  DOUBLET
  ROHF
  END

  TCE
  SCF
  CCSDT
  END

  TASK TCE ENERGY
and likewise, UHF-UCCSDT for an open-shell doublet system can be specified with
  SCF
  DOUBLET
  UHF
  END

  TCE
  SCF
  CCSDT
  END

  TASK TCE ENERGY
The operation and storage costs of the last two calculations are identical. To use the KS DFT reference wave function for a UCCSD calculation of an open-shell doublet system,
  DFT
  ODFT
  MULT 2
  END

  TCE
  DFT
  CCSD
  END

  TASK TCE ENERGY
Note that the default model of the DFT module is LDA.

15.5.2 CCSD,CCSDT,CCSDTQ,CISD,CISDT,CISDTQ, MBPT2,MBPT3,MBPT4, etc. -- the correlation models

These keywords stand for the following models:

All of these models are based on spin-orbital expressions of the amplitude and energy equations, and designed primarily for spin-unrestricted reference wave functions. However, for a restricted reference wave function of a closed-shell system, some further reduction of operation and storage cost will be made. Within the unrestricted framework, all these methods take full advantage of spin, spatial, and index permutation symmetries to save operation and storage costs at every stage of the calculation. Consequently, these computer-generated programs will perform significantly faster than, for instance, a hand-written spin-adapted CCSD program in NWChem, although the nominal operation cost for a spin-adapted CCSD is just one half of that for spin-unrestricted CCSD (in spin-unrestricted CCSD there are three independent sets of excitation amplitudes, whereas in spin-adapted CCSD there is only one set, so the nominal operation cost for the latter is one third of that of the former. For a restricted reference wave function of a closed-shell system, all $\beta$ spin block of the excitation amplitudes and integrals can be trivially mapped to the all $\alpha$ spin block, reducing the ratio to one half).

While the MBPT (MP) models implemented in the TCE module give identical correlation energies as conventional implementation for a canonical HF reference of a closed-shell system, the former are intrinsically more general and theoretically robust for other less standard reference wave functions and open-shell systems. This is because the zeroth order of Hamiltonian is chosen to be the full Fock operatior (not just the diagonal part), and no further approximation was invoked. So unlike the conventional implementation where the Fock matrix is assumed to be diagonal and a correlation energy is evaluated in a single analytical formula that involves orbital energies (or diagonal Fock matrix elements), the present tensor MBPT requires the iterative solution of amplitude equations and subsequent energy evaluation and is generally more expensive than the former. For example, the operation cost of many conventional implementation of MBPT(2) scales as the fourth power of the system size, but the cost of the present tensor MBPT(2) scales as the fifth power of the system size, as the latter permits non-canonical HF reference and the former does not (to reinstate the non-canonical HF reference in the former makes it also scale as the fifth power of the system size).

15.5.3 THRESH -- the convergence threshold of iterative solutions of amplitude equations

This keyword specifies the convergence threshold of iterative solutions of amplitude equations, and applies to all of the CI, CC, and MBPT models. The threshold refers to the norm of residual, namely, the deviation from the amplitude equations. The default value is 1e-6.

15.5.4 MAXITER -- the maximum number of iterations

It sets the maximum allowed number iterations for the iterative solutions of amplitude equations. The default value is 100.

15.5.5 IO -- parallel I/O scheme

There are five parallel I/O schemes implemented for all the models, which need to be wisely chosen for a particular problem and computer architecture.

The GA algorithm, which is default, stores all input (integrals and excitation amplitudes), output (residuals), and intermediate tensors in the shared memory area across all nodes by virtue of GA library. This fully incore algorithm replaces disk I/O by inter-process communications. This is a recommended algorithm whenever feasible. Note that the memory management through runtime orbital range tiling described above applies to local (unshared) memory of each node, which may be separately allocated from the shared memory space for GA. So when there is not enough shared memory space (either physically or due to software limitations, in particular, shmmax setting), the GA algorithm can crash due to an out-of-memory error. The replicated scheme is the currently the only disk-based algorithm for a genuinely distributed file system. This means that each node keeps an identical copy of input tensors and it holds non-identical overlapping segments of intermediate and output tensors in its local disk. Whenever data coherency is required, a file reconcilation process will take place to make the intermediate and output data identical throughout the nodes. This algorithm, while requiring redundant data space on local disk, performs reasonably efficiently in parallel. For sequential execution, this reduces to the EAF scheme. For a global file system, the SF scheme is recommended. This together with the Fortran77 direct access scheme does not usually exhibit scalability unless shared files on the global file system also share the same I/O buffer. For sequential executions, the SF, EAF, and replicated schemes are interchangeable, while the Fortran77 scheme is appreciably slower.

Two new I/O algorithms dra and ga_eaf combines GA and DRA or EAF based replicated algorithm. In the former, arrays that are not active (e.g., prior $T$ amplitudes used in DIIS or EOM-CC trial vectors) in GA algorithm will be moved to DRA. In the latter, the intermediates that are formed by tensor contractions are initially stored in GA, thereby avoiding the need to accumulate the fragments of the intermediate scattered in EAFs in the original EAF algorithm. Once the intermediate is formed completely, then it will be replicated as EAFs.

The spin-free 4-index transformation algorithms are exclusively compatible with the GA I/O scheme, although out-of-core algorithms for the 4-index transformation are accessible using the 2emet options. See 15.5.12 for details.

15.5.6 DIIS -- the convergence acceleration

It sets the number iterations in which a DIIS extrapolation is performed to accelerate the convergence of excitation amplitudes. The default value is 5, which means in every five iteration, one DIIS extrapolation is performed (and in the rest of the iterations, Jacobi rotation is used). When zero or negative value is specified, the DIIS is turned off. It is not recommended to perform DIIS every iteration, whereas setting a large value for this parameter necessitates a large memory (disk) space to keep the excitation amplitudes of previous iterations. In 5.0 version we significantly improved the DIIS solver by re-organizing the itrative process and by introducing the level shift option (lshift) that enable to increase small orbital energy differences used in calculating the up-dates for cluster amplitudes. Typical values for lshift oscillates between 0.3 and 0.5 for CC calculations for ground states of multi-configurational character. Otherwise, the value of lshift is by default set equal to 0.

15.5.7 FREEZE -- the frozen core/virtual approximation

Some of the lowest-lying core orbitals and/or some of the highest-lying virtual orbitals may be excluded in the calculations by this keyword (this does not affect the ground state HF or DFT calculation). No orbitals are frozen by default. To exclude the atom-like core regions altogether, one may request

  FREEZE atomic
To specify the number of lowest-lying occupied orbitals be excluded, one may use
  FREEZE 10
which causes 10 lowest-lying occupied orbitals excluded. This is equivalent to writing
  FREEZE core 10
To freeze the highest virtual orbitals, use the virtual keyword. For instance, to freeze the top 5 virtuals
  FREEZE virtual 5

15.5.8 NROOTS -- the number of excited states

One can specify the number of excited state roots to be determined. The default value is 1. It is advised that the users request several more roots than actually needed, since owing to the nature of the trial vector algorithm, some low-lying roots can be missed when they do not have sufficient overlap with the initial guess vectors.

15.5.9 TARGET and TARGETSYM -- the target root and its symmetry

At the moment, the first and second geometrical derivatives of excitation energies that are needed in force, geometry, and frequency calculations are obtained by numerical differentiation. These keywords may be used to specify which excited state root is being used for the geometrical derivative calculation. For instance, when TARGET 3 and TARGETSYM a1g are included in the input block, the total energy (ground state energy plus excitation energy) of the third lowest excited state root (excluding the ground state) transforming as the irreducible representation a1g will be passed to the module which performs the derivative calculations. The default values of these keywords are 1 and none, respectively.

The keyword TARGETSYM is essential in excited state geometry optimization, since it is very common that the order of excited states changes due to the geometry changes in the course of optimization. Without specifying the TARGETSYM, the optimizer could (and would likely) be optimizing the geometry of an excited state that is different from the one the user had intended to optimize at the starting geometry. On the other hand, in the frequency calculations, TARGETSYM must be none, since the finite displacements given in the course of frequency calculations will lift the spatial symmetry of the equilibrium geometry. When these finite displacements can alter the order of excited states including the target state, the frequency calculation is not be feasible.

15.5.10 SYMMETRY -- restricting the excited state symmetry

By adding this keyword to the input block, the user can request the module to seek just the roots of the specified irreducible representation as TARGETSYM. By default, this option is not set. TARGETSYM must be specified when SYMMETRY is invoked.

15.5.11 2EORB -- alternative storage of two-electron integrals

In the 5.0 version a new option has been added in order to provide more economical way of storing two-electron integrals used in CC calculations based on the RHF and ROHF references. The 2EORB keyword can be used for all CC methods except for those using an active-space (CCSDt). All two-electron integrals are transformed and subsequently stored in a way which is compatible with assumed tiling scheme. The transformation from orbital to spinorbital form of the two-electron integrals is performed on-the-fly during execution of the CC module. This option, although slower, allows to significantly reduce the memory requirements needed by the first half of 4-index transformation and final file with fully transformed two-electron integrals . Savings in the memory requirements on the order of magnitude (and more) have been observed for large-scale open-shell calculations.


15.5.12 2EMET -- alternative storage of two-electron integrals

In addition to the algorithm implemented in the 5.0 version, several new computation-intensive algorithms has been added to the 5.1 version with the purpose of improving scalability and overcoming local memory bottleneck of the 5.0 2EORB 4-index transformation. In order to give the user a full control over this part of the TCE code several keywords were designed to define the most vital parameters that determine the perfromance of 4-index transformation. All new keywords must be used with the 2EORB keyword. The 2emet keyword (default value 1 or 2emet 1, refers to the 5.0 version of the 4-index transformation), defines the algorithm to be used. By putting 2emet 2 the TCE code will execute the algoritm based on the two step procedure with two intermediate files. In some instances this algorithm is characterized by better timings compared to algorithms 3 and 4, although it is more memory demanding. In contrast to algorithms nr 1,3, and 4 this approach can make use of a disk to store intermediate files. For this purpose one should use the keyword idiskx (idiskx 0 causes that all intermediate files are stored on global arrays, while idiskx 1 tells the code to use a disk to store intermediates; default value of idiskx is equal 0). Algorithm nr 3 (2emet 3) uses only one intermediate file whereas algorithm nr 4 (2emet 4) is a version of algorithm 3 with the option of reducing the memory requirements. For example, by using new keyword split 4 we will reduce the size of only intermediate file by factor of 4 (the split keyword can be only used in the context of algorithm nr 4). All new algorithms (i.e. 2emet 2+) use the attilesize setting to define the size of the atomic tile. By default attilesize is set equal 30. For larger systems the use of larger values of attilesize is recommended (typically between 40-60).

Additional algorithms for version 5.2 are numbers 5, 6 and 9. Other values of 2emet are not supported and refer to methods which do not function properly. Algorithms 5 and 6 were written as out-of-core $N^5$ methods (idiskx 1) and are the most efficient algorithms at the present time. The corresponding in-core variants (idiskx 0) are available but require excessive memory with respect to the methods discussed above, although they may be faster if sufficient memory is available (to get enough memory often requires excessive nodes, which decreases performance in the later stages of the calculation). The difference between 5 and 6 is that 5 writes to a single file (SF or GA) while 6 uses multiple files. For smaller calculations, particularly single-node jobs, 5 is faster than 6, but for more than a handful of processors, algorithm 6 should be used. The perforance discrepancy depends on the hardware used but in algorithm eliminates simulataneous disk access on parallel file systems or memory mutexes for the in-core case. For NFS filesystems attached to parallel clusters, no performance differences have been observed, but for Lustre and PVFS they are signficant. Using algorithm 5 for large parallel file systems will make the file system inaccessible to other users, invoking the wrath of system administrators.

Algorithm 9 is an out-of-core solution to the memory bottleneck of the 2-e integrals. In this approach, the intermediates of the 4-index transformation as well as the MO integrals are stored in an SF file. As before, this requires a shared file system. Because algorithm 9 is based upon algorithm 5, described above, it is not expected to scale. The primary purpose of algorithm 9 is to make the performance of the NWChem coupled-cluster codes competive with fast serial codes on workstations. It succeeds in this purpose when corresponding functionality is compared. A more scalable version of this algorithm is possible, but the utility is limited since large parallel computers do not permit the wall times necessary to use an out-of-core method, which is necessarily slower than the in-core variant. An extensible solution to these issues using complex heterogeneous I/O is in development. Restarting with algorithm 9 is not supported and attempting to use this feature with the present version may produce meaningless results.

New to 5.2 is the inclusion of multiple 2emet options for the spin-orbital transformations, which are the default when 2eorb is not set and are mandatory for UHF and KS references. The are currently three algorithms 1, 2 and 3 available. The numbering scheme does not correspond in any way to the numbering scheme for the 2eorb case, except that 2emet 1 corresponds to the default algorithm present in previous releases, which uses the user-defined I/O scheme. Algorithm 2 (2emet 2) writes an SF file for the half-transformed integrals, which is at least an order-of-magnitude larger than the fully-transformed integrals, but stores the fully-transformed integrals in core. Thus, once the 4-index transformation is complete, this algorithm will perform exactly as when algorithm 1 is used. Unfortuntely, the spin-orbital 2-e fully-transformed integrals are still quite large and an algorithm corresponding to 2eorb/2emet=9 is available with 2emet 3. Algorithm 3 is also limited in its scalability, but it permits relatively large UHF-based calculations using single workstations for patient users.

In cases where the user has access to both shared and local filesystems for parallel calculations, the permanent_dir setting refers to the location of SF files. The file system for scratch_dir will not be used for any of the 4-index transformation algorithms which are compatible with io=ga.

In the later part of this manual several examples illustrate the use of the newly introduced keywords.

15.5.13 DIPOLE -- the ground- and excited-state dipole moments

When this is set, the ground-state CC calculation will enter another round of iterative step for the so-called $\Lambda$ equation to obtain the one-particle density matrix and dipole moments. Likewise, for excited-states (EOM-CC), the transition moments and dipole moments will be computed when (and only when) this option is set. In the latter case, EOM-CC left hand side solutions will be sought incurring approximately three times the computational cost of excitation energies alone (note that the EOM-CC effective Hamiltonian is not Hermitian and has distinct left and right eigenvectors).

15.5.14 (NO)FOCK -- (not) recompute Fock matrix

The default is FOCK meaning that the Fock matrix will be reconstructed (as opposed to using the orbital energies as the diagonal part of Fock). This is essential in getting correct correlation energies with ROHF or DFT reference wave functions. However, currently, this module cannot reconstruct the Fock matrix when one-component relativistic effects are operative. So when a user wishes to run TCE's correlation methods with DK or other relativistic reference, NOFOCK must be set and orbital energies must be used for the Fock matrix.

15.5.15 PRINT -- the verbosity

This keyword changes the level of output verbosity. One may also request some particular items in Table 15.1 printed.


Table 15.1: Printable items in the TCE modules and their default print levels.
Item Print Level Description
``time'' vary CPU and wall times
``tile'' vary Orbital range tiling information
``t1'' debug $T_1$ excitation amplitude dumping
``t2'' debug $T_2$ excitation amplitude dumping
``t3'' debug $T_3$ excitation amplitude dumping
``t4'' debug $T_4$ excitation amplitude dumping
``general information'' default General information
``correlation information'' default TCE information
``mbpt2'' debug Caonical HF MBPT2 test
``get_block'' debug I/O information
``put_block'' debug I/O information
``add_block'' debug I/O information
``files'' debug File information
``offset'' debug File offset information
``ao1e'' debug AO one-electron integral evaluation
``ao2e'' debug AO two-electron integral evaluation
``mo1e'' debug One-electron integral transformation
``mo2e'' debug Two-electron integral transformation

15.6 Sample input

The following is a sample input for a ROHF-UCCSD energy calculation of a water radical cation.

START h2o

TITLE "ROHF-UCCSD/cc-pVTZ H2O"

CHARGE 1

GEOMETRY
O     0.00000000     0.00000000     0.12982363
H     0.75933475     0.00000000    -0.46621158
H    -0.75933475     0.00000000    -0.46621158
END

BASIS
* library cc-pVTZ
END

SCF
ROHF
DOUBLET
THRESH 1.0e-10
TOL2E  1.0e-10
END

TCE
CCSD
END

TASK TCE ENERGY
The same result can be obtained by the following input:
START h2o

TITLE "ROHF-UCCSD/cc-pVTZ H2O"

CHARGE 1

GEOMETRY
O     0.00000000     0.00000000     0.12982363
H     0.75933475     0.00000000    -0.46621158
H    -0.75933475     0.00000000    -0.46621158
END

BASIS
* library cc-pVTZ
END

SCF
ROHF
DOUBLET
THRESH 1.0e-10
TOL2E  1.0e-10
END

TASK UCCSD ENERGY

EOM-CCSDT calculation for excitation energies, excited-state dipole, and transition moments.

START tce_h2o_eomcc
 
GEOMETRY UNITS BOHR
H    1.474611052297904   0.000000000000000   0.863401706825835
O    0.000000000000000   0.000000000000000  -0.215850436155089
H   -1.474611052297904   0.000000000000000   0.863401706825835
END
 
BASIS
* library sto-3g
END
 
SCF
SINGLET
RHF
END
 
TCE
CCSDT
DIPOLE
FREEZE CORE ATOMIC
NROOTS 1
END
 
TASK TCE ENERGY

Active-space CCSDt/EOMCCSDt calculations (version I) of several excited states of the Be$_3$ molecule. Three highest-lying occupied $\alpha$ and $\beta$ orbitals (active_oa and active_ob) and nine lowest-lying unoccupied $\alpha$ and $\beta$ orbitals (active_va and active_vb) define the active space.

START TCE_ACTIVE_CCSDT

ECHO

GEOMETRY UNITS ANGSTROM
SYMMETRY C2V
BE  0.00  0.00   0.00
BE  0.00  1.137090 -1.96949
end

BASIS spherical
# --- DEFINE YOUR BASIS SET ---
END

SCF
THRESH 1.0e-10
TOL2E 1.0e-10
SINGLET
RHF
END

TCE
FREEZE ATOMIC
CCSDTA
TILESIZE 15
THRESH 1.0d-5
ACTIVE_OA 3
ACTIVE_OB 3
ACTIVE_VA 9
ACTIVE_VB 9
T3A_LVL 1
NROOTS  2
END 

TASK TCE ENERGY

Completely renormalized EOMCCSD(T) (CR-EOMCCSD(T)) calculations for the ozone molecule as described by the POL1 basis set. The CREOMSD(T) directive automatically initialize three-step procedure: (1) CCSD calculations; (2) EOMCCSD calculations; (3) non-iterative CR-EOMCCSD(T) corrections.

START TCE_CR_EOM_T_OZONE

ECHO

GEOMETRY UNITS BOHR
SYMMETRY C2V
O   0.0000000000        0.0000000000        0.0000000000
O   0.0000000000       -2.0473224350       -1.2595211660
O   0.0000000000        2.0473224350       -1.2595211660
END

BASIS SPHERICAL
O    S
     10662.285000000      0.00079900
      1599.709700000      0.00615300
       364.725260000      0.03115700
       103.651790000      0.11559600
        33.905805000      0.30155200
O    S
        12.287469000      0.44487000
         4.756805000      0.24317200
O    S
         1.004271000      1.00000000
O    S
         0.300686000      1.00000000
O    S
         0.090030000      1.00000000
O    P
        34.856463000      0.01564800
         7.843131000      0.09819700
         2.306249000      0.30776800
         0.723164000      0.49247000
O    P
         0.214882000      1.00000000
O    P
         0.063850000      1.00000000
O    D
         2.306200000      0.20270000
         0.723200000      0.57910000
O    D
         0.214900000      0.78545000
         0.063900000      0.53387000
END

SCF
THRESH 1.0e-10
TOL2E 1.0e-10
SINGLET
RHF
END

TCE
FREEZE ATOMIC
CREOMSD(T)
TILESIZE 20
THRESH 1.0d-6
NROOTS 2
END

TASK TCE ENERGY

The LR-CCSD(T) calculations for the glycine molecule in the aug-cc-pVTZ basis set. Option 2EORB is used in order to minimize memory requirements associated with the storage of two-electron integrals.

START TCE_LR_CCSD_T

ECHO

GEOMETRY UNITS BOHR
O      -2.8770919486        1.5073755650        0.3989960497
C      -0.9993929716        0.2223265108       -0.0939400216
C       1.6330980507        1.1263991128       -0.7236778647
O      -1.3167079358       -2.3304840070       -0.1955378962
N       3.5887721300       -0.1900460352        0.6355723246
H       1.7384347574        3.1922914768       -0.2011420479
H       1.8051078216        0.9725472539       -2.8503867814
H       3.3674278149       -2.0653924379        0.5211399625
H       5.2887327108        0.3011058554       -0.0285088728
H      -3.0501350657       -2.7557071585        0.2342441831
END

BASIS
* library aug-cc-pVTZ
END

SCF
THRESH 1.0e-10
TOL2E 1.0e-10
SINGLET
RHF
END

TCE
FREEZE ATOMIC
2EORB
TILESIZE 15
LR-CCSD(T)
THRESH 1.0d-7
END

TASK TCE ENERGY

The CCSD calculations for the triplet state of the C$_{20}$ molecule. New algorithms for 4-index tranformation are used.

title "c20_cage"
echo
start c20_cage
memory stack 2320 mb heap 180 mb global 2000 mb noverify

geometry print xyz units bohr
   symmetry c2
   C      -0.761732  -1.112760   3.451966
   C       0.761732   1.112760   3.451966
   C       0.543308  -3.054565   2.168328
   C      -0.543308   3.054565   2.168328
   C       3.190553   0.632819   2.242986
   C      -3.190553  -0.632819   2.242986
   C       2.896910  -1.982251   1.260270
   C      -2.896910   1.982251   1.260270
   C      -0.951060  -3.770169   0.026589
   C       0.951060   3.770169   0.026589
   C       3.113776   2.128908   0.076756
   C      -3.113776  -2.128908   0.076756
   C       3.012003  -2.087494  -1.347695
   C      -3.012003   2.087494  -1.347695
   C       0.535910  -2.990532  -2.103427
   C      -0.535910   2.990532  -2.103427
   C       3.334106   0.574125  -2.322563
   C      -3.334106  -0.574125  -2.322563
   C      -0.764522  -1.081362  -3.453211
   C       0.764522   1.081362  -3.453211
end

basis spherical
 * library cc-pvtz
end

scf
   triplet
   rohf
   thresh 1.e-8
   maxiter 200
end

tce
   ccsd
   maxiter 60
   diis 5
   thresh 1.e-6
   2eorb
   2emet 3
   attilesize 40
   tilesize 30
   freeze atomic
end

task tce energy

15.7 TCE Response Properties

15.7.1 Introduction

Response properties can be calculated within the TCE. Ground-state dipole polarizabilities can be performed at the CCSD, CCSDT and CCSDTQ levels of theory. Neither CCSDT-LR nor CCSDTQ-LR are compiled by default. Like the rest of the TCE, properties can be calculated with RHF, UHF, ROHF and DFT reference wavefunctions.

Specific details for the implementation of CCSD-LR and CCSDT-LR can be found in the following papers:

An appropriate background on coupled-cluster linear response (CC-LR) can be found in the references of those papers.

15.7.2 Performance

The coupled-cluster response codes were generated in the same manner as the rest of the TCE, thus all previous comments on performance apply here as well. The improved offsets available in the CCSD and EOM-CCSD codes is now also available in the CCSD-$\Lambda$ and CCSD-LR codes. The bottleneck for CCSD-LR is the same as EOM-CCSD, likewise for CCSDT-LR and EOM-CCSDT. The CCSD-LR code has been tested on as many as 1024 processors for systems with more than 2000 spin-orbitals, while the CCSDT-LR code has been run on as many as 1024 processors. The CCSDTQ-LR code is not particularly useful due to the extreme memory requirements of quadruples amplitudes, limited scalability and poor convergence in the CCSDTQ equations in general.

15.7.3 Input

The input commands for TCE response properties exclusively use set directives (see Section 5.7) instead of TCE input block keywords. There are currently only three commands available:

set tce:lineresp <logical lineresp default: F>
set tce:afreq <double precision afreq(9) default: 0.0>
set tce:respaxis <logical respaxis(3) default: T T T>

The boolean variable lineresp invokes the linear response equations for the corresponding coupled-cluster method (only CCSD and CCSDT possess this feature) and evaluates the dipole polarizability. When lineresp is true, the $\Lambda$-equations will also be solved, so the dipole moment is also calculated. If no other options are set, the complete dipole polarizability tensor will be calculated at zero frequency (static). Up to nine real frequencies can be set; adding more should not crash the code but it will calculate meaningless quanities. If one desires to calculate more frequencies at one time, merely change the line double precision afreq(9) in $(NWCHEM_TOP)/src/tce/include/tce.fh appropriately and recompile.

The user can choose to calculate response amplitudes only for certain axis, either because of redundancy due to symmetry or because of memory limitations. The boolean vector of length three respaxis is used to determine whether or not a given set of response amplitudes are allocated, solved for, and used in the polarizability tensor evaluation. The logical variables represent the X, Y, Z axes, respectively. If respaxis is set to T F T, for example, the responses with respect to the X and Z dipoles will be calculated, and the four (three unique) tensor components will be evaluated. This feature is also useful for conserving memory. By calculating only one axis at a time, memory requirements can be reduced by $25\%$ or more, depending on the number of DIIS vectors used. Reducing the number of DIIS vectors also reduces the memory requirements.

It is strongly advised that when calculating polarizabilities at high-frequencies, that user set the frequencies in increasing order, preferably starting with zero or other small value. This approach is computationally efficient (the initial guess for subsequent responses is the previously converged value) and mitigates starting from a zero vector for the response amplitudes.

15.7.4 Examples

This example runs in-core on a large workstation.

geometry units angstrom
 symmetry d2h
 C               0.000    1.390    0.000
 H               0.000    2.470    0.000
 C               1.204    0.695    0.000
 H               2.139    1.235    0.000
 C               0.000   -1.390    0.000
 H               0.000   -2.470    0.000
 C              -1.204   -0.695    0.000
 H              -2.139   -1.235    0.000
 C               1.204   -0.695    0.000
 H               2.139   -1.235    0.000
 C              -1.204    0.695    0.000
 H              -2.139    1.235    0.000
end

basis spherical
  * library aug-cc-pvdz
end

tce
  freeze atomic
  ccsd
  io ga
  2eorb
  tilesize 16
end

set tce:lineresp T
set tce:afreq 0.000 0.072
set tce:respaxis T T T

task tce energy

This is a relatively simple example for CCSDT-LR.

geometry units au
  symmetry c2v
  H 0       0        0
  F 0       0        1.7328795
end

basis spherical
  * library aug-cc-pvdz
end

tce
  ccsdt
  io ga
  2eorb
end

set tce:lineresp T
set tce:afreq 0.0 0.1 0.2 0.3 0.4
set tce:respaxis T F T

task tce energy

15.8 TCE Restart Capability

15.8.1 Overview

Check-pointing and restart are critical for computational chemistry applications of any scale, but particularly those done on supercomputers, or run for an extended period on workstations and clusters. The TCE supports parallel check-pointing and restart using the Shared Files (SF) library in the Global Arrays Tools. The SF library requires that the file system be accessible by every node, so reading and writing restart files can only be performed on a shared file system. For workstations, this condition is trivially met. Restart files must be persistent to be useful, so volatile file systems or those which are periodicly erased should not be used for check-pointing.

Restart is possible for all ground-state amplitudes ($T$, $\Lambda$ and $T^{(1)}$) but not for excited-state amplitudes, as in an EOM-CC calculation. The latter functionality is under development.

Restart capability is useful in the following situations:

At the present time, restarting the amplitudes during a potential energy surface scan or numerical geometry optmization/frequency calculation is not advised due to the phase issue in the molecular orbital coefficients. If the phase changes, the amplitudes will no longer be a useful guess and may lead to nonsense results. Expert users may be able to use restart when the geometry varies using careful choices in the SCF input by using the ``rotate'' and ``lock'' options but this use of restart is not supported.

If SF encounters a failure during restart I/O, the job will fail. The capability to ignore a subset of failures, such as when saving the amplitudes prior to convergence, will be available in the future. This is useful on some large machines when the filesystem is being taxed by another job and may be appear unavailable at the moment a check-point write is attempted.

The performance of SF I/O for restart is excellent and the wall time for reading and writing integrals and amplitudes is negligible, even on a supercomputer (such systems have very fast parallel file systems in most cases). The only platform for which restart may cause I/O problems is BlueGene, due to ratio of compute to I/O nodes (64 on BlueGene/P).

15.8.2 Input

set tce:read_integrals <logical read_integrals default: F F F F F>
set tce:read_t <logical read_t default: F F F F>
set tce:read_l <logical read_l default: F F F F>
set tce:read_tr <logical read_tr default: F F F F>
set tce:save_integrals <logical save_integrals default: F F F F F>
set tce:save_t <logical save_t default: F F F F>
set tce:save_l <logical save_l default: F F F F>
set tce:save_tr <logical save_tr default: F F F F>
set tce:save_interval <integer save_interval default: 100000>

The boolean variables read_integrals and save_integrals control which integrals are read/saved. The first location is the 1-e integrals, the second is for the 2-e integrals, and the third is for dipole integrals. The fourth and fifth positions are reserved for quadrupole and octupole integrals but this functionality is not available. The read_t, read_l, read_tr, save_t, save_l and save_tr variables control the reading/saving of the $T$, $\Lambda$ and $T^{(1)}$ (response) amplitudes. Restart control on the response amplitudes is implicitly controlled by the value of respaxis (see above). Requesting amplitudes that are beyond the scope of a given calculation, such as $T_3$ in a CCSD calculation, does not produce an error as these commands will never be processed.

Attempting to restart with a set of amplitudes without the corresponding integrals is ill-advised, due to the phase issue discussed above. For the same reason, one cannot save a subset of the integrals, so if it is even remotely possible that the dipole moment or polarizabilities will be desired for a given molecule, the dipole integrals should be saved as well. It is possible to save the dipole integrals without setting dipole in the TCE input block; setting save_integrals(3) true is sufficient for this to occur.

The save_interval variable controls the frequency with which amplitudes are saved. By default, the amplitudes are saved only when the iterative process has converged, meaning that if the iterations do not converge in less than the maximum, one must start the calculation again from scratch. The solution is to set save_interval to a smaller value, such as the number of DIIS cycles being used.

The user shall not change the tilesize when reading in saved amplitudes. The results of this are catastrophic and under no circumstance will this lead to physically meaningful results. Restart does not work for 2eorb and 2emet 9; no error will be produced but the results may be meaningless.

15.8.3 Examples

geometry units au
  symmetry c2v
  H 0       0        0
  F 0       0        1.7328795
end

basis spherical
  * library aug-cc-pvdz
end

tce
  ccsdt
  io ga
end

set tce:lineresp T
set tce:afreq 0.0 0.1 0.2 0.3 0.4
set tce:respaxis T F T

task tce energy

15.9 Maximizing performance

The following are recommended parameters for getting the best performance and efficiency for common methods on various hardware configurations. The optimal settings are far from extensible and it is extremely important that users take care in how they apply these recommendations. Testing a variety of settings on a simple example is recommended when optimal settings are desired. Nonetheless, a few guiding principles will improve the performance of TCE jobs markedly, including making otherwise impossible jobs possible.

15.9.1 Memory considerations

The default memory settings for NWChem are not optimal for TCE calculations. When 2 GB of memory is available per process, the following settings are close to optimal for CCSD jobs

memory stack 800 mb heap 100 mb global 1000 mb

for property jobs, which require more amplitudes to be stored, it is wise to favor the global allocation

memory stack 500 mb heap 100 mb global 1300 mb

If you get an error for ga_create during the iterative steps, reduce the number of DIIS vectors. If this error occurs during the four-index transformation (after d_v2 filesize appears) you need more GA space, a different 2emet, or more nodes.

The memory requirements for CCSD(T) are quite different because the triples are generated in local memory. The value of $tilesize$ should not be larger than $30$ in most cases and one should set something similar to the following

memory stack 1200 mb heap 100 mb global 600 mb

The local memory requires will be $\sim tilesize^N$ where $N=4$ for CCSD, $N=6$ for CCSD(T) and CCSDT, and $N=8$ for CCSDTQ. One should set $tilesize\sim{15}$ for CCSDT and CCSDTQ, although symmetry will affect the local memory use significantly.

15.9.2 SCF options

For parallel jobs on clusters with poor disk performance on the filesystem used for scratch_dir, it is a good idea to disable disk IO during the SCF stage of the calculation. This is done by adding
semidirect memsize N filesize 0, where N is $80\%$ of the stack memory divided by 8, as the value in this directive is the number of dwords, rather than bytes. With these settings, if the aggregate memory is sufficient to store the integrals, the SCF performance will be excellent, and it will be better than if direct is set in the SCF input block. If scratch_dir is set to a local disk, then one should use as much disk as is permissible, controlled by the value of filesize. On many high-performance computers, filling up the local scratch disk will crash the node, so one cannot be careless with these settings. In addition, on many such machines, the shared file system performance is better than that of the local disk (this is true for many NERSC systems).

15.9.3 Convergence criteria

It makes no sense to converge a calculation to a precision not relevant to experiment. However, the relationship between convergence criteria and calculated quantities is not fully known for some properties. For example, the effect of the convergence criteria on the polarizability is significant in some cases. In the case of CN, convergence of $10^-11$ is necessary to resolve the polarizability tensor components to $10^-2$. However, for many systems $10^-7$ convergence is sufficient to get accurate results for all properties. It is important to calibrate the effect of convergence on property calculations, particularly for open-shell and post-CCSD methods, on a modest basis set before relaxing the convergence criteria too much.

15.9.4 IO schemes and integral transformation algorithms

The effect on memory use of using the 2eorb keyword is huge. However, this option can only be used with IO=GA and an RHF/ROHF reference. There are a number of choices for the integral transformation algorithm when using spin-free integrals. The fastest algorithm is $2EMET=5$, but significant disk IO is required for this algorithm. One must set permanent_dir to a fast, shared file system for this algorithm to work. If disk performance is not good, one should use either $2EMET=3$ or $2EMET=4$ depending on how much memory is available. If one sees a ga_create error with $2EMET=3$, then switch to algorithm 4 and add split 8 to the TCE input block.


next up previous contents
Next: 16. MP2 Up: user Previous: 14. CIS, TDHF, and   Contents
Dunyou Wang 2009-03-13