From owner-nwchem-users@emsl.pnl.gov Thu Jan 18 20:40:18 2007 Received: from odyssey.emsl.pnl.gov (localhost [127.0.0.1]) by odyssey.emsl.pnl.gov (8.13.8/8.13.8) with ESMTP id l0J4eHSN024439 for ; Thu, 18 Jan 2007 20:40:18 -0800 (PST) Received: (from majordom@localhost) by odyssey.emsl.pnl.gov (8.13.8/8.13.8/Submit) id l0J4eHVs024430 for nwchem-users-outgoing-0915; Thu, 18 Jan 2007 20:40:17 -0800 (PST) X-Authentication-Warning: odyssey.emsl.pnl.gov: majordom set sender to owner-nwchem-users@emsl.pnl.gov using -f X-Ironport-SG: OK_Domains X-Ironport-SBRS: 1.6 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ao8CADLar0XC4rITdGdsb2JhbACNWA X-IronPort-AV: i="4.13,208,1167638400"; d="scan'208"; a="11895521:sNHT19880056" From: Alexander Shaposhnikov Organization: Institute of Semiconductor Physics Siberian Branch of Russian Academy of Sciences (ISP SB RAS) To: Dunyou Wang Subject: Re: [NWCHEM] NWChem/GA MPI error Date: Fri, 19 Jan 2007 10:39:45 +0600 User-Agent: KMail/1.9.5 References: <200701152058.12764.shaposh@isp.nsc.ru> <45AFF73C.5040403@pnl.gov> In-Reply-To: <45AFF73C.5040403@pnl.gov> Cc: nwchem-users@emsl.pnl.gov MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200701191039.45225.shaposh@isp.nsc.ru> X-Virus-Scanned: ClamAV 0.88.5/2466/Fri Jan 19 05:49:11 2007 on ns.isp.nsc.ru X-Virus-Status: Clean Sender: owner-nwchem-users@emsl.pnl.gov Precedence: bulk Dear mr. Wang, The first machine is dual Xeon Clovertown Quad core, 2670MHz, 8 cores total, with 8GB Ram. The second is dual Opteron Dual core, 2400MHz, 4 cores total, with 8GB Ram. Built as > make CC=icc FC=ifort with the following environment variables: #NWChem--------------------------------------------- export LARGE_FILES=TRUE export NWCHEM_TOP=~/quant/nwchem export NWCHEM_TARGET=LINUX64 export NWCHEM_MODULES=" nwdft gradients dftgrad stepper driver moints property nwpw" export USE_MPI=y export USE_MPIF=y export LIBMPI="-lmpi_f77 -lmpi_f90 -lmpi" export MPI_LIB=/usr/local/lib export MPI_INCLUDE=/usr/local/include export BLASOPT="-L/home/alex/libs -lgoto " #NWChem-------------------- Machine file to run with open-mpi: ------------------------------------------------ Clovertown slots=8 max_slots=8 Opteron265 slots=4 max_slots=4 --------------------------------------------- MPI command to run job.nw on N_CPU CPUs: ----------------------------------------------------------------------- nohup mpiexec -hostfile ~/ompi_hostfile -n N_CPU $NWCHEM_TOP/bin/$NWCHEM_TARGET/nwchem job.nw > $FILE & ------------------------------------------------------------------------ Output from the siosi3.nw test run: ------------------------------------------------------------------------- 8: Opteron265 len=10 9: Opteron265 len=10 10: Opteron265 len=10 11: Opteron265 len=10 ARMCI configured for 2 cluster nodes. Network protocol is 'TCP/IP Sockets'. Clovertown cluster:0 nodes:8 master=0 Opteron265 cluster:1 nodes:4 master=8 0: Clovertown len=10 1: Clovertown len=10 2: Clovertown len=10 3: Clovertown len=10 4: Clovertown len=10 5: Clovertown len=10 6: Clovertown len=10 7: Clovertown len=10 argument 1 = job.nw Northwest Computational Chemistry Package (NWChem) 5.0 ------------------------------------------------------ Environmental Molecular Sciences Laboratory Pacific Northwest National Laboratory Richland, WA 99352 COPYRIGHT (C) 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006 Pacific Northwest National Laboratory, Battelle Memorial Institute. >>> All Rights Reserved <<< DISCLAIMER ---------- This material was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the United States Department of Energy, nor Battelle, nor any of their employees, MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY INFORMATION, APPARATUS, PRODUCT, SOFTWARE, OR PROCESS DISCLOSED, OR REPRESENTS THAT ITS USE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS. LIMITED USE ----------- This software (including any documentation) is being made available to you for your internal use only, solely for use in performance of work directly for the U.S. Federal Government or work under contracts with the U.S. Department of Energy or other U.S. Federal Government agencies. This software is a version which has not yet been evaluated and cleared for commercialization. Adherence to this notice may be necessary for the author, Battelle Memorial Institute, to successfully assert copyright in and commercialize this software. This software is not intended for duplication or distribution to third parties without the permission of the Manager of Software Products at Pacific Northwest National Laboratory, Richland, Washington, 99352. ACKNOWLEDGMENT -------------- This software and its documentation were produced with Government support under Contract Number DE-AC05-76RL01830 awarded by the United States Department of Energy. The Government retains a paid-up non-exclusive, irrevocable worldwide license to reproduce, prepare derivative works, perform publicly and display publicly by or for the Government, including the right to distribute to other Government contractors. Job information --------------- hostname = Clovertown program = nwchem date = Mon Jan 15 04:45:16 2007 compiled = `date` source = /home/alex/quant/nwchem-5.0 nwchem branch = 5.0 input = job.nw prefix = siosi3. data base = ./siosi3.db status = startup nproc = 12 time left = -1s Memory information ------------------ heap = 21299201 doubles = 162.5 Mbytes stack = 21299201 doubles = 162.5 Mbytes global = 42598400 doubles = 325.0 Mbytes (distinct from heap & stack) total = 85196802 doubles = 650.0 Mbytes verify = no hardfail = no Directory information --------------------- 0 permanent = . 0 scratch = /scr NWChem Input Module ------------------- NWChem DFT Module ----------------- Caching 1-el integrals General Information ------------------- SCF calculation type: DFT Wavefunction type: closed shell. No. of atoms : 33 No. of electrons : 186 Alpha electrons : 93 Beta electrons : 93 Charge : 0 Spin multiplicity: 1 Use of symmetry is: off; symmetry adaption is: off Maximum number of iterations: 30 AO basis - number of functions: 347 number of shells: 160 A Charge density fitting basis will be used. CD basis - number of functions: 832 number of shells: 335 Convergence on energy requested: 1.00D-06 Convergence on density requested: 1.00D-05 Convergence on gradient requested: 5.00D-04 XC Information -------------- Slater Exchange Functional 1.000 local VWN V Correlation Functional 1.000 local Grid Information ---------------- Grid used for XC integration: medium Radial quadrature: Mura-Knowles Angular quadrature: Lebedev. Tag B.-S. Rad. Rad. Pts. Rad. Cut. Ang. Pts. --- ---------- --------- --------- --------- O 0.60 49 16.0 434 Si 1.10 88 19.0 590 H 0.35 45 21.0 434 Grid pruning is: on Number of quadrature shells: 1857 Spatial weights used: Erf1 Convergence Information ----------------------- Convergence aids based upon iterative change in total energy or number of iterations. Levelshifting, if invoked, occurs when the HOMO/LUMO gap drops below (HL_TOL): 1.00D-02 DIIS, if invoked, will attempt to extrapolate using up to (NFOCK): 10 stored Fock matrices. Damping(70%) Levelshifting(0.5) DIIS --------------- ------------------- --------------- dE on: start ASAP start dE off: 2 iters 30 iters 30 iters Screening Tolerance Information ------------------------------- Density screening/tol_rho: 1.00D-10 AO Gaussian exp screening on grid/accAOfunc: 14 CD Gaussian exp screening on grid/accCDfunc: 20 XC Gaussian exp screening on grid/accXCfunc: 20 Schwarz screening/accCoul: 1.00D-10 Spatial weight screening/radius(au): 4.09D+01 Superposition of Atomic Density Guess ------------------------------------- Sum of atomic energies: -2842.55896057 Non-variational initial energy ------------------------------ Total energy = -2848.597708 1-e energy = -8548.884484 2-e energy = 3387.044398 HOMO = -0.373192 LUMO = -0.026200 Time after variat. SCF: 3.0 3 Center 2 Electron Integral Information ---------------------------------------- Maximum number of 3-center 2e- integrals is: 100180288. This is reduced with Schwarz screening to: 34780928. Incore requires a per proc buffer size of: 3621355. The minimum integral buffer size is: 179712 Minimum dble words available (all nodes) is: 42570277 This is reduced (for later use) to: 42127133 Suggested buffer size is: 3621355 3.621 MW buffer allocated for incore 3-center 2e- integral storage on stack. The percent of 3c 2e- integrals held in-core is: 100.00 Time prior to 1st pass: 4.0 8:8:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 8:8:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 Last System Error Message from Task 8:: Resource temporarily unavailable 9:9:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 9:9:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 Last System Error Message from Task 9:: Resource temporarily unavailable [Opteron265:02753] MPI_ABORT invoked on rank 9 in communicator MPI_COMM_WORLD with errorcode 0 10:10:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0ror Message from Task 10:: Resource temporarily unavailable [Opteron265:02754] MPI_ABORT invoked on rank 10 in communicator MPI_COMM_WORLD with errorcode 0 11:11:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 11:11:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 Last System Error Message from Task 11:: Resource temporarily unavailable [Opteron265:02755] MPI_ABORT invoked on rank 11 in communicator MPI_COMM_WORLD with errorcode 0 forrtl: error (78): process killed (SIGTERM) 0:Terminate signal was sent, status=: 15 0:Terminate signal was sent, status=: 15 Last System Error Message from Task 0:: Inappropriate ioctl for device 4:armci_rcv_data: read failed: -1 4:armci_rcv_data: read failed: -1 Last System Error Message from Task 4:: Operation now in progress [Clovertown:04415] MPI_ABORT invoked on rank 4 in communicator MPI_COMM_WORLD with errorcode -1 5:armci_rcv_data: read failed: -1 5:armci_rcv_data: read failed: -1 Last System Error Message from Task 5:: Operation now in progress [Clovertown:04416] MPI_ABORT invoked on rank 5 in communicator MPI_COMM_WORLD with errorcode -1 6:armci_rcv_data: read failed: -1 6:armci_rcv_data: read failed: -1 Last System Error Message from Task 6:: Resource temporarily unavailable [Clovertown:04417] MPI_ABORT invoked on rank 6 in communicator MPI_COMM_WORLD with errorcode -1 7:armci_rcv_data: read failed: -1 7:armci_rcv_data: read failed: -1 Last System Error Message from Task 7:: Resource temporarily unavailable [Clovertown:04418] MPI_ABORT invoked on rank 7 in communicator MPI_COMM_WORLD with errorcode -1 1:Terminate signal was sent, status=: 15 -10000(s):Terminate signal was sent, status=: 15 -10000(s):Terminate signal was sent, status=: 15 1:Terminate signal was sent, status=: 15 Last System Error Message from Task 1:: Resource temporarily unavailable 2:Terminate signal was sent, status=: 15 2:Terminate signal was sent, status=: 15 Last System Error Message from Task 2:: Resource temporarily unavailable 3:Terminate signal was sent, status=: 15 3:Terminate signal was sent, status=: 15 Last System Error Message from Task 3:: Operation now in progress forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) mpiexec: killing job... -------------------------------------------------------------------------------------- Best Regards, Alexander Shaposhnikov On Friday 19 January 2007 04:39, you wrote: > Dear Alexander, > > Would you please tell us some detailed information about your machine? > You said you tested on two nodes with 8 and 4 cpus, would you please > clarify the way you requested the cpus on each node? > > My colleague, Bruce Palmer, who is in charge of the mirrored array part > in GA, is going to take a look at your problem, so would you also send > in your output from the testing? > > Best regards > Dunyou Wang > > Alexander Shaposhnikov wrote: > > Hi, > > trying to run siosi3.nw test on two nodes with 8 and 4 cpus > > connected with dual gigabit ethernet, > > with nwchem 5.0 built with Intel ifort and icc 9.1 em64t compilers > > and open-mpi for message passing, system SUSE 10.2 64bit, > > i am getting the following error: > > ///////////////////////////////////////////////////////////////////////// > >/////////////////////////////// *** > > Time prior to 1st pass: 4.0 > > 8:8:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 > > 8:8:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 > > Last System Error Message from Task 8:: Resource temporarily unavailable > > 9:9:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 > > 9:9:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 > > La029] MPI_ABORT invoked on rank 9 in communicator MPI_COMM_WORLD with > > errorcode 0 > > 10:10:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 > > 10:10:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 > > Last System Error Message from Task 10:: Resource temporarily unavailable > > [Opteron265:03030] MPI_ABORT invoked on rank 10 in communicator > > MPI_COMM_WORLD with errorcode 0 > > 11:11:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 > > 11:11:ga_copy:ga_merge_mirrored:nga_access_ptr:locate top failed:: 0 > > Last System Error Message from Task 11:: Resource temporarily unavailable > > [Opteron265:03031] MPI_ABORT invoked on rank 11 in communicator > > MPI_COMM_WORLD with errorcode 0 > > forrtl: error (78): process killed (SIGTERM) > > 0:Terminate signal was sent, status=: 15 > > 0:Terminate signal was sent, status=: 15 > > Last System Error Message from Task 0:: Inappropriate ioctl for device > > 4:armci_rcv_data: read failed: -1 > > 4:armci_rcv_data: read failed: -1 > > Last System Error Message from Task 4:: Operation now in progress > > [Clovertown:04601] MPI_ABORT invoked on rank 4 in communicator > > MPI_COMM_WORLD with errorcode -1 > > 5:armci_rcv_data: read failed: -1 > > 5:armci_rcv_data: read failed: -1 > > Last System Error Message from Task 5:: Operation now in progress > > [Clovertown:04608] MPI_ABORT invoked on rank 5 in communicator > > MPI_COMM_WORLD with errorcode -1 > > 6:armci_rcv_data: read failed: -1 > > 6:armci_rcv_data: read failed: -1 > > Last System Error Message from Task 6:: Resource temporarily unavailable > > [Clovertown:04609] MPI_ABORT invoked on rank 6 in communicator > > MPI_COMM_WORLD with errorcode -1 > > 7:armci_rcv_data: read failed: -1 > > 7:armci_rcv_data: read failed: -1 > > Last System Error Message from Task 7:: Resource temporarily unavailable > > [Clovertown:04610] MPI_ABORT invoked on rank 7 in communicator > > MPI_COMM_WORLD with errorcode -1 > > 1:Terminate signal was sent, status=: 15 > > -10000(s):Terminate signal was sent, status=: 15 > > -10000(s):Terminate signal was sent, status=: 15 > > 1:Terminate signal was sent, status=: 15 > > Last System Error Message from Task 1:: Resource temporarily unavailable > > 2:Terminate signal was sent, status=: 15 > > 2:Terminate signal was sent, status=: 15 > > Last System Error Message from Task 2:: Resource temporarily unavailable > > 3:Terminate signal was sent, status=: 15 > > 3:Terminate signal was sent, status=: 15 > > Last System Error Message from Task 3:: Operation now in progress > > forrtl: error (78): process killed (SIGTERM) > > forrtl: error (78): process killed (SIGTERM) > > forrtl: error (78): process killed (SIGTERM) > > forrtl: error (78): process killed (SIGTERM) > > ///////////////////////////////////////////////////////////////////////// > >//////////////////////////////////////////////////////////////////////// > > > > The task runs fine on each node. I can also run other mpi applications on > > two nodes. > > Any suggestion as to the source of this problem and how to fix this is > > welcome. > > > > > > Best Regards, > > Alexander Shaposhnikov