__color__ __group__ ticket summary component version type owner created _changetime _description _reporter 1 None 428 MPI_Win_fence memory consumption mpich2 None bug jayesh * 1235780829 1236984277 {{{ Hi List, the attached test program uses MPI_Accumulate/MPI_Win_fence for one sided communication with derived datatype. The program runs fine with mpich2-1.1a2 except for my debugging version of MPICH2 compiled with ./configure --with-device=ch3:nemesis --enable-g=dbg,mem,meminit In this case the MPI_Win_fence on the target side comuses about 90% of main memory (e.g. > 3 GB). As the behaviour is completely different for predefined datatypes, I suspect that the memory consumption is related to the construction of the derived datatype on the target side. Is there a workaround for this? Thanks + Best regards, Dorian }}} Dorian Krause 1 None 462 test1_dt failed in nightlies mpich2 None bug jayesh 1237132067 1237223429 test1_dt (rma) failed in the nightlies. We should consider this separately from the attribute test failures (which were fixed last night). jayesh 2 None 29 nemesis ext_procs optimization mpich2 None bug buntinas 1217598154 1236632771 In r975 I committed a rough cut of dynamic processes for nemesis newtcp. In mpid_nem_inline.h I commented out an optimization that uses MPID_nem_mem_region.ext_procs because it prevents the proper operation of dynamic processes. Unfortunately, removing it adds ~100ns to our zero-byte message latencies. So there is a FIXME in the code that reads like this: {{{ /* FIXME the ext_procs bit is an optimization for the all-local-procs case. This has been commented out for now because it breaks dynamic processes. Some other solution should be implemented eventually, possibly using a flag that is set whenever a port is opened. [goodell@ 2008-06-18] */ }}} In general, this won't affect real uses who run any inter-node jobs, since they were already polling every time anyway. However, it does hurt those wonderful microbenchmarks. A hack fix is to leave this in but also check to see if a port has been opened. A possibly better fix is to only poll the network every X iterations of "poll everything", where X is some tunable parameter. This req is a reminder for this FIXME. -Dave goodell 2 None 79 Nemesis support for non-Intel/AMD platforms mpich2 None bug buntinas 1218473627 1236632795 (This is a resend of one of the lost emails) Should we make the default device depend on whether we're intel-Unix? Bill William Gropp Paul and Cynthia Saylor Professor of Computer Science University of Illinois Urbana-Champaign William Gropp 2 None 148 multiple netmod support mpich2 None bug buntinas * 1221675882 1221683096 This is a place holder for supporting multiple netmods simultaneously in nemesis * poll on multiple netmods * configure which vcs use which netmods buntinas 2 None 149 Define netmod interface mpich2 None bug buntinas * 1221675971 1236632814 Place holder for defining the netmod interface * versioning * allow for future modifications buntinas 2 None 152 PAC_F90_CHECK_COMPILER_OPTION + pgf90 mpich2 None bug chan 1221712853 1237058207 PAC_F90_CHECK_COMPILER_OPTION rejects -O2 as a valid compiler flag for pgf90, because the compiler produces different output (just the file name) with -O2 when linked with object file compiled without -O2. Here is the related config.log output: configure:10592: checking whether routines compiled with -O2 can be linked with ones compiled without -O2 configure:10598: pgf90 -c conftest2.f90 >conftest2.out 2>&1 configure:10601: $? = 0 configure:10603: pgf90 -O2 -o conftest conftest2.o conftest.f90 >conftest.bas 2>&1 configure:10606: $? = 0 configure: Compiler output differed in two cases 0a1 > conftest.f90: configure:10651: result: no A.Chan Anthony Chan 2 None 165 Config and binary file conflict mpich2 None bug balaji 1222202918 1224791129 Hi, has anyone let you guys know about the file conflict in your "mpd" projects (Music Player Daemon and Multi Processing Daemon) yet? This old bug report summarizes it, and aside from telling the package manager not to install both on the same system, nothing has happened since: http://bugs.gentoo.org/145367 I've run into the same problem and was wondering if you'd be willing to do something about it? Debian calls its mpd "mpich-mpd-bin". Why not just "mpich-mpd" I'm not sure but that sounds reasonable to me. On the other hand, MusicPD is pretty much standalone and there are probably less scripts out in the wild that have its name hardcoded---you start in an init script and that's it. So that would perhaps be easier to change: /usr/bin/musicplayerd and /etc/musicplayerd.conf? Would be great if you guys could work something out... cheers, _Matthias -- I prefer encrypted and signed messages. KeyID: FAC37665 Fingerprint: 8C16 3F0A A6FC DF0D 19B0 8DEF 48D9 1700 FAC3 7665 Matthias Bethke 2 None 179 Fw: [ROMIO Req #936] Inconsistent and incorrect use of MPIR_Nest_incr and MPIR_Nest_decr mpich2 None bug None 1222886214 1222886214 Forwarding to Trac. > From: William Gropp > To: romio-maint@mcs.anl.gov > Content-Type: text/plain; charset="US-ASCII"; format=flowed; delsp=yes > Content-Transfer-Encoding: 7bit > Mime-Version: 1.0 (Apple Message framework v929.2) > Subject: [ROMIO Req #936] Inconsistent and incorrect use of MPIR_Nest_incr and MPIR_Nest_decr and MPI routines > Date: Mon, 22 Sep 2008 09:09:07 -0500 > X-Mailer: Apple Mail (2.929.2) > X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at mailgw.mcs.anl.gov > Cc: romio-maint@mcs.anl.gov > > I enabled the nesting tests in MPICH2 and found a number of problems > in the ROMIO code (particularly in iwrite.c and iread.c). In looking > at these files, I saw no need for the MPIR_Nest_incr or Nest_decr . > These macros should only be used when calling an MPI routine (not an > MPIR or other internal implementation routine). Their purpose is to > tell the MPI routine not to invoke the MPI error handler but instead > to return an error code; they're also currently used to avoid > recursive calls to the global thread mutex; again, this only applies > to the MPI routines, not the internal routines. > thakur@mcs.anl.gov (Rajeev Thakur) 2 None 221 Fwd: MPICH2 bug? (attributes) mpich2 None bug gropp 1224603493 1230652737 {{{ Jeff was kind enough to point out this bug in our attribute handling. I've tested it out on my mac with gcc and g77 and I definitely get a bus error in case 4. His tests are in the attached tarball and something along these lines should probably be added to our test suites. -Dave Begin forwarded message: {{{ > From: Jeff Squyres > Date: October 21, 2008 Oct 21 8:28:17 AM CDT > To: Dave Goodell > Subject: MPICH2 bug? > > Yo Dave -- > > MPI attributes suck. I was the poor schlep who was tapped to write > them in OMPI, and it took me a *long* time to get them right. In > doing so, I came up with what I thought were 9 discrete cases for > reading and writing attributes. I outline the details in the > comment beginning here: > > https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/attribute/ > attribute.c#L23 > > To make sure I got this stuff right, I wrote up a test program that > checks all 9 cases. MPICH2 seems to segv on case 4 (I didn't > really dig any further than that). Can you check it out? > > -- > Jeff Squyres > Cisco Systems }}} }}} goodell 2 None 258 Fwd: Reducing MPI_REAL16's mpich2 None bug None 1225299144 1225307529 {{{ Yet another wacky inter-language issue to check on... Begin forwarded message: > From: Jeff Squyres > Date: October 29, 2008 Oct 29 11:25:21 AM CDT > To: Dave Goodell > Subject: Reducing MPI_REAL16's > > Yo Dave -- > > In the genre of obscure MPI bugs... Will MPICH2 also have this > same issue? > > https://svn.open-mpi.org/trac/ompi/ticket/1603 > > See comments 1, 2, and 4 in particular. > > -- > Jeff Squyres > Cisco Systems > }}} goodell 2 None 277 Eliminating MPICH2ism from test suite mpich2 None bug gropp 1226099615 1226349256 I'm running the MPICH2 test suite under the IBM MPI, and I've found a number of problems. Some are ambiguities in the MPI spec that have been fixed in 2.1; some are improper use of MPICH internals (there were some refs to status.count and status.cancelled, neither of which is valid MPI). Some are unsupported features in the IBM MPI. Some appear to be bugs in the IBM MPI, for which I'll probably want to enhance the output from the tests in those cases. This is a place holder for the updates. gropp 2 None 278 MPICH2-1.0.8 on windows with gfortran mpich2 None bug jayesh 1226099669 1236888782 {{{ BTW: cygwin now has gfortran available. So you might want to support it aswell in the binary distribution. [difference being single underscore vs double underscore fortran symbols]. Its available as gfortran-4.exe as part of gcc4-gfortran package. Satish }}} Satish Balay 2 None 279 Re: MPICH2-1.0.8 on windows with Compaq f90 mpich2 None bug jayesh 1226156500 1236891730 {{{ Changing to INTEGER (KIND=4) gets this going and I have a successful configure & build of PETSc with it.. [as mentioned in the previous e-mail using 'INTEGER' on the 32bit windows install might work with both g77 & compaq f90] Satish On Fri, 7 Nov 2008, Satish Balay wrote: > > This is with compaq f90 on windows. It support (KIND=4) - but not > (KIND=8)[its an old compiler - but I think some folks still use it - > as it goes with VC6, so I test PETSc with it] > > Satish > > ----------------------------- > > Checking for header mpif.h > sh: /home/sbalay/petsc-dev/bin/win32fe/win32fe f90 -c -o conftest.o -threads -debug:full -fpp:-m -I/cygdrive/c/Program\ Files/MPICH2/include conftest.F > Executing: /home/sbalay/petsc-dev/bin/win32fe/win32fe f90 -c -o conftest.o -threads -debug:full -fpp:-m -I/cygdrive/c/Program\ Files/MPICH2/include conftest.F > sh: conftest.i^M > c:\PROGRA~1\MPICH2\INCLUDE\mpif.h(404) : Error: This is not a valid data type. [KIND]^M > INTEGER (KIND=8) MPI_DISPLACEMENT_CURRENT^M > ----------------^^M > > Possible ERROR while running compiler: conftest.i^M > c:\PROGRA~1\MPICH2\INCLUDE\mpif.h(404) : Error: This is not a valid data type. [KIND]^M > INTEGER (KIND=8) MPI_DISPLACEMENT_CURRENT^M > ----------------^^M > ret = 256 > Source: > program main > include 'mpif.h' > end > > > }}} Satish Balay 2 None 284 Retrofit job attributes to PMI v1.1 mpich2 None bug buntinas 1226437293 1236620279 There is a function called PMI_Get_clique_ranks() which is not part of the PMI v1 interface and is not implemented in all process managers. The proposal was made to implement PMI v2's job attributes functionality into PMI v1 (and bump the version number to 1.1) to address the need for getting info on local processes from the process manager, in a more general way. This is a placeholder to remind us to do this. buntinas 2 None 287 Trunk warnings: "...Handles still allocated" mpich2 None bug 1226523478 1233517929 When I run pt2pt/scancel & some io tests on trunk I get the following message when compiled with "--enable-strict --enable-g=all", ############################################################################# shakey:/sandbox/jayesh/freshBuild/mpich2/test/mpi/pt2pt> mpiexec -n 2 scancel No Errors In direct memory block for handle type REQUEST, 3 handles are still allocated ############################################################################# Regards, Jayesh jayesh 2 None 288 Re: [mpi-all-commits] r3512 - mpich2/branches/dev/knem/src/mpid/ch3/include mpich2 None bug None 1226587769 1226588955 {{{ The request creation is on the critical path for latency - the intent was that the parts of the code that needed these fields was responsible for setting them before using them. While these changes are reasonable temporary fixes, we need to reduce the overhead in request creation and management. One starter would be to correct the code that should have set the fields that are now nulled out here - where are they? Bill On Nov 13, 2008, at 8:26 AM, goodell@mcs.anl.gov wrote: > Author: goodell > Date: 2008-11-13 08:26:51 -0600 (Thu, 13 Nov 2008) > New Revision: 3512 > > Modified: > mpich2/branches/dev/knem/src/mpid/ch3/include/mpidimpl.h > Log: > Merge r3511 from trunk -> knem. This zeroes out some additional > fields in MPID_Requests. > > > Modified: mpich2/branches/dev/knem/src/mpid/ch3/include/mpidimpl.h > =================================================================== > --- mpich2/branches/dev/knem/src/mpid/ch3/include/mpidimpl.h > 2008-11-13 14:20:26 UTC (rev 3511) > +++ mpich2/branches/dev/knem/src/mpid/ch3/include/mpidimpl.h > 2008-11-13 14:26:51 UTC (rev 3512) > @@ -316,6 +316,7 @@ > (sreq_)->comm = comm; \ > (sreq_)->cc = 1; \ > (sreq_)->cc_ptr = &(sreq_)->cc; \ > + (sreq_)->partner_request = NULL; \ > MPIR_Comm_add_ref(comm); \ > (sreq_)->status.MPI_ERROR = MPI_SUCCESS; \ > (sreq_)->status.cancelled = FALSE; \ > @@ -331,6 +332,8 @@ > (sreq_)->dev.segment_ptr = NULL; \ > (sreq_)->dev.OnDataAvail = NULL; \ > (sreq_)->dev.OnFinal = NULL; \ > + (sreq_)->dev.iov_count = NULL; \ > + (sreq_)->dev.iov_offset = NULL; \ > } > > /* This is the receive request version of MPIDI_Request_create_sreq */ > @@ -353,11 +356,14 @@ > (rreq_)->cc_ptr = &(rreq_)->cc; \ > (rreq_)->status.MPI_ERROR = MPI_SUCCESS; \ > (rreq_)->status.cancelled = FALSE; \ > + (rreq_)->partner_request = NULL; \ > (rreq_)->dev.state = 0; \ > (rreq_)->dev.cancel_pending = FALSE; \ > (rreq_)->dev.datatype_ptr = NULL; \ > (rreq_)->dev.segment_ptr = NULL; \ > (rreq_)->dev.iov_offset = 0; \ > + (rreq_)->dev.OnDataAvail = NULL; \ > + (rreq_)->dev.OnFinal = NULL; \ > MPIDI_CH3_REQUEST_INIT(rreq_);\ > } > > William Gropp Deputy Director for Research Institute for Advanced Computing Applications and Technologies Paul and Cynthia Saylor Professor of Computer Science University of Illinois Urbana-Champaign }}} William Gropp 2 None 292 Some thread tests hang with Nemesis and SMPD mpich2 None bug jayesh * 1226850709 1228324911 {{{ Looks like with Nemesis and SMPD, the following thread tests fail: alltoall, multisend, multispawn, taskmaster http://www.mcs.anl.gov/research/projects/mpich2/nightly/new/latest/run_20568 10692/test_1/make_testing.html Rajeev }}} "Rajeev Thakur" 2 None 299 ERROR MESSAGE: ../../../include/mpitypedefs.h:17:25: sys/bitypes.h: No such file or directory mpich2 None bug None 1227254507 1228235134 {{{ Dear Sir/Mdm, I am trying to install MPICH2 into CYGWIN and encountered this error message at the end of my make.log file. Attached is my make.log file for your kind consideration. I would like to request for some advice about it, please. Thank you. regards, Cornelius 1) This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its contents to any other person. 2) If you believe you're receiving this e-mail in error or prefer not to receive publicity materials from Data Storage Institute, please send e-mail to admin@dsi.a-star.edu.sg with the subject "Unsubscribe". Please remember to include the body text received. Removal requests will be honored and respected. Please allow 1 to 3 days for processing. You may still receive other e-mails from us within the grace period. 3) As an anti-virus measure, our mail server rejects the following attachments: *.bat, *.com, *.cmd, *.exe, *.hta, *.Ink, *.pif, *.scr, *.shs; *.vb*; *.{*, *.js, *.sct, *.wsh, *.jse, *.swf. If you need to send us an attachment of this type, please contact us at helpline@dsi.a-star.edu.sg. Thank you! }}} "Lee Mun Chiew, Cornelius" 2 None 300 cross-compiling is broken in MPICH2-1.1a2 mpich2 None bug gropp 1227293869 1227567796 {{{ The culprit is PAC_CC_FUNCTION_NAME_SYMBOL finding its way into the top-level configure.in. It invokes AC_RUN_IFELSE, which won't work when cross-compiling. I note that there is other code in configure.in to test for the presence of __func__ and such, lacking only the check to see that it works correctly. (Would a compiler really ever provide this without it working correctly?) Another approach would be to forget the autoconf stuff and make the use of __func__ conditional under something like this: #if __STDC__ && __STDC_VERSION__ >= 199901L The downside is that you miss out on usage for pre-C99 compilers that support __func__. But from the looks of what Globus is doing with this, that doesn't seem like a big deal. -dg }}} David Gingold 2 None 304 Mem leak during error condns in MPIR/MPIC* funcs mpich2 None bug jayesh 1227553867 1237158827 This is a placeholder to remind us to cleanup memory in error cases for MPIR/MPIC* functions. eg: In bcast.c we have the following code, MPIR_Bcast(){ ... if (!is_contig || !is_homogeneous) { tmp_buf = MPIU_Malloc(nbytes); ... } ... if ((nbytes < MPIR_BCAST_SHORT_MSG) || (comm_size < MPIR_BCAST_MIN_PROCS)) { ... while (mask < comm_size) { if (relative_rank & mask) { ... if (mpi_errno != MPI_SUCCESS) { /* FIXME: tmp_buf NOT FREED IN THIS CASE */ MPIU_ERR_POP(mpi_errno); } ... } mask <<= 1; } ... } if (!is_contig || !is_homogeneous) { ... MPIU_Free(tmp_buf); } fn_exit: ... fn_fail: ... } There are many cases like these in the MPIR/MPIC* funcs. A good fix would be to get rid of MPIU_Malloc() and use MPIU_CHKLMEM_MALLOC()/MPIU_CHKLMEM_FREEALL() instead. Regards, Jayesh jayesh 2 None 307 about /iface:mixed_str_len_arg mpich2 None bug None 1227634426 1229106357 {{{ Dear MPI developing group, I am trying to run a FORTRAN code (Intel Fortran compilier). In my code I need to use the compilier option: "/iface:mixed_str_len_arg". Unfortrantely MPICH2 does not support this mixed_str_len_arg. I tried compiling the MPICH2 source code with mixed_str_len_arg option, but it still does not work. Do you know how to compile a MPI version that supports /iface:mixed_str_len_arg (based in Interl Fortran compilier). Cheers, Wei Yao }}} "Wei Yao" 2 None 315 Minor wart during MPE install mpich2 None bug chan 1228236112 1228430941 {{{ I saw this flash by when doing an install: Installing SLOG2SDK's share mkdir: /Users/gropp/tmp/mpi2-inst-nemesis/share/logfiles: File exists *** Error making directory /Users/gropp/tmp/mpi2-inst-nemesis/share/ logfiles. *** Installed MPE2 in /Users/gropp/tmp/mpi2-inst-nemesis Bill William Gropp Deputy Director for Research Institute for Advanced Computing Applications and Technologies Paul and Cynthia Saylor Professor of Computer Science University of Illinois Urbana-Champaign }}} William Gropp 2 None 321 f77/buildiface bugs around HAVE_MULTIPLE_PRAGMA_WEAK mpich2 None bug None 1228498140 1229624590 {{{ This fixes a couple of problems in the f77/buildiface script. One is a typo (SYBMOLS instead of SYMBOLS), the other was a scope error hiding behind that. -dg .... Index: src/binding/f77/buildiface =================================================================== --- src/binding/f77/buildiface (revision 65243) +++ src/binding/f77/buildiface (working copy) @@ -820,7 +820,7 @@ #endif /* USE_WEAK_SYMBOLS */\ /* End MPI profiling block */\n\n"; - &AddFwrapWeakName( $lcname, $ucname ); + &AddFwrapWeakName( $lcname, $ucname, $args ); } } @@ -3567,11 +3567,11 @@ # Allow multiple underscore versions of names # but without the PMPI versions (needed for the wrapper library) sub AddFwrapWeakName { - my ($lcname, $ucname) = @_; + my ($lcname, $ucname, $args) = @_; print $OUTFD " /* These definitions are used only for generating the Fortran wrappers */ -#if defined(USE_WEAK_SYBMOLS) && defined(HAVE_MULTIPLE_PRAGMA_WEAK) && \\ +#if defined(USE_WEAK_SYMBOLS) && defined(HAVE_MULTIPLE_PRAGMA_WEAK) && \\ defined(USE_ONLY_MPI_NAMES)\n"; &print_weak_decl( $OUTFD, "MPI_$ucname", $args, $lcname ); &print_weak_decl( $OUTFD, "mpi_${lcname}__", $args, $lcname ); }}} David Gingold 2 None 325 maint/updatefiles fails to create configure on niagara mpich2 None bug gropp 1228778619 1234825132 Running maint/updatefiles on niagara1 doesn't create configure files. No error is reported by maint/updatefiles. Perhaps another change in the last 2 months improved the error reporting. buntinas 2 None 331 corrupted block allocated in segment.c[222] mpich2 None bug None 1229615379 1229965347 {{{ Hi, I'm using MPICH2-1.0.8 on CentOS 5.2 with MPI_THREAD_MULTIPLE. I configured it with: CC="icc" ./configure --prefix=$HOME/mpich2-icc --with-device=ch3:sock --with-thread-package=pthreads --enable-threads --enable-error-checking=all --enable-error-messages=all --enable-timing=all --enable-g=all --disable-fast I'm running my program with MALLOC_CHECK_=1 and got the following message repeated many times in stderr: [4] Block at address 0x0000000008a7cce0 is corrupted (probably write past end) [4] Block allocated in segment.c[222] My program ended with this message after running for over 5 hours on 8 nodes: rank 7 in job 1 node40_45659 caused collective abort of all ranks exit status of rank 7: killed by signal 9 I don't know if this helps or if these messages came at the same time. Please let me know if you need more information or if there is a way I can log more information. I use MPI_Send, MPI_Bsend, MPI_Ssend, MPI_Recv, MPI_Barrier, and MPI_Reduce in my program. This is the first time I've seen this error, but in previous runs I have had messages that say: job aborted; reason = mpd disappeared but I don't know if this is related. Cheers, Shawn }}} "Shawn Poindexter" 2 None 333 a problem with Fortran and mpe_logf on Windows mpich2 None bug jayesh * 1229682354 1234893136 {{{ Dear all   I have installed MPICH2 on my PC running Windows Xp and Digital Visual Fortran 6.0. All thinga are OK but I cant generate clog file after running wmpiexec.  If  (include 'mpe_logf.h') is added to the source.f, many errors are reported durnig link operation. The considered souce file and the generated errors are attatched. Please tell me how can I solve this problem. Thank you                                                                      Alaa El- nashar }}} alaa nashar 2 None 343 [mpich-discuss] request to enhance jumpshot script for portability mpich2 None bug None 1230648771 1230662103 {{{ version used : 1.0.8   the current jumpshot script created by config has the JAVA path and MPICH2 installation path hard coded in.  This make shipping jumpshot to a different env impossible without hacking.   I am requesting that the jumpshot/jumpshot.in code (in src/mpe2/src/slog2sdk/bin) be enhanced with the following changes:   # Set JAVA environment if [ "XX${JRE}" = "XX" ] ; then     JVM=/bin/java  ### else     JVM=${JRE}/bin/java fi JVMOPTS=""                                                 ### # Assume user's environmental JVMFLAGS is better than what configure found. JVMFLAGS=${JVMFLAGS:-${JVMOPTS}}       ### # Set PATH to various jar's needed by the GUI MPIEXEC_PATH=`which mpiexec` echo ${MPIEXEC_PATH} if [ "XX${MPIEXEC_PATH}" = "XX" ] ; then    GUI_LIBDIR=/lib         ### else    GUI_LIBDIR=$(dirname $(dirname $MPIEXEC_PATH))/lib fi where path in <> are hard coded like the existing code.  the env var JRE (or any name the MPICH2 group prefers) decides where to pickup the JRE.  lines marked ### are lines from 1.0.8 release.   this will make relocating jumpshot to a different system easy.   thanks tan   }}} chong tan 2 None 354 [mpich2-dev] followup, smpd + mpiexec_rsh.c mpich2 None bug jayesh 1231876471 1236108778 {{{ Another bug with smpd and mpiexec_rsh startup: the working directory is never passed to the rsh invocations. E.g. I run the job from / home/frey/mpitest and have the program attempting to open "test.in" and it fails since the "rsh" puts me in /home/frey. So relative paths will never work with mpiexec_rsh startup. :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ 99 A1 7F 5E 71 70 8A 38 3C 4A A2 B1 4D 0A B2 49 :::::::::::::::::::::::::::::::::::::::::::::::::::::: }}} Jeffrey Frey 2 None 355 RE: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4 computers mpich2 None bug jayesh 1231882249 1232032590 {{{ Hi, From the error codes in the hostname tests it looks like Computer1 (Where the shared network folder resides) is unable to handle the number of connections to it. ############ Error code desc from MS ############ ERROR_REQ_NOT_ACCEP (71 0x47) : No more connections can be made to this remote computer at this time because there are already as many connections as the computer can accept. ############ Error code desc from MS ############ We should retry (but we do not) in this case. Can you verify that the existing network mapped drive connections are cleanedup in all the machines (Type "net use" in a command prompt on each machine to view the existing network mapped conns)? Regards, Jayesh _____ From: Tina Tina [mailto:gucigu@gmail.com] Sent: Tuesday, January 13, 2009 3:21 PM To: Jayesh Krishna Subject: Re: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4 computers Dear Community! I started testng with the exampel cpi.exe program (so the problem is not in my program). I run the following command for all computers X=(1..8) and everything worked ok: "C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Computer1\MPI$ -wdir X:\CPI\ -hosts 1 ComputerX -machinefile "C:\Program Files\MPICH2\bin\machines.txt" X:\CPI\cpi.exe Than I ran the following command: "C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\Computer1\MPI$ -wdir X:\CPI\ -n X -machinefile "C:\Program Files\MPICH2\bin\machines.txt" X:\CPI\cpi.exe Note: I also changed the machines.txt file as you suggested (adding :1). The result was the following for X up to 5 it worked ok (I did only one test run). But when I tested with X=6 (aka. on 6 computers). I got the following error: launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer2' failed, error 3 - The system cannot find the path specified. On next run with X=6: launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer2' failed, error 3 - The system cannot find the path specified. launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer6' failed, error 3 - The system cannot find the path specified. launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer3' failed, error 3 - The system cannot find the path specified. launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer5' failed, error 3 - The system cannot find the path specified. launch failed: CreateProcess(X:\CPI\cpi.exe) on 'Computer4' failed, error 3 - The system cannot find the path specified. On next run with X=6: I got the same error as on the first run. And this errors were repeating on and on and on ... most of the times the error with only one computer and in most cases it was the second computer in the machinefile list. But not necesary. When there were more than one launch failed errors (like in second case) the order could be also different. In 20 tries not one was successfull. Than just for kicks I tried with X=8 I got the same errors with random number of launch failed errors and more or less random ComputerX that reported this. But every now or than I got one of the following errors (after the list of launch failed errors): 1) unable to post a write for the next command, sock error: generic socket failure, error stack: MPIDU_Sock_post_writev(1768): An established connection was aborted by the software in your host machine. (errno 10053) unable to post a write of the close command to tear down the job tree as part of the abort process. unable to post an abort command. 2) unable to post a read for the next command header, sock error: generic socket failure, error stack: MPIDU_Sock_post_readv(1656): An existing connection was forcibly closed by the remote host. (errno 10054) unable to post a read for the next command on left context. 3) unable to read the cmd header on the left context, socket connection closed. Hope this info helps Regards P.S.: I tried a couple of runs with X=5 and got mixed results, on some runs it worked ok on some it did not. Basically the same as with my program. So I would still say, as the number of computers increases, the problem gets worse. P.P.S.: Almost forgot to test the hostname. Here are the results of two runs. "C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\computer1\MPI$ -wdir X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt" hostname *********** Warning ************ Unable to map \\computer1\MPI$. (error 71) *********** Warning ************ *********** Warning ************ Unable to map \\computer1\MPI$. (error 71) *********** Warning ************ computer4 computer1 computer8 computer2 *********** Warning ************ Unable to map \\computer1\MPI$. (error 71) *********** Warning ************ computer7 computer5 computer3 *********** Warning ************ Unable to map \\computer1\MPI$. (error 71) *********** Warning ************ computer6 "C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\computer1\MPI$ -wdir X:\CPI\ -n 8 -machinefile "C:\Program Files\MPICH2\bin\machines.txt" hostname *********** Warning ************ Unable to map \\computer1\MPI$. (error 71) *********** Warning ************ *********** Warning ************ Unable to map \\computer1\MPI$. (error 71) *********** Warning ************ computer3 computer7 computer5 computer1 computer4 computer8 computer2 computer6 2009/1/13 Jayesh Krishna Hi, # Do you get any error message related to mapping network drives when you ran your job ? Please provide us with the command+output of your MPI job (Copy-paste your complete mpiexec command and its output in your email). # Can you run a command like (Note that I have removed "-noprompt" option), mpiexec -map x:\\computer1\MPI -wdir x:\ -n 8 -machinefile testallnamesmf.txt hostname with the following contents in the machinefile (testallnamesmf.txt - contains all the computer/host names - Note that I specify that only 1 MPI process be launched on each host using "hostname:1" syntax), computer1:1 -ifhn 192.168.1.1 computer2:1 -ifhn 192.168.1.2 ... computer8:1 -ifhn 192.168.1.8 # Does your program fail consistently for certain computers ? Try running a simple job (mpiexec -map x:\\computer1\MPI -wdir x:\ -n 1 -machinefile testmf.txt hostname) with only specifying 1 computer/host at a time. # Try removing "-noprompt" from the mpiexec command and see if mpiexec prompts you for anything (password, inputs etc). Regards, Jayesh _____ From: mpich-discuss-bounces@mcs.anl.gov [mailto:mpich-discuss-bounces@mcs.anl.gov] On Behalf Of Tina Tina Sent: Tuesday, January 13, 2009 12:01 PM To: mpich-discuss@mcs.anl.gov Subject: [mpich-discuss] MPICH2 1.1a2 - problems with more than 4 computers Dear Community! I am using the latest version of MPICH2 for Windows (the problem occurs also on 1.0.8). I have 8 computers connected over giga-bit switch. I have written a program that uses MPI for paralelization. When I run a program on one or two computers. Everything works OK (lets say most of the time). When I run it on 4 computers, sometimes it works and sometimes it does not. The error that I get is: launch failed: CreateProcess(X:\mpi_program.exe) on 'computerX' failed, error 3 - The system cannot find the path specified. Most times I get this error for one computer in machine list, but it can also happen for 2 or more computers etc. If I increase number of computers over 4. I get this error almost every time. With 6 or more this happens every time. It looks like the higher the number the worse it gets. I would really like to make this work. Has anybody had such experiences and what was the solution. It looks like the computer tries to start the program before the mapped drive would be made operational. Is there any way to increase this delay? Or are there any other settings that needs to be set? There are some other errors that I occasionally get, but this is the most important one (for now). Systems: Windows XP SP3 (on all computers) Installed latest MPICH2 Connection giga-bit NICs (local network) over switch Example of run command: "C:\Program Files\MPICH2\bin\mpiexec.exe" -map X:\\computer1\MPI -wdir X:\ -n 4 -machinefile "C:\Program Files\MPICH2\bin\machines.txt" -noprompt X:\mpi_program.exe \\computer1\MPI is a shared folder on computer1 from which the command is run machines.txt consists of following lines: computer1 -ifhn 192.168.1.1 computer2 -ifhn 192.168.1.2 ... computer8 -ifhn 192.168.1.8 These are the NICs I would like MPI to use them for communication. The order of computers in machines.txt is irrelevant (it happens on every combination). Regards }}} "Jayesh Krishna" 2 None 363 Re: MPI_IN_PLACE bug in Allgatherv in MPE's collchk mpich2 None bug chan 1231966878 1237008489 {{{ ----- "Satyanarayana Kakollu" wrote: > Hi Anthony, > Is it safe to use MPI_ALLGATHERV with MPI_IN_PLACE in fortran? > > Should we just use the recv buffer as send buffer instead of > MPI_IN_PLACE? > > Thanks, > Satya > > > > On Tue, Jan 6, 2009 at 4:45 PM, Anthony Chan > wrote: > > > > > Hi Satyanarayana, > > > > The support of MPI_IN_PLACE for Allgatherv in CollChk library > > is definitely in 1.0.6p1. My simple test program didn't reveal > > any problem. If your program is small, could you send it to > > me so I can check if the collchk library contains any bug ? > > > > Thanks, > > A.Chan > > > > ----- "Anthony Chan" wrote: > > > > > ----- "Rajeev Thakur" wrote: > > > > > > > That might be a bug in the collchk library. If sendbuf is > > > MPI_IN_PLACE > > > > in > > > > Allgatherv, the sendcount argument should be ignored. > > > > > > > > Rajeev > > > > > > > > > > > > > > > > _____ > > > > > > > > From: Satyanarayana Kakollu [mailto:kakollu@gmail.com] > > > > Sent: Friday, December 19, 2008 9:53 AM > > > > To: Anthony Chan > > > > Cc: Rajeev Thakur > > > > Subject: Re: Trouble with MPI_BCAST > > > > > > > > > > > > Thank you Rajeev and Anthony, > > > > > > > > -mpe=mpicheck give the following message at an MPI_ALL_GATHERV > call > > > > in our > > > > code. > > > > > > > > ALLGATHERV (Rank 0) --> Inconsistent datatype signatures > detected > > > > between > > > > local rank 0 > > > > > > > > I am using the MPI_IN_PLACE option with send count set as '0', > can > > > > this be > > > > the problem ? > > > > > > > > Satya > > > > > > > > On Wed, Dec 17, 2008 at 10:02 PM, Anthony Chan > > > > > wrote: > > > > > > > > > > > > > > > > Or use "mpicc -mpe=mpicheck" or "mpif90 -mpe=mpicheck" as > linker. > > > > > > > > A.Chan > > > > > > > > > > > > ----- "Rajeev Thakur" wrote: > > > > > > > > > Satya, > > > > > Try linking with -lmpe_collchk. It will run MPE's > > > > > collective call > > > > > checker to see if there is any discrepancy in the parameters > > > passed > > > > > to > > > > > MPI_Bcast. If that doesn't show any errors, try running a > simple > > > > test > > > > > program that contains only the broadcast. > > > > > > > > > > Rajeev > > > > > > > > > > > > > > > > > > > > _____ > > > > > > > > > > From: Satyanarayana Kakollu [mailto:kakollu@gmail.com] > > > > > Sent: Tuesday, December 16, 2008 5:31 PM > > > > > To: Rajeev Thakur > > > > > Subject: Trouble with MPI_BCAST > > > > > > > > > > > > > > > Rajeev, > > > > > > > > > > We are seeing that our code is getting stuck at MPI_BCAST on > a > > > > > customer > > > > > machine. The call simple, all ranks use same size buffer and > > > count, > > > > > we > > > > > verified that the root is same on all ranks. > > > > > > > > > > The code works on our clusters, but not on the user's > machine. > > > Here > > > > > are the > > > > > differences between our clusters and the user's machine. > > > > > > > > > > > > > > > Our clusters User's machine > > > > > > > > > > Multi-proc nodes Single SMP node with 8 > cores on > > > > > two > > > > > sockets. > > > > > CentOS 4, RHEL 4 RHEL 5 client version > > > > > mpich2 1.0.6p1 mpich2 1.0.6p1 (same) > > > > > > > > > > We were using gdb to localize the bug to MPI_BCAST two of the > 8 > > > > ranks > > > > > do not > > > > > get past the BCAST. If we replace the BCAST with PT2PT > > > > communication > > > > > it is > > > > > running well for 1000s of iterations. > > > > > > > > > > We linked our applications statically, on the RHEL 4 machine. > > > > > > > > > > Can you share your first thoughts about the issue. > > > > > > > > > > Thanks, > > > > > Satya > > }}} kakollu@gmail.com 2 None 366 RE: [mpich-discuss] can not find function MPI_Type_create_f90_real mpich2 None bug jayesh 1232033959 1232045689 {{{ Hi, The current fortran libraries in MPICH2 on windows don't support the *TYPE_CREATE_F90* functions. I have added this request to our bug tracking system and will update you on our progress (Should be available in the next release - end of this month.). Regards, Jayesh _____ From: mpich-discuss-bounces@mcs.anl.gov [mailto:mpich-discuss-bounces@mcs.anl.gov] On Behalf Of trimtrim trimtrim Sent: Thursday, January 15, 2009 3:14 AM To: mpich-discuss@mcs.anl.gov Subject: [mpich-discuss] can not find function MPI_Type_create_f90_real Dear every one: I am try to use the function of "MPI_Type_create_f90_real" to select the MPI send data type. But when I link the program, it shows I can't find the library. "Error 1 error LNK2019: unresolved external symbol _MPI_TYPE_CREATE_F90_REAL referenced in function _MAIN__ TEST " Does anyone knows which library I need to add, the library "fmpe.lib" and "fmpich2.lib" are already add to the linker. Below is the attached program. Many thanks Regards Haihua. PROGRAM MAIN USE mpi INTEGER(KIND=4),PARAMETER::p = 12,r=37; INTEGER(KIND=4),PARAMETER:: realKind =selected_real_kind(p,r) INTEGER(KIND=4):: MPI_INT_KIND,info; CALL MPI_Type_create_f90_real(p,r,MPI_INT_KIND,info); write(*,*) realKind; Write(*,*) PRECISION(x),range(x); END }}} "Jayesh Krishna" 2 None 375 Need to define C datatypes in Fortran's mpif.h mpich2 None bug None 1233025048 1233025048 {{{ Creating ticket for this... -----Original Message----- From: William Gropp [mailto:wgropp@illinois.edu] Sent: Monday, January 26, 2009 8:43 PM To: Rajeev Thakur Cc: 'Dave Goodell' Subject: Re: Datatypes in multiple languages Yes, that's what it means - we need to add them (with a decimal version of the value) to mpif.h . Bill On Jan 26, 2009, at 8:25 PM, Rajeev Thakur wrote: > Bill, > In the Chapter on Language Interoperability, it says "All > predefined datatypes can be used in datatype constructors in any > language" (pg 483, ln 46). Does it mean that all C datatypes must also > be defined in Fortran's mpif.h? We don't have any of them defined > currently, but we do have the > opposite: Fortran datatypes defined in C mpi.h. > > Rajeev > }}} "Rajeev Thakur" 2 None 401 BARRIER instructions not generated in CH3 SHM device on X86_64 linux. mpich2 None bug goodell 1233615465 1236108915 {{{ Hello, While investigating MPICH2 1.0.8 SHM implementation with RHEL5 Linux on Intel X86_64, we noticed that the following macros: MPID_WRITE_BARRIER() & MPID_READ_BARRIER() are translated to empty code in src/mpid/ch3/util/shmbase/ch3_shm.c. The macros for generating these barriers are defined in src/mpid/ch3/channels/shm/include/mpidi_ch3_impl.h. The following line (63) in mpidi_ch3_impl.h: #ifdef HAVE_GCC_AND_PENTIUM_ASM should be replaced with this one: #if defined(HAVE_GCC_AND_PENTIUM_ASM) || defined(HAVE_GCC_AND_X86_64_ASM) After this change the barrier macros produce the proper "fence" instructions. Thank you, Tal Nevo Application Performance Engineering ScaleMP Inc. }}} Tal Nevo 2 None 439 f90 f77 configure test failure on solaris mpich2 None bug 1236616633 1236616633 I noticed this configure message on solaris, and I don't think it's right (g77 does not work with g77). Maybe it is, in which case ignore this. -d {{{ checking whether Fortran 90 works with Fortran 77... cat: cannot open conftest.f90 Output from the link step is ld: fatal: file conftest1.f90: unknown file type ld: fatal: File processing errors. No output written to conftest collect2: ld returned 1 exit status no configure: WARNING: The test program that was used and the output may be found in config.log configure: WARNING: The selected Fortran 90 compiler /opt/csw/gcc3/bin/g77 does not work with the selected Fortran 77 compiler /opt/csw/gcc3/bin/g77. Use the environment variables F90 and F77 respectively to select compatible Fortran compilers. The check here tests to see if a main program compiled with the Fortran 90 compiler can link with a subroutine compiled with the Fortran 77 compiler. }}} http://www.mcs.anl.gov/research/projects/mpich2/nightly/old/runs/SPARC-Solaris-GNU32-mpd-ch3:nemesis-2009-03-05-20-45.xml buntinas 2 None 441 mpiu_shm_wrappers warnings mpich2 None bug jayesh * 1236707313 1236895762 I get a lot of warning messages (mostly in mpiu_shm_wrappers) when building MPICH2. [...snip...] /home/balaji/projects/mpich2/trunk/trunk/src/util/wrappers/mpiu_shm_wrappers.h: In function 'MPIU_SHMW_Seg_open': /home/balaji/projects/mpich2/trunk/trunk/src/util/wrappers/mpiu_shm_wrappers.h:889: warning: format not a string literal and no format arguments [...snip...] Here's my configure line: {{{ ../trunk/configure --enable-g=dbg,log \ --with-pm=hydra:gforker:remshell:mpd \ --disable-cxx --disable-f77 --disable-f90 \ --disable-mpe --disable-romio --disable-fast \ --disable-spawn --enable-strict=posix \ --enable-dependencies }}} I didn't really dig into which option was causing these warnings. balaji 2 None 463 Re: MPICH2 installation problem mpich2 None bug None 1237149939 1237149939 {{{ Hi, We'll need some more information, and you should send this to mpich2-maint@mcs.anl.gov (which I've cc'ed here). Just based on what is here, my guess is that the wrong MPI library was found; can you also send the compile and link commands and output? Bill On Mar 12, 2009, at 10:20 PM, Yang Yang wrote: > Hi, Dr. Gropp, > > I recently installed MPICH2-1.0.8 on the cluster to replace the old > MPICH-1.2.7. Although the example pi code can be compiled and run > using MPICH2, the one software package that used to be working under > MPICH-1.2.7 is not working now with MPICH2-1.0.8. The compilation > was successful and the executable was built. But when I tried to > run the package by mpiexec –n 16 ./exe it generated the message as > follows: > > Initializing MPI > node01.cluster: Not running from mpirun?. > Initializing MPI > node10.cluster: Not running from mpirun?. > Initializing MPI > node13.cluster: Not running from mpirun?. > Initializing MPI > western-wind.cluster: Not running from mpirun?. > Initializing MPI > node05.cluster: Not running from mpirun?. > Initializing MPI > node04.cluster: Not running from mpirun?. > Initializing MPI > Initializing MPI > node14.cluster: Not running from mpirun?. > node15.cluster: Not running from mpirun?. > Initializing MPI > node08.cluster: Not running from mpirun?. > Initializing MPI > node03.cluster: Not running from mpirun?. > Initializing MPI > node09.cluster: Not running from mpirun?. > Initializing MPI > node07.cluster: Not running from mpirun?. > Initializing MPI > node02.cluster: Not running from mpirun?. > Initializing MPI > node11.cluster: Not running from mpirun?. > Initializing MPI > node12.cluster: Not running from mpirun?. > Initializing MPI > node06.cluster: Not running from mpirun?. > > What’s wrong ? I used the same compilers for the software package > as used for building MPICH2. > > I appreciate your help. > > Regards, > > Yang William Gropp Deputy Director for Research Institute for Advanced Computing Applications and Technologies Paul and Cynthia Saylor Professor of Computer Science University of Illinois Urbana-Champaign }}} William Gropp 2 None 464 Hydra: Multi-executable launches on the same node mpich2 None bug balaji 1237163769 1237163769 Hydra uses a separate proxy whenever there is a separate executable name. For the case where two different executables are launched on the same host, both proxies try to open the same port and one of them fails. balaji 2 None 465 PSM netmod for Nemesis mpich2 None bug balaji 1237173548 1237173548 This ticket is a reminder for us to cleanup the PSM netmod in nemesis (will probably need to rewrite it based on the changes in the MX module). balaji 2 None 466 meet error, when run mpdboot mpich2 None bug None 1237187199 1237187199 {{{ hello, I meet an error, when i execute "mpdboot -n 2 -f mpd.hosts". I install MPICH2 follow the instruction of "installguide.pdf" step by step. Everything including ssh goes smoothly. i have two computer node01 and node02. if i run mpdboot manually, it goes well. root@node01:~# mpd & [1] 5753 root@node01:~# mpdtrace -l node01_48187 (159.226.10.27) root@node01:~# exit logout Connection to node01 closed. root@node02:~# mpd -h node01 -p 48187 & [1] 5875 root@node02:~# mpdtrace node02 node01 root@node02:~# But, when i run "mpdboot -n 2 -f mpd.hosts“ on node01 or node02,an error occur. root@node01:~# mpdboot -n 2 -f mpd.hosts mpdboot_node01 (handle_mpd_output 406): from mpd on node02, invalid port info: no_port root@node01:~# I have check the iptables, which i never change, it is blank for INPUT OUTPUT and FORWARD, with no surprise. I try the "mpdcheck", which give some information: root@node01:~# mpdcheck -f mpd.hosts -ssh client on node02 failed to access the server here is the output: bash: /home/wsh/bin/mpdcheck.py: No such file or directory root@node01:~# there are some information else, which maybe useful: root@node01:~# mpdboot -n 2 -f mpd.hosts -d -v debug: starting running mpdallexit on node01 LAUNCHED mpd on node01 via debug: launch cmd= /home/wsh/bin/mpd.py --ncpus=1 -e -d debug: mpd on node01 on port 33709 RUNNING: mpd on node01 debug: info for running mpd: {'ncpus': 1, 'list_port': 33709, 'entry_port': '', 'host': 'node01', 'entry_host': '', 'ifhn': ''} LAUNCHED mpd on node02 via node01 debug: launch cmd= ssh -x -n -q node02 '/home/wsh/bin/mpd.py -h node01 -p 33709 --ncpus=1 -e -d' debug: mpd on node02 on port no_port mpdboot_node01 (handle_mpd_output 406): from mpd on node02, invalid port info: no_port root@node01:~# and run"mpdcheck -pc on node01: root@node01:~# mpdcheck -pc --- print results of: gethostbyname_ex(gethostname()) ('node01', [], ['159.226.10.27']) --- try to run /bin/hostname node01 --- try to run uname -a Linux node01 2.6.27-13-generic #1 SMP Thu Feb 26 07:26:43 UTC 2009 i686 GNU/Linux --- try to print /etc/hosts 127.0.0.1 localhost.localdomain localhost #127.0.1.1 ubuntu.ubuntu-domain ubuntu 159.226.10.27 scc-m 159.226.10.27 node01 159.226.10.41 node02 # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts --- try to print /etc/resolv.conf # Generated by NetworkManager nameserver 159.226.2.135 --- try to run /sbin/ifconfig -a eth0 Link encap:Ethernet HWaddr 00:0b:db:bb:d8:2f inet addr:159.226.10.27 Bcast:159.226.10.255 Mask:255.255.255.0 inet6 addr: 2001:cc0:2004:2:20b:dbff:febb:d82f/64 Scope:Global inet6 addr: fe80::20b:dbff:febb:d82f/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:24886 errors:0 dropped:0 overruns:0 frame:0 TX packets:10709 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:18262430 (18.2 MB) TX bytes:1162554 (1.1 MB) Interrupt:17 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:758 errors:0 dropped:0 overruns:0 frame:0 TX packets:758 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:129788 (129.7 KB) TX bytes:129788 (129.7 KB) pan0 Link encap:Ethernet HWaddr 8a:95:9a:1d:ab:c0 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) --- try to print /etc/nsswitch.conf # /etc/nsswitch.conf # # Example configuration of GNU Name Service Switch functionality. # If you have the `glibc-doc-reference' and `info' packages installed, try: # `info libc "Name Service Switch"' for information about this file. passwd: compat group: compat shadow: compat hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4 networks: files protocols: db files services: db files ethers: db files rpc: db files netgroup: nis root@node01:~# mpdcheck on node02: --- print results of: gethostbyname_ex(gethostname()) ('node02', [], ['159.226.10.41']) --- try to run /bin/hostname node02 --- try to run uname -a Linux node02 2.6.27-7-generic #1 SMP Fri Oct 24 06:42:44 UTC 2008 i686 GNU/Linux --- try to print /etc/hosts 127.0.0.1 localhost.localdomain localhost ##127.0.1.1 ubuntu.ubuntu-domain ubuntu 159.226.10.27 scc-m 159.226.10.27 node01 159.226.10.41 node02 # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts --- try to print /etc/resolv.conf # Generated by NetworkManager nameserver 159.226.2.135 --- try to run /sbin/ifconfig -a eth0 Link encap:Ethernet HWaddr 00:0a:eb:ad:e3:c5 inet addr:159.226.10.41 Bcast:159.226.10.255 Mask:255.255.255.0 inet6 addr: 2001:cc0:2004:2:20a:ebff:fead:e3c5/64 Scope:Global inet6 addr: fe80::20a:ebff:fead:e3c5/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:9822 errors:0 dropped:0 overruns:0 frame:0 TX packets:7460 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3538769 (3.5 MB) TX bytes:1208257 (1.2 MB) Interrupt:16 Base address:0xc000 eth1 Link encap:Ethernet HWaddr 50:78:4c:70:f9:df UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) Interrupt:17 Base address:0xc400 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:512 errors:0 dropped:0 overruns:0 frame:0 TX packets:512 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:34873 (34.8 KB) TX bytes:34873 (34.8 KB) pan0 Link encap:Ethernet HWaddr b2:ed:0d:3c:2b:c0 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) --- try to print /etc/nsswitch.conf # /etc/nsswitch.conf # # Example configuration of GNU Name Service Switch functionality. # If you have the `glibc-doc-reference' and `info' packages installed, try: # `info libc "Name Service Switch"' for information about this file. passwd: compat group: compat shadow: compat hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4 networks: files protocols: db files services: db files ethers: db files rpc: db files netgroup: nis root@node02:~# Any help will be highly appreciated. Thanks. }}} 王强 2 None 146 Shared memory capable collectives mpich2 None feature goodell 1221634594 1236712841 We need to make sure that shared-memory capable collectives are implemented in 1.1. This ticket is to keep track of what all is pending for 1.1a2. Most current work is in the shmemcoll branch. balaji 2 None 443 Non-ssh boot-strap servers for Hydra mpich2 None feature balaji 1236722791 1236722839 Currently, ssh is the only supported boot-strap server for Hydra. We will need to eventually add support for slurm, pbs, sge and (maybe) fork as well. balaji 2 None 444 Dynamic process support in Hydra mpich2 None feature balaji 1236722952 1236722952 This is a place holder for dynamic process support in Hydra. balaji 2 None 445 Hydra proxy enhancements mpich2 None feature balaji 1236723281 1236723461 The current proxy implementation in Hydra is fairly simple. This needs to be extended in the following ways: 1. The proxy should be able to use the boot-strap server. The interface right now is not clean enough to allow this and needs to be fixed. This will let us launch a multi-level hierarchy of proxies. 2. The proxy currently only handles process launch and stdout/stderr/stdin functionality. Code-wise, however, the proxy is parallel to the process manager and should be able to provide some PMI functionality as well. This will help on large-scale systems, but is currently not supported. 3. Manual proxy launching capability: for platforms that don't have boot-strap servers, it should be possible to launch them either manually or as persistent daemons (e.g., on windows). 4. Connected proxies: on systems which have high-speed and scalable network capabilities (IB, MX), the proxies do not have to be disconnected. This makes most sense only when the proxies are pre-launched and not spawned as a part of mpiexec. balaji 2 None 446 Windows support for Hydra mpich2 None feature jayesh 1236723343 1236723343 This is a place holder for windows support for Hydra. balaji 2 None 457 Hydra: Process-core mapping ability mpich2 None feature balaji 1236991392 1236991465 We need to be able to allow processes to be bound to a processor or core on the system. This is do-able using external applications such as numactl that are available on some platforms, but that might not be portable. There are two possible options for this: 1. Extend PMI (maybe part of PMI-1.1 or PMI-2), so the process manager can tell the MPI process what core it should bind to, and the process can internally do the binding. 2. Instead of forking off the application processes directly, the proxies can spawn our own binding processes; these binding processes bind themselves to the appropriate core based on information from the proxy and then execvp the actual application. My preference is option 2. Other things to consider here are a portable way for a process to internally bind itself to a core. -- Pavan balaji 3 None 40 Support for type_create_indexed_block in ROMIO romio None bug thakur 1217952911 1236712472 Need to add support for MPI_Type_create_indexed_block in ROMIO flatten code. Rajeev "Rajeev Thakur" 3 None 89 MPI_Waitany does not return correct errors for truncation mpich2 None bug thakur 1218575064 1236712434 Need to see why MPI_Waitany does not return error truncate when Irecv is posted with a smaller buffer than the send. "Rajeev Thakur" 3 None 90 issues around MPID_Dev_comm_create_hook(), etc. mpich2 None bug goodell 1218639335 1236712586 Need to look into this. _____ From: owner-mpich2-dev@mcs.anl.gov [mailto:owner-mpich2-dev@mcs.anl.gov] On Behalf Of David Gingold Sent: Thursday, August 07, 2008 8:48 PM To: mpich2-dev@mcs.anl.gov Subject: [mpich2-dev] issues around MPID_Dev_comm_create_hook(), etc. I'm scrambling to get a release out, but lest I forget about this later, I thought I should mention a few brief bits about MPID_Dev_comm_create_hook() and friends: 1. The callers of these don't check the return values. It would be nicer to allow the hooks to pass errors up, e.g. if the create hook does memory allocation. 2. MPIR_Setup_intercomm_localcomm() doesn't call MPID_Dev_comm_create_hook(). Should it? (This had me on a bit of a goose chase this evening, but I'm better now.) 3. I've ended up hanging device-specific things off the communicator that might instead be hung off the communicator's group. (The bits, in my case, are representations of what ranks are local versus off-node.) Should we have MPID_Dev_group_{create,destroy}_hook() functions, also? I note that there is already a MPID_DEV_GROUP_DECL facility. -dg -- David Gingold Principal Software Engineer SiCortex Three Clock Tower Place, Suite 210 Maynard MA 01754 (978) 897-0214 x224 "Rajeev Thakur" 3 None 100 deleting dead/unsupported code mpich2 None bug buntinas 1219062395 1237022079 Some cleanup items that we don't want to forget about... {{{ Begin forwarded message: > From: Pavan Balaji > Date: August 17, 2008 Aug 17 2:29:51 PM CDT > To: mpich2-core@mcs.anl.gov > Subject: Re: [mpich2-core] deleting dead/unsupported code > Reply-To: mpich2-core@mcs.anl.gov > > > src/mpid/ch3/channels/nemesis/nemesis/net_mode/elan_module should > also probably go out. And newtcp_module should be renamed to > tcp_module. > > If we are spending some time to clean up the directory structure, > it might be worth changing the net_mod directory to "nm" and > removing _module in each netmod's name. Function name lengths would > be cut down by half :-). > > Merging nemesis/nemesis to nemesis is also on the plate, but will > probably take more time. > > -- Pavan }}} goodell 3 None 182 unify communicator creation paths mpich2 None bug 1222980457 1222980457 Not all communicators are created through the MPIR_Comm_create routine. Some are created via MPIR_Setup_intercomm_localcomm while MPIR_Process.{comm_world,comm_self,icomm_world} are created by hand in a separate array. Each piece of duplicated communicator construction logic is a spot where we are likely to have a bug some time in the future. If all three (or more) code paths are not kept in sync correctly then we will likely experience a bug. As a bonus, this should make the code easier to read and understand. This ticket is here to keep us from forgetting to clean this up. goodell 3 None 197 rreq->dev.recv_pending_count uninitialized mpich2 None bug buntinas 1223505987 1224790554 Now that we have valgrind integration (r3255) valgrind is showing the use of some uninitialized data here: {{{ goodell-desktop% mpiexec -n 3 valgrind -q ./reduce No Errors ==16186== Conditional jump or move depends on uninitialised value(s) ==16186== at 0x45FF9D: MPID_Irecv (mpid_irecv.c:76) ==16186== by 0x40CA3B: MPIC_Sendrecv (helper_fns.c:115) ==16186== by 0x46D731: MPIR_Barrier (barrier.c:70) ==16186== by 0x46DFE0: PMPI_Barrier (barrier.c:387) ==16186== by 0x45E842: MPID_Finalize (mpid_finalize.c:92) ==16186== by 0x421DD8: PMPI_Finalize (finalize.c:152) ==16186== by 0x4023DD: main (reduce.c:53) }}} adding --db-attach=yes into the mix shows that rreq->dev.recv_pending_count is filled with 0xef, confirming the uninitialized data. We need to figure out what path is missing this value and if there is a sensible default to initialize it to in the constructor. -Dave goodell 3 None 204 mpid_abort warning mpich2 None bug goodell 1223667852 1225197494 {{{ I get this warning for mpid_abort. /homes/thakur/cvs/mpich2/src/mpid/ch3/src/mpid_abort.c: In function ‘MPID_Abort’: /homes/thakur/cvs/mpich2/src/mpid/ch3/src/mpid_abort.c:99: warning: function declared ‘noreturn’ has a ‘return’ statement /homes/thakur/cvs/mpich2/src/mpid/ch3/src/mpid_abort.c:100: warning: ‘noreturn’ function does return }}} "Rajeev Thakur" 3 None 216 Incorrect behavior of MPICH2 C++ when Error handler MPI::ERRORS_RETURN is set mpich2 None bug goodell * 1224162971 1236713188 {{{ Hi, Rajeev. During investigation of some problem with MPI C++ code we have found that error handling in MPICH2 is not conform to MPI standatd. The issue is the program throws exception then MPI::ERRORS_RETURN error handler is set. It is demonstrated in attached example. <> I think that this behavior due to definition MPIX_CALL macro in mpicxx.h file. #define MPIX_CALL( fnc ) \ {int err; err = fnc ; if (err) throw Exception(err);} Victor. -------------------------------------------------------------------- Closed Joint Stock Company Intel A/O Registered legal address: Krylatsky Hills Business Park, 17 Krylatskaya Str., Bldg 4, Moscow 121614, Russian Federation This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. }}} "Shumilin, Victor" 3 None 231 won't compile with --with-thread-package=no mpich2 None bug chan 1224871468 1236713424 {{{ ./configure --with-thread-package=no make }}} fails to build cpi with linker errors, e.g.: info_getvallen.c:(.text+0x41): undefined reference to `MPE_Thread_tls_get' info_getvallen.c:(.text+0x14c): undefined reference to `MPE_Thread_tls_get' info_getvallen.c:(.text+0x200): undefined reference to `MPE_Thread_mutex_unlock' info_getvallen.c:(.text+0x220): undefined reference to `MPE_Thread_mutex_lock' info_getvallen.c:(.text+0x383): undefined reference to `MPE_Thread_tls_set' info_getvallen.c:(.text+0x4cb): undefined reference to `MPE_Thread_tls_set' There are also lots of warnings about incompatible type of MPE_Thread_tls_get buntinas 3 None 289 File_set_view doesn't check committed status of datatypes mpich2 None bug 1226602142 1226602142 MPI_File_set_view (and perhaps others) does not check that the datatypes are committed (as required by the standard). The IBM MPI does require this; I found this out while debugging some of the I/O test cases. gropp 3 None 374 investigate shared memory segment size mpich2 None bug buntinas 1232996514 1236714575 Alexander claims that the current nemesis shared memory segment size is approximately 16MiB, which might be too much as memory/core ratios shrink. We need to investigate this to see if it's actually a problem and if there is anything sensible we can do to reduce the size of the segment. -Dave goodell 3 None 393 Cross compilation requirement for manual cross files mpich2 None bug chan 1233285855 1236714697 This ticket is a reminder based on some input from Rob Latham about the use of cross files for cross compilation. Here's the summary of the problem: Currently, the MPICH2 configuration files perform runtime checks to find type sizes and other parameters. During cross-compilation such runtime checks are not possible, so we expect the values to be provided to us in a cross file. However, it looks like there are some tricks to do this at compile time: http://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/configure.in#L207 We should follow a similar approach to remove or minimize the number of runtime checks we need, and thus not rely on having a user-provided cross file. balaji 3 None 411 intel tests don't free datatypes mpich2 None bug 1234567339 1234567339 From a quick skim of the code and when building with --enable-g=handle, it looks like the Intel tests don't free the datatypes that they create. It would be good to create an MPITEST_Finalize that does any necessary cleanup that corresponds to MPITEST_Init. -Dave goodell 3 None 442 Hydra stdin support mpich2 None bug balaji 1236721742 1236721742 Hydra does not currently support stdin when more than one process is launched. This needs to be eventually fixed. Filing this ticket so we don't forget. -- Pavan balaji 3 None 448 Fwd: MPIR_Grequest_waitall mpich2 None bug None 1236800632 1236800696 {{{ Begin forwarded message: > From: Matthew Koop > Date: March 11, 2009 Mar 11 2:42:50 PM CDT > To: Dave Goodell > Cc: Darius Buntinas , Pavan Balaji >, , > Subject: Re: MPIR_Grequest_waitall > > Hi Dave, > > Sure, that sounds great. > > Matt > > On Wed, 11 Mar 2009, Dave Goodell wrote: > >> Hi Matt, >> >> "greq_wait" passes for the most recent version of MPICH2 that I >> happen >> to have compiled on my laptop. Something must be different between >> MPICH2 and MVAPICH2 in this area. >> >> I don't have the time right now to track down a bug I can't >> reproduce, >> but I can turn this into a Trac ticket and we'll take a closer look >> at >> it in the next week or two if you'd like. >> >> -Dave >> >> On Mar 10, 2009, at 2:27 PM, Matthew Koop wrote: >> >>> >>> The 'greq_wait' test fails for MVAPICH2 -- although it never even >>> enters >>> the CH3 layer -- everything is at the upper layer. >>> >>> Matt >>> >>> On Tue, 10 Mar 2009, Darius Buntinas wrote: >>> >>>> >>>> Do you have a test program that demonstrates the bug? >>>> >>>> -d >>>> >>>> On 03/10/2009 12:44 PM, Matthew Koop wrote: >>>>> I was looking into an issue here we were seeing with >>>>> tests/mpi/threads/pt2pt/greq_wait and it doesn't seem like >>>>> 'wait_fn' will >>>>> always be populated. >>>>> >>>>> MPI_Grequest_start sets wait_fn to NULL, cc_ptr is not null, and >>>>> kind is >>>>> set to MPID_UREQUEST. Then in MPIR_Grequest_waitall: >>>>> >>>>> 625 for (i = 0; i < count; ++i) >>>>> 626 { >>>>> 627 /* skip over requests we're not interested in */ >>>>> 628 if (request_ptrs[i] == NULL || *request_ptrs[i]- >>>>>> cc_ptr == 0 >>>>> || request_ptrs[i]->kind != MPID_UREQUEST) >>>>> 629 continue; >>>>> 630 mpi_error = (request_ptrs[i]->wait_fn)(1, >>>>> &request_ptrs[i]->grequest_extra_state, 0, NULL); >>>>> 631 if (mpi_error) MPIU_ERR_POP(mpi_error); >>>>> 632 } >>>>> >>>>> Shows that wait_fn gets called in this case. >>>>> >>>>> Matt >>>>> >>>> >>> >> > }}} koop@cse.ohio-state.edu 3 None 265 SMPD machinefile format mpich2 None docs jayesh 1225479325 1236713545 SMPD machinefile format is different from that of mpd, either we should support that format in SMPD or clearly document it. This is a placeholder to remind me to go through the user guides and update stuff related to windows & smpd -Jayesh jayesh 3 None 183 Support for Sun Studio compiler mpich2 None feature chan 1222991512 1237058224 This is a place holder to add support for Sun Studio compilers on x86. buntinas 3 None 192 make intercomm bcast SMP-aware mpich2 None feature goodell 1223405655 1236713108 The way that MPI_Bcast is implemented right now is SMP-aware for intracomms but not for intercomms. This needs to be corrected, probably by introducing a small utility function in bcast.c that performs SMP broadcasts (called something like MPIR_SMPBcast). -Dave goodell 3 None 286 supporting -Werror mpich2 None feature None 1226508635 1236714354 {{{ Given the lack of response from others, I've interpreted this as consensus and created a tracking ticket for this feature. -Dave Begin forwarded message: > From: Dave Goodell > Date: November 11, 2008 Nov 11 2:03:32 PM CST > To: mpich2-core@mcs.anl.gov > Subject: Re: [mpich2-core] supporting -Werror > Reply-To: mpich2-core@mcs.anl.gov > > This seems like a sensible way to accomplish what I want. I'm not > sure what else we would want to put in MAKE_CFLAGS in the future, > but it gets the job done. > > -Dave > > On Nov 11, 2008, at 1:55 PM, Anthony Chan wrote: > >> Hi Dave, >> >> Are you saying that you want -Werror to be used in building >> the MPICH2 libraries but not used during any of the configure tests ? >> If so, it seems to me the easiest thing to do create a special >> makefile >> only CFLAGS, e.g. MAKE_CFLAGS, which is set during make step, >> i.e. "make MAKE_CFLAGS=..." like you have been doing, and MAKE_CFLAGS >> is defined in each Makefile as "CFLAGS = $(CFLAGS) $(MAKE_CFLAGS)". >> >> PS. I think fixing configure is the wrong approach, too massive >> and complicated, because we are altering the meaning of CFLAGS >> in configure tests... >> >> A.Chan >> >> ----- "Dave Goodell" wrote: >> >>> David Gingold from SiCortex is in the process of updating to >>> mpich2-1.1.0a1 and is looking for a way to build with "-Werror". I >>> would like for us to be able to support this for our own development >>> >>> as well. >>> >>> Unfortunately, configuring with CFLAGS="-Werror" breaks numerous >>> configure tests and causes configure to make the wrong determination >>> >>> about system characteristics. For example configure thinks that it >>> can't find any suitable timer implementation on my mac when >>> configured with -Werror. >>> >>> Running "make CFLAGS=-Werror" sort of works, except that it stomps >>> any CFLAGS that were set by configure and I don't know if our >>> recursive make reliably passes variables to sub-makes in all cases. >>> >>> Because fixing all of the autoconf tests is likely to be an >>> intractable problem, what I think we want is a configure switch that >>> >>> will cause -Werror to be included in CFLAGS after all configure >>> tests >>> >>> have been made but before AC_OUTPUT time. This obviously will cause >>> >>> some builds to fail if the system is not warnings-clean, but this >>> wouldn't be the default option. The main trick to this approach is >>> that we would have to basically do the same thing in each sub- >>> configure because of the preciousness of the CFLAGS. Maybe a >>> PAC_WERROR inserted just before all AC_OUTPUTs in the tree, I'm not >>> quite sure... >>> >>> Any thoughts? As usual with these build system issues there's >>> probably a problem that I'm not thinking of, but that's why it's >>> good >>> >>> to discuss this sort of thing. Alternative proposals are welcome. >>> >>> -Dave > }}} goodell 4 None 118 Simple MPICH2 Delegate Bug mpich2 None bug jayesh 1220052846 1236712633 Hi, I have setup 3 Win64 hosts using "smpd -register_spn" with no problem. On the domain controller, I have created a user that is authorized for delegation and setup all 3 hosts to allow delegation. Then, assume the following two scenarios: Scenario 1: submission host: vm-cce1 execution host: vm-cce2 command: mpiexec -delegate -hosts 1 vm-cce2 hostname result: vm-cce2 Scenario 2: submission host: vm-cce1 execution host: vm-cce1 command: mpiexec -delegate -hosts 1 vm-cce1 hostname result: op_read error on left context: socket connection closed unable to read the re-connect request, socket connection closed. Scenario 2 appears to be a bug. Why is it that I can not use delegation when talking to the localhost vm-cce1. This I think is a bug. Try it for yourself. Regards, Larry Adams Senior Systems Engineer Platform Computing Tele: (586) 510-0007 Cell: (586) 899-1138 Skype: TheWitness "Larry Adams" 4 None 122 Dynamic process context IDs mpich2 None bug goodell 1220452373 1236712744 Hi Roberto, We have done several rounds of checks and do not see any difference between MPICH2 1.0.7 and the TCP/IP interface of MVAPICH2 1.2. Both these should perform exactly the same. We are continuing our investigation. We are wondering whether you can send us a sample code piece to reproduce the problem you are indicating across these two interfaces. This will help us to debug this problem faster and help you to solve your problem. I've added other CCs in this email, maybe other people are interested to have a look in. Attached you find the test program, which I'm working on, to turn up the problem. I'm not completely sure if it works perfectly since I wasn't able to complete its execution, but please let me know if I made something wrong inside the code. The testmaster is quite easy, you must provide the number of jobs to simulate (say 50000) and the node file that the resource manager provide for its schedule. Actually the node that matches the master will be excluded by the slave nodes. The testmain creates a ring of threads from the assigned nodes. So walking in the ring, for each free node it find, a thread is started so you should have as many threads as the number of assigned nodes working in multithreading. For simulating something to do each thread internally generate a random integer, sets some MPI_Info (host and pwd), spawn the testslave job, send it the generated random number, wait that the testslave receive and send back that number, sent and received numbers are comparated in order to verify their coherency, the slave send an empty MPI_Send() for signaling its termination, the thread now calls MPI_Comm_disconnect() for closing the slave connection, and finally all the MPI_Info are cleared. At this time the thread terminate. When the number of requested jobs are correctly "worked out" the application should terminate ... but without cleaning up (too tired sorry ;-), so it just wait a bit and finalize the MPI. At this time, I wasn't able to complete any execution. Currently the application still crashing with the backtrace you find below. Only one time I was able to reach 3500 jobs but one thread was stuck in a mutex. Looking in the backtrace you can find the same race I'm getting in my applications. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1087666512 (LWP 18231)] 0x00000000006a3902 in MPIDI_PG_Dup_vcr () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 Missing separate debuginfos, use: debuginfo-install glibc.x86_64 (gdb) info threads 29 Thread 1121462608 (LWP 18232) 0x0000003465a0a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 * 28 Thread 1087666512 (LWP 18231) 0x00000000006a3902 in MPIDI_PG_Dup_vcr () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 27 Thread 1142442320 (LWP 18230) 0x0000003464ecbd66 in poll () from /lib64/libc.so.6 26 Thread 1098156368 (LWP 18229) 0x0000003464e9ac61 in nanosleep () from /lib64/libc.so.6 1 Thread 140135980537584 (LWP 18029) main (argc=3, argv=0x7ffffb5992d8) at testmaster.c:437 (gdb) bt #0 0x00000000006a3902 in MPIDI_PG_Dup_vcr () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #1 0x0000000000668012 in SetupNewIntercomm () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #2 0x00000000006682c8 in MPIDI_Comm_accept () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #3 0x00000000006a6617 in MPID_Comm_accept () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #4 0x000000000065ec5f in MPIDI_Comm_spawn_multiple () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #5 0x00000000006a17e6 in MPID_Comm_spawn_multiple () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #6 0x00000000006783fd in PMPI_Comm_spawn () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #7 0x00000000004017de in NodeThread_threadMain (arg=0x120a790) at testmaster.c:314 #8 0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0 #9 0x0000003464ed4b0d in clone () from /lib64/libc.so.6 (gdb) thread 29 [Switching to thread 29 (Thread 1121462608 (LWP 18232))]#0 0x0000003465a0a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 (gdb) bt #0 0x0000003465a0a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000065e2e7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #2 0x00000000006675ca in FreeNewVC () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #3 0x0000000000668302 in MPIDI_Comm_accept () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #4 0x00000000006a6617 in MPID_Comm_accept () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #5 0x000000000065ec5f in MPIDI_Comm_spawn_multiple () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #6 0x00000000006a17e6 in MPID_Comm_spawn_multiple () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #7 0x00000000006783fd in PMPI_Comm_spawn () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #8 0x00000000004017de in NodeThread_threadMain (arg=0x120d590) at testmaster.c:314 #9 0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0 #10 0x0000003464ed4b0d in clone () from /lib64/libc.so.6 (gdb) thread 27 [Switching to thread 27 (Thread 1142442320 (LWP 18230))]#0 0x0000003464ecbd66 in poll () from /lib64/libc.so.6 (gdb) bt #0 0x0000003464ecbd66 in poll () from /lib64/libc.so.6 #1 0x00000000006d63bf in MPIDU_Sock_wait () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #2 0x000000000065e1e7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #3 0x00000000006cf87c in PMPI_Send () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glib c2.3.4/libmpich.so.1.1 #4 0x0000000000401831 in NodeThread_threadMain (arg=0x120a6f0) at testmaster.c:480 #5 0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0 #6 0x0000003464ed4b0d in clone () from /lib64/libc.so.6 (gdb) thread 26 [Switching to thread 26 (Thread 1098156368 (LWP 18229))]#0 0x0000003464e9ac61 in nanosleep () from /lib64/libc.so.6 (gdb) bt #0 0x0000003464e9ac61 in nanosleep () from /lib64/libc.so.6 #1 0x0000003464e9aa84 in sleep () from /lib64/libc.so.6 #2 0x000000000040197c in NodeThread_threadMain (arg=0x120d630) at testmaster.c:505 #3 0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0 #4 0x0000003464ed4b0d in clone () from /lib64/libc.so.6 (gdb) "Rajeev Thakur" 4 None 123 MPIDU_Yield and MPID_Thread_yield mpich2 None bug balaji 1220475167 1236143376 MPIDU_Yield has been implemented as a mpid/common/locks utility function for a number of platforms. MPID_Thread_yield is implemented at the MPI top-level, but only for a subset of cases (e.g., sched_yield and yield; no windows version or select version is present). It's probably a better idea to move the MPIDU_Yield function as a top-level utility as MPID_Yield, and allow MPID_Thread_yield to use MPID_Yield internally. This means that the internal usage of the MPIDU_Yield function has to change as well, and probably the header file dependencies too. Sending this email as a place holder for this fix. -- Pavan -- Pavan Balaji http://www.mcs.anl.gov/~balaji Pavan Balaji 4 None 195 Internal error in packet size mpich2 None bug buntinas * 1223484726 1236619851 {{{ I'm getting this today: william-gropps-computer:examples gropp$ ./cpi Internal error - packet definition is too small. Generic is 32 bytes, MPIDI_CH3_Pkt_t is 36 This is with the ch3:sock device/channel. The full configure line is /Users/gropp/projects/software/mpich2-current/configure --with- pm=gforker:mpd --with-device=ch3:sock --enable-threads=runtime -- enable-thread-cs=global --enable-refcount=default --enable- g=log,mem,dbg,mutex,nesting --enable-strict=posix --enable- dependencies --without-mpe --prefix=/Users/gropp/tmp/thread-tests/ mpich2-current-inst --enable-debuginfo Bil William Gropp Deputy Director for Research Institute for Advanced Computing Applications and Technologies Paul and Cynthia Saylor Professor of Computer Science University of Illinois Urbana-Champaign }}} William Gropp 4 None 281 RE: [mpich-discuss] closesocket failed error when running an MinGWcompiled executable. mpich2 None bug jayesh 1226422229 1236713686 {{{ Hi, This is a bug in the current state machine of SMPD. This should not affect the execution of your MPI program (This error occurs when the process manager tries to cleanup connections after the MPI program finishes execution). (PS: You can go ahead with your program development and ignore the closesocket() errors for now. We will fix this bug soon.) Regards, Jayesh -----Original Message----- From: mpich-discuss-bounces@mcs.anl.gov [mailto:mpich-discuss-bounces@mcs.anl.gov] On Behalf Of Dmitri Chubarov Sent: Tuesday, November 11, 2008 1:23 AM To: someindianbloke@gmail.com Cc: mpich-discuss@mcs.anl.gov Subject: Re: [mpich-discuss] closesocket failed error when running an MinGWcompiled executable. Hi, Error 10093 is a winsock error code for "Successful WSAStartup not yet performed". Do check if you get the same error on other machines to rule out network misconfiguration in your Windows installation. On Tue, Nov 11, 2008 at 7:02 AM, Chiraj wrote: > Hi, > > I have compiled a C executable using the MPICH2 Windows libraries and > MinGW. I have tried running the executable using "mpiexec -localroot > -localonly 2 main.exe 0.xml", but I get the following error: > > closesocket failed, sock 1284, error 10093 > > Could some please tell me what this means I am doing wrong? I have > tried searching everywhere on what this error means. I have registered > my user credentials using wmpiregister and checked if I have started > smpd as a windows service. I am running Windows Xp Professional SP3 on > an Intel Pentium 4 base system. > > Chiraj > }}} "Jayesh Krishna" 4 None 305 Location of the console - hardcoded /tmp mpich2 None bug None 1227569934 1236714472 {{{ Hi, while developing a Tight Integration of the mpd startup method of MPICH2 into SGE, I found that the location of the console is hard- coded to be /tmp. Would it be an RFE to redirect it to $TMPDIR, if it's set? As I also set MPD_CON_EXT to get unique entries per jobnumber (on the master node of a parallel job, slave nodes have no consoles at all in this setup), I would also like to force the console to be in the SGE created $TMPDIR. -- Reuti }}} Reuti 4 None 395 etags configure checks mpich2 None bug balaji 1233288333 1236109235 MPICH2 binary packages currently have a dependency on emacs-common; its configuration relies on the availability of etags that is provided by the emacs-common package. This should be removed since "make etags" is completely broken currently and is rarely used through "make" in MPICH2. balaji 4 None 435 Code Duplication in Collectives mpich2 None bug 1236288136 1236288136 A lot of the code in the collectives is duplicated. These should be moved to helper functions. balaji 4 None 77 configure support for memory barriers mpich2 None feature goodell 1218220179 1236712347 mpidu_mem_barriers.h contains support for memory barriers on various platforms. Unfortunately, it doesn't yet have any non-intel, non-unix support and so it won't work for lots of platforms that MPICH2 runs on. Up until r1299 it had a "#warning" statement in there, but that isn't portable to several compilers, including the Visual Studio 64-bit compiler. I replaced it with an MPID_Abort statement that will trigger when the barrier is invoked in r1299. We should probably change it to a configure-time check so that users know up front that MPICH2 won't work on that platform. This will be more important once we begin using the memory barriers outside of nemesis, since ch3:sock and other code needs to be broadly portable. -Dave goodell 4 None 290 better valgrind integration mpich2 None feature 1226693985 1226694168 This is a catchall ticket for some of the valgrind integration features that I'd like to put into mpich2 and don't want to forget about. 1. Add an {{{MPIU_Assert_valid_and_not_null}}} - This would check for a !NULL value but also look {{{0xefefefef}}} if mem debugging is enabled and/or test validity via the {{{VALGRIND_CHECK_VALUE_IS_DEFINED(_lvalue)}}} valgrind client request macro if valgrind is available. {{{MPIU_Assert_zero_and_not_null}}} would be very similar. 2. Use proper memory pool management macros to track the handle allocation. {{{VALGRIND_CREATE_MEMPOOL}}} and friends is what is used for this. 3. Use {{{VALGRIND_CREATE_BLOCK}}} to add descriptions to regions of memory in order to make understanding valgrind messages clearer. 4. I suspect the knem LMT code will frequently cause valgrind to think that memory is undefined when it is actually defined. Look at ways to give valgrind a better view of things. 5. Add valgrind client requests as an alternative to the initializations performed by {{{--enable-g=meminit}}}. That way when Tom uses valgrind on him MPI program he doesn't get uninitialized writev warnings but doesn't have to pay a full initialization latency penalty. 6. Figure out a good way to integrate valgrind into the nightly tests. This would help catch bugs that our current {{{--enable-g}}} features can't. goodell 4 None 297 make -j N support mpich2 None feature goodell * 1227221670 1236714390 We should support parallel builds (via "make -j N"). Builds in general could go much much faster, and would especially speed up on slow clock speed platforms like the SiCortex. -Dave goodell 4 None 306 Add MPID_Segment_transpack() function mpich2 None feature buntinas 1227629920 1227629920 A transpack function would be useful when copying from a noncontig buffer to a noncontig buffer (like in MPIR_Localcopy). The idea is that you would pass in the source segment and destination segment and the function would copy directly from the source to the dest buffer. Currently this is done using two copies where the data is packed from the noncontig source buffer into the temp buffer and then unpacked to the noncontig dest buffer. I don't believe that a transpack function can be implemented above the ADI level because access is needed to segment manipulation functions. I'm creating this ticket as a placeholder to remind us to look into this. buntinas 4 None 458 Flow control in MPICH2 mpich2 None feature 1236992453 1236992453 This is a reminder that we need to add communication flow-control in MPICH2. balaji 4 None 459 MPICH2 on Vista mpich2 None feature 1236992880 1236992880 I was going over the older tickets in mpich2tkreq and moving relevant and non-duplicate ones here. One of the tickets was on MPICH2 for Vista. This has been on our plates for some time, but we never got to it because of not enough demand. This ticket is a reminder that it needs to be done at some point. Keeping this as long term for now, unless someone thinks this is critical. balaji