Ticket #403: part0001.4.html

File part0001.4.html, 20.3 kB (added by Jayesh Krishna, 3 months ago)

Added by email2trac

Line 
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
2 <HTML>
3 <HEAD>
4 <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
5 <META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0">
6 <TITLE>RE: [mpich2-maint] #403: mpiexec kills the remote login shell</TITLE>
7 </HEAD>
8 <BODY>
9 <!-- Converted from text/plain format -->
10
11 <P><FONT SIZE=2>&nbsp;Hi,<BR>
12 &nbsp; Can you try out the patch attached. The patch contains some extra debug statements which will help us in narrowing down on your problem.<BR>
13 <BR>
14 Applying the patch<BR>
15 ---------------------<BR>
16 1) change directory to top-level of mpich2 source<BR>
17 2) patch -p0 &lt; mpich2_1_0_8_Korebot.patch<BR>
18 3) Re-compile &amp; re-install MPICH2<BR>
19 <BR>
20 &nbsp; Now re-run smpd &amp; mpiexec in debug mode with a simple mpi program, hellow.c (smpd -d &gt; smpd.log / mpiexec -verbose -n 1 ./hellow &gt; mpiexec.log).<BR>
21 <BR>
22 Regards,<BR>
23 Jayesh<BR>
24 <BR>
25 -----Original Message-----<BR>
26 From: Yu-Cheng Chou [<A HREF="mailto:cycchou@ucdavis.edu">mailto:cycchou@ucdavis.edu</A>]<BR>
27 Sent: Thursday, February 12, 2009 1:33 PM<BR>
28 To: Jayesh Krishna<BR>
29 Subject: Re: [mpich2-maint] #403: mpiexec kills the remote login shell<BR>
30 <BR>
31 Hi,<BR>
32 For the first question, I am not able to get the core dump for mpiexec/hellow/ssh on the Korebot because of the limited memory on the Korebot.<BR>
33 <BR>
34 For the second question, I can run such a program with fflush(stdout) and fflush(stderr) statements on the Korebot.<BR>
35 <BR>
36 Thank you<BR>
37 <BR>
38 On Thu, Feb 12, 2009 at 11:14 AM, Jayesh Krishna &lt;jayesh@mcs.anl.gov&gt; wrote:<BR>
39 &gt; Hi,<BR>
40 &gt;&nbsp; I have yet to make the debug module (shouldn't take much time). The<BR>
41 &gt; answers to the questions in my prev email will help me to put in the<BR>
42 &gt; right debug statements.<BR>
43 &gt;<BR>
44 &gt; Regards,<BR>
45 &gt; Jayesh<BR>
46 &gt;<BR>
47 &gt; -----Original Message-----<BR>
48 &gt; From: Yu-Cheng Chou [<A HREF="mailto:cycchou@ucdavis.edu">mailto:cycchou@ucdavis.edu</A>]<BR>
49 &gt; Sent: Thursday, February 12, 2009 12:23 PM<BR>
50 &gt; To: Jayesh Krishna<BR>
51 &gt; Subject: Re: [mpich2-maint] #403: mpiexec kills the remote login shell<BR>
52 &gt;<BR>
53 &gt; Hi,<BR>
54 &gt; Would you give me the debug module directly?<BR>
55 &gt; Thank you<BR>
56 &gt;<BR>
57 &gt; On Thu, Feb 12, 2009 at 10:15 AM, Jayesh Krishna &lt;jayesh@mcs.anl.gov&gt; wrote:<BR>
58 &gt;&gt; Hi,<BR>
59 &gt;&gt;&nbsp; Do you get a core dump of mpiexec/hellow/ssh ? (if yes, what does it<BR>
60 &gt;&gt; show<BR>
61 &gt;&gt; ?)<BR>
62 &gt;&gt;&nbsp; Can you run a simple non-MPI C program with fflush(stdout) &amp;<BR>
63 &gt;&gt; fflush(stderr) in it?<BR>
64 &gt;&gt;&nbsp; If the above suggestions don't narrow down the problem I will give<BR>
65 &gt;&gt; you a debug module (patch with some extra printfs) to help us narrow<BR>
66 &gt;&gt; down the problem.<BR>
67 &gt;&gt;<BR>
68 &gt;&gt; (PS: I looked into the code, but cannot think of anything that might<BR>
69 &gt;&gt; fail in your environment.)<BR>
70 &gt;&gt;<BR>
71 &gt;&gt; Regards,<BR>
72 &gt;&gt; Jayesh<BR>
73 &gt;&gt;<BR>
74 &gt;&gt; -----Original Message-----<BR>
75 &gt;&gt; From: Yu-Cheng Chou [<A HREF="mailto:cycchou@ucdavis.edu">mailto:cycchou@ucdavis.edu</A>]<BR>
76 &gt;&gt; Sent: Thursday, February 05, 2009 4:27 PM<BR>
77 &gt;&gt; To: Jayesh Krishna<BR>
78 &gt;&gt; Cc: mpich2-maint@mcs.anl.gov<BR>
79 &gt;&gt; Subject: Re: [mpich2-maint] #403: mpiexec kills the remote login<BR>
80 &gt;&gt; shell<BR>
81 &gt;&gt;<BR>
82 &gt;&gt;&gt;&nbsp; Hi,<BR>
83 &gt;&gt;&gt; The debug outputs look normal (the problem could be with the part of<BR>
84 &gt;&gt;&gt; the code at mpiexec exit() which has no dbg statements). I have<BR>
85 &gt;&gt;&gt; added this to our bug tracking list.<BR>
86 &gt;&gt;&gt; Meanwhile,<BR>
87 &gt;&gt;&gt;<BR>
88 &gt;&gt;&gt; #&nbsp; Can you send us your &quot;.smpd&quot; config file ?<BR>
89 &gt;&gt;<BR>
90 &gt;&gt; The &quot;.smpd&quot; file only contains one line of statement as follows.<BR>
91 &gt;&gt;<BR>
92 &gt;&gt; phrase=123<BR>
93 &gt;&gt;<BR>
94 &gt;&gt;&gt; #&nbsp; Did you modify the MPICH2 code to run on Korbet (Please send us<BR>
95 &gt;&gt;&gt; your configure command &amp; any env settings set to configure/make<BR>
96 &gt;&gt;&gt; MPICH2)?<BR>
97 &gt;&gt;<BR>
98 &gt;&gt; I did not modify the MPICH2 source code.<BR>
99 &gt;&gt; The configure command that I used is listed below.<BR>
100 &gt;&gt;<BR>
101 &gt;&gt; ./configure LDFLAGS=-L/tmp/korebot/mpich2-1.0.8/korebot_openssl/lib<BR>
102 &gt;&gt; --host=arm-linux --with-cross=crosstype --with-pm=smpd --with-mpe=no<BR>
103 &gt;&gt; --disable-f90 --disable-f77 --disable-cxx<BR>
104 &gt;&gt; --prefix=/tmp/korebot/mpich2-1.0.8/korebot_mpich2<BR>
105 &gt;&gt;<BR>
106 &gt;&gt; The &quot;korebot_openssl/lib&quot; contains the libraries needed for building smpd.<BR>
107 &gt;&gt;<BR>
108 &gt;&gt; The content of the file &quot;crosstype&quot; is listed below.<BR>
109 &gt;&gt;<BR>
110 &gt;&gt; CROSS_SIZEOF_FLOAT_INT=8<BR>
111 &gt;&gt; CROSS_SIZEOF_DOUBLE_INT=12<BR>
112 &gt;&gt; CROSS_SIZEOF_LONG_INT=8<BR>
113 &gt;&gt; CROSS_SIZEOF_SHORT_INT=8<BR>
114 &gt;&gt; CROSS_SIZEOF_2_INT=8<BR>
115 &gt;&gt; CROSS_SIZEOF_LONG_DOUBLE_INT=16<BR>
116 &gt;&gt;<BR>
117 &gt;&gt; Thank you<BR>
118 &gt;&gt;<BR>
119 &gt;&gt;<BR>
120 &gt;&gt;&gt;&nbsp; &gt; -----Original Message-----<BR>
121 &gt;&gt;&gt;&nbsp; &gt; From: Yu-Cheng Chou [<A HREF="mailto:cycchou@ucdavis.edu">mailto:cycchou@ucdavis.edu</A>]&nbsp; &gt; Sent:<BR>
122 &gt;&gt;&gt; Wednesday, February 04, 2009 2:32 PM&nbsp; &gt; To: Jayesh Krishna&nbsp; &gt; Cc:<BR>
123 &gt;&gt;&gt; mpich-discuss@mcs.anl.gov&nbsp; &gt; Subject: Re: [mpich-discuss] mpiexec<BR>
124 &gt;&gt;&gt; kills the remote login shell&nbsp; &gt;&nbsp; &gt; &gt;&nbsp; Hi,<BR>
125 &gt;&gt;&gt;&nbsp; &gt; &gt;&nbsp;&nbsp; Does smpd abort when you run your MPI job ?<BR>
126 &gt;&gt;&gt;&nbsp; &gt;<BR>
127 &gt;&gt;&gt;&nbsp; &gt; No.<BR>
128 &gt;&gt;&gt;&nbsp; &gt;<BR>
129 &gt;&gt;&gt;&nbsp; &gt; &gt;<BR>
130 &gt;&gt;&gt;&nbsp; &gt; &gt; Regards,<BR>
131 &gt;&gt;&gt;&nbsp; &gt; &gt; Jayesh<BR>
132 &gt;&gt;&gt;&nbsp; &gt; &gt;<BR>
133 &gt;&gt;&gt;&nbsp; &gt; &gt; -----Original Message-----<BR>
134 &gt;&gt;&gt;&nbsp; &gt; &gt; From: Yu-Cheng Chou [<A HREF="mailto:cycchou@ucdavis.edu">mailto:cycchou@ucdavis.edu</A>]&nbsp; &gt; &gt; Sent:<BR>
135 &gt;&gt;&gt; Wednesday, February 04, 2009 1:56 PM&nbsp; &gt; &gt; To: Jayesh Krishna&nbsp; &gt; &gt; Cc:<BR>
136 &gt;&gt;&gt; mpich-discuss@mcs.anl.gov&nbsp; &gt; &gt; Subject: Re: [mpich-discuss] mpiexec<BR>
137 &gt;&gt;&gt; kills the remote login shell&nbsp; &gt; &gt;&nbsp; &gt; &gt; Hi&nbsp; &gt; &gt;&nbsp; &gt; &gt; I can<BR>
138 &gt;&gt;&gt; cross-compile the program and then simply run the&nbsp; &gt; executable on&nbsp;<BR>
139 &gt;&gt;&gt; &gt;<BR>
140 &gt;&gt;&gt; &gt; Korebot with no errors.<BR>
141 &gt;&gt;&gt;&nbsp; &gt; &gt;<BR>
142 &gt;&gt;&gt;&nbsp; &gt; &gt;<BR>
143 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; Hi,<BR>
144 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&nbsp; Can you try running (without mpiexec) a simple C program with<BR>
145 &gt;&gt;&gt; &gt; &gt;&gt;<BR>
146 &gt;&gt;&gt; exit(-1) on Korebot ?<BR>
147 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;<BR>
148 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; ========================================<BR>
149 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; #include &lt;stdlib.h&gt;<BR>
150 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; int main(int argc, char *argv[])&nbsp; &gt; &gt;&gt; {<BR>
151 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&nbsp;&nbsp;&nbsp;&nbsp; exit(-1);<BR>
152 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; }<BR>
153 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; ========================================<BR>
154 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;<BR>
155 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; Regards,<BR>
156 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; Jayesh<BR>
157 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; ________________________________&nbsp; &gt; &gt;&gt; From:<BR>
158 &gt;&gt;&gt; mpich-discuss-bounces@mcs.anl.gov&nbsp; &gt; &gt;&gt;<BR>
159 &gt;&gt;&gt; [<A HREF="mailto:mpich-discuss-bounces@mcs.anl.gov">mailto:mpich-discuss-bounces@mcs.anl.gov</A>] On Behalf Of Jayesh&nbsp; &gt; &gt;&gt;<BR>
160 &gt;&gt;&gt; Krishna&nbsp; &gt; &gt;&gt; Sent: Wednesday, February 04, 2009 1:04 PM&nbsp; &gt; &gt;&gt; To:<BR>
161 &gt;&gt;&gt; 'Yu-Cheng Chou'<BR>
162 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; Cc: mpich-discuss@mcs.anl.gov<BR>
163 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; Subject: Re: [mpich-discuss] mpiexec kills the remote login<BR>
164 &gt;&gt;&gt; shell&nbsp; &gt;<BR>
165 &gt;&gt;&gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&nbsp; Hi,<BR>
166 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&nbsp;&nbsp; Can you also attach the corresponding smpd debug output ?<BR>
167 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;<BR>
168 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; Regards,<BR>
169 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; Jayesh<BR>
170 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;<BR>
171 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; -----Original Message-----<BR>
172 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; From: Yu-Cheng Chou [<A HREF="mailto:cycchou@ucdavis.edu">mailto:cycchou@ucdavis.edu</A>]&nbsp; &gt; &gt;&gt; Sent:<BR>
173 &gt;&gt;&gt; Wednesday, February 04, 2009 1:02 PM&nbsp; &gt; &gt;&gt; To: Jayesh Krishna&nbsp; &gt; &gt;&gt; Cc:<BR>
174 &gt;&gt;&gt; mpich-discuss@mcs.anl.gov&nbsp; &gt; &gt;&gt; Subject: Re: [mpich-discuss] mpiexec<BR>
175 &gt;&gt;&gt; kills the remote login shell&nbsp; &gt; &gt;&gt;&nbsp; &gt; &gt;&gt; Hi,&nbsp; &gt; &gt;&gt;&nbsp; &gt; &gt;&gt; Firstly,<BR>
176 &gt;&gt;&gt; the previously attached mpiexec verbose output is&nbsp; &gt; a wrong one.<BR>
177 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; I've attached the correct one to this email.<BR>
178 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;<BR>
179 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; Secondly, I want to point out that as long as mpiexec is<BR>
180 &gt;&gt;&gt; initiated&nbsp; &gt;<BR>
181 &gt;&gt;&gt;&gt;&gt; from Korebot to run a program, no matter it's a MPI or non-MPI&nbsp; &gt;<BR>
182 &gt;&gt;&gt;&gt;&gt; &gt;&gt;<BR>
183 &gt;&gt;&gt; program, no matter the program can be found or not, as soon as&nbsp; &gt; &gt;&gt;<BR>
184 &gt;&gt;&gt; mpiexec is finished, the ssh connection to Korebot will be gone.<BR>
185 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;<BR>
186 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt; Thank you<BR>
187 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;<BR>
188 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;<BR>
189 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Hi,<BR>
190 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&nbsp;&nbsp; The mpiexec output shows the following error when<BR>
191 &gt;&gt;&gt;&nbsp; &gt; running hellow,<BR>
192 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; ==================<BR>
193 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
194 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Unable to exec 'hello' on korebot&nbsp; &gt; &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Error 2 - No<BR>
195 &gt;&gt;&gt; such file or directory&nbsp; &gt; &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; ==================&nbsp; &gt; &gt;&gt;&gt;<BR>
196 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&nbsp;&nbsp; Please provide the debug output of smpd (smpd -d 2&gt;&amp;1 | tee<BR>
197 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; smpd.out) along with mpiexec (mpiexec -verbose -n 2&nbsp; &gt;<BR>
198 &gt;&gt;&gt; ./hellow<BR>
199 &gt;&gt;&gt; 2&gt;&amp;1<BR>
200 &gt;&gt;&gt; |&nbsp; &gt; &gt;&gt;&gt; tee mpiexec.out).<BR>
201 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
202 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; #&nbsp; Can you run simple C programs (without using mpiexec)&nbsp; &gt;<BR>
203 &gt;&gt;&gt; on Korbet ?<BR>
204 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; #&nbsp; Is the ssh connection aborted when you run non-MPI<BR>
205 &gt;&gt;&gt; programs<BR>
206 &gt;&gt;&gt; &gt; &gt;&gt;&gt; (mpiexec -n 2&nbsp; &gt; &gt;&gt;&gt; hostname) ?<BR>
207 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; #&nbsp; Can you send us your &quot;.smpd&quot; config file ?<BR>
208 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; #&nbsp; Did you modify the MPICH2 code to run on Korbet&nbsp; &gt; (Please<BR>
209 &gt;&gt;&gt; send us&nbsp; &gt; &gt;&gt;&gt; your configure command &amp; any env settings set to&nbsp; &gt;<BR>
210 &gt;&gt;&gt; configure/make MPICH2)?<BR>
211 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
212 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Regards,<BR>
213 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Jayesh<BR>
214 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
215 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; ________________________________&nbsp; &gt; &gt;&gt;&gt; From:<BR>
216 &gt;&gt;&gt; mpich-discuss-bounces@mcs.anl.gov&nbsp; &gt; &gt;&gt;&gt;<BR>
217 &gt;&gt;&gt; [<A HREF="mailto:mpich-discuss-bounces@mcs.anl.gov">mailto:mpich-discuss-bounces@mcs.anl.gov</A>] On Behalf Of Jayesh&nbsp; &gt;<BR>
218 &gt;&gt;&gt; &gt;&gt;&gt; Krishna&nbsp; &gt; &gt;&gt;&gt; Sent: Wednesday, February 04, 2009 8:41 AM&nbsp; &gt; &gt;&gt;&gt; To:<BR>
219 &gt;&gt;&gt; 'Yu-Cheng Chou'<BR>
220 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Cc: mpich-discuss@mcs.anl.gov&nbsp; &gt; &gt;&gt;&gt; Subject: Re:<BR>
221 &gt;&gt;&gt; [mpich-discuss] mpiexec kills the remote login shell&nbsp; &gt;<BR>
222 &gt;&gt;&gt;&gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&nbsp; Hi,<BR>
223 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&nbsp;&nbsp; I will take a look at the debug logs and get back to you.<BR>
224 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Meanwhile, can you run simple C programs without using<BR>
225 &gt;&gt;&gt; mpiexec on&nbsp; &gt;<BR>
226 &gt;&gt;&gt;&gt;&gt;&gt; Korbet ?<BR>
227 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&nbsp;&nbsp; MPICH2 currently does not support heterogeneous systems (So you<BR>
228 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; won't be able to run your MPI job across ARM &amp; other&nbsp; &gt;<BR>
229 &gt;&gt;&gt; architectures).<BR>
230 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
231 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Regards,<BR>
232 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Jayesh<BR>
233 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
234 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; -----Original Message-----<BR>
235 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; From: Yu-Cheng Chou [<A HREF="mailto:cycchou@ucdavis.edu">mailto:cycchou@ucdavis.edu</A>]&nbsp; &gt; &gt;&gt;&gt; Sent:<BR>
236 &gt;&gt;&gt; Tuesday, February 03, 2009 7:52 PM&nbsp; &gt; &gt;&gt;&gt; To: Jayesh Krishna&nbsp; &gt; &gt;&gt;&gt; Cc:<BR>
237 &gt;&gt;&gt; mpich-discuss@mcs.anl.gov&nbsp; &gt; &gt;&gt;&gt; Subject: Re: [mpich-discuss]<BR>
238 &gt;&gt;&gt; mpiexec kills the remote login shell&nbsp; &gt; &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; # Can you run<BR>
239 &gt;&gt;&gt; non-MPI programs using mpiexec (mpiexec -n&nbsp; &gt; 2 hostname) ?<BR>
240 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Yes.<BR>
241 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
242 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; # Can you compile and run the hello world program&nbsp; &gt; &gt;&gt;&gt;&gt;<BR>
243 &gt;&gt;&gt; (examples/hellow.c) provided with MPICH2 (mpiexec -n 2 ./hellow)?<BR>
244 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Yes.<BR>
245 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
246 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; # How did you start smpd (the command used to start&nbsp; &gt; smpd) ?<BR>
247 &gt;&gt;&gt; How did&nbsp; &gt; &gt;&gt;&gt;&gt; you run your MPI job (the command used to run your job)?<BR>
248 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; I have a &quot;.smpd&quot; file containing one line of information,&nbsp; &gt;<BR>
249 &gt;&gt;&gt; which is&nbsp; &gt; &gt;&gt;&gt; &quot;phrase=123&quot;.<BR>
250 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Thus, I started smpd using &quot;smpd -s&quot;.<BR>
251 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Then I used &quot;mpiexec -n 1 hellow&quot; to run hellow on Korebot.<BR>
252 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
253 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; # How did you find that mpiexec kills the sshd process (We&nbsp;<BR>
254 &gt;&gt;&gt; &gt;<BR>
255 &gt;&gt;&gt; &gt;&gt;&gt;&gt; typically ssh to unix machines and run mpiexec without&nbsp; &gt; any<BR>
256 &gt;&gt;&gt; &gt;&gt;&gt;&gt; problems) ?<BR>
257 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; I logged in Korebot with two terminals.<BR>
258 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; &gt;From #1 terminal, I checked all the processes running on Korebot.<BR>
259 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; &gt;From #2 terminal, I started smpd and run hellow using&nbsp; &gt; the<BR>
260 &gt;&gt;&gt; commands&nbsp; &gt; &gt;&gt;&gt; mentioned above.<BR>
261 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; After hellow was finished, the connection to Korebot via&nbsp; &gt;<BR>
262 &gt;&gt;&gt; #2 terminal&nbsp; &gt; &gt;&gt;&gt; was closed.<BR>
263 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; &gt;From #1 terminal, I knew that the sshd process&nbsp; &gt; associated<BR>
264 &gt;&gt;&gt; with<BR>
265 &gt;&gt;&gt; #2&nbsp; &gt; &gt;&gt;&gt; &gt;terminal&nbsp; &gt; &gt;&gt;&gt; was gone.<BR>
266 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
267 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt;&nbsp; Can you run smpd/mpiexec in debug mode and provide us with<BR>
268 &gt;&gt;&gt; the&nbsp; &gt;<BR>
269 &gt;&gt;&gt;&gt;&gt;&gt;&gt; outputs (smpd -d / mpiexec -n 2 -verbose hostname) ?<BR>
270 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; The first attached text file is the output from running<BR>
271 &gt;&gt;&gt; hellow in&nbsp; &gt;<BR>
272 &gt;&gt;&gt;&gt;&gt;&gt; mpiexec's verbose mode.<BR>
273 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
274 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
275 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; There is another issue.<BR>
276 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; This time, I used two machines. One is Korebot as&nbsp; &gt;<BR>
277 &gt;&gt;&gt; mentioned above,&nbsp; &gt; &gt;&gt;&gt; and the other is a laptop running Ubuntu Linux OS.<BR>
278 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; I started smpd with the same &quot;.smpd&quot; file and command as&nbsp; &gt;<BR>
279 &gt;&gt;&gt; mentioned&nbsp; &gt; &gt;&gt;&gt; above both on Korebot and the lap top.<BR>
280 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; There is a machine file called &quot;hostfile&quot; on Korebot. The<BR>
281 &gt;&gt;&gt; file<BR>
282 &gt;&gt;&gt; &gt; &gt;&gt;&gt; contains the following information about the name of the&nbsp; &gt;<BR>
283 &gt;&gt;&gt; &gt; &gt;&gt;&gt; two machines.<BR>
284 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
285 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; korebot<BR>
286 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; shrimp<BR>
287 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
288 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; Then from Korebot, I ran cpi using the following command.<BR>
289 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
290 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; mpiexec -machinefile ./hostfile -verbose -n 2 cpi&nbsp; &gt; &gt;&gt;&gt;&nbsp; &gt;<BR>
291 &gt;&gt;&gt; &gt;&gt;&gt;<BR>
292 &gt;&gt;&gt; &gt;<BR>
293 &gt;&gt;&gt;&gt;&gt;&gt; But the value of pi is a huge number. I think it is related to&nbsp; &gt;<BR>
294 &gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;&gt;<BR>
295 &gt;&gt;&gt; &quot;double type variables&quot; being transferred between&nbsp; &gt; processes<BR>
296 &gt;&gt;&gt; running<BR>
297 &gt;&gt;&gt; &gt;<BR>
298 &gt;&gt;&gt;&gt;&gt;&gt; on an ARM-based Linux and a general Linux machines.<BR>
299 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
300 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt; The second attached text file is the output from running cpi<BR>
301 &gt;&gt;&gt; in<BR>
302 &gt;&gt;&gt; &gt;<BR>
303 &gt;&gt;&gt;&gt;&gt;&gt; mpiexec's verbose mode.<BR>
304 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
305 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
306 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt;<BR>
307 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; I am cross-compiling mpich2-1.0.8 with smpd for Khepera&nbsp; &gt;<BR>
308 &gt;&gt;&gt; III mobile&nbsp; &gt; &gt;&gt;&gt;&gt; robot.<BR>
309 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt;<BR>
310 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; This mobile robot has a Korebot board which is an ARM-based<BR>
311 &gt;&gt;&gt; &gt;<BR>
312 &gt;&gt;&gt; &gt;&gt;&gt;&gt; computer with a Linux operating system.<BR>
313 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt;<BR>
314 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; The cross-compilation was fine.<BR>
315 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt;<BR>
316 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; Firstly, I logged in to Korebot through ssh.<BR>
317 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; Secondly, I started smpd.<BR>
318 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; Thirdly, I ran mpiexec to execute an MPI program (cpi)&nbsp; &gt;<BR>
319 &gt;&gt;&gt; that comes&nbsp; &gt; &gt;&gt;&gt;&gt; with the package.<BR>
320 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt;<BR>
321 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; The result was correct, but when mpiexec was finished, the<BR>
322 &gt;&gt;&gt; ssh<BR>
323 &gt;&gt;&gt; &gt;<BR>
324 &gt;&gt;&gt;&gt;&gt;&gt;&gt; connection to the Korebot was closed.<BR>
325 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; I found that mpiexec kills the sshd process through which I<BR>
326 &gt;&gt;&gt; was&nbsp; &gt;<BR>
327 &gt;&gt;&gt;&gt;&gt;&gt;&gt; remotely connected to Korebot.<BR>
328 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt;<BR>
329 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; I've been looking for the cause, but still have not&nbsp; &gt; found<BR>
330 &gt;&gt;&gt; any clues.<BR>
331 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt;<BR>
332 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; Could you give me any ideas to solve this problem?<BR>
333 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt;<BR>
334 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; Thank you,<BR>
335 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt;<BR>
336 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt; Yu-Cheng<BR>
337 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;&gt;<BR>
338 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;&gt;<BR>
339 &gt;&gt;&gt;&nbsp; &gt; &gt;&gt;<BR>
340 &gt;&gt;&gt;&nbsp; &gt; &gt;<BR>
341 &gt;&gt;&gt;&nbsp; &gt;<BR>
342 &gt;&gt;&gt;&nbsp; }}}<BR>
343 &gt;&gt;&gt;<BR>
344 &gt;&gt;&gt;<BR>
345 &gt;&gt;&gt; --<BR>
346 &gt;&gt;&gt; Ticket URL: &lt;<A HREF="https://trac.mcs.anl.gov/projects/mpich2/ticket/403">https://trac.mcs.anl.gov/projects/mpich2/ticket/403</A>&gt;<BR>
347 &gt;&gt;&gt;<BR>
348 &gt;&gt;<BR>
349 &gt;<BR>
350 </FONT>
351 </P>
352
353 </BODY>
354 </HTML>