root/trunk/doc/operations.xml

Revision 526, 13.0 kB (checked in by voran, 2 years ago)
  • started updating manual for 0.97
Line 
1 <chapter>
2   <title>Component Operations</title>
3  
4   <para>
5     During normal operations, a variety of messages are produced. This
6     allows for most state to be tracked through logs. All messages are
7     logged to syslog facility LOG_LOCAL0, so ensure that these
8     messages are captured.
9   </para>
10
11   <section>
12     <title>Job Execution</title>
13
14     <para>
15       Job execution is the most common operation in cobalt. It is a
16       procedure that requires several components to work in
17       concert. All jobs go through the same based steps:
18     </para>
19
20     <variablelist>
21       <varlistentry>
22         <term>Initial Job Queueing</term>
23         <listitem>
24           <para>
25             A request is sent to the queue manager describing a
26             new job. Aspects of this request are checked both on the
27             server side, and in <filename>cqsub</filename>, for better
28             user error messages. <!-- Whenever a job is created or changes -->
29 <!--        state, appropriate events are emitted. These events can be -->
30 <!--        seen using the <filename>eminfo.py</filename> command. Any -->
31 <!--        client that has subscribed to this sort of event will -->
32 <!--        receive a copy. -->
33           </para>
34         </listitem>
35       </varlistentry>
36       <varlistentry>
37         <term>Job Scheduling</term>
38         <listitem>
39           <para>
40             The scheduler periodically pools the queue manager for new
41             jobs, and can also receive events as an asynchronous
42             notification of queue activity. At these times, it
43             connects to the queue manager and fetches information
44             about current jobs. This process results in a set of idle
45             partitions and idle jobs. If both sets are non-empty, then
46             the scheduler attempts to place idle jobs on idle
47             partitions. This cycle culmunates in the execution of
48             suitable jobs, if they can be scheduled.
49           </para>
50         </listitem>
51       </varlistentry>
52       <varlistentry>
53         <term>Job Execution</term>
54         <listitem>
55           <para>
56             Once the queue manager gets a job-run command from the
57             queue manager, it can start the job on those specified
58             resources. At this point, the job state machine is
59             activated. This state machine can contain different steps
60             depending on the underlying architecture and which queue
61             manager features are enabled. For example, enabling
62             allocation management functionality causes jobs to run
63             several extra job steps before completion. These extra
64             steps will not be discussed here; our main focus is
65             generic job execution.
66           </para>
67         </listitem>
68       </varlistentry>
69
70       <varlistentry>
71         <term>Process Group Execution</term>
72         <listitem>
73           <para>
74             The queueing system spawns some number of parallel
75             processes for each job. The execution, management, and
76             cleanup of these processes is handled by the process
77             manager. It, like the queue manager, emits a number
78             of events as process groups execute.
79           </para>
80         </listitem>
81       </varlistentry>
82       <varlistentry>
83         <term>Process Group Cleanup</term>
84         <listitem>
85           <para>
86             Parallel process management semantics are not unlike unix
87             process semantics. Processes can be started, signalled,
88             killed, and can exit of their own accord. Similar to unix
89             processes, process groups must be reaped once they have
90             finished execution. At reap time, stdio and return codes
91             are available to the "parent" component.
92           </para>
93         </listitem>
94       </varlistentry>
95       <varlistentry>
96         <term>Job Step Execution</term>
97         <listitem>
98           <para>
99             As the job executes, some number of process groups will be
100             executed. These will result in a number of cycles of the
101             previous two steps. Note that process groups can be serial
102             as well, so steps like job prologue and epilogue are
103             executed in an identical fashion.
104           </para>
105         </listitem>
106       </varlistentry>
107       <varlistentry>
108         <term>Job Completion</term>
109         <listitem>
110           <para>
111             Once all steps have completed, the job is
112             finished. Cleanup consists of logging a usage summary, job
113             deletion from the queue, and event emission. At this
114             point, the job no longer exists.
115           </para>
116         </listitem>
117       </varlistentry>
118       <varlistentry>
119         <term>Scheduler Cleanup</term>
120         <listitem>
121           <para>
122             When the job no longer exists in the queue manager, the
123             scheduler flags it as exited and frees its execution
124             location. It then attempts to schedule idle jobs in this
125             location.
126           </para>
127         </listitem>
128       </varlistentry>
129     </variablelist>
130   </section>
131
132   <section>
133     <title>Job Log Trace</title>
134    
135     <para>
136       The following is a set of example logs pertaining to a single
137       job.
138     </para>
139
140     <programlisting>
141 Jun 29 20:27:14 sn1 BGSched: Found new job 4719
142 Jun 29 20:27:14 sn1 BGSched: Scheduling job 4719 on partition R000_J108-32
143 Jun 29 20:27:14 sn1 cqm: Running job 4719 on R000_J108-32
144 Jun 29 20:27:14 sn1 cqm: running step SetBGKernel for job 4719
145 Jun 29 20:27:14 sn1 cqm: running step RunBGUserJob for job 4719
146 Jun 29 20:27:14 sn1 bgpm: ProcessGroup 84 Started on partition R000_J108-32. pid: 29368
147 Jun 29 20:27:16 sn1 bgpm: Running /bgl/BlueLight/ppcfloor/bglsys/bin/mpirun mpirun
148   -np 32 -partition R000_J108-32 -mode co
149   -cwd /bgl/home1/adiga/alumina/surface/slab_30/1x1/300K/zerok
150   -exe /home/adiga/alumina/surface/slab_30/1x1/300K/zerok/DLPOLY.X
151 Jun 29 21:05:28 sn1 bgpm: ProcessGroup 84 Finshed. pid 29368
152 Jun 29 21:05:28 sn1 cqm: user completed for job 4719
153 Jun 29 21:05:28 sn1 cqm: running step FinishUserPgrp for job 4719
154 Jun 29 21:05:29 sn1 bgpm: Got wait-process-group from 10.0.0.1
155 Jun 29 21:05:29 sn1 cqm: running step Finish for job 4719
156 Jun 29 21:05:29 sn1 cqm: Job 4719/adiga on 32 nodes done. queue:9.18s user:2294.08s
157 Jun 29 21:05:35 sn1 BGSched: Job 4719 gone from qm
158 Jun 29 21:05:35 sn1 BGSched: Freeing partition R000_J108-32
159 Jun 29 21:28:37 sn1 BGSched: Found new job 4720
160     </programlisting>
161    
162     <para>
163       In the event that this job ran out of time or was cqdel'ed,
164       additional log messages would appear to that effect.
165     </para>
166
167 </section>
168
169 <section>
170     <title>Job Accounting Messages</title>
171
172     <para>
173       Job accounting log messages are logged to files in the directory
174       specified by <filename>log_dir</filename> in the [cqm] section
175       of the config file. Basic messages are logged by the
176       queue-manager for job queuing (Q), execution (S), and exit
177       (E). Additional messages include the location where the job is
178       running and the exit code.
179
180         <programlisting>
181 Q;jobid;user;queue
182 S;jobid;user;job name;nodes;processors;mode;walltime
183 Job jobid/user/Q:queue: Running job on location
184 E;jobid;user;walltime
185 Job jobid/user on nodes nodes done. queue:[queuetime]s
186   user:[walltime]s  exit:exitcode
187         </programlisting>
188     </para>
189     <para>
190       Example:
191       <programlisting>
192 2007-06-06 12:43:50 Q;59;bob;default
193 2007-06-06 12:44:56 S;59;bob;N/A;32;32;co;20
194 2007-06-06 12:44:56 Job 59/bob/4539/Q:default: Running job on 32_R000_J108_N3
195 2007-06-06 12:44:56 Job 59/bob using kernel default
196 2007-06-06 12:45:08 E;59;bob;14
197 2007-06-06 12:45:09 Job 59/bob on 32 nodes done. queue:65.88s user:11.98s  exit:0
198       </programlisting>
199
200     </para>
201
202     <para>
203       More details can be found in the log messages from the
204       scheduler, process-manager, and queue-manager. The scheduler
205       logs where the job is executing. The process-manager (bgpm) logs
206       the program executable and arguments, as well as the exit code
207       of the program. The queue-manager also logs statistics about the
208       job execution, such as the kernel used, the actual run time and
209       the queue wait time. 
210     </para>
211
212     <para>
213       Example:
214       <programlisting>
215 Dec 15 17:29:39 localhost bgsched[4760]: Job 25537/bob: Scheduling
216   job 25537 on partition 32wayN0
217 Dec 15 17:29:39 localhost cqm[4152]: Job 25537/bob using kernel default
218 Dec 15 17:29:39 localhost bgpm[4124]: Job 25537/bob: ProcessGroup 1
219   Started on partition 32wayN0. pid: 4220
220 Dec 15 17:29:39 localhost bgpm[4220]: Job 25537/bob: Running
221   /bgl/BlueLight/ppcfloor/bglsys/bin/mpirun mpirun -np 32 -partition
222   32wayN0 -mode co -cwd /home/bob/tests -exe /home/bob/tests/ring-hello
223 Dec 15 17:35:49 localhost bgpm[4124]: Job 25537/bob: ProcessGroup 1
224   Finshed with exit code 0. pid 4220
225 Dec 15 17:35:59 localhost cqm[4152]: Job 25537/bob on 32 nodes
226   done. queue:2.99s user:10.18s
227       </programlisting>
228
229     </para>
230 </section>
231
232 <section>
233     <title>Data Persistence</title>
234    
235     <para>
236       Each of these components must store persistent data, for obvious
237       reasons. Each of the components present in cobalt store data
238       using a common mechanism. These functions are implemented in
239       common code. Each component has some data that needs to be
240       persistent. Periodically, each component marshalls this data
241       down to a text stream (using Python's cPickle module), and saves
242       this data in a file in the directory
243       <filename>/var/spool/cobalt/</filename>. The filenames in this
244       directory correspond to the component implementation name. This
245       is the name that appears in syslog log messages (ie cqm, bgpm,
246       BGSched).
247     </para>
248
249 <!--     <para> -->
250 <!--       This data can be manipulated from a python interpreter using the -->
251 <!--       <filename>cddbg.py</filename>. This should not be attempted -->
252 <!--       unless you really know what you are doing. -->
253 <!--     </para> -->
254   </section>
255   <section>
256     <title>Resource Manager Operations</title>
257
258     <para>This section describes basic operations of the Cobalt
259     resource manager in terms of initial setup and long term
260     operations. First we will describe how job flow through the
261     system, and then we will describe how administrators can shape
262     this process.</para>
263
264     <para>
265       Internally, the Cobalt queue manager (cqm) stores a set of
266       queues, each of which is associated with some number of
267       jobs. Each queue has a set (potentially empty) of submission
268       restrictions. These restrictions can limit a queue to jobs of
269       particular sizes, walltimes, or users. At submission time, jobs
270       are verified against all restrictions. If a job fails to pass
271       any of these, it is rejected, and the user is presented with an
272       error message describing its failure. Also, queues have a state
273       that is used to govern its behavior. The state "running" allows
274       the submission and execution of jobs. The state "draining"
275       allows jobs to execute, but prevents submission of new jobs. The
276       state "stopped" allows jobs to be submitted, but does not allow
277       job execution. Finally, the state "dead" prevents both job
278       submission and execution.
279     </para>
280
281     <para>When setting up new Cobalt instance, the default queue must
282     be created. This can be done with the command: </para>
283     <programlisting>
284       $ cqadm.py --addq default
285     </programlisting>
286
287     <para>
288       Restrictions can be added also using cqadm. Once the queues are
289       enabled, users can submit jobs.
290     </para>
291      
292     <para>
293       Job execution is controlled by the scheduler. It makes
294       scheduling decisions based on several criteria.
295     </para>
296
297     <variablelist>
298       <varlistentry>
299         <term>Queue Status</term>
300         <listitem>
301           <para>The scheduler will only execute jobs from queues that
302             are in the "running" or "draining" states</para>
303
304           <para>Queue status can be toggled using the cqadm command.</para>
305         </listitem>
306       </varlistentry>
307       <varlistentry>
308         <term>Partition Status</term>
309         <listitem>
310           <para>Partitions have two flags that govern their
311           use. "Active" partitions are functional; that is, they
312           work. This means that they (and all containing partitions
313           that are also active) work properly. If a partition is
314           "inactive", then it and all larger containing partitions
315           will not be used.</para>
316
317           <para>Partitions also have a "scheduled" flag. This allows
318           administrators to control if a partition should have jobs
319           scheduled on it. Typically, administrators will frequently
320           toggle the "scheduled" flag, to govern the available machine
321           topology, while only changing the "active" flag during
322           system faults.</para>
323
324           <para>Both partition status flags can be toggled using the
325           partadm command.</para>
326         </listitem>
327       </varlistentry>
328       <varlistentry>
329         <term>Partition Queue Setting</term>
330         <listitem>
331           <para>
332             Partitions are made available to one or more queues at any
333             given time. Only jobs from the specified queues can be run on a
334             partition.
335           </para>
336
337           <para>Partition queue settings can be set using the partadm command.</para>
338         </listitem>
339       </varlistentry>
340       <varlistentry>
341         <term>Reservations</term>
342         <listitem>
343           <para>
344             Reservations each have a list of associated users,
345             partitions, and an active period. During this period, only
346             users listed on a reservation can run on reserved
347             resources. If multiple reservations are active
348             simultaneously, then a user must be listed on all
349             reservations in order to consume resources</para>
350           <para>Reservations can be set, displayed and released using
351           the commands setres, showres, and releaseres, respectively.</para>
352         </listitem>
353
354       </varlistentry>
355     </variablelist>
356
357   </section>
358 </chapter>
Note: See TracBrowser for help on using the browser.