1 |
<chapter> |
---|
2 |
<title>Component Operations</title> |
---|
3 |
|
---|
4 |
<para> |
---|
5 |
During normal operations, a variety of messages are produced. This |
---|
6 |
allows for most state to be tracked through logs. All messages are |
---|
7 |
logged to syslog facility LOG_LOCAL0, so ensure that these |
---|
8 |
messages are captured. |
---|
9 |
</para> |
---|
10 |
|
---|
11 |
<section> |
---|
12 |
<title>Job Execution</title> |
---|
13 |
|
---|
14 |
<para> |
---|
15 |
Job execution is the most common operation in cobalt. It is a |
---|
16 |
procedure that requires several components to work in |
---|
17 |
concert. All jobs go through the same based steps: |
---|
18 |
</para> |
---|
19 |
|
---|
20 |
<variablelist> |
---|
21 |
<varlistentry> |
---|
22 |
<term>Initial Job Queueing</term> |
---|
23 |
<listitem> |
---|
24 |
<para> |
---|
25 |
A request is sent to the queue manager describing a |
---|
26 |
new job. Aspects of this request are checked both on the |
---|
27 |
server side, and in <filename>cqsub</filename>, for better |
---|
28 |
user error messages. |
---|
29 |
|
---|
30 |
|
---|
31 |
|
---|
32 |
|
---|
33 |
</para> |
---|
34 |
</listitem> |
---|
35 |
</varlistentry> |
---|
36 |
<varlistentry> |
---|
37 |
<term>Job Scheduling</term> |
---|
38 |
<listitem> |
---|
39 |
<para> |
---|
40 |
The scheduler periodically pools the queue manager for new |
---|
41 |
jobs, and can also receive events as an asynchronous |
---|
42 |
notification of queue activity. At these times, it |
---|
43 |
connects to the queue manager and fetches information |
---|
44 |
about current jobs. This process results in a set of idle |
---|
45 |
partitions and idle jobs. If both sets are non-empty, then |
---|
46 |
the scheduler attempts to place idle jobs on idle |
---|
47 |
partitions. This cycle culmunates in the execution of |
---|
48 |
suitable jobs, if they can be scheduled. |
---|
49 |
</para> |
---|
50 |
</listitem> |
---|
51 |
</varlistentry> |
---|
52 |
<varlistentry> |
---|
53 |
<term>Job Execution</term> |
---|
54 |
<listitem> |
---|
55 |
<para> |
---|
56 |
Once the queue manager gets a job-run command from the |
---|
57 |
queue manager, it can start the job on those specified |
---|
58 |
resources. At this point, the job state machine is |
---|
59 |
activated. This state machine can contain different steps |
---|
60 |
depending on the underlying architecture and which queue |
---|
61 |
manager features are enabled. For example, enabling |
---|
62 |
allocation management functionality causes jobs to run |
---|
63 |
several extra job steps before completion. These extra |
---|
64 |
steps will not be discussed here; our main focus is |
---|
65 |
generic job execution. |
---|
66 |
</para> |
---|
67 |
</listitem> |
---|
68 |
</varlistentry> |
---|
69 |
|
---|
70 |
<varlistentry> |
---|
71 |
<term>Process Group Execution</term> |
---|
72 |
<listitem> |
---|
73 |
<para> |
---|
74 |
The queueing system spawns some number of parallel |
---|
75 |
processes for each job. The execution, management, and |
---|
76 |
cleanup of these processes is handled by the process |
---|
77 |
manager. It, like the queue manager, emits a number |
---|
78 |
of events as process groups execute. |
---|
79 |
</para> |
---|
80 |
</listitem> |
---|
81 |
</varlistentry> |
---|
82 |
<varlistentry> |
---|
83 |
<term>Process Group Cleanup</term> |
---|
84 |
<listitem> |
---|
85 |
<para> |
---|
86 |
Parallel process management semantics are not unlike unix |
---|
87 |
process semantics. Processes can be started, signalled, |
---|
88 |
killed, and can exit of their own accord. Similar to unix |
---|
89 |
processes, process groups must be reaped once they have |
---|
90 |
finished execution. At reap time, stdio and return codes |
---|
91 |
are available to the "parent" component. |
---|
92 |
</para> |
---|
93 |
</listitem> |
---|
94 |
</varlistentry> |
---|
95 |
<varlistentry> |
---|
96 |
<term>Job Step Execution</term> |
---|
97 |
<listitem> |
---|
98 |
<para> |
---|
99 |
As the job executes, some number of process groups will be |
---|
100 |
executed. These will result in a number of cycles of the |
---|
101 |
previous two steps. Note that process groups can be serial |
---|
102 |
as well, so steps like job prologue and epilogue are |
---|
103 |
executed in an identical fashion. |
---|
104 |
</para> |
---|
105 |
</listitem> |
---|
106 |
</varlistentry> |
---|
107 |
<varlistentry> |
---|
108 |
<term>Job Completion</term> |
---|
109 |
<listitem> |
---|
110 |
<para> |
---|
111 |
Once all steps have completed, the job is |
---|
112 |
finished. Cleanup consists of logging a usage summary, job |
---|
113 |
deletion from the queue, and event emission. At this |
---|
114 |
point, the job no longer exists. |
---|
115 |
</para> |
---|
116 |
</listitem> |
---|
117 |
</varlistentry> |
---|
118 |
<varlistentry> |
---|
119 |
<term>Scheduler Cleanup</term> |
---|
120 |
<listitem> |
---|
121 |
<para> |
---|
122 |
When the job no longer exists in the queue manager, the |
---|
123 |
scheduler flags it as exited and frees its execution |
---|
124 |
location. It then attempts to schedule idle jobs in this |
---|
125 |
location. |
---|
126 |
</para> |
---|
127 |
</listitem> |
---|
128 |
</varlistentry> |
---|
129 |
</variablelist> |
---|
130 |
</section> |
---|
131 |
|
---|
132 |
<section> |
---|
133 |
<title>Job Log Trace</title> |
---|
134 |
|
---|
135 |
<para> |
---|
136 |
The following is a set of example logs pertaining to a single |
---|
137 |
job. |
---|
138 |
</para> |
---|
139 |
|
---|
140 |
<programlisting> |
---|
141 |
Jun 29 20:27:14 sn1 BGSched: Found new job 4719 |
---|
142 |
Jun 29 20:27:14 sn1 BGSched: Scheduling job 4719 on partition R000_J108-32 |
---|
143 |
Jun 29 20:27:14 sn1 cqm: Running job 4719 on R000_J108-32 |
---|
144 |
Jun 29 20:27:14 sn1 cqm: running step SetBGKernel for job 4719 |
---|
145 |
Jun 29 20:27:14 sn1 cqm: running step RunBGUserJob for job 4719 |
---|
146 |
Jun 29 20:27:14 sn1 bgpm: ProcessGroup 84 Started on partition R000_J108-32. pid: 29368 |
---|
147 |
Jun 29 20:27:16 sn1 bgpm: Running /bgl/BlueLight/ppcfloor/bglsys/bin/mpirun mpirun |
---|
148 |
-np 32 -partition R000_J108-32 -mode co |
---|
149 |
-cwd /bgl/home1/adiga/alumina/surface/slab_30/1x1/300K/zerok |
---|
150 |
-exe /home/adiga/alumina/surface/slab_30/1x1/300K/zerok/DLPOLY.X |
---|
151 |
Jun 29 21:05:28 sn1 bgpm: ProcessGroup 84 Finshed. pid 29368 |
---|
152 |
Jun 29 21:05:28 sn1 cqm: user completed for job 4719 |
---|
153 |
Jun 29 21:05:28 sn1 cqm: running step FinishUserPgrp for job 4719 |
---|
154 |
Jun 29 21:05:29 sn1 bgpm: Got wait-process-group from 10.0.0.1 |
---|
155 |
Jun 29 21:05:29 sn1 cqm: running step Finish for job 4719 |
---|
156 |
Jun 29 21:05:29 sn1 cqm: Job 4719/adiga on 32 nodes done. queue:9.18s user:2294.08s |
---|
157 |
Jun 29 21:05:35 sn1 BGSched: Job 4719 gone from qm |
---|
158 |
Jun 29 21:05:35 sn1 BGSched: Freeing partition R000_J108-32 |
---|
159 |
Jun 29 21:28:37 sn1 BGSched: Found new job 4720 |
---|
160 |
</programlisting> |
---|
161 |
|
---|
162 |
<para> |
---|
163 |
In the event that this job ran out of time or was cqdel'ed, |
---|
164 |
additional log messages would appear to that effect. |
---|
165 |
</para> |
---|
166 |
|
---|
167 |
</section> |
---|
168 |
|
---|
169 |
<section> |
---|
170 |
<title>Job Accounting Messages</title> |
---|
171 |
|
---|
172 |
<para> |
---|
173 |
Job accounting log messages are logged to files in the directory |
---|
174 |
specified by <filename>log_dir</filename> in the [cqm] section |
---|
175 |
of the config file. Basic messages are logged by the |
---|
176 |
queue-manager for job queuing (Q), execution (S), and exit |
---|
177 |
(E). Additional messages include the location where the job is |
---|
178 |
running and the exit code. |
---|
179 |
|
---|
180 |
<programlisting> |
---|
181 |
Q;jobid;user;queue |
---|
182 |
S;jobid;user;job name;nodes;processors;mode;walltime |
---|
183 |
Job jobid/user/Q:queue: Running job on location |
---|
184 |
E;jobid;user;walltime |
---|
185 |
Job jobid/user on nodes nodes done. queue:[queuetime]s |
---|
186 |
user:[walltime]s exit:exitcode |
---|
187 |
</programlisting> |
---|
188 |
</para> |
---|
189 |
<para> |
---|
190 |
Example: |
---|
191 |
<programlisting> |
---|
192 |
2007-06-06 12:43:50 Q;59;bob;default |
---|
193 |
2007-06-06 12:44:56 S;59;bob;N/A;32;32;co;20 |
---|
194 |
2007-06-06 12:44:56 Job 59/bob/4539/Q:default: Running job on 32_R000_J108_N3 |
---|
195 |
2007-06-06 12:44:56 Job 59/bob using kernel default |
---|
196 |
2007-06-06 12:45:08 E;59;bob;14 |
---|
197 |
2007-06-06 12:45:09 Job 59/bob on 32 nodes done. queue:65.88s user:11.98s exit:0 |
---|
198 |
</programlisting> |
---|
199 |
|
---|
200 |
</para> |
---|
201 |
|
---|
202 |
<para> |
---|
203 |
More details can be found in the log messages from the |
---|
204 |
scheduler, process-manager, and queue-manager. The scheduler |
---|
205 |
logs where the job is executing. The process-manager (bgpm) logs |
---|
206 |
the program executable and arguments, as well as the exit code |
---|
207 |
of the program. The queue-manager also logs statistics about the |
---|
208 |
job execution, such as the kernel used, the actual run time and |
---|
209 |
the queue wait time. |
---|
210 |
</para> |
---|
211 |
|
---|
212 |
<para> |
---|
213 |
Example: |
---|
214 |
<programlisting> |
---|
215 |
Dec 15 17:29:39 localhost bgsched[4760]: Job 25537/bob: Scheduling |
---|
216 |
job 25537 on partition 32wayN0 |
---|
217 |
Dec 15 17:29:39 localhost cqm[4152]: Job 25537/bob using kernel default |
---|
218 |
Dec 15 17:29:39 localhost bgpm[4124]: Job 25537/bob: ProcessGroup 1 |
---|
219 |
Started on partition 32wayN0. pid: 4220 |
---|
220 |
Dec 15 17:29:39 localhost bgpm[4220]: Job 25537/bob: Running |
---|
221 |
/bgl/BlueLight/ppcfloor/bglsys/bin/mpirun mpirun -np 32 -partition |
---|
222 |
32wayN0 -mode co -cwd /home/bob/tests -exe /home/bob/tests/ring-hello |
---|
223 |
Dec 15 17:35:49 localhost bgpm[4124]: Job 25537/bob: ProcessGroup 1 |
---|
224 |
Finshed with exit code 0. pid 4220 |
---|
225 |
Dec 15 17:35:59 localhost cqm[4152]: Job 25537/bob on 32 nodes |
---|
226 |
done. queue:2.99s user:10.18s |
---|
227 |
</programlisting> |
---|
228 |
|
---|
229 |
</para> |
---|
230 |
</section> |
---|
231 |
|
---|
232 |
<section> |
---|
233 |
<title>Data Persistence</title> |
---|
234 |
|
---|
235 |
<para> |
---|
236 |
Each of these components must store persistent data, for obvious |
---|
237 |
reasons. Each of the components present in cobalt store data |
---|
238 |
using a common mechanism. These functions are implemented in |
---|
239 |
common code. Each component has some data that needs to be |
---|
240 |
persistent. Periodically, each component marshalls this data |
---|
241 |
down to a text stream (using Python's cPickle module), and saves |
---|
242 |
this data in a file in the directory |
---|
243 |
<filename>/var/spool/cobalt/</filename>. The filenames in this |
---|
244 |
directory correspond to the component implementation name. This |
---|
245 |
is the name that appears in syslog log messages (ie cqm, bgpm, |
---|
246 |
BGSched). |
---|
247 |
</para> |
---|
248 |
|
---|
249 |
|
---|
250 |
|
---|
251 |
|
---|
252 |
|
---|
253 |
|
---|
254 |
</section> |
---|
255 |
<section> |
---|
256 |
<title>Resource Manager Operations</title> |
---|
257 |
|
---|
258 |
<para>This section describes basic operations of the Cobalt |
---|
259 |
resource manager in terms of initial setup and long term |
---|
260 |
operations. First we will describe how job flow through the |
---|
261 |
system, and then we will describe how administrators can shape |
---|
262 |
this process.</para> |
---|
263 |
|
---|
264 |
<para> |
---|
265 |
Internally, the Cobalt queue manager (cqm) stores a set of |
---|
266 |
queues, each of which is associated with some number of |
---|
267 |
jobs. Each queue has a set (potentially empty) of submission |
---|
268 |
restrictions. These restrictions can limit a queue to jobs of |
---|
269 |
particular sizes, walltimes, or users. At submission time, jobs |
---|
270 |
are verified against all restrictions. If a job fails to pass |
---|
271 |
any of these, it is rejected, and the user is presented with an |
---|
272 |
error message describing its failure. Also, queues have a state |
---|
273 |
that is used to govern its behavior. The state "running" allows |
---|
274 |
the submission and execution of jobs. The state "draining" |
---|
275 |
allows jobs to execute, but prevents submission of new jobs. The |
---|
276 |
state "stopped" allows jobs to be submitted, but does not allow |
---|
277 |
job execution. Finally, the state "dead" prevents both job |
---|
278 |
submission and execution. |
---|
279 |
</para> |
---|
280 |
|
---|
281 |
<para>When setting up new Cobalt instance, the default queue must |
---|
282 |
be created. This can be done with the command: </para> |
---|
283 |
<programlisting> |
---|
284 |
$ cqadm.py --addq default |
---|
285 |
</programlisting> |
---|
286 |
|
---|
287 |
<para> |
---|
288 |
Restrictions can be added also using cqadm. Once the queues are |
---|
289 |
enabled, users can submit jobs. |
---|
290 |
</para> |
---|
291 |
|
---|
292 |
<para> |
---|
293 |
Job execution is controlled by the scheduler. It makes |
---|
294 |
scheduling decisions based on several criteria. |
---|
295 |
</para> |
---|
296 |
|
---|
297 |
<variablelist> |
---|
298 |
<varlistentry> |
---|
299 |
<term>Queue Status</term> |
---|
300 |
<listitem> |
---|
301 |
<para>The scheduler will only execute jobs from queues that |
---|
302 |
are in the "running" or "draining" states</para> |
---|
303 |
|
---|
304 |
<para>Queue status can be toggled using the cqadm command.</para> |
---|
305 |
</listitem> |
---|
306 |
</varlistentry> |
---|
307 |
<varlistentry> |
---|
308 |
<term>Partition Status</term> |
---|
309 |
<listitem> |
---|
310 |
<para>Partitions have two flags that govern their |
---|
311 |
use. "Active" partitions are functional; that is, they |
---|
312 |
work. This means that they (and all containing partitions |
---|
313 |
that are also active) work properly. If a partition is |
---|
314 |
"inactive", then it and all larger containing partitions |
---|
315 |
will not be used.</para> |
---|
316 |
|
---|
317 |
<para>Partitions also have a "scheduled" flag. This allows |
---|
318 |
administrators to control if a partition should have jobs |
---|
319 |
scheduled on it. Typically, administrators will frequently |
---|
320 |
toggle the "scheduled" flag, to govern the available machine |
---|
321 |
topology, while only changing the "active" flag during |
---|
322 |
system faults.</para> |
---|
323 |
|
---|
324 |
<para>Both partition status flags can be toggled using the |
---|
325 |
partadm command.</para> |
---|
326 |
</listitem> |
---|
327 |
</varlistentry> |
---|
328 |
<varlistentry> |
---|
329 |
<term>Partition Queue Setting</term> |
---|
330 |
<listitem> |
---|
331 |
<para> |
---|
332 |
Partitions are made available to one or more queues at any |
---|
333 |
given time. Only jobs from the specified queues can be run on a |
---|
334 |
partition. |
---|
335 |
</para> |
---|
336 |
|
---|
337 |
<para>Partition queue settings can be set using the partadm command.</para> |
---|
338 |
</listitem> |
---|
339 |
</varlistentry> |
---|
340 |
<varlistentry> |
---|
341 |
<term>Reservations</term> |
---|
342 |
<listitem> |
---|
343 |
<para> |
---|
344 |
Reservations each have a list of associated users, |
---|
345 |
partitions, and an active period. During this period, only |
---|
346 |
users listed on a reservation can run on reserved |
---|
347 |
resources. If multiple reservations are active |
---|
348 |
simultaneously, then a user must be listed on all |
---|
349 |
reservations in order to consume resources</para> |
---|
350 |
<para>Reservations can be set, displayed and released using |
---|
351 |
the commands setres, showres, and releaseres, respectively.</para> |
---|
352 |
</listitem> |
---|
353 |
|
---|
354 |
</varlistentry> |
---|
355 |
</variablelist> |
---|
356 |
|
---|
357 |
</section> |
---|
358 |
</chapter> |
---|