operations.xml

Revision 526, 13.0 kB (checked in by voran, 2 years ago)
started updating manual for 0.97

Line
1	<chapter>
2	<title>Component Operations</title>
3
4	<para>
5	During normal operations, a variety of messages are produced. This
6	allows for most state to be tracked through logs. All messages are
7	logged to syslog facility LOG_LOCAL0, so ensure that these
8	messages are captured.
9	</para>
10
11	<section>
12	<title>Job Execution</title>
13
14	<para>
15	Job execution is the most common operation in cobalt. It is a
16	procedure that requires several components to work in
17	concert. All jobs go through the same based steps:
18	</para>
19
20	<variablelist>
21	<varlistentry>
22	<term>Initial Job Queueing</term>
23	<listitem>
24	<para>
25	A request is sent to the queue manager describing a
26	new job. Aspects of this request are checked both on the
27	server side, and in <filename>cqsub</filename>, for better
28	user error messages. <!-- Whenever a job is created or changes -->
29	<!-- state, appropriate events are emitted. These events can be -->
30	<!-- seen using the <filename>eminfo.py</filename> command. Any -->
31	<!-- client that has subscribed to this sort of event will -->
32	<!-- receive a copy. -->
33	</para>
34	</listitem>
35	</varlistentry>
36	<varlistentry>
37	<term>Job Scheduling</term>
38	<listitem>
39	<para>
40	The scheduler periodically pools the queue manager for new
41	jobs, and can also receive events as an asynchronous
42	notification of queue activity. At these times, it
43	connects to the queue manager and fetches information
44	about current jobs. This process results in a set of idle
45	partitions and idle jobs. If both sets are non-empty, then
46	the scheduler attempts to place idle jobs on idle
47	partitions. This cycle culmunates in the execution of
48	suitable jobs, if they can be scheduled.
49	</para>
50	</listitem>
51	</varlistentry>
52	<varlistentry>
53	<term>Job Execution</term>
54	<listitem>
55	<para>
56	Once the queue manager gets a job-run command from the
57	queue manager, it can start the job on those specified
58	resources. At this point, the job state machine is
59	activated. This state machine can contain different steps
60	depending on the underlying architecture and which queue
61	manager features are enabled. For example, enabling
62	allocation management functionality causes jobs to run
63	several extra job steps before completion. These extra
64	steps will not be discussed here; our main focus is
65	generic job execution.
66	</para>
67	</listitem>
68	</varlistentry>
69
70	<varlistentry>
71	<term>Process Group Execution</term>
72	<listitem>
73	<para>
74	The queueing system spawns some number of parallel
75	processes for each job. The execution, management, and
76	cleanup of these processes is handled by the process
77	manager. It, like the queue manager, emits a number
78	of events as process groups execute.
79	</para>
80	</listitem>
81	</varlistentry>
82	<varlistentry>
83	<term>Process Group Cleanup</term>
84	<listitem>
85	<para>
86	Parallel process management semantics are not unlike unix
87	process semantics. Processes can be started, signalled,
88	killed, and can exit of their own accord. Similar to unix
89	processes, process groups must be reaped once they have
90	finished execution. At reap time, stdio and return codes
91	are available to the "parent" component.
92	</para>
93	</listitem>
94	</varlistentry>
95	<varlistentry>
96	<term>Job Step Execution</term>
97	<listitem>
98	<para>
99	As the job executes, some number of process groups will be
100	executed. These will result in a number of cycles of the
101	previous two steps. Note that process groups can be serial
102	as well, so steps like job prologue and epilogue are
103	executed in an identical fashion.
104	</para>
105	</listitem>
106	</varlistentry>
107	<varlistentry>
108	<term>Job Completion</term>
109	<listitem>
110	<para>
111	Once all steps have completed, the job is
112	finished. Cleanup consists of logging a usage summary, job
113	deletion from the queue, and event emission. At this
114	point, the job no longer exists.
115	</para>
116	</listitem>
117	</varlistentry>
118	<varlistentry>
119	<term>Scheduler Cleanup</term>
120	<listitem>
121	<para>
122	When the job no longer exists in the queue manager, the
123	scheduler flags it as exited and frees its execution
124	location. It then attempts to schedule idle jobs in this
125	location.
126	</para>
127	</listitem>
128	</varlistentry>
129	</variablelist>
130	</section>
131
132	<section>
133	<title>Job Log Trace</title>
134
135	<para>
136	The following is a set of example logs pertaining to a single
137	job.
138	</para>
139
140	<programlisting>
141	Jun 29 20:27:14 sn1 BGSched: Found new job 4719
142	Jun 29 20:27:14 sn1 BGSched: Scheduling job 4719 on partition R000_J108-32
143	Jun 29 20:27:14 sn1 cqm: Running job 4719 on R000_J108-32
144	Jun 29 20:27:14 sn1 cqm: running step SetBGKernel for job 4719
145	Jun 29 20:27:14 sn1 cqm: running step RunBGUserJob for job 4719
146	Jun 29 20:27:14 sn1 bgpm: ProcessGroup 84 Started on partition R000_J108-32. pid: 29368
147	Jun 29 20:27:16 sn1 bgpm: Running /bgl/BlueLight/ppcfloor/bglsys/bin/mpirun mpirun
148	-np 32 -partition R000_J108-32 -mode co
149	-cwd /bgl/home1/adiga/alumina/surface/slab_30/1x1/300K/zerok
150	-exe /home/adiga/alumina/surface/slab_30/1x1/300K/zerok/DLPOLY.X
151	Jun 29 21:05:28 sn1 bgpm: ProcessGroup 84 Finshed. pid 29368
152	Jun 29 21:05:28 sn1 cqm: user completed for job 4719
153	Jun 29 21:05:28 sn1 cqm: running step FinishUserPgrp for job 4719
154	Jun 29 21:05:29 sn1 bgpm: Got wait-process-group from 10.0.0.1
155	Jun 29 21:05:29 sn1 cqm: running step Finish for job 4719
156	Jun 29 21:05:29 sn1 cqm: Job 4719/adiga on 32 nodes done. queue:9.18s user:2294.08s
157	Jun 29 21:05:35 sn1 BGSched: Job 4719 gone from qm
158	Jun 29 21:05:35 sn1 BGSched: Freeing partition R000_J108-32
159	Jun 29 21:28:37 sn1 BGSched: Found new job 4720
160	</programlisting>
161
162	<para>
163	In the event that this job ran out of time or was cqdel'ed,
164	additional log messages would appear to that effect.
165	</para>
166
167	</section>
168
169	<section>
170	<title>Job Accounting Messages</title>
171
172	<para>
173	Job accounting log messages are logged to files in the directory
174	specified by <filename>log_dir</filename> in the [cqm] section
175	of the config file. Basic messages are logged by the
176	queue-manager for job queuing (Q), execution (S), and exit
177	(E). Additional messages include the location where the job is
178	running and the exit code.
179
180	<programlisting>
181	Q;jobid;user;queue
182	S;jobid;user;job name;nodes;processors;mode;walltime
183	Job jobid/user/Q:queue: Running job on location
184	E;jobid;user;walltime
185	Job jobid/user on nodes nodes done. queue:[queuetime]s
186	user:[walltime]s exit:exitcode
187	</programlisting>
188	</para>
189	<para>
190	Example:
191	<programlisting>
192	2007-06-06 12:43:50 Q;59;bob;default
193	2007-06-06 12:44:56 S;59;bob;N/A;32;32;co;20
194	2007-06-06 12:44:56 Job 59/bob/4539/Q:default: Running job on 32_R000_J108_N3
195	2007-06-06 12:44:56 Job 59/bob using kernel default
196	2007-06-06 12:45:08 E;59;bob;14
197	2007-06-06 12:45:09 Job 59/bob on 32 nodes done. queue:65.88s user:11.98s exit:0
198	</programlisting>
199
200	</para>
201
202	<para>
203	More details can be found in the log messages from the
204	scheduler, process-manager, and queue-manager. The scheduler
205	logs where the job is executing. The process-manager (bgpm) logs
206	the program executable and arguments, as well as the exit code
207	of the program. The queue-manager also logs statistics about the
208	job execution, such as the kernel used, the actual run time and
209	the queue wait time.
210	</para>
211
212	<para>
213	Example:
214	<programlisting>
215	Dec 15 17:29:39 localhost bgsched[4760]: Job 25537/bob: Scheduling
216	job 25537 on partition 32wayN0
217	Dec 15 17:29:39 localhost cqm[4152]: Job 25537/bob using kernel default
218	Dec 15 17:29:39 localhost bgpm[4124]: Job 25537/bob: ProcessGroup 1
219	Started on partition 32wayN0. pid: 4220
220	Dec 15 17:29:39 localhost bgpm[4220]: Job 25537/bob: Running
221	/bgl/BlueLight/ppcfloor/bglsys/bin/mpirun mpirun -np 32 -partition
222	32wayN0 -mode co -cwd /home/bob/tests -exe /home/bob/tests/ring-hello
223	Dec 15 17:35:49 localhost bgpm[4124]: Job 25537/bob: ProcessGroup 1
224	Finshed with exit code 0. pid 4220
225	Dec 15 17:35:59 localhost cqm[4152]: Job 25537/bob on 32 nodes
226	done. queue:2.99s user:10.18s
227	</programlisting>
228
229	</para>
230	</section>
231
232	<section>
233	<title>Data Persistence</title>
234
235	<para>
236	Each of these components must store persistent data, for obvious
237	reasons. Each of the components present in cobalt store data
238	using a common mechanism. These functions are implemented in
239	common code. Each component has some data that needs to be
240	persistent. Periodically, each component marshalls this data
241	down to a text stream (using Python's cPickle module), and saves
242	this data in a file in the directory
243	<filename>/var/spool/cobalt/</filename>. The filenames in this
244	directory correspond to the component implementation name. This
245	is the name that appears in syslog log messages (ie cqm, bgpm,
246	BGSched).
247	</para>
248
249	<!-- <para> -->
250	<!-- This data can be manipulated from a python interpreter using the -->
251	<!-- <filename>cddbg.py</filename>. This should not be attempted -->
252	<!-- unless you really know what you are doing. -->
253	<!-- </para> -->
254	</section>
255	<section>
256	<title>Resource Manager Operations</title>
257
258	<para>This section describes basic operations of the Cobalt
259	resource manager in terms of initial setup and long term
260	operations. First we will describe how job flow through the
261	system, and then we will describe how administrators can shape
262	this process.</para>
263
264	<para>
265	Internally, the Cobalt queue manager (cqm) stores a set of
266	queues, each of which is associated with some number of
267	jobs. Each queue has a set (potentially empty) of submission
268	restrictions. These restrictions can limit a queue to jobs of
269	particular sizes, walltimes, or users. At submission time, jobs
270	are verified against all restrictions. If a job fails to pass
271	any of these, it is rejected, and the user is presented with an
272	error message describing its failure. Also, queues have a state
273	that is used to govern its behavior. The state "running" allows
274	the submission and execution of jobs. The state "draining"
275	allows jobs to execute, but prevents submission of new jobs. The
276	state "stopped" allows jobs to be submitted, but does not allow
277	job execution. Finally, the state "dead" prevents both job
278	submission and execution.
279	</para>
280
281	<para>When setting up new Cobalt instance, the default queue must
282	be created. This can be done with the command: </para>
283	<programlisting>
284	$ cqadm.py --addq default
285	</programlisting>
286
287	<para>
288	Restrictions can be added also using cqadm. Once the queues are
289	enabled, users can submit jobs.
290	</para>
291
292	<para>
293	Job execution is controlled by the scheduler. It makes
294	scheduling decisions based on several criteria.
295	</para>
296
297	<variablelist>
298	<varlistentry>
299	<term>Queue Status</term>
300	<listitem>
301	<para>The scheduler will only execute jobs from queues that
302	are in the "running" or "draining" states</para>
303
304	<para>Queue status can be toggled using the cqadm command.</para>
305	</listitem>
306	</varlistentry>
307	<varlistentry>
308	<term>Partition Status</term>
309	<listitem>
310	<para>Partitions have two flags that govern their
311	use. "Active" partitions are functional; that is, they
312	work. This means that they (and all containing partitions
313	that are also active) work properly. If a partition is
314	"inactive", then it and all larger containing partitions
315	will not be used.</para>
316
317	<para>Partitions also have a "scheduled" flag. This allows
318	administrators to control if a partition should have jobs
319	scheduled on it. Typically, administrators will frequently
320	toggle the "scheduled" flag, to govern the available machine
321	topology, while only changing the "active" flag during
322	system faults.</para>
323
324	<para>Both partition status flags can be toggled using the
325	partadm command.</para>
326	</listitem>
327	</varlistentry>
328	<varlistentry>
329	<term>Partition Queue Setting</term>
330	<listitem>
331	<para>
332	Partitions are made available to one or more queues at any
333	given time. Only jobs from the specified queues can be run on a
334	partition.
335	</para>
336
337	<para>Partition queue settings can be set using the partadm command.</para>
338	</listitem>
339	</varlistentry>
340	<varlistentry>
341	<term>Reservations</term>
342	<listitem>
343	<para>
344	Reservations each have a list of associated users,
345	partitions, and an active period. During this period, only
346	users listed on a reservation can run on reserved
347	resources. If multiple reservations are active
348	simultaneously, then a user must be listed on all
349	reservations in order to consume resources</para>
350	<para>Reservations can be set, displayed and released using
351	the commands setres, showres, and releaseres, respectively.</para>
352	</listitem>
353
354	</varlistentry>
355	</variablelist>
356
357	</section>
358	</chapter>

Note: See TracBrowser for help on using the browser.

root/trunk/doc/operations.xml

Download in other formats: