I/O nodes location: the location of the I/O node(s) responsible for
@@ -93,7 +92,7 @@
The command is currently integrated into the busybox, i.e. it is
not a standalone binary. It is not intended for execution by users; in
fact, it requires super-user privileges to authenticate with the daemon (by
-means of privileged TCP port number).
+means of a privileged TCP port number).
@@ -114,13 +113,17 @@
Due to the limitations of the upstream API, the information is only
available once the partition is fully booted, and polling is used to obtain
it. For short-running jobs (less than 60 seconds), zinfo could potentially
-miss the fact that a job ran at all.
+miss the fact that a job ran at all. However, after at most a minute it will
+notice that the job is no longer in the queue, and terminate with an error
+message.
For user's convenience, a job can be specified by providing either
its Cobalt job ID (if Cobalt scheduler is being used), its BlueGene job ID,
-or the block/partition ID.
+or the block/partition ID. The use of block/partition IDs is in general
+not recommended, as they are not unique across subsequent jobs, and so could
+lead to race conditions.
@@ -159,46 +162,43 @@
mail -s "Job $cjob_id finished" $USER </dev/null
-
Security
+zinfo_ping
-Basic precautions to take:
+Another command that is available is zinfo_ping. It is meant for
+use by a system administrator as a heartbeat monitor. It tries to connect
+to the daemon and perform initial protocol handshake. The supplied sample
+startup script supports a verify argument, that causes
+zinfo_ping to be invoked and, in case it can't communicate with
+zinfod, restarts the daemon.
-
--
-daemon port: ensure that the listening TCP/IP socket of the daemon running
-on the service node is not accessible from outside of the BlueGene machine.
-
-Rationale: the communication protocol between zinfod and
-the other two tools does not involve any authentication/authorization
-steps. The communication should thus be made impossible for those without
-authorization to be on the BlueGene machine at all.
-
+Notes
+
-
-daemon process: do not run it as root. Run it as an unprivileged
-user.
-
-Rationale: the daemon process does not need any elevated
-privileges for proper operation.
+As currently implemented, zinfo infrastructure assumes that I/O nodes are
+rebooted every time a new job is started, which is typically the case with
+driver version 202. Improvements made by IBM in driver release 1 no longer
+make it necessary to reboot so often. We will try to address this issue
+once our site is upgraded to driver release 1.
-
-
-Threats:
-
-
-
-
-if the daemon process dies for some reason, a user might be able to replace
-it with his/her own version, so modified that it sends false authorization
-data in response to startup notifications, allowing such user to obtain the
-identity of other users (maybe even root) on I/O nodes. One way to prevent
-that is to ban untrusted users from the service node (a reasonable
-precaution under any circumstances), another would be to change the
-listening port number to a privileged one (<1024), but then you need to
-start the daemon as root and make it drop privileges later.
+It is not advised to run the zinfo daemon with root privileges. It doesn't
+need them, and it would be a security threat. The daemon provides support
+for dropping super-user privileges once it's initialized. Make use of this
+feature.
+
+-
+A related threat has to do with the daemon port. It is recommended to make
+it inaccessible from outside of the BlueGene machine, using a firewall,
+because the communication protocol does not use reliable
+authentication/authorization steps. Further, it might be a good idea to
+change the port number to a privileged one (<1024), so that, in an
+unlikely event of the daemon dying, an unprivileged user could not spawn
+a malicious replacement (this is only an issue if ordinary users are
+allowed to log on the service node).
@@ -212,7 +212,7 @@
-You need to specify a mode of operations first: --mode jobs or
+You need to specify a mode of operation first: --mode jobs or
--mode blocks. A more convenient way to do that is to invoke one
of the zbgl-listjobs and zbgl-listblocks commands, which
are simply links to zbgl-list.
@@ -248,7 +248,7 @@
Job Id: 20266
C Job Id: 10580
User: joe
-Block Id: R000_J104-32 (Row: %row Rack: 0 Midplanes: bottom)
+Block Id: R000_J104-32 (Row: 0 Rack: 0 Midplanes: bottom)
Job Name: mpirun.24536.sn1
Status: Running
Started: 2005-09-06 11:16:12