The nug30 Computation

The nug30 Computation

The nug30 Computational Pool

By using the flocking and glide-in mechanisms provided by Condor, we were able to bring together a computational pool consisting of 2510 processors from various locations and of varying characteristics. Table 1 shows the number and type of processors at each participating site.

Table 1: Computational Pool
Number	Arch/OS	Location
414	Intel/Linux	Argonne
96	SGI/Irix	Argonne
1024	SGI/Irix	NCSA
16	Intel/Linux	NCSA
45	SGI/Irix	NCSA
246	Intel/Linux	Wisconsin
146	Intel/Solaris	Wisconsin
133	Sun/Solaris	Wisconsin
190	Intel/Linux	Georgia Tech
94	Intel/Solaris	Georgia Tech
54	Intel/Linux	Italy (INFN)
25	Intel/Linux	New Mexico
12	Sun/Solaris	Northwestern
5	Intel/Linux	Columbia U.
10	Sun/Solaris	Columbia U.

Interesting facts about the participating machines:

The 1024 SGI/Irix processors at NCSA currently rank that supercomputer as the 52nd fastest in the world, according to Top500.org. However, this is a very heavily used machine, and we were able to acquire at most 41 processors at any one time.
The Linux machines at Argonne are part of the new Chiba City cluster.
The machines at Georgia Tech are part of the Interactive High Performance Computing Lab.
The computers in Italy are part of the Italian "Computational Grid" and as such, were spread throughout the entire country : Rome, Bologna, Padova, Milan, and Naples.

Graphs of the nug30 computation

The Evolution of the nug30 Computation

On June 8, 2000 at 11:05:36 CDT, Jean-Pierre Goux and Jeff Linderoth started running the MW-QAP code on nug30, logging in remotely to Jeff's Personal Condor pool at the University of Wisconsin-Madison. The computation completed on June 15, at 21:20:07 CDT, after which there was much rejoicing.

The nug30 computation was stopped five times during the week for various reasons:

6/09 14:23:29 Job was manually stopped so that new master process, capable of dealing with more than 1000 workers could be used
6/13 03:36:40 Job terminated due to bug in Condor schedd code (now fixed in the newest version of Condor)
6/13 14:12:51 Job was manually stopped at the request of the master machine's system administrator. The machine needed to be rebooted to fix (unrelated) NFS problems that were causing problems for other users
6/14 19:23:46 Job terminated due to bug in Condor schedd code. (The same bug that caused the second program termination)
6/15 11:27:42 Job was manually stopped to quickly remove a number of worker submissions that had be incorrectly made. The incorrect submissions were the result of Jeff incorrectly editing a configuration file.

After each termination, the computation was restarted from a checkpoint that was taken every 15 minutes during the run. (Thus at most 15 minutes of computation time was lost). Although no one is happy when bugs are present or human error occurs, these things will happen. Adding program robustness (in the form of checkpointing) was critical to the success of solving nug30.

nug30 Computation Statistics

The optimal solution to the nug30 QAP instance is: 14,5,28,24,1,3,16,15,10,9,21,2,4,29,25,22,13,26,17,30,6,20,19,8,18,7,27,12,11,23

In order to prove the optimality of this solution, 11,892,208,412 nodes of a branch and bound tree were explored. Solving the associated node subproblems and computing the branching information required 574,254,156,532 Frank-Wolfe iterations.

On average, there were 653 machines participating in the computation, with a maximum of 1009. One of the most remarkable features of the run was that almost 1 million linear assignment problems (LAPs) were solved each second during the course of the run. (One LAP must be solved for each Frank-Wolfe iteration). Table 2 shows a number of other interesting statistics about the nug30 run and the computational pool. The machine speeds have been normalized to an HP-C3000 workstation by comparing the time required for each participating machine to compute the same portion of the branch and bound tree. (Thus the "average" machine used in the nug30 computation was 56% as fast as an HP-C3000).

Table 2: nug30 Run Statistics
Average number of available workers	652.7
Maximum number of available workers	1009
Running wall clock time (sec)	597,872
Total cpu time (sec)	346,640,860
Average machine speed	0.560
Minimum machine speed	0.045
Maximum machine speed	1.074
Equivalent CPU time (sec) on an HP-C3000	218,823,577
Parallel Efficiency	93%
Number of times a machine joined the computation	19,063

Table 3 shows the percentage of the work done at each participating location.

Table 3: Percentage of Work Done at Each Location
Location	Percentage
Argonne	42.27
Wisconsin	33.69
Gatech	11.90
INFN	5.65
NCSA	2.74
New Mexico	1.42
Columbia	1.23
NW	1.10

Table 4 shows the percentage of the work done by machines of each operating system and architecture type.

Table 4: Percentage of Work Done by Each Architecture/Operating System
Arch/OS	Percentage
Intel/Linux	79.57
SGI/Irix	8.76
Sun/Solaris	6.17
Intel/Solaris	5.50

Some Historic Photos

metaneos@mcs.anl.gov

Last modified: Mon Jul 3 23:17:42 CDT 2000