|
|||
By using the flocking and glide-in mechanisms provided by Condor, we were able to bring together a computational pool consisting of 2510 processors from various locations and of varying characteristics. Table 1 shows the number and type of processors at each participating site.
Number | Arch/OS | Location | 414 | Intel/Linux | Argonne | 96 | SGI/Irix | Argonne | 1024 | SGI/Irix | NCSA | 16 | Intel/Linux | NCSA | 45 | SGI/Irix | NCSA | 246 | Intel/Linux | Wisconsin | 146 | Intel/Solaris | Wisconsin | 133 | Sun/Solaris | Wisconsin | 190 | Intel/Linux | Georgia Tech | 94 | Intel/Solaris | Georgia Tech | 54 | Intel/Linux | Italy (INFN) | 25 | Intel/Linux | New Mexico | 12 | Sun/Solaris | Northwestern | 5 | Intel/Linux | Columbia U. | 10 | Sun/Solaris | Columbia U. |
---|
Interesting facts about the participating machines:
On June 8, 2000 at 11:05:36 CDT, Jean-Pierre Goux and Jeff Linderoth started running the MW-QAP code on nug30, logging in remotely to Jeff's Personal Condor pool at the University of Wisconsin-Madison. The computation completed on June 15, at 21:20:07 CDT, after which there was much rejoicing.
The nug30 computation was stopped five times during the week for various reasons:
After each termination, the computation was restarted from a checkpoint that was taken every 15 minutes during the run. (Thus at most 15 minutes of computation time was lost). Although no one is happy when bugs are present or human error occurs, these things will happen. Adding program robustness (in the form of checkpointing) was critical to the success of solving nug30.
In order to prove the optimality of this solution, 11,892,208,412 nodes of a branch and bound tree were explored. Solving the associated node subproblems and computing the branching information required 574,254,156,532 Frank-Wolfe iterations.
On average, there were 653 machines participating in the computation, with a maximum of 1009. One of the most remarkable features of the run was that almost 1 million linear assignment problems (LAPs) were solved each second during the course of the run. (One LAP must be solved for each Frank-Wolfe iteration). Table 2 shows a number of other interesting statistics about the nug30 run and the computational pool. The machine speeds have been normalized to an HP-C3000 workstation by comparing the time required for each participating machine to compute the same portion of the branch and bound tree. (Thus the "average" machine used in the nug30 computation was 56% as fast as an HP-C3000).
Average number of available workers | 652.7 |
Maximum number of available workers | 1009 |
Running wall clock time (sec) | 597,872 |
Total cpu time (sec) | 346,640,860 |
Average machine speed | 0.560 |
Minimum machine speed | 0.045 |
Maximum machine speed | 1.074 |
Equivalent CPU time (sec) on an HP-C3000 | 218,823,577 |
Parallel Efficiency | 93% |
Number of times a machine joined the computation | 19,063 |
Table 3 shows the percentage of the work done at each participating location.
Location | Percentage |
---|---|
Argonne | 42.27 | Wisconsin | 33.69 | Gatech | 11.90 | INFN | 5.65 | NCSA | 2.74 | New Mexico | 1.42 | Columbia | 1.23 | NW | 1.10 |
Table 4 shows the percentage of the work done by machines of each operating system and architecture type.
Arch/OS | Percentage | Intel/Linux | 79.57 | SGI/Irix | 8.76 | Sun/Solaris | 6.17 | Intel/Solaris | 5.50 |
---|