Installation Phases

The NT cluster will be built and tested in approximately four phases.

Phase I Getting it turned on

The first phase of building the NT Cluster is to configure the majority of the nodes with NT 4.0 Workstation. In this phase we want to learn how to do basic administration of a cluster of NT workstations. Part of this process is developing automated installation tools for the nodes and learning how to use Microsoft tools, such as SMS to maintain patch levels and software versions on the system. We are also trying to answer basic administration questions, such as how to monitor the status, load and utilization on the machines and how to alert the administrator in case of problems.

Some of the open issues in administration are:

At this point the machines are turned on and have NT 4.0 workstation on 9 of the nodes. In the process of building the machines we developed a way to boot the nodes from a single disk and accomplish most of the installation without having an operator present at the console. Although Microsoft has an alternative solution for doing this, we feel that our solution is easier to use and more configurable then the boot disks that are created with Microsoft's server manager. Take a look at our unattended install page for the distribution and instructions on using and modifying the disk for your site. Also take a look at the Administrators Log book for notes on our experiences with the cluster up to this point.

Phase II Developing the tools for the system

Once the system is up and running the next phase is to develop and port tools that will enable the nodes to function as a parallel cluster of machines. Environments like MPI, SIO-ADI, Nexus, CC++, PETSc, CAVElib, CAVEcomm, Voyager, and BlockSolve and others will need to be ported for NT. Other tools, such as job schedulers, will have to be developed from scratch. There are also other opportunities to experiment with Microsoft tools to enable parallel computation. Some of the possibilities are OLE and DCOM and Microsoft's implementation of RPC.

Phase III Running user jobs on the system

When tools like MPI and Nexus are ported to NT we can start experimenting with running real jobs on the system. Argonne already has a large user base on the IBM SP and it should not be difficult to find some users that are willing to experiment with alternative architectures. Applications include any class of problems that can be efficiently solved on networked clusters of machines. The first set of applications will most likely be relatively small due in part to the minimal hardware configuration of the current cluster. However if it appears that NT will scale well, we plan to update the system with additional resources to enable a larger class of applications.

Phase IV Performance analysis as compared to other types of systems

The last phase of the initial experiment will be to test the performance of the system in executing user applications. At this point we will probably already have some performance measurements from the tool development and porting phase. We plan to do additional studies of performance with real applications and compare them to other parallel environments, such as the IBM SP and other OS's that run on the Pentium class of machines (Linux, FreeBSD, etc).

Additional Work

If the NT cluster is a successful project we can try additional experiments with scalability and performance. In addition to expanding the available hardware in the cluster we can also try faster commodity network hardware and since we have the IBM SP it would be interesting to see if the NT cluster can be integrated into the SP environment.