Our goal is to make this cluster as modular and scalable as possible, so the cluster will not just be one huge 1024 node cluster. We'll call this cluster GAEA and break the huge cluster into parts as follows:
As you can see there are two types of GENUSES, compute and thought. The difference is the COMPUTE GENUS is comprised totally of COMPUTE SPECIES. The THOUGHT GENUS contains the THOUGHT SPECIES needed to power the FAMILY, eleven COMPUTE SPECIES, and the net gear needed to run the FAMILY. Now that you know what the basic building blocks are, we'll quickly run through the rest of the classifications:
This show only 8 COMPUTE SPECIES and 1 THOUGHT SPECIES but this holds for all SPECIES in the family. All the SPECIES in the FAMILY are connect to one 32 port ethernet switch and all are connected to a 32 port serial expander (actually two 16 port serial hubs linked together). The THOUGHT SPECIES is connected to the serial expander vie the ethernet uplink so that it acts as a console server for all the COMPUTE SPECIES.
The next diagram show how the FAMILIES are connect together to form the complete cluster:
The previous two diagrams are the building blocks for our cluster. This approach breaks the cluster down into sub-clusters. Job handling will be done on a per FAMILY basis. The theory being that 31 COMPUTE SPECIES should be adequate for most jobs. If a job actually needs more than that a performance hit will occur because the job must leave the FAMILY. Optimal job times will be achieved when the job is run strictly in a single FAMILY. The THOUGHT SPECIES in each FAMILY will handle job scheduling for the FAMILY, inter-FAMILY communication, act as console server for the FAMILY, handle the FAMILY'S distributed filesystem/RAID service, handle other FAMILY specific services (NIS/YP, NFS, NTP, etc), and be integral in the installation process.