Date and Author(s) |
A brief description of what Sandia has developed for cluster infrastructure. Our target platform is 5-10k node system. Currently we are completely diskless which influenced some of our bootable hierarchy decisions. Our infrastructure is best described in a layered fashion. Class Hierarchy: We have developed a "Class Hierarchy" that is used as a generic foundation for any cluster that we wish to create. Their are no limitations on this cluster that will be created other than the devices it consists of must be in the class hierarchy. If we wish to use a new device we simply add it to the class hierarchy. A "device" in this case is a generic term that includes any type of node or computer, and any sort of support equipment that is useful to describe, like external power controllers for nodes that don't have that capability, terminal servers, switches, etc. The methods of each device class describe the capabilities of the device. Inheritance is used to leverage commonality among device types, for example the three types of alpha nodes we have, have similar enough srm's to share common methods among all three device classes. Database configuration Program: Since it is assumed that each cluster will be different not only in what devices it is made of, but how those devices are connected together, the configuration program for each cluster will be to some degree unique for each cluster. The configuration program will create instances of each device the cluster consists of and also, very importantly describe how each of the devices are connected together (ethernet, serial, power etc). This is important, for example for hierarchical systems to determine device dependencies, like which device boots off of which higher level device, or which device serves the console for a lower level device etc. The output of the configuration program will be a persistent object store that describes the cluster that is to be managed in great detail. We generically call this object store the database. Database: The database holds an instance of each device that makes up the cluster along with linkages for any interdependencies that are determined to be important. Each object holds whatever information is needed to communicate with the device over serial and or network connections, power the device, or any other function that is necessary. Attributes can be leveraged for describing things like the "role" of the device or any other information that is thought to be useful. MAC addresses are stored in this database for purposes of generating configuration files as are IP addresses and additionally any other information that is necessary. Tools: Tools have been developed to satisfy any need of cluster management. These tools get all of the necessary information about the cluster from the database and are portable from cluster to cluster without modification. Basic functions that these tools provide are discovery(a process that interrogates the device for information that needs to be stored in the database, and does any necessary configuration of the device), Power control, booting, status, console (provides a console connection to any device), an assortment of configuration tools, etc. The tools are also layered in that the lowest level tools are usually the tools that interface with the database and are intended to be orthogonal. Higher level tools leverage lower level tools etc. Bootable Hierarchy: The bootable hierarchy for the entire cluster is also generated from the database since it describes all information necessary to construct the appropriate infrastructure for booting. This bootable hierarchy can take many forms. Currently our entire hierarchy is made up of diskless nodes below the highest level nodes. This is not a requirement. We leverage commonality by creating a single image of a root filesystem that is taken from a stock linux cd. The next level is what we call "proto" directories that either over ride what appears in the stock linux release, or links through to it. Above the proto directories are the specific node directories that again can over ride what is below, or link through. Any number of levels can be generated based on the needs of the cluster. We find that for our systems, proto directories based on the "role" of the node are the most useful. For example the most common node in a compute cluster is "compute", therefore we have a compute role and a compute proto directory. Most of the node directories link to the compute directory, and a simple change in the compute proto directory effects a change on all compute nodes. We have built more than 8 different clusters all of different architectures with this infrastructure and it has worked virtually unmodified. It provides a standard interface for cluster management to our production staff which dramatically eases cluster management. Currently our largest install is 1600+ nodes. We are integrating a cluster that consists of 1536 "center section nodes" that are switchable (may be connected) to any of 4 different "heads". Each head exists on a separate network (the reason for the diskless implementation) 3 of the heads consist of 256 nodes that provide I/O and interface from the users perspective to the cluster. The other head is for development purposes and consists of 128 nodes that can be connected to the center section when scalability tests must be performed. Let me know if I can describe any of these sections more fully. It is difficult to relate in this brief manner.