Node Build and Configuration Notebook - page 9 of 55

EditDeleteAnnotateNotarize
First PagePrevious PageNext PageLast PageTable of ContentsSearch

Date and Author(s)

Sandia's Cluster infrastructure

A brief description of what Sandia has developed for cluster infrastructure. 
 
Our target platform is 5-10k node system. Currently we are completely  
diskless which influenced some of our bootable hierarchy decisions. 
 
Our infrastructure is best described in a layered fashion. 
 
Class Hierarchy: 
We have developed a "Class Hierarchy" that is used as a generic foundation 
for any cluster that we wish to create. Their are no limitations on this 
cluster that will be created other than the devices it consists of must 
be in the class hierarchy. If we wish to use a new device we simply add 
it to the class hierarchy. A "device" in this case is a generic term that  
includes any type of node or computer, and any sort of support equipment 
that is useful to describe, like external power controllers for nodes that 
don't have that capability, terminal servers, switches, etc. 
The methods of each device class describe the capabilities of the device. 
Inheritance is used to leverage commonality among device types, for example 
the three types of alpha nodes we have, have similar enough srm's to share 
common methods among all three device classes. 
 
Database configuration Program: 
Since it is assumed that each cluster will be different not only in what 
devices it is made of, but how those devices are connected together, the 
configuration program for each cluster will be to some degree unique for 
each cluster. The configuration program will create instances of each 
device the cluster consists of and also, very importantly describe how 
each of the devices are connected together (ethernet, serial, power etc).  
This is important, for example for hierarchical systems to determine  
device dependencies, like which device boots off of which higher level  
device, or which device serves the console for a lower level device etc. 
The output of the configuration program will be a persistent object store 
that describes the cluster that is to be managed in great detail. We 
generically call this object store the database. 
 
Database: 
The database holds an instance of each device that makes up the cluster 
along with linkages for any interdependencies that are determined to be 
important. Each object holds whatever information is needed to communicate 
with the device over serial and or network connections, power the device, 
or any other function that is necessary. Attributes can be leveraged for 
describing things like the "role" of the device or any other information 
that is thought to be useful. MAC addresses are stored in this database 
for purposes of generating configuration files as are IP addresses and 
additionally any other information that is necessary.  
 
Tools: 
Tools have been developed to satisfy any need of cluster management. These tools 
get all of the necessary information about the cluster from the database 
and are portable from cluster to cluster without modification. Basic  
functions that these tools provide are discovery(a process that interrogates 
the device for information that needs to be stored in the database, and does  
any necessary configuration of the device), Power control, booting, status, 
console (provides a console connection to any device), an assortment of  
configuration tools, etc. 
The tools are also layered in that the lowest level tools are usually the 
tools that interface with the database and are intended to be orthogonal. 
Higher level tools leverage lower level tools etc. 
 
Bootable Hierarchy: 
The bootable hierarchy for the entire cluster is also generated from the 
database since it describes all information necessary to construct 
the appropriate infrastructure for booting. This bootable hierarchy can 
take many forms. Currently our entire hierarchy is made up of diskless 
nodes below the highest level nodes. This is not a requirement. We leverage 
commonality by creating a single image of a root filesystem that is taken 
from a stock linux cd. The next level is what we call "proto" directories 
that either over ride what appears in the stock linux release, or links 
through to it. Above the proto directories are the specific node directories 
that again can over ride what is below, or link through. Any number of 
levels can be generated based on the needs of the cluster. We find that 
for our systems, proto directories based on the "role" of the node are the 
most useful. For example the most common node in a compute cluster is 
"compute", therefore we have a compute role and a compute proto directory. 
Most of the node directories link to the compute directory, and a simple 
change in the compute proto directory effects a change on all compute 
nodes. 
 
We have built more than 8 different clusters all of different architectures 
with this infrastructure and it has worked virtually unmodified. It provides 
a standard interface for cluster management to our production staff which 
dramatically eases cluster management. Currently our largest install is 1600+ 
nodes. We are integrating a cluster that consists of 1536 "center section nodes" 
that are switchable (may be connected) to any of 4 different "heads". 
Each head exists on a separate network (the reason for the diskless implementation) 
3 of the heads consist of 256 nodes that provide I/O and interface from the 
users perspective to the cluster. The other head is for development purposes 
and consists of 128 nodes that can be connected to the center section when 
scalability tests must be performed. 
 
Let me know if I can describe any of these sections more fully. It is 
difficult to relate in this brief manner.