scms 2.2 - OpenSCE installation notes for RedHat 8.0

Installation Notes for SCMS 2.2 and BeowulfBuilder for RedHat 8.0

Here I provide step by step instructions to build a test diskless beowulf cluster system using beowulfbuilder and scms 2.2 for RedHat8.0. I haven't got it working yet but very close.

The frontend and compute nodes are dual Tyan motherboards with Zeon 2.4 GHz CPUs. Each motherboard has one fast and one gigabit ethernet on board. We, for the moment, disabled the gigabits on the motherboards. Hence "eth0" is the fast ethernet for slave nodes (i.e. compute1,compute2, etc). For the master (i.e. frontend) eth0 is for external communication and eth1 is for internal networking.
All the motherboards are set such that they boot by PXE using onboard eth0. Slave nodes have only CPUs and memory and onboard eth0 network card.

1. Linux RedHat 8.0 Installation

We first installed the linux RedHat 8.0 on the master (i.e. frontend). We use custom installation and install everything. We used "no firewall" option.

2. Installation of SCMS

We downloaded the following files from
http://www.opensce.org/moin :

libprg-1.7-1rh8x.i386.rpm ( download local copy )
libhal-2.4-2.8.x.i386.rpm ( donwload local copy )
python-pmw-0.8.5-3.noarch.rpm ( donwload local copy )
python-pil-1.1.1-3.i386.rpm ( donwload local copy )
scms-2.2-4.8.x.i386.rpm ( donwload local copy )
scms-devel-2.2-4.8.x.i386.rpm ( donwload local copy )
scms-gui-2.2-4.8.x.i386.rpm ( donwload local copy )

We put all these files in a directory and then installed all of them (i.e. rpm -ivf lib*.rpm pyth*rpm scms*.rpm")

3. Installation of Beowuldbuilder

After having SCMS installed, we installed beowulfbuilder. The order of installaion, I think, is important. Otherwise you may need to rebuild your cluster so that /etc/sce directory is included in your slave nodes...

STEP 1.

We downloaded the latest version of the beowulfbuilder from
beowulfbuilder-2.7-9.8.x.i386.rpm ( donwload local copy ) ) and just installed it.

STEP 2.

We next set up the configuration files:

/usr/share/beowulfbuilder-2.7/beowulfbuilder.conf.template
/usr/share/beowulfbuilder-2.7/bbconf

Here's our modified beowulfbuilder.conf.template file:

SERVER_NETWORK="172.16.0.0"
SERVER_IP="172.16.0.1"
SERVER_NETMASK="255.255.0.0"
CLIENT_NETWORK="172.16.0.0"
CLIENT_NETID="172.16"
CLIENT_HOSTIP_RANGE="172.16.255.2 172.16.255.254"
CLIENT_NETMASK="255.255.0.0"
CLIENT_BROADCAST="172.16.255.255"
CLIENT_HOSTNAME="node"
DOMAINNAME="ncnr.nist.gov"
CLIENT_DOMAINNAME_SERVER="ncnr.nist.gov"
CLUSTER_NAME="tancluster"
SERVER_INT_NAME="googoo"
NISDOMAIN="ncnr.nist.gov"

We next modified the /etc/share/beowulfbuilder-2.7/bbconf file to increase the size of the ramdisk:

TFTPBOOT_DIR=/tftpboot
DBETHER="$TFTPBOOT_DIR/DBETHER"
CONFIG_FILE="$TFTPBOOT_DIR/beowulfbuilder.conf"
PXECFGDIR="pxelinux.cfg"
GRUBCFGDIR="grub.cfg"
DEFAULTRAMDISKNAME=rootfs.gz
DEFAULTRAMDISKSIZE=393216
DEFAULTKERNELNAME=vmlinuz
RAMDISKSUFFIX=root.gz
KERNSUFFIX=kernel

If your master (i.e frontend) is setup such that eth0 is for external network, you are done with the configuration of the beowulf. However, if your eth1 is for external networking then you need to modify the file /usr/sbin/bbuilder . You have to change all the "eth1" to "eth0".

STEP 3.

We are finally ready to run "bbuilder" and start building our diskless cluster. You just type "bbuilder" to start the beowulfbuilder. It will create /tftpboot directory and modify your /etc/hosts, /etc/hosts.equiv, /etc/dhdcp.conf files. Please note that you need to keep the size of your window in which bbuilder is running large. Otherwise the program may complain it and just quits. If this happens, just resize the window and rerun it again. If everything goes normal, you enter Add node . Now you are ready to turn on the first slave node. You should see its MAC # and IP on the bbuilder's screen. During this process, I noticed that I got a warning:
couldn't find /var/beowulfbuilder : No such file or directory
This warning seems harmless but it is quite possible that it may be the main reason why SCMS does not work yet for me!!! After node1 finishes its booting, you can turn on the next node, etc, etc. AFter all nodes are up, you type F2 (please not that for SCE1.5, before typing F2, you have to change the file permission for /etc/hosts i.e. chmod 644 /etc/hosts, otherwise you will get lots of "permission denied" errors). We are done with adding nodes. Next enter the Configure Synchronize and you are done!

A SMALL BUG: During installation of the slave nodes, I noticed that each nodes complain about gdm settings. Apparently /var/gdm directory has the wrong file permission and owner. It should belong to gdm. However, this bug seems harmless as well!!

At this point,our cluster is ready for parallel computing, in principle. You should be also rsh without any problem.

STEP 4.

Finally we will start the cluster monitoring software SCMS . This is where I am stopped! After above steps, I rebooted the master and slave nodes. This was necessary to get SCMS running in SCE 1.5. Hence, I just repeated the same thing here, i.e. reboot the master and then slaves. Below is the boot.log for both master and slave nodes:

Please note that in both log files, it says that "ms: rms startup succeeded". Hence both master and slaves are running rms after booting. At this stage I have the following files in /etc/sce:

When I type "scms", everything seems working except that I only see the "googoo" (i.e. frontend) as if my closter does not contain any slave nodes. I tried update configuration or restart deamon but it did not work.

Below is a list of commands that I run and the corresponding responses. I am hopping that an expert on SCMS will have a look at these and let me know the problem!!!

[root@googoo sce]# cms_stop
node1: Stopping rms on 172.16.255.254 : [ OK ]
googoo: Stopping rms on 172.16.0.1 : [ OK ]
node2: Stopping rms on 172.16.255.253 : [ OK ]

[root@googoo sce]# cms_start
googoo: Starting rms on 172.16.0.1 : [ OK ]
node1: Starting rms on 172.16.255.254 : [ OK ]
node2: Starting rms on 172.16.255.253 : [ OK ]

Even though it seems working, it does not!! If I type "cms_host -l", it reports a blank line as if no slaves! Similarly if I type "pexec -a env", it prints the env variables for only the master "googoo". If I type "pexec -h node1 env", It prints again a blank line..

Below is a list of comments to test the network setup and cms commands at slave noces
[root@googoo sce]# hostname
googoo.ncnr.nist.gov
[root@googoo sce]# hostname -i
172.16.0.1
[root@googoo sce]# rsh node1
[root@node1 root]#
[root@node1 root]# hostname
node1.ncnr.nist.gov
[root@node1 root]# hostname -i
172.16.255.254
Finally, I tried the following:

route add -net 224.0.0.0 netmask 240.0.0.0 dev eth1

which seems to work. Please note that this command has to be run everytime the system is booted, etc.