Installation Notes for SCMS 2.2 and BeowulfBuilder for RedHat 8.0
Here I provide step by step instructions to build
a test diskless beowulf cluster system using beowulfbuilder and
scms 2.2 for RedHat8.0. I haven't got it working yet but
very close.
The frontend and compute nodes are dual
Tyan motherboards with Zeon 2.4 GHz CPUs. Each motherboard has one
fast and one gigabit ethernet on board. We, for the moment,
disabled the gigabits on the motherboards. Hence "eth0" is the
fast ethernet for slave nodes (i.e. compute1,compute2, etc).
For the master (i.e. frontend) eth0 is for external communication
and eth1 is for internal networking.
All the motherboards are set such that they boot by
PXE using onboard eth0. Slave nodes have only CPUs
and memory and onboard eth0 network card.
1. Linux RedHat 8.0 Installation
We first installed the linux RedHat 8.0 on the master
(i.e. frontend). We use custom installation and install
everything. We used "no firewall" option.
2. Installation of SCMS
We downloaded the following files from
http://www.opensce.org/moin :
We put all these files in a directory and then
installed all of them (i.e. rpm -ivf lib*.rpm pyth*rpm scms*.rpm")
3. Installation of Beowuldbuilder
After having SCMS installed, we installed beowulfbuilder. The
order of installaion, I think, is important. Otherwise you may
need to rebuild your cluster so that /etc/sce directory is
included in your slave nodes...
STEP 1.
We downloaded the latest version of the beowulfbuilder from
beowulfbuilder-2.7-9.8.x.i386.rpm
( donwload local copy )
)
and just installed it.
STEP 2.
We next set up the configuration files:
- /usr/share/beowulfbuilder-2.7/beowulfbuilder.conf.template
- /usr/share/beowulfbuilder-2.7/bbconf
Here's our modified beowulfbuilder.conf.template file:
SERVER_NETWORK="172.16.0.0"
SERVER_IP="172.16.0.1"
SERVER_NETMASK="255.255.0.0"
CLIENT_NETWORK="172.16.0.0"
CLIENT_NETID="172.16"
CLIENT_HOSTIP_RANGE="172.16.255.2 172.16.255.254"
CLIENT_NETMASK="255.255.0.0"
CLIENT_BROADCAST="172.16.255.255"
CLIENT_HOSTNAME="node"
DOMAINNAME="ncnr.nist.gov"
CLIENT_DOMAINNAME_SERVER="ncnr.nist.gov"
CLUSTER_NAME="tancluster"
SERVER_INT_NAME="googoo"
NISDOMAIN="ncnr.nist.gov"
We next modified the /etc/share/beowulfbuilder-2.7/bbconf
file to increase the size of the ramdisk:
TFTPBOOT_DIR=/tftpboot
DBETHER="$TFTPBOOT_DIR/DBETHER"
CONFIG_FILE="$TFTPBOOT_DIR/beowulfbuilder.conf"
PXECFGDIR="pxelinux.cfg"
GRUBCFGDIR="grub.cfg"
DEFAULTRAMDISKNAME=rootfs.gz
DEFAULTRAMDISKSIZE=393216
DEFAULTKERNELNAME=vmlinuz
RAMDISKSUFFIX=root.gz
KERNSUFFIX=kernel
If your master (i.e frontend) is setup such that
eth0 is for external network, you are done
with the configuration of the beowulf. However, if
your eth1 is for external networking then
you need to modify the file /usr/sbin/bbuilder .
You have to change all the "eth1" to "eth0".
STEP 3.
We are finally ready to run "bbuilder" and start building
our diskless cluster. You just type "bbuilder" to start
the beowulfbuilder. It will create /tftpboot directory
and modify your /etc/hosts, /etc/hosts.equiv, /etc/dhdcp.conf
files. Please note that you need to keep the size of your window
in which bbuilder is running large. Otherwise the program
may complain it and just quits. If this happens, just resize
the window and rerun it again. If everything goes normal,
you enter Add node . Now you are ready to turn on
the first slave node. You should see its MAC # and IP
on the bbuilder's screen.
During this process, I noticed that I got
a warning:
couldn't find /var/beowulfbuilder : No such file
or directory
This warning seems harmless but it is quite possible that
it may be the main reason why SCMS does not work yet for me!!!
After node1 finishes its booting, you can turn on the next node,
etc, etc. AFter all nodes are up, you type F2 (please not that
for SCE1.5, before typing F2, you have to change the file
permission for /etc/hosts i.e. chmod 644 /etc/hosts, otherwise
you will get lots of "permission denied" errors). We are
done with adding nodes. Next enter the
Configure Synchronize and you are done!
A SMALL BUG: During installation of the
slave nodes, I noticed that each nodes complain about
gdm settings. Apparently /var/gdm directory has the
wrong file permission and owner. It should belong to
gdm. However, this bug seems harmless as well!!
At this point,our cluster is ready for parallel computing,
in principle. You should be also rsh without any problem.
STEP 4.
Finally we will start the cluster monitoring software
SCMS . This is where I am stopped! After above
steps, I rebooted the master and slave nodes. This was
necessary to get SCMS running in SCE 1.5. Hence, I
just repeated the same thing here, i.e. reboot the
master and then slaves. Below is the boot.log for
both master and slave nodes:
Please note that in both log files, it says
that "ms: rms startup succeeded". Hence both master and slaves
are running rms after booting. At this stage I have
the following files in /etc/sce:
When I type "scms", everything seems working except that
I only see the "googoo" (i.e. frontend) as if my closter does
not contain any slave nodes. I tried update configuration
or restart deamon but it did not work.
Below is a list of commands that I run
and the corresponding responses. I am hopping that an expert
on SCMS will have a look at these and let me know the
problem!!!
[root@googoo sce]# cms_stop
node1: Stopping rms on 172.16.255.254 : [ OK ]
googoo: Stopping rms on 172.16.0.1 : [ OK ]
node2: Stopping rms on 172.16.255.253 : [ OK ]
[root@googoo sce]# cms_start
googoo: Starting rms on 172.16.0.1 : [ OK ]
node1: Starting rms on 172.16.255.254 : [ OK ]
node2: Starting rms on 172.16.255.253 : [ OK ]
Even though it seems working, it does not!! If I type "cms_host -l",
it reports a blank line as if no slaves! Similarly if I type
"pexec -a env", it prints the env variables for only the master
"googoo". If I type "pexec -h node1 env", It prints again a blank
line..
Below is a list of comments to test the network
setup and cms commands at slave noces
[root@googoo sce]# hostname
googoo.ncnr.nist.gov
[root@googoo sce]# hostname -i
172.16.0.1
[root@googoo sce]# rsh node1
[root@node1 root]#
[root@node1 root]# hostname
node1.ncnr.nist.gov
[root@node1 root]# hostname -i
172.16.255.254
Finally, I tried the following:
route add -net 224.0.0.0 netmask 240.0.0.0 dev eth1
which seems to work. Please note that this command has to
be run everytime the system is booted, etc.