Personal tools
You are here: Home Dcache Archive FNDCA Infrastructure and Procedures
Document Actions

Old Procedure: Adding a new pool node

Draft procedure for adding a new pool node to FNDCA.

Click here to get the file

Size 11.7 kB - File type text/plain

File contents

Preparing a new Pool Node for dCache service
v0.91
16 Mar 2007
RDK


This is specific to pool nodes. Instructions for admin and monitor nodes vary.

[MOST RECENT CHANGE] Minor, in logfile references.


Assumptions: What has been done by ISA/IA/SSA before hand-off to dCache dev.s
------------

1. FSL 4.x is installed on the node, based on the "generic farm worker node".

2. This hostname for pool nodes is expected to follow the pattern stken*. If the
hostname does NOT follow this pattern, then the scripts:

	~enstore/dcache-code/dcache-fermi-config/scripts/setup-enstore
	~enstore/dcache-code/dcache-fermi-config/check_crc

will require modification. Traditionally, hostnames like stkendca<NN>a have been
used for pool nodes where <NN> is an integer. Some admin nodes have different
hostnames, since only POOL nodes must follow this particular pattern.

3. Yum update is either disabled completely, or at least the yum update of the
Sun Java RPM (referred to as "jdk" nowadays, but has also been known as j2sdk")
is disabled. This is critical to prevent a service outage due to a yum update.

4. The root account contains appropriate .k5login

5. The enstore account has been cloned from a working pool node. Be sure to
	preserve permissions on sym-links as is. The enstore account therefore
	contains an appropriate .k5login

6. /etc/grid-security (CA chain and x.509 host credentials) has been installed

7. Kerberos host credentials have been installed: /etc/krb5.keytab

8. Data partitions exist to host the pools and are mounted at /diska,/diskb,...
The pattern /disk[a-z] is encoded in the script:

	~enstore/dcache-code/dcache-fermi-config/check_crc

and only files in pools on FNDCA that match this base path of /disk[a-z] will be
scanned by the CRC integrity check.

9. PNFS should be mounted (and available for export to this node), for example
	in the /etc/fstab on most pool nodes is:

stkensrv1:/fs  /pnfs/fs  nfs  sync,rsize=4096,wsize=4096,user,intr,bg,hard,rw,noac 0 0

10. Make sure there is a symbolic link: ln -s fs /pnfs/fnal.gov

[root@stkendca17a ~]# ls -alF /pnfs
total 13
drwxr-xr-x   3 root root 4096 Feb 20 14:37 ./
drwxr-xr-x  33 root root 4096 Feb  9 12:25 ../
lrwxrwxrwx   1 root root    2 Feb 20 14:37 fnal.gov -> fs/
drwxrwxrwx   1 root root  512 Feb 13  2004 fs/
[

11. Install a UPS/UPD "products" account, basic installation. Partial installs
will confuse a complete basic install, so if one RPM is there already, remove
"upsupdbootstrap" first. Then:

	% yum install upsupdbootstrap upsupdbootstrap-local

12. Turn off unnecessary services and login daemons, run updatedb (once at
least)

13. Logins and copy services: make sure possible for root and enstore.

14. On new 64-bit systems, we are still running some 32-bit applications. In
order to do this successfully, we have found that we need to install a 32-bit
system library: libgcc-<version>.i386. If this is not done, encp's may fail, and
odd error messages will appear in dcache/encp logfiles (now located on the same
host as the dcache component as is writing to them) stating something like
"/usr/lib/libgcc_s.so cannot be found so pthread_cancel cannot be run".




Overview of what developers have done: (assuming addition to a live system)
--------------------------------------

*** As user "root" on the pool node ***

1. Remove GNU and Blackdown java-related rpm packages and dependents. The exact
	list varies by FSL version. The clues are packages containing "gcj" and
	"jpp" in their names, and packages that require these. "Yum remove" can
	help find the dependents. The OLD FSL 3.X removal list as an example:

	j2re-blackdown-1.4.1-gcc32.1	redhat-java-rpm-scripts-1.0.2-2
	libgcj-3.2.3-42			libgcj-ssa-3.5ssa-0.20030801.48
	gcc-java-ssa-3.5ssa-0.20030801.48
	libgcj-ssa-devel-3.5ssa-0.20030801.48		javamail-20031006-1
	redhat-lsb		gettext			commons-logging-1.0.2-12
	jaf-20030319-1		jakarta-regexp-1.2-12	bcel-5.0-10
	junit-3.8.1-1		cup-v10k-10		xerces-j-2.2.1-11
	xalan-j-2.4.1-11	ant-libs-1.5.2-23	ant-1.5.2-23

	This is more than is really needed to be removed, but since this is a
	closed turnkey service, it does not hurt to remove more either.

	For FSL 4.4, all I see that seems to fit this description is 1 package.
		"yum remove jpackage-utils-1.6.0-2jpp_3rh.noarch"

2. Install Sun Java rpm and CURRENT sym-link. The accepted current version RPM is
	usually stored in ~enstore/JAVA on dCache nodes nowadays.

	$ cd /usr/java
	$ ln -s jdk1.5.0_10 CURRENT
	$ ls -alF

	lrwxrwxrwx   1 root root   11 Feb  8 14:58 CURRENT -> jdk1.5.0_10/
	drwxr-xr-x   9 root root 4096 Feb  8 14:57 jdk1.5.0_10/

3. Yum install emacs/xemacs and other basic utilities to help administrators.
	For FSL 4.4, we did: "yum install emacs xemacs cvs which"

4. Prepare to be able to login as user enstore, and test:
	# cd /usr/local/bin
	# ln -s /home/enstore/dcache-deploy/dcache-fermi-config/ENSTORE_HOME

	Note: trouble with "su - enstore" means missing ENSTORE_HOME symlink

*** as user "enstore" on the pool node ***

5. Set the config sym-link in ~enstore/dcache-code to the appropriate base
	directory for the dCache instance to be used, for example: fndca,
	fndcat, cdfen, cdftest. For example, to bring up the node in the public
	dcache test stand, the config sym-link should point to fndcat, via:
	config -> dcache-fermi-config/fndcat

6. Add the new pool node hostname to the dcache and pool farmlet files.

	$ cd ~enstore/dcache-code/config
	$ edit dcache.farmlet
	$ edit pool.farmlet

7. Create appropriate dCache config files for the node: boot, poollist, setup

	- boot: just copy another pool nodes with an appropriate name

	- poollist: copy pre-existing one for a pool filling same role (read,
	write, volatile), change the partition, pool name, etc.

	- setup: copy pre-existing one for a pool filling same role (read,
	write, volatile), change the size, max active numbers, etc.

	POOL FILE SYSTEM DEFINITIONS:
	- Each pool <= 1 TB, entirely contained on a partition.
	- More than one pool per partition is OK.
	- Total pool space <= 95% of actual partition space.

8. Check in config file additions and changes to CVS.

	### Cannot cvs update yet... credentials issue

*** as "root" on pool node ***

9. Create pool directories in /diska, /diskb, etc. and install "setup" sym-link

For one particular pool, write-pool-3 living on /diskc:

[root@stkendca17a write-pool-2]# cd /diskc
[root@stkendca17a diskc]# mkdir write-pool-3
[root@stkendca17a diskc]# cd write-pool-3
[root@stkendca17a write-pool-3]# mkdir control data
[root@stkendca17a write-pool-3]# ln -s /home/enstore/dcache-deploy/config/stkendca17a.write-pool-3.setup setup

*** as "enstore" on pool node ***

10. Add pools to PoolManager.conf, assign them to an appropriate pool group.

11. Add the pools to the pool_files.config with appropriate details. I do not
	use this myself, but it should in principle reproduce all the pool
	configuration files in one pass. (Not usually done for test stand)

12. Check in config file changes to CVS.

	### CANNOT DO YET ON 17a
	### copied to 20a and cvs add/ci from there to capture our progress

13. CVS update the ~enstore/dcache-code/config area of at least all admin nodes
	and the new pool node. You can use the following command to do this for
	the admin nodes, as user "enstore" on fndca:

	% dgang -a "cd ~enstore/dcache-code/config ; cvs update -A"

*** as user "root" on the pool node ***

14. Install symbolic links to for boot-up and in /usr/local/bin
	- Jon wrote a script to do this for you, but it is not maintained for
	FNDCA. Just look at a working pool node... you will get the idea.
	- Boot Examples: All point to the "boot" file mentioned above.
		/etc/rc.d/init.d/dcache-boot --> stkendca17a-dcache-boot

		/etc/rc.d/rc3.d/K05dcache-boot --> ../init.d/dcache-boot
		/etc/rc.d/rc3.d/S95dcache-boot
		/etc/rc.d/rc6.d/K05dcache-boot

	### Idea: add "config" action to the boot script to setup the links

	- /usr/local/bin Examples:
		dgang -> /home/enstore/dcache-deploy/dcache-fermi-config/dgang
		doorcommand -> /home/enstore/dcache-deploy/scripts/doorcommand
		pathfinder -> /home/enstore/dcache-deploy/dcache-fermi-config/pathfinder
		poolcmd -> /home/enstore/dcache-deploy/scripts/poolcmd
		real-encp.sh -> /home/enstore/dcache-deploy/scripts/real-encp.sh
		rgang -> /home/enstore/dcache-deploy/dcache-fermi-config/rgang.py
		timed_cmd.sh -> /home/enstore/dcache-deploy/dcache-fermi-config/timed_cmd.sh

	- There are many sym-links in the dCache directory structure under
		~enstore, but these should have been preserved in the cloning.

*** as user "products" or "enstore" on the pool node ***

15. Be sure the version and configuration of ENSTORE/encp is appropriate.

	- We are moving away from in situ encp builds towards encp upd install.
	The UPD approach prevents centralized ("fifo") logging however, but
	simplifies encp upgrades, avoiding dcache admins having to know about
	enstore software distribution and builds from CVS.
	
	If you are using the UPD-based encp, then you need to install the encp
	product, declare it current, and edit the setup-enstore script in
	~enstore/dcache-code/dcache-fermi-config/scripts

	*** as user "root" on pool node ***
	# su - products
	-sh-3.00$ source etc/setups.sh 
	-sh-3.00$ setup upd
	-sh-3.00$ upd install encp v3_6d -q dcache -G -c
	informational: installed encp v3_6d.
	upd install succeeded.

	Then edit setup-enstore and cvs check it in.

	### Logfiles are being erased when pools shutdown (nonfifo mode). This
	needs to be verified. I think I have seen these logs being appended to
	as well, in a different configuration.

	- If still using the in situ encp build, the cloning should be enough
	for now. However, the next time you upgrade encp, you will need to
	install a number of products to succeed: enstore-specific python, a fake
	enstore ups setup, swig, and perhaps blt. Long explanation.

*** as user "root" on the pool node ***

16. Start up the pool node as root on that node
	% /etc/rc.d/init.d/dcache-boot start

*** as user "enstore" on the head node ***

17. Enable the new pool node in the dcache configuration:
	% dcache
		slm-enstore> exit
		.. > set dest PoolManager@dCacheDomain
		PoolManager@dCacheDomain> reload -yes

*** as user "enstore" on pool node ***

18. Entries for the following nodes should be present in
	~enstore/.ssh/known_hosts

	- fndca, fndca.fnal.gov, fndca3a, fndca3a.fnal.gov (production)

	- fndcat, fndcat.fnal.gov, stkendca3a, stkendca3a.fnal.gov (test stand)

	These are needed for the various monitoring/watchdog scripts to work.

*** as general user with web browser ***

19. If after 5 minutes the node does not show up in cells page and pool request
	queues page, and pools are running, then try restarting the httpd CELL
	on the head node. Be patient, as this monitoring takes time to sync up.

20. Watch the billing errors page to be sure that some successful client
	activity takes place with the new pool nodes.

*** as user "enstore" on the pool node or monitor node ***

21. Check the dCache logfiles for error messages originating from the pools. If
	non-centralized logging is used (FNDCA, teststands), you will find the
	pool logs on the pool node. If centralized logging is used (CDFDCA), the
	central logs are on the monitor node.

22. Test that restores/store work... read/writes... etc.


Post-install assumptions:
-------------------------

1. BE SURE yum auto-update is OFF, or that the jdk rpm is exempted from updates.

2. New pool node is added to NGOP as appropriate (eg. logfile space, up/down)

3. New pool node is added to Ganglia monitoring if and when this exists.


Other Notes:
------------
I have had created a workgroup called "DcacheServer" located in CVS at:
	/lts4rolling/i386/sites/Fermi/workgroups/DcacheServer. I hand this off
	to SSA. Just let me know if you need to send e-mail to the maintainers
	to add/remove administrators for this workgroup.


by Robert Kennedy last modified 2008-11-19 10:56 — expired
« February 2009 »
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
 

Powered by Plone, the Open Source Content Management System

This site conforms to the following standards: