Personal tools
You are here: Home Dcache Archive FNDCA Infrastructure and Procedures FNDCA Components
Document Actions

FNDCA Components

An overview of the components, scripts, cronjobs, and deployment approach used in the FNDCA system in the current (old-style, not rpm-based) deployment configuration. This is a draft intended to help the effort to evolve the configuration to an rpm-based deployment organization.

Click here to get the file

Size 18.2 kB - File type text/plain

File contents

Fndca Infrastructure and Organization
Rough Draft 2
19 Dec 2006
RDK


[ASIDE] Items to be sure to cover, not yet obvious in text at this time

*) Requirements for nodes running dcache, pgsql, tomcat, srm (new style).
*) How "at" jobs interact with the cronjobs to achieve web pages
*) How CRC check is done in detail



Historical FNAL Deployment Organization
=======================================

1) Overall structure broken down along the file system organization

1.1) ~enstore is the home of dcache installations, based on the history of dCache
being brought to FNAL by Enstore developers. It contains a collection of
directories to support deployment, a few utilities to support developers, and
some accumulated junk. Note-worthy are:

	1. dcache-deploy is the main body of dcache deployment, described below

	2. dcache-log contains the dCache log files
		These may be local one-per-domain ("nofifo") or merged ("fifo")

	3. dcache-billing, dcache-ftp-tlog*, dcache-logins, dcache-queues, ...
		contain plain text records and logs used to support web pages
		developed by FNAL to track billing, FTP transfers, etc.

	4. JAVA contains Sun JDK rpms and any post-install tweaks

	5. TOOLS contains developer tools like jprofiler

	6. jvmstat* contain mostly outdated Sun jvmstat installations, should be
	moved into TOOLS

	7. pgsql (not everywhere) points to the current Postgresql deployment

	8. tomcat (not everywhere) points to the current Tomcat deployment

	9. .bashrc, .bash_profile: setup the base environment for scripts
		(true for ~root as well)
	10. unix.uid.list: manually updated mapping of usernames to uids at FNAL

1.2) ~enstore/dcache-deploy is a symlink to the base of FNAL dcache software
infrastructure. In all known instances, it points to ~enstore/dcache-code.

	~enstore/dcache-deploy contains:

	1. classes -> dcache-fermi-config/jars/20061114-1637utc-1.7.0-19-1.5.0_09/
		jar files containing actual dCache service classes	
	2. config -> dcache-fermi-config/fndca/
		configuration files for this instance of dCache (fndca)
	3. dcache-fermi-config: base directory of most dcache service contents
		3.1. the directory itself
			contains a disorganized ad hoc collection of scripts and
			utilities used in dcache operation in addition to what
			is in its scripts sub-directory (7).
		3.2. fndca, fndcat, cms, cdfen
			Base directories for per-instance configuration. Note
			that cdfen = CDFDCA, cdftest = CDFDCAT.
		3.3. jars
			dCache software "classes" releases built at FNAL.
		3.4. A LOT of other stuff, some ref'd by symlinks from elsewhere

	4. docs -> dcache-fermi-config/docs/
		not described
	5. gsint -> dcache-fermi-config/gsint/
		not described
	6. jobs -> dcache-fermi-config/jobs/
		start-up wrappers for dCache JVMs/Domains. All changed in CVS.
	7. scripts -> dcache-fermi-config/scripts/
		palliative and administrative support scripts

1.3) /etc/rc.d/init.d contains symlinks and boot scripts for dcache components
such as dcache-boot, postgres-boot, tomcat-boot, monitoring-boot.

1.4) /usr/local/bin contains symlinks to utility programs and some boot scripts

1.5) crontabs. The config areas contain crontabs for users root and enstore on
the head and monitoring node, suer enstore on door nodes, and possibly more.
These are covered in more detail in section 2.2.

1.6) /fnal/ups/prd/www_pages is the "home page" area of the dcache WWW page. It
only exists on the head node serving the traditional "outer" dcache web page
created and supported by FNAL and run under apache httpd. Much [TODO]

	1. HTTPD non-default configuration settings

	2. index.html, robots.txt are the usual httpd basics: the default page
	and web page indexing block respectively. 

	3. dcache

		3.1 *.gif, *.jpg - graphics and images from DESY

		3.2 billing

		3.3 diskList, diskList.save

		3.4 files, files.save

		3.5 lifetime

		3.6 logins

		3.7 queue

		3.8 running.html - "Running dCache"

		3.9 Mapping of monitoring generated files to web pages:

"Daily Billing"		http://fndca3a.fnal.gov/dcache/billing.html
"File Lifetime Plots"	http://fndca3a.fnal.gov/dcache/dc_lifetime_plots.html
"Login Plots"		http://fndca3a.fnal.gov/dcache/dc_login_plots.html
"Queue Plots"		http://fndca3a.fnal.gov/dcache/dc_queue_plots.html
"Login List"		http://fndca3a.fnal.gov/dcache/DOORS.html
"P2P Queue" (unlinked)	http://fndca3a.fnal.gov/dcache/p2p.html
"Restore Queue"		http://fndca3a.fnal.gov/dcache/RC.html
"Active Transfers"	http://fndca3a.fnal.gov/dcache/transfers.html

		3.10 http://fndca2a.fnal.gov:8090/dcache/lsplots

		3.11 http://fndca3a.fnal.gov/dcache/files

		3.12 http://fndca3a.fnal.gov/cgi-bin/dcache_files.py

		3.13 http://fndca3a.fnal.gov:2288/statistics


1.7) ~enstore/enstore contains the Enstore code distribution from CVS. It is used
to build Enstore utilities in place that are used by dCache such as: encp, ecrc,
config_server, log_server, and a few minor scripts.

	There are several ways to setup this area and each which could put
	different executables in the PATH on different instances of dCache.
	The encp and ecrc tools are now available in a client package.
	
	The configuration server, log server, alarm_server, and event relay
	programs are not available in a separately redistributable client
	package. These servers are used primarily (but not exclusively) to
	collect logfile output across a distributed dCache instance and merge
	all into a single rotating logfile. While this is convenient for
	reviewing logs, this logging mechanism and its use has weaknesses. For
	instance, dCache uses it synchronously so if a log disk fills, the whole
	service stops. It is not actively supported distinct from Enstore, so
	its use should be reconsidered. CMS no longer uses mechanism, and we may
	consider dropping it as well for FNDCA to simplify the deployment
	re-organization project. A follow-up project may consider an alternative
	framework to achieve the same functionality.

1.8) pagedcache --- Eileen is working on this, upgrading it.

	[TODO] Describe where it resides, what files are involved.

--------------------------------------------------------------------------------

2) Break-down by functional organization

2.1) ~enstore/dcache-deploy/config/cold-start and cold-stop.

	These recent scripts define how to do a cold start and cold stop of a
	old-style dCache instance. They are maintained in each instance's config
	area to allow specialization to be captured in CVS as soon as possible,
	with efforts to make the scripts general again coming later when time
	permits. This is a break-down of what the scripts execute as another
	dimension along which to break down the dCache system description.

	1. dgang, rgang, rgang.py

		1. dgang is a script layer on top of rgang that adds
			dcache-specific functionality and ease-of-use. This is
			the expected administrator interface to rgang.

		2. rgang is a frozen python program built from rgang.py

		3. rgang.py is a means to execute commands across multiple
			nodes in distributed dcache system. It can be thought of
			as rsh coupled with text files describing which nodes
			are to be contacted.

	2. config/*.farmlet - the means to describe what nodes play which roles
	in a dCache system. Typical files are:

		1. admin.farmlet  = head node, door nodes (NOT monitor nodes)
		2. dcache.farmlet = all nodes in system
		3. head.farmlet   = head node only
		4. monitoring.farmlet = monitor node(s) only
		5. pnfs.farmlet   = node where PnfsManager runs (NOT pnfsd)
		6. pool.farmlet   = any node hosting one or more pools

	3. postresql-boot - starts a postgresql database server. At least the
		following aspects of dcache use postgresql:

		1. billing
		2. SRM
		3. future monitoring plots

	4. "service httpd start" - FNDCA uses a httpd rpm deployment of Apache
		httpd, which can be started easily by the "service" interface.

	5. tomcat-boot - starts a tomcat servlet container. At least the
		following aspects of dcache use postgresql:

		1. Vladimir's old-style monitoring plots
		2. SRM
		3. future monitoring plots

	6. kdcmux-boot - starts the FNAL-developed KDC multiplexer service. This
		service allows Java programs to spread their kerberos
		authentication requests across a suite of KDCs to break the
		limitation to exactly one KDC imposted by the JAVA Kerberos API.
		This is crucial to support the load of large kerberos-oriented
		dCache instances such as CDF dCache.

		[TODO] file(s) involved, description of operation.

	7. logger-boot - starts three Enstore-oriented services to support the
		merge/catenation of all logfiles into a single logfile.

		1. configuration_server
		2. log_server
		3. alarm_server
		4. event_relay

		[TODO] provide minimal description of what each does
		[TODO] relate this to: fifo, nofifo... where defaults defined

	8. PnfsManager start-up separate from General start-up

		It is a feature of FNAL production dCache deployment to have
		PnfsManager running on Pnfs server node. This improves overall
		performance since more data is transferred between the pnfs
		server and the PnfsManager than between the PnfsManager and the
		rest of dCache. The teststand for a production dCache instance
		(like FNDCAT is the teststand for FNDCA) runs on the same PNFS,
		but it must run its PnfsManager on its head node to have a
		distinct PnfsManager instance.

	9. General start-up

		dgang is used to do the general start up of dCache internal
		services called "Domains". Note that which domains are started
		on which node is defined at a lower level (than the farmlet
		files) in the dcache boot scripts. For instance:

		1. fndca-dcache-boot:
			processes="lm skmslm httpd adminDoor dCache billing"
		2. fndca1-dcache-boot:
			processes="door00 door01 doorK00 doorK01 doorG00
				doorG01 kerberizedftpdoor0 gridftpdoor0
				pinManager weakftpdoor0 srm"
		3. stkensrv1-dcache-boot:
			processes="pnfs cleaner"
			where pnfs here means the PnfsManager

	10. monitoring-boot - starts the FNAL-specific watchdogs and monitoring
		support system on the monitor node. This system uses the Unix
		"at" daemon to execute actions at regular intervals. Exactly
		which watchdogs or monitoring are executed varies between dcache
		instances and is defined in this boot script itself. Those
		which may be applicable to FNDCA include:

		Categories: according to primary deliverable of script
			Plotter - creates a web page for viewing
			Watchdog - send e-mail is a condition is found
			Palliative - reduces impact of a weakness in dCache

         	1.  login.list - plotter
			Calls scripts/listioalldoors.sh

			This script creates the "Login List" web page.
			http://fndca3a.fnal.gov/dcache/DOORS.html

		2.  moverls.list - plotter
			Calls scripts/moverls.sh 

			This script creates the "Pool Mover List" web page that
			is not linked by from the main FNDCA web page.
			http://fndca3a.fnal.gov/dcache/moverls.html

		3.  restore.list - plotter
			Calls scripts/kill_restore_butincache.sh

			This script creates the "Restore Queue" web page.
			http://fndca3a.fnal.gov/dcache/RC.html

		4.  enabled.list - watchdog
			Calls scripts/check_poolenabled.sh

			This script checks for pools reported offline for at
			least 6 iterations of checking, and sends e-mail if any.

         	5.  postgres.list - watchdog
			Calls check_postgres

			This script checks for postgres database instances
			needed are running, and sends e-mail if none found.

		6.  queue.plot - plotter
			Calls scripts/queues.sh

			This scripts gathers information and creates plots of
			the queue levels for each pool in dcache: number of
			movers active or queued, number of restores active or
			queued, etc. After final processing done by a cronjob,
			these plots are shown on the web page:

			http://fndca3a.fnal.gov/dcache/dc_queue_plots.html

		7.  login.plot - plotter
			Calls scripts/logins.sh

			This scripts gathers information and creates plots of
			the login levels for each door in dcache. After final
			processing done by a cronjob, these plots are shown on
			the web page:

			http://fndca3a.fnal.gov/dcache/dc_login_plots.html

		8.  pool.stats - watchdog, plotter
			Calls status/updatePoolStatus.sh
			Calls status/updateDirectory.sh

			Involved in the processing of billing informatioon to
			create a detailed statistics breakdown by pool or
			storage class with intermediate information stored in
			the sub-directory fndca:~enstore/dcache-statistics.
			Displayed on the web page:

			http://fndca3a.fnal.gov:2288/statistics

		9.  retry.Pool - watchdog
			Calls scripts/retry.waiting

			This script checks for excessive retries due to, for
			instance, Enstore backlogs, and sends e-mail if some are
			found.

		10. retry.P2P - plotter
			Calls scripts/retry.p2p

			This script creates the "dCache P2P Queue" web page that
			is not linked by from the main FNDCA web page.
			http://fndca3a.fnal.gov/dcache/p2p.html

		11. retry.no-mover-found - watchdog, palliative
			Calls scripts/retry.NoMoverFound

			This script checks for cases where no mover is found for
			a request (?), and sends e-mail if some are found.

		12. kill_close_wait.sh - palliative
			Calls scripts/kill_close_wait

			CLOSE_WAIT is a socket state usually associated with a
			socket that has not been properly closed per protocol.
			This script is cleans up unclosed sockets that can
			accumulate in a large dcache system, exhausting file
			descriptors. In principle, should not be needed with
			ideal dcache code or a low load dcache system (since a
			JVM does clean these up eventually).

2.2) Cronjobs

	Categories: according to primary deliverable of script
		Admin - manages logs, cleans up disk space, or makes back-ups
		Plotter - creates a web page for viewing
		Watchdog - send e-mail is a condition is found
		Palliative - reduces impact of a weakness in dCache

	1. crontab.enstore.fndca

		1. make_queue_plots.sh - plotter
			Runs dcache_make_queue_plot_page.py

			This script takes the information gathered by an "at"
			job as input to formally create the web page:

			http://fndca3a.fnal.gov/dcache/dc_queue_plots.html

		2. make_login_plots.sh - plotter
			Runs dcache_make_login_plot_page.py

			This script takes the information gathered by an "at"
			job as input to formally create the web page:

			http://fndca3a.fnal.gov/dcache/dc_login_plots.html

		3. Billing.summary - plotter

			This script creates the FNAL "Billing" web page, listing
			actions per day (distinct from DESY Billing web page).

			http://fndca3a.fnal.gov/dcache/billing.html

		4. repls start - plotter

			This script initiates the creation of a file listing for
			each pool. Some processing is done and the intermdiate
			results are stored on each pool node.

		5. repls copy - plotter

			This script gathers the results from each pool node into
			the master web listing found at:

			http://fndca3a.fnal.gov/dcache/files

			Note: we are accepting that sometimes the "repls start"
			may not be done when we do the copy hours later... very
			rare, but possible. The alternative would be to have
			authorized keys for each pool node entered on the head
			node which is fragile (nodes come/go) and unwieldy.

		6. lifetime start - plotter

			This script initiates the creation of a file lifetime
			lists for each pool. Some processing is done and the
			intermdiate results are stored on each pool node.

		7. lifetime copy - plotter

			This script gathers the results from each pool node into
			and creates plots for the web page at:

			http://fndca3a.fnal.gov/dcache/dc_lifetime_plots.html

			Note: we are accepting that sometimes the "repls start"
			may not be done when we do the copy hours later... very
			rare, but possible. The alternative would be to have
			authorized keys for each pool node entered on the head
			node which is fragile (nodes come/go) and unwieldy.

		8. ftp_gather - plotter

			This script gathers the FTP logs from door nodes to the
			admin node. These logs are used to create the "Recent
			FTP Transfers on fndca" web page using:

			http://fndca3a.fnal.gov/cgi-bin/dcache_files.py

	2. crontab.enstore.fndcam

		1. pg_dumpall - admin

			Back up the postgresql database on monitor node.

	3. crontab.root.fndca

		1. move-old-logs - admin

			Clean up old FTP transfer and related logs to the
			ftp-tlog-old directory (where they will no longer by
			visible on the Recent FTP Transfers web page).

			[TODO] There may be a weakness here. I recall not seeing
			new directories created in the "old" ftp log area to
			accommodate new users. The script assumed only files
			would be moved, not directory trees. This should be
			confirmed and fixed.


		2. move-ftp-cert-logs - admin

			Clean up old FTP transfer C=US items. I am not familiar
			with this with what these are though.

		3. tmpwatch - admin

			Clean up statistics and support file areas, but do not
			remove the doc stored in the statistics sub-directory.

		4. check_port_block - watchdog

			Check that the appropriate ipchains (personal firewall)
			blocks on the use of certain dcache ports from offsite
			are in place. Example: unsecured dcap ports are supposed
			to be block from use by nodes outside of fnal.gov.

	4. crontab.root.fndca1

		1. tmpwatch - admin

			Clean up old FTP logs still on door node.

	5. crontab.root.fndcam

		1. real-encp-cleanup kickoff - admin

			Initiate the clean up of old logs left by real-encp on
			each of the pool nodes.

		2. check_crc kickoff - watchdog

			Initiate the check of CRC of almost every file in cache
			on each of the pool nodes. Only files older than 12
			hours in cache are checked.
			
			[TODO] more detail on what is really compared in this.


2.3) DCache Call-outs

	1. Enstore restores/stores: real-encp, encp.options

2.4) Vladimir's monitoring plots

	1. old-style tomcat-based (web.xml)

	2. Lazlo-based

2.5) pagedcache

2.6) Operational Utilities: poolcmd, doorcommand, pathfinder, real-encp.sh,...

2.7) Dmitry's PNFS consistency incremental scan

	http://www-stken.fnal.gov/enstore/dcache_monitor

2.8) Miscellaneous utilities: dropit (copied from a FNAL FUE product)

--------------------------------------------------------------------------------

3) Supporting Infrastucture not covered elsewhere

3.1) authorized_keys

3.2) kerberos service credentials

3.3) /etc/grid-security

3.4) CVS: HPPC and a little on DESY (now in transition to subversion)

--------------------------------------------------------------------------------

4) Pool, Cost Model, and Timeout Configuration

	[TODO] PoolManager.conf and pool.batch explained

.the end.
by Robert Kennedy last modified 2007-01-10 11:08
« February 2009 »
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
 

Powered by Plone, the Open Source Content Management System

This site conforms to the following standards: