FNDCA Components
An overview of the components, scripts, cronjobs, and deployment approach used in the FNDCA system in the current (old-style, not rpm-based) deployment configuration. This is a draft intended to help the effort to evolve the configuration to an rpm-based deployment organization.
Size 18.2 kB - File type text/plainFile contents
Fndca Infrastructure and Organization Rough Draft 2 19 Dec 2006 RDK [ASIDE] Items to be sure to cover, not yet obvious in text at this time *) Requirements for nodes running dcache, pgsql, tomcat, srm (new style). *) How "at" jobs interact with the cronjobs to achieve web pages *) How CRC check is done in detail Historical FNAL Deployment Organization ======================================= 1) Overall structure broken down along the file system organization 1.1) ~enstore is the home of dcache installations, based on the history of dCache being brought to FNAL by Enstore developers. It contains a collection of directories to support deployment, a few utilities to support developers, and some accumulated junk. Note-worthy are: 1. dcache-deploy is the main body of dcache deployment, described below 2. dcache-log contains the dCache log files These may be local one-per-domain ("nofifo") or merged ("fifo") 3. dcache-billing, dcache-ftp-tlog*, dcache-logins, dcache-queues, ... contain plain text records and logs used to support web pages developed by FNAL to track billing, FTP transfers, etc. 4. JAVA contains Sun JDK rpms and any post-install tweaks 5. TOOLS contains developer tools like jprofiler 6. jvmstat* contain mostly outdated Sun jvmstat installations, should be moved into TOOLS 7. pgsql (not everywhere) points to the current Postgresql deployment 8. tomcat (not everywhere) points to the current Tomcat deployment 9. .bashrc, .bash_profile: setup the base environment for scripts (true for ~root as well) 10. unix.uid.list: manually updated mapping of usernames to uids at FNAL 1.2) ~enstore/dcache-deploy is a symlink to the base of FNAL dcache software infrastructure. In all known instances, it points to ~enstore/dcache-code. ~enstore/dcache-deploy contains: 1. classes -> dcache-fermi-config/jars/20061114-1637utc-1.7.0-19-1.5.0_09/ jar files containing actual dCache service classes 2. config -> dcache-fermi-config/fndca/ configuration files for this instance of dCache (fndca) 3. dcache-fermi-config: base directory of most dcache service contents 3.1. the directory itself contains a disorganized ad hoc collection of scripts and utilities used in dcache operation in addition to what is in its scripts sub-directory (7). 3.2. fndca, fndcat, cms, cdfen Base directories for per-instance configuration. Note that cdfen = CDFDCA, cdftest = CDFDCAT. 3.3. jars dCache software "classes" releases built at FNAL. 3.4. A LOT of other stuff, some ref'd by symlinks from elsewhere 4. docs -> dcache-fermi-config/docs/ not described 5. gsint -> dcache-fermi-config/gsint/ not described 6. jobs -> dcache-fermi-config/jobs/ start-up wrappers for dCache JVMs/Domains. All changed in CVS. 7. scripts -> dcache-fermi-config/scripts/ palliative and administrative support scripts 1.3) /etc/rc.d/init.d contains symlinks and boot scripts for dcache components such as dcache-boot, postgres-boot, tomcat-boot, monitoring-boot. 1.4) /usr/local/bin contains symlinks to utility programs and some boot scripts 1.5) crontabs. The config areas contain crontabs for users root and enstore on the head and monitoring node, suer enstore on door nodes, and possibly more. These are covered in more detail in section 2.2. 1.6) /fnal/ups/prd/www_pages is the "home page" area of the dcache WWW page. It only exists on the head node serving the traditional "outer" dcache web page created and supported by FNAL and run under apache httpd. Much [TODO] 1. HTTPD non-default configuration settings 2. index.html, robots.txt are the usual httpd basics: the default page and web page indexing block respectively. 3. dcache 3.1 *.gif, *.jpg - graphics and images from DESY 3.2 billing 3.3 diskList, diskList.save 3.4 files, files.save 3.5 lifetime 3.6 logins 3.7 queue 3.8 running.html - "Running dCache" 3.9 Mapping of monitoring generated files to web pages: "Daily Billing" http://fndca3a.fnal.gov/dcache/billing.html "File Lifetime Plots" http://fndca3a.fnal.gov/dcache/dc_lifetime_plots.html "Login Plots" http://fndca3a.fnal.gov/dcache/dc_login_plots.html "Queue Plots" http://fndca3a.fnal.gov/dcache/dc_queue_plots.html "Login List" http://fndca3a.fnal.gov/dcache/DOORS.html "P2P Queue" (unlinked) http://fndca3a.fnal.gov/dcache/p2p.html "Restore Queue" http://fndca3a.fnal.gov/dcache/RC.html "Active Transfers" http://fndca3a.fnal.gov/dcache/transfers.html 3.10 http://fndca2a.fnal.gov:8090/dcache/lsplots 3.11 http://fndca3a.fnal.gov/dcache/files 3.12 http://fndca3a.fnal.gov/cgi-bin/dcache_files.py 3.13 http://fndca3a.fnal.gov:2288/statistics 1.7) ~enstore/enstore contains the Enstore code distribution from CVS. It is used to build Enstore utilities in place that are used by dCache such as: encp, ecrc, config_server, log_server, and a few minor scripts. There are several ways to setup this area and each which could put different executables in the PATH on different instances of dCache. The encp and ecrc tools are now available in a client package. The configuration server, log server, alarm_server, and event relay programs are not available in a separately redistributable client package. These servers are used primarily (but not exclusively) to collect logfile output across a distributed dCache instance and merge all into a single rotating logfile. While this is convenient for reviewing logs, this logging mechanism and its use has weaknesses. For instance, dCache uses it synchronously so if a log disk fills, the whole service stops. It is not actively supported distinct from Enstore, so its use should be reconsidered. CMS no longer uses mechanism, and we may consider dropping it as well for FNDCA to simplify the deployment re-organization project. A follow-up project may consider an alternative framework to achieve the same functionality. 1.8) pagedcache --- Eileen is working on this, upgrading it. [TODO] Describe where it resides, what files are involved. -------------------------------------------------------------------------------- 2) Break-down by functional organization 2.1) ~enstore/dcache-deploy/config/cold-start and cold-stop. These recent scripts define how to do a cold start and cold stop of a old-style dCache instance. They are maintained in each instance's config area to allow specialization to be captured in CVS as soon as possible, with efforts to make the scripts general again coming later when time permits. This is a break-down of what the scripts execute as another dimension along which to break down the dCache system description. 1. dgang, rgang, rgang.py 1. dgang is a script layer on top of rgang that adds dcache-specific functionality and ease-of-use. This is the expected administrator interface to rgang. 2. rgang is a frozen python program built from rgang.py 3. rgang.py is a means to execute commands across multiple nodes in distributed dcache system. It can be thought of as rsh coupled with text files describing which nodes are to be contacted. 2. config/*.farmlet - the means to describe what nodes play which roles in a dCache system. Typical files are: 1. admin.farmlet = head node, door nodes (NOT monitor nodes) 2. dcache.farmlet = all nodes in system 3. head.farmlet = head node only 4. monitoring.farmlet = monitor node(s) only 5. pnfs.farmlet = node where PnfsManager runs (NOT pnfsd) 6. pool.farmlet = any node hosting one or more pools 3. postresql-boot - starts a postgresql database server. At least the following aspects of dcache use postgresql: 1. billing 2. SRM 3. future monitoring plots 4. "service httpd start" - FNDCA uses a httpd rpm deployment of Apache httpd, which can be started easily by the "service" interface. 5. tomcat-boot - starts a tomcat servlet container. At least the following aspects of dcache use postgresql: 1. Vladimir's old-style monitoring plots 2. SRM 3. future monitoring plots 6. kdcmux-boot - starts the FNAL-developed KDC multiplexer service. This service allows Java programs to spread their kerberos authentication requests across a suite of KDCs to break the limitation to exactly one KDC imposted by the JAVA Kerberos API. This is crucial to support the load of large kerberos-oriented dCache instances such as CDF dCache. [TODO] file(s) involved, description of operation. 7. logger-boot - starts three Enstore-oriented services to support the merge/catenation of all logfiles into a single logfile. 1. configuration_server 2. log_server 3. alarm_server 4. event_relay [TODO] provide minimal description of what each does [TODO] relate this to: fifo, nofifo... where defaults defined 8. PnfsManager start-up separate from General start-up It is a feature of FNAL production dCache deployment to have PnfsManager running on Pnfs server node. This improves overall performance since more data is transferred between the pnfs server and the PnfsManager than between the PnfsManager and the rest of dCache. The teststand for a production dCache instance (like FNDCAT is the teststand for FNDCA) runs on the same PNFS, but it must run its PnfsManager on its head node to have a distinct PnfsManager instance. 9. General start-up dgang is used to do the general start up of dCache internal services called "Domains". Note that which domains are started on which node is defined at a lower level (than the farmlet files) in the dcache boot scripts. For instance: 1. fndca-dcache-boot: processes="lm skmslm httpd adminDoor dCache billing" 2. fndca1-dcache-boot: processes="door00 door01 doorK00 doorK01 doorG00 doorG01 kerberizedftpdoor0 gridftpdoor0 pinManager weakftpdoor0 srm" 3. stkensrv1-dcache-boot: processes="pnfs cleaner" where pnfs here means the PnfsManager 10. monitoring-boot - starts the FNAL-specific watchdogs and monitoring support system on the monitor node. This system uses the Unix "at" daemon to execute actions at regular intervals. Exactly which watchdogs or monitoring are executed varies between dcache instances and is defined in this boot script itself. Those which may be applicable to FNDCA include: Categories: according to primary deliverable of script Plotter - creates a web page for viewing Watchdog - send e-mail is a condition is found Palliative - reduces impact of a weakness in dCache 1. login.list - plotter Calls scripts/listioalldoors.sh This script creates the "Login List" web page. http://fndca3a.fnal.gov/dcache/DOORS.html 2. moverls.list - plotter Calls scripts/moverls.sh This script creates the "Pool Mover List" web page that is not linked by from the main FNDCA web page. http://fndca3a.fnal.gov/dcache/moverls.html 3. restore.list - plotter Calls scripts/kill_restore_butincache.sh This script creates the "Restore Queue" web page. http://fndca3a.fnal.gov/dcache/RC.html 4. enabled.list - watchdog Calls scripts/check_poolenabled.sh This script checks for pools reported offline for at least 6 iterations of checking, and sends e-mail if any. 5. postgres.list - watchdog Calls check_postgres This script checks for postgres database instances needed are running, and sends e-mail if none found. 6. queue.plot - plotter Calls scripts/queues.sh This scripts gathers information and creates plots of the queue levels for each pool in dcache: number of movers active or queued, number of restores active or queued, etc. After final processing done by a cronjob, these plots are shown on the web page: http://fndca3a.fnal.gov/dcache/dc_queue_plots.html 7. login.plot - plotter Calls scripts/logins.sh This scripts gathers information and creates plots of the login levels for each door in dcache. After final processing done by a cronjob, these plots are shown on the web page: http://fndca3a.fnal.gov/dcache/dc_login_plots.html 8. pool.stats - watchdog, plotter Calls status/updatePoolStatus.sh Calls status/updateDirectory.sh Involved in the processing of billing informatioon to create a detailed statistics breakdown by pool or storage class with intermediate information stored in the sub-directory fndca:~enstore/dcache-statistics. Displayed on the web page: http://fndca3a.fnal.gov:2288/statistics 9. retry.Pool - watchdog Calls scripts/retry.waiting This script checks for excessive retries due to, for instance, Enstore backlogs, and sends e-mail if some are found. 10. retry.P2P - plotter Calls scripts/retry.p2p This script creates the "dCache P2P Queue" web page that is not linked by from the main FNDCA web page. http://fndca3a.fnal.gov/dcache/p2p.html 11. retry.no-mover-found - watchdog, palliative Calls scripts/retry.NoMoverFound This script checks for cases where no mover is found for a request (?), and sends e-mail if some are found. 12. kill_close_wait.sh - palliative Calls scripts/kill_close_wait CLOSE_WAIT is a socket state usually associated with a socket that has not been properly closed per protocol. This script is cleans up unclosed sockets that can accumulate in a large dcache system, exhausting file descriptors. In principle, should not be needed with ideal dcache code or a low load dcache system (since a JVM does clean these up eventually). 2.2) Cronjobs Categories: according to primary deliverable of script Admin - manages logs, cleans up disk space, or makes back-ups Plotter - creates a web page for viewing Watchdog - send e-mail is a condition is found Palliative - reduces impact of a weakness in dCache 1. crontab.enstore.fndca 1. make_queue_plots.sh - plotter Runs dcache_make_queue_plot_page.py This script takes the information gathered by an "at" job as input to formally create the web page: http://fndca3a.fnal.gov/dcache/dc_queue_plots.html 2. make_login_plots.sh - plotter Runs dcache_make_login_plot_page.py This script takes the information gathered by an "at" job as input to formally create the web page: http://fndca3a.fnal.gov/dcache/dc_login_plots.html 3. Billing.summary - plotter This script creates the FNAL "Billing" web page, listing actions per day (distinct from DESY Billing web page). http://fndca3a.fnal.gov/dcache/billing.html 4. repls start - plotter This script initiates the creation of a file listing for each pool. Some processing is done and the intermdiate results are stored on each pool node. 5. repls copy - plotter This script gathers the results from each pool node into the master web listing found at: http://fndca3a.fnal.gov/dcache/files Note: we are accepting that sometimes the "repls start" may not be done when we do the copy hours later... very rare, but possible. The alternative would be to have authorized keys for each pool node entered on the head node which is fragile (nodes come/go) and unwieldy. 6. lifetime start - plotter This script initiates the creation of a file lifetime lists for each pool. Some processing is done and the intermdiate results are stored on each pool node. 7. lifetime copy - plotter This script gathers the results from each pool node into and creates plots for the web page at: http://fndca3a.fnal.gov/dcache/dc_lifetime_plots.html Note: we are accepting that sometimes the "repls start" may not be done when we do the copy hours later... very rare, but possible. The alternative would be to have authorized keys for each pool node entered on the head node which is fragile (nodes come/go) and unwieldy. 8. ftp_gather - plotter This script gathers the FTP logs from door nodes to the admin node. These logs are used to create the "Recent FTP Transfers on fndca" web page using: http://fndca3a.fnal.gov/cgi-bin/dcache_files.py 2. crontab.enstore.fndcam 1. pg_dumpall - admin Back up the postgresql database on monitor node. 3. crontab.root.fndca 1. move-old-logs - admin Clean up old FTP transfer and related logs to the ftp-tlog-old directory (where they will no longer by visible on the Recent FTP Transfers web page). [TODO] There may be a weakness here. I recall not seeing new directories created in the "old" ftp log area to accommodate new users. The script assumed only files would be moved, not directory trees. This should be confirmed and fixed. 2. move-ftp-cert-logs - admin Clean up old FTP transfer C=US items. I am not familiar with this with what these are though. 3. tmpwatch - admin Clean up statistics and support file areas, but do not remove the doc stored in the statistics sub-directory. 4. check_port_block - watchdog Check that the appropriate ipchains (personal firewall) blocks on the use of certain dcache ports from offsite are in place. Example: unsecured dcap ports are supposed to be block from use by nodes outside of fnal.gov. 4. crontab.root.fndca1 1. tmpwatch - admin Clean up old FTP logs still on door node. 5. crontab.root.fndcam 1. real-encp-cleanup kickoff - admin Initiate the clean up of old logs left by real-encp on each of the pool nodes. 2. check_crc kickoff - watchdog Initiate the check of CRC of almost every file in cache on each of the pool nodes. Only files older than 12 hours in cache are checked. [TODO] more detail on what is really compared in this. 2.3) DCache Call-outs 1. Enstore restores/stores: real-encp, encp.options 2.4) Vladimir's monitoring plots 1. old-style tomcat-based (web.xml) 2. Lazlo-based 2.5) pagedcache 2.6) Operational Utilities: poolcmd, doorcommand, pathfinder, real-encp.sh,... 2.7) Dmitry's PNFS consistency incremental scan http://www-stken.fnal.gov/enstore/dcache_monitor 2.8) Miscellaneous utilities: dropit (copied from a FNAL FUE product) -------------------------------------------------------------------------------- 3) Supporting Infrastucture not covered elsewhere 3.1) authorized_keys 3.2) kerberos service credentials 3.3) /etc/grid-security 3.4) CVS: HPPC and a little on DESY (now in transition to subversion) -------------------------------------------------------------------------------- 4) Pool, Cost Model, and Timeout Configuration [TODO] PoolManager.conf and pool.batch explained .the end.