D0en SDE upgrade plan
This is a draft of the upgrade plan for the SDE upgrade for Dzero Enstore scheduled for the Nov 6th downtime. This document is still a draft and being updated. It will be reviewed at a meeting on Thus Nov 1rst at 1:30pm.
Size 27.4 kB - File type text/plainFile contents
D0en SDE upgrade plan Last Updated: 1 Nov 2007 11:15 Scheduled Date: Tues 06 Nov 2007 Synopsis: On Tuesday, Nov 6th, we need to move all D0 Enstore services from the existing d0ensrvN nodes to the new SDE server configuration. All of the existing server processes will migrate to SDE nodes. All of the existing databases and other configuration and data files needed will be migrated to the new RAID systems. Note that during this downtime, ADIC technicians will be on-site replacing the Robot#2 ARM. This is scheduled for 0700-1700. This work by ADIC is still tentative depending on arrival of necessary replacement hardware. Roles: All E-mail with questions, comments or updates related to this plan should be directed to 'Enstore-Admin@fnal.gov'. ??? Question for everyone: Are these numbers correct? I'd like a pager or alternate number for everyone on this list, please. SSA Downtime Coordinator: Ken Schumacher x4579, pgr 630-905-1149 SSA Systems Admin: Mike Harrison x8651, pgr 630-218-32435 SSA Monitoring/Test: David Berg x3021, pgr 630-722-0051 DMD Enstore team lead: Sasha Moibenko x3937, alt? DMD Enstore Services Admin: Mike Zalokar x6289, alt? DMD PNFS Database Admin: Vladimir Podstavkov x2855, alt? DSS Mgmt: Gene Oleynik x6805, pgr 630-218-3136 DMS Mgmt: Matt Crawford x3461, alt? DMS Head: Don Petravick x3935 (or x5214), pgr 630-314-5437 Milestones: * Thus Nov 1rst: Review this plan, modify as needed and accept. We will need management commitment to the defer or fall-back criteria. Meeting to be led by Gene. * Tues Nov 6th: All hardware must be on-line and available and all pre-install tasks completed. Also, Enstore must be done writing any precious files and stopped before we begin the upgrade. * Successful dump and migration of database dumps and log files and plot directories. * DMD Department needs to complete successful testing of the new Enstore configuration in the defined minimal test configuration (as defined below). * Experiment needs to complete successful testing of the new Enstore configuration after we bring up the full production configuration. ??? Question for everyone: Are there any additional milestones or Go/No-go decision points that should be defined here? NOTE: If any of these milestones is not satisfactorily completed as scheduled, this upgrade will need to be deferred (rescheduled) and any necessary fall-back to the previous installation needs to be implemented. See the Fallback section of this document, found at the end of this Detailed Upgrade Process section which follows. ??? Question for Gene: Are you the person who would make any No-go decision? If not you, than who has authority to make that call? +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ DETAILED PROCESS FOR THE UPGRADE: ---------------------------- Prior to scheduled downtime: * Verify that all hardware is ready and properly registered ___* There are outstanding Tissue incidents against our service names. These appear to be false positives from net scans. How do we get them cleared? * Mike Z contacted Randy Reitz and sent mail to CST. Vladimir Brovove is supposed to correct this. This is a false positive report because these are alias names and not formal node names. * Mike will forward any follow-up e-mails that he gets back from CST. * Stan has asked Mike Z to request an update from Networking ___* Ken to follow-up (if needed) with Networking. ??? Question for Mike Z: Have you heard anything more from Networking or CST regarding these TISSUE incidents? ??? Question for Mike H: Are the SATA blades configured to do regularly scheduled surface scans and to report any problems seen? Are these SATA blades sending e-mail to ssa-auto to report any significant events? ___* We need to determine who owns this action item. This should be taken care of and verified by COB Friday. ??? Question for Mike H or SSA: Has someone verified the BIOS settings for Hyperthreading off and the power-up action upon restoration of power? * Per Mike Z, we believe Hyperthreading is disabled already. ___* We need to determine who owns this action item. This should be taken care of and verified by COB Friday. * Verify the public network and private network connections. - All five SDE nodes should have private network connections - All five SDE nodes should have public network connections with aliases. ___* Mike Z will provide a mapping of hostname and service alias mapping to be used in the new SDE configuration by COB Thus. ___* Ken will add a table of alias assignments in this document (based on the mapping to be provided by Mike Z). ___* Mike will draft a list (by noon Friday) of the network registration changes that will need to be submitted the day prior to the upgrade. ___* SSA will register new names (ie. d0ensrv1old) for the existing servers to be brought up with new IPs. This will also involve getting new Kerberos host/ftp principles based on these new names. ??? Question for Enstore Developers: Will we also need to create the special Enstore principles to go with these new names? My guess is that we do not, since these nodes will not run Enstore services under these names. ---------------------------- Preparation of systems and Pre-install software/config/data: ___* Install OS and all required RPMs via CFEngine on one host. Per Mike H, this will be tested again Thus and/or Friday. ??? Question: I believe Mike H is testing this? Is that correct? ___* Install and configure enstore on the each of the new servers for testing. Most of the configuration and bringing in of additional software is managed by CFengine. After the test installation, the following need to be verified and status reported to 'Enstore-admin' list. ___* Configure mount points for external raids. Mike H responsible for reporting on this. (as of Wed, this was not yet been tested). ___* Install and configure PNFS. Vladimir is responsible for verifying this and reporting. ___* Install and configure dcache - While there is no stand-alone dCache on Dzero, I understand there are minimal elements of dCache bundled with SDE. These will simply not be configured as dCache on this upgrade. - For future upgrades of STKen and CDFen, this is where the configuration of the dCache instances needs to occur or be verified. ___* Mike Z will prepare a configuration file for the minimal test configuration. ___* Mike Z will prepare a configuration file for the new SDE production configuration. ___* Test enstore system (test-stand, non-production) under the minimal test configuration. ___* Test enstore system (test-stand, non-production) in a full SDE configuration. ___* Ken will send e-mail to Mike Diesburg asking who will be the contact when Enstore Admin is ready to have DZero/SAM test this new Enstore configuration, before farm jobs and other users are allowed to restart using Enstore. ---------------------------- Test Data Migration: Note: Chih-Hao has documented the database dump and restore processes at https://plone4.fnal.gov/P0/Enstore_and_Dcache/developers/enstore-developers/screwdriverless-enstore/d0ensrv0-d0ensrv0n-databases-migration-on-november-6-2007 ___* Vladimir will perform an initial dump of the PNFS Database from existing servers to disk on the new server. Est. is 2.5 hours. * Start Time: ____________ End Time: ____________ Size:___________ ___* Vladimir will perform a test of the restore of the PNFS Database onto the SDE prototype nodes. Est. is 1.5 hours. * Start Time: ____________ End Time: ____________ Size:___________ ___* Mike Z will perform an initial dump of the three other Enstore Databases (FileClerk/VC, Accounting and DriveStat). Est is 30 mins. * Start Time: ____________ End Time: ____________ Size: ___________ ___* Mike Z will perform a test of the restore of these other Databases onto the SDE prototype nodes. Est is between 2 and 3 hours. * Start Time: ____________ End Time: ____________ Size:___________ ___* SSA (Ken?) will perform an initial rsync copy of the files from 'd0ensrv2:/diska/enstore-log' (excluding the 'history' subdirectory) to 'd0ensrv2n:/srv2/enstore/enstore-log', preserving ownership and permissions and over-writing any files already in that area. NOTE: An Appendix is included at the end of this document listing all of the directories on the server RAID storage. After review by Sasha and Ken, we believe everything that needs to move (other than databases) is in this one directory. ___* This step should be scripted, one script per original server, to allow repeating the rsync copies with consistant options and parameters specified. Ken will develop this. * Start Time: ____________ End Time: ____________ Size: ___________ ___* SSA (who?) needs to review currently outstanding alarms on the D0en systems. Existing alarms will not migrate to the new installation. As many of the existing alarms as possible should be resolved before the upgrade. -------------------------------------------------------------- Friday Nov 2nd: * Ken (as Dzero Enstore Liaison) will e-mail a reminder to notify users of the downtime. This e-mail should be addressed to 'd0en-announce' with copies to 'enstore-admin' and 'helpdesk'. * This reminder needs to specifically request that users stop all jobs from running on Tuesday. Otherwise, their jobs will continually retry their Enstore operations. We want to avoid having any transfers queued up when Enstore is brought back on-line, until after testing is completed. * SSA (Ken?) will perform another rsync copy of the files updated since the prevous rsync copy of 'd0ensrv2:/diska/enstore-log'. Record time needed for this incremental copy of one day's changes. * Start Time: ____________ End Time: ____________ Size: ___________ -------------------------------------------------------------- Mon Nov 5th: ___* Ken will configure NGOP for a full D0 Enstore scheduled downtime. Start time should be listed as 0630 with an estimated duration of 12 hours. ___* SSA (Ken?) will perform another rsync copy of the files updated since the prevous rsync copy of 'd0ensrv2:/diska/enstore-log'. Record time needed for this incremental copy of the weekend changes. * Start Time: ____________ End Time: ____________ Size: ___________ ___* SSA will submit node name/IP swap registrations. This will update MISCOMP and prepare the proper files for the Tues 7:30am DNS reload. ___* SSA (??) will schedule the 'at' job which will start draining D0en at 5am on Tues. -------------------------------------------------------------- Tues Nov 6th: * 5 AM: Start Draining of all D0en library managers. This step should be scheduled as an AT job. ___* Someone (SSA? DMD?) needs to verify that this worked. -------------------------- During scheduled downtime: * Actual Switchover: - 0630 - 'enstore sched enstore --down' to prevent RedBall alarms. ___* Ken: Stop all existing Dzero Enstore services with 'enstore Estop'. Leave the databases on-line so they can be dumped. Stop all web servers on the old hosts. ___* Sasha: Check that everything did indeed stop, probably with an 'rgang enstore EPS'. ************************************************************************ *** Milestone #1: Nothing proceeds until __who?___ confirms we are here. ************************************************************************ ### Need to consider starting or holding these data migrations so that ### the DNS reload does not cause things to fail. If the nodes involved ### in these copies all have /etc/hosts and resolve from these local ### files before querying the DNS, we should be fine. ___* Vladimir: Dump existing PNFS DB from old host leaving dump file(s) on the new host. ___* Mike Z: Dump other Enstore DBs from old hosts leaving dump file(s) on the new host. ___* SSA (??) Shutdown the old hosts and change the host names on those retiring systems. That includes /etc/hosts files as well as network interface configuration files in '/etc/sysconfig/network-scripts'. ___* SSA (Ken?) will update the names of these retiring hosts in the d0ensrv3:/etc/consserver.cf configuration file. Also remove referances to the d0ents0 terminal server from that configuration file. # NOTE: The private subnet addresses for the existing D0en servers need # to be used with the new installation. (Where should this note # be in this document). This shows up in the /etc/hosts file # to be installed on the new SDE D0en systems. ___* Mike Z: Verify that the new node names / IPs / CName definitions are in place as part of the normal 0730 DNS reload. ___* Vladimir: Load new PNFS DB onto the new hosts. ___* Mike Z: Load the other Databases onto the new hosts. ___* Ken: Rsync 'enstore-log' files (excluding 'history') via script. ___* Verify that the old systems are operating under new names and new IP addresses. This is so none of the new services unintentionally connect to the older servers. ************************************************************************ *** Milestone #2: Nothing proceeds until __who?___ confirms we are here. ************************************************************************ - Bring up the new systems configuration in a minimal test configuration. Ensure that all library managers are in a "LOCKED" state and system still in 'enstore sched enstore --down' state. # NOTE: The minimal test configuration will consist of the following: # Library Mangers: testlet2, TEST-9940B, samnull # Media Changes: samnull and test # Movers: Use 9940B27 in TEST-9940B library # Use D41H in testlto2 library # This config should include all null movers. ? Question for Sasha: Is there a test media changer or will this be defined as part of defining the minimal test configuration? ------------------- Validation/Testing: * Test Enstore functionality ___* Developers should devine the list of things to check. * Test monitoring functionality (SAAG pages, NGOP, Plots, etc.) ___* SSA should define this list of things to check. * Test console server access (proper names in all prompts) ___* SSA will go through the full list of console ports configured and verify that the proper system prompt is seen. * Per Stan, the service migration process is not ready for this D0en upgrade. In future upgrades, we will want to test the migration of one or more services from one server node to another at this point. ************************************************************************ *** Milestone #3: Nothing proceeds until __who?___ confirms we are here. ************************************************************************ * Upon successful validation and testing, Enable all services and stop draining. * Notify appropriate users they may begin their testing. Ask them to notify Enstore-admin@fnal.gov when they are done. Then farm processing can begin again. ************************************************************************ *** Milestone #4: No production use of Enstore until __who?___ confirms *** we are here. ************************************************************************ * Notify the experiment (via d0en-announce mailing list) that the update is completed and Enstore is available for use. End of upgrade process. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Fall-back criteria: ( Need to define, in detail, the conditions that would warrent falling back to the previously installed instance of Enstore on the original servers. ) * If we fail to successfully dump and migrate of database dumps and log files and plot directories, we should simply restart the existing configuration. * DMD Department needs to complete successful testing of the new Enstore configuration in the defined minimal test configuration. If our own testing failes, we need to implement a roll back to the previous configuration. * Experiment needs to complete successful testing of the new Enstore configuration after we bring up the full production configuration. If the experiment reports that their tests are failing and we can not resolve the issues promptly, we need to roll back to the previous configuration. -------------------------------------------- Fall-back plans: ( Need to define and refine the steps necessary to roll back to the old instance of Enstore. This will be based in part on how far we get in the migration before deciding to roll back. ) ---------- * We may need to make an arrangement with DataComm for an additional DNS reload in the case where we need to roll back. Two possible scenarios: 1) We simply submit a set of updates to back out the changes submitted on Monday. Those changes will be processed Tues evening and be updated on the Site DNS Wed at 0730. 2) We may need to submit these changes and arrange for a special DNS reload with these changes included. ___* We need to discuss the possibility of option #2 with Networking in advance. #Note: One possible alternative would be to set all nodes to # referance the local hosts file prior to consulting the DNS or # NIS. Then we can go back to DNS preferred mode on Wed after # the new DNS update which sets everything back to the original # mode/configuration. ---------- ___* Rollback plan per Sasha: If we need to roll back, we will simply rename the original production nodes to what they were before we started. NOTE: While I (Ken) agree in principle, we need to define this in more detail. - We need to list the sequence for turning off nodes in the failed upgrade, then rename systems (which files are updated to accomplish this), - then what needs to be started and in what sequence. - Will this also require restarting mover nodes? What testing do we need to do on the restored system before we release it to the users? If any production Enstore commands were processed on the upgraded installation and we need to fall back, we need to understand how we will preserve any changes to production data in Enstore. - For what number of tapes would we simply recover the databases by making manual changes? - For what number of tapes would we be better off dumping and restoring the update production databases (and based on whatever reason we determined the need to roll back, can we trust the data)? - What changes are needed in the testing of the upgrade to catch the problem which required the rollback next time we attempt an upgrade? +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Appendix: Note: These listings can be dropped from this document once we are sure we know which files need to be migrated from the current nodes over to the new systems. Listing of directories as of 18:30 Tues 30 Oct 2007: ---------- d0ensrv0 ------------------------------ [root@d0ensrv0 root]# df -kl Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda3 9851340 1856384 7494532 20% / none 2043864 0 2043864 0% /dev/shm /dev/sda1 241003008 19362716 221640292 9% /diska /dev/sda2 241011072 28045976 212965096 12% /diskb [root@d0ensrv0 root]# ls -lAF /diska total 32 drwx------ 10 products products 4096 Sep 10 10:42 accounting-db/ drwx------ 10 products products 4096 Mar 2 2007 accounting-db.saved/ drwxrwxrwx 2 products root 4096 Oct 30 14:17 DB_DUMP/ drwx------ 10 products products 4096 Sep 10 10:42 drivestat-db/ drwx------ 10 products products 4096 Mar 2 2007 drivestat-db.saved/ drwxr-xr-x 2 enstore enstore 4096 Oct 30 18:13 enstore-journal/ drwxr-xr-x 2 products products 4096 Mar 2 2007 enstore-journal.saved/ drwxr-xr-x 2 root root 4096 Sep 10 2003 lost+found/ [root@d0ensrv0 root]# ls -lAF /diskb total 48 drwxr-xr-x 6 root root 4096 Aug 14 2006 DELETE_AFTER_2006_09_01/ dr-x------ 2 root root 4096 Mar 5 2003 .enstore-database/ drwxr-xr-x 2 enstore enstore 4096 Oct 30 18:13 enstore-database/ drwx------ 10 products products 4096 Sep 10 10:41 enstore-db/ drwx------ 10 products products 4096 Aug 14 2006 enstore-db.before_any_action/ drwx------ 10 products products 4096 Aug 14 2006 enstore-db.right_after_pg_resetxlog/ drwx------ 8 products enstore 4096 Apr 3 2006 enstore-db.saved/ drwxrwxr-x 3 products products 4096 Aug 14 2006 from_d0ensrv3/ drwxr-xr-x 2 root root 16384 Mar 1 2004 lost+found/ ---------- d0ensrv1 ------------------------------ [root@d0ensrv1 root]# df -kl Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda3 9851340 2329084 7021832 25% / none 2043864 0 2043864 0% /dev/shm /dev/sda1 241003008 57917708 183085300 25% /diska /dev/sda2 241011072 14070344 226940728 6% /diskb [root@d0ensrv1 root]# ls -lAF /diska total 20 drwxr-xr-x 2 root root 16384 Apr 13 2004 lost+found/ drwxrwxr-x 7 root root 4096 Apr 3 2006 pnfs/ [root@d0ensrv1 root]# ls -lAF /diskb total 60 drwx------ 2 enstore root 4096 Mar 12 2007 Full-dump/ drwxr-xr-x 2 root root 16384 Oct 8 2001 lost+found/ drwxrwxrwx 2 root root 4096 Oct 23 2004 Migration_tmp/ drwx------ 4 enstore root 8192 Oct 30 12:43 pnfs-backup/ drwxrwxr-x 2 enstore root 24576 Oct 30 00:00 postgres-log/ drwxrwxr-x 3 huangch g023 4096 Mar 30 2006 SCANNED/ ---------- d0ensrv2 ------------------------------ [root@d0ensrv2 root]# df -kl Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda3 9851340 2999608 6351308 33% / none 2025408 0 2025408 0% /dev/shm /dev/sda1 241003008 132629440 108373568 56% /diska /dev/sda2 241011072 5457680 235553392 3% /diskb [root@d0ensrv2 root]# ls -lAF /diska total 480 drwxr-xr-x 3 enstore enstore 12288 Oct 30 18:11 aml2-log/ drwxr-xr-x 2 enstore enstore 4096 Apr 2 2002 aml2-report/ drwxr-xr-x 3 enstore enstore 4096 Apr 2 2002 aml2Shadow/ drwxr-xr-x 2 berman g023 4096 Apr 25 2006 berman/ drwxrwxr-x 7 enstore enstore 4096 Jul 26 14:44 CRONS/ drwxrwxr-x 4 enstore enstore 4096 Jul 26 14:44 CRONS.disabled/ drwxr-xr-x 4 enstore enstore 20480 Oct 30 18:24 enstore-log/ drwxr-xr-x 2 root root 16384 Apr 2 2002 lost+found/ drwxr-xr-x 3 enstore enstore 81920 Oct 30 00:00 ratekeeper/ drwxr-xr-x 3 enstore enstore 307200 Oct 30 18:15 tape-inventory/ drwxrwxr-x 6 enstore enstore 24576 Oct 30 18:25 www_pages/ [root@d0ensrv2 root]# ls -lAF /diskb total 5423620 drwxrwxr-x 2 enstore enstore 4096 Nov 12 2005 GONE/ drwx------ 2 root root 16384 Oct 13 2003 lost+found/ -rw-rw-r-- 1 root root 5548328960 Oct 23 20:41 tape-inventory.tar drwxr-xr-x 2 enstore enstore 4096 Mar 6 2007 UPDATE_RESTART-2007-03-06/ ---------- d0ensrv3 ------------------------------ [root@d0ensrv3 root]# df -kl Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda3 9851340 5911572 3439344 64% / none 1027720 0 1027720 0% /dev/shm /dev/sda1 241003008 19338372 214319240 9% /diska /dev/sda2 241011072 62746968 170918464 27% /diskc [root@d0ensrv3 root]# ls -lAF /diska total 32 drwxrwxr-x 2 enstore enstore 4096 Oct 30 07:31 BackupToTape/ lrwxrwxrwx 1 enstore enstore 19 Jun 28 2006 check-db-tmp -> /diskc/check-db-tmp/ drwxrwxr-x 70 enstore enstore 8192 Oct 30 18:13 enstore-backup/ lrwxrwxrwx 1 root root 25 Jun 28 2006 enstore-log-backup -> /diskc/enstore-log-backup/ drwxr-xr-x 2 root root 16384 Aug 7 1999 lost+found/ drwxrwxr-x 2 enstore enstore 4096 Oct 10 09:29 pgdb-backup/ lrwxrwxrwx 1 root root 18 Jun 28 2006 pnfs-backup -> /diskc/pnfs-backup/ [root@d0ensrv3 root]# ls -lAF /diskc total 168 drwxrwxr-x 67 enstore enstore 4096 Oct 12 09:19 aml2Shadow/ drwxr-xr-x 10 root enstore 4096 Mar 22 2007 backup/ drwx------ 10 enstore enstore 4096 Oct 30 17:11 check-database/ drwx------ 8 enstore enstore 4096 Apr 4 2006 check-database.saved/ drwxrwxr-x 2 enstore enstore 4096 Mar 30 2006 check-db-tmp/ drwxrwxrwx 2 huangch g023 4096 Apr 3 2006 DB_DUMP/ drwxr-xr-x 2 enstore enstore 24576 Oct 30 17:51 db-inventory/ drwxrwxr-x 2 enstore enstore 4096 Mar 27 2006 db-inventory_cache/ drwxr-xr-x 3 root root 4096 Mar 28 2006 enstore/ drwxrwxr-x 2 enstore enstore 4096 Oct 30 18:20 enstore-log-backup/ drwxr-xr-x 5 root root 16384 Oct 22 21:45 lost+found/ drwxrwxrwx 2 root root 4096 Oct 30 12:43 pnfs-backup/ drwxrwxrwx 2 enstore enstore 86016 Oct 30 18:27 pnfs-backup.xlogs/ ---------- d0ensrv4 ------------------------------ [root@d0ensrv4 root]# df -kl Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda3 9851340 1684792 7666124 19% / none 2049452 0 2049452 0% /dev/shm ---------- d0ensrv5 ------------------------------ [root@d0ensrv5 root]# df -kl Filesystem 1K-blocks Used Available Use% Mounted on /dev/hda3 9851340 4203828 5147088 45% / none 255164 0 255164 0% /dev/shm [root@d0ensrv5 root]# cd /usr/farm [root@d0ensrv5 farm]# ls -lAF total 48 drwxr-sr-x 2 root root 4096 May 10 2006 bin/ -r-xr-xr-x 1 root root 943 Jan 30 2004 conserver.rc.d.initd* -rw-r--r-- 1 root root 240 Feb 10 1999 cons-soft.bashrc -rw-r--r-- 1 root root 4547 Feb 10 1999 cons-soft.services drwxr-sr-x 2 root root 4096 Jan 22 1999 .install/ drwxr-sr-x 2 root root 4096 Apr 5 2006 lib/ drwxr-sr-x 2 root root 4096 Oct 16 15:21 log/ drwxr-xr-x 2 root root 4096 Jan 22 1999 lost+found/ drwxr-sr-x 2 root root 4096 May 10 2006 SAVEDlog/ drwxr-sr-x 6 root root 4096 Apr 5 2006 src/ drwxr-sr-x 2 root root 4096 Apr 5 2006 syslog/ [root@d0ensrv5 farm]# du -sk * 144 bin 4 conserver.rc.d.initd 4 cons-soft.bashrc 8 cons-soft.services 12 lib 105016 log 4 lost+found 152 SAVEDlog 10752 src 68 syslog