CDFOI v1.2.0 "Scheduled Tasks" List - WBS (13 June 2008)
Text format
Size 20.1 kB - File type text/plainFile contents
ID WBS Name Notes 0 CDF Offline Initiative This plan is defined to start on 31 March 2008.... 1 1 CDF Offline Architecture 2 1.1 Strategy Sheets 3 1.1.1 CDF Grid Infrastructure 4 1.1.1.1 Version 1 vetted by Initiative 5 1.1.1.2 Version 2 vetted by CDF Spokespeople 6 1.1.1.3 Version 3 posted for review 7 1.1.1.4 "Review by CDF Spokes, CD Mgmt, Initiative Mgmt" "V3 presented to CDF Spokespeople on 6/5/2008, with general agreement on strategies and qualified agreement on the Project Organization component." 8 1.1.2 CDF CAF-Grid Instances 9 1.1.2.1 Version 2 vetted by CDF Spokespeople 10 1.1.2.2 Version 3 posted for review 11 1.1.2.3 "Review by CDF Spokes, CD Mgmt, Initiative Mgmt" "V3 presented to CDF Spokespeople on 6/5/2008, with general agreement on strategies and qualified agreement on the Project Organization component." 12 1.1.3 Milestone: High Priority Strategy Sheets approved 13 1.1.4 CDF Offline Infrastructure 14 1.1.5 CDF Disk Pool 15 1.1.6 CDF Data Handling 16 1.1.7 MILE 7: Strategy Sheets completed 17 1.2 Offline Services Design Document 18 1.2.1 High-Level Arch Diagram for CAF and Grid 19 1.2.2 Production Processing Workflow Diagrams 20 1.2.2.1 CAF Components Diagram Posted 21 1.2.3 Services-Hardware Map 22 1.3 Project Organization Chart Org Chart *PLUS* clear responsibilities and boundaries defined among the sub-divisions of the Offline Project 23 1.3.1 Basic Project Org Chart 24 1.3.4 Project Org Chart Proposal development 25 1.3.3 "Review by CDF Spokes, CD Mgmt, Initiative Mgmt" "V3 presented to CDF Spokespeople on 6/5/2008, with general agreement on strategies and qualified agreement on the Project Organization component." 26 1.3.2 Detailed Responsibility Assignments in Org Chart This includes any shifting of responsibilities that may be considered. 27 1.4 MILE 3: Offline Project and Services documented 28 1.5 High-Level Offline Architecture Document 29 1.6 MILE 15: CD Offline Architecture defined 30 2 Shared Operations Management Topics 31 2.1 Low-level Monitoring and Alarms (Zabbix) 32 2.1.1 Evaluation Requirements Document 33 2.1.2 Evaluation and Evaluaton Summary Document 34 2.1.3 System Specificatons Document 35 2.1.4 Multi-Phase Deployment and Short-term Support Plan 36 2.1.5 Milestone: Zabbix System specified 37 2.1.7 Negotiate service hosting and support 38 2.1.8 Setup service hosting 39 2.1.6 Initial Production-quality Agents 40 2.1.6.1 CDF-Independent Agent scripting and testing 41 2.1.6.2 CDF-specific Agents code 42 2.1.9 Integration of available agents on hosted service 43 2.1.10 MILE 5: Zabbix Phase 1 Deployment successful 44 2.1.11 Configuration Testing and Tuning 45 2.1.12 Develop Complete Suite of Production Agents (by expert) 46 2.1.13 Integration and Reporting for new agents (by expert) 47 2.1.14 Milestone: Zabbix Phase 2 Deployment successful 48 2.1.15 Agents Development and Deployment Doc 49 2.1.16 Operations and Configuration Management Doc 50 2.1.17 Develop new agents for another system (by CAF ops team) 51 2.1.18 Integrate new agents for another system (by CAF ops team) 52 2.1.19 Milestone: Zabbix Phase 3 Deployment successful 53 2.1.20 Long-term Platform and Service Support Agreement 54 2.1.21 Hand-off to CDF CAF Operations team 55 2.1.22 MILE 20: Zabbix production deployment achieved 56 2.2 Issue/bug Tracking (Jira) 57 2.2.1 Jira Evaluation: Beta Config on Beta Platform 58 2.2.2 Jira Configuration Review 59 2.2.3 Jira Evaluation: Prod Config on Beta Platform 60 2.2.4 Beta Configuration Document 61 2.2.5 Requirements Document 62 2.2.6 Evaluation Document 63 2.2.7 Initial Integration into Existing Support Processes 64 2.2.8 Create and Deliver a user Tutorial 65 2.2.9 Refine Integration into Existing CDF Support Processes "MeV (6/3/2008): The ultimate goal is to have *only* issues@fnal.gov in the email lists. As of this weekend, this is true for all of the lists except cdf_caf and cdfdb-support. The latter is scope creep, and I hope progress will be made on the former th..." 66 2.2.10 MILE 1: Jira Production Service on Beta Platform successful 67 2.2.14 INPUT: Choose to outsource production platform 68 2.2.15 Purchase Requisition development 69 2.2.16 Purchase Requisition into Lab System 70 2.2.17 Initial Access to N-user Jira Service (purchase eff. completed) 71 2.2.18 Milestone: N-user Jira Service purchased and accessible 72 2.2.19 Migration from Beta to Production Platform 73 2.2.20 Adapt Most Configuration to Specifics of Production Platform 74 2.2.29 Milestone: Switch Service to Jira on Production Platform 75 2.2.26 Refine CDF Support Process Integration - Last Loose E-mail Lists "MeV (6/3/2008): The ultimate goal is to have *only* issues@fnal.gov in the email lists. As of this weekend, this is true for all of the lists except cdf_caf and cdfdb-support. The latter is scope creep, and I hope progress will be made on the former th..." 76 2.2.12 Metrics and Reports Based on Issue Tracking Content 77 2.2.22 Configuration Fine-Tuning MeV (6/10/2008) - rephrased:... 78 2.2.27 LDAP Integration for Production Platform MeV (6/10/2008): waiting on approval from computer security 79 2.2.28 PIX Email Handler Plug-in Integration for Production Platform "MeV (6/10/2008): Glenn is testing the new version of PIX on the evaluation license, but it seems to be eating email off the imapserver and not generating tickets. He is trying to get help from the company in Germany, but no one is responding (bad sign). " 80 2.2.21 MILE 8: Jira Migrated from Beta to Production Platform 81 2.2.23 Configuration Management and Application Support Guidance Doc 82 2.2.24 Hand-off to Operations team 83 2.2.25 MILE 13: Jira Issue Tracker deployment achieved 84 2.3 Downtime Planning and Recovery 85 2.3.1 Root Cause Analysis for late March downtime 95 2.4 Code Repository 96 2.4.1 Assess Risk of CVS-to-SVN Migration during Initiative 106 2.7 Milestone: Shared Operations Management Issues addressed 114 3.4 Code Server Node Upgrades 115 3.5 SL4 Migration [Rough Draft Plan] 116 3.5.1 Basic Migration Planning RS/LG (6/10/2008) paraphrased:... 117 3.5.2 Implement Modifications to Infrastructure 118 3.5.2.1 INPUT: Implement changes to SRT Already done by the time the SL4 Migration Plan was defined. 119 3.5.2.2 INPUT: Deploy UPS v4.7.4 with 64 bit support Already done by the time the SL4 Migration Plan was defined. 120 3.5.2.3 Update External Products 121 3.5.2.4 Deploy Stand-alone Xrootd 122 3.5.3 Switch development to new build scheme 123 3.5.4 INPUT: ICHEP Activity sufficiently ended 124 3.5.5 Configure Major Releases to Build under SL4 and new build scheme 125 3.5.6 MILE 23: Major Releases Ready for SL4 and new build scheme 126 3.5.7 Migrate Remaining SL3 Machines to SL4: ILP 127 3.5.8 Migrate Remaining SL3 Machines to SL4: Desktops 128 3.5.10 Milestone: SL4 Migration completed 129 3.10 CVS Service Migration 130 3.11 "MILE 24: Code Server, SL4, CVS Migrations completed" 135 4 CDF Grid Infrastructure 136 4.1 INPUT: Low-Level Monitoring (Zabbix) Integration with Operations 137 4.2 INPUT: Issue Tracking (Jira) Integration with Support 138 4.3 Condor User Monitoring Migrate CAF monitoring from Python dict to RDB-backed 139 4.3.1 Identify and gain access to hardware for evaluation work 140 4.3.2 Setup Zabbix (distinct from FEF/GuG service) Hans Wenzel (6/4/2008):... 141 4.3.3 Setup Condor Quill++ Hans Wenzel (6/4/2008):... 142 4.3.12 Resolve Condor Quill++/CAF Authentication Problem Federica Moscato (6/3/2008) paraphrased:... 143 4.3.4 Recreate some aspects of User Monitoring based on Quill and Zabbix 144 4.3.5 MILE 9: Demonstrate User Monitoring based on RDB-backed system 145 4.3.6 Specify Production System 146 4.3.7 Develop/configure Production System 147 4.3.8 Deploy Production System 148 4.3.9 Document Production System 149 4.3.10 Hand-off Production System to Operations 150 4.3.11 MILE 21: User Monitoring migrated to RDB-backed system 151 4.4 FNAL KCA Upgrade 152 4.4.1 Plan and prepare for new KCA turn-on "New KCA Turn-on: week of May 15, probably May 16..." 153 4.4.2 Assess impact of KCA upgrade on GroupCAF 154 4.4.3 Simple Test Trial 1 against Test KCA - OSG Stack 155 4.4.4 Simple Test Trial 1 against Test KCA - LCG Stack 156 4.4.5 "Investigate test failures, identify cause(s)" 157 4.4.6 Determine and deploy best remedy for test failures 158 4.4.7 Simple Test Trial 2 against Test KCA - OSG Stack 159 4.4.8 Simple Test Trial 2 against Test KCA - LCG Stack 160 4.4.9 Submission Tests against Test KCA - OSG Stack 161 4.4.10 Submission Tests against Test KCA - LCG Stack 162 4.4.11 Tests for fix to whitescape and other character insertion to DN Donatella Lucchesi (6/4/2008):... 163 4.4.12 MILE 4: KCA Upgrade achieves goals 164 4.4.13 "Plan CDF VOMS work required, discuss with VOMS service providers" 165 4.4.14 VOMS service provides prepare for adaptation 166 4.4.15 Milestone: CDF VOMS service providers ready for adaptation 167 4.4.17 LCGCAF VOMS Entries introduced 168 4.4.18 CNAF VOMS Entries introduced 169 4.4.19 NAMCAF VOMS Entries introduced 170 4.4.20 PACCAF VOMS Entries introduced 171 4.4.21 FermiGrid CAF VOMS Entries introduced 172 4.4.16 EXTERNAL: FNAL Production KCA Upgrade 173 4.4.25 FNAL KCA Service Switchover (overlapped with 1/2 day downtime) 174 4.4.23 MILE 14: FNAL KCA Upgrade adaptations tested and deployed 175 4.4.22 Unneeded VOMS entries removed 176 4.4.24 Milestone: FNAL KCA Upgrade task closed 177 4.5 Migrate GroupCAF to FermiGrid CAF 178 4.5.1 "Procure and prepare fcdfhead10,11" 179 4.5.2 "Install and configure fcdfhead10,11 as FermiGrid head nodes" Federica Moscato (6/3/2008):... 180 4.5.2.1 "Install and configure fcdfhead10,11 as FermiGrid head nodes - Initial Trial" Federica Moscato (6/3/2008):... 183 4.5.10 "Install and configure TestCAF: 12,13 based on 10,11 experience" 184 4.5.3 Installation Procedures w/limited root access 185 4.5.4 Specifications and Test Plan document 186 4.5.5 Build Test Framework - Jobs using data input and not 187 4.5.6 INPUT: Shadow CAF prepared for CDF use "Steve Timm reported to CAF_DEVELOPERS that the ""sleeper"" CAF is ready for CDF use. While it is initially setup for fewer slots, it has scripts for 10k slots. This hand-off will need some follow-up to insure mutual understanding of what is ready." 188 4.5.39 CDF Initial Acceptance Test for ShadowCAF: usable at modest scale 189 4.5.7 CDF Final Acceptance Test for ShadowCAF: ready for use at 10k scale? 190 4.5.8 MILE 2: Shadow CAF ready for use at 10k scale 191 4.5.9 Disaster Recovery Plan for head and other critical nodes 192 4.5.11 "Adapt TestCAF for Recovery Drill w.r.t. fcdfhead10,11 config" 193 4.5.12 "Recovery Drill on TestCAF: 12,13" 194 4.5.13 "Test Features, Robustness, and Scalability w/o glideinWMS: baseline; 10,11" 195 4.5.14 Adapt configuration to support multiple collectors 196 4.5.15 "Test Features, Robustness, and Scalability w/o glideinWMS: multiple collectors; 10,11" 197 4.5.16 Document Testing Results and Post 198 4.5.17 Scaliing Test Result Review (4 Decision Points) DECISION: Is glideinWMS required and ready now or can it come later?... 199 4.5.18 Final Migration Plan Review Coupled to glideinWMS preparedness since it is assumed this will be needed for scaling reasons.... 200 4.5.19 MILE 11: Final Plan for GroupCAF to FermiGrid CAF Migration This plan will revise the work between this milestone and the next milestone (FermiGrid Scheduling released). 201 4.5.20 Integration of FermiGrid CAF/NAMCAF across OSG1-4 202 4.5.21 Configure OSG1-4 to establish FermiGrid CAF/NAMCAF priorities 203 4.5.22 Release FermiGrid scheduling to all CDF OSG WNs 204 4.5.23 MILE 16: FermiGrid Scheduling released 205 4.5.24 Complete Remaining WN OS Upgrades 206 4.5.25 Decommission old FermiGrid head nodes 207 4.5.26 Establish Stable Operations at Capacity to Enable Production Migration 208 4.5.27 "Revise installation, recovery procedures for general operations" 209 4.5.28 Long-term Support Agreements 210 4.5.29 Phased Migration Plan of WN and Production Users 211 4.5.30 INPUT: Production p17 processing completed 212 4.5.31 MILE 18: FermiGrid CAF Ready to Absorb GroupCAF 213 4.5.32 Phase 1 Migration of WNs from GroupCAF to FermiGridCAF 214 4.5.32.1 Phase 1: N1 racks of WN updated and migrated to FermiGrid CAF 215 4.5.32.2 Phase 1: Class X of Production Users migrated to FermiGrid CAF 216 4.5.33 Phase 2 Migration of WNs from GroupCAF to FermiGridCAF 217 4.5.33.1 Phase 2: N2 racks of WN updated and migrated to FermiGrid CAF 218 4.5.33.2 Phase 2: Class Y of Production Users migrated to FermiGrid CAF 219 4.5.34 Phase 3 Migration of WNs from GroupCAF to FermiGridCAF 220 4.5.34.1 Phase 3: N3 racks of WN updated and migrated to FermiGrid CAF 221 4.5.34.2 Phase 3: Class Z of Production Users migrated to FermiGrid CAF 222 4.5.35 Phase 4 Migration of WNs from GroupCAF to FermiGridCAF 223 4.5.35.1 Phase 4: N4 racks of WN updated and migrated to FermiGrid CAF 224 4.5.35.2 Phase 4: Class Z of Production Users migrated to FermiGrid CAF 225 4.5.36 Post-Migration FermiGrid CAF Tuning 226 4.5.37 Switch off GroupCAF 227 4.5.37.1 "Migrate ""Add New Users"" to new host" 228 4.5.37.2 GroupCAF NFS File Server also an ICAF node? 229 4.5.37.3 Decommission Old GroupCAF Head Nodes 230 4.5.38 MILE 22: GroupCAF to FermiGrid CAF Migration completed 231 4.6 Adopt glideinWMS [Needs Replanning] 232 4.6.1 INPUT: GlideinWMS proven mature enough for CDF production use 233 4.6.2 Determine Ops responsibilities and support for glideinWMS "Meeting on 6/2/2008: CDF Offline, Initiative, and Grid Service Dept mgmt - discussed responsibilies based on Initiative worklist and note in DocDB written by Keith Chadwick (defines requirements of FermiGrid customer service). Igor Sfiligoi may be avai..." 234 4.6.8 DECISION: Establish responsibilties among glideinWMS and CAFexe 235 4.6.3 INPUT: Determine Hardware needed to support glideinWMS (estimate 2 nodes/CAF) 236 4.6.4 Milestone: glideinWMS Hardware Decision 237 4.6.20 Reassessment of GlideinWMS Status and Replanning 238 4.6.5 Procure and Prepare Hardware needed to support glideinWMS 239 4.6.6 Disaster Recovery Plan for glideinWMS nodes 240 4.6.7 Specifications and Test Plan document for glideinWMS 241 4.6.9 Initial Rework of Glidecaf Code 242 4.6.10 Establish Test CAF system for glideinWMS testing 243 4.6.10.1 Upgrade to latest glideinWMS code version on gFactory node 244 4.6.10.2 Upgrade and configure VO front-end machine 245 4.6.11 Test Setup and Initial glideinWMS Use 246 4.6.12 "Test Features, Robustness, and Scalability: 12,13" 247 4.6.13 Refinement of Glidecaf Code 248 4.6.14 Tune glideinWMS to CDF needs 249 4.6.15 MILE 10: Ready to deploy glideinWMS to first production system 250 4.6.16 "Install, Configure glideinWMS Software System for Production Use (Decision Point)" "DECISION: Install glideinWMS services on test or prod h/w, and use glideinWMS in production system?" 251 4.6.17 Deploy glideinWMS on Production System (Decision point; downtime req'd) DECISION: Install on NAMCAF or FermiGrid CAF? 252 4.6.18 MILE 17: GlideinWMS Ready for Full Deployment 254 4.7 Modify CAF code: Support multiple distinct schedd hosts 255 4.7.1 Adjust CAF Software for multiple schedd hosts 256 4.7.2 Test multiple schedd hosts in realistic environment 257 4.7.3 Deployment of multiple schedd hosts in production (CAF downtime req'd?) 258 4.7.4 MILE 19: Multiple Schedd Hosts in CAF production 298 5 CDF CAF-Grid Instances 300 5.2 Fermigrid CAF Work 301 5.2.1 Fermigrid CAF: Critical Node Upgrades 309 5.3 NAMCAF Work 310 5.3.1 NAMCAF: Critical Node Upgrades 335 5.7 FNAL CAF Team Operations 336 5.7.1 CAF Operations Shift Level of Effort 337 5.7.1.1 Shift 1 - Downtime Recovery 338 5.7.1.2 Shift 2 339 5.7.1.3 Shift 3 - CAF Attack 1 340 5.7.1.4 Shift 4 341 5.7.1.5 Shift 5 342 5.7.1.6 Shift 6 - CAF Attack 2 343 5.7.1.7 Shift 7 344 5.7.1.8 Shift 8 345 5.7.1.9 Shift 9 346 5.7.1.10 Shift 10 347 5.7.1.11 Shift 11 348 5.7.1.12 Shift 12 349 5.7.1.13 Shift 13 350 5.7.1.14 Shift 14 351 5.7.1.15 Shift 15 352 5.7.1.16 Shift 16 353 5.7.1.17 Shift 17 354 5.7.1.18 Shift 18 355 5.7.1.19 Shift 19 356 5.7.1.20 Shift 20 357 5.7.1.21 Shift 21 358 5.7.1.22 Shift 22 359 5.7.1.23 Shift 23 360 5.7.1.24 Shift 24 361 5.7.1.25 Shift 25 362 5.7.1.26 Shift 26 363 5.7.1.27 Shift 27 364 5.7.2 CAF Operations Off-Shift Level of Effort 365 5.7.2.1 April 2008 366 5.7.2.2 Early May 2008 367 5.7.2.3 Late May - June 2008 368 5.7.2.6 July 2008 369 5.7.2.5 August 2008 370 5.7.2.4 September 2008 371 6 CDF Disk Pool 372 6.1 Decommission/Upgrade ICAF Nodes 373 6.1.1 INPUT: ICAF User Account Migration procedure 374 6.1.2 INPUT: ICAF Hardware Replacement Decision 375 6.1.3 Develop ICAF Hardware Replacement Plan 376 6.1.4 Identify or specify replacement User Space and Service hardware 377 6.1.5 MOOT: Simplify ICAF user account management It was discovered while developing the ICAF upgrade plan and hardware specification that FEF had already taken on user account management and was using YP. There may be some modest changes desired (not required immediately) to ICAF infrastructure to ac... 378 6.1.6 Setup and preparation of User Space and Service hardware 379 6.1.7 Milestone: ICAF ready for user space migration There may still be a user account management script to modify to accommodate the revised approach... but this does not need to hold up the migration. 380 6.1.8 Migrate User Space to new hardware - Single Node Test Phase 381 6.1.9 Migrate User Space to new hardware - More Nodes Phase 382 6.1.10 Migrate User Space to new hardware - Remaining Nodes Phase 383 6.1.11 Migrate Service node 384 6.1.12 Identify or specify replacement Backup Space hardware 385 6.1.13 Selection and Setup of Backup Space's backup application 386 6.1.14 Migrate Backup Space 387 6.1.20 Adjustments to ICAF Gui and related infrastructure code 388 6.1.16 MILE 12: ICAF Nodes Upgrade deployed 389 6.1.17 Decommission old ICAF Service and/or Backup Space nodes 390 6.1.18 Milestone: ICAF Nodes Upgrade closed 399 7 CDF Data Handling 409 7.5 File Server Upgrades 410 7.5.1 "Determine Supported Platform (OS, file system) to use" "6/6/2008: New deployments delayed by conflict: Scientific Linux dropped support for XFS, but dCache only supporting use on XFS file systems. RS states REX/Ops should do the dCache deployment, but on what file system since the OS must be Scientific Linu..." 411 7.5.2 Deployment of Available New File Servers 412 8 Project Management 413 8.1 Project Communications 414 8.1.1 Executive Meetings 415 8.1.1.1 Joint Executive Meeting 1 416 8.1.2 Management Meetings "Weekly, generally on Fridays at 11:30am->..." 417 8.1.3 Coordination Meetings "Overlay on CDF Offline Operations meetings, weekly on Wednesdays at 10am." 418 8.2 Project Planning 419 8.2.1 WBS v1.0 420 8.2.2 Organization Management Plan Embedded in Slides presented at the 4/9 CDF Offline Operations meeting. 421 8.2.3 Communications Management Plan Embedded in Slides presented at the 4/9 CDF Offline Operations meeting. 422 8.2.4 WBS v1.1 423 8.2.5 WBS v1.2 424 8.2.6 Early Schedule v0.9 425 8.2.7 Baseline Early High Priority Tasks Schedule v1.0 426 8.2.8 Resource-Loaded Complete Initiative Schedule v2.0 "This will require several steps in replanning portions of the Initiative. Schedules v1.1, v1.2, and may-be v1.3 will be intermediate schedules that introduce resource-loading and expand to lower priority work done in parallel with higher priority using..." 427 8.2.9 MILE 6: Project Planning completed 428 8.3 Project Administration 429 8.3.1 Planning Phase (Level of Effort = 10%) 430 8.3.2 Transition Phase (Level of Effort = 15%) 431 8.3.3 Execution Phase (Level of Effort = 25%) 432 8.3.4 milestone: Project Execution completed 433 8.4 Project Close-out 434 8.4.1 Project Objectives Close-out Document 435 8.4.2 Project Close-out Meeting 436 8.4.3 Project Close-out Report 437 8.4.4 "Project Close-out Report Approval: CDF, CD" 438 8.4.5 Archive project artifacts 439 8.4.6 milestone: Initiative Closed-out 440 9 END MILESTONE: CDF Offline Initiative completed