Personal tools
You are here: Home Dcache Archive FY07 FNAL DCache WBS Documents FY07 FNAL Dcache Project WBS Details v1
Document Actions

FY07 FNAL Dcache Project WBS Details v1

This is an EARLY DRAFT of a detailed hierarchical listing of work identified by the FNAL dCache project as an input to the FNAL budgeting process. There is inadequate FNAL staff to do all the work listed here in FY07. This list will be prioritized to fit available staff, and other work deferred. Some tasks are being discussed as candidates for outside contributions.

Click here to get the file

Size 12.9 kB - File type text/plain

File contents

FNAL Dcache Project
FY06 Detailed WBS Activities Listing
V1.3
25 Sep 2006
Rob Kennedy

CCF - Data Movement and Storage - Upper Storage - DCache Project
All activities are charged to budget task DMS-UP-ST-DCACHE-OPS

----------------------------------------------------------------------------
High-Level Activities Breakdown
----------------------------------------------------------------------------
1. Operations
	1.1 Production Services: will include list with responsibility limits
		1.1.1 Day-to-Day operations
		1.1.2 Upgrades and Installations
		1.1.3 Operations Issues (introduce palliatives if necessary)
		1.1.4 Facility/Customer Requirements, Config, Planning
		1.1.5 Administrative procedures, scripts, and related doc
		1.1.6 Developer Shift System Management (scheduling, duties)
		1.1.7 Communications: Operations notices and reporting
	1.2 Integration Services: customer accessible test stands (FCC)
		1.2.1 FNDCAT - with IA
		1.2.2 CDFDCAT - with Run2 Sys, CDF DH
	1.3 Related Services
		1.3.1 PNFS - with other groups
		1.3.2 Postgresql used by PNFS, DCache, SRM - with other groups
		1.3.3 Apache and dCache internal httpd
	^^^
	 \\\=== Proposed level of Effort Reporting = Activities


2. Development
	2.1 Infrastructure Development
		2.1.1 FNAL dcache service re-organization: major project
		2.1.2 PNFS to Chimera transition: major project
		2.1.3 Packaging and Distribution (RPMs, FSL workgroup,...)
		2.1.4 Automated Build and Test Platform
		2.1.5 Dev Process Definition with Issue Tracking and Workflow
		2.1.6 Documentation and Tutorials
		2.1.7 FNAL Dcache Plone site (and scattered web sites)
		2.1.8 Storage developer test stand (WH8SE): procure, operate
	2.2 Feature Development: Sub-projects and areas of responsibility
		2.2.1 Resilient dCache: Sub-Project Leader = Alex K.
		2.2.2 GridFTP and related components
		2.2.3 VO Auth Module Integration - near completion
		2.2.4 Monitoring - improve, integrate, RPM deploy
		2.2.5 Logging - improve, archive
		2.2.6 Namespace Service (Pnfs, Chimera, and transition)
		2.2.7 HSM Interface (encp, real-encp)
		2.2.8 Experience-driven development (replace palliatives)
		2.2.9 Datafile Aggregation - tar small files on way to Enstore
		2.2.10 Investigate Other Caches & File Systems
	2.3 Project Collaboration and Communication
		2.3.1 Dcache Collaboration
		2.3.2 OSG Participation
		2.3.3 LCG Participation
		2.3.4 Other Grid Storage Projects
		2.3.5 General Outreach, Conferences, Workshops
	^^^
	 \\\=== Proposed level of Effort Reporting = Activities


3. Administration
	3.1 Administration: general administration, paperwork, reports, etc.
	3.2 Project Management: WBS, budget, project planning
	^^^
	 \\\=== Proposed level of Effort Reporting = Activities

----------------------------------------------------------------------------
Detailed Activities Breakdown
----------------------------------------------------------------------------
1. Operations
	1.1 Production Services: will include a list with responsibility limits
		1.1.1 Day-to-Day operations
			- "Shifters": Group. Must know, can delegate or do.
			- 1 week Mon to Mon, reports on Fri and Mon
			- With IA on FNDCA
			- With CDFDH on CDFDCA, IA for crc failures
			- With Jon, Ted on CMSDCA
			- LQCD mostly manages LQCDDCA, but we help on issues
		1.1.2 Upgrades and Installations
			- Over all dCache, java, monitoring software upgrades
			- dCache software installation on new nodes
		1.1.3 Operations Issues (introduce palliatives if necessary)
			- As arise, still > 0.50 FTE including other categories
			- Reduce DNS, KDC "slowness" vulnerability
			- central tracking of pnfs, door response time
			- policy AND matching monitoring on use-patterns
				including short-term data to tape through dCache
		1.1.4 Facility/Customer Requirements, Config, Planning
			- "Liaisons": Group
			- mostly assigned per-system, some per-customer
			- FNDCA: Rob
			- CDFDCA: Alex
			- CMSDCA: Timur
			- LQCDDCA: Vladimir
			- PNFS: Vladimir
			- Salk Institute ingest: Alex (FNDCA)
			- RMAN DB backups: Rob (FNDCA)
			- E907: Vladimir (FNDCA)
			- SDSS DAQ chain: Vladimir (FNDCA)
			- US-CMS-T2 sites (elevated issues): Alex
			- Review or co-create responsibility break-down for each
		1.1.5 Administrative procedures, scripts, and related doc
			- Utilities, watchdogs, monitors, and palliatives
		1.1.6 Developer Shift System Management (scheduling, duties)
			- Lead: Rob, modeled on Enstore
			- Review model given SRM, Dcache, customer evolution
		1.1.7 Communications: Operations notices and reporting
			- Weekly reports to CCF Storage operations
			- Weekly reports to CD operations
			- Detailed reports as needed to document incidents
			- Notices to keep customers informed of service issues
	1.2 Integration Services: customer accessible test stands (FCC)
		1.2.1 FNDCAT - with IA
			- Test pre-production dCache code
			- Test few new file servers before production service
			- Rob
		1.2.2 CDFDCAT - with Run2 Sys, CDF DH
			- Test many new file servers before production service
			- Alex
	1.3 Related Services
		1.3.1 PNFS - with other groups
			- Lead: Vladimir
			- apply service baselines where not already done
		1.3.2 Postgresql used by PNFS, DCache, SRM - with other groups
			- Lead: Vladimir
			- document for others to share in support
			- further work towards baseline compliance
		1.3.3 Apache and dCache internal httpd
			- ad hoc part of dCache project
			- work towards baseline compliance

2. Development
	2.1 Infrastructure Development
		2.1.1 FNAL dcache service re-organization: major project
			- Lead: Rob
			- shift to DESY-style service config files and layout
			- use encp as UPS/UPS product, not from CVS
			- run under user "dcache", split from enstore
			- apply service baselines to dcache operation as well
			- shift to pure RPM-based installation and config
			- review framework of monitoring and palliative
				scripts to run under dcache
			- review distributed logging mechanism used at FNAL
			- harden and package KDC multiplexer
			- package generalized monitoring and palliative in RPMs
			- will require phased transition
		2.1.2 PNFS to Chimera transition: major project
			- Lead: Vladimir
			- Still much testing to be done
			- Must be shown to work with Enstore too
		2.1.3 Packaging and Distribution (general tasks)
			- Dcache.org RPMs: FNAL needs krb5 clients, etc.
			- FNAL dcap UPS/UPD product
			- Transition FNAL dcap support to RPM use, not UPD
			- Fermi SciLinux "DcacheServer" workgroup defn for IA
			- Include option to use a JRE packaged within dCache
		2.1.4 Automated Build and Test Platform
			- Maven/CruiseControl for automated builds now running
			- Integrate basic feature/transfer test suite
			- JUnit-based feature and unit testing
			- FTP client evaluator for customers to run themselves
		2.1.5 Dev Process Definition with Issue Tracking and Workflow
			- DESY's RT does not serve local FNAL operations issues
			- Evaluate TIssue and plone-based issue trackers,
				integrate with FNAL dCache plone site
			- Other tools to support development process
		2.1.6 Documentation and Tutorials
			- Primary doc = "DCache Book", filling out as we go
			- FNAL Admin doc (see Admin toolkit topic)
			- User guide based on AHeavey's PU00090 (we support)
		2.1.7 FNAL Dcache Plone site (and scattered web sites)
			- Fill out with existing doc using tags, smart folders
			- Automate administrative processing with report input
		2.1.8 Storage developer test stand (WH8SE): procure, operate
			- Procure next gen nodes: dual 1.4GHz workers
			- Maintain per security requirements .and. test needs
	2.2 Feature Development: Sub-projects and areas of responsibility
		2.2.1 Resilient dCache: Sub-Project Leader = Alex K.
			- Sub-project leader: Alex
			- Near-zero administration features
			- General V2 design, if needed
			- Specialize configs: file in RAID + any other disk
			- Pool scheduling coordination with PoolManager, eg.
				in order to replicate on distinct nodes
			- Integrate with central space management in dCache
		2.2.2 GridFTP and related components
			- Lead: Rob
			- Default "ls" should not check file permissions
			- Bidirectional socket adapter validation
			- Command-specific monitoring via logs and/or DB
			- On-demand start-up of (gridftp) doors by SRM to
				enhance scalability, reduce redundant copies
			- Light-weight, distinct SocketAdapter for scaling
			- NIO in SocketAdapter, optimization of NIO use in FTP
			- Mitigation of high-performance transfer issues in OS
			- FTP command support: a few basic commands, most
				remaining to be done are gridftp-related
			- Test/validation suite for common "supported" clients
			- Permit multi-door/multi-NIC configs on a node
			- Integration with another gridftp code base?
			- Gridftp X-mode development
		2.2.3 VO Auth Module Integration
			- Lead: Ted, with input from Timur
			- Near completion, will need to do some hand-off doc...
		2.2.4 Monitoring - improve, integrate, RPM deploy
			- Lead: Vladimir
			- expand use of internal DB for decoupled monitoring
		2.2.5 Logging - improve, back-up to permanent store
			- Lead: Vladimir
			- exploit Log4j and overhaul logging content
			- Log4j back-end to stage logs to disk, compress, tape
			- Light-weight logfile scanner or equiv histogram
				creating mechanism within log4j
			- Error and cause tracking to DB to permit error, cause
				histos by outsiders. Eg. auth vs. dns failures
		2.2.6 Namespace Service (Pnfs, Chimera)
			- Lead: Vladimir
			- Chimera feature testing, readiness for production?
			- Monitoring, Logging, and other required features of a
				high-demand production service
		2.2.7 HSM Interface (encp, real-encp)
			- Lead: Rob
			- Stale precious files: several causes. Fix what can be.
			- Avoid redundant requests/encps for same file
			- Work with Lower Storage on a small-footprint encp
			- Review of real-encp/encp goals and expectations
		2.2.8 Experience-driven development (fixes,improvements,wishes)
			- theme is to reduce operations overhead long-term
				and make service more administratable at FNAL
			2.2.8.1 Maintain master list on FNAL dCache plone site
			2.2.8.2 Overall Service hardening to external failures
			2.2.8.3 Per-node limits in addition to per-pool limits
			2.2.8.4 Write pool "twins" (2 files in different write
				pools, with same pnfsid - should be impossible)
			2.2.8.5 Infrequent cell start-up failures with 1.6.X
			2.2.8.6 PNFS timeouts (no such directory) under modest
				loads + monitoring to track the cause
			2.2.8.7 FNDCA volatile pool config - should be no stores
			2.2.8.8 Volatile pools - no hang if file gone, in pnfs
			2.2.8.9 wget client and staging leaves detritus
			2.2.8.10 kftp client alternative: Minos raw data logging
			2.2.8.11 N+2 queue request hangs (max set = N)
			2.2.8.12 Overwrite in PNFS fails, but wipes file size
			2.2.8.13 User-visible error in billing DB when timeout
			2.2.8.14 Live update/upgrade capability to reduce
				need for downtimes: note CDF CAF impact
			2.2.8.15 Dead/slow pool blackhole effect
			2.2.8.16 Store queue info through "null" FlushManager
			2.2.8.17 KDC Multiplexer review, overhaul
			2.2.8.18 Error code API, esp. for dcap
			2.2.8.19 Billing DB API for aggregation of results 
			2.2.8.20 Bad credentials can break X509 doors. Krb5?
			2.2.8.21 Pool File listings - less heavy approach?
			2.2.8.22 PnfsCompanion deployment after Chimera on FNDCA
				and CDFDCA.
			2.2.8.23 Incremental PNFS scans to catch file metadata
				inconsistencies while files still in write pools
			2.2.8.24 Track CRC failures to identify h/w issues
			........ (more)
		2.2.9 Datafile Aggregation - tar small files on way to Enstore
			- Major project, discussed but not initiated
			- Seems simple in theory, very hard with file families
		2.2.10 Investigate Other Caches & File Systems
			- As time permits, eg. ZFS from Sun (Vladimir)

	2.3 Project Collaboration and Communication
		2.3.1 Dcache Collaboration
			- Biweekly "management" meeting
			- Weekly (when needed) developer collaboration meeting
		2.3.2 OSG Participation
			- Interface with OSG Storage Technical Group
			- Interface with OSG Storage-related projects
		2.3.3 LCG Participation
			- Support SRM-driven changes (mostly DESY nowadays)
		2.3.4 Other Grid Storage Projects
			- Mostly through FNAL CD GDM forum
		2.3.5 General Outreach, Conferences, Workshops
			- ad hoc, plus BNL/US-LHC dCache cooperation

3. Administration
	3.1 Administration: general administration, paperwork, reports, etc.
		- Group Steering: biweekly meetings, reports
		- Section and Department Management biweekly meetings
		- Groupware to reduce overhead, improve accessibility of info
	3.2 Project Management: WBS, budget, project planning
		- Rob, with input from deputy (Vladimir) and resilient dcache
		sub-project leader (Alex)
		- CompDiv Projects: reports, presentations 1-2 times per year
		- Project WBS, Budget: once per year with follow-up
		- Long-term facility/program-oriented planning: once per year
		- Project tactical planning: roughly 4 times per year
		- Funding, developer help: grants, SBIR, new collaborations
		
by Robert Kennedy last modified 2006-10-24 12:21
« February 2009 »
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
 

Powered by Plone, the Open Source Content Management System

This site conforms to the following standards: