Computing Division Operations Meeting Minutes - 2004/03/22 ==================== NOTE: Pre-meeting submissions appended below for DZero, Planning (Goals) ES&H (AP) ==================== 804 days w/o LTInjury No reports of walk throughs received. Continuing with Tri-partite interviews. Trained 3 people in service coordinator training. An operator trained in emergency warden training GERT Training - 13 people due - all can take it online. VW - Any issues or things one would like to bring up relating to Safety? (none) When feedback from OSHA? A couple of months. Said that the lab did as well as general industry. CDF (RC) ==================== Tuesday UPS fixed. Went ok. At 90% usage. Will bring some more machines back online -- go up to 95%. (RK) Friday - CDF groups have decided to make samples - overwhelmed tape band width. Reminded groups that they have to coordinate with Data Handling Group. Good news is dcache seems to be handling it. (VW) What is process? (RK) Anyone at CDF who is going to access data not in cache they are to notify data handling groups so we can act as central clearing house. Big turnover recently in who is doing this. Had several groups start up on Friday. Had they informed, RK would have looked over requests - and prestaged files used in common. Has worked in past. Will continue until can coordinate Data Handling with Job submission. CMS (DF) ==================== Pool nodes died at CERN over the weekend. Hardware. Similar units had been running for 18 months. Hope to have it diagnosed and started back up soon. Nothing much happening over the weekend to the Data Challenge except of course there was a lot of activity over the weekend to diagnose. Had to get CERN to do things. (VW) Who is doing what over the weekend? (DF) Don't know. DZero (AB) ==================== See submission below. Mike Diesburg took long weekend off. Fedora incompatibility. Experts weighing in. (MK) What version are they using? (AB) Not sure. (MK) What can we do to help them? (AB) Stay tuned. Fairly normal week. Low on d0mino. Large week on ClueD0 EAG (SK) ==================== Sky's have cleared. 5800 Kevents. In process. All systems up and running except for some getting serviced. SDSSdp0 up and running. Recovered all data from disks that we thought were lost from crash. Obviously another of the 3ware controller cards problem. Data release 2 (DR2) made public a week ago. No significant complaints. (VW) How are you doing with the cutover to ingesting the tapes into the robot? Last heard - 2 weeks ago - doing tests. Tests mostly successful but a few problems. Process? Discussion. (JB) Can report that problems know about seem to be in old DLT drives. Waiting for new drives. EXP (LBG) ==================== Took beneficial occupancy of near detector hall. LAN install happening this week. Collaboration meeting happening end of this week. Expect some systems will be blocked. Would be nice if can get their systems patched and unblocked if they show up and have problems with their laptops. Demo machine for proposed databases here and being tested by Julie. Will put in req. for real system if tests prove ok. CCF (JB) ==================== Main dcache node for general access replaced this morning. Web site is back. -Data Comm (RC) Near end - Minos work - network cabling and installation began today and should be done by Thursday (estimate). Starlight project: Will have meeting with ComEd person re Make Ready work. Will get more precise schedule of when fiber will be turned over to us. Will characterize it. Will be poised to do this. VW would like to see list of what we are testing for as part of the characterization. Will probably only get one easy shot at it so need to review. -Networking (PD) DHCP rollout - first irate customer - we notified LSS - but did not trickle down to Fred Ullrich in Visual Media Services. Did finish WH except all the conference rooms and except for wireless and except for BSS. A few more to do but almost done otherwise. Progess on CDF Gigabit Ethernet Upgrade (Did not catch all comments--DR). -Security (DS) (MK) Hit friday night about 11 pm with a new worm affects an internet fire wall. 80 some systems. Two main effects. A lot of network traffic. Affected IMAP server. Worm corrupted data. Contained fairly quickly. First packet to first machine infected was 10 seconds. Not capable of being addressed by human intervention. Vulnerability Announced Thursday, Patch Friday, First Worm Friday night. Worm generated lots of network traffic as it blasted itself back to other machines. Impact was due to traffic utilization. Crammed everything into a single packet. Put in ACL to log. Mistyped. Fixed after calls. (VW) What is it costing us? (DS) 18 machines in lab services which we are cleaning up. Have prioritized list we are working through to correct. People not able to work today (VW) High cost. Discussion: Autoblocker triggered and blocked 80 some systems. Noticed that high number. Alerted people. What happens if they bring in a laptop with this Black Ice software? Will get isolated. (DS) Will be rolling out training course online for system administrators. Should show up in ITNA's. Can go into MISCOMP and look up how many systems people are signed up for if you know their id and then encourage their attendence. LCG and LGEEE (?) phone con today. Will have discussions about how these two activities relate. Discussion about Fermilab KCA. LCG project did not include Fermilab KCA certificates. Will get that remedied. Fermilab KCA certificate due to expire next month. Maintenance. Fragile. Depends on how people maintain their sites. Dependent on people paying attention. -Computer Center Operations (MS) A couple of minor OSI's. Dozen pages and phone calls outgoing and incoming. KTEV OCS mounts over weekend. Tape mounts are almost migration to 9940B's and not the analysis. -Storage (WB) Some time coming up next month during downtime need to talk to DZero and CDF about converting databases to new pnfs databases (improve perf and fix 2 GB problem). CDF and DZero: 6 to 8 hours of down time. Other pnfs users will have an outage as well - 20 hour downtime projected. CSS (SN) ==================== Covered previously in meeting. CEPA (LL) ==================== Carmenita Moore passed away over the weekend due to complications from pneumonia. Will be sorely missed. A moment of silence: ///\\\ Cards will be available for signature at admin desks. Services arrangements circulated in e-mail. DBS hire coming on April 5th. CMS / CERN operations. Databases operations. Trying to figure out how to do operations in this complicated environment. Julie and Amil have document. (VW) Who is the person from the CMS detector side who is working on this? (LL) A number of people "in spirit". Planning to hire someone. Just been posted. ESE: Interviewing. A number of strong candidates. Planning and Customer Support () ==================== See Goals Submission below. Budget discussion to be set. Projects (M?) ==================== No project status reports this week. Would like to thank everyone for help with their review. Computing Techniques seminar this week - Tuesday, March 30, 2004, 1:30 pm, WH Curia II: "The Globus Toolkit and the OGSI - WSRF Evolution," by Greg Nawrocki of The Globus Alliance. More Information.. Trying to get Curia II - Tuesday March 30th. Scheduled at 1:30 pm. Abstract in CD News. Author is part of the Globous Alliance. Planning to have a "fest" in a week or two to go over the various home pages. (VW) Is there a reschedule of the attempt to get our projects up to date? Joy has been working on it. Joy is on vacation this week so it will be Operations (DR) ==================== Please note: 1. CDF UPS cable fix is complete 2. Painting of back staircase up to Mezzanine Level is complete Painting of Doors in FCC will occur over next several weeks 3. Debra Guzman from Visual Media Services will be taking photos within the Data Center over the next several weeks. (Will background on this--DR) Division (VW): ==================== Operations Review. Thanks to everyone for their efforts. Feedback from reviewers. Very good feedback. Lab got good marks. Reviewers made a few suggestions and recommendations. Web pages look so much better and metrics coming on. Another Review this week - Many from CD are talking - DOE Review of the Program -- Coming at it from the Program. I hope it will go well. Going to PAC shortly after that. Will be looking at a couple of proposals from several groups (2 from EAG). Also, Minerva (?). Off-axis is proposing (Nova). Did WPAS. Must catch up with our budget now. Don't want to spend everything at the last minute. Problems with infrastructure. Goals - see below - Sensitive item list. Need to get this done. CDF? RC - about 800 to do. A third to do. (AB) Will you (RT) send Department Head's a list around of who they have to get on? (VW) Need to re-think this for next year. Minutes - Send comments to ritchie@fnal.gov ++++++++++++++++++++++++++++ Electronic Submissions ++++++++++++++++++++++++++++ DZero Submission: -------------------- D0 operation report (3/22/2004) ============================== Production Farm: In shutdown, and Mike took the weekend off! We processed 5.2M events last week, and wrote 2.7M. Had to reboot D0mino, otherwise, a quiet week. Some of the clued0 machines have been upgraded to Fedora, and there seems to be an incompatibility with the standard root version. Experts investigating. Tapes: 209.7 TB in mezosilo, 261TB in 9940b (1 TB written last week), 162 TB in LTO (4 TB written last week). Analysis Stations: Data Analyzed Events Analyzed D0mino 3.5 TB 60 M CAB 8 TB 275 M cabsrv1 15 TB 475 M Clued0 1.6 TB 45 M <--- Another HUGE week for Clued0 d0karlsruhe 0 TB 0 M From Planning ------------- Goals Tally DEPT # Submitted Balance Due CDO 34 15 19 CCF 47 14 33 CEPA 51 30 21 CSS 60 39 21 CDF 15 15 CMS 11 11 D0 13 13 0 EAG 14 13 1 EXP 6 3 3 TOTAL 251 127 124